KR20150093979A

KR20150093979A - Method and apparatus for assigning namenode in virtualized cluster environments

Info

Publication number: KR20150093979A
Application number: KR1020140014684A
Authority: KR
Inventors: 최종무; 김태원; 정혜진
Original assignee: 단국대학교 산학협력단
Priority date: 2014-02-10
Filing date: 2014-02-10
Publication date: 2015-08-19
Also published as: KR101654969B1

Abstract

Disclosed are a method and an apparatus for selecting an optimum name node in a virtual cluster environment, wherein a plurality of name nodes exist. The method for allocating a name node in a virtual cluster environment includes a step of receiving a job; and a step of determining a name node, wherein the job is to be allocated, among at least two name nodes by considering a function of at least two name nodes or a processing speed as to the job, which are included in a virtual cluster. Accordingly, the optimum name node in the virtual cluster environment utilizing a plurality of name nodes may be effectively chosen.

Description

[0001] METHOD AND APPARATUS FOR ASSIGNING NAMENODE IN VIRTUALIZED CLUSTER ENVIRONMENTS [0002]

본 발명은 대규모 데이터를 분산 처리하는 기술에 관한 것으로, 더욱 상세하게는 다수의 네임 노드가 존재하는 가상화 클러스터 환경에서 최적의 네임 노드를 선택하는 방법 및 장치에 관한 것이다.The present invention relates to a technique for distributing large-scale data, and more particularly, to a method and apparatus for selecting an optimal name node in a virtual cluster environment in which a plurality of name nodes exist.

최근 컴퓨터의 하드웨어 자원을 최대한 활용하기 위해 가상화 기술이 도입되어 널리 사용되고 있는 추세이다. 가상화란 물리적인 자원을 여러 개의 가상 자원들로 추상화시키는 기술로, 각 가상 자원마다 서로 다른 운영체제를 수행시켜 다중화된 컴퓨팅 자원의 이용률을 높이고 효율적인 프로비져닝을 제공한다. 또한 특정 가상 서버에 결함 발생 시 이를 고립시키고 다른 가상 서버를 통해 중단 없는 서비스를 제공할 수 있어 시스템의 가용성과 신뢰성을 높일 수 있다. Recently, virtualization technology has been introduced and widely used to make the best use of the hardware resources of computers. Virtualization is a technology that abstracts physical resources into multiple virtual resources. By running different operating systems for each virtual resource, it increases the utilization of multiplexed computing resources and provides efficient provisioning. In addition, it can isolate a specific virtual server when a fault occurs and provide non-stop service through another virtual server, thereby improving system availability and reliability.

한편 데이터베이스 분야에서 최근 급증하는 대규모 데이터 처리를 위해 병렬, 분산 데이터베이스에 대한 연구가 활발하다. 즉, 매니코어 시스템이 제공하는 하드웨어 컴퓨팅 자원의 병렬성을 극대화하기 위한 소프트웨어 기술로 가상화 기술과 함께 최근 주목받고 있는 기술이 병렬 DB 기술이다. On the other hand, parallel and distributed databases have been actively studied for large-scale data processing in the field of databases. In other words, as a software technology to maximize the parallelism of the hardware computing resources provided by the ManiCORE system, parallel DB technology is recently attracting attention with virtualization technology.

병렬 DB 기술은 데이터를 키와 값의 쌍(Key-Value pair)으로 구성하고, 데이터를 병렬로 처리하여 처리 성능을 높인다. 대표적인 예가 구글의 GFS(Google File system)과 Bigtable, 그리고 공개 소스 진영의 Hadoop과 Hbase, Hive 이며, 프로그래밍 모델로는 Map-Reduce, DryadLINQ, MPI 등을 사용한다.Parallel DB technology constructs data as a key-value pair and processes data in parallel to improve processing performance. Examples are Google's Google File System (GFS) and Bigtable, open source Hadoop, Hbase, and Hive, and programming models such as Map-Reduce, DryadLINQ, and MPI.

하둡(Hadoop)은 대량의 자료를 처리할 수 있도록 컴퓨터 클러스터에서 동작하는 분산 응용 프로그램을 지원하는 자바 소프트웨어 프레임워크로서, 크게 하둡 분산 파일 시스템(HDFS:HADOOP Distributed File System)과 맵리듀스(Map-Reduce)로 이루어진다.Hadoop is a Java software framework that supports distributed applications running on computer clusters to handle large amounts of data. It is largely composed of Hadoop Distributed File System (HDFS) and Map-Reduce ).

종래의 하둡 분산 파일 시스템(HDFS)에 따라 Hadoop 클러스터를 구축할 경우 네임 노드(Name Node)는 단지 하나의 노드로만 되어 있으며 이로 인해 다양한 문제점이 발생하였다. 예를 들어, 네임 노드의 메모리의 한계로 네임 노드에서 관리하는 file과 directory의 개수에 제한이 있고, 네임 노드가 하나이기 때문에 파일 입출력에 관한 throughput에 제약이 있으며, 모든 사용자/응용프로그램이 하나의 네임 노드를 활용해야만 하는 문제점이 있다. When constructing a Hadoop cluster according to the conventional Hadoop distributed file system (HDFS), the name node is only one node, which causes various problems. For example, there is a limitation on the name node memory, which limits the number of files and directories managed by the name node, and there is a limitation on the throughput of the file I / O because there is only one name node. There is a problem that the name node must be utilized.

또한, 최근에는 다수의 독립된 네임 노드의 집합인 네임 노드 페더레이션(NameNode Federation)이라는 개념에 하둡에 적용되고 있으나, 이기종 컴퓨팅 특징을 가지고 있는 가상 머신 환경에서 네임 노드 페더레이션을 적용하였을 경우, 각각의 네임 노드의 성능이 서로 다르게 되고, 이로 인해 기존의 방법대로 작업(job)을 네임 노드에 할당하게 되면 성능 상의 불균형으로 인한 문제점이 발생한다.In addition, in recent years, Hadoop has been applied to the concept of NameNode Federation, which is a set of a plurality of independent name nodes. However, when naming node federation is applied in a virtual machine environment having heterogeneous computing characteristics, The performance is different from each other. As a result, assigning a job to a name node according to the conventional method causes a problem due to a performance imbalance.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 가상화 클러스터 환경에서 네임 노드에 작업을 할당하는 방법을 제공하는데 있다.In order to solve the above problems, an object of the present invention is to provide a method of assigning a job to a name node in a virtualized cluster environment.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 대규모 데이터를 병렬 분산 처리하는 시스템을 제공하는데 있다.It is another object of the present invention to provide a system for parallelly distributing large-scale data.

상기 목적을 달성하기 위한 본 발명의 실시예에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법은, 작업(job)을 수신하는 단계와, 가상화 클러스터에 포함된 적어도 두 개의 네임 노드들의 성능 또는 작업에 대한 처리 속도를 고려하여 적어도 두 개의 네임 노드들 중에서 작업을 할당할 네임 노드를 결정하는 단계를 포함한다.According to another aspect of the present invention, there is provided a method of allocating a name node in a virtual cluster environment, the method comprising: receiving a job; performing a performance or task of at least two name nodes included in the virtual cluster Determining a name node to which to allocate a job among at least two name nodes considering the processing speed for the name node.

여기에서, 상기 가상화 클러스터는, 다수의 물리 머신들의 각각에 존재하는 다수의 가상 노드들의 집합으로 구성될 수 있다. Here, the virtualization cluster may be composed of a plurality of virtual nodes existing in each of a plurality of physical machines.

여기에서, 상기 적어도 두 개의 네임 노드들의 각각은, 다수의 물리 머신들의 각각에 하나씩 위치할 수 있다. Here, each of the at least two name nodes may be located one by one in each of a plurality of physical machines.

여기에서, 상기 네임 노드 할당 방법은, 하둡 분산 파일 시스템(HDFS:HADOOP Distributed File System)에 기반하여 적용될 수 있다. Here, the name node allocation method may be applied based on a Hadoop Distributed File System (HDFS).

여기에서, 상기 작업을 할당할 네임 노드를 결정하는 단계는, 적어도 두 개의 네임 노드들의 중에서 두 개의 네임 노드를 랜덤하게(randomly) 선택하고, 선택된 두 개의 네임 노드의 성능 및 작업량을 고려하여 선택된 두 개의 네임 노드 중에서 작업을 할당할 네임 노드를 결정할 수 있다. Wherein determining the name node to assign the task comprises randomly selecting two name nodes out of at least two name nodes and selecting the selected two name nodes in consideration of performance and workload of the selected two name nodes. It is possible to determine the name node to which the task is assigned among the name nodes.

상기 다른 목적을 달성하기 위한 본 발명의 실시예에 따른 대규모 데이터를 병렬 분산 처리하는 시스템은, 가상화 클러스터에 포함된 적어도 두 개의 네임 노드들의 집합으로 구성되는 네임 노드 페더레이션과, 네임 노드 페더레이션에 포함된 적어도 두 개의 네임 노드들의 성능 또는 작업(job)에 대한 처리 속도를 고려하여 적어도 두 개의 네임 노드들 중에서 작업을 할당할 네임 노드를 결정하는 네임 노드 제어부를 포함한다.According to another aspect of the present invention, there is provided a system for parallel distributed processing of large-scale data according to an exemplary embodiment of the present invention includes: a name node federation configured by a set of at least two name nodes included in a virtualization cluster; And a name node control unit for determining a name node to allocate a task among at least two name nodes in consideration of performance or processing speed of at least two name nodes.

상기와 같은 본 발명에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법 및 대규모 데이터를 병렬 분산 처리하는 시스템을 이용할 경우에는 다수의 네임 노드를 활용하는 가상화 클러스터 환경에서 최적의 네임 노드를 효과적으로 선택할 수 있다. In the virtual cluster environment according to the present invention, when a method for assigning a name node and a system for parallelly distributing large-scale data are used, an optimal name node can be effectively selected in a virtual cluster environment utilizing a plurality of name nodes .

또한, 본 발명에 따르면 네임 노드의 성능 및 작업량에 대한 체크를 최소화하면서 일관된 성능을 유지하도록 할 수 있는 장점이 있다.Also, according to the present invention, there is an advantage that consistent performance can be maintained while minimizing a check on performance and workload of a name node.

도 1은 분산 파일 시스템을 설명하기 위한 개념도이다.
도 2는 본 발명의 실시예에 따라 다수의 네임 노드가 적용된 클러스터를 설명하기 위한 개념도이다.
도 3은 본 발명의 실시예에 따른 대규모 데이터를 병렬 분산 처리하는 시스템을 설명하기 위한 개념도이다.
도 4는 본 발명의 실시예에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 실시예에 따라 다수의 네임 노드가 적용된 가상화 클러스터에서 작업이 할당되는 네임 노드의 빈도를 설명하기 위한 그래프들이다.
도 6은 본 발명의 실시예에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법을 적용하였을 경우에, 작업의 처리 시간을 비교하기 위한 그래프이다. 1 is a conceptual diagram for explaining a distributed file system.
2 is a conceptual diagram illustrating a cluster to which a plurality of name nodes are applied according to an embodiment of the present invention.
3 is a conceptual diagram for explaining a system for parallel-distributed processing large-scale data according to an embodiment of the present invention.
4 is a flowchart illustrating a method of assigning a name node in a virtual cluster environment according to an embodiment of the present invention.
5 is a graph illustrating a frequency of a name node to which tasks are allocated in a virtualization cluster to which a plurality of name nodes are applied according to an embodiment of the present invention.
6 is a graph for comparing the processing time of a task when a method of assigning a name node is applied in a virtual cluster environment according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.
Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

먼저, 본 발명에서 사용하는 용어를 간략히 설명하면 다음과 같다. First, terms used in the present invention will be briefly described as follows.

하둡 분산 파일 시스템(HDFS)은 가격이 저렴한 하드웨어를 대량으로 이용하며, 고장 발생을 전제로 설계된 시스템이기 때문에, 항상 파일을 여러 개 복사하고 복사된 파일들을 분산하여 저장한다. 또한 파일의 내용과 위치에 대한 정보도 여러 개의 복사본으로 만들어져 분산 저장된다. 이렇게 파일의 내용과 정보가 여러 대의 컴퓨터에 분산 저장되기 때문에 검색 시간도 단축되고 여러 곳에서 동시에 검색이 이루어져도 어느 한 곳에 작업량이 집중되지 않는다. Because the Hadoop Distributed File System (HDFS) is a system designed for high availability and low cost, it always copies several files and distributes the copied files. In addition, information about the contents and location of the file is also distributed and stored in a plurality of copies. Because the contents and information of the files are distributed and stored on several computers, the search time is shortened.

맵리듀스(Map-Reduce)는 효율적인 데이터 처리를 위해 여러 대의 컴퓨터를 활용하는 분산 데이터 처리 기술로, 먼저, 맵(Map) 단계에서는 대규모 데이터를 여러 대의 컴퓨터에 분산해 병렬적으로 처리해 새로운 데이터(중간 결과)를 만들어내고, 리듀스(Reduce) 단계에서는 이렇게 생성된 중간 결과물을 결합해 최종적으로 원하는 결과를 생산한다.
Map-Reduce is a distributed data processing technology that utilizes multiple computers for efficient data processing. First, in the Map phase, large-scale data is distributed to several computers in parallel, processed in parallel, Results), and in the Reduce step, the resulting intermediate results are combined to produce the final desired result.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 분산 파일 시스템을 설명하기 위한 개념도이다. 1 is a conceptual diagram for explaining a distributed file system.

먼저, 본 발명의 기술적 특징을 보다 명확히 하기 위하여 종래의 분산 파일 시스템을 설명한다. First, a conventional distributed file system will be described in order to clarify the technical characteristics of the present invention.

도 1을 참조하면, 하둡 분산 파일 시스템(HDFS)은 마스터 노드인 네임 노드(110)와 다수의 데이터 노드(120-1 내지 120-n)를 포함하여 구성된다. HDFS는 대용량의 파일 관리를 지원하고 다중 파일 복사본을 저장하여 가용성을 보장하는 구조를 가지고 있다. 데이터 노드(120-1 내지 120-n)는 파일 복사본을 저장하고, 네임 노드(110)는 파일의 메타 데이터(meta data)를 저장하고 있다. Referring to FIG. 1, the Hadoop distributed file system (HDFS) includes a name node 110, which is a master node, and a plurality of data nodes 120-1 to 120-n. HDFS has a structure that supports high-capacity file management and ensures availability by storing multiple file copies. The data nodes 120-1 through 120-n store a copy of the file, and the name node 110 stores meta data of the file.

상세하게는, 네임 노드(110)는 파일 시스템의 이름 공간을 관리하면서 클라이언트(130)로부터의 파일 접근 요청을 처리한다. HDFS에서 파일 데이터는 블록 단위로 나뉘어서 여러 데이터 노드(120-1 내지 120-n)에 분산되어 저장된다. 여기서, 클라이언트(130)는 다양한 타입의 사용자 단말을 의미할 수 있다. Specifically, the name node 110 manages a file system access request from the client 130 while managing the namespace of the file system. In HDFS, file data is divided into blocks and stored in the plurality of data nodes 120-1 to 120-n in a distributed manner. Here, the client 130 may refer to various types of user terminals.

데이터 노드(120-1 내지 120-n)는 클라이언트(130)로부터의 데이터 입출력 요청을 처리하고, 네임 노드(110)는 데이터 노드(120-1 내지 120-n)로부터 하트 비트(heartbit)를 주기적으로 받으면서 데이터 노드(120-1 내지 120-n)의 상태를 체크할 수 있다. The data nodes 120-1 through 120-n process data input and output requests from the client 130 and the naming node 110 periodically receives heartbeats from the data nodes 120-1 through 120- The state of the data nodes 120-1 to 120-n can be checked.

HDFS는 데이터의 가용성을 보장하기 위해서 관계형 데이터를 블록 단위로 나눈 후, 다시 데이터 노드(120-1 내지 120-n)에 복제(Replication)하여 분산 저장한다. 이는 한 개의 데이터를 n 개의 완전한 복사본으로 만들어 분산 저장하는 기법으로 전체 노드 중 단지 하나만 살아남아도 원래의 데이터를 완전하게 복구시킬 수 있으나, 가용성을 높이기 위해 복제가 많아질수록 스토리지가 증가한다.In order to ensure the availability of data, HDFS divides the relational data into blocks and then replicates the data to the data nodes 120-1 to 120-n for distribution. This is a technique of distributing one piece of data into n complete copies. It is possible to completely recover the original data even if only one of the nodes survive, but as the number of copies increases, the storage increases.

다만, 도 1에 따른 HDFS는 하나의 네임 노드를 이용하여 데이터를 병렬 분산 처리함에 따른 다양한 문제점이 발생할 수 있음은 상술하였다.
However, it has been described above that the HDFS according to FIG. 1 may cause various problems due to the parallel distributed processing of data using one name node.

도 2는 본 발명의 실시예에 따라 다수의 네임 노드가 적용된 클러스터를 설명하기 위한 개념도이다. 2 is a conceptual diagram illustrating a cluster to which a plurality of name nodes are applied according to an embodiment of the present invention.

도 2를 참조하면, 하나의 클러스터(20)에 여러 개의 독립된 네임 노드들(210-1,...210-k,...210-n)이 존재하고 이러한 네임 노드들(210-1,...210-k,...210-n)은 각각의 고유한 네임 스페이스(NS: NameSpace)를 가질 수 있다. Referring to FIG. 2, a plurality of independent name nodes 210-1 to 210-k, ... 210-n exist in one cluster 20, and the name nodes 210-1, ... 210-k, ... 210-n may have respective unique namespaces (NS: NameSpace).

또한, 블록 풀들(Block Pools)(230)은 클러스터(20) 전체의 블록에 대한 정보를 가지고 있고 각각의 네임 노드들(210-1,...210-k,...210-n)은 이러한 전체 블록 풀들(Block Pools)(230)의 특정 부분을 나누어 가질 수 있다. 예를 들어, NameNode 1(210-1)은 Pool 1(230-1)을 가지고, NameNode k(210-k)는 Pool k(230-k)를 가지며, NameNode n(210-n)은 Pool n(230-n)을 가질 수 있다. In addition, the block pools 230 have information on the entire blocks of the cluster 20, and each of the name nodes 210-1, ..., 210-k, ... 210-n And may divide a particular portion of such total block pools (Block Pools) 230. For example, NameNode 1 210-1 has Pool 1 230-1, NameNode k 210-k has Pool k 230-k, and NameNode n 210-n has Pool n (230-n).

예를 들어, 클러스터(20)는 N개의 네임 노드들(210-1,...210-k,...210-n)을 가질 수 있고, 각각의 네임 노드(210-1,...210-k,...210-n)는 네임 스페이스(namespace)를 가질 수 있다. 또한 클러스터(20)는 M개의 데이터 노드들(220-1,...220-k,...220-n)을 가질 수 있고, 데이터 노드들(220-1,...220-k,...220-n)은 네임 노드들과 연동할 수 있다. For example, the cluster 20 may have N name nodes 210-1, ... 210-k, ... 210-n, and each of the name nodes 210-1, ..., 210- 210-k, ... 210-n may have a namespace. The cluster 20 may also have M data nodes 220-1 ... 220-k, ... 220-n and the data nodes 220-1 ... 220- ... 220-n can interwork with the name nodes.

여기서, 네임 스페이스(NS: NameSpace)은 파일(file) 및 디렉토리(directory)에 대한 생성(create), 삭제(delete), 수정(modify) 및 리스트(list) 등을 지원할 수 있는 개념적인 영역을 의미할 수 있다. Here, a name space (NS) is a conceptual area that can support create, delete, modify, and list of files and directories. can do.

또한, 블록 스토리지(Block Storage)는 데이터 노드의 맴버쉽(membership)을 관리하며, 블록(block)에 대한 생성(create), 삭제(delete) 및 수정(modify) 등을 지원 및 관리할 수 있는 개념적인 영역을 의미할 수 있다. 여기서, 블록(block)은 다수의 데이터 노드들에 분산 저장되는 분할된 데이터의 단위를 의미할 수 있다. In addition, block storage manages the membership of a data node, and it is a conceptual concept that supports and manages the creation, deletion and modification of blocks. It can mean area. Here, a block may mean a unit of divided data that is distributedly stored in a plurality of data nodes.

한편, 다수의 독립된 네임 노드들의 집합을 “네임 노드 페더레이션(Name Node Federation)”으로 명명할 수 있다.
On the other hand, a plurality of sets of independent name nodes can be named " Name Node Federation ".

도 3은 본 발명의 실시예에 따른 대규모 데이터를 병렬 분산 처리하는 시스템을 설명하기 위한 개념도이다. 3 is a conceptual diagram for explaining a system for parallel-distributed processing large-scale data according to an embodiment of the present invention.

네임 노드 페더레이션(Name Node Federation)(310)이 Hadoop 클러스터가 물리 노드로만 구축된 클러스터 환경만을 고려하여 구현된 경우, 가상화 노드들을 가진 가상화 Hadoop 클러스터 환경에서는 성능 상의 문제점이 나타날 수 있다. If the Name Node Federation 310 is implemented considering only the cluster environment in which the Hadoop cluster is constructed only as a physical node, a performance problem may occur in a virtualized Hadoop cluster environment having virtual nodes.

즉, 가상 노드들의 성능은 물리 노드의 성능에 영향을 가장 많이 받기 때문에, 가상화 Hadoop 클러스터는 각각의 물리 노드의 성능에 따라 가상 노드들의 성능이 다른 이기종 환경의 클러스터가 된다.In other words, since the performance of the virtual nodes is most affected by the performance of the physical node, the virtual Hadoop cluster becomes a cluster of the heterogeneous environment in which the performance of the virtual nodes is different according to the performance of each physical node.

예를 들어, 네임 노드 페더레이션(310)에서 작업(job)을 배치할 경우, 모든 네임 노드(311)의 성능이 동일하다고 가정하여 작업(job)을 배치할 수 있다. For example, when placing a job in the name node federation 310, it is possible to arrange a job assuming that the performance of all the name nodes 311 is the same.

그러나 이기종 컴퓨팅 특징을 가지고 있는 가상 머신 환경에서 네임 노드 페더레이션(310)을 적용하였을 경우 각각의 네임 노드의 성능이 서로 다름으로 인하여 기존의 방법대로 작업(job)을 네임 노드에 할당하게 되면 성능 상의 불균형으로 인한 손해를 보게 된다. However, when the name node federation 310 is applied in a virtual machine environment having heterogeneous computing characteristics, if assigning a job to a name node according to the conventional method due to the different performance of each name node, And you will see damage caused by.

이에 본 발명의 실시예에 따른 대규모 데이터를 병렬 분산 처리하는 시스템은 가상화 클러스터 환경에서 다수의 네임 노드에 작업(job)을 효과적으로 할당하기 위한 기술을 제공한다. A system for parallel distributed processing of large-scale data according to an embodiment of the present invention provides a technique for efficiently assigning jobs to a plurality of name nodes in a virtual cluster environment.

대규모 데이터를 병렬 분산 처리하는 시스템은 네임 노드 페더레이션(310) 및 네임 노드 제어부(330)를 포함하여 구성될 수 있다. A system for parallelly distributing large-scale data may include a name node federation 310 and a name node control unit 330.

네임 노드 페더레이션(310)은 가상화 클러스터에 포함된 적어도 두 개의 네임 노드들의 집합으로 구성될 수 있다. The name node federation 310 may be composed of a set of at least two name nodes included in the virtualization cluster.

여기서, 가상화 클러스터는 다수의 물리 머신들의 각각에 존재하는 다수의 가상 노드들의 집합으로 구성될 수 있으며, 네임 노드들의 각각은 다수의 물리 머신들의 각각에 하나씩 위치할 수 있다. Here, the virtualization cluster may be composed of a plurality of virtual nodes existing in each of a plurality of physical machines, and each of the name nodes may be located in each of a plurality of physical machines.

네임 노드 제어부(330)는 네임 노드 페더레이션(310)에 포함된 적어도 두 개의 네임 노드들의 성능 또는 작업(job)에 대한 처리 속도를 고려하여 적어도 두 개의 네임 노드들 중에서 작업을 할당할 네임 노드를 결정할 수 있다. The name node control unit 330 determines a name node to allocate a job among at least two name nodes considering the performance or processing speed of at least two name nodes included in the name node federation 310 .

상세하게는, 적어도 두 개의 네임 노드들의 중에서 두 개의 네임 노드를 랜덤하게(randomly) 선택하고, 선택된 두 개의 네임 노드의 성능 및 작업량을 고려하여 선택된 두 개의 네임 노드 중에서 작업을 할당할 네임 노드를 결정할 수 있다. In detail, two name nodes among at least two name nodes are selected randomly, and a name node to which a task is allocated among two selected name nodes is determined in consideration of performance and workload of the selected two name nodes .

도 3을 참조하여, 하둡 분산 파일 시스템(HDFS)에 기반한 대규모 데이터를 병렬 처리하는 시스템을 설명하면 다음과 같다. Referring to FIG. 3, a system for parallel processing large-scale data based on the Hadoop Distributed File System (HDFS) will be described.

도 3에서 각각의 물리 머신(30)은 하나의 하둡 네임 노드(311)와 다수의 하둡 데이터 노드(321)를 포함하여 구성될 수 있다. 또한, 각각의 물리 머신(30)에 포함된 하둡 네임 노드(311)의 집합으로 네임 노드 페더레이션(310)을 구성할 수 있다. In FIG. 3, each physical machine 30 may include one Hadoop name node 311 and a plurality of Hadoop data nodes 321. In addition, the name node federation 310 can be configured as a set of Hadoop name nodes 311 included in each physical machine 30.

네임 노드 제어부(330)는 네임 노드 페더레이션(310)에 포함된 하둡 네임 노드(311)들을 제어하고 관리할 수 있다. The name node control unit 330 can control and manage the Hadoop name nodes 311 included in the name node federation 310. [

High Ability NameNode로 명명될 수 있는 네임 노드 제어부(330)는 전체 네임 스페이스(namespace)와 전체 블록 pool에 대한 정보를 가지고 있으며 클러스터의 전반적인 관리와 제어를 수행할 수 있다. The name node controller 330, which may be called a High Ability NameNode, has information on the entire namespace and the entire block pool, and can perform overall management and control of the cluster.

예를 들어, 다수의 작업(job)이 실행될 경우, 네임 노드 제어부(330)가 각각 물리 노드들의 성능과 현재 작업량을 고려하여 하둡 네임 노드(311)를 선택하는 역할을 수행하며 선택된 하둡 네임 노드(311)에게 작업(job)을 할당해주는 역할 또한 수행할 수 있다. For example, when a plurality of jobs are executed, the name node control unit 330 selects the Hadoop name node 311 considering the performance of the physical nodes and the current work amount, 311) can also be assigned a job.

하둡 네임 노드(311)는 물리 노드 별 한 개씩 존재하며 물리 노드 안의 가상의 하둡 데이터 노드(321)들이 가진 블록에 대한 정보와 개별적인 네임 스페이스(namespace)를 가질 수 있다.
The Hadoop name node 311 exists for each physical node and can have information on the blocks of the virtual Hadoop data nodes 321 in the physical node and individual namespaces.

네임 노드 제어부(330)가 하둡 네임 노드에 작업을 할당하는 예를 비교하여 설명하면 다음과 같다. An example in which the name node control unit 330 assigns a job to a Hadoop name node will be described below.

1) 전체 네임 노드들을 순차적으로 선택-(제1 기법)1) Select all the name nodes sequentially - (first technique)

제1 기법에 따르면 전체 네임 노드들에 작업(job)을 순차적으로 할당할 수 있다. 제1 기법은 기존 물리 환경의 네임 노드 페더레이션에서 작업(job)을 배치하는 방법으로 이 방법을 가상화된 Hadoop 클러스터에 적용할 경우 성능이 다른 Hadoop NameNode들의 불균형으로 인한 전체 클러스터 성능의 하락이 발생할 소지가 있다.
According to the first technique, jobs can be sequentially allocated to all the name nodes. The first technique is to place jobs in the node node federation of the existing physical environment. If this method is applied to a virtualized Hadoop cluster, the performance of the whole cluster may be degraded due to the imbalance of Hadoop NameNodes having different performance have.

2) 전체 네임 노드들의 성능 및 작업량을 고려하여 선택-(제2 기법)2) Selection based on performance and workload of all name nodes - (second technique)

제2 기법에 따르면 전체 네임 노드들의 성능 및 작업량을 체크하여 작업(job)을 할당할 수 있다. 제2 기법은 성능의 일관성이 일정하게 잘 유지 되겠지만 Hadoop NameNode들의 성능 및 작업량을 체크하기 위한 비용이 발생하게 되며 이로 인한 성능 하락이 발생할 소지가 있다.
According to the second technique, jobs can be allocated by checking the performance and workload of all the name nodes. The second technique will consistently maintain consistency of performance, but it will incur costs to check the performance and workload of Hadoop NameNodes, which may cause performance degradation.

3) 전체 네임 노드들 중 하나를 랜덤하게 선택-(제3 기법)3) randomly select one of all the name nodes - (third technique)

제3 기법에 따르면 전체 네임 노드들 중에서 하나의 네임 노드를 랜덤하게 선택하여 작업(job)을 할당할 수 있다. 제3 기법은 제2 기법과 같은 성능 및 작업량을 체크하기 위한 비용 발생이 없지만, 제1 기법과 마찬가지로 Hadoop NameNode들의 불균형으로 인한 전체 클러스터 성능의 하락이 발생할 소지가 있다.
According to the third technique, one name node among all the name nodes can be randomly selected and assigned a job. The third technique does not incur the same cost and performance as the second technique. However, as in the first technique, the performance of the entire cluster may be degraded due to the unbalance of the Hadoop NameNodes.

4) 전체 네임 노드들 중 2개를 랜덤하게 선택하여 성능 및 작업량을 고려하여 하나를 최종 선택-(제4 기법)4) Randomly selecting 2 of all the name nodes and finally selecting one considering performance and workload - (4th technique)

제4 기법에 따르면 전체 네임 노드들 중에서 2개의 네임 노드를 랜덤하게 선택하고, 선택된 2개의 네임 노드의 성능 및 작업량을 체크하여 최종 네임 노드를 결정하여 작업(job)을 할당할 수 있다. 제4 기법은 위의 제1 내지 3 기법에서 제기된 문제점들인 성능의 일관성과 체크 비용에 대한 합의점으로써 어느 정도 일관된 성능을 보여줄 수 있으며 체크 비용도 크게 줄일 수 있는 장점이 있다.
According to the fourth technique, two name nodes among all the name nodes are randomly selected, and the performance and the workload of the selected two name nodes are checked to determine the final name node and assign a job. The fourth technique is a consensus on the consistency of performance and the check cost, which are the problems raised in the first to third techniques described above, and can show a consistent performance to some extent, and the check cost can be greatly reduced.

도 4는 본 발명의 실시예에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법을 설명하기 위한 흐름도이다. 4 is a flowchart illustrating a method of assigning a name node in a virtual cluster environment according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 실시예에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법은, 작업(job)을 수신하는 단계 및 작업을 할당할 네임 노드를 결정하는 단계를 포함하여 구성될 수 있다. Referring to FIG. 4, a method of assigning a name node in a virtualized cluster environment according to an embodiment of the present invention may include receiving a job and determining a name node to which a task is assigned have.

먼저, 클라이언트로부터 할당받은 작업(job)를 수신할 수 있다(S410).First, a job allocated from a client can be received (S410).

가상화 클러스터에 포함된 적어도 두 개의 네임 노드들의 성능 또는 작업에 대한 처리 속도를 고려하여 적어도 두 개의 네임 노드들 중에서 작업을 할당할 네임 노드를 결정할 수 있다. A name node to assign a task among at least two name nodes may be determined in consideration of the performance of at least two name nodes included in the virtualization cluster or the processing speed of the task.

여기서, 가상화 클러스터는 다수의 물리 머신들의 각각에 존재하는 다수의 가상 노드들의 집합으로 구성될 수 있으며, 적어도 두 개의 네임 노드들의 각각은 다수의 물리 머신들의 각각에 하나씩 위치할 수 있다. Here, the virtualization cluster may consist of a plurality of virtual nodes in each of a plurality of physical machines, and each of the at least two name nodes may be located in each of a plurality of physical machines.

적어도 두 개의 네임 노드들 중에서 두 개의 네임 노드를 랜덤하게(randomly)하게 선택할 수 있다(S420).In operation S420, two name nodes among at least two name nodes may be randomly selected.

선택된 두 개의 네임 노드 중에서 작업(job)을 할당할 네임 노드를 결정할 수 (S430). 즉, 선택된 두 개의 네임 노드의 성능 및 작업량을 고려하여 선택된 두 개의 네임 노드 중에서 작업(job)을 할당할 네임 노드를 최종 결정할 수 있다. Among the selected two name nodes, a name node to which a job is to be assigned can be determined (S430). That is, considering the performance and workload of the selected two name nodes, it is possible to finally determine the name node to which the job is to be allocated among the two selected names.

마지막으로, 최종 결정된 네임 노드에 작업을 할당할 수 있다(S440). Finally, the task can be assigned to the finally determined name node (S440).

또한, 네임 노드 할당 방법은, 하둡 분산 파일 시스템(HDFS:HADOOP Distributed File System)에 기반하여 적용될 수 있으나, 이에 한정되는 것은 아니다.
In addition, the naming node allocation method can be applied based on the HADOP Distributed File System (HDFS), but is not limited thereto.

도 5는 본 발명의 실시예에 따라 다수의 네임 노드가 적용된 가상화 클러스터에서 작업이 할당되는 네임 노드의 빈도를 설명하기 위한 그래프들이다. 5 is a graph illustrating a frequency of a name node to which tasks are allocated in a virtualization cluster to which a plurality of name nodes are applied according to an embodiment of the present invention.

즉, 도 5는 상술한 4가지 기법에 대한 성능을 비교할 수 있는 시뮬레이션 결과를 나타내는 그래프들이다. That is, FIG. 5 is a graph showing the simulation results that can compare the performance of the four techniques described above.

시뮬레이션을 위한 가정(assumption)은 다음과 같다. The assumptions for the simulation are as follows.

A) 가상화 Hadoop 클러스터에 Hadoop NameNode가 10개 있고 이중 NameNode 1 내지 5까지의 5개의 네임 노드는 성능이 좋지 않은 그룹으로, NameNode 6 내지 10까지의 5개의 네임 노드는 성능이 좋은 그룹 B로 구분한다.A) Virtualization Hadoop cluster has 10 Hadoop NameNodes, 5 namesnodes 1 through 5 are poor performance group, and 5 namesnodes 6 to 10 are distinguished as good group B .

B) 네임 노드 제어부에서 상술한 4가지 기법에 따라 Hadoop NameNode를 선택하여 전체 100개의 작업(job)을 분배한다. 그리고 클러스터 전체에서 모든 작업(job)이 처리되어 끝나는 시간을 계산한다. B) The name node controller selects a Hadoop NameNode according to the above-mentioned four techniques to distribute a total of 100 jobs. And it calculates the time that all jobs in the cluster are processed and finished.

C) 그룹 A의 Hadoop NameNode가 하나의 작업(job)을 처리하는 데 걸리는 시간을 2초, 그룹 B의 Hadoop NameNode가 하나의 작업(job)을 처리하는데 걸리는 시간을 1초로 가정한다.C) Assume that Group A's Hadoop NameNode takes two seconds to process one job, and Group B's Hadoop NameNode takes one second to process one job.

D) Hadoop NameNode의 작업량 및 성능을 체크하는 데 드는 비용을 한 건당 0.01초로 가정한다.D) Assume the cost of checking the workload and performance of the Hadoop NameNode is 0.01 seconds per job.

도 5a는 제1 기법에 따르는 경우에 작업(job)이 네임 노드에 할당되는 빈도를 나타내고, 도 5b는 제2 기법에 따르는 경우에 작업(job)이 네임 노드에 할당되는 빈도를 나타내고, 도 5c는 제3 기법에 따르는 경우에 작업(job)이 네임 노드에 할당되는 빈도를 나타내며, 도 5d는 제4 기법에 따르는 경우에 작업(job)이 네임 노드에 할당되는 빈도를 나타낸다. 여기서, 랜덤 선택이 들어가 있는 제3 기법과 제4 기법은 총 3회 실시하였다.FIG. 5A shows the frequency with which a job is assigned to a name node in accordance with a first technique, FIG. 5B shows a frequency with which a job is assigned to a name node in accordance with a second technique, FIG. 5D shows the frequency with which a job is assigned to a name node when the fourth technique is used, and FIG. 5D shows a frequency with which a job is assigned to a name node. Here, the third technique and the fourth technique in which the random selection is performed were performed three times in total.

도 5a를 보면, 네임 노드의 성능과 상관없이 모든 네임 노드에 균등하게 작업(job)이 할당되어 전체 클러스터 성능의 하락이 발생할 소지가 있다. Referring to FIG. 5A, regardless of the performance of the name node, jobs are uniformly allocated to all the name nodes, and the performance of the entire cluster may be degraded.

도 5b를 보면, 성능이 낮은 그룹 A에 작업(job) 할당 빈도가 낮은 것을 아 수 있다. 다만, 네임 노드들의 성능 및 작업량을 체크하기 위한 비용이 발생하게 되며 이로 인한 성능 하락이 발생할 소지가 있다.Referring to FIG. 5B, it can be seen that the job assignment frequency is low in the group A with low performance. However, the cost for checking the performance and workload of the name nodes occurs, and the performance degradation may occur.

도 5c를 보면, 작업(job)이 일관성 없이 네임 노드에 할당되는 것을 알 수 있다. 즉, 네임 노드의 성능과 상관없이 랜덤하게 작업(job)이 할당되기 때문에 전체 클러스터 성능의 하락이 발생할 소지가 있다.Referring to FIG. 5c, it can be seen that a job is assigned to a name node inconsistently. That is, irrespective of the performance of the name node, a job is randomly assigned, so that the performance of the entire cluster may be deteriorated.

도 5d를 보면, 성능이 낮은 그룹 A에 작업(job)이 할당 빈도가 낮은 것을 알 수 있다. 또한, 랜덤하게 선택된 두 개의 네임 노드의 성능 및 작업량 만을 비교하기 때문에 추가되는 비용이 많지 않을 것으로 예상된다.
Referring to FIG. 5D, it can be seen that the frequency of assigning jobs to the low performance group A is low. In addition, since the performance and the workload of two randomly selected name nodes are compared, it is expected that the added cost will not be large.

도 6은 본 발명의 실시예에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법을 적용하였을 경우에, 작업의 처리 시간을 비교하기 위한 그래프이다. 6 is a graph for comparing the processing time of a task when a method of assigning a name node is applied in a virtual cluster environment according to an embodiment of the present invention.

도 6은 도 5에서 상술한 가정(assumption)에 기반하여 상술한 4가지 기법에 따라 100개의 작업(job)을 처리하는데 소요되는 시간을 나타낸다. FIG. 6 shows the time required to process 100 jobs according to the above-described four techniques based on the assumptions described above with reference to FIG.

도 6을 참조하면, 제4 기법에 따를 경우 가장 짧은 처리 시간이 소요되고, 제3 기법에 따를 경우 가장 긴 처리 시간이 소요되는 것을 알 수 있다. Referring to FIG. 6, it can be seen that the shortest processing time is required according to the fourth technique, and the longest processing time is required according to the third technique.

즉, 제4 기법은 성능의 일관성과 체크 비용을 절충하는 방법으로, 어느 정도 일관된 성능을 보여줄 수 있으며 체크 비용도 크게 줄일 수 있는 장점이 있다.
In other words, the fourth technique is a method of compromising the consistency of performance and the check cost, and it has a merit that it can show a consistent performance to some extent and the check cost can be greatly reduced.

상술한 본 발명의 실시예에 따른 가상화 클러스터 환경에서 네임 노드를 할당하는 방법 및 대규모 데이터를 병렬 분산 처리하는 시스템은 다수의 네임 노드를 활용하는 가상화 클러스터 환경에서 최적의 네임 노드를 효과적으로 선택할 수 있도록 한다. 즉, 네임 노드의 성능 및 작업량에 대한 체크를 최소화하면서 일관된 성능을 유지하도록 할 수 있는 장점이 있다.In the virtual cluster environment according to an embodiment of the present invention, a method for assigning a name node and a system for parallelly distributing large-scale data can effectively select an optimal name node in a virtual cluster environment utilizing a plurality of name nodes . That is, there is an advantage that consistent performance can be maintained while minimizing the check of the performance and the workload of the name node.

따라서, 본 발명은 가상화 클러스터의 전체적인 성능 향상에 기여할 수 있다. Therefore, the present invention can contribute to the overall performance improvement of the virtualization cluster.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

Claims

In a name node assignment method,
Receiving a job; And
Determining a name node to which the task is to be allocated among the at least two name nodes in consideration of the performance of at least two name nodes included in the virtualization cluster or the processing speed for the task, / RTI >

The method according to claim 1,
The virtual cluster includes:
And a plurality of virtual nodes existing in each of the plurality of physical machines.

The method according to claim 1,
Wherein each of the at least two name nodes comprises:
Wherein the plurality of physical machines are located one by one in each of the plurality of physical machines.

The method according to claim 1,
The method of claim 1,
Wherein the method is applied based on a Hadoop Distributed File System (HDFS).

The method according to claim 1,
Wherein the step of determining a name node to which the task is assigned comprises:
Randomly selecting two name nodes out of the at least two name nodes,
Wherein a name node to which the job is to be allocated is determined from among the selected two name nodes in consideration of performance and workload of the selected two name nodes.

In a system for processing large-scale data,
Node node federation consisting of a set of at least two name nodes included in a virtualization cluster; And
And a name node control unit for determining a name node to which the job is to be allocated among the at least two name nodes in consideration of a performance or a processing speed of a job of the at least two name nodes included in the name node federation A system that distributes large-scale data in parallel.

The method of claim 6,
The virtual cluster includes:
And a plurality of virtual nodes existing in each of a plurality of physical machines.

The method of claim 6,
Wherein each of the at least two name nodes comprises:
Wherein the plurality of physical machines are located one by one in each of the plurality of physical machines.

The method of claim 6,
A system for parallel-distributing large-scale data,
A system for parallel distributed processing of large-scale data characterized by being based on the Hadoop Distributed File System (HDFS).

The method of claim 6,
The name node control unit,
Randomly selecting two name nodes out of the at least two name nodes,
And determines a name node to which the job is to be allocated among the selected two name nodes considering performance and workload of the selected two name nodes.