KR101559901B1

KR101559901B1 - System for storing distributed file in HDFS(Hadoop Distributed File System) and providing method thereof

Info

Publication number: KR101559901B1
Application number: KR1020140075827A
Authority: KR
Inventors: 김철기; 장민욱; 이대철; 류충모
Original assignee: 한국항공대학교산학협력단
Priority date: 2014-06-20
Filing date: 2014-06-20
Publication date: 2015-10-15

Abstract

The present invention relates to an HDFS dispersion storage system and a providing method thereof and, more particularly, to a system for distributing and storing a file in a hadoop distributed file system (HDFS), a system for performing map-reduce operations for a moving picture stored in the HDFS, and a providing method thereof. The HDFS distribution storage system for distributing and storing the file in the HDFS according to one aspect of the present invention includes an obtaining module which obtains a replication factor of the HDFS, a division module which generates M division file blocks, and a control module.

Description

[0001] The present invention relates to an HDFS distributed storage system and a method for providing the HDFS distributed storage system.

본 발명은 파일을 하둡 분산 파일 시스템(Hadoop Distributed File System; 이하, 'HDFS'라고 함)에 분산 저장하는 시스템, HDFS에 저장된 동영상에 대한 맵리듀스를 수행하는 시스템 및 그 제공방법에 관한 것이다.The present invention relates to a system for distributedly storing files in a Hadoop Distributed File System (HDFS), a system for performing mapping tasks for moving images stored in an HDFS, and a method for providing the same.

본 발명은 경기도 지역협력연구센터(GRRC)의 다채널 AVB 방송 및 조명 융합제어 클라우드 기술 개발 연구(과제번호: GRRC 항공 2013-B02)의 결과물로서 산출된 것이다.
The present invention has been produced as a result of research on the multi-channel AVB broadcasting and lighting convergence control cloud technology development of GRRC (GRRC aviation 2013-B02).

클라우드 컴퓨팅(cloud computing)이란 다수의 컴퓨터를 하나의 집합체로 묶어 가상의 연산집단인 클라우드를 구성하는 것으로, 데이터의 저장이나 연산을 개개의 컴퓨터에게 맡기는 것이 아니라 컴퓨터들의 집합체인 클라우드에게 위임하는 개념을 뜻한다. 클라우드 컴퓨팅은 여러 분야에서 활용되고 있으며 최근 가장 큰 쓰임새로 주목 받고 있는 분야는 '빅데이터' 분야이다. 빅데이터란 대량의 테라바이트(Tera Byte)를 넘어 페타바이트(Peta Byte)나 그 이상의 크기를 가지는 데이터를 저장하고 처리하는 방법론에 관한 연구분야를 칭한다. 빅데이터의 정의를 생각할 때 동영상 등 방송미디어 데이터는 가장 대표적인 빅데이터라고 할 수 있다. 이러한 대량의 데이터는 단일 컴퓨터로 처리하는 것이 불가능하기 때문에 클라우드 컴퓨팅은 빅데이터 처리의 기본 플랫폼으로 여겨지고 있다.Cloud computing is a grouping of multiple computers into a cloud, which is a virtual computing group. It is not the responsibility of individual computers to store or manipulate data, but the concept of delegating to the cloud, a collection of computers. It means. Cloud computing has been used in many fields, and the field of big data has recently been highlighted. Big data refers to a field of research on methodologies for storing and processing data over petabytes (or more), beyond a large number of terabytes (Tera Bytes). Considering the definition of big data, broadcast media data such as video is the most representative big data. Because such large amounts of data can not be processed by a single computer, cloud computing is seen as a basic platform for big data processing.

공개 소프트웨어 형태의 빅데이터를 위한 클라우드 컴퓨팅 환경 중 대표적으로 Java 언어를 기반으로 한 다중 컴퓨터 클라우드 환경인 아파치(Apache)의 하둡(Hadoop) 시스템을 들 수 있다.Apache's Hadoop system, a multi-computer cloud environment based on the Java language, is one of the cloud computing environments for large data in the form of open software.

하둡이 처리하는 데이터의 크기가 통상 최소 수백 기가바이트 수준이기 때문에 데이터는 하나의 컴퓨터에 저장되는 것이 아니라 여러 개의 블록으로 나누어져 여러 개의 컴퓨터에 분산 저장된다. 따라서 하둡은 입력되는 데이터를 나누어 처리할 수 있도록 하는 하둡 분산 파일 시스템(HDFS)을 포함하며, 분산 저장된 데이터들은 대용량 데이터를 클러스터 환경에서 고속 병렬 처리하기 위해 개발된 맵리듀스(Map-reduce)과정에 의해 처리된다. 또한, 하둡은 클라우드 내의 일부 컴퓨터가 실패하는 상황에서도 데이터를 유지하기 위한 데이터의 다중화를 지원하며, 연산도중 일부 컴퓨터가 정지하더라도, 다른 컴퓨터가 해당 작업을 넘겨받아 작업을 완수할 수 있는 고장 감내형(fault-tolerant) 구조도 내장하고 있다.Since the data that Hadoop normally processes is at least a few hundred gigabytes in size, the data is not stored on a single computer, but rather is divided into blocks and stored on multiple computers. Therefore, Hadoop includes the Hadoop Distributed File System (HDFS), which allows input data to be divided and processed. Distributed data is stored in a Map-reduce process developed for high-speed parallel processing of large volumes of data in a cluster environment. Lt; / RTI > In addition, Hadoop supports data multiplexing to maintain data even in the event that some computers in the cloud fail, and even if some of the computers are down during the operation, other computers can take over the task and complete the task. (fault-tolerant) structure.

HDFS는 클라우드로 묶인 대량의 컴퓨터에 파일시스템을 분산할 수 있도록 하는 환경을 제공한다. 파일 시스템이 여러 컴퓨터에 분산되어 있는 상황에서 컴퓨터 중 하나가 고장이 나면 파일시스템 전체가 망가질 수 있는데, 이를 방지하기 위하여, 파일시스템은 각각의 블록을 다중화하여 여러 컴퓨터에 저장된다. 보다 상세하게는, HDFS에서는 파일을 여러 개의 블록으로 나누어 여러 컴퓨터에 분산 저장하며, 각 파일 블록의 크기는 기본적으로 64MB로 일반 파일의 블록이 수 kB라는 점을 감안하면 무척 큰 크기라 할 수 있다. 또한, 하나의 데이터 블록을 미리 설정된 개수(Replication factor)만큼 중복 저장하며, 기본 설정 값으로는 각 블록이 삼중화로 되어 분산 저장되도록 되어 있다. HDFS는 일반 파일 시스템과 다른 여러 가지 특성을 가지고 있는데, 그 중 하나가 각 파일은 한번 씌어지고 나면 수정이 불가능하다는 점이다. 따라서, 파일의 내용을 바꾸기 위해서는 파일 전체를 새로 써야 하는데, 이는 여러 노드에 분산되어 있는 파일이 일관성 있는 내용을 유지할 수 있도록 하는 고육책이다.HDFS provides an environment for distributing file systems across a large number of computers in a cloud. When a file system is distributed over several computers, if one of the computers fails, the whole file system may be destroyed. To prevent this, the file system is stored in multiple computers by multiplexing each block. More specifically, in HDFS, a file is divided into a plurality of blocks and distributed to a plurality of computers. The size of each file block is basically 64 MB, which is a very large size considering that the number of blocks of a general file is several kB . In addition, one data block is redundantly stored by a predetermined number of replication factors, and each block is distributed and stored in triplicate as a basic set value. HDFS has a number of other features than the normal file system, one of which is that once each file is written, it can not be modified. Therefore, in order to change the contents of a file, the entire file must be rewritten. This is a hard work that allows files distributed over several nodes to maintain consistent contents.

HDFS는 주/종(Master/Slave) 구조를 가지고 있다. HDFS 클러스터는 마스터 서버인 네임 노드 와 파일의 내용을 블록 단위로 분산 저장하고 있는 다수의 데이터 노드로 구성된다. 데이터 노드는 클라이언트로부터 요청되는 파일 읽기 및 쓰기를 수행하며, 네임 노드의 지시에 따라 블록 생성, 삭제 및 복제를 수행한다. 한편, 네임 노드는 하나의 HDFS 클러스터에 하나만 존재하며, 파일 시스템 네임스페이스(Name-space) 관리, 클라이언트에 의한 파일 접근의 통제, HDFS에 저장되는 각종 파일의 메타 정보(HDFS 네임스페이스, 파일 블록과 데이터 노드간의 매핑 정보 등) 관리, 파일과 디렉터리의 열기/닫기/이름 변경과 같은 파일 시스템 네임스페이스 동작을 수행한다. 또한, 각 데이터 노드로부터 주기적으로 현재 상태와 보유 데이터 블록 리스트를 보고 받는다(Heartbeat).HDFS has a master / slave structure. The HDFS cluster consists of a name node, which is a master server, and a number of data nodes, which distribute and store the contents of a file on a block basis. The data node reads and writes the requested file from the client, and performs block creation, deletion and replication according to the instruction of the name node. There is only one name node in a single HDFS cluster, and it manages the file system namespace, the file access control by the client, the meta information of various files stored in HDFS (HDFS namespace, Data node mapping information), and performs file system namespace operations such as opening / closing / renaming of files and directories. Also, the current state and the list of held data blocks are periodically received from each data node (Heartbeat).

클라이언트가 HDFS에 파일을 생성하고자 하는 경우, 먼저 자신의 로컬 파일 시스템에 파일을 생성한다. 만약 파일의 생성이 끝나거나 파일에 써야 할 데이터의 크기가 미리 설정된 데이터 블록의 크기(예를 들면, 64M)가 되면 네임 노드와 통신하여 네임 노드로부터 데이터 블록이 저장될 데이터 노드와 블록 ID를 수신한다. 이후, 데이터 노드에 파일 블록을 저장한다. 만약 클라이언트에서 파일에 써야 할 데이터가 더 남은 경우, 블록 단위로 위 과정을 반복한다. 한편, 클라이언트는 HDFS에 저장된 데이터를 읽기 위하여, 먼저 네임 노드로부터 해당 파일의 데이터 블록 위치 리스트(블록 ID와 해당 블록을 저장하고 있는 데이터 노드의 리스트)를 획득하며, 해당 데이터 블록을 저장하고 있는 데이터 노드와 직접 통신하여 데이터 블록을 읽어 들인다.When a client wants to create a file in HDFS, it first creates a file on its local file system. If the generation of the file is completed or the size of the data to be written to the file becomes the size of the preset data block (for example, 64M), the data node and the block ID to be stored in the data block are received from the name node do. Then, the file block is stored in the data node. If the client has more data to write to the file, repeat the above process on a block-by-block basis. On the other hand, in order to read data stored in the HDFS, the client acquires a data block location list (block ID and a list of data nodes storing the corresponding block) of the file from the name node, and stores the data Directly communicates with the node to read the data block.

한편, 맵리듀스는 HDFS에 저장 되어 있는 대용량 데이터를 분석하기 위하여 분산 환경에서의 병렬 처리를 지원하는 프레임워크이며, 현재까지 알려진 클라우드 연산 병렬화 기법들 중 가장 효과적인 방법으로 알려져 있다.On the other hand, MapReduce is a framework that supports parallel processing in a distributed environment to analyze large amount of data stored in HDFS, and is known as the most effective method among cloud computing parallelization methods so far known.

맵리듀스에서 처리되는 맵리듀스 작업(job)은 복수(n개)의 맵 태스크(map task)와 복수(m개)의 리듀스 태스크(reduce task)로 구성된다. 맵 태스크(map task)에서는 원시 데이터(입력 파일)를 독립적인 연산 단위들로 나눈 후 이를 가공하여 <키(key), 값(value)> 형태의 연관성 있는 데이터로 분류하고, 리듀스 태스크에서는 각각의 맵 태스크의 결과물들을 입력으로 받아 최종 결과물을 생성한다.A maple deuce job that is processed in the MapReduce consists of a plurality of (n) map tasks and a plurality of (m) reduce tasks. In the map task, the raw data (input file) is divided into independent units of operation, processed and sorted into related data of the form <key, value>, and the task of reduction And the final result is generated.

도 1은 하둡에서 맵리듀스를 수행하는 방식을 간략하게 도시한 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram briefly showing a method of performing MapReduce in Hadoop. FIG.

도 1을 참조하면, 하둡에서는 HDFS에 저장되어 있는 입력파일에 대해 맵리듀스를 수행하기 위하여, 입력 파일을 여러 개의 스플릿(split)으로 분할하고, 맵 태스크에 할당된다. 이때, 각 스플릿은 독립적인 처리 단위인 맵 엔트리(map entry)로 구성된다.Referring to FIG. 1, in Hadoop, an input file is divided into a plurality of splits and allocated to a map task in order to perform map de-misses on the input files stored in the HDFS. At this time, each split consists of a map entry which is an independent processing unit.

각각의 맵 태스크 서버는 할당된 스플릿을 맵 엔트리 단위로 분석하여 <키, 값> 데이터를 생성하며, 할당된 스플릿을 모두 처리한 서버는 추가적으로 스플릿을 할당 받아 처리한다. 요즈음의 컴퓨터는 하나의 CPU에 다수의 물리적/논리적 코어(core)를 가지고 있기 때문에 하나의 컴퓨터에 여러 서버가 수행될 수도 있다.Each map task server analyzes the allocated splits on a map-entry-by-map basis to generate <key, value> data, and the server that has processed all of the allocated splits further allocates and processes splits. Because computers today have a large number of physical / logical cores on one CPU, several servers can be running on one computer.

맵 과정에 의해 생성되는 결과 데이터는 셔플링(Shuffling) 과정을 거쳐 키에 따라 리듀스 그룹으로 분류되어 리듀스 태스크를 수행할 서버로 전송되며, 리듀스 태스크는 리듀스 그룹을 리듀스 엔트리 단위로 가공하여 최종 결과물(출력 파일)을 생성한다. 이때, 생성된 출력 파일은 HDFS에 저장된다.The result data generated by the mapping process is shuffled, classified according to the key into a reduction group, and transmitted to the server for performing the reduction task. The reduction task is divided into the reduction group in the reduction entry unit And the final result (output file) is generated by processing. At this time, the generated output file is stored in HDFS.

한편, 맵 리듀스 시스템은 다수의 태스크 트래커(Task Tracker)와 하나의 잡 트래커(Job Tracker)로 구성된다. 잡 트래커는 태스크 트래커들이 수행하는 태스크(맵 태스크 및 리듀스 태스크)를 관리하는 마스터 역할을 수행하는 서버이며, HDFS의 네임 노드가 실행되는 서버에서 함께 위치하는 것이 일반적이다. 태스크 트래커는 사용자들이 요청한 작업을 태스크 단위로 수행하며, 통상적으로 데이터 노드 역할을 하는 서버에 위치한다.On the other hand, the map reduction system consists of a number of task trackers and a job tracker. Job trackers are masters that manage task tasks (map tasks and redisplay tasks) performed by task trackers, and are generally located together on the server where the HDFS name node is running. Task tracker performs tasks requested by users on a task-by-task basis and is usually located on a server that acts as a data node.

클라이언트가 잡 트래커에 맵리듀스 작업을 요청하면, 잡 트래커는 태스크 트래커 별로 처리할 태스크 목록을 구성한다. 태스크 트래커는 주기적으로 heartbeat 메시지를 전송하고, 잡 트래커는 이 메시지의 반환 값에 처리할 태스크 ID를 반환한다. 태스크 ID를 받은 태스크 트래커는 관련된 태스크의 정보와 수행할 프로그램을 HDFS에서 가져와서 fork 명령을 이용해 맵 태스크 및/또는 리듀스 태스크를 실행한다(태스크 트래커는 동시에 여러 태스크를 수행할 수 있음). 만약 태스크 트래커에 장애가 생기거나, 새로운 태스크 트래커가 추가되면 잡 트래커는 이를 인식하여 추가, 제거 작업을 수행한다(장애가 발생한 태스크 트래커의 작업은 다른 태스크 트래커에 재할당됨). 한편, 잡 트래커는 데이터의 지역성(Data Locality)를 최대한 활용하기 위하여, 맵 태스크를 수행할 서버를 찾을 때 입력 파일 블록을 이미 가지고 있거나 그와 같은 랙(Rack)에 있는 서버를 우선적으로 찾으려고 시도한다.When a client requests a MapReduce operation to a job tracker, the job tracker constructs a task list for each task tracker. The task tracker periodically sends a heartbeat message, and the job tracker returns the task ID to be processed in the return value of this message. The task tracker that receives the task ID fetches the related task information and the program to be executed from the HDFS, and executes the map task and / or the reuse task using the fork command. (Task tracker can perform multiple tasks at the same time). If Task Tracker fails or a new Task Tracker is added, Job Tracker recognizes and adds / removes it (the task of the failed Task Tracker is reassigned to another Task Tracker). On the other hand, in order to maximize the data locality of the data, the job tracker tries to find a server that is already in the input file block or a server in such a rack when searching for a server to perform the map task .

한편, 하둡에서는 파일을 이용하는데 여러 대의 컴퓨터(데이터 노드)가 협력해야 한다는 점에서 이들 중 어느 한 대만 고장을 일으켜도 파일작업을 할 수 없게 된다는 문제점이 발생할 수 있다. HDFS에서는 이러한 문제점을 해결하기 위해 각각의 파일 블록을 여러 대의 데이터 노드에 중복하여 저장하는 방식을 이용한다. 이때, 중복 저장하는 숫자를 리플리케이션 팩터(replication facror; 복제인수)라고 한다. 통상적으로 리플리케이션 팩터는 3으로 설정되어 있으나 조정이 가능한 값이다.On the other hand, in Hadoop, several computers (data nodes) need to cooperate in using a file, so that even if one of them fails, the file operation can not be performed. In HDFS, each file block is stored in multiple data nodes in a redundant manner in order to solve this problem. At this time, the number of redundant storage is called a replication facror (replication factor). Typically, the replication factor is set to 3, but it is a tunable value.

한편, 동영상은 가장 대표적인 빅데이터 분야라 할 수 있다. HD(High Definition) 영상 하나가 수~수십 기가바이트에 이를 정도로 큰 크기를 가지고 있으며, 스마트폰을 이용한 동영상 촬영이 일상화되어 상당히 많은 양의 동영상이 인터넷 상에서 공유되고 있다. 이처럼 동영상이 대표적인 빅데이터임에도 불구하고 가상의 단일 환경에 이들을 모아놓는 기초적인 클라우드 저장소의 개념이 사용되고 있을 뿐이며, 동영상과 같은 미디어를 분석하기 위한 클라우드 환경에 대한 연구가 미비한 실정이다.
On the other hand, video is the most representative big data field. One HD (High Definition) image has a large size ranging from several to several tens of gigabytes, and video shooting using a smart phone has become commonplace, and a considerable amount of video is being shared on the Internet. Although the video is the representative big data, the basic concept of cloud storage that collects them in a single virtual environment is used, and there is not much research on cloud environment for analyzing media such as video.

본 발명은 파일에 포함된 데이터를 HDFS상에 리플리케이션 팩터만큼 중복하여 저장할 수 있으면서도, 파일에 대한 맵리듀스를 효율적으로 수행하기 위하여 상기 파일을 HDFS에 분산 저장할 수 있는 HDFS 분산 저장 시스템 및 그 방법을 제공하는 것이다.The present invention provides an HDFS distributed storage system capable of storing data stored in a file as much as a replication factor on an HDFS, and also capable of distributing the file to an HDFS in order to efficiently perform mapping on the file, and a method thereof .

특히, 동영상 파일을 HDFS에 분산 저장함에 있어, 상기 동영상 파일에 대한 맵리듀스를 수행하는 서버와 동영상 파일을 실제로 저장하고 있는 서버간의 네트워크 트래픽이 최소화되도록 하는 시스템 및 그 방법을 제공하는 것이다.In particular, the present invention provides a system and method for minimizing network traffic between a server that performs mapping tasks for the moving image file and a server that actually stores the moving image file in a distributed storage of moving image files in HDFS.

또한, 이와 같이 분산 저장된 동영상 파일을 효율적으로 맵리듀스할 수 있는 시스템 및 그 방법을 제공하는 것이다.
It is another object of the present invention to provide a system and method for efficiently mapping and distributing a video file distributed and stored as described above.

본 발명의 일 측면에 따르면, 파일을 HDFS(Hadoop Distribution File System; 하둡 분산 파일 시스템)에 분산 저장하는 HDFS 분산 저장 시스템으로서, 상기 파일에 상응하는 M개의 분할 파일 블록을 생성하는 분할모듈-여기서, M은 상기 파일을 S/R의 크기로 분할했을 때의 블록의 개수이며, R은 상기 HDFS의 리플리케이션 팩터이며, S는 소정의 기본 블록 사이즈(default block size)임 및 상기 M개의 분할 파일 블록을 상기 HDFS 상의 데이터 노드에 분산 저장하는 제어모듈을 포함하되, 상기 M개의 분할 파일 블록 중 i번째 분할 파일 블록(1≤i≤M-R+1)은, 상기 파일 중 상기 파일을 S/R의 크기로 분할하였을 때의 i번째 블록의 시작지점부터 상기 기본 블록 사이즈만큼의 영역에 포함된 데이터를 포함하며, 상기 M개의 분할 파일 블록 중 j번째 분할 파일 블록(M-R+2≤j≤M)은, 상기 파일 중 상기 파일을 S/R의 크기로 분할하였을 때의 i번째 블록의 시작지점부터 상기 파일의 종료지점까지의 영역에 포함된 데이터를 포함하는 HDFS 분산 저장 시스템이 제공된다.According to an aspect of the present invention, there is provided an HDFS distributed storage system for distributedly storing files in a HDFS (Hadoop Distribution File System), the division module for generating M divided file blocks corresponding to the files, M is a number of blocks when the file is divided into S / R sizes, R is a replication factor of the HDFS, S is a predetermined default block size, (1? I? M-R + 1) among the M divided file blocks is a file of the S / R of the file among the files (M-R + 2 < = j < M >) among the M divided file blocks, and the data included in an area corresponding to the basic block size from a start point of an i- ), The HDFS distributed storage system from the start of the i-th block when the file hayeoteul divided by the size of the S / R includes data included in the area to the end point of the file is provided.

일 실시예에서, 상기 M개의 분할 파일 블록 중 j번째 분할 파일 블록(M-R+2≤j≤M)은, 상기 파일의 시작지점부터 상기 파일을 S/R의 크기로 분할하였을 때의 R-(M-j+1)번째 블록의 종료지점까지의 영역에 포함된 데이터를 더 포함할 수 있다.In one embodiment, the jth divided file block (M-R + 2? J? M) among the M divided file blocks is divided into R - < / RTI > (M-j + 1) th block.

일 실시예에서, 상기 파일은 동영상 파일일 수 있다.In one embodiment, the file may be a video file.

일 실시예에서, 상기 제어모듈은, 상기 동영상 파일의 헤더 정보를 포함하는 헤더 블록을 상기 M개의 분할 파일 블록을 분산 저장하고 있는 모든 데이터 노드에 더 저장할 수 있다.In one embodiment, the control module may further store a header block including header information of the moving picture file in all the data nodes distributedly storing the M divided file blocks.

본 발명의 다른 일 측면에 따르면, 상술한 HDFS 분산 저장 시스템에 의해 저장된 동영상 파일에 대한 맵리듀스(Map-Reduce)를 수행하며, 잡 트래커(Job Tracker) 서버 및 복수의 태스크 트래커(Task Tracker) 서버를 포함하는 동영상 맵리듀스 시스템으로서, 상기 잡 트래커 서버는, 상기 동영상 파일을 키 프레임을 기준으로 하여 단위 스플릿으로 분할하고, 분할된 각각의 단위 스플릿을 상기 복수의 태스크 트래커 서버 중 해당 단위 스플릿을 포함하는 분할 파일 블록을 저장하고 있는 태스크 트래커 서버에서 실행되는 맵 태스크에 할당하고, 상기 태스크 트래커 서버는, 자신에게 할당된 단위 스플릿에 포함된 데이터에 대한 맵 태스크를 수행하는 동영상 맵리듀스 시스템이 제공된다.According to another aspect of the present invention, there is provided a method for performing Map-Reduce on a moving image file stored by the above-described HDFS distributed storage system, and a job tracker server and a plurality of task tracker servers Wherein the job tracker server divides the video file into unit splits with reference to a key frame, and splits the divided unit splits into corresponding unit split of the plurality of task tracker servers To a map task executed in a task tracker server storing a divided file block to be executed by the task tracker server, and the task tracker server performs a map task on data included in the unit split allocated to the task tracker server .

본 발명의 다른 일 측면에 따르면, 파일을 HDFS에 분산 저장하는 HDFS 분산 저장 시스템 제공방법으로서, 상기 파일에 상응하는 M개의 분할 파일 블록을 생성하는 단계-여기서, M은 상기 파일을 S/R의 크기로 분할했을 때의 블록의 개수이며, R은 상기 HDFS의 리플리케이션 팩터이며, S는 소정의 기본 블록 사이즈임- 및 상기 M개의 분할 파일 블록을 상기 HDFS 상의 데이터 노드에 분산 저장하는 단계를 포함하되, 상기 M개의 분할 파일 블록 중 i번째 분할 파일 블록(1≤i≤M-R+1)은, 상기 파일 중 상기 파일을 S/R의 크기로 분할하였을 때의 i번째 블록의 시작지점부터 상기 기본 블록 사이즈만큼의 영역에 포함된 데이터를 포함하며, 상기 M개의 분할 파일 블록 중 j번째 분할 파일 블록(M-R+2≤j≤M)은, 상기 파일 중 상기 파일을 S/R의 크기로 분할하였을 때의 i번째 블록의 시작지점부터 상기 파일의 종료지점까지의 영역에 포함된 데이터를 포함하는 HDFS 분산 저장 시스템 제공방법이 제공된다.According to another aspect of the present invention, there is provided a method of providing an HDFS distributed storage system for distributing files in HDFS, the method comprising: generating M divided file blocks corresponding to the files, Wherein R is the replication factor of the HDFS and S is a predetermined basic block size and distributing the M divided file blocks to the data nodes on the HDFS, , And an i-th divided file block (1? I? M-R + 1) among the M divided file blocks is divided into a plurality of blocks from a start point of an i-th block when the file is divided into S / (M-R + 2? J? M) among the M divided file blocks includes data included in an area corresponding to a basic block size, I < th > block From this starting point HDFS distributed storage system provides a way to include the data contained in the area to the end point of the file is provided.

일 실시예에서, 상기 파일은 동영상 파일이며, 상기 HDFS 분산 저장 시스템 제공방법은, 상기 동영상 파일의 헤더 정보를 포함하는 헤더 블록을 상기 M개의 분할 파일 블록을 분산 저장하고 있는 모든 데이터 노드에 저장하는 단계를 더 포함할 수 있다.In one embodiment, the file is a moving picture file, and the method for providing the HDFS distributed storage system stores a header block including header information of the moving picture file in all the data nodes distributedly storing the M divided file blocks Step < / RTI >

본 발명의 다른 일 측면에 따르면, 상술한 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체가 제공된다.According to another aspect of the present invention, there is provided a computer-readable recording medium on which a program for performing the above-described method is recorded.

본 발명의 다른 일 측면에 따르면, HDFS 분산 저장 시스템으로서, 프로세서 및 상기 프로세서에 의하여 실행되는 컴퓨터 프로그램을 저장하는 메모리를 포함하며, 상기 컴퓨터 프로그램은, 상기 프로세서에 의해 실행되는 경우, 상기 HDFS에 분산 저장 시스템이 상술한 방법을 수행하도록 하는 HDFS 분산 저장 시스템이 제공된다.According to another aspect of the present invention there is provided an HDFS distributed storage system comprising a processor and a memory for storing a computer program executed by the processor, wherein the computer program, when executed by the processor, There is provided an HDFS distributed storage system for allowing a storage system to perform the above-described method.

본 발명의 다른 일 측면에 따르면, 상술한 HFDS 분산 저장 시스템에 의해 저장된 동영상 파일에 대한 맵리듀스를 수행하며, 잡 트래커 서버 및 복수의 태스크 트래커 서버를 포함하는 동영상 맵리듀스 시스템 제공방법으로서, (a) 상기 잡 트래커 서버가, 상기 동영상 파일을 키 프레임을 기준으로 하여 단위 스플릿으로 분할하는 단계, (b) 상기 잡 트래커 서버가, 분할된 각각의 단위 스플릿을 상기 복수의 태스크 트래커 서버 중 해당 단위 스플릿을 포함하는 분할 파일 블록을 저장하고 있는 태스크 트래커 서버에서 실행되는 맵 태스크에 할당하는 단계 및 (c) 상기 분할된 단위 스플릿을 할당받은 상기 태스크 트래커 서버 각각이, 자신에게 할당된 단위 스플릿에 포함된 데이터에 대한 맵 태스크를 수행하는 단계를 포함하는 동영상 맵리듀스 시스템 제공방법이 제공된다.
According to another aspect of the present invention, there is provided a method for providing a moving picture deuce system that performs a map deuce on a moving picture file stored by the HFDS distributed storage system and includes a job tracker server and a plurality of task tracker servers, ) The job tracker server divides the moving image file into unit splits based on a key frame, (b) the job tracker server divides each divided unit split into a corresponding unit split of the plurality of task tracker servers (C) assigning the divided unit split to the task tracker server; and (c) assigning the unit split to the task tracker server. Providing a video maple deuce system including performing a map task on data This method is provided.

본 발명의 일 실시예에 따르면, 대용량의 파일을 저장하고 고속으로 병렬 처리하여 분석할 수 있도록 하는 하둡 시스템을 제공할 수 있다. 즉, 본 발명의 일 실시예에 따르면, 상기 파일에 포함된 데이터가 HDFS 상에 리플리케이션 팩터만큼 중복적으로 저장될 수 있도록 함으로써, 해당 파일에 대한 견고성을 높일 수 있다.According to an embodiment of the present invention, it is possible to provide a Hadoop system that can store large-capacity files and perform parallel analysis and analysis at high speed. That is, according to an embodiment of the present invention, data included in the file can be stored redundantly as much as a replication factor on the HDFS, thereby increasing the robustness of the file.

또한, HDFS에 저장되는 파일이 동영상인 경우, 동영상에 대한 맵리듀스를 수행하는데 필요한 통신량을 최소화 하면서 동시에 병렬성을 높일 수 있다.In addition, when the file stored in the HDFS is a moving image, the amount of communication required to perform the mapping process on the moving image can be minimized while enhancing the parallelism.

또한, 동영상의 분석 및 처리에 필수적인 헤더 정보가 상기 동영상을 분산 저장하고 있는 모든 서버에 저장되도록 하여, 상기 동영상을 맵리듀스하는 각 서버가 자신의 로컬 저장소에 저장되어 있는 헤더 정보를 참조하도록 할 수 있다. 따라서, 헤더 정보를 원격에서 읽어올 경우에 발생할 수 있는 네트워크 오버헤드를 줄일 수 있는 효과가 있다.
In addition, the header information necessary for the analysis and processing of the moving image may be stored in all the servers that distribute and store the moving image, so that each server that maps the moving image can refer to the header information stored in its local storage have. Therefore, there is an effect that the network overhead that may occur when the header information is read from a remote location is reduced.

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 간단한 설명이 제공된다.
도 1은 하둡에서 맵리듀스를 수행하는 방식을 간략하게 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 HDFS 분산 저장 시스템을 개략적으로 설명하기 위한 블록도이다.
도 3a 내지 도 3c는 본 발명의 일 실시예에 따른 HDFS 분산 저장 시스템에서 생성하는 분할 파일 블록을 종래 HDFS에 의해 생성되는 파일 블록과 비교하기 위한 도면이다.
도 4는 동영상 파일을 일정한 크기의 파일 블럭으로 분할할 경우에 발생할 수 있는 문제점을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 맵리듀스 시스템을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 맵리듀스 시스템에 의해 동영상 파일이 스플릿으로 분할되고 분할된 스플릿이 잡 트래커 서버에 의해 태스크 트래커 서버로 할당되는 예를 설명하기 위한 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS A brief description of each drawing is provided to more fully understand the drawings recited in the description of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram briefly showing a method of performing MapReduce in Hadoop. FIG.
2 is a block diagram schematically illustrating an HDFS distributed storage system according to an embodiment of the present invention.
3A to 3C are views for comparing a divided file block generated in the HDFS distributed storage system according to an embodiment of the present invention with a file block generated by the conventional HDFS.
4 is a diagram for explaining a problem that may occur when a moving image file is divided into file blocks of a predetermined size.
FIG. 5 is a diagram for explaining a maple deuces system according to an embodiment of the present invention.
FIG. 6 is a diagram for explaining an example in which a moving picture file is divided into splits and a divided split is allocated to a task tracker server by a job tracker server according to a maple deuces system according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.BRIEF DESCRIPTION OF THE DRAWINGS The present invention is capable of various modifications and various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise.

본 명세서에 있어서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this specification, the terms "comprises" or "having" and the like refer to the presence of stated features, integers, steps, operations, elements, components, or combinations thereof, But do not preclude the presence or addition of features, numbers, steps, operations, components, parts, or combinations thereof.

또한, 본 명세서에 있어서는 어느 하나의 구성요소가 다른 구성요소로 데이터를 '전송'하는 경우에는 상기 구성요소는 상기 다른 구성요소로 직접 상기 데이터를 전송할 수도 있고, 적어도 하나의 또 다른 구성요소를 통하여 상기 데이터를 상기 다른 구성요소로 전송할 수도 있는 것을 의미한다. 반대로 어느 하나의 구성요소가 다른 구성요소로 데이터를 '직접 전송'하는 경우에는 상기 구성요소에서 다른 구성요소를 통하지 않고 상기 다른 구성요소로 상기 데이터가 전송되는 것을 의미한다.Also, in this specification, when any one element 'transmits' data to another element, the element may transmit the data directly to the other element, or may be transmitted through at least one other element And may transmit the data to the other component. Conversely, when one element 'directly transmits' data to another element, it means that the data is transmitted to the other element without passing through another element in the element.

이하, 첨부된 도면들을 참조하여 본 발명의 실시예들을 중심으로 본 발명을 상세히 설명한다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다.Hereinafter, the present invention will be described in detail with reference to the embodiments of the present invention with reference to the accompanying drawings. Like reference symbols in the drawings denote like elements.

도 2는 본 발명의 일 실시예에 따른 HDFS 분산 저장 시스템을 개략적으로 설명하기 위한 블록도이다.2 is a block diagram schematically illustrating an HDFS distributed storage system according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시예에 따른 HDFS 분산 저장 시스템 제공방법을 구현하기 위하여, HDFS 분산 저장 시스템(100)이 구비될 수 있다.Referring to FIG. 2, an HDFS distributed storage system 100 may be provided to implement a method of providing an HDFS distributed storage system according to an embodiment of the present invention.

상기 HDFS 분산 저장 시스템(100)은 클라이언트(200)가 저장 대상이 되는 파일을 HDFS(300) 상에 저장할 것을 요청하는 경우, 요청된 상기 파일에 상응하는 복수의 분할 파일 블록을 생성하고 및 생성된 복수의 분할 파일 블록을 HDFS(300)를 구성하는 데이터 노드(예를 들면, 320-1 내지 320-3)에 분산 저장할 수 있다.When the client 200 requests the HDFS 300 to store a file to be stored on the HDFS 300, the HDFS distributed storage system 100 generates a plurality of divided file blocks corresponding to the requested file, A plurality of divided file blocks may be distributedly stored in the data nodes (for example, 320-1 to 320-3) constituting the HDFS 300. [

한편, 상기 파일은 동영상 파일일 수 있으며, 본 발명에서 취급할 수 있는 동영상 파일은 avi, mov, mp4, wmv 포맷 등 동영상의 저장 형식에 구애를 받지 아니하며, divx, mpeg, h.264 등 널리 공지된 다양한 코덱에 의해 인코딩 될 수 있다.Meanwhile, the file may be a moving picture file, and the moving picture file that can be handled by the present invention is not limited to the storage format of the moving picture such as avi, mov, mp4, wmv format, and is widely known as divx, mpeg, Lt; RTI ID = 0.0 > codecs. &Lt; / RTI >

한편, 상기 동영상 파일은 다수의 프레임(정지 영상)을 포함할 수 있다. 상기 동영상에 포함되는 프레임 중 일부는 키 프레임일 수 있다. 키 프레임은 해당 시점의 영상 데이터 전부를 가지고 있어서 전/후 프레임에 대한 의존성을 가지지 않는 프레임을 의미할 수 있다. 본 발명의 일 실시예에서, 키 프레임은 해당 시점의 영상 데이터 전부를 가지고 있으며, 앞 프레임의 참조를 받지 않는 프레임을 의미할 수도 있다.Meanwhile, the moving picture file may include a plurality of frames (still images). Some of the frames included in the moving picture may be key frames. The key frame may include a frame having all of the image data at the corresponding time point and having no dependency on the before / after frame. In one embodiment of the present invention, a key frame may include a frame that has all of the image data at the corresponding point in time and does not receive the reference of the previous frame.

또한, 상기 동영상 파일은 헤더 정보를 포함할 수 있다. 헤더 정보는 상기 동영상에 포함된 각 프레임에 대한 정보(예를 들면, 프레임의 위치 혹은 종류 등)를 포함할 수 있으며, 상기 HDFS 분산 저장 시스템(100)은 상기 동영상을 복수의 분할 파일 블록으로 분할하기 위하여, 상기 헤더 정보로부터 키 프레임의 위치 정보를 획득할 수 있다.In addition, the moving picture file may include header information. The HDFS distributed storage system 100 may divide the moving image into a plurality of divided file blocks, and the HDFS distributed storage system 100 may divide the moving image into a plurality of divided file blocks, for example, The location information of the key frame may be obtained from the header information.

한편, 상기 HDFS 분산 저장 시스템(100)은 생성된 상기 복수의 분할 파일 블록 각각을 저장하기 위하여, 각각의 파일 블록 마다 해당 블록이 저장될 데이터 노드를 상기 HDFS에 포함되어 있는 네임 노드(310)에 질의할 수 있다. 또한 상기 네임 노드(310)가 질의에 응답하여 해당 블록이 실제로 저장될 데이터 노드에 대한 정보(예를 들면, 데이터 노드의 ID)를 전송하면, 해당 블록을 데이터 노드에 전송하여 저장되도록 할 수 있다.In order to store each of the generated plurality of divided file blocks, the HDFS distributed storage system 100 stores a data node in which the corresponding block is stored for each file block, to a name node 310 included in the HDFS You can query. In addition, if the name node 310 transmits information (for example, ID of a data node) about a data node to which the corresponding block is actually stored in response to a query, the corresponding block may be transmitted to the data node for storage .

한편, 상기 HDFS 분산 저장 시스템(100)은 분할모듈(110) 및 제어모듈(120)을 포함할 수 있다. 본 발명의 실시예에 따라서는, 상술한 구성요소들 중 일부 구성요소는 반드시 본 발명의 구현에 필수적으로 필요한 구성요소에 해당하지 않을 수도 있으며, 또한 실시예에 따라 상기 분산 저장 시스템(100)은 이보다 더 많은 구성요소를 포함할 수도 있음은 물론이다. Meanwhile, the HDFS distributed storage system 100 may include a partition module 110 and a control module 120. According to an embodiment of the present invention, some of the above-mentioned components may not necessarily correspond to components necessary for the implementation of the present invention, and in accordance with an embodiment, the distributed storage system 100 It goes without saying that more components may be included.

상기 분산 저장 시스템(100)은 본 발명의 기술적 사상을 구현하기 위해 필요한 하드웨어 리소스(resource) 및/또는 소프트웨어를 구비할 수 있으며, 반드시 하나의 물리적인 구성요소를 의미하거나 하나의 장치를 의미하는 것은 아니다. 즉, 상기 분산 저장 시스템(100)은 본 발명의 기술적 사상을 구현하기 위해 구비되는 하드웨어 및/또는 소프트웨어의 논리적인 결합을 의미할 수 있으며, 필요한 경우에는 서로 이격된 장치에 설치되어 각각의 기능을 수행함으로써 본 발명의 기술적 사상을 구현하기 위한 논리적인 구성들의 집합으로 구현될 수도 있다. 또한, 상기 분산 저장 시스템(100)은 본 발명의 기술적 사상을 구현하기 위한 각각의 기능 또는 역할별로 별도로 구현되는 구성들의 집합을 의미할 수도 있다. 예컨대, 상기 분할모듈(110) 및/또는 제어모듈(120) 은 서로 다른 물리적 장치에 위치할 수도 있고, 동일한 물리적 장치에 위치할 수도 있다. 또한, 구현 예에 따라서는 상기 획득모듈(110), 분할모듈(110) 및/또는 제어모듈(120) 등 각각의 개별모듈을 구성하는 요소들 역시 서로 다른 물리적 장치에 위치하고, 서로 다른 물리적 장치에 위치한 요소들이 서로 유기적으로 결합되어 각각의 개별 모듈이 수행하는 기능을 실현할 수도 있다.The distributed storage system 100 may include hardware resources and / or software required to implement the technical idea of the present invention, and it is understood that one means a physical component or one device no. That is, the distributed storage system 100 may mean a logical combination of hardware and / or software provided to implement the technical idea of the present invention. If necessary, the distributed storage system 100 may be installed in a separate apparatus, The present invention may be embodied as a set of logical structures for realizing the technical idea of the present invention. Also, the distributed storage system 100 may mean a set of configurations separately implemented for each function or role for implementing the technical idea of the present invention. For example, the partitioning module 110 and / or the control module 120 may be located in different physical devices, or may be located in the same physical device. In addition, according to an embodiment, elements constituting each individual module such as the acquisition module 110, the partitioning module 110 and / or the control module 120 are also located in different physical devices, The elements located on top of each other may be combined with each other to realize the function performed by each individual module.

또한, 본 명세서에서 모듈이라 함은, 본 발명의 기술적 사상을 수행하기 위한 하드웨어 및 상기 하드웨어를 구동하기 위한 소프트웨어의 기능적, 구조적 결합을 의미할 수 있다. 예컨대, 상기 모듈은 소정의 코드와 상기 소정의 코드가 수행되기 위한 하드웨어 리소스의 논리적인 단위를 의미할 수 있으며, 반드시 물리적으로 연결된 코드를 의미하거나, 한 종류의 하드웨어를 의미하는 것은 아님은 본 발명의 기술분야의 평균적 전문가에게는 용이하게 추론될 수 있다.In this specification, a module may mean a functional and structural combination of hardware for carrying out the technical idea of the present invention and software for driving the hardware. For example, the module may mean a logical unit of a predetermined code and a hardware resource for executing the predetermined code, and it does not necessarily mean a physically connected code or a kind of hardware. Can be easily deduced to the average expert in the field of < / RTI >

상기 제어모듈(120)은 본 발명의 일 실시예에 따른 분산 저장 시스템(100)에 포함된 다른 구성들(예컨대, 분할모듈(110) 등)의 기능 및/또는 리소스를 제어할 수 있다.The control module 120 may control the functions and / or resources of other configurations (e.g., the partitioning module 110, etc.) included in the distributed storage system 100 according to an embodiment of the present invention.

상기 분할모듈(110)은 HDFS에 저장되는 대상 파일에 상응하는 복수의 분할 파일 블록으로 분할할 수 있다.The partitioning module 110 may be divided into a plurality of divided file blocks corresponding to a target file stored in the HDFS.

앞서 설명한 바와 같이, 통상적인 종래의 HDFS에서는 대용량 파일을 미리 설정된 일정한 크기(즉, 기본 블록 사이즈, 예를 들면, 64M, 96MB 등)로 분할하여 할 수 있으며 상기 HDFS의 리플레케이션 팩터(예를 들어, 3)만큼 중복하여 저장할 수 있다. 따라서, 종래의 HDFS에서는 리플리케이션 팩터만큼의 동일한 파일 블록이 존재할 수 있다.As described above, in a conventional conventional HDFS, a large file can be divided into a predetermined size (i.e., a basic block size, for example, 64M, 96MB, etc.) and the refreshment factor of the HDFS , 3). Therefore, in the conventional HDFS, there may exist the same file block as the replication factor.

반면, 본 발명의 일 실시예에 따른 분할모듈(110)에 의해 생성되는 복수의 분할 파일 블록은 모두 서로 상이할 수 있으며, 생성되는 복수의 분할 파일 블록에는 상기 대상 파일에 포함된 데이터가 리플리케이션 팩터만큼 중복 포함될 수 있다. 한편, 상기 분할모듈에 의해 생성되는 복수의 분할 파일 블록의 개수는 상기 대상 파일을 S/R의 크기로 분할했을 때의 블록 개수(M)일 수 있다(여기서, R은 HDFS의 리플리케이션 팩터이며, S는 기본 블록 사이즈임).On the other hand, the plurality of divided file blocks generated by the partitioning module 110 according to an embodiment of the present invention may be different from each other, and data included in the target file is divided into a plurality of divided file blocks, As shown in FIG. Meanwhile, the number of the plurality of divided file blocks generated by the partitioning module may be the number of blocks (M) when the target file is divided into S / R sizes (where R is a replication factor of HDFS, S is the basic block size).

이하에서는 도 3a 내지 도 3c을 참조하여, 상기 분할모듈(110)이 상기 대상 파일에 상응하는 M개의 분할 파일 블록을 생성하는 구체적인 방법에 대하여 설명하기로 한다.Hereinafter, with reference to FIG. 3A to FIG. 3C, a description will be made of a specific method of generating the M divided file blocks corresponding to the target file by the partitioning module 110. FIG.

도 3a는 본 발명의 일 실시예에 따른 HDFS 분산 저장 시스템에 의해 저장될 대상 파일을 도시한 도면이며, 도 3b는 종래의 HDFS에서 생성하는 분할 파일 블록을 나타내는 도면이며, 도 3c는 본 발명의 일 실시예에 따른 HDFS 분산 저장 시스템이 생성하는 분할 파일 블록을 나타내는 도면이다. 도 3a 내지 도 3c의 예시에서 M이 9이며, 기본 블록 사이즈는 96MB이며, HDFS의 리플리케이션 팩터가 3인 경우를 도시하고 있다.FIG. 3A illustrates a target file to be stored by the HDFS distributed storage system according to an exemplary embodiment of the present invention. FIG. 3B illustrates a divided file block generated in the conventional HDFS. FIG. FIG. 5 is a diagram illustrating a split file block generated by the HDFS distributed storage system according to an exemplary embodiment. In the example of FIGS. 3A to 3C, M is 9, the basic block size is 96 MB, and the replication factor of the HDFS is 3.

먼저 도 3b를 참조하면, 종래의 HDFS에서는 도 3b에 도시된 대상 파일(1)을 미리 설정된 소정의 기본 블록 사이즈 단위로 분할하여, 3 개의 파일 블록(11, 12, 13)을 생성할 수 있다. 한편, 종래의 HDFS에 의하면 상기 3개의 파일 블록(11, 12, 13)이 각각 리플리케이션 팩터(즉, 3)만큼 중복 저장될 수 있다. 따라서, 종래의 HDFS 상에는 파일 블록(11) 내지 파일 블록(13)이 각각 3개씩 중복적으로 존재할 수 있다.Referring to FIG. 3B, in the conventional HDFS, the target file 1 shown in FIG. 3B can be divided into predetermined basic block size units to generate three file blocks 11, 12, and 13 . Meanwhile, according to the conventional HDFS, the three file blocks 11, 12 and 13 can be redundantly stored by the replication factors (i.e., 3). Therefore, three file blocks 11 to 13 may be redundantly present on the conventional HDFS.

반면, 본 발명의 일 실시예에 따르면, 상기 분할모듈(110)은 상기 대상 파일에 상응하는 M개의 서로 다른 분할 파일 블록을 생성할 수 있다. 여기서, M은 상기 대상 파일을 S/R의 크기로 분할했을 때의 블록의 개수이며, R은 상기 HDFS의 리플리케이션 팩터이며, S는 상기 기본 블록 사이즈이다.Meanwhile, according to an embodiment of the present invention, the partitioning module 110 may generate M different divided file blocks corresponding to the target file. Here, M is the number of blocks when the target file is divided into S / R size, R is the replication factor of the HDFS, and S is the basic block size.

본 발명의 제1 실시예에서, 상기 분할모듈(110)에 의해 생성되는 M개의 분할 파일 블록이 포함하는 데이터는 다음과 같을 수 있다.In the first embodiment of the present invention, the data included in the M divided file blocks generated by the division module 110 may be as follows.

1) M개의 분할 파일 블록 중 i번째 분할 파일 블록(1≤i≤M-R+1)은 상기 대상 파일 중 상기 대상 파일을 S/R의 크기로 분할하였을 때의 i번째 블록의 시작지점부터 상기 기본 블록 사이즈만큼의 영역에 포함된 데이터를 포함할 수 있다.1) the i-th divided file block (1? I? M-R + 1) among the M divided file blocks starts from the start point of the i-th block when the target file is divided into S / And may include data included in an area corresponding to the basic block size.

2) M개의 분할 파일 블록 중 j번째 분할 파일 블록(M-R+2≤j≤M)은 상기 대상 파일 중 상기 대상 파일을 S/R의 크기로 분할하였을 때의 i번째 블록의 시작지점부터 상기 대상 파일의 종료지점까지의 영역에 포함된 데이터를 포함할 수 있다.2) The j-th divided file block (M-R + 2? J? M) among the M divided file blocks is divided from the starting point of the i-th block when the target file is divided into S / And data included in an area up to an end point of the target file.

한편, 본 발명의 제2 일 실시예에 따르면, 위 2)의 경우, j번째 분할 파일 블록은 상기 대상 파일의 시작지점부터 상기 대상 파일을 S/R의 크기로 분할하였을 때의 R-(M-j+1)번째 블록의 종료지점까지의 영역에 포함된 데이터를 더 포함할 수도 있다.Meanwhile, according to the second embodiment of the present invention, in the case 2), the jth divided file block is divided into R- (M -j + 1) < th > block.

도 3c는 상기 제2실시예에 따른 분할모듈(110)에 의해 생성된 도 3a에 도시된 바와 같은 대상 파일에 상응하는 분할 파일 블록을 도시한 도면이다. 분할 파일 블록(21 내지 27)은 상기 1) 규칙에 의해 생성되는 파일 블록이며, 분할 파일 블록(28 및 29)는 상기 2) 규칙에 의해 생성되는 파일 블록이다.FIG. 3C is a diagram showing a divided file block corresponding to a target file as shown in FIG. 3A generated by the dividing module 110 according to the second embodiment. The split file blocks 21 to 27 are the file blocks generated by the 1) rule, and the split file blocks 28 and 29 are the file blocks generated by the 2) rule.

상기 대상 파일을 S/R의 크기로 분할하였을 때의 각 블록을 서브 블록이라고 하면, 도 3c의 5번째 분할 파일 블록(25)은 상기 규칙 1)이 적용되어, 5번째 서브 블록 ⑤의 시작지점부터 상기 기본 블록 사이즈(이는 서브 블록 R개의 크기와 동일함)만큼의 영역인 서브 블록 ⑤, ⑥, ⑦에 포함된 데이터를 포함할 수 있다.When each block when the target file is divided into S / R sizes is referred to as a sub-block, the fifth divided file block 25 in FIG. 3C is applied to the starting point of the fifth sub-block 5 (5), (6), and (7) which are areas corresponding to the basic block size (which is equal to the size of the sub-block R).

한편, 도 3c의 8번째 분할 파일 블록(28)은 상기 규칙 2)가 적용되어 8번째 서브 블록 ⑧의 시작지점부터 상기 대상 파일의 종료지점(즉 서브 블록 ⑨의 종료지점)까지 영역에 포함된 데이터, 및 상기 대상 파일의 시작지점(즉, 서브 블록 ①의 시작지점)부터 R-(M-j+1)=3-(9-8+1)=1번째 서브 블록(즉 서브 블록 ①)의 종료지점까지의 영역에 포함된 데이터를 포함할 수 있다. 결국 8번째 분할 파일 블록(28)은 서브 블록 ⑧, ⑨, ①에 포함된 데이터를 포함할 수 있다.In the meantime, the eighth divided file block 28 in FIG. 3C is a block included in the area from the start point of the eighth sub-block 8 to the end point of the target file (that is, the end point of the sub-block 9) (M-j + 1) = 3- (9-8 + 1) = 1 th sub-block (i.e., sub-block 1) from the start point of the target file And may include data included in an area up to the end point of the < RTI ID = 0.0 > As a result, the eighth divided file block 28 may include data included in the subblocks 8, 9, and 1.

다시 도 2를 참조하면, 상기 제어모듈(120)은 상기 M개의 분할 파일 블록을 상기 HDFS 상의 데이터 노드에 분산 저장할 수 있다.Referring back to FIG. 2, the control module 120 may distribute the M divided file blocks to data nodes on the HDFS.

상기 제어모듈(120)은 각 분할 파일 블록이 생성 혹은 결정될 때마다, HDFS(300)에 포함된 네임 노드(310)로 해당 분할 파일 블록이 저장/복제될 데이터 노드의 ID를 요청하고, 상기 네임 노드(310)가 회신하는 정보에 기초하여 데이터 노드(예를 들면, 320-1)에 해당 분할 파일 블록이 저장되도록 할 수 있다.The control module 120 requests the ID of the data node to be stored / duplicated in the corresponding divided file block to the name node 310 included in the HDFS 300 whenever the divided file block is generated or determined, The segmented file block may be stored in the data node (e.g., 320-1) based on the information that the node 310 returns.

한편, 상기 분할모듈(110)에 의해 생성된 각각의 분할 파일 블록이 리플리케이션 팩터만큼 중복되어 분산 저장되는 것은 아니다. 하지만, 상기 분할모듈(110)에 의해 생성된 각각의 분할 파일 블록들 자체가 이미 리플리케이션 팩터만큼 중복된 데이터를 포함하고 있으므로 대상 파일을 구성하는 데이터는 리플리케이션 팩터만큼 HDFS에 중복적으로 저장될 수 있다.Meanwhile, each of the divided file blocks generated by the partitioning module 110 is not duplicated and stored as redundant as the replication factor. However, since each of the divided file blocks generated by the partitioning module 110 already includes duplicated data as much as the replication factor, the data constituting the target file can be redundantly stored in the HDFS as much as the replication factor .

상기 HDFS 분산 저장 시스템(100)에 의해 HDFS에 저장될 수 있는 파일의 종류(혹은 타입)에는 제한이 없다.There is no limitation on the type (or type) of files that can be stored in the HDFS by the HDFS distributed storage system 100. [

한편, 상기 HDFS에 저장되는 파일이 동영상 파일인 경우, 동영상에 관한 메타정보가 있는 헤더 영역은 특별히 관리할 필요가 있다. 동영상을 처리하는 모든 서버는 영상을 본격적으로 처리하기 전에 헤더를 반드시 처리해야만 적절한 동영상의 처리가 가능하다. 즉, 각 태스크 트래커 서버는 할당된 스플릿에 대한 맵리듀스를 수행하기 위하여, 해당 스플릿을 구성하는 맵 엔트리인 각각의 프레임의 위치 정보 등을 알아야 하며, 각 프레임의 위치 정보를 확인하기 위해서는 동영상의 헤더 파일을 참조할 필요가 있다. 따라서, 스플릿을 처리하는 모든 태스크 트래커 서버가 헤더 부분을 저장하고 있는 데이터 노드로 데이터를 요청하면, 헤더를 저장하고 있는 데이터 노드에 대한 오버헤드가 심해질 뿐만 아니라 네트워크 지연 등으로 인해 성능이 감소하게 될 수 있다. 따라서, 상기 제어모듈(120)은 상기 복수의 분할 파일 블록을 분산 저장하고 있는 모든 데이터 노드에 상기 동영상 파일의 헤더 정보를 포함하는 별도의 헤더 블록이 더 저장되도록 할 수 있다. 그러면, 이후 태스크 트래커 서버는 자신의 로컬 저장소로부터 헤더 정보를 읽을 수 있으므로 상술한 문제점을 해결할 수 있다.On the other hand, when the file stored in the HDFS is a moving image file, a header area having meta information about the moving image needs to be specifically managed. All the servers that process the video must process the header before processing the video in earnest to be able to process the appropriate video. That is, each task tracker server must know the location information of each frame, which is a map entry constituting the split, in order to perform map ridding for the allocated split. To check position information of each frame, You need to reference the file. Therefore, if all the task tracker servers that process the split request data to the data node storing the header portion, not only the overhead for the data node storing the header is increased but also the performance is decreased due to the network delay and the like . Therefore, the control module 120 may further store a separate header block including header information of the moving picture file in all the data nodes that distribute and store the plurality of divided file blocks. Then, the task tracker server can then read the header information from its local storage, thereby solving the above-described problem.

한편, 상술한 바와 같이, 맵리듀스 기법에서는 대용량의 파일을 독립적인 연산 단위인 맵 엔트리(map entry)단위로 나누어 맵 단계를 진행하며, 실제로는 속도의 향상을 위해 맵 단계의 독립적 연산 단위는 스플릿이라는 묶음으로 묶이며, 하나의 스플릿이 하나의 서버(즉, 맵 태스크 트래커 서버)로 할당된다. 맵 태스크 트래커 서버는 할당받은 스플릿에 포함된 맵 엔트리를 순차적으로 처리한다. 맵리듀스 기법의 성능은 입력된 대용량 파일을 어떻게 최소 처리 단위인 맵 엔트리로 나누고 이를 다시 스플릿으로 그룹화하는지에 상당 부분 의존한다. 맵 엔트리의 크기는 알고리즘의 단순화 및 스플릿 크기의 유연성 등을 고려할 때 작을수록 좋으며, 서버로의 작업 할당 단위인 스플릿은 적절히 병렬화가 가능할 정도로 작게 유지하는 것이 좋지만, 파일을 자르는 작업에 드는 부하를 최소화하기 위해서는 충분히 큰 것이 좋다. 요컨대, 스플릿의 크기에 대한 적절한 조화점을 찾는 것이 중요하다.As described above, in the MapReduce method, a large-capacity file is divided into map entry units, which are independent operation units, and the map step is performed. In order to improve the speed, , And one split is allocated to one server (i.e., a map task tracker server). The map task tracker server sequentially processes the map entries included in the allocated split. The performance of the MapReduce method relies heavily on how to divide the input large files into map entries, which are the minimum processing units, and group them into splits. It is better to keep the size of the map entry small considering the simplification of the algorithm and the flexibility of the split size. It is better to keep the split size of the job allocation unit allocated to the server small enough so that parallelization can be properly performed. However, It is good enough to be large enough to do so. In short, it is important to find an appropriate harmonization point for the size of the split.

본 발명의 일 실시예는 특히 동영상 파일을 대상으로 하며, 맵 엔트리의 경우 작을수록 좋기 때문에, 동영상을 대상으로 하는 본 발명의 일 실시예에서는 동영상의 각 프레임을 맵 엔트리의 단위로 구성할 수 있다.Since an embodiment of the present invention is particularly directed to a moving picture file and a smaller number of map entries, it is preferable that each frame of the moving picture can be configured as a unit of a map entry in one embodiment of the present invention, .

한편, 동영상의 각 프레임을 하나의 맵 엔트리로 구성할 경우 가지게 되는 하나의 문제점은, 동영상은 압축기법 상 상기 동영상에 포함된 프레임 중 대다수가 앞이나 뒤의 프레임에 의존적으로 압축된다는 사실이다. 동영상은 연속적인 움직임을 가지는 정지화상의 연속이기 때문에 앞뒤 영상의 이미지를 알고 있다는 가정을 하게 되면 현재 프레임의 압축률을 높이는 것이 가능하기 때문에 실제 동영상을 압축할 경우 앞뒤 프레임의 이미지에 종속적인 압축기법을 사용하게 된다. 따라서, 어느 하나의 프레임(즉, 하나의 맵 엔트리)를 처리하기 위해서는 해당 프레임이 키 프레임(전후 프레임에 종속적이지 않고 영상 데이터를 전부 가지고 있어서 독립적으로 분석이 가능한 프레임)이 아닌 이상, 앞서 처리된 프레임을 참조해야만 한다. 요컨대, 동영상의 각 프레임을 하나의 맵 엔트리로 둘 경우 하둡의 맵리듀스 패턴이 가지는 가정 중 '각각의 맵 엔트리는 독립적이다'라는 부분이 위배되게 된다. 하지만, 맵리듀스에서 하나의 스플릿 안에 포함되어 있는 맵 엔트리를 파일 내의 순서대로 처리한다면, 동영상의 각 프레임을 하나의 맵 엔트리로 둔다고 하더라도 각 스플릿이 다른 스플릿에 종속적인 영상을 가지도록 자르지 않는 한, 동영상 내의 각 프레임을 처리하는데 문제가 발생하지 않을 수 있다.On the other hand, one problem that occurs when each frame of a moving picture is composed of one map entry is that the majority of the frames included in the moving picture are compressed depending on the preceding or following frame on the basis of the compression technique. Since it is possible to increase the compression rate of the current frame by assuming that the moving image is a sequence of still pictures having continuous motion, it is possible to increase the compression rate of the current frame. . Therefore, in order to process any one frame (i.e., one map entry), unless the frame is a key frame (a frame that does not depend on the preceding and succeeding frames but has all of the image data and can be independently analyzed) You must reference the frame. In other words, when each frame of the moving picture is made into a single map entry, the part of each Hadoop maple deuce pattern assumes that each map entry is independent. However, if MapleDesus treats the map entries contained in one split in order in the file, even if each frame of the moving image is made into a single map entry, as long as each split does not have an image dependent on another split, There is no problem in processing each frame in the moving image.

상술한 바와 같이, 각 스플릿은 별도의 서버에서 처리될 수 있어야 하기 때문에, 서로 간에 완전히 독립적이어야 한다. 따라서, 각 스플릿은 다른 프레임에 종속적이지 않은 키 프레임을 기준으로 구분되어야 한다. 이는 모든 스플릿은 어느 한 키 프레임의 시작 지점으로부터 시작하며, 다른 키 프레임의 시작지점 바로 이전 지점에서 종료해야 함을 의미할 수 있다.As described above, each split must be completely independent of each other, since it must be able to be processed in a separate server. Therefore, each split should be separated by key frames that are not dependent on other frames. This may mean that all splits must start at the beginning of one key frame and end at a point just before the start of another key frame.

그런데, 만약 동영상 파일 블록을 기본 블록 사이즈 단위로 일정하게 분할하여 HDFS에 저장한다면, 각 데이터 노드에 저장되는 블록의 경계와 키 프레임을 기준으로 분할한 스플릿의 경계가 서로 달라질 수 있다. 즉, 도 4에 도시된 바와 같이, 키 프레임을 기준으로 스플릿을 분할하는 경우, 키프레임 경계와 파일 블록의 경계가 일치하지 않는 문제로 인하여 경계가 약간씩 어긋나는 현상을 도 4에 도시된 바와 같이 가지게 된다. 따라서, 파일 블록을 소유하고 있는 서버에게 대응하는 스플릿을 처리하도록 작업을 할당한다고 하더라도 해당 서버는 파일 블록을 넘어가는 자투리 파일 조각을 외부에서 읽어 와야 하는 것이다.However, if the moving image file block is uniformly divided into basic block size units and stored in the HDFS, the boundaries of the blocks stored in the respective data nodes and the split frames based on the key frames may be different from each other. That is, as shown in FIG. 4, when the split is divided based on the key frame, the phenomenon that the boundary is slightly shifted due to the problem that the boundary between the key frame boundary and the file block do not coincide is shown as shown in FIG. 4 I have. Therefore, even if you assign a job to process a split that corresponds to a server that owns a file block, the server must read the spoofed file fragments beyond the file block from the outside.

본 발명의 일 실시예에 따른 동영상 맵리듀스 시스템은 상기 동영상 파일을 키 프레임을 기준으로 하여 각각의 스플릿으로 분할하되, 상술한 본 발명의 일 실시예에 따른 HDFS 분산 저장 시스템을 이용함으로써 이러한 문제점을 해결할 수 있다. 다만 도 4의 예시는 각 스플릿을 가능한 한 파일 블록의 사이즈에 근접하도록 설정한 것이지만, 본 발명의 일 실시예에 따른 동영상 맵리듀스 시스템은 이와 달리 각각의 키 프레임을 기준으로 하여 동영상 파일을 스플릿으로 분할할 수도 있다.The moving picture deuce system according to an embodiment of the present invention divides the moving picture file into respective splits with reference to the key frame, and by using the HDFS distributed storage system according to the embodiment of the present invention, Can be solved. In the example of FIG. 4, each split is set as close as possible to the size of the file block. However, in the moving picture deuce system according to the embodiment of the present invention, the moving picture file is split It may be divided.

도 5는 본 발명의 일 실시예에 따른 맵리듀스 시스템(400)을 설명하기 위한 도면이며, 도 6은 상기 맵리듀스 시스템(400)에 의해 동영상 파일이 스플릿으로 분할되고 분할된 스플릿이 잡 트래커 서버에 의해 태스크 트래커 서버로 할당되는 예를 설명하기 위한 도면이다.FIG. 5 is a diagram for explaining a mapping task system 400 according to an embodiment of the present invention. FIG. 6 illustrates a case where a moving picture file is divided into splits by the mapping task system 400, FIG. 4 is a diagram for explaining an example in which the task tracker server is allocated to the task tracker server.

상술한 바와 같이 HDFS에 분산 저장된 동영상 파일(v)은 본 발명의 일 실시예에 따른 맵리듀스 시스템(400)에 의해 처리될 수 있다. 상기 맵리듀스 시스템(400)은 통상적인 하둡 시스템에서처럼 하나의 잡 트래커 서버(410) 및 복수의 태스크 트래커 서버(예를 들면, 420-1 내지 420-2)를 포함할 수 있으며, 각각의 태스크 트래커 서버는 데이터 노드로서의 역할을 수행할 수 있다. 즉, 각각의 태스크 트래커 서버들(420-1 내지 420-2)은 자신의 로컬 저장소(421-1 내지 421-4)에 상기 동영상 파일로부터 생성된 분할 파일 블록 중 적어도 일부를 저장하고 있을 수 있다. 한편, 잡 트래커 서버(410)는 각 트래커 서버(410-1 내지 420-2)로 스플릿을 할당할 수 있으며, 태스크 트래커 서버(410-1 내지 420-2)는 자신에게 할당된 단위 스플릿에 포함된 데이터에 대한 맵 태스크를 수행할 수 있다.As described above, the moving picture file v stored in the HDFS may be processed by the mapping system 400 according to an embodiment of the present invention. The mapping system 400 may include one job tracker server 410 and a plurality of task tracker servers (e.g., 420-1 through 420-2) as in a conventional Hadoop system, The server can act as a data node. That is, each of the task tracker servers 420-1 to 420-2 may store at least a part of the divided file blocks generated from the moving picture file in its local storage 421-1 to 421-4 . On the other hand, the job tracker server 410 may allocate a split to each of the tracker servers 410-1 to 420-2, and the task tracker servers 410-1 to 420-2 may include a split The map task can be performed on the data.

상술한 바와 같이, 각 태스크 트래커 서버(420-1 내지 420-2)에 처리되는 스플릿은 서로 독립적이어야 하므로 상기 잡 트래커 서버(410)는 처리 대상이 되는 동영상 파일을 키 프레임을 기준으로 하여 분할할 수 있으며, 이는 각 스플릿의 경계가 키 프레임의 시작점이 되도록 분할되어야 함을 의미할 수 있다. 도 6에 도시된 바와 같이, 상기 잡 트래커 서버(410)는 맵리듀스의 대상이 되는 동영상 파일(2)을 키 프레임을 기준으로 하여 키 프레임의 개수만큼의 단위 스플릿(S1 내지 S9)으로 분할될 수 있다.As described above, since the splits processed by the task tracker servers 420-1 to 420-2 must be independent from each other, the job tracker server 410 divides the video file to be processed on the basis of the key frame , Which may mean that the boundary of each split should be split so that it is the starting point of the key frame. As shown in FIG. 6, the job tracker server 410 divides a motion picture file 2 to be subjected to mapping to a unit split (S1 to S9) corresponding to the number of key frames on the basis of a key frame .

한편, 상기 잡 트래커 서버(410)는 분할된 상기 각각의 단위 스플릿을 해당 스플릿을 포함하는 분할 파일 블록을 저장하고 있는 태스크 트래커 서버에 할당할 수 있다.Meanwhile, the job tracker server 410 may allocate the divided unit split to the task tracker server storing the split file block including the split.

도 6의 예시에서, 상기 잡 트래커 서버(410)는 단위 스플릿(S1)을 포함하고 있는 분할 파일 블록(21, 28, 29) 중 어느 하나를 저장하고 있는 태스크 트래커 서버에 단위 스플릿(S1)을 할당할 수 있다.6, the job tracker server 410 transmits a unit split S1 to the task tracker server storing any one of the split file blocks 21, 28, and 29 including the unit split S1 Can be assigned.

또한, 단위 스플릿(S2)을 포함하고 있는 분할 파일 블록(21)을 저장하고 있는 태스크 트래커 서버에 단위 스플릿(S1)을 할당할 수 있다.Further, the unit split S1 can be allocated to the task tracker server storing the split file block 21 including the unit split S2.

또한, 단위 스플릿(S3)를 포함하고 있는 분할 파일 블록(22, 24) 중 어느 하나를 저장하고 있는 태스크 트래커 서버에 단위 스플릿(S3)을 할당할 수 있다. 그러나, 분할 파일 블록(21, 24)는 단위 스플릿(S3) 중 일부만을 포함하고 있으므로 분할 파일 블록(21, 24)을 저장하고 있는 태스크 트래커 서버에는 단위 스플릿(S3)를 할당하지 않을 수 있다.Further, the unit split S3 can be allocated to the task tracker server storing any one of the split file blocks 22 and 24 including the unit split S3. However, since the split file blocks 21 and 24 include only a part of the unit split S3, the unit split S3 may not be allocated to the task tracker server storing the split file blocks 21 and 24. [

또한, 단위 스플릿(S4)을 포함하고 있는 분할 파일 블록(23, 24) 중 어느 하나를 저장하고 있는 태스크 트래커 서버에 단위 스플릿(S1)을 할당할 수 있다.In addition, the unit split S1 can be allocated to the task tracker server storing any of the split file blocks 23 and 24 including the unit split S4.

상기 잡 트래커 서버는 나머지 단위 스플릿(S5 내지 S9)에 대해서도 동일한 방식으로 해당 단위 스플릿을 포함하고 있는 분할 파일 블록을 저장하고 있는 태스크 트래커 서버에 해당 단위 스플릿을 할당할 수 있다.The job tracker server may allocate the unit split to the task tracker server storing the split file block including the unit split in the same manner for the remaining unit split (S5 to S9).

잡 트래커 서버(410)가 이와 같은 방법으로 각 태스크 트래커 서버(420-1 내지 420-4)에 스플릿을 할당함으로써, 각 스플릿에 대한 맵 태스크를 수행하는 태스크 트래커 서버는 자신의 로컬 저장소에 저장된 분할 파일 블록에만 접근하면 되고, 자신에게 할당된 스플릿을 처리하기 위하여 다른 서버에서 원격으로 데이터를 가져올 필요가 없게 될 수 있다.In this manner, the task tracker server 410 allocates a split to each of the task tracker servers 420-1 to 420-4 so that the task tracker server, which performs the map task for each split, You only need to access the file block, and you may not need to retrieve the data remotely from another server to process the splits allocated to it.

한편, 구현 예에 따라서, 상기 HDFS 분산 저장 시스템(100)은 프로세서 및 상기 프로세서에 의해 실행되는 프로그램을 저장하는 메모리를 포함할 수 있다. 상기 프로세서는 싱글 코어 CPU혹은 멀티 코어 CPU를 포함할 수 있다. 메모리는 고속 랜덤 액세스 메모리를 포함할 수 있고 하나 이상의 자기 디스크 저장 장치, 플래시 메모리 장치, 또는 기타 비휘발성 고체상태 메모리 장치와 같은 비휘발성 메모리를 포함할 수도 있다. 프로세서 및 기타 구성 요소에 의한 메모리로의 액세스는 메모리 컨트롤러에 의해 제어될 수 있다. 여기서, 상기 프로그램은, 프로세서에 의해 실행되는 경우, 본 실시예에 따른 분산 저장 시스템(100)으로 하여금, 상술한 HDFS에 분산 저장하는 시스템 제공방법을 수행하도록 할 수 있다.Meanwhile, according to an embodiment, the HDFS distributed storage system 100 may include a processor and a memory for storing a program executed by the processor. The processor may include a single-core CPU or a multi-core CPU. The memory may include high speed random access memory and may include non-volatile memory such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid state memory devices. Access to the memory by the processor and other components can be controlled by the memory controller. Here, if the program is executed by a processor, the distributed storage system 100 according to the present embodiment may be configured to perform a method of providing a system for distributed storage in the above-described HDFS.

한편, 본 발명의 실시예에 따른 HDFS 분산 저장 시스템 제공방법 및 동영상 맵리듀스 시스템 제공방법은 컴퓨터가 읽을 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 읽을 수 있는 기록 매체에 저장될 수 있으며, 본 발명의 실시예에 따른 제어 프로그램 및 대상 프로그램도 컴퓨터로 판독 가능한 기록 매체에 저장될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.Meanwhile, the HDFS distributed storage system providing method and the video map devise system providing method according to an embodiment of the present invention may be implemented as computer-readable program instructions and stored in a computer-readable recording medium. The control program and the target program according to the embodiment can also be stored in a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

기록 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 소프트웨어 분야 당업자에게 공지되어 사용 가능한 것일 수도 있다.Program instructions to be recorded on a recording medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of software.

컴퓨터로 읽을 수 있는 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 또한 상술한 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Examples of the computer-readable recording medium include magnetic media such as a hard disk, a floppy disk and a magnetic tape, optical media such as CD-ROM and DVD, a floptical disk, And hardware devices that are specially configured to store and execute program instructions such as magneto-optical media and ROM, RAM, flash memory, and the like. The above-mentioned medium may also be a transmission medium such as a light or metal wire, wave guide, etc., including a carrier wave for transmitting a signal designating a program command, a data structure and the like. The computer readable recording medium may also be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner.

프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 전자적으로 정보를 처리하는 장치, 예를 들어, 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Examples of program instructions include machine language code such as those produced by a compiler, as well as devices for processing information electronically using an interpreter or the like, for example, a high-level language code that can be executed by a computer.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성요소들도 결합된 형태로 실시될 수 있다.It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타나며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.It is intended that the present invention covers the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents. .

Claims

An HDFS distributed storage system that distributes files to a Hadoop Distribution File System (HDFS)
Wherein M is a number of blocks when the file is divided into S / R sizes, R is a replication factor of the HDFS, and S is a predetermined number A default block size; And
And a control module for distributively storing the M divided file blocks on data nodes on the HDFS,
Wherein the i-th divided file block (1? I? M-R + 1)
The data included in an area corresponding to the basic block size from a start point of an i-th block when the file is divided into S / R sizes,
The M-th divided file block (M-R + 2? J? M)
An HDFS distributed storage system including data included in an area from a start point of an i-th block to an end point of the file when the file is divided into S / R sizes.

The method according to claim 1,
The M-th divided file block (M-R + 2? J? M) of the M divided file blocks is divided into R- (M-j +1) th block of the HDFS file system.

The method according to claim 1,
Wherein the file is a moving image file.

The method of claim 3,
The control module includes:
And a header block including header information of the moving picture file is further stored in all the data nodes distributedly storing the M divided file blocks.

4. A moving picture deuce system for performing map-reduction on a moving picture file stored by the system according to claim 3 and including a job tracker server and a plurality of task tracker servers,
The job tracker server includes:
A map executed by a task tracker server that divides the video file into unit splits with reference to a key frame and stores split divided unit blocks into divided file blocks including corresponding unit splits among the plurality of task tracker servers; Task,
The task tracker server,
A video maple deuces system that performs a map task on data contained in a unit split assigned to itself.

A method for providing an HDFS distributed storage system for distributing files to HDFS,
Wherein M is a number of blocks when the file is divided into S / R sizes, R is a replication factor of the HDFS, S is a predetermined number of blocks Basic block size; And
Distributing the M divided file blocks to data nodes on the HDFS,
Wherein the i-th divided file block (1? I? M-R + 1)
The data included in an area corresponding to the basic block size from a start point of an i-th block when the file is divided into S / R sizes,
The M-th divided file block (M-R + 2? J? M)
And data included in an area from a start point of an i-th block to an end point of the file when the file is divided into S / R sizes among the files.

The method according to claim 6,
The M-th divided file block (M-R + 2? J? M) of the M divided file blocks is divided into R- (M-j +1) th block of the HDFS file system.

The method according to claim 6,
The file is a video file,
The HDFS distributed storage system providing method includes:
Storing a header block including header information of the moving image file in all the data nodes distributed and storing the M divided file blocks.

9. A computer-readable recording medium on which a program for carrying out the method according to any one of claims 6 to 8 is recorded.

As an HDFS distributed storage system,
A processor; And
A memory for storing a computer program executed by the processor,
The computer program causes the distributed storage system to perform the method of any one of claims 6 to 8 in the HDFS when executed by the processor.

A method for providing a moving picture deuce system that performs a map deuce on a moving picture file stored by the system of claim 3 and includes a job tracker server and a plurality of task tracker servers,
(a) the job tracker server dividing the moving image file into unit splits based on a key frame;
(b) the job tracker server allocating each divided unit split to a map task executed in a task tracker server storing a divided file block including a corresponding unit split among the plurality of task tracker servers; And
and (c) each of the task tracker servers allocated with the split unit split performs a map task on data included in the unit split allocated to the task tracker server.