KR20120084100A

KR20120084100A - Inputformat for binary format data in hadoop mapreduce and binary data analysis using the same

Info

Publication number: KR20120084100A
Application number: KR1020110005424A
Authority: KR
Inventors: 이영석; 이연희
Original assignee: 충남대학교산학협력단
Priority date: 2011-01-19
Filing date: 2011-01-19
Publication date: 2012-07-27
Anticipated expiration: 2031-01-19
Also published as: KR101218087B1

Abstract

본 발명은 (A) 바이너리 데이터의 레코드의 길이를 입력받는 단계; (B) 하둡분산파일시스템(HDFS)에 저장된 데이터 블록 중 처리해야 될 데이터 블록에서 레코드의 길이의 n배수가 되는 지점 중 블록 시작점에 가장 가까운 값을 시작점으로 이전 InputSplit과 자신의 InputSplit의 경계를 설정하는 것에 의해 InputSplit을 정의하는 단계; (C) 상기에서 정의된 자신의 InpuSplit 전체 영역에 대해 시작점으로부터 레코드의 길이만큼씩 읽는 일을 수행하는 RecordReader를 생성하고 이를 반환하는 단계; 및 (D) 상기 RecordReader를 통해 (Key, Value)를 (LongWritable, BytesWritable)의 형태로 레코드들을 추출하는 단계;를 포함하여 이루어지는 것을 특징으로 하는 고정길이의 레코드를 갖는 바이너리 데이터를 분산처리하기 위한 하둡 맵리듀스에서의 입력포맷과, 상기 입력 포맷을 이용한 바이너리 데이터의 분석방법에 관한 것이다.
본 발명의 입력포맷에 의하면, 고정길이의 바이너리 데이터를 하둡 환경에서 분산 처리할 때 데이터 포맷의 변환작업 없이 처리가 가능하므로, 다른 형태의 데이터에 비해 적은 저장공간을 요하며 빠른 처리 속도를 가능하게 한다.The present invention comprises the steps of (A) receiving the length of the record of the binary data; (B) The boundary of the previous InputSplit and its InputSplit, starting from the value closest to the beginning of the block among the blocks of data stored in the Hadoop Distributed File System (HDFS) that are n times the length of the record in the data block to be processed. Defining an InputSplit by doing; (C) generating and returning a RecordReader for reading the entire InpuSplit region defined above by the length of the record from the starting point; And (D) extracting records in the form of (LongWritable, BytesWritable) by (Key, Value) through the RecordReader; Hadoop for distributing binary data having fixed-length records, comprising: An input format in MapReduce and a method for analyzing binary data using the input format.
According to the input format of the present invention, since binary data of fixed length can be processed in a Hadoop environment without conversion of data format, it requires less storage space and enables faster processing speed than other types of data. do.

Description

Input format for binary format data in Hadoop MapReduce and binary data analysis using the same} in Hadoop MapReduce

본 발명은 하둡 맵리듀스에서 고정된 길이의 레코드를 갖는 바이너리 포맷의 데이터를 처리하기 위한 새로운 입력포맷과, 상기 입력 포맷을 이용한 바이너리 데이터의 분석방법에 관한 것이다.
The present invention relates to a new input format for processing data in binary format having fixed length records in Hadoop MapReduce, and a method for analyzing binary data using the input format.

하둡(Hadoop)은 너치(Nutch)의 분산처리를 지원하기 위해 개발된 것으로, 수백 기가바이트~테라바이트 혹은 페타바이트 크기의 데이터를 처리할 수 있는 어플리케이션을 제작하고 운영할 수 있는 기반을 제공해 주는 데이터 처리 플랫폼이다. 하둡이 처리하는 데이터의 크기가 통상 최소 수백 기가바이트 수준이기 때문에 데이터는 하나의 컴퓨터에 저장되는 것이 아니라 여러 개의 블록으로 나누어져 여러 개의 컴퓨터에 분산 저장된다. 따라서 하둡은 입력되는 데이터를 나누어 처리할 수 있도록 하는 하둡 분산 파일 시스템(HDFS: Hadoop Distributed File System)을 포함하며, 분산 저장된 데이터들은 대용량 데이터를 클러스터 환경에서 병렬 처리하기 위해 개발된 MapReduce 과정에 의해 처리되어진다.Hadoop was developed to support Nutch's distributed processing, providing the foundation for building and operating applications that can handle hundreds of gigabytes to terabytes or petabytes of data. Processing platform. Because Hadoop's data is typically at least several hundred gigabytes in size, data is not stored on a single computer, but divided into blocks and distributed across multiple computers. Therefore, Hadoop includes the Hadoop Distributed File System (HDFS), which allows the processing of incoming data separately. The distributed data is processed by the MapReduce process developed for parallel processing of large data in a cluster environment. It is done.

도 1은 하둡 맵리듀스에서 잡(Job) 처리 시의 데이터의 흐름을 보여주는 개념도이다. 입력파일(input file)은 맵리듀스가 수행될 데이터가 저장된 것으로 통상은 HDFS에 저장되어 진다. 하둡은 텍스트 포맷의 데이터 뿐 아니라 다양한 형태의 데이터 포맷을 지원한다. 1 is a conceptual diagram illustrating the flow of data during job processing in Hadoop MapReduce. An input file stores data for map reduction, which is typically stored in HDFS. Hadoop supports a variety of data formats, as well as textual data.

클라이언트의 요청에 의해 Job이 시작되면, 입력포맷(InputFormat, 101)은 입력파일을 어떻게 나누고, 읽을 것인가를 결정하게 된다. 즉 해당 블록의 데이터에 대해 입력 파일을 나누어 InputSplit을 반환하는 한편, InputSplit을 맵퍼(mapper)가 읽을 수 있는 (key, value) 형태로 변환한 RecordReader(102)를 생성하여 반환한다. InputSplit는 맵리듀스에서 단일의 맵작업이 처리하는 데이터의 단위이다. 하둡에서는 TextInputFormat, KeyValueInputFormat, SequenceInputFormat과 같은 유형의 입력포맷이 있다. 대표적인 입력포맷은 TextInputFormat으로서 각 라인을 기준으로 블록단위로 저장된 입력파일을 나누어 논리적인 입력단위인 InputSplit을 구성하며, 이 InputSplit으로부터 (LongWritable, Text)의 형태의 레코드를 추출하는 임무를 수행하는 LineRecordReader를 반환한다. When a Job is started at the client's request, the InputFormat (101) determines how to divide and read the input file. In other words, inputSplit is returned by dividing the input file with respect to the data of the block, while generating and returning a RecordReader 102 that converts the InputSplit into a mapper-readable (key, value) form. InputSplit is the unit of data processed by a single map job in MapReduce. Hadoop has several types of input formats: TextInputFormat, KeyValueInputFormat, and SequenceInputFormat. The typical input format is TextInputFormat, which divides input files stored in block units based on each line to form InputSplit, which is a logical input unit, and extracts a record of the form (LongWritable, Text) from this InputSplit. Returns.

반환된 RecordReader는 통상적인 Map 과정 중에 InputSplit에서 키와 값의 쌍으로 구성된 레코드를 읽어 맵퍼에 넘겨주는 역할을 수행한다. 매퍼는 이 레코드를 Map에 정의된 처리과정을 거치면서 새로운 키와 값으로 구성된 레코드로 생성한다. 출력포맷(OutputFormat, 103)은 MapReduce 과정에서 생성한 데이터를 HDFS에 파일로 출력하기 위한 포맷으로서, 출력포맷은 subclass인 RecordWriter(104)를 통하여 MapReduce 처리의 결과로 받은 키와 값의 쌍으로 구성된 레코드를 HDFS에 쓰는 것에 의해 데이터 처리 과정을 종료하게 된다.The returned RecordReader is responsible for reading the key-value pair record from InputSplit and passing it to the mapper during normal Map process. The mapper creates this record as a record of new keys and values through the process defined in the Map. OutputFormat (103) is a format for outputting data generated during the MapReduce process to a file in HDFS. The output format is a record composed of key and value pairs received as a result of MapReduce processing through RecordWriter 104, a subclass. Writing to HDFS terminates the data processing process.

Hadoop은 웹 크롤링의 특성에 맞게 텍스트 데이터의 처리를 위한 다양한 형태의 입력 포맷과 출력 포맷을 제공하며, 이 중 시퀀스파일포맷은 텍스트 이외의 데이터 포맷에 대한 입력과 출력을 제공한다. deflate, gzip, ZIP, bzip2, and LZO 등의 압축파일의 입출력도 지원하며, 이러한 압축파일포맷은 저장공간의 효율을 높일 수 있다는 장점이 있다. 그러나, 압축파일포맷에 의해 입력파일을 처리하기 위해서는 MapReduce 작업이 시작되기 전에 압축을 해제하고 처리한 결과를 다시 압축하는 단계가 필요하므로 처리 속도가 낮아지는 문제가 있다. 시퀀스파일포맷은 바이너리를 포함한 다양한 포맷의 데이터를 담을 수 있는 틀을 제공하지만, 담고자 하는 원시데이터를 일련의 시퀀스 형태로 변환해야만 하는 또 다른 변환과정이 요구된다.Hadoop provides various types of input and output formats for processing text data according to the characteristics of web crawling. Among them, sequence file format provides input and output for data formats other than text. It also supports input and output of compressed files such as deflate, gzip, ZIP, bzip2, and LZO, and the compressed file format can increase the storage space efficiency. However, in order to process the input file by compressed file format, it is necessary to decompress and recompress the processing result before the MapReduce operation starts. Therefore, the processing speed becomes low. The sequence file format provides a framework for storing data in a variety of formats, including binary, but requires another conversion process that requires converting the raw data to be contained in a sequence of sequences.

이와 같은 이유로 이미지, 통신 패킷 등 바이너리 형태를 띠고 있는 대량의 데이터를 하둡 분산환경에서 처리하기 위해서는 텍스트로의 변환이나 다른 하둡에서 인식할 수 있는 형태로의 데이터 변환이 필요하다. 이러한 변환 작업은 단일 시스템에 의해 변환하고자 하는 파일을 읽고 변환하여 저장을 위해 다시 쓰는 과정으로 이루어지며, 하둡의 분산시스템을 이용하여 처리 성능을 향상시키려는 근본적인 취지에 어긋나는 것이다. 이에, 바이너리 데이터를 하둡 분산환경에서 처리하기 위해서는 보다 효과적인 방법의 개발이 필요하다.
For this reason, in order to process a large amount of binary data such as images and communication packets in a Hadoop distributed environment, it is necessary to convert it to text or data that can be recognized by other Hadoop. This conversion process consists of reading, converting, and rewriting files for conversion by a single system, which is contrary to the fundamental purpose of improving processing performance by using Hadoop's distributed system. Therefore, in order to process binary data in Hadoop distributed environment, a more effective method needs to be developed.

상기와 같은 종래기술의 문제점을 해소하기 위한 본 발명의 목적은 NetFlow v5와 같이 고정된 길이의 데이터 레코드 블록을 갖는 바이너리 데이터를 하둡 분산처리 시스템에서 효과적으로 처리할 수 있도록 하는 입력포맷을 제공하는 것이다.An object of the present invention for solving the problems of the prior art as described above is to provide an input format that can effectively process binary data having a fixed length data record block, such as NetFlow v5 in the Hadoop distributed processing system.

본 발명의 또 다른 목적은 상기의 입력포맷을 이용하여 하둡 맵리듀스에 의해 고정된 길이의 레코드를 갖는 바이너리 데이터를 처리하는 방법을 제공하는 것이다.
It is still another object of the present invention to provide a method of processing binary data having a fixed length record by Hadoop MapReduce using the above input format.

전술한 목적을 달성하기 위한 본 발명은 고정길이의 레코드를 갖는 바이너리 데이터를 분산처리하기 위한 하둡 맵리듀스에서의 입력포맷에 관한 것으로, (A) 바이너리 데이터의 레코드의 길이를 입력받는 단계; (B) 하둡분산파일시스템(HDFS)에 저장된 데이터 블록 중 처리해야 될 데이터 블록에서 레코드의 길이의 n배수가 되는 지점 중 블록 시작점에 가장 가까운 값을 시작점으로 이전 InputSplit과 자신의 InputSplit의 경계를 설정하는 것에 의해 InputSplit을 정의하는 단계; (C) 상기에서 정의된 자신의 InpuSplit 전체 영역에 대해 시작점으로부터 레코드의 길이만큼씩 읽는 일을 수행하는 RecordReader를 생성하고 이를 반환하는 단계; 및 (D) 상기 RecordReader를 통해 (Key, Value)를 (LongWritable, BytesWritable)의 형태로 레코드들을 추출하는 단계;를 포함하여 이루어지는 것을 특징으로 한다.The present invention for achieving the above object relates to an input format in Hadoop MapReduce for distributing binary data having a record of fixed length, (A) receiving the length of the record of the binary data; (B) The boundary of the previous InputSplit and its InputSplit, starting from the value closest to the beginning of the block among the blocks of data stored in the Hadoop Distributed File System (HDFS) that are n times the length of the record in the data block to be processed. Defining an InputSplit by doing; (C) generating and returning a RecordReader for reading the entire InpuSplit region defined above by the length of the record from the starting point; And (D) extracting records in the form of (LongWritable, BytesWritable) by (Key, Value) through the RecordReader.

또한 본 발명은 상기 입력포맷을 이용하여 하둡 맵리듀스에서 고정길이의 레코드를 갖는 바이너리 데이터를 분석하는 방법에 관한 것이다. 보다 상세하게는, (A) 하둡분산파일시스템의 데이터블록으로부터 고정길이의 레코드를 갖는 바이너리 데이터를 읽어오는 단계; (B) 제 1 항의 입력포맷에 의해 바이너리 형태의 데이터로부터 (K, V)를 생성하는 단계; (C) 상기 (K, V)값을 사용하여 맵리듀스를 진행하는 단계; (D) 상기 맵리듀스의 결과를 바이너리 데이터의 형태로 변환하는 단계; 및 (E) 상기 바이너리 형태로 변환된 데이터를 하둡분산파일시스템에 저장하는 단계;를 포함하여 진행되는 것을 특징으로 하는 고정길이의 레코드를 갖는 바이너리 데이터를 분석하는 방법에 관한 것이다.
The present invention also relates to a method for analyzing binary data having fixed length records in Hadoop MapReduce using the above input format. More specifically, (A) reading the binary data having a fixed length record from the data block of the Hadoop distributed file system; (B) generating (K, V) from binary data by the input format of claim 1; (C) proceeding to map reduce using the (K, V) value; (D) converting the result of the MapReduce into the form of binary data; And (E) storing the data converted into the binary form in a Hadoop distributed file system.

이상과 같이 본 발명의 입력포맷에 의하면, NetFlow v5와 같은 고정길이의 바이너리 데이터를 하둡 환경에서 분산 처리할 때 데이터 포맷의 변환작업 없이 처리가 가능하므로, 다른 형태의 데이터에 비해 적은 저장공간을 요하며 빠른 처리 속도를 가능하게 한다.As described above, according to the input format of the present invention, since binary data having a fixed length such as NetFlow v5 is distributed in a Hadoop environment, processing can be performed without converting the data format, which requires less storage space than other types of data. And enables high throughput.

본 발명의 바이너리 형태의 데이터 분석 방법은 하둡 시스템을 이용한 패킷의 패턴매칭과 같은 응용을 통한 침입탐지시스템의 구성이나, 이미지 데이터, 유전자 정보, 암호처리 등과 같은 바이너리 데이터를 다루는 분야의 분석에 활용될 수 있다.
The binary data analysis method of the present invention can be used for the construction of an intrusion detection system through an application such as packet matching using the Hadoop system, or in the analysis of binary data such as image data, genetic information, and encryption processing. Can be.

도 1은 하둡에서 잡(Job) 처리 시의 데이터의 흐름을 보여주는 개념도.
도 2는 하둡의 각 클러스터 노드들이 데이터 블록을 읽어 처리하는 절차를 보여주는 순서도.
도 3은 본 발명의 입력과 출력 인터페이스를 하둡의 API 구조에 대비하여 보여주는 클래스 계층 구조도.
도 4는 본 발명의 방법과 종래의 TextInputFormat을 이용한 방법에 의해 NetFlow v5 플로우를 처리하는 과정을 비교하여 보여주는 비교도. 1 is a conceptual diagram showing the flow of data during Job processing in Hadoop.
2 is a flowchart illustrating a procedure in which each cluster node of Hadoop reads and processes a data block.
Figure 3 is a class hierarchy diagram showing the input and output interface of the present invention in contrast to Hadoop API structure.
4 is a comparison diagram illustrating a process of processing a NetFlow v5 flow by a method of the present invention and a method using a conventional TextInputFormat.

이하 첨부된 도면을 참조하여 본 발명을 보다 상세히 설명한다. 그러나 이러한 도면은 본 발명의 기술적 사상의 내용과 범위를 쉽게 설명하기 위한 예시일 뿐, 이에 의해 본 발명의 기술적 범위가 한정되거나 변경되는 것은 아니다. 또한 이러한 예시에 기초하여 본 발명의 기술적 사상의 범위 안에서 다양한 변형과 변경이 가능함은 당업자에게는 당연할 것이다.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail with reference to the accompanying drawings. However, these drawings are only examples for easily describing the content and scope of the technical idea of the present invention, and thus the technical scope of the present invention is not limited or changed. In addition, it will be apparent to those skilled in the art that various modifications and changes can be made within the scope of the present invention based on these examples.

본 발명은 고정길이의 레코드를 갖는 바이너리 데이터를 분산처리하기 위한 하둡 맵리듀스에서의 입력포맷에 관한 것으로, (A) 바이너리 데이터의 레코드의 길이를 입력받는 단계; (B) 하둡분산파일시스템(HDFS)에 저장된 데이터 블록 중 처리해야 될 데이터 블록에서 레코드의 길이의 n배수가 되는 지점 중 블록 시작점에 가장 가까운 값을 시작점으로 이전 InputSplit과 자신의 InputSplit의 경계를 설정하는 것에 의해 InputSplit을 정의하는 단계; (C) 상기에서 정의된 자신의 InpuSplit 전체 영역에 대해 시작점으로부터 레코드의 길이만큼씩 읽는 일을 수행하는 RecordReader를 생성하고 이를 반환하는 단계; 및 (D) 상기 RecordReader를 통해 (Key, Value)를 (LongWritable, BytesWritable)의 형태로 레코드들을 추출하는 단계;를 포함하여 이루어지는 것을 특징으로 하는 고정길이의 레코드를 갖는 바이너리 데이터를 분산처리하기 위한 하둡 맵리듀스에서의 입력포맷에 관한 것이다.The present invention relates to an input format in Hadoop MapReduce for distributing binary data having fixed-length records, comprising: (A) receiving a length of a record of binary data; (B) The boundary of the previous InputSplit and its InputSplit, starting from the value closest to the beginning of the block among the blocks of data stored in the Hadoop Distributed File System (HDFS) that are n times the length of the record in the data block to be processed. Defining an InputSplit by doing; (C) generating and returning a RecordReader for reading the entire InpuSplit region defined above by the length of the record from the starting point; And (D) extracting records in the form of (LongWritable, BytesWritable) by (Key, Value) through the RecordReader; Hadoop for distributing binary data having fixed-length records, comprising: It is about input format in MapReduce.

상기 고정길이의 레코드를 갖는 바이너리 데이터로는 NetFlow v5의 플로우 데이터, 텍스트 형으로 변환되기 전의 원시 형태의 유전자 염기서열 데이터, 위치정보와 색상정보로 구성된 픽셀의 조합으로 구성된 비트맵 방식의 이미지 파일을 구성하는 데이터, 고정길이의 청크 사이즈를 갖는 웨이브 파일을 구성하는 데이터를 예로 들 수 있으나, 이에 한정되는 것은 아니며 레코드의 길이가 고정된 바이너리 데이터에 대해서는 모두 적용이 가능하다.The binary data having the fixed-length record includes a bitmap image file composed of a combination of flow data of NetFlow v5, genetic sequence data in a raw form before being converted into text format, and a combination of pixels consisting of location information and color information. Although the data constituting the data and the data constituting the wave file having the fixed-length chunk size may be exemplified, the present invention is not limited thereto, and any data can be applied to binary data having a fixed record length.

도 2는 본 발명의 입력포맷에 의해 하둡의 맵리듀스 과정을 진행하기 위해 각 클러스터 노드들이 데이터 블록을 읽어 처리하는 절차를 보여주는 순서도이다. 2 is a flowchart illustrating a procedure of reading and processing data blocks of each cluster node in order to proceed with Hadoop map reduction process according to an input format of the present invention.

먼저 JobClient를 통해 바이너리 데이터의 레코드의 길이를 입력받는다. 이 때 값을 입력받는 방법으로는 Configuration Property를 이용하여 특정 Property에 레코드 크기에 대한 정보를 할당하여 모든 클러스터 노드들이 공유하도록 할 수 있다. 데이터 처리를 위해 블록을 열면, 블록의 시작점이 레코드의 길이의 n배수가 되는 지점인지 확인한다. 이때, n은 0 또는 자연수이다. 블록의 시작점이 레코드 길이의 n배수가 되는 지점이라면, 그 지점을 InputSplit의 시작점으로 정의한다. 블록의 시작점이 레코드 길이의 n배수가 되는 지점이 아니라면, 1byte씩 이동하면서 다시 레코드 길이의 n배수가 되는 지점인지 확인하는 과정을 거쳐 레코드 길이의 n배수가 되는 최초 지점으로 이동하여 InputSplit의 시작점으로 정의한다. 즉, 데이터 블록의 InputSplit은 레코드 길이의 n배수가 되는 지점 중 블록 시작점에 가장 가까운 값으로부터 다음 데이터 블록에 대한 InputSplit 시작점 전까지를 데이터 블록에 대한 InputSplit로 정의한다.First, the length of the record of binary data is input through JobClient. At this time, as a method of receiving a value, all cluster nodes can be shared by allocating record size information to a specific property by using a configuration property. When you open a block for data processing, make sure that the starting point of the block is the point that is n times the length of the record. In this case, n is 0 or a natural number. If the starting point of a block is the point n times the record length, Define that point as the starting point of the InputSplit. If the starting point of the block is not the point of n times the length of the record, go to the first point of n times the length of the record by moving 1 byte and checking whether it is the point of n times the length of the record, and move to the starting point of the InputSplit. define. That is, InputSplit of a data block defines as InputSplit for a data block from the value closest to the block start point among the points that are n times the record length until the inputSplit start point for the next data block.

InputSplit가 정의되면 해당 InputSplit으로부터 맵작업을 수행하기 위하여 InputSplit의 시작점으로부터 레코드의 길이만큼씩 읽어 레코드를 추출하는 임무를 수행하는 RecordReader를 생성하여 반환한다. 이 때 RecordReader에 의해 Map으로 전달되는 (Key, Value)는 하둡의 (LongWritable, BytesWritable) 형의 Writable 클래스 타입이며, 예를 들면 (파일 시작점으로부터의 offset 값, 레코드 데이터)의 형태로 레코드들을 추출하여 맵에 전달한다. If an InputSplit is defined, it creates and returns a RecordReader that performs the task of extracting the record by reading the length of the record from the starting point of the InputSplit to perform the map operation from the corresponding InputSplit. At this time, (Key, Value) passed to Map by RecordReader is Hadoop's Writable class type of (LongWritable, BytesWritable) type. For example, it extracts records in the form of (offset value from file start point, record data). Pass it to the map.

NetFlow v5의 플로우 데이터를 예로 들어 설명하면, NetFlow v5 패킷 데이터를 Value로 기재할 수 있다. 즉, 상기 Value는 패킷의 수, 바이트의 수 및 플로우의 수로 구성된 군으로부터 선택된 하나 이상을 하나의 바이트 배열로 구성한 값일 수 있다.Taking flow data of NetFlow v5 as an example, NetFlow v5 packet data may be described as a value. That is, the value may be a value configured by configuring one or more bytes selected from the group consisting of the number of packets, the number of bytes, and the number of flows.

Key 역시 처리하고자 하는 데이터와 Job의 성격에 따라 offset 값이 아닌 레코드의 인덱스로서의 의미가 있는 다른 값을 설정할 수도 있다. 즉, NetFlow 분석에서 포트번호 별 전체 패킷의 수, 바이트의 수, 플로우의 수를 구하고자하는 경우, 파일로부터의 offset 값이 아닌 포트번호를 Key로 사용할 수 있으며, 출발지 IP에 따른 전체 패킷의 수, 바이트의 수, 플로우의 수를 구하고자하는 경우, 출발지 IP를 Key로 정의한다. 일정 시간간격에 따른 포트번호 별 전체 패킷의 수, 바이트의 수, 플로우의 수를 구하고자하는 경우라면, 플로우의 타임스탬프와 포트번호를 하나의 바이트배열로 구성하여 Key키로 전달한다. 일정 시간간격에 따른 출발지 IP별 플로우 데이터를 분석하고자 한다면, 플로우의 타임스탬프와 출발지 IP를 하나의 바이트배열로 구성하여 Key키로 전달하는 것과 같이 패킷을 구성하는 모든 항목에 대한 모든 조합을 키로 구성하여 맵리듀스에 의해 분석하는 것이 가능하다. 이와 같이, 상기 Key는 파일로부터의 offset 값, 출발지 포트번호, 도착지 포트번호, 출발지 IP 주소, 도착지 IP주소, 플로우의 타임스탬프와 출발지 포트번호를 하나의 바이트배열로 구성한 값, 플로우의 타임스탬프와 도착지 포트번호를 하나의 바이트배열로 구성한 값, 플로우의 타임스탬프와 출발지 IP 주소를 하나의 바이트배열로 구성한 값 및 플로우의 타임스탬프와 출발지 혹은 도착지 IP 주소를 하나의 바이트배열로 구성한 값으로 구성된 군으로부터 선택된 하나를 사용할 수 있다.
Key can also set other values that are meaningful as indexes of records, not offset values, depending on the data to be processed and the nature of the job. In other words, in the case of NetFlow analysis, if you want to get the total number of packets, number of bytes, and number of flows by port number, you can use the port number as a key and not the offset value from the file. If you want to get the number of bytes and the number of flows, define the source IP as Key. If you want to get the total number of packets, the number of bytes, and the number of flows for each port number according to a certain time interval, configure the timestamp and port number of the flow into one byte array and pass it to the Key key. If you want to analyze flow data by source IP according to a certain time interval, all combinations of all the items constituting the packet are composed of keys, such as the timestamp and source IP of the flow as a single byte array and delivered with a key key. It is possible to analyze by MapReduce. As described above, the Key includes an offset value from a file, a source port number, a destination port number, a source IP address, a destination IP address, a value consisting of a timestamp of the flow and a source port number in one byte array, a timestamp of the flow, It consists of a value consisting of one byte array of the destination port number, one byte array of the timestamp and source IP address of the flow, and one byte array of the timestamp and source or destination IP address of the flow. One selected from can be used.

상기와 같은 방법에 의해 블록 중 레코드의 최초 시작점을 이전 InputSplit과의 경계로 하는 InputSplit을 정의하고 RecordReader를 반환하면, Mapper는 RecordReader를 이용하여 하나의 레코드 씩 InputSplit으로부터 읽어 Map Function을 수행한다. 이 때 RecordReader는 전달하고자 하는 레코드의 파일로부터의 offset이 블록의 InputSplit의 모든 레코드들이 처리되었는지 판별하기 위해 전달하고자 하는 레코드 시작점의 offset이 자신이 처리할 블록의 영역을 벗어나는 지 확인함으로써 이어지는 블럭의 InputSplit의 영역을 침범하지 않도록 한다. 만일, offset이 블록의 영역을 넘지 않는다면 RecordReader는 offset이 블록의 영역을 넘게 될 때까지 레코드를 읽어 생성하는 작업을 반복한다. If you define an InputSplit with the first starting point of the record in the block as the previous InputSplit and return RecordReader by the above method, Mapper reads from the InputSplit one record by using RecordReader and executes Map Function. At this point, the RecordReader checks to see if the offset from the file of the record to be passed has been processed for all records in the block's InputSplit by checking if the offset of the beginning of the record to be passed is outside the area of the block to be processed. Do not invade the territory of. If the offset does not exceed the area of the block, RecordReader repeats the operation of reading and creating records until the offset exceeds the area of the block.

이와 같은 본 발명의 RecordReader를 포함하는 입력포맷을 BinaryInputFormat이라 명명한다.
Such an input format including the RecordReader of the present invention is called BinaryInputFormat.

도 3은 본 발명에서 구현한 입력포맷을 종래의 하둡의 응용프로그래밍인터페이스(API, Application Programming Interface)의 구조에 대비하여 보여주는 클래스 계층 구조도이다. 도 3의 클래스의 명칭과 계층은 구현 방법에 따라 다소 변경하여 구현할 수 있다.
FIG. 3 is a class hierarchy diagram illustrating an input format implemented in the present invention in preparation for the structure of a conventional Hadoop application programming interface (API). The names and hierarchies of the class of FIG. 3 may be changed to some extent depending on the implementation method.

본 발명은 또한 상기 BinaryInputFormat을 이용하여 하둡 맵리듀스에서 고정길이의 레코드를 갖는 바이너리 데이터를 분석하는 방법에 관한 것으로, 보다 상세하게는, (A) 하둡분산파일시스템의 데이터블록으로부터 고정길이의 레코드를 갖는 바이너리 데이터를 읽어오는 단계; (B) 본 발명의 BinaryInputFormat에 의해 바이너리 형태의 데이터로부터 (K, V)를 생성하는 단계; (C) 상기 (K, V)값을 사용하여 맵리듀스를 진행하는 단계; (D) 상기 맵리듀스의 결과를 바이너리 데이터의 형태로 변환하는 단계; 및 (E) 상기 바이너리 형태로 변환된 데이터를 하둡분산파일시스템에 저장하는 단계를 포함하여 진행되는 것을 특징으로 하는 고정길이의 레코드를 갖는 바이너리 데이터를 분석하는 방법에 관한 것이다.The present invention also relates to a method for analyzing binary data having fixed length records in Hadoop MapReduce using BinaryInputFormat, and more specifically, (A) a fixed length record from a data block of a Hadoop distributed file system. Reading binary data having; (B) generating (K, V) from binary type data by BinaryInputFormat of the present invention; (C) proceeding to map reduce using the (K, V) value; (D) converting the result of the MapReduce into the form of binary data; And (E) storing the data converted into the binary form in a Hadoop distributed file system.

도 4는 종래기술에 의한 TextInputFormat을 이용하여 NetFlow v5 플로우를 처리하는 과정(400)과, 본 발명에 의한 BinaryInputFormat을 이용하여 NetFlow v5 플로우를 처리하는 과정(410)을 비교하여 보여주는 도면이다.4 is a view illustrating a process 400 of processing a NetFlow v5 flow using a TextInputFormat according to the prior art and a process 410 of processing a NetFlow v5 flow using a BinaryInputFormat according to the present invention.

도 4를 참조하여 TextInputFormat을 이용하여 NetFlow v5 플로우를 처리하는 과정을 설명한다. 먼저 nProbe와 같은 플로우 송신부(Flow Exporter, 401)는 패킷을 수집하여 실시간으로 플로우 정보를 생성하고 플로우 수집기(Flow Collector, 402)로 전송한다. 플로우 수집기(402)는 전송받은 플로우를 캡쳐하여 플로우 데이터를 저장한다. 저장된 플로우 데이터는 바이너리 포맷이므로 flow-print와 같은 텍스트 변환기(Text Converter, 403)를 실행하여 저장된 플로우 데이터를 읽고 이를 하둡에서 처리할 수 있는 텍스트 형태의 플로우 데이터로 변환한 후 이를 HDFS에 저장한다. flow-print 명령 과정에서 플로우 패킷의 헤더는 자동으로 제외된다. A process of processing a NetFlow v5 flow using a TextInputFormat will be described with reference to FIG. 4. First, a flow transmitter 401, such as nProbe, collects packets, generates flow information in real time, and transmits the flow information to the flow collector 402. The flow collector 402 captures the received flow and stores the flow data. Since the stored flow data is in binary format, a text converter (403) such as flow-print is executed to read the stored flow data, convert it into text data that can be processed in Hadoop, and store it in HDFS. The flow packet header is automatically excluded during the flow-print command.

일단 하둡 맵리듀스를 이용한 플로우 분석기(Flow Analyzer, 404)의 플로우 분석이 실행되면 저장된 HDFS로부터 블록단위의 플로우를 읽어와 TextInputFormat(405)을 이용하여 텍스트 형태의 데이터로부터 (K, V)를 생성한다. 생성된 (K, V) 값을 이용하여 맵, 리듀스 작업을 거친 결과물은 다시 TextOutputFormat(406)에 의해 텍스트 형태의 결과로 HDFS에 저장된다.Once flow analysis of the Flow Analyzer (404) using Hadoop MapReduce is performed, a block-level flow is read from the stored HDFS and a (K, V) is generated from the text data using the TextInputFormat (405). . The result of the map and reduce operation using the generated (K, V) values is stored in HDFS as a textual result by the TextOutputFormat 406 again.

이에 반해 본 발명의 방법에 의하면 플로우 송신부(Flow Exporter, 411)에 의해 수집된 패킷으로부터 생성된 플로우 정보가 플로우 수집기(Flow Collector, 412)로 전송되면, 플로우 수집기(412)는 전송받은 플로우를 캡쳐하여 플로우 패킷의 헤더를 제거한 후 별도의 변환 과정이 없이 이를 HDFS에 저장한다. BinaryInputFormat을 이용한 하둡 맵리듀스 플로우 분석기(413)의 플로우 분석이 실행되면 저장된 HDFS로부터 블록단위의 플로우를 읽어와 BinaryInputFormat(414)을 이용하여 블록으로부터 바이너리 형태의 레코드를 추출하여 Map에 전달한다. 전달된 레코드들은 Map과 Reduce 과정을 거친 후, 그 결과를 바이너리 형태로 출력하여 HDFS에 저장한다.In contrast, according to the method of the present invention, when flow information generated from the packet collected by the flow exporter 411 is transmitted to the flow collector 412, the flow collector 412 captures the received flow. Remove the header of the flow packet and store it in HDFS without any conversion. When the flow analysis of the Hadoop MapReduce Flow Analyzer 413 using BinaryInputFormat is performed, a flow in a block unit is read from the stored HDFS, and a binary record is extracted from the block using BinaryInputFormat 414 and transferred to the map. After passing through Map and Reduce process, the passed records are outputted in binary form and saved in HDFS.

맵리듀스의 결과를 특정 형태로 출력하는 것은 출력포맷(OutputFormat)에 의해 이루어진다. 본 발명에서의 출력포맷은 HDFS로의 파일 출력을 위한 클래스인 FileOutputFormat을 확장하여 간단히 구현할 수 있으며 이를 BinaryOutputFormat으로 명명한다. 출력 레코드의 Key와 Value를 모두 BytesWritable로 하여 MapReduce의 분석결과로써의 BytesWritable 형으로 된 바이너리 데이터를 담아 HDFS에 출력할 수 있다.Outputting the result of MapReduce in a specific form is done by OutputFormat. The output format in the present invention can be simply implemented by extending FileOutputFormat, which is a class for outputting a file to HDFS, and is called BinaryOutputFormat. By using both the key and value of the output record as BytesWritable, you can output binary data in BytesWritable type as the result of MapReduce analysis to HDFS.

상기와 같이 본 발명의 BinaryInputFormat을 사용하면 별도의 데이터 변환의 절차가 필요없으므로 시간과 저장공간을 절약하여 데이터 처리 효율을 높일 수 있다.As described above, since the BinaryInputFormat of the present invention does not require a separate data conversion procedure, data processing efficiency can be improved by saving time and storage space.

Claims

In input format in Hadoop MapReduce,
(A) receiving a length of a record of binary data;
(B) The boundary of the previous InputSplit and its InputSplit, starting from the value closest to the beginning of the block among the blocks of data stored in the Hadoop Distributed File System (HDFS) that are n times the length of the record in the data block to be processed. Defining an InputSplit by doing;
(C) generating and returning a RecordReader that performs a task of reading a length of a record from a starting point for the entire InpuSplit region defined above; And
(D) extracting records in the form of (LongWritable, BytesWritable) with (Key, Value) through the RecordReader;
Input format in Hadoop MapReduce for distributing binary data having fixed length records, characterized in that it comprises a.

The method of claim 1,
The binary data is input flow in Hadoop MapReduce, characterized in that the flow data of NetFlow v5.

The method of claim 2,
The key is an offset value from a file, a source port number, a destination port number, a source IP address, a destination IP address, a value formed by arranging a timestamp and a source port number in a single byte array, a timestamp and a destination port number of the flow. Is selected from the group consisting of one byte array, the timestamp and source IP address of the flow in one byte array, and the timestamp and source or destination IP address of the flow in one byte array. Is,
The Value is an input format of Hadoop MapReduce, characterized in that one or more selected from the group consisting of the number of packets, the number of bytes and the number of flows in a single byte array.

The method of claim 1,
The binary data is a data file constituting a bitmap image file composed of a combination of primitive genetic sequence data, position information, and color information pixels before being converted into a text format, or a wave file having a fixed-length chunk size. Input format in Hadoop MapReduce, characterized in that the data constituting.

A method of analyzing binary data having fixed-length records in Hadoop MapReduce using any one of claims 1 to 4,
(A) reading binary data having fixed length records from data blocks of the Hadoop distributed file system;
(B) generating (Key, Value) from binary data by the input format of claim 1;
(C) performing MapReduce using the (Key, Value) value;
(D) converting the result of the MapReduce into the form of binary data; And
(E) storing the data converted into the binary form in a Hadoop distributed file system;
Binary data having a fixed length record, characterized in that the progress of including.