KR20120084880A

KR20120084880A - Inputformat for handling network packet data on hadoop mapreduce

Info

Publication number: KR20120084880A
Application number: KR1020110006180A
Authority: KR
Inventors: 이영석; 이연희
Original assignee: 충남대학교산학협력단
Priority date: 2011-01-21
Filing date: 2011-01-21
Publication date: 2012-07-31
Also published as: KR101200773B1

Abstract

PURPOSE: An input format for analyzing a network packet in Hadoop map reduce is provided to define InputSpilt for each distributed node when performing distributed processing of binary packet data having a variable length record in a Hadoop environment, thereby simultaneously accessing and processing. CONSTITUTION: Information about the beginning and termination time of packet capture is obtained from configuration property. The beginning point of a first packet is searched in a data block required to be processed among data blocks stored in HDFS(Hadoop Distributed File System). InputSplit sets up a boundary between previous InputSplit and its InputSplit with the beginning point of the first packet in order to be defined.

Description

InputFormat for handling network packet data on Hadoop MapReduce}

본 발명은 하둡 맵리듀스에서 가변길이의 레코드를 갖는 바이너리 포맷의 패킷 데이터를 처리하기 위한 새로운 입력포맷에 관한 것이다.
The present invention relates to a new input format for processing binary data packet data having variable length records in Hadoop MapReduce.

하둡(Hadoop)은 너치(Nutch)의 분산처리를 지원하기 위해 개발된 것으로, 수백 기가바이트~테라바이트 혹은 페타바이트 크기의 데이터를 처리할 수 있는 어플리케이션을 제작하고 운영할 수 있는 기반을 제공해 주는 데이터 처리 플랫폼이다. 하둡이 처리하는 데이터의 크기가 통상 최소 수백 기가바이트 수준이기 때문에 데이터는 하나의 컴퓨터에 저장되는 것이 아니라 여러 개의 블록으로 나누어져 여러 개의 컴퓨터에 분산 저장된다. 따라서 하둡은 입력되는 데이터를 나누어 처리할 수 있도록 하는 하둡 분산 파일 시스템(HDFS: Hadoop Distributed File System)을 포함하며, 분산 저장된 데이터들은 대용량 데이터를 클러스터 환경에서 병렬 처리하기 위해 개발된 MapReduce 과정에 의해 처리되어진다.Hadoop was developed to support Nutch's distributed processing, providing the foundation for building and operating applications that can handle hundreds of gigabytes to terabytes or petabytes of data. Processing platform. Because Hadoop's data is typically at least several hundred gigabytes in size, data is not stored on a single computer, but divided into blocks and distributed across multiple computers. Therefore, Hadoop includes the Hadoop Distributed File System (HDFS), which allows the processing of incoming data separately. The distributed data is processed by the MapReduce process developed for parallel processing of large data in a cluster environment. It is done.

도 1은 하둡 맵리듀스에서 잡(Job) 처리 시의 데이터의 흐름을 보여주는 개념도이다. 입력파일(input file)은 맵리듀스가 수행될 데이터가 저장된 것으로 통상은 HDFS에 저장되어 진다. 하둡은 텍스트 포맷의 데이터 뿐 아니라 다양한 형태의 데이터 포맷을 지원한다. 1 is a conceptual diagram illustrating the flow of data during job processing in Hadoop MapReduce. An input file stores data for map reduction, which is typically stored in HDFS. Hadoop supports a variety of data formats, as well as textual data.

클라이언트의 요청에 의해 Job이 시작되면, 입력포맷(InputFormat, 101)은 입력파일을 어떻게 나누고, 읽을 것인가를 결정하게 된다. 즉 해당 블록의 데이터에 대해 입력 파일을 나누어 InputSplit을 반환하는 한편, InputSplit을 맵퍼(mapper)가 읽을 수 있는 (key, value) 형태로 변환한 RecordReader(102)를 생성하여 반환한다. InputSplit는 맵리듀스에서 단일의 맵작업이 처리하는 데이터의 단위이다. 하둡에서는 TextInputFormat, KeyValueInputFormat, SequenceInputFormat과 같은 유형의 입력포맷이 있다. 대표적인 입력포맷은 TextInputFormat으로서 각 라인을 기준으로 블록단위로 저장된 입력파일을 나누어 논리적인 입력단위인 InputSplit을 구성하며, 이 InputSplit으로부터 (LongWritable, Text)의 형태의 레코드를 추출하는 임무를 수행하는 LineRecordReader를 반환한다. When a Job is started at the client's request, the InputFormat (101) determines how to divide and read the input file. In other words, inputSplit is returned by dividing the input file with respect to the data of the block, while generating and returning a RecordReader 102 that converts the InputSplit into a mapper-readable (key, value) form. InputSplit is the unit of data processed by a single map job in MapReduce. Hadoop has several types of input formats: TextInputFormat, KeyValueInputFormat, and SequenceInputFormat. The typical input format is TextInputFormat, which divides input files stored in block units based on each line to form InputSplit, which is a logical input unit, and extracts a record of the form (LongWritable, Text) from this InputSplit. Returns.

반환된 RecordReader는 통상적인 Map 과정 중에 InputSplit에서 키와 값의 쌍으로 구성된 레코드를 읽어 맵퍼에 넘겨주는 역할을 수행한다. 매퍼는 이 레코드를 Map에 정의된 처리과정을 거치면서 새로운 키와 값으로 구성된 레코드로 생성한다. 출력포맷(OutputFormat, 103)은 MapReduce 과정에서 생성한 데이터를 HDFS에 파일로 출력하기 위한 포맷으로서, 출력포맷은 subclass인 RecordWriter(104)를 통하여 MapReduce 처리의 결과로 받은 키와 값의 쌍으로 구성된 레코드를 HDFS에 쓰는 것에 의해 데이터 처리 과정을 종료하게 된다.The returned RecordReader is responsible for reading the key-value pair record from InputSplit and passing it to the mapper during normal Map process. The mapper creates this record as a record of new keys and values through the process defined in the Map. OutputFormat (103) is a format for outputting data generated during the MapReduce process to a file in HDFS. The output format is a record composed of key and value pairs received as a result of MapReduce processing through RecordWriter 104, a subclass. Writing to HDFS terminates the data processing process.

Hadoop은 웹 크롤링의 특성에 맞게 텍스트 데이터의 처리를 위한 다양한 형태의 입력 포맷과 출력 포맷을 제공하며, 이 중 시퀀스파일포맷은 텍스트 이외의 데이터 포맷에 대한 입력과 출력을 제공한다. deflate, gzip, ZIP, bzip2, and LZO 등의 압축파일의 입출력도 지원하며, 이러한 압축파일포맷은 저장공간의 효율을 높일 수 있다는 장점이 있다. 그러나, 압축파일포맷에 의해 입력파일을 처리하기 위해서는 MapReduce 작업이 시작되기 전에 압축을 해제하고 처리한 결과를 다시 압축하는 단계가 필요하므로 처리 속도가 낮아지는 문제가 있다. 시퀀스파일포맷은 바이너리를 포함한 다양한 포맷의 데이터를 담을 수 있는 틀을 제공하지만, 담고자 하는 원시데이터를 일련의 시퀀스 형태로 변환해야만 하는 또 다른 변환과정이 요구된다.Hadoop provides various types of input and output formats for processing text data according to the characteristics of web crawling. Among them, sequence file format provides input and output for data formats other than text. It also supports input and output of compressed files such as deflate, gzip, ZIP, bzip2, and LZO, and the compressed file format can increase the storage space efficiency. However, in order to process the input file by compressed file format, it is necessary to decompress and recompress the processing result before the MapReduce operation starts. Therefore, the processing speed becomes low. The sequence file format provides a framework for storing data in a variety of formats, including binary, but requires another conversion process that requires converting the raw data to be contained in a sequence of sequences.

이와 같은 이유로 이미지, 통신 패킷 등 바이너리 형태를 띠고 있는 대량의 데이터를 하둡 분산환경에서 처리하기 위해서는 텍스트로의 변환이나 다른 하둡에서 인식할 수 있는 형태로의 데이터 변환이 필요하다. 이러한 변환 작업은 단일 시스템에 의해 변환하고자 하는 파일을 읽고 변환하여 저장을 위해 다시 쓰는 과정으로 이루어지며, 하둡의 분산시스템을 이용하여 처리 성능을 향상시키려는 근본적인 취지에 어긋나는 것이다. 이에, 바이너리 데이터를 하둡 분산환경에서 처리하기 위해서는 보다 효과적인 방법의 개발이 필요하다.
For this reason, in order to process a large amount of binary data such as images and communication packets in a Hadoop distributed environment, it is necessary to convert it to text or data that can be recognized by other Hadoop. This conversion process consists of reading, converting, and rewriting files for conversion by a single system, which is contrary to the fundamental purpose of improving processing performance by using Hadoop's distributed system. Therefore, in order to process binary data in Hadoop distributed environment, a more effective method needs to be developed.

상기와 같은 종래기술의 문제점을 해소하기 위한 본 발명의 목적은 가변 길이의 데이터 레코드 블록을 갖는 바이너리 패킷 데이터를 하둡 분산처리 시스템에서 효과적으로 처리할 수 있도록 하는 입력포맷을 제공하는 것이다.
An object of the present invention for solving the problems of the prior art as described above is to provide an input format that can effectively process binary packet data having a variable length data record block in the Hadoop distributed processing system.

전술한 목적을 달성하기 위한 본 발명은 가변길이의 레코드를 갖는 바이너리 패킷 데이터를 분산처리하기 위한 하둡 맵리듀스에서의 입력포맷에 관한 것으로, (A) 패킷의 캡쳐를 수행한 시작시간과 종료시간에 대한 정보를 configuration property로부터 획득하는 단계; (B) 하둡분산파일시스템(HDFS)에 저장된 데이터 블록 중 처리해야 될 데이터 블록에서 첫 패킷의 시작점을 검색하는 단계; (C) 상기 첫 패킷의 시작점을 InputSplit의 시작점으로 하여 이전 InputSplit과 자신의 InputSplit의 경계를 설정하는 것에 의해 InputSplit을 정의하는 단계; (D) 상기에서 정의된 자신의 InpuSplit 전체 영역에 대해 시작점으로부터 각 패킷의 캡쳐된 패킷헤더(pcap header)에 기록된 캡쳐된 패킷길이(capLen)만큼씩 읽는 일을 수행하는 RecordReader를 생성하고 이를 반환하는 단계; 및 (E) 상기 RecordReader를 통해 (Key, Value)를 (LongWritable, BytesWritable)의 형태로 레코드들을 추출하는 단계;를 포함하여 이루어지는 것을 특징으로 한다.The present invention for achieving the above object relates to an input format in Hadoop MapReduce for distributing binary packet data having a variable-length record, (A) at the start time and end time to capture the packet Obtaining information about the configuration property; (B) searching for a start point of a first packet in a data block to be processed among data blocks stored in the Hadoop Distributed File System (HDFS); (C) defining an InputSplit by setting a boundary between a previous InputSplit and its own InputSplit with the starting point of the first packet as the starting point of the InputSplit; (D) Create and return a RecordReader that reads the captured packet length (capLen) recorded in the captured packet header of each packet from the starting point for its entire InpuSplit region defined above. Making; And (E) extracting records in the form of (LongWritable, BytesWritable) from (Key, Value) through the RecordReader.

상기 첫 패킷의 시작점은, 블록의 시작 바이트를 첫 패킷의 시작이라고 가정하고, (a) 가정한 지점에서 패킷의 pcap header로부터 timestamp, 캡쳐된 패킷길이(capture length, CapLen), 실제 유입된 패킷길이(wired length, WiredLen)를 포함한 헤더정보를 추출하는 단계; (b) 첫 패킷의 시작 바이트로부터 상기 (a)에서 얻어진 (pcap header의 길이+CapLen)만큼 이동하는 단계; (c) 상기 (b)에서 이동한 지점을 두 번째 패킷의 시작이라고 가정하고 두 번째 패킷의 pcap header로부터 timestamp, 캡쳐된 패킷길이(capture length, CapLen), 실제 유입된 패킷길이(wired length, WiredLen)를 포함한 헤더정보를 추출하는 단계; 및 (d) 상기 (a) 와 (c)에서 각각 얻은 첫 번째 패킷과 두 번째 패킷의 pcap header 정보로부터 첫 패킷의 시작이라고 가정한 지점이 첫 번째 패킷의 시작점이 맞는지를 검증하는 단계; (e) 상기 (d)의 검증 결과 첫 번째 패킷의 시작점이 아니라면, 첫 패킷의 시작으로 가정한 위치로부터 1바이트를 이동한 지점을 첫 패킷의 시작이라 가정하고 (a)~(d)의 단계를 반복하여 첫 번째 패킷을 시작점을 검색하는 단계; 를 포함하여 이루어질 수 있다.
The start point of the first packet assumes the start byte of the block as the start of the first packet, and (a) the timestamp, the captured packet length (CapLen), and the actual incoming packet length from the pcap header of the packet at the assumed point. extracting header information including (wired length, WiredLen); (b) moving from the start byte of the first packet by (length of pcap header + CapLen) obtained in (a); (c) Assuming that the point moved in (b) is the start of the second packet, the timestamp from the pcap header of the second packet, the captured packet length (CapLen), and the actual packet length (wired length, WiredLen) Extracting header information including; And (d) verifying that the starting point of the first packet corresponds to the point that assumes the beginning of the first packet from the pcap header information of the first packet and the second packet obtained in (a) and (c), respectively; (e) If the result of the verification of (d) is not the start point of the first packet, the steps of (a) to (d) are assumed to be the start of the first packet at the point where one byte is moved from the position assumed as the start of the first packet. Repeating to retrieve the first packet starting point; It may be made, including.

이상과 같이 본 발명의 입력포맷에 의하면, 가변길이의 레코드를 갖는 바이너리 패킷 데이터를 하둡 환경에서 분산 처리할 때 각 분산 노드에 대해 InputSplit를 정의하여 줌으로써 동시에 접근하여 처리가 가능하도록 하며, InputSplit에서 바이너리 패킷 데이터를 추출하여 맵퍼에 전달함으로써 데이터 포맷의 변환작업 없이 처리가 가능하므로, 다른 형태의 데이터에 비해 적은 저장 공간을 요하며 빠른 처리 속도를 가능하게 한다.As described above, according to the input format of the present invention, when distributing binary packet data having a variable-length record in a Hadoop environment, an InputSplit is defined for each distributed node to simultaneously access and process the binary packet data in the InputSplit. By extracting the packet data and delivering it to the mapper, it can be processed without converting the data format, which requires less storage space and enables faster processing speed than other types of data.

본 발명의 가변길이의 레코드를 갖는 바이너리 형태의 패킷 데이터에 대한 입력포맷은 하둡 시스템을 이용한 패킷의 패턴매칭과 같은 응용을 통한 침입탐지시스템의 구축에 활용될 수 있다.
The input format for binary packet data having a variable-length record of the present invention can be utilized to construct an intrusion detection system through an application such as pattern matching of a packet using a Hadoop system.

도 1은 하둡에서 잡(Job) 처리 시의 데이터의 흐름을 보여주는 개념도.
도 2는 하둡의 각 클러스터 노드들이 데이터 블록을 읽어 처리하는 절차를 보여주는 순서도.
도 3은 본 발명의 일 실시 예에 의해 도 2의 201에서 첫 패킷의 시작 바이트를 찾기 위한 방법을 보여주는 순서도.
도 4는 도 3의 방법에 의한 패킷 처리를 설명하기 위한 모식도.1 is a conceptual diagram showing the flow of data during Job processing in Hadoop.
2 is a flowchart illustrating a procedure in which each cluster node of Hadoop reads and processes a data block.
3 is a flow chart illustrating a method for finding the start byte of the first packet in 201 of FIG. 2 according to an embodiment of the present invention.
4 is a schematic diagram for explaining packet processing by the method of FIG.

이하 첨부된 도면을 참조하여 본 발명을 보다 상세히 설명한다. 그러나 이러한 도면은 본 발명의 기술적 사상의 내용과 범위를 쉽게 설명하기 위한 예시일 뿐, 이에 의해 본 발명의 기술적 범위가 한정되거나 변경되는 것은 아니다. 또한 이러한 예시에 기초하여 본 발명의 기술적 사상의 범위 안에서 다양한 변형과 변경이 가능함은 당업자에게는 당연할 것이다.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail with reference to the accompanying drawings. However, these drawings are only examples for easily describing the content and scope of the technical idea of the present invention, and thus the technical scope of the present invention is not limited or changed. In addition, it will be apparent to those skilled in the art that various modifications and changes can be made within the scope of the present invention based on these examples.

본 발명은 가변길이의 레코드를 갖는 바이너리 데이터인 패킷 데이터를 분산처리하기 위한 하둡 맵리듀스에서의 입력포맷에 관한 것으로, (A) 패킷의 캡쳐를 수행한 시작시간과 종료시간에 대한 정보를 configuration property로부터 획득하는 단계; (B) 하둡분산파일시스템(HDFS)에 저장된 데이터 블록 중 처리해야 될 데이터 블록에서 첫 패킷의 시작점을 검색하는 단계; (C) 상기 첫 패킷의 시작을 InputSplit의 시작점으로 하여 이전 InputSplit과 자신의 InputSplit의 경계를 설정하는 것에 의해 InputSplit을 정의하는 단계; (D) 상기에서 정의된 자신의 InpuSplit 전체 영역에 대해 시작점으로부터 각 패킷의 캡쳐된 패킷헤더(pcap header)에 기록된 캡쳐된 패킷길이(capLen)만큼씩 읽는 일을 수행하는 RecordReader를 생성하고 이를 반환하는 단계; 및 (E) 상기 RecordReader를 통해 (Key, Value)를 (LongWritable, BytesWritable)의 형태로 레코드들을 추출하는 단계;를 포함하여 이루어지는 것을 특징으로 하는 가변길이의 레코드를 갖는 바이너리 패킷 데이터를 분산처리하기 위한 하둡 맵리듀스에서의 입력포맷에 관한 것이다.The present invention relates to an input format in Hadoop MapReduce for distributing packet data, which is binary data having a variable-length record, (A) configuration property of information about a start time and an end time when a packet is captured. Obtaining from; (B) searching for a start point of a first packet in a data block to be processed among data blocks stored in the Hadoop Distributed File System (HDFS); (C) defining an InputSplit by setting a boundary between a previous InputSplit and its InputSplit with the start of the first packet as the starting point of the InputSplit; (D) Create and return a RecordReader that reads the captured packet length (capLen) recorded in the captured packet header of each packet from the starting point for its entire InpuSplit region defined above. Making; And (E) extracting records in the form of (LongWritable, BytesWritable) by (Key, Value) through the RecordReader; for distributing binary packet data having a variable-length record, characterized in that it comprises a. It is about input format in Hadoop MapReduce.

도 2는 대용량 패킷 trace 파일을 읽고 하둡 맵리듀스에 의해 패킷을 분석하기 위해 본 발명의 입력포맷에 의해 각 클러스터 노드들이 데이터 블록을 읽어 처리하는 절차를 보여주는 순서도이다. 도 2에서는 잡(Job) 실행 전에 패킷의 캡쳐를 수행한 시작시간과 종료시간에 대한 정보는 configuration property를 통하여 이전에 획득하였음을 가정한다.FIG. 2 is a flowchart illustrating a procedure in which each cluster node reads and processes data blocks according to an input format of the present invention in order to read a large packet trace file and analyze packets by Hadoop MapReduce. In FIG. 2, it is assumed that information on a start time and an end time of capturing a packet before executing a job is previously obtained through a configuration property.

데이터 처리를 위해 블록을 열면, 블록의 시작점이 패킷의 시작점인지 확인한다. 먼저, 해당 블록이 패킷 trace 파일의 첫 블록이라면 시작점은 패킷의 시작점이 될 것이므로 그 지점을 InputSplit의 시작점으로 정의한다. 블록의 시작점이 패킷 trace 파일의 첫 블록이 아니라면, 블록의 시작점과 패킷의 시작점이 일치하지 않을 수 있으므로 실제 패킷처리를 위한 시작점을 찾는 과정(201)을 거쳐야 한다. When you open a block for data processing, make sure that the start of the block is the start of the packet. First, if the block is the first block of the packet trace file, the starting point will be the starting point of the packet, so define that point as the starting point of the InputSplit. If the starting point of the block is not the first block of the packet trace file, the starting point of the block and the starting point of the packet may not coincide, and thus, the process of finding a starting point for actual packet processing should be performed (201).

도 3은 데이터 블록에서 첫 패킷의 시작점을 찾기 위한 일 실시 예를 보여준다. 먼저 (A) 블록의 시작 바이트를 패킷의 시작이라고 가정하고 가정한 지점에서 패킷의 pcap header로부터 timestamp, 캡쳐된 패킷길이(capture length, CapLen), 실제 유입된 패킷길이(wired length, WiredLen)를 포함한 헤더정보를 추출한다. 이하, 각각을 TS1, CapLen1, WiredLen1으로 각각 기재한다. 통상 pcap header에는 첫 번째 8 바이트에 timestamp가, 다음 4 바이트에 CapLen가, 그 다음 4 바이트에 WireLen가 기록되므로 블록의 시작 바이트로부터 16 바이트를 읽는 것에 의해 상기 정보를 추출할 수 있다. 여기서 timestamp는 처음 4바이트만을 사용하여도 초단위의 timestamp 정보를 얻을 수 있으므로 4바이트만을 사용할 수도 있으며, 정확도를 더욱 높이고자 하는 경우에는 8바이트를 사용할 수도 있다. (B) 첫 패킷에 대한 데이터를 추출하면, 마찬가지 방법으로 두 번 째 패킷의 시작점으로 가정되는 곳으로부터 두 번째 패킷에 대한 헤더정보를 추출한다. 이하, 각각을 TS2, CapLen2, WiredLen2로 각각 기재한다. 두 번째 패킷의 시작점은 첫 패킷의 pcap header의 길이(통상 16바이트)와 pcap header에 기록된 CapLen를 더한 값만큼 이동한 지점이 될 것이다. (C) 상기 (A) 와 (B)에서 각각 얻은 첫 번째 패킷과 두 번째 패킷의 헤더정보로부터 데이터 블록의 첫 번째 바이트가 첫 번째 패킷의 시작점이 맞는지를 검증한다. 3 illustrates an embodiment for finding a starting point of a first packet in a data block. First, (A) Assuming that the start byte of the block is the start of the packet, the timestamp from the packet's pcap header, the captured packet length (CapLen), and the actual packet length (wired length, WiredLen) are included. Extract header information. Hereinafter, each is described as TS1, CapLen1, and WiredLen1. In general, the pcap header records a timestamp in the first 8 bytes, CapLen in the next 4 bytes, and WireLen in the next 4 bytes, so that the information can be extracted by reading 16 bytes from the start byte of the block. In this case, since the timestamp can obtain timestamp information in seconds even if only the first 4 bytes are used, only 4 bytes may be used, or 8 bytes may be used to further increase accuracy. (B) When extracting the data for the first packet, the header information for the second packet is extracted in the same manner from where it is assumed to be the starting point of the second packet. Hereinafter, each is described as TS2, CapLen2, and WiredLen2. The starting point of the second packet will be the point where the length of the pcap header of the first packet (usually 16 bytes) plus the CapLen recorded in the pcap header is moved. (C) From the header information of the first packet and the second packet obtained in (A) and (B), respectively, verify whether the first byte of the data block is the start point of the first packet.

도 3의 기재를 참조하여 패킷의 시작점을 검증하는 방법을 설명한다. a) TS1과 TS2가 configuration property로부터 획득한 패킷의 캡쳐 시작시간으로부터 종료시간까지의 범위 내에 있는 유효한 값인지 확인한다. b) Wiredlen1과 CapLen1의 차이가 최대 패킷길이와 패킷의 최소길이의 차이보다 작은 지 검사한다. 마찬가지로 wiredlen2과 caplen2에 대해서도 검사한다. 패킷의 최대 길이와 최소길이는 이더넷 프레임의 정의에 따라 각각 1,518 바이트와 64바이트로 가정한다. c) TS1과 TS2가 연속되어 유입된 패킷이라고 확인할 수 있는지 검증한다. 이를 위해 TS1과 TS2의 차를 구해 연속된 패킷이라고 인정되는 delta time을 정하여 그 값의 범위 내에 해당하는지 확인한다. 통상 delta time은 5초 이내인 것이 바람직하나, 네트워크 환경이나 다른 변수들을 고려하여 적절히 조절될 수 있음은 당연하다. 위의 a), b), c)의 과정 모두 만족하면 현재 가정한 패킷의 시작 바이트를 패킷의 실제 패킷의 바이트로 인정한다. 반면, a), b), c) 중 어느 하나라도 만족하지 않을 경우, 다음 바이트로 이동하여 패킷의 시작점으로 가정하고 a), b), c)의 과정을 반복하여 다시 시작점을 검증하는 것에 의해 해당 데이터 블록에서 첫 패킷의 시작점을 검색한다.A method of verifying the starting point of a packet will be described with reference to FIG. 3. a) Check whether TS1 and TS2 are valid values within the range from the capture start time to the end time of the packet obtained from the configuration property. b) Check that the difference between Wiredlen1 and CapLen1 is less than the difference between the maximum packet length and the minimum length of the packet. Similarly, check wiredlen2 and caplen2. The maximum and minimum lengths of packets are assumed to be 1,518 bytes and 64 bytes, respectively, according to the Ethernet frame definition. c) Verify that TS1 and TS2 can be identified as packets coming in consecutively. To do this, the difference between TS1 and TS2 is determined, and the delta time, which is recognized as a continuous packet, is determined and checked to be within the range of the value. In general, the delta time is preferably within 5 seconds, but can be properly adjusted in consideration of the network environment or other variables. If all the above steps a), b) and c) are satisfied, the start byte of the currently assumed packet is recognized as the actual packet byte of the packet. On the other hand, if any one of a), b) and c) is not satisfied, go to the next byte and assume the start point of the packet and repeat the process of a), b) and c) to verify the starting point again. Search for the start of the first packet in the data block.

도 3에서는 패킷 시작점 검증을 위해 a), b), c) 모두를 사용하였으나, 이는 일 예에 관한 것으로 이들 중 하나 또는 두가지 조건만을 기준으로 패킷 시작점을 검증할 수 있으며, 이외에도 추가적인 정보를 부가하여 패킷 시작점을 검증할 수 있다. 상기 검증에 사용되는 조건의 수가 많을수록 패킷 시작점 검증이 더욱 정확할 수 있음은 당연하다.In FIG. 3, all of a), b), and c) are used to verify the packet starting point. However, this is related to an example. The packet starting point may be verified based on only one or two conditions, and additional information may be added. You can verify the packet start point. Naturally, the larger the number of conditions used for the verification, the more accurate the packet starting point verification can be.

도 3에 예시된 방법에 의해 데이터 블록에서의 첫 패킷의 시작점으로 이동하면, 이를 InputSplit의 시작점으로 정의한다. 즉, 데이터 블록의 InputSplit은 첫 패킷의 시작점으로부터 다음 데이터 블록에 대한 InputSplit 시작점 전까지를 해당 데이터 블록에 대한 InputSplit로 정의한다.Moving to the start of the first packet in the data block by the method illustrated in FIG. 3, this is defined as the start of the InputSplit. That is, InputSplit of a data block defines InputSplit for the data block from the start of the first packet to the start of InputSplit for the next data block.

InputSplit가 정의되면 해당 InputSplit으로부터 맵작업을 수행하기 위하여 InputSplit의 시작점으로부터 pcap header에 기록된 capLen를 읽고, 해당 길이만큼실제 패킷을 읽는 일을 수행하는 RecordReader를 생성하고 이를 반환한다. 이 때 RecordReader에 의해 Map으로 전달되는 (Key, Value)는 하둡의 (LongWritable, BytesWritable) 형의 Writable 클래스 타입이며, Key로는 파일 시작점으로부터의 offset을 사용할 수 있다. Value로는 패킷 레코드 전체 바이트에 해당하는 이더넷 프레임, IP 패킷, TCP 세그먼트, UDP 세그먼트, http Payload와 같이 통상 OSI 7 layer 상의 특정 프로토콜에 해당하는 패킷을 추출하여 전달할 수 있다. 마찬가지로 Pcap 헤더를 제거하지 않은 패킷 즉, Pcap 헤더와 이더넷 프레임을 포함하는 전체 바이트를 Value로 사용할 수도 있다. 또한 ICMP, ARP, RIP, SSL 등의 OSI 7 Layer 상의 모든 프로토콜에 해당하는 패킷을 Value로 사용하여도 무방하며, 이들에 한정되는 것은 아니다. Value 값을 분석하고자 하는 데이터에 따라 적절히 선택하여 사용하는 것은 당업자에게는 용이할 것이다.
If an InputSplit is defined, it reads the capLen recorded in the pcap header from the inputSplit's starting point to perform the map operation from the InputSplit, and creates and returns a RecordReader that reads the actual packet by the length. At this time, (Key, Value) that is passed to Map by RecordReader is Hadoop's Writable class type of (LongWritable, BytesWritable) type, and Key can use offset from file start point. As a value, a packet corresponding to a specific protocol on an OSI 7 layer, such as an Ethernet frame, an IP packet, a TCP segment, a UDP segment, and an http payload corresponding to the entire byte of a packet record, may be extracted and transmitted. Similarly, a packet that does not remove the Pcap header, that is, a whole byte including the Pcap header and the Ethernet frame may be used as the value. In addition, a packet corresponding to all protocols on the OSI 7 Layer such as ICMP, ARP, RIP, and SSL may be used as a value, but is not limited thereto. It will be easy for those skilled in the art to properly select and use Value according to the data to be analyzed.

상기와 같은 방법에 의해 블록 중 첫 패킷의 시작점을 이전 InputSplit과의 경계로 하는 InputSplit을 정의하고 RecordReader를 반환하면, Mapper는 RecordReader를 이용하여 하나의 레코드 씩 InputSplit으로부터 읽어 Map Function을 수행한다. 이 때 RecordReader는 데이터 블록에 대한 InputSplit의 모든 레코드들이 처리되었는지 판별하기 위해 전달하고자 하는 레코드 시작점의 offset이 자신이 처리할 블록의 영역을 벗어나는 지 확인함으로써 이어지는 블럭의 InputSplit의 영역을 침범하지 않도록 한다. 만일, offset이 블록의 영역을 넘지 않는다면 RecordReader는 offset이 블록의 영역을 넘게 될 때까지 레코드를 읽어 생성하는 작업을 반복한다. 마지막 패킷이 다음 블록에 나누어 저장된 경우 다음 블록의 일부를 읽어 패킷 레코드의 완성하여 반환한다.If you define an InputSplit with the start point of the first packet in the block as the previous InputSplit and return RecordReader by the above method, Mapper performs Map Function by reading from InputSplit one record by using RecordReader. At this point, the RecordReader ensures that the offset of the start point of the record to be passed out of the block to be processed is not invaded by the area of the InputSplit of the subsequent block to determine whether all records of the InputSplit for the data block have been processed. If the offset does not exceed the area of the block, RecordReader repeats the operation of reading and creating records until the offset exceeds the area of the block. If the last packet is stored in the next block, a part of the next block is read and the packet record is completed and returned.

이와 같은 본 발명의 RecordReader를 포함하는 입력포맷을 패킷입력포맷(PcapInputFormat)이라 명명한다. Such an input format including the RecordReader of the present invention is called a packet input format (PcapInputFormat).

도 4는 본 발명의 PcapInputFormat(105)에 의한 데이터 처리 방법의 이해를 돕기 위한 실시 예를 보여준다. 우선 각 클러스터 노드들이 대용량의 패킷 trace file을 처리하기 위한 개략적 과정은 1) 먼저 미리 정의된 패킷의 캡쳐 시작시간과 캡쳐 종료까지의 기간을 얻어 configuration property로 등록한다. 2) job이 시작되면 configuration 정보는 각 task에 전달되며 클러스터 내의 노드들은 각자가 처리해야 하는 블록에 접근하여 블록의 시작 점부터 한 바이트씩 이동하며 자신이 처리해야 할 블록에서의 패킷의 시작점을 찾고, 3) 그 위치로부터 패킷 헤더에 있는 패킷의 캡쳐 길이 정보를 이용하여 패킷을 읽어 각 패킷을 map에 전달한다. 4) 이런 방식으로 블록 내의 모든 패킷을 읽어 전달하는 과정을 반복한다. 이 과정에서 새로운 패킷의 시작점의 offset이 블록크기를 넘지 않을 때까지 반복함으로써 실제 하나의 map task가 수행해야 하는 데이터 블록인 split의 처리가 완료된다.
Figure 4 shows an embodiment to help understand the data processing method by the PcapInputFormat 105 of the present invention. First, the general process for each cluster node to process a large packet trace file is as follows. 2) When the job starts, the configuration information is sent to each task. The nodes in the cluster access the block that needs to be processed, move one byte from the beginning of the block, and find the starting point of the packet in the block that needs to be processed. 3) Read the packet from the location using the packet's capture length information in the packet header and pass each packet to the map. 4) It repeats the process of reading and forwarding all the packets in the block in this way. In this process, by repeating until the offset of the start point of a new packet does not exceed the block size, the split processing is completed, which is a data block that a single map task must perform.

Claims

In input format in Hadoop MapReduce,
(A) obtaining information about a start time and an end time for capturing a packet from a configuration property;
(B) searching for a start point of a first packet in a data block to be processed among data blocks stored in the Hadoop Distributed File System (HDFS);
(C) defining an InputSplit by setting a boundary between a previous InputSplit and its InputSplit with the start of the first packet as the starting point of the InputSplit;
(D) Generate and return a RecordReader that reads the captured packet length (capLen) recorded in the captured Pcap header of each packet from the starting point for its entire InpuSplit region defined above. Doing; And
(E) extracting records in the form of (LongWritable, BytesWritable) with (Key, Value) through the RecordReader;
Input format in Hadoop MapReduce for distributing binary packet data having a variable-length record comprising a.

The method of claim 1,
The starting point of the first packet is
Assume the start byte of the block is the start of the first packet,
(a) extracting header information including a timestamp, a captured packet length (CapLen), and an actual incoming packet length (wiredLen) from a pcap header of a packet at an assumed point;
(b) moving from the start byte of the first packet by (length of pcap header + CapLen) obtained in (a);
(c) A header including the timestamp from the pcap header, the captured packet length (CapLen), and the actual incoming packet length (wired length, WiredLen) from the pcap header, assuming that the point moved in (b) is the start of the second packet. Extracting information; And
(d) verifying whether a starting point of the first packet is correct based on the assumption that the starting point of the first packet is correct from the pcap header information of the first packet and the second packet obtained in (a) and (c), respectively;
(e) If the result of the verification of (d) is not the start point of the first packet, the steps of (a) to (d) are assumed to be the start of the first packet at the point where one byte is moved from the position assumed as the start of the first packet. Repeating to retrieve the first packet starting point;
Input format in Hadoop MapReduce for distributed processing of the packet data comprising a.

The method of claim 2,
The step (d)
Both the timestamp of the first packet and the timestamp of the second packet obtained in steps (a) and (c), respectively, are valid within the range from the capture start time to the end time of the packet obtained from the configuration property in step (A). Value,
(Difference between WiredLen and CapLen) of the first packet obtained in step (a) is less than (difference between maximum packet length and minimum packet length),
If (difference between WireLen and CapLen) of the second packet obtained in step (c) is less than (difference between maximum packet length and minimum packet length),
Input format in Hadoop MapReduce for distributed processing of packet data, characterized in that the point assumed the start of the first packet is determined as the start point of the first packet.

The method of claim 3, wherein
In the step (d)
Further processing whether the difference between the timestamp of the first packet and the timestamp of the second packet obtained in step (a) and (c), respectively, falls within the delta time range recognized as a continuous packet. Input format in Hadoop MapReduce to use.

The method according to any one of claims 1 to 4,
The Key is an offset value from the file start point,
Value is one selected from the group consisting of Ethernet frames, IP packets, TCP segments, UDP segments, http Payload Pcap, ICMP, ARP, RIP, SSL, and Pcap headers and full packets including Ethernet frames corresponding to all bytes of the packet record. Input format in Hadoop MapReduce for distributed processing of packet data.