KR20160050745A

KR20160050745A - Method and Apparatus for Processing Data Based on Real-Time or Batch Processing

Info

Publication number: KR20160050745A
Application number: KR1020140149635A
Authority: KR
Inventors: 이재영; 박근태; 이정룡; 최승운
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2014-10-30
Filing date: 2014-10-30
Publication date: 2016-05-11

Abstract

Disclosed are a method and an apparatus for processing data on a real-time basis or a batch-processing basis. In the present invention, the apparatus is a data processing apparatus for processing big data through a real time processing or a batch processing. The method comprises: a step of storing big data collected from a data processing device; a step of storing a mart table, which is generated based on the big data, in a disk or a memory depending on the size of the same; and a step of providing information of a real-time processed result or a batch-processed result based on a query received by a managing device. According to the present invention, a real-time processing or a batch processing is able to be performed by using one query.

Description

TECHNICAL FIELD [0001] The present invention relates to a real-time or batch-based data processing method and apparatus,

본 실시예는 실시간 또는 일괄 처리 기반으로 데이터를 처리하는 방법 및 장치에 관한 것이다. This embodiment relates to a method and apparatus for processing data on a real-time or batch basis.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the present embodiment and do not constitute the prior art.

일반적인 빅데이터 시스템은 데이터에 대해 실시간으로 처리하거나, 데이터를 저장한 후 일괄 처리하는 두 가지 방식을 별도로 사용하고 있다. 이와 같이, 실시간 처리 및 일괄 처리를 별도로 수행하는 것은 데이터 처리에 있어서 비효율적이다. A typical big data system uses two methods of processing data in real time, or storing data in a batch process. As described above, it is inefficient in data processing to perform real-time processing and batch processing separately.

일반적인 빅데이터 시스템에서 실시간으로 데이터를 처리하기 위해서는 메모리에 먼저 데이터를 저장하고(In-Memory 처리), 실시간 처리 후 디스크에 데이터를 저장함으로써, 메모리에서 데이터의 문제(예: 오류, 상실 등)가 발생하는 경우, 디스크에 저장할 데이터가 유실될 수 있고, 디스크에 저장된 데이터를 이용하여 일괄 처리를 할 수 없는 문제가 발생한다. In order to process real-time data in a normal big data system, it is necessary to first store the data in the memory (in-memory processing) and store the data in the disk after real-time processing. The data to be stored on the disk may be lost and a problem that batch processing can not be performed using the data stored in the disk occurs.

일반적인 빅데이터 시스템에서는 데이터 수집기에서 수집한 데이터를 실시간 처리를 위한 입력부와 일괄처리를 위한 입력부로 각각 제공해야하고, 실시간 처리 및 일괄 처리에 대한 쿼리도 각각 별도로 수신하여 불필요한 데이터 입력 및 출력을 수행하여야 한다. In a typical big data system, data collected by a data collector must be provided as an input unit for real-time processing and an input unit for batch processing, respectively, and queries for real-time processing and batch processing are separately received to perform unnecessary data input and output do.

또한, 일반적으로 빅데이터를 처리하는 기술의 경우, 큰 데이터의 처리는 우수하나 작은 데이터의 처리에는 처리속도가 느리다는 단점이 있다. Generally, in the case of a technique for processing big data, although large data processing is excellent, there is a disadvantage in that processing speed is slow for small data processing.

본 실시예는 빅데이터의 실시간 처리 또는 배치 처리를 위한 데이터 처리장치로써, 데이터 처리장치에서 수집된 빅데이터를 디스크에 저장하고, 빅데이터를 기초로 생성된 마트 테이블의 크기에 따라 디스크 또는 메모리로 저장하고, 관리자 장치로부터 수신된 쿼리에 근거하여 실시간 또는 일괄 처리한 처리 결과정보를 제공하는 실시간 또는 일괄 처리 기반의 데이터 처리방법 및 장치를 제공하는 데 주된 목적이 있다.The present embodiment is a data processing apparatus for real-time processing or batch processing of big data, in which big data collected by a data processing apparatus is stored in a disk and stored in a disk or memory according to the size of a mart table generated based on the big data And provides a processing result information in real time or batch processing based on a query received from an administrator device, and a data processing method and apparatus based on real time or batch processing.

본 실시예의 일 측면에 의하면, 데이터 처리장치가 기 수집된 빅데이터를 실시간 또는 일괄 처리하는 데이터 처리방법에 있어서, 상기 빅데이터를 디스크로 저장하는 디스크 저장과정; 상기 빅데이터를 분석하고, 분석결과에 근거하여 상기 빅데이터에 대한 복수의 마트 테이블을 생성하는 테이블 생성과정; 상기 복수의 마트 테이블 각각을 기 설정된 임계크기와 비교하여 메모리로 적재하는 테이블 분류과정; 관리자 장치로부터 수신된 쿼리에 근거하여 일괄 처리엔진을 기반으로 상기 디스크에 저장된 마트 테이블을 일괄 처리하거나, 실시간 처리엔진을 기반으로 상기 메모리에 적재된 마트 테이블을 실시간 처리하는 데이터 처리과정; 및 상기 실시간 처리엔진 또는 상기 일괄 처리엔진 기반의 처리 결과정보를 제공하는 결과 제공과정을 포함하는 것을 특징으로 하는 데이터 처리방법을 제공한다.According to an aspect of the present invention, there is provided a data processing method for real-time or batch processing large data collected by a data processing apparatus, comprising: a disk storing step of storing the big data as a disk; A table generation step of analyzing the big data and generating a plurality of mart tables for the big data based on an analysis result; A table classification step of comparing each of the plurality of the mart tables with a predetermined threshold size and loading the mart tables into a memory; A data processing step of collectively processing the mart tables stored on the disk based on the batch engine based on the query received from the manager device or real time processing the mart tables loaded in the memory based on the real time processing engine; And a result providing step of providing processing result information based on the real time processing engine or the batch processing engine.

이상에서 설명한 바와 같이 본 실시예에 의하면, 빅데이터에 대해 실시간 처리 및 일괄 처리를 동시 또는 순차적으로 수행할 수 있는 효과가 있으며, 일괄 처리를 위해 생성된 마트 테이블의 크기가 작은 경우, 바로 실시간 처리할 수 있는 효과가 있다. 또한, 실시간 처리 및 일괄 처리를 동시에 수행할 수 있는 EDW(Enterprise Data Warehouse)를 저비용으로 구축할 수 있고, 하나의 쿼리를 이용하여 실시간 또는 일괄 처리를 수행할 수 있는 효과가 있다. As described above, according to the present embodiment, real-time processing and batch processing can be performed simultaneously or sequentially with respect to the big data. When the size of the generated mart table for batch processing is small, There is an effect that can be done. Also, it is possible to construct an EDW (Enterprise Data Warehouse) capable of real-time processing and batch processing at a low cost, and real-time or batch processing can be performed using a single query.

도 1은 본 실시예에 따른 데이터 처리시스템을 개략적으로 나타낸 블록 구성도이다.
도 2는 본 실시예에 따른 데이터 처리장치를 개략적으로 나타낸 블록 구성도이다.
도 3은 본 실시예에 따른 일괄 또는 실시간으로 데이터를 처리하는 방법을 설명하기 위한 순서도이다.
도 4는 본 실시예에 따른 실시간으로 데이터를 처리하는 방법을 설명하기 위한 순서도이다. 1 is a block diagram schematically showing a data processing system according to the present embodiment.
2 is a block diagram schematically showing a data processing apparatus according to the present embodiment.
3 is a flowchart for explaining a method of processing data in batch or in real time according to the present embodiment.
4 is a flowchart for explaining a method of processing data in real time according to the present embodiment.

이하, 본 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, the present embodiment will be described in detail with reference to the accompanying drawings.

본 실시예에서는 하둡(Hadoop) 및 HDFS(Hadoop Distributed File System) 기반으로 데이터를 저장하는 것으로 설명하지만, 데이터를 저장하는 구조는 이에 한정되지 않는다. 빅데이터를 처리하는 시스템으로는 GFS(Google File System)와 MapReduce 등 다양한 시스템이 있을 수 있고, 본 발명의 기술적 사상은 빅데이터를 처리하는 특정 시스템에 한정되지 않는다.In the present embodiment, data is stored on the basis of Hadoop and HDFS (Hadoop Distributed File System), but the structure for storing data is not limited thereto. As a system for processing big data, there may be various systems such as GFS (Google File System) and MapReduce, and the technical idea of the present invention is not limited to a specific system for processing big data.

도 1은 본 실시예에 따른 데이터 처리시스템을 개략적으로 나타낸 블록 구성도이다. 1 is a block diagram schematically showing a data processing system according to the present embodiment.

본 실시예에 따른 데이터 처리시스템은 데이터 수집장치(110), 데이터 처리장치(120) 및 관리자 장치(130)를 포함한다.The data processing system according to the present embodiment includes a data collecting apparatus 110, a data processing apparatus 120, and an administrator apparatus 130. [

데이터 수집장치(110)는 외부 장치 또는 네트워크와 연결되어 각종 데이터를 수집한다. 본 실시예에 다른 데이터 수집장치(110)는 빅데이터를 처리하기 위해 각종 데이터를 수집한다. 예컨대, 데이터 수집장치(110)는 사내 네트워크와 연동하여 수집된 사내 데이터(112), 소셜 네트워크 서버와 연동하여 수집된 외부 소셜 네트워크 데이터(114), 및 각종 로그정보(116) 등을 수집한다. 여기서, 데이터 수집장치(110)는 각각의 데이터별로 별도의 데이터베이스를 구축할 수 있다. The data collection device 110 is connected to an external device or a network to collect various data. The data collecting apparatus 110 according to the present embodiment collects various data to process the big data. For example, the data collection device 110 collects in-house data 112 collected in association with an in-house network, external social network data 114 collected in association with a social network server, and various log information 116 and the like. Here, the data collection device 110 may construct a separate database for each data item.

데이터 수집장치(110)는 각종 데이터를 수집하여 통합하여 기 설정된 주기 또는 기 설정된 시간마다 통합한 빅데이터를 데이터 처리장치(120)로 전송한다. The data collecting apparatus 110 collects and integrates various data, and transmits the combined big data to the data processing apparatus 120 at a preset period or at predetermined time intervals.

데이터 처리장치(120)는 데이터 수집장치(110)로부터 빅데이터를 수신하여 저장하고, 실시간 처리 또는 일괄 처리를 수행하여 생성된 처리 결과정보를 관리자 장치(130)로 제공한다. The data processing apparatus 120 receives and stores the big data from the data collecting apparatus 110 and performs real-time processing or batch processing to provide the processing result information generated to the manager apparatus 130.

본 실시예에 따른 데이터 처리장치(120)는 데이터 수집장치(110)로부터 획득한 빅데이터를 구비된 디스크(210)에 저장한다. The data processing apparatus 120 according to the present embodiment stores the big data acquired from the data collecting apparatus 110 in the disk 210 provided.

데이터 처리장치(120)는 디스크(210)에 저장된 빅데이터를 분석하여 실시간 처리에 대한 데이터인 경우, 디스크(210)에 신규로 저장된 실시간 처리에 대한 데이터를 메모리(220)로 적재한다. 여기서, 데이터 처리장치(120)는 실시간 처리엔진(230)을 기반으로 메모리(220)에 적재된 데이터를 처리하여 생성된 처리 결과정보를 관리자 장치(130)로 전송한다. The data processing apparatus 120 analyzes the big data stored in the disk 210 and loads the data for the real time processing newly stored in the disk 210 into the memory 220 when the data is data for real time processing. Here, the data processing apparatus 120 processes the data loaded in the memory 220 based on the real-time processing engine 230 and transmits the generated processing result information to the manager device 130.

데이터 처리장치(120)는 디스크(210)에 저장된 빅데이터를 분석하여 실시간 처리에 대한 데이터가 아닌 경우, 빅데이터에 대한 원천 데이터(Raw Data)를 이용하여 복수의 마트 테이블(Mart Table)을 생성한다. 여기서, 마트 테이블은 빅데이터의 데이터 속성, 카테고리, 분류정보 등을 포함하는 요약 데이터(Summary Data)를 의미한다. The data processing apparatus 120 analyzes the big data stored in the disk 210 and generates a plurality of Mart tables using the raw data for the big data when the data is not data for real time processing do. Here, the mart table refers to summary data including data attributes, categories, classification information, and the like of the big data.

데이터 처리장치(120)는 생성된 마트 테이블이 기 설정된 임계크기 이상인 경우, 해당 마트 테이블인 일반 마트 테이블을 디스크(210)에 저장된 상태로 유지한다. The data processing apparatus 120 maintains the general mart table, which is the corresponding mart table, in a state stored in the disk 210, when the generated mart table is equal to or larger than a predetermined threshold size.

데이터 처리장치(120)는 관리자 장치(130)로부터 수신된 쿼리를 분석하여 일괄 처리에 대한 쿼리인 경우, 디스크(210)에 저장된 마트 테이블을 일괄 처리하여 처리 결과정보를 생성한다. 여기서, 데이터 처리장치(120)는 일괄 처리된 처리 결과정보를 관리자 장치(130)로 전송한다. The data processing apparatus 120 analyzes the query received from the manager device 130 and, in the case of a query for batch processing, collectively processes the mart tables stored in the disk 210 to generate processing result information. Here, the data processing apparatus 120 transmits the batch processing result information to the manager device 130. [

데이터 처리장치(120)는 생성된 마트 테이블이 기 설정된 임계크기 미만인 경우, 해당 마트 테이블인 스몰 마트 테이블을 메모리(220)로 적재한다. The data processing apparatus 120 loads the small mart table, which is the corresponding mart table, into the memory 220 when the generated mart table is smaller than a predetermined threshold size.

데이터 처리장치(120)는 관리자 장치(130)로부터 수신된 쿼리를 분석하여 실시간 처리에 대한 쿼리인 경우, 메모리(220)에 적재된 마트 테이블을 실시간 처리한다. 여기서, 데이터 처리장치(120)는 실시간 처리된 처리 결과정보를 관리자 장치(130)로 전송한다. The data processing apparatus 120 analyzes the query received from the manager device 130 and real-time processes the mart table loaded in the memory 220 in the case of a query for real-time processing. Here, the data processing apparatus 120 transmits the processing result information processed in real time to the manager device 130. [

관리자 장치(130)는 데이터 처리장치(120)로부터 처리 결과정보를 수신하여 모니터링한다. 본 실시예에 따른 관리자 장치(130)는 데이터 처리장치(130)로 쿼리(Query)를 전송하고, 쿼리에 대응하는 처리 결과정보를 수신한다. 여기서, 쿼리는 데이터 처리장치(120)의 동작에 대한 질의정보를 포함하며, 실시간 처리 또는 일괄 처리에 대한 질의정보일 수 있다. 관리자 장치(130)는 관리자의 조작 또는 입력에 근거하여 설정된 실시간 처리 또는 일괄 처리에 대한 쿼리를 데이터 처리장치(130)로 전송한다. The manager device 130 receives and monitors the process result information from the data processing device 120. The manager device 130 according to the present embodiment transmits a query to the data processing apparatus 130 and receives processing result information corresponding to the query. Here, the query includes query information about the operation of the data processing apparatus 120, and may be query information for real-time processing or batch processing. The manager device 130 transmits the query for the real-time processing or the batch processing set to the data processing apparatus 130 based on the operation or input of the manager.

본 실시예에 따른 관리자 장치(130)는 실시간 모니터링부(132) 및 기록결과 처리부(134)를 포함한다. 실시간 모니터링부(132)는 데이터 처리장치(120)로 실시간 처리에 대한 쿼리를 전송한 경우, 실시간 처리된 처리 결과정보를 수신하여 모니터링한다. 기록결과 처리부(134)는 데이터 처리장치(120)로 일괄 처리에 대한 쿼리를 전송한 경우, 일괄 처리된 처리 결과정보를 수신하여 기록한다. The administrator device 130 according to the present embodiment includes a real-time monitoring unit 132 and a recording result processing unit 134. [ When the real-time monitoring unit 132 transmits a query for real-time processing to the data processing apparatus 120, the real-time monitoring unit 132 receives and monitors real-time processed processing result information. When the query for the batch process is transmitted to the data processing device 120, the recording result processing unit 134 receives and records the batch process result information.

관리자 장치(130)는 데이터 처리장치(120)로 실시간 처리 및 일괄 처리에 대한 쿼리 중 적어도 하나의 쿼리를 전송할 수 있으나 반드시 이에 한정되는 것은 아니며, 실시간 처리 및 일괄 처리를 모두 포함하는 하나의 쿼리를 데이터 처리장치(120)로 전송할 수 있으며, 실시간 처리 및 일괄 처리 각각에 대한 처리 결과정보를 수신할 수도 있다. Manager device 130 may send at least one of the queries for real-time processing and batch processing to data processing device 120, but it is not so limited, and one query including both real-time processing and batch processing To the data processing apparatus 120, and may receive the processing result information for each of the real-time processing and the batch processing.

도 2는 본 실시예에 따른 데이터 처리장치를 개략적으로 나타낸 블록 구성도이다. 2 is a block diagram schematically showing a data processing apparatus according to the present embodiment.

본 실시예에 따른 데이터 처리장치(120)는 데이터 처리부(200), 제어부(250) 및 쿼리 획득부(260)를 포함한다. 여기서, 데이터 처리부(200)는 디스크(210), 메모리(220), 실시간 처리엔진(230) 및 일괄 처리엔진(240)을 포함한다. The data processing apparatus 120 according to the present embodiment includes a data processing unit 200, a control unit 250, and a query acquisition unit 260. The data processing unit 200 includes a disk 210, a memory 220, a real time processing engine 230, and a batch processing engine 240.

데이터 처리부(200)는 제어부(250)의 제어를 기반으로 빅데이터를 실시간 처리 또는 일괄 처리하는 동작을 수행한다. 여기서, 데이터 처리부(200)는 빅데이터를 처리하기 위해 수집된 대용량의 데이터를 여러 서버에 나눠서 저장하도록 하는 하둡 분산 파일 시스템(HDFS: Hadoop Distributed File System)을 기반으로 데이터 처리를 수행할 수 있으나 반드시 이에 한정되는 것은 아니다. The data processing unit 200 performs an operation of real-time processing or batch processing of the big data based on the control of the control unit 250. [ Here, the data processing unit 200 may perform data processing based on a Hadoop Distributed File System (HDFS) that divides the large-volume data collected to process the big data into a plurality of servers, But is not limited thereto.

디스크(210)는 데이터가 저장되는 저장모듈을 의미한다. 본 실시예에 따른 디스크(210)는 데이터 수집장치(110)로부터 빅데이터를 수신하여 저장한다. 디스크(210)는 전원 공급이 중단되어도 저장된 데이터가 소멸되지 않는 비휘발성 메모리로 구현된다. 예컨대, 디스크(210)는 플래시 메모리 저장장치, 하드 디스크(HDD: Hard Disk Drive) 혹은 솔리드 스테이트 드라이브(SSD: Solid State Drive)일 수 있다. 그러나 이는 예시적인 것으로 본 발명은 이에 한정되지 않는다.The disk 210 refers to a storage module in which data is stored. The disk 210 according to the present embodiment receives and stores the big data from the data collecting apparatus 110. The disk 210 is implemented as a nonvolatile memory in which the stored data is not destroyed even when the power supply is interrupted. For example, the disk 210 may be a flash memory storage device, a hard disk drive (HDD), or a solid state drive (SSD). However, the present invention is not limited thereto.

한편, 디스크(210)는 하둡 분산 파일 시스템에서 빅데이터를 할당받아 실제 데이터(Raw Data)를 저장하고 있는 하둡 클러스터일 수 있으며, 네임노드, 데이터 노드 등을 포함할 수 있다. Meanwhile, the disk 210 may be a Hadoop cluster storing big data in the Hadoop distributed file system and storing raw data, and may include a name node, a data node, and the like.

이하, 디스크(210)가 하둡 분산 파일 시스템 기반의 하둡 클러스터로 구현된 경우, 네임노드, 데이터 노드 등의 동작에 대해 설명하도록 한다. Hereinafter, when the disk 210 is implemented as a Hadoop cluster based on the Hadoop distributed file system, operations of a name node, a data node, and the like will be described.

하둡 분산 파일 시스템은 빅데이터를 처리하기 위해 수집된 대용량의 데이터를 여러 서버에 나눠서 저장하도록 하는 기술이다. 하둡 분산 파일 시스템는 네임노드(NameNode)와 데이터노드로 구성된다. 네임노드는 데이터노드(DataNode)에 저장되는 실제 파일의 메타(Meta) 정보를 저장하는 곳으로 실제 데이터가 저장되는 곳은 아니다. 네임노드는 네임노드(마스터)와 네임노드(세컨더리)로 구성되는데, 네임노드(세컨더리)는 네임노드(마스터)에 장애가 발생하면 네임노드(마스터)를 대신하여 사용하거나 네임노드(마스터)를 복구하기 위해 사용한다. The Hadoop Distributed File System is a technology that allows large amounts of data collected to process big data to be stored on multiple servers. The Hadoop distributed file system consists of a NameNode and a data node. The name node is a place for storing meta information of an actual file stored in a data node (DataNode), and is not a place where actual data is stored. The name node consists of a name node (master) and a name node (secondary). The name node (secondary) is used to replace the name node (master) when the name node (master) .

예컨대, 데이터노드의 구성원이 데이터노드 1, 데이터노드 2, 데이터노드 3, 데이터노드 4, 데이터노드 5인 경우, 각각의 데이터노드는 실제 데이터가 저장되는 공간으로 네트워크로 연결된 서버 또는 스토리지이다. 네임노드에는 데이터노드에 저장된 파일과 실제로 저장된 데이터노드의 정보를 가지고 있다. 응용프로그램이나 사용자가 파일에 접근하고자 할 때에는 네임노드에서 파일이 저장된 데이터노드를 찾아 접근하게 된다.For example, when the members of the data node are the data node 1, the data node 2, the data node 3, the data node 4, and the data node 5, each data node is a server or storage connected to the network through which the actual data is stored. The name node has information about the file stored in the data node and the actually stored data node. When an application or a user wants to access a file, the name node accesses the data node storing the file.

메모리(220)는 데이터를 일시적으로 저장하기 위한 메모리장치로서, 메모리(220)는 DRAM(Dynamic Random Access Memory), SRAM(Static Random Access Memory) 등과 같은 휘발성(Volatile) 메모리일 수 있으나 반드시 이에 한정되는 것은 아니며, EEPROM(Electrically Erasable Programmable Read-Only Memory), PRAM(Phase-change Memory), MRAM(Magnetic Random Access Memory), Flash Memory 등과 같은 비휘발성(Nonvolatile) 메모리일 수도 있다. 여기서, 휘발성 메모리에 저장된 데이터는 전원 공급이 중단되면 소멸되고, 비휘발성 메모리에 저장된 데이터는 전원 공급이 중단되더라도 소멸되지 않는다.The memory 220 is a memory device for temporarily storing data and the memory 220 may be a volatile memory such as a dynamic random access memory (DRAM), a static random access memory (SRAM) But it may be a nonvolatile memory such as an EEPROM (Electrically Erasable Programmable Read-Only Memory), a PRAM (Phase-change memory), an MRAM (Magnetic Random Access Memory) Here, the data stored in the volatile memory is extinguished when the power supply is interrupted, and the data stored in the nonvolatile memory is not extinguished even if the power supply is interrupted.

본 실시예에 따른 메모리(220)는 제어부(250)의 제어에 근거하여 디스크(210)에 저장된 빅데이터 또는 마트 테이블을 추출하여 적재한다. 예를 들어, 메모리(220)는 디스크(210)에 실시간 처리에 대한 빅데이터가 저장된 경우, 제어부(250)의 제어에 근거하여 해당 빅데이터를 추출하여 저장한다. 또한, 메모리(220)는 디스크(210)에 기 설정된 임계크기 미만의 스몰 마트 테이블(Small Mart Table)이 저장된 경우, 제어부(250)의 제어에 근거하여 스몰 마트 테이블을 추출하여 저장한다. The memory 220 according to the present embodiment extracts and loads the big data or the mart table stored in the disk 210 based on the control of the controller 250. [ For example, when the big data for real-time processing is stored in the disk 210, the memory 220 extracts and stores the corresponding big data based on the control of the controller 250. [ The memory 220 extracts and stores the small mart table based on the control of the controller 250 when a small mart table smaller than a predetermined threshold size is stored in the disk 210. [

예를 들어, 메모리(220)는 디스크(210)에 실시간 처리에 대한 빅데이터가 저장되는 것을 기 설정된 주기마다 확인하고, 신규로 빅데이터가 저장될 때마다 빅데이터를 추출하여 저장할 수 있다. 또한, 메모리(220)는 디스크(210)에 저장된 실시간 처리에 대한 빅데이터 또는 기 설정된 임계크기 미만의 스몰 마트 테이블이 기 설정된 데이터량에 도달하면, 빅데이터 또는 스몰 마트 테이블을 추출하여 저장할 수도 있다. For example, the memory 220 may confirm that the big data for real-time processing is stored on the disk 210 every predetermined period, and may extract and store the big data every time the big data is newly stored. In addition, the memory 220 may extract and store the big data or the small mat table when the big data for the real-time processing stored in the disk 210 or the small mart table smaller than the predetermined threshold size reaches the predetermined data amount .

실시간 처리엔진(230)은 제어부(250)의 제어를 기반으로 메모리(220)에 기 저장된 빅데이터 또는 스몰 마트 테이블을 처리하여 처리 결과정보를 생성한다. 여기서, 일괄 처리엔진(240)은 타조(Tajo), 임팔라(Impala), 하이브(Hive), 맵리듀스(MapReduce), 에이치베이스(HBase), 피그(Pig) 등 중 하나일 수 있다. The real-time processing engine 230 processes the big data or the small-mart table previously stored in the memory 220 based on the control of the controller 250 and generates processing result information. Here, the batch processing engine 240 may be one of Tajo, Impala, Hive, MapReduce, HBase, Pig, and the like.

실시간 처리엔진(230)은 빅데이터 또는 스몰 마트 테이블을 처리하여 생성된 처리 결과정보를 관리자 장치(130)로 전송한다. The real-time processing engine 230 processes the big data or the small-mart table and transmits the generated processing result information to the manager device 130.

일괄 처리엔진(240)은 제어부(250)의 제어를 기반으로 디스크(210) 또는 메모리(220)에 기 저장된 마트 테이블을 처리하여 처리 결과정보를 생성한다. 여기서, 일괄 처리엔진(240)은 타조(Tajo), 임팔라(Impala), 하이브(Hive), 맵리듀스(MapReduce), 에이치베이스(HBase), 피그(Pig) 등 중 하나일 수 있다. The batch processing engine 240 processes the mart table stored in the disk 210 or the memory 220 based on the control of the controller 250 to generate processing result information. Here, the batch processing engine 240 may be one of Tajo, Impala, Hive, MapReduce, HBase, Pig, and the like.

일괄 처리엔진(240)은 마트 테이블을 처리하여 생성된 처리 결과정보를 관리자 장치(130)로 전송한다. The batch processing engine 240 processes the mart table and transmits the processing result information generated to the manager device 130. [

제어부(250)는 데이터 처리장치(120)의 전반적인 동작을 관리 및 제어한다. 본 실시예에 따른 제어부(250)는 데이터 수집장치(110)로부터 수신된 빅데이터가 디스크(210)에 저장되도록 제어한다. 여기서, 제어부(250)는 디스크(210)에서 빅데이터의 저장위치(노드)를 결정하고, 결정된 저장위치에 빅데이터가 저장되도록 한다. The control unit 250 manages and controls the overall operation of the data processing apparatus 120. The control unit 250 controls the large data received from the data collection device 110 to be stored in the disk 210. [ Here, the controller 250 determines the storage location (node) of the big data on the disk 210, and stores the big data in the determined storage location.

제어부(250)는 디스크(210)에 저장된 빅데이터를 분석하여 실시간 처리가 필요한 빅데이터인 경우, 메모리(220)로 적재되도록 한다. The controller 250 analyzes the big data stored in the disk 210 and stores the big data in the memory 220 if it is big data requiring real time processing.

제어부(250)는 디스크(210)에 저장된 빅데이터를 분석하여 실시간 처리가 필요한 빅데이터가 아닌 경우, 빅데이터의 원천 데이터(Raw Data)를 이용하여 복수의 마트 테이블을 생성되도록 제어한다. 여기서, 마트 테이블은 빅데이터의 데이터 속성, 카테고리, 분류정보 등을 포함하는 요약 데이터(Summary Data)를 의미한다. The control unit 250 analyzes the big data stored in the disk 210 and controls the plurality of data tables to be generated using the raw data of the big data when the data is not big data requiring real-time processing. Here, the mart table refers to summary data including data attributes, categories, classification information, and the like of the big data.

마트 테이블을 생성된 경우, 제어부(250)는 마트 테이블의 크기를 확인하여 기 설정된 임계크기 이상인 경우, 해당 마트 테이블인 일반 마트 테이블이 디스크(210)에 유지되도록 한다. When the mart table is generated, the controller 250 checks the size of the mart table and, if the mart table is larger than the predetermined threshold size, causes the general mart table, which is the corresponding mart table, to be held on the disc 210.

한편, 제어부(250)는 마트 테이블의 크기를 확인하여 기 설정된 임계크기 미만인 경우, 해당 마트 테이블인 스몰 마트 테이블을 메모리(220)에서 추출하여 적재하도록 제어한다. On the other hand, if the size of the mart table is less than a predetermined threshold size, the controller 250 extracts the small mart table, which is the corresponding mart table, from the memory 220 and controls to load the small mart table.

제어부(250)는 쿼리 획득부(260)로부터 획득한 쿼리를 분석하여 실시간 처리에 대한 쿼리인지 일괄 처리에 대한 쿼리인지 여부를 확인한다. 제어부(250)는 실시간 또는 일괄처리에 대한 쿼리에 대응하는 데이터 처리가 수행되도록 제어한다. The control unit 250 analyzes the query acquired from the query acquisition unit 260 and confirms whether the query is a query for a real-time process or a batch process. The control unit 250 controls the data processing corresponding to the query for real-time or batch processing to be performed.

제어부(250)는 실시간 처리에 대응하는 쿼리인 경우, 실시간 처리엔진(230)을 기반으로 메모리(220)에 적재된 스몰 마트 테이블을 처리하도록 제어한다. 한편, 제어부(250)는 일괄 처리에 대응하는 쿼리인 경우, 일괄 처리엔진(240)을 기반으로 디스크(210)에 저장된 일반 마트 테이블을 처리하도록 제어한다. If the query corresponds to the real-time processing, the control unit 250 controls to process the small-mart table loaded in the memory 220 based on the real-time processing engine 230. On the other hand, if the query corresponds to the batch processing, the control unit 250 controls to process the general mart tables stored in the disk 210 based on the batch processing engine 240.

쿼리 획득부(260)는 관리자 장치(130)로부터 쿼리를 획득한다. 여기서, 쿼리 획득부(260)는 실시간 처리에 대한 쿼리 또는 일괄 처리에 대한 쿼리를 획득할 수 있으며, 실시간 처리 및 일괄 처리를 모두 포함하는 하나의 쿼리를 획득할 수도 있다. The query acquisition unit 260 acquires the query from the manager device 130. Here, the query acquisition unit 260 may acquire a query for the real-time processing or a query for the batch processing, and may acquire a single query including both real-time processing and batch processing.

본 발명의 실시예에 따른 데이터 처리장치(120)는 개인용 컴퓨터(PC: Personal Computer), 노트북 컴퓨터, 태블릿(Tablet), 개인 휴대 단말기(PDA: Personal Digital Assistant), 게임 콘솔, 휴대형 멀티미디어 플레이어(PMP: Portable Multimedia Player), 플레이스테이션 포터블(PSP: PlayStation Portable), 무선 통신 단말기(Wireless Communication Terminal), 스마트폰(Smart Phone), TV, 미디어 플레이어 등과 같은 사용자 단말기를 포함할 수 있고, 사용자 단말기는 데이터 처리장치(120)의 일부일 수 있다. 본 발명의 실시예에 따른 데이터 처리장치(120)는 응용 서버와 서비스 서버 등 서버 단말기일 수 있다. 본 발명의 실시예에 따른 데이터 처리장치(120)는 각기 (i) 각종 기기 또는 유무선 통신망과 통신을 수행하기 위한 통신 모뎀 등의 통신 장치, (ii) 프로그램을 실행하기 위한 데이터를 저장하기 위한 메모리, (iii) 프로그램을 실행하여 연산 및 제어하기 위한 마이크로프로세서 등을 구비하는 다양한 장치를 의미할 수 있다. 적어도 일 실시예에 따르면, 메모리는 램(Random Access Memory: RAM), 롬(Read Only Memory: ROM), 플래시 메모리, 광 디스크, 자기 디스크, 솔리드 스테이트 디스크(Solid State Disk: SSD) 등의 컴퓨터로 판독 가능한 기록/저장매체일 수 있다. 적어도 일 실시예에 따르면, 마이크로프로세서는 명세서에 기재된 동작과 기능을 하나 이상 선택적으로 수행하도록 프로그램될 수 있다. 적어도 일 실시예에 따르면, 마이크로프로세서는 전체 또는 부분적으로 특정한 구성의 주문형반도체(Application Specific Integrated Circuit: ASIC) 등의 하드웨어로써 구현될 수 있다.The data processing apparatus 120 according to an embodiment of the present invention may be a personal computer (PC), a notebook computer, a tablet, a personal digital assistant (PDA), a game console, a portable multimedia player Such as a portable multimedia player (PSP), a PlayStation Portable (PSP), a wireless communication terminal, a smart phone, a TV, a media player, May be part of the processing unit 120. The data processing apparatus 120 according to an embodiment of the present invention may be a server terminal such as an application server and a service server. The data processing apparatus 120 according to the embodiment of the present invention may include (i) a communication device such as a communication modem for performing communication with various devices or wired / wireless communication networks, (ii) a memory for storing data for executing a program, , (iii) a microprocessor for executing and controlling a program, and the like. According to at least one embodiment, the memory may be a computer such as a random access memory (RAM), a read only memory (ROM), a flash memory, an optical disk, a magnetic disk, or a solid state disk Readable recording / storage medium. According to at least one embodiment, a microprocessor can be programmed to selectively perform one or more of the operations and functions described in the specification. In accordance with at least one embodiment, the microprocessor may be implemented in hardware, such as an Application Specific Integrated Circuit (ASIC), in wholly or partially of a particular configuration.

도 3은 본 실시예에 따른 일괄 또는 실시간으로 데이터를 처리하는 방법을 설명하기 위한 순서도이다. 3 is a flowchart for explaining a method of processing data in batch or in real time according to the present embodiment.

데이터 처리장치(120)는 데이터 수집장치(110)로부터 수집된 빅데이터를 획득하고(S310), 획득한 빅데이터를 구비된 디스크(210)에 저장한다(S320).The data processing apparatus 120 acquires the big data collected from the data collecting apparatus 110 (S310), and stores the acquired big data in the disk 210 (S320).

데이터 처리장치(120)는 디스크(210)에 저장된 빅데이터를 분석하여 실시간 처리에 대한 데이터인지 여부를 판단한다(S330). 여기서, 실시간 처리에 대한 데이터인 경우, 데이터 처리장치(120)의 동작은 도 4에서 설명하도록 한다(S410 내지 S430). The data processing apparatus 120 analyzes the big data stored in the disk 210 to determine whether it is data for real-time processing (S330). Here, in the case of data for real-time processing, the operation of the data processing apparatus 120 will be described with reference to FIG. 4 (S410 to S430).

단계 S330의 판단결과, 실시간 처리에 대한 데이터가 아닌 경우, 데이터 처리장치(120)는 원천 데이터(Raw Data)를 이용하여 복수의 마트 테이블을 생성한다(S340). 여기서, 마트 테이블은 빅데이터의 데이터 속성, 카테고리, 분류정보 등을 포함하는 요약 데이터(Summary Data)를 의미한다. If it is determined in step S330 that the data is not for real-time processing, the data processing apparatus 120 generates a plurality of mat tables using raw data (S340). Here, the mart table refers to summary data including data attributes, categories, classification information, and the like of the big data.

데이터 처리장치(120)는 복수의 마트 테이블 각각의 크기가 기 설정된 임계크기 이상인지 여부를 확인하여(S350), 기 설정된 임계크기 이상인 경우, 해당 마트 테이블이 저장된 디스크(210)에 유지한다(S352).The data processing apparatus 120 checks whether the size of each of the plurality of the mart tables is greater than or equal to a preset critical size (S350). If the size of the mart tables is greater than or equal to a predetermined threshold size, ).

데이터 처리장치(120)는 복수의 마트 테이블 각각의 크기가 기 설정된 임계크기 이상인지 여부를 확인하여(S350), 기 설정된 임계크기 미만인 경우, 해당 마트 테이블을 메모리(220)로 적재한다(S360). The data processing apparatus 120 checks whether the size of each of the plurality of the mart tables is equal to or larger than a preset critical size (S350). If the size of the mart tables is less than a predetermined threshold size, the data processing apparatus 120 loads the corresponding mart tables into the memory 220 (S360) .

데이터 처리장치(120)는 관리자 장치(130)로부터 쿼리를 수신하고(S370), 수신된 쿼리를 분석하여 실시간 처리 또는 일괄 처리에 대한 쿼리인지 여부를 확인한다(S372).The data processing apparatus 120 receives a query from the manager device 130 (S370), analyzes the received query, and determines whether it is a query for real-time processing or batch processing (S372).

데이터 처리장치(120)는 수신된 쿼리를 분석하여 일괄 처리에 대한 쿼리인 경우, 일괄 처리엔진(240)을 기반으로 디스크(210)에 저장된 마트 테이블을 처리한다(S380). 단계 S380의 처리결과에 따라 데이터 처리장치(120)는 쿼리의 응답신호로 처리 결과정보를 관리자 장치(130)로 전송한다(S392). The data processing apparatus 120 analyzes the received query and processes the mart table stored in the disk 210 based on the batch processing engine 240 when the query is for a batch process (S380). In accordance with the processing result of step S380, the data processing apparatus 120 transmits the processing result information to the manager device 130 as a response signal of the query (S392).

데이터 처리장치(120)는 수신된 쿼리를 분석하여 실시간 처리에 대한 쿼리인 경우, 실시간 처리엔진(230)을 기반으로 메모리(220)에 적재된 마트 테이블을 처리한다(S390). 단계 S390의 처리결과에 따라 데이터 처리장치(120)는 쿼리의 응답신호로 처리 결과정보를 관리자 장치(130)로 전송한다(S392). The data processing apparatus 120 analyzes the received query and processes the mart table loaded in the memory 220 based on the real-time processing engine 230 when the query is a query for real-time processing (S390). In accordance with the processing result of step S390, the data processing apparatus 120 transmits the processing result information to the manager device 130 as a response signal of the query (S392).

도 3에서는 단계 S310 내지 단계 S392를 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 3에 기재된 순서를 변경하여 실행하거나 단계 S310 내지 단계 S392 중 하나 이상의 단계를 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 3은 시계열적인 순서로 한정되는 것은 아니다.Although it is described in FIG. 3 that steps S310 to S392 are sequentially executed, this is merely illustrative of the technical idea of an embodiment of the present invention. It is to be understood that the present invention is not limited to the above- Those skilled in the art will appreciate that various modifications and adaptations may be made to those skilled in the art without departing from the essential characteristics of one embodiment of the present invention or by executing one or more of steps S310 through S392 in parallel, And therefore, Fig. 3 is not limited to the time-series order.

도 4는 본 실시예에 따른 실시간으로 데이터를 처리하는 방법을 설명하기 위한 순서도이다. 4 is a flowchart for explaining a method of processing data in real time according to the present embodiment.

이하, 데이터 처리장치(120)에서 쿼리의 수신없이 실시간으로 데이터를 처리하는 과정에 대해 설명하도록 한다. Hereinafter, a process of processing data in real time without receiving a query in the data processing apparatus 120 will be described.

데이터 처리장치(120)가 빅데이터를 획득하고, 획득한 빅데이터를 디스크(210)에 저장하며, 저장된 빅데이터를 분석하여 실시간 처리에 대한 데이터인지 여부를 판단하는 과정(S310 내지 S330)은 도 3에 기재되어 있으므로 그 기재를 생략하고, 이후 과정에 대해서만 기재하도록 한다. The processes (S310 to S330) for the data processing apparatus 120 to acquire the big data, store the acquired big data in the disk 210, and analyze the stored big data to determine whether the data is for the real time processing 3, the description will be omitted and only the subsequent steps will be described.

단계 S330의 판단결과, 실시간 처리에 대한 데이터인 경우, 데이터 처리장치(120)는 신규로 저장된 실시간 처리에 대한 데이터를 메모리(220)로 적재한다(S410). 여기서, 데이터 처리장치(120)는 디스크(210)에 실시간 처리에 대한 데이터가 저장되자마자 메모리(220)로 적재하는 것이 바람직하나 반드시 이에 한정되는 것은 아니며, 기 설정된 실시간 데이터 확인주기마다 디스크(210)를 확인하여 실시간 처리에 대한 데이터를 메모리(220)로 적재할 수도 있다. As a result of the determination in step S330, in the case of data for real-time processing, the data processing apparatus 120 loads data for the newly stored real-time processing into the memory 220 (S410). Here, the data processing apparatus 120 preferably loads the data on the disk 210 into the memory 220 as soon as the data on the real-time processing is stored. However, the present invention is not limited thereto, ) And load the data for the real-time processing into the memory 220. [

데이터 처리장치(120)는 실시간 처리엔진(230)을 기반으로 메모리(220)에 적재된 데이터를 처리한다(S420). 데이터 처리장치(120)는 관리자 장치(130)로 단계 S420의 처리결과에 대한 처리 결과정보를 전송한다(S430). The data processing apparatus 120 processes the data loaded in the memory 220 based on the real-time processing engine 230 (S420). The data processing apparatus 120 transmits processing result information on the processing result of step S420 to the manager device 130 (S430).

도 4에서는 단계 S410 내지 단계 S430를 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 발명의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 4에 기재된 순서를 변경하여 실행하거나 단계 S410 내지 단계 S430 중 하나 이상의 단계를 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 도 4는 시계열적인 순서로 한정되는 것은 아니다.Although it is described in FIG. 4 that steps S410 to S430 are sequentially executed, it is only an exemplary description of the technical idea of an embodiment of the present invention. Those skilled in the art will appreciate that various modifications and adaptations may be made to those skilled in the art without departing from the essential characteristics of one embodiment of the present invention by changing the order described in FIG. 4 or by executing one or more of steps S410 through S430 in parallel And therefore, it is not limited to the time-series order in Fig.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The foregoing description is merely illustrative of the technical idea of the present embodiment, and various modifications and changes may be made to those skilled in the art without departing from the essential characteristics of the embodiments. Therefore, the present embodiments are to be construed as illustrative rather than restrictive, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The scope of protection of the present embodiment should be construed according to the following claims, and all technical ideas within the scope of equivalents thereof should be construed as being included in the scope of the present invention.

이상에서 설명한 바와 같이 본 실시예는 빅데이터 처리분야에 적용되어, 빅데이터에 대해 실시간 처리 및 일괄 처리를 동시 또는 순차적으로 수행할 수 있는 효과가 있으며, 일괄 처리를 위해 생성된 마트 테이블의 크기가 작은 경우에도 바로 실시간 처리할 수 있는 효과를 발생하는 유용한 발명이다.As described above, the present embodiment is applied to the field of big data processing, and real-time processing and batch processing for big data can be performed simultaneously or sequentially, and the size of the generated mart table for batch processing is It is a useful invention that produces effects that can be processed in real time even in small cases.

110: 데이터 수집장치 120: 데이터 처리장치
130: 관리자 장치 132: 실시간 모니터링부
134: 기록결과 처리부
200: 데이터 처리부 210: 디스크
220: 메모리 230: 실시간 처리엔진
240: 일괄처리 엔진 250: 제어부
260: 쿼리 획득부110: data collecting device 120: data processing device
130: Administrator device 132: Real-time monitoring unit
134: recording result processing unit
200: data processing unit 210: disk
220: memory 230: real-time processing engine
240: batch processing engine 250:
260: Query acquisition unit

Claims

A data processing method for real-time or batch-processing large data collected by a data processing apparatus,
A disk storing process of storing the big data as a disk;
A table generation step of analyzing the big data and generating a plurality of Mart tables for the big data based on the analysis result;
A table classification step of comparing each of the plurality of the mart tables with a predetermined threshold size and loading the mart tables into a memory;
A data processing step of collectively processing the mart tables stored on the disk based on the batch processing engine based on the query received from the manager device or real time processing the mart tables loaded on the memory based on the real time processing engine; And
A result providing process of providing processing result information based on the real-time processing engine or the batch processing engine
The data processing method comprising the steps of:

The method according to claim 1,
In the table generation process,
If the big data is preset for real-time processing based on the analysis result,
Loading new data stored in the disk into the memory;
Processing the newly stored big data stored in the memory using the real-time processing engine; And
And providing the processing result information based on the real-time processing engine to the manager device
The data processing method further comprising:

The method according to claim 1,
The table classification process includes:
Wherein the common mart table is held on the disk when the mart table is a general mart table having a predetermined threshold size or more among the plurality of mart tables.

The method of claim 3,
Wherein the data processing step comprises:
And when the query for the batch processing is received, processing the general Mart table using the batch processing engine to generate the processing result information.

The method according to claim 1,
The table classification process includes:
Wherein the small mart table is loaded into the memory when the small mart table is smaller than a predetermined threshold size among the plurality of mart tables.

6. The method of claim 5,
Wherein the data processing step comprises:
Processing the small-mart table using the real-time processing engine to generate the processing result information when receiving the query for real-time processing.