KR101543377B1

KR101543377B1 - Apparatus and method for analyzing data using mapreduce based on nosql

Info

Publication number: KR101543377B1
Application number: KR1020130091053A
Authority: KR
Inventors: 한명묵; 홍성삼; 공종환; 최보민; 최해술; 지상훈
Original assignee: 가천대학교 산학협력단; 워치아이시스템주식회사
Priority date: 2013-07-31
Filing date: 2013-07-31
Publication date: 2015-08-11
Also published as: KR20150016420A

Abstract

NoSQL(Not Only SQL(Structureed Query Language)) 기반의 맵리듀스(MapReduce)를 이용한 데이터 분석 장치는, 하나 이상의 필드 및 상기 하나 이상의 필드 각각에 대응되는 값을 갖는 입력 데이터를 저장하도록 구성된 NoSQL 기반 데이터베이스; 상기 입력 데이터로부터 미리 결정된 필드의 내용을 추출하여 파싱(parsing)된 데이터를 생성하며, 상기 파싱된 데이터를 미리 결정된 규칙에 기초하여 필터링한 제1 출력 데이터를 생성하도록 구성된 맵(Map) 모듈; 및 상기 맵 모듈로부터 상기 제1 출력 데이터를 수신하고, 상기 제1 출력 데이터를 상기 미리 결정된 필드의 내용에 따라 병합하여 제2 출력 데이터를 생성하도록 구성된 리듀스(Reduce) 모듈을 포함할 수 있다. A data analysis apparatus using MapReduce based on NoSQL (Structure Only Query Language) is a NoSQL based database configured to store input data having one or more fields and values corresponding to each of the one or more fields; A map module configured to extract the contents of a predetermined field from the input data to generate parsed data and generate first output data obtained by filtering the parsed data based on a predetermined rule; And a Reduce module configured to receive the first output data from the map module and generate the second output data by merging the first output data according to the contents of the predetermined field.

Description

[0001] APPARATUS AND METHOD FOR ANALYZING DATA USING MAPREDUCE BASED ON NOSQL [0002]

실시예들은 NoSQL(Not Only SQL(Structureed Query Language)) 기반의 맵리듀스(MapReduce)를 이용한 데이터 분석 장치 및 방법 에 관한 것이다. Embodiments relate to an apparatus and method for analyzing data using MapReduce based on NoSQL (Not-Only Structured Query Language (SQL)).

최근 컴퓨터와 인터넷 이용의 확산, 스마트폰을 포함한 스마트 기기의 보급, 신용카드와 온라인 상거래의 증가 등 네트워크를 이용한 서비스 산업의 영역이 확대되면서 많은 기업들이 정보보호 시스템에 관심을 갖고 이에 대한 구축을 확장 시켜나가고 있다. 기업들은 정보보호 시스템이 늘어남에 따라 운영관리 및 일관성 있는 정보보호 정책 구현에 어려움을 겪고 있는데, 지능적으로 변화하는 정보보호 침해사고에 대한 조기 분석을 원하는 곳들에서 통합 보안관제 시스템이 효과적인 역할을 해왔다. Recently, as the scope of the service industry using networks has expanded, such as the spread of computers and the Internet, the spread of smart devices including smart phones, and the increase of credit cards and online commerce, many companies are interested in information protection system It is going out. Organizations are struggling with operational management and the implementation of consistent information protection policies as the information protection system grows. Integrated security control systems have played an effective role in those areas where early analysis of intelligently changing information security breaches is needed.

보안관제 시스템은 기업 시스템의 효율적인 운영을 위해 시스템 상태, 유입되는 트래픽(traffic) 및 데이터 등의 모니터링을 통해 장애 발생의 근원을 신속하게 파악하고, 다양한 데이터들에 여러 가지 형태의 분석 룰(Rule)을 적용하여 이들의 유해성 판단, 보안 위협에 대한 대응 및 경보체계의 구축을 가능하게 한다. 기업뿐만 아니라 다양한 조직 형태의 국가전산망과 IT 융합 시스템의 전반적인 운영과 판단에 있어서도, 종합적인 분석 및 모니터링을 통해 시기적절한 조치, 사고 예방 및 사후 추적을 실시간으로 해결하기 위해 보안관제 시스템을 구축하고 있다. In order to efficiently operate the enterprise system, the security control system quickly grasps the source of the failure by monitoring the system state, the incoming traffic and the data, and various kinds of analysis rules (Rule) , It is possible to judge their hazard, to respond to security threats, and to build an alarm system. In addition to corporations, we are building a security control system to solve the problems of timely action, accident prevention and follow-up in real-time through comprehensive analysis and monitoring in the overall operation and judgment of national computer networks and IT convergence systems in various organizational forms .

이처럼, ICT(Information & Communication Technology) 기술이 발달하고 여러 분야의 산업들과 융합되면서 보안 장비에 대한 수요는 점차 늘어나고 있다. 이에 다양한 장비들에서 발생하는 데이터의 양 또한 점차 늘어나고 있어 기존의 관리 체계로는 감당할 수 없는 거대한 데이터인 빅데이터(Big Data) 규모로 확대되어 가고 있다.As such, with the development of ICT (Information & Communication Technology) technology and convergence with various industries, demand for security equipment is increasing. As a result, the amount of data that is generated in various devices is also increasing, and it is expanding to the size of big data, which is huge data that can not be covered by the existing management system.

빅데이터란 "Very Large data", "Extreme data", "Total data" 등으로 불리기도 하며, 초기에는 데이터의 양이 기준이 되어, 수십 테라에서 향후 페타, 엑사 바이트 정도 크기의 방대한 데이터를 의미하였다. 그러나 현재에는 IT 기술의 발달과 함께 각종 IT 기기에서 발생하는 데이터의 종류 또한 다양해져 기존 방식으로 데이터를 수집/저장/분석/관리하기 어려울 정도로 방대한 정형(structured) 또는 비정형(untructured) 데이터 집합을 의미하기도 한다.Big data is sometimes referred to as "Very Large data", "Extreme data", or "Total data". In the beginning, the amount of data was used as a standard, . However, nowadays, with the development of IT technology, various kinds of data generated in various IT devices are also diversified, which means a large amount of structured or untructured data sets that are difficult to collect / store / analyze / do.

이러한 빅데이터 기술의 핵심 이슈는 기존 데이터베이스의 성능적 한계를 보완하여 저장, 분석, 체계화를 위한 플랫폼(platform)에 대한 기술과 수집되는 방대한 양의 데이터들을 적시 적소에 효율적으로 사용할 수 있도록 필요한 정보만 수집할 수 있는 데이터 정제 기술이 대표적으로 다뤄지고 있다. 통합 보안관제 시스템에서도, 빅데이터 기술을 도입하여 실시간으로 유입되는 빅데이터급의 보안 로그(log)들을 보다 효율적으로 처리할 수 있게끔 하고, 이에 따른 시간적 처리 비용과 에러(error) 비율을 낮추고 정확한 탐색율을 보장할 수 있는 기술이 필요하다.The key issue of this big data technology is that it complements the performance limitations of the existing database and provides the platform with a platform for storage, analysis and systematization, and the information necessary to efficiently use the vast amount of data collected at the right time Data cleansing techniques that can be collected are described as representative. In the integrated security control system, big data technology is introduced to handle the big data class security logs that flow in real time more efficiently, thereby lowering the temporal processing cost and error ratio, We need technology that can guarantee rate.

그러나 이렇게 방대해지고 있는 데이터들은 기존 시스템들에서 처리하고 관리하는 데 여러 가지 어려움이 있다. 가장 대표적인 것이 데이터를 처리하는 성능 문제인데, 관계형 데이터베이스(예컨대, RDBMS 방식의 데이터베이스)를 사용하는 종래의 시스템들에서는 방대해진 데이터를 처리할 시 자체적인 스케일-아웃(Scale-Out) 방식의 확장이 어려워 저장 공간 부족 및 이에 따른 속도 저하 문제가 발생하며, 이는 원활한 서비스를 저해하는 요인이 될 수 있다. 이에 대한 방안으로 별도의 데이터 저장 및 처리 솔루션을 구입하여 장착하거나, 시스템 내부적으로는 바이너리 형태로 저장하는 방식을 이용하여 늘어난 데이터를 경량화하는 방안 등이 제시되고 있다. 그러나 현재 대부분의 IT 융합 시스템에서 이용되고 있는 데이터들은 고가의 데이터관리 솔루션을 이용할 만큼 중요도가 높지 않으므로 비용적인 측면에서 효율적이지 못하다. 또한, 바이너리 형태로 변환할 시 별도의 파일 변환 과정이 필요하기 때문에 실시간으로 분석하고 이를 처리해야 하는 시스템들에는 적절하지 못하다. However, there are many difficulties in processing and managing data in such existing systems. In the conventional systems using a relational database (for example, an RDBMS-based database), the scale-out of the system itself is expanded in processing large data. The storage space is insufficient and the speed reduction problem occurs, which may hinder smooth service. As a solution to this problem, a method of reducing the amount of data that has been increased by purchasing and installing a separate data storage and processing solution or storing the data in a binary form within the system is proposed. However, the data currently used in most IT convergence systems is not as costly as it is not as important as using expensive data management solutions. In addition, it is not suitable for systems that need to analyze and process in real time because it requires a separate file conversion process when converting to binary form.

따라서, 저가의 비용으로도 많은 양의 데이터들을 저장하고, 이렇게 수집된 데이터들을 보다 신속하고 효율적으로 정제하여 최적화된 데이터 집합을 구축 할 수 있는 새로운 패러다임의 기술이 필요하다. 즉, 수많은 시스템 장비의 상태를 실시간으로 분석 및 모니터링하고, 장애 및 응급 상황에 긴급히 대처할 수 있으며, 여러 경로를 통해 유입되는 로그, 데이터, 이벤트(event), 트래픽 등 대량의 데이터를 대상으로 사고 예방, 유해성 판단 및 실시간 대응이 가능한 종합적인 대응체계와 경보체계 구축을 가능하게 하는, 빅데이터 환경에 맞는 분석 기술이 요구된다. Therefore, there is a need for a new paradigm capable of storing a large amount of data at a low cost and quickly and efficiently refining the collected data to construct an optimized data set. In other words, it can analyze and monitor the status of many system equipments in real time, cope with emergency and emergency situations, and prevent mass accidents such as logs, data, events, , Analysis technology suitable for big data environment is needed which enables construction of comprehensive response system and alarm system capable of judging harmfulness and real time response.

공개특허공보 10-2012-0078908Patent Publication No. 10-2012-0078908

본 발명의 일 측면에 따르면, 대량의 보안 로그(log)들을 보다 효율적으로 처리할 수 있는 NoSQL(Not Only SQL(Structureed Query Language)) 기반의 데이터베이스 및 대량의 데이터를 대상으로 불필요한 정보들은 제거하고 필요한 정보만을 추출하여 적시적소에 사용하게 할 수 있는 맵리듀스(MapReduce) 설계를 이용한 전처리 알고리즘 기법에 의해, 빅데이터(Big Data) 시대에 대응할 수 있는 통합 보안관제 기술을 제공할 수 있다.According to an aspect of the present invention, there is provided a database based on NoSQL (Structured Query Language (NoSQL)) capable of processing a large number of security logs more efficiently, and a system for removing unnecessary information It is possible to provide an integrated security control technology capable of coping with the Big Data era by the preprocessing algorithm technique using the MapReduce design which can extract information only and use it in a timely manner.

일 실시예에 따른 데이터 분석 장치는, 하나 이상의 필드 및 상기 하나 이상의 필드 각각에 대응되는 내용을 갖는 입력 데이터를 저장하도록 구성된 NoSQL(Not Only SQL(Structureed Query Language)) 기반 데이터베이스; 상기 입력 데이터로부터 미리 결정된 필드의 내용을 추출하여 파싱(parsing)된 데이터를 생성하며, 상기 파싱된 데이터를 미리 결정된 규칙에 기초하여 필터링한 제1 출력 데이터를 생성하도록 구성된 맵(Map) 모듈; 및 상기 맵 모듈로부터 상기 제1 출력 데이터를 수신하고, 상기 제1 출력 데이터를 상기 미리 결정된 필드의 내용에 따라 병합하여 제2 출력 데이터를 생성하도록 구성된 리듀스(Reduce) 모듈을 포함할 수 있다. A data analysis apparatus according to an embodiment includes: a NoSQL (Structured Query Language (NoSQL) -based database configured to store input data having one or more fields and contents corresponding to each of the one or more fields; A map module configured to extract the contents of a predetermined field from the input data to generate parsed data and generate first output data obtained by filtering the parsed data based on a predetermined rule; And a Reduce module configured to receive the first output data from the map module and generate the second output data by merging the first output data according to the contents of the predetermined field.

일 실시예에 따른 데이터 분석 방법은, NoSQL 기반 데이터베이스에, 하나 이상의 필드 및 상기 하나 이상의 필드 각각에 대응되는 내용을 갖는 입력 데이터를 저장하는 단계; 상기 입력 데이터로부터 미리 결정된 필드의 내용을 추출하여 파싱된 데이터를 생성하는 단계; 상기 파싱된 데이터를 상기 미리 결정된 필드의 내용에 기초하여 필터링하여 제1 출력 데이터를 생성하는 단계; 상기 제1 출력 데이터를 상기 미리 결정된 필드의 내용에 기초하여 병합하여 제2 출력 데이터를 생성하는 단계; 및 상기 제2 출력 데이터를 출력하는 단계를 포함할 수 있다. According to an embodiment, a method of analyzing data includes storing input data in a NoSQL-based database having one or more fields and contents corresponding to each of the one or more fields; Extracting contents of a predetermined field from the input data to generate parsed data; Filtering the parsed data based on the content of the predetermined field to generate first output data; Merging the first output data based on the contents of the predetermined field to generate second output data; And outputting the second output data.

일 실시예에 따른 컴퓨터로 판독 가능한 저장 매체에는, 컴퓨터에 의하여 실행됨으로써 컴퓨터에 의하여 상기 데이터 분석 방법을 수행하기 위한 명령이 저장될 수 있다. A computer-readable storage medium, according to one embodiment, may store instructions for performing the method of data analysis by a computer, executed by a computer.

본 발명의 일 측면에 따른 데이터 분석 장치 및 방법에 의하면, 대용량 로그(log) 수집 분석 시스템에서 필요한 원본(Raw Data) 자체를 데이터베이스에 저장할 수 있을만한 빅데이터 저장 엔진을 구축할 수 있다. 또한, IT 융합 시스템에서 빅데이터의 전처리 과정을 통해 필수적인 특징만을 추출 및 선택함으로써 연산 및 처리 효율을 향상시키고 실시간으로 유입되는 빅데이터에 적용할 수 있다. 이상과 같은 기술은 IT 융? 시스템에 있어서 데이터, 이벤트(event), 또는 트래픽(traffic)을 실시간 검색, 인덱싱(indexing) 및/또는 상관 분석하기 위한 기반을 제공할 수 있다. 또한, 최적화된 특징들로 구성된 정제 데이터를 통하여 데이터에서 특정 공격을 탐지 가능하도록 할 수 있고, 데이터의 검색 및 인덱싱 처리 시 정확도를 향상시키고 에러(error)율을 감소시킬 수 있다. According to the apparatus and method for analyzing data according to an aspect of the present invention, it is possible to construct a big data storage engine capable of storing raw data itself in a database in a large volume log collection and analysis system. Also, by extracting and selecting essential features through preprocessing of big data in IT convergence system, it is possible to improve computation and processing efficiency and apply it to big data flowing in real time. The above technology is IT fusion? The system can provide a basis for real-time searching, indexing, and / or correlation analysis of data, events, or traffic. In addition, it is possible to detect a specific attack from the data through the refined data composed of the optimized features, improve the accuracy in data retrieval and indexing processing, and reduce the error rate.

도 1은 일 실시예에 따른 데이터 분석 장치의 개략적인 블록도이다.
도 2는 일 실시예에 따른 데이터 분석 방법의 순서도이다.
도 3은 일 실시예에 따른 데이터 분석 방법에 의한 데이터 수집 및 정제 과정을 설명하기 위한 개략도이다.
도 4는 일 실시예에 따른 데이터 분석 방법에 의한 데이터 수집 속도를 종래의 기술과 비교하여 나타내는 그래프이다.
도 5는 일 실시예에 따른 데이터 분석 방법에 의한 데이터 검색 속도를 종래의 기술과 비교하여 나타내는 그래프이다.1 is a schematic block diagram of a data analysis apparatus according to one embodiment.
2 is a flowchart of a data analysis method according to an embodiment.
FIG. 3 is a schematic diagram for explaining a data collection and refinement process according to a data analysis method according to an embodiment.
FIG. 4 is a graph illustrating a data collection rate according to a data analysis method according to an exemplary embodiment in comparison with a conventional technique.
FIG. 5 is a graph illustrating a data retrieval speed according to a data analysis method according to an exemplary embodiment in comparison with a conventional technique.

이하에서, 도면을 참조하여 본 발명의 실시예들에 대하여 상세히 살펴본다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 일 실시예에 따른 데이터 분석 장치의 개략적인 블록도이다.1 is a schematic block diagram of a data analysis apparatus according to one embodiment.

도 1을 참조하면, 데이터 분석 장치는 NoSQL(Not Only SQL(Structureed Query Language)) 기반의 데이터베이스(10), 및 맵리듀스(MapReduce) 기반의 데이터 분석을 수행하기 위한 맵(Map) 모듈(20)과 리듀스(Reduce) 모듈(30)을 포함할 수 있다. 본 명세서에서 "데이터베이스", "모듈" 또는 "장치" 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구동되는 소프트웨어의 조합을 지칭하는 것으로 의도된다. 예를 들어, 하드웨어는 CPU 또는 다른 프로세서(processor)를 포함하는 데이터 처리 기기일 수 있다. 또한, 하드웨어에 의해 구동되는 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.1, the data analysis apparatus includes a database 10 based on a NoSQL (Structured Query Language) database, and a Map module 20 for performing data analysis based on MapReduce. And a Reduce module 30. The terms "database "," module ", or "device" and the like are used herein to refer to a combination of hardware and software driven by that hardware. For example, the hardware may be a data processing device comprising a CPU or other processor. Also, the software driven by the hardware may refer to a running process, an object, an executable, a thread of execution, a program, and the like.

데이터베이스(10)에는 분석을 위한 입력 데이터가 수신된다. 본 명세서에서는, 데이터 분석 장치를 이용하여 방화벽 로그(log)를 분석하는 실시예를 기준으로 본 발명에 대하여 설명한다. 본 실시예에서, 데이터베이스(10)는 파이어월(firewall), 웹 파이어월(web firewall), VPN(Virtual Private Network), IDS(Intrusion Detection System), IPS(Intrusion Prevention System), NMS(Network Management System), SMS(System Management Solution) 또는 다른 상이한 종류의 보안 장비들로부터 입력 데이터로서 로그 데이터를 수신할 수 있다. 그러나, 이는 예시적인 것으로서, 다른 실시예에서 데이터 분석 장치는 방화벽 로그 분석 외에 다른 상이한 종류의 빅 데이터(Big Data) 분석 및 처리에 활용될 수도 있다. In the database 10, input data for analysis is received. In the present specification, the present invention will be described on the basis of an embodiment in which a firewall log is analyzed using a data analysis apparatus. In the present embodiment, the database 10 includes a firewall, a web firewall, a VPN (Virtual Private Network), an Intrusion Detection System (IDS), an Intrusion Prevention System (IPS) ), SMS (System Management Solution), or other types of security devices. However, this is illustrative, and in other embodiments the data analysis device may be utilized for other types of Big Data analysis and processing other than firewall log analysis.

실시예들에서 데이터베이스(10)는 NoSQL 기반의 데이터베이스로서, 예컨대, HBase, 카산드라(Cassandra), 몽고DB(MongoDB) 등일 수 있다. 종래의 관계형 데이터베이스(예컨대, MySQL 또는 RDBMS)를 기반으로 데이터를 처리하는 방식은, 빠르게 변화하고 거대화되어 가고 있는 보안 로그 데이터들의 현 특성을 반영하지 못하며 분석에 있어 한계가 있다. 관계형 데이터베이스는 각 로그 데이터 테이블간의 복잡한 관계를 설정하여 이를 분석하여야 하므로 변화하는 공객의 패턴이나 새로운 공격 유형의 반영에 어려움이 있다. 즉, 변화된 공객 패턴이나 신규 공격의 발생 시에 이를 수정하거나 추가하기 위해 매번 관계를 재설정해야 한다. In embodiments, the database 10 may be a NoSQL-based database, such as HBase, Cassandra, MongoDB, and the like. The manner of processing data based on conventional relational databases (e.g., MySQL or RDBMS) does not reflect the current nature of rapidly changing and gigantic security log data and has limitations in analysis. The relational database needs to set up a complex relationship between each log data table and analyze it. Therefore, it is difficult to reflect the changing pattern of the visitor or the new attack type. That is, you must re-establish the relationship each time in order to modify or add a changed passenger pattern or a new attack.

이에 본 발명의 실시예들에서는, 관계형 데이터베이스가 아닌 NoSQL 기반의 데이터베이스(10)를 이용한다. NoSQL 기반의 데이터베이스(10)를 이용한 결과, 맵 모듈(20) 및 리듀스 모듈(30)은 고정된 데이터 스키마(Schema) 없이 키(key) 값을 이용하여 데이터베이스(10) 내의 다양한 형태의 데이터에 접근할 수 있다. 따라서, 공격을 탐지하기 위한 규칙을 수정함으로써 새로운 공격 유형을 신속하게 반영할 수 있다. Therefore, in the embodiments of the present invention, the NoSQL-based database 10 is used instead of the relational database. As a result of using the NoSQL-based database 10, the map module 20 and the redistribution module 30 can store various types of data in the database 10 using a key value without a fixed data schema It is accessible. Thus, by modifying the rules for detecting attacks, new attack types can be quickly reflected.

맵 모듈(20)과 리듀스 모듈(30)은, 데이터베이스(10)에 저장된 입력 데이터의 일부 또는 전부를 대상으로 맵리듀스(MapReduce) 기반 데이터 처리를 수행함으로써 데이터를 분석하기 위한 부분이다. 실시예들에 따른 데이터 분석 장치를 보안 로그 데이터 분석에 사용하는 경우, 맵 모듈(20)은 데이터베이스(10)에 저장된 로그 데이터를 파싱(parsing)하고, 파싱된 데이터로부터 소정의 규칙에 따라 공격을 탐지할 수 있다. 또한, 리듀스 모듈(30)은 맵 모듈(20)에서 출력된 데이터를 입력받고 데이터들을 기준에 따라 병합한 집합을 출력함으로써 출력 데이터를 구축할 수 있다. The map module 20 and the reduction module 30 are parts for analyzing data by performing MapReduce-based data processing on a part or all of the input data stored in the database 10. When the data analysis apparatus according to the embodiments is used for analyzing the security log data, the map module 20 parses the log data stored in the database 10 and performs an attack from the parsed data according to a predetermined rule It can detect. In addition, the reduction module 30 may receive the data output from the map module 20 and output the set of data merged according to the reference, thereby constructing the output data.

도 2는 일 실시예에 따른 데이터 분석 방법의 순서도이다. 도 1 및 도 2를 참조하여, 일 실시예에 따른 데이터 분석 방법에 대하여 아래에서 설명한다. 2 is a flowchart of a data analysis method according to an embodiment. 1 and 2, a data analysis method according to an embodiment will be described below.

먼저, NoSQL 기반 데이터베이스(10)에 입력 데이터를 수집할 수 있다(S1). 실시예들에 따른 데이터 분석 방법을 방화벽 로그 분석에 적용하는 경우, NoSQL 기반 데이터베이스(10)는 파이어월, 웹 파이어월, VPN, IDS/IPS, NMS/SMS 또는 다른 상이한 종류의 보안 장비들로부터 입력 데이터로서 로그 데이터를 수신할 수 있다. 데이터베이스(10)에 수집되는 입력 데이터는 하나 이상의 필드 및 각 필드에 대응되는 내용을 포함할 수 있다. 데이터의 필드 및 각 필드의 내용은 데이터 분석의 적용 대상에 따라 다양한 형태로 구현될 수 있다. 예컨대, 입력 데이터가 로그 데이터일 경우 입력 데이터는 하기 표 1과 같은 형태로 구성될 수도 있다. First, input data may be collected in the NoSQL-based database 10 (S1). When applying the data analysis method according to the embodiments to the firewall log analysis, the NoSQL-based database 10 can be input from a firewall, a web firewall, a VPN, an IDS / IPS, an NMS / SMS, It is possible to receive log data as data. The input data collected in the database 10 may include one or more fields and content corresponding to each field. The fields of data and the contents of each field can be implemented in various forms according to the application of data analysis. For example, when the input data is log data, the input data may be configured as shown in Table 1 below.

필드field 내용 및 형식Content and format 일시(Data Time)Data Time 로그 발생 날짜 및 시간Date and time of log occurrence YYYYMMDD HHMMSS 형식
(Y: 연도, M:월, D:일, H:시간, M: 분, S: 초)YYYYMMDD HHMMSS format
(Y: year, M: month, D: day, H: hour, M: minute, S: second) 근원지 IP(Source IP)Source IP (Source IP) 로그 발생 진원지의 IP 주소IP address of log origin 2 * 2 바이트(byte) 부호없는 문자열(unsigned String)2 * 2 bytes (unsigned String) 목적지 IP(Detination IP)Destination IP (Detection IP) 로그 발생 목적지의 IP 주소IP address of log origination destination 2 * 2 바이트 부호없는 문자열2 * 2-byte unsigned string 목적지 포트
(Destination Port)Destination port
(Destination Port) 로그 발생 목적지의 포트Port of the log origination destination 2 * 2 바이트 부호없는 문자열2 * 2-byte unsigned string 프로토콜(Protocol)Protocol 프로토콜protocol 2 * 2 바이트 부호없는 문자열2 * 2-byte unsigned string IP : 0 / ICMP : 1 / CGP : 3 / TCP : 6
EGP : 8 / PUP : 12 / UDP : 17IP: 0 / ICMP: 1 / CGP: 3 / TCP: 6
EGP: 8 / PUP: 12 / UDP: 17 액션코드(Action Code)Action Code 로그 메시지의 상세 정보Details of the log message 최대 256 바이트의 문자열A string of up to 256 bytes 1 (허용) : Allow/Accept/Close
2 (거부) : Drop/Reject1 (Allowed): Allow / Accept / Close
2 (Deny): Drop / Reject

그러나, 상기 표 1에 기재된 것은 단지 예시적인 것으로서, 실시예들에 따른 데이터베이스(10)에 수집되는 데이터의 형태는 전술한 것에 한정되는 것은 아니다. However, what is described in Table 1 is merely an example, and the form of data collected in the database 10 according to the embodiments is not limited to the above.

다음으로, 데이터베이스(10)에 수집된 입력 데이터를 맵 모듈(20)에 의하여 파싱(parsing)할 수 있다. 맵 모듈(20)은 키(key)와 값(value)의 쌍을 입력받고 이를 이용하여 파싱된 데이터를 새로운 키-값의 쌍으로 출력하는 Map () 함수를 이용하여 동작될 수 있다. 예를 들어, 일 실시예에서 맵 모듈(20)을 구동하는 Map() 함수의 의사 코드는 하기 표 2와 같다. Next, the input data collected in the database 10 can be parsed by the map module 20. The map module 20 may be operated using a Map () function that receives a pair of key and value and outputs the parsed data as a new key-value pair using the key. For example, in one embodiment, the pseudo code of the Map () function for driving the map module 20 is shown in Table 2 below.

MapMap () 함수의 의사 코드() Function pseudo-code whilewhile (( LogListLogList != ! = NullNull ){) {
mapmap ::
functionfunction (){() {
thisthis .. attackattack .forEach(.forEach ( functionfunction (( attackattack __ typetype ){) {
emit( emit ( attackattack ,{ , { datedate , , countcount : 1});: One});
}}

그러나, 위에 기재한 Map() 함수의 의사 코드는 단지 예시적인 것이며, 실시예들에 따른 데이터 분석 장치에서 맵 모듈(20)은 다른 상이한 형태로 작성되거나 다른 상이한 언어로 작성된 함수를 이용하여 동작하도록 구성될 수도 있다. However, the pseudo-code of the Map () function described above is merely exemplary, and in the data analysis apparatus according to embodiments, the map module 20 may be designed to operate using functions written in other different forms or written in different languages .

맵 모듈(20)은, 입력 데이터에 포함된 하나 이상의 필드 중 공격 탐지에 이용될 미리 결정된 필드를 Map() 함수의 키(key)로 이용하며 해당 필드의 내용을 Map() 함수의 값(value)으로 이용하여 입력 데이터의 파싱을 수행할 수 있다. 파싱의 기준이 되는 미리 결정된 필드는 하나 또는 복수 개일 수 있으며, 사용자에 의하여 공격 유형에 따라 적절히 결정되어 맵 모듈(20)에 입력될 수 있다. 일 실시예에서, 데이터 분석 장치는 미리 결정된 필드 및 공격 탐지 규칙 등을 사용자가 입력하기 위한 입력 모듈(미도시)을 더 포함할 수도 있다. The map module 20 uses a predetermined field to be used for attack detection among one or more fields included in the input data as a key of the Map () function and stores the contents of the corresponding field in the map () ) Can be used to perform parsing of input data. The predetermined field to be parsed may be one or a plurality of predetermined fields and may be appropriately determined according to the attack type by the user and input to the map module 20. In one embodiment, the data analysis apparatus may further include an input module (not shown) for the user to input predetermined fields and attack detection rules and the like.

다음으로, 맵 모듈(20)은 파싱된 데이터에 미리 결정된 공격 탐지 규칙을 적용함으로써, 파싱된 데이터 중 공격으로 판별되는 데이터들을 필터링할 수 있다. 필터링을 수행하기 위한 기준이 되는 공격 탐지 규칙은 사용자에 의해 적절히 결정되어 맵 모듈(20)에 입력될 수 있으며, 공격 유형의 변화에 맞추어 수정되거나 새롭게 생성될 수도 있다. 공격 탐지 규칙의 몇몇 예시에 대해서는 도 4 및 도 5를 참조하여 상세히 후술한다. 종래와 달리 맵 모듈(20) 내에서 데이터의 파싱 및 이상 데이터의 탐지가 한번에 처리되므로, 분석 단계를 감소시킬 수 있으며, 이는 장치로 구현될 경우 비용 절감에 기여할 수 있다. Next, the map module 20 can filter data determined as an attack in the parsed data by applying a predetermined attack detection rule to the parsed data. The attack detection rule serving as a criterion for performing the filtering may be appropriately determined by the user and input to the map module 20, and may be modified or newly generated according to the change of the attack type. Some examples of attack detection rules will be described in detail below with reference to FIG. 4 and FIG. Unlike the prior art, since parsing of data and detection of abnormal data in the map module 20 are processed at once, it is possible to reduce the number of analysis steps, which can contribute to cost reduction when implemented in a device.

맵 모듈(20)에 의하여 파싱된 데이터 중 공격으로 판별된 데이터들은 제1 출력 데이터로서 맵 모듈(20)로부터 출력되어 리듀스 모듈(30)에 전달될 수 있다. 제1 출력 데이터는, 리듀스 모듈(30)에서 이용될 새로운 키와 값의 쌍으로 이루어질 수 있다. 예컨대, 제1 출력 데이터는 입력 데이터 중 공격 탐지에 사용된 특정 필드(예컨대, 일시(Date Time))의 내용 및 공격으로 탐지된 데이터의 집계 정보(count)를 포함할 수 있다. Among the data parsed by the map module 20, data determined as an attack may be output from the map module 20 as first output data and transmitted to the reduction module 30. [ The first output data may be a pair of a new key and a value to be used in the reduction module 30. [ For example, the first output data may include contents of a specific field (e.g., Date Time) used for attack detection among the input data and count information (count) of data detected as an attack.

리듀스 모듈(30)은 맵 모듈(20)로부터 제1 출력 데이터를 수신하고, 제1 출력 데이터를 키를 기준으로 병합하여 제2 출력 데이터를 생성할 수 있다. 리듀스 모듈(30)은 키(key)와 값(value)의 쌍을 입력받고 이를 이용하여 데이터를 병합하는 Reduce () 함수를 이용하여 동작될 수 있다. 예를 들어, 일 실시예에서 리듀스 모듈(30)을 구동하는 Reduce() 함수의 의사 코드는 하기 표 3과 같다. The reduction module 30 may receive the first output data from the map module 20 and may merge the first output data with respect to the key to generate second output data. The reduction module 30 can be operated using a Reduce () function that receives a pair of a key and a value and merges data using the key. For example, in one embodiment, the pseudo code of the Reduce () function for driving the reduction module 30 is shown in Table 3 below.

ReduceReduce () 함수의 의사 코드() Function pseudo-code whilewhile (( LogListLogList != ! = NullNull ){) {
reducereduce ::
function( function ( keykey , , valuesvalues ){ //() {// ( datedate , , countcount ))
totaltotal = 0; = 0;
for(n = 0; n< for (n = 0; n < valuesvalues .. lengthlength ; n++){; n + +) {
totaltotal += + = valuesvalues [n].[n]. countcount ;};}
returnreturn { { attackattack : : totaltotal };}};

그러나, 위에 기재한 Reduce() 함수의 의사 코드는 단지 예시적인 것이며, 실시예들에 따른 데이터 분석 장치에서 리듀스 모듈(30)은 다른 상이한 형태로 작성되거나 다른 상이한 언어로 작성된 함수를 이용하여 동작하도록 구성될 수도 있다. However, the pseudo-code of the Reduce () function described above is merely exemplary, and in the data analysis apparatus according to the embodiments, the reduce module 30 may be implemented using functions written in different different forms or written in different different languages .

리듀스 모듈(30)은, 맵 모듈(20)로부터 출력된 제1 출력 데이터 중 공격 탐지에 사용된 특정 필드(예컨대, 일시(Date Time))의 내용을 Reduce() 함수의 키(key)로 이용하며 공격으로 탐지된 데이터의 집계 정보(count)를 Reduc() 함수의 값(value)으로 이용하여 데이터를 병합할 수 있다. 병합된 데이터는 공격이 일어난 로그 데이터에 대한 집계를 포함하며, 이는 제2 출력 데이터로서 리듀스 모듈(30)로부터 출력될 수 있다. The reduce module 30 receives the content of a specific field (e.g., Date Time) used for attack detection among the first output data output from the map module 20 as a key of the Reduce () function And the aggregation information (count) of the data detected by the attack can be used as the value of the Reduc () function. The merged data includes an aggregation of the log data in which the attack occurred, which may be output from the reduction module 30 as second output data.

일 실시예에서, 데이터 분석 장치는 제2 출력 데이터를 사용자가 볼 수 있도록 표시하기 위한 표시 모듈(미도시)을 더 포함할 수도 있다. 사용자는 표시 모듈을 통하여 제2 출력 데이터를 확인하고, 특정 시간에 집계된 공격 횟수나 공격 유형 등을 통하여 방화벽 로그에 대한 다양한 분석을 수행할 수 있다. In one embodiment, the data analysis device may further comprise a display module (not shown) for displaying the second output data for viewing by the user. The user can confirm the second output data through the display module and perform various analyzes on the firewall logs through the number of attacks or attack types counted at a specific time.

도 3은 일 실시예에 따른 데이터 분석 방법에 의한 데이터 수집 및 정제 과정을 설명하기 위한 개략도이다. FIG. 3 is a schematic diagram for explaining a data collection and refinement process according to a data analysis method according to an embodiment.

도 3에 도시된 예에서, 입력 데이터는 일시(Date), 근원지 IP(Src IP), 목적지 IP(Dst IP), 프로토콜(예컨대, ICMP)의 필드를 가지며, 입력 데이터 중에는 공격에 해당하는 일련의 데이터 그룹(Attack1-3)이 포함된다. Map() 함수는 일시(Date) 필드 및 이에 해당하는 내용을 각각 키(key) 및 값(value)으로 사용하여 파싱을 수행한다. 파싱 결과, 입력 데이터는 일시(Date) 필드의 내용(예컨대, "2012706 134", "2012706 164", "2012706 165") 및 해당하는 집계 정보(예컨대, count(1))의 쌍을 포함하는 파싱된 데이터로 변환된다. Map() 함수는 파싱된 데이터 중 공격으로 판별된 것을 제1 출력 데이터로 생성하여 Reduce() 함수에 전달한다. In the example shown in Fig. 3, the input data has fields of Date, Source IP (Src IP), Destination IP (Dst IP), Protocol (for example, ICMP) And a data group (Attacks 1-3). The Map () function performs parsing using the Date field and corresponding contents as keys and values, respectively. As a result of parsing, the input data is parsed including the pair of the contents of the Date field (e.g., 2012706 134, 2012706 164, 2012706 165) and the corresponding aggregation information (e.g., count (1) Lt; / RTI > The Map () function generates the first output data that has been determined to be an attack among the parsed data, and transmits it to the Reduce () function.

Reduce() 함수는, 제1 출력 데이터에서 일시(Date) 필드의 내용(예컨대, "2012706 134", "2012706 164", "2012706 165") 및 해당하는 집계 정보(예컨대, count(1))를 각각 키(key) 및 값(value)으로 사용하여 데이터를 병합한다. 예를 들어, Reduce() 함수에 의해 로그 발생 일시가 동일한 데이터들이 하나로 병합되며, Reduce() 함수는 로그 발생 일시가 동일한 데이터들의 집계 정보를 합산하여 합산된 개수를 포함하는 제2 출력 데이터를 생성한다. 도 3에서는 로그 발생 일시가 "2012706 164"인 데이터가 5개, 로그 발생 일시가 "2012706 165"인 데이터가 2개인 것으로 출력되는 것이 예시적으로 도시되었다.The Reduce () function returns the contents (e.g., "2012706 134", "2012706 164", "2012706 165") and corresponding aggregation information (eg, count The data are merged using key and value, respectively. For example, the Reduce () function merges the data having the same log generation date and time, and the Reduce () function generates second output data including the sum of the aggregation information of the data having the same log generation date and time, do. In FIG. 3, it is exemplarily shown that there are five data having the log generation date and time "2012706 164" and two data having the log generation date and time "2012706 165".

도 4는 일 실시예에 따른 데이터 분석 방법에 의한 데이터 수집 속도를 종래의 기술과 비교하여 나타내는 그래프이다. FIG. 4 is a graph illustrating a data collection rate according to a data analysis method according to an exemplary embodiment in comparison with a conventional technique.

도 4에 도시된 결과에서, 종래의 기술에서 데이터베이스로는 관계형 데이터베이스에서 가장 많이 이용되는 MySQL-5.6.10 버전을 사용하였다. 본 발명의 실시예에 따른 NoSQL 기반 데이터베이스로는 MongoDB-2.4.1 버전을 사용하였다. 실험에 사용된 데이터는 실제 기업의 보안 로그 데이터 12,847,649 건 중 10,000,000건을 추출하여 이용하였으며, 실험에 사용된 공격 탐지 규칙은 다음과 같다. In the result shown in Fig. 4, the conventional technique used the MySQL-5.6.10 version, which is the most used database in the relational database. As the NoSQL-based database according to the embodiment of the present invention, MongoDB-2.4.1 version was used. The data used in the experiment is 10,000,000 out of the 12,847,649 security log data of the actual company. The rules of attack detection used in the experiment are as follows.

i. 공격1 : 동일 근원지에서 동일 목적지로 5분간 5회 이상의 ICMP 메시지 발생i. Attack 1: More than 5 ICMP messages for 5 minutes from the same source to the same destination

ii. 공격2 : 50개 이상의 근원지에서 동일 목적지로 5분간 50회 이상의 ICMP 메시지 발생ii. Attack 2: More than 50 ICMP messages in 5 minutes from 50 sources to the same destination

iii. 공격3 : 동일 근원지에서 동일 목적지로 1분간 10000회, 이상의 통신 발생. iii. Attack 3: Over 10000 communication occurrences per minute from the same source to the same destination.

상기 공격1 내지 공격3 중 하나 이상에 해당하는 데이터를 공격으로 판단하도록 규칙을 설정하고, 이를 이용하여 SQL 질의 성능을 테스트하고, 공격 의심 시나리오에 대한 탐지 및 분석을 수행하였다. The rule is set to judge that the data corresponding to one or more of attack 1 to attack 3 is an attack, and the SQL query performance is tested using the rule, and the attack suspicion scenario is detected and analyzed.

도 4는 삽입 테스트에 대한 결과를 나타내는 것으로서, 삽입 테스트는 로그 데이터를 얼마나 빠른 시간 안에 수집할 수 있는가, 혹은 이미 수집된 로그 데이터를 얼마나 빠르게 가져올 수 있는가를 시험하는 테스트이다. 도 4의 그래프(410)는 MySQL 데이터베이스 기반 종래 기술에 의한 결과를 나타내며, 그래프(420)는 본 발명의 MongoDB 기반 실시예에 의한 결과를 나타낸다. 도시되는 바와 같이, 입력 데이터의 수가 증가할수록 종래 기술과 본 발명의 실시예의 속도 차이가 눈에 띄게 증가하는 것을 확인할 수 있다.Figure 4 shows the results of an insertion test, which is a test that tests how quickly log data can be collected, or how quickly the collected log data can be retrieved. The graph 410 of FIG. 4 shows the results of the prior art based on the MySQL database, and the graph 420 shows the results of the MongoDB based embodiment of the present invention. As can be seen, as the number of input data increases, the speed difference between the prior art and the embodiment of the present invention increases remarkably.

도 4에 도시된 결과를 정리한 것이 하기의 표 4로서, 200만 건의 데이터를 삽입할 경우 종래 기술과 본 실시예의 속도 차이는 936.275초이고, 1000만 건의 데이터를 삽입할 경우에는 상기 속도 차이는 4946.733초로서 속도차가 약 5배 정도 증가하였다. 이는 데이터가 늘어날수록 데이터를 삽입하는 속도 간의 차이도 비례적으로 증가하는 것을 나타낸다. The results shown in FIG. 4 are summarized in Table 4 below. When 2 million data are inserted, the speed difference between the conventional technique and the present embodiment is 936.275 seconds, and when 10 million data are inserted, 4946.733 seconds, and the speed difference was increased about 5 times. This indicates that the difference between the speed of inserting data as the data is increased proportionally increases.

구분division 200만2 million 500만5 million 1000만10M 본 발명의 실시예Examples of the present invention 59.02259.022 152.107152.107 340.533340.533 종래 기술Conventional technology 995.298995.298 2470.2032470.203 4946.7334946.733 속도차(초)Speed difference (seconds) 936.275936.275 2318.0962318.096 4606.24606.2

도 5는 일 실시예에 따른 데이터 분석 방법에 의한 데이터 검색 속도를 종래의 기술과 비교하여 나타내는 그래프이다.FIG. 5 is a graph illustrating a data retrieval speed according to a data analysis method according to an exemplary embodiment in comparison with a conventional technique.

도 5는 데이터베이스에 저장된 데이터에서 사용자가 원하는 정보를 얼마나 빠르고 정확하게 수집할 수 있는지에 대한 검색 테스트 결과를 나타낸다. 도 4와 마찬가지로, 도 5에서도 데이터의 건 수를 200만, 500만 및 1000만으로 증가시켜가면서 본 발명의 실시예와 종래 기술을 비교하였다. 도 5에서 그래프(510, 520)는 본 발명의MongoDB 기반 실시예들에 의한 결과를 나타내며, 그래프(530, 540)는 MySQL 기반 종래 기술에 의한 결과를 나타낸다. 또한, 그래프(510, 530)는 검색 쿼리의 개수가 3개인 경우의 결과에 해당하며, 그래프(530, 540)는 검색 쿼리의 개수가 4개인 경우의 결과에 해당한다. 5 shows a result of a search test as to how quickly and accurately the user can collect desired information from the data stored in the database. As in FIG. 4, FIG. 5 also compares the prior art with the embodiment of the present invention while increasing the number of data to 2 million, 5 million and 10 million. The graphs 510 and 520 in FIG. 5 represent results by the MongoDB-based embodiments of the present invention, and the graphs 530 and 540 represent results by the MySQL-based prior art. In addition, the graphs 510 and 530 correspond to a case where the number of search queries is three, and the graphs 530 and 540 correspond to a case where the number of search queries is four.

도시되는 바와 같이, 본 발명의 실시예들이 검색 쿼리(query) 조건의 개수 또는 데이터의 개수와 상관 없이 종래 기술에 비해 빠른 수행 속도를 나타냈다. 도 5에 도시된 결과를 정리한 것이 하기의 표 5로서, 본 발명의 실시예들의 경우 데이터의 개수나 조건 등에 크게 영향을 받지 않고 검색 쿼리의 전체적 평균은 0.0027초로 빠른 검색 성능 및 검색 쿼리에 대한 안정성을 보인다. 반면, MySQL 기반 종래 기술의 경우 데이터의 개수가 증가할수록 검색 속도도 함께 증가하고 있으며, 데이터의 증가뿐만 아니라 검색 쿼리 조건의 개수 증가에 따라서도 검색 쿼리 속도가 증가하는 등 안정적이지 못하고 낮은 성능을 보였다. As shown, embodiments of the present invention exhibit faster performance rates than the prior art, regardless of the number of query conditions or the number of data. The results shown in FIG. 5 are summarized in Table 5 below. In the embodiments of the present invention, the overall average of search queries is 0.0027 seconds, Stability. On the other hand, in the case of the conventional technology based on MySQL, as the number of data increases, the search speed also increases, and the search query speed increases not only with the increase in data but also with the increase in the number of search query conditions, .

구분division 200만2 million 500만5 million 1000만10M 본 발명의 실시예
(검색 쿼리 3)Examples of the present invention
(Search query 3) 0.0010.001 0.00220.0022 0.00370.0037 본 발명의 실시예
(검색 쿼리 4)Examples of the present invention
(Search query 4) 0.0020.002 0.00320.0032 0.00410.0041 종래 기술
(검색 쿼리 3)Conventional technology
(Search query 3) 5.84355.8435 15.928515.9285 37.972537.9725 종래 기술
(검색 쿼리 4)Conventional technology
(Search query 4) 5.4225.422 22.637522.6375 46.15746.157

이상에서 설명한 실시예들에 따른 데이터 분석 방법은 적어도 부분적으로 컴퓨터 프로그램으로 구현되고 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 실시예들에 따른 데이터 분석 방법을 구현하기 위한 프로그램이 기록되고 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(carrier wave)(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. The data analysis method according to the embodiments described above can be at least partially implemented in a computer program and recorded in a computer-readable recording medium. The computer-readable recording medium on which the program for implementing the data analysis method according to the embodiments is recorded includes all kinds of recording apparatuses in which data that can be read by the computer is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and a carrier wave (for example, And the like. The computer readable recording medium may also be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner.

이상에서 살펴본 본 발명은 도면에 도시된 실시예들을 참고로 하여 설명하였으나 이는 예시적인 것에 불과하며 당해 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 실시예의 변형이 가능하다는 점을 이해할 것이다. 그러나, 이와 같은 변형은 본 발명의 기술적 보호범위 내에 있다고 보아야 한다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해서 정해져야 할 것이다.While the invention has been shown and described with reference to certain embodiments thereof, it will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. However, it should be understood that such modifications are within the technical scope of the present invention. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

Claims

A NoSQL based database configured to store input data having one or more fields and contents corresponding to each of the one or more fields;
A map module configured to generate parsed data by extracting contents of a predetermined field from the input data, and generate first output data that is filtered based on a predetermined rule; And
A redistribution module configured to receive the first output data from the map module and to combine the first output data with the contents of the predetermined field to generate second output data,
Wherein the database is a MongoDB (DB).

The method according to claim 1,
Wherein the input data is log data,
Wherein the one or more fields comprise at least one of a date and time, a source IP, a destination IP, a destination port, a protocol and an action code.

3. The method of claim 2,
The predetermined rule includes an attack detection rule based on at least one of a date and time, a source IP, a destination IP, a destination port, a protocol, and an action code,
Wherein the first output data includes data determined as an attack by the attack detection rule among the parsed data.

The method according to claim 1,
Wherein the second output data includes a number of data in which the content of the predetermined field is the same in the first output data.

delete

Storing, in a NoSQL-based database, input data having one or more fields and values corresponding to each of the one or more fields;
Extracting contents of a predetermined field from the input data to generate parsed data;
The data analysis apparatus comprising: filtering the parsed data based on contents of the predetermined field to generate first output data;
The data analysis apparatus comprising: merging the first output data based on the contents of the predetermined field to generate second output data; And
And the data analysis apparatus outputting the second output data,
Wherein the database is a MongoDB database.

The method according to claim 6,
Wherein the input data is log data,
Wherein the one or more fields comprise at least one of a date and time, a source IP, a destination IP, a destination port, a protocol and an action code.

8. The method of claim 7,
The predetermined rule includes an attack detection rule based on at least one of a date and time, a source IP, a destination IP, a destination port, a protocol, and an action code,
Wherein the first output data includes data determined to be attacked by the attack detection rule among the parsed data.

The method according to claim 6,
Wherein the second output data includes a number of data in which the content of the predetermined field is the same in the first output data.

delete

By being executed by a computer,
Storing, in a NoSQL-based database, input data having one or more fields and values corresponding to each of the one or more fields;
Extracting contents of a predetermined field from the input data to generate parsed data;
Filtering the parsed data based on the content of the predetermined field to generate first output data;
Merging the first output data based on the contents of the predetermined field to generate second output data; And
And outputting the second output data, wherein instructions for performing a data analysis method are stored,
Wherein the database is a MongoDB (DB).