KR101435789B1

KR101435789B1 - System and Method for Big Data Processing of DLP System

Info

Publication number: KR101435789B1
Application number: KR1020130014423A
Authority: KR
Inventors: 백승태; 류승태; 최일훈
Original assignee: (주)소만사
Priority date: 2013-01-29
Filing date: 2013-02-08
Publication date: 2014-08-29
Also published as: KR20140096936A

Abstract

본 발명은 데이터 처리 시스템 및 방법에 대하여 개시한다. 본 발명의 일면에 따른 데이터 처리 서버는, 적어도 하나의 DLP(Data Loss Prevention) 서버로부터 로그 데이터를 수신하는 수신부; 상기 로그 데이터의 필드를 기설정된 인덱스 대상 필드별로 분류하는 분류부; 분류된 상기 인덱스 대상 필드를 이용하여 상기 로그 데이터의 색인 데이터를 생성하여 상기 로그 데이터의 인덱스 ID와 함께 일 인덱스 노드 내 저장소에 저장하는 인덱스 생성부; 및 상기 로그 데이터 중에서 비정형 데이터를 HDFS(Hadoop Distributed File System) 클러스터 내 복수의 HDFS에 분산 저장하고, 상기 로그 데이터에서 정형 데이터를 상기 인덱스 ID 및 상기 비정형 데이터가 포함된 위치정보와 연관시켜 상기 HDFS 클러스터 내 복수의 하이브(HIVE)에 분산 저장하는 저장 관리부를 포함하는 것을 특징으로 한다.The present invention discloses a data processing system and method. A data processing server according to an aspect of the present invention includes: a receiving unit for receiving log data from at least one DLP (Data Loss Prevention) server; A classifier for classifying the fields of the log data by predetermined index object fields; An index generator for generating index data of the log data using the classified index target field and storing the generated index data together with the index ID of the log data in a storage in one index node; And storing the unstructured data among the log data in a plurality of HDFSs in an HDFS (Hadoop Distributed File System) cluster, associating the formatted data in the log data with the index ID and the location information including the irregular data, And a storage management unit for distributively storing the data in the plurality of hives HIVE.

Description

Technical Field [0001] The present invention relates to a system and method for processing large data in a DLP system,

본 발명은 분산 처리 시스템에 관한 것으로서, 더 구체적으로는 고용량 데이터를 분산 처리할 수 있는 데이터 처리 시스템 및 방법에 관한 것이다.The present invention relates to a distributed processing system, and more particularly, to a data processing system and method capable of distributed processing high capacity data.

위키피디아의 정의에 따르면, 빅데이터란 기존 데이터베이스 관리도구의 데이터 수집, 저장, 관리 및 분석의 역량을 넘어서는 대량의 정형 또는 비정형 데이터 셋을 의미한다.Wikipedia defines Big Data as a large set of formal or unstructured data sets that go beyond the ability to collect, store, manage and analyze data from existing database management tools.

그리고, 빅데이터 처리 기술은 대량의 정형 또는 비정형 데이터 셋을 수집, 저장, 관리 및 분석하여 가치를 추출하거나 원하는 결과를 분석해내는 것을 의미한다.Big data processing technology means collecting, storing, managing and analyzing a large amount of fixed or unstructured data sets to extract values or analyze desired results.

다른 한편으로, 빅데이터 처리 기술은 낮은 비용으로 다양한 종류의 대규모 데이터로부터 가치를 추출하고, 초고속 수집, 발굴 및 분석 가능한 아키텍쳐 기술로 정의되고 있다.Big data processing technology, on the other hand, is defined as architecture technology that extracts value from a wide variety of large scale data at low cost, and is capable of ultra-fast acquisition, discovery and analysis.

최근, 빅데이터 및 빅테이터 처리 기술은 하둡(Hadoop)이라는 대표적인 오픈 소스 기반 플랫폼과 그로부터 확장된 요소에 의해 사회 각 분야에서 활용되고 있다. 구체적으로, 빅데이터 처리 기술은 정치 및 사회 분야, 경제 분야, 경영 분야 및 문화 분야 등에서 그 활용도가 점점 높아지고 있으며, 인간 사회의 새로운 가치와 시장을 창출하는데 기여하고 있다.In recent years, big data and big data processing technologies have been used in various fields of society by a representative open source platform called Hadoop and its extended elements. Specifically, big data processing technology is increasingly utilized in political and social fields, economic fields, management fields, and cultural fields, and contributes to creating new values and markets for human society.

DLP(Data Loss Prevention)로 통칭 되고 있는 정보보호 분야에서도 빅데이터 처리 기술을 접목하는 시도가 있다.In the field of information protection, also known as DLP (Data Loss Prevention), there is an attempt to incorporate a big data processing technology.

기업 및 기관에서 운영중인 DLP 시스템에서 하루에 생산되는 로그의 량은 수십 ~ 수백 기가바이트까지 증가 되었다. 이로 인해, 로그 관리 비용이 증대되었으며, 컴플라이언스 측면에서도 수개월 내지 수년간 축적된 로그에서 원하는 로그를 검색하기 위한 인적/시간적 비용이 매우 높아졌다.The amount of logs produced per day in enterprise and institutional DLP systems has increased from tens to hundreds of gigabytes. This increases log management costs, and in terms of compliance, the human / temporal cost of retrieving the desired logs from logs accumulated over months or years has become very high.

예를 들면, DLP 시스템의 로그가 하루 20 기가바이트의 로그가 생산된다면, 1년치 로그는 약 5 테라바이트에 달하며, 하루 100 기가바이트의 로그가 생산된다면, 1년치 로그는 무려 24 테라바이트에 달한다. 그러므로, 그 중에서 원하는 로그를 검색하려면, 최소 수시간에서 수일 이상의 시간 및 인적 비용이 소요된다.For example, if a log of a DLP system produces a log of 20 gigabytes per day, the log of one year is about 5 terabytes, and if a log of 100 gigabytes of data per day is produced, the log of one year is as much as 24 terabytes . Therefore, in order to retrieve a desired log from among them, a time from a minimum of several hours to several days and a human cost are required.

실 예로, W은행의 경우 자사 직원에 의한 금융사고에 대한 증거 자료를 제출하기 위해 3년간 축적된 약 10테라바이트의 DLP 로그를 검색하였는데, 3명의 인원과 로그 검색 및 데이터베이스 장비를 동원하여 약 한 달의 시간이 소요되었다.As a practical example, W Bank searched for approximately 10 terabytes of DLP logs accumulated over three years to submit evidence of financial accidents by its employees, using three persons and log search and database equipment It took months of time.

그로 인해, DLP 시스템은 로그를 관계형 데이터베이스에 저장하여 늘어나는 대용량의 로그에 대한 확장성을 유지하고 있으며, 복수의 로그 데이터베이스를 이용해 적절히 로그를 분산/저장처리하고 있다. 물론, 관계형 데이터베이스는 구조화 또는 정형화된 데이터를 처리하는데 굉장히 효율적이고 많은 데이터를 저장할 수 있다. 하지만, 더 큰 데이터의 저장 및 관리하려면, 고비용 투자가 필요하다.As a result, the DLP system stores logs in a relational database, maintains scalability for increasingly large logs, and distributes / stores logs appropriately using multiple log databases. Of course, a relational database is extremely efficient for processing structured or structured data and can store a large amount of data. However, storing and managing larger amounts of data requires high-cost investments.

더욱이, DLP 시스템에서 생성되는 로그는 정형 데이터와 비정형 데이터가 혼재하므로, 로그 데이터를 빠르게 탐색하기가 어려워, 추가적인 인적, 시간적, 물리적 비용이 필요하다.Furthermore, since the logs generated by the DLP system are mixed with the fixed data and the irregular data, it is difficult to quickly search the log data, and additional human, temporal, and physical costs are required.

예를 들어, DLP 시스템의 로그 중에서 정형 데이터의 유형은 프로토콜의 종류, 송/수신 시간, 송/수신 IP 주소, 송/수신자 ID나 이메일주소, 첨부파일의 유/무, 검출된 민감정보 유형(개인정보, 재무정보, 사업전략, 투자전략, 제품기획정보, 서비스기획정보, 개발소스/도면 등) 식별 코드 및 민감정보 유형별 검출건수 등이 있다. 또한, 비정형 데이터의 유형은 이메일 제목, 이메일본문, 메신저의 대화 메시지, 블로그 제목, 블로그 본문 및 첨부 또는 업로드 되는 파일의 텍스트 내용 및 파일 원본 등이 있다.For example, the type of formatted data in the DLP system's log is the type of protocol, the transmission / reception time, the sending / receiving IP address, the sender / receiver ID or e-mail address, the presence or absence of the attachment, Personal information, financial information, business strategy, investment strategy, product planning information, service planning information, development source / drawing, etc.). In addition, types of unstructured data include email title, email body, instant messaging message, blog title, blog body, and text content and file source of attached or uploaded files.

도 1은 종래의 대용량 로그 저장 및 검색을 위한 DLP 서버 및 DB 서버를 도시한 구성도이다.FIG. 1 is a block diagram illustrating a conventional DLP server and a DB server for storing and searching large-capacity logs.

도 1에 도시된 바와 같이, 복수의 DLP 서버가 대용량 DLP 로그를 분산 수집함과 동시에 복수의 DB에 분산 저장하고 있다. 따라서, 분산 저장된 대용량의 로그 중 원하는 로그를 검색하기 위해서는 기간별로 저장된 DB에 각각 접속하여 검색하여야 한다. 즉, 총 3년치 로그가 3개의 DB 서버에 1년치 씩 나눠서 보관되고 있다면, 원하는 DLP 로그를 검색하기 위해서는 각 DB에서 1년치 로그를 검색하는 과정을 세 번에 걸쳐 수행하여야 한다. 또한, DLP 로그를 정형 비정형 데이터의 조합으로 이루어진 검색 조건으로 검색하면, 그 복잡도에 따라 수십 시간 또는 수 일이 소요된다.As shown in FIG. 1, a plurality of DLP servers collectively distribute a large amount of DLP logs and simultaneously distribute and store the DLP logs in a plurality of DBs. Accordingly, in order to retrieve a desired log among the large-capacity logs, That is, if a total of three years worth of logs is stored in three DB servers for one year, it is necessary to perform a process of retrieving a log of one year from each DB three times in order to retrieve a desired DLP log. If the DLP log is searched by a search condition made up of a combination of fixed and unstructured data, it takes several tens of hours or several days depending on the complexity.

본 발명은 전술한 바와 같은 기술적 배경에서 안출된 것으로서, DLP 로그를 분산 처리할 수 있는 데이터 처리 시스템 및 방법을 제공하는 것을 그 목적으로 한다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a data processing system and method capable of distributing DLP logs in a distributed manner.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

본 발명의 일면에 따른 데이터 처리 서버는, 적어도 하나의 DLP(Data Loss Prevention) 서버로부터 로그 데이터를 수신하는 수신부; 상기 로그 데이터의 필드를 기설정된 인덱스 대상 필드별로 분류하는 분류부; 분류된 상기 인덱스 대상 필드를 이용하여 상기 로그 데이터의 색인 데이터를 생성하여 상기 로그 데이터의 인덱스 ID와 함께 일 인덱스 노드 내 저장소에 저장하는 인덱스 생성부; 및 상기 로그 데이터 중에서 비정형 데이터를 HDFS(Hadoop Distributed File System) 클러스터 내 복수의 HDFS에 분산 저장하고, 상기 로그 데이터에서 정형 데이터를 상기 인덱스 ID 및 상기 비정형 데이터가 포함된 위치정보와 연관시켜 상기 HDFS 클러스터 내 복수의 하이브(HIVE)에 분산 저장하는 저장 관리부를 포함하는 것을 특징으로 한다.A data processing server according to an aspect of the present invention includes: a receiving unit for receiving log data from at least one DLP (Data Loss Prevention) server; A classifier for classifying the fields of the log data by predetermined index object fields; An index generator for generating index data of the log data using the classified index target field and storing the generated index data together with the index ID of the log data in a storage in one index node; And storing the unstructured data among the log data in a plurality of HDFSs in an HDFS (Hadoop Distributed File System) cluster, associating the formatted data in the log data with the index ID and the location information including the irregular data, And a storage management unit for distributively storing the data in the plurality of hives HIVE.

본 발명의 다른 면에 따른 HDFS(Hadoop Distributed File System) 클러스터 내 로그 데이터의 인덱스 ID, 메타데이터를 포함하는 정형 데이터를 분산 저장하는 복수의 하이브(HIVE) 및 상기 로그 데이터에서 비정형 데이터를 분산 저장하는 복수의 HDFS로부터 요청에 따른 로그 데이터를 검색하는 데이터 처리 서버는, 사용자에 의해 입력된 검색 조건 및 키워드에 대응하는 데이터 쌍을 포함하는 검색 요청을 다른 데이터 처리 장치의 타 관리부에 공유하고, 구비된 인덱스 노드 내 저장소로부터 상기 데이터 쌍에 대응하는 색인 데이터를 검색하고, 상기 타 관리부의 타 인덱스 노드 내 상기 데이터 쌍에 대응하는 색인 데이터의 검색 결과를 확인하며, 각 검색 결과를 조합하여 상기 데이터 쌍에 대응하는 인덱스 ID 목록을 생성하는 일 관리부; 및 상기 복수의 하이브(HIVE)로 상기 인덱스 ID 목록에 대응하는 정보를 질의하고, 상기 복수의 하이브 중 적어도 하나로부터 그 응답으로 상기 인덱스 ID 목록에 대응하는 정보를 수신하여 출력하는 출력부를 포함하는 것을 특징으로 한다.An index ID of log data in an HDFS (Hadoop Distributed File System) cluster according to another aspect of the present invention, a plurality of hives (HIVE) for distributing and storing fixed data including meta data, A data processing server for retrieving log data according to a request from a plurality of HDFSs may share a search request including a data pair corresponding to a search condition and a keyword entered by a user to another management unit of another data processing apparatus, The index data corresponding to the data pair is retrieved from the repository in the index node, the retrieval result of the index data corresponding to the data pair in the other index node of the other management unit is confirmed, A job manager for generating a corresponding index ID list; And an output unit for querying information corresponding to the index ID list with the plurality of hives and receiving and outputting information corresponding to the index ID list from at least one of the plurality of hives in response thereto .

본 발명에 따르면, 빅데이터 수준의 대용량 DLP 로그 데이터의 분산 처리의 시간과 비용을 줄일 수 있다.According to the present invention, it is possible to reduce the time and cost of the distributed processing of the large-capacity DLP log data at the level of the big data.

도 1은 종래의 대용량 로그 저장 및 검색을 위한 DLP 서버 및 DB 서버를 도시한 구성도.
도 2는 본 발명의 실시예에 따른 데이터 처리 시스템을 도시한 구성도.
도 3은 본 발명의 실시예에 따른 데이터 처리 시스템의 세부 구성도.
도 4는 본 발명의 실시예에 따른 인덱스 함수 호출 코드의 일부분을 도시한 도면.
도 5a 및 도 5b는 본 발명의 실시예에 따른 인덱스 노드 간의 자동 밸런싱 과정을 도시한 도면.
도 6은 본 발명의 실시예에 따른 JSON 포맷의 인덱스 검색 요청용 데이터 쌍을 도시한 도면.FIG. 1 is a block diagram illustrating a DLP server and a DB server for storing and retrieving large-capacity logs in the related art.
2 is a configuration diagram showing a data processing system according to an embodiment of the present invention;
3 is a detailed configuration diagram of a data processing system according to an embodiment of the present invention;
4 illustrates a portion of an index function calling code according to an embodiment of the present invention.
5A and 5B illustrate a process of automatic balancing between index nodes according to an embodiment of the present invention.
FIG. 6 illustrates a data pair for an index search request in JSON format according to an embodiment of the present invention; FIG.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms " comprises, " and / or "comprising" refer to the presence or absence of one or more other components, steps, operations, and / Or additions.

이제 첨부의 도면을 참조하여 본 발명의 구체적인 실시예에 대하여 설명한다.A specific embodiment of the present invention will now be described with reference to the accompanying drawings.

도 2는 본 발명의 실시예에 따른 데이터 처리 시스템을 도시한 구성도이고, 도 3은 본 발명의 실시예에 따른 데이터 처리 시스템의 세부 구성도이다. 도 4는 본 발명의 실시예에 따른 인덱스 함수 호출 코드의 일부분을 도시한 도면이고, 도 5a 및 도 5b는 본 발명의 실시예에 따른 인덱스 노드 간의 자동 밸런싱 과정을 도시한 도면이다. 도 6은 본 발명의 실시예에 따른 JSON 포맷의 인덱스 검색 요청용 데이터 쌍을 도시한 도면이다.FIG. 2 is a configuration diagram illustrating a data processing system according to an embodiment of the present invention, and FIG. 3 is a detailed configuration diagram of a data processing system according to an embodiment of the present invention. FIG. 4 is a diagram illustrating a portion of an index function calling code according to an embodiment of the present invention. FIGS. 5A and 5B illustrate a process of automatic balancing between index nodes according to an embodiment of the present invention. FIG. 6 is a diagram illustrating a data pair for an index search request in a JSON format according to an embodiment of the present invention.

도 2 및 도 3에 도시된 바와 같이, 본 발명의 실시예에 따른 데이터 처리 시스템(30)은 인덱스 클러스터(300), HDFS 클러스터(400) 및 검색 서버(200)를 포함한다.2 and 3, the data processing system 30 according to the embodiment of the present invention includes an index cluster 300, an HDFS cluster 400, and a search server 200.

인덱스 클러스터(300)는 적어도 하나의 인덱스 서버(300_1~300_n 중 적어도 하나)를 포함하며, DLP 로그 데이터에 대한 인덱싱 및 검색 작업에 이용된다.The index cluster 300 includes at least one of the index servers 300_1 to 300_n and is used for indexing and searching operations of the DLP log data.

각 인덱스 서버(300_1~300_n 중 하나)는 수신부(340), 분류부(330), 인덱스 생성부(320), 저장 관리부(310), 인덱스 노드(380), 클러스터 관리부(350), 검색엔진(370) 및 검색 관리부(360)를 포함한다.Each of the index servers 300_1 to 300_n includes a receiving unit 340, a classifying unit 330, an index generating unit 320, a storage managing unit 310, an index node 380, a cluster managing unit 350, 370 and a search management unit 360.

검색 서버(200)는 고속 검색을 위한 미들웨어 시스템으로서, 사용자의 요청에 따른 DLP 로그를 검색하며, 웹 기반으로 구성될 수 있다. 검색 서버(200)는 검색 요청부(210), 요청 변환부(240), 결과 수신부(220) 및 결과 출력부(230)를 포함한다.The search server 200 is a middleware system for high-speed search, searches for DLP logs according to a user's request, and can be configured based on the web. The search server 200 includes a search request unit 210, a request conversion unit 240, a result reception unit 220, and a result output unit 230.

HDFS 클러스터(400)는 하둡(Hadoop) 기반의 분산 데이터 저장소로서, 적어도 하나의 HDFS 서버(400_1~400_n 중 적어도 하나)를 포함한다. 각 HDFS 서버(400_1~400_n 중 하나)는 하이브(HIVE; BigData Datawarehouse System for Hadoop)(410) 및 HDFS(Hadoop Distributed File System)(420)을 포함한다.The HDFS cluster 400 is a Hadoop-based distributed data store and includes at least one of the HDFS servers 400_1 to 400_n. Each of the HDFS servers 400_1 to 400_n includes a HIVE (Big Data Warehouse System for Hadoop) 410 and a HDFS (Hadoop Distributed File System)

여기서, 하이브(410)는 하둡에서 사용되는 데이터 웨어하우스 시스템(Datawarehouse System) 즉, 빅데이터 처리를 위해 특화된 데이터베이스 시스템이며, DLP 로그 데이터중 정형의 메타데이터가 저장되는 저장소이다. 그리고, HDFS(420)는 하둡에서 사용되는 분산 파일 시스템으로서, DLP 로그 데이터 중 비정형 데이터가 파일 단위로 저장되는 저장소이다.Here, the hive 410 is a data warehouse system used in Hadoop, that is, a database system specialized for processing big data, and is a repository in which formal metadata of DLP log data is stored. The HDFS 420 is a distributed file system used in Hadoop, and is a repository in which unstructured data among DLP log data is stored in file units.

이하, 데이터 처리 시스템(30)의 ① DLP 로그 데이터 분산 저장 과정과 ② DLP 로그 데이터 검색 과정에서 데이터 처리 시스템(30)의 각 부에 대해서 설명한다.Hereinafter, each part of the data processing system 30 in the DLP log data distribution process and the DLP log data retrieval process of the data processing system 30 will be described.

① ① DLPDLP 로그 데이터 분산 저장 Distributing Log Data

각 DLP 서버(100)는 로그 전송부(110)에 의해 라운드 로빈(Round Robin) 방식에 의해 순차적으로 DLP 로그 데이터를 송신한다. DLP 로그 데이터의 분산 저장은 주로 인덱스 서버(300_1~300_n 중 하나)에 의해 수행되므로, 이하, 도 3을 참조하여 인덱스 클러스터(300) 내 인덱스 서버(300_1~300_n 중 하나)의 각 부에 대해서 설명한다. Each DLP server 100 sequentially transmits DLP log data by a round robin method by the log transfer unit 110. [ The distributed storage of DLP log data is mainly performed by one of the index servers 300_1 to 300_n. Hereinafter, each part of the index servers 300_1 to 300_n in the index cluster 300 will be described with reference to FIG. do.

수신부(340)는 적어도 하나의 DLP 서버(100)로부터 송신된 DLP 로그 데이터를 수신하여 분류부(330)로 전달한다. 이때, DLP 로그 데이터는 DLP 서버(100)의 로그 전송부(110)로부터 송신된다.The receiving unit 340 receives the DLP log data transmitted from at least one DLP server 100 and transmits the DLP log data to the classifying unit 330. At this time, the DLP log data is transmitted from the log transfer unit 110 of the DLP server 100.

분류부(330)는 수신부(340)로부터 DLP 로그 데이터를 수신하면, DLP 로그 데이터의 세부 필드를 분류하고, 도 4와 같이, 분류된 DLP 로그 데이터의 세부 필드에 대한 색인(Indexing)을 요청한다. Upon receiving the DLP log data from the receiver 340, the classifying unit 330 classifies the detailed fields of the DLP log data and requests indexing of the detailed fields of the classified DLP log data as shown in FIG. 4 .

구체적으로, 분류부(330)는 각 DLP 로그 데이터에서 기설정된 인덱스 대상 필드를 확인하여 DLP 로그 데이터를 인덱스 대상 필드별로 분류한다. 또한, 분류부(330)는 DLP 로그 데이터에 포함된 파일을 포함하는 비정형 데이터로부터 텍스트 스트림을 추출하여 인덱스 대상 필드 중 하나로 분류하여 인덱스 생성부(320)로 전달한다. 여기서, 분류부(330)는 하나의 DLP 로그 데이터에 대한 인덱스 대상 필드를 다른 DLP 로그 데이터의 인덱스 대상 필드와 구분하여 인덱스 생성부(320)로 전달하는 것이 좋다. Specifically, the classifying unit 330 classifies the DLP log data into index target fields by checking preset index target fields in each DLP log data. The classifying unit 330 extracts the text stream from the unstructured data including the file included in the DLP log data, classifies the text stream into one of the index target fields, and transmits the text stream to the index generating unit 320. Here, the classifying unit 330 may divide the index object field of one DLP log data from the index object field of another DLP log data, and transmit the index object field to the index generating unit 320.

여기서, 인덱싱 대상 필드는 관리자에 의해 사전 정의되며, 프로토콜의 종류, 송신 시간, 수신 시간, 송신 IP 주소, 수신 IP 주소, 송신자 ID, 수신자 ID, 이메일 주소, 첨부파일 유무, 민감정보의 유형, 민감정보 유형 식별코드별 검출 건수, 이메일 제목, 이메일 본문, 메신저의 대화 메시지, 블로그 제목, 블로그 본문 및 첨부 또는 업로드 파일로부터 추출된 텍스트 스트림 및 파일명 중 적어도 하나일 수 있다. 여기서, 민감정보는 개인정보, 재무정보, 사업전략, 투자전략, 제품기획정보, 서비스 기획 정보, 개발 소스나, 도면 등과 같이 보안이 필요한 정보일 수 있다. 그리고, 민감정보의 유형 및 그 식별코드별 검출 건수는 인덱스 서버에 의해 검출되어, DLP 로그 데이터에 포함된다.Here, the indexing target field is predefined by the administrator, and the type of the protocol, the transmission time, the reception time, the transmission IP address, the reception IP address, the sender ID, the receiver ID, the e-mail address, An email title, an email body, an instant messaging message, a blog title, a blog body, and a text stream and file name extracted from an attached or uploaded file. Here, the sensitive information may be information requiring security such as personal information, financial information, business strategy, investment strategy, product planning information, service planning information, development source, drawing, and the like. Then, the type of sensitive information and the number of detections for each identification code are detected by the index server and included in the DLP log data.

인덱스 생성부(320)는 분류된 DLP 로그 데이터의 인덱스 대상 필드를 인덱싱하여 인덱스 데이터(즉, 인덱스 ID 및 색인 데이터)를 생성하고, 인덱스 데이터를 인덱스 노드(380)에 할당된 저장소에 저장한다. The index generator 320 indexes index target fields of the classified DLP log data to generate index data (i.e., index ID and index data), and stores the index data in the storage allocated to the index node 380. [

예를 들면, 인덱스 생성부(320)는 각 DLP 로그 데이터에 대해서 각기 다른 하나의 인덱스 ID를 부여하고, 각 DLP 로그 데이터의 핵심 텍스트를 적어도 하나를 색인 데이터로 생성한다. 구체적으로, 인덱스 생성부(320)는 DLP 로그 데이터의 인덱스 대상 필드에서 정형 데이터 또는 구조적 데이터 필드를 제외한 비정형 데이터 필드의 텍스트 스트림에 대해서는 형태소 분석을 수행한다. 그리고, 형태소 분석 결과에 따른 텍스트에 대해 전체 텍스트 인덱싱(Full Text Indexing)을 수행하여 각 DLP 로그 데이터의 핵심 테스트인 적어도 하나의 색인 데이터를 생성한다. 여기서, 정형 데이터 또는 구조적 데이터 필드는 IP 주소, 시간, 정수형의 숫자값을 가지는 데이터 필드일 수 있다. 예컨대, 인덱스 생성부(320)는 N-Gram 방식으로 비정형 데이터 필드의 영문/한글/숫자에 대해 형태소를 분석하여, "철수가 방에 들어간다"라는 텍스트 스트림에 대해서는 "철수", "방", "들어" 및 "간다"가 되는 형태의 형태소 분석을 수행할 수 있다.For example, the index generator 320 assigns a different index ID to each DLP log data, and generates at least one key text of each DLP log data as index data. Specifically, the index generator 320 performs morphological analysis on the text stream of the irregular data field excluding the format data or the structured data field in the index object field of the DLP log data. Then, full text indexing is performed on the text according to the morphological analysis result to generate at least one index data which is a core test of each DLP log data. Here, the structured data or structured data field may be a data field having an IP address, a time, and a numeric value of integer type. For example, the index generating unit 320 may analyze the morphemes of the English, Korean, and numbers of the unstructured data fields in the N-Gram scheme and may generate a text stream of " Morphological analysis can be performed in the form of "entering" and "going".

인덱스 생성부(320)는 생성된 인덱스 데이터를 인덱스 노드(380) 내 주요 저장소 중에서 라운드 로빈 방식으로 선택된 인덱스 노드(380) 내 주요 저장소에 순차적으로 저장한다.The index generator 320 sequentially stores the generated index data in the main storage in the index node 380 selected in the round robin manner among the main stores in the index node 380.

인덱스 노드(380)는 각 인덱스 서버(300_1~300_n 중 하나)에 하나씩 구비되어, 관리자에 의해 지정된 수만큼의 주요(Primary) 저장소와 복제(Replication) 저장소를 가지며, 인덱스 데이터를 저장한다. 이때, 인덱스 노드의 수는 인덱스 노드 간의 효율적인 자동 밸런싱 및 부하 밸런싱을 위하여 3개 이상으로 설정되는 것이 좋다. 여기서, 주요 저장소는 인덱스 데이터를 저장하고, 복제 저장소는 주요 저장소를 복제한 것으로서, 장애 복구를 위한 백업으로 이용된다.The index node 380 is provided for each of the index servers 300_1 to 300_n, and has as many primary and replicate repositories as the number designated by the administrator, and stores the index data. In this case, the number of index nodes is preferably set to three or more for efficient automatic balancing and load balancing between index nodes. Here, the main repository stores the index data, and the replica repository replicates the main repository, and is used as a backup for failover.

클러스터 관리부(350)는 인덱스 클러스터(300) 내 인덱스 노드 간의 주요 저장소 및 복제 저장소의 자동 재배치(Auto Relocation)를 통해 인덱스 노드 간의 자동 밸런싱(Auto Balancing) 및 부하 밸런싱(Load Balancing) 처리를 수행한다. 물론, DLP 로그 데이터에 비해서 인덱스 데이터는 그 용량이 작지만, 인덱스 데이터가 늘어날수록 그 용량도 늘어나, DLP 로그 검색시 속도 저하로 이어질 수 있다. 또한, 인덱스 노드 중 어느 하나에 장애가 발생할 경우, 인덱스 노드의 분산 저장 및 분산 검색에 문제가 발생할 수 있다. 따라서, 클러스터 관리부()는 인덱스 노드 간의 부하 밸런싱을 수행하는 것이다.The cluster management unit 350 performs automatic balancing and load balancing between index nodes through automatic relocation of the main storage and replica storage between the index nodes in the index cluster 300. [ Of course, the capacity of the index data is smaller than that of the DLP log data, but as the index data increases, the capacity of the index data increases, which may lead to a decrease in the speed of DLP log search. In addition, if any one of the index nodes fails, a problem may arise in distributed storage and distributed retrieval of the index nodes. Therefore, the cluster management unit (i) performs load balancing between the index nodes.

이를 위해서, 클러스터 관리부(350)는 인덱스 클러스터(300) 내에 존재하는 모든 인덱스 서버의 클러스터 관리부와 주기적으로 상태 정보를 교환한다. 그리고, 교환된 상태 정보를 이용하여 다른 인덱스 서버 내 인덱스 노드의 장애 상황에 대한 상호 상태 확인(Cross Health Checking)을 수행한다. 따라서, 본 발명에 따른 클러스터 관리부(350)는 인덱스 작업 및 인덱스 데이터 간의 고가용성과 유연한 확장성을 보장할 수 있다.To this end, the cluster management unit 350 periodically exchanges state information with the cluster management units of all the index servers existing in the index cluster 300. Then, Cross Health Checking is performed to check the fault status of index nodes in other index servers using the exchanged state information. Therefore, the cluster management unit 350 according to the present invention can ensure high availability and flexible scalability between the index operation and the index data.

일 예로, 주요 저장소가 3, 각 주요 저장소에 대한 복제 저장소가 1로 지정되어, 클러스터 관리부(350)에 의해 인덱스 노드(380)가 점진적으로 총 3개까지 확장된 경우, 클러스터 관리부(350)의 자동 재배치 과정은 도 5a와 같을 수 있다. 구체적으로, 클러스터 관리부(350)는 다른 인덱스 서버의 인덱스 노드를 함께 이용하여 일 주요 저장소와 그에 대응하는 주요 저장소(Primary)에 대한 복제 저장소(Replica)를 하나의 인덱스 노드에 함께 두지 않도록 관리한다. For example, when the main storage is 3, the replica storage for each major storage is designated as 1, and the index node 380 is gradually expanded by the cluster management unit 350 to 3 in total, the cluster management unit 350 The automatic relocation procedure may be as shown in FIG. 5A. Specifically, the cluster management unit 350 manages not to put a replica of one major storage and its corresponding major storage in one index node together by using index nodes of other index servers.

다른 예로, 도 5a와 같은 총 3개의 인덱스 노드를 사용 중에 인덱스 노드 3에 장애가 발생하면, 인덱스 노드 1과 인덱스 노드 2가 속한 인덱스 서버의 클러스터 관리부는 장애를 자동으로 감지한다. 따라서, 도 5b와 같이, 주요 저장소 3(Primary 3)의 복제 저장소인 복제 저장소 3(Replica 3)이 있는 인덱스 노드 2가 속한 인덱스 서버의 클러스터 관리부는 복제 저장소 3을 이용하여 주요 저장소 3을 생성한다. 또한, 인덱스 노드 3이 속한 인덱스 서버의 클러스터 관리부는 주요 저장소 1에 대한 복제 저장소인 복제 저장소 1을 새로이 생성한다.As another example, if a failure occurs in the index node 3 while all three index nodes are used as shown in FIG. 5A, the cluster manager of the index server to which the index node 1 and the index node 2 belongs automatically detects the failure. Therefore, as shown in FIG. 5B, the cluster manager of the index server to which the index node 2 having the replica 3 as the replica of the primary 3 is included, creates the main store 3 using the replica store 3 . In addition, the cluster manager of the index server to which the index node 3 belongs newly creates the replica store 1 as the replica store for the main store 1.

이 같이, 본 발명에서는 3개의 인덱스 노드를 운영중에 어느 하나의 인덱스 노드에 장애가 발생하면, 클러스터 관리부 간의 자동 협상을 통해 2개의 인덱스 노드로 운영되도록 자동 전환될 수 있어, 인덱스 노드의 고가용성을 보장할 수 있다.As described above, according to the present invention, when failure occurs in any one of the index nodes during operation of three index nodes, automatic negotiation between the cluster management units can be automatically performed to operate as two index nodes, thereby ensuring high availability of the index node can do.

인덱스 생성부(320)는 각 DLP 로그 데이터에 대한 인덱싱 작업이 완료되면, 각 DLP 로그 데이터에 인덱싱 작업에 따른 인덱스 ID를 추가하여 저장 관리부(310)에 전달한다.When the indexing operation for each DLP log data is completed, the index generating unit 320 adds an index ID according to the indexing operation to each DLP log data, and transmits the index ID to the storage management unit 310. [

저장 관리부(310)는 하둡 기반 데이터 처리를 위한 인터페이스 규약에 맞추어, 인덱스 ID가 추가된 DLP 로그 데이터를 하이브(410)와 HDFS(420)에 각각 분산 저장한다.The storage management unit 310 distributes the DLP log data to which the index ID is added to the hive 410 and the HDFS 420 in accordance with the interface protocol for processing the Hadoop based data.

구체적으로, 저장 관리부(310)는 DLP 로그 데이터 중에서 메타데이터 필드를 테이블 레코드 형태로 하이브(410)에 저장하고, DLP 로그 데이터 중에서 비정형 데이터 필드를 파일 데이터 셋 형태로 HDFS(420)에 분산 저장한다. 여기서, 메타데이터는 관리자에 의해 지정될 수 있으며, DLP 로그 데이터 중에서 프로토콜의 종류, 송신 시간, 수신 시간, 송신 IP, 수신 IP, 송신자 ID, 수신자 ID, 이메일 주소, 첨부파일의 유무, 검출된 민감정보 유형, 이메일 제목, 블로그 제목이나, 첨부 파일명 등일 수 있다. 또한, 비정형 데이터는 이메일 본문, 메신저의 대화 메시지, 블로그 본문, 파일에서 추출된 텍스트 스트림이나, 파일 원본 등일 수 있다.Specifically, the storage management unit 310 stores meta data fields of the DLP log data in the form of a table record in the hive 410, and stores unstructured data fields of the DLP log data in the HDFS 420 in the form of a file data set . Here, the metadata can be designated by the administrator, and the DLP log data includes information such as the type of protocol, transmission time, reception time, transmission IP, reception IP, sender ID, receiver ID, e-mail address, Information type, email title, blog title, attachment file name, and the like. The unstructured data can also be an email body, a chat message of a messenger, a blog body, a text stream extracted from a file, a file source, and the like.

이때, 저장 관리부(310)는 비정형 데이터를 파일 데이터 셋 형태로 HDFS(420)에 저장한 후 각 파일 데이터 셋의 위치정보를 확인하고, 파일 데이터 셋의 위치정보를 해쉬처리(Hashing)한다. 그리고, 저장 관리부(310)는 파일 데이터 셋의 위치정보 해쉬값을, 인덱스 ID 및 메타데이터와 함께 HDFS 서버(400_1~400_n 중 하나)의 하이브(410)에 테이블 레코드 형태로 분산 저장한다. 이러한 방식으로, 저장 관리부(310)는 파일 데이터 셋의 위치정보 해쉬값을 인덱스 ID 및 메타데이터와 연관시켜 저장할 수 있다. At this time, the storage management unit 310 stores the unstructured data in the HDFS 420 in the form of a file data set, checks the location information of each file data set, and hashes the location information of the file data set. The storage management unit 310 stores the location information hash value of the file data set in the form of a table record in the hive 410 of the HDFS servers 400_1 to 400_n together with the index ID and the metadata. In this manner, the storage management unit 310 can store the location information hash value of the file data set in association with the index ID and the metadata.

한편, 하이브(410) 및 HDFS(420)가 하둡 기반 플랫폼 또한 분산 데이터 처리 및 서비스를 제공하며, 고가용성 보장 구조로 구성됨은 당업자라면 알고 있는 내용이므로, 그에 대한 상세한 설명은 생략하기로 한다. 따라서, 본 발명에서는 하둡 기반 분산 저장 및 분산 병렬 구조를 이용하여 인덱스 되지 않은 DLP 로그 데이터의 검색시에도 수 시간 내에 원하는 검색 결과를 제공할 수 있다.Meanwhile, those skilled in the art know that the hive 410 and the HDFS 420 constitute a Hadoop-based platform as well as a distributed data processing and service and a high availability guarantee structure, so that a detailed description thereof will be omitted. Therefore, according to the present invention, desired search results can be provided within a few hours even when DLP log data not indexed is retrieved using the Hadoop-based distributed storage and distributed parallel structure.

이와 같이, 본 발명에서는 오픈소스 프로젝트인 하둡(Hadoop)과 맵리듀스(MapReduce)로 대표되는 빅데이터 처리 플랫폼 및 기술을 응용하여 DLP 로그 데이터를 HDFS(Hadoop Distributed File System) 기반으로 그 특성에 맞게 파일 단위의 집합으로 나누고, 인덱싱 처리 후 분산 저장할 수 있다.As described above, according to the present invention, the DLP log data is applied to the HDFS (Hadoop Distributed File System) based on the characteristics of the large data processing platform and technology represented by Hadoop and MapReduce, It can be divided into a set of units, and can be distributedly stored after indexing processing.

위에서, DLP 로그 데이터에 대한 인덱싱 처리 및 하둡 플랫폼 기반 분산 저장 과정에 대해서 설명하였으며, 이하에서는 인덱싱되어 분산 저장된 DLP 로그 데이터의 고속 검색 방법에 대해서 설명한다.
In the above, the indexing process for DLP log data and the Hadoop platform-based distributed storing process have been described. Hereinafter, a fast searching method of indexed and distributed DLP log data will be described.

② ② DLPDLP 로그 데이터 검색 과정 Log data retrieval process

DLP 로그 데이터 검색 과정은 주로 검색 서버(200) 및 인덱스 서버(300_1~300_n)에 의해서 수행된다. 이하, 검색 서버(200) 및 인덱스 서버(300_1~300_n)의 각 구성요소에 의해 DLP 로그 데이터의 검색 과정에 대해서 설명한다.The DLP log data retrieval process is mainly performed by the search server 200 and the index servers 300_1 to 300_n. Hereinafter, a process of searching DLP log data by each component of the search server 200 and the index servers 300_1 to 300_n will be described.

검색 서버의 요청 변환부(240)는 관리자에 의한 검색 요청이 수신되면, 검색 요청으로부터 검색 조건 및 검색 키워드를 식별 및 분석하고, "조건항목:키워드" 형태로 재정의한 검색 조건 및 키워드의 데이터 쌍을 생성한다. 이때, 관리자는 관리자 PC를 이용하여 웹상의 검색 서버(200)에 접속하여 검색 조건 및 검색 키워드를 입력하고, 그에 대응하는 인덱스를 검색 요청할 수 있다.Upon receiving the search request by the administrator, the request conversion unit 240 of the search server identifies and analyzes the search condition and the search keyword from the search request, and searches for the search condition and the keyword data pair . At this time, the administrator can access the search server 200 on the web using the manager PC, input the search condition and the search keyword, and search for the corresponding index.

검색 서버의 검색 요청부(210)는 "조건항목:키워드" 형태의 데이터 쌍을 JSON 포맷으로 변환한다. 그리고, 검색 요청부(210)는 인덱스 클러스터(300) 내 인덱스 서버들 중에서 라운드 로빈 방식으로 인덱스 서버를 선택하고, 선택된 인덱스 서버의 검색 관리부(360)로 JSON 포맷의 데이터 쌍을 포함하는 요청 메시지를 전송하여 인덱스 검색을 요청한다. 이때, 본 발명의 실시예에 따른 검색 요청부(210)의 인덱스 검색 요청을 위한 JSON 포맷의 데이터 쌍은 도 6과 같다.The search request unit 210 of the search server converts the data pair of the "condition item: keyword" type into the JSON format. The search request unit 210 selects an index server in a round robin manner among the index servers in the index cluster 300 and transmits a request message including a data pair of JSON format to the search management unit 360 of the selected index server To request an index search. The data pair of the JSON format for the index search request of the search request unit 210 according to the embodiment of the present invention is shown in FIG.

인덱스 서버의 검색 관리부(360)는 인덱스 검색 요청이 수신되면, JSON 포맷의 데이터 쌍을 포함하는 요청 메시지를 인덱스 클러스터(300) 내 다른 인덱스 서버의 타 검색 관리부에 전달 및 공유한다. 따라서, 인덱스 클러스터(300) 내 모든 인덱스 서버의 검색 관리부(360)가 거의 동시에 인덱스 노드(380)에 대한 검색 작업을 수행할 수 있다.When the index search request is received, the search management unit 360 of the index server transmits and shares a request message including a data pair of JSON format to another search management unit of another index server in the index cluster 300. Accordingly, the search management unit 360 of all the index servers in the index cluster 300 can perform a search operation on the index node 380 almost at the same time.

이때, 검색 관리부(360) 및 타 검색 관리부는 각기 구비된 검색엔진(370)을 이용해 각기 구비된 인덱스 노드(380)의 주요 저장소에 JSON 포맷의 데이터 쌍에 대응하는 인덱스 데이터(즉, 색인 데이터)가 있는지를 검색한다. 그리고, 검색 관리부(360)는 타 검색 관리부에 의해 검색된 JSON 포맷의 데이터 쌍에 대응하는 인덱스 데이터의 검색 결과들을 확인한다. 이어서, 검색 관리부(360)는 타 검색 관리부의 검색 결과(색인 데이터 및 인덱스 ID)들과 직접 검색한 검색 결과(색인 데이터 및 인덱스 ID)를 조합하여 검색 조건에 부합하는 인덱스 ID 목록을 생성한다. 일 예로, 각 검색 관리부(360)는 인덱스 노드(380)의 주요 저장소로부터 JSON 포맷의 데이터 쌍에 대응하는 인덱스 데이터(색인 데이터)가 존재하는지를 확인하고, 확인된 모든 색인 데이터에 대응하는 인덱스 ID 목록을 생성한다.At this time, the search management unit 360 and the other search management unit use indexes (i.e., index data) corresponding to the data pair of JSON format in the main storage of the index node 380 provided by each search engine 370, Or not. Then, the search management unit 360 confirms the search results of the index data corresponding to the data pair of the JSON format retrieved by the other search management unit. Then, the search management unit 360 combines search results (index data and index IDs) of the other search management unit with search results (index data and index IDs) directly searched to generate an index ID list matching the search conditions. For example, each search management unit 360 determines whether index data (index data) corresponding to a data pair of JSON format exists from the main storage of the index node 380, .

그리고, 각 검색 관리부(360)는 검색 결과 즉, 검색 조건에 부합하는 인덱스 ID 목록을 검색 서버(200)로 전송한다.Then, each search management unit 360 transmits a search result, that is, an index ID list matching the search conditions, to the search server 200.

그러면, 결과 수신부(220)가 수신된 인덱스 ID 목록을 수신하여 결과 출력부(230)로 전달한다.Then, the result receiving unit 220 receives the received index ID list and transfers the index ID list to the result output unit 230.

결과 출력부(230)는 인덱스 ID 목록을 기반으로 HDFS 서버(400_1~400_n 중 하나)의 하이브(410)에 인덱스 ID 목록에 대응하는 테이블 레코드(즉, 인덱스 ID, 메타데이터 및 파일 데이터 셋의 위치정보)를 질의한다. 이때, 결과 출력부(230)로부터의 질의는 HDFS 클러스터(400) 내 모든 HDFS 서버(400_1~400_n)의 하이브(410)로 전달된다.The result output unit 230 outputs a table record corresponding to the index ID list (that is, the index ID, the metadata, and the position of the file dataset) in the hive 410 of the HDFS servers 400_1 through 400_n, Information). At this time, the query from the result output unit 230 is transmitted to the hive 410 of all the HDFS servers 400_1 to 400 - n in the HDFS cluster 400.

각 HDFS 서버의 하이브(410)는 결과 출력부(230)로부터의 질의를 수신하면, 질의에 대응하는 응답(즉, 인덱스 ID 목록 내 각 인덱스 ID를 가진 레코드 셋)을 확인하여 결과 출력부(230)로 전송한다.Upon receipt of the query from the result output unit 230, the hive 410 of each HDFS server checks the response corresponding to the query (i.e., the record set having each index ID in the index ID list) ).

결과 출력부(230)는 각 HDFS 서버의 하이브(410)로부터 수신된 응답 - 즉, 검색 조건에 부합하는 DLP 로그 데이터에 대한 인덱스 ID, 메타데이터 및 메타데이터에 대응하는 비정형 데이터의 파일 데이터 셋의 위치정보 해쉬값 - 을 순차적으로 화면에 출력한다. 예를 들어, 결과 출력부(230)는 수신된 응답을 수신된 순서대로 화면에 출력할 수 있다.The result output unit 230 outputs the response received from the hive 410 of each HDFS server, that is, the index ID for the DLP log data conforming to the search condition, the file data set of the unstructured data corresponding to the metadata, And the location information hash value - in sequence. For example, the result output unit 230 may output the received responses to the screen in the order in which they are received.

결과 출력부(230)는 관리자가 검색 조건에 대응하는 메타데이터를 확인하고, 출력된 DLP 로그 데이터 중 적어도 하나를 선택하여 상세 내용 조회를 요청하면, 선택된 DLP 로그 데이터에 대응하는 파일위치 해쉬값을 이용하여 선택된 DLP 로그의 상세 내용을 HDFS(420)로부터 가져와 화면에 출력한다.When the administrator confirms the metadata corresponding to the search condition and requests at least one of the output DLP log data to inquire the detail information, the result output unit 230 outputs a file location hash value corresponding to the selected DLP log data And extracts the details of the selected DLP log from the HDFS 420 and outputs it to the screen.

한편, 전술한 예에서는 DLP 로그 데이터 검색 과정에 검색 서버와 인덱스 서버의 협업에 의해서 수행되는 경우를 예로 들어 설명하였지만, DLP 로그 데이터 검색 과정은 검색 서버 및 인덱스 서버 중 하나에 의해 수행될 수도 있음은 물론이다.Meanwhile, although the DLP log data retrieval process is performed by the collaboration between the search server and the index server as an example, the DLP log data retrieval process may be performed by one of the search server and the index server Of course.

이 같이, 본 발명의 실시예에 따른 검색 서버(200)는 관리자의 검색 요청에 대응하는 DLP 로그 데이터의 인덱스 데이터가 어느 DLP 서버에 있는지를 알거나, 확인할 필요가 없이, 단지 관리자의 요청에 따른 검색 조건을 JSON 포맷으로 변환하여 인덱스 서버(300_1~300_n 중 하나)로 전송함에 따라, 인덱스 서버(300_1~300_n 중 하나)로부터 원하는 검색 결과를 얻을 수 있다.As described above, the search server 200 according to the embodiment of the present invention does not need to know which DLP server has the index data of the DLP log data corresponding to the search request of the administrator, The search condition is converted into the JSON format and transmitted to one of the index servers 300_1 to 300_n so that a desired search result can be obtained from the index servers 300_1 to 300_n.

또한, 본 발명에 따르면, 수개월에서 수년간 축적되어 수 테라바이트 이상인 빅데이터 수준의 대용량 DLP 로그 데이터에서 수 초에서 수시간 내에 원하는 로그 데이터를 검색 및 조회할 수 있어, 대용량 DLP 로그 데이터의 조회 및 검색에 필요한 인적, 시간적 비용을 획기적으로 줄일 수 있다.In addition, according to the present invention, it is possible to retrieve and inquire desired log data within several seconds to several hours from large DLP log data having a large data level of several terabytes or more accumulated for several months to several years, The human and the temporal costs required for the present invention can be drastically reduced.

뿐만 아니라, 본 발명에서는 종래의 상용 DBMS 및 고가의 서버와 스토리지 대신, 복수의 PC급 서버와 DAS(Direct Attached Storage; HDD로 대표됨)으로 구성된 저가의 HDFS 클러스터를 사용하여 빅데이터 수준의 대용량 DLP 로그 데이터의 저장 및 고속 검색을 가능할 수 있다. 따라서, 본 발명에서는 검색 엔진과 연계한 분산 병렬 검색을 통해 더 높은 확장성과 가용성을 제공하면서도, 종래의 상용 데이터베이스 대비 20~50% 수준의 비용으로 빅데이터 수준의 대용량 DLP 로그 데이터를 효율적으로 수집, 저장, 관리 및 분석할 수 있다.In addition, in the present invention, instead of the conventional commercial DBMS and expensive server and storage, a low-cost HDFS cluster composed of a plurality of PC-class servers and DAS (represented by HDD) Storage of log data and high-speed retrieval may be possible. Accordingly, the present invention efficiently collects and stores large DLP log data at a level of 20 to 50% of that of a conventional commercial database while providing higher scalability and availability through distributed parallel search in connection with a search engine. , Manage and analyze.

더불어, 본 발명에서는 1차적인 검색 결과 조회와 출력시에는 메타데이터에 비해 데이터 크기가 상대적으로 매우 큰 파일 등 비정형 데이터를 제외하고, 사용자의 요청에 따라 2차적으로 검색 결과를 제공하여, 검색 결과를 효율적으로 관리할 수 있으며, 페이지 단위의 화면 출력 성능을 보장할 수 있다. 그러므로, 본 발명에서는 하둡 기반의 분산 병렬 데이터 처리에 기반하여 DLP 로그 데이터 및 인덱스의 고속 검색 성능을 보장할 수 있다.In addition, in the present invention, at the time of inquiry and output of the primary search result, irregular data such as a file having a relatively large data size as compared with the metadata is excluded and the search result is secondarily provided according to the request of the user, Can be efficiently managed, and the page output performance can be guaranteed. Therefore, in the present invention, it is possible to guarantee the fast search performance of the DLP log data and the index based on the Hadoop-based distributed parallel data processing.

이상, 본 발명의 구성에 대하여 첨부 도면을 참조하여 상세히 설명하였으나, 이는 예시에 불과한 것으로서, 본 발명이 속하는 기술분야에 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 변형과 변경이 가능함은 물론이다. 예컨대, 전술한 명세서에서는 인덱스 서버 및 검색 서버가 DLP 서버에 의해 생성된 DLP 로그 데이터를 분산 처리하는 경우를 예로 들어 설명하였지만, 인덱스 서버 및 검색 서버는 다른 서버에 의해 전달되는 비정형 데이터를 포함하는 로그 데이터를 분산 처리할 수도 있음은 물론이다.While the present invention has been described in detail with reference to the accompanying drawings, it is to be understood that the invention is not limited to the above-described embodiments. Those skilled in the art will appreciate that various modifications, Of course, this is possible. For example, in the above description, the case where the index server and the search server distribute the DLP log data generated by the DLP server has been described as an example, but the index server and the search server may not be able to perform the log including the unstructured data It goes without saying that the data may be distributed.

따라서 본 발명의 보호 범위는 전술한 실시예에 국한되어서는 아니되며 이하의 특허청구범위의 기재에 의하여 정해져야 할 것이다.Accordingly, the scope of protection of the present invention should not be limited to the above-described embodiments, but should be determined by the description of the following claims.

Claims

A receiving unit for receiving log data from at least one Data Loss Prevention (DLP) server;
A classifier for classifying the fields of the log data by predetermined index object fields;
An index generator for generating index data of the log data using the classified index target field and storing the generated index data together with the index ID of the log data in a storage in one index node; And
Wherein the non-standard data is distributedly stored in a plurality of HDFSs in a HDFS (Hadoop Distributed File System) cluster, and the format data is associated with the index ID and the location information including the irregular data in the HDFS cluster A storage management unit for distributing and storing the data in a plurality of hives (HIVE)
And the data processing server.

The method according to claim 1,
Sharing state information of another index processing node in another data processing server and another index processing node in the other data processing server, and using the status information to relocate the one index node and the other index node through the storage relocation between the one index node and the other index node A cluster manager for balancing load balancing of index nodes
The data processing server further comprising:

The information processing apparatus according to claim 1,
Wherein the location information hash value obtained by hashing the location information including the irregular data is stored in the form of a table record in association with the index ID and the form data.

The apparatus according to claim 1,
Extracts a text stream from a file included in the log data, and classifies the text stream into one of index target fields including the atypical data.

2. The apparatus of claim 1,
And morphologically analyzing the atypical data of the log data to extract key text therefrom to generate the index data for the log data.

The information processing apparatus according to claim 1,
And distributes the unstructured data in the form of a file data set.

2. The data processing apparatus according to claim 1,
The type of protocol of the log data, the transmission time, the reception time, the transmission IP address, the reception IP address, the sender ID, the receiver ID, the e-mail address, the presence of the attachment file, the e-mail title, the blog title, The type of the sensitive information included in the log data, and the number of detections by the identification code of the sensitive information type.

2. The method according to claim 1,
An email body, an instant messaging message of a messenger, a blog body, and a text stream extracted from an attachment or an upload file.

An index ID of log data in an HDFS (Hadoop Distributed File System) cluster, a plurality of hives (HIVE) for distributedly storing the formatted data including the metadata, and a plurality of HDFSs distributedly storing the unstructured data in the log data, A data processing server for retrieving log data,
A search request including a data pair corresponding to a search condition and a keyword entered by a user is shared with other management units of the other data processing apparatuses and index data corresponding to the data pair is retrieved from a storage in the provided index node, A job manager for confirming a search result of the index data corresponding to the data pair in the other index node of the other management unit and generating an index ID list corresponding to the data pair by combining the search results; And
An information processing unit for querying information corresponding to the index ID list with the plurality of hives and receiving and outputting information corresponding to the index ID list in at least one of the plurality of hives,
And the data processing server.

10. The method of claim 9,
The index ID of the log data corresponding to the data pair, the metadata, and the location information hash value of the file data set.

The image processing apparatus according to claim 10,
When the user requests detailed information inquiry about the information corresponding to the index ID list, the file data set is fetched from the HDFS corresponding to the location information hash value of the plurality of HDFSs using the location information hash value, The data processing server.

12. The method of claim 11,
A conversion unit for identifying the search condition and the keyword in a search request input by the user and converting the search condition and the keyword into data pairs of a "condition item: keyword" type in JSON format,
The data processing server further comprising: