KR102090239B1

KR102090239B1 - Method for detecting anomality quickly by using layer convergence statistics information and system thereof

Info

Publication number: KR102090239B1
Application number: KR1020190123368A
Authority: KR
Inventors: 한영석; 김형근; 이정표
Original assignee: 주식회사 모비젠
Priority date: 2019-10-04
Filing date: 2019-10-04
Publication date: 2020-03-17

Abstract

According to an embodiment of the present invention, disclosed is a method for generating hierarchical convergence statistical information by using both global and regional statistical information, and detecting anomality of information output by a system at high speed based on the generated hierarchical convergence statistical information.

Description

High-speed anomaly detection method and system using hierarchical convergence statistics information {Method for detecting anomality quickly by using layer convergence statistics information and system thereof}

본 발명은 계층융합통계정보를 활용한 고속이상탐지방법 및 그 시스템에 관한 발명으로서, 보다 구체적으로는, 전역적이고 지역적인 통계정보를 모두 활용하여 계층융합통계정보를 생성하고, 생성된 계층융합통계정보를 기초로 하여 고속으로 시스템이 출력한 정보의 이상성(anomality)을 탐지하기 위한 방법 및 그 시스템에 관한 것이다.The present invention relates to a high-speed anomaly detection method using hierarchical convergence statistical information and a system thereof, and more specifically, to generate hierarchical convergence statistical information using both global and regional statistical information, and to generate hierarchical convergence statistics. It relates to a method and system for detecting anomalies of information output by a system at a high speed based on information.

대규모 수치정보로 이루어진 데이터베이스를 관리하는 관리자에게, 이상성(anomality)을 보이는 이상값을 신속하게 찾아내는 것은 중요하다. 데이터베이스가 출력하는 이상값을 빠르게 파악하면 할수록, 시스템의 복구 및 수정이 그만큼 용이해지기 때문이다. For administrators who manage databases of large-scale numerical information, it is important to quickly find outliers that exhibit anomalies. This is because the faster the database detects the outliers, the easier it is to recover and modify the system.

이상값을 찾는 방법론으로서, 통계적 방법, 기계학습적 방법, 신경망을 이용한 방법 등이 존재하지만, 정확성 대비 비용의 관계에 있어서 월등한 성능을 보장하는 방법은 현재까지 없는 실정이다. As a methodology for finding outliers, there are statistical methods, machine learning methods, and methods using neural networks, but there is no method to guarantee superior performance in relation to accuracy and cost.

일 예로서, 통계적 정보에 기반한 방법은, 이상성(anomality)의 커버리지가 전역적인 것과 지역적인 것간의 균형을 맞추기 힘들 뿐만 아니라, 정확성의 수준을 조절하는 것 또한 쉽지 않은 문제점이 있다. 통계적 방법은 통계적 메트릭 측정 수식을 이용하여, 데이터의 분포를 모델링하고, 그에 따라 각각의 데이터포인트가 어느 정도 정규적인지 비정규적인지를 판단하며, 분포에 따른 이상성은 Zscore를 이용해서 판단할 수 있다. 통계적 방법은 사건의 시간이나 공간상의 발생문맥을 전혀 고려하지 않는다는 점이 약점으로 지적되며, 특히, 이상성을 보이는 데이터에는 발생빈도가 낮으면서도 실제로는 노이즈에 불과할 뿐 이상성과는 크게 관련이 없는 아웃라이어(outlier)도 포함되어 있어서, 통계적 방법에 의하면 그런 아웃라이어를 의미있는 이상성데이터와 구분하기 어렵다.As an example, the method based on statistical information has a problem that the coverage of ideality is difficult to balance between global and regional, and it is also difficult to control the level of accuracy. The statistical method uses a statistical metric measurement formula to model the distribution of data, and accordingly determines how regular or irregular each data point is, and the ideality according to the distribution can be determined using Zscore. It is pointed out that the statistical method does not consider the occurrence context in the time or space of the event at all, and in particular, the outliers that have low frequency of occurrence but are actually only noise in the data showing abnormality, but are not significantly related to the ideality ( outlier) is also included, and according to statistical methods, it is difficult to distinguish such outliers from meaningful ideality data.

다른 예로서, 신경망(neural network)기반의 기술은 정교하게 설계할 경우 가장 정교하고 유연한 정확성을 획득할 수 있지만, 높은 비용이 걸림돌로 작용할 수 있으며, 특히, 초대규모의 데이터베이스에 대해서 신경망 기반 기술을 적용하는 것은 엄청난 연산량과 비용문제로 비현실적으로 간주될 수 있다. 신경망을 이용한 방법의 일 예로서, 오토인코더(autoencoder) 등이 사용되나, 이러한 신경망 기반의 방법은, 모델링이 완료된 이후에 응용단계에 적용할 경우에는 충분히 빠른 데이터처리(data processing)가 가능하지만, 주어진 데이터가 수백만 건에 이르는 상황에서 모델링이 완료되지 않은 경우, 수 초 이내에 결과값을 출력하는 것은 불가능한 한계가 있다.As another example, neural network-based technology can obtain the most sophisticated and flexible accuracy when it is elaborately designed, but it can act as a stumbling block in high cost. In particular, neural network-based technology is used for very large databases. Applying it can be considered unrealistic due to the enormous amount of computation and cost. As an example of a method using a neural network, an autoencoder or the like is used, but the method based on the neural network is capable of sufficiently fast data processing when applied to an application step after modeling is completed. When modeling is not completed in a situation where there are millions of given data, it is impossible to output results within a few seconds.

또 다른 예로서, 기계학습 방법으로 이상값을 찾는 방법은, 모델을 학습하고 결과를 내기에 앞서 학습의 시간이 필수적이고, 이로 인한 시간 지연이 필수적이어서 대규모 데이터에 대해서 실시간으로 결과값을 처리할 수 없는 문제점이 있다. 기계학습 방법은 시스템 내 이상성을 보이는 사건이 발생하는 문맥을 고려하여 의미적인 분석을 시도하는 것으로서, 대표적으로, SVM(support vector machine), PCA(Principal component analysis) 등이 있다.As another example, in the method of finding outliers using the machine learning method, the time of training is essential before training the model and producing the results, and the time delay required is necessary to process the result in real time for large data. There is a problem that cannot be. The machine learning method attempts a semantic analysis in consideration of the context in which an event showing an abnormality in the system occurs, and typically includes a support vector machine (SVM) and a principal component analysis (PCA).

즉, 종래에 알려진 방법론만으로는 대규모로 수치를 출력하는 데이터베이스가 이상성을 나타내는 데이터를 출력하는지 여부를 신속하게 파악하는 것이 거의 불가능하였기에 계산의 복잡성을 최소화하고 일정수준 이상의 정확성을 담보하면서 빠르게 이상값을 찾아낼 수 있는 방법론의 도입이 필요한 실정이다.In other words, it was almost impossible to quickly determine whether or not a database that outputs numerical data on a large scale outputs ideality data using only a known methodology, thus minimizing computational complexity and quickly finding outliers while guaranteeing accuracy over a certain level. It is necessary to introduce a methodology that can be issued.

대한민국 등록특허 제10-1964412호 (2019.04.01 공고)Republic of Korea Registered Patent No. 10-1964412 (announced April 1, 2019)

본 발명이 해결하고자 하는 기술적 과제는, 통계정보를 활용하되, 알고리듬적 장치를 통하여 매우 빠른 결과를 도출할 수 있는 데이터베이스 데이터의 이상성을 판단하는 방법 및 그 방법을 구현하기 위한 시스템을 제공하는 데에 있다.The technical problem to be solved by the present invention is to provide a method for determining the ideality of database data and a system for implementing the method, which utilize statistical information, and which can derive very fast results through an algorithmic device. have.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시 예에 따른 방법은, 데이터베이스로부터 복수 개의 데이터를 포함하는 로우데이터(raw data)를 수신하고, 상기 로우데이터에 포함된 상기 데이터의 전체통계를 산출하는 전체통계산출단계; 상기 로우데이터를 미리 설정된 분할상수의 데이터가 포함된 제1계층군집으로 분할하고, 상기 분할된 제1계층군집별로 제1군집통계를 산출하는 제1군집통계산출단계; 상기 산출된 전체통계 및 상기 산출된 제1군집통계를 기초로 상기 제1계층에 대한 계층융합통계를 산출하는 제1계층통계산출단계; 상기 분할된 제1계층군집을 상기 분할상수의 데이터가 포함된 제2계층군집으로 분할하고, 상기 분할된 제2계층군집별로 제2군집통계를 산출하는 제2군집통계산출단계; 상기 계층융합통계 및 상기 산출된 제2군집통계를 기초로 상기 제2계층에 대한 계층융합통계를 산출하는 제2계층통계산출단계; 미리 설정된 조건을 만족할 때까지 상기 제2군집통계산출단계 및 상기 제2계층통계산출단계를 반복하는 통계값반복산출단계; 및 상기 통계값반복산출단계에서 최후로 산출된 최후계층융합통계를 기초로 하여, 가장 마지막 계층의 최후계층군집에 속한 데이터들의 이상성(anomality)을 판단하는 이상성판단단계를 포함한다.The method according to an embodiment of the present invention for solving the above technical problem, receives raw data including a plurality of data from a database, and calculates overall statistics of the data included in the raw data Total statistics calculation step; A first group statistics calculating step of dividing the raw data into a first layer cluster including data of a predetermined division constant, and calculating a first group statistics for each of the divided first layer clusters; A first hierarchical statistical calculation step of calculating hierarchical convergence statistics for the first layer based on the calculated overall statistics and the calculated first cluster statistics; A second group statistics calculation step of dividing the divided first layer cluster into a second layer cluster including the data of the division constant, and calculating a second group statistics for each of the divided second layer clusters; A second layer statistics calculation step of calculating a layer convergence statistics for the second layer based on the hierarchical convergence statistics and the calculated second group statistics; A statistical value iterative calculation step of repeating the second group statistics calculation step and the second layer statistics calculation step until a predetermined condition is satisfied; And an abnormality determination step of determining anomalies of data belonging to the last layer cluster of the last layer based on the last layer convergence statistics calculated last in the statistical value iterative calculation step.

상기 기술적 과제를 해결하기 위한 본 발명의 다른 일 실시 예에 따른 시스템은, 데이터베이스로부터 복수 개의 데이터를 포함하는 로우데이터를 수신하고, 상기 로우데이터에 포함된 상기 데이터의 전체통계를 산출하는 전체통계산출부; 상기 로우데이터를 미리 설정된 분할상수의 데이터가 포함된 제1계층군집으로 분할하고, 상기 분할된 제1계층군집별로 제1군집통계를 산출하는 제1군집통계산출부; 상기 산출된 전체통계 및 상기 산출된 제1군집통계를 기초로 상기 제1계층에 대한 계층융합통계를 산출하는 제1계층통계산출부; 상기 분할된 제1계층군집을 상기 분할상수의 데이터 셋이 포함된 제2계층군집으로 분할하고, 상기 분할된 제2계층군집별로 제2군집통계를 산출하는 제2군집통계산출부; 상기 계층융합통계 및 상기 산출된 제2군집통계를 기초로 상기 제2계층에 대한 계층융합통계를 산출하는 제2계층통계산출부; 미리 설정된 조건을 만족할 때까지 군집통계 및 계층융합통계를 산출하는 것을 반복하는 통계값반복산출부; 및 상기 통계값반복산출부에서 최후로 산출된 최후계층융합통계를 기초로 하여, 가장 마지막 계층의 최후계층군집에 속한 데이터들의 이상성(anomality)을 판단하는 이상성판단부를 포함한다.The system according to another embodiment of the present invention for solving the above technical problem, calculates overall statistics for receiving raw data including a plurality of data from a database and calculating overall statistics of the data included in the raw data part; A first group statistics calculation unit for dividing the raw data into a first layer cluster including data of a predetermined division constant, and calculating a first group statistics for each of the divided first layer clusters; A first hierarchical statistical calculation unit for calculating hierarchical convergence statistics for the first layer based on the calculated overall statistics and the calculated first group statistics; A second group statistics calculation unit for dividing the divided first layer cluster into a second layer cluster including the data set of the division constant, and calculating a second group statistics for each of the divided second layer clusters; A second layer statistics calculation unit calculating a layer convergence statistics for the second layer based on the hierarchical convergence statistics and the calculated second group statistics; A statistical value iterative calculation unit that repeats calculating cluster statistics and hierarchical convergence statistics until a predetermined condition is satisfied; And an ideality determination unit for determining anomalies of data belonging to the last layer cluster of the last layer based on the last layer convergence statistics calculated last in the statistical value repeat calculation unit.

본 발명의 일 실시 예는 상기 방법을 실행시키기 위한 프로그램을 저장하고 있는 컴퓨터 판독가능한 기록매체를 제공할 수 있다.One embodiment of the present invention can provide a computer-readable recording medium storing a program for executing the method.

본 발명은 통계적 방법을 사용하면서 동시에 전역적 정보 및 지역적 정보의 융합을 이용함으로써, 매우 빠른 속도로 이상성을 나타내는 데이터를 결과값으로 출력할 수 있는 특징이 있다. 위와 같은 특징은, 기존의 통계적 방법이 전역적인 특성만 반영하여 결과값을 출력해오던 것과는 구별되는 특징이 되고, 알고리즘 특성상 모델학습과 추론에 따른 결과값을 현재 보급되어 있는 통상적인 서버상에서, 수 초 이내에 출력할 수 없는 신경망 기법이나 기계학습 기법하고도 구분되는 점이다.The present invention has a feature that, while using a statistical method, and simultaneously fusion of global information and regional information, data representing ideality can be output as a result value at a very high speed. The above characteristics become distinct from those in which the existing statistical method reflects only global characteristics and outputs the result values. Due to the characteristics of the algorithm, the result values based on model learning and inference are currently distributed on a typical server. It is distinguished from neural network techniques or machine learning techniques that cannot be output within seconds.

도 1은 본 발명에 따른 시스템의 일 예를 블록도로 나타낸 것이다.
도 2는 재귀함수로 구현된 통계값반복산출부의 일 예를 도식적으로 나타낸 도면이다.
도 3은 계층융합통계 및 계층군집통계를 설명하기 위한 도면이다.
도 4는 이상성을 나타내는 데이터 셋을 출력하기 위한 과정의 일 예를 도시한 도면이다.
도 5는 본 발명에 따른 계층융합통계정보를 활용한 고속이상탐지방법의 일 예의 흐름도이다.1 is a block diagram showing an example of a system according to the present invention.
2 is a diagram schematically showing an example of a statistical value iterative calculation unit implemented by a recursive function.
3 is a view for explaining hierarchical convergence statistics and hierarchical cluster statistics.
4 is a diagram illustrating an example of a process for outputting a data set indicating ideality.
5 is a flowchart of an example of a high-speed anomaly detection method using hierarchical convergence statistical information according to the present invention.

실시 예들에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terminology used in the embodiments has been selected for general terms that are currently widely used while considering functions in the present invention, but this may vary depending on the intention or precedent of a person skilled in the art or the appearance of new technologies. In addition, in certain cases, some terms are arbitrarily selected by the applicant, and in this case, their meanings will be described in detail in the description of the applicable invention. Therefore, the terms used in the present invention should be defined based on the meanings of the terms and the contents of the present invention, not simply the names of the terms.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "…부", "…모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When a part of the specification "includes" a certain component, this means that other components may be further included instead of excluding other components unless specifically stated otherwise. In addition, terms such as “… unit” and “… module” described in the specification mean a unit that processes at least one function or operation, which may be implemented in hardware or software, or a combination of hardware and software.

아래에서는 첨부한 도면을 참고하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains may easily practice. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein.

이하에서는 도면을 참조하여 본 발명의 실시 예들을 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명에 따른 시스템의 일 예를 블록도로 나타낸 것이다.1 is a block diagram showing an example of a system according to the present invention.

도 1을 참조하면, 본 발명에 따른 시스템(100)은 전체통계산출부(110), 제1군집통계산출부(120), 제1계층통계산출부(130), 제2군집통계산출부(140), 제2계층통계산출부(150), 통계값반복산출부(160), 이상성판단부(170) 및 이상개수수신부(180)를 포함하는 것을 알 수 있다. 실시 예에 따라서, 이상개수수신부(180)는 시스템(100)에서 생략될 수도 있다.Referring to FIG. 1, the system 100 according to the present invention includes a total statistics calculation unit 110, a first group statistics calculation unit 120, a first layer statistics calculation unit 130, and a second group statistics calculation unit ( 140), the second layer statistical calculation unit 150, the statistical value iterative calculation unit 160, it can be seen that includes the abnormality determination unit 170 and the abnormal number receiving unit 180. Depending on the embodiment, the abnormality receiving unit 180 may be omitted from the system 100.

전체통계산출부(110)는 데이터베이스로부터 복수 개의 데이터 셋(data set)을 포함하는 로우데이터(raw data)를 수신하고, 로우데이터에 포함된 데이터의 전체통계를 산출한다.The overall statistics calculating unit 110 receives raw data including a plurality of data sets from a database, and calculates overall statistics of data included in the raw data.

실시 예에 따라서, 데이터베이스는 본 발명에 따른 시스템(100)에 포함될 수도 있고 생략될 수도 있다. 데이터베이스가 본 발명에 따른 시스템(100)에서 생략되는 경우, 전체통계산출부(110)는 유선케이블 또는 무선통신 방식으로 데이터베이스로부터 로우데이터를 수신할 수 있다. 데이터베이스는 로우데이터를 저장하고 있으며, 로우데이터는 데이터베이스 내에 정의된 고유한 규격에 맞춰서 정리된 다양한 수치정보 그 자체를 의미한다.Depending on the embodiment, the database may be included in the system 100 according to the present invention or may be omitted. When the database is omitted from the system 100 according to the present invention, the overall statistics calculation unit 110 may receive raw data from the database through a wired cable or wireless communication method. The database stores raw data, and raw data refers to various numerical information itself organized according to unique standards defined in the database.

데이터베이스로부터 수신한 로우데이터는 특정 시스템의 로그(log) 데이터일 수 있다. 또한, 로우데이터는 숫자 또는 문자로 이루어진 데이터를 이진화 또는 16진수화 시켜서 저장한 데이터일 수도 있다. 로우데이터는 복수 개의 데이터를 포함하고, 각각의 데이터는 서로 다른 시점 또는 서로 다른 모듈로부터 출력된 데이터를 의미한다. 각각의 데이터는 실제로 통계를 산출하는 데에 의미가 있는 필수데이터 뿐만 아니라, 데이터가 기록된 시각이나 서로 다른 데이터 셋을 구분하기 위한 메타데이터(metadata)를 포함할 수 있으며, 이하에서, 필수데이터와 메타데이터를 포함하는 개념으로써 데이터 셋(data set)이라는 호칭을 사용할 수 있다.The raw data received from the database may be log data of a specific system. In addition, the raw data may be data stored by binary or hexadecimalization of data consisting of numbers or characters. The raw data includes a plurality of data, and each data refers to data output from different viewpoints or different modules. Each data may include not only essential data that is meaningful for actually calculating statistics, but also metadata for distinguishing different data sets or times when data was recorded. As a concept including metadata, the term data set can be used.

표 1은 총 8개의 데이터 셋을 포함하는 로우데이터의 일 예를 표로 나타낸 것이다. 표 1을 참조하면, 로우데이터는 8개의 데이터 셋을 포함하고 있으며, 각각의 데이터 셋은 총 8개의 서로 다른 모듈에서 동일한 시점에 출력된 수치데이터이거나, 하나의 모듈에서 서로 다른 여덟 가지의 시점에 출력된 수치데이터일 수도 있다. Table 1 shows an example of row data including a total of eight data sets. Referring to Table 1, the raw data includes 8 data sets, and each data set is numerical data output at the same time in a total of 8 different modules, or at 8 different time points in one module. It may be outputted numerical data.

전체통계산출부(110)는 로우데이터를 수신하고, 로우데이터에 포함된 데이터 셋을 구분하여 파악한 후에, 데이터의 전체통계를 산출한다. 여기서, 전체통계산출부(110)가 데이터들을 기초로 하여 산출하는 통계값에는 평균(average), 분산(variance), 표준편차(standard deviation), z-score(Zscore) 중 적어도 하나 이상일 수 있다. 실시 예에 따라, 전체통계산출부(110)가 산출하는 통계는 전술한 통계값들의 조합으로 산출된 새로운 통계값일 수도 있다.The overall statistics calculating unit 110 receives the raw data, classifies and identifies the data set included in the raw data, and then calculates the overall statistics of the data. Here, the statistical value calculated by the total statistical calculation unit 110 based on the data may be at least one of an average, a variance, a standard deviation, and a z-score (Zscore). According to an embodiment, the statistics calculated by the overall statistics calculating unit 110 may be new statistical values calculated by a combination of the above-described statistical values.

제1군집통계산출부(120)는 로우데이터를 미리 설정된 분할상수의 데이터가 포함된 제1계층군집으로 분할하고, 분할된 제1계층군집별로 제1군집통계를 산출한다.The first group statistics calculation unit 120 divides the raw data into a first group cluster including data of a predetermined division constant, and calculates the first group statistics for each divided first group cluster.

분할상수는 로우데이터를 복수 개의 데이터(또는 데이터 셋)로 분할하기 위해서 제1군집통계산출부(120)가 참조하는 값으로서, 제1군집통계산출부(120) 내부에 미리 설정되어 있다. 예를 들어, 분할상수가 2이면, 제1군집통계산출부(120)는 로우데이터를 2개의 제1계층군집으로 분할한다. 다른 예로서, 분할상수가 2이고, 표 1과 같이 로우데이터에 포함된 데이터(또는 데이터 셋)가 여덟 개라면, 제1군집통계산출부(120)는 제1계층군집을 총 두 개 생성할 수 있고, 두 개의 제1계층군집은 각각 네 개의 데이터(또는 데이터 셋)을 포함하는 군집(group)이 된다. 여기서, 제1계층이라고 하는 것은 후술하는 제2계층, 제3계층 등과 구분하기 위한 호칭을 의미하고, 제1군집통계산출부(120)에 의해 생성된 두 개의 군집 모두 제1계층군집이 된다.The division constant is a value referenced by the first group statistics calculation unit 120 in order to divide the raw data into a plurality of data (or data sets), and is preset in the first group statistics calculation unit 120. For example, if the division constant is 2, the first group statistics calculation unit 120 divides the raw data into two first layer clusters. As another example, if the division constant is 2 and the data (or data set) included in the raw data is eight as shown in Table 1, the first cluster statistics calculator 120 generates a total of two first layer clusters. The two first hierarchical clusters may be a group including four data (or data sets), respectively. Here, the term “first layer” refers to a name for distinguishing the second layer, the third layer, and the like, which will be described later, and both clusters generated by the first group statistics calculation unit 120 are first layer clusters.

제1군집통계산출부(120)는 로우데이터에 포함된 데이터를 제1계층군집으로 분할한 후에, 제1계층군집별로 제1군집통계를 산출한다. 제1군집통계는 제1계층군집 각각에 속한 데이터를 기초로 하여 산출된 통계값을 의미하고, 전체통계산출부(110)가 전체통계를 산출하는 방식과 동일한 방식을 거쳐서 산출된다. 일 예로서, 로우데이터에 1, 2, 3, 4가 포함되어 있고, 전체통계가 평균으로서, 2.5가 산출되었으며, 제1계층군집은 {1, 2}, {3, 4}로 각각 분할되었다고 가정하면, 제1군집통계는 각각 1.5, 3.5가 산출될 수 있다. 여기서, 1.5와 3.5는 서로 다른 값이지만, 둘 다 제1계층에 속한 군집들의 통계값으로서, 본 명세서에서 제1군집통계로 통칭될 수 있다. 또한, 후술하는 제2군집통계 및 제3군집통계도 위와 같은 맥락으로 이해될 수 있을 것이다.The first cluster statistics calculation unit 120 divides the data included in the raw data into the first layer clusters, and then calculates the first group statistics for each first layer cluster. The first group statistics mean statistical values calculated based on data belonging to each of the first layer clusters, and are calculated through the same method as the total statistics calculation unit 110 calculates the total statistics. As an example, 1, 2, 3, and 4 are included in the raw data, and the overall statistics are averaged, 2.5 is calculated, and the first hierarchical cluster is divided into {1, 2}, and {3, 4}, respectively. Assuming, the first group statistics can be calculated to be 1.5 and 3.5, respectively. Here, 1.5 and 3.5 are different values, but both are statistical values of clusters belonging to the first layer, and may be collectively referred to as a first cluster statistics in this specification. Also, the second group statistics and the third group statistics, which will be described later, may be understood in the same context as above.

제1계층통계산출부(130)는 전체통계산출부(110)가 산출한 전체통계 및 제1군집통계산출부(120)가 산출한 제1군집통계를 기초로 하여, 제1계층융합통계를 산출한다. 여기서, 제1계층융합통계는 제1계층에 있는 제1계층군집들의 통계와 제1계층보다 더 상위계층에서 쌓여온 통계값을 통합한 값을 의미한다.The first hierarchical statistical calculation unit 130 is based on the overall statistics calculated by the overall statistical calculation unit 110 and the first group statistical calculation calculated by the first group statistical calculation unit 120, and the first hierarchical convergence statistics are calculated. Calculate. Here, the first layer convergence statistics refers to a value obtained by combining statistics of the first layer clusters in the first layer and statistics accumulated in the upper layer than the first layer.

본 발명에서 제1계층융합통계를 구하는 이유는 종래에 알려진 통계적 방법, 신경망 기법, 기계학습을 이용하는 방법이 갖고 있는 문제점을 해소하기 위함이다. The reason for obtaining the first layer convergence statistics in the present invention is to solve the problems of the conventional statistical methods, neural network techniques, and methods using machine learning.

먼저, 통계적 방법의 단점은 전역적 정보만 반영되어 지역적 정보를 반영하기 어렵다는 것이었고, 신경망 기법 및 기계학습을 이용하는 방법은 지역적 지역적 패턴을 학습할 수 있으나, 대규모의 데이터(데이터 셋)에 대해서 수 초 이내에 일정이상의 정확도를 갖는 결과값을 산출할 수 없는 문제점이 있었다. 그렇다고 위와 같은 문제점을 해소하기 위해서, 통계적 방법을 이용하되 단순히 지역적 정보를 무한히 확장하여 적용하게 되면, 결국 통계적 방법으로도 짧은 시간 내에 이상성이 있는 데이터를 검지하는 것이 불가능하게 된다.First, the disadvantage of the statistical method was that it was difficult to reflect regional information because only global information was reflected.However, the method using neural network techniques and machine learning can learn regional regional patterns, but it can be used for large-scale data (data sets). There was a problem in that it was not possible to calculate a result value having a certain degree of accuracy within seconds. However, in order to solve the above problems, if statistical methods are used, but simply local information is infinitely extended, it is impossible to detect data with abnormality within a short time even with statistical methods.

본 발명은 통계적 방법을 어느 정도 이용하면서, 순수하게 통계적 방법만을 이용할 때와 대비하여, 전역적 정보뿐만 아니라 지역적 정보까지 반영하여 통계값을 산출하고 산출된 통계값을 활용함으로써, 대규모의 데이터에서 수 초 이내에 이상성을 나타내는 데이터 셋을 탐지하는 것을 특징으로 한다. 위와 같은 특징은 지역적 정보로서, 1차적으로 계층별 군집통계를 산출하고, 재귀함수(recursive function)의 원리로서, 현재 계층보다 상위 계층에서 누적되어 온 통계값까지 고려하여 계층융합통계를 반복적으로 산출하여, 최종계층에 있는 군집에 속한 데이터 셋과 최종계층융합통계와의 비교를 통해서 이상성이 가장 높은 데이터를 파악하는 단계로서 구현될 수 있다. 전술한 과정에 대한 상세한 설명은 이하 도 2 내지 도 5를 통해 후술하기로 한다.The present invention uses a statistical method to some extent, and compares with global information as well as regional information, and calculates a statistical value and utilizes the calculated statistical value in comparison with a purely statistical method. It is characterized by detecting a data set indicating abnormality within seconds. The above features are regional information, primarily calculating cluster statistics for each layer, and as a principle of a recursive function, repeatedly calculating hierarchical convergence statistics by considering statistical values accumulated in a higher layer than the current layer. Thus, it may be implemented as a step of identifying data having the highest ideality through comparison between a data set belonging to a cluster in the final layer and the final layer convergence statistics. Detailed description of the above-described process will be described later with reference to FIGS. 2 to 5.

이어서, 도 1에 대한 설명을 계속하기로 한다. Next, the description of FIG. 1 will be continued.

제2군집통계산출부(140)는 1차적으로 분할된 제1계층군집을 분할상수의 데이터 셋이 포함된 제2계층군집으로 분할하고, 제2계층군집별로 제2군집통계를 산출한다. The second cluster statistics calculating unit 140 divides the first divided first cluster into a second group cluster including a data set of division constants, and calculates a second group statistics for each second group cluster.

여기서, 제2군집통계산출부(140)가 제1계층군집들을 분할시키기 위해 참조한 분할상수는 로우데이터를 분할할 때 참조한 분할상수와 동일하다. 예를 들어, 로우데이터가 네 개의 제1계층군집으로 분할되었다면, 네 개의 제1계층군집은 각각 또 네 개의 제2계층군집들로 분할될 수 있으며, 제2계층에 속한 군집은 총 16개가 된다. 각각의 군집들은 생성된 계층 수가 동일하다면, 동일한 수의 데이터 셋을 포함하게 된다. 이어서, 제2군집통계산출부(140)는 제2계층에 속한 제2군집들 각각의 통계값을 산출하고, 이들의 통계값은 제2군집통계로 호칭될 수 있다.Here, the division constant referenced by the second group statistics calculation unit 140 to divide the first layer clusters is the same as the division constant referenced when dividing the raw data. For example, if the raw data is divided into four first layer clusters, each of the four first layer clusters may be further divided into four second layer clusters, and the total number of clusters belonging to the second layer is 16. . Each cluster contains the same number of data sets if the number of hierarchies created is the same. Subsequently, the second cluster statistics calculating unit 140 calculates statistical values of each of the second clusters belonging to the second layer, and these statistical values may be referred to as second cluster statistics.

제2계층통계산출부(150)는 제1계층통계산출부(130)에서 산출된 제1계층에 대한 계층융합통계와 제2군집통계산출부(140)가 산출한 제2군집통계를 기초로 하여, 제2계층에 대한 계층융합통계를 산출한다. 제1계층에 대한 계층융합통계가 로우데이터가 존재하는 기준계층 및 제1계층에 대한 통계값을 통합시킨 정보라면, 제2계층에 대한 계층융합통계는 기준계층, 제1계층, 제2계층에 대한 통계값을 통합시킨 정보를 의미한다.The second hierarchical statistical calculation unit 150 is based on the hierarchical convergence statistics for the first hierarchical layer calculated by the first hierarchical statistical calculation unit 130 and the second group statistical statistics calculated by the second group statistical calculation unit 140. By doing so, the hierarchical convergence statistics for the second layer are calculated. If the hierarchical convergence statistics for the first layer are information incorporating statistical values for the reference layer and the first layer in which the raw data exist, the hierarchical convergence statistics for the second layer are the reference layer, the first layer, and the second layer. Refers to information that incorporates statistical values.

통계값반복산출부(160)는 미리 설정된 조건을 만족할 때까지 계층군집을 생성하고, 생성된 계층군집에 대한 군집통계를 산출하고, 산출된 군집통계를 상위계층의 통계와 융합하여 계층융합통계를 산출하는 것을 반복한다. The statistical value repetition calculation unit 160 generates a hierarchical cluster until a predetermined condition is satisfied, calculates cluster statistics for the generated hierarchical cluster, and fuses the calculated cluster statistics with statistics of a higher layer to obtain hierarchical convergence statistics. Repeat the calculation.

예를 들어, 통계값반복산출부(160)는 제2계층군집을 미리 설정된 분할상수에 따라서 제3계층군집들로 분할하고, 각각의 제3계층군집에 대한 제3군집통계들을 산출하고, 제2계층에 대한 계층융합통계와 제3군집통계를 기초로 하여 제3계층에 대한 계층융합통계를 산출하는 것을 반복함으로써, 제100계층에 대한 계층융합통계까지 산출할 수도 있다. For example, the statistical value iterative calculation unit 160 divides the second layer cluster into third layer clusters according to a preset division constant, calculates third group statistics for each third layer cluster, and By repeatedly calculating the hierarchical convergence statistics for the third layer based on the hierarchical convergence statistics for the second layer and the third group statistics, it is also possible to calculate the hierarchical convergence statistics for the 100th layer.

통계값반복산출부(160)는 물리적인 구성이 아니라 논리적으로 구현되어, 도 1에 도시된 바와 같이, 제2군집통계산출부(140) 및 제2계층통계산출부(150)가 반복되어 동작하도록 제어하는 방식으로 구현될 수도 있다. 이 경우, 제2군집통계산출부(140) 및 제2계층통계산출부(150)는 1차적으로 제2군집통계 및 제2계층에 대한 계층융합통계를 산출한 이후에, 후속적으로 제3군집통계 및 제3계층에 대한 계층융합통계를 하게 되고, 결과값을 산출하는 데에 필요한 데이터 값은 고속 버퍼(buffer)에 일시적으로 저장해두었다가 사용하는 방식으로 동작할 수도 있다.The statistical value repetition calculation unit 160 is logically implemented instead of a physical configuration, and the second group statistics calculation unit 140 and the second layer statistics calculation unit 150 are repeatedly operated as shown in FIG. 1. It can be implemented in such a way as to control. In this case, the second group statistics calculation unit 140 and the second layer statistics calculation unit 150 first calculate the hierarchical convergence statistics for the second group statistics and the second layer, and subsequently the third group. Cluster statistics and hierarchical convergence statistics for the third layer are performed, and data values necessary for calculating the result may be temporarily stored in a high-speed buffer and used in a manner.

통계값반복산출부(160)가 반복과정을 중지하는 조건은 통계값반복산출부(160)에 미리 설정되어 있다. The condition that the statistical value repeat calculation unit 160 stops the iterative process is preset in the statistical value repeat calculation unit 160.

일 예로서, 통계값반복산출부(160)는 특정한 계층의 군집에 속한 데이터 셋이 더 이상 분할되지 않을 때까지 계층분할 및 통계값산출을 반복할 수 있다. 달리 말해, 군집에 데이터 셋 하나만 속하게 되면, 통계값반복산출부(160)는 미리 설정된 조건이 만족된 것으로 간주하여, 계층분할 및 통계값산출과정을 종료할 수 있다. As an example, the statistical value iterative calculation unit 160 may repeat hierarchical division and statistical value calculation until a data set belonging to a cluster of a specific layer is no longer divided. In other words, if only one data set belongs to the cluster, the statistical value iterative calculation unit 160 may consider the preset condition to be satisfied, and terminate the hierarchical division and statistical value calculation process.

다른 일 예로서, 통계값반복산출부(160)는 미리 설정된 데이터 셋의 숫자보다 현재 계층의 군집에 속해 있는 데이터 셋의 숫자가 더 작거나, 미리 설정된 데이터 셋의 숫자와 동일할 경우, 계층분할 및 통계값산출과정을 종료할 수 있다. 본 선택적 일 실시 예에 따르면, 통계값반복산출부(160)에 미리 설정된 데이터 셋의 숫자가 4이고, 현재 계층에 있는 복수의 군집마다 속해 있는 데이터 셋의 숫자가 4와 같거나 4보다 작다면, 미리 설정된 조건이 만족된 것으로 간주하고, 통계값반복산출부(160)는 계층분할 및 통계값산출과정을 종료할 수 있다.As another example, the statistical value repetition calculation unit 160 may divide a layer when the number of data sets belonging to a cluster of the current layer is smaller than the number of preset data sets, or when it is equal to the number of preset data sets. And the statistical value calculation process. According to the present exemplary embodiment, if the number of data sets preset in the statistic repetition calculation unit 160 is 4, and the number of data sets belonging to each cluster in the current layer is equal to 4 or less than 4 , It is regarded that the preset condition is satisfied, and the statistical value repetition calculation unit 160 may end the process of hierarchical division and statistical value calculation.

선택적 일 실시 예로서, 통계값반복산출부(160)는 그 반복되는 동작에 의거하여, 재귀함수(recursive function)로 구현될 수 있다. 재귀함수로 구현된 통계값반복산출부(160)는 프로그래밍 언어로 기록된 스크립트(script)로 구체적인 동작이 결정될 수 있으며, 이 과정에서 재귀함수를 동작시키기 위해서 다양한 함수파라미터가 채용될 수 있다.As an optional embodiment, the statistical value repetition calculator 160 may be implemented as a recursive function based on the repeated operation. The statistic repetition calculation unit 160 implemented as a recursive function may be determined in a specific operation with a script written in a programming language, and various function parameters may be employed to operate the recursive function in this process.

도 2는 재귀함수로 구현된 통계값반복산출부의 일 예를 도식적으로 나타낸 도면이다.2 is a diagram schematically showing an example of a statistical value iterative calculation unit implemented by a recursive function.

통계값반복산출부(160)는 도 2와 같은 재귀함수로 구현되면, 입력파일명(210), N값(220), 기준σ값(230), 샘플개수(240), 전역비율(250), 지역비율(260)을 함수파라미터로 요구할 수 있다. 통계값반복산출부(160)는 실행파일(executive file)을 조건에 맞춰서 수 회 반복실행하는 방식의 스크립트로 구현될 수 있다.When the statistical value iterative calculation unit 160 is implemented with a recursive function as shown in FIG. 2, the input file name 210, the N value 220, the reference σ value 230, the number of samples 240, the global ratio 250, The regional ratio 260 may be requested as a function parameter. The statistical value repetition calculation unit 160 may be implemented as a script of repeatedly executing an execution file several times according to conditions.

먼저, 입력파일명(210)은 로우데이터를 파일로 변환했을 때, 그 파일의 이름을 의미한다. First, the input file name 210 means the name of the file when raw data is converted into a file.

N값(220)은 로우데이터에 포함되어 있는 데이터 셋의 총 개수를 의미한다. 도 2에 따르면 30만개의 데이터 셋이 로우데이터에 존재한다.The N value 220 means the total number of data sets included in the raw data. According to FIG. 2, 300,000 data sets exist in raw data.

기준σ값(230)은 이상여부를 판단하기 위해 참조하는 최초 기준값으로서, 실시 예에 따라서, 미리 설정된 σ의 배수로 표현될 수도 있다.The reference σ value 230 is an initial reference value referred to determine whether an abnormality is present, and may be expressed as a multiple of σ preset according to an embodiment.

샘플개수(240)는 재귀함수에서 각 계층의 군집에 대한 군집통계 및 각 계층에 대한 계층융합통계를 산출할 때에 있어서, 군집에 속한 데이터 셋 중 통계값을 산출하는 데에 사용되기 위해서 선택되는 데이터 셋의 수를 의미한다. 예를 들어, 제2계층군집에 10개의 데이터 셋이 존재하고, 샘플개수가 6이라면, 그 제2계층군집의 통계값은 제2계층군집에 속해있는 데이터 셋 중 무작위로 선택된 여섯 개의 데이터 셋을 기초로 산출될 수 있다. 샘플개수(240)가 작으면 작을 수록 전체적인 시스템(100)의 동작속도는 크게 올라가게 되고, 샘플개수(240)가 특정한 값으로 지정되지 않는 경우, 통계값반복산출부(160)는 샘플링없이 데이터 셋 전체에 대한 통계값을 연산하게 된다.The sample number 240 is data selected for use in calculating a statistical value among data sets belonging to a cluster when calculating cluster statistics for each layer cluster and hierarchical convergence statistics for each layer in the recursive function. It means the number of three. For example, if 10 data sets exist in the second layer cluster, and the number of samples is 6, the statistical value of the second layer cluster includes six randomly selected data sets among the data sets belonging to the second layer cluster. Can be calculated on a basis. The smaller the number of samples 240, the greater the operating speed of the overall system 100 increases, and when the number of samples 240 is not specified as a specific value, the statistical value repeat calculation unit 160 performs data without sampling. The statistical values for the entire set are calculated.

전역비율(250) 및 지역비율(260)은 재귀함수가 상위에서 받아온 통계값과 현재 계층에서의 통계값을 융합하는 비율로서 두 값을 더했을 때, 1.0이 되어야 한다. 여기서, 상위에서 받아온 통계값은 직전 계층에 대한 계층융합통계를 의미하고, 현재 계층에서의 통계값은 현재 계층에 대한 군집통계를 의미한다. 또한, 시스템(100)의 관리자는 전역비율(250) 및 지역비율(260)로서, 직전 계층에 대한 계층융합통계 또는 현재 계층에서의 통계값 중 어느 한 쪽에 더 높은 가중치를 부여할 수도 있다. 본 발명의 시스템(100)의 관리자는, 위와 같은 가중치 조절을 통해서, 전역적 정보에 치중한 결과 또는 지역적 정보에 치중한 결과를 개별적으로 획득할 수 있다.The global ratio 250 and the regional ratio 260 are ratios in which the recursive function fuses the statistical values received from the upper level with the statistical values in the current layer, and should be 1.0. Here, the statistical value received from the upper level means hierarchical convergence statistics for the previous layer, and the statistical value at the current level means cluster statistics for the current layer. In addition, the manager of the system 100 may assign a higher weight to either the global consolidation ratio 250 or the regional ratio 260, either the convergence statistics for the immediately preceding layer or the statistical values in the current layer. The administrator of the system 100 of the present invention can individually obtain a result focused on global information or a result focused on regional information through weight adjustment as described above.

도 2와 같은 재귀함수는 각 계층의 군집이 분할상수에 의해 분할될 때마다 새로 적용된다. 특히, 샘플개수(240)가 일정값으로 지정되면, 샘플링을 기초로 한 통계값 산출이 수행되며, 이 과정에서 통계값반복산출부(160)는 무작위로 샘플링을 할 수도 있지만, 연산속도저하를 방지하기 위해서 equi distance방식으로 샘플링을 수행할 수도 있다.The recursive function shown in FIG. 2 is newly applied each time a cluster of each layer is divided by a division constant. In particular, when the number of samples 240 is designated as a constant value, statistical value calculation based on sampling is performed, and in this process, the statistical value repeat calculation unit 160 may randomly sample, but may decrease the computational speed. To prevent this, sampling may be performed using an equi distance method.

위와 같이, 통계값반복산출부(160)에 의해 반복적으로 누적통계처리되면서, 계층이 심화될수록 전역적인 정보가 점차 덜 전역적인 특성을 갖게 되고, 결국 전통적인 통계적 방법이 갖고 있는 문제점(지역적 정보의 반영이 어려운 점)을 최소화할 수 있게 된다. 또한, 샘플개수(240), 전역비율(250) 및 지역비율(260)을 다르게 조정하는 방식으로 조정함으로써, 관리자는 관리자의 마음대로 시스템(100)의 전체의 처리속도 및 정확성을 조절할 수 있다.As described above, as the statistical repetition calculation unit 160 repeatedly accumulates statistics, as the layer deepens, global information gradually has less global characteristics, and eventually problems with traditional statistical methods (reflection of regional information) This difficulty) can be minimized. In addition, by adjusting the number of samples 240, the global ratio 250, and the regional ratio 260 in different ways, the administrator can control the overall processing speed and accuracy of the system 100 at his will.

도 3은 계층융합통계 및 계층군집통계를 설명하기 위한 도면이다.3 is a view for explaining hierarchical convergence statistics and hierarchical cluster statistics.

도 3은 본 발명에 따른 시스템(100)에 의해 구현될 수 있으므로, 도 1을 참조하여 설명하기로 하고, 도 1 및 도 2에서 설명한 내용과 중복되는 설명은 생략하기로 한다. 또한, 도 3에서 분할상수는 편의상 2로 설정하였으나, 실시 예에 따라서, 2보다 더 큰 정수가 될 수 있다는 것은 이 분야의 통상의 지식을 가진 자에게 자명할 것이다.3 may be implemented by the system 100 according to the present invention, it will be described with reference to FIG. 1, and descriptions overlapping with those described in FIGS. 1 and 2 will be omitted. In addition, although the division constant is set to 2 in FIG. 3 for convenience, it will be apparent to those skilled in the art that the number may be an integer greater than 2 according to an embodiment.

도 3을 참조하면, 총 네 개 계층이 도시되어 있으며, 로우데이터에 포함된 데이터 셋은 ds1 내지 ds8까지 총 8개 셋인 것을 알 수 있다. 도 3에서, 기준계층에서의 군집통계는 G1이고, G1은 데이터 셋 ds1 내지 ds8에 대한 평균, 분산, 표준편차, zscore 중 하나이거나 적어도 두 가지 이상의 조합으로 결정된 값으로서, 전체통계산출부(110)에 의해 산출되는 값이라는 것은 이미 설명한 바 있다.Referring to FIG. 3, a total of four layers are illustrated, and it can be seen that the data set included in the raw data is a total of eight sets from ds1 to ds8. In FIG. 3, the cluster statistics in the reference layer is G1, and G1 is a value determined by one or a combination of at least two or more of the mean, variance, standard deviation, and zscore for the data sets ds1 to ds8. The value calculated by) has already been described.

제1군집통계산출부(120)는 로우데이터를 분할상수인 2로 분할하여, 제1계층군집에 네 개의 데이터 셋이 포함되도록 한다. 도 3을 참조하면, 제1계층군집은 분할상수에 따라 총 두 개 생성되었으며, 두 개의 제1계층군집의 제1군집통계는 각각 G2, G3인 것을 알 수 있다.The first cluster statistical calculation unit 120 divides the raw data into two, which is a division constant, so that four data sets are included in the first hierarchy cluster. Referring to FIG. 3, it can be seen that two first layer clusters were generated according to the division constant, and the first group statistics of the two first layer clusters are G2 and G3, respectively.

제1계층통계산출부(130)는 두 개의 제1계층군집의 제1군집통계 G2, G3와 미리 산출되어 있던 군집통계 G1을 융합하여, 제1계층에 대한 계층융합통계 S1를 산출한다. 제1계층에 대한 계층융합통계 S1은 G1, G2, G3을 파라미터로 하는 함수값으로 설정될 수 있고, 제1계층통계산출부(130)가 계층융합통계 S1을 산출하면서, 미리 설정된 전역비율 및 지역비율을 고려할 수 있어서, 다양한 S1값이 산출될 수 있다는 것은 이미 도 2를 통해 설명한 바 있다.The first hierarchical statistical calculation unit 130 fuses the first cluster statistics G2 and G3 of the two first hierarchical clusters with the previously calculated cluster statistics G1 to calculate hierarchical convergence statistics S1 for the first layer. Hierarchical convergence statistics S1 for the first layer may be set as a function value having G1, G2, and G3 as parameters, and the first hierarchical statistical calculation unit 130 calculates the hierarchical convergence statistics S1, and the preset global ratio and Since the regional ratio can be considered, it has already been described through FIG. 2 that various S1 values can be calculated.

제2군집통계산출부(140)는 제1계층에 생성된 두 개의 제1계층군집을 재차 분할상수를 기초로 하여 제2계층군집으로 분할하고, 제2군집통계를 산출한다. 도 3을 참조하면, 제2계층군집은 제1계층군집의 수 및 분할상수에 따라서 총 네 개 생성되었으며, 네 개의 제2계층군집의 제2군집통계는 각각 G4 내지 G7인 것을 알 수 있다. 도 3에서는 설명의 편의상 데이터 셋의 순서가 섞이지 않고 군집에 포함되는 것처럼 설명되어 있으나, 실시 예에 따라서, 데이터 셋이 무작위로 각 계층 군집에 포함될 수 있는 것은 자명하다.The second cluster statistics calculating unit 140 divides the two first layer clusters generated in the first layer into a second layer cluster based on the division constant, and calculates the second group statistics. Referring to FIG. 3, a total of four second layer clusters were generated according to the number and division constant of the first layer cluster, and it can be seen that the second group statistics of the four second layer clusters are G4 to G7, respectively. In FIG. 3, for convenience of description, the order of the data sets is not mixed, but it is described as being included in the cluster. However, it is apparent that the data set may be randomly included in each layer cluster according to an embodiment.

제2계층통계산출부(150)는 네 개의 제2계층군집의 제2군집통계 G4 내지 G7와 미리 산출되어 있던 계층융합통계 S1을 융합하여, 제2계층에 대한 계층융합통계 S2를 산출한다. 제2계층에 대한 계층융합통계 S2는 S1, G4, G5, G6, G7을 파라미터로 하는 함수값으로 설정될 수 있고, 제2계층통계산출부(150)가 계층융합통계 S2를 산출하면서, 미리 설정된 전역비율 및 지역비율을 고려할 수 있어서, 다양한 S2값이 산출될 수 있다는 것은 이미 도 2에서 설명한 바 있다.The second layer statistics calculating unit 150 fuses the second group statistics G4 to G7 of the four second layer clusters with the previously calculated layer convergence statistics S1 to calculate the layer convergence statistics S2 for the second layer. Hierarchical convergence statistics S2 for the second layer may be set as a function value having S1, G4, G5, G6, and G7 as parameters, and the second hierarchical statistical calculation unit 150 calculates hierarchical convergence statistics S2 in advance. Since the set global ratio and the regional ratio can be considered, it has already been described in FIG. 2 that various S2 values can be calculated.

통계값반복산출부(160)는 미리 설정된 조건을 만족할 때까지, 계층분할과정과 계층융합통계산출과정을 반복하게 되며, 도 3은 제3계층을 최종계층(최후계층)이 되도록 도시하고 있다. 도 3에서는 편의상 제3계층을 최종계층으로 설정하였으나, 로우데이터에 포함되는 데이터 셋의 수, 계층군집을 분할하는 분할상수, 또는 통계값반복산출부(160)에 미리 설정되어 있는 조건 중 적어도 하나 이상에 따라서, 통계값반복산출부(160)에 의한 계층분할과정 및 계층융합통계산출과정은 더 늘어나거나 더 줄어들 수도 있다.The statistical value repetition calculation unit 160 repeats the hierarchical division process and the hierarchical convergence statistics calculation process until a predetermined condition is satisfied, and FIG. 3 shows the third layer to be the final layer (last layer). In FIG. 3, for convenience, the third layer is set as the final layer, but at least one of the number of data sets included in the raw data, the division constant for dividing the hierarchical cluster, or the conditions previously set in the statistical value repeat calculation unit 160. According to the above, the process of hierarchical division and the process of calculating the convergence of statistics by the statistical value repetition calculation unit 160 may be increased or decreased.

통계값반복산출부(160)는 제3계층에 있는 제3계층군집에 데이터 셋이 하나씩 포함됨에 따라서, 더 이상 계층분할이 진행될 수 없다고 판단하고 이상성판단부(170)에 최후로 산출된 최후계층융합통계 S2를 전달한다.The statistical value iterative calculation unit 160 determines that no further layer division can be performed as the data set is included in each of the third layer clusters in the third layer, and determines the last layer calculated by the ideality determination unit 170 Convergence statistics S2 is delivered.

이어서, 도 1에 대한 구성을 계속 설명하기로 한다.Next, the configuration for FIG. 1 will be continued.

이상성판단부(170)는 통계값반복산출부(160)가 최후로 산출한 최후계층융합통계를 기초로 하여, 가장 마지막 계층의 최후계층군집에 속한 데이터 셋들의 이상성을 판단한다.The ideality determining unit 170 determines the ideality of the data sets belonging to the last layer cluster of the last layer based on the last layer convergence statistics calculated by the statistical value repeat calculation unit 160 last.

통계값반복산출부(160)가 반복적으로 동작하는 과정을 통해서, 제1계층에서의 군집들은 제n계층의 군집(단, n은 1보다 큰 정수)으로 분할되고, 기준계층부터 제n-1계층까지의 통계값을 융합한 계층융합통계가 산출될 수 있다. 여기서, 최후계층융합통계는 기준계층부터 제n-1계층까지의 통계값을 융합한 계층융합통계가 되며, 가장 마지막 계층은 제n계층이 된다. Through the process in which the statistical value iterative calculation unit 160 repeatedly operates, clusters in the first layer are divided into clusters of the n-th layer (where n is an integer greater than 1), and n-1 from the reference layer. Hierarchical convergence statistics that combine statistical values up to the hierarchies can be calculated. Here, the last layer convergence statistics are hierarchical convergence statistics that combine the statistical values from the reference layer to the n-1 layer, and the last layer is the nth layer.

이상성판단부(170)는 최후계층융합통계를 통계값반복산출부(160)로부터 전달받고, 전달받은 최후계층융합통계를 이용하여, 최후계층군집에 속한 데이터 셋의 이상성을 판단한다. 일 예로서, 최후계층융합통계가 평균에 관한 값이라면, 최후계층군집에 속한 데이터 셋 중 평균으로부터 가장 편차가 큰 값이 가장 이상성이 큰 데이터 셋으로 판단될 수 있고, 최후계층융합통계가 표준편차에 관한 값이라면, 최후계층군집에 속한 데이터 셋 중 가장 큰 표준편차를 나타내는 데이터 셋이 이상성판단부(170)에 의해 선택될 수 있다. The ideality determining unit 170 receives the last layer convergence statistics from the statistical repetition calculation unit 160 and uses the received last layer convergence statistics to determine the ideality of the data set belonging to the last layer cluster. As an example, if the last layer convergence statistics is a value related to an average, a value having the greatest deviation from the average among the data sets belonging to the last layer cluster may be determined as the most ideal data set, and the last layer convergence statistics is the standard deviation. If it is a value of, a data set indicating the largest standard deviation among the data sets belonging to the last layer cluster may be selected by the ideality determining unit 170.

위와 같은 결과를 해석하면, 이상성판단부(170)는 전역적이면서 지역적이기도 한 통계값인 최후계층융합통계와 최종계층의 군집에 속한 데이터 셋들을 비교했을 때, 최종계층의 군집에 속한 데이터 셋이 얼마나 이상성을 나타내는지 판단하는 것을 알 수 있다. 미리 설정된 값이 없다면 이상성판단부(170)는 가장 높은 이상성을 보이는 데이터 셋 하나만을 특정할 수 있으나, 실시 예에 따라서, 이상성이 높은 순서로 여러 개의 데이터 셋을 특정하여 출력할 수도 있다.When interpreting the above results, the ideality determination unit 170 compares the data sets belonging to the cluster of the final layer and the final layer convergence statistics, which are global and regional statistical values, and the data sets belonging to the cluster of the final layer are You can see how to judge how ideal you are. If there is no preset value, the ideality determination unit 170 may specify only one data set showing the highest ideality, but according to an embodiment, the plurality of data sets may be specified in order of high ideality.

표 2는 이상성판단부(170)가 판단한 결과의 예시를 표로 나타낸 것이다. Table 2 shows an example of the results determined by the ideality determining unit 170 as a table.

표 2에서 Line Number는 로우데이터의 순서, 배열내 위치는 가장 이상성이 높게 나타나는 데이터 셋이 있는 위치, Data Value는 가장 이상성이 높게 나타난 데이터 셋의 값, Zscore는 가장 이상성이 높게 나타난 데이터 셋의 표준점수(standard score), Sampling Average는 샘플링된 데이터 셋값들의 평균, Sampling Std.는 샘플링된 데이터 셋값들의 표준편차, Depth는 함수호출의 깊이값으로서, 최후계층의 번호를 각각 의미한다.In Table 2, Line Number is the order of raw data, position in the array is where the data set with the highest ideality is found, Data Value is the value of the data set with the highest ideality, and Zscore is the standard of the data set with the highest ideality. The standard score, Sampling Average is the average of the sampled data set values, Sampling Std. Is the standard deviation of the sampled data set values, and Depth is the depth value of the function call, which means the number of the last layer.

표 2의 첫 번째 줄을 해석하면, 72번째 로우데이터를 입력데이터로 하여 이상성(anomality)을 분석하였을 때, 제19계층에 있으면서, 97이라는 값을 갖는 222520에 위치한 데이터 셋이 가장 이상성이 높은 것으로 판단되었으며, 데이터 처리의 가속화를 위해서 샘플링을 진행한 결과, 샘플링된 데이터 셋들의 평균은 54.513008, 표준편차는 23.187880라는 것이다. 또한, 표 2에 각 데이터 셋에 대한 정보로서, 평균과 표준편차 외에 Zscore도 있다는 것을 통해, 데이터 셋의 이상성을 판단하기 위해서, 각 계층에서 다양한 방식으로 산출된 통계값이 사용되었다는 것을 알 수 있다.When analyzing the first row of Table 2, when analyzing the anomaly using the 72nd row data as input data, the data set located at 222520, which is in the 19th layer and has a value of 97, has the highest ideality. It was judged that, as a result of sampling to accelerate data processing, the average of the sampled data sets is 54.513008 and the standard deviation is 23.187880. In addition, in Table 2, as information about each data set, through the existence of Zscore in addition to the mean and standard deviation, it can be seen that statistical values calculated in various ways were used in each layer to determine the ideality of the data set. .

이어서, 도 1의 구성에 대한 설명을 계속 하기로 한다.Next, the description of the configuration of FIG. 1 will be continued.

이상개수수신부(180)는 목표이상개수를 사용자로부터 입력받아서, 이상성판단부(170)에 전달한다. 이상성판단부(170)는 최후계층의 군집에 속한 데이터 셋들 중에서 가장 높은 이상성을 보이는 데이터 셋을 목표이상개수만큼 순차적으로 추출할 수 있다. 이상개수수신부(180)는 실시 예에 따라 시스템(100)에서 생략될 수도 있다.The abnormal number receiving unit 180 receives the target abnormal number from the user, and transmits it to the abnormality determination unit 170. The ideality determination unit 170 may sequentially extract the datasets having the highest ideality among the datasets belonging to the cluster of the last layer as many as the target abnormality number. The abnormal number receiving unit 180 may be omitted from the system 100 according to an embodiment.

도 4는 이상성을 나타내는 데이터 셋을 출력하기 위한 과정의 일 예를 도시한 도면이다.4 is a diagram illustrating an example of a process for outputting a data set indicating ideality.

도 4는 1차적으로 도 3과 같은 과정을 거쳐서 이상성판단부(170)가 이상성이 가장 높다고 판단된 데이터 셋이 어떤 데이터 셋인지 결정한 후에 각 계층별로 이상성이 높은 값을 역순으로 찾아가는 방식을 나타낸다.FIG. 4 shows a method in which the ideality determination unit 170 determines the data set for which the ideality is determined to be the highest through the same process as in FIG. 3 and then searches for the highest ideality value for each layer in reverse order.

먼저, 최종계층에서 각 계층군집에는 하나의 데이터 셋만이 포함되어 있으며, 단계 S410을 거치면 각 계층군집마다 두 개의 데이터 셋이 포함되어 있는 제2계층이 된다. 이상성판단부(170)는 제2계층군집에 속한 데이터 셋들끼리 이상성을 비교판단한 후, 높은 이상성을 보이는 데이터 셋만을 검출하게 되며, 도 4에 따르면, ds1, ds4, ds5, ds8이 ds2, ds3, ds6, ds7보다 각각 더 높은 이상성을 나타내고 있다는 것을 알 수 있다.First, in the final layer, each hierarchical cluster contains only one data set, and in step S410, the second hierarchical layer includes two data sets for each hierarchical cluster. After comparing and determining the abnormality between data sets belonging to the second layer cluster, the ideality determining unit 170 detects only a data set showing high abnormality. According to FIG. 4, ds1, ds4, ds5, and ds8 are ds2, ds3, It can be seen that each exhibits higher ideality than ds6 and ds7.

이어서, 단계 S430를 거치면, 각 계층군집마다 네 개의 데이터 셋이 포함되어 있는 제1계층이 된다. 이상성판단부(170)는 제1계층군집에 속한 데이터 셋들끼리 이상성을 비교판단한 후, 가장 높은 이상성을 보이는 데이터 셋만을 검출하게 되며, 도 4에 따르면 ds4, ds5가 ds1, ds8보다 각각 더 높은 이상성을 나타내는 것을 알 수 있다.Subsequently, when step S430 is performed, the first layer includes four data sets for each layer cluster. The ideality determining unit 170 compares and determines the abnormality between data sets belonging to the first layer cluster, and then detects only the data set showing the highest ideality. According to FIG. 4, ds4 and ds5 are higher than ds1 and ds8, respectively. It can be seen that indicates.

마지막으로, 단계 S450을 거치면, 로우데이터(raw data)에 포함되어 있던 데이터 셋이 하나의 군집을 형성하는 기준계층이 되고, 이상성판단부(170)는 제1계층에서 가장 높은 이상성을 보였던 데이터 셋인 ds4, ds5를 비교분석하여, 가장 높은 이상성을 나타내는 ds5를 검출할 수 있다. Finally, through step S450, the data set included in the raw data becomes a reference layer forming a cluster, and the ideality determination unit 170 is a data set that has the highest ideality in the first layer. By comparing and analyzing ds4 and ds5, ds5 having the highest ideality can be detected.

또한, 위와 같은 과정을 종합하여, 이상성판단부(170)는 각 계층별, 군집별 최고 이상성을 보였던 데이터 셋들을 정렬하여, 가장 높은 이상성을 나타내는 데이터 셋 복수 개를 검출할 수 있다. 일 예로서, 도 4를 참조하여 설명하면, 이상성판단부(170)는 제1계층에서 ds4, ds5를 비교한 결과를 기초로 ds5를 가장 큰 이상성을 나타내는 데이터 셋으로 결정하고, 그 다음에는 ds4, ds1(ds8) 순서로 데이터 셋의 이상성을 내림차순으로 정렬할 수 있다. 여기서, 이상개수수신부(180)가 사용자로부터 3을 입력받은 경우, 시스템(100)은 출력값으로 ds5, ds4, ds1(ds8)을 출력할 수 있으며, 추가적으로 ds1 및 ds8의 이상성 크기비교를 통해서, 더 정확한 결과를 출력할 수도 있다.In addition, by synthesizing the above process, the ideality determination unit 170 may detect a plurality of data sets representing the highest ideality by arranging the data sets showing the highest ideality for each layer and cluster. As an example, referring to FIG. 4, the ideality determining unit 170 determines ds5 as a data set showing the greatest ideality based on a result of comparing ds4 and ds5 in the first layer, and then ds4 , The ideality of the data set can be sorted in descending order by ds1 (ds8). Here, when the abnormality number receiving unit 180 receives 3 from the user, the system 100 may output ds5, ds4, and ds1 (ds8) as output values, and additionally through comparison of the ideality of ds1 and ds8, further You can also print out the exact results.

도 4에서 설명한 과정은 도 3에서 가장 높은 이상성을 나타내는 데이터 셋을 파악했을 때, 데이터 셋을 정렬하는 방법의 일 예로서, 도 4에서 설명한 방식 외에도 다른 예가 있을 수 있음은 이 분야의 통상의 지식을 가진 자에게 자명할 것이다.The process described in FIG. 4 is an example of a method of sorting a data set when the data set showing the highest ideality in FIG. 3 is identified. It is common knowledge in this field that there may be other examples in addition to the method described in FIG. 4. It will be obvious to those who have it.

도 5는 본 발명에 따른 계층융합통계정보를 활용한 고속이상탐지방법의 일 예의 흐름도이다.5 is a flowchart of an example of a high-speed anomaly detection method using hierarchical convergence statistical information according to the present invention.

도 5에 따른 방법은 도 1에 따른 시스템에 의해 구현될 수 있으므로, 이하에서는, 도 1에서 설명한 내용과 중복된 설명은 생략하기로 하고, 도 1을 참조하여 설명하기로 한다.Since the method according to FIG. 5 can be implemented by the system according to FIG. 1, hereinafter, descriptions overlapping with those described in FIG. 1 will be omitted and will be described with reference to FIG. 1.

전체통계산출부(110)는 데이터베이스로부터 로우데이터를 수신하고, 데이터 셋의 전체통계를 산출한다(S510).The overall statistics calculating unit 110 receives raw data from the database and calculates overall statistics of the data set (S510).

제1군집통계산출부(120)는 로우데이터를 제1계층군집으로 분할하고, 제1계층군집별로 제1군집통계를 산출한다(S520).The first cluster statistics calculating unit 120 divides the raw data into a first layer cluster, and calculates a first group statistics for each first layer cluster (S520).

제1계층통계산출부(130)는 전체통계 및 제1군집통계를 기초로 제1계층에 대한 계층융합통계를 산출한다(S530).The first hierarchical statistical calculation unit 130 calculates hierarchical convergence statistics for the first hierarchy based on the overall statistics and the first cluster statistics (S530).

제2군집통계산출부(140)는 제1계층군집들을 제2계층군집으로 분할하고, 제2계층군집별로 제2군집통계를 산출한다(S540). 단계 S520 및 S540에서 군집을 분할하는 데에 사용되는 분할상수는 동일한 값으로 유지되며, 분할상수는 정확도 또는 처리속도를 위해서 임의로 조정될 수 있음은 이미 설명한 바 있다.The second group statistics calculation unit 140 divides the first layer clusters into the second layer clusters and calculates the second group statistics for each second group cluster (S540). It has been already explained that the division constants used to divide the clusters in steps S520 and S540 remain the same, and the division constants can be arbitrarily adjusted for accuracy or processing speed.

제2계층통계산출부(150)는 제1계층에 대한 계층융합통계 및 제2군집통계를 기초로 제2계층에 대한 계층융합통계를 산출한다(S550).The second layer statistics calculating unit 150 calculates the layer convergence statistics for the second layer based on the layer convergence statistics for the first layer and the second group statistics (S550).

통계값반복산출부(160)는 미리 설정된 조건을 만족할 때까지 군집 분할 및 계층 통계 산출을 반복적으로 수행한다(S560). 통계값반복산출부(160)는 미리 설정된 조건이 만족되면(S570), 단계 S560에서 산출된 최후계층융합통계를 기초로 최후계층군집에 속한 데이터셋들의 이상성을 판단한다(S580).The statistical value iterative calculation unit 160 repeatedly performs cluster division and hierarchical statistical calculation until a predetermined condition is satisfied (S560). When the predetermined condition is satisfied (S570), the statistical value iterative calculation unit 160 determines the ideality of the datasets belonging to the last layer cluster based on the last layer convergence statistics calculated in step S560 (S580).

단계 S580에서, 이상성판단부(170)가 최후계층군집에 속한 데이터 셋들의 이상성을 판단하고 정렬하는 과정에 대해서는 도 3 내지 도 4를 통해 설명한 바 있다.In step S580, the process of determining and arranging the ideality of the data sets belonging to the last hierarchical cluster by the ideality determining unit 170 has been described with reference to FIGS. 3 to 4.

본 발명은 통계적 방법을 사용하면서 동시에 전역적 정보 및 지역적 정보의 융합을 이용함으로써, 매우 빠른 속도로 이상성을 나타내는 데이터를 결과값으로 출력할 수 있는 특징이 있다. 이는 기존의 통계적 방법이 전역적인 특성만 반영하여 결과값을 출력해오던 것과는 구별되는 특징이 되고, 알고리즘 특성상 수초 이내에 결과값을 출력할 수 없는 신경망 기법이나 기계학습 기법하고도 구분되는 점이다.The present invention has a feature that, while using a statistical method, and simultaneously fusion of global information and regional information, data representing ideality can be output as a result value at a very high speed. This is a feature that distinguishes the existing statistical method from outputting the result value by reflecting only the global characteristics, and is also distinguished from the neural network technique or machine learning technique, which cannot output the result value within seconds due to the characteristics of the algorithm.

본 발명가 스크립트로 구현되었을 때, 1000만건의 데이터 셋에 대해서 늦어도 3초 내지 5초 이내로 결과값이 출력되었으며, 본 발명에서 샘플링 레이트(sampling rate)나 군집분할 및 이상성판단의 기준이 되는 분할상수(기준σ값)가 조정되면, 시스템(100)을 관리 및 운영하는 사용자에게 보다 더 유용하고 적합한 결과값이 출력될 수 있다.When implemented by the inventor script, the result was output within 3 to 5 seconds at the latest for 10 million data sets, and the partitioning constant (which is the basis of sampling rate, cluster division, and ideality determination in the present invention) When the reference σ value) is adjusted, a more useful and suitable result value can be output to a user who manages and operates the system 100.

이상 설명된 본 발명에 따른 실시 예는 컴퓨터상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The embodiment according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program can be recorded on a computer-readable medium. At this time, the medium includes a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as a CD-ROM and DVD, a magneto-optical medium such as a floptical disk, and a ROM. , Hardware devices specially configured to store and execute program instructions, such as RAM, flash memory, and the like.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention or may be known and available to those skilled in the computer software field. Examples of computer programs may include machine language codes such as those produced by a compiler, as well as high-level language codes that can be executed by a computer using an interpreter or the like.

본 발명에서 설명하는 특정 실행들은 일 실시 예들로서, 어떠한 방법으로도 본 발명의 범위를 한정하는 것은 아니다. 명세서의 간결함을 위하여, 종래 전자적인 구성들, 제어 시스템들, 소프트웨어, 상기 시스템들의 다른 기능적인 측면들의 기재는 생략될 수 있다. 또한, 도면에 도시된 구성 요소들 간의 선들의 연결 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것으로서, 실제 장치에서는 대체 가능하거나 추가의 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들로서 나타내어질 수 있다. 또한, “필수적인”, “중요하게” 등과 같이 구체적인 언급이 없다면 본 발명의 적용을 위하여 반드시 필요한 구성 요소가 아닐 수 있다.The specific implementations described in the present invention are examples, and do not limit the scope of the present invention in any way. For brevity of the specification, descriptions of conventional electronic configurations, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection or connecting members of the lines between the components shown in the drawings are illustrative examples of functional connections and / or physical or circuit connections, and in the actual device, alternative or additional various functional connections, physical It can be represented as a connection, or circuit connections. In addition, unless specifically mentioned, such as “essential”, “importantly”, etc., it may not be a necessary component for application of the present invention.

본 발명의 명세서(특히 특허청구범위에서)에서 “상기”의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 발명에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다. 마지막으로, 본 발명에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 발명이 한정되는 것은 아니다. 본 발명에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 발명을 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 발명의 범위가 한정되는 것은 아니다. 또한, 당업자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.In the specification (particularly in the claims) of the present invention, the use of the term “above” and similar indication terms may be in both singular and plural. In addition, in the case where a range is described in the present invention, it includes the invention to which the individual values belonging to the range are applied (if there is no contrary description), and describes the individual values constituting the range in the detailed description of the invention. Same as Finally, unless there is a clear or contradictory description of the steps constituting the method according to the invention, the steps can be done in a suitable order. The present invention is not necessarily limited to the description order of the above steps. The use of all examples or exemplary terms (eg, etc.) in the present invention is merely for describing the present invention in detail, and the scope of the present invention is limited due to the examples or exemplary terms, unless it is defined by the claims. It does not work. In addition, those skilled in the art can recognize that various modifications, combinations, and changes can be configured according to design conditions and factors within the scope of the appended claims or equivalents thereof.

Claims

In the high-speed anomaly detection method using hierarchical convergence statistical information performed by the high-speed anomaly detection system,
An overall statistics calculating step of receiving raw data including a plurality of data from the database and calculating overall statistics of the data included in the raw data;
A first group statistics calculating step of dividing the raw data into a first layer cluster including the data of a predetermined division constant, and calculating a first group statistics for each of the divided first layer clusters;
A first hierarchical statistical calculation step of calculating a hierarchical convergence statistics for the first hierarchical layer based on the calculated overall statistics and the calculated first cluster statistics;
The second cluster statistics calculation unit calculates a second cluster statistics for dividing the divided first layer cluster into a second layer cluster containing the data of the division constant, and calculating a second group statistics for each of the divided second layer clusters. step;
A second layer statistics calculating step of calculating a layer convergence statistics for the second layer based on the hierarchical convergence statistics and the calculated second group statistics;
A statistical value repetition calculation step of repeating the second cluster statistics calculation step and the second hierarchical statistical calculation step until the statistical value repetition calculation unit satisfies a preset condition; And
Hierarchical convergence statistics including the ideality judgment step of determining the ideality of data belonging to the last layer cluster of the last layer based on the last layer convergence statistics calculated last in the statistical value iterative calculation step. High-speed anomaly detection method using information.

According to claim 1,
The statistics calculated at each step constituting the high-speed anomaly detection method,
A high-speed anomaly detection method using hierarchical convergence statistical information, characterized in that at least one of the average, variance, standard deviation, and Zscore.

According to claim 1,
The high-speed anomaly detection method,
The abnormal number receiving unit further includes an abnormal number receiving step of receiving a target abnormal number,
The ideality determination step,
A high-speed anomaly detection method using hierarchical convergence statistics information, characterized in that the data showing the highest abnormality among the data belonging to the cluster of the last layer is extracted by the target anomaly.

According to claim 1,
The ideality determination step,
Based on the hierarchical convergence statistics calculated at each step constituting the high-speed anomaly detection method, high-speed anomalies using hierarchical convergence statistical information, characterized in that it detects the data having the highest abnormality among the data belonging to the raw data Detection method.

According to claim 1,
The splitting constant is a high-speed anomaly detection method using hierarchical convergence statistical information, characterized in that 2.

According to claim 1,
The division constant is a high-speed anomaly detection method using hierarchical convergence statistical information, characterized in that the integer is 3 or more.

According to claim 1,
The statistics calculated at each step constituting the high-speed anomaly detection method,
A high-speed anomaly detection method using hierarchical convergence statistical information, characterized in that it is calculated from data sampled at a predetermined number.

A computer-readable recording medium storing a program for executing the method according to claim 1.

A total statistics calculation unit for receiving raw data including a plurality of data from a database and calculating overall statistics of the data included in the raw data;
A first group statistics calculation unit for dividing the raw data into a first layer cluster including data of a predetermined division constant, and calculating a first group statistics for each of the divided first layer clusters;
A first hierarchical statistical calculation unit for calculating hierarchical convergence statistics for the first layer based on the calculated overall statistics and the calculated first group statistics;
A second group statistics calculation unit for dividing the divided first layer cluster into a second layer cluster containing the data of the division constant, and calculating a second group statistics for each of the divided second layer clusters;
A second layer statistics calculation unit calculating a layer convergence statistics for the second layer based on the hierarchical convergence statistics and the calculated second group statistics;
A statistical value iterative calculation unit that repeats calculating cluster statistics and hierarchical convergence statistics until a predetermined condition is satisfied; And
Based on the last layer convergence statistics calculated at the statistical repetition calculation unit, hierarchical convergence statistical information including an abnormality determination unit that determines anomaly of data belonging to the last layer cluster of the last layer is utilized. High-speed anomaly detection system.

The method of claim 9,
The overall statistics, cluster statistics and hierarchical convergence statistics,
A high-speed anomaly detection system using hierarchical convergence statistics, characterized by at least one of average, variance, standard deviation, and Zscore.

The method of claim 9,
The high-speed anomaly detection system,
It further includes an abnormal number receiving unit for receiving a target abnormal number,
The ideality determination unit,
A high-speed anomaly detection system using hierarchical convergence statistical information, characterized in that the data showing the highest abnormality among the data belonging to the cluster of the last layer is extracted by the target anomaly.

The method of claim 9,
The ideality determination unit,
A high-speed anomaly detection system using hierarchical convergence statistical information, characterized in that, based on the calculated hierarchical convergence statistics, the data having the highest abnormality among data belonging to the raw data is detected.

The method of claim 9,
The splitting constant is high-speed anomaly detection system using hierarchical convergence statistical information, characterized in that 2.

The method of claim 9,
The division constant is a high-speed anomaly detection system using hierarchical convergence statistical information, characterized in that the integer is 3 or more.

The method of claim 9,
The overall statistics, cluster statistics and hierarchical convergence statistics,
A high-speed anomaly detection system using hierarchical convergence statistical information, characterized in that it is calculated from data sampled at a preset number.