KR102280884B1

KR102280884B1 - Method for analyzing categorical data

Info

Publication number: KR102280884B1
Application number: KR1020150152117A
Authority: KR
Inventors: 강지훈; 권순목; 유동호; 박성미; 박용로
Original assignee: 삼성에스디에스 주식회사
Priority date: 2015-10-30
Filing date: 2015-10-30
Publication date: 2021-07-23
Also published as: KR20170050506A

Abstract

범주형 데이터 분석 방법이 개시된다. 본 발명의 일 실시예에 따른 범주형 데이터 분석 방법은 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성하는 단계, 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성하는 단계, 제1 잔차 함수와 제2 잔차 함수의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하는 단계, 제1 잔차 함수와 상관 관계 함수와의 차이값으로 최종 잔차 함수를 생성하는 단계 및 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단하는 단계를 포함한다.A method for analyzing categorical data is disclosed. A method for analyzing categorical data according to an embodiment of the present invention includes generating a first cumulative function with first categorical data, generating a second cumulative function with second categorical data, and a rising pattern in the first cumulative function. generating a first residual function by removing , generating a second residual function by removing a rising pattern from the second cumulative function, generating a correlation function representing the correlation between the first residual function and the second residual function and generating a final residual function as a difference value between the first residual function and the correlation function, and determining that an abnormality has occurred in the monitoring target when the final residual function is out of a preset range.

Description

METHOD FOR ANALYZING CATEGORICAL DATA

본 발명은 범주형 데이터 분석 방법에 관한 것으로, 보다 상세하게는 단속적인 데이터를 갖는 이산적인 범주형 데이터를 연속적인 데이터로 변환한 후 이를 분석하는 범주형 데이터 분석 방법에 관한 것이다.The present invention relates to a categorical data analysis method, and more particularly, to a categorical data analysis method in which discrete categorical data having intermittent data is converted into continuous data and then analyzed.

현대의 빅데이터 시대에는 연속형 데이터 뿐만 아니라 수많은 형태의 범주형 데이터가 존재한다. 변수가 취할 수 있는 값이 범주(category)로 주어지는 경우 그 변수를 범주형 변수라고 하는데, 이러한 범주형 변수들로 이루어진 자료를 범주형 데이터라고 한다.In the modern big data era, there are numerous types of categorical data as well as continuous data. When the values that a variable can take are given as a category, the variable is called a categorical variable, and data made up of these categorical variables is called categorical data.

범주형 데이터의 세부 유형은 다음과 같다.The detailed types of categorical data are as follows.

범주형

categorical
이분형척도 (예 : 찬성, 반대)Dichotomous scales (e.g. for, against) 명목형척도(예 : 여러 가지 종교)Nominal scales (e.g. multiple religions) 순서형척도(예 : 청년층, 중년층, 장년층)Order scale (eg, young people, middle-aged, elderly)

이러한 범주형 데이터들은 여러 산업에서 새로운 가치창출을 위해 적극적으로 활용되고 있다.These categorical data are being actively used to create new values in various industries.

그러나, 종래의 통계학, 기계학습 기반의 분석 로직들은 주로 연속적인 데이터를 처리하는데 초점을 맞추고 있는바, 이산적인 데이터 값을 갖는 범주형 데이터를 분석하는데 한계가 있다는 문제점이 있었다.However, since conventional statistical and machine learning-based analysis logics mainly focus on processing continuous data, there is a problem in that there is a limitation in analyzing categorical data having discrete data values.

예를 들어, 예측, 분류, 모니터링 등 실제 알고리즘의 활용 개체가 주로 연속형 함수로 표현된 데이터들이며, 연속형 데이터들에 대한 분석 로직이 장기간동안 다양한 형태로 개발되고 발전되어 왔다.For example, the objects used for real algorithms such as prediction, classification, and monitoring are mainly data expressed as continuous functions, and analysis logic for continuous data has been developed and developed in various forms for a long period of time.

이에 반해, 이산적인 데이터를 갖는 범주형 데이터 분석, 특히 특정 이벤트의 시간흐름에 따른 변화와 관련하여 데이터의 성질을 분석하는 연구는 상당히 제한적으로 진행되어 왔다.On the other hand, categorical data analysis with discrete data, particularly the study of analyzing the properties of data in relation to the time-dependent change of a specific event, has been quite limited.

도 1은 이산적인 범주형 데이터를 분석하는 과정을 설명하기 위한 도면이다.1 is a diagram for explaining a process of analyzing discrete categorical data.

도 1에는 특정 장비에서 발생될 수 있는 로그 이벤트 A와 B의 시간 흐름에 따른 빈도 변화를 도시하고 있다. 도 1 및 도 2를 살펴보면 이벤트 A의 경우 대부분 0의 값을 갖고, 2014년 4월 30일부터 2014년 7월 중순, 2015년 1월부터 2015년 4월까지 간헐적인 패턴으로 이벤트가 발생됨을 알 수 있다.1 illustrates a change in frequency over time of log events A and B that may occur in a specific device. 1 and 2, it can be seen that most of event A has a value of 0, and events occur in an intermittent pattern from April 30, 2014 to mid-July 2014, and from January 2015 to April 2015. can

마찬가지로, 이벤트 B의 경우 이벤트 A에 비해 이벤트가 빈번하게 발생되지만 2014년 7월 말을 기점으로 2014년 12월까지 이벤트 빈도가 0임을 알 수 있다.Similarly, in the case of event B, the event occurs more frequently than in event A, but it can be seen that the event frequency is 0 from the end of July 2014 to December 2014.

도 1에 도시된 바와 같은 비연속적인 데이터들은 기존 통계적 기법으로는 이벤트 A와 B의 상관관계를 충분히 분석할 수 없다는 문제점이 있었다. 따라서, 이산적인 범주형 데이터를 효과적으로 분석할 수 있는 새로운 형태의 모니터링 방법에 대한 필요성이 대두되었다.Discontinuous data as shown in FIG. 1 had a problem in that the correlation between events A and B could not be sufficiently analyzed using existing statistical techniques. Therefore, there is a need for a new type of monitoring method that can effectively analyze discrete categorical data.

본 발명은 상술한 문제점을 해결하기 위해 안출된 것으로 상호 연관 관계를 도출하기 어려운 이산적인 범주형 데이터를 연속적인 데이터로 변환하여 이를 분석할 수 있는 범주형 데이터 분석 방법을 제공하는데 있다.An object of the present invention is to provide a categorical data analysis method capable of transforming discrete categorical data, which is difficult to derive a mutual correlation, into continuous data, and analyzing it, devised to solve the above-described problems.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속한 기술분야의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 목적을 달성하기 위한 본 발명의 일 실시예에 따른 범주형 데이터 분석 방법은, 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성하는 단계, 상기 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 상기 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성하는 단계, 상기 제1 잔차 함수와 제2 잔차 함수의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하는 단계, 상기 제1 잔차 함수와 상기 상관 관계 함수와의 차이값으로 최종 잔차 함수를 생성하는 단계 및 상기 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단하는 단계를 포함한다.A categorical data analysis method according to an embodiment of the present invention for achieving the above object includes generating a first cumulative function using first categorical data and generating a second cumulative function using second categorical data , generating a first residual function by removing a rising pattern from the first cumulative function, and generating a second residual function by removing the rising pattern from the second cumulative function, the first and second residual functions generating a correlation function representing the correlation relationship of , generating a final residual function with a difference value between the first residual function and the correlation function, and a monitoring target when the final residual function is out of a preset range Including the step of determining that an abnormality has occurred.

본 발명의 일 실시예에 따르면, 상기 제1 잔차 함수 및 상기 제2 잔차 함수를 생성하는 단계는, 상기 제1 누적 함수의 상승 패턴을 나타내는 제1 추세선 및 상기 제2 누적 함수의 상승 패턴을 나타내는 제2 추세선을 생성하는 단계 및 상기 제1 추세선과 상기 제1 누적 함수의 차이값인 잔차로 제1 잔차 함수를 생성하고, 상기 제2 추세선과 상기 제2 누적함수의 차이값인 잔차로 상기 제2 잔차 함수를 생성하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the generating of the first residual function and the second residual function includes a first trend line indicating a rising pattern of the first cumulative function and a rising pattern of the second cumulative function. generating a second trend line; generating a first residual function as a residual value that is a difference value between the first trend line and the first cumulative function; and generating the first residual function as a residual value that is a difference value between the second trend line and the second cumulative function 2 generating a residual function.

본 발명의 일 실시예에 따르면, 상기 제1 추세선 및 상기 제2 추세선을 생성하는 단계는, 시계열 선형회귀분석(Time-Series Linear Regression)으로 상기 제1 추세선의 기울기 및 상기 제2 추세선을 기울기를 산출하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the generating of the first trend line and the second trend line may include calculating the slope of the first trend line and the slope of the second trend line through time-series linear regression analysis. It may include a step of calculating.

본 발명의 일 실시예에 따르면, 상기 제1 추세선의 기울기 및 상기 제2 추세선의 기울기를 산출하는 단계는, 상기 제1 추세선과 상기 제1 누적 함수의 차이값인 잔차의 최소제곱을 만족하는 기울기를 상기 제1 추세선의 기울기로 결정하는 단계 및 상기 제2 추세선과 상기 제2 누적 함수의 차이값인 잔차의 최소제곱을 만족하는 기울기를 상기 제2 추세선의 기울기로 결정하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the calculating of the slope of the first trend line and the slope of the second trend line may include a slope that satisfies the least square of a residual that is a difference value between the first trend line and the first cumulative function. determining as the slope of the first trend line and determining, as the slope of the second trend line, a slope that satisfies a least squares difference between the second trend line and the second cumulative function. .

본 발명의 일 실시예에 따르면, 상기 상관 관계 함수를 생성하는 단계는, 상기 제2 추세선과 상기 제2 누적함수의 차이값인 잔차를 독립 변수로 회귀 모델에 활용하여 상관 관계 함수를 생성하는 단계를 포함할 수 있다.According to an embodiment of the present invention, the generating of the correlation function includes generating a correlation function by using a residual that is a difference value between the second trend line and the second cumulative function in a regression model as an independent variable. may include.

본 발명의 일 실시예에 따르면, 상기 상관 관계 함수를 생성하는 단계는, 모델로는 다중선형회귀, 신경망 모델, 의사 결정 나무(Regression Tree), Regularized Regression 기법 중 적어도 하나를 이용할 수 있다.According to an embodiment of the present invention, in the generating of the correlation function, at least one of multiple linear regression, a neural network model, a decision tree, and a regularized regression technique may be used as a model.

본 발명의 일 실시예에 따르면, 상기 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단하는 단계는, 제1 범주형 데이터의 평균값 또는 중앙값으로 기준값을 산출하는 단계, 상기 기준값에 수용 가능한 변동값을 더하여 상한 기준값을 산출하고, 상기 변동값을 차감하여 하한 기준값을 산출하는 단계 및 상기 최종 잔차 함수가 상기 상한 기준값 또는 상기 하한 기준값을 벗어나면 상기 모니터링 대상에 이상이 발생한 것으로 판단하는 단계를 포함할 수 있다.According to an embodiment of the present invention, when the final residual function is out of a preset range, determining that an abnormality has occurred in the monitoring target includes calculating a reference value using the average or median value of the first categorical data, the reference value calculating an upper limit reference value by adding an acceptable variation value to , calculating a lower limit reference value by subtracting the variation value, and when the final residual function deviates from the upper limit reference value or the lower limit reference value, it is determined that an abnormality has occurred in the monitoring target may include the step of

본 발명의 일 실시예에 따르면, 상기 제1 범주형 데이터 및 상기 제2 범주형 데이터는 비연속적인 이산적인 데이터일 수 있다.According to an embodiment of the present invention, the first categorical data and the second categorical data may be non-continuous discrete data.

본 발명의 일 실시예에 따른 범주형 데이터 분석 장치는, 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성하는 누적 함수 생성부, 상기 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 상기 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성하는 잔차 함수 생성부, 상기 제1 잔차 함수와 상기 제2 잔차 함수의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하는 상관 관계 함수 생성부, 상기 제1 잔차 함수와 상기 상관 관계 함수의 차이값으로 최종 잔차 함수를 생성하는 최종 잔차 함수 생성부 및 상기 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단하는 이상 감지부를 포함한다.An apparatus for analyzing categorical data according to an embodiment of the present invention includes: an accumulation function generator configured to generate a first accumulation function from first categorical data and a second accumulation function from second categorical data; a residual function generator configured to generate a first residual function by removing a rising pattern from a cumulative function, and to generate a second residual function by removing a rising pattern from the second cumulative function, the first residual function and the second residual function A correlation function generator generating a correlation function representing the correlation of , a final residual function generator generating a final residual function using a difference value between the first residual function and the correlation function, and the final residual function If it is out of the set range, it includes an abnormality detection unit that determines that an abnormality has occurred in the monitoring target.

본 발명의 일 실시예에 따르면, 상기 잔차 함수 생성부는, 상기 제1 누적 함수의 상승 패턴을 나타내는 제1 추세선 및 상기 제2 누적 함수의 상승 패턴을 나타내는 제2 추세선을 생성하는 추세선 생성부;를 포함하고, 상기 제1 추세선과 상기 제1 누적 함수의 차이값인 잔차로 제1 잔차 함수를 생성하고, 상기 제2 추세선과 상기 제2 누적함수의 차이값인 잔차로 상기 제2 잔차 함수를 생성할 수 있다.According to an embodiment of the present invention, the residual function generating unit may include a trend line generating unit generating a first trend line indicating a rising pattern of the first cumulative function and a second trend line indicating a rising pattern of the second cumulative function; and generating a first residual function using a residual that is a difference value between the first trend line and the first cumulative function, and generating the second residual function using a residual that is a difference between the second trend line and the second cumulative function. can do.

본 발명의 일 실시예에 따르면, 상관 관계 함수 생성부는, 상기 제2 추세선과 상기 제2 누적 함수의 차이값인 잔차를 독립 변수로 회귀 모델에 활용하여 상관 관계 함수를 생성할 수 있다.According to an embodiment of the present invention, the correlation function generator may generate the correlation function by using a residual, which is a difference value between the second trend line and the second cumulative function, as an independent variable in a regression model.

본 발명의 일 실시예에 따르면, 상기 이상 감지부는, 제1 범주형 데이터의 평균값 또는 중앙값으로 기준값을 산출하는 기준값 산출부, 상기 기준값에 수용 가능한 변동값을 더하여 상한 기준값을 산출하는 상한 기준값 산출부 및 상기 변동값을 차감하여 하한 기준값을 산출하는 하한 기준값 산출부를 포함하고, 상기 최종 잔차 함수가 상기 상한 기준값 또는 상기 하한 기준값을 벗어나면 상기 모니터링 대상에 이상이 발생한 것으로 판단할 수 있다.According to an embodiment of the present invention, the abnormality detection unit includes a reference value calculator for calculating a reference value from the average or median value of the first categorical data, and an upper limit reference value calculator for calculating an upper limit reference value by adding an acceptable variation value to the reference value and a lower limit reference value calculator for calculating a lower limit reference value by subtracting the variation value, wherein when the final residual function deviates from the upper limit reference value or the lower limit reference value, it can be determined that an abnormality has occurred in the monitoring target.

본 발명의 또 다른 실시예에 따른 범주형 데이터 분석 장치는, 하나 이상의 프로세서, 상기 프로세서에 의하여 수행되는 컴퓨터 프로그램을 로드(load)하는 메모리 및 범주형 데이터를 분석할 수 있는 컴퓨터 프로그램을 저장하는 스토리지를 포함하되, 상기 컴퓨터 프로그램은, 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성하는 오퍼레이션, 상기 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 상기 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성하는 오퍼레이션, 상기 제1 잔차 함수와 제2 잔차 함수의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하는 오퍼레이션, 상기 제1 잔차 함수와 상기 상관 관계 함수와의 차이값으로 최종 잔차 함수를 생성하는 오퍼레이션 및 상기 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단하는 오퍼레이션을 포함한다.Categorical data analysis apparatus according to another embodiment of the present invention, one or more processors, a memory for loading a computer program executed by the processor, and storage for storing a computer program capable of analyzing categorical data The computer program comprising: an operation of generating a first cumulative function from the first categorical data, and generating a second cumulative function with the second categorical data; An operation of generating a first residual function and removing a rising pattern from the second cumulative function to generate a second residual function, and an operation of generating a correlation function representing a correlation between the first and second residual functions , an operation of generating a final residual function using a difference value between the first residual function and the correlation function, and an operation of determining that an abnormality has occurred in a monitoring target when the final residual function is out of a preset range.

본 발명의 또 다른 실시예에 다른 컴퓨터 프로그램은, 컴퓨터 장치와 결합하여, 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성하는 단계, 상기 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 상기 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성하는 단계, 상기 제1 잔차 함수와 제2 잔차 함수의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하는 단계, 상기 제1 잔차 함수와 상기 상관 관계 함수와의 차이값으로 최종 잔차 함수를 생성하는 단계 및 상기 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단하는 단계를 실행하도록 컴퓨터 판독 가능한 기록 매체에 저장된다.According to another embodiment of the present invention, a computer program, in combination with a computer device, generates a first cumulative function from the first categorical data, and generates a second cumulative function from the second categorical data, the first removing a rising pattern from one cumulative function to generate a first residual function, and removing a rising pattern from the second cumulative function to generate a second residual function; correlation of the first residual function and a second residual function generating a correlation function representing a relationship; generating a final residual function with a difference value between the first residual function and the correlation function; and when the final residual function is out of a preset range, an abnormality is detected in the monitoring target It is stored in a computer-readable recording medium to execute the step of determining that it has occurred.

상술한 본 발명의 일 실시예에 따른 범주형 데이터 분석 방법에 따르면, 상호 연관 관계를 도출하기 어려운 이산적인 범주형 데이터를 연속적인 데이터로 변환하여 이를 분석할 수 있게 된다는 효과를 달성할 수 있다.According to the above-described categorical data analysis method according to an embodiment of the present invention, it is possible to achieve the effect that it is possible to analyze discrete categorical data, which is difficult to derive a correlation, into continuous data.

도 1은 이산적인 범주형 데이터를 분석하는 과정을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 범주형 데이터 분석 방법을 설명하기 위한 흐름도이다.
도 3 및 도 4는 본 발명의 일 실시예에 따라 이산적인 범주형 데이터를 연속적인 데이터로 변환하는 과정을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따라 누적 함수에서 상승 패턴을 제거한 결과를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따라 다른 이벤트와의 상관 관계를 제외한 잔차를 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따라 이산적인 범주형 데이터가 다른 이벤트들과의 상관 관계를 제외한 최종 잔차로 변환되는 과정을 설명하기 위한 도면이다.
도 8은 본 발명의 일 실시예에 따른 범주형 데이터 분석 장치를 설명하기 위한 기능 블럭도이다.
도 9는 본 발명의 또 다른 실시예에 따른 범주형 데이터 분석 장치를 설명하기 위한 기능 블록도이다.1 is a diagram for explaining a process of analyzing discrete categorical data.
2 is a flowchart illustrating a categorical data analysis method according to an embodiment of the present invention.
3 and 4 are diagrams for explaining a process of converting discrete categorical data into continuous data according to an embodiment of the present invention.
5 is a diagram for explaining a result of removing a rising pattern from an accumulation function according to an embodiment of the present invention.
6 is a diagram for explaining a residual excluding correlation with other events according to an embodiment of the present invention.
7 is a diagram for describing a process in which discrete categorical data is converted into a final residual excluding correlation with other events according to an embodiment of the present invention.
8 is a functional block diagram illustrating an apparatus for analyzing categorical data according to an embodiment of the present invention.
9 is a functional block diagram illustrating an apparatus for analyzing categorical data according to another embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments published below, but can be implemented in various different forms, and only these embodiments make the publication of the present invention complete, and common knowledge in the art to which the present invention pertains It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein may be used with the meaning commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless clearly defined in particular.

또한, 본 명세서에서 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함될 수 있다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.In addition, in this specification, the singular form may also include a plural form unless otherwise specified in the phrase. As used herein, "comprises" and/or "comprising" refers to the presence of one or more other components, steps, operations and/or elements mentioned. or addition is not excluded.

도 2는 본 발명의 일 실시예에 따른 범주형 데이터 분석 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a categorical data analysis method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 범주형 데이터 분석 장치는 이산적인 제1 범주형 데이터로 제1 누적 함수를 생성하고 제2 범주형 데이터로 제2 누적 함수를 생성한다(S210).The apparatus for analyzing categorical data according to an embodiment of the present invention generates a first cumulative function from the discrete first categorical data and generates a second cumulative function from the second categorical data (S210).

이때, 제1 범주형 데이터는 모니터링 대상에서 발생된 이벤트 A의 발생 빈도이고 제2 범주형 데이터는 이벤트 B의 발생 빈도일 수 있다. 즉, 제1 누적 함수는 시간 경과에 따른 이벤트 A의 발생 빈도를 순차적으로 합산하여 생성된 것일 수 있다. 마찬가지로, 제2 누적 함수는 시간 경과에 따른 이벤트 B의 발생 빈도를 순차적으로 합산하여 생성된 것일 수 있다.In this case, the first categorical data may be an occurrence frequency of event A generated in the monitoring target, and the second categorical data may be an occurrence frequency of event B. That is, the first accumulation function may be generated by sequentially summing the frequency of occurrence of event A over time. Similarly, the second accumulation function may be generated by sequentially summing the frequency of occurrence of event B over time.

이후, 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성한다(S220). 제1 잔차 함수 및 제2 잔차 함수를 생성하는 구체적인 방법은 도 4 및 도 5에서 상세하게 설명하도록 한다.Thereafter, the first residual function is generated by removing the rising pattern from the first cumulative function, and the second residual function is generated by removing the rising pattern from the second cumulative function ( S220 ). A detailed method for generating the first residual function and the second residual function will be described in detail with reference to FIGS. 4 and 5 .

상술한 과정을 거쳐 생성된 제1 잔차 함수 및 제2 잔차 함수는 연속적인 데이터를 갖는 값이다. 따라서, 제1 잔차 함수 및 제2 잔차 함수간에 상호 연관 관계를 나타내는 상관 관계 함수를 생성할 수 있게 된다(S230).The first residual function and the second residual function generated through the above-described process are values having continuous data. Accordingly, it is possible to generate a correlation function indicating a correlation between the first residual function and the second residual function ( S230 ).

즉, 제1 범주형 데이터 및 제2 범주형 데이터 그 자체는 이산적인 특징에 의해 상호 연관 관계를 도출하는데 한계가 있으나, 제1 범주형 데이터 및 제2 범주형 데이터를 연속적인 데이터인 제1 잔차 함수 및 제2 잔차 함수로 변환하였는바, 두 데이터간에 상호 연관 관계를 용이하게 도출할 수 있게 되는 것이다.In other words, although the first categorical data and the second categorical data itself have a limit in deriving a correlation relationship due to discrete features, the first residual data of the first categorical data and the second categorical data are continuous data. Since it is transformed into a function and a second residual function, it is possible to easily derive a correlation between the two data.

이후, 제1 잔차 함수 및 상관 관계 함수의 차이값으로 최종 잔차 함수를 생성하고(S240), 이를 모니터링하여 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단할 수 있다(S250).Thereafter, a final residual function is generated using the difference value between the first residual function and the correlation function (S240), and by monitoring this, when the final residual function is out of a preset range, it can be determined that an abnormality has occurred in the monitoring target (S250) ).

이하에서는, 각 단계별 과정을 구체적으로 설명하도록 한다.Hereinafter, each step-by-step process will be described in detail.

도 3 및 도 4는 본 발명의 일 실시예에 따라 이산적인 범주형 데이터를 연속적인 데이터로 변환하는 과정을 설명하기 위한 도면이다.3 and 4 are diagrams for explaining a process of converting discrete categorical data into continuous data according to an embodiment of the present invention.

본 실시예에서 모니터링 대상은 기계 설비인 것을 예로 들어 설명하나, 모니터링 대상이 될 수 있는 것은 이에 한정되지는 않는다.In this embodiment, the monitoring target is described as an example of a mechanical facility, but the monitoring target is not limited thereto.

도 3에서 A 및 B는 기계 설비에서 발생될 수 있는 고장 유형을 의미한다. 즉, 도 3을 살펴보면 A 및 B가 간헐적인 패턴을 가지며 발생되고 있음을 알 수 있다. 따라서, 도 3과 같은 이벤트가 발생하는 경우 이벤트 A와 B의 Person’s correlation coefficient 값이 거의 0에 가까워 A 및 B의 빈도의 상관 관계에 대한 효과적인 분석 결과를 얻을 수 없게 된다.3, A and B mean types of failures that may occur in mechanical equipment. That is, referring to FIG. 3 , it can be seen that A and B are generated with an intermittent pattern. Therefore, when an event as shown in FIG. 3 occurs, the Person's correlation coefficient value of events A and B is almost 0, so that an effective analysis result for the correlation between the frequencies of A and B cannot be obtained.

따라서, 본 발명의 일 실시예에 따른 범주형 데이터 분석 장치는 단속적인 데이터 값을 갖는 이벤트 A 및 이벤트 B의 발생빈도를 연속적인 데이터로 변환하기 위해 누적 함수를 생성한다.Accordingly, the categorical data analysis apparatus according to an embodiment of the present invention generates an accumulation function to convert the frequency of occurrence of events A and B having intermittent data values into continuous data.

도 4는 본 발명의 일 실시예에 따라 단속적인 데이터를 누적하여 생성한 누적 함수를 설명하기 위한 도면이다.4 is a diagram for explaining an accumulation function generated by accumulating intermittent data according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 범주형 데이터 분석 방법은 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성한다. 여기에서, 제1 범주형 데이터는 시간 경과에 따른 이벤트 A의 발생 빈도이고, 제2 범주형 데이터는 시간 경과에 따른 이벤트 B의 발생 빈도일 수 있다.A categorical data analysis method according to an embodiment of the present invention generates a first cumulative function as first categorical data and a second cumulative function as second categorical data. Here, the first categorical data may be an occurrence frequency of event A over time, and the second categorical data may be an occurrence frequency of event B over time.

도 3에 도시된 바와 같이 불연속적이며 상대적으로 희박하게 발생하는 이벤트들에 누적합을 산출하면, 도 4에 도시된 바와 같은 누적 함수를 생성할 수 있다. 이벤트가 집중적으로 발생하는 구간은 누적 함수의 기울기가 상승하며, 상대적으로 이벤트가 발생하지 않는 구간은 완만한 기울기로 표현된다.As shown in FIG. 3 , if the cumulative sum of discontinuous and relatively sparsely occurring events is calculated, the cumulative function shown in FIG. 4 may be generated. The slope of the cumulative function rises in the section where events intensively occur, and the section in which the event does not occur is expressed as a gentle slope.

이때, 시계열 선형회귀분석(Time-Series Linear Regression)을 활용하면 제1 누적 함수(410)의 상승 패턴을 나타내는 제1 추세선(420)의 기울기를 산출할 수 있다. 여기에서 제1 누적 함수(410)는 이벤트 A의 발생 빈도를 누적한 그래프이다. 마찬가지로 이벤트 B의 발생 빈도를 누적한 제2 누적 함수(430)의 상승 패턴을 나타내는 제2 추세선(440)의 기울기도 산출할 수 있다.In this case, if time-series linear regression is used, the slope of the first trend line 420 representing the rising pattern of the first cumulative function 410 may be calculated. Here, the first accumulation function 410 is a graph in which the occurrence frequency of event A is accumulated. Similarly, the slope of the second trend line 440 indicating the rising pattern of the second accumulation function 430 accumulating the frequency of occurrence of the event B may also be calculated.

도 4에 도시된 그래프는 이벤트들의 발생 빈도를 누적한 것이므로 제1 누적 함수(410)와 제2 누적 함수(420)는 상승하는 패턴을 가지게 된다. 그러나, 본 발명에서는 기계 설비가 정상적으로 동작하는 경우와 비교했을 때, 각 이벤트의 발생 빈도가 어떻게 변화하는지를 관찰하고자 하므로, 상술한 상승 패턴을 제거하는 단계가 필요하다.Since the graph shown in FIG. 4 is an accumulation of the occurrence frequencies of events, the first accumulation function 410 and the second accumulation function 420 have an ascending pattern. However, in the present invention, compared to the case where the mechanical equipment operates normally, it is desired to observe how the frequency of occurrence of each event changes, so the step of removing the above-described rising pattern is required.

도 5는 본 발명의 일 실시예에 따라 누적 함수에서 상승 패턴을 제거한 결과를 설명하기 위한 도면이다.5 is a diagram for explaining a result of removing a rising pattern from an accumulation function according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 범주형 데이터 분석 방법은 누적 함수에서 상승 패턴을 제거하여 이산적인 범주형 데이터를 연속적인 데이터로 변환한다. 구체적으로, 시간에 따른 이벤트 A의 발생 빈도를 A 이벤트의 누적 빈도와 예측 누적 빈도의 차이인 잔차로 변환하면 상승 패턴이 제거된 데이터값을 산출할 수 있다.The categorical data analysis method according to an embodiment of the present invention converts discrete categorical data into continuous data by removing the rising pattern from the cumulative function. Specifically, if the occurrence frequency of event A according to time is converted into a residual that is the difference between the cumulative frequency of event A and the predicted cumulative frequency, a data value from which the rising pattern is removed can be calculated.

구체적으로, 도 4에서 A 이벤트의 누적 빈도를 의미하는 제1 누적 함수(410)와 예측 누적 빈도를 의미하는 제1 추세선(420)의 차이를 산출하면 도 5에 도시된 제1 잔차 함수(510)를 산출할 수 있다.Specifically, when the difference between the first cumulative function 410 indicating the cumulative frequency of event A in FIG. 4 and the first trend line 420 indicating the predicted cumulative frequency is calculated, the first residual function 510 shown in FIG. ) can be calculated.

마찬가지로, 도 4에서 B 이벤트의 누적 빈도를 의미하는 제2 누적 함수(430)와 이벤트 B의 예측 누적 빈도를 의미하는 제2 추세선(440)의 차이를 산출하면 도 5에 도시된 제2 잔차 함수(520)를 산출할 수 있다.Similarly, in FIG. 4 , when the difference between the second cumulative function 430 indicating the cumulative frequency of event B and the second trend line 440 indicating the predicted cumulative frequency of event B is calculated, the second residual function shown in FIG. 5 . (520) can be calculated.

이를 수식으로 표현하면 다음과 같다.Expressing this as a formula:

여기에서, Ψ는 특정 이벤트의 발생 빈도를 누적한 누적 함수, α는 누적 함수의 추세선 기울기, e(t)는 실제 측정된 측정값과 예측값의 차이, 즉, 잔차를 의미한다.Here, Ψ is a cumulative function that accumulates the frequency of occurrence of a specific event, α is the slope of the trend line of the cumulative function, and e(t) is the difference between the actually measured value and the predicted value, that is, the residual.

즉 제1 잔차 함수(510)와 제2 잔차 함수(520)는 이산적인 제1 범주형 데이터와 제2 범주형 데이터가 연속적인 데이터로 변환된 결과물이다.That is, the first residual function 510 and the second residual function 520 are results obtained by converting discrete first categorical data and second categorical data into continuous data.

한편, 상술한 상승 패턴을 제거하기 위해서는 시계열 회귀 분석 모델을 이용할 수 있는데, 시계열 회귀 분석은 시간에 다른 특정 변수값의 패턴을 표현하기 위한 방법으로, 잔차의 최소제곱(least square)을 만족시키는 추세선의 기울기 α를 산출하는 방식으로 진행된다.On the other hand, in order to remove the above rising pattern, a time series regression analysis model can be used. The time series regression analysis is a method for expressing a pattern of a specific variable value different over time, and a trend line satisfying the least square of the residual. It proceeds in such a way as to calculate the slope α of

상술한 과정을 통해 이산적인 범주형 데이터를 연속적인 값을 갖는 데이터로 변환할 수 있게 된다는 효과를 달성할 수 있다. 즉, 도 5의 제1 잔차 함수(510)는 도 3의 이벤트 A 발생 빈도를 변환한 값으로 이산적인 값을 가지는 데이터가 연속적인 값을 갖는 데이터로 변환되었음을 확인할 수 있다.Through the above-described process, it is possible to achieve the effect of being able to convert discrete categorical data into data having continuous values. That is, the first residual function 510 of FIG. 5 is a value obtained by converting the occurrence frequency of event A of FIG. 3 , and it can be confirmed that data having a discrete value is converted into data having a continuous value.

마찬가지로, 도 3의 이산적인 범주형 데이터인 이벤트 B 발생 빈도도 연속적인 제2 잔차 함수(520)로 변환되었음을 알 수 있다.Similarly, it can be seen that the frequency of occurrence of event B, which is the discrete categorical data of FIG. 3 , has also been converted into the continuous second residual function 520 .

이하에서는 연속적인 데이터로 변환된 이벤트들간에 상관 관계를 이용하여 모니터링 대상인 기계설비에 발생될 수 있는 이상 현상을 미리 감지할 수 있는 방법에 대해 설명하도록 한다.Hereinafter, a method for detecting anomalies that may occur in a machine facility to be monitored using correlation between events converted into continuous data will be described in advance.

도 6은 본 발명의 일 실시예에 따라 다른 이벤트와의 상관 관계를 제외한 잔차를 설명하기 위한 도면이다.6 is a diagram for explaining a residual excluding correlation with other events according to an embodiment of the present invention.

도 3 내지 도 5에서는 기계 설비에서 발생되는 이산적인 범주형 데이터를 연속적인 데이터로 변환하는 방법에 대해 설명하였다. 이때, 기계 설비에서 발생될 수 있는 이벤트 A와 이벤트 B는 상호 연관 관계가 있을 수 있다.A method of converting discrete categorical data generated in mechanical equipment into continuous data has been described in FIGS. 3 to 5 . In this case, an event A and an event B that may be generated in the mechanical equipment may have a correlation.

예를 들어, 이벤트 B의 발생이 이벤트 A 발생을 야기하는 경우 이벤트 B 발생 빈도가 높아짐에 따라 이벤트 A의 발생 빈도가 높아지는 것은 정상적인 경우라고 할 수 있다.For example, when occurrence of event B causes occurrence of event A, it can be said that it is normal for the frequency of occurrence of event A to increase as the frequency of occurrence of event B increases.

반면, 이벤트 B가 발생되지 않았음에도 이벤트 A의 발생빈도가 높아졌다면 기계 설비에 이상이 발생되었을 가능성이 크다고 할 수 있다.On the other hand, if the occurrence frequency of event A is increased even though event B has not occurred, it can be said that there is a high possibility that an abnormality has occurred in the mechanical equipment.

예를 들어, 기계 설비에서 온도의 증가는 압력의 증가를 수반하므로 온도가 기 설정된 임계값을 초과하는 빈도가 높아짐에 따라 압력이 기 설정된 임계값을 초과하는 빈도가 높아지는 것은 자연스러운 현상이라고 할 수 있다.For example, since an increase in temperature in a mechanical facility is accompanied by an increase in pressure, it can be said that it is a natural phenomenon that as the frequency of exceeding the preset threshold increases, the frequency of the pressure exceeding the preset threshold increases. .

그러나, 온도가 기 설정된 임계값을 초과하는 이벤트가 발생하지 않았음에도 압력이 기설정된 임계값을 초과하는 이벤트의 빈도가 증가하였다면 기계 설비에 이상이 발생되었을 가능성이 크다고 판단할 수 있다.However, if the frequency of the event in which the pressure exceeds the preset threshold increases even though an event in which the temperature exceeds the preset threshold does not occur, it may be determined that there is a high possibility that an abnormality has occurred in the mechanical equipment.

즉, 이벤트 A의 발생 빈도를 연속적인 데이터로 변환한 제1 잔차 함수(510)와 이벤트 B의 발생 빈도를 연속적인 데이터로 변환한 제2 잔차 함수(520)와의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하여, 이를 제1 잔차 함수(510)에서 제외하면 이벤트 B의 발생과 무관하게 이벤트 A가 발생된 경우를 감지할 수 있게 된다.That is, a correlation function indicating a correlation between the first residual function 510 in which the frequency of occurrence of event A is converted into continuous data and the second residual function 520 in which the frequency of occurrence of event B is converted into continuous data. , and excluding it from the first residual function 510, it is possible to detect the occurrence of event A regardless of the occurrence of event B.

이를 수식으로 나타내면 다음과 같다.This can be expressed as a formula as follows.

여기에서, e’(t)는 다른 이벤트들과의 상관 관계를 제외한 잔차, Φ는 도 3 내지 도 5에서 설명한 방법으로 변환된 연속적인 데이터로 변환된 잔차 함수, f는 다른 잔차 함수들과의 상호 연관 관계를 나타내는 상관 관계 함수를 의미한다.Here, e'(t) is the residual excluding correlation with other events, Φ is the residual function transformed into continuous data transformed by the method described in FIGS. 3 to 5, and f is the residual function with other residual functions. It means a correlation function that represents a correlation relationship.

이벤트 A와 이벤트 B를 예로 들면, Φ(t)는 제1 잔차, f는 이벤트 A와 이벤트 B의 상호 연관 관계를 나타내는 상관 관계 함수일 수 있다. Taking the event A and the event B as an example, Φ(t) may be a correlation function indicating the first residual, and f is the correlation between the event A and the event B.

본 발명의 일 실시예에 따라 각 범주형 데이터들간에 상호 연관 관계를 나타내는 f함수는 잔차값들을 독립변수로 활용하여 구축된 회귀 모델로 산출할 수 있다. 회귀 모델로는 다중선형회귀, 신경망 모델, 의사 결정 나무(Regression Tree), Regularized Regression 기법 등이 사용될 수 있으나, 이에 한정되지 않으며 다른 범용적인 회귀 모델이 사용될 수도 있다.According to an embodiment of the present invention, the f function representing the correlation between each categorical data may be calculated using a regression model constructed by using residual values as independent variables. As the regression model, multiple linear regression, a neural network model, a decision tree (regression tree), a regularized regression technique, etc. may be used, but is not limited thereto, and other general-purpose regression models may be used.

예를 들어, 이벤트 A와 관련된 제1 범주형 데이터와 이벤트 B와 관련된 제2 범주형 데이터와의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하고자 하는 경우, 제2 잔차 함수(520)를 구성하는 잔차값을 독립 변수로 제1 잔차 함수(510)를 구성하는 잔차값을 상관 관계 함수로 표현할 수 있다.For example, when it is desired to generate a correlation function representing a correlation between first categorical data related to event A and second categorical data related to event B, the residual constituting the second residual function 520 . A residual value constituting the first residual function 510 may be expressed as a correlation function with a value as an independent variable.

관련하여, 도 6에는 다른 이벤트들과의 상관 관계가 제거된 최종 잔차 함수(620)가 도시되어 있다. 구체적으로, 도 6에는 이벤트 A의 발생 빈도가 연속적인 데이터로 변환된 제1 잔차 함수(510) 및 회귀 모델을 이용하여 산출된 이벤트 A와 이벤트 B와의 상관 관계 함수(610)가 도시되어 있다.In this regard, the final residual function 620 from which correlations with other events have been removed is shown in FIG. 6 . Specifically, FIG. 6 shows a first residual function 510 in which the frequency of occurrence of event A is converted into continuous data, and a correlation function 610 between event A and event B calculated using a regression model.

제1 잔차 함수(510)에서 이벤트 B와의 상관 관계 함수(610)의 차이값인 최종 잔차 함수(620)를 산출하면 이벤트 B와는 무관한 이벤트 A 독립적인 변화량을 파악할 수 있다.If the final residual function 620, which is the difference value of the correlation function 610 with the event B, is calculated from the first residual function 510, the event A independent change amount independent of the event B can be identified.

이후, A의 독립적인 변화량을 모니터링 한 결과 변화량이 기 설정된 범위를 벗어나면 관찰 대상인 기계 설비에 고장이 발생된 것으로 판단할 수 있다.Then, as a result of monitoring the independent change amount of A, if the change amount is out of a preset range, it can be determined that a failure has occurred in the mechanical equipment to be observed.

이때, 기 설정된 범위는 정상적인 통계량의 대표값인 평균값, 중앙값 등에서 특정 산포만큼을 수용할 수 있는 변동값을 더하거나 차감하여 산출하게 된다. 구체적으로, 특정 이벤트와 관련된 최종 잔차 함수(620)를 모니터링 하여 기계 설비에 고장이 발생되었는지 여부를 판단하는 기준값, 상한 임계값 및 하한 임계값은 다음과 같은 수식을 통해 산출될 수 있다.In this case, the preset range is calculated by adding or subtracting a variation value that can accommodate a specific dispersion in the average value, the median value, etc., which are representative values of normal statistics. Specifically, the reference value, the upper limit threshold value, and the lower limit threshold value for determining whether a failure has occurred in the mechanical equipment by monitoring the final residual function 620 related to a specific event may be calculated through the following equation.

즉, 관측치의 통계량이 정상구간의 평균값을 기준으로 k라는 상수 범위에 정상 통계량의 표준편차를 곱하여 기 설정된 범위 이내에 해당값이 위치하는 경우는 정상, 기 설정된 범위를 초과하면 기계 설비에 고장이 발생된 것으로 판단할 수 있다.That is, if the statistical value of the observed value is within the preset range by multiplying the standard deviation of the normal statistic by the constant range k based on the average value of the normal section, it is normal. can be judged to have been

관련하여, 도 6을 살펴보면 최종 잔차 함수(620)와 상한 임계값(630) 및 하한 임계값(640)이 도시되어 있음을 알 수 있다.In relation to this, referring to FIG. 6 , it can be seen that a final residual function 620 and an upper threshold 630 and a lower threshold 640 are shown.

도 7은 본 발명의 일 실시예에 따라 이산적인 범주형 데이터가 다른 이벤트들과의 상관 관계를 제외한 최종 잔차로 변환되는 과정을 설명하기 위한 도면이다.7 is a diagram for describing a process in which discrete categorical data is converted into a final residual excluding correlation with other events according to an embodiment of the present invention.

S710 단계에서는 모니터링 대상으로부터 수신되는 데이터를 수집한다. 이때 수집되는 데이터는 모니터링 대상에서 발생되는 이벤트의 발생 빈도에 관한 것으로 이산적인 범주형 데이터일 수 있다.In step S710, data received from the monitoring target is collected. In this case, the collected data relates to the frequency of occurrence of an event occurring in the monitoring target and may be discrete categorical data.

S720 단계에서는 이산적인 범주형 데이터를 누적하여 누적 함수를 생성한다. 누적 함수는 이벤트 발생 횟수를 순차적으로 합산하여 생성되므로 시간에 따라 상승하는 패턴을 갖게 된다.In step S720, a cumulative function is generated by accumulating discrete categorical data. Since the accumulation function is created by sequentially summing the number of occurrences of events, it has a pattern that increases with time.

그러나, 누적 함수의 상승 패턴은 누적 함수의 속성에 따른 필연적인 것으로 본 발명에서 관심 있는 대상이 아니므로 시계열 회귀분석을 통해 상승 패턴을 제거한다.However, since the rising pattern of the cumulative function is inevitable according to the property of the cumulative function and is not of interest in the present invention, the rising pattern is removed through time series regression analysis.

S720 단계에서 상승 패턴이 제거되면 S730 단계에서는 이산적인 범주형 데이터가 연속적인 데이터로 변환된 결과를 얻을 수 있다. S730 단계에서 변환된 연속적인 데이터는 이벤트 발생 빈도의 평균적인 경우보다 많이 발생했는지 또는 적게 발생했는지 여부에 관한 정보가 포함될 수 있다.When the rising pattern is removed in step S720, a result in which discrete categorical data is converted into continuous data can be obtained in step S730. The continuous data converted in step S730 may include information about whether the event occurs more or less than the average case of the frequency of occurrence.

즉, S730 단계에서 연속적인 데이터는 실제 발생된 이벤트의 누적 빈도와 예측 누적 빈도의 차이값을 의미하므로, 그 차이값인 잔차가 0보다 큰 것은 평균적인 경우에 비해 이벤트가 많이 발생된 것을 의미하고 잔차가 0보다 작은 것은 평균적인 경우에 비해 이벤트가 적게 발생된 것을 의미한다.That is, continuous data in step S730 means the difference between the cumulative frequency of the event actually occurred and the predicted cumulative frequency. When the residual is less than 0, it means that fewer events are generated compared to the average case.

S740 단계에서는 다른 이벤트들과의 상관 관계를 제외한 최종 잔차를 산출한다. 특정 이벤트 A의 발생은 이벤트 B의 발생에 기인한 것일 수 있으므로, 다른 이벤트와의 상관 관계를 제외하면 이벤트 A의 독립적인 발생 빈도 변화량을 얻을 수 있게 된다.In step S740, a final residual excluding correlation with other events is calculated. Since the occurrence of a specific event A may be due to the occurrence of event B, it is possible to obtain an independent change in the frequency of occurrence of event A by excluding correlation with other events.

이후, 이벤트 A의 독립적인 발생 빈도가 기 설정된 범위를 초과하면 모니터링 대상에 이상이 있는 것으로 판단할 수 있다. 상술한 범주형 데이터 분석 방법에 따르면 이산적인 범주형 데이터도 효과적으로 분석할 수 있다는 효과를 달성할 수 있다.Thereafter, when the independent occurrence frequency of event A exceeds a preset range, it may be determined that there is an abnormality in the monitoring target. According to the above-described categorical data analysis method, it is possible to achieve the effect of effectively analyzing discrete categorical data.

도 8은 본 발명의 일 실시예에 따른 범주형 데이터 분석 장치를 설명하기 위한 기능 블럭도이다.8 is a functional block diagram illustrating an apparatus for analyzing categorical data according to an embodiment of the present invention.

도 8에 도시된 범주형 데이터 분석 장치(800)는 누적 함수 생성부(810), 잔차 함수 생성부(820), 상관 관계 함수 생성부(830), 최종 잔차 함수 생성부(840) 및 이상 감지부(850)를 포함한다.The categorical data analysis apparatus 800 illustrated in FIG. 8 includes an accumulation function generator 810 , a residual function generator 820 , a correlation function generator 830 , a final residual function generator 840 , and anomaly detection. part 850 .

도 8에는 본 발명의 실시예와 관련있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속하는 기술분야의 통상의 기술자라면 도 8에 도시된 구성요소 이외에 다른 범용적인 구성요소가 더 포함될 수 있음을 알 수 있다.Only the components related to the embodiment of the present invention are shown in FIG. 8 . Accordingly, those skilled in the art to which the present invention pertains can know that other general-purpose components other than the components shown in FIG. 8 may be further included.

누적 함수 생성부(810)는 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성한다. 이산적인 범주형 데이터로 누적 함수를 생성하는 구체적인 방법은 도 4에서 설명한 바와 같다.The accumulation function generator 810 generates a first accumulation function from the first categorical data and generates a second accumulation function from the second categorical data. A specific method of generating the cumulative function with discrete categorical data is as described with reference to FIG. 4 .

잔차 함수 생성부(820)는 상기 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 상기 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성한다. 이를 위해, 잔차 함수 생성부(820)는 상기 제1 누적 함수의 상승 패턴을 나타내는 제1 추세선 및 상기 제2 누적 함수의 상승 패턴을 나타내는 제2 추세선을 생성하는 추세선 생성부(821)을 포함할 수 있다.The residual function generator 820 generates a first residual function by removing the rising pattern from the first cumulative function, and generates a second residual function by removing the rising pattern from the second cumulative function. To this end, the residual function generator 820 may include a trend line generator 821 that generates a first trend line representing the rising pattern of the first cumulative function and a second trend line representing the rising pattern of the second cumulative function. can

누적 함수에서 상승 패턴을 생성하여 연속적인 데이터인 잔차 함수를 생성하는 구체적인 방법은 도 5에서 설명한 바와 같으므로 중복되는 설명은 생략하도록 한다.A detailed method of generating a residual function, which is continuous data, by generating a rising pattern in the accumulation function is the same as that described in FIG. 5 , so a redundant description will be omitted.

상관 관계 함수 생성부(830)는 상기 제1 잔차 함수와 상기 제2 잔차 함수의 상호 연관 관계를 나타내는 상관 관계 함수를 생성한다. 상관 관계 함수를 생성하는 구체적인 방법은 도 6에서 설명한 바와 같으므로 중복되는 설명은 생략하도록 한다.The correlation function generator 830 generates a correlation function representing the correlation between the first residual function and the second residual function. A detailed method of generating the correlation function is the same as that described with reference to FIG. 6 , so a redundant description will be omitted.

최종 잔차 함수 생성부(840)는 상기 제1 잔차 함수와 상기 상관 관계 함수의 차이값으로 최종 잔차 함수를 생성한다. 마찬가지로, 최종 잔차 함수를 생성하는 구체적인 방법도 도 6에 기재되어 있으므로 중복되는 설명은 생략하도록 한다.The final residual function generator 840 generates a final residual function using a difference value between the first residual function and the correlation function. Similarly, since a specific method of generating the final residual function is also described in FIG. 6 , a redundant description will be omitted.

이상 감지부(850)는 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단한다. 이를 위해, 본 발명의 일 실시예에 따른 이상 감지부(850)는 제1 범주형 데이터의 평균값 또는 중앙값으로 기준값을 산출 기준값 산출부(851), 기준값에 수용 가능한 변동값을 더하여 상한 기준값을 산출하는 상한 기준값 산출부(852) 및 변동값을 차감하여 하한 기준값을 산출하는 하한 기준값 산출부(853)을 포함할 수 있다.The abnormality detection unit 850 determines that an abnormality has occurred in the monitoring target when the final residual function is out of a preset range. To this end, the abnormality detection unit 850 according to an embodiment of the present invention calculates a reference value as the average or median value of the first categorical data. The reference value calculation unit 851 calculates an upper limit reference value by adding an acceptable variation value to the reference value. It may include an upper limit reference value calculator 852 to calculate the lower limit reference value by subtracting the variation value and a lower limit reference value calculator 853 for calculating the lower limit reference value.

도 9는 본 발명의 또 다른 실시예에 따른 범주형 데이터 분석 장치를 설명하기 위한 기능 블록도이다.9 is a functional block diagram illustrating an apparatus for analyzing categorical data according to another embodiment of the present invention.

도 9에 도시된 범주형 데이터 분석 장치(900)는 프로세서(910), 스토리지(920), 메모리(930), 네트워크 인터페이스(940) 및 버스(950)을 포함한다.The categorical data analysis apparatus 900 illustrated in FIG. 9 includes a processor 910 , a storage 920 , a memory 930 , a network interface 940 , and a bus 950 .

프로세서(910)는 범주형 데이터 분석 프로그램을 실행한다. 그러나, 프로세서(910)에서 실행될 수 있는 프로그램은 이에 한정되지 않으며 다른 범용적이 프로그램이 실행될 수도 있다.The processor 910 executes a categorical data analysis program. However, the program that can be executed by the processor 910 is not limited thereto, and other general-purpose programs may be executed.

스토리지(920)는 범주형 데이터 분석 프로그램이 저장된다. 본 발명의 일 실시예에 따른 범주형 데이터 분석 프로그램은 제1 범주형 데이터로 제1 누적 함수를 생성하고, 제2 범주형 데이터로 제2 누적 함수를 생성하는 단계, 상기 제1 누적 함수에서 상승 패턴을 제거하여 제1 잔차 함수를 생성하고, 상기 제2 누적 함수에서 상승 패턴을 제거하여 제2 잔차 함수를 생성하는 단계, 상기 제1 잔차 함수와 제2 잔차 함수의 상호 연관 관계를 나타내는 상관 관계 함수를 생성하는 단계, 상기 제1 잔차 함수와 상기 상관 관계 함수와의 차이값으로 최종 잔차 함수를 생성하는 단계 및 상기 최종 잔차 함수가 기 설정된 범위를 벗어나면 모니터링 대상에 이상이 발생한 것으로 판단하는 단계를 실행한다.The storage 920 stores a categorical data analysis program. The categorical data analysis program according to an embodiment of the present invention includes generating a first cumulative function from the first categorical data and generating a second cumulative function from the second categorical data, rising from the first cumulative function A first residual function is generated by removing a pattern, and a second residual function is generated by removing a rising pattern from the second cumulative function, A correlation indicating a correlation between the first residual function and a second residual function generating a function, generating a final residual function with a difference value between the first residual function and the correlation function, and determining that an abnormality has occurred in a monitoring target when the final residual function is out of a preset range run

메모리(930)는 범주형 데이터 분석 프로그램을 로딩하여, 그 프로그램이 프로세서(910)에서 실행될 수 있도록 한다.The memory 930 loads the categorical data analysis program so that the program can be executed in the processor 910 .

네트워크 인터페이스(940)에는 다양한 컴퓨팅 장치가 연결될 수 있다. 예를 들어, 모니터링 대상이 되는 기계 설비가 연결되어 기계 설비에서 측정되는 범주형 데이터를 수신하도록 구현될 수 있다.Various computing devices may be connected to the network interface 940 . For example, the machine equipment to be monitored may be connected to receive categorical data measured by the machine equipment.

버스(950)는 상술한 프로세서(910), 스토리지(920), 메모리(930) 및 네트워크 인터페이스(940)가 연결되는 데이터 이동 통로로서의 역할을 수행한다.The bus 950 serves as a data movement path through which the processor 910 , the storage 920 , the memory 930 and the network interface 940 are connected.

한편, 상술한 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 방법에서 사용된 데이터의 구조는 컴퓨터로 읽을 수 있는 기록매체에 여러 수단을 통하여 기록될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.Meanwhile, the above-described method can be written as a program that can be executed on a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium. In addition, the structure of the data used in the above-described method may be recorded in a computer-readable recording medium through various means. The computer-readable recording medium includes a storage medium such as a magnetic storage medium (eg, a ROM, a floppy disk, a hard disk, etc.) and an optically readable medium (eg, a CD-ROM, a DVD, etc.).

본 실시예와 관련된 기술 분야에서 통상의 지식을 가진 자는 상기된 기재의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 방법들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Those of ordinary skill in the art related to the present embodiment will understand that it can be implemented in a modified form within a range that does not deviate from the essential characteristics of the above description. Therefore, the disclosed methods are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

generating a first cumulative function from the first categorical data and generating a second cumulative function from the second categorical data;
generating a first residual function by removing the rising pattern from the first cumulative function, and generating a second residual function by removing the rising pattern from the second cumulative function;
generating a correlation function representing the correlation between the first residual function and the second residual function;
generating a final residual function as a difference value between the first residual function and the correlation function; and
and determining that an abnormality has occurred in a monitoring target when the final residual function is out of a preset range.

According to claim 1,
The generating of the first residual function and the second residual function comprises:
generating a first trend line representing the rising pattern of the first cumulative function and a second trend line representing the rising pattern of the second cumulative function; and
generating a first residual function with a residual that is a difference value between the first trend line and the first cumulative function, and generating the second residual function with a residual that is a difference between the second trend line and the second cumulative function; Categorical data analysis methods including.

3. The method of claim 2,
The generating of the first trend line and the second trend line includes:
Categorical data analysis method comprising calculating the slope of the first trend line and the slope of the second trend line by time-series linear regression analysis.

4. The method of claim 3,
Calculating the slope of the first trend line and the slope of the second trend line includes:
determining, as a slope of the first trend line, a slope that satisfies a least squares difference between the first trend line and the first cumulative function; and
and determining, as the slope of the second trend line, a slope that satisfies a least squares difference between the second trend line and the second cumulative function.

3. The method of claim 2,
The step of generating the correlation function comprises:
and generating a correlation function by using a residual, which is a difference value between the second trend line and the second cumulative function, in a regression model as an independent variable.

6. The method of claim 5,
The step of generating the correlation function comprises:
A categorical data analysis method using at least one of multiple linear regression, neural network model, decision tree, and regularized regression as a model.

According to claim 1,
The step of determining that an abnormality has occurred in the monitoring target when the final residual function is out of a preset range,
calculating a reference value as an average value or a median value of the first categorical data;
calculating an upper limit reference value by adding an acceptable variation value to the reference value, and calculating a lower limit reference value by subtracting the variation value; and
and determining that an abnormality has occurred in the monitoring target when the final residual function deviates from the upper limit reference value or the lower limit reference value.

According to claim 1,
The first categorical data and the second categorical data are non-continuous discrete data.

an accumulation function generator configured to generate a first accumulation function from the first categorical data and a second accumulation function from the second categorical data;
a residual function generator configured to generate a first residual function by removing the rising pattern from the first cumulative function, and to generate a second residual function by removing the rising pattern from the second cumulative function;
a correlation function generator configured to generate a correlation function representing a correlation between the first residual function and the second residual function;
a final residual function generator configured to generate a final residual function using a difference value between the first residual function and the correlation function; and
Categorical data analysis apparatus including an abnormality detection unit that determines that an abnormality has occurred in the monitoring target when the final residual function is out of a preset range.

10. The method of claim 9,
The residual function generator,
Including; a trend line generator generating a first trend line representing the rising pattern of the first cumulative function and a second trend line representing the rising pattern of the second cumulative function;
A categorical type for generating a first residual function from a residual that is a difference value between the first trend line and the first cumulative function, and generating the second residual function from a residual that is a difference between the second trend line and the second cumulative function data analysis device.

11. The method of claim 10,
Correlation function generator,
A categorical data analysis apparatus for generating a correlation function by using a residual, which is a difference between the second trend line and the second cumulative function, as an independent variable in a regression model.

10. The method of claim 9,
The abnormality detection unit,
a reference value calculation unit for calculating a reference value using the average or median value of the first categorical data;
an upper limit reference value calculation unit for calculating an upper limit reference value by adding an acceptable variation value to the reference value; and
and a lower limit reference value calculation unit for calculating a lower limit reference value by subtracting the variation value;
Categorical data analysis apparatus for determining that an abnormality has occurred in the monitoring target when the final residual function deviates from the upper limit reference value or the lower limit reference value.

10. The method of claim 9,
The first categorical data and the second categorical data are non-continuous discrete data.

one or more processors;
a memory for loading a computer program executed by the processor; and
storage for storing computer programs capable of analyzing categorical data;
The computer program is
generating a first cumulative function from the first categorical data and generating a second cumulative function from the second categorical data;
generating a first residual function by removing the rising pattern from the first cumulative function, and generating a second residual function by removing the rising pattern from the second cumulative function;
generating a correlation function representing a correlation between the first residual function and the second residual function;
generating a final residual function using a difference value between the first residual function and the correlation function; and
and an operation for determining that an abnormality has occurred in a monitoring target when the final residual function is out of a preset range.

In combination with a computer device,
generating a first cumulative function from the first categorical data and generating a second cumulative function from the second categorical data;
generating a first residual function by removing the rising pattern from the first cumulative function, and generating a second residual function by removing the rising pattern from the second cumulative function;
generating a correlation function representing the correlation between the first residual function and the second residual function;
generating a final residual function as a difference value between the first residual function and the correlation function; and
A computer program stored in a computer-readable recording medium to execute the step of determining that an abnormality has occurred in the monitoring target when the final residual function is outside a preset range.