KR102489502B1

KR102489502B1 - Sensor outlier detection system for building monitoring using ensemble algorithm

Info

Publication number: KR102489502B1
Application number: KR1020210127549A
Authority: KR
Inventors: 최우현; 양동원
Original assignee: (주)엘 테크
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2023-01-17

Abstract

In operating of a vibration, crack, tilt sensor, and the like used for safety monitoring of a building, the present invention relates to a sensor outlier detection system for building monitoring using an ensemble algorithm that sequentially applies a plurality of detection algorithms, enables to effectively detect an outlier of a detection signal generated by various causes, and enables to improve a precision of monitoring by responding to an error. The sensor outlier detection system for building monitoring using the ensemble algorithm comprises: a first detection module; a second detection module; a third detection module; a fourth detection module; and a verification module.

Description

Sensor outlier detection system for building monitoring using ensemble algorithm {Sensor outlier detection system for building monitoring using ensemble algorithm}

본 발명은 센서 이상치 탐지시스템에 관한 것으로, 자세하게는 건축물의 안전 모니터링을 위해 사용되는 진동, 균열, 기울기 센서 등의 운용에 있어 복수의 탐지 알고리즘을 순차적으로 적용하여 다양한 원인으로 발생하는 감지신호의 이상치를 효과적으로 탐지하고 오류에 대응하도록 하여 모니터링의 정밀도를 향상시킬 수 있는 앙상블 알고리즘을 이용한 건축물 모니터링용 센서 이상치 탐지시스템에 관한 것이다.The present invention relates to a sensor anomaly detection system, and more particularly, in the operation of vibration, crack, and tilt sensors used for safety monitoring of buildings, a plurality of detection algorithms are sequentially applied to detect anomalies in detection signals caused by various causes. It relates to a sensor anomaly detection system for building monitoring using an ensemble algorithm that can improve the precision of monitoring by effectively detecting values and responding to errors.

각종 센서에서 발생하며 자료가 관찰될만한 범위 내에 있지 않거나 관찰될 가능성이 매우 희박한 자료, 또는 다른 자료들과는 관찰된 속성값이 이질적인 데이타로 정의되는 이상치(outlier)는 이상 값, 극단 값(extreme value), 특이 값(singular value) 등으로 불리며, 데이터를 모델화함에 있어 수립된 모델의 안정성에 예상치 못했던 영향을 줄 수 있거나, 수집된 자료에서 이상징후 혹은 독특한 특징점을 나타낼 수 있어, 트렌드 변화나 중요 이슈 발생을 감지하는 측면에서 매우 중요한 값이다. Outliers, which occur in various sensors and are defined as data that are not within the observable range or that are very unlikely to be observed, or data that have different observed attribute values from other data, are called outliers, extreme values, It is called a singular value, etc., and may have an unexpected effect on the stability of an established model in modeling data, or may indicate an anomaly or a unique characteristic point in collected data, preventing trend changes or occurrence of important issues. It is a very important value in terms of detection.

건축물 안전 모니터링 시스템에서 사용하는 진동, 균열, 기울기 센서는 IoT센서 자료의 특성상 시계열 자료 형태로 입력되고, 기온의 영향을 받으며, 우천(雨天)의 영향, 내장 전원 배터리 소진, 건물 공사로 인한 센서 정전, 건물에 실제 급격한 균열이나 기울어짐의 이유로 수집된 자료에 이질적인 이상치로 간주 될 수 있는 요소가 많이 존재한다. 따라서 안전 모니터링 시스템을 위한 자료 모델링 이전의 전처리 단계와 자료의 시각화 부분에서 이상치 탐지와 적절한 조치가 필요한 상황이다.Vibration, crack, and tilt sensors used in the building safety monitoring system are input in the form of time-series data due to the nature of IoT sensor data, and are affected by temperature, rain, built-in power battery exhaustion, and sensor outages due to building construction. However, there are many elements that can be considered as outliers that are heterogeneous in the collected data due to actual sharp cracks or tilts in the building. Therefore, it is necessary to detect outliers and take appropriate measures in the preprocessing stage prior to data modeling for the safety monitoring system and in the visualization of data.

또한, 이상치 탐지의 오류가 상대적으로 크게 문제되지 않는 다른 분야와 달리 건축물 안전의 특성상 안전 관련 자료에서 이상치의 정확한 탐지는 전체 시스템이 존재하는 가장 큰 이유가 되기 때문에 엄격한 이상치 탐지 모델이 필요한 상황이다.In addition, unlike other fields where errors in outlier detection are not a relatively big problem, accurate detection of outliers in safety-related data is the biggest reason for the existence of the entire system due to the nature of building safety, so a strict outlier detection model is required.

이러한 이상치가 발생하는 원인은 다양하기 때문에 자료가 이상치인지를 알아내는 것은 여러 요소를 반영하여 연구자의 주관이나 전문성이 개입하는 경우가 많다. 즉 이상치가 발생할 수 있는 정상적인 범위를 벗어나는 비교적 단순한 이유로부터, 측정 오류, 통신 오류, 기록 오류, 일상적인 운영환경에서 벗어나는 다양한 이유에서 발생하는 값 등 여러 원인 때문에 이상치가 발생하기 때문에 이상치를 탐지해내는 것은 자료 수집의 맥락과 모델에 미칠 영향을 고려해서 수행되어야 한다.Since the causes of these outliers are diverse, finding out whether data are outliers often involves the researcher's subjectivity or expertise by reflecting various factors. In other words, since outliers occur due to various reasons, such as relatively simple reasons outside the normal range where outliers can occur, measurement errors, communication errors, recording errors, and values arising from various reasons that deviate from the usual operating environment, it is necessary to detect outliers. This should be done taking into account the context of data collection and the impact it will have on the model.

건축물 모니터링에 사용되는 진동, 균열, 기울기를 측정하는 센서 또한 트렌트에서 벗어나 짧은 시간 동안 큰 분산(variance)을 가지는 경우(additive outlier), 또는 비교적 장기적인 기간 다른 트렌드를 보이다 다시 원래 대로 복귀하는 경우(transient change outlier), 크게 변화하고 그 변한 상태로 계속 유지되는 경우(level shift), 큰 변화가 되고 이것이 이후의 관측 값에도 계속 영향을 미치는 경우(innovational outlier) 등 다양한 형태로 이상치가 나타나게 된다.Sensors that measure vibration, cracks, and tilts used in building monitoring also deviate from the trend and have a large variance for a short period of time (additive outlier), or show a different trend for a relatively long period of time and then return to the original level (transient Outliers appear in various forms, such as a change outlier), a case that changes significantly and continues to change (level shift), and a case where a large change continues to affect subsequent observed values (innovational outlier).

그러나 이상치 탐지 기법은 원인이 다양한 만큼 사용하는 알고리즘이 방대하고 사용되는 기법의 논리적인 근거 또한 다양하여 결과도 제각각인 관계로 어떤 것이 좋거나 좋지 않다라는 명쾌한 결론을 내기 어렵다.However, as the outlier detection technique has various causes, it is difficult to draw a clear conclusion that which one is good or bad because the algorithm used is vast and the logical basis of the technique used is diverse, and the results are also different.

현재 구현중인 일반적 자료의 이상치 탐색은 단변량일 경우와 다변량일 경우로 나누어지고, 또 각 관측값이 정상값인지 이상치인지, 사전 정보를 가지고 이상치 탐지 모델을 구축하는지, 또한 자료들이 특정 확률분포 가정을 따른다고 볼 수 있는지에 따라 다양한 이상치 탐지 기법을 사용할 수 있다. 덧붙여서, 시계열 자료의 특성을 살려서 ARIMA 모델을 적합하면서 동시에 이상치를 탐지하는 방법, 비모수적인 분위수 회귀를 사용해서 이상치 탐색을 하는 방법 등이 존재한다.The outlier search of general data currently being implemented is divided into univariate and multivariate cases, and whether each observation is a normal value or an outlier, builds an outlier detection model with prior information, and assumes that the data have a specific probability distribution Depending on whether it can be seen as following, various outlier detection techniques can be used. In addition, there are methods to detect outliers while fitting an ARIMA model by taking advantage of the characteristics of time series data, and methods to search for outliers using nonparametric quantile regression.

이와 같은 이상치 탐지의 심각한 문제는 사실 방법론의 “풍요속의 빈곤”으로, 자료의 형태에 따라서 다양한 알고리즘이 존재하는데 결과적으로 탐지된 이상치가 정확한 이상치라고 확신을 가지고 뒷받침할 수 있는 탐지 데이터의 적합성에 대한 기준이 존재하지 않는다.A serious problem in detecting such outliers is actually the “poverty in abundance” of the methodology. There are various algorithms depending on the type of data. standards do not exist

보통 통계적 방법론에 대개 GOF(goodness of fit) 검정 방법이 사용되어 실측값과 예측값의 차이를 기준으로 하는데 이상치 탐색에서는 특정 자료가 이상치인지를 미리 알고 있는 지도적 방법론(supervised method)이 아니라면 이런 방법을 사용하기 어렵다. Generally, the GOF (goodness of fit) test method is usually used in statistical methodology, which is based on the difference between the measured value and the predicted value. In outlier search, this method is used unless it is a supervised method that knows in advance whether a specific data is an outlier. Hard to do.

대한민국 등록특허 제10-2274389호(2021.07.01)Republic of Korea Patent No. 10-2274389 (2021.07.01)

본 발명은 상기와 같은 문제를 해결하기 위하여 창출된 것으로, 본 발명의 목적은 다양한 센서를 활용하여 건축물의 안전을 모니터링함에 있어 다양한 원인으로 발생하는 감지신호의 이상치를 복수의 알고리즘을 순차적으로 적용하여 탐지하고 오류에 대응하여 정확한 탐지결과를 얻도록 하는 앙상블 알고리즘을 이용한 건축물 모니터링용 센서 이상치 탐지시스템을 제공하는 것이다.The present invention was created to solve the above problems, and an object of the present invention is to sequentially apply a plurality of algorithms to abnormal values of detection signals generated from various causes in monitoring the safety of buildings using various sensors. To provide a sensor outlier detection system for building monitoring using an ensemble algorithm that detects and responds to errors to obtain accurate detection results.

상기와 같은 목적을 위해 본 발명은 건축물에 설치되어 안전을 모니터링하는 센서의 이상치를 탐지시스템에 있어서, 상기 센서로부터 얻어진 자료값을 오름차순으로 정렬하는 정렬부와, 정렬된 자료값을 설정된 비율로 분할하여 경계구간을 설정하는 구간설정부를 구비하고, 상기 경계구간을 벗어나는 자료값을 제1이상치로 정의하여 저장하는 제1검출모듈; 상기 센서로부터 얻어진 자료값을 정규분포화하고, 표준정규분포를 참고하여 설정된 신뢰수준에서 검정하여 제2이상치를 추출 및 저장하는 제2검출모듈; 상기 제1이상치 및 제2이상치 개수의 상한을 설정하고 검정통계량을 계산하는 검정통계부와, 상기 검정통계량과 비교할 임계값을 설정하는 임계값설정부를 구비하고, 상기 검정통계량이 설정된 임계값보다 클 경우 제3이상치로 정의하여 저장하는 제3검출모듈; isolation forest 알고리즘을 이용하여 이상치 점수를 계산하는 점수계산부와, 이상치의 비율을 설정하고 설정된 비율에 맞는 이상치 점수 자료값을 제4이상치로 정의하여 저장하는 제4검출모듈; 상기 제1이상치, 제2이상치, 제3이상치, 제4이상치를 바탕으로 원본 자료값 및 이상치 자료값을 제거한 데이터셋을 대상으로 엔트로피를 계산하는 엔트로피계산부와, 상기 엔트로피의 변동을 확률변동 관점으로 재해석하는 해석부를 구비하고, 설정된 값과 확률변동을 비교하여 이상치 추출의 양호를 판단하는 검증모듈; 로 이루어지는 것을 특징으로 한다.For the above object, the present invention provides an anomaly detection system for a sensor installed in a building to monitor safety, an arranging unit for arranging data values obtained from the sensor in ascending order, and dividing the sorted data values at a set ratio. a first detection module having a section setting unit configured to set a boundary section by doing so, and defining and storing a data value outside the boundary section as a first outlier; a second detection module that normalizes data values obtained from the sensor, tests them at a set confidence level with reference to a standard normal distribution, and extracts and stores second outliers; a test statistic unit configured to set an upper limit on the number of the first and second outliers and to calculate a test statistic; and a threshold setter configured to set a threshold value to be compared with the test statistic, wherein the test statistic is larger than the set threshold value. a third detection module for defining and storing the case as a third outlier; a fourth detection module configured to set a score calculation unit that calculates an outlier score using an isolation forest algorithm, and to set a ratio of outliers and to define and store an outlier score data value that meets the set ratio as a fourth outlier; Based on the first outlier, the second outlier, the third outlier, and the fourth outlier, an entropy calculation unit that calculates entropy for a dataset from which original data values and outlier data values have been removed, and the change in entropy from the perspective of probability fluctuation a verification module having an analysis unit that reinterprets as , and comparing a set value with a probability change to determine whether an outlier is extracted; It is characterized by consisting of.

이때 상기 센서는 진동, 균열, 기울기를 측정할 수 있는 센서가 될 수 있다.At this time, the sensor may be a sensor capable of measuring vibration, crack, and tilt.

또한, 상기 구간설정부는, 오름차순으로 정리된 자료값을 4 등분하여 백분위 75% 값과 25% 값과의 차에 1.5를 곱해서 백분위 75% 값에 더해준 것을 최대, 백분위 25% 값에서 빼준 것을 최소로 하는 경계 구간을 설정할 수 있다.In addition, the interval setting unit divides the data values organized in ascending order into 4 equal parts, multiplies the difference between the 75% percentile value and the 25% value by 1.5, and sets the maximum value added to the percentile 75% value and the minimum value subtracted from the percentile 25% value. It is possible to set a boundary section to

또한, 상기 제2검출모듈은, 95 내지 99% 신뢰수준에서 이상치 여부를 검정할 수 있다.In addition, the second detection module may test whether there is an outlier at a 95 to 99% confidence level.

또한, 상기 검정통계부는, 상기 제1이상치 및 제2이상치의 최대값에 1을 더한 값으로 이상치 개수의 상한을 설정하고, 표본평균과 표본표준편차를 계산하며, 관측치와 표본평균의 편차의 최대치를 구하고 최대치에 해당하는 자료값을 표본에서 제거하고 다시 검정통계량을 계산하여 추정한 최대치의 이상치 개수까지 검정통계량을 계산할 수 있다.In addition, the test statistics unit sets the upper limit of the number of outliers by adding 1 to the maximum values of the first and second outliers, calculates the sample mean and sample standard deviation, and calculates the maximum value of the deviation between the observed value and the sample mean. , remove the data value corresponding to the maximum value from the sample, and calculate the test statistic again to calculate the test statistic up to the number of outliers of the estimated maximum value.

이상치 탐지를 위한 다양한 기존의 방법에서는 사전에 이상치 개수를 추정해야 하거나 자료가 정규 분포를 따라야 한다는 가정, 자료가 중앙값에서 양쪽으로 균형이 잡혀 있어야 한다는 가정 등 여러 제한점이 있어 적용에 어려움이 있다. 또한, 다른 방법론 적용에 따라 결과도 서로 다르게 나와 이상치 탐지 결과가 어떤 방법론을 사용하느냐에 의존하게 된다. 또한, 이상치 탐지 결과가 얼마나 적절하게 되었는지를 객관적으로 설명하는 측정 도구 또한 존재하지 않기 때문에 탐지 결과에 따른 사후 조치도 어렵다.Various existing methods for outlier detection have difficulties in application due to various limitations such as the need to estimate the number of outliers in advance, the assumption that the data must follow a normal distribution, and the assumption that the data must be balanced on both sides at the median. In addition, the results are different depending on the application of different methodologies, so the outlier detection results depend on which methodology is used. In addition, since there is no measurement tool that objectively explains how appropriate the outlier detection result has become, it is difficult to take follow-up measures according to the detection result.

본 발명은 이처럼 기존에 적용되던 다양한 방법들의 문제점들을 극복하고 정확한 이상치 탐지가 이루어질 수 있도록 여러 기법을 동시에 혼합해서 적용하는 앙상블 기법을 적용함에 따라 수집된 자료가 기법이 요구하는 가정들을 충족, 미충족에 따른 우려를 불식시키고, 방법론들을 순차적으로 적용해서 상호 필요한 가정들을 충족시키고 여러 방법론으로 추출된 이상치들에 대해 엔트로피 정보량을 바탕으로 적절한 이상치 검출이 되었는지를 확인해 볼 수 있기 때문에 안정적인 이상치 탐지를 기대할 수 있는 효과가 있다.The present invention overcomes the problems of various previously applied methods and applies an ensemble technique that simultaneously mixes and applies several techniques so that accurate outlier detection can be achieved, so that the collected data meets the assumptions required by the technique and prevents unfulfillment. Since it is possible to dispel concerns about the following, apply methodologies sequentially to satisfy mutually necessary assumptions, and check whether or not an outlier has been detected appropriately based on the amount of entropy information for outliers extracted by various methodologies, stable outlier detection can be expected. It works.

이와 같이 정보량 통계량을 사용하는 방법은 단변량 기법뿐만 아니고 다변량 이상치 탐색 결과에서 적용 가능하여 현재 수십 개의 다른 알고리즘들이 난립해있는 이상치 탐지 알고리즘들을 정보량의 변화라는 척도기준으로 이상치 탐지 결과를 상호 평가를 해볼 수 있는 기초를 제공할 수 있다.In this way, the method of using statistics on the amount of information can be applied not only to univariate techniques but also to multivariate outlier search results, so the outlier detection algorithms, which are currently crowded with dozens of other algorithms, can be mutually evaluated on the basis of the change in the amount of information. can provide a basis for

도 1은 본 발명의 개념도,
도 2는 본 발명의 실시예에 따른 구성 및 연결관계를 나타낸 블록도,
도 3은 본 발명의 실시예에 따른 순서를 나타낸 순서도,
도 4 ~ 5는 본 발명의 실험예에 따른 이상치 탐지 예시를 나타낸 시트,
도 6 ~ 9는 본 발명의 실험예에 따른 센서 자료값에 대한 분석결과를 나타낸 시트이다.1 is a conceptual diagram of the present invention;
2 is a block diagram showing the configuration and connection relationship according to an embodiment of the present invention;
3 is a flow chart showing a sequence according to an embodiment of the present invention;
4 to 5 are sheets showing examples of detecting anomalies according to experimental examples of the present invention;
6 to 9 are sheets showing analysis results for sensor data values according to experimental examples of the present invention.

이하, 첨부된 도면을 참조하여 본 발명 앙상블 알고리즘을 이용한 건축물 모니터링용 센서 이상치 탐지시스템의 구성을 구체적으로 설명한다.Hereinafter, the configuration of a sensor anomaly detection system for monitoring a building using the ensemble algorithm of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 개념도로서, 본 발명은 건축물에 설치되어 안전을 모니터링하는 센서의 이상치를 탐지하는 시스템으로, 상기 센서는 진동, 균열, 기울기를 측정할 수 있는 센서가 될 수 있으며 적용되는 이상치 탐지 알고리즘은 다음과 같은 4가지 알고리즘을 기반으로 한다. 이는 다른 기법을 추가해도 같은 스텝을 밟아서 이상치를 정리하면 되므로 특정방법론에 의존하지 않아도 이상치를 탐지할 수 있는 장점이 있다.1 is a conceptual diagram of the present invention. The present invention is a system for detecting anomalies in a sensor installed in a building to monitor safety. The detection algorithm is based on the following four algorithms. This has the advantage of being able to detect outliers without relying on a specific methodology, because even if other techniques are added, the outliers can be cleaned up by following the same steps.

첫 번째로 IQR(Interquatile range)로서 자료들을 크기순으로 배열한 후 1사분위수(25%)와 3사분위수(75%) 사이 간격을 IQR이라고 한다. 이 IQR값에 1.5배 하고 중앙값을 더한 것이나 뺀 것보다 자료값이 더 크거나 작으면 이상치로 본다.First, as IQR (Interquatile range), data are arranged in order of size, and the interval between the 1st quartile (25%) and 3rd quartile (75%) is called IQR. If the data value is larger or smaller than the value obtained by multiplying this IQR value by 1.5 and adding or subtracting the median value, it is considered an outlier.

특별한 자료에 대한 분포 가정이 전혀 없고 결과 해석에 주관적인 요인이 들어가지 않지만, 불균형한 자료(중앙값을 중심으로 양쪽의 자료가 나타나는 빈도가 많이 다른 경우)는 이상치 탐색이 다른 방법과는 차이를 보이고 여전히 거리를 바탕으로 이상치 검출을 하기 때문에 두 가지 이질적인 자료의 무리가 섞여있는 경우(Mixed distribution)에 이상치 검출에 문제가 생긴다.Although there is no distributional assumption for special data and no subjective factors are included in the interpretation of the results, for unbalanced data (when the frequency of both data around the median is very different), outlier search is different from other methods and still Because outliers are detected based on distance, there is a problem in detecting outliers when two disparate data groups are mixed (mixed distribution).

두 번째로 z-score, 표준화 점수라고도 불리는 알고리즘으로 자료가 평균값에서 표준편차를 고려해서 얼마나 멀리 떨어져 있는지를 자료가 정규분포(Normal probability distribution)를 따른다는 가정하에 나타내며 다음과 같이 표현된다.Second, it is an algorithm called z-score or standardized score, which shows how far the data are from the mean value by considering the standard deviation, under the assumption that the data follow a normal probability distribution, and is expressed as follows.

,

는 자료값,

는 평균,

는 표준편차의 제곱근이다.

,

is the data value,

is the mean,

is the square root of the standard deviation.

z-score는 널리 쓰이는 알고리즘이나 자료가 정규분포를 따른다는 가정에 문제가 생기거나 자료에 극단치가 많이 섞이면 평균이 극단치에 민감하게 반응하기 때문에 문제가 생긴다. 여기에 95% 신뢰 계수를 사용할지 혹은 99% 신뢰계수를 사용할지도 사용자의 주관적인 판단 영역에 속한다. 다만, 정규분포 가정을 만족한다면 이상치 검출에 가장 민감한 반응을 보여주고 직관적이라는 장점이 있다.The z-score causes a problem with widely used algorithms or assumptions that the data follow a normal distribution, or because the average reacts sensitively to the extreme values if a lot of extreme values are mixed in the data. Whether to use a 95% confidence coefficient or a 99% confidence coefficient here belongs to the subjective judgment area of the user. However, if the normal distribution assumption is satisfied, it shows the most sensitive response to outlier detection and has the advantage of being intuitive.

세 번째로 Generalized ESD test로서 Grubb test와 같은 통계량을 사용하여 검정하지만 grubb 검정의 약점을 보완한 방법으로 이상치의 개수를 미리 결정해야 했던 약점을 완화해서 이상치의 개수 상한만 지정하면 된다.Thirdly, as a generalized ESD test, it is tested using the same statistics as the Grubb test, but it is a method that supplements the weakness of the grubb test. The weakness of having to determine the number of outliers in advance is alleviated and only the upper limit of the number of outliers needs to be specified.

이상치의 거리에 바탕을 두고 있지만 여러 점에 대한 이상치 판단의 1종 오류 확률이 수리적으로 수정 계산되어 나온다는 이론적 장점을 가지고 있다. ESD 검정은 이상치의 개수의 상한선만 제시하면 되고 여러 개의 이상치를 동시에 검출해내는데 장점을 가지고 있다. 그러나 여전히 정규성에 바탕을 두고 있고 거리 기반 이상치 탐지의 약점도 공유하고 있고 대용량 자료에서는 검정의 수치적 1종 오류 확률에 대한 시뮬레이션 결과가 불안정해진다는 단점이 존재한다.Although it is based on the distance of the outlier, it has the theoretical advantage that the probability of a type 1 error in outlier judgment for several points is mathematically corrected and calculated. The ESD test only needs to present an upper limit on the number of outliers, and has the advantage of simultaneously detecting multiple outliers. However, it is still based on normality and shares the weakness of distance-based outlier detection, and in large data, the simulation results for the numerical type 1 error probability of the test become unstable.

참고로 Grubb test(Grubb 검정)은 자료값 중에서 가장 평균값과 거리가 먼 자료의 절대 편차를 가지고 자료가 정규분포에서 나왔다는 가정하에 outlier 인지를 검증한다.For reference, the Grubb test (Grubb test) verifies whether it is an outlier under the assumption that the data come from a normal distribution with the absolute deviation of the data that is the farthest from the average value among the data values.

이는 자료 중 편차가 가장 큰 1개에 대하여 검증을 수행하며, 이상치가 더 이상 감지되지 않을 때까지 검정을 되풀이하여 샘플 개수가 작아지면 사실 이상치가 아닌데도 이상치라고 결론을 내릴 확률이 높아지기 때문에 샘플 개수가 작아지면 검정을 멈추는 것이 필요하다.This test is performed on one of the data with the largest deviation, and the test is repeated until the outlier is no longer detected. As the number of samples decreases, the probability of concluding that it is an outlier when it is not actually an outlier increases, so the number of samples increases. It is necessary to stop the black when it gets smaller.

다음 네 번째로 isolation forest로서, 이 기법은 랜덤하게 직선으로 자료값들을 둘로 나누는 직선들을 계속 생성해서 자료가 모두 직선으로 구분될 때까지 나눈 후에 비교적 빨리 다른 자료들과 구분이 지어진 데이터들을 이상치로 간주하는 기법이다.Next, as the fourth isolation forest, this technique continues to generate straight lines that divide data values into two at random with a straight line until all data are divided into straight lines, and then relatively quickly distinguish data from other data as outliers. It is a technique to

여러 번 같은 알고리즘을 적용한 평균값을 점수화해서 판단하는 일종의 시뮬레이션 기법인 i-Forest 기법은 최근 기울기가 있는 직선으로 확장 버전이 나온 상황으로, 이 기법은 반복적인 시뮬레이션을 이용하는 방법론으로 다변량 자료와 지도적(supervised) 이상치 검출에 특화된 방법론이다.The i-Forest technique, which is a kind of simulation technique that scores and judges the average value by applying the same algorithm several times, has recently come out with an extended version with a straight line with a slope. ) is a methodology specialized for outlier detection.

비교적 이상치 검출 계산 시간이 빠르고 특별한 가정을 하지 않기 때문에 사용하기 쉽지만 전체 자료 중 얼마나 많은 자료를 이상치로 볼 것인가를 미리 입력해야 하기 때문에 결과를 해석하기가 쉽지 않다.Outlier detection calculation time is relatively fast and it is easy to use because it does not make any special assumptions, but it is not easy to interpret the result because you have to input in advance how many data out of the total data to be considered as outliers.

이러한 Isolation Forest는 Unsupervised Anomaly Detection 중 하나로 현재 갖고 있는 데이터 중 이상치를 탐지할 때 주로 사용되며, 이름에서 볼 수 있듯이 tree 기반으로 구현되는데, 랜덤으로 데이터를 split하여 모든 관측치를 고립시키며 구현된다. 특히, 변수가 많은 데이터에서도 효율적으로 작동할 수 있는 장점이 있다.This Isolation Forest is one of Unsupervised Anomaly Detection and is mainly used to detect outliers among existing data. In particular, it has the advantage of being able to operate efficiently even in data with many variables.

Isolation Forest의 컨셉은 각 관측치를 고립(=분리)시키기는 것은 이상치가 정상 데이터보다 쉽다라는 것으로, 학습 방법은, 정상 데이터는 tree의 terminal node와 근접하며 경로 길이가 크고, 이상치는 tree의 root node와 근접하며 경로 길이가 작다는 것을 통해 이루어진다.The concept of the Isolation Forest is that isolating (= isolating) each observation is that outliers are easier than normal data. is close to and the path length is small.

랜덤포레스트가 의사결정나무를 여러번 반복하여 앙상블 하듯이, Isolation Forest는 iTree를 여러번 반복하여 앙상블함다. iTree는 ① Sub-sampling: 비복원 추출로 데이터 중 일부를 샘플링, ② 변수 선택: 데이터 X의 변수 중 q를 랜덤 선택, ③ split point 설정: 변수q의 범위(max~min) 중 uniform하게 split point를 선택, ④ ①~③번 과정을 모든 관측치가 split 되거나, 임의의 split 횟수까지 반복(=재귀 나무)하며, 경로 길이를 모두 저장하며, Isolation Forest는 ⑤ 1~4번 과정(iTree)을 여러 번 반복하게 된다.Just as random forests ensemble decision trees by iterating multiple times, Isolation Forest ensembles iTree iterations multiple times. iTree is ① Sub-sampling: Sampling some of the data with non-repair extraction, ② Variable selection: Randomly selecting q among the variables of data X, ③ Setting a split point: Split point uniformly among the range of variable q (max~min) Select, ④ Processes ① to ③ are repeated until all observations are split or a random number of splits (=recursive tree), and all path lengths are saved. will be repeated several times.

평가는 다음과 같이 이뤄진다.Evaluation is done as follows.

여기서 h(x): 해당 관측치의 경로 길이, E(h(x)): 모든 iTree에서 해당 관측치에 대한 평균 경로 길이, c(n): h(x)를 nomalise하기 위한 값으로, iTree의 평균 경로 길이. (iTree는 Binary Search Tree와 동일한 구조이기 때문에,c(n)값을 쉽게 구함)where h ( x ): path length of that observation, E ( h ( x )): average path length for that observation across all iTree, c ( n ): value to normalize h ( x ), average of iTree path length. (since iTree has the same structure as Binary Search Tree, c ( n ) is easily obtained)

E(h(x))에 따른 Score 값은 1. 관측치 x가 전체 경로길이의 평균과 유사(= 정상 데이터): E(h(x))→c(n),s→0.5, 2. 관측치 x가 이상치: E(h(x))→0,s→1, 3. 관측치 x의 최대 경로길이: E(h(x))→n-1, s→0로, 즉, Score는 0 ~ 1 사이에 분포되며, 1에 가까울수록 이상치일 가능성이 크고 0.5 이하이면 정상데이터로 판단할 수 있다.The score value according to E ( h ( x )) is 1. Observation value x is similar to the average of all path lengths (= normal data): E ( h ( x )) → c ( n ), s → 0.5, 2. Observation value x is an outlier: E ( h ( x ))→0, s →1, 3. Maximum path length of observation x: E ( h ( x ))→ n-1, s →0, that is, Score is 0 to It is distributed between 1, and the closer to 1, the more likely it is to be an outlier, and if it is less than 0.5, it can be judged as normal data.

본 발명에서는 이러한 알고리즘을 유기적으로 결합하여 정확한 이상치의 검출과 오류에 대항할 수 있는 이른바 앙상블 알고리즘을 사용하게 된다.In the present invention, a so-called ensemble algorithm that can accurately detect outliers and counteract errors by organically combining these algorithms is used.

기존의 이상치 탐지 방법론의 문제를 보면 후술되는 내용에서 실제 자료를 사용한 예제에서 보여주듯이, 근거가 되는 이론과 사용하는 상황들이 다르기 때문에 실제 자료에 적용해 보면 이상치 탐지의 결과가 비슷한 케이스도 있지만 결과가 다른 경우가 많다.Looking at the problem of the existing outlier detection methodology, as shown in the example using actual data in the following content, the basis theory and the use situation are different, so when applied to actual data, there are cases where the result of outlier detection is similar, but the result is different. There are many different cases.

특히 건축물 안전 진단 플랫폼의 경우는 이상치가 검출되었느냐가 예측문제를 떠나서 플랫폼 자체의 존재 의미와 관련되기 때문에 정확성을 요구하는데 사용하는 방법론에 따라 결과가 많이 달라지는 것은 바람직한 현상이 아닐뿐더러 방법론들 사이에 어떤 것이 상황에 맞는지 아닌지 판단 기준조차 모호한 현실이다.In particular, in the case of a building safety diagnosis platform, since the detection of an outlier is related to the meaning of the existence of the platform itself, apart from the problem of prediction, it is not a desirable phenomenon for results to vary greatly depending on the methodology used to require accuracy, and Even the criterion for judging whether something is suitable for the situation is an ambiguous reality.

이는 서로 다른 근거를 사용해서 이질적인 자료들을 골라내는 고유의 논거를 펼치기 때문에 서로 다른 근거를 이용하는 방법론들은 우열을 이론적으로 가릴 수도 없고 실제 자료에서 어떤 것이 이질적인 이상치라고 알고 있어서 지도적(supervised) 학습 알고리즘을 사용할 수 있는 경우는 드물기 때문에 사후 검증을 할 수도 없기 때문이다.Because this method uses different grounds to develop a unique argument for selecting heterogeneous data, methodologies using different grounds cannot theoretically distinguish superiority and inferiority. This is because it is rare to be able to do post-hoc verification.

이에 본 발명에서는 Entropy 정보량을 이용한 이상치 탐지 적합성을 결정하였다.Therefore, in the present invention, the suitability of detecting outliers using entropy information was determined.

데이터 사이언스에서 엔트로피(entropy)란 정보량이라는 것을 의미하며, 정보량에 대한 엔트로피는 새넌 엔트로피(Shannon entropy) 혹은 평균 정보량이 쓰이고 있다.In data science, entropy means the amount of information, and Shannon entropy or the average amount of information is used as entropy for the amount of information.

정보량이란 사건이 일어날 확률의 역수의 로그(log)로 정의하고, 이 정보량의 평균을 엔트로피 값이라고 한다. 엔트로피 값은 충분히 일어날 수 있는 일이 일어나는 경우는 값이 작고 일어날 가능성이 작은 사건에 대해서는 큰 값을 제공한다.The amount of information is defined as the log of the reciprocal of the probability that an event will occur, and the average of this amount of information is called the entropy value. Entropy values are small for highly probable events and provide large values for unlikely events.

이상치의 관점에서 보면 여러 방식을 통해 탐지한 이상치를 제외한 자료의 엔트로피를 상호 비교해서 가장 작은 엔트로피 값을 주는 방법론으로 이상치 탐지 값을 선택해주는 것이 합리적 방식이다. 다시 말해, 이상치(일어날 확률이 작은 값)를 제외하면 이상치를 제외한 나머지 자료들에서 제공하는 평균 정보량은 이상치를 포함한 자료에서 계산된 평균 정보량보다 확연하게 큰 값이어야 하고 서로 다른 방법론을 통해 제시된 각각 다른 이상치들은 추출된 이상치들을 제외한 데이터 셋의 평균 정보량을 서로 비교함으로써 어떤 이상치 추출이 정보량이라는 관점에서 더 적합한 방식인가를 판단할 수 있는 여러 방법론을 통합하는 적합성 검정을 하는데 사용할 수 있다.From the point of view of outliers, it is reasonable to select the outlier detection value as the methodology that gives the smallest entropy value by comparing the entropy of data excluding outliers detected through various methods. In other words, excluding outliers (values with a small probability of occurrence), the average amount of information provided by the rest of the data excluding the outliers must be significantly greater than the average amount of information calculated from the data including the outliers, and each different Outliers can be used for a suitability test integrating various methodologies that can determine which outlier extraction method is more suitable in terms of information amount by comparing the average amount of information of a data set excluding the extracted outliers.

평균 정보량

(엔트로피)의 정의는 다음과 같다.average amount of information

The definition of (entropy) is as follows.

여기서 I(X)는 개별 자료의 정보량이라고 정의하고 b는 로그함수의 밑으로 새넌 엔트로피에서는 비트 수인 2를 사용하고 자연수 10 혹은 오일러 상수 e를 사용하기도 한다.Here, I(X) is defined as the amount of information of individual data, and b is the base of the logarithmic function. In Shannon entropy, 2, the number of bits, is used, and the natural number 10 or Euler's constant e is sometimes used.

여러 이상치 추출 방법론으로 이상치를 검출한 결과로 이상치를 제외한 자료셋의 평균정보량이 유의미하게 적은 케이스를 선택하는 것이 좋다. 엔트로피는 자료에서 나올 확률이 적은 데이터들을 제외할수록 작은 값이 나오는 특징을 가지고 있다.As a result of detecting outliers with various outlier extraction methodologies, it is better to select cases in which the average amount of information in the data set excluding outliers is significantly smaller. Entropy has a characteristic that a smaller value appears when data with a low probability of occurrence are excluded from the data.

다만, 꼭 나올 확률이 매우 작은 이상치가 아니더라도 비교적 작은 확률값을 가지는 데이터를 제외해도 그 값은 작아진다. 매우 큰 확률을 가지는 데이터를 제외하는 경우는 엔트로피 값이 증가하기도 하며 확률값에 로그를 취하여 계산하기 때문에 확률값에 대하여 비선형적인 성격을 가진다.However, even if it is not an outlier with a very small probability, the value becomes small even if data with a relatively small probability value are excluded. When data with a very high probability is excluded, the entropy value increases, and since it is calculated by taking the logarithm of the probability value, it has a non-linear character with respect to the probability value.

이상치를 제외한 자료와 원자료의 정보량 값의 차이는 직관적인 해석이 어렵기 때문에 평균 정보량을 자료의 평균 확률값의 변화량으로 변환해서 보면 좀 더 직관적인 정보량의 차이를 체감하기 쉽다.Since the difference between the data excluding outliers and the original data is difficult to intuitively interpret, it is easy to feel the difference in information more intuitively by converting the average information into the change in the average probability of the data.

따라서, 본 발명에서는 평균 정보량의 차이를 다음과 같이 정의한다. 평균 정보량의 정의 H(X)를 사용하면 이상치가 제외되는 경우 확률분포도 달라지기 때문에 이론적인 계산이 어려워진다. Therefore, in the present invention, the difference in average information amount is defined as follows. If H(X) is used as the definition of the average amount of information, theoretical calculation becomes difficult because the probability distribution also changes when outliers are excluded.

이에 평균정보량을 데이터셋의 대표 정보량I(X)라고 간주하고 평균 정보량 엔트로피를 전체 데이터(full data set)와 이상치를 제외한 데이터(reduced data set)의 대표 정보량으로 간주한다. 두 자료의 대표 정보량의 차이는,Therefore, the average amount of information is regarded as the representative amount of information I(X) of the dataset, and the entropy of the average amount of information is regarded as the representative amount of information of the full data set and the data excluding outliers (reduced data set). The difference in the amount of representative information of the two data is,

으로 표현될 수 있다.

can be expressed as

두 데이터 셋의 평균 정보량을 대표 정보량으로 보고 그 차이를 확률의 관점으로 살펴보면 확률의 차이가 클수록 우리는 아웃라이어 추출이 잘되었다고 할 수 있다.Considering the average amount of information in the two data sets as the representative amount of information and examining the difference in terms of probability, we can say that the greater the difference in probability, the better the outlier extraction.

만일 확률값에 별 큰 차이가 없다면 우리는 이상치 추출 결과가 빈약하다고 할 수 있다.If there is not much difference in the probability value, we can say that the outlier extraction result is poor.

종래 센서의 이상치 탐색을 위한 다양한 알고리즘이 제공되는 만큼 방법론에 따른 장단점이 존재하며, 여러 알고리즘이 이상치 개수를 미리 특정해야 활용 가능하다는 점도 문제였다.As various algorithms for detecting outliers in conventional sensors are provided, there are advantages and disadvantages according to the methodology, and it is also a problem that various algorithms can be used only when the number of outliers is specified in advance.

여러 가지 알고리즘들 중 시계열 모델을 적합하고 예측치와의 차이를 이용하는 이상치 탐지 기법들은 최신 기법이고 유효성이 증명되었지만 건축물 관련 자료의 특성상 특별한 모델 적합이 반드시 필요하다고 보이지 않기 때문에 제외하였다.Among various algorithms, outlier detection methods that fit a time-series model and use differences from predicted values are state-of-the-art and have proven effective, but were excluded because special model fitting is not necessarily required due to the nature of building-related data.

또한, 다변량 기법은 필요치 않은 상황이고 건축물 관련 센서 자료들이 안정성에 대한 자료들이 축적되 있는 상황이 아니기 때문에 비지도적(unsupervised)기법만 사용 가능하다. 센서 자료는 시계열 자료라는 특성상 시간에 따른 영향을 배제하기 위해 1차 차분한 잔차를 가지고 이상치 탐지를 수행하였다.In addition, since the multivariate method is not required and the building-related sensor data is not a situation in which data on stability are accumulated, only unsupervised methods can be used. Due to the nature of sensor data as time series data, outlier detection was performed with first-order residuals to exclude the effect of time.

도 2는 본 발명의 실시예에 따른 구성 및 연결관계를 나타낸 블록도, 도 3은 본 발명의 실시예에 따른 순서를 나타낸 순서도로서, 본 발명에서는 대표적인 단변량 이상치 검출 방법론 4개를 동시에 적용하고 각 방법론에 따른 이상치로 검출된 결과의 정보량(entropy)를 계산하여 대표 정보량의 확률값의 변동을 기준으로 센서 데이터셋의 이상치로 정의하는 앙상블 알고리즘을 제안한다.2 is a block diagram showing configuration and connection relationships according to an embodiment of the present invention, and FIG. 3 is a flowchart showing a sequence according to an embodiment of the present invention. In the present invention, four representative univariate outlier detection methodologies are simultaneously applied, We propose an ensemble algorithm that calculates the entropy of the result detected as an outlier according to each methodology and defines it as an outlier in the sensor dataset based on the change in the probability value of the representative information.

이를 위해 우선 기울기 센서를 예시로 직교 좌표계에서 극좌표계로 전환해준다. 이때 각도에 해당하는 값은 고려하지 않고 이상치 검출에서는 일단 기울기의 크기만 고려한다. 기울기 값의 크기는 다음과 같이 표현된다.To this end, the tilt sensor is first converted from the Cartesian coordinate system to the polar coordinate system as an example. At this time, the value corresponding to the angle is not considered, and only the magnitude of the slope is considered in the outlier detection. The magnitude of the gradient value is expressed as:

먼저, 첫 번째 단계(S 110)에서는 IQR(Inter quatile range)방식으로 이상치를 검출하며, 이는 상기 센서로부터 얻어진 자료값을 오름차순으로 정렬하는 정렬부(111)와, 정렬된 자료값을 설정된 비율로 분할하여 경계구간을 설정하는 구간설정부(112)를 구비하고, 상기 경계구간을 벗어나는 자료값을 제1이상치로 정의하여 저장하는 제1검출모듈(110)을 통해 수행된다.First, in the first step (S 110), an outlier is detected by the IQR (Inter quatile range) method, which includes the sorting unit 111 that sorts the data values obtained from the sensor in ascending order, and the sorted data values at a set ratio. It is performed through a first detection module 110 that includes a section setting unit 112 that divides and sets a boundary section, and defines and stores a data value outside the boundary section as a first outlier.

즉 자료값을 오름차순으로 읽어들이고 4 등분하여 75% percentile과 25% percentile의 차를 계산한다. 이 값을 IQR 이라 부른다. 이때 상기 구간설정부(112)는 IQR에 1.5를 곱해서 75% percentile 값에 더해준 것을 최대, 25% percentile 값에서 빼준 것을 최소로 하는 경계 구간을 설정한다. 자료들 중 이 범위를 벗어나는 점들을 이상치라고 정의하고, 이상치로 검출된 값들과 이상치 개수를 저장한다.That is, the data values are read in ascending order and divided into 4 equal parts to calculate the difference between the 75% percentile and the 25% percentile. This value is called IQR. At this time, the interval setting unit 112 multiplies IQR by 1.5 to set a boundary interval in which the maximum value added to the 75% percentile value and the minimum value obtained by subtracting the 25% percentile value are minimized. Points outside this range are defined as outliers, and the values detected as outliers and the number of outliers are stored.

다음 두 번째 단계(S 120)에서는 Z-Score 방식으로 이상치를 검출하며, 이는 상기 센서로부터 얻어진 자료값을 정규분포화하고, 표준정규분포를 참고하여 설정된 신뢰수준에서 검정하여 제2이상치를 추출 및 저장하는 제2검출모듈(120)을 통해 수해왼다.In the next second step (S 120), outliers are detected by the Z-Score method, which normalizes the data values obtained from the sensor, and tests at the set confidence level with reference to the standard normal distribution to extract the second outliers and It is solved through the second detection module 120 to store.

즉 자료값을 정규분포라 가정하고 평균과 표준편차를 계산하여 점수를 계산한다. 실제로 건축물 안전 관련 균열과 기울기 센서는 정상적인 상황이라면 백색잡음의 정규분표 형태를 취하고 이상이 생겼다면 평소와는 다른 트렌드가 나타나거나 이상치가 나타나기 때문에 적당한 가정이라 볼 수 있다.That is, assuming that the data values are normally distributed, the mean and standard deviation are calculated to calculate the score. In fact, cracks and tilt sensors related to building safety take the form of a normal distribution of white noise under normal circumstances, and when abnormalities occur, unusual trends or outliers appear, so it can be considered a reasonable assumption.

이때 상기 제2검출모듈(120)은 계산된 점수를 표준정규분포를 참조해서 95% 혹은 99%신뢰수준에서 이상치인지 아닌지를 검정한다. 계산된 이상치와 결과를 저장한다.At this time, the second detection module 120 checks whether or not the calculated score is an outlier at a 95% or 99% confidence level by referring to a standard normal distribution. Stores calculated outliers and results.

다음 세 번째 단계(S 130)에서는 Generalized ESD test를 이용해서 이상치를 검출하며, 상기 제1이상치 및 제2이상치 개수의 상한을 설정하고 검정통계량을 계산하는 검정통계부(131)와, 상기 검정통계량과 비교할 임계값을 설정하는 임계값설정부(132)를 구비하고, 상기 검정통계량이 설정된 임계값보다 클 경우 제3이상치로 정의하여 저장하는 제3검출모듈(130)을 통해 수행된다.In the next third step (S 130), an outlier is detected using the generalized ESD test, a test statistics unit 131 sets an upper limit for the number of first outliers and second outliers and calculates a test statistic, and the test statistic It is performed through a third detection module 130 that includes a threshold value setting unit 132 for setting a threshold value to be compared with the threshold value, and defines and stores the test statistic as a third outlier when the test statistic is greater than the set threshold value.

즉 상기 검정통계부(131)는 이상치의 개수의 상한을 첫 번째 단계(S 110) 및 두 번째 단계(S 120)의 최대값에 1을 더한값으로 설정한 후 검정통계량을 계산하되, 표본평균과 표본표준편차를 계산하고 관측치와 표본평균의 편차의 최대치를 구하고 최대치에 해당하는 자료값을 표본에서 제거하고 다시 검정통계량을 계산하여 추정한 최대치의 이상치 개수까지 검정 통계량을 계산한다. 이는 다음과 같이 표현된다.That is, the test statistics unit 131 sets the upper limit of the number of outliers to the maximum value of the first step (S 110) and the second step (S 120) plus 1, and then calculates the test statistic, but the sample mean and sample standard deviation are calculated, the maximum value of the deviation between the observed value and the sample mean is calculated, the data value corresponding to the maximum value is removed from the sample, and the test statistic is calculated again to calculate the test statistic up to the number of outliers of the estimated maximum value. This is expressed as:

.

은 추정된 이상치의 최대값.

.

is the maximum value of the estimated outlier.

또한, 상기 임계값설정부(132)는 검정통계량과 비교할 임계값을 다음과 같이 계산한다.In addition, the threshold value setting unit 132 calculates a threshold value to be compared with the test statistic as follows.

여기서 t(p,v)는 자유도 v를 가지는 t

분포의 100p 백분율 값을 의미하고

로 정의한다. 계산된

가 임계값

보다 크면 이상치로 간주하고 이 값을 저장한다. 이 검정은

인 경우 상당히 정확하다.where t(p,v) is t with degrees of freedom v

means the 100p percentage value of the distribution, and

is defined as calculated

is the threshold

If it is greater than , it is regarded as an outlier and this value is stored. this test

If , it is fairly accurate.

다음 네 번째 단계(S 140)에서는 isolation forest 기법을 이용해서 이상치를 검출하며, 이는 isolation forest 알고리즘을 이용하여 이상치 점수를 계산하는 점수계산부(151)와, 이상치의 비율을 설정하고 설정된 비율에 맞는 이상치 점수 자료값을 제4이상치로 정의하여 저장하는 제4검출모듈(140)을 통해 수행된다.In the next fourth step (S 140), an outlier is detected using the isolation forest technique, which is achieved by using the isolation forest algorithm to calculate the outlier score, the score calculation unit 151, setting the ratio of outliers, and It is performed through the fourth detection module 140 that defines and stores the outlier score data value as the fourth outlier.

이는 decision tree를 이용해서 이상치를 탐지하는 기법으로 선제적으로 자료에서 이상치가 차지하는 비중이 얼마 정도 될지 추정한 이후 공간분할 방식의 알고리즘을 사용해서 다른 관측치와 유사점이 없는 이상치를 탐지하는 기법이다.This is a technique for detecting outliers using a decision tree. After preemptively estimating how much outliers will occupy in the data, a spatial segmentation algorithm is used to detect outliers that have no similarities with other observations.

많은 이상치 탐지 기법이 자료들 사이에 통계적 거리를 사용하는데 자료가 많아지거나 다차원 자료일 경우 계산량이 폭증하는 단점을 가지고 있으나 공간기반 탐지 기법은 거리 계산의 부담이 적고 이상치 탐지에서 어려운 두 가지 문제인 masking(이상치가 무리지어 있을 경우 정상치로 구분되는 문제)과 swamping(정상값이 이상치와 거리가 가까워 잘못 이상치로 구분되는 경우)에서 조금 더 자유롭다.Many outlier detection techniques use statistical distances between data, but have the disadvantage of exponentially increasing the amount of calculation when there are many data or multidimensional data. It is a little more free from the problem of classifying an outlier as a normal value when there is a group of outliers and swamping (when a normal value is mistakenly classified as an outlier due to its proximity to the outlier).

알고리즘은 전체 데이터를 가진 root node와 추후 분할을 위한 후보 노드를 담은 리스트(C)를 작성한다. 이 리스트에서 하나의 노드(R)를 랜덤하게 추출하고 이 노드는 리스트 C 에서 삭제한다. 뽑힌 R에 대해서 특성치(a)를 랜덤하게 정하고 a보다 작거나 같은 자료는

그 외의 자료는

로 나눈다.

혹은

에 데이터가 2개 이상 포함되 있다면 이것을 후보 리스트 집합인 C에 다시 포함 시키고 그렇지 않다면 이것을 고립된 노드로 간주한다. 모든 자료들이 고립된 노드로 구분될 때까지 같은 작업을 반복하고 자료별로 모두 고립될 때까지 필요한 작업의 수를 저장하고 필요 작업수가 작은 것들을 이상치로 간주한다.The algorithm builds a list (C) containing the root node with full data and candidate nodes for further splitting. One node (R) is randomly extracted from this list and this node is deleted from list C. For the selected R, a characteristic value (a) is randomly set, and data less than or equal to a

other data

divide by

or

If two or more data are included in , it is included again in C, the candidate list set, and if not, it is regarded as an isolated node. The same operation is repeated until all data are classified as isolated nodes, and the number of required operations is stored for each data until all of them are isolated, and those with a small number of required operations are considered outliers.

이 알고리즘은 결과로 이상치 점수(anomaly score)를 계산하는데 이상치 점수가 높은 것을 이상치로 선언한다. 본 발명에서는 다른 방법론에서 제시된 이상치 개수의 최대값을 반영하여 이상치 탐지를 수행한다. 다만, iForest는 이상치의 비율을 사전에 정하고 그 비율에 맞는 이상치 점수의 상위 % 자료를 이상치로 선언하는데 이것은 이상치 탐색을 하기 전에 미리 이상치가 대략 어느 정도 비율인지를 미리 결정해야 하는 어려움이 있지만, 일단은 이상치 점수를 계산하는데까지 실행한다.This algorithm calculates an anomaly score as a result, and the one with a high outlier score is declared an outlier. In the present invention, outlier detection is performed by reflecting the maximum value of the number of outliers suggested in other methodologies. However, iForest sets the ratio of outliers in advance and declares the top % of outlier scores that fit the ratio as outliers. runs as far as computing outlier scores.

마지막 다섯 번째 단계(S 150)에서는 이상 4가지 이상치 탐지 기법을 순차적으로 적용한 이후 검출된 이상치 자료를 바탕으로 원래 자료와 이상치로 탐지된 자료를 제거한 데이터 셋을 대상으로 엔트로피를 계산한다.In the fifth and final step (S 150), entropy is calculated for the data set from which the original data and the data detected as outliers are removed based on the detected outlier data after sequentially applying the four outlier detection techniques.

이는 상기 제1이상치, 제2이상치, 제3이상치, 제4이상치를 바탕으로 원본 자료값 및 이상치 자료값을 제거한 데이터셋을 대상으로 엔트로피를 계산하는 엔트로피계산부(151)와, 상기 엔트로피의 변동을 확률변동 관점으로 재해석하는 해석부(152)를 구비하고, 설정된 값과 확률변동을 비교하여 이상치 추출의 양호를 판단하는 검증모듈(150)을 통해 수행된다.Based on the first outlier, the second outlier, the third outlier, and the fourth outlier, the entropy calculation unit 151 calculates the entropy of the data set from which the original data value and the outlier data value have been removed, and the change in entropy. It is provided with an analysis unit 152 that reinterprets in terms of probability fluctuation, and is performed through a verification module 150 that determines the quality of outlier extraction by comparing the set value with the probability fluctuation.

엔트로피 계산은 적절한 자료의 이산화(discretized)가 필요하다. 자료가 도출되는 통상적인 범위를 적절한 개수로 나누어서 이산확률 도수분포표를 생성하고 각 급간 간에 들어가는 도수들을 파악한다. Entropy calculations require proper discretization of the data. Create a discrete probability frequency distribution table by dividing the normal range from which data are derived into an appropriate number, and identify the frequencies that fall between each class.

이 자료를 기준으로 엔트로피를 계산하고 이상치로 인한 엔트로피의 변동량의 해석을 위해 평균 정보량인 엔트로피의 변동을 데이터의 확률 변동 관점으로 재해석한다. 이 확률의 변동이 크다면 이상치 검출이 잘 이루어진 것이고 그렇지 않다면 그 방법론으로 검출된 이상치 추출에 문제가 있다고 결론 내린다.Based on this data, entropy is calculated, and the change in entropy, which is the average amount of information, is reinterpreted in terms of the probability change of data to interpret the change in entropy due to outliers. If the fluctuation of this probability is large, it is concluded that outlier detection is successful, and if not, there is a problem with the outlier extraction detected by the methodology.

도 4 ~ 5는 본 발명의 실험예에 따른 이상치 탐지 예시를 나타낸 시트로서, 서로 다른 방법론이 내보내는 이상치 검출이 일치하고 정보량의 변화도 크다고 볼 수 있다.4 and 5 are sheets showing examples of outlier detection according to experimental examples of the present invention, and it can be seen that the detection of outliers emitted by different methodologies coincides and the change in the amount of information is large.

대표 확률값 변동량의 기준에서 정보량의 변화가 얼마나 커야 유의미한 정보량의 변화라고 볼 수 있을지는 다른 통계적 가설 검정과 같이 사용자나 자료 해석자의 주관이 개입하게 된다. 통계량의 확률 분포를 정확하게 유도할 수 있다면 솔루션이 존재할 수 있으나 통계량의 확률분포 가정을 두지 않는다면 수학적 대답은 어렵다.As with other statistical hypothesis tests, the subjectivity of the user or data interpreter intervenes in how large the change in the amount of information must be to be considered a significant change in the amount of information in terms of the standard of change in the representative probability value. A solution may exist if the probability distribution of the statistic can be accurately derived, but a mathematical answer is difficult unless the assumption of the probability distribution of the statistic is made.

예를 들어 자료값이 정규 분포를 따른다고 가정한다면 다음과 같이 정보량 함수를 유도할 수 있다.For example, assuming that the data values follow a normal distribution, the information content function can be derived as follows.

정규분포 가정을 한다면 평균 엔트로피는 분산이 커지면 늘어나고 분산이 작을수록 줄어든다. 그러나 자료값의 확률분포 가정을 두면 여러 가지 방법론 중에 정규성 가정을 하지 않는 방법론을 다루는데 문제가 생긴다.Assuming a normal distribution, the average entropy increases as the variance increases and decreases as the variance decreases. However, if the assumption of probability distribution of data values is made, there is a problem in dealing with methodologies that do not assume normality among various methodologies.

이를 우회하기 위해, 본 발명에서는 평균 엔트로피 값을 확률값 기준으로 변환해서 관찰했을 때 이상치 자료 1개당 1% 정도의 확률 변화량을 보이면 유의미한 변화라고 가정하기로 한다.In order to circumvent this, in the present invention, when the average entropy value is converted to a probability value standard and observed, a probability change of about 1% per one outlier data is assumed to be a significant change.

확률값 관점으로 변환이란 평균 정보량을 자료의 정보량의 대푯값으로 보고 정보량값을 다시 확률로 변환해서 관찰하는 것이다. 실제 자료들에 적용해본 이후 해석자가 변환된 확률 관점의 수치들을 바탕으로 개략적인 유의미한 정보 변화량을 이상치 검출 자료를 바탕으로 추산해 낼 수 있다.Conversion from the point of view of probability values is to view the average amount of information as a representative value of the amount of information in data and convert the value of information amount back into probability for observation. After applying it to the actual data, the interpreter can estimate the approximate amount of significant information change based on the outlier detection data based on the converted probability values.

도 6 ~ 9는 본 발명의 실험예에 따른 센서 자료값에 대한 분석결과를 나타낸 시트이다.6 to 9 are sheets showing analysis results for sensor data values according to experimental examples of the present invention.

도 6에서는 서로 다른 방법을 통해 모두 이상치 검출이 안 된 결과를 보여주고 있으며, 도 7에서는 IQR과 z-score에서 동일 이상치 검출하였으나 Generalized ESD test는 이상치 미검출된 상황으로 확률값의 변동이 IQR과 z-score에서 검출된 이상치 자료가 5%를 넘어서기 때문에 추출된 2개의 이상치를 이상치로 받아 들였다.Figure 6 shows the results in which outliers were not detected through different methods, and in Figure 7, the same outliers were detected in IQR and z-score, but the generalized ESD test showed no outliers detected, and the fluctuations in probability values varied between IQR and z-score. Since the outlier data detected in -score exceeded 5%, the two extracted outliers were accepted as outliers.

도 8에서는 z-score방법론 에서만 1개가 검출되었으나 그나마 확률 변화가 1%를 넘지 못해 이상치가 존재하지 않는다는 결론을 내렸으며, 도 9에서는 z-score에서 2개가 이상치로 검출되고 나머지 방법론에서는 z-score 결과를 포함 4개가 이상치로 검출되어 4개를 이상치로 봤을 때 확률값 변화가 15%를 넘겨 4개를 이상치로 보는 것으로 결론을 내렸다.In FIG. 8, one was detected only in the z-score methodology, but the probability change did not exceed 1%, so it was concluded that there was no outlier. Including the results, 4 were detected as outliers, and when 4 were viewed as outliers, the probability value change exceeded 15%, and it was concluded that 4 were regarded as outliers.

이처럼 서로 다른 이상치 탐지 알고리즘 결과를 동시에 비교해 보면서 검출된 특정 이상치에 좀 더 정확성을 부여하고 근거로 확률 변화값을 얻을 수 있었으며 개별 방법론마다 다른 결과가 검출되는데 따른 혼란과 불신을 없앨 수 있다.In this way, by comparing the results of different outlier detection algorithms at the same time, more accuracy was given to the specific detected outlier, and the probability change value was obtained as a basis, and confusion and distrust due to the detection of different results for each individual methodology could be eliminated.

본 발명의 권리는 위에서 설명된 실시 예에 한정되지 않고 청구범위에 기재된 바에 의해 정의되며, 본 발명의 분야에서 통상의 지식을 가진 자가 청구범위에 기재된 권리범위 내에서 다양한 변형과 개작을 할 수 있다는 것은 자명하다.The rights of the present invention are defined by what is described in the claims, not limited to the embodiments described above, and that those skilled in the art can make various modifications and adaptations within the scope of rights described in the claims. It is self-evident.

110: 제1검출모듈 111: 정렬부
112: 구간설정부 120: 제2검출모듈
130: 제3검출모듈 131: 검정통계부
132: 임계값설정부 140: 제4검출모듈
141: 점수계산부 150: 검증모듈
151: 앤트로피계산부 152: 해석부110: first detection module 111: alignment unit
112: section setting unit 120: second detection module
130: third detection module 131: verification statistics unit
132: threshold value setting unit 140: fourth detection module
141: score calculation unit 150: verification module
151: entropy calculation unit 152: analysis unit

Claims

In the system for detecting anomalies of sensors installed in buildings to monitor safety,
An arranging unit 111 for sorting the data values obtained from the sensor in ascending order, and a section setting unit 112 for setting a boundary section by dividing the sorted data values at a set ratio, the data values outside the boundary section a first detection module 110 defining and storing as a first outlier;
a second detection module (120) for normalizing the data values obtained from the sensor and for extracting and storing second outliers by testing them at a set confidence level with reference to the standard normal distribution;
The upper limit of the number of the first and second outliers is set and the test statistic is calculated, but the upper limit of the number of outliers is set as the value obtained by adding 1 to the maximum value of the first and second outliers, and the sample mean and sample standard deviation are calculated. A test statistics unit 131 that calculates the maximum value of the deviation between the observed value and the sample mean, removes the data value corresponding to the maximum value from the sample, calculates the test statistic again, and calculates the test statistic up to the number of outliers of the estimated maximum value; a third detection module 130 having a threshold setting unit 132 for setting a threshold to be compared with the test statistic, and defining and storing the test statistic as a third outlier when the test statistic is greater than the set threshold;
a score calculator 151 that calculates an outlier score using an isolation forest algorithm, and a fourth detection module 140 that sets a ratio of outliers and defines and stores an outlier score data value that meets the set ratio as a fourth outlier;
Based on the first outlier, the second outlier, the third outlier, and the fourth outlier, the entropy calculation unit 151 calculates entropy for the dataset from which original data values and outlier data values have been removed, and the change in entropy a verification module 150 having an analysis unit 152 that reinterprets from the point of view of probability fluctuation, and comparing a set value with a probability change to determine whether an outlier is extracted; Sensor anomaly detection system for building monitoring using an ensemble algorithm, characterized in that consisting of.

According to claim 1,
The sensor is a sensor anomaly detection system for monitoring buildings using an ensemble algorithm, characterized in that the sensor is a sensor capable of measuring vibration, cracks, and tilt.

According to claim 1,
The section setting unit 112,
The data values organized in ascending order are divided into 4 parts, and the difference between the 75% and 25% values is multiplied by 1.5 to set a boundary section that maximizes the value added to the 75% value and minimizes the value subtracted from the 25% value. A sensor anomaly detection system for building monitoring using an ensemble algorithm.

According to claim 1,
The second detection module 120,
A sensor anomaly detection system for monitoring a building using an ensemble algorithm, characterized in that it tests for outliers at a 95 to 99% confidence level.

delete