KR20170078252A

KR20170078252A - Method and apparatus for time series data monitoring

Info

Publication number: KR20170078252A
Application number: KR1020150188589A
Authority: KR
Inventors: 권순환; 김형찬; 김성희; 오규삼; 서범준; 이민형
Original assignee: 삼성에스디에스 주식회사
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2017-07-07
Also published as: KR102215690B1

Abstract

시계열 데이터 모니터링 방법이 제공 된다. 본 발명의 일 실시예에 따른 시계열 데이터 모니터링 방법은 트레이닝 기간 동안의 측정치 시계열 데이터 및 환경 데이터에 대한 분석 결과에 따라, 예측 기간의 환경 데이터로부터 상기 예측 기간의 측정치 시계열 데이터의 클러스터를 예측하는 단계, 상기 예측 된 클러스터에 속한 트레이닝 기간 동안의 각 측정치 시계열 데이터의 시점 별 변동성에 따라, 각 시점 별 관리 범위를 설정하는 단계, 및 상기 예측 기간의 실제 측정치 시계열 데이터가 상기 각 시점 별 관리 범위를 만족하는지 모니터링 하는 단계를 포함한다.A time series data monitoring method is provided. According to an embodiment of the present invention, there is provided a method of monitoring time series data, comprising the steps of: predicting a cluster of time series data of a measurement period of the prediction period from environment data of a prediction period, according to an analysis result of measurement time series data and environment data during a training period; Setting a management range for each viewpoint according to the time-variability of each measurement time-series data during a training period belonging to the predicted cluster, and determining whether the actual measurement time-series data of the prediction period satisfies the management range for each viewpoint And monitoring.

Description

TECHNICAL FIELD The present invention relates to a method and apparatus for monitoring time series data,

본 발명은 시계열의 데이터를 모니터링 하는 방법 및 그 장치에 관한 것이다. 보다 자세하게는, 과거의 일정 기간 동안 발생된 시계열 데이터를 트레이닝 한 결과를 이용하여, 특정 기간 동안의 시계열 데이터를 예측 하고, 그 결과를 바탕으로, 실제의 시계열 데이터를 모니터링 하는 방법 및 그 장치에 관한 것이다.The present invention relates to a method and apparatus for monitoring time series data. More particularly, the present invention relates to a method for predicting time series data for a specific period by using a result of training time series data generated during a certain period of the past and for monitoring actual time series data based on the result, will be.

시계열 데이터(time series data)는 일정 기간에 대해 시간의 함수로 표현되는 데이터를 가리킨다. 이러한 시계열 데이터는, 과거의 시계열 데이터에 대한 분석을 통하여 예측될 수 있다. 실제 발생 된 시계열 데이터와 예측 된 시계열 데이터의 차이가 기 지정 된 한계치를 초과하거나, 전문가에 의하여 지정 된 규칙(rule)에 위배되는 지 여부가 모니터링 될 수 있다.Time series data refers to data expressed as a function of time for a certain period of time. Such time series data can be predicted through analysis of past time series data. It can be monitored whether the difference between the actually generated time series data and the predicted time series data exceeds the predefined limit or violates the rule specified by the expert.

전문가에 의존한 시계열 데이터 모니터링 규칙은, i) 기존의 Rule이 센서의 시점 값이나 변화량, 통계치에 국한되어 적용 되므로, 공정 시계열 변화나 기동 모드의 시계열 변화 등의 복잡한 패턴을 반영하지 못하는 문제, ii) 정상으로 보는 관리 범위가 각 관리 시점 별로 동일하여, 그 신뢰성에 의문이 제기 되는 문제, iii) 실제 측정치 시계열 데이터가 관리 범위 이내에 위치하더라도, 작은 변화들이 빈번하게 발생하는 유형의 비정상은 감지하기 어려운 문제를 가진다.Expert-dependent time-series data monitoring rules are: i) problems that do not reflect complicated patterns such as process time series change and time series change of start-up mode because existing rule is limited to sensor viewpoint value, change amount and statistics; ) The problem is that reliability is questioned because the management coverage is normal for each management point, and iii) the actual measurement value. Even if the time series data is within the management range, abnormalities of the type in which small changes frequently occur are hard to detect I have a problem.

한국공개특허 제1998-7002852호Korean Patent Publication No. 1998-7002852 한국공개특허 제2009-0073937호Korean Patent Publication No. 2009-0073937

본 발명이 해결하고자 하는 기술적 과제는, 트레이닝 기간 동안의 측정치 시계열 데이터의 시점 별 변동성에 따라, 각 시점 별 관리 범위를 동적으로 설정하는 시계열 데이터 모니터링 방법 및 그 장치를 제공하는 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a time series data monitoring method and apparatus for dynamically setting a management range for each viewpoint in accordance with time-variability of measured time series data during a training period.

본 발명이 해결하고자 하는 다른 기술적 과제는, 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 미세한 비정상들이 축적되어 한계치를 초과하는 경우 전반적 비정상(global anomaly)으로 판정할 수 있는 시계열 데이터 모니터링 방법 및 그 장치를 제공하는 것이다.Another object of the present invention is to provide a time series data monitoring method capable of determining a global anomaly when fine abnormalities accumulate and exceed a threshold value even if actual measured value time series data is within a management range, .

본 발명이 해결하고자 하는 또 다른 기술적 과제는, 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 미세한 비정상들이 축적되어 한계치를 초과하는 경우, 그 원인 시점을 표시해 줄 수 있는 시계열 데이터 모니터링 방법 및 그 장치를 제공하는 것이다.Another object of the present invention is to provide a time-series data monitoring method and apparatus that can display a time point when a real-time-of-measurement time-series data is accumulated within a management range and a minute abnormality accumulates and exceeds a threshold value .

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The technical objects of the present invention are not limited to the above-mentioned technical problems, and other technical subjects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 시계열 데이터 모니터링 방법은 트레이닝 기간 동안의 측정치 시계열 데이터 및 환경 데이터에 대한 분석 결과에 따라, 예측 기간의 환경 데이터로부터 상기 예측 기간의 측정치 시계열 데이터의 클러스터를 예측하는 단계와, 상기 예측 된 클러스터에 속한 트레이닝 기간 동안의 각 측정치 시계열 데이터의 시점 별 변동성에 따라, 각 시점 별 관리 범위를 설정하는 단계와, 상기 예측 기간의 실제 측정치 시계열 데이터가 상기 각 시점 별 관리 범위를 만족하는지 모니터링 하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method for monitoring time series data according to an embodiment of the present invention, Setting a management range for each viewpoint according to a time-variability of each measurement time-series data during a training period belonging to the predicted cluster; And monitoring whether the management scope for each viewpoint is satisfied.

일 실시예에서, 상기 예측 기간은 제1 시점 및 상기 제1 시점 이후의 제2 시점을 포함하고, 상기 각 시점 별 관리 범위를 설정하는 단계는, 상기 제1 시점의 관리 범위와 상기 제2 시점의 관리 범위를 서로 다른 값으로 설정하는 단계를 포함한다. 이 때, 상기 제1 시점의 관리 범위와 상기 제2 시점의 관리 범위를 서로 다른 값으로 설정하는 단계는, 상기 제1 시점의 상기 변동성이 상기 제2 시점의 상기 변동성 보다 큰 경우, 상기 제1 시점의 관리 범위를 상기 제2 시점의 관리 범위보다 더 큰 값으로 설정하는 단계를 포함한다.In one embodiment, the prediction period includes a first time point and a second time point after the first time point, and the step of setting the management range for each time point comprises: And setting the management range of the management table to a different value. The step of setting the management range of the first time point and the management time range of the second time point to be different from each other may be configured such that when the variance at the first time point is larger than the variance at the second time point, And setting the management range of the viewpoint to a value larger than the management range of the second viewpoint.

일 실시예에서, 상기 시계열 데이터 모니터링 방법은 상기 예측 된 클러스터를 위한 회귀 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터를 예측하는 단계를 더 포함한다. 이 때, 상기 모니터링 하는 단계는 각 시점의 상기 실제 측정치 시계열 데이터와 상기 예측 된 측정치 시계열 데이터 사이의 차이 값이, 상기 각 시점 별 관리 범위 이내인지 여부를 판정하는 단계를 포함한다.In one embodiment, the time series data monitoring method further comprises predicting the measured time series data of the prediction period using a regression model for the predicted cluster. In this case, the monitoring step may include determining whether a difference value between the actual measurement time series data at each time point and the predicted measurement time series data is within the management range for each time point.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 시계열 데이터 모니터링 방법은, 트레이닝 기간 동안의 측정치 시계열 데이터 및 환경 데이터에 대한 분석 결과에 따라, 예측 기간의 환경 데이터로부터 상기 예측 기간의 측정치 시계열 데이터의 클러스터를 예측하는 단계, 상기 예측 된 클러스터를 위한 회귀 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터를 예측하는 단계, 상기 예측 기간의 실제 측정치 시계열 데이터를 수신하는 단계, 상기 예측 된 클러스터의 대표 시계열 데이터를 얻는 단계, 및 상기 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 차이가 상기 대표 시계열 데이터를 바탕으로 얻어진 한계치(threshold)를 초과하는 경우, 전반적 비정상(global anomaly)으로 판정 하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method for monitoring time series data according to an exemplary embodiment of the present invention, Predicting the measured time series data of the predicted period using a regression model for the predicted cluster, receiving actual measured value time series data of the predicted period, Obtaining the representative time series data of the measured time series data by using the representative time series data and obtaining the representative time series data of the measured time series data by using the representative time series data, doing Wu, and a step of determining the overall abnormal (global anomaly).

일 실시예에서, 상기 전반적 비정상으로 판정 하는 단계는, 상기 예측 된 클러스터에 속한 트레이닝 기간 동안의 각 측정치 시계열 데이터와 상기 대표 시계열 데이터 사이의 DTW(Dynamic Time Warping) 거리의 히스토그램을 생성하는 단계, 및 상기 히스토그램 상에서 기 지정 된 요건을 만족하는 DTW 거리를 상기 한계치로 결정하는 단계를 포함할 수 있다. 이 때, 상기 히스토그램 상에서 기 지정 된 요건을 만족하는 DTW 거리를 상기 한계치로 결정하는 단계는, 상기 클러스터에 속한 전체 측정치 시계열 데이터 중 DTW 거리의 오름 차순으로 기 지정된 비율이 포함 되도록 하는 DTW 거리를 한계치로 결정하는 단계를 포함하거나, 상기 히스토그램 상의 DTW 거리의 평균 값 및 표준 편차를 이용하여 상기 한계치를 결정하는 단계를 포함할 수 있다.In one embodiment, the step of determining the overall abnormality may include generating a histogram of a DTW (Dynamic Time Warping) distance between each measurement time series data and the representative time series data during a training period belonging to the predicted cluster, and And determining the DTW distance satisfying the predetermined requirement on the histogram as the threshold value. The step of determining the DTW distance satisfying the pre-designated requirements on the histogram as the threshold may include determining a DTW distance that includes a predefined ratio in ascending order of the DTW distance among all measurement time series data belonging to the cluster, , Or determining the threshold using an average value and a standard deviation of the DTW distance on the histogram.

일 실시예에서, 상기 전반적 비정상으로 판정 하는 단계는, 상기 예측된 측정치 시계열 데이터와 상기 실제 측정치 시계열 데이터의 각 시점 별 데이터 차이 값을 가리키는 로컬 코스트 매트릭스를 생성하는 단계와, 상기 로컬 코스트 매트릭스 상에서 최소 비용 경로를 탐색하는 단계와, 상기 최소 비용 경로를 이용하여, 각 시점 별 비정상 영향도를 생성하는 단계를 포함한다. 이 때, 상기 각 시점 별 비정상 영향도를 생성하는 단계는, 상기 예측된 측정치 시계열 데이터의 각 시점을 기준으로, 상기 최소 비용 경로에 속하는 데이터 차이 값을 합산하여, 상기 비정상 영향도를 생성하는 단계를 포함할 수 있다.In one embodiment, determining the overall abnormality comprises: generating a local cost matrix indicating a data difference value for each point in time between the predicted measured time series data and the actual measured value time series data; Searching for a cost path, and generating an abnormal influence degree for each point in time using the minimum cost path. At this time, the step of generating an abnormal influence degree for each viewpoint may include: generating the abnormal influence degree by summing data difference values belonging to the least cost path based on each time point of the predicted measured time series data; . &Lt; / RTI >

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 시계열 데이터 모니터링 장치는 트레이닝 기간 동안의 측정치 시계열 데이터를 분석하여 예측 기간의 상기 측정치 시계열 데이터를 예측하기 위한 컴퓨터 프로그램을 로드 하는 메모리, 상기 메모리에 로드된 상기 컴퓨터 프로그램을 실행하는 프로세서, 네트워크 인터페이스, 상기 네트워크 인터페이스를 통하여 수신 된 측정치 시계열 데이터, 상기 환경 데이터 및 상기 컴퓨터 프로그램에 의하여 조회 되는 데이터를 저장하는 스토리지를 포함한다. 상기 컴퓨터 프로그램은 트레이닝 로직, 예측 로직 및 모니터링 로직을 포함하고, 상기 트레이닝 로직은, 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 오퍼레이션, 상기 트레이닝 기간 동안의 복수의 환경 데이터를 수집하는 오퍼레이션, 상기 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 하여, 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 최적 분류 모델을 생성하는 오퍼레이션을 포함하고, 상기 예측 로직은, 상기 최적 분류 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터의 클러스터를 예측 하는 오퍼레이션, 상기 예측 된 클러스터를 위한 회귀 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터를 예측하는 오퍼레이션을 포함하고, 상기 모니터링 로직은, 상기 예측 된 클러스터에 속한 트레이닝 기간 동안의 각 측정치 시계열 데이터의 시점 별 변동성에 따라, 각 시점 별 관리 범위를 설정하는 오퍼레이션, 및 상기 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 차이가 상기 대표 시계열 데이터를 바탕으로 얻어진 한계치(threshold)를 초과하는 경우, 전반적 비정상(global anomaly)으로 판정 하는 오퍼레이션을 포함한다.According to an aspect of the present invention, there is provided an apparatus for monitoring time series data, comprising: a memory for loading a computer program for analyzing measurement time series data during a training period to predict the measured time series data in a prediction period; And a storage for storing data to be inquired by the computer program, the environment data, and measurement time series data received through the network interface. Wherein the computer program comprises training logic, prediction logic and monitoring logic, the training logic comprising: an operation of clustering measurement time series data of a predefined period of time during a training period into a plurality of clusters, An operation of collecting data and an operation of generating an optimal classification model that optimally classifies clusters of the measurement time series data with at least a part of the plurality of environmental data as factors, An operation of predicting a cluster of the measured time series data in the prediction period using an optimal classification model and an operation of predicting the measured time series data of the prediction period by using a regression model for the predicted cluster, Prize The monitoring logic may further comprise: an operation of setting a management range for each viewpoint according to a time-variability of each measurement time-series data during a training period belonging to the predicted cluster, And an operation of determining a global anomaly when the difference between the measured time series data and the predicted measured time series data exceeds a threshold value obtained based on the representative time series data.

도 1은 본 발명의 일 실시예에 따른 시계열 데이터 예측 및 모니터링 시스템 구성도이다.
도 2 내지 도 4는 본 발명의 일 실시예에 따른 시계열 데이터 예측 및 모니터링 방법의 순서도이다.
도 5는 본 발명의 몇몇 실시예들에서 참조되는 측정치 시계열 데이터를 설명하기 위한 도면이다.
도 6은 도 5의 측정치 시계열 데이터를 클러스터링 하고, 각 클러스터의 대표 시계열 데이터를 생성한 결과를 설명하기 위한 도면이다.
도 7은 본 발명의 몇몇 실시예들에서, 트레이닝 기간 동안 수집 된 측정치 시계열 데이터에 대하여 클러스터링을 수행한 결과의 저장 형태를 설명하기 위한 도면이다.
도 8은 본 발명의 몇몇 실시예들에서 참조되는 다차원 측정치 시계열 데이터를 설명하기 위한 도면이다.
도 9는 본 발명의 몇몇 실시예들에서 시계열 데이터를 클러스터링 할 때 클러스터의 개수를 결정하는 과정을 설명하기 위한 도면이다.
도 10은 본 발명의 몇몇 실시예들에서 참조되는 환경 데이터 중 다차원 환경 시계열 데이터를 설명하기 위한 도면이다.
도 11은 본 발명의 몇몇 실시예들에서 참조되는 환경 데이터 중 일자 속성을 설명하기 위한 도면이다.
도 12는 본 발명의 몇몇 실시예들에서 환경 데이터가 클러스터링 되는 것을 설명하기 위한 도면이다.
도 13은 본 발명의 몇몇 실시예들에서 참조되는 환경 데이터 중 특정 환경을 대표하는 대표 값을 설명하기 위한 도면이다.
도 14는 본 발명의 몇몇 실시예들에서 환경 데이터 중 환경 시계열 데이터를 클러스터링을 수행한 결과의 저장 형태를 설명하기 위한 도면이다.
도 15 및 도 16은 본 발명의 몇몇 실시예들에서 참조되는 분류 모델을 설명하기 위한 도면이다.
도 17은 본 발명의 몇몇 실시예들에서, 예측 기간의 측정치 시계열 데이터를 예측 하기 위한 회귀 모델이 각 측정치 시계열 데이터 클러스터 별로 지정 되는 것을 설명하기 위한 도면이다.
도 18 내지 도 19는 본 발명의 몇몇 실시예들에서, 특정 측정치 시계열 데이터 클러스터에 속한 각 측정치 시계열 데이터들의 시점 별 변동성이 평가 되는 것을 설명하기 위한 도면이다.
도 20은 본 발명의 일 실시예에 따라, 각 시점 별 관리 범위가 설정되는 것을 설명하기 위한 도면이다.
도 21은 본 발명의 일 실시예에 따라, 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 비정상 판정이 필요한 이유를 설명하기 위한 도면이다.
도 22는 본 발명의 일 실시예에 따라 전반적 비정상(global anomaly)을 판정하기 위한 한계치가 설정 되는 것을 설명하기 위한 도면이다.
도 23 내지 도 25는 본 발명의 일 실시예에 따라 전반적 비정상이 판정 될 때, 각 시점 별 비정상 영향도를 평가하여 문제의 원인이 되는 시점을 표시하는 것을 설명하기 위한 도면이다.
도 26은 본 발명의 일 실시예에 따른 시계열 데이터 모니터링 장치의 구성도이다.1 is a block diagram illustrating a time-series data prediction and monitoring system according to an embodiment of the present invention.
FIGS. 2 to 4 are flowcharts of a time-series data prediction and monitoring method according to an embodiment of the present invention.
5 is a view for explaining measured time series data referred to in some embodiments of the present invention.
FIG. 6 is a diagram for explaining a result of clustering measurement time series data of FIG. 5 and generating representative time series data of each cluster; FIG.
FIG. 7 is a view for explaining a storage form of a result of performing clustering on measured time series data collected during a training period in some embodiments of the present invention. FIG.
8 is a diagram for explaining multidimensional measurement time series data referred to in some embodiments of the present invention.
9 is a diagram for explaining a process of determining the number of clusters when clustering time series data in some embodiments of the present invention.
10 is a view for explaining multidimensional environment time series data among environmental data referred to in some embodiments of the present invention.
11 is a view for explaining a date attribute of environmental data referred to in some embodiments of the present invention.
12 is a diagram for explaining that environmental data is clustered in some embodiments of the present invention.
FIG. 13 is a diagram for explaining representative values representative of a specific environment among environmental data referred to in some embodiments of the present invention. FIG.
FIG. 14 is a view for explaining a storage form of a result of clustering environmental time series data among environmental data in some embodiments of the present invention. FIG.
15 and 16 are views for explaining a classification model referred to in some embodiments of the present invention.
17 is a diagram for explaining that, in some embodiments of the present invention, a regression model for predicting measured time series data of a prediction period is designated for each measurement time series data cluster.
FIGS. 18 to 19 are diagrams for explaining the time-by-time variability of each measurement time series data belonging to a specific measurement time series data cluster in some embodiments of the present invention.
FIG. 20 is a diagram for explaining how a management range for each viewpoint is set according to an embodiment of the present invention. FIG.
FIG. 21 is a diagram for explaining the reason why an abnormality determination is necessary even if the actual measurement value time series data is within the management range, according to an embodiment of the present invention.
22 is a diagram for explaining that a threshold for determining a global anomaly is set according to an embodiment of the present invention.
FIGS. 23 to 25 are diagrams for explaining the evaluation of the abnormal influence degree at each time point when a general abnormality is determined according to an embodiment of the present invention, thereby displaying a time point that causes a problem. FIG.
26 is a block diagram of a time-series data monitoring apparatus according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise. The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification.

이해의 편의를 위하여, 본 발명의 실시예를 본격적으로 설명하기 전에, 본 명세서에 사용된 용어들의 의미를 설명한다.For convenience of understanding, before describing the embodiments of the present invention in full, the meanings of the terms used in this specification will be described.

측정치 시계열 데이터: 센서 등에 의하여 측정 된 측정치의 시계열 데이터를 가리킨다. 측정치 시계열 데이터는 기 지정 된 주기(예를 들어, 24시간)로 분리 된 것일 수 있다. 상기 센서는, 예를 들어 빌딩 관리 시스템에 연결 된 온도 센서, 밝기 센서, 전력 사용량 센서 등이거나, 생산 설비에 구비된 온도, 압력 센서 등이거나, 컴퓨팅 장치에 구비된 온도 센서, cpu 사용량 센서, 메모리 사용량 센서, 스토리지 I/O 부하 센서, 네트워크 사용량 센서 등일 수 있다. 측정치 시계열 데이터를 생성할 수 있는 상기 센서는 상기 예시 된 이외의 다른 측정 장치를 포함할 수 있음을 물론이다.Measured time series data: Indicates time series data of measured values measured by sensors and the like. The measured time series data may be separated by a predetermined period (for example, 24 hours). The sensor may be, for example, a temperature sensor, a brightness sensor, a power consumption sensor connected to a building management system, a temperature sensor, a pressure sensor or the like provided in a production facility, a temperature sensor provided in the computing device, A usage sensor, a storage I / O load sensor, a network usage sensor, and the like. It goes without saying that the sensor capable of generating measurement time series data may include measurement devices other than those illustrated above.

환경 데이터: 상기 측정치 시계열 데이터에 영향을 끼칠 수 있는 다양한 환경에 대한 데이터이다. 상기 환경 데이터는 i) 환경 시계열 데이터 ii) 환경 대표 값 iii) 환경 속성값으로 구분 될 수 있다. 예를 들어, 24시간 동안의 온도 시계열 데이터, 24시간 동안의 습도 시계열 데이터 등이 상기 환경 시계열 데이터에 해당하고, 각 날짜 별 평균 온도, 평균 습도 등이 상기 환경 대표 값에 해당하며, 각 날짜 별 휴일/평일 여부가 상기 환경 속성값에 해당한다.Environmental data: It is data on various environments that may affect the measurement time series data. The environmental data may be classified into i) environmental time series data, ii) environmental representative values, and iii) environmental attribute values. For example, the temperature time series data for 24 hours, the humidity time series data for 24 hours, and the like correspond to the environmental time series data, and the average temperature and the average humidity for each date correspond to the environmental representative value. The holiday / weekdays whether or not the environment property value corresponds to the environmental property value.

트레이닝 기간: 시계열 데이터의 예측을 위하여, 과거 일정 기간 동안의 데이터를 수집하여 기계 학습 등의 기술을 통하여 학습할 필요가 있다. 트레이닝 기간은, 학습 대상이 되는 과거의 일정 기간을 가리킨다. 트레이닝 기간의 만료 시점은 현재일 수도 있다. 즉, 현재의 데이터가 수집됨과 동시에 학습 대상이 될 수 있다. 트레이닝 기간 동안의 측정치 시계열 데이터 및 환경 데이터 중 환경 시계열 데이터가 학습을 통하여 클러스터링 될 수 있다.Training period: In order to predict time series data, it is necessary to collect data for a certain period of time and learn through techniques such as machine learning. The training period refers to a certain period in the past that is the learning target. The expiration time of the training period may be current. That is, the current data can be collected and collected at the same time as the data is collected. The time series data of the measurement time series data and the environmental data during the training period can be clustered through learning.

예측 기간: 트레이닝 기간 동안의 학습 결과를 이용하여, 특정 기간의 측정치 시계열 데이터가 예측 될 수 있다. 본 명세서에서는 측정치 시계열 데이터의 예측 대상 기간을 예측 기간으로 지칭한다. 예측 기간은 미래의 특정 기간일 수도 있고, 지나간 기간에 대한 진단을 위하여 과거의 특정 기간일 수도 있다.Estimation period: Using the learning result during the training period, the measurement time series data of a specific period can be predicted. In the present specification, a prediction target period of measurement time series data is referred to as a prediction period. The prediction period may be a specific period in the future, or may be a specific period in the past for diagnosis of the past period.

시계열 데이터 모니터링 시스템Time series data monitoring system

이하, 도 1을 참조하여, 본 발명의 일 실시예에 따른 시계열 데이터 모니터링 시스템의 구성 및 동작을 설명한다. 본 실시예에 따른 시계열 데이터 모니터링 시스템은 측정 장치(10) 및 측정치 모니터링 장치(20)를 포함할 수 있다.Hereinafter, a configuration and operation of a time series data monitoring system according to an embodiment of the present invention will be described with reference to FIG. The time series data monitoring system according to the present embodiment may include a measurement apparatus 10 and a measurement monitoring apparatus 20.

측정 장치(10)는 측정치 시계열 데이터를 생성하는 장치이다. 측정 장치(10)는 생성된 측정치 시계열 데이터를 네트워크를 통하여 측정치 모니터링 장치(20) 및 단말 장치(40)에 송신할 수 있다. 이미 언급한 바와 같이, 측정 장치(10)는, 예를 들어 빌딩 관리 시스템에 연결 된 온도 센서, 밝기 센서, 전력 사용량 센서 등이거나, 생산 설비에 구비된 온도, 압력 센서 등이거나, 컴퓨팅 장치에 구비된 온도 센서, CPU 사용량 센서, 메모리 사용량 센서, 스토리지 I/O 부하 센서, 네트워크 사용량 센서 등일 수 있다.The measurement apparatus 10 is a device for generating measurement time series data. The measurement apparatus 10 can transmit the generated measurement time series data to the measurement value monitoring apparatus 20 and the terminal apparatus 40 via the network. As already mentioned, the measuring device 10 may be, for example, a temperature sensor, a brightness sensor, a power usage sensor connected to a building management system, a temperature, a pressure sensor or the like provided in a production facility, A CPU usage sensor, a memory usage sensor, a storage I / O load sensor, a network usage sensor, and the like.

환경 데이터 관리 장치(30)는 상기 측정치 시계열 데이터에 영향을 미칠 수 있는 환경 데이터를 생성하거나, 수집하여 측정치 모니터링 장치(20)에 제공한다.The environmental data management device 30 generates or collects environmental data that may affect the measurement time series data and provides the environmental data to the measurement value monitoring device 20. [

측정치 모니터링 장치(20)는 트레이닝 기간 동안의 상기 측정치 시계열 데이터 및 상기 환경 데이터를 학습하고, 상기 학습 결과를 이용하여 예측 기간 동안의 상기 측정치 시계열 데이터를 예측한다.The measurement monitoring apparatus 20 learns the measurement time series data and the environment data during a training period and predicts the measurement time series data during a prediction period using the learning results.

이하, 측정치 모니터링 장치(20)의 데이터 학습 관련 동작을 설명한다.Hereinafter, the data learning related operation of the measurement value monitoring apparatus 20 will be described.

트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 기 지정된 주기 단위의 측정치 시계열 데이터가 복수의 클러스터로 클러스터링 되고, 각각의 측정치 시계열 데이터의 대표 시계열 데이터가 결정 될 수 있다.As a result of learning about the data during the training period, the measurement time series data of the predetermined period unit is clustered into a plurality of clusters, and representative time series data of each measurement time series data can be determined.

또한, 트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 기 지정된 주기 단위의 환경 시계열 데이터가 복수의 클러스터로 클러스터링 되고, 각각의 환경 시계열 데이터의 대표 시계열 데이터가 결정 될 수 있다. 상기 측정치 시계열 데이터와 상기 환경 시계열 데이터는 동일한 방식으로 클러스터링 되는 것이 바람직하다.As a result of learning about data during the training period, environmental time series data of a predetermined period unit is clustered into a plurality of clusters, and representative time series data of each environmental time series data can be determined. The measured time series data and the environmental time series data are preferably clustered in the same manner.

또한, 트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 환경 데이터를 입력 받아 상기 측정치 시계열의 클러스터를 출력하는 최적 분류 모델이 생성 될 수 있다. 상기 최적 분류 모델은 i) 수집된 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하고, ii) 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 분류 모델을 생성하고, iii) 상기 생성된 분류 모델의 성능 지표 값을 결정하고, iv) 상기 복수의 환경 데이터 중 적어도 일부를 인자로 선택하는 것, 상기 분류 모델을 생성하는 것 및 상기 성능 지표 값을 결정하는 것을, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 것을 통하여 생성 될 수 있다.Also, as a result of learning about the data during the training period, an optimal classification model for receiving the environment data and outputting clusters of the measurement time series may be generated. Wherein the optimal classification model comprises: i) selecting at least some of the collected plurality of environmental data as factors; and ii) optimally classifying clusters of the measured time series data in a space or plane comprising axes indicating the factors Generating a classification model; iii) determining a performance indicator value of the generated classification model; iv) selecting at least a portion of the plurality of environmental data as a parameter; generating the classification model; And selecting an optimal classification model among the generated classification models based on the performance index value by repeating the selection of the factor while changing the selection of the factor.

또한, 트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 각각의 측정치 시계열 클러스터 별로, 환경 데이터로부터 상기 측정치 데이터를 예측하기 위한 회귀 모델이 구축 될 수 있다. 상기 회귀 모델은, 예를 들어 MARS(Multivariate Adaptive Regression Splines)나, 다항 회귀 모델(polynomial regression) 등 다양한 회귀 모델 중 어느 하나 일 수 있다. 회귀 모델 또는 회귀 분석에 대하여는, 다양한 논문 등의 자료가 공개 되어 있으므로, 회귀 모델에 대한 자세한 설명은 생략하기로 한다. 예를 들어, 웹 사이트(https://en.wikipedia.org/wiki/Regression_analysis)를 참조할 수 있다.Also, as a result of learning about the data during the training period, a regression model for predicting the measurement data from the environmental data may be constructed for each measurement time series cluster. The regression model may be any of a variety of regression models such as, for example, Multivariate Adaptive Regression Splines (MARS) or polynomial regression. Regarding the regression model or regression analysis, since various data such as articles are disclosed, a detailed description of the regression model will be omitted. For example, you can visit the website at https://en.wikipedia.org/wiki/Regression_analysis.

이하, 측정치 모니터링 장치(20)의 측정치 시계열 데이터 예측 관련 동작을 설명한다.Hereinafter, the operation related to the measurement time series data prediction by the measurement value monitoring apparatus 20 will be described.

측정치 모니터링 장치(20)는, 상기 최적 분류 모델에 상기 예측 기간의 환경 데이터 예측치를 입력하여, 예측 기간의 측정치 시계열 데이터 클러스터를 예측한다. 상기 예측 기간의 환경 데이터 예측치는, 평균 온도, 평균 습도, 평균 풍속 등의 기상 예보 정보일 수 있다. 측정치 모니터링 장치(20)는 상기 예측 기간의 환경 데이터 예측치를 환경 데이터 관리 장치(30)로부터 제공 받을 수 있다.The measured value monitoring device 20 inputs the predicted environmental data prediction value to the optimum classification model to predict the measured time series data cluster of the prediction period. The environmental data predicted value in the prediction period may be weather forecast information such as average temperature, average humidity, average wind speed, and the like. The measurement value monitoring device 20 may receive the environmental data prediction value of the prediction period from the environment data management device 30. [

측정치 모니터링 장치(20)는 상기 예측된 측정치 시계열 데이터 클러스터에 대한 대표 시계열 데이터 등의 정보를 네트워크를 통하여 단말 장치(40)에 송신할 수 있다.The measurement value monitoring device 20 can transmit information such as representative time series data for the predicted measured value time series data cluster to the terminal device 40 via the network.

상기 최적 분류 모델의 인자에 환경 데이터 시계열이 포함되어 있다면, 측정치 모니터링 장치(20)는 예측 기간의 상기 환경 데이터 시계열 예측치(예를 들어, 일간 온도 시계열 예측치)가 상기 트레이닝 기간 동안의 학습 결과로 얻어진 각각의 환경 데이터 시계열 클러스터 중 어디에 속하는지 결정한다. 이 때, 각 환경 데이터 시계열 클러스터의 대표 시계열 데이터와 상기 환경 데이터 시계열 예측치를 비교함으로써, 클러스터를 신속하게 결정할 수 있다. 측정치 모니터링 장치(20)는 상기 결정된 환경 데이터 시계열 클러스터의 식별자(예를 들어, 인덱스 값)를 상기 최적 분류 모델에 입력하여, 예측 기간의 측정치 시계열 데이터 클러스터를 예측한다.If the environmental data time series is included in the factor of the optimal classification model, the measurement monitoring apparatus 20 determines that the environmental data time series prediction value (for example, daytime temperature time series prediction value) of the prediction period is obtained as the learning result during the training period Determine which environmental data time series clusters belong to. At this time, the cluster can be quickly determined by comparing the representative time series data of each environmental data time series cluster with the environmental data time series prediction value. The measurement monitoring apparatus 20 inputs an identifier (for example, an index value) of the determined environment data time series cluster to the optimum classification model to predict a measurement time series data cluster of the prediction period.

측정치 모니터링 장치(20)는, 상기 예측된 측정치 시계열 데이터의 클러스터를 위한 회귀 모델을 이용하여, 예측 기간의 측정치 시계열 데이터를 예측한다. 상기 회귀 모델은 인자(factor)로 제1 환경(예를 들어, 온도)에 대한 시계열 데이터를 입력 받아, 그 경우의 측정치 시계열 데이터를 출력한다. 상기 회귀 모델은 제2 환경(예를 들어, 습도)에 대한 시계열 데이터 클러스터 식별자, 제3 환경(예를 들어, 일사량)에 대한 대표 값 및 환경 속성(예를 들어, 평일/휴일 여부) 중 적어도 하나를 추가로 입력 받을 수 있다.The measurement value monitoring device 20 predicts the time series data of the predicted time series by using a regression model for the cluster of predicted measured time series data. The regression model receives time series data for a first environment (e.g., temperature) as a factor, and outputs measured time series data in that case. The regression model may include at least one of a time series data cluster identifier for a second environment (e.g., humidity), a representative value for a third environment (e.g., solar radiation) and environmental attributes (e.g., weekday / One can receive additional input.

측정치 모니터링 장치(20)는 예측 된 측정치 시계열 데이터(예를 들어, 내일의 24시간 동안의 에너지 소모량 시계열 데이터 예측치)를 네트워크를 통하여 단말 장치(40)에 송신할 수 있다.The measurement monitoring apparatus 20 can transmit predicted measurement time series data (e.g., tomorrow's 24 hour energy consumption time series data prediction) to the terminal device 40 via the network.

이하, 측정치 모니터링 장치(20)의 데이터 모니터링 관련 동작을 설명한다.Hereinafter, the data monitoring related operation of the measurement value monitoring apparatus 20 will be described.

측정치 모니터링 장치(20)는 상기 예측된 측정치 시계열 데이터의 클러스터에 속한 각 측정치 시계열 데이터의 시점 별 변동성에 따라, 각 시점 별 관리 범위를 설정한다. 예를 들어, 트레이닝 기간 동안의 상기 클러스터에 속한 측정치 시계열 데이터들의 분석 결과, 오전 10시 보다 오후 3시에 상기 변동성이 더 큰 경우, 측정치 모니터링 장치(20)는 오전 10시 보다 오후 3시에 관리 범위를 더 넓게 설정한다. 트레이닝 기간 동안의 학습 결과가 오전 10시에 비하여 오후 3시에 데이터의 편차가 더 컸다는 사실을 가리키므로, 오전 10시에 비하여 오후 3시에 관리 범위를 더 넓게 설정해야, 불필요하게 알람 상황이 발생 되는 것을 방지할 수 있을 것이다.The measurement value monitoring device 20 sets the management range for each viewpoint according to the time-variability of each measurement time-series data belonging to the cluster of the predicted measured value time series data. For example, if the variability is greater at 3:00 pm than 10:00 am as a result of analysis of measurement time series data belonging to the cluster during the training period, the measurement monitoring device 20 will manage at 3:00 pm than 10:00 am Set the range to be wider. Since the learning result during the training period indicates that the data deviation was larger at 3:00 pm than at 10:00 am, it is necessary to set the management range to be wider at 3:00 pm than at 10:00 am, Can be prevented.

측정치 모니터링 장치(20)는 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 차이가 상기 대표 시계열 데이터를 바탕으로 얻어진 한계치(threshold)를 초과하는 경우, 전반적 비정상(global anomaly)으로 판정 할 수 있다. 본 실시예에 따르면, 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 미세한 비정상들이 축적되어 한계치를 초과하는 경우 전반적 비정상으로 판정해 줌으로써, 시계열 데이터를 관리하는 관리자가 종래 기술에 비하여 미세한 데이터 모니터링을 수행할 수 있도록 한다.If the difference between the actual measurement value time series data and the predicted measurement time series data exceeds a threshold value obtained based on the representative time series data even if the actual measurement value time series data is within the management range, It can be judged as a global anomaly. According to the present embodiment, even if the actual measurement value time series data is within the management range, when the minute abnormalities are accumulated and exceed the limit value, it is determined that the overall abnormality is exceeded so that the manager managing the time series data performs fine data monitoring .

도 1에는 측정치 모니터링 장치(20)와 환경 데이터 관리 장치(30)가 서로 물리적으로 분리 된 것으로 도시되어 있으나, 몇몇 실시예에서는, 환경 데이터 관리 장치(30)가 대용량 파일 생성 시스템(300) 내부의 한 모듈로서 구성될 수도 있다.Although the measurement monitoring apparatus 20 and the environmental data management apparatus 30 are shown as being physically separated from each other in FIG. 1, in some embodiments, the environment data management apparatus 30 may be provided in the large- May be configured as one module.

시계열 데이터 모니터링 방법How to monitor time series data

이하, 도 2 내지 도 25를 참조하여, 본 발명의 일 실시예에 따른 시계열 데이터 모니터링 방법을 설명한다. 본 실시예에 따른 시계열 데이터 모니터링 방법은 컴퓨팅 장치에 의하여 실행 될 수 있는데, 예를 들어, 도 1을 참조하여 설명된 측정치 모니터링 장치(20)에 의하여 실행 될 수 있다.Hereinafter, a time-series data monitoring method according to an embodiment of the present invention will be described with reference to FIG. 2 to FIG. The time series data monitoring method according to the present embodiment can be executed by the computing device, for example, by the measurement monitoring device 20 described with reference to FIG.

본 실시예에 따른 시계열 데이터 모니터링 방법은 트레이닝 기간의 데이터를 학습하는 동작, 상기 학습의 결과를 이용하여 예측 기간의 측정치 시계열 데이터를 예측 하는 동작 및 실제 측정치 시계열 데이터가 수신하여 모니터링 하는 동작을 포함한다. 도 2 내지 도 3을 참조하여 트레이닝 기간의 데이터를 학습하는 동작을 설명한 후, 도 4를 참조하여 예측 기간의 측정치 시계열 데이터를 예측 하는 동작 및 실제 측정치 시계열 데이터를 모니터링 하는 동작을 설명한다.The time series data monitoring method according to the present embodiment includes an operation of learning data of a training period, an operation of estimating measured time series data of a prediction period using the result of the learning, and an operation of receiving and monitoring actual measurement time series data . The operation of learning the data of the training period will be described with reference to FIGS. 2 to 3, and the operation of predicting the measured time series data of the predicted period and the operation of monitoring the actual measured value time series data will be described with reference to FIG.

도 2를 참조하면, 트레이닝 기간의 측정치 시계열 데이터 및 복수의 환경 데이터를 수신한다(S100, S102). 복수의 환경 데이터는, 제1 환경(예를 들어, 온도)을 가리키는 시계열 데이터 또는 대표값 및 제2 환경(예를 들어, 습도)를 가리키는 시계열 데이터 또는 대표값, 환경 속성(예를 들어, 휴일/평일 여부)을 가리키는 값을 포함할 수 있다. 수신된 측정치 시계열 데이터 및 환경 데이터 중 환경 시계열 데이터는 트레이닝 과정에서 유사한 것들끼리 클러스터링 된다(S104, S106). 이하, 클러스터링 과정(S104, S106)에 대하여 자세히 설명한다.Referring to FIG. 2, measurement time series data of a training period and a plurality of environment data are received (S100, S102). The plurality of environmental data may include time series data or representative values indicating a first environment (e.g., temperature) and time series data or representative values indicating a second environment (e.g., humidity), environmental attributes / Whether it is weekday). Among the received measurement time series data and environment data, environmental time series data are clustered among similar things in the training process (S104, S106). Hereinafter, the clustering process (S104, S106) will be described in detail.

수신된 측정치 시계열 데이터는 기 지정 된 주기를 단위로 하여 처리 된다. 예를 들어, 상기 주기가 24시간 인 경우, 상기 측정치 시계열 데이터는 0시를 기준으로 분리 될 수 있다. 상기 주기는 각 측정치 시계열 데이터에 따라 서로 다른 값으로 설정될 수 있다. 예를 들어, 건물 내 에너지 소비량 시계열 데이터는 24시간을 기준으로 분리되고, 건물 내부 엘리베이터 운행 거리 데이터는 일주일을 기준으로 분리될 수 있을 것이다.The received measurement time series data is processed in units of the predefined period. For example, if the period is 24 hours, the measured time series data may be separated by 0 hour. The period may be set to a different value according to each measurement time series data. For example, time series data of energy consumption in a building can be separated by 24 hours, and data on the elevator operating distance within a building can be separated by a week.

각 주기의 측정치 시계열 데이터는 클러스터링을 통하여 복수의 클러스터 중 어느 하나로 분류 된다. 도 5는, 24시간 단위로 분리된 에너지 사용량 시계열 데이터를 겹쳐서 표시한 것이다. 도 5에 도시된 것과 같은 시계열 데이터는 k-평균(k-means) 로직과 같이 널리 알려진 클러스터링 로직에 의하여 클러스터링 될 수 있다. k-평균 로직은 주어진 데이터를 k개의 클러스터로 묶는 알고리즘으로, 각 클러스터와 거리 차이의 분산을 최소화하는 방식으로 동작한다. k-평균 로직은 자율 학습의 일종으로, 레이블이 달려 있지 않은 입력 데이터에 레이블을 달아주는 역할을 수행한다. 이 알고리즘은 EM 알고리즘을 이용한 클러스터링과 비슷한 구조를 가지고 있다. k-평균 로직은 시계열 데이터에 대한 클러스터링에 뛰어난 성능을 보여주기 때문에, 본 실시예는 k-평균 로직을 활용한 클러스터링을 수행하는 것에 의하여 클러스터링 품질을 향상시키는 효과를 가져온다.The measured time series data of each cycle is classified into one of a plurality of clusters through clustering. FIG. 5 is an overlay of energy usage time series data separated by 24 hours. The time series data as shown in Fig. 5 may be clustered by well-known clustering logic such as k-means logic. The k-mean logic is an algorithm for grouping the given data into k clusters, and operates in a way that minimizes the variance of the distance difference with each cluster. The k-means logic is a type of self-learning, and it plays a role of labeling input data that is not labeled. This algorithm has similar structure to clustering using EM algorithm. Since the k-average logic shows excellent performance in clustering with time series data, this embodiment has the effect of improving clustering quality by performing clustering using k-average logic.

한편, 다른 실시예에 따르면, k-평균 로직 뿐만 아니라, 다양한 클러스터링 로직이 활용 될 수도 있다. 클러스터링 로직과 관련된 정보는 웹 문서 'https://en.wikipedia.org/wiki/Cluster_analysis'를 참조할 수 있다.On the other hand, according to another embodiment, various clustering logic may be utilized, as well as k-means logic. For information on clustering logic, see the web document 'https://en.wikipedia.org/wiki/Cluster_analysis'.

일 실시예에서, 클러스터링을 수행한 후, 시계열 평균화 로직을 이용하여 각 클러스터에 속한 시계열 데이터들의 대표 시계열 데이터를 선정할 수 있다. 예를 들어, DTW Barycenter Averaging(DBA) 등 널리 알려진 다양한 시계열 평균화 로직이 활용될 수 있다. DBA 로직에 대하여는 'F. Petitjean, A. Ketterlin, P. Gancarski, A global averaging method for dynamic time warping, with applications to clustering' 등의 널리 알려진 논문을 참조할 수 있다. 도 6에는, 총 5개의 클러스터로 도 5의 측정치 시계열 데이터가 클러스터링 되었고, 각 클러스터의 대표 시계열 데이터가 각각 추출 된 것이 표시 되어 있다.In one embodiment, after clustering, representative time series data of time series data belonging to each cluster can be selected using time series averaging logic. For example, various well-known time series averaging logic such as DTW Barycenter Averaging (DBA) can be utilized. For DBA logic, see 'F. Petitjean, A. Ketterlin, P. Gancarski, A global averaging method for dynamic time warping, with applications to clustering '. In FIG. 6, the measured time series data of FIG. 5 is clustered with a total of five clusters, and representative time series data of each cluster is extracted.

DBA 로직은 k-평균 로직에 의하여 클러스터링 된 클러스터 내에서 대표 시계열 데이터를 효과적으로 추출한다. 본 실시예에서는 k-평균 로직을 이용한 클러스터링과 DBA 로직을 이용한 클러스터 내 대표 시계열 데이터 추출의 조합을 통하여, 최적의 클러스터링 및 클러스터 대표 시계열 데이터 추출의 효과를 제공한다.DBA logic effectively extracts representative time series data in clusters clustered by k-means logic. In this embodiment, the combination of clustering using k-means logic and representative time series data extraction using clustered DBA logic provides the effect of optimal clustering and cluster representative time series data extraction.

도 7은 24시간 주기의 측정치 시계열 데이터가 각 일자 별로 저장되는 형태를 도시한다. 도 7에 도시된 바와 같이, 각 주기 별 측정치 시계열 데이터는, 클러스터의 식별자 역할을 하는 클러스터 인덱스와 함께 저장될 수 있다. 추가적으로, 각 클러스터의 대표 시계열 데이터가 클러스터링의 결과로서 저장될 수 있다. 한편, 수집 된 측정치 시계열 데이터는, 도 8에 도시된 것과 같이 n개(n>=2)의 서로 다른 측정치 시계열 데이터로 구성 된 다차원 시계열 데이터일 수 있다.FIG. 7 shows a form in which measured time series data of a 24-hour period is stored for each date. As shown in FIG. 7, the measurement time series data for each cycle can be stored together with the cluster index serving as an identifier of the cluster. In addition, representative time series data of each cluster may be stored as a result of clustering. On the other hand, the collected measurement time series data may be multidimensional time series data composed of n (n > = 2) different measurement time series data as shown in Fig.

시계열 데이터에 대한 클러스터링을 수행함에 있어서, 몇 개의 클러스터로 클러스터링 할 것인지가 문제 된다. 클러스터의 개수를 너무 적게 하면 각 클러스터에 속한 시계열 데이터의 낮은 동질성이 문제되고, 클러스터의 개수를 너무 많게 하면, 클러스터링의 효율이 떨어지기 때문이다. 따라서, 적절한 클러스터 개수를 결정하는 것이 클러스터링의 품질을 높이는데 중요하다. 본 발명의 일 실시예에서는, 각 클러스터 별 대표 시계열 데이터와, 각 클러스터에 속한 각 주기의 측정치 시계열 데이터 사이의 유사도 합산치 를 기준으로 클러스터의 개수를 최종 결정한다. 상기 유사도 합산치는, 예를 들어 DTW distance 등 다양한 시계열 데이터 사이의 차이 값 연산 로직을 이용하여 산출 될 수 있다.In performing clustering on time-series data, it is a problem how many clusters are to be clustered. If the number of clusters is too small, low homogeneity of the time series data belonging to each cluster is problematic. If the number of clusters is too large, clustering efficiency becomes poor. Therefore, it is important to determine the proper number of clusters to improve the quality of clustering. In an embodiment of the present invention, the number of clusters is finally determined based on the sum of similarity values between representative time series data for each cluster and measurement time series data of each cycle belonging to each cluster. The similarity sum value can be calculated using difference value calculation logic between various time series data such as DTW distance.

도 9에 도시된 케이스의 경우, 클러스터의 개수를 1에서 5까지 증가시킴에 따라, 각 클러스터 별 대표 시계열 데이터와, 각 클러스터에 속한 각 주기의 측정치 시계열 데이터 사이의 DTW distance 합산치가 급격히 감소하다가, 클러스터의 개수가 5이상 되면 DTW distance 합산치의 감소폭이 미미해진다. 즉, 도 9에 도시된 케이스의 경우, 클러스터 개수는 5 이상으로 증가시키더라도 클러스터링의 품질에 별 영향을 미치지 않는다. 따라서, 일 실시예에서, 클러스터의 개수가 1에서 k까지는 클러스터의 개수가 증가함에 따라 상기 DTW distance 합산치의 감소폭이 기준치 이상이나, 클러스터의 개수가 k를 초과하여 증가함에 따라 상기 DTW distance 합산치의 감소폭이 기준치 미만인 경우, 클러스터의 개수는 k개로 최종 결정 될 수 있다.In the case shown in FIG. 9, as the number of clusters is increased from 1 to 5, the DTW distance sum value between the representative time series data of each cluster and the measurement time series data of each cycle belonging to each cluster sharply decreases, If the number of clusters is 5 or more, the decrease in the DTW distance sum becomes small. That is, in the case shown in FIG. 9, even if the number of clusters is increased to 5 or more, the quality of the clustering is not affected. Therefore, in one embodiment, as the number of clusters increases from 1 to k, as the number of clusters increases, the decrease of the sum of DTW distances increases beyond a reference value, and as the number of clusters increases beyond k, If it is less than this criterion, the number of clusters can be finally determined as k.

한편, 측정치 시계열 데이터가 2이상의 개별 측정치 시계열으로 구성 된 다차원 시계열 데이터인 경우, 각 클러스터 별 대표 시계열과, 각 클러스터에 속한 각 주기의 측정치 시계열 데이터 사이의 MD-DTW(Multi-Dimensional Dynamic Time Warping) 로직에 따른 유사도(예를 들어, DTW distance)의 합산치를 기준으로 클러스터의 개수가 최종 결정 될 수 있다. 본 발명에 따른 시계열 데이터 모니터링 방법은, 시계열 데이터가 다차원 데이터이더라도 1차원 시계열 데이터와 동일하게 클러스터링 및 각 클러스터의 대표 시계열 데이터를 생성할 수 있으므로, 다차원 시계열 데이터에 대한 확장성을 제공한다. 즉, 본 실시예에서는, 다차원 시계열 데이터도 예측 기간의 측정치 시계열 데이터의 클러스터를 예측하기 위한 인자(factor)로 사용할 수 있도록 지원한다.On the other hand, when the measured time series data is multidimensional time series data composed of two or more individual measured value time series, a Multi-Dimensional Dynamic Time Warping (MD-DTW) between the representative time series for each cluster and the measured time series data of each cycle belonging to each cluster, The number of clusters can be finally determined based on the sum of similarities (e. G., DTW distance) according to the logic. The time series data monitoring method according to the present invention provides scalability for multidimensional time series data because clustering and representative time series data of each cluster can be generated in the same way as one dimensional time series data even if the time series data is multilevel data. That is, in this embodiment, the multidimensional time series data can also be used as a factor for predicting the cluster of the time series data of the measurement time series of the prediction period.

이미 언급한 것과 같이, 환경 데이터 중 시계열 데이터도 측정치 시계열 데이터의 클러스터링 방법과 동일한 방법으로 클러스터링 되고, 각 클러스터의 대표 시계열 데이터가 추출 된다. 도 12에는, 여름과 겨울의 온도 시계열 데이터를 클러스터링 하고, 각 클러스터의 대표 시계열 데이터를 추출한 결과가 도시 되어 있다.As already mentioned, the time series data in the environment data is also clustered in the same manner as the clustering method of the measurement time series data, and the representative time series data of each cluster is extracted. FIG. 12 shows the results of clustering the temperature time series data of summer and winter and extracting representative time series data of each cluster.

환경 데이터 중 시계열 데이터를 클러스터링 하는 이유는, 시계열 데이터의 특성 상 완전히 동일한 데이터가 발생할 가능성이 낮기 때문이다. 따라서, 측정치 시계열 데이터의 클러스터를 예측하기 위한 인자(factor)로서 환경 데이터의 시계열 데이터가 포함 될 수 있도록, 환경 데이터를 클러스터링 한다. 각 클러스터의 식별자(예를 들어, 인덱스)가 측정치 시계열 데이터의 클러스터를 예측하기 위한 최적 분류 모델의 인자로서 사용될 수 있다. 최적 분류 모델에 관한 자세한 사항은, 도 3, 도 15 및 도16을 참조하여 추후 자세히 설명한다.The reason for clustering time series data among environmental data is that the possibility of completely the same data is low due to the nature of time series data. Therefore, the environmental data is clustered so that the time series data of the environment data can be included as a factor for predicting the cluster of the measurement time series data. An identifier (e.g., an index) of each cluster may be used as a factor of an optimal classification model for predicting clusters of measured time series data. Details of the optimum classification model will be described in detail later with reference to Figs. 3, 15 and 16.

도 10은 다차원 환경 시계열 데이터를 표시한다. 예를 들어, n개의 서로 다른 환경 시계열 데이터를 포함하는 n차원 환경 시계열 데이터의 경우, n개의 1차원 환경 시계열 데이터로 분리하여 클러스터링 하는 것보다, 1개의 n차원 환경 시계열 데이터로서 클러스터링 하는 것이, 매일의 환경을 보다 효과적으로 클러스터링 하는 것일 수 있다. 따라서, 다차원 환경 시계열 데이터 역시 상기 최적 분류 모델의 인자로서 사용될 필요가 있다. 이미 설명한 다차원 측정치 시계열 데이터에 대한 클러스터링 및 대표 시계열 데이터 추출 방법과 동일한 방법을 이용하여, 다차원 환경 시계열 데이터도 클러스터링 및 대표 시계열 데이터 추출이 가능하다.10 shows multidimensional environment time series data. For example, in the case of n-dimensional environment time series data including n different environment time series data, clustering as one n-dimensional environment time series data is more preferable than clustering by separating into n one- Lt; RTI ID = 0.0 > environment. &Lt; / RTI > Therefore, the multidimensional environment time series data also needs to be used as a factor of the optimal classification model. Clustering and representative time series data can also be extracted from multidimensional environment time series data using the same method as clustering and representative time series data extraction method described above for multidimensional measurement time series data.

이미 언급한 바와 같이, 본 발명의 몇몇 실시예들에서 수집 되고 학습 되는 환경 데이터는 시계열 데이터가 아닌 데이터도 포함한다. 예를 들어, 환경의 속성 값을 가리키는 데이터(예를 들어, 도 11의 일자 별 토요일/평일/휴일 여부) 또는 각 환경의 대표 값을 가리키는 데이터(예를 들어, 도 13의 일자 별 온도/습도/기압 평균 값)도 환경 데이터에 포함될 수 있다. 일 실시예에 따르면, 시계열 데이터가 아닌 환경 데이터도 널리 알려진 클러스터링 방법에 의하여 클러스터링 되고, 각 클러스터의 대표 값도 추출 될 수 있다.As already mentioned, the environmental data collected and learned in some embodiments of the present invention also includes data that is not time series data. For example, data indicating an attribute value of the environment (e.g., whether it is Saturday / weekday / holiday by date in FIG. 11) or data indicating representative values of each environment (for example, temperature / / Average pressure value) can also be included in the environmental data. According to an exemplary embodiment, environment data other than time series data are clustered by a well-known clustering method, and representative values of each cluster can also be extracted.

도 14는 24시간 주기의 환경 시계열 데이터가 각 주기 별로 저장되는 형태를 도시한다. 도 14에 도시된 바와 같이, 각 주기 별 환경 시계열 데이터는, 클러스터의 식별자 역할을 하는 클러스터 인덱스와 함께 저장될 수 있다. 추가적으로, 각 클러스터의 대표 시계열 데이터가 클러스터링의 결과로서 저장될 수 있다.14 shows a form in which environmental time series data of a 24-hour period is stored for each cycle. As shown in FIG. 14, the environment time series data for each cycle can be stored together with the cluster index serving as an identifier of the cluster. In addition, representative time series data of each cluster may be stored as a result of clustering.

다시 도 2로 돌아와서, 클러스터링 이후의 동작을 설명한다. 클러스터링이 완료 되면, 측정치 시계열 데이터의 클러스터를 얻기 위한 최적의 모델을 생성한다(S108). 상기 모델은, 수신 된 복수의 환경 데이터 중 적어도 일부가 각각의 축이 되어 구성 된 평면 또는 공간 상에서, 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류 하는 최적 분류 모델을 가리킨다.Returning back to Fig. 2, the operation after clustering will be described. When clustering is completed, an optimal model for obtaining clusters of measured time series data is generated (S108). The model indicates an optimal classification model that best classifies clusters of the measurement time series data on a plane or a space formed by at least a part of the plurality of received environmental data as respective axes.

예를 들어, 제1 축이 온도 시계열 데이터 클러스터이고, 제2 축이 습도 시계열 데이터 클러스터인 경우, 상기 제1 축 및 상기 제2 축이 구성하는 평면 상에 트레이닝 기간 동안의 상기 측정치 시계열 데이터를 표시할 때, 상기 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류 하는 하나의 기준선이 표시 될 수 있을 것이다. 이 때, 상기 기준선을 이용하면, 예측 기간의 온도 시계열 데이터 클러스터 및 습도 시계열 데이터 클러스터를 입력하는 것으로, 측정치 시계열 데이터의 클러스터를 알 수 있다. 따라서, 측정치 시계열 데이터의 클러스터를 얻기 위한 최적의 모델은, 수신 된 복수의 환경 데이터 중 적어도 일부가 각각의 축이 되어 구성 된 평면 또는 공간 상에서, 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류하는 최적 분류 모델이다.For example, when the first axis is the temperature time series data cluster and the second axis is the humidity time series data cluster, the measurement time series data during the training period is displayed on the plane constituted by the first axis and the second axis , One reference line that best classifies the clusters of the measured time series data on the plane may be displayed. At this time, by using the reference line, clusters of measured time series data can be obtained by inputting temperature time series data clusters and humidity time series data clusters in the prediction period. Therefore, the optimal model for obtaining the clusters of the measured time series data is an optimal classification for best classifying clusters of the measured time series data on the plane or space constituted by at least a part of the received plurality of environment data as the respective axes It is a model.

도 3을 참조하여, 상기 최적 분류 모델을 생성하는 동작(S108)을 보다 자세히 설명한다.Referring to FIG. 3, the operation (S108) of generating the optimal classification model will be described in more detail.

먼저, 복수의 환경 데이터 중 인자(factor)로 사용할 환경 데이터를 선택한다. 예를 들어, 수집 된 환경 데이터가 3가지 종류(A, B, C)라고 하면, 선택의 가지 수는 7 가지이다(A, B, C, AB, AC, BC, ABC). 측정치 시계열 데이터가 하나의 환경 데이터에만 의존 관계가 있지는 않을 것으로 가정한다. 2개의 환경 데이터를 인자로 사용하는 것으로 선택했다면, 2개의 인자로 구성 된 평면이 구성되고, 이 평면 상에 트레이닝 기간 동안의 각 주기의 측정치 시계열 데이터를 표시할 수 있을 것이다.First, environment data to be used as a factor among a plurality of environmental data is selected. For example, if the collected environmental data are of three kinds (A, B, C), there are seven kinds of choices (A, B, C, AB, AC, BC, ABC). It is assumed that the measurement time series data does not depend on only one environmental data. If you choose to use two sets of environmental data as parameters, you can construct a plane consisting of two factors and display the time series data of each period over the training period on this plane.

도 15는 2개의 환경 시계열 데이터가 선택 된 경우, 제1 환경 시계열 데이터의 클러스터 인덱스를 가리키는 제1 축과, 제2 환경 시계열 데이터의 클러스터 인덱스를 가리키는 제2 축으로 구성 된 평면 상에, 트레이닝 기간 동안의 각 주기의 측정치 시계열 데이터를, 그 클러스터의 인덱스 번호로 표시한 것이다. 아래의 표 1과 같이 트레이닝 기간의 데이터가 처리 된 경우, 도 15와 같이 측정치 시계열 데이터의 클러스터가 표시 될 수 있을 것이다.FIG. 15 is a diagram showing a case in which two environment time series data are selected, and on a plane constituted by a first axis indicating a cluster index of the first environment time series data and a second axis indicating a cluster index of the second environment time series data, Is the index number of the cluster. When the data of the training period is processed as shown in Table 1 below, a cluster of the measured time series data can be displayed as shown in FIG.

주기Cycle 제1 환경 시계열 데이터
클러스터 인덱스First environment time series data
Cluster index 제2 환경 시계열 데이터
클러스터 인덱스Second environment time series data
Cluster index 측정치 시계열 데이터
클러스터 인덱스Measured time series data
Cluster index 1One 1One 1One 1One 22 22 1One 1One 33 1One 22 1One 44 22 22 1One 55 33 1One 1One 66 44 1One 22 77 33 22 22 88 1One 33 22 99 22 33 22 1010 33 33 22 1111 44 22 22 1212 55 1One 22 1313 66 1One 22 1414 55 22 22 1515 44 33 22 1616 55 33 22 1717 66 33 22 1818 66 22 22

도 15에 도시 된 평면 상에서, 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 문제는, SVM(Support Vector Machine) 로직, decision tree 로직 등 다양한 분류(classification) 로직을 활용하여 솔루션을 얻을 수 있다. 즉, 본 발명의 실시예는, 예를 들어, 웹 문서 ' https://en.wikipedia.org/wiki/Statistical_classification'를 통하여 소개 된 다양한 분류 로직을 사용하여, 환경 데이터가 구성하는 평면 또는 공간 상에서, 각 주기의 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 모델을 생성하는 것으로 확장 가능하다. 다만, 이하 이해의 편의를 위하여 SVM 로직을 활용하는 실시예를 설명한다.On the plane shown in FIG. 15, the problem of optimally classifying clusters of measured time series data can be solved by using various classification logic such as SVM (Support Vector Machine) logic and decision tree logic. That is, the embodiment of the present invention uses various classification logic introduced through a web document 'https://en.wikipedia.org/wiki/Statistical_classification', for example, in a plane or a space , It is possible to expand the model to optimally classify the clusters of the measured time series data of each period. However, for convenience of understanding, an embodiment utilizing SVM logic will be described below.

도 16은 환경 시계열 데이터 하나(온도 시계열 데이터), 환경 대표값(평균 습도) 데이터 하나가 각각 인자로서 선택된 경우를 도시한다. 이미 설명한 바와 같이, 환경 시계열 데이터의 경우 그대로는 축상에 표시할 수 없으므로, 제1 축은 환경 시계열 데이터의 클러스터 인덱스를 가리킨다. SVM 로직을 수행하면, 평면 상에서 2개의 이질적인 데이터(제1 클러스터의 측정치 시계열 데이터 및 제2 클러스터의 측정치 시계열 데이터)를 최적으로 구별할 수 있는 초평면(hyperplane)(63)이 구해진다. 이 때 최대 여백(margin)은, 초평면(63)에 가장 가까운 데이터를 지나면서 초평면에 평행한 두개의 벡터(61, 62) 사이의 거리이다.FIG. 16 shows a case in which one environmental time series data (temperature time series data) and one environmental representative value (average humidity) data are respectively selected as factors. As described above, since the environmental time series data can not be displayed on the axis as it is, the first axis indicates the cluster index of the environmental time series data. When the SVM logic is executed, a hyperplane 63 is obtained that can distinguish two heterogeneous data on the plane (measurement time series data of the first cluster and measurement time series data of the second cluster) on the plane. The maximum margin at this time is the distance between two vectors 61 and 62 parallel to the hyperplane as they pass data closest to the hyperplane 63.

도 16과 같이 2개의 인자를 선택한 상황에서는, 도 16에 도시된 초평면(63)이 최적의 분류 모델이다. 하지만, 다른 인자 선택을 고려하면, 제1 클러스터의 측정치 시계열 데이터 및 제2 클러스터의 측정치 시계열 데이터를 더 잘 분류할 수 있는 분류 모델이 생성될 수도 있다. 일 실시예에서, 분류 모델의 성능 지표는 상기 SVM 로직에 따라 생성 되는 초평면(hyperplane)에 따른 최대 여백(margin) 값으로, 상기 최대 여백이 클 수록 상기 성능 지표도 좋은 것이다.16, the hyperplane 63 shown in Fig. 16 is an optimal classification model. However, considering another factor selection, a classification model may be generated that can better classify the measurement time series data of the first cluster and the measurement time series data of the second cluster. In one embodiment, the performance indicator of the classification model is a maximum margin value according to a hyperplane generated according to the SVM logic. The larger the maximum margin, the better the performance indicator.

따라서, 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류하는 최적 분류 모델을 구하기 위하여는, 수집 된 복수의 환경 데이터를 다양하게 조합해 보면서, 상기 최대 여백 값이 가장 큰 경우를 찾으면 될 것이다.Therefore, in order to obtain an optimal classification model that best classifies clusters of the measured time series data, a case where the maximum margin value is the largest is searched while various collected environmental data are combined in various ways.

다시 도 3으로 돌아와 설명하면, 복수의 환경 데이터 중 인자로 사용할 환경 데이터를 선택하고(S180), 선택된 인자를 가리키는 축들로 구성된 공간(3개 이상의 인자가 선택 된 경우) 또는 평면(2개의 인자가 선택 된 경우) 상에서의 분류 모델을 생성하며(S182), 상기 분류 모델의 성능 지표값(SVM 로직을 사용한 경우, 최대 여백 값)을 결정한다(S184). 아직 검토 되지 않은 인자의 선택 케이스가 존재하지 않을 때까지(S186), 인자의 선택을 변경하고(S188), 선택된 인자를 이용하여 분류 모델을 생성하며(S182), 생성된 분류 모델의 성능 지표 값을 결정(S184)하는 동작이 반복된다.3, the environment data to be used as a parameter in the plurality of environmental data is selected (S180), and a space (when three or more parameters are selected) constituted by axes indicating the selected parameter or a plane (S182), and determines the performance index value of the classification model (when the SVM logic is used, the maximum margin value) (S184). (S186), the selection of the factor is changed (S188), and a classification model is generated using the selected factor (S182) until the performance index value of the generated classification model (S184) is repeated.

인자의 선택 시, 복수의 환경 데이터 중 적어도 일부를 선택하는 모든 케이스가 다 가능하거나, 선택 가능한 인자 개수의 범위를 지정 하거나, 선택 가능한 데이터의 타입을 특정 타입으로 제한할 수 있다(예를 들어, 환경 시계열 데이터 및 환경 대표 값으로 제한).When selecting an argument, all cases of selecting at least a part of a plurality of environment data are possible, a range of selectable number of factors can be specified, or the type of selectable data can be limited to a specific type (for example, Environment time series data and environment representative values).

모든 인자 선택 케이스를 다 검토한 후, 각 인자 선택 케이스에서 생성 된 분류 모델의 성능 지표 값을 비교하여, 가장 높은 성능 지표 값을 가지는 분류 모델을 최적 분류 모델로 선정한다(S189).After all the factor selection cases are examined, the performance index values of the classification models generated in the respective factor selection cases are compared, and the classification model having the highest performance index value is selected as the optimal classification model (S189).

다음으로, 트레이닝 작업의 일환으로, 각각의 측정치 시계열 클러스터 별로, 그 측정치 시계열 데이터에 속한 주기의 환경 데이터를 입력 받아 측정치 시계열 데이터를 출력하는 회귀 모델(regression model)을 구축한다. 즉 상기 회귀 모델을 구축하는 단계는, 제2 측정치 시계열 데이터 클러스터로 클러스터링 된 주기의 데이터는 이용하지 않고 제1 측정치 시계열 데이터 클러스터로 클러스터링 된 주기의 데이터 만을 이용하여, 상기 제1 측정치 시계열 데이터 클러스터에 대응하는 회귀 모델을 구축하는 단계를 포함한다. 예를 들어, 표 1에 표시된 케이스에서, 측정치 시계열 데이터 클러스터 1번에 대응 되는 회귀 모델을 구축할 때, 주기 1 내지 5까지의 환경 데이터만 이용된다.Next, as a part of the training work, a regression model is constructed in which environment data of a cycle belonging to the measurement time series data is input for each measurement time series cluster and the measurement time series data is output. That is, the step of constructing the regression model may be performed by using only the data of the period clustered in the first measured value time series data cluster without using the data of the period clustered with the second measured value time series data cluster, And constructing a corresponding regression model. For example, in the case shown in Table 1, when constructing the regression model corresponding to the measurement time series data cluster No. 1, only the environmental data of the cycles 1 to 5 are used.

상기 회귀 모델의 구축은 웹 문서 'https://en.wikipedia.org/wiki/Regression_analysis' 등을 통하여 제시된, 다양한 로직을 적용하여 수행 될 수 있다. 예를 들어 MARS(Multivariate Adaptive Regression Splines)나, 다항 회귀 모델(polynomial regression) 등 다양한 회귀 모델 중 어느 하나 일 수 있다.The construction of the regression model can be performed by applying various logic presented through a web document 'https://en.wikipedia.org/wiki/Regression_analysis' or the like. For example, it may be one of various regression models such as Multivariate Adaptive Regression Splines (MARS) or polynomial regression.

상기 회귀 모델은 상기 환경 데이터 중 제1 환경 시계열 데이터를 제1 독립 변수로 가진다. 측정치 시계열 데이터를 출력하기 위해서는, 시간의 흐름에 따라 변하는 시계열 데이터가 적어도 하나는 입력 되어야 하기 때문이다.The regression model has first environment time series data of the environment data as a first independent variable. In order to output the measured time series data, at least one time series data which varies with the passage of time must be inputted.

상기 회귀 모델은, 상기 제1 환경 시계열 데이터와 다른 제2 환경 시계열 데이터의 클러스터 식별자, 상기 환경 데이터 중 각 주기의 특정 환경을 대표하는 대표값(예를 들어, 평균 온도) 및 환경의 속성을 가리키는 데이터(예를 들어, 평일/휴일 여부) 중 적어도 하나를 추가적인 독립 변수로 가질 수 있다.Wherein the regression model includes a cluster identifier of a second environment time series data different from the first environment time series data, a representative value (e.g., an average temperature) representative of a specific environment of each cycle of the environment data, And at least one of data (e.g., whether it is weekday / holiday) as an additional independent variable.

이하, 도 4를 참조하여 예측 기간의 측정치 시계열 데이터를 예측 하고 모니터링 하는 동작을 설명한다.Hereinafter, an operation of predicting and monitoring measured time series data of a prediction period will be described with reference to FIG.

예측 기간의 환경 데이터를 수신한다(S200). 상기 수신 된 환경 데이터는 예측치일 수 있다. 상기 환경 데이터는 예를 들어, 기상 예보 정보 일 수 있다. 상기 기상 예보 정보는, 예를 들어 예측 기간의 평균 온도, 평균 습도, 시간에 따른 온도 시계열 예측 데이터 등을 포함할 수 있다. 상기 환경 데이터는, 예측의 대상인 시계열 데이터에 대한 상기 최적 분류 모델의 인자로 포함 된 데이터를 모두 포함하는 것이 바람직하다.The environmental data of the forecast period is received (S200). The received environmental data may be a predicted value. The environmental data may be, for example, weather forecast information. The weather forecast information may include, for example, an average temperature of the prediction period, an average humidity, temperature time series prediction data according to time, and the like. It is preferable that the environment data include all data included as a factor of the optimal classification model for time series data to be predicted.

상기 최적 분류 모델의 인자로 환경 시계열 데이터가 포함되었다면, 상기 환경 시계열 데이터로 예측 된 시계열 데이터가, 상기 환경 시계열 데이터의 클러스터들 중 어디에 가장 가까운지 결정된다(S202).If environmental time series data is included as a factor of the optimal classification model, it is determined in step S202 whether the time series data predicted by the environmental time series data is closest to the clusters of the environmental time series data.

상기 환경 시계열 데이터의 클러스터링 시(S106), 각 클러스터의 대표 시계열 데이터가 추출 되는 점을 이미 설명한 바 있다. 예측 기간의 환경 시계열 데이터에 대응되는 클러스터를 결정할 때(S202), 예측 기간의 환경 시계열 데이터를 각 클러스터에 속한 모든 데이터와 비교하는 것이 아니라, 각 클러스터의 대표 시계열 데이터와 비교하기만 하면 된다. 즉, 상기 예측 기간의 환경 시계열 데이터와 상기 환경 시계열 데이터의 각 클러스터 별 대표 시계열 사이의 차이값 연산 로직에 따른 유사도를 기준으로, 상기 예측 기간의 환경 시계열 데이터가 속하는 상기 환경 시계열 데이터의 클러스터를 선정한다.The representative time series data of each cluster is extracted at the time of clustering the environmental time series data (S106). The environmental time series data of the prediction period is not compared with all the data belonging to each cluster but only with the representative time series data of each cluster when determining the cluster corresponding to the environmental time series data of the prediction period (S202). That is, a cluster of the environmental time series data to which the environmental time series data of the prediction period belongs is selected based on the similarity degree according to the difference value calculation logic between the environmental time series data of the prediction period and the representative time series for each cluster of the environmental time series data do.

상기 유사도는, 예를 들어, 예측 기간의 환경 시계열 데이터와 환경 시계열 데이터의 각 클러스터 별 대표 시계열 데이터 사이의 DTW(Dynamic Time Warping) 차이값 연산 로직 등, 시계열 데이터 사이의 차이 값을 연산하는 다양한 로직에 의하여 구해 질 수 있다.The degree of similarity may be, for example, various logic for computing a difference value between time series data such as DTW (Dynamic Time Warping) difference arithmetic logic between environmental time series data of the prediction period and representative time series data of each cluster of environment time series data . &Lt; / RTI >

예를 들어, 클러스터의 개수가 10개라면, 클러스터의 결정(S202) 과정에서 10번의 DTW 값 비교만 수행하면 되기 때문에, 본 실시예는 예측 기간의 환경 시계열 데이터에 대응되는 클러스터를 신속하게 결정할 수 있는 효과를 가진다.For example, if the number of clusters is 10, only 10 DTW values must be compared in the determination of the cluster (S202). Therefore, in this embodiment, the cluster corresponding to the environment time series data of the prediction period can be determined quickly It has an effect.

상기 최적 분류 모델의 인자에 예측 기간의 환경 데이터를 입력함으로써, 상기 예측 기간의 측정치 시계열 데이터의 클러스터가 예측 된다(S204). 이미 언급한 바와 같이, 상기 최적 분류 모델의 인자에 환경 시계열 데이터가 포함 된 경우, 환경 시계열 데이터 자체가 아니라, 환경 시계열 데이터의 클러스터 식별자(예를 들어, 클러스터 인덱스)가 입력 된다.By inputting the environmental data of the prediction period into the factor of the optimal classification model, a cluster of the measured time series data of the prediction period is predicted (S204). As mentioned above, when the environment time series data is included in the factor of the optimal classification model, the cluster identifier (e.g., cluster index) of the environmental time series data is input instead of the environmental time series data itself.

상기 예측 된 측정치 시계열 데이터의 클러스터에 대응된 회귀 모델에 예측 기간의 환경 데이터를 입력하면, 예측 기간의 측정치 시계열 데이터를 얻을 수 있다(S206). 도 17에 도시된 바와 같이, 본 실시예에 따르면 측정치 시계열 데이터 클러스터가 다르면, 적용되는 회귀 모델도 달라진다. 예를 들어, 에너지 사용량이 예측 대상 측정치 시계열 데이터인 경우, 예측 기간의 에너지 사용량 데이터 클러스터가 #1으로 예측 된 경우, 회귀 모델은 MARS(Multivariate Adaptive Regression Splines) 모델 형식의 1번 모델이 사용될 수 있다. 예측 기간의 에너지 사용량 데이터 클러스터가 #2으로 예측 된 경우, 회귀 모델은 2번 모델로 달라진다.When the environmental data of the predicted period is input to the regression model corresponding to the cluster of the predicted measured value time series data, the measured time series data of the predicted period can be obtained (S206). As shown in Fig. 17, according to the present embodiment, when the measurement time series data cluster is different, the applied regression model also changes. For example, if the energy usage is forecasted measurement time series data and the energy usage data cluster in the forecast period is predicted as # 1, then the regression model may be model 1 of the MARS (Multivariate Adaptive Regression Splines) model format . If the energy usage data cluster of the forecast period is predicted as # 2, then the regression model will be different for the second model.

한편, 예측 대상 측정치 시계열 데이터가 달라지면, 다른 모델 형식의 회귀 모델이 적용 될 수도 있다. 예를 들어, 도 17에는 용수 사용량 시계열 데이터에 대하여는 다항 회귀 모델(polynomial regression)이 사용되는 점이 도시 되어 있다.On the other hand, if the predicted measurement time series data is different, a regression model of another model type may be applied. For example, FIG. 17 shows that polynomial regression is used for water usage time series data.

이하, 실제 측정치 시계열 데이터의 모니터링 관련 동작을 설명한다.Hereinafter, the monitoring-related operation of the actual measured value time series data will be described.

단계 S204에서 예측 된 측정치 시계열 데이터의 클러스터에 속한 각 측정치 시계열 데이터들의 각 시점 별 변동성이 평가된다(S208). 도 18에는 특정 측정치 시계열 데이터(에너지 사용량)의 클러스터에 속한 각 측정치 시계열 데이터들이 겹쳐서 표시 되어 있다. 변동성의 평가 시에, 각 시점 별로, 그 클러스터에 속한 각 측정치 시계열 데이터의 분산 또는 표준편차가 연산 된다. 즉, 이 경우에는 분산 또는 표준편차가 각 시점 별 변동성이 된다. 도 19에는 도 18의 데이터에 대하여 각 시점(시각) 별로 연산 된 표준 편차가 표시 되어 있다. 도 19에는 6시에 표준편차가 최저(60)인 점, 9시에 표준편차가 최대(71)인 점이 도시 되어 있다.The variability of each measurement time series data belonging to the cluster of measured time series data predicted at step S204 is evaluated (S208). In FIG. 18, each measurement time series data belonging to a cluster of a specific measurement time series data (energy use amount) is superimposed and displayed. In the evaluation of the variability, the variance or standard deviation of each measurement time series data belonging to the cluster is calculated for each time point. That is, in this case, the variance or standard deviation is the volatility of each point of view. FIG. 19 shows the standard deviation calculated for each time (time) with respect to the data of FIG. FIG. 19 shows a point at which the standard deviation is at the minimum (60) at 6 o'clock, and a point at which the standard deviation is at the maximum (71) at 9 o'clock.

다음으로, 각 시점 별 변동성 평가(S208)의 결과를 이용하여, 시점 별 관리 정보가 생성 된다(S210). 예를 들어, 아래와 같이 시점 별 관리 정보가 생성 될 수 있다.Next, the management information for each viewpoint is generated using the result of the evaluation of the volatility of each viewpoint (S208) (S210). For example, point-in-time management information may be generated as follows.

U(t) = P(t) + ασ(t)U (t) = P (t) +?? (T)

L(t) = P(t) - ασ(t)L (t) = P (t) -?? (T)

상기 수식에서, U(t)는 시점 t에서의 상방 한계 값이고, α는 관리자에 의하여 설정 될 수 있는 계수이며, σ(t)는 변동성 평가(S208)의 결과 얻어진, 예측 된 측정치 시계열 데이터의 클러스터에 속한 각 측정치 시계열 데이터들의 시점 t에서의 표준편차 값이다.In the above equation, U (t) is an upper limit value at a time point t, a is a coefficient that can be set by the manager, and? (T) is a value of the predicted measurement time series data obtained as a result of the volatility evaluation Is the standard deviation value at time t of each measurement time series data belonging to the cluster.

도 20에는, 도 19에 도시 된 변동성 평가(S208)의 결과가 반영 된 시점 별 관리 정보 생성 결과가 도시 되어 있다. 도 20에 도시 된 바와 같이, 트레이닝 기간 동안에 변동성이 가장 낮았던 6시에는 관리 범위(72)가 가장 협소하게 설정 되고, 변동성이 가장 낮았던 9시에는 관리 범위가(73)가 가장 넓게 설정 된다. 도 20과 같이 관리 범위가 설정 된 후, 실제 측정치 시계열 데이터가 수신 되면(S214), 각 시점의 상기 실제 측정치 시계열 데이터와 상기 예측 된 측정치 시계열 데이터 사이의 차이 값이, 상기 각 시점 별 관리 범위 이내인지 여부가 판정됨으로써, 각 시점 별 비정상 패턴 모니터링(S216)이 수행 될 것이다.Fig. 20 shows the management information generation result at each point of time, in which the result of the volatility evaluation (S208) shown in Fig. 19 is reflected. As shown in FIG. 20, the management range 72 is set to the narrowest at 6 o'clock during the training period, and the management range 73 is set to be the widest at 9 o'clock when the volatility is the lowest. When the actual measurement value time series data is received after the management range is set as shown in FIG. 20 (S214), the difference value between the actual measurement value time series data at each time point and the predicted measurement time series data is within the management range It is determined whether abnormal pattern monitoring is performed at each point in time (S216).

본 실시예에 따라, 트레이닝 기간 동안의 학습 결과를 반영하여, 시점 별로 서로 다른 관리 범위를 설정함으로써, 관리 범위의 정확도를 높일 수 있고, 결과적으로 비정상적인 실제 측정치 시계열 데이터가 발생했을 때 정확하게 이를 감지할 수 있는 효과가 있다.According to the present embodiment, the management range can be improved by reflecting the learning result during the training period and by setting different management ranges for each viewpoint, and as a result, when abnormal abnormal measurement time series data occur, There is an effect that can be.

한편, 도 20과 같이 각 시점 별로 동적인 관리 범위를 설정함으로써 관리 범위의 정확도를 높이더라도, 실제 측정치 시계열 데이터가 상기 관리 범위를 이탈하지 않는다면, 비정상 패턴으로 판정되지 않는다. 도 21에 도시 된 실제 측정치 시계열 데이터(78)도 상방 관리 범위 한계선(76) 및 하방 관리 범위 한계선(77) 사이에 지속적으로 위치하므로, 비정상 패턴으로 판정되지 않을 것이다. 그러나, 도 21에 도시 된 실제 측정치 시계열 데이터(78)는, 낮 시간 동안 지속적으로 예측된 측정치 시계열 데이터(75)를 초과(79)하고 있음을 알 수 있다.On the other hand, even if the accuracy of the management range is raised by setting the dynamic management range for each viewpoint as shown in FIG. 20, if the actual measurement value time series data does not deviate from the management range, it is not determined as an abnormal pattern. The actual measured value time series data 78 shown in FIG. 21 is also continuously positioned between the upper management range limit line 76 and the lower management range limit line 77, so that it will not be determined as an abnormal pattern. However, it can be seen that the actual measured value time series data 78 shown in FIG. 21 exceeds the predicted measured time series data 75 continuously during daytime (79).

본 발명의 일 실시예에 따르면, 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 차이가 상기 대표 시계열 데이터를 바탕으로 얻어진 한계치(threshold)를 초과하는 경우, 전반적 비정상(global anomaly)으로 판정한다. 전반적 비정상 판정을 위하여, 전반적 비정상 범위 정보의 생성(S212)이 필요하다. 상기 전반적 비정상 범위 정보는 상기 한계치를 의미할 수 있다. 이하, 상기 한계치를 설정하는 방법에 대하여 도 22를 참조하여 설명한다.According to an embodiment of the present invention, even if the actual measured value time series data is within the management range, the difference between the actual measured value time series data and the predicted measured time series data exceeds a threshold value obtained based on the representative time series data , It is determined to be a global anomaly. For overall abnormal determination, generation of overall abnormal range information (S212) is necessary. The overall abnormal range information may mean the limit value. Hereinafter, a method of setting the threshold value will be described with reference to FIG.

상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 거리는, 널리 알려진 시계열 데이터 사이의 차이 값 연산 로직을 활용하여 얻을 수 있다. 예를 들어, DTW 거리나, 각 시점의 유클리디안 거리의 합 등 다양한 방법을 이용하여 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 거리를 얻을 수 있다. 이 때, 상기 한계치 역시 상기 DTW 거리 또는 상기 각 시점의 유클리디안 거리의 합을 기준으로 설정 될 수 있다. 이하, 이해의 편의를 위하여, DTW 거리를 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 거리로 사용하는 경우를 전제하여 설명한다.The distance between the actual measurement value time series data and the predicted measurement time series data can be obtained by utilizing a difference value calculation logic between widely known time series data. For example, the distance between the actual measured value time series data and the predicted measured time series data can be obtained by using various methods such as a DTW distance and a sum of Euclidian distances at respective time points. At this time, the limit value may also be set based on the DTW distance or the sum of the Euclidian distances of the respective viewpoints. Hereinafter, for convenience of explanation, it is assumed that the DTW distance is used as the distance between the actual measurement time series data and the predicted measurement time series data.

상기 예측 된 클러스터에 속한 트레이닝 기간 동안의 각 측정치 시계열 데이터와 상기 대표 시계열 데이터 사이의 DTW(Dynamic Time Warping) 거리의 히스토그램을 생성하고, 상기 히스토그램 상에서 기 지정 된 요건을 만족하는 DTW 거리를 상기 한계치로 결정할 수 있다.Generating a histogram of a DTW (Dynamic Time Warping) distance between each measurement time series data and representative time series data during a training period belonging to the predicted cluster, and setting a DTW distance satisfying a predetermined requirement on the histogram to the threshold value You can decide.

특정 클러스터에 속한 측정치 시계열 데이터들의 상기 클러스터의 대표 시계열 데이터와의 DTW 거리는 대부분 도 22와 같이 분포 한다. 즉, 상기 클러스터의 대표 시계열 데이터와의 DTW 거리의 빈도 그래프(81)는 0으로부터 점차 증가하다가 감소하는 패턴을 가질 것이다. 상기 한계치는 도 22에 도시된 것과 같은 히스토그램 상에서 기 지정 된 요건을 만족하는 DTW 거리로 결정된다. 실제 측정치 시계열 데이터가 수신되고(S214), 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터 사이의 DTW 거리가 상기 한계치를 초과하면 이상(anomaly) 판정이 내려지고, 반대의 경우에는 정상 판정이 내려질 수 있을 것이다(S216).The DTW distance of the measured time series data belonging to a specific cluster with the representative time series data of the cluster is mostly distributed as shown in FIG. That is, the frequency graph 81 of the DTW distance from the representative time series data of the cluster will have a pattern gradually increasing from 0 to decreasing. The threshold value is determined as the DTW distance satisfying the predetermined requirement on the histogram as shown in Fig. An actual measurement value time series data is received (S214), and an anomaly determination is made when the DTW distance between the actual measurement value time series data and the predicted measurement time series data exceeds the threshold value, and in the opposite case, (S216).

상기 기 지정 된 요건은, 상기 클러스터에 속한 전체 측정치 시계열 데이터 중 DTW 거리의 오름 차순으로 기 지정된 비율이 포함 되도록 하는 것일 수 있다. 예를 들어, 클러스터에 속한 측정치 시계열 데이터가 1000개 인 경우, 총 1000의 빈도 중 99%인 990의 빈도가 상기 히스토그램 상에서 포함 되도록 하는 DTW 거리가 상기 한계치로 결정 될 수 있다.The predefined requirement may include a predefined rate in ascending order of the DTW distance among the entire measurement time series data belonging to the cluster. For example, when the number of time series data belonging to the cluster is 1000, the DTW distance at which the frequency of 990, which is 99% of the total frequency of 1000, is included on the histogram can be determined as the threshold value.

상기 기 지정 된 요건은, 상기 히스토그램 상의 평균 DTW 거리 및 표준 편차를 연산한 후, 상기 평균 DTW 거리에 기 지정 된 배수의 표준 편차를 합산한 것일 수도 있다.The predefined requirement may be the average DTW distance and the standard deviation on the histogram and then the average DTW distance may be the sum of the standard deviation of the predetermined multiple.

일 실시예에서, 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 전반적 비정상(global anomaly)으로 판정된 경우, 어느 시점의 시계열 데이터에 그 원인이 있는지에 대한 정보가 추가적으로 제공 될 수 있다.In one embodiment, even when the actual measured value time series data is within the management range, when it is determined as a global anomaly, information on the cause of the time series data at a certain point may additionally be provided.

도 23에 도시 된 바와 같이, 예측된 측정치 시계열 데이터(82)가 존재하는 상황에서 실제 측정치 시계열 데이터(83)가 입력 된 경우를 가정하자. 도 24에는 예측된 측정치 시계열 데이터(82)와 실제 측정치 시계열 데이터(83)의 각 시점 별 데이터 차이 값(예를 들어, 절대 값)을 가리키는 로컬 코스트 매트릭스가 도시 되어 있다. 도 23에 도시된 바와 같이 시점은 총 10개인데, 상기 매트릭스에서 (제1 시점, 제1 시점)에서 (제10 시점, 제10 시점)까지, 상기 차이 값이 가장 작은 인접 셀으로 이동하는 방식에 의하여, 최소 비용 경로가 탐색 될 수 있다.As shown in FIG. 23, it is assumed that the actual measured value time series data 83 is input in the situation where the predicted measured time series data 82 exists. 24 shows a local cost matrix indicating a data difference value (for example, an absolute value) for each time point of the predicted measured time series data 82 and the actual measured value time series data 83. In FIG. As shown in FIG. 23, the total number of viewpoints is 10, and a method of moving to the neighboring cell having the smallest difference value from (the first viewpoint, the first viewpoint) to the tenth viewpoint The minimum cost path can be searched.

상기 최소 비용 경로의 탐색이 완료 된 후, 예측된 측정치 시계열 데이터(82)의 각 시점을 기준으로, 상기 최소 비용 경로에 속하는 데이터 차이 값을 합산하여, 상기 비정상 영향도를 생성할 수 있다. 예를 들어, 제1 시점의 경우에는 (제1 시점, 제1 시점)의 차이 값이 '1'으로 비정상 영향도가 연산되고, 제5 시점의 경우 총 4개 데이터 차이 값이 최소 비용 경로에 속하기 때문에, 그 데이터 차이 값인 4, 5, 0, 1을 각각 합산한 '10'을 비정상 영향도가 연산된다. 즉, 각 시점 별로 전반적 비정상으로 판정되는 데 영향을 주는 정도인 비정상 영향도는 제1 시점 부터 제10 시점의 순서로 '1, 3, 7, 3, 10, 2, 3, 6, 9, 6)이 된다.After the search for the minimum cost path is completed, the abnormal difference degree may be generated by adding the data difference value belonging to the minimum cost path based on each time point of the predicted measurement time series data 82. For example, in the case of the first point of view, the difference value of (the first point of time and the first point of time) is '1', the abnormal influence degree is calculated. In the case of the fifth point of time, The abnormal influence is calculated by adding 10, which is the sum of the data difference values of 4, 5, 0, and 1, respectively. 3, 7, 3, 10, 2, 3, 6, 9, and 6 in order from the first point to the tenth point in the order of the degree of abnormality affecting the overall abnormality ).

상기 비정상 영향도는 도 25에 도시된 것과 같이 관리자의 단말 장치에 표시 되어, 어느 시점이 문제인지 관리자가 직관적으로 파악할 수 있도록 할 수 있다. 상기 비정상 영향도가 클수록, 더 강하게 표시하거나, 더 강조하여 표시하거나, 음향 등의 추가적인 알람 수단이 동원 될 수 있다.The abnormal influence degree is displayed on the terminal device of the manager as shown in Fig. 25, so that the manager can intuitively grasp the point in time at which the problem occurs. The greater the degree of the abnormal influence, the stronger the display, the more emphasized the display, or the additional alarm means such as sound can be mobilized.

지금까지 도 2 내지 도 25를 참조하여 설명된 본 발명의 실시예에 따른 방법들은 컴퓨터가 읽을 수 있는 코드로 구현된 컴퓨터프로그램의 실행에 의하여 수행될 수 있다. 상기 컴퓨터프로그램은 인터넷 등의 네트워크를 통하여 제1 컴퓨팅 장치로부터 제2 컴퓨팅 장치에 전송되어 상기 제2 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 제2 컴퓨팅 장치에서 사용될 수 있다. 상기 제1 컴퓨팅 장치 및 상기 제2 컴퓨팅 장치는, 서버 장치, 클라우드 서비스를 위한 서버 풀에 속한 물리 서버, 데스크탑 피씨와 같은 고정식 컴퓨팅 장치를 모두 포함한다.The methods according to the embodiments of the present invention described above with reference to Figs. 2 to 25 can be performed by the execution of a computer program embodied in computer readable code. The computer program may be transmitted from a first computing device to a second computing device via a network, such as the Internet, and installed in the second computing device, thereby enabling it to be used in the second computing device. The first computing device and the second computing device all include a server device, a physical server belonging to a server pool for cloud services, and a fixed computing device such as a desktop PC.

상기 컴퓨터프로그램은 DVD-ROM, 플래시 메모리 장치 등의 기록매체에 저장된 것일 수도 있다.The computer program may be stored in a recording medium such as a DVD-ROM, a flash memory device, or the like.

시계열 데이터 모니터링 장치Time series data monitoring device

이하, 도 26을 참조하여, 본 발명의 또 다른 실시예에 따른 시계열 데이터 모니터링 장치의 구성 및 동작을 설명한다.Hereinafter, the configuration and operation of a time-series data monitoring apparatus according to another embodiment of the present invention will be described with reference to FIG.

도 26에 도시된 바와 같이, 본 실시예에 따른 시계열 데이터 모니터링 장치(20)는 프로세서(200), 메모리(206), 네트워크 인터페이스(204), 스토리지(208) 및 시스템 버스(202)를 포함한다. 프로세서(200), 네트워크 인터페이스(204), 스토리지(208) 및 메모리(206)는 시스템 버스(202)를 통하여 데이터를 송수신한다. 메모리(206)는 트레이닝 기간 동안의 측정치 시계열 데이터를 분석하여 예측 기간의 상기 측정치 시계열 데이터를 예측하기 위한 컴퓨터 프로그램을 로드한다. 프로세서(200)는 메모리에 로드 된 컴퓨터 프로그램을 실행한다.26, the time series data monitoring apparatus 20 according to the present embodiment includes a processor 200, a memory 206, a network interface 204, a storage 208, and a system bus 202 . The processor 200, the network interface 204, the storage 208 and the memory 206 transmit and receive data via the system bus 202. The memory 206 loads the computer program for analyzing the measured time series data during the training period and for predicting the measured time series data of the predicted period. The processor 200 executes the computer program loaded in the memory.

네트워크 인터페이스(204)는, 복수의 센서 및 환경 데이터 관리 장치에 연결된 네트워크를 통하여 트레이닝 기간의 측정치 시계열 데이터 및 환경 데이터를 수신하고, 예측 기간의 환경 데이터 및 실제 측정치 시계열 데이터를 수신하며, 예측 기간의 측정치 시계열 데이터의 클러스터 정보 또는 예측 기간의 측정치 시계열 데이터의 예측 결과 또는 측정치 시계열 데이터의 모니터링 결과를 네트워크 인터페이스(204)를 통하여 단말 장치에 송신한다.The network interface 204 receives measured time series data and environmental data of a training period through a plurality of sensors and a network connected to the environmental data management apparatus and receives environment data of the forecast period and actual measurement time series data, Transmits to the terminal device through the network interface 204 the cluster information of the measured time series data or the measurement result of the predicted time series data or the result of the monitoring of the measured time series data.

스토리지(208)는 네트워크 인터페이스(204)를 통하여 수신 된 측정치 시계열 데이터, 상기 환경 데이터 및 상기 컴퓨터 프로그램에 의하여 조회 되는 측정치 시계열 클러스터링 결과 데이터(280), 환경 시계열 클러스터링 결과 데이터(282), 측정치 시계열 데이터 클러스터 별 회귀 모델(284), 시점 별 관리 범위 정보(286) 및 전반적인 비정상 범위 정보(288)을 저장할 수 있다.The storage 208 stores measurement time series data received via the network interface 204, the environment data and the measurement time series clustering result data 280 inquired by the computer program, environment time series clustering result data 282, A per-cluster regression model 284, per-view management range information 286, and overall abnormal range information 288.

측정치 시계열 클러스터링 결과 데이터(280)는 트레이닝 기간 동안의 측정치 시계열 데이터를 클러스터링 한 결과와 각 클러스터의 대표 시계열 데이터를 포함한다.The measured time series clustering result data 280 includes the result of clustering the measured time series data during the training period and the representative time series data of each cluster.

환경 시계열 클러스터링 결과 데이터(282)는 트레이닝 기간 동안의 환경 시계열 데이터를 클러스터링 한 결과와 각 클러스터의 대표 시계열 데이터를 포함한다.The environment time series clustering result data 282 includes a result of clustering environment time series data during a training period and representative time series data of each cluster.

측정치 시계열 데이터 클러스터 별 회귀 모델(284)은, 각 측정치 시계열 데이터의 클러스터 별 회귀 모델의 구성 정보를 포함한다. 상기 회귀 모델의 구성 정보는, 회귀 모델 타입 정보 및 인자 리스트(factor list)를 포함할 수 있다.The measurement time series data cluster-by-cluster regression model 284 includes configuration information of a cluster-by-cluster regression model of each measurement time series data. The configuration information of the regression model may include regression model type information and a factor list.

시점 별 관리 범위 정보(286)는, 각각의 측정치 시계열 데이터 클러스터 별로 그 클러스터에 속한 트레이닝 기간 동안의 측정치 시계열 데이터의 각 시점 별 표준 편차를 포함할 수 있다.The point-of-time management range information 286 may include a standard deviation of each time point of the measurement time series data for the training period belonging to the cluster for each measurement time series data cluster.

전반적인 비정상(global anomaly) 범위 정보(288)는, 각각의 측정치 시계열 데이터 클러스터 별로 그 클러스터의 대표 시계열 데이터와의 DTW 거리 값을 가리키는 한계치를 포함할 수 있다.The global anomaly range information 288 may include a limit value indicating a DTW distance value from representative time series data of the cluster for each measurement time series data cluster.

스토리지(208)에는 각 측정치 시계열 데이터 별로, 상기 최적 분류 모델에 대한 정보가 더 저장될 수 있다.The storage 208 may further store information on the optimal classification model for each measurement time series data.

상기 컴퓨터 프로그램은 트레이닝 로직(260), 예측 로직(262) 및 모니터링 로직(264)을 포함한다.The computer program includes training logic 260, prediction logic 262, and monitoring logic 264.

트레이닝 로직(260)은, 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 오퍼레이션과, 상기 트레이닝 기간 동안의 복수의 환경 데이터를 수집하는 오퍼레이션과, 상기 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하는 오퍼레이션과, 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 분류 모델을 생성하는 오퍼레이션과, 상기 생성된 분류 모델의 성능 지표 값을 결정하는 오퍼레이션과, 상기 인자로 선택하는 단계, 상기 분류 모델을 생성하는 단계 및 상기 성능 지표 값을 결정하는 단계를, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 오퍼레이션을 포함한다.The training logic 260 includes an operation of clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters, an operation of collecting a plurality of environmental data during the training period, An operation of selecting at least a part of a parameter as a factor and an operation of generating a classification model that optimally classifies clusters of the measured time series data in a space or a plane constituted by axes indicating the factor, Determining a performance indicator value, selecting the factor, generating the classification model, and determining the performance indicator value by repeating the selection of the factor while changing the performance indicator value to a reference The optimal classification model among the generated classification models .

예측 로직(262)은, 상기 최적 분류 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터의 클러스터를 예측 하는 오퍼레이션과, 상기 예측 된 측정치 시계열 데이터의 클러스터를 위한 회귀 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터를 예측하는 오퍼레이션을 포함한다.The prediction logic 262 uses the optimal classification model to predict a cluster of the measured time series data in the prediction period and a prediction model of the prediction time series data using the predicted period And estimating the time series data of the measured value.

모니터링 로직(264)은, 상기 예측 된 클러스터에 속한 트레이닝 기간 동안의 각 측정치 시계열 데이터의 시점 별 변동성에 따라, 각 시점 별 관리 범위를 설정하는 오퍼레이션과, 상기 실제 측정치 시계열 데이터가 관리 범위 내에 위치하더라도, 상기 실제 측정치 시계열 데이터와 상기 예측된 측정치 시계열 데이터의 차이가 상기 대표 시계열 데이터를 바탕으로 얻어진 한계치(threshold)를 초과하는 경우, 전반적 비정상(global anomaly)으로 판정 하는 오퍼레이션을 포함한다.The monitoring logic 264 may include an operation of setting a management range for each viewpoint according to a time-variability of each measurement time-series data during a training period belonging to the predicted cluster, And an operation of determining a global anomaly when the difference between the actual measurement value time series data and the predicted measurement time series data exceeds a threshold value obtained based on the representative time series data.

본 명세서에서, 상기 오퍼레이션은, 프로세서(200)에 의하여 해석되고 실행 될 수 있으며, 특정 기능을 수행하는 일련의 이상의 명령어로 구성 된다.In the present specification, the operation is interpreted and executed by the processor 200 and consists of a series of or more instructions that perform a specific function.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims

Predicting a cluster of measured time series data of the predicted time period from environmental data of a predicted time period according to an analysis result of measurement time series data and environmental data during a training period;
Setting a management range for each viewpoint according to a time-variability of each measurement time-series data during a training period belonging to the predicted cluster; And
And monitoring whether the actual measurement time series data of the prediction period satisfies the management range for each viewpoint.
How to monitor time series data.

The method according to claim 1,
Wherein the prediction period includes a first time point and a second time point after the first time point,
Wherein setting the management scope for each viewpoint includes setting the management scope of the first viewpoint and the management scope of the second viewpoint to different values.
How to monitor time series data.

3. The method of claim 2,
Wherein the step of setting the management range of the first viewpoint and the management scope of the second viewpoint are different from each other,
And setting the management range of the first viewpoint to a value larger than the management range of the second viewpoint when the variability at the first viewpoint is larger than the variability at the second viewpoint.
How to monitor time series data.

The method according to claim 1,
Further comprising: predicting the measured time series data of the prediction period using a regression model for the predicted cluster,
Wherein the monitoring comprises:
Determining whether a difference value between the actual measurement value time series data at each time point and the predicted measurement time series data is within the management range for each view point,
How to monitor time series data.

Predicting a cluster of measured time series data of the predicted time period from environmental data of a predicted time period according to an analysis result of measurement time series data and environmental data during a training period;
Predicting the measured time series data of the prediction period by using a regression model for the predicted cluster;
Receiving actual measured value time series data of the prediction period;
Obtaining representative time series data of the predicted cluster; And
Even if the actual measured value time series data is within the management range, if the difference between the actual measured value time series data and the predicted measured time series data exceeds a threshold value obtained based on the representative time series data, , &Lt; / RTI >
How to monitor time series data.

6. The method of claim 5,
Wherein the step of determining as the overall abnormality comprises:
Generating a histogram of a time series distance between each measurement time series data and the representative time series data during a training period belonging to the predicted cluster; And
Determining the time-series distance satisfying a pre-specified requirement on the histogram as the threshold,
How to monitor time series data.

The method according to claim 6,
The step of determining, as the threshold value, a DTW distance satisfying a predetermined requirement on the histogram,
And determining, as a threshold value, the time-series distance in which a pre-specified ratio is included in ascending order of the time-series distance among all measurement time-series data belonging to the cluster.
How to monitor time series data.

The method according to claim 6,
Wherein the step of determining the time-series distance satisfying a predetermined requirement on the histogram as the threshold value comprises:
And determining the threshold using an average value and a standard deviation of the time series distance on the histogram.
How to monitor time series data.

6. The method of claim 5,
Wherein the step of determining as the overall abnormality comprises:
Generating a local cost matrix indicating a data difference value between the predicted measured value time series data and the actual measured value time series data at each time point;
Searching for a least cost path on the local cost matrix; And
Using the least cost path to generate an abnormal impact for each point in time,
How to monitor time series data.

10. The method of claim 9,
Wherein the step of generating the abnormal influence degree for each viewpoint comprises:
And summing the data difference values belonging to the minimum cost path based on each time point of the predicted measured value time series data to generate the abnormal influence degree.
How to monitor time series data.

A memory for loading a computer program for analyzing measured time series data during a training period to predict said measured time series data in a predicted period;
A processor for executing the computer program loaded in the memory;
Network interface; And
A storage for storing measurement time series data received via the network interface, the environment data, and data inquired by the computer program,
The computer program comprising training logic, prediction logic and monitoring logic,
The training logic comprises:
Clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters;
Collecting a plurality of environmental data during the training period;
And an operation of generating an optimal classification model that optimally classifies clusters of the measurement time series data with at least a part of the plurality of environmental data as factors,
The prediction logic comprises:
An operation of predicting a cluster of the measurement time series data in the prediction period using the optimal classification model; And
And estimating the time series data of the predicted period by using a regression model for the predicted cluster,
The monitoring logic comprises:
An operation for setting a management range for each viewpoint according to a time-variability of each measurement time-series data during a training period belonging to the predicted cluster; And
Even if the actual measured value time series data is within the management range, if the difference between the actual measured value time series data and the predicted measured time series data exceeds a threshold value obtained based on the representative time series data, &Lt; / RTI >
Time series data monitoring device.