KR20170078256A

KR20170078256A - Method and apparatus for time series data prediction

Info

Publication number: KR20170078256A
Application number: KR1020150188595A
Authority: KR
Inventors: 권순환; 김형찬; 오규삼; 서범준; 김성희; 오민환; 유창준
Original assignee: 삼성에스디에스 주식회사
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2017-07-07
Also published as: KR102340258B1

Abstract

시계열 데이터 예측 방법이 제공 된다. 본 발명의 일 실시예에 따른 시계열 데이터 예측 방법은 i) 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 단계, ii) 상기 트레이닝 기간 동안의 복수의 환경 데이터를 수집하는 단계, iii) 상기 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하는 단계, iv) 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 분류 모델을 생성하는 단계, 상기 생성된 분류 모델의 성능 지표 값을 결정하는 단계, 상기 인자로 선택하는 단계, 상기 분류 모델을 생성하는 단계 및 상기 성능 지표 값을 결정하는 단계를, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 단계, v) 상기 최적 분류 모델을 이용하여, 예측 기간의 상기 측정치 시계열 데이터의 클러스터를 예측 하는 단계를 포함한다.A time series data prediction method is provided. The method of predicting time series data according to an embodiment of the present invention includes the steps of i) clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters, ii) collecting a plurality of environment data during the training period (Iii) selecting at least a part of the plurality of environmental data as a factor, iv) selecting a classification model that optimally classifies clusters of the measured time series data on a space or a plane constituted by axes indicating the factors Determining a performance indicator value of the generated classification model, selecting the factor as the factor, generating the classification model, and determining the performance indicator value by changing the selection of the factor The optimum classification model among the generated classification models based on the performance index value Step, v) selecting for and a step of using the optimum classification model, predicting the measurements of the cluster series data of the forecast period.

Description

TECHNICAL FIELD The present invention relates to a method and apparatus for predicting time series data,

본 발명은 시계열의 데이터를 예측 하는 방법 및 그 장치에 관한 것이다. 보다 자세하게는, 과거의 일정 기간 동안 발생된 시계열 데이터를 트레이닝 한 결과를 이용하여, 특정 기간 동안의 시계열 데이터를 예측 하는 방법 및 그 장치에 관한 것이다.The present invention relates to a method and apparatus for predicting time series data. More particularly, the present invention relates to a method and apparatus for predicting time series data for a specific period by using a result of training time series data generated during a certain period of the past.

시계열 데이터(time series data)는 일정 기간에 대해 시간의 함수로 표현되는 데이터를 가리킨다. 이러한 시계열 데이터는, 과거의 시계열 데이터에 대한 분석을 통하여 예측될 수 있다.Time series data refers to data expressed as a function of time for a certain period of time. Such time series data can be predicted through analysis of past time series data.

시계열 데이터를 예측함에 있어서, 과거의 트레이닝 기간 동안 발생된 시계열 데이터를 유사한 패턴의 시계열 데이터끼리 클러스터링 하고, 예측 기간에 대응하는 클러스터를 결정함으로써, 상기 예측 기간의 시계열 데이터가 상기 결정된 클러스터와 유사한 패턴을 보일 것으로 예측하는 방법이 제공된다.In the case of predicting the time series data, time series data generated in the past training period is clustered with time series data of similar patterns, and a cluster corresponding to the prediction period is determined, so that the time series data of the prediction period has a pattern similar to the determined cluster Lt; RTI ID = 0.0 > a < / RTI >

이때, 상기 예측 기간에 대응 되는 클러스터를 결정하기 위한 기준은 전문가의 경험이나 지식에 의존하여 결정 된다. 예를 들어, 빌딩의 전력 소비량(24 시간 기준)을 예측하고자 할 때, 평일/토요일/공휴일의 일자 종류를 기준으로 하여, 예측 기간에 해당하는 전력 소비량 클러스터가 대응 될 수 있을 것이다. 상기 예시에서, 일자의 종류를 기준으로 하여 전력 소비량의 패턴이 달라질 수 있다는 점은 지식 또는 경험에 의존하여 얻어진 것이다.At this time, the criterion for determining the cluster corresponding to the prediction period is determined depending on the expert's experience or knowledge. For example, when estimating the power consumption (24 hours basis) of a building, a power consumption cluster corresponding to a forecast period may be corresponded to a day type of weekday / Saturday / holiday. In the above example, the pattern of the power consumption amount can be changed on the basis of the type of date is obtained depending on knowledge or experience.

기존의 시계열 데이터 예측 방법은 전문가의 지식 또는 경험에 의존하여 대상 시계열 데이터의 클러스터를 결정하기 위한 기준이 선택 되므로, 기준의 적절성에 대한 의문이 있다. 또한, 사람에 의존하기 때문에, 예측 기간의 환경을 가리키는 온도, 습도와 같은 환경 시계열 데이터를 기준으로 하여 대상 시계열 데이터의 클러스터를 결정하기 위한 명확한 기준을 만들기도 어렵다.The existing method of predicting the time series data depends on the knowledge or experience of the expert and the criterion for determining the clusters of the target time series data is selected. In addition, since it is dependent on people, it is also difficult to establish clear criteria for determining clusters of target time series data based on environmental time series data such as temperature and humidity indicating the environment of the forecast period.

한국공개특허 제1998-7002852호Korean Patent Publication No. 1998-7002852 한국공개특허 제2009-0073937호Korean Patent Publication No. 2009-0073937

본 발명이 해결하고자 하는 기술적 과제는, 대상 시계열 데이터의 클러스터를 결정하기 위한 최적의 기준 모델을 자동으로 생성하고, 상기 최적의 기준 모델을 이용하여 예측 기간의 상기 대상 시계열 데이터의 클러스터를 결정하는 방법 및 그 장치를 제공하는 것이다.SUMMARY OF THE INVENTION The present invention provides a method of automatically generating an optimal reference model for determining a cluster of target time series data and determining a cluster of the target time series data in the prediction period by using the optimum reference model And a device therefor.

본 발명이 해결하고자 하는 다른 기술적 과제는, 상기 대상 시계열 데이터에 영향을 끼칠 수 있는 환경을 가리키는 환경 시계열 데이터를 대상 시계열 데이터의 클러스터를 결정하기 위한 기준 모델의 인자(factor)로 사용하는 방법 및 그 장치를 제공하는 것이다.Another object of the present invention is to provide a method of using environment time series data indicating an environment that may affect the target time series data as a factor of a reference model for determining a cluster of target time series data, Device.

본 발명이 해결하고자 하는 또 다른 기술적 과제는, 상기 대상 시계열 데이터에 영향을 끼칠 수 있는 환경을 가리키되, 복수의 시계열 데이터로 구성된 다차원 환경 시계열 데이터를 대상 시계열 데이터의 클러스터를 결정하기 위한 기준 모델의 인자(factor)로 사용하는 방법 및 그 장치를 제공하는 것이다.According to another aspect of the present invention, there is provided a computer-readable storage medium storing a program for causing a computer to function as a reference model for determining a cluster of target time series data, And a method of using the same as a factor.

본 발명이 해결하고자 하는 또 다른 기술적 과제는, 대상 시계열 데이터의 클러스터 별로 구축 된 예측 모델을 이용하여, 예측 기간의 대상 시계열 데이터를 높은 정확도로 예측하는 방법 및 그 장치를 제공하는 것이다.It is another object of the present invention to provide a method and apparatus for predicting target time series data of a prediction period with high accuracy by using a prediction model constructed for each cluster of target time series data.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The technical objects of the present invention are not limited to the above-mentioned technical problems, and other technical subjects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 시계열 데이터 예측 방법은 i) 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 단계, ii) 상기 트레이닝 기간 동안의 복수의 환경 데이터를 수집하는 단계, iii) 상기 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하는 단계, iv) 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 분류 모델을 생성하는 단계, 상기 생성된 분류 모델의 성능 지표 값을 결정하는 단계, 상기 인자로 선택하는 단계, 상기 분류 모델을 생성하는 단계 및 상기 성능 지표 값을 결정하는 단계를, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 단계, v) 상기 최적 분류 모델을 이용하여, 예측 기간의 상기 측정치 시계열 데이터의 클러스터를 예측 하는 단계를 포함한다.According to another aspect of the present invention, there is provided a method of predicting time series data, comprising the steps of: i) clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters; ii) Collecting a plurality of environmental data, iii) selecting at least a part of the plurality of environmental data as a factor, iv) clustering the measurement time series data in a space or a plane constituted by axes indicating the factor Determining a performance index value of the generated classification model, selecting the factor as the factor, generating the classification model, and determining the performance indicator value, And repeating the selection while changing the selection of the factor, Selecting an optimal classification model among the generated classification models, and v) predicting clusters of the measurement time series data in the prediction period using the optimal classification model.

일 실시예에서, 상기 환경 데이터를 수집하는 단계는 상기 수집된 상기 환경 데이터 중 기 지정 된 주기 단위의 환경 시계열 데이터를 복수의 클러스터로 클러스터링 하는 단계를 포함한다. 이 때, 상기 선택 된 인자 중 상기 환경 시계열 데이터가 포함된 경우, 상기 환경 시계열 데이터를 가리키는 축은 상기 환경 시계열 데이터의 클러스터를 가리키는 축이다. In one embodiment, the collecting of the environmental data includes a step of clustering environmental time series data of a predetermined period unit of the collected environment data into a plurality of clusters. In this case, when the environment time series data among the selected factors is included, the axis indicating the environmental time series data is an axis indicating the cluster of the environmental time series data.

일 실시예에서, 상기 분류 모델을 생성하는 단계는 SVM(Support Vector Machine) 로직을 이용하여 상기 분류 모델을 생성하는 단계를 포함하고, 상기 성능 지표를 결정하는 단계는 상기 SVM 로직에 따라 생성 되는 초평면(hyperplane)에 따른 최대 여백(margin)을 상기 분류 모델의 성능 지표로 사용하는 단계를 포함할 수 있다.In one embodiment, generating the classification model comprises generating the classification model using Support Vector Machine logic, wherein the determining the performance index comprises: generating a classification model using a hyperplane < RTI ID = 0.0 > and using a maximum margin according to a hyperplane as a performance index of the classification model.

일 실시예에서, 상기 예측 기간의 상기 측정치 시계열 데이터의 클러스터를 예측하는 단계는, 상기 예측 기간의 환경 데이터 중, 상기 선정 된 분류 모델의 인자로 사용된 환경 데이터의 예측치를 제공 받는 단계와, 상기 환경 데이터의 예측치를 상기 최적 분류 모델에 입력하여 상기 측정치 시계열 데이터의 클러스터를 예측 하는 단계를 포함한다.In one embodiment, the step of predicting the clusters of the time series data of the predicted time period comprises the steps of: receiving a predicted value of the environmental data used as a factor of the selected classification model among the environmental data of the prediction period; And inputting the predicted value of the environmental data into the optimal classification model to predict the cluster of the measured time series data.

일 실시예에서, 상기 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 단계는, k-평균(k-means) 로직을 적용하여 상기 각각의 측정치 시계열 데이터를 k개의 클러스터로 클러스터링 하는 단계와, DBA(DTW Barycenter Averaging) 로직을 이용하여 각 클러스터의 대표 시계열 데이터를 산출하는 단계를 포함한다.In one embodiment, clustering the measured time series data into a plurality of clusters comprises clustering each of the measured time series data into k clusters using k-means logic, Barycenter Averaging) logic to generate representative time series data for each cluster.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 시계열 데이터 예측 방법은, i) 트레이닝 기간 동안의 측정치 시계열 데이터 및 환경 데이터에 대한 분석 결과에 따라, 예측 기간의 환경 데이터로부터 상기 예측 기간의 측정치 시계열 데이터의 클러스터를 예측하는 단계, ii) 상기 선정 된 측정치 시계열 데이터의 클러스터를 위한 회귀 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터를 예측하는 단계를 포함한다. 이 때, 상기 회귀 모델은 상기 환경 데이터 중 제1 환경 시계열 데이터를 제1 독립 변수로 하고, 상기 측정치 시계열 데이터를 종속 변수로 하는 것이고, 상기 회귀 모델은 제1 모델과 제2 모델을 포함하며, 상기 측정치 시계열 데이터를 예측하는 단계는 상기 측정치 시계열 데이터의 클러스터가 제1 클러스터인 경우 상기 제1 모델을 이용하고, 상기 측정치 시계열 데이터의 클러스터가 상기 제1 클러스터와 다른 제2 클러스터인 경우 상기 제1 모델과 다른 제2 모델을 이용하는 단계를 포함한다.According to another aspect of the present invention, there is provided a method for predicting time series data according to an embodiment of the present invention, comprising the steps of: i) Predicting the measured time series data of the prediction period by using a regression model for the cluster of the selected measurement time series data. The regression model may include a first model and a second model, wherein the regression model includes first environment time series data of the environment data as a first independent variable and the measurement value time series data as a dependent variable, Wherein the step of predicting the measured time series data includes using the first model when the cluster of the measured time series data is the first cluster and using the first model when the cluster of the measured time series data is the second cluster different from the first cluster, And using a second model different from the first model.

상기 회귀 모델은 상기 제1 독립 변수 외에 상기 제1 환경 시계열 데이터와 다른 제2 환경 시계열 데이터의 클러스터 식별자와 상기 환경 데이터 중 각 주기의 특정 환경을 대표하는 대표 값 중 적어도 하나를 추가적인 독립 변수로 가질 수 있다.Wherein the regression model has at least one of a cluster identifier of second environment time series data different from the first environment time series data and a representative value representing a specific environment of each cycle of the environment data in addition to the first independent variable, .

일 실시예에서, 상기 시계열 데이터 예측 방법은 상기 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 단계와, 상기 트레이닝 기간 동안의 복수의 서로 다른 종류의 환경 데이터를 수집하는 단계와, 각 측정치 시계열 데이터의 클러스터에 대응하는 회귀 모델을 구축하는 단계를 더 포함한다. 이 때, 상기 회귀 모델을 구축하는 단계는, 제2 측정치 시계열 데이터 클러스터로 클러스터링 된 주기의 데이터는 이용하지 않고 제1 측정치 시계열 데이터 클러스터로 클러스터링 된 주기의 데이터 만을 이용하여, 상기 제1 측정치 시계열 데이터 클러스터에 대응하는 회귀 모델을 구축하는 단계를 포함할 수 있다.In one embodiment, the time-series data prediction method includes: clustering measurement time series data of a predetermined period unit during the training period into a plurality of clusters; collecting a plurality of different types of environment data during the training period; And constructing a regression model corresponding to a cluster of each measurement time series data. In this case, the step of constructing the regression model may include using only the data of the period clustered in the first measurement time series data cluster without using the data of the period clustered in the second measurement value time series data cluster, And constructing a regression model corresponding to the cluster.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시예에 따른 시계열 데이터 예측 장치는, 트레이닝 기간 동안의 측정치 시계열 데이터를 분석하여 예측 기간의 상기 측정치 시계열 데이터를 예측하기 위한 컴퓨터 프로그램을 로드 하는 메모리와, 상기 메모리에 로드 된 상기 컴퓨터 프로그램을 실행하는 프로세서와, 네트워크 인터페이스와, 상기 네트워크 인터페이스를 통하여 수신 된 측정치 시계열 데이터, 상기 환경 데이터 및 상기 컴퓨터 프로그램에 의하여 조회 되는 데이터를 저장하는 스토리지를 포함한다. 이 때, 상기 컴퓨터 프로그램은 트레이닝 로직과 예측 로직을 포함하고, 상기 트레이닝 로직은, i) 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 오퍼레이션, ii) 상기 트레이닝 기간 동안의 복수의 환경 데이터를 수집하는 오퍼레이션, iii) 상기 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하는 오퍼레이션, iv) 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 분류 모델을 생성하는 오퍼레이션, v) 상기 생성된 분류 모델의 성능 지표 값을 결정하는 오퍼레이션, vi) 상기 인자로 선택하는 단계, 상기 분류 모델을 생성하는 단계 및 상기 성능 지표 값을 결정하는 단계를, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 오퍼레이션을 포함한다. 또한, 상기 예측 로직은, 상기 최적 분류 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터의 클러스터를 예측 하는 오퍼레이션을 포함한다.According to an aspect of the present invention, there is provided an apparatus for predicting time series data, the apparatus comprising: a memory for loading a computer program for analyzing measurement time series data during a training period to predict the measurement time series data in a prediction period; A processor for executing the computer program loaded in the memory; a network interface; and storage for storing measurement time series data received via the network interface, the environment data, and data inquired by the computer program. Wherein the computer program includes training logic and prediction logic, the training logic comprising: i) clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters, ii) (Iii) an operation of selecting at least a part of the plurality of environmental data by a factor, iv) a cluster of the measured time series data in a space or a plane constituted by axes indicating the factor, V) an operation to determine a performance index value of the generated classification model, vi) a step of selecting as the factor, a step of generating the classification model, and a step of calculating the performance index value Determining a step of changing the selection of the factor, And includes an operation of selecting the best classification model of the generated classification model, based on the parameter values. Further, the prediction logic includes an operation of using the optimal classification model to predict a cluster of the measured time series data in the prediction period.

일 실시예에서, 상기 예측 로직은, 상기 예측 된 측정치 시계열 데이터의 클러스터를 위한 회귀 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터를 예측하는 오퍼레이션을 더 포함한다.In one embodiment, the prediction logic further comprises an operation of predicting the measured time series data of the prediction period using a regression model for a cluster of the predicted measured time series data.

도 1은 본 발명의 일 실시예에 따른 시계열 데이터 예측 시스템 구성도이다.
도 2 내지 도 4는 본 발명의 일 실시예에 따른 시계열 데이터 예측 방법의 순서도이다.
도 5는 본 발명의 몇몇 실시예들에서 참조되는 측정치 시계열 데이터를 설명하기 위한 도면이다.
도 6은 도 5의 측정치 시계열 데이터를 클러스터링 하고, 각 클러스터의 대표 시계열 데이터를 생성한 결과를 설명하기 위한 도면이다.
도 7은 본 발명의 몇몇 실시예들에서, 트레이닝 기간 동안 수집 된 측정치 시계열 데이터에 대하여 클러스터링을 수행한 결과의 저장 형태를 설명하기 위한 도면이다.
도 8은 본 발명의 몇몇 실시예들에서 참조되는 다차원 측정치 시계열 데이터를 설명하기 위한 도면이다.
도 9는 본 발명의 몇몇 실시예들에서 시계열 데이터를 클러스터링 할 때 클러스터의 개수를 결정하는 과정을 설명하기 위한 도면이다.
도 10은 본 발명의 몇몇 실시예들에서 참조되는 환경 데이터 중 다차원 환경 시계열 데이터를 설명하기 위한 도면이다.
도 11은 본 발명의 몇몇 실시예들에서 참조되는 환경 데이터 중 일자 속성을 설명하기 위한 도면이다.
도 12는 본 발명의 몇몇 실시예들에서 환경 데이터가 클러스터링 되는 것을 설명하기 위한 도면이다.
도 13은 본 발명의 몇몇 실시예들에서 참조되는 환경 데이터 중 특정 환경을 대표하는 대표 값을 설명하기 위한 도면이다.
도 14는 본 발명의 몇몇 실시예들에서 환경 데이터 중 환경 시계열 데이터를 클러스터링을 수행한 결과의 저장 형태를 설명하기 위한 도면이다.
도 15 및 도 16은 본 발명의 몇몇 실시예들에서 참조되는 분류 모델을 설명하기 위한 도면이다.
도 17은 본 발명의 몇몇 실시예들에서, 예측 기간의 측정치 시계열 데이터를 예측 하기 위한 회귀 모델이 각 측정치 시계열 데이터 클러스터 별로 지정 되는 것을 설명하기 위한 도면이다.
도 18은 본 발명의 일 실시예에 따른 시계열 데이터 예측 장치의 구성도이다.1 is a block diagram illustrating a time-series data prediction system according to an embodiment of the present invention.
2 to 4 are flowcharts of a time series data predicting method according to an embodiment of the present invention.
5 is a view for explaining measured time series data referred to in some embodiments of the present invention.
FIG. 6 is a diagram for explaining a result of clustering measurement time series data of FIG. 5 and generating representative time series data of each cluster; FIG.
FIG. 7 is a view for explaining a storage form of a result of performing clustering on measured time series data collected during a training period in some embodiments of the present invention. FIG.
8 is a diagram for explaining multidimensional measurement time series data referred to in some embodiments of the present invention.
9 is a diagram for explaining a process of determining the number of clusters when clustering time series data in some embodiments of the present invention.
10 is a view for explaining multidimensional environment time series data among environmental data referred to in some embodiments of the present invention.
11 is a view for explaining a date attribute of environmental data referred to in some embodiments of the present invention.
12 is a diagram for explaining that environmental data is clustered in some embodiments of the present invention.
FIG. 13 is a diagram for explaining representative values representative of a specific environment among environmental data referred to in some embodiments of the present invention. FIG.
FIG. 14 is a view for explaining a storage form of a result of clustering environmental time series data among environmental data in some embodiments of the present invention. FIG.
15 and 16 are views for explaining a classification model referred to in some embodiments of the present invention.
17 is a diagram for explaining that, in some embodiments of the present invention, a regression model for predicting measured time series data of a prediction period is designated for each measurement time series data cluster.
18 is a configuration diagram of a time series data predicting apparatus according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise. The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification.

이해의 편의를 위하여, 본 발명의 실시예를 본격적으로 설명하기 전에, 본 명세서에 사용된 용어들의 의미를 설명한다.For convenience of understanding, before describing the embodiments of the present invention in full, the meanings of the terms used in this specification will be described.

측정치 시계열 데이터: 센서 등에 의하여 측정 된 측정치의 시계열 데이터를 가리킨다. 측정치 시계열 데이터는 기 지정 된 주기(예를 들어, 24시간)로 분리 된 것일 수 있다. 상기 센서는, 예를 들어 빌딩 관리 시스템에 연결 된 온도 센서, 밝기 센서, 전력 사용량 센서 등이거나, 생산 설비에 구비된 온도, 압력 센서 등이거나, 컴퓨팅 장치에 구비된 온도 센서, cpu 사용량 센서, 메모리 사용량 센서, 스토리지 I/O 부하 센서, 네트워크 사용량 센서 등일 수 있다. 측정치 시계열 데이터를 생성할 수 있는 상기 센서는 상기 예시 된 이외의 다른 측정 장치를 포함할 수 있음을 물론이다.Measured time series data: Indicates time series data of measured values measured by sensors and the like. The measured time series data may be separated by a predetermined period (for example, 24 hours). The sensor may be, for example, a temperature sensor, a brightness sensor, a power consumption sensor connected to a building management system, a temperature sensor, a pressure sensor or the like provided in a production facility, a temperature sensor provided in the computing device, A usage sensor, a storage I / O load sensor, a network usage sensor, and the like. It goes without saying that the sensor capable of generating measurement time series data may include measurement devices other than those illustrated above.

환경 데이터: 상기 측정치 시계열 데이터에 영향을 끼칠 수 있는 다양한 환경에 대한 데이터이다. 상기 환경 데이터는 i) 환경 시계열 데이터 ii) 환경 대표 값 iii) 환경 속성값으로 구분 될 수 있다. 예를 들어, 24시간 동안의 온도 시계열 데이터, 24시간 동안의 습도 시계열 데이터 등이 상기 환경 시계열 데이터에 해당하고, 각 날짜 별 평균 온도, 평균 습도 등이 상기 환경 대표 값에 해당하며, 각 날짜 별 휴일/평일 여부가 상기 환경 속성값에 해당한다.Environmental data: It is data on various environments that may affect the measurement time series data. The environmental data may be classified into i) environmental time series data, ii) environmental representative values, and iii) environmental attribute values. For example, the temperature time series data for 24 hours, the humidity time series data for 24 hours, and the like correspond to the environmental time series data, and the average temperature and the average humidity for each date correspond to the environmental representative value. The holiday / weekdays whether or not the environment property value corresponds to the environmental property value.

트레이닝 기간: 시계열 데이터의 예측을 위하여, 과거 일정 기간 동안의 데이터를 수집하여 기계 학습 등의 기술을 통하여 학습할 필요가 있다. 트레이닝 기간은, 학습 대상이 되는 과거의 일정 기간을 가리킨다. 트레이닝 기간의 만료 시점은 현재일 수도 있다. 즉, 현재의 데이터가 수집됨과 동시에 학습 대상이 될 수 있다. 트레이닝 기간 동안의 측정치 시계열 데이터 및 환경 데이터 중 환경 시계열 데이터가 학습을 통하여 클러스터링 될 수 있다.Training period: In order to predict time series data, it is necessary to collect data for a certain period of time and learn through techniques such as machine learning. The training period refers to a certain period in the past that is the learning target. The expiration time of the training period may be current. That is, the current data can be collected and collected at the same time as the data is collected. The time series data of the measurement time series data and the environmental data during the training period can be clustered through learning.

예측 기간: 트레이닝 기간 동안의 학습 결과를 이용하여, 특정 기간의 측정치 시계열 데이터가 예측 될 수 있다. 본 명세서에서는 측정치 시계열 데이터의 예측 대상 기간을 예측 기간으로 지칭한다. 예측 기간은 미래의 특정 기간일 수도 있고, 지나간 기간에 대한 진단을 위하여 과거의 특정 기간일 수도 있다.Estimation period: Using the learning result during the training period, the measurement time series data of a specific period can be predicted. In the present specification, a prediction target period of measurement time series data is referred to as a prediction period. The prediction period may be a specific period in the future, or may be a specific period in the past for diagnosis of the past period.

시계열 데이터 예측 시스템Time series data prediction system

이하, 도 1을 참조하여, 본 발명의 일 실시예에 따른 시계열 데이터 예측 시스템의 구성 및 동작을 설명한다. 본 실시예에 따른 시계열 데이터 예측 시스템은 측정 장치(10) 및 측정치 예측 장치(20)를 포함할 수 있다.Hereinafter, a configuration and operation of a time series data prediction system according to an embodiment of the present invention will be described with reference to FIG. The time series data prediction system according to the present embodiment may include a measurement apparatus 10 and a measurement prediction apparatus 20. [

측정 장치(10)는 측정치 시계열 데이터를 생성하는 장치이다. 측정 장치(10)는 생성된 측정치 시계열 데이터를 네트워크를 통하여 측정치 예측 장치(20) 및 단말 장치(40)에 송신할 수 있다. 이미 언급한 바와 같이, 측정 장치(10)는, 예를 들어 빌딩 관리 시스템에 연결 된 온도 센서, 밝기 센서, 전력 사용량 센서 등이거나, 생산 설비에 구비된 온도, 압력 센서 등이거나, 컴퓨팅 장치에 구비된 온도 센서, CPU 사용량 센서, 메모리 사용량 센서, 스토리지 I/O 부하 센서, 네트워크 사용량 센서 등일 수 있다.The measurement apparatus 10 is a device for generating measurement time series data. The measurement apparatus 10 can transmit the generated measurement time series data to the measurement value prediction apparatus 20 and the terminal apparatus 40 via the network. As already mentioned, the measuring device 10 may be, for example, a temperature sensor, a brightness sensor, a power usage sensor connected to a building management system, a temperature, a pressure sensor or the like provided in a production facility, A CPU usage sensor, a memory usage sensor, a storage I / O load sensor, a network usage sensor, and the like.

환경 데이터 관리 장치(30)는 상기 측정치 시계열 데이터에 영향을 미칠 수 있는 환경 데이터를 생성하거나, 수집하여 측정치 예측 장치(20)에 제공한다.The environmental data management device 30 generates or collects environmental data that may affect the measurement time series data and provides the environment data to the measurement value prediction device 20. [

측정치 예측 장치(20)는 트레이닝 기간 동안의 상기 측정치 시계열 데이터 및 상기 환경 데이터를 학습하고, 상기 학습 결과를 이용하여 예측 기간 동안의 상기 측정치 시계열 데이터를 예측한다.The measurement predictor 20 learns the measurement time series data and the environment data during a training period and predicts the measurement time series data during a prediction period using the learning results.

이하, 측정치 예측 장치(20)의 데이터 학습 관련 동작을 설명한다.The data learning related operation of the measurement value prediction apparatus 20 will be described below.

트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 기 지정된 주기 단위의 측정치 시계열 데이터가 복수의 클러스터로 클러스터링 되고, 각각의 측정치 시계열 데이터의 대표 시계열 데이터가 결정 될 수 있다.As a result of learning about the data during the training period, the measurement time series data of the predetermined period unit is clustered into a plurality of clusters, and representative time series data of each measurement time series data can be determined.

또한, 트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 기 지정된 주기 단위의 환경 시계열 데이터가 복수의 클러스터로 클러스터링 되고, 각각의 환경 시계열 데이터의 대표 시계열 데이터가 결정 될 수 있다. 상기 측정치 시계열 데이터와 상기 환경 시계열 데이터는 동일한 방식으로 클러스터링 되는 것이 바람직하다.As a result of learning about data during the training period, environmental time series data of a predetermined period unit is clustered into a plurality of clusters, and representative time series data of each environmental time series data can be determined. The measured time series data and the environmental time series data are preferably clustered in the same manner.

또한, 트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 환경 데이터를 입력 받아 상기 측정치 시계열의 클러스터를 출력하는 최적 분류 모델이 생성 될 수 있다. 상기 최적 분류 모델은 i) 수집된 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하고, ii) 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 분류 모델을 생성하고, iii) 상기 생성된 분류 모델의 성능 지표 값을 결정하고, iv) 상기 복수의 환경 데이터 중 적어도 일부를 인자로 선택하는 것, 상기 분류 모델을 생성하는 것 및 상기 성능 지표 값을 결정하는 것을, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 것을 통하여 생성 될 수 있다.Also, as a result of learning about the data during the training period, an optimal classification model for receiving the environment data and outputting clusters of the measurement time series may be generated. Wherein the optimal classification model comprises: i) selecting at least some of the collected plurality of environmental data as factors; and ii) optimally classifying clusters of the measured time series data in a space or plane comprising axes indicating the factors Generating a classification model; iii) determining a performance indicator value of the generated classification model; iv) selecting at least a portion of the plurality of environmental data as a parameter; generating the classification model; And selecting an optimal classification model among the generated classification models based on the performance index value by repeating the selection of the factor while changing the selection of the factor.

또한, 트레이닝 기간 동안의 데이터에 대한 학습의 결과로, 각각의 측정치 시계열 클러스터 별로, 환경 데이터로부터 상기 측정치 데이터를 예측하기 위한 회귀 모델이 구축 될 수 있다. 상기 회귀 모델은, 예를 들어 MARS(Multivariate Adaptive Regression Splines)나, 다항 회귀 모델(polynomial regression) 등 다양한 회귀 모델 중 어느 하나 일 수 있다. 회귀 모델 또는 회귀 분석에 대하여는, 다양한 논문 등의 자료가 공개 되어 있으므로, 회귀 모델에 대한 자세한 설명은 생략하기로 한다. 예를 들어, 웹 사이트(https://en.wikipedia.org/wiki/Regression_analysis)를 참조할 수 있다.Also, as a result of learning about the data during the training period, a regression model for predicting the measurement data from the environmental data may be constructed for each measurement time series cluster. The regression model may be any of a variety of regression models such as, for example, Multivariate Adaptive Regression Splines (MARS) or polynomial regression. Regarding the regression model or regression analysis, since various data such as articles are disclosed, a detailed description of the regression model will be omitted. For example, you can visit the website at https://en.wikipedia.org/wiki/Regression_analysis.

이하, 측정치 예측 장치(20)의 측정치 시계열 데이터 예측 관련 동작을 설명한다.Hereinafter, the measurement time series data prediction related operation of the measurement value prediction apparatus 20 will be described.

측정치 예측 장치(20)는, 상기 최적 분류 모델에 상기 예측 기간의 환경 데이터 예측치를 입력하여, 예측 기간의 측정치 시계열 데이터 클러스터를 예측한다. 상기 예측 기간의 환경 데이터 예측치는, 평균 온도, 평균 습도, 평균 풍속 등의 기상 예보 정보일 수 있다. 측정치 예측 장치(20)는 상기 예측 기간의 환경 데이터 예측치를 환경 데이터 관리 장치(30)로부터 제공 받을 수 있다.The measurement predictor 20 inputs the predicted environmental data predicted value to the optimum classification model to predict a measured time series data cluster in the predicted period. The environmental data predicted value in the prediction period may be weather forecast information such as average temperature, average humidity, average wind speed, and the like. The measurement value prediction apparatus 20 can receive the environmental data prediction value of the prediction period from the environment data management device 30. [

측정치 예측 장치(20)는 상기 예측된 측정치 시계열 데이터 클러스터에 대한 대표 시계열 데이터 등의 정보를 네트워크를 통하여 단말 장치(40)에 송신할 수 있다.The measurement predictor 20 can transmit information such as representative time series data for the predicted measurement time series data cluster to the terminal device 40 via the network.

상기 최적 분류 모델의 인자에 환경 데이터 시계열이 포함되어 있다면, 측정치 예측 장치(20)는 예측 기간의 상기 환경 데이터 시계열 예측치(예를 들어, 일간 온도 시계열 예측치)가 상기 트레이닝 기간 동안의 학습 결과로 얻어진 각각의 환경 데이터 시계열 클러스터 중 어디에 속하는지 결정한다. 이 때, 각 환경 데이터 시계열 클러스터의 대표 시계열 데이터와 상기 환경 데이터 시계열 예측치를 비교함으로써, 클러스터를 신속하게 결정할 수 있다. 측정치 예측 장치(20)는 상기 결정된 환경 데이터 시계열 클러스터의 식별자(예를 들어, 인덱스 값)를 상기 최적 분류 모델에 입력하여, 예측 기간의 측정치 시계열 데이터 클러스터를 예측한다.If the environmental data time series is included in the factor of the optimal classification model, the measurement prediction apparatus 20 determines that the environmental data time series prediction value (for example, daytime temperature time series prediction value) of the prediction period is obtained as the learning result during the training period Determine which environmental data time series clusters belong to. At this time, the cluster can be quickly determined by comparing the representative time series data of each environmental data time series cluster with the environmental data time series prediction value. The measurement predictor 20 inputs an identifier (for example, an index value) of the determined environmental data time series clusters to the optimum classification model to predict a measurement time series data cluster of the prediction period.

측정치 예측 장치(20)는, 상기 예측된 측정치 시계열 데이터의 클러스터를 위한 회귀 모델을 이용하여, 예측 기간의 측정치 시계열 데이터를 예측한다. 상기 회귀 모델은 인자(factor)로 제1 환경(예를 들어, 온도)에 대한 시계열 데이터를 입력 받아, 그 경우의 측정치 시계열 데이터를 출력한다. 상기 회귀 모델은 제2 환경(예를 들어, 습도)에 대한 시계열 데이터 클러스터 식별자, 제3 환경(예를 들어, 일사량)에 대한 대표 값 및 환경 속성(예를 들어, 평일/휴일 여부) 중 적어도 하나를 추가로 입력 받을 수 있다.The measurement predictor 20 predicts the time series data of the predicted time series data by using a regression model for the cluster of the predicted measured time series data. The regression model receives time series data for a first environment (e.g., temperature) as a factor, and outputs measured time series data in that case. The regression model may include at least one of a time series data cluster identifier for a second environment (e.g., humidity), a representative value for a third environment (e.g., solar radiation) and environmental attributes (e.g., weekday / One can receive additional input.

측정치 예측 장치(20)는 예측 된 측정치 시계열 데이터(예를 들어, 내일의 24시간 동안의 에너지 소모량 시계열 데이터 예측치)를 네트워크를 통하여 단말 장치(40)에 송신할 수 있다.The measurement predictor 20 can transmit predicted measurement time series data (e.g., tomorrow's 24 hour energy consumption time series data prediction value) to the terminal device 40 via the network.

도 1에는 측정치 예측 장치(20)와 환경 데이터 관리 장치(30)가 서로 물리적으로 분리 된 것으로 도시되어 있으나, 몇몇 실시예에서는, 환경 데이터 관리 장치(30)가 대용량 파일 생성 시스템(300) 내부의 한 모듈로서 구성될 수도 있다.1, the measurement data predicting apparatus 20 and the environmental data management apparatus 30 are physically separated from each other. However, in some embodiments, May be configured as one module.

시계열 데이터 예측 방법How to predict time series data

이하, 도 2 내지 도 17을 참조하여, 본 발명의 일 실시예에 따른 시계열 데이터 예측 방법을 설명한다. 본 실시예에 따른 시계열 데이터 예측 방법은 컴퓨팅 장치에 의하여 실행 될 수 있는데, 예를 들어, 도 1을 참조하여 설명된 측정치 예측 장치(20)에 의하여 실행 될 수 있다. 본 실시예에 따른 시계열 데이터 예측 방법은 트레이닝 기간의 데이터를 학습하는 동작 및 상기 학습의 결과를 이용하여 예측 기간의 측정치 시계열 데이터를 예측 하는 동작을 포함한다. 도 2 내지 도 3을 참조하여 트레이닝 기간의 데이터를 학습하는 동작을 설명한 후, 도 4를 참조하여 예측 기간의 측정치 시계열 데이터를 예측 하는 동작을 설명한다.Hereinafter, a time series data predicting method according to an embodiment of the present invention will be described with reference to FIGS. 2 to 17. FIG. The method of predicting the time series data according to the present embodiment can be executed by the computing device, for example, by the measurement prediction device 20 described with reference to FIG. The time series data prediction method according to the present embodiment includes an operation of learning data of a training period and an operation of predicting measured time series data of a prediction period using the result of the learning. The operation of learning the training period data will be described with reference to FIGS. 2 to 3, and then the operation of predicting the measured time series data of the prediction period will be described with reference to FIG.

도 2를 참조하면, 트레이닝 기간의 측정치 시계열 데이터 및 복수의 환경 데이터를 수신한다(S100, S102). 복수의 환경 데이터는, 제1 환경(예를 들어, 온도)을 가리키는 시계열 데이터 또는 대표값 및 제2 환경(예를 들어, 습도)를 가리키는 시계열 데이터 또는 대표값, 환경 속성(예를 들어, 휴일/평일 여부)을 가리키는 값을 포함할 수 있다. 수신된 측정치 시계열 데이터 및 환경 데이터 중 환경 시계열 데이터는 트레이닝 과정에서 유사한 것들끼리 클러스터링 된다(S104, S106). 이하, 클러스터링 과정(S104, S106)에 대하여 자세히 설명한다.Referring to FIG. 2, measurement time series data of a training period and a plurality of environment data are received (S100, S102). The plurality of environmental data may include time series data or representative values indicating a first environment (e.g., temperature) and time series data or representative values indicating a second environment (e.g., humidity), environmental attributes / Whether it is weekday). Among the received measurement time series data and environment data, environmental time series data are clustered among similar things in the training process (S104, S106). Hereinafter, the clustering process (S104, S106) will be described in detail.

수신된 측정치 시계열 데이터는 기 지정 된 주기를 단위로 하여 처리 된다. 예를 들어, 상기 주기가 24시간 인 경우, 상기 측정치 시계열 데이터는 0시를 기준으로 분리 될 수 있다. 상기 주기는 각 측정치 시계열 데이터에 따라 서로 다른 값으로 설정될 수 있다. 예를 들어, 건물 내 에너지 소비량 시계열 데이터는 24시간을 기준으로 분리되고, 건물 내부 엘리베이터 운행 거리 데이터는 일주일을 기준으로 분리될 수 있을 것이다.The received measurement time series data is processed in units of the predefined period. For example, if the period is 24 hours, the measured time series data may be separated by 0 hour. The period may be set to a different value according to each measurement time series data. For example, time series data of energy consumption in a building can be separated by 24 hours, and data on the elevator operating distance within a building can be separated by a week.

각 주기의 측정치 시계열 데이터는 클러스터링을 통하여 복수의 클러스터 중 어느 하나로 분류 된다. 도 5는, 24시간 단위로 분리된 에너지 사용량 시계열 데이터를 겹쳐서 표시한 것이다. 도 5에 도시된 것과 같은 시계열 데이터는 k-평균(k-means) 로직과 같이 널리 알려진 클러스터링 로직에 의하여 클러스터링 될 수 있다. k-평균 로직은 주어진 데이터를 k개의 클러스터로 묶는 알고리즘으로, 각 클러스터와 거리 차이의 분산을 최소화하는 방식으로 동작한다. k-평균 로직은 자율 학습의 일종으로, 레이블이 달려 있지 않은 입력 데이터에 레이블을 달아주는 역할을 수행한다. 이 알고리즘은 EM 알고리즘을 이용한 클러스터링과 비슷한 구조를 가지고 있다. k-평균 로직은 시계열 데이터에 대한 클러스터링에 뛰어난 성능을 보여주기 때문에, 본 실시예는 k-평균 로직을 활용한 클러스터링을 수행하는 것에 의하여 클러스터링 품질을 향상시키는 효과를 가져온다.The measured time series data of each cycle is classified into one of a plurality of clusters through clustering. FIG. 5 is an overlay of energy usage time series data separated by 24 hours. The time series data as shown in Fig. 5 may be clustered by well-known clustering logic such as k-means logic. The k-mean logic is an algorithm for grouping the given data into k clusters, and operates in a way that minimizes the variance of the distance difference with each cluster. The k-means logic is a type of self-learning, and it plays a role of labeling input data that is not labeled. This algorithm has similar structure to clustering using EM algorithm. Since the k-average logic shows excellent performance in clustering with time series data, this embodiment has the effect of improving clustering quality by performing clustering using k-average logic.

한편, 다른 실시예에 따르면, k-평균 로직 뿐만 아니라, 다양한 클러스터링 로직이 활용 될 수도 있다. 클러스터링 로직과 관련된 정보는 웹 문서 'https://en.wikipedia.org/wiki/Cluster_analysis'를 참조할 수 있다.On the other hand, according to another embodiment, various clustering logic may be utilized, as well as k-means logic. For information on clustering logic, see the web document 'https://en.wikipedia.org/wiki/Cluster_analysis'.

일 실시예에서, 클러스터링을 수행한 후, 시계열 평균화 로직을 이용하여 각 클러스터에 속한 시계열 데이터들의 대표 시계열 데이터를 선정할 수 있다. 예를 들어, DTW Barycenter Averaging(DBA) 등 널리 알려진 다양한 시계열 평균화 로직이 활용될 수 있다. DBA 로직에 대하여는 'F. Petitjean, A. Ketterlin, P. Gancarski, A global averaging method for dynamic time warping, with applications to clustering' 등의 널리 알려진 논문을 참조할 수 있다. 도 6에는, 총 5개의 클러스터로 도 5의 측정치 시계열 데이터가 클러스터링 되었고, 각 클러스터의 대표 시계열 데이터가 각각 추출 된 것이 표시 되어 있다.In one embodiment, after clustering, representative time series data of time series data belonging to each cluster can be selected using time series averaging logic. For example, various well-known time series averaging logic such as DTW Barycenter Averaging (DBA) can be utilized. For DBA logic, see 'F. Petitjean, A. Ketterlin, P. Gancarski, A global averaging method for dynamic time warping, with applications to clustering '. In FIG. 6, the measured time series data of FIG. 5 is clustered with a total of five clusters, and representative time series data of each cluster is extracted.

DBA 로직은 k-평균 로직에 의하여 클러스터링 된 클러스터 내에서 대표 시계열 데이터를 효과적으로 추출한다. 본 실시예에서는 k-평균 로직을 이용한 클러스터링과 DBA 로직을 이용한 클러스터 내 대표 시계열 데이터 추출의 조합을 통하여, 최적의 클러스터링 및 클러스터 대표 시계열 데이터 추출의 효과를 제공한다.DBA logic effectively extracts representative time series data in clusters clustered by k-means logic. In this embodiment, the combination of clustering using k-means logic and representative time series data extraction using clustered DBA logic provides the effect of optimal clustering and cluster representative time series data extraction.

도 7은 24시간 주기의 측정치 시계열 데이터가 각 일자 별로 저장되는 형태를 도시한다. 도 7에 도시된 바와 같이, 각 주기 별 측정치 시계열 데이터는, 클러스터의 식별자 역할을 하는 클러스터 인덱스와 함께 저장될 수 있다. 추가적으로, 각 클러스터의 대표 시계열 데이터가 클러스터링의 결과로서 저장될 수 있다. 한편, 수집 된 측정치 시계열 데이터는, 도 8에 도시된 것과 같이 n개(n>=2)의 서로 다른 측정치 시계열 데이터로 구성 된 다차원 시계열 데이터일 수 있다.FIG. 7 shows a form in which measured time series data of a 24-hour period is stored for each date. As shown in FIG. 7, the measurement time series data for each cycle can be stored together with the cluster index serving as an identifier of the cluster. In addition, representative time series data of each cluster may be stored as a result of clustering. On the other hand, the collected measurement time series data may be multidimensional time series data composed of n (n > = 2) different measurement time series data as shown in Fig.

시계열 데이터에 대한 클러스터링을 수행함에 있어서, 몇 개의 클러스터로 클러스터링 할 것인지가 문제 된다. 클러스터의 개수를 너무 적게 하면 각 클러스터에 속한 시계열 데이터의 낮은 동질성이 문제되고, 클러스터의 개수를 너무 많게 하면, 클러스터링의 효율이 떨어지기 때문이다. 따라서, 적절한 클러스터 개수를 결정하는 것이 클러스터링의 품질을 높이는데 중요하다. 본 발명의 일 실시예에서는, 각 클러스터 별 대표 시계열 데이터와, 각 클러스터에 속한 각 주기의 측정치 시계열 데이터 사이의 유사도 합산치를 기준으로 클러스터의 개수를 최종 결정한다. 상기 유사도 합산치는, 예를 들어 DTW distance 등 다양한 시계열 데이터 사이의 차이 값 연산 로직을 이용하여 산출 될 수 있다.In performing clustering on time-series data, it is a problem how many clusters are to be clustered. If the number of clusters is too small, low homogeneity of the time series data belonging to each cluster is problematic. If the number of clusters is too large, clustering efficiency becomes poor. Therefore, it is important to determine the proper number of clusters to improve the quality of clustering. In one embodiment of the present invention, the number of clusters is finally determined on the basis of the sum of similarity values between representative time series data for each cluster and measurement time series data of each cycle belonging to each cluster. The similarity sum value can be calculated using difference value calculation logic between various time series data such as DTW distance.

도 9에 도시된 케이스의 경우, 클러스터의 개수를 1에서 5까지 증가시킴에 따라, 각 클러스터 별 대표 시계열 데이터와, 각 클러스터에 속한 각 주기의 측정치 시계열 데이터 사이의 DTW distance 합산치가 급격히 감소하다가, 클러스터의 개수가 5이상 되면 DTW distance 합산치의 감소폭이 미미해진다. 즉, 도 9에 도시된 케이스의 경우, 클러스터 개수는 5 이상으로 증가시키더라도 클러스터링의 품질에 별 영향을 미치지 않는다. 따라서, 일 실시예에서, 클러스터의 개수가 1에서 k까지는 클러스터의 개수가 증가함에 따라 상기 DTW distance 합산치의 감소폭이 기준치 이상이나, 클러스터의 개수가 k를 초과하여 증가함에 따라 상기 DTW distance 합산치의 감소폭이 기준치 미만인 경우, 클러스터의 개수는 k개로 최종 결정 될 수 있다.In the case shown in FIG. 9, as the number of clusters is increased from 1 to 5, the DTW distance sum value between the representative time series data of each cluster and the measurement time series data of each cycle belonging to each cluster sharply decreases, If the number of clusters is 5 or more, the decrease in the DTW distance sum becomes small. That is, in the case shown in FIG. 9, even if the number of clusters is increased to 5 or more, the quality of the clustering is not affected. Therefore, in one embodiment, as the number of clusters increases from 1 to k, as the number of clusters increases, the decrease of the sum of DTW distances increases beyond a reference value, and as the number of clusters increases beyond k, If it is less than this criterion, the number of clusters can be finally determined as k.

한편, 측정치 시계열 데이터가 2이상의 개별 측정치 시계열으로 구성 된 다차원 시계열 데이터인 경우, 각 클러스터 별 대표 시계열과, 각 클러스터에 속한 각 주기의 측정치 시계열 데이터 사이의 MD-DTW(Multi-Dimensional Dynamic Time Warping) 로직에 따른 유사도(예를 들어, DTW distance)의 합산치를 기준으로 클러스터의 개수가 최종 결정 될 수 있다. 본 발명에 따른 시계열 데이터 예측 방법은, 시계열 데이터가 다차원 데이터이더라도 1차원 시계열 데이터와 동일하게 클러스터링 및 각 클러스터의 대표 시계열 데이터를 생성할 수 있으므로, 다차원 시계열 데이터에 대한 확장성을 제공한다. 즉, 본 실시예에서는, 다차원 시계열 데이터도 예측 기간의 측정치 시계열 데이터의 클러스터를 예측하기 위한 인자(factor)로 사용할 수 있도록 지원한다.On the other hand, when the measured time series data is multidimensional time series data composed of two or more individual measured value time series, a Multi-Dimensional Dynamic Time Warping (MD-DTW) between the representative time series for each cluster and the measured time series data of each cycle belonging to each cluster, The number of clusters can be finally determined based on the sum of similarities (e. G., DTW distance) according to the logic. The time series data prediction method according to the present invention provides scalability for multidimensional time series data because clustering and representative time series data of each cluster can be generated in the same way as one dimensional time series data even if the time series data is multidimensional data. That is, in this embodiment, the multidimensional time series data can also be used as a factor for predicting the cluster of the time series data of the measurement time series of the prediction period.

이미 언급한 것과 같이, 환경 데이터 중 시계열 데이터도 측정치 시계열 데이터의 클러스터링 방법과 동일한 방법으로 클러스터링 되고, 각 클러스터의 대표 시계열 데이터가 추출 된다. 도 12에는, 여름과 겨울의 온도 시계열 데이터를 클러스터링 하고, 각 클러스터의 대표 시계열 데이터를 추출한 결과가 도시 되어 있다.As already mentioned, the time series data in the environment data is also clustered in the same manner as the clustering method of the measurement time series data, and the representative time series data of each cluster is extracted. FIG. 12 shows the results of clustering the temperature time series data of summer and winter and extracting representative time series data of each cluster.

환경 데이터 중 시계열 데이터를 클러스터링 하는 이유는, 시계열 데이터의 특성 상 완전히 동일한 데이터가 발생할 가능성이 낮기 때문이다. 따라서, 측정치 시계열 데이터의 클러스터를 예측하기 위한 인자(factor)로서 환경 데이터의 시계열 데이터가 포함 될 수 있도록, 환경 데이터를 클러스터링 한다. 각 클러스터의 식별자(예를 들어, 인덱스)가 측정치 시계열 데이터의 클러스터를 예측하기 위한 최적 분류 모델의 인자로서 사용될 수 있다. 최적 분류 모델에 관한 자세한 사항은, 도 3, 도 15 및 도16을 참조하여 추후 자세히 설명한다.The reason for clustering time series data among environmental data is that the possibility of completely the same data is low due to the nature of time series data. Therefore, the environmental data is clustered so that the time series data of the environment data can be included as a factor for predicting the cluster of the measurement time series data. An identifier (e.g., an index) of each cluster may be used as a factor of an optimal classification model for predicting clusters of measured time series data. Details of the optimum classification model will be described in detail later with reference to Figs. 3, 15 and 16.

도 10은 다차원 환경 시계열 데이터를 표시한다. 예를 들어, n개의 서로 다른 환경 시계열 데이터를 포함하는 n차원 환경 시계열 데이터의 경우, n개의 1차원 환경 시계열 데이터로 분리하여 클러스터링 하는 것보다, 1개의 n차원 환경 시계열 데이터로서 클러스터링 하는 것이, 매일의 환경을 보다 효과적으로 클러스터링 하는 것일 수 있다. 따라서, 다차원 환경 시계열 데이터 역시 상기 최적 분류 모델의 인자로서 사용될 필요가 있다. 이미 설명한 다차원 측정치 시계열 데이터에 대한 클러스터링 및 대표 시계열 데이터 추출 방법과 동일한 방법을 이용하여, 다차원 환경 시계열 데이터도 클러스터링 및 대표 시계열 데이터 추출이 가능하다.10 shows multidimensional environment time series data. For example, in the case of n-dimensional environment time series data including n different environment time series data, clustering as one n-dimensional environment time series data is more preferable than clustering by separating into n one- Lt; RTI ID = 0.0 > environment. &Lt; / RTI > Therefore, the multidimensional environment time series data also needs to be used as a factor of the optimal classification model. Clustering and representative time series data can also be extracted from multidimensional environment time series data using the same method as clustering and representative time series data extraction method described above for multidimensional measurement time series data.

이미 언급한 바와 같이, 본 발명의 몇몇 실시예들에서 수집 되고 학습 되는 환경 데이터는 시계열 데이터가 아닌 데이터도 포함한다. 예를 들어, 환경의 속성 값을 가리키는 데이터(예를 들어, 도 11의 일자 별 토요일/평일/휴일 여부) 또는 각 환경의 대표 값을 가리키는 데이터(예를 들어, 도 13의 일자 별 온도/습도/기압 평균 값)도 환경 데이터에 포함될 수 있다. 일 실시예에 따르면, 시계열 데이터가 아닌 환경 데이터도 널리 알려진 클러스터링 방법에 의하여 클러스터링 되고, 각 클러스터의 대표 값도 추출 될 수 있다.As already mentioned, the environmental data collected and learned in some embodiments of the present invention also includes data that is not time series data. For example, data indicating an attribute value of the environment (e.g., whether it is Saturday / weekday / holiday by date in FIG. 11) or data indicating representative values of each environment (for example, temperature / / Average pressure value) can also be included in the environmental data. According to an exemplary embodiment, environment data other than time series data are clustered by a well-known clustering method, and representative values of each cluster can also be extracted.

도 14는 24시간 주기의 환경 시계열 데이터가 각 주기 별로 저장되는 형태를 도시한다. 도 14에 도시된 바와 같이, 각 주기 별 환경 시계열 데이터는, 클러스터의 식별자 역할을 하는 클러스터 인덱스와 함께 저장될 수 있다. 추가적으로, 각 클러스터의 대표 시계열 데이터가 클러스터링의 결과로서 저장될 수 있다.14 shows a form in which environmental time series data of a 24-hour period is stored for each cycle. As shown in FIG. 14, the environment time series data for each cycle can be stored together with the cluster index serving as an identifier of the cluster. In addition, representative time series data of each cluster may be stored as a result of clustering.

다시 도 2로 돌아와서, 클러스터링 이후의 동작을 설명한다. 클러스터링이 완료 되면, 측정치 시계열 데이터의 클러스터를 얻기 위한 최적의 모델을 생성한다(S108). 상기 모델은, 수신 된 복수의 환경 데이터 중 적어도 일부가 각각의 축이 되어 구성 된 평면 또는 공간 상에서, 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류 하는 최적 분류 모델을 가리킨다.Returning back to Fig. 2, the operation after clustering will be described. When clustering is completed, an optimal model for obtaining clusters of measured time series data is generated (S108). The model indicates an optimal classification model that best classifies clusters of the measurement time series data on a plane or a space formed by at least a part of the plurality of received environmental data as respective axes.

예를 들어, 제1 축이 온도 시계열 데이터 클러스터이고, 제2 축이 습도 시계열 데이터 클러스터인 경우, 상기 제1 축 및 상기 제2 축이 구성하는 평면 상에 트레이닝 기간 동안의 상기 측정치 시계열 데이터를 표시할 때, 상기 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류 하는 하나의 기준선이 표시 될 수 있을 것이다. 이 때, 상기 기준선을 이용하면, 예측 기간의 온도 시계열 데이터 클러스터 및 습도 시계열 데이터 클러스터를 입력하는 것으로, 측정치 시계열 데이터의 클러스터를 알 수 있다. 따라서, 측정치 시계열 데이터의 클러스터를 얻기 위한 최적의 모델은, 수신 된 복수의 환경 데이터 중 적어도 일부가 각각의 축이 되어 구성 된 평면 또는 공간 상에서, 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류하는 최적 분류 모델이다.For example, when the first axis is the temperature time series data cluster and the second axis is the humidity time series data cluster, the measurement time series data during the training period is displayed on the plane constituted by the first axis and the second axis , One reference line that best classifies the clusters of the measured time series data on the plane may be displayed. At this time, by using the reference line, clusters of measured time series data can be obtained by inputting temperature time series data clusters and humidity time series data clusters in the prediction period. Therefore, the optimal model for obtaining the clusters of the measured time series data is an optimal classification for best classifying clusters of the measured time series data on the plane or space constituted by at least a part of the received plurality of environment data as the respective axes It is a model.

도 3을 참조하여, 상기 최적 분류 모델을 생성하는 동작(S108)을 보다 자세히 설명한다.Referring to FIG. 3, the operation (S108) of generating the optimal classification model will be described in more detail.

먼저, 복수의 환경 데이터 중 인자(factor)로 사용할 환경 데이터를 선택한다. 예를 들어, 수집 된 환경 데이터가 3가지 종류(A, B, C)라고 하면, 선택의 가지 수는 7 가지이다(A, B, C, AB, AC, BC, ABC). 측정치 시계열 데이터가 하나의 환경 데이터에만 의존 관계가 있지는 않을 것으로 가정한다. 2개의 환경 데이터를 인자로 사용하는 것으로 선택했다면, 2개의 인자로 구성 된 평면이 구성되고, 이 평면 상에 트레이닝 기간 동안의 각 주기의 측정치 시계열 데이터를 표시할 수 있을 것이다.First, environment data to be used as a factor among a plurality of environmental data is selected. For example, if the collected environmental data are of three kinds (A, B, C), there are seven kinds of choices (A, B, C, AB, AC, BC, ABC). It is assumed that the measurement time series data does not depend on only one environmental data. If you choose to use two sets of environmental data as parameters, you can construct a plane consisting of two factors and display the time series data of each period over the training period on this plane.

도 15는 2개의 환경 시계열 데이터가 선택 된 경우, 제1 환경 시계열 데이터의 클러스터 인덱스를 가리키는 제1 축과, 제2 환경 시계열 데이터의 클러스터 인덱스를 가리키는 제2 축으로 구성 된 평면 상에, 트레이닝 기간 동안의 각 주기의 측정치 시계열 데이터를, 그 클러스터의 인덱스 번호로 표시한 것이다. 아래의 표 1과 같이 트레이닝 기간의 데이터가 처리 된 경우, 도 15와 같이 측정치 시계열 데이터의 클러스터가 표시 될 수 있을 것이다.FIG. 15 is a diagram showing a case in which two environment time series data are selected, and on a plane constituted by a first axis indicating a cluster index of the first environment time series data and a second axis indicating a cluster index of the second environment time series data, Is the index number of the cluster. When the data of the training period is processed as shown in Table 1 below, a cluster of the measured time series data can be displayed as shown in FIG.

주기Cycle 제1 환경 시계열 데이터
클러스터 인덱스First environment time series data
Cluster index 제2 환경 시계열 데이터
클러스터 인덱스Second environment time series data
Cluster index 측정치 시계열 데이터
클러스터 인덱스Measured time series data
Cluster index 1One 1One 1One 1One 22 22 1One 1One 33 1One 22 1One 44 22 22 1One 55 33 1One 1One 66 44 1One 22 77 33 22 22 88 1One 33 22 99 22 33 22 1010 33 33 22 1111 44 22 22 1212 55 1One 22 1313 66 1One 22 1414 55 22 22 1515 44 33 22 1616 55 33 22 1717 66 33 22 1818 66 22 22

도 15에 도시 된 평면 상에서, 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 문제는, SVM(Support Vector Machine) 로직, decision tree 로직 등 다양한 분류(classification) 로직을 활용하여 솔루션을 얻을 수 있다. 즉, 본 발명의 실시예는, 예를 들어, 웹 문서 ' https://en.wikipedia.org/wiki/Statistical_classification'를 통하여 소개 된 다양한 분류 로직을 사용하여, 환경 데이터가 구성하는 평면 또는 공간 상에서, 각 주기의 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 모델을 생성하는 것으로 확장 가능하다. 다만, 이하 이해의 편의를 위하여 SVM 로직을 활용하는 실시예를 설명한다.On the plane shown in FIG. 15, the problem of optimally classifying clusters of measured time series data can be solved by using various classification logic such as SVM (Support Vector Machine) logic and decision tree logic. That is, the embodiment of the present invention uses various classification logic introduced through a web document 'https://en.wikipedia.org/wiki/Statistical_classification', for example, in a plane or a space , It is possible to expand the model to optimally classify the clusters of the measured time series data of each period. However, for convenience of understanding, an embodiment utilizing SVM logic will be described below.

도 16은 환경 시계열 데이터 하나(온도 시계열 데이터), 환경 대표값(평균 습도) 데이터 하나가 각각 인자로서 선택된 경우를 도시한다. 이미 설명한 바와 같이, 환경 시계열 데이터의 경우 그대로는 축상에 표시할 수 없으므로, 제1 축은 환경 시계열 데이터의 클러스터 인덱스를 가리킨다. SVM 로직을 수행하면, 평면 상에서 2개의 이질적인 데이터(제1 클러스터의 측정치 시계열 데이터 및 제2 클러스터의 측정치 시계열 데이터)를 최적으로 구별할 수 있는 초평면(hyperplane)(63)이 구해진다. 이 때 최대 여백(margin)은, 초평면(63)에 가장 가까운 데이터를 지나면서 초평면에 평행한 두개의 벡터(61, 62) 사이의 거리이다.FIG. 16 shows a case in which one environmental time series data (temperature time series data) and one environmental representative value (average humidity) data are respectively selected as factors. As described above, since the environmental time series data can not be displayed on the axis as it is, the first axis indicates the cluster index of the environmental time series data. When the SVM logic is executed, a hyperplane 63 is obtained that can distinguish two heterogeneous data on the plane (measurement time series data of the first cluster and measurement time series data of the second cluster) on the plane. The maximum margin at this time is the distance between two vectors 61 and 62 parallel to the hyperplane as they pass data closest to the hyperplane 63.

도 16과 같이 2개의 인자를 선택한 상황에서는, 도 16에 도시된 초평면(63)이 최적의 분류 모델이다. 하지만, 다른 인자 선택을 고려하면, 제1 클러스터의 측정치 시계열 데이터 및 제2 클러스터의 측정치 시계열 데이터를 더 잘 분류할 수 있는 분류 모델이 생성될 수도 있다. 일 실시예에서, 분류 모델의 성능 지표는 상기 SVM 로직에 따라 생성 되는 초평면(hyperplane)에 따른 최대 여백(margin) 값으로, 상기 최대 여백이 클 수록 상기 성능 지표도 좋은 것이다.16, the hyperplane 63 shown in Fig. 16 is an optimal classification model. However, considering another factor selection, a classification model may be generated that can better classify the measurement time series data of the first cluster and the measurement time series data of the second cluster. In one embodiment, the performance indicator of the classification model is a maximum margin value according to a hyperplane generated according to the SVM logic. The larger the maximum margin, the better the performance indicator.

따라서, 상기 측정치 시계열 데이터의 클러스터를 가장 잘 분류하는 최적 분류 모델을 구하기 위하여는, 수집 된 복수의 환경 데이터를 다양하게 조합해 보면서, 상기 최대 여백 값이 가장 큰 경우를 찾으면 될 것이다.Therefore, in order to obtain an optimal classification model that best classifies clusters of the measured time series data, a case where the maximum margin value is the largest is searched while various collected environmental data are combined in various ways.

다시 도 3으로 돌아와 설명하면, 복수의 환경 데이터 중 인자로 사용할 환경 데이터를 선택하고(S180), 선택된 인자를 가리키는 축들로 구성된 공간(3개 이상의 인자가 선택 된 경우) 또는 평면(2개의 인자가 선택 된 경우) 상에서의 분류 모델을 생성하며(S182), 상기 분류 모델의 성능 지표값(SVM 로직을 사용한 경우, 최대 여백 값)을 결정한다(S184). 아직 검토 되지 않은 인자의 선택 케이스가 존재하지 않을 때까지(S186), 인자의 선택을 변경하고(S188), 선택된 인자를 이용하여 분류 모델을 생성하며(S182), 생성된 분류 모델의 성능 지표 값을 결정(S184)하는 동작이 반복된다.3, the environment data to be used as a parameter in the plurality of environmental data is selected (S180), and a space (when three or more parameters are selected) constituted by axes indicating the selected parameter or a plane (S182), and determines the performance index value of the classification model (when the SVM logic is used, the maximum margin value) (S184). (S186), the selection of the factor is changed (S188), and a classification model is generated using the selected factor (S182) until the performance index value of the generated classification model (S184) is repeated.

인자의 선택 시, 복수의 환경 데이터 중 적어도 일부를 선택하는 모든 케이스가 다 가능하거나, 선택 가능한 인자 개수의 범위를 지정 하거나, 선택 가능한 데이터의 타입을 특정 타입으로 제한할 수 있다(예를 들어, 환경 시계열 데이터 및 환경 대표 값으로 제한).When selecting an argument, all cases of selecting at least a part of a plurality of environment data are possible, a range of selectable number of factors can be specified, or the type of selectable data can be limited to a specific type (for example, Environment time series data and environment representative values).

모든 인자 선택 케이스를 다 검토한 후, 각 인자 선택 케이스에서 생성 된 분류 모델의 성능 지표 값을 비교하여, 가장 높은 성능 지표 값을 가지는 분류 모델을 최적 분류 모델로 선정한다(S189).After all the factor selection cases are examined, the performance index values of the classification models generated in the respective factor selection cases are compared, and the classification model having the highest performance index value is selected as the optimal classification model (S189).

다음으로, 트레이닝 작업의 일환으로, 각각의 측정치 시계열 클러스터 별로, 그 측정치 시계열 데이터에 속한 주기의 환경 데이터를 입력 받아 측정치 시계열 데이터를 출력하는 회귀 모델(regression model)을 구축한다. 즉 상기 회귀 모델을 구축하는 단계는, 제2 측정치 시계열 데이터 클러스터로 클러스터링 된 주기의 데이터는 이용하지 않고 제1 측정치 시계열 데이터 클러스터로 클러스터링 된 주기의 데이터 만을 이용하여, 상기 제1 측정치 시계열 데이터 클러스터에 대응하는 회귀 모델을 구축하는 단계를 포함한다. 예를 들어, 표 1에 표시된 케이스에서, 측정치 시계열 데이터 클러스터 1번에 대응 되는 회귀 모델을 구축할 때, 주기 1 내지 5까지의 환경 데이터만 이용된다.Next, as a part of the training work, a regression model is constructed in which environment data of a cycle belonging to the measurement time series data is input for each measurement time series cluster and the measurement time series data is output. That is, the step of constructing the regression model may be performed by using only the data of the period clustered in the first measured value time series data cluster without using the data of the period clustered with the second measured value time series data cluster, And constructing a corresponding regression model. For example, in the case shown in Table 1, when constructing the regression model corresponding to the measurement time series data cluster No. 1, only the environmental data of the cycles 1 to 5 are used.

상기 회귀 모델의 구축은 웹 문서 'https://en.wikipedia.org/wiki/Regression_analysis' 등을 통하여 제시된, 다양한 로직을 적용하여 수행 될 수 있다. 예를 들어 MARS(Multivariate Adaptive Regression Splines)나, 다항 회귀 모델(polynomial regression) 등 다양한 회귀 모델 중 어느 하나 일 수 있다.The construction of the regression model can be performed by applying various logic presented through a web document 'https://en.wikipedia.org/wiki/Regression_analysis' or the like. For example, it may be one of various regression models such as Multivariate Adaptive Regression Splines (MARS) or polynomial regression.

상기 회귀 모델은 상기 환경 데이터 중 제1 환경 시계열 데이터를 제1 독립 변수로 가진다. 측정치 시계열 데이터를 출력하기 위해서는, 시간의 흐름에 따라 변하는 시계열 데이터가 적어도 하나는 입력 되어야 하기 때문이다.The regression model has first environment time series data of the environment data as a first independent variable. In order to output the measured time series data, at least one time series data which varies with the passage of time must be inputted.

상기 회귀 모델은, 상기 제1 환경 시계열 데이터와 다른 제2 환경 시계열 데이터의 클러스터 식별자, 상기 환경 데이터 중 각 주기의 특정 환경을 대표하는 대표값(예를 들어, 평균 온도) 및 환경의 속성을 가리키는 데이터(예를 들어, 평일/휴일 여부) 중 적어도 하나를 추가적인 독립 변수로 가질 수 있다.Wherein the regression model includes a cluster identifier of a second environment time series data different from the first environment time series data, a representative value (e.g., an average temperature) representative of a specific environment of each cycle of the environment data, And at least one of data (e.g., whether it is weekday / holiday) as an additional independent variable.

이하, 도 4를 참조하여 예측 기간의 측정치 시계열 데이터를 예측 하는 동작을 설명한다.Hereinafter, an operation of predicting measured time series data of a predicted period will be described with reference to FIG.

예측 기간의 환경 데이터를 수신한다(S200). 상기 수신 된 환경 데이터는 예측치일 수 있다. 상기 환경 데이터는 예를 들어, 기상 예보 정보 일 수 있다. 상기 기상 예보 정보는, 예를 들어 예측 기간의 평균 온도, 평균 습도, 시간에 따른 온도 시계열 예측 데이터 등을 포함할 수 있다. 상기 환경 데이터는, 예측의 대상인 시계열 데이터에 대한 상기 최적 분류 모델의 인자로 포함 된 데이터를 모두 포함하는 것이 바람직하다.The environmental data of the forecast period is received (S200). The received environmental data may be a predicted value. The environmental data may be, for example, weather forecast information. The weather forecast information may include, for example, an average temperature of the prediction period, an average humidity, temperature time series prediction data according to time, and the like. It is preferable that the environment data include all data included as a factor of the optimal classification model for time series data to be predicted.

상기 최적 분류 모델의 인자로 환경 시계열 데이터가 포함되었다면, 상기 환경 시계열 데이터로 예측 된 시계열 데이터가, 상기 환경 시계열 데이터의 클러스터들 중 어디에 가장 가까운지 결정된다(S202).If environmental time series data is included as a factor of the optimal classification model, it is determined in step S202 whether the time series data predicted by the environmental time series data is closest to the clusters of the environmental time series data.

상기 환경 시계열 데이터의 클러스터링 시(S106), 각 클러스터의 대표 시계열 데이터가 추출 되는 점을 이미 설명한 바 있다. 예측 기간의 환경 시계열 데이터에 대응되는 클러스터를 결정할 때(S202), 예측 기간의 환경 시계열 데이터를 각 클러스터에 속한 모든 데이터와 비교하는 것이 아니라, 각 클러스터의 대표 시계열 데이터와 비교하기만 하면 된다. 즉, 상기 예측 기간의 환경 시계열 데이터와 상기 환경 시계열 데이터의 각 클러스터 별 대표 시계열 사이의 차이값 연산 로직에 따른 유사도를 기준으로, 상기 예측 기간의 환경 시계열 데이터가 속하는 상기 환경 시계열 데이터의 클러스터를 선정한다.The representative time series data of each cluster is extracted at the time of clustering the environmental time series data (S106). The environmental time series data of the prediction period is not compared with all the data belonging to each cluster but only with the representative time series data of each cluster when determining the cluster corresponding to the environmental time series data of the prediction period (S202). That is, a cluster of the environmental time series data to which the environmental time series data of the prediction period belongs is selected based on the similarity degree according to the difference value calculation logic between the environmental time series data of the prediction period and the representative time series for each cluster of the environmental time series data do.

상기 유사도는, 예를 들어, 예측 기간의 환경 시계열 데이터와 환경 시계열 데이터의 각 클러스터 별 대표 시계열 데이터 사이의 DTW(Dynamic Time Warping) 차이값 연산 로직 등, 시계열 데이터 사이의 차이 값을 연산하는 다양한 로직에 의하여 구해 질 수 있다.The degree of similarity may be, for example, various logic for computing a difference value between time series data such as DTW (Dynamic Time Warping) difference arithmetic logic between environmental time series data of the prediction period and representative time series data of each cluster of environment time series data . &Lt; / RTI >

예를 들어, 클러스터의 개수가 10개라면, 클러스터의 결정(S202) 과정에서 10번의 DTW 값 비교만 수행하면 되기 때문에, 본 실시예는 예측 기간의 환경 시계열 데이터에 대응되는 클러스터를 신속하게 결정할 수 있는 효과를 가진다.For example, if the number of clusters is 10, only 10 DTW values must be compared in the determination of the cluster (S202). Therefore, in this embodiment, the cluster corresponding to the environment time series data of the prediction period can be determined quickly It has an effect.

상기 최적 분류 모델의 인자에 예측 기간의 환경 데이터를 입력함으로써, 상기 예측 기간의 측정치 시계열 데이터의 클러스터가 예측 된다(S204). 이미 언급한 바와 같이, 상기 최적 분류 모델의 인자에 환경 시계열 데이터가 포함 된 경우, 환경 시계열 데이터 자체가 아니라, 환경 시계열 데이터의 클러스터 식별자(예를 들어, 클러스터 인덱스)가 입력 된다.By inputting the environmental data of the prediction period into the factor of the optimal classification model, a cluster of the measured time series data of the prediction period is predicted (S204). As mentioned above, when the environment time series data is included in the factor of the optimal classification model, the cluster identifier (e.g., cluster index) of the environmental time series data is input instead of the environmental time series data itself.

상기 예측 된 측정치 시계열 데이터의 클러스터에 대응된 회귀 모델에 예측 기간의 환경 데이터를 입력하면, 예측 기간의 측정치 시계열 데이터를 얻을 수 있다(S206). 도 17에 도시된 바와 같이, 본 실시예에 따르면 측정치 시계열 데이터 클러스터가 다르면, 적용되는 회귀 모델도 달라진다. 예를 들어, 에너지 사용량이 예측 대상 측정치 시계열 데이터인 경우, 예측 기간의 에너지 사용량 데이터 클러스터가 #1으로 예측 된 경우, 회귀 모델은 MARS(Multivariate Adaptive Regression Splines) 모델 형식의 1번 모델이 사용될 수 있다. 예측 기간의 에너지 사용량 데이터 클러스터가 #2으로 예측 된 경우, 회귀 모델은 2번 모델로 달라진다.When the environmental data of the predicted period is input to the regression model corresponding to the cluster of the predicted measured value time series data, the measured time series data of the predicted period can be obtained (S206). As shown in Fig. 17, according to the present embodiment, when the measurement time series data cluster is different, the applied regression model also changes. For example, if the energy usage is forecasted measurement time series data and the energy usage data cluster in the forecast period is predicted as # 1, then the regression model may be model 1 of the MARS (Multivariate Adaptive Regression Splines) model format . If the energy usage data cluster of the forecast period is predicted as # 2, then the regression model will be different for the second model.

한편, 예측 대상 측정치 시계열 데이터가 달라지면, 다른 모델 형식의 회귀 모델이 적용 될 수도 있다. 예를 들어, 도 17에는 용수 사용량 시계열 데이터에 대하여는 다항 회귀 모델(polynomial regression)이 사용되는 점이 도시 되어 있다.On the other hand, if the predicted measurement time series data is different, a regression model of another model type may be applied. For example, FIG. 17 shows that polynomial regression is used for water usage time series data.

지금까지 도 1 내지 도 17을 참조하여 설명된 본 발명의 실시예에 따른 방법들은 컴퓨터가 읽을 수 있는 코드로 구현된 컴퓨터프로그램의 실행에 의하여 수행될 수 있다. 상기 컴퓨터프로그램은 인터넷 등의 네트워크를 통하여 제1 컴퓨팅 장치로부터 제2 컴퓨팅 장치에 전송되어 상기 제2 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 제2 컴퓨팅 장치에서 사용될 수 있다. 상기 제1 컴퓨팅 장치 및 상기 제2 컴퓨팅 장치는, 서버 장치, 클라우드 서비스를 위한 서버 풀에 속한 물리 서버, 데스크탑 피씨와 같은 고정식 컴퓨팅 장치를 모두 포함한다.The methods according to the embodiments of the present invention described above with reference to Figs. 1 to 17 can be performed by the execution of a computer program embodied in computer readable code. The computer program may be transmitted from a first computing device to a second computing device via a network, such as the Internet, and installed in the second computing device, thereby enabling it to be used in the second computing device. The first computing device and the second computing device all include a server device, a physical server belonging to a server pool for cloud services, and a fixed computing device such as a desktop PC.

상기 컴퓨터프로그램은 DVD-ROM, 플래시 메모리 장치 등의 기록매체에 저장된 것일 수도 있다.The computer program may be stored in a recording medium such as a DVD-ROM, a flash memory device, or the like.

시계열 데이터 예측 장치Time series data prediction device

이하, 도 18을 참조하여, 본 발명의 또 다른 실시예에 따른 시계열 데이터 예측 장치의 구성 및 동작을 설명한다.Hereinafter, the configuration and operation of the time-series data predicting apparatus according to another embodiment of the present invention will be described with reference to FIG.

도 18에 도시된 바와 같이, 본 실시예에 따른 시계열 데이터 예측 장치(20)는 프로세서(200), 메모리(206), 네트워크 인터페이스(204), 스토리지(208) 및 시스템 버스(202)를 포함한다. 프로세서(200), 네트워크 인터페이스(204), 스토리지(208) 및 메모리(206)는 시스템 버스(202)를 통하여 데이터를 송수신한다. 메모리(206)는 트레이닝 기간 동안의 측정치 시계열 데이터를 분석하여 예측 기간의 상기 측정치 시계열 데이터를 예측하기 위한 컴퓨터 프로그램을 로드한다. 프로세서(200)는 메모리에 로드 된 컴퓨터 프로그램을 실행한다.18, the time series data predicting apparatus 20 according to the present embodiment includes a processor 200, a memory 206, a network interface 204, a storage 208, and a system bus 202 . The processor 200, the network interface 204, the storage 208 and the memory 206 transmit and receive data via the system bus 202. The memory 206 loads the computer program for analyzing the measured time series data during the training period and for predicting the measured time series data of the predicted period. The processor 200 executes the computer program loaded in the memory.

네트워크 인터페이스(204)는, 복수의 센서 및 환경 데이터 관리 장치에 연결된 네트워크를 통하여 트레이닝 기간의 측정치 시계열 데이터 및 환경 데이터를 수신하고, 예측 기간의 환경 데이터를 수신하며, 예측 기간의 측정치 시계열 데이터의 클러스터 정보 또는 예측 기간의 측정치 시계열 데이터의 예측 결과를 네트워크 인터페이스(204)를 통하여 단말 장치에 송신한다.The network interface 204 receives measurement time series data and environment data of a training period through a plurality of sensors and a network connected to the environment data management apparatus and receives environmental data of a prediction period, And transmits the prediction result of the measured time series data of the information or the prediction period to the terminal device via the network interface 204. [

스토리지(208)는 네트워크 인터페이스(204)를 통하여 수신 된 측정치 시계열 데이터, 상기 환경 데이터 및 상기 컴퓨터 프로그램에 의하여 조회 되는 측정치 시계열 클러스터링 결과 데이터(280), 환경 시계열 클러스터링 결과 데이터(282), 측정치 시계열 데이터 클러스터 별 회귀 모델(284)을 저장한다.The storage 208 stores measurement time series data received via the network interface 204, the environment data and the measurement time series clustering result data 280 inquired by the computer program, environment time series clustering result data 282, And stores the cluster-by-cluster regression model 284.

측정치 시계열 클러스터링 결과 데이터(280)는 트레이닝 기간 동안의 측정치 시계열 데이터를 클러스터링 한 결과와 각 클러스터의 대표 시계열 데이터를 포함한다.The measured time series clustering result data 280 includes the result of clustering the measured time series data during the training period and the representative time series data of each cluster.

환경 시계열 클러스터링 결과 데이터(282)는 트레이닝 기간 동안의 환경 시계열 데이터를 클러스터링 한 결과와 각 클러스터의 대표 시계열 데이터를 포함한다.The environment time series clustering result data 282 includes a result of clustering environment time series data during a training period and representative time series data of each cluster.

측정치 시계열 데이터 클러스터 별 회귀 모델(284)은, 각 측정치 시계열 데이터의 클러스터 별 회귀 모델의 구성 정보를 포함한다. 상기 회귀 모델의 구성 정보는, 회귀 모델 타입 정보 및 인자 리스트(factor list)를 포함할 수 있다.The measurement time series data cluster-by-cluster regression model 284 includes configuration information of a cluster-by-cluster regression model of each measurement time series data. The configuration information of the regression model may include regression model type information and a factor list.

스토리지(208)에는 각 측정치 시계열 데이터 별로, 상기 최적 분류 모델에 대한 정보가 더 저장될 수 있다.The storage 208 may further store information on the optimal classification model for each measurement time series data.

상기 컴퓨터 프로그램은 트레이닝 로직(260)과 예측 로직(262)을 포함한다.The computer program includes training logic 260 and prediction logic 262.

트레이닝 로직(260)은, 트레이닝 기간 동안의 기 지정된 주기 단위의 측정치 시계열 데이터를 복수의 클러스터로 클러스터링 하는 오퍼레이션과, 상기 트레이닝 기간 동안의 복수의 환경 데이터를 수집하는 오퍼레이션과, 상기 복수의 환경 데이터 중 적어도 일부를 인자(factor)로 선택하는 오퍼레이션과, 상기 인자를 가리키는 축들로 구성되는 공간 또는 평면 상에서 상기 측정치 시계열 데이터의 클러스터를 최적으로 분류하는 분류 모델을 생성하는 오퍼레이션과, 상기 생성된 분류 모델의 성능 지표 값을 결정하는 오퍼레이션과, 상기 인자로 선택하는 단계, 상기 분류 모델을 생성하는 단계 및 상기 성능 지표 값을 결정하는 단계를, 상기 인자의 선택을 변경해 가면서 반복하여, 상기 성능 지표 값을 기준으로 상기 생성된 분류 모델 중 최적 분류 모델을 선정하는 오퍼레이션을 포함한다.The training logic 260 includes an operation of clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters, an operation of collecting a plurality of environmental data during the training period, An operation of selecting at least a part of a parameter as a factor and an operation of generating a classification model that optimally classifies clusters of the measured time series data in a space or a plane constituted by axes indicating the factor, Determining a performance indicator value, selecting the factor, generating the classification model, and determining the performance indicator value by repeating the selection of the factor while changing the performance indicator value to a reference The optimal classification model among the generated classification models .

예측 로직(262)은, 상기 최적 분류 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터의 클러스터를 예측 하는 오퍼레이션과, 상기 예측 된 측정치 시계열 데이터의 클러스터를 위한 회귀 모델을 이용하여, 상기 예측 기간의 상기 측정치 시계열 데이터를 예측하는 오퍼레이션을 포함한다.The prediction logic 262 uses the optimal classification model to predict a cluster of the measured time series data in the prediction period and a prediction model of the prediction time series data using the predicted period And estimating the time series data of the measured value.

본 명세서에서, 상기 오퍼레이션은, 프로세서(200)에 의하여 해석되고 실행 될 수 있으며, 특정 기능을 수행하는 일련의 이상의 명령어로 구성 된다.In the present specification, the operation is interpreted and executed by the processor 200 and consists of a series of or more instructions that perform a specific function.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

Claims

Clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters;
Collecting a plurality of environmental data during the training period;
Selecting at least a part of the plurality of environmental data as a factor;
Generating a classification model that optimally classifies clusters of the measured time series data on a space or a plane constituted by axes indicating the factor;
Determining a performance indicator value of the generated classification model;
Selecting an optimal classification model among the generated classification models based on the performance index value by repeating the steps of selecting the classification model, generating the classification model, and determining the performance index value, Selecting step; And
And predicting a cluster of the measured time series data in the prediction period using the optimal classification model.
Time series data prediction method.

The method according to claim 1,
Wherein the collecting of the environmental data comprises:
And clustering environment time series data of a predetermined period unit of the collected environment data into a plurality of clusters,
When the environment time series data among the selected factors is included, the axis indicating the environmental time series data is an axis indicating the cluster of the environmental time series data,
Time series data prediction method.

3. The method of claim 2,
Wherein clustering the environmental time series data into a plurality of clusters includes:
And clustering in the same manner as the clustering method performed in the step of clustering the measured value time series data into a plurality of clusters.
Time series data prediction method.

3. The method of claim 2,
Wherein the step of predicting the cluster of the measured time series data of the prediction period using the optimal classification model comprises:
Selecting a cluster of the environmental time series data to which the environmental time series data belongs in the prediction period when the environment time series data is included in the factor of the optimal classification model; And
Applying the cluster of environmental time series data of the selected prediction period to the optimal classification model;
Time series data prediction method.

5. The method of claim 4,
Wherein clustering the environmental time series data into a plurality of clusters includes:
Further comprising the step of deriving cluster-wise representative time series data,
Wherein the step of selecting the clusters of the environmental time series data to which the environmental time series data of the prediction period belongs comprises the steps of:
Selecting a cluster of the environmental time series data to which the environmental time series data of the prediction period belongs based on the similarity degree according to the difference value calculation logic between the environmental time series data of the prediction period and the representative time series for each cluster of the environmental time series data / RTI >
Time series data prediction method.

3. The method of claim 2,
The environmental time series data is m-dimensional time series data composed of m (m > = 2) different environment time series data,
Wherein clustering the environmental time series data into a plurality of clusters includes:
Expressing the environmental time series data on a space formed by m axes indicating time axes and m environment time series, respectively; And
And clustering the expressed environmental time series data.
Time series data prediction method.

The method according to claim 1,
Wherein the step of generating the classification model comprises:
Generating the classification model using SVM (Support Vector Machine) logic,
Wherein determining the performance indicator comprises:
And using a maximum margin according to a hyperplane generated according to the SVM logic as a performance index of the classification model.
Time series data prediction method.

The method according to claim 1,
Wherein the step of estimating the cluster of the measured time series data in the prediction period comprises:
Receiving a prediction value of environmental data used as a factor of the selected classification model among environmental data of the prediction period; And
And inputting the predicted value of the environment data to the optimum classification model to predict a cluster of the measured time series data.
Time series data prediction method.

The method according to claim 1,
The step of clustering the measurement time series data into a plurality of clusters,
Clustering each of the measurement time series data into k clusters by applying clustering logic; And
And calculating representative time series data of each cluster using time series averaging logic.
Time series data prediction method.

10. The method of claim 9,
The step of clustering the measurement time series data into a plurality of clusters,
Clustering the k clusters into the k clusters and calculating a representative time series of each of the clusters;
And finally determining the k value based on the sum of similarities between the representative time series data for each cluster and the measured time series data of each cycle belonging to each cluster,
Time series data prediction method.

10. The method of claim 9,
Wherein the measurement time series data is multidimensional time series data composed of two or more individual measurement time series,
The step of clustering the measurement time series data into a plurality of clusters,
Clustering the k clusters into the k clusters and calculating a representative time series of each of the clusters;
Determining the k value based on the similarity between the representative time series data for each cluster and the measured time series data of each cycle belonging to each cluster,
Time series data prediction method.

Predicting a cluster of measured time series data of the predicted time period from environmental data of a predicted time period according to an analysis result of measurement time series data and environmental data during a training period; And
And estimating the measured time series data of the prediction period by using a regression model for a cluster of the selected measurement time series data,
Wherein the regression model uses the first environment time series data of the environment data as a first independent variable and the measurement time series data as a dependent variable,
Wherein the regression model includes a first model and a second model,
The step of estimating the time series data includes:
Using the first model when the cluster of the measured time series data is the first cluster and using the second model different from the first model when the cluster of the measured time series data is the second cluster different from the first cluster / RTI >
Time series data prediction method.

13. The method of claim 12,
Wherein the regression model uses a cluster identifier of second environment time series data different from the first environment time series data as a second independent variable,
Time series data prediction method.

13. The method of claim 12,
Wherein the regression model has a representative value representing a specific environment of each cycle of the environmental data as a second independent variable,
Time series data prediction method.

13. The method of claim 12,
Clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters;
Collecting a plurality of different kinds of environmental data during the training period; And
Further comprising the step of constructing a regression model corresponding to a cluster of each measurement time series data,
The step of constructing the regression model comprises:
And constructing a regression model corresponding to the first measurement time series data cluster using only the data of the period clustered with the first measurement time series data cluster without using the data of the period clustered with the second measurement value time series data cluster doing,
Time series data prediction method.

A memory for loading a computer program for analyzing measured time series data during a training period to predict said measured time series data in a predicted period;
A processor for executing the computer program loaded in the memory;
Network interface; And
A storage for storing measurement time series data received via the network interface, the environment data, and data inquired by the computer program,
The computer program comprising training logic and prediction logic,
The training logic comprises:
Clustering measurement time series data of a predetermined period unit during a training period into a plurality of clusters;
Collecting a plurality of environmental data during the training period;
Selecting at least some of the plurality of environmental data as a factor;
Generating a classification model that optimally classifies clusters of the measured time series data in a space or a plane comprising axes indicating the factors;
An operation for determining a performance indicator value of the generated classification model;
Selecting the parameter as the parameter, generating the classification model, and determining the performance indicator value are repeated while changing the selection of the factor, so that the optimal classification of the generated classification model And an operation for selecting a model,
The prediction logic comprises:
And estimating a cluster of the measured time series data in the prediction period using the optimal classification model.
Time series data prediction device.

17. The method of claim 16,
The prediction logic comprises:
Further comprising an operation of predicting the measured time series data of the prediction period using a regression model for a cluster of the predicted measured time series data,
Time series data prediction device.