KR20220123845A

KR20220123845A - Meathod and device for measuring similarity between time series data

Info

Publication number: KR20220123845A
Application number: KR1020210027293A
Authority: KR
Inventors: 홍성록; 박대흠
Original assignee: 주식회사 케이티
Priority date: 2021-03-02
Filing date: 2021-03-02
Publication date: 2022-09-13

Abstract

Disclosed is a method for measuring a similarity between time series data. According to the present invention, the method for measuring a similarity between time series data includes: a step of collecting first time series data; a step of splitting the first time series data into a section in a certain size and acquiring a main symbol based on a measurement value of the section; a step of acquiring at least one pattern symbol presenting a structural pattern of the first time series data from the section; and a step of calculating a similarity between the first time series data and second time series data by using the main symbol and the at least one pattern symbol coupled with the main symbol.

Description

Method and apparatus for measuring similarity between time series data

본 발명은, 메인 심볼 및 시계열 데이터의 변동 패턴을 나타내는 패턴 심볼을 이용하여 시계열 데이터 간의 유사도를 산출하고, 시계열 데이터 간의 유사도를 정규화된 값으로 나타낼 수 있는, 시계열 데이터 간의 유사도 측정 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for measuring the degree of similarity between time series data, which can calculate the degree of similarity between time series data using a main symbol and a pattern symbol representing a variation pattern of time series data, and represent the degree of similarity between time series data as a normalized value will be.

최근 기술의 발달로 인하여 IOT 센서 등의 엣지 컴퓨팅(Edge Computing) 디바이스의 활용도가 대폭 증대된 상태이다. Due to the recent development of technology, the utilization of edge computing devices such as IOT sensors has been greatly increased.

구체적으로 마이크로프로세서 제조기술 발전으로 하드웨어의 스펙이 향상되어 엣지 디바이스의 연산 능력이 강화되었으며, 통신 기술의 발달로 인하여 엣지 디바이스의 보급률 및 활용도가 매우 늘어났다.Specifically, with the development of microprocessor manufacturing technology, the hardware specifications have been improved, and the computing power of edge devices has been strengthened.

예를 들어 센서의 발달과 보급률의 향상으로 인하여, 빌딩 에너지 모니터링 및 제어를 위해 수집 또는 분석되는 센서 데이터의 크기가 대폭 증가하였으며, 미래에는 그 증가 폭이 매우 커질 것으로 예측된다.For example, due to the development of sensors and the improvement of penetration rates, the size of sensor data collected or analyzed for monitoring and control of building energy has greatly increased, and it is predicted that the increase will be very large in the future.

한편 엣지 디바이스의 증가, 센서 인공지능과 머신러닝과 같은 복잡한 연산의 확대 등으로 인하여, 다수의 엣지 디바이스들과 연결되는 중앙 클라우드에는 매우 많은 데이터를 수신하며, 또한 매우 많은 데이터를 이용하여 복잡한 연산을 수행하여야 한다.On the other hand, due to the increase in edge devices and the expansion of complex operations such as sensor AI and machine learning, a lot of data is received in the central cloud connected to a number of edge devices, and complex calculations are performed using a lot of data. should be carried out

그리고 네트워크 딜레이, 네트워크 안정성 문제 등으로 인한 데이터의 누락 및 지연, 트래픽, 중앙 클라우드의 연산 부하 등을 줄이기 위하여, 중앙 클라우드의 연산 부하를 엣지 디바이스로 분산시킬 필요가 있다.In addition, in order to reduce data omissions and delays, traffic, and computational load of the central cloud due to network delay and network stability problems, it is necessary to distribute the computational load of the central cloud to edge devices.

다만 엣지 디바이스의 성능이 예전에 비해 많이 향상되었음에도 불구하고, 엣지 디바이스의 성능은 비용상의 문제 등으로 인하여 여전히 제한적이며, 따라서 엣지 디바이스에서의 데이터의 처리 및 분석에 효율적인 연산 기술이 필요하다.However, even though the performance of the edge device has improved a lot compared to the past, the performance of the edge device is still limited due to cost problems, and therefore, efficient computational technology is required for data processing and analysis in the edge device.

본 발명은 상술한 문제점을 해결하기 위한 것으로, 본 발명의 목적은, 메인 심볼 및 시계열 데이터의 변동 패턴을 나타내는 패턴 심볼을 이용하여 시계열 데이터 간의 유사도를 산출하고, 시계열 데이터 간의 유사도를 정규화된 값으로 나타낼 수 있는, 시계열 데이터 간의 유사도 측정 방법 및 장치를 제공하기 위함이다.The present invention is to solve the above-described problems, and an object of the present invention is to calculate the similarity between time series data using a main symbol and a pattern symbol representing a variation pattern of time series data, and convert the similarity between time series data to a normalized value. It is to provide a method and apparatus for measuring the similarity between time series data that can be represented.

본 발명에 따른 시계열 데이터 간의 유사도 측정 방법은, 제1 시계열 데이터를 수집하는 단계, 상기 제1 시계열 데이터를 일정 크기의 구간으로 분할하고, 상기 구간의 측정 값에 기초하여 메인 심볼을 획득하는 단계, 상기 구간에서, 상기 제1 시계열 데이터의 구조적 패턴을 나타내는 하나 이상의 패턴 심볼을 획득하는 단계, 및, 상기 메인 심볼 및 상기 메인 심볼과 결합된 하나 이상의 패턴 심볼을 이용하여, 상기 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 단계를 포함한다.The method for measuring the similarity between time series data according to the present invention comprises the steps of collecting first time series data, dividing the first time series data into sections of a certain size, and obtaining a main symbol based on the measured value of the section; In the section, obtaining one or more pattern symbols representing a structural pattern of the first time series data, and using the main symbol and one or more pattern symbols combined with the main symbol, the first time series data and the second and calculating a similarity between the two time series data.

이 경우 상기 패턴 심볼은, 증가 심볼, 감소 심볼, 유지 심볼 및 노이즈 심볼을 포함할 수 있다.In this case, the pattern symbol may include an increase symbol, a decrease symbol, a maintenance symbol, and a noise symbol.

이 경우 상기 패턴 심볼을 획득하는 단계는, 상기 제1 시계열 데이터의 이전 시점의 측정 값과 현 시점의 측정 값의 차이들을 이용하여 표준 편차를 산출하는 단계, 및, 상기 표준 편차가 노이즈 임계치보다 크면, 상기 패턴 심볼이 노이즈 심볼인 것으로 결정하는 단계를 포함할 수 있다.In this case, the step of obtaining the pattern symbol may include calculating a standard deviation using differences between the measurement value at a previous time and the measurement value at the current time of the first time series data, and if the standard deviation is greater than a noise threshold , determining that the pattern symbol is a noise symbol.

이 경우 상기 패턴 심볼을 획득하는 단계는, 상기 표준 편차가 상기 노이즈 임계치 보다 작으면, 상기 구간 내 세부 구간들의 평균 값들을 산출하는 단계, 및, 상기 평균 값들 간의 기울기를 이용하여 상기 증가 심볼, 상기 감소 심볼 및 상기 유지 심볼 중 어느 하나를 결정하는 단계를 더 포함할 수 있다.In this case, the obtaining of the pattern symbol includes: if the standard deviation is smaller than the noise threshold, calculating average values of detailed sections within the section; The method may further include determining any one of the reduction symbol and the maintenance symbol.

한편 상기 노이즈 심볼이 다른 심볼들과 가지는 거리가 가장 크고, 같은 심볼들 간의 거리를 제외하면 상기 유지 심볼이 상기 증가 심볼 또는 상기 감소 심볼과 가지는 거리가 가장 작을 수 있다.Meanwhile, the noise symbol may have the largest distance from other symbols, and the maintenance symbol may have the smallest distance from the increment symbol or the decrement symbol, except for distances between the same symbols.

한편 상기 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 단계는, 상기 제1 시계열 데이터의 상기 메인 심볼 및 상기 패턴 심볼의 결합인 제1 결합 심볼과, 상기 제2 시계열 데이터의 제2 메인 심볼 및 제2 패턴 심볼의 결합인 제2 결합 심볼 간의 거리를 산출하는 단계를 포함할 수 있다.Meanwhile, calculating the degree of similarity between the first time series data and the second time series data includes a first combined symbol that is a combination of the main symbol and the pattern symbol of the first time series data, and a second main of the second time series data. and calculating a distance between a second combined symbol that is a combination of a symbol and a second pattern symbol.

이 경우 상기 결합 심볼과 상기 제2 결합 심볼 간의 거리를 산출하는 단계는, 메인 심볼의 비교 테이블을 이용하여, 상기 메인 심볼과 상기 제2 메인 심볼 간의 메인 심볼 거리를 산출하는 단계, 패턴 심볼의 비교 테이블을 이용하여, 상기 패턴 심볼과 상기 제2 패턴 심볼 간의 패턴 심볼 거리를 산출하는 단계, 및, 상기 메인 심볼 거리와 상기 패턴 심볼 거리를 합산하여 상기 제1 결합 심볼과 상기 제2 결합 심볼 간의 거리를 산출하는 단계를 포함할 수 있다.In this case, the calculating of the distance between the combined symbol and the second combined symbol includes calculating the main symbol distance between the main symbol and the second main symbol using a comparison table of main symbols, Comparison of pattern symbols calculating a pattern symbol distance between the pattern symbol and the second pattern symbol using a table, and adding the main symbol distance and the pattern symbol distance to the distance between the first combined symbol and the second combined symbol It may include the step of calculating

한편 상기 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 단계는, 각 구간의 상기 제1 결합 심볼과 상기 제2 결합 심볼 간의 거리를 이용하여 DTW 행렬을 생성하는 단계, 및, 상기 DTW 행렬 이용하여 상기 제1 시계열 데이터와 상기 제2 시계열 데이터의 유사도가 가장 가깝게 매핑된 최적 와핑 패스를 선택하는 단계를 더 포함할 수 있다.Meanwhile, calculating the similarity between the first time series data and the second time series data includes generating a DTW matrix using a distance between the first combining symbol and the second combining symbol in each section, and the DTW matrix The method may further include selecting an optimal warping path in which the similarity between the first time series data and the second time series data is most closely mapped using the method.

이 경우 상기 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 단계는, 상기 최적 와핑 패스를 구성하는 복수의 인덱스 쌍을 이용하여 상기 제1 시계열 데이터 및 상기 제2 시계열 데이터를 비동기적으로 매핑 하는 단계를 더 포함할 수 있다.In this case, the calculating of the similarity between the first time series data and the second time series data may include asynchronously mapping the first time series data and the second time series data using a plurality of index pairs constituting the optimal warping path. It may further include the step of

이 경우 상기 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 단계는, 상기 비동기적으로 매핑된 상기 제1 시계열 데이터의 값 및 상기 제2 시계열 데이터의 값을 이용하여, 상기 제1 시계열 데이터와 상기 제2 시계열 데이터의 상관 관계를 나타내는, 정규화된 통계적 유사도를 산출하는 단계를 더 포함할 수 있다.In this case, the calculating of the similarity between the first time series data and the second time series data may include: using the asynchronously mapped values of the first time series data and the second time series data, the first time series data and calculating a normalized statistical similarity indicating a correlation between the second time series data and the second time series data.

한편 본 발명에 따른 시계열 데이터 간의 유사도 측정 장치는, 제1 시계열 데이터를 수집하는 데이터 수집부, 및, 상기 제1 시계열 데이터를 일정 크기의 구간으로 분할하고 상기 구간의 측정 값에 기초하여 메인 심볼을 획득하고, 상기 구간에서, 상기 제1 시계열 데이터의 구조적 패턴을 나타내는 하나 이상의 패턴 심볼을 획득하고, 상기 메인 심볼 및 상기 메인 심볼과 결합된 하나 이상의 패턴 심볼을 이용하여, 상기 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 제어부를 포함한다.Meanwhile, the apparatus for measuring similarity between time series data according to the present invention includes a data collection unit that collects first time series data, and divides the first time series data into sections of a predetermined size and selects a main symbol based on the measured value of the section obtaining, in the section, one or more pattern symbols representing a structural pattern of the first time series data, and using the main symbol and one or more pattern symbols combined with the main symbol, the first time series data and the second and a control unit for calculating a degree of similarity between the two time series data.

이 경우 상기 제어부는, 상기 제1 시계열 데이터의 이전 시점의 측정 값과 현 시점의 측정 값의 차이들을 이용하여 표준 편차를 산출하고, 상기 표준 편차가 노이즈 임계치보다 크면, 상기 패턴 심볼이 노이즈 심볼인 것으로 결정할 수 있다.In this case, the control unit calculates a standard deviation by using differences between the measurement value of the previous time and the measurement value of the current time of the first time series data, and if the standard deviation is greater than a noise threshold, the pattern symbol is a noise symbol it can be decided that

이 경우 상기 제어부는, 상기 표준 편차가 상기 노이즈 임계치 보다 작으면, 상기 구간 내 세부 구간들의 평균 값들을 산출하고, 상기 평균 값들 간의 기울기를 이용하여 상기 증가 심볼, 상기 감소 심볼 및 상기 유지 심볼 중 어느 하나를 결정할 수 있다.In this case, when the standard deviation is smaller than the noise threshold, the control unit calculates average values of detailed sections within the section, and uses a slope between the average values to select any one of the increase symbol, the decrease symbol, and the maintenance symbol. You can decide one.

한편 상기 제어부는, 상기 제1 시계열 데이터의 상기 메인 심볼 및 상기 패턴 심볼의 결합인 제1 결합 심볼과, 상기 제2 시계열 데이터의 제2 메인 심볼 및 제2 패턴 심볼의 결합인 제2 결합 심볼 간의 거리를 산출할 수 있다.On the other hand, the control unit, between a first combination symbol that is a combination of the main symbol and the pattern symbol of the first time series data, and a second combination symbol that is a combination of a second main symbol and a second pattern symbol of the second time series data distance can be calculated.

이 경우 상기 제어부는, 메인 심볼의 비교 테이블을 이용하여, 상기 메인 심볼과 상기 제2 메인 심볼 간의 메인 심볼 거리를 산출하고, 패턴 심볼의 비교 테이블을 이용하여, 상기 패턴 심볼과 상기 제2 패턴 심볼 간의 패턴 심볼 거리를 산출하고, 상기 메인 심볼 거리와 상기 패턴 심볼 거리를 합산하여 상기 제1 결합 심볼과 상기 제2 결합 심볼 간의 거리를 산출할 수 있다.In this case, the control unit calculates a main symbol distance between the main symbol and the second main symbol using a comparison table of main symbols, and uses the comparison table of pattern symbols to determine the pattern symbol and the second pattern symbol. The distance between the first combined symbol and the second combined symbol may be calculated by calculating a pattern symbol distance between the first and second combined symbols by summing the main symbol distance and the pattern symbol distance.

한편 상기 제어부는, 각 구간의 상기 제1 결합 심볼과 상기 제2 결합 심볼 간의 거리를 이용하여 DTW 행렬을 생성하고, 상기 DTW 행렬 이용하여 상기 제1 시계열 데이터와 상기 제2 시계열 데이터의 유사도가 가장 가깝게 매핑된 최적 와핑 패스를 선택할 수 있다.Meanwhile, the controller generates a DTW matrix by using the distance between the first combined symbol and the second combined symbol of each section, and the similarity between the first time series data and the second time series data is the most by using the DTW matrix. It is possible to select an optimal warping path that is closely mapped.

이 경우 상기 제어부는, 상기 최적 와핑 패스를 구성하는 복수의 인덱스 쌍을 이용하여 상기 제1 시계열 데이터 및 상기 제2 시계열 데이터를 비동기적으로 매핑 할 수 있다.In this case, the controller may asynchronously map the first time series data and the second time series data using a plurality of index pairs constituting the optimal warping path.

한편 컴퓨터 프로그램은, 제1 시계열 데이터를 수집하는 단계, 상기 제1 시계열 데이터를 일정 크기의 구간으로 분할하고, 상기 구간의 측정 값에 기초하여 메인 심볼을 획득하는 단계, 상기 구간에서, 상기 제1 시계열 데이터의 구조적 패턴을 나타내는 하나 이상의 패턴 심볼을 획득하는 단계, 및, 상기 메인 심볼 및 상기 메인 심볼과 결합된 하나 이상의 패턴 심볼을 이용하여, 상기 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 단계를 포함하는 시계열 데이터 간의 유사도 측정 방법을 수행하기 위하여 매체에 저장될 수 있다.Meanwhile, the computer program includes the steps of collecting first time series data, dividing the first time series data into sections of a certain size, and obtaining a main symbol based on the measurement value of the section, in the section, the first obtaining one or more pattern symbols representing a structural pattern of time series data, and calculating a degree of similarity between the first time series data and the second time series data by using the main symbol and one or more pattern symbols combined with the main symbol It may be stored in a medium in order to perform a similarity measurement method between time series data comprising the step of:

본 발명에 따르면, 심볼화를 통하여 연산량을 현격히 낮추면서도, 심볼화에 의해 정보가 왜곡되는 것을 최소화 하며, 이에 따라 시계열 데이터의 유사도 판단의 정확도를 향상시킬 수 있는 장점이 있다. According to the present invention, there is an advantage in that the amount of computation is significantly reduced through symbolization, information distortion is minimized by symbolization, and thus the accuracy of determining the similarity of time series data can be improved.

또한 본 발명에 따르면, 시계열 데이터 간 유사도를 정규화된 값으로 표현하고, 정규화된 값만 가지고 상태를 분류하거나 이상을 탐지할 수 있다. 예를 들어 제어부(120)는 정규화된 통계적 유사도가 0.7 이상인 경우 정상, 0.5 내지 0.7인 경우 이상 징후, 0.5 이하인 경우 이상 발생으로 판단할 수 있다. 즉 시계열 데이터 간의 절대적 거리를 산출함으로써, 다수의 비교 대상 시계열 데이터들을 저장하거나 비교 대상 시계열 데이터들의 거리 값을 산출하지 않아도 되기 때문에, 필요 메모리와 연산량을 감소시킬 수 있는 장점이 있다.In addition, according to the present invention, similarity between time series data can be expressed as a normalized value, and a state can be classified or anomaly detected using only the normalized value. For example, the controller 120 may determine that the normalized statistical similarity is 0.7 or more as normal, 0.5 to 0.7 as an abnormal symptom, and 0.5 or less as abnormal occurrence. That is, since it is not necessary to store a plurality of time series data to be compared or to calculate distance values of time series data to be compared by calculating the absolute distance between time series data, there is an advantage in that the required memory and the amount of calculation can be reduced.

또한 시계열 데이터 간의 유사도 측정 장치(100)가 엣지 디바이스의 형태로 구현되는 경우, 시계열 데이터 간의 유사도 측정 장치(100)가 시계열 데이터에 기초하여 상태 분류 또는 이상 탐지를 수행함으로써, 클라우드 서버나 네트워크에서 발생하는 트래픽 부하, 클라우드 서버의 연산 부하, 데이터의 누락 및 지연 등을 감소시킬 수 있는 장점이 있다.In addition, when the similarity measuring apparatus 100 between time series data is implemented in the form of an edge device, the similarity measuring apparatus 100 between time series data performs state classification or anomaly detection based on the time series data, resulting in a cloud server or network. It has the advantage of reducing the traffic load, the computational load of the cloud server, and data omission and delay.

또한 시계열 데이터 간의 유사도 측정 장치(100)가 엣지 디바이스의 형태로 구현되는 경우, 낮은 연산량에 기반하여 엣지 디바이스의 스펙을 감소시킬 수 있기 때문에, 가격 경쟁력을 확보할 수 있는 장점이 있다.In addition, when the similarity measuring apparatus 100 between time series data is implemented in the form of an edge device, it is possible to reduce the specifications of the edge device based on a low amount of computation, so that it is possible to secure price competitiveness.

도 1은 종래의 통계적 유사도, 거리 기반 유사도 및 비동기적 거리 기반 유사도 측정의 문제점을 설명하기 위한 도면이다.
도 2는 비동기적 거리 기반 유사도 측정 방식의 한계를 설명하기 위한 도면이다.
도 3은 종래의 시계열 데이터의 구조적 특징 기술을 설명하기 위한 도면이다.
도 4는 본 발명에 따른, 시계열 데이터 간의 유사도 측정 장치의 구성 요소를 설명하기 위한 블록도이다.
도 5는 본 발명에 따른, 시계열 데이터 간의 유사도 측정 방법을 설명하기 위한 순서도이다.
도 6 내지 도 7은 본 발명에 따른, 심볼을 획득하는 방법을 설명하기 위한 도면이다.
도 8은 본 발명에 따른, 메인 심볼의 비교 테이블과 패턴 심볼의 비교 테이블을 설명하기 위한 도면이다.
도 9는 일반적인 DTW(Dynamic Time Warping)를 설명하기 위한 도면이다.
도 10은 본 발명에 따른, 정규화된 유사도를 산출하는 방법을 설명하기 위한 도면이다.
도 11은 DTW에 기반하여 비동기적으로 매핑된 제1 시계열 데이터 및 제2 시계열 데이터의 정규화된 통계적 유사도를 산출하는 예시이다.
도 12는 본 발명의 따른 효과를 설명하기 위한 도면이다.1 is a diagram for explaining problems of conventional statistical similarity, distance-based similarity, and asynchronous distance-based similarity measurement.
2 is a diagram for explaining a limitation of an asynchronous distance-based similarity measurement method.
3 is a diagram for explaining a description of the structural characteristics of time series data according to the related art.
4 is a block diagram for explaining the components of an apparatus for measuring similarity between time series data according to the present invention.
5 is a flowchart illustrating a method for measuring similarity between time series data according to the present invention.
6 to 7 are diagrams for explaining a method of obtaining a symbol according to the present invention.
8 is a diagram for explaining a comparison table of a main symbol and a comparison table of a pattern symbol according to the present invention.
9 is a diagram for explaining a general dynamic time warping (DTW).
10 is a diagram for explaining a method of calculating a normalized similarity according to the present invention.
11 is an example of calculating a normalized statistical similarity between asynchronously mapped first time series data and second time series data based on DTW.
12 is a view for explaining the effect according to the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Hereinafter, the embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "part" for components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves. In addition, in describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in this specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in this specification, and the technical idea disclosed herein is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention , should be understood to include equivalents or substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers such as first, second, etc. may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but it is understood that other components may exist in between. it should be On the other hand, when it is said that a certain element is "directly connected" or "directly connected" to another element, it should be understood that the other element does not exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprises” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It is to be understood that this does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

본 발명을 구현함에 있어서 설명의 편의를 위하여 구성요소를 세분화하여 설명할 수 있으나, 이들 구성요소가 하나의 장치 또는 모듈 내에 구현될 수도 있고, 혹은 하나의 구성요소가 다수의 장치 또는 모듈들에 나뉘어져서 구현될 수도 있다. In implementing the present invention, components may be subdivided for convenience of description, but these components may be implemented in one device or module, or one component may be divided into a plurality of devices or modules. It can also be implemented in

도 1 내지 도 3은 종래의 시계열 유사도 판단 방법의 문제점을 설명하기 위한 도면이다.1 to 3 are diagrams for explaining the problems of the conventional method for determining the time series similarity.

시계열 데이터(예를 들어 온도, 습도, 가스 유량, 전력량, 냉온수 유량, 풍속, 압력 등의 센서 데이터)의 분석 기술로는, 통계적 분석(Mean, Max, Min, Mode, Median, Correlation Analysis, Principal Component Analysis, 등), 주파수 분석(DWT(Discrete Wavelet Transform), FFT(Fast Fourier Transform) 등), 시계열 분석(STL(Seasonal and Trend decomposition using Loess), Pattern Recognition/Matching 등)이 있다.As an analysis technology for time series data (for example, sensor data such as temperature, humidity, gas flow rate, electric energy, cold/hot water flow rate, wind speed, pressure, etc.), statistical analysis (Mean, Max, Min, Mode, Median, Correlation Analysis, Principal Component) Analysis, etc.), frequency analysis (Discrete Wavelet Transform (DWT), Fast Fourier Transform (FFT), etc.), and time series analysis (Seasonal and Trend decomposition using Loess (STL), Pattern Recognition/Matching, etc.).

그리고 시계열 데이터를 분석하여 상태를 분류하거나 이상 탐지를 수행하는 것에는, 시계열 분석 기법인 패턴 매칭(Pattern Matching)이 주로 활용되고 있다.In addition, pattern matching, a time series analysis technique, is mainly used to classify states or detect anomalies by analyzing time series data.

도 1은 종래의 통계적 유사도, 거리 기반 유사도 및 비동기적 거리 기반 유사도 측정의 문제점을 설명하기 위한 도면이다.1 is a diagram for explaining problems of conventional statistical similarity, distance-based similarity, and asynchronous distance-based similarity measurement.

유사도 측정 기술은, 시계열 데이터의 분류, 이상 탐지, 예측 등 다양한 목적을 위해 기반이 되는 기술로, 엣지 디바이스에서 지능형 분석을 위해 필수적인 연산이다.Similarity measurement technology is a technology that is the basis for various purposes such as classification of time series data, anomaly detection, and prediction, and is an essential operation for intelligent analysis in edge devices.

도 1a에서는 피어슨 상관관계(Pearson’s Correlation)를 이용하여 통계적 유사도를 산출하는 방법을 도시하였다.1A illustrates a method of calculating statistical similarity using Pearson's correlation.

이와 같은 방식은 정규화(-1, 1)된 수치로 관계의 강도를 표현하여, 관계 파악이 쉽고 연산이 단순하다는 이점이 있는 반면에, 비선형적인 관계에 대해서는 파악할 수 없으며 동일한 데이터 길이에서만 비교가 가능하다는 단점이 있다.This method expresses the strength of the relationship with normalized (-1, 1) numerical values, and has the advantage of easy understanding of the relationship and simple operation. There is a downside to that.

도 1b에서는 유클리디안 거리를 이용한 일반적인 거리 기반 유사도 측정 방법을 도시하였다.1B illustrates a general distance-based similarity measurement method using the Euclidean distance.

이와 같은 방식은, 직관적 비교가 가능하고 쉬운 계산법으로 일반적인 유사도 비교로 활용 가능하다는 장점이 있으나, 시차가 있거나 데이터 길이가 다른 시계열 데이터 패턴의 유사도 비교에 부적합하다는 단점이 있다.This method has the advantage that it can be intuitively compared and can be used as a general similarity comparison with an easy calculation method, but has a disadvantage in that it is not suitable for similarity comparison of time series data patterns with different time lags or different data lengths.

그리고 앞선 유사도 측정 방법의 단점을 극복하기 위하여, 도 1c에서와 같이 비동기적 거리 기반 유사도(예를 들어 동적 시간 워핑(Dynamic Time Warping)) 기술이 등장하였다.In addition, in order to overcome the shortcomings of the previous similarity measurement method, an asynchronous distance-based similarity (eg, dynamic time warping) technique has emerged as shown in FIG. 1C .

이와 같은 방식은 비동기적 시점간의 거리 측정과 상이한 길이의 데이터 비교가 가능하며 시계열 패턴 비교의 능력이 우수하다는 장점이 있는 반면에, 유클리디안 거리 측정 방식에 비하여 연산량이 매우 늘어난다는 단점이 존재하며, 또한 유사도의 판단을 위해서는 다수의 비교 데이터가 필요하다는 단점이 존재한다. This method has the advantage that it is possible to measure the distance between asynchronous viewpoints and compare data of different lengths, and has excellent ability to compare time series patterns. , and also has a disadvantage that a large number of comparison data is required to determine the degree of similarity.

이와 관련해서는 도 2를 참고하여 설명한다.This will be described with reference to FIG. 2 .

도 2는 비동기적 거리 기반 유사도 측정 방식의 한계를 설명하기 위한 도면이다.2 is a diagram for explaining a limitation of an asynchronous distance-based similarity measurement method.

기존의 거리 계산 방법(Euclidean, Manhattan, Mahalanobis 등)은 결과값 범위가 양의 실수로 나타나며, 시계열 데이터 간의 거리값이 표준화 되어있지 않다.In the existing distance calculation methods (Euclidean, Manhattan, Mahalanobis, etc.), the result range appears as a positive real number, and the distance values between time series data are not standardized.

따라서 시계열 데이터 간의 거리 값이 산출되더라도 이 거리 값 만으로는 유사함의 정도를 판단할 수 없으며, 다른 시계열 데이터들 간 거리와 상대적인 비교를 통하여 유사함의 정도를 판단할 수 밖에 없다.Therefore, even if the distance value between time series data is calculated, the degree of similarity cannot be determined only by this distance value, and the degree of similarity cannot but be determined through a relative comparison with the distance between other time series data.

따라서 거리 값만 가지고는 유사한 정도의 판단이 불가능하며, 비교 대상 데이터의 수가 많아져야 정확히 판단이 가능한 단점이 있다.Therefore, it is impossible to determine a similar degree using only the distance value, and there is a disadvantage that an accurate determination is possible only when the number of data to be compared increases.

예를 들어 도 2를 참고하여, 제1 시계열 데이터(S1)과 제2 시계열 데이터(S2)가 유사한 정도를 판단하는 상황을 가정한다.For example, with reference to FIG. 2 , it is assumed that the degree of similarity between the first time series data S1 and the second time series data S2 is determined.

도 2b의 테이블을 참고하면, 제1 시계열 데이터(S1)과 제2 시계열 데이터(S2)의 유클리디안 거리는 8.48이다.Referring to the table of FIG. 2B , the Euclidean distance between the first time series data S1 and the second time series data S2 is 8.48.

다만 8.48만으로는 제1 시계열 데이터(S1)과 제2 시계열 데이터(S2)가 얼마나 유사한지 판단할 수 없으며, 8.48정도면 상당히 유사하다는 것을 판단하기 위해서는 다른 시계열 데이터들의 유사도를 이용하여 상대적으로 판단해야 한다.However, it is not possible to determine how similar the first time series data (S1) and the second time series data (S2) are with only 8.48. .

예를 들어 제1 시계열 데이터(S1)과 제3 시계열 데이터(S3) 간의 유클리디안 거리(22.35), 제1 시계열 데이터(S1)과 제4 시계열 데이터(S4) 간의 유클리디안 거리(23.05)를 알고 있는 상태에서 상대적으로 비교하여야, 8.48 정도면 두 시계열 데이터가 상당히 유사한 데이터라는 것을 파악할 수 있게 된다.For example, the Euclidean distance 22.35 between the first time series data S1 and the third time series data S3, and the Euclidean distance 23.05 between the first time series data S1 and the fourth time series data S4 (23.05) It is possible to understand that the two time series data are quite similar data by about 8.48 if they are relatively compared in the state of knowing .

또한 추가적으로 새로운 제5 시계열 데이터(S2)가 수집되었으며, 제1 시계열 데이터(S1)와 제5 시계열 데이터(S5) 간의 유사도가 2.23이라고 가정한다. 이 경우 “8.48 정도면 두 시계열 데이터가 상당히 유사한 데이터”라는 판단이 틀려지게 된다.Also, it is assumed that new fifth time series data S2 is additionally collected, and the similarity between the first time series data S1 and the fifth time series data S5 is 2.23. In this case, the judgment that “the two time series data are very similar data around 8.48” is wrong.

즉 제1 시계열 데이터(S1)과 제2 시계열 데이터(S2)의 유클리디안 거리는 8.48라는 수치로 제1 시계열 데이터(S1)과 제2 시계열 데이터(S2)의 유사성이 어느 정도인지 정확히 판단하기 위해서는 비교 대상 시계열 데이터가 다수 필요하게 된다. That is, the Euclidean distance between the first time series data S1 and the second time series data S2 is 8.48. In order to accurately determine the degree of similarity between the first time series data S1 and the second time series data S2, A large number of time series data to be compared are required.

따라서 종래 기술에 따를 때에는, 비교 대상 시계열 데이터를 저장하기 위하여, 그리고 비교 대상 시계열 데이터들의 거리 값을 산출하고 그 결과 값을 저장하기 위하여, 필요 메모리와 연산량이 증가하는 문제가 발생할 수 있다.Therefore, according to the prior art, in order to store the time series data to be compared and to calculate the distance value of the time series data to be compared and to store the result value, there may be a problem in that the required memory and the amount of calculation are increased.

또한 앞서 설명한 바와 같이, 비동기적 거리 기반 유사도 측정 방식은 그 자체로 연산량이 매우 많이 필요한 과정으로, 연산량의 증가가 더욱 가중될 수 있다.Also, as described above, the asynchronous distance-based similarity measurement method itself requires a very large amount of computation, and the increase in the amount of computation may be further increased.

도 3은 종래의 시계열 데이터의 구조적 특징 기술을 설명하기 위한 도면이다.3 is a diagram for explaining a description of the structural characteristics of time series data according to the related art.

SAX(Symbolic Aggregate Approximation)란, 데이터의 구조적 특징을 추출하여 심볼(예를 들어 알파멧) 형태로 축약하는 기술이다,SAX (Symbolic Aggregate Approximation) is a technology that extracts structural features of data and abbreviates them in the form of symbols (eg, Alphamet).

구체적으로 PAA(Piecewise Aggregate Approximation) 기법을 이용하여, 시계열 데이터는 W개의 동일한 크기의 세그먼트로 분할되고, 각 세그먼트의 평균 값들을 가지는 W 개의 벡터로 변환될 수 있다. 이 경우 W개 구간으로 압축된 데이터는 L개의 심볼이라는 이산화된 시퀀스로 표현될 수 있다.Specifically, using a piecewise aggregate approximation (PAA) technique, time series data may be divided into W segments of the same size and converted into W vectors having average values of each segment. In this case, data compressed into W sections may be expressed as a discretized sequence of L symbols.

예를 들어 도 3a을 참고하면, 각 구간 별 Z 값의 평균 값에 대응하는 심볼들(b, a, e, d, c, b, b, a)이 표현되어 있는 것을 알 수 있다.For example, referring to FIG. 3A , it can be seen that symbols (b, a, e, d, c, b, b, a) corresponding to the average value of the Z values for each section are expressed.

다만 SAX(Symbolic Aggregate Approximation)는 세그먼트로 정보를 축약하는 과정에서 각 세그먼트 구간 내에서의 변화 정보를 포함 할 수 없다. 따라서 평균 값에 대응하는 심볼이 동일한 이상, 구간 내 세부 패턴이 서로 다른 시계열 데이터들이라고 할지라도, 서로 다른 시계열 데이터를 매우 유사한 것으로 왜곡할 가능성 존재한다.However, SAX (Symbolic Aggregate Approximation) cannot include change information within each segment in the process of abbreviating information into segments. Therefore, as long as the symbol corresponding to the average value is the same, even if the detailed patterns in the interval are time series data different from each other, there is a possibility of distorting the different time series data into very similar ones.

예를 들어 도 3b의 제1 시계열 데이터(S1)와 도 3c의 제2 시계열 데이터(s2)의 패턴은 매우 상이하다. 다만 SAX(Symbolic Aggregate Approximation)로 데이터를 표현했을 때, 제1 시계열 데이터(S1)와 제2 시계열 데이터(S2)의 심볼은 모두 C-C-C-C-C-C-C-C로 표현되게 되어, 심볼 간 비교를 통하여 유사도 판단 시 제1 시계열 데이터(S1)와 제2 시계열 데이터(S2)가 매우 유사한 것으로 판단되는 문제가 발생한다.For example, the patterns of the first time series data S1 of FIG. 3B and the second time series data s2 of FIG. 3C are very different. However, when data is expressed by SAX (Symbolic Aggregate Approximation), the symbols of the first time series data (S1) and the second time series data (S2) are all expressed as C-C-C-C-C-C-C-C. A problem occurs in that the data S1 and the second time series data S2 are determined to be very similar.

따라서 정보 축약에 의한 정보 왜곡을 최소화 하는 시계열 구조적 특징 정보 추출 방법이 요구된다.Therefore, a method for extracting time-series structural feature information that minimizes information distortion due to information reduction is required.

위에서 설명한 종래 기술의 문제점을 종합해 볼 때, 본 발명은 시계열 유사도 측정 시 연산 부하를 낮추는 것, 그리고 시계열 데이터를 적게 활용하면서 시계열 유사도 측정의 정확도를 높이는 것을 목적으로 한다.Considering the problems of the prior art described above, an object of the present invention is to reduce the computational load when measuring time series similarity and to increase the accuracy of time series similarity measurement while using less time series data.

그리고 본 발명에서는, 시계열 데이터의 정보를 축약하면서 왜곡을 최소화 하고, 시계열 데이터 간의 상대적인 거리 비교가 아닌 절대적 거리 비교를 통하여 다수의 시계열 데이터를 활용하지 않아도 유사도 판단을 가능하게 한다.In the present invention, distortion is minimized while abbreviated information of time series data, and similarity determination is possible without using a plurality of time series data through absolute distance comparison rather than relative distance comparison between time series data.

도 4는 본 발명에 따른, 시계열 데이터 간의 유사도 측정 장치의 구성 요소를 설명하기 위한 블록도이다.4 is a block diagram for explaining the components of an apparatus for measuring similarity between time series data according to the present invention.

시계열 데이터 간의 유사도 측정 장치(100)는 데이터 수집부(110), 제어부(120), 통신부(130) 및 메모리(140)를 포함할 수 있다.The apparatus 100 for measuring the similarity between time series data may include a data collection unit 110 , a control unit 120 , a communication unit 130 , and a memory 140 .

데이터 수집부(110)는 시계열 데이터를 수집할 수 있다. The data collection unit 110 may collect time series data.

여기서 시계열 데이터는 센서 데이터일 수 있다. 예를 들어 시계열 데이터는 온도 데이터, 습도 데이터, 가스 유량 데이터, 전력량 데이터, 냉온수 유량 데이터, 풍속 데이터, 압력 데이터, 진동 데이터 등의 센서 데이터일 수 있다. 이 경우 데이터 수집부(110)는 상응하는 시계열 데이터를 수집하기 위한 센서를 구비할 수 있다.Here, the time series data may be sensor data. For example, the time series data may be sensor data such as temperature data, humidity data, gas flow data, wattage data, cold/hot water flow data, wind speed data, pressure data, vibration data, and the like. In this case, the data collection unit 110 may include a sensor for collecting the corresponding time series data.

다만 시계열 데이터는 센서 데이터에 한정되지 않으며, 다른 시계열 데이터와의 유사도 판단의 대상이 되는 다양한 데이터가 시계열 데이터로 사용될 수 있다.However, the time series data is not limited to the sensor data, and various data that is the target of determining the similarity with other time series data may be used as the time series data.

한편 시계열 데이터는 외부 장치로부터 수신될 수도 있으며, 이 경우 데이터 수집부(110)는 외부 장치와 통신하기 위한 통신부를 포함할 수 있다.Meanwhile, time series data may be received from an external device, and in this case, the data collection unit 110 may include a communication unit for communicating with the external device.

제어부(120)는 시계열 데이터 간의 유사도 측정 장치(100)의 전반적인 동작을 제어할 수 있다.The controller 120 may control the overall operation of the apparatus 100 for measuring the similarity between time series data.

또한 제어부(120)는 시계열 데이터를 일정 크기의 구간으로 분할하고, 구간의 측정 값들에 기초하여 메인 심볼을 획득할 수 있다.Also, the controller 120 may divide the time series data into sections of a predetermined size, and obtain a main symbol based on measurement values of the sections.

또한 제어부(120)는, 구간에서, 시계열 데이터의 변동 패턴을 나타내는 하나 이상의 패턴 심볼을 획득할 수 있다.In addition, the controller 120 may acquire one or more pattern symbols indicating a variation pattern of the time series data in the section.

또한 제어부(120)는, 메인 심볼 및 메인 심볼과 결합된 하나 이상의 패턴 심볼을 이용하여, 시계열 데이터와 다른 시계열 데이터 간의 유사도를 산출할 수 있다.Also, the controller 120 may calculate a similarity between the time series data and other time series data by using the main symbol and one or more pattern symbols combined with the main symbol.

한편 제어부(120)는 프로세서, 마이크로 프로세서, 컨트롤러, 마이크로 컨트롤러 등의 용어와 혼용되어 사용될 수 있다.Meanwhile, the controller 120 may be used interchangeably with terms such as a processor, a microprocessor, a controller, and a microcontroller.

통신부(130)는 외부 장치 사이의 유선 또는 무선 통신을 지원할 수 있다.The communication unit 130 may support wired or wireless communication between external devices.

구체적으로 통신부(130)는 시계열 데이터 간의 유사도 측정 장치(100)를 인터넷망을 포함하는 유/무선 네트워크와 연결하기 위한 인터페이스를 제공할 수 있다. 그리고 통신부(130)는 통신부(130)에 접속된 네트워크 또는 접속된 네트워크에 링크된 다른 네트워크를 통해, 외부 장치와 데이터를 송수신 할 수 있다.Specifically, the communication unit 130 may provide an interface for connecting the apparatus 100 for measuring the similarity between time series data with a wired/wireless network including an Internet network. In addition, the communication unit 130 may transmit/receive data to and from an external device through a network connected to the communication unit 130 or another network linked to the connected network.

메모리(130)는 제어부(120) 내의 각 신호 처리 및 제어를 위한 프로그램, 컨텐츠, 신호 처리된 영상, 음성 또는 데이터 신호, 기타 데이터 등을 저장할 수 있다.The memory 130 may store programs, contents, signal-processed images, audio or data signals, and other data for each signal processing and control in the controller 120 .

또한 메모리(130)는 시계열 데이터 간의 유사도 측정 장치(100)의 동작을 위한 응용 프로그램 또는 데이터를 저장할 수 있다.Also, the memory 130 may store an application program or data for the operation of the apparatus 100 for measuring the similarity between time series data.

한편 시계열 데이터 간의 유사도 측정 장치(100)는 엣지 디바이스 일 수 있다. 예를 들어 시계열 데이터 간의 유사도 측정 장치(100)는 IOT 디바이스일 수 있다. 이 경우 통신부(130)는 복수의 엣지 디바이스와 연결된 클라우드 서버와 통신할 수 있다.Meanwhile, the apparatus 100 for measuring the similarity between time series data may be an edge device. For example, the apparatus 100 for measuring the similarity between time series data may be an IOT device. In this case, the communication unit 130 may communicate with a cloud server connected to a plurality of edge devices.

또한 한편 제어부(120)는 통신부(130)를 통하여 시계열 데이터 및 시계열 데이터 간의 유사도 중 적어도 하나를 클라우드 서버에 전송할 수 있다.In addition, the control unit 120 may transmit at least one of the time series data and the similarity between the time series data to the cloud server through the communication unit 130 .

또한 제어부(120)는 시계열 데이터 간의 유사도에 기반하여 현재 상태의 분류 또는 이상 상태에 대한 정보를 획득할 수 있다. 이 경우 제어부(120)는 통신부(130)를 통하여 현재 상태의 분류 또는 이상 상태에 대한 정보를 클라우드 서버에 전송할 수 있다.Also, the controller 120 may acquire information on a classification of a current state or an abnormal state based on the similarity between time series data. In this case, the controller 120 may transmit the classification of the current state or information on the abnormal state to the cloud server through the communication unit 130 .

한편 클라우드 서버는 복수의 엣지 디바이스로부터 수신되는 데이터들에 기초하여, 상태 정보의 획득, 이상 탐지 등의 동작을 수행할 수 있다.Meanwhile, the cloud server may perform operations such as obtaining status information and detecting anomalies based on data received from a plurality of edge devices.

도 5는 본 발명에 따른, 시계열 데이터 간의 유사도 측정 방법을 설명하기 위한 순서도이다.5 is a flowchart illustrating a method for measuring similarity between time series data according to the present invention.

본 발명에 따른 시계열 데이터 간의 유사도 측정 방법은, 제1 시계열 데이터를 수집하는 단계(S510), 제1 시계열 데이터를 일정 크기의 구간으로 분할하고, 구간의 측정 값에 기초하여 메인 심볼을 획득하는 단계(S530), 구간에서, 제1 시계열 데이터의 변동 패턴을 나타내는 하나 이상의 패턴 심볼을 획득하는 단계(S550), 및, 메인 심볼 및 메인 심볼과 결합된 하나 이상의 패턴 심볼을 이용하여, 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출하는 단계(S570)를 포함할 수 있다.The method for measuring the similarity between time series data according to the present invention includes the steps of collecting first time series data (S510), dividing the first time series data into sections of a predetermined size, and obtaining a main symbol based on the measured values of the sections (S530), in the section, obtaining one or more pattern symbols representing a variation pattern of the first time-series data (S550), and using the main symbol and one or more pattern symbols combined with the main symbol, the first time-series data and calculating a similarity between the and second time series data ( S570 ).

도 6 내지 도 7은 본 발명에 따른, 심볼을 획득하는 방법을 설명하기 위한 도면이다.6 to 7 are diagrams for explaining a method of obtaining a symbol according to the present invention.

데이터 수집부(110)는 시계열 데이터를 수집할 수 있다(S510). 이 경우 제어부(120)는 다른 시계열 데이터와의 비교를 위하여 시계열 데이터를 정규화 할 수도 있다.The data collection unit 110 may collect time series data (S510). In this case, the controller 120 may normalize the time series data for comparison with other time series data.

그리고 나서 제어부(120)는 시계열 데이터를 일정 크기의 구간으로 분할할 수 있다(S530). 구체적으로 제어부(120)는 시계열 데이터를 일정 크기를 가지는 복수의 시간 구간(610, 620, 630, 640, 650, 660, 670, 680)으로 분할할 수 있다.Then, the controller 120 may divide the time series data into sections of a predetermined size (S530). Specifically, the controller 120 may divide the time series data into a plurality of time sections 610 , 620 , 630 , 640 , 650 , 660 , 670 , and 680 having a predetermined size.

이 경우 제어부(120)는 상기 구간의 레벨에 기초하여 메인 심볼을 획득할 수 있다.In this case, the controller 120 may acquire the main symbol based on the level of the section.

구체적으로 구간 내에는 시계열 데이터를 구성하는 측정 값들이 존재한다. 이 경우 제어부(120)는 구간 내 측정 값들의 평균 값을 산출하고, 평균 값에 상응하는 메인 심볼을 획득할 수 있다.Specifically, there are measured values constituting time series data within the section. In this case, the controller 120 may calculate an average value of the measured values within the section, and obtain a main symbol corresponding to the average value.

즉 메인 심볼은 각 구간의 측정 값들의 레벨을 대표하는 심볼일 수 있다. 예를 들어 제1 구간(610)의 메인 심볼은 제1 구간의 측정 값들의 레벨을 대표하는 E 심볼일 수 있다. 다른 예를 들어 제2 구간(620)의 메인 심볼은 제2 구간의 측정 값들의 레벨을 대표하는 E 심볼일 수 있다.That is, the main symbol may be a symbol representing the level of the measurement values of each section. For example, the main symbol of the first section 610 may be an E symbol representing the level of the measurement values of the first section. For another example, the main symbol of the second section 620 may be an E symbol representing the level of the measurement values of the second section.

한편 제어부(120)는, 상기 구간에서, 시계열 데이터의 구조적 패턴을 나타내는 하나 이상의 패턴 심볼을 획득할 수 있다(S570).Meanwhile, the controller 120 may acquire one or more pattern symbols indicating a structural pattern of the time series data in the section (S570).

도 7a를 참고하면, 패턴 심볼은, 증가 심볼(u), 감소 심볼(d), 유지 심볼(s) 및 노이즈 심볼(n)을 포함할 수 있다.Referring to FIG. 7A , the pattern symbol may include an increase symbol (u), a decrease symbol (d), a maintenance symbol (s), and a noise symbol (n).

여기서 증가 심볼(u)은 측정 값의 레벨이 증가하는, 즉 측정 값의 기울기가 0보다 크다는 것을 나타낼 수 있다.Here, the increase symbol u may indicate that the level of the measurement value increases, that is, the slope of the measurement value is greater than zero.

그리고 감소 심볼(d)은 측정 값의 레벨이 감소하는, 즉 측정 값의 기울기가 0보다 작다는 것을 나타낼 수 있다.In addition, the decrease symbol d may indicate that the level of the measured value is decreasing, that is, the slope of the measured value is less than zero.

또한 유지 심볼(s)은 측정 값의 레벨에 큰 변화가 없는, 즉 측정 값이 임계치 내에서 유지되는 것을 나타낼 수 있다.In addition, the holding symbol s may indicate that there is no significant change in the level of the measured value, that is, the measured value is maintained within a threshold.

또한 노이즈 심볼(n)은 잡음과 같이, 측정 값의 레벨의 변동성이 큰 상태를 나타낼 수 있다. In addition, the noise symbol n may represent a state in which the level of the measurement value is highly variable, such as noise.

한편 하나의 구간에서 하나 이상의 패턴 심볼이 획득될 수 있다. 이하에서는 하나의 구간에서 두개의 패턴 심볼이 획득되는 것으로 설명하나 이에 한정되지 않으며, 하나의 구간에서는 하나 또는 셋 이상의 패턴 심볼이 획득될 수도 있다.Meanwhile, one or more pattern symbols may be obtained in one section. Hereinafter, it will be described that two pattern symbols are obtained in one section, but the present invention is not limited thereto, and one or three or more pattern symbols may be obtained in one section.

다음은 패턴 심볼을 획득하는 순서에 대하여 설명한다.The following describes a procedure for obtaining a pattern symbol.

먼저 제어부(120)는 패턴 심볼이 노이즈 심볼에 해당하는지 판단할 수 있다.First, the controller 120 may determine whether a pattern symbol corresponds to a noise symbol.

구체적으로 제어부(120)는 이전 시점의 측정 값과 현 시점의 측정 값의 차이들을 이용하여 표준 편차를 산출할 수 있다.In more detail, the controller 120 may calculate the standard deviation by using the differences between the measured value at the previous time and the measured value at the current time.

예를 들어 도 7b에서는 제1 내지 제10 시점의 측정 값(1, 9, 2, 8, 3, 7, 4, 6, 5, 10)이 도시되어 있다. For example, in FIG. 7B , measured values 1, 9, 2, 8, 3, 7, 4, 6, 5, and 10 at first to tenth time points are shown.

이 경우 제어부(120)는 이전 시점의 측정 값과 현 시점의 측정 값의 차이들(9-1=8, 2-9=-7, 8-2=6, 3-8=-5, 7-3=4, 4-7=-3, 6-4=2, 5-6=-1, 10-5=5)을 산출할 수 있다.In this case, the controller 120 determines the differences between the measured value of the previous time and the measured value of the current time (9-1=8, 2-9=-7, 8-2=6, 3-8=-5, 7- 3=4, 4-7=-3, 6-4=2, 5-6=-1, 10-5=5) can be calculated.

그리고 나서 제어부(120)는 전 시점의 측정 값과 현 시점의 측정 값의 차이들에 대한 표준 편차를 산출할 수 있다. 즉 도 6을 참고하면, 잡음(670 구간 참고)의 경우 구간 내에서 위아래로 변동성이 심한 경우로, 각 시점의 측정 값의 증감 분에 대한 표준 편차를 산출함으로써 변동성의 정도를 판단할 수 있다.Then, the controller 120 may calculate a standard deviation for differences between the measured value at the previous time and the measured value at the current time. That is, referring to FIG. 6 , in the case of noise (refer to section 670), fluctuations are severe up and down within the section, and the degree of volatility can be determined by calculating the standard deviation for the increase or decrease of the measured value at each time point.

한편 제어부(120)는 산출된 표준 편차가 노이즈 임계치보다 큰지(또는 크거나 같은지) 판단할 수 있다. 그리고 산출된 표준 편차가 노이즈 임계치보다 크면, 제어부(120)는 패턴 심볼이 노이즈 심볼인 것으로 결정할 수 있다.Meanwhile, the controller 120 may determine whether the calculated standard deviation is greater than (or greater than or equal to) a noise threshold. And if the calculated standard deviation is greater than the noise threshold, the controller 120 may determine that the pattern symbol is a noise symbol.

한편 하나의 구간에 두개의 패턴 심볼이 할당되는 경우, 제어부(120)는 하나의 구간을 동일한 크기의 두개의 세부 구간으로 나누고, 두개의 세부 구간에이 각각 노이즈 심볼에 해당하는지 결정할 수 있다.Meanwhile, when two pattern symbols are allocated to one section, the controller 120 may divide one section into two subsections of the same size and determine whether each of the two subsections corresponds to a noise symbol.

한편 표준 편차가 노이즈 임계치보다 작은 경우(또는 작거나 같은 경우), 제어부(120)는 패턴 심볼이 증가 심볼, 감소 심볼 및 유지 심볼 중 어느 심볼에 해당하는지 판단할 수 있다.Meanwhile, when the standard deviation is less than (or less than or equal to) the noise threshold, the controller 120 may determine which symbol the pattern symbol corresponds to among an increase symbol, a decrease symbol, and a maintenance symbol.

구체적으로, 표준 편차가 노이즈 임계치 보다 작으면, 제어부(120)는 시간 구간 내 세부 구간들의 평균 값들을 산출할 수 있다.Specifically, when the standard deviation is smaller than the noise threshold, the controller 120 may calculate average values of detailed sections within the time section.

구체적으로 제어부(120)는 제1 구간을 동일한 크기의 복수의 세부 구간으로 나눌 수 있다. 예를 들어 도 6의 제1 구간의 확대 영상(610-1)을 참고하면, 하나의 구간에 두개의 패턴 심볼이 할당되는 경우, 제어부(120)는 제1 구간을 동일한 크기의 세 개의 세부 구간으로 나누고, 각 세부 구간에서의 측정 값들의 평균 값들을 산출할 수 있다.In more detail, the controller 120 may divide the first section into a plurality of detailed sections having the same size. For example, referring to the enlarged image 610 - 1 of the first section of FIG. 6 , when two pattern symbols are allocated to one section, the controller 120 divides the first section into three subsections having the same size. , and it is possible to calculate average values of the measured values in each sub-section.

그리고 제어부(120)는 세부 구간들에서의 평균 값들 간의 기울기를 이용하여 증가 심볼, 감소 심볼 및 유지 심볼 중 어느 하나를 결정할 수 있다.In addition, the controller 120 may determine any one of an increase symbol, a decrease symbol, and a maintenance symbol by using the slope between average values in the detailed sections.

예를 들어 도 6의 제1 구간의 확대 영상(610-1)을 참고하면, 제어부(120)는 제1 세부 구간에서의 제1 평균 값(611) 및 제2 세부 구간에서의 제2 평균 값(612)을 이용하여 제1 세부 구간 및 제2 세부 구간 간의 기울기를 획득할 수 있다. For example, referring to the enlarged image 610 - 1 of the first section of FIG. 6 , the controller 120 controls the first average value 611 in the first detailed section and the second average value in the second detailed section. A gradient between the first subsection and the second subsection may be obtained using 612 .

이 경우 제어부(120)는 제1 세부 구간 및 제2 세부 구간 간의 기울기의 절대 값이 유지 임계치보다 작은지(또는 작거나 같은지) 판단할 수 있다. 또한 제1 세부 구간 및 제2 세부 구간 간의 기울기의 절대 값이 유지 임계치보다 작으면(또는 작거나 같으면), 제어부(120)는 제1 세부 구간 및 제2 세부 구간 간의 패턴 심볼이 유지 심볼인 것으로 결정할 수 있다.In this case, the controller 120 may determine whether the absolute value of the slope between the first subsection and the second subsection is less than (or less than or equal to) the maintenance threshold. In addition, if the absolute value of the slope between the first subsection and the second subsection is less than (or less than or equal to) the maintenance threshold, the controller 120 determines that the pattern symbol between the first subsection and the second subsection is the maintenance symbol. can decide

한편 제1 세부 구간 및 제2 세부 구간 간의 기울기의 절대 값이 유지 임계치보다 큰 경우, 제어부(120)는 제1 세부 구간 및 제2 세부 구간 간의 기울기가 0보다 큰지 판단할 수 있다. Meanwhile, when the absolute value of the slope between the first subsection and the second subsection is greater than the maintenance threshold, the controller 120 may determine whether the slope between the first subsection and the second subsection is greater than zero.

그리고 제1 세부 구간 및 제2 세부 구간 간의 기울기가 0보다 크면, 제어부(120)는 제1 세부 구간 및 제2 세부 구간 간의 패턴 심볼이 증가 심볼(u)인 것으로 결정할 수 있다.And if the slope between the first subsection and the second subsection is greater than 0, the controller 120 may determine that the pattern symbol between the first subsection and the second subsection is the increment symbol u.

반대로 제1 세부 구간 및 제2 세부 구간 간의 기울기가 0보다 작으면, 제어부(120)는 제1 세부 구간 및 제2 세부 구간 간의 패턴 심볼이 감소 심볼(d)인 것으로 결정할 수 있다.Conversely, if the slope between the first subsection and the second subsection is less than 0, the controller 120 may determine that the pattern symbol between the first subsection and the second subsection is the reduction symbol d.

한편 동일한 과정을 통하여, 제2 세부 구간 및 제3 세부 구간 간의 패턴 심볼이 획득될 수 있다.Meanwhile, through the same process, a pattern symbol between the second subsection and the third subsection may be obtained.

예를 들어 도 6의 제1 구간의 확대 영상(610-1)을 참고하면, 제어부(120)는 제2 세부 구간에서의 제2 평균 값(612) 및 제3 세부 구간에서의 제3 평균 값(613)을 이용하여 제2 세부 구간 및 제3 세부 구간 간의 기울기를 획득할 수 있다. For example, referring to the enlarged image 610 - 1 of the first section of FIG. 6 , the controller 120 controls the second average value 612 in the second sub section and the third average value in the third sub section A gradient between the second subsection and the third subsection may be obtained using 613 .

이 경우 제어부(120)는 제2 세부 구간 및 제3 세부 구간 간의 기울기의 절대 값이 유지 임계치보다 작으면(또는 작거나 같으면), 제어부(120)는 제2 세부 구간 및 제3 세부 구간 간의 패턴 심볼이 유지 심볼인 것으로 결정할 수 있다.In this case, if the absolute value of the slope between the second subsection and the third subsection is less than (or less than or equal to) the maintenance threshold, the controller 120 controls the pattern between the second subsection and the third subsection It may be determined that the symbol is a holding symbol.

한편 제2 세부 구간 및 제3 세부 구간 간의 기울기의 절대 값이 유지 임계치보다 큰 경우, 제어부(120)는 제2 세부 구간 및 제3 세부 구간 간의 기울기가 0보다 큰지 판단할 수 있다. Meanwhile, when the absolute value of the slope between the second subsection and the third subsection is greater than the maintenance threshold, the controller 120 may determine whether the slope between the second subsection and the third subsection is greater than zero.

그리고 제2 세부 구간 및 제3 세부 구간 간의 기울기가 0보다 작으면, 제어부(120)는 제2 세부 구간 및 제3 세부 구간 간의 패턴 심볼이 감소 심볼(d)인 것으로 결정할 수 있다.And when the slope between the second subsection and the third subsection is less than 0, the controller 120 may determine that the pattern symbol between the second subsection and the third subsection is the reduction symbol (d).

그리고 제어부(120)는, 복수의 구간(610, 620, 630, 640, 650, 660, 670, 680)에서, 복수의 구간 별 패턴 심볼을 획득할 수 있다.In addition, the controller 120 may acquire pattern symbols for each of the plurality of sections in the plurality of sections 610 , 620 , 630 , 640 , 650 , 660 , 670 , and 680 .

즉 제어부(120)는 제1 구간(610)에 대해서, 제1-1 패턴 심볼 및 제1-2 패턴 심볼을 획득할 수 있다. 이 경우 앞서 설명한 순서(제1-1 패턴 심볼이 노이즈 심볼에 해당하는지 우선적으로 판단하고, 그리고 나서 제1-1 패턴 심볼이 유지 심볼, 증가 심볼 또는 감소 심볼 중 어느 하나에 해당하는지를 판단)에 따라, 제어부(120)는 제1-1 패턴 심볼이 증가 심볼, 감소 심볼, 유지 심볼 및 노이즈 심볼 중 어느 심볼에 해당하는지 판단될 수 있다. 그리고 나서, 동일한 방법으로, 제어부(120)는 제1-2 패턴 심볼이 증가 심볼, 감소 심볼, 유지 심볼 및 노이즈 심볼 중 어느 심볼에 해당하는지 판단될 수 있다.That is, the controller 120 may obtain the 1-1 pattern symbol and the 1-2 th pattern symbol for the first section 610 . In this case, according to the above-described order (preferentially determining whether the 1-1 pattern symbol corresponds to a noise symbol, and then determining whether the 1-1 pattern symbol corresponds to any one of a maintenance symbol, an increase symbol, or a decrease symbol) , the control unit 120 may determine which symbol of the 1-1 pattern symbol corresponds to an increase symbol, a decrease symbol, a maintenance symbol, and a noise symbol. Then, in the same manner, the control unit 120 may determine which symbol of the 1-2 th pattern symbol corresponds to an increase symbol, a decrease symbol, a sustain symbol, and a noise symbol.

같은 방식으로, 제어부(120)는 제2 구간(620)에 대해서, 제2-1 패턴 심볼 및 제2-2 패턴 심볼을 획득할 수 있으며, 나머지 구간(630, 640, 650, 660, 670, 680)에 대해서도 동일한 과정을 반복함으로써 각 구간에 대응하는 패턴 심볼을 획득할 수 있다.In the same way, the control unit 120 may obtain a 2-1 pattern symbol and a 2-2 pattern symbol with respect to the second section 620, and the remaining sections 630, 640, 650, 660, 670, 680), by repeating the same process, it is possible to obtain a pattern symbol corresponding to each section.

한편 메인 심볼은 패턴 심볼과 결합될 수 있다. 예를 들어 제1 구간(610)의 메인 심볼(E)이 제1-1 패턴 심볼(u) 및 제1-2 패턴 심볼(d)과 결합되어 제1 결합 심볼(Eud)가 생성될 수 있다. 다른 예를 들어 제7 구간(670)의 메인 심볼(E)이 제7-1 패턴 심볼(n) 및 제7-2 패턴 심볼(n)과 결합되어 제7 결합 심볼(Enn)이 생성될 수 있다.Meanwhile, the main symbol may be combined with the pattern symbol. For example, the main symbol E of the first section 610 may be combined with the 1-1 pattern symbol u and the 1-2 pattern symbol d to generate a first combined symbol Eud. . For another example, the main symbol (E) of the seventh section 670 is combined with the 7-1 pattern symbol (n) and the 7-2 pattern symbol (n) to generate a seventh combined symbol (Enn). have.

도 8은 본 발명에 따른, 메인 심볼의 비교 테이블과 패턴 심볼의 비교 테이블을 설명하기 위한 도면이다.8 is a diagram for explaining a comparison table of a main symbol and a comparison table of a pattern symbol according to the present invention.

시계열 데이터 간의 유사도 측정 장치(100)의 메모리(140)에는 메인 심볼의 비교 테이블(810) 및 패턴 심볼의 비교 테이블(820)이 저장될 수 있다.A comparison table 810 of a main symbol and a comparison table 820 of a pattern symbol may be stored in the memory 140 of the apparatus 100 for measuring the similarity between time series data.

여기서 메인 심볼의 비교 테이블(810)은, 메인 심볼 간의 거리들에 대한 정보를 포함할 수 있다. 이 경우 메인 심볼의 비교 테이블은, 시계열 데이터가 나타낼 수 있는 측정 값들을 표준 정규 분포화 하고, 정규 분포화된 측정 값들이 메인 심볼들(A, B, C, D, E, F)에 대하여 동일한 비율로 할당되는 경계 값을 산출함으로써 획득될 수 있다.Here, the comparison table 810 of the main symbols may include information on distances between the main symbols. In this case, the comparison table of the main symbol standardizes the measured values that can be represented by the time series data, and the normalized measured values are the same for the main symbols (A, B, C, D, E, F). It can be obtained by calculating a boundary value assigned as a ratio.

예를 들어 도 8을 참고하면, 메인 심볼 A의 경계 값은 -3.719 및 -0.967, 메인 심볼 B의 경계 값은 -0.967 및 -0.431, 메인 심볼 C의 경계 값은 -0.431 및 0.000, 메인심볼 D의 경계 값은 0.000 및 0.431, 메인 심볼 E의 경계 값은 0.431 및 0.967, 메인 심볼 F의 경계 값은 0.967 및 3.719일 수 있다.For example, referring to FIG. 8 , the boundary values of the main symbol A are -3.719 and -0.967, the boundary values of the main symbol B are -0.967 and -0.431, the boundary values of the main symbol C are -0.431 and 0.000, and the main symbol D is Boundary values of , may be 0.000 and 0.431, the boundary values of the main symbol E may be 0.431 and 0.967, and the boundary values of the main symbol F may be 0.967 and 3.719.

이 경우 메인 심볼의 비교 테이블(810)에 포함되는 메인 심볼 간의 거리는, 특정 메인 심볼의 두 경계값의 중앙 값과 다른 메인 심볼의 두 경계 값의 중앙 값의 차이로 정의될 수 있다.In this case, the distance between the main symbols included in the comparison table 810 of the main symbols may be defined as a difference between the center value of two boundary values of a specific main symbol and the center value of two boundary values of another main symbol.

예를 들어 메인 심볼 A의 두 경계 값이 -3.719 및 -0.967인 경우 메인 심볼 A의 두 경계 값의 중앙 값은 2.343이다. 다른 예를 들어 메인 심볼 D의 두 경계 값이 0.000 및 0.431인 경우 메인 심볼 D의 두 경계 값의 중앙 값은 -0.2155이다. 이 경우 메인 심볼 A와 메인 심볼 D 간의 거리는, 2.343과 -0.2155의 차인 2.5585(반올림 하여 2.559)로 정의될 수 있다.For example, if the two boundary values of the main symbol A are -3.719 and -0.967, the median value of the two boundary values of the main symbol A is 2.343. As another example, if the two boundary values of the main symbol D are 0.000 and 0.431, the median value of the two boundary values of the main symbol D is -0.2155. In this case, the distance between the main symbol A and the main symbol D may be defined as 2.5585 (rounded up to 2.559), which is the difference between 2.343 and -0.2155.

한편 패턴 심볼의 비교 테이블(820)은, 패턴 심볼 간의 거리들에 대한 정보를 포함할 수 있다. Meanwhile, the comparison table 820 of pattern symbols may include information on distances between pattern symbols.

이 경우 패턴 심볼 간의 거리는, 메인 심볼 간의 거리들 중 최소값(예를 들어 메인 심볼 테이블(810)에 포함되는 거리들 중 최소 값인 0.431)을 이용하여 산출될 수 있다.In this case, the distance between pattern symbols may be calculated using a minimum value among distances between main symbols (eg, 0.431, which is a minimum value among distances included in the main symbol table 810 ).

즉 시계열 데이터의 비교시, 메인 심볼은 측정 값의 레벨을 비교하는 것이고 패턴 심볼은 시계열 데이터의 구조적 패턴(구조적 변동 패턴)을 비교하는 것이기 때문에, 시계열 데이터의 유사도 판단 시 메인 심볼 간의 차이의 반영 정도가 패턴 심볼 간의 차이의 반영 정도에 비해 커야 한다.That is, when comparing time series data, since the main symbol compares the level of measured values and the pattern symbol compares the structural pattern (structural variation pattern) of time series data, the degree of reflection of the difference between the main symbols when determining the similarity of time series data should be larger than the degree of reflection of the difference between pattern symbols.

따라서 본 발명에서는, 메인 심볼 간의 거리들 중 최소값을 이용하여 패턴 심볼 들 간의 거리가 산출될 수 있다. 여기서 메인 심볼 간의 거리들 중 최소 값을 m_d라고 정의한다.Therefore, in the present invention, the distance between pattern symbols can be calculated using the minimum value among the distances between the main symbols. Here, the minimum value among the distances between the main symbols is defined as m _d .

구체적으로 노이즈 심볼(n)이 다른 심볼들(u, d, s)와 가지는 거리(u-n, d-n, s-n)는, 3*m_d로 정의될 수 있다. 예를 들어 메인 심볼 간의 거리들 중 최소 값(m_d)이 0.431인 경우, 노이즈 심볼(n)이 다른 심볼들(u, d, s)과 가지는 거리는 1.293(반올림 하여 1.29)가 될 수 있다.Specifically, a distance (un, dn, sn) of a noise symbol (n) to other symbols (u, d, s) may be defined as 3*m _d . For example, when the minimum value (m _d ) among the distances between the main symbols is 0.431, the distance the noise symbol (n) has from other symbols (u, d, s) may be 1.293 (rounded up to 1.29).

또한 유지 심볼(s)이 증가 심볼(u) 또는 감소 심볼(d)와 가지는 거리(u-s, d-s)는, m_d로 정의될 수 있다. 예를 들어 메인 심볼 간의 거리들 중 최소 값(m_d)이 0.431인 경우, 유지 심볼(s)이 증가 심볼(u) 또는 감소 심볼(d)와 가지는 거리는, 0.431(반올림 하여 0.43)이 될 수 있다.In addition, the distance (us, ds) that the maintenance symbol (s) has with the increasing symbol (u) or the decreasing symbol (d) may be defined as m _d . For example, if the minimum value (m _d ) among the distances between the main symbols is 0.431, the distance that the maintenance symbol (s) has with the increment symbol (u) or the decrement symbol (d) can be 0.431 (rounded to 0.43). have.

또한 증가 심볼(u)과 감소 심볼(d) 간의 거리(u-d)는 2*m_d로 정의될 수 있다. 예를 들어 메인 심볼 간의 거리들 중 최소 값(md)이 0.431인 경우, 증가 심볼(u)과 감소 심볼(d) 간의 거리는 0.862(반올림 하여 0.86)이 될 수 있다.Also, the distance ud between the increment symbol u and the decrement symbol d may be defined as 2*m _d . For example, when the minimum value md among the distances between the main symbols is 0.431, the distance between the increment symbol u and the decrement symbol d may be 0.862 (rounded up to 0.86).

즉 심볼들 간의 거리들 중, 노이즈 심볼(n)이 다른 심볼들(u, d, s)와 가지는 거리(u-n, d-n, s-n)가 가장 크고, 같은 심볼들 간의 거리인 0을 제외하면, 유지 심볼(s)이 증가 심볼(u) 또는 감소 심볼(d)와 가지는 거리(u-s, d-s)는 가장 작을 수 있다.That is, among the distances between symbols, the distance (u-n, d-n, s-n) of the noise symbol (n) to other symbols (u, d, s) is the largest, except for 0, which is the distance between the same symbols, maintain The distance (u-s, d-s) of the symbol (s) to the increasing symbol (u) or the decreasing symbol (d) may be the smallest.

이것은 패턴 심볼 간의 거리가, 시계열 데이터의 구조적 패턴의 차이를 나타내기 때문이다. This is because the distance between the pattern symbols indicates a difference in the structural pattern of the time series data.

즉 노이즈 심볼(n)은 잡음과 같이 변동성이 매우 큰 구조적 패턴을 나타내는 것이기 때문에, 다른 패턴 들과의 구조적 패턴 차이가 가장 클 수 있다. That is, since the noise symbol n represents a structural pattern with very high variability like noise, the structural pattern difference from other patterns may be the largest.

또한 유지 심볼(s)이 증가 심볼(u) 또는 감소 심볼(d)와 가지는 거리(u-s, d-s)는 기울기가 +인 구조적 패턴(또는 기울기가 -인 구조적 패턴)을 기울기가 0인 구조적 패턴과 비교하는 것인데 반해, 증가 심볼(u)과 감소 심볼(d) 간의 거리(u-d)는 기울기가 +인 구조적 패턴을 기울기가 -인 구조적 패턴과 비교하는 것이다. 따라서 유지 심볼이 증가 심볼 또는 감소 심볼과 가지는 거리는, 증가 심볼이 감소 심볼과 가지는 거리보다 작을 수 있다.In addition, the distance (u-s, d-s) that the maintenance symbol (s) has with the increasing symbol (u) or the decreasing symbol (d) is the difference between a structural pattern with a slope of + (or a structural pattern with a slope of -) and a structural pattern with a slope of 0 In contrast, the distance (u-d) between the increment symbol (u) and the decrement symbol (d) compares a structural pattern with a slope of + to a structural pattern with a slope of -. Accordingly, the distance between the maintenance symbol and the increment symbol or the decrement symbol may be smaller than the distance between the increment symbol and the decrement symbol.

한편 제어부(120)는 메인 심볼의 비교 테이블(810) 및 패턴 심볼의 비교 테이블(820)을 이용하여 결합 심볼들 간의 거리를 획득할 수 있다.Meanwhile, the controller 120 may obtain the distance between the combined symbols by using the comparison table 810 of the main symbol and the comparison table 820 of the pattern symbol.

구체적으로 제1 결합 심볼이 Csn이고, 제2 결합 심볼이 Aud인 것으로 가정하여 설명한다.In detail, it is assumed that the first combining symbol is Csn and the second combining symbol is Aud.

제어부(120)는 메인 심볼의 비교 테이블을 이용하여, 메인 심볼 들 간의 메인 심볼 거리를 산출할 수 있다. 예를 들어 제1 결합 심볼의 메인 심볼은 C이고, 제2 결합 심볼의 메인 심볼은 A이다. 이 경우 제어부(120)는 메인 심볼의 비교 테이블(810)로부터 C와 A 간의 거리인 2.128을 획득할 수 있다.The controller 120 may calculate the main symbol distance between the main symbols by using the comparison table of the main symbols. For example, the main symbol of the first combining symbol is C, and the main symbol of the second combining symbol is A. In this case, the controller 120 may obtain the distance between C and A of 2.128 from the comparison table 810 of the main symbol.

다음으로, 제어부(120)는 패턴 심볼의 비교 테이블을 이용하여, 패턴 심볼들 간의 패턴 심볼 거리를 산출할 수 있다. Next, the controller 120 may calculate the pattern symbol distance between the pattern symbols by using the comparison table of the pattern symbols.

예를 들어 제1 결합 심볼의 제1-1 패턴 심볼은 s이고, 제2 결합 심볼의 제2-1 패턴 심볼은 u이다. 이 경우 제어부(120)는 패턴 심볼의 비교 테이블(820)로부터 s와 u의 거리인 0.43을 획득할 수 있다.For example, the 1-1 pattern symbol of the first combined symbol is s, and the 2-1 pattern symbol of the second combined symbol is u. In this case, the controller 120 may obtain 0.43, which is the distance between s and u, from the comparison table 820 of pattern symbols.

또한 제1 결합 심볼의 제1-2 패턴 심볼은 n이고, 제2 결합 심볼의 제2-2 패턴 심볼은 d이다. 이 경우 제어부(120)는 패턴 심볼의 비교 테이블(820)로부터 n와 d의 거리인 1.29를 획득할 수 있다.In addition, the 1-2th pattern symbol of the first combined symbol is n, and the 2-2th pattern symbol of the second combined symbol is d. In this case, the controller 120 may obtain 1.29, which is the distance between n and d, from the comparison table 820 of the pattern symbol.

이 경우 제어부(120)는 제1 결합 심볼(Csn)과 제2 결합 심볼(Aud) 간의 거리를 산출할 수 있다. 구체적으로 제어부(120)는 메인 심볼 거리와 패턴 심볼 거리를 합산하여 제1 결합 심볼(Csn)과 제2 결합 심볼(Aud) 간의 거리를 산출할 수 있다. 예를 들어 제어부(120)는 메인 심볼 거리(2.128) 및 두 패턴 심볼 거리(0.43, 1.29)를 모두 합산함으로써, 제1 결합 심볼(Csn)과 제2 결합 심볼(Aud) 간의 거리(3.848)를 산출할 수 있다.In this case, the control unit 120 may calculate the distance between the first combining symbol (Csn) and the second combining symbol (Aud). Specifically, the controller 120 may calculate the distance between the first combined symbol Csn and the second combined symbol Aud by summing the main symbol distance and the pattern symbol distance. For example, the controller 120 sums up the main symbol distance (2.128) and the two pattern symbol distances (0.43, 1.29), thereby determining the distance (3.848) between the first combining symbol (Csn) and the second combining symbol (Aud). can be calculated.

한편 제어부(120)는 메인 심볼 및 메인 심볼과 결합된 하나 이상의 패턴 심볼을 이용하여, 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출할 수 있다(S570).Meanwhile, the controller 120 may calculate a similarity between the first time series data and the second time series data by using the main symbol and one or more pattern symbols combined with the main symbol ( S570 ).

여기서 제2 시계열 데이터는 상태의 분류 또는 이상 상태를 나타내는 데이터로써, 메모리(140)에 저장될 수 있다. 또한 제2 시계열 데이터가 메모리(140)에 저장되는 것 없이, 제2 시계열 데이터의 각 구간의 결합 심볼만이 메모리(140)에 저장될 수도 있다.Here, the second time series data may be stored in the memory 140 as data indicating a classification of a state or an abnormal state. Also, without the second time series data being stored in the memory 140 , only the combined symbol of each section of the second time series data may be stored in the memory 140 .

이 경우 제어부(120)는 각 구간에서, 수집된 제1 시계열 데이터의 메인 심볼 및 패턴 심볼의 결합인 결합 심볼과, 제2 시계열 데이터의 제2 메인 심볼 및 제2 패턴 심볼의 결합인 제2 결합 심볼 간의 거리를 산출할 수 있다.In this case, the control unit 120, in each section, a combination symbol that is a combination of a main symbol and a pattern symbol of the collected first time series data, and a second combination that is a combination of a second main symbol and a second pattern symbol of the second time series data The distance between symbols can be calculated.

구체적으로 제어부(120)는 각 구간에서, 메인 심볼의 비교 테이블을 이용하여 메인 심볼과 제2 메인 심볼 간의 메인 심볼 거리를 산출하고, 패턴 심볼의 비교 테이블을 이용하여 패턴 심볼과 제2 패턴 심볼 간의 패턴 심볼 거리를 산출하고, 메인 심볼 거리와 패턴 심볼 거리를 합산하여 결합 심볼과 제2 결합 심볼 간의 거리를 산출할 수 있다.Specifically, in each section, the control unit 120 calculates a main symbol distance between the main symbol and the second main symbol using the comparison table of the main symbol, and uses the comparison table of the pattern symbol to calculate the distance between the pattern symbol and the second pattern symbol. The distance between the combined symbol and the second combined symbol may be calculated by calculating the pattern symbol distance and summing the main symbol distance and the pattern symbol distance.

한편 결합 심볼은 메인 심볼에 구조적 패턴을 나타내는 패턴 심볼이 결합된 것으로써, P-SAX(Pattern-added SAX)로 명칭될 수도 있다.Meanwhile, the combined symbol is a combination of a main symbol and a pattern symbol indicating a structural pattern, and may be referred to as P-SAX (Pattern-added SAX).

한편 제어부(120)는 DTW 행렬을 이용하여 제1 시계열 데이터와 제2 시계열 데이터 간의 유사도를 산출할 수 있다.Meanwhile, the controller 120 may calculate a similarity between the first time series data and the second time series data by using the DTW matrix.

본 발명의 유사도 산출 방법을 설명하기에 앞서, 일반적인 동적 시간 워핑(Dynamic Time Warping, DTW)을 먼저 설명하도록 한다.Before describing the similarity calculation method of the present invention, general dynamic time warping (DTW) will be described first.

도 9는 일반적인 DTW(Dynamic Time Warping)를 설명하기 위한 도면이다.9 is a diagram for explaining a general dynamic time warping (DTW).

DTW 알고리즘을 통하여 DTW 행렬이 생성될 수 있다.A DTW matrix may be generated through the DTW algorithm.

제1 시계열 데이터(x)와 제2 시계열 데이터(y)를 비교한다고 가정한다.It is assumed that the first time series data (x) and the second time series data (y) are compared.

제1 시계열 데이터(x)는 복수의 구간으로 나뉘고, 복수의 구간을 각각 대표하는 복수의 측정 값이 존재한다. DTW 행렬에서 복수의 구간은 인덱스 1 내지 인덱스 10으로 표현되어 있으며, 복수의 구간을 각각 대표하는 복수의 측정 값은 1, 3, 5, 7, 6, 8, 9, 10, 8, 7로 표현되어 있다.The first time series data x is divided into a plurality of sections, and a plurality of measurement values each representing the plurality of sections exist. In the DTW matrix, a plurality of sections are represented by indices 1 to 10, and a plurality of measurement values representing each section are represented by 1, 3, 5, 7, 6, 8, 9, 10, 8, 7 has been

그리고 제2 시계열 데이터(y)는 복수의 구간으로 나뉘고, 복수의 구간을 각각 대표하는 복수의 측정 값이 존재한다. DTW 행렬에서 복수의 구간은 인덱스 1 내지 인덱스 8로 표현되어 있으며, 복수의 구간을 각각 대표하는 복수의 측정 값은 1, 2, 6, 6, 7, 9, 8, 7로 표현되어 있다.In addition, the second time series data y is divided into a plurality of sections, and a plurality of measurement values each representing the plurality of sections exist. In the DTW matrix, a plurality of sections are represented by indices 1 to 8, and a plurality of measurement values representing the plurality of sections are represented by 1, 2, 6, 6, 7, 9, 8, and 7.

일반적인 DTW 알고리즘에서는 아래의 수학식 1에 따라 DTW 행렬(930)을 작성한다.In a general DTW algorithm, a DTW matrix 930 is created according to Equation 1 below.

(

: 현재 인덱스의 DTW 값,

: 현재 인덱스의 거리 값,

: 이전 인덱스들 중 최소 DTW 값)(

: DTW value of the current index,

: the distance value of the current index,

: Minimum DTW value among previous indexes)

그리고 현재 인덱스의 거리 값은 수학식 2에서와 같이 표현될 수 있다.And the distance value of the current index can be expressed as in Equation (2).

(

: 유클리디안 거리)(

: Euclidean distance)

예를 들어 도 9a를 참고하면, 제1 시계열 데이터(x)의 제6 인덱스의 값은 8이고, 제2 시계열 데이터(y)의 제5 인덱스의 값은 7이다. 이 경우

에 의해,

=1로 계산된다.For example, referring to FIG. 9A , the value of the sixth index of the first time series data x is 8, and the value of the fifth index of the second time series data y is 7. in this case

by,

= 1 is calculated.

또한 DTW 행렬(930)을 참고하면, 수학식 1 및 수학식 2를 이용하여

=3,

=4,

=5라는 값이 이미 산출된 상태이다.Also, referring to the DTW matrix 930, using

Equations

1 and 2,

=3,

=4,

=5 has already been calculated.

따라서

는 1+3(3, 4, 5 중 최소 값)이 되어 4로 산출되게 된다. 이 경우 4는

에 해당 하는 칸(931)에 기재되게 된다.therefore

becomes 1+3 (the minimum value among 3, 4, and 5) and is calculated as 4. In this case 4 is

It will be described in the column 931 corresponding to .

이와 같은 과정을 모든 인덱스에 대하여 반복하면 DTW 행렬(930)이 완성되며, 최종 DTW 값은 수학식 3으로 표현될 수 있다.If this process is repeated for all indices, the DTW matrix 930 is completed, and the final DTW value can be expressed by Equation (3).

(M: x의 데이터 길이, N: y의 데이터 길이)(M: data length of x, N: data length of y)

즉 제1 시계열 데이터(x)는 10개의 인덱스를 가지고 제2 시계열 데이터(y)는 8개의 인덱스를 가진다.That is, the first time series data (x) has 10 indexes and the second time series data (y) has 8 indexes.

따라서 최종 DTW는

가 되어 최종 DTW의 값은

에 해당하는 칸(932)에 기재된 5가 될 수 있다.So the final DTW is

and the final DTW value is

It may be 5 described in the column 932 corresponding to .

한편 도 9b에서는 제1 시계열 데이터(x) 및 제2 시계열 데이터(y)의 그래프를 도시하였다. 이와 같이 DTW 알고리즘을 통하여 서로 길이가 다른 두 시계열 데이터의 유사도가 산출될 수 있다.Meanwhile, FIG. 9B shows graphs of the first time series data (x) and the second time series data (y). As described above, the degree of similarity between two time series data having different lengths may be calculated through the DTW algorithm.

이와 같이 동적 시간 워핑(Dynamic Time Warping, DTW)은 두 시계열 데이터 간의 패턴 비교를 위하여 사용되는 거리 기반 유사도 방법 중 비동기적 비교 능력이 우수한 방법이다.As described above, Dynamic Time Warping (DTW) is a method with excellent asynchronous comparison ability among distance-based similarity methods used for pattern comparison between two time series data.

다만 앞서 설명한 바와 같이, 이 방식은 상대적 비교를 통한 유사도 판단을 수행해야 함으로, 다수의 비교 데이터가 필요하다는 단점이 존재한다. However, as described above, this method has a disadvantage in that a large number of comparison data is required because similarity determination through relative comparison is required.

따라서 본 발명에서는 와핑 패스(Warping Path)를 이용하여 시계열 데이터를 새롭게 매핑하는 방식으로 유사도를 표준화한다.Therefore, in the present invention, similarity is standardized by newly mapping time series data using a warping path.

먼저 제어부(120)는 앞서 설명한 방식과 동일한 방식으로, DTW 알고리즘에 따라 DTW 행렬을 생성할 수 있다.First, the control unit 120 may generate a DTW matrix according to the DTW algorithm in the same manner as described above.

그리고 나서 제어부(120)는 DTW 행렬을 이용하여 제1 시계열 데이터와 제2 시계열 데이터의 유사도가 가장 가깝게 매핑된 최적 와핑 패스(Optimal Warping Path)를 선택할 수 있다.Then, the controller 120 may select an optimal warping path to which the similarity between the first time series data and the second time series data is most closely mapped using the DTW matrix.

구체적으로 9a를 참고하면 최적 와핑 패스(Optimal Warping Path)가 화살표의 형태로 표시되어 있다. 여기서 최적 와핑 패스(Optimal Warping Path)란, 두 시계열 데이터의 유사도가 가장 가깝게 매핑된 인덱스 경로를 의미할 수 있다.Specifically, referring to 9a, an optimal warping path is indicated in the form of an arrow. Here, the optimal warping path may refer to an index path in which the similarity between two time series data is most closely mapped.

예를 들어 도 9a를 참고하면, 두 시계열 데이터의 유사도가 가장 가깝게 매핑된 인덱스 경로는 (1, 1), (2, 2), (3, 3), (4, 3), (5, 4), (6, 5), (7, 6), (8, 6), (9, 7), (10, 8)이며, 이에 따라 최종 유사도는

에 해당하는 칸(932)에 기재된 5로 산출되게 된다.For example, referring to FIG. 9A , the index path to which the similarity of the two time series data is most closely mapped is (1, 1), (2, 2), (3, 3), (4, 3), (5, 4) ), (6, 5), (7, 6), (8, 6), (9, 7), (10, 8), so the final similarity is

It is calculated as 5 described in the column 932 corresponding to .

이 경우 인덱스 경로 (1, 1), (2, 2), (3, 3), (4, 3), (5, 4), (6, 5), (7, 6), (8, 6), (9, 7), (10, 8)는 최적 와핑 패스(Optimal Warping Path)를 구성할 수 있다.In this case the index paths (1, 1), (2, 2), (3, 3), (4, 3), (5, 4), (6, 5), (7, 6), (8, 6) ), (9, 7), and (10, 8) may constitute an optimal warping path.

도 10은 본 발명에 따른, 정규화된 유사도를 산출하는 방법을 설명하기 위한 도면이다.10 is a diagram for explaining a method of calculating a normalized similarity according to the present invention.

도 10c에서는 일반적인 유클리디안 거리 산출 방식을 도시하였다. 그리고적인 일반적인 유클리디안 거리 산출 방식에서의 인덱스 연결 쌍은 아래와 같은 수학식으로 표현될 수 있다.10c shows a general Euclidean distance calculation method. An index-linked pair in a general Euclidean distance calculation method can be expressed by the following equation.

즉 일반적인 유클리디안 거리 산출 방식은, 제1 시계열 데이터(x)와 제2 시계열 데이터(y)의 동일 인덱스를 하나의 쌍을 지어 거리를 비교하는 방식이다. 예를 들어 도 10b를 참고하면, 유클리디안 거리 산출 방식의 와핑 패스는 도 1-b의 1010과 같다.That is, a general method of calculating the Euclidean distance is a method of comparing the distances by pairing the same index of the first time series data (x) and the second time series data (y) as a pair. For example, referring to FIG. 10B , the warping path of the Euclidean distance calculation method is the same as 1010 of FIG. 1-B .

다만 본 발명에서는 최적 와핑 패스(1020)를 기준으로 제1 시계열 데이터(x)의 인덱스와 제2 시계열 데이터(x)의 인덱스를 비동기적으로 매핑한다. However, in the present invention, the index of the first time series data (x) and the index of the second time series data (x) are asynchronously mapped based on the optimal warping pass 1020 .

즉 도 10a에서는 두 시계열 데이터의 시계열이 서로 어긋나 있으나, 도 10b에서의 최적 와핑 패스(1020)에서는 시계열이 서로 어긋나있음에도 불구하고 상응하는 인덱스 쌍이 획득되어 있는 상태이다.That is, although the time series of the two time series data are displaced from each other in FIG. 10A , in the optimal warping pass 1020 of FIG. 10B , a corresponding index pair is obtained despite the time series being displaced from each other.

따라서 본 발명에서는 DTW 알고리즘에 기인한 연산 과정에서 도출되는 최적 와핑 패스(1020)를 이용하여 제1 시계열 데이터(x)와 제2 시계열 데이터(y)를 다시 매핑한다.Accordingly, in the present invention, the first time-series data (x) and the second time-series data (y) are re-mapped using the optimal warping path 1020 derived from the operation process due to the DTW algorithm.

구체적으로 제어부(120)는 최적 와핑 패스(1020)를 구성하는 복수의 인덱스 쌍을 이용하여 제1 시계열 데이터(x)와 제2 시계열 데이터(y)를 비동기적으로 매핑할 수 있다.In more detail, the controller 120 may asynchronously map the first time series data x and the second time series data y using a plurality of index pairs constituting the optimal warping path 1020 .

최적 와핑 패스(1020)에 위치한 복수의 인덱스 쌍을 기준으로 비동기적으로 재매핑된 제1 시계열 데이터(x)와 제2 시계열 데이터(y)는 도 10d에 도시되어 있다.The first time series data x and the second time series data y that are asynchronously remapped based on a plurality of index pairs located in the optimal warping path 1020 are illustrated in FIG. 10D .

이 경우 최적 와핑 패스(1020)를 구성하는 복수의 인덱스 쌍은 다음과 같은 예시로 표현될 수 있다.In this case, a plurality of index pairs constituting the optimal warping path 1020 may be represented by the following example.

예를 들어, 인덱스 쌍 (3, 4)는 제1 시계열 데이터(x)의 세번째 인덱스와 제2 시계열 데이터(y)의 네번째 인덱스를 매핑하여 하나의 쌍을 생성하였다는 의미이다. 그리고 인덱스 쌍 (3, 4)는 최적 와핑 패스(1020) 상에 위치할 수 있으며, DTW 페어즈(

)를 구성하는 다른 인덱스 쌍들((1,1), (3,3), (M,M) 등)들 역시 최적 와핑 패스(1020) 상에 위치하는 인덱스 쌍들이다.For example, the index pair (3, 4) means that one pair is generated by mapping the third index of the first time series data (x) and the fourth index of the second time series data (y). And the index pair (3, 4) may be located on the optimal warping pass 1020, DTW pair (

) constituting the other index pairs ((1,1), (3,3), (M,M), etc.) are also index pairs located on the optimal warping path 1020 .

또한 정규화된 통계적 유사도의 산출에 사용되는 인덱스의 개수(

)는 아래와 같이 표현될 수 있다.Also, the number of indices used to calculate normalized statistical similarity (

) can be expressed as follows.

즉 수학식 5에 따르면, 매핑 전 제1 시계열 데이터(x)의 인덱스의 개수는 M이고 매핑 전 제2 시계열 데이터(y)의 인덱스의 개수 역시 M이다. 이 경우 도 10b에서의 유클리디안 와핑 패스(1010)를 이용하게 되면 인덱스 쌍의 개수 역시 M이 되지만, 본 발명에서는 최적 와핑 패스(1020)를 선택하기 때문에 정규화된 통계적 유사도의 산출에 사용되는 인덱스의 개수(

)는 DTW 페어즈(

)의 인덱스 쌍의 개수가 될 수 있다.That is, according to Equation 5, the number of indices of the first time series data (x) before mapping is M, and the number of indices of the second time series data (y) before mapping is also M. In this case, if the Euclidean warping pass 1010 in FIG. 10B is used, the number of index pairs also becomes M, but in the present invention, since the optimal warping pass 1020 is selected, the index used for calculating the normalized statistical similarity number of (

) is the DTW pair (

) can be the number of index pairs.

즉 정규화된 통계적 유사도의 산출에 사용되는 인덱스의 개수(

)는, 제1 시계열 데이터가 가지는 인덱스의 개수보다 크거나 같고, 제2 시계열 데이터가 가지는 인덱스의 개수보다 크거나 같을 수 있다.That is, the number of indexes used to calculate normalized statistical similarity (

) may be greater than or equal to the number of indexes of the first time series data, and greater than or equal to the number of indexes of the second time series data.

한편 제어부(120)는 비동기적으로 매핑된 제1 시계열 데이터의 값 및 제2 시계열 데이터의 값을 비교하여 정규화된 통계적 유사도를 산출할 수 있다.Meanwhile, the controller 120 may calculate a normalized statistical similarity by comparing the asynchronously mapped values of the first time series data and the values of the second time series data.

여기서 정규화된 통계적 유사도란, 제1 시계열 데이터 및 제2 시계열 데이터의 상관 관계를 정규화된 수치로 나타낸 값을 의미할 수 있다. 예를 들어 앞서 설명한 피어슨 상관관계(Pearson’s Correlation) 등이 정규화된 통계적 유사도를 산출하는데 사용될 수 있다.Here, the normalized statistical similarity may mean a value representing a correlation between the first time series data and the second time series data as a normalized numerical value. For example, the Pearson's correlation described above may be used to calculate the normalized statistical similarity.

한편 비동기적으로 매핑된 제1 시계열 데이터의 값 및 제2 시계열 데이터의 값을 비교하여 정규화된 통계적 유사도(

)를 산출하는 방법은 아래와 같은 수학식으로 표현될 수 있다.On the other hand, the statistical similarity normalized by comparing the asynchronously mapped values of the first time series data and the values of the second time series data (

) can be expressed by the following equation.

구체적으로 제어부(120)는 제1 시계열 데이터를 구성하는 인덱스들의 값들의 평균(

) 및 제2 시계열 데이터를 구성하는 인덱스들의 값들의 평균(

)을 산출할 수 있다.Specifically, the controller 120 controls the average (

) and the average of the values of the indices constituting the second time series data (

) can be calculated.

또한 제어부(120)는 DTW 페어즈(

)를 구성하는 특정 인덱스 쌍 내 제1 시계열 데이터의 인덱스의 값(

) 및 제2 시계열 데이터의 인덱스의 값(

)을 수학식 7에 대입할 수 있다. 그리고 이러한 과정을 총

개의 인덱스 쌍에 대하여 반복함으로써, 제어부(120)는 비동기적으로 매핑된 제1 시계열 데이터 및 제2 시계열 데이터의 정규화된 통계적 유사도(

)를 산출할 수 있다.In addition, the control unit 120 DTW pairs (

), the value of the index of the first time series data within a specific index pair (

) and the value of the index of the second time series data (

) can be substituted into Equation 7. And this process

By repeating the index pairs, the control unit 120 determines the normalized statistical similarity (

) can be calculated.

그리고 이와 같은 과정을 통하여 -1 내지 1 범위로 정규화된, DTW 기반의 상관 관계가 도출될 수 있다.And through this process, a DTW-based correlation normalized to a range of -1 to 1 may be derived.

도 11은 DTW에 기반하여 비동기적으로 매핑된 제1 시계열 데이터 및 제2 시계열 데이터의 정규화된 통계적 유사도를 산출하는 예시이다.11 is an example of calculating a normalized statistical similarity between asynchronously mapped first time series data and second time series data based on DTW.

매핑 전 그래프(1110)를 참고하면, 제1 시계열 데이터(x) 및 제2 시계열 데이터(y) 간의 시계열은 서로 어긋나있는 상태이다. 그리고 제1 시계열 데이터(x) 및 제2 시계열 데이터(y)의 인덱스의 개수(M)은 10이다.Referring to the graph 1110 before mapping, the time series between the first time series data (x) and the second time series data (y) is out of alignment with each other. And the number (M) of the indices of the first time series data (x) and the second time series data (y) is 10.

먼저 제어부(120)는 앞서 설명한 방식과 동일한 방식으로, DTW 알고리즘에 따라 DTW 행렬(1120)을 생성할 수 있다.First, the controller 120 may generate the DTW matrix 1120 according to the DTW algorithm in the same manner as described above.

그리고 나서 제어부(120)는 DTW 행렬(1120)을 이용하여 제1 시계열 데이터와 제2 시계열 데이터의 유사도가 가장 가깝게 매핑된 최적 와핑 패스(Optimal Warping Path)를 선택할 수 있다.Then, the controller 120 may select an optimal warping path to which the similarity between the first time series data and the second time series data is most closely mapped using the DTW matrix 1120 .

이 경우 최적 와핑 패스(Optimal Warping Path)를 구성하는 복수의 인덱스 쌍은 표 1(1130)에서 도시하는 바와 같이 (1,1), (2,2), (3,3), (4,3), (5,4), (6,5), (7,6), (8,6), (9,7), (10,8), (1,9), (10,10)이 될 수 있다. 즉 DTW 페어즈(

)의 인덱스 쌍의 개수는 12개가 될 수 있다.In this case, a plurality of index pairs constituting an optimal warping path are (1,1), (2,2), (3,3), (4,3) as shown in Table 1 (1130). ), (5,4), (6,5), (7,6), (8,6), (9,7), (10,8), (1,9), (10,10) can be i.e. DTW Pairs (

), the number of index pairs may be 12.

이 경우 제어부(120)는 최적 와핑 패스를 구성하는 복수의 인덱스 쌍을 이용하여 제1 시계열 데이터(x) 및 제2 시계열 데이터(y)를 비동기적으로 매핑할 수 있다.In this case, the controller 120 may asynchronously map the first time series data (x) and the second time series data (y) using a plurality of index pairs constituting the optimal warping path.

매핑 후 그래프(1140)를 참고하면, 제1 시계열 데이터(x) 및 제2 시계열 데이터(y)가 비동기적으로 매핑되어 있는 상태이다.Referring to the graph 1140 after mapping, the first time series data (x) and the second time series data (y) are asynchronously mapped.

한편 제어부(120)는 최적 와핑 패스를 구성하는 복수의 인덱스 쌍 각각에 대응하는 값을 획득할 수 있다.Meanwhile, the controller 120 may obtain a value corresponding to each of a plurality of index pairs constituting an optimal warping path.

예를 들어 표1(1130) 및 표 2(1150)를 참고하면, 최적 와핑 패스를 구성하는 제1 인덱스 쌍(pairs NO. 1)은 제1 시계열 데이터(x)의 제1 인덱스 및 제2 시계열 데이터(y)의 제1 인덱스로 구성되고, 매핑전 제1 시계열 데이터(x)의 제1 인덱스의 값은 1, 매핑 전 제2 시계열 데이터(y)의 제2 인덱스의 값은 1이다. 따라서 제1 인덱스 쌍(pairs NO. 1)에 대응하는 제1 시계열 데이터(x)의 값은 1, 제1 인덱스 쌍(pairs NO. 1)에 대응하는 제2 시계열 데이터(y)의 값은 1일 수 있다.For example, referring to Table 1 ( 1130 ) and Table 2 ( 1150 ), the first index pair (pairs NO. 1) constituting the optimal warping path is the first index and the second time series of the first time series data (x). It consists of a first index of the data y, and the value of the first index of the first time series data x before mapping is 1, and the value of the second index of the second time series data y before mapping is 1. Accordingly, the value of the first time series data x corresponding to the first index pair (pairs NO. 1) is 1, and the value of the second time series data (y) corresponding to the first index pair (pairs NO. 1) is 1 can be

다른 예를 들어 표1(1130) 및 표 2(1150)를 참고하면, 최적 와핑 패스를 구성하는 제9 인덱스 쌍(pairs NO. 9)은 제1 시계열 데이터(x)의 제9 인덱스 및 제2 시계열 데이터(y)의 제7 인덱스로 구성되고, 매핑전 제1 시계열 데이터(x)의 제9 인덱스의 값은 8, 매핑 전 제2 시계열 데이터(y)의 제7 인덱스의 값은 8이다. 따라서 제9 인덱스 쌍(pairs NO. 9)에 대응하는 제1 시계열 데이터(x)의 값은 8, 제9 인덱스 쌍(pairs NO. 9)에 대응하는 제2 시계열 데이터(y)의 값은 8일 수 있다.For another example, referring to Table 1 1130 and Table 2 1150 , the ninth index pair (pairs NO. 9) constituting the optimal warping path is the ninth index and the second index of the first time series data (x). It consists of a seventh index of the time series data y, and the value of the ninth index of the first time series data x before mapping is 8, and the value of the seventh index of the second time series data y before mapping is 8. Accordingly, the value of the first time series data x corresponding to the ninth index pair (pairs NO. 9) is 8, and the value of the second time series data (y) corresponding to the ninth index pair (pairs NO. 9) is 8 can be

다른 예를 들어 표1(1130) 및 표 2(1150)를 참고하면, 최적 와핑 패스를 구성하는 제11 인덱스 쌍(pairs NO. 11)은 제1 시계열 데이터(x)의 제10 인덱스 및 제2 시계열 데이터(y)의 제9 인덱스로 구성되고, 매핑전 제1 시계열 데이터(x)의 제10 인덱스의 값은 7, 매핑 전 제2 시계열 데이터(y)의 제9 인덱스의 값은 6이다. 따라서 제11 인덱스 쌍(pairs NO. 11)에 대응하는 제1 시계열 데이터(x)의 값은 7, 제11 인덱스 쌍(pairs NO. 11)에 대응하는 제2 시계열 데이터(y)의 값은 6일 수 있다.As another example, referring to Table 1 1130 and Table 2 1150 , the eleventh index pair (pairs NO. 11) constituting the optimal warping path is the tenth index and the second index of the first time series data (x). It consists of a ninth index of the time series data y, a value of a tenth index of the first time series data x before mapping has a value of 7, and a value of a ninth index of the second time series data y before mapping has a value of 6. Accordingly, the value of the first time series data x corresponding to the eleventh index pair (pairs NO. 11) is 7, and the value of the second time series data (y) corresponding to the eleventh index pair (pairs NO. 11) is 6 can be

이 경우 제어부(120)는 비동기적으로 매핑된 제1 시계열 데이터의 값 및 제2 시계열 데이터의 값을 이용하여 정규화된 통계적 유사도를 산출할 수 있다.In this case, the controller 120 may calculate the normalized statistical similarity by using the asynchronously mapped values of the first time series data and the second time series data.

구체적으로 제어부(120)는, 최적 와핑 패스를 구성하는 복수의 인덱스 쌍 각각에 대응하는 값(제1 시계열 데이터의 값 및 제2 시계열 데이터의 값)을 수학식 7에 대입하여, 정규화된 통계적 유사도를 산출할 수 있다.Specifically, the controller 120 substitutes the values (the value of the first time series data and the value of the second time series data) corresponding to each of the plurality of index pairs constituting the optimal warping path into Equation 7, and the normalized statistical similarity degree can be calculated.

도 11의 예시에 따라, DTW에 기반하여 비동기적으로 매핑된 제1 시계열 데이터 및 제2 시계열 데이터의 정규화된 통계적 유사도(

)는 0.94838으로 산출된다.According to the example of FIG. 11 , the normalized statistical similarity (

) is calculated as 0.94838.

즉 정규화된 통계적 유사도가 산출되기 때문에 1에 가까울수록 두 시계열 데이터가 매우 유사하다는 것을 의미한다. That is, since normalized statistical similarity is calculated, the closer to 1, the more similar the two time series data.

즉 기존의 일반적인 비동기적 거리 기반 유사도 측정 방식에 따라 산출된 값은 상대적인 값에 불과하여, 산출된 값이 어느 정도의 유사도를 나타내는지를 판단하기 위해서는 수많은 비교 데이터가 필요하였다. 다만 본 발명에서는 정규화된 통계적 유사도(

)를 산출함으로써, 유사한 정도를 나타내는 절대 값을 산출할 수 있다. That is, the value calculated according to the conventional asynchronous distance-based similarity measurement method is only a relative value, and thus, a lot of comparison data is needed to determine the degree of similarity of the calculated value. However, in the present invention, the normalized statistical similarity (

), an absolute value representing a similar degree can be calculated.

한편 도 11의 예시에서, 비동기적인 매핑 과정 없이(즉 도 10c의 방식으로) 산출된 통계적 유사도는 0.7619로 계산된다. 이는 제1 시계열 데이터와 제2 시계열 데이터 간에 양의 상관 관계가 있다는 것만을 판단할 수 있을 뿐, 두 데이터가 매우 유사하다고 판단하기는 힘들다.Meanwhile, in the example of FIG. 11 , the statistical similarity calculated without the asynchronous mapping process (ie, the method of FIG. 10C ) is calculated to be 0.7619. This can only determine that there is a positive correlation between the first time series data and the second time series data, and it is difficult to determine that the two data are very similar.

즉 비동기적인 매핑 과정 없이(즉 도 10c의 방식으로), 제1 시계열 데이터(x)와 제2 시계열 데이터(y)의 동일 인덱스를 하나의 쌍을 지어 거리를 비교하는 경우에는 그 정확도가 현격하게 떨어지게 된다. 다만 본 발명에서는 최적 와핑 패스를 이용한 비동기적인 매핑 과정을 거쳐 인덱스 쌍에 대응하는 값들을 산출하고, 이를 이용하여 정규화된 통계적 유사도를 산출하기 때문에, 데이터의 구조적 패턴의 유사도를 매우 정확하게 판단할 수 있는 장점이 있다.That is, if the distance is compared by pairing the same index of the first time series data (x) and the second time series data (y) without an asynchronous mapping process (that is, in the method of FIG. 10C ), the accuracy is significantly improved. will fall However, in the present invention, values corresponding to index pairs are calculated through an asynchronous mapping process using an optimal warping path, and normalized statistical similarity is calculated using this. There are advantages.

다음은 앞서 설명한 심볼을 이용하여 시계열 데이터 간의 정규화된 통계적 유사도를 산출하는 방법을 설명한다.Next, a method of calculating the normalized statistical similarity between time series data using the symbols described above will be described.

여기에도 도 9 내지 도 11의 설명은 적용될 수 있으며, 아래에서는 다른 점 위주로 설명하도록 한다.The descriptions of FIGS. 9 to 11 may be applied here as well, and below, different points will be mainly described.

제어부(120)는 제1 시계열 데이터의 각 구간의 결합 심볼과 제2 시계열 데이터의 각 구간의 결합 심볼 간의 거리를 이용하여 DTW 행렬을 생성할 수 있다.The controller 120 may generate a DTW matrix by using a distance between a combined symbol of each section of the first time series data and a combined symbol of each section of the second time series data.

구체적으로 도 9를 다시 참고하면, 제1 시계열 데이터(x)는 복수의 구간으로 나뉘고, 이 경우 복수의 구간을 각각 대표하는 복수의 측정 값은 제1 시계열의 결합 심볼이 될 수 있다.Specifically, referring back to FIG. 9 , the first time series data x is divided into a plurality of sections, and in this case, a plurality of measurement values representing the plurality of sections may be combined symbols of the first time series.

또한 제2 시계열 데이터(y)는 복수의 구간으로 나뉘고, 이 경우 복수의 구간을 각각 대표하는 복수의 측정 값은 제2 시계열의 결합 심볼이 될 수 있다.In addition, the second time series data y is divided into a plurality of sections, and in this case, a plurality of measurement values representing the plurality of sections may be combined symbols of the second time series.

한편 수학식 2를 다시 참고하면, 현재 인덱스의 거리 값은 유클리디안 거리로써 산출될 수 있다. 그리고 결합 심볼을 이용하여 DTW 행렬을 생성하는 경우, 현재 인덱스의 거리 값은. 유클리디안 거리 대신, 도 8에서 설명한 결합 심볼 간의 거리가 될 수 있다.Meanwhile, referring to Equation 2 again, the distance value of the current index may be calculated as the Euclidean distance. And when the DTW matrix is generated using the combining symbol, the distance value of the current index is . Instead of the Euclidean distance, it may be the distance between the combined symbols described in FIG. 8 .

예를 들어 제1 시계열 데이터(x)의 제6 인덱스의 값이 심볼 Csn이고, 제2 시계열 데이터(y)의 제5 인덱스의 값이 심볼 Aud라고 가정한다. 이 경우

는 심볼 Csn와 심볼 Aud 간의 거리인 3.848로 산출될 수 있다. 그리고 제어부(120)는 현재 인덱스의 거리

및 이전 인덱스들 중 최소 DTW 값을 이용하여 현재 인덱스의 DTW 값을 산출할 수 있다.For example, it is assumed that the value of the sixth index of the first time series data (x) is the symbol Csn, and the value of the fifth index of the second time series data (y) is the symbol Aud. in this case

may be calculated as 3.848, which is the distance between the symbol Csn and the symbol Aud. And the control unit 120 is the distance of the current index

and the DTW value of the current index may be calculated using the minimum DTW value among the previous indices.

그리고 이와 같은 과정을 반복함으로써, 제어부(120)는 DTW 행렬을 생성할 수 있다.And by repeating this process, the controller 120 may generate the DTW matrix.

또한 제어부(120)는 DTW 행렬 이용하여 제1 시계열 데이터와 제2 시계열 데이터의 유사도가 가장 가깝게 매핑된 최적 와핑 패스를 선택할 수 있다.Also, the controller 120 may select an optimal warping path in which the similarity between the first time series data and the second time series data is most closely mapped using the DTW matrix.

또한 제어부(120)는 최적 와핑 패스를 구성하는 복수의 인덱스 쌍을 이용하여 제1 시계열 데이터 및 제2 시계열 데이터를 비동기적으로 매핑 할 수 있다. Also, the controller 120 may asynchronously map the first time series data and the second time series data using a plurality of index pairs constituting the optimal warping path.

예를 들어, 최적 와핑 패스를 구성하는 제9 인덱스 쌍(pairs NO. 9)은 제1 시계열 데이터(x)의 제9 인덱스 및 제2 시계열 데이터(y)의 제7 인덱스로 구성되고, 매핑전 제1 시계열 데이터(x)의 제9 인덱스의 값은 Ass, 매핑 전 제2 시계열 데이터(y)의 제7 인덱스의 값은 Bnn이다. 이 경우 제9 인덱스 쌍(pairs NO. 9)에 대응하는 제1 시계열 데이터(x)의 값은 Ass, 제9 인덱스 쌍(pairs NO. 9)에 대응하는 제2 시계열 데이터(y)의 값은 Bnn일 수 있다.For example, the ninth index pair (pairs NO. 9) constituting the optimal warping path is composed of the ninth index of the first time series data (x) and the seventh index of the second time series data (y), and before mapping The value of the ninth index of the first time series data (x) is Ass, and the value of the seventh index of the second time series data (y) before mapping is Bnn. In this case, the value of the first time series data x corresponding to the ninth index pair (pairs NO. 9) is Ass, and the value of the second time series data (y) corresponding to the ninth index pair (pairs NO. 9) is It can be Bnn.

또한 제어부(120)는 비동기적으로 매핑된 제1 시계열 데이터의 값 및 제2 시계열 데이터의 값을 이용하여 정규화된 통계적 유사도를 산출할수 있다.Also, the controller 120 may calculate the normalized statistical similarity by using the asynchronously mapped values of the first time series data and the second time series data.

구체적으로 수학식 7을 참고하면, 제어부(120)는 제1 시계열 데이터를 구성하는 인덱스들의 값들의 평균(

)을 산출해야 한다.Specifically, referring to Equation 7, the control unit 120 controls the average (

) should be calculated.

이 경우 제1 시계열 데이터를 구성하는 인덱스들의 값들의 평균(

)은, 제1 시계열 데이터를 구성하는 인덱스(구간)들의 결합 심볼들의 평균이 될 수 있다. In this case, the average of the values of the indexes constituting the first time series data (

) may be an average of combined symbols of indices (intervals) constituting the first time series data.

구체적으로 제1 시계열 데이터를 구성하는 인덱스(구간)들의 결합 심볼들의 평균이란, 제1 시계열 데이터에 포함되는 메인 심볼들 중 최다 빈도 심볼과 패턴 심볼들 중 최다 빈도 심볼의 결합 심볼을 의미할 수 있다.Specifically, the average of the combined symbols of indices (intervals) constituting the first time series data may mean a combined symbol of the most frequent symbol among the main symbols included in the first time series data and the most frequent symbol among the pattern symbols. .

예를 들어 제1 시계열 데이터에 제1 내지 제9 결합 심볼이 존재하는 경우, 제1 시계열 데이터에는 9개의 메인 심볼과 18개의 패턴 심볼이 포함된다. 그리고 9개의 메인 심볼에서 가장 많이 나타나는 메인 심볼이 C이고 18개의 패턴 심볼에서 가장 많이 나타나는 패턴 심볼이 n인 경우, 제1 시계열 데이터를 구성하는 인덱스(구간)들의 결합 심볼들의 평균은 Cnn이 될 수 있다.For example, when the first to ninth combined symbols exist in the first time series data, the first time series data includes 9 main symbols and 18 pattern symbols. And when the main symbol that appears the most in the 9 main symbols is C and the pattern symbol that appears the most in the 18 pattern symbols is n, the average of the combined symbols of the indices (intervals) constituting the first time series data can be Cnn. have.

같은 방식으로 제2 시계열 데이터를 구성하는 인덱스(구간)들의 결합 심볼들의 평균 역시 산출될 수 있다.In the same way, the average of the combined symbols of the indices (intervals) constituting the second time series data may also be calculated.

한편 수학식 7에서 인덱스의 값과 인덱스들의 값들의 평균의 차(

,

)는, 도 8에서 설명한 결합 심볼 간의 거리에 기반하여 산출될 수 있다.On the other hand, in Equation 7, the difference between the value of the index and the average of the values of the indexes (

,

) may be calculated based on the distance between the combined symbols described with reference to FIG. 8 .

예를 들어 제9 인덱스 쌍(pairs NO. 9)의 값들이 수학식 7에 대입된다고 가정한다. 또한 제9 인덱스 쌍(pairs NO. 9)에 대응하는 제1 시계열 데이터(x)의 값은 Ass, 제9 인덱스 쌍(pairs NO. 9)에 대응하는 제2 시계열 데이터(y)의 값은 Bnn이다. 또한 제1 시계열 데이터를 구성하는 인덱스들의 값들의 평균은 Cnn이고, 제2 시계열 데이터를 구성하는 인덱스 값들의 평균은 Bss이다.For example, it is assumed that the values of the ninth index pair (pairs NO. 9) are substituted in Equation (7). In addition, the value of the first time series data (x) corresponding to the ninth index pair (pairs NO. 9) is Ass, and the value of the second time series data (y) corresponding to the ninth index pair (pairs NO. 9) is Bnn to be. Also, an average of values of indices constituting the first time series data is Cnn, and an average of index values constituting the second time series data is Bss.

이 경우 제9 인덱스 쌍에서의 제1 시계열 데이터의 값과 제1 시계열 데이터의 평균의 차(

)는 Ass와 Cnn의 거리로써 산출될 수 있다.In this case, the difference between the value of the first time series data in the ninth index pair and the average of the first time series data (

) can be calculated as the distance between Ass and Cnn.

또한 제9 인덱스 쌍에서의 제2 시계열 데이터의 값과 제2 시계열 데이터의 평균의 차(

)는 Bnn과 Bss의 거리로써 산출될 수 있다.In addition, the difference between the value of the second time series data in the ninth index pair and the average of the second time series data (

) can be calculated as the distance between Bnn and Bss.

즉 본 발명에 따르면, 시계열 데이터의 구조적 특징을 심볼로서 추출한 경우에도, 두 시계열 데이터 간의 정규화된 통계적 유사도를 산출할 수 있다.That is, according to the present invention, even when structural features of time series data are extracted as symbols, a normalized statistical similarity between two time series data can be calculated.

다음은 종래의 시계열 거리 기반 패턴 비교 방법(동적 시간 워핑(Dynamic Time Warping) 기술)에 의한 연산 복잡도와, 본 발명의 구조적 특징 기반 패턴 비교 방법에 의한 연산 복잡도를 비교한다.Next, the computational complexity by the conventional time-series distance-based pattern comparison method (Dynamic Time Warping technology) is compared with the computational complexity by the structural feature-based pattern comparison method of the present invention.

N은 시계열의 길이, W는 구간 수, S는 메인 심볼의 종류의 수, P는 패턴 심볼의 종류의 수, O(*) 는 알고리즘 간의 연산의 효율성을 비교하기 위한 방법을 의미할 수 있다.N is the length of the time series, W is the number of sections, S is the number of types of main symbols, P is the number of types of pattern symbols, and O(*) may mean a method for comparing the efficiency of operations between algorithms.

이 경우 종래의 시계열 거리 기반 패턴 비교 방법(동적 시간 워핑(Dynamic Time Warping) 기술)에서는 연산 복잡도가 O(N2)로 산출될 수 있고, 본 발명의 구조적 특징 기반 패턴 비교 방법에서는 연산 복잡도가 O(N + W*log(S*2P))로 계산될 수 있다.In this case, in the conventional time series distance-based pattern comparison method (Dynamic Time Warping technology), the computational complexity can be calculated as O(N2), and in the structural feature-based pattern comparison method of the present invention, the computational complexity is O( N + W*log(S*2P)).

예를 들어 N = 288, W = 8, S = 6, P = 4인 경우, 종래의 시계열 거리 기반 패턴 비교 방법(동적 시간 워핑(Dynamic Time Warping) 기술)에 따르면 O(288·288)로써 연산 복잡도는 O(82944)가 될 수 있다.For example, when N = 288, W = 8, S = 6, and P = 4, according to the conventional time-series distance-based pattern comparison method (Dynamic Time Warping technique), operation is performed as O(288·288). The complexity may be O(82944).

반면에 본 발명의 구조적 특징 기반 패턴 비교 방법에 따르면, O(288 + 8·log(6, 4))로써 연산 복잡도는 약 O(325)가 될 수 있다.On the other hand, according to the structural feature-based pattern comparison method of the present invention, the computational complexity may be about O(325) as O(288 + 8·log(6, 4)).

따라서 본 발명에 따르면, 종래의 시계열 거리 기반 패턴 비교 방법(동적 시간 워핑(Dynamic Time Warping) 기술)과 비교하여, 연산 복잡도가 99.61%만큼 감소할 수 있다.Therefore, according to the present invention, compared with the conventional time-series distance-based pattern comparison method (Dynamic Time Warping technology), the computational complexity can be reduced by 99.61%.

또한 종래의 DTW에 기반한 정규화된 통계적 유사도 산출에서는, 연산 복잡도가 O(N²+N)로 산출될 수 있고, 본 발명의 구조적 특징 기반 패턴 비교 방법에 기반한 정구화된 통계적 유사도 산출에서는 연산 복잡도가 O(N + W*log(S*2P)+W²+W)이 된다.In addition, in the normalized statistical similarity calculation based on the conventional DTW, the computational complexity can be calculated as O(N ² +N), and in the regularized statistical similarity calculation based on the structural feature-based pattern comparison method of the present invention, the computational complexity is It becomes O(N + W*log(S*2P)+W ² +W).

따라서 N = 288, W = 8, S = 6, P = 4인 경우, 종래의 DTW에 기반한 정규화된 통계적 유사도 산출에 따르면 O(288·288+288)로써 연산 복잡도는 O(83232)가 될 수 있다. 반면에 본 발명의 구조적 특징 기반 패턴 비교 방법에 따르면, O(325 + 8·8 + 8)로써 연산 복잡도는 약 O(397)가 될 수 있다. 이에 따라 본 발명에 따른 연산 복잡도는 99.52% 만큼 감소할 수 있다.Therefore, when N = 288, W = 8, S = 6, P = 4, according to the conventional DTW-based normalized statistical similarity calculation, the computational complexity is O(288·288+288) and O(83232). have. On the other hand, according to the structural feature-based pattern comparison method of the present invention, the computational complexity may be about O(397) as O(325 + 8·8 + 8). Accordingly, the computational complexity according to the present invention can be reduced by 99.52%.

도 12는 본 발명의 따른 효과를 설명하기 위한 도면이다.12 is a view for explaining the effect according to the present invention.

제1 시계열 데이터(S1)와 제2 시계열 데이터(S2)의 패턴은 매우 상이하다. The patterns of the first time series data S1 and the second time series data S2 are very different.

다만 SAX(Symbolic Aggregate Approximation)로 데이터를 표현했을 때, 제1 시계열 데이터(S1)와 제2 시계열 데이터(S2)의 심볼은 모두 C-C-C-C-C-C-C-C로 표현되게 되어, 심볼 간 비교를 통하여 유사도 판단 시 제1 시계열 데이터(S1)와 제2 시계열 데이터(S2)가 매우 유사한 것으로 판단되는 문제가 발생한다.However, when data is expressed by SAX (Symbolic Aggregate Approximation), the symbols of the first time series data (S1) and the second time series data (S2) are all expressed as C-C-C-C-C-C-C-C. A problem occurs in that the data S1 and the second time series data S2 are determined to be very similar.

다만 본 발명에 따른 구조적 패턴을 추출하여 심볼화 하는 방식(Pattern-added SAX, P-SAX)은 제1 시계열 데이터(S1)의 경우 Cnn-Cnn-Cnn-Cnn-Cnn-Cnn-Cnn-Cnn으로, 제2 시계열 데이터(S2)의 경우 Css-Css-Css-Css-Css-Css-Css-Css로 심볼화 하게 되며, 이에 따라 제1 시계열 데이터(S1)와 제2 시계열 데이터(S2)는 서로 다른 패턴을 가진 것으로 판단된다.However, the method of extracting and symbolizing the structural pattern according to the present invention (Pattern-added SAX, P-SAX) is Cnn-Cnn-Cnn-Cnn-Cnn-Cnn-Cnn-Cnn in the case of the first time series data S1. , the second time series data (S2) is symbolized as Css-Css-Css-Css-Css-Css-Css-Css, and accordingly, the first time series data (S1) and the second time series data (S2) are mutually It is considered to have a different pattern.

즉 본 발명에 따르면, 심볼화를 통하여 연산량을 현격히 낮추면서도, 심볼화에 의해 정보가 왜곡되는 것을 최소화 하며, 이에 따라 시계열 데이터의 유사도 판단의 정확도를 향상시킬 수 있는 장점이 있다. That is, according to the present invention, there is an advantage in that the amount of computation is significantly reduced through symbolization, information distortion is minimized by symbolization, and thus the accuracy of determining the similarity of time series data can be improved.

또한 제어부(120)는 시계열 데이터의 결합 심볼을 이용하여 시계열 데이터들을 그룹화 할 수 있다. 즉 본 발명에 따르면, 시계열 데이터의 결합 심볼은, 시계열 데이터를 구조적 특징에 따라 그룹화 하는 기준으로 활용될 수 있다.In addition, the control unit 120 may group the time series data by using a combination symbol of the time series data. That is, according to the present invention, a combination symbol of time series data may be used as a criterion for grouping time series data according to structural features.

또한 제어부(120)는 시계열 데이터를 트레이닝 데이터로 활용하여 인공지능 모델을 트레이닝 할 수 있다. 이 경우 트레이닝 데이터를 시계열 데이터로써 저장하게 되면 매우 많은 메모리가 필요하게 된다. 다만 제어부(120)는 시계열 데이터의 결합 심볼을 저장하고, 저장된 결합 심볼을 인공지능 모델의 트레이닝 데이터로 활용함으로써, 과거에 시계열 데이터를 보관함에 따라서 발생하는 메모리 부족 문제를 해결할 수 있는 장점이 있다.In addition, the controller 120 may train the artificial intelligence model by using the time series data as training data. In this case, if the training data is stored as time series data, a very large amount of memory is required. However, the control unit 120 stores the combined symbols of the time series data and utilizes the stored combined symbols as training data of the artificial intelligence model, thereby solving the memory shortage problem caused by storing time series data in the past.

또한 본 발명에 따르면, 제어부(120)는 시계열 데이터의 결합 심볼을 입력 값으로 상태 분류를 출력 값으로 제공하여 인공지능 모델을 트레이닝 할 수 있다. 그리고 새로운 시계열 데이터가 수집되는 경우 새롭게 수집된 시계열 데이터의 결합 심볼을 인공지능 모델에 제공함으로써, 새롭게 수집된 시계열 데이터에 기반한 상태 예측을 수행할 수 있다.In addition, according to the present invention, the controller 120 can train the artificial intelligence model by providing a combination symbol of time series data as an input value and a state classification as an output value. In addition, when new time series data is collected, state prediction based on the newly collected time series data can be performed by providing a combination symbol of the newly collected time series data to the AI model.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 상기 컴퓨터는 제어부를 포함할 수도 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니 되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The present invention described above can be implemented as computer-readable code on a medium in which a program is recorded. The computer-readable medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of computer-readable media include Hard Disk Drive (HDD), Solid State Disk (SSD), Silicon Disk Drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is this. In addition, the computer may include a control unit. Accordingly, the above detailed description should not be construed as restrictive in all respects but as exemplary. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

100: 시계열 데이터 간의 유사도 측정 장치100: device for measuring the similarity between time series data

Claims

collecting first time series data;
dividing the first time series data into sections of a predetermined size, and obtaining a main symbol based on a measurement value of the section;
obtaining at least one pattern symbol representing a structural pattern of the first time series data in the section; and
Calculating a degree of similarity between the first time series data and the second time series data by using the main symbol and one or more pattern symbols combined with the main symbol;
A method of measuring the similarity between time series data.

The method of claim 1,
The pattern symbol is
including increment symbols, decrement symbols, hold symbols and noise symbols.
A method of measuring the similarity between time series data.

3. The method of claim 2,
The step of obtaining the pattern symbol comprises:
calculating a standard deviation by using differences between a measurement value at a previous time point and a current time point of the first time series data; and
determining that the pattern symbol is a noise symbol if the standard deviation is greater than a noise threshold;
A method of measuring the similarity between time series data.

4. The method of claim 3,
The step of obtaining the pattern symbol comprises:
calculating average values of subsections within the section when the standard deviation is smaller than the noise threshold; and
Determining any one of the increase symbol, the decrease symbol, and the maintenance symbol by using the slope between the average values; further comprising
A method of measuring the similarity between time series data.

3. The method of claim 2,
The noise symbol has the largest distance from other symbols,
Except for the distance between the same symbols, the distance the maintenance symbol has with the increasing symbol or the decreasing symbol is the smallest.
A method of measuring the similarity between time series data.

The method of claim 1,
Calculating the degree of similarity between the first time series data and the second time series data includes:
Calculating a distance between a first combined symbol that is a combination of the main symbol and the pattern symbol of the first time series data, and a second combined symbol that is a combination of a second main symbol and a second pattern symbol of the second time series data including;
A method of measuring the similarity between time series data.

7. The method of claim 6,
Calculating the distance between the combined symbol and the second combined symbol comprises:
calculating a main symbol distance between the main symbol and the second main symbol by using a comparison table of the main symbols;
calculating a pattern symbol distance between the pattern symbol and the second pattern symbol by using a comparison table of pattern symbols; and
calculating a distance between the first combined symbol and the second combined symbol by summing the main symbol distance and the pattern symbol distance;
A method of measuring the similarity between time series data.

7. The method of claim 6,
Calculating the degree of similarity between the first time series data and the second time series data includes:
generating a DTW matrix by using a distance between the first combining symbol and the second combining symbol in each interval; and
using the DTW matrix to select an optimal warping path in which the similarity between the first time series data and the second time series data is most closely mapped;
A method of measuring the similarity between time series data.

9. The method of claim 8,
Calculating the degree of similarity between the first time series data and the second time series data includes:
Asynchronously mapping the first time series data and the second time series data using a plurality of index pairs constituting the optimal warping path; further comprising
A method of measuring the similarity between time series data.

10. The method of claim 9,
Calculating the degree of similarity between the first time series data and the second time series data includes:
Calculating a normalized statistical similarity indicating a correlation between the first time series data and the second time series data by using the asynchronously mapped values of the first time series data and the second time series data further containing ;
A method of measuring the similarity between time series data.

a data collection unit for collecting first time series data; and
dividing the first time series data into sections of a certain size, obtaining a main symbol based on a measurement value of the section, obtaining one or more pattern symbols representing a structural pattern of the first time series data in the section, and A control unit for calculating a degree of similarity between the first time series data and the second time series data by using a main symbol and one or more pattern symbols combined with the main symbol;
A device for measuring the similarity between time series data.

12. The method of claim 11,
The pattern symbol is
including increment symbols, decrement symbols, hold symbols and noise symbols.
A device for measuring the similarity between time series data.

13. The method of claim 12,
The control unit is
calculating a standard deviation by using the differences between the measurement value of the previous time point and the current time point of the first time series data,
If the standard deviation is greater than a noise threshold, determining that the pattern symbol is a noise symbol
A device for measuring the similarity between time series data.

14. The method of claim 13,
The control unit is
If the standard deviation is less than the noise threshold, calculating average values of detailed sections within the section,
Determining any one of the increment symbol, the decrement symbol, and the maintenance symbol using the slope between the average values
A device for measuring the similarity between time series data.

13. The method of claim 12,
The noise symbol has the largest distance from other symbols,
Except for the distance between the same symbols, the distance the maintenance symbol has with the increasing symbol or the decreasing symbol is the smallest.
A device for measuring the similarity between time series data.

12. The method of claim 11,
The control unit is
Calculating a distance between a first combined symbol that is a combination of the main symbol and the pattern symbol of the first time series data, and a second combined symbol that is a combination of a second main symbol and a second pattern symbol of the second time series data
A device for measuring the similarity between time series data.

17. The method of claim 16,
The control unit is
calculating a main symbol distance between the main symbol and the second main symbol using a comparison table of main symbols,
calculating a pattern symbol distance between the pattern symbol and the second pattern symbol using a comparison table of pattern symbols,
calculating the distance between the first combined symbol and the second combined symbol by summing the main symbol distance and the pattern symbol distance
A device for measuring the similarity between time series data.

17. The method of claim 16,
The control unit is
generating a DTW matrix by using the distance between the first combining symbol and the second combining symbol in each interval;
Selecting an optimal warping path in which the similarity between the first time series data and the second time series data is most closely mapped using the DTW matrix
A device for measuring the similarity between time series data.

19. The method of claim 18,
The control unit is
Asynchronously mapping the first time series data and the second time series data using a plurality of index pairs constituting the optimal warping path
A device for measuring the similarity between time series data.

collecting first time series data;
dividing the first time series data into sections of a predetermined size, and obtaining a main symbol based on a measurement value of the section;
obtaining at least one pattern symbol representing a structural pattern of the first time series data in the section; and
Using the main symbol and one or more pattern symbols combined with the main symbol, calculating a degree of similarity between the first time series data and the second time series data; stored computer programs.