KR20200052425A

KR20200052425A - Method for analyzing time series data, determining a key influence variable and apparatus supporting the same

Info

Publication number: KR20200052425A
Application number: KR1020180128528A
Authority: KR
Inventors: 이정림; 진유리
Original assignee: 삼성에스디에스 주식회사
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2020-05-15
Also published as: KR102472637B1

Abstract

Provided is a method of determining a key influence variable for a target class among a plurality of time series variables. The method of determining a key influence variable comprises the following steps of: acquiring a first matrix predicted as the target class from multiple time series data related to the time series variables; acquiring a second matrix belonging to a specific class; calculating similarity between the two matrixes except a value of a first time series variable from the first matrix and the second matrix; and determining the first time series variable as the key influence variable in response to determination that the calculated similarity satisfies a predetermined condition, wherein a first row or a first column of the first matrix may be formed of a measured value of the first time series variable and a second row or a second column of the first matrix may be formed of a measured value of a second time series variable.

Description

Methods for analyzing time series data, methods for determining main influence variables, and devices that support the methods {METHOD FOR ANALYZING TIME SERIES DATA, DETERMINING A KEY INFLUENCE VARIABLE AND APPARATUS SUPPORTING THE SAME}

본 발명은 시계열 데이터 분석 방법, 주영향 변수 결정 방법 및 그 방법들을 지원하는 장치에 관한 것이다. 보다 자세하게는, 데이터 분석의 정확성을 향상시키기 위해, 다수의 시계열 변수 간의 상관 관계를 고려하여 다중 시계열 데이터를 분석하는 방법, 복수의 시계열 변수 중에서 분석 결과에 가장 영향을 미친 주영향 변수를 결정하는 방법 및 그 방법들을 지원하는 장치에 관한 것이다.The present invention relates to a method for analyzing time series data, a method for determining a main influence variable, and an apparatus supporting the methods. More specifically, in order to improve the accuracy of data analysis, a method of analyzing multiple time series data in consideration of a correlation between a plurality of time series variables, and a method of determining a main influence variable having the most influence on an analysis result among a plurality of time series variables And devices that support the methods.

반도체 공정은 매우 복잡한 제조 공정 중 하나이며 대부분의 공정이 자동화되어 있다. 자동화된 반도체 공정을 효율적으로 운영하기 위한 필수 요소 기술 중 하나는 이상 탐지(anomaly detection) 기술이다. 자동화된 공정에서 장애가 발생하는 경우, 전체 제조 공정 중단되어 경제적 손실 규모가 급격히 증가하기 때문이다.The semiconductor process is one of the most complex manufacturing processes and most of the processes are automated. An anomaly detection technology is one of the essential technologies for efficiently operating an automated semiconductor process. This is because if the failure occurs in an automated process, the entire manufacturing process is interrupted and the amount of economic loss increases rapidly.

반도체 공정에서 실시간으로 이상 상태를 감지하기 위해 다수의 센서들이 사용되며, 다수의 센서들로부터 실시간으로 많은 양의 데이터가 생성된다. 이러한 다중 시계열 데이터는 적게는 수십, 많게는 수백 개의 시계열 변수(e.g. 온도, 습도 등)에 대한 측정 값으로 구성된다.In a semiconductor process, multiple sensors are used to detect an abnormal condition in real time, and a large amount of data is generated in real time from multiple sensors. This multi-time series data consists of measurements for as few as tens and as many as hundreds of time series variables (e.g. temperature, humidity, etc.).

종래의 이상 탐지 방법은 각 시계열 변수 별로 시계열 데이터의 특징을 분석하고, 분석 결과에 따라 공정의 이상 유무를 예측하는 방식이었다. 즉, 다수의 시계열 변수를 모니터링하고 있음에도, 단일 시계열 변수를 기준으로 상호 독립적으로 이상 탐지를 위한 분석이 수행되었다. 이에 따라, 시계열 변수 간의 상관 관계가 이상 탐지 과정에 반영되지 못했고, 그 결과로 이상 탐지의 정확도가 떨어지는 문제가 있었다.The conventional anomaly detection method is a method of analyzing characteristics of time series data for each time series variable and predicting the presence or absence of an abnormality according to the analysis result. That is, although multiple time series variables are being monitored, analysis for abnormality detection was performed independently of each other based on a single time series variable. Accordingly, the correlation between time series variables was not reflected in the anomaly detection process, and as a result, there was a problem in that the accuracy of the anomaly detection was deteriorated.

적은 수의 시계열 변수의 상관 관계를 고려하여 이상 탐지를 수행하는 방법이 일부 제안된 바도 있으나, 수십, 수백 개의 시계열 변수 간의 상관 관계를 고려할 수 있는 방법은 아직까지 제안된 바 없는 실정이다.Some methods have been proposed to perform anomaly detection considering the correlation of a small number of time series variables, but a method to consider the correlation between tens and hundreds of time series variables has not been proposed.

또한, 다수의 시계열 변수 중에서 이상 상태에 가장 영향을 미친 주영향 인자를 정확하게 식별할 수 있는 방법도 아직까지 제안된 바가 없다.In addition, a method for accurately identifying the main influence factor that has the most influence on the abnormal state among the multiple time series variables has not been proposed.

한국공개특허 제10-2016-0026492호 (2016.03.09 공개)Korean Patent Publication No. 10-2016-0026492 (2016.03.09 published)

본 발명이 해결하고자 하는 기술적 과제는, 분석의 정확성을 향상시키기 위해 다수의 시계열 변수 간의 상관 관계를 고려하여 다중 시계열 데이터를 분석하는 방법 및 그 방법을 지원하는 장치를 제공하는 것이다.The technical problem to be solved by the present invention is to provide a method for analyzing multiple time series data in consideration of a correlation between a plurality of time series variables and an apparatus supporting the method in order to improve analysis accuracy.

본 발명이 해결하고자 하는 다른 기술적 과제는, 상기 다수의 시계열 변수 중에서 분석 결과에 가장 영향을 미친 주영향 변수를 정확하게 결정하는 방법 및 그 방법을 지원하는 장치를 제공하는 것이다.Another technical problem to be solved by the present invention is to provide a method for accurately determining a main influence variable having the most influence on an analysis result among the plurality of time series variables and an apparatus supporting the method.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 주영향 변수 결정 방법은, 컴퓨팅 장치에서 복수의 시계열 변수 중 타깃 클래스(target class)에 대한 주영향 변수를 결정하는 방법에 있어서, 상기 복수의 시계열 변수와 연관된 다중 시계열 데이터에서, 상기 타깃 클래스로 예측된 제1 매트릭스를 획득하는 단계, 특정 클래스에 속한 제2 매트릭스를 획득하는 단계, 상기 제1 매트릭스 및 상기 제2 매트릭스에서 제1 시계열 변수의 값을 제외하고, 두 매트릭스 간의 유사도를 산출하는 단계 및 상기 산출된 유사도가 소정의 조건을 만족한다는 판정에 응답하여, 상기 제1 시계열 변수를 상기 주영향 변수로 결정하는 단계를 포함할 수 있다. 이때, 상기 제1 매트릭스의 제1 행 또는 제1 열은 제1 시계열 변수의 측정 값으로 구성되고, 상기 제1 매트릭스의 제2 행 또는 제2 열은 제2 시계열 변수의 측정 값으로 구성될 수 있다.In order to solve the above technical problem, a method for determining a main influence variable according to an embodiment of the present invention includes: a method for determining a main influence variable for a target class among a plurality of time series variables in a computing device; In multiple time series data associated with a plurality of time series variables, obtaining a first matrix predicted by the target class, obtaining a second matrix belonging to a specific class, and a first time series in the first matrix and the second matrix Excluding the value of the variable, calculating the similarity between the two matrices and determining the first time-series variable as the main influence variable in response to determining that the calculated similarity satisfies a predetermined condition. have. In this case, the first row or the first column of the first matrix may consist of measured values of the first time series variable, and the second row or the second column of the first matrix may consist of measured values of the second time series variable. have.

일 실시예에서, 상기 특정 클래스는 상기 타깃 클래스와 다른 클래스이고,In one embodiment, the specific class is a class different from the target class,

상기 주영향 변수로 결정하는 단계는, 상기 산출된 유사도가 임계 값 이상이라는 판정에 응답하여, 상기 제1 시계열 변수를 상기 주영향 변수로 결정하는 단계를 포함할 수 있다.The determining of the main influence variable may include determining the first time series variable as the main influence variable in response to a determination that the calculated similarity is equal to or greater than a threshold value.

일 실시예에서, 상기 특정 클래스는 상기 타깃 클래스와 동일한 클래스이고, 상기 주영향 변수로 결정하는 단계는, 상기 산출된 유사도가 임계 값 미만이라는 판정에 응답하여, 상기 제1 시계열 변수를 상기 주영향 변수로 결정하는 단계를 포함할 수 있다.In one embodiment, the specific class is the same class as the target class, and the determining of the main influence variable comprises: in response to the determination that the calculated similarity is less than a threshold value, the first time series variable is the main influence. It may include the step of determining as a variable.

일 실시예에서, 상기 제1 매트릭스를 획득하는 단계는, 상기 다중 시계열 데이터에서 기 설정된 시계열 구간의 데이터를 추출하여 상기 제1 매트릭스를 생성하는 단계 및 상기 제1 매트릭스의 분석 결과에 기반하여 상기 제1 매트릭스의 클래스를 상기 타깃 클래스로 예측하는 단계를 포함할 수 있다.In one embodiment, the step of acquiring the first matrix comprises: extracting data of a predetermined time series section from the multi-time series data to generate the first matrix, and based on an analysis result of the first matrix. It may include the step of predicting the class of one matrix to the target class.

일 실시예에서, 상기 제2 매트릭스를 획득하는 단계는, 상기 특정 클래스에 속한 복수의 후보 매트릭스를 획득하는 단계 및 LSH(Locality Sensitive Hashing) 알고리즘을 적용하여 상기 복수의 후보 매트릭스 중에서 제2 매트릭스를 선정하는 단계를 포함할 수 있다.In one embodiment, the obtaining of the second matrix may include obtaining a plurality of candidate matrices belonging to the specific class, and selecting a second matrix from the plurality of candidate matrices by applying a locality sensitive hashing (LSH) algorithm. It may include the steps.

상술한 기술적 과제를 해결하기 위한 본 발명의 다른 실시예에 따른 주영향 변수 결정 방법은, 컴퓨팅 장치에서 복수의 시계열 변수 중 특정 클래스에 대한 주영향 변수를 결정하는 방법에 있어서, 상기 복수의 시계열 변수와 연관된 다중 시계열 데이터에서 기 설정된 시계열 구간의 데이터를 추출하여 제1 매트릭스를 생성하는 단계, 상기 제1 매트릭스를 예측 모델에 입력하고, 상기 예측 모델로부터 출력된 제1 컨피던스 스코어(confidence score)에 기반하여 상기 제1 매트릭스의 클래스를 제1 클래스로 예측하는 단계 및 상기 제1 클래스에 대한 주영향 변수를 결정하는 단계를 포함하되, 상기 제1 매트릭스의 제1 행 또는 제1 열은 제1 시계열 변수의 측정 값으로 구성되고, 상기 제1 매트릭스의 제2 행 또는 제2 열은 제2 시계열 변수의 측정 값으로 구성되며, 상기 주영향 변수를 결정하는 단계는, 상기 제1 시계열 변수의 값이 제외된 상기 제1 매트릭스를 상기 예측 모델에 다시 입력하여, 제2 컨피던스 스코어를 획득하는 단계 및 상기 제2 컨피던스 스코어가 소정의 조건을 만족한다는 판정에 응답하여, 상기 제1 시계열 변수를 상기 제1 클래스의 주영향 변수로 결정하는 단계를 포함할 수 있다.A method for determining a main influence variable according to another embodiment of the present invention for solving the above technical problem is a method for determining a main influence variable for a specific class among a plurality of time series variables in a computing device, wherein the plurality of time series variables Generating a first matrix by extracting data of a preset time series section from multiple time series data associated with, inputting the first matrix to a prediction model, and based on a first confidence score output from the prediction model And predicting a class of the first matrix as a first class and determining a main influence variable for the first class, wherein a first row or a first column of the first matrix is a first time series variable. The second row or the second column of the first matrix is composed of the measured values of the second time series variable, The determining of the main influence variable may include re-entering the first matrix from which the value of the first time series variable is excluded, into the prediction model to obtain a second confidence score, and wherein the second confidence score is predetermined. And in response to the determination that the condition is satisfied, determining the first time series variable as the main influence variable of the first class.

일 실시예에서, 상기 제1 컨피던스 스코어는 상기 제1 클래스에 대한 컨피던스 스코어이고, 상기 제2 컨피던스 스코어는 제2 클래스에 대한 컨피던스 스코어이며, 상기 제1 시계열 변수를 상기 제1 클래스의 주영향 변수로 결정하는 단계는, 상기 제2 컨피던스 스코어가 임계 값 이상이라는 판정에 응답하여, 상기 제1 시계열 변수를 상기 제1 클래스의 주영향 변수로 결정하는 단계를 포함할 수 있다.In one embodiment, the first confidence score is a confidence score for the first class, the second confidence score is a confidence score for the second class, and the first time series variable is a major influence variable of the first class The determining may include determining the first time-series variable as a primary influence variable of the first class in response to determining that the second confidence score is equal to or greater than a threshold value.

일 실시예에서, 상기 제1 컨피던스 스코어와 상기 제2 컨피던스 스코어는 모두 상기 제1 클래스에 대한 컨피던스 스코어이고, 상기 제1 시계열 변수를 상기 제1 클래스의 주영향 변수로 결정하는 단계는, 상기 제1 컨피던스 스코어와 상기 제2 컨피던스 스코어의 차이가 소정의 조건을 만족한다는 판정에 응답하여, 상기 제1 시계열 변수를 상기 주영향 변수로 결정하는 단계를 포함할 수 있다.In one embodiment, the first confidence score and the second confidence score are both confidence scores for the first class, and determining the first time series variable as the primary influence variable of the first class includes: And determining the first time series variable as the main influence variable in response to a determination that a difference between the first confidence score and the second confidence score satisfies a predetermined condition.

상술한 기술적 과제를 해결하기 위한 본 발명의 또 다른 실시예에 따른 시계열 데이터 분석 방법은, 컴퓨팅 장치에서 컨볼루션 신경망(convolutional neural network) 기반의 예측 모델을 이용하여 예측 대상과 연관된 다중 시계열 데이터를 분석하는 방법에 있어서, 상기 다중 시계열 데이터에서 기 설정된 시계열 구간의 데이터를 추출하여, 제1 매트릭스를 생성하는 단계 및 상기 제1 매트릭스를 상기 예측 모델에 적용하여, 상기 예측 대상의 클래스를 예측하는 단계를 포함할 수 있다. 이때, 상기 다중 시계열 데이터는 제1 시계열 변수에 및 제2 시계열 변수의 측정 값을 포함하고, 상기 제1 매트릭스의 제1 행 또는 제1 열은 상기 시계열 구간에 대한 상기 제1 시계열 변수의 측정 값으로 구성되며, 상기 제1 매트릭스의 제2 행 또는 제2 열은 상기 시계열 구간에 대한 상기 제2 시계열 변수의 측정 값으로 구성될 수 있다.The time series data analysis method according to another embodiment of the present invention for solving the above technical problem analyzes multiple time series data associated with a prediction target using a convolutional neural network-based prediction model in a computing device In the method, extracting data of a preset time series section from the multiple time series data, generating a first matrix, and applying the first matrix to the prediction model to predict a class of the prediction target It can contain. In this case, the multiple time series data includes measured values of the first time series variable and the second time series variable, and the first row or the first column of the first matrix is the measured value of the first time series variable for the time series section. The second row or the second column of the first matrix may be configured as measured values of the second time series variable for the time series section.

일 실시예에서, 상기 제1 매트릭스를 생성하는 단계는, 시간 축 및 시계열 변수 축에 의해 형성되는 데이터 평면 상에, 상기 시계열 변수 축을 따라 상기 제1 시계열 변수 및 상기 제2 시계열 변수에 관한 측정 값을 배열하는 단계 및 상기 데이터 평면 상에서 슬라이딩 윈도우(sliding window)에 대응되는 측정 값을 추출하여 상기 제1 매트릭스를 생성하는 단계를 포함할 수 있다.In one embodiment, the step of generating the first matrix comprises measuring values of the first time series variable and the second time series variable along the time series variable axis on a data plane formed by the time axis and the time series variable axis. And generating the first matrix by extracting a measurement value corresponding to a sliding window on the data plane.

일 실시예에서, 상기 예측 모델은 순환 신경망(recurrent neural network)에 더 기반한 것이고, 상기 예측 대상의 클래스를 예측하는 단계는, 상기 컨볼루션 신경망에 상기 제1 매트릭스를 입력하여 특징 맵을 추출하는 단계 및 상기 추출된 특징 맵을 상기 순환 신경망에 입력하고, 상기 순환 신경망의 출력 결과에 기초하여 상기 예측 대상의 클래스를 예측하는 단계를 포함할 수 있다.In one embodiment, the prediction model is further based on a recurrent neural network, and predicting a class of the prediction target comprises: extracting a feature map by inputting the first matrix to the convolutional neural network. And inputting the extracted feature map into the cyclic neural network and predicting a class of the prediction target based on an output result of the cyclic neural network.

도 1은 본 발명의 일 실시예에 따른 시계열 데이터 분석 시스템을 나타내는 구성도이다.
도 2는 본 발명의 몇몇 실시예에서 참조될 수 있는 데이터 소스 및 다중 시계열 데이터의 예시도이다.
도 3 및 도 4는 본 발명의 일 실시예에 따른 시계열 데이터 분석 장치를 나타내는 블록도이다.
도 5는 본 발명의 일 실시예에 따른 시계열 데이터 분석 장치를 나타내는 하드웨어 구성도이다.
도 6은 본 발명의 제1 실시예에 따른 시계열 데이터 분석 방법을 나타내는 흐름도이다.
도 7 내지 도 9는 본 발명의 몇몇 실시예에 따른 전처리 방법을 설명하기 위한 예시도이다.
도 10 내지 도 12는 본 발명의 일 실시예에 따라 다중 시계열 데이터를 기초로 매트릭스를 생성하는 방법을 설명하기 위한 예시도이다.
도 13은 본 발명의 제1 실시예에 따른 주영향 변수 결정 방법을 설명하기 위한 예시도이다.
도 14 및 도 15는 본 발명의 일 실시예에 따라 보다 효율적으로 주영향 변수를 결정하기 위한 방법을 설명하기 위한 예시도이다.
도 16은 본 발명의 제2 실시예에 따른 시계열 데이터 분석 방법을 나타내는 흐름도이다.
도 17은 본 발명의 일 실시예에 따른 예측 모델의 구조를 설명하기 위한 예시도이다.
도 18은 본 발명의 제2 실시예에 따른 주영향 변수 결정 방법을 설명하기 위한 예시도이다.
도 19는 본 발명의 일 활용예에 따른 이상 탐지 시스템을 설명하기 위한 구성도이다.1 is a configuration diagram showing a time series data analysis system according to an embodiment of the present invention.
2 is an illustration of data sources and multiple time series data that may be referenced in some embodiments of the invention.
3 and 4 are block diagrams showing an apparatus for analyzing time series data according to an embodiment of the present invention.
5 is a hardware configuration diagram illustrating a time series data analysis device according to an embodiment of the present invention.
6 is a flowchart illustrating a method for analyzing time series data according to a first embodiment of the present invention.
7 to 9 are exemplary views for explaining a pre-processing method according to some embodiments of the present invention.
10 to 12 are exemplary views for explaining a method of generating a matrix based on multiple time series data according to an embodiment of the present invention.
13 is an exemplary view for explaining a method for determining a main influence variable according to a first embodiment of the present invention.
14 and 15 are exemplary views for explaining a method for more effectively determining a main influence variable according to an embodiment of the present invention.
16 is a flowchart illustrating a method for analyzing time series data according to a second embodiment of the present invention.
17 is an exemplary diagram for explaining the structure of a prediction model according to an embodiment of the present invention.
18 is an exemplary view for explaining a method for determining a main influence variable according to a second embodiment of the present invention.
19 is a configuration diagram illustrating an abnormality detection system according to an exemplary embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention pertains. It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims.

각 도면의 구성요소들에 참조 부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.It should be noted that in adding reference numerals to the components of each drawing, the same components have the same reference numerals as possible even though they are displayed on different drawings. In addition, in describing the present invention, when it is determined that detailed descriptions of related well-known structures or functions may obscure the subject matter of the present invention, detailed descriptions thereof will be omitted.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used in a sense that can be commonly understood by those skilled in the art to which the present invention pertains. In addition, terms defined in the commonly used dictionary are not ideally or excessively interpreted unless specifically defined. The terminology used herein is for describing the embodiments and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form unless otherwise specified in the phrase.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 또는 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In addition, in describing the components of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are only for distinguishing the component from other components, and the nature, order, or order of the component is not limited by the term. When a component is described as being "connected", "coupled" or "connected" to another component, the component may be directly connected to or connected to the other component, but another component between each component It will be understood that elements may be "connected", "coupled" or "connected".

명세서에서 사용되는 "포함한다 (comprises)" 및/또는 "포함하는 (comprising)"은 언급된 구성 요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성 요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.As used herein, "comprises" and / or "comprising" refers to the components, steps, operations and / or elements mentioned above, the presence of one or more other components, steps, operations and / or elements Or do not exclude additions.

본 명세서에 대한 설명에 앞서, 본 명세서에서 사용되는 몇 가지 용어들에 대하여 명확하게 하기로 한다.Prior to the description of the present specification, some terms used in the specification will be clarified.

본 명세서에서, 다중 시계열 데이터(multiple times series data)란, 둘 이상의 시계열 변수에 관한 측정 값으로 구성된 데이터를 의미한다. 상기 다중 시계열 데이터란 용어는 당해 기술 분야에서 다차원 시계열 데이터 또는 다변량 시계열 데이터 등의 용어와 혼용되어 사용될 수 있다.In this specification, multiple times series data means data composed of measured values related to two or more time series variables. The term multi-time series data may be used interchangeably with terms such as multi-dimensional time series data or multivariate time series data in the art.

본 명세서에서, 시계열 변수(times series variable)란, 시간의 흐름에 따라 측정 또는 관측 가능한 특성을 지닌 모든 변수를 가리킨다. 이때, 상기 변수는 당해 기술 분야에서 속성(attribute), 변인, 인자(factor) 등의 용어와 혼용되어 사용될 수 있다. 상기 시계열 변수의 예는 온도, 습도, 주가지수, 환율 등이 될 수 있으나, 본 발명의 기술적 범위가 상기 열거된 예시에 한정되는 것은 아니다.In this specification, the time series variable (times series variable) refers to all variables having characteristics that can be measured or observed over time. In this case, the variable may be used interchangeably with terms such as attribute, variable, and factor in the art. Examples of the time series variable may be temperature, humidity, stock index, exchange rate, etc., but the technical scope of the present invention is not limited to the examples listed above.

본 명세서에서, 예측 대상(target of prediction)이란, 문자 그대로 다중 시계열 데이터를 분석하여 예측하고자 하는 대상을 가리킨다. 예를 들어, 온도, 습도 등의 측정 값으로 구성된 다중 시계열 데이터를 이용하여 공정 이상을 예측하는 경우, 상기 예측 대상은 공정 상태를 지칭하는 것일 수 있다. 다른 예를 들어, 환율, 종합 주가 지수 등의 관측 값으로 구성된 다중 시계열 데이터를 이용하여 특정 종목(e.g. 기업, 부동산)의 가치를 예측하는 경우, 상기 예측 대상은 상기 특정 종목의 가치를 지칭하는 것일 수 있다. 복수의 시계열 변수의 측정 값을 이용하여 예측 대상의 클래스(e.g. 이상, 정상)를 예측한다고 할 때, 상기 시계열 변수는 독립 변수에 대응되고 상기 예측 대상은 종속 변수에 대응되는 것일 수 있다.In the present specification, a target of prediction literally refers to an object to be predicted by analyzing multiple time series data. For example, when a process anomaly is predicted using multiple time series data composed of measured values such as temperature and humidity, the predicted object may refer to a process state. For another example, when predicting the value of a specific item (eg company, real estate) using multi-series data composed of observed values such as an exchange rate and a composite stock price index, the prediction target refers to the value of the specific item Can be. When predicting a class (e.g. or higher, normal) of a prediction target using measurement values of a plurality of time series variables, the time series variable may correspond to an independent variable and the prediction target may correspond to a dependent variable.

본 명세서에서, 예측 모델(prediction model)이란, 예측 대상의 클래스를 예측하기 위해 이용되는 모델을 의미한다. 가령, 상기 예측 모델은 기계 학습을 통해 구축되는 모델일 수 있으나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.In the present specification, a prediction model means a model used to predict a class of prediction targets. For example, the prediction model may be a model built through machine learning, but the technical scope of the present invention is not limited thereto.

본 명세서에서, 인스트럭션(instructions)이란, 기능을 기준으로 묶인 일련의 명령어들로서 컴퓨터 프로그램의 구성 요소이자 프로세서에 의해 실행되는 것을 가리킨다.In the present specification, instructions (instructions) refers to a set of instructions grouped by function, which are components of a computer program and executed by a processor.

이하, 본 발명의 몇몇 실시예들에 대하여 첨부된 도면에 따라 상세하게 설명한다.Hereinafter, some embodiments of the present invention will be described in detail according to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 시계열 데이터 분석 시스템을 나타내는 구성도이다.1 is a configuration diagram showing a time series data analysis system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 상기 시계열 데이터 분석 시스템은 적어도 하나의 데이터 소스(10-1 내지 10-n), 수집 장치(50) 및 시계열 데이터 분석 장치(100)를 포함할 수 있다. 단, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일뿐이며, 필요에 따라 일부 구성 요소가 추가되거나 삭제될 수 있음은 물론이다. 또한, 도 1에 도시된 시계열 데이터 분석 시스템의 각각의 구성 요소들은 기능적으로 구분되는 기능 요소들을 나타낸 것으로서, 복수의 구성 요소가 실제 물리적 환경에서는 서로 통합되는 형태로 구현될 수도 있음에 유의한다. 예컨대, 수집 장치(50)와 시계열 데이터 분석 장치(100)는 동일한 물리적 컴퓨팅 장치 내의 서로 다른 로직(logic)의 형태로 구현될 수 있다.As shown in FIG. 1, the time series data analysis system may include at least one data source 10-1 to 10-n, a collection device 50, and a time series data analysis device 100. However, this is only a preferred embodiment for achieving the object of the present invention, and of course, some components may be added or deleted as necessary. In addition, it is noted that each component of the time series data analysis system shown in FIG. 1 represents functionally divided functional elements, and a plurality of components may be implemented in an integrated form in an actual physical environment. For example, the collection device 50 and the time series data analysis device 100 may be implemented in the form of different logic in the same physical computing device.

또한, 실제 물리적 환경에서 상기 각각의 구성 요소들은 복수의 세부 기능 요소로 분리되는 형태로 구현될 수도 있다. 예컨대, 시계열 데이터 분석 장치(100)의 제1 기능은 컴퓨팅 시스템을 구성하는 제1 컴퓨팅 장치에서 구현되고, 제2 기능은 상기 컴퓨팅 시스템을 구성하는 제2 컴퓨팅 장치에서 구현될 수도 있다. 이하, 상기 각각의 구성 요소에 대하여 설명한다.Further, in the actual physical environment, each of the components may be implemented in a form of being divided into a plurality of detailed functional elements. For example, a first function of the time series data analysis device 100 may be implemented in a first computing device constituting a computing system, and a second function may be implemented in a second computing device constituting the computing system. Hereinafter, each of the components will be described.

상기 시계열 데이터 분석 시스템에서, 적어도 하나의 데이터 소스(10-1 내지 10-n)는 분석 대상이 되는 시계열 데이터를 제공하는 장치 또는 저장소이다. 예를 들어, 도 2에 도시된 바와 같이, 분석 대상 데이터가 온도, 습도 등에 관한 측정 값인 경우, 데이터 소스(10-1 내지 10-n)는 상기 측정 값을 제공하는 각종 센서(20-1 내지 20-n)를 지칭하는 것일 수 있다. 다른 예를 들어, 분석 대상 데이터가 환율, 주가지수 등의 금융 데이터인 경우, 데이터 소스(10-1 내지 10-n)는 상기 금융 데이터를 제공하는 저장소 또는 장치를 지칭하는 것일 수 있다.In the time series data analysis system, at least one data source (10-1 to 10-n) is a device or storage providing time series data to be analyzed. For example, as illustrated in FIG. 2, when the data to be analyzed is measured values related to temperature, humidity, and the like, the data sources 10-1 to 10-n may include various sensors 20-1 to provide the measured values. 20-n). For another example, when the data to be analyzed is financial data such as exchange rates and stock indices, the data sources 10-1 to 10-n may refer to a storage or device that provides the financial data.

상기 시계열 데이터 분석 시스템에서, 수집 장치(50)는 적어도 하나의 데이터 소스(10-1 내지 10-n)로부터 다중 시계열 데이터를 수집하는 장치이다. 가령, 수집 장치(50)는 제1 데이터 소스(10-1)로부터 제1 시계열 데이터를 수집하고, 제2 데이터 소스(10-2)로부터 제2 시계열 데이터를 수집할 수 있다. 수집 장치(50)가 다중 시계열 데이터를 수집하는 방식은 어떠한 방식이 되더라도 무방하다.In the time series data analysis system, the collection device 50 is a device that collects multiple time series data from at least one data source 10-1 to 10-n. For example, the collection device 50 may collect first time series data from the first data source 10-1 and second time series data from the second data source 10-2. The method in which the collection device 50 collects multiple time series data may be any method.

상기 시계열 데이터 분석 시스템에서, 시계열 데이터 분석 장치(100)는 다중 시계열 데이터에 대한 분석 기능이 구비된 컴퓨팅 장치이다. 여기서, 상기 컴퓨팅 장치는, 노트북, 데스크톱(desktop), 랩탑(laptop) 등이 될 수 있으나, 이에 국한되는 것은 아니며 컴퓨팅 기능이 구비된 모든 종류의 장치를 포함할 수 있다. 다만, 대용량의 다중 시계열 데이터를 분석하는 환경이라면, 시계열 데이터 분석 장치(100)는 고성능의 서버급 컴퓨팅 장치로 구현되는 것이 바람직할 수 있다. 설명의 편의를 위해, 이하에서는, 시계열 데이터 분석 장치(100)를 분석 장치(100)로 약칭하도록 한다.In the time series data analysis system, the time series data analysis device 100 is a computing device equipped with an analysis function for multiple time series data. Here, the computing device may be a laptop, a desktop, a laptop, etc., but is not limited thereto, and may include all types of devices equipped with computing functions. However, in an environment that analyzes a large amount of multi-time series data, the time series data analysis apparatus 100 may be preferably implemented as a high-performance server-class computing device. For convenience of description, hereinafter, the time series data analysis device 100 will be abbreviated as the analysis device 100.

본 발명의 실시예에 따르면, 분석 장치(100)는 시계열 변수 간의 상관 관계를 고려하여 다중 시계열 데이터를 분석함으로써 예측 대상에 대한 클래스 정보를 제공할 수 있다. 예컨대, 예측 대상이 공정 상태인 경우, 공정 변수 간의 상관 관계를 고려하여 다중 시계열 데이터를 분석함으로써 공정 상태에 대한 클래스 정보(e.g. 이상, 정상)가 제공될 수 있다. 본 실시예에 따르면, 시계열 변수 간의 상관 관계를 고려함으로써 신뢰도 높은 양질의 예측 정보가 제공될 수 있다. 본 실시예에 대한 자세한 설명은 도 3 이하의 도면을 참조하여 후술하도록 한다.According to an embodiment of the present invention, the analysis device 100 may provide class information for a prediction target by analyzing multiple time series data in consideration of a correlation between time series variables. For example, when the prediction target is a process state, class information (e.g. abnormality, normal) for the process state may be provided by analyzing multiple time series data in consideration of a correlation between process variables. According to the present embodiment, high-quality prediction information with high reliability can be provided by considering a correlation between time-series variables. Detailed description of this embodiment will be described later with reference to the drawings of FIG. 3 and below.

또한, 본 발명의 실시예에 따르면, 분석 장치(100)는 복수의 시계열 변수 중에서 클래스 판정에 가장 영향을 미친 주영향 변수를 결정할 수 있다. 가령, 공정 이상이 예측된 경우, 분석 장치(100)는 복수의 시계열 변수(e.g. 온도, 습도 등) 중에서 공정 이상에 가장 영향을 미친 주영향 변수(즉, 주영향 인자)를 결정할 수 있다. 본 실시예에 따르면, 예측 결과(즉, 클래스 정보)와 함께 예측 결과에 대한 원인 정보(즉, 주영향 변수)가 추가 제공된다. 따라서, 활용도 높고 가치 있는 정보가 제공되는 장점이 있다. 본 실시예에 대한 자세한 설명 또한 도 3 이하의 도면을 참조하여 후술하도록 한다.Further, according to an embodiment of the present invention, the analysis device 100 may determine a main influence variable that has the most influence on class determination among a plurality of time series variables. For example, when a process abnormality is predicted, the analysis device 100 may determine a main influence variable (ie, a main influence factor) that most affects the process abnormality among a plurality of time series variables (e.g. temperature, humidity, etc.). According to the present embodiment, cause information (ie, main influence variable) for the prediction result is additionally provided along with the prediction result (ie, class information). Therefore, there is an advantage in that high utilization and valuable information are provided. Detailed description of the present embodiment will also be described later with reference to the drawings of FIG. 3 and below.

도 1에 도시된 시계열 데이터 분석 시스템의 적어도 일부 구성 요소는 네트워크를 통해 통신할 수 있다. 여기서, 상기 네트워크는 근거리 통신망(Local Area Network; LAN), 광역 통신망(Wide Area Network; WAN), 이동 통신망(mobile radio communication network), Wibro(Wireless Broadband Internet) 등과 같은 모든 종류의 유/무선 네트워크로 구현될 수 있다.At least some components of the time series data analysis system shown in FIG. 1 can communicate over a network. Here, the network is a wired / wireless network of any kind, such as a local area network (LAN), a wide area network (WAN), a mobile radio communication network, a Wibro (Wireless Broadband Internet), and the like. Can be implemented.

지금까지 도 1 및 도 2를 참조하여 본 발명의 일 실시예에 따른 시계열 데이터 분석 시스템에 대하여 설명하였다. 이하에서는, 분석 장치(100)의 구성 및 동작에 대하여 도 3 내지 도 5를 참조하여 보다 상세하게 설명한다.So far, a time series data analysis system according to an embodiment of the present invention has been described with reference to FIGS. 1 and 2. Hereinafter, the configuration and operation of the analysis device 100 will be described in more detail with reference to FIGS. 3 to 5.

도 3은 본 발명의 일 실시예에 따른 분석 장치(100)를 나타내는 블록도이다.3 is a block diagram showing an analysis device 100 according to an embodiment of the present invention.

도 3을 참조하면, 분석 장치(100)는 데이터 수집부(110), 전처리부(120), 매트릭스 생성부(130), 분석부(140), 패턴 DB(150) 및 주영향 변수 결정부(160)를 포함할 수 있다. 다만, 도 3에는 본 발명의 실시예와 관련 있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 3에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다. 또한, 도 3에 도시된 분석 장치(100)의 각각의 구성 요소들은 기능적으로 구분되는 기능 요소들을 나타낸 것으로서, 복수의 구성 요소가 실제 물리적 환경에서는 서로 통합되는 형태로 구현될 수도 있음에 유의한다.Referring to FIG. 3, the analysis device 100 includes a data collection unit 110, a pre-processing unit 120, a matrix generation unit 130, an analysis unit 140, a pattern DB 150, and a main influence variable determination unit ( 160). However, only components related to the embodiment of the present invention are illustrated in FIG. 3. Accordingly, it can be seen that a person skilled in the art to which the present invention pertains may further include other general-purpose components in addition to the components shown in FIG. 3. In addition, it is noted that each of the components of the analysis device 100 illustrated in FIG. 3 is functionally divided functional elements, and a plurality of components may be implemented in an integrated form in a physical environment.

각 구성 요소를 살펴보면, 데이터 수집부(110)는 적어도 하나의 데이터 소스(e.g. 10-1 내지 100-n)으로부터 다중 시계열 데이터를 수집한다. 또는, 데이터 수집부(110)는 다른 수집 장치(e.g. 도 1의 50)로부터 다중 시계열 데이터를 수집할 수도 있다.Looking at each component, the data collection unit 110 collects multiple time series data from at least one data source (e.g. 10-1 to 100-n). Alternatively, the data collection unit 110 may collect multiple time series data from another collection device (e.g. 50 of FIG. 1).

다음으로, 전처리부(120)는 수집된 다중 시계열 데이터에 대한 전처리를 수행한다. 중복된 설명을 배제하기위해, 전처리부(120)의 동작에 대한 자세한 설명은 도 6 내지 도 9를 참조하여 후술하도록 한다.Next, the pre-processing unit 120 performs pre-processing on the collected multiple time series data. In order to exclude the duplicated description, a detailed description of the operation of the pre-processing unit 120 will be described later with reference to FIGS. 6 to 9.

다음으로, 매트릭스 생성부(130)는 전처리된 다중 시계열 데이터로부터 의 매트릭스를 생성한다. 구체적으로, 매트릭스 생성부(130)는 상기 전처리된 다중 시계열 데이터에서 기 설정된 시계열 구간 별로 데이터를 추출하고, 추출된 데이터를 기초로 매트릭스를 생성할 수 있다. 생성된 매트릭스는 패턴 별로 패턴 DB(150)에 저장될 수 있다. 중복된 설명을 배제하기 위해, 매트릭스 생성부(130)의 동작에 대한 자세한 설명은 도 6, 도 10 내지 도 12를 참조하여 후술하도록 한다.Next, the matrix generator 130 generates a matrix from pre-processed multi-time series data. Specifically, the matrix generator 130 may extract data for each preset time series section from the pre-processed multi-time series data, and generate a matrix based on the extracted data. The generated matrix may be stored in the pattern DB 150 for each pattern. In order to exclude duplicate description, a detailed description of the operation of the matrix generator 130 will be described later with reference to FIGS. 6 and 10 to 12.

다음으로, 분석부(140)는 생성된 매트릭스를 분석하여 예측 대상의 클래스를 예측한다. 도 4에 도시된 바와 같이, 분석부(140)는 제1 분석부(141)와 제2 분석부(143)를 포함할 수 있다.Next, the analysis unit 140 analyzes the generated matrix to predict a class of prediction targets. As shown in FIG. 4, the analysis unit 140 may include a first analysis unit 141 and a second analysis unit 143.

제1 분석부(141)는 상기 생성된 매트릭스와 매칭된 패턴의 발생 빈도에 기초하여 상기 예측 대상의 클래스를 예측한다. 이때, 하나의 매트릭스가 하나의 패턴과 매칭될 수 있고, 복수의 매트릭스가 하나의 패턴과 매칭될 수도 있다. 제1 분석부(141)의 동작에 대한 자세한 설명은 도 6 내지 도 12를 참조하여 후술하도록 한다.The first analysis unit 141 predicts the class of the prediction target based on the frequency of occurrence of the pattern matched with the generated matrix. At this time, one matrix may match one pattern, and a plurality of matrices may match one pattern. A detailed description of the operation of the first analysis unit 141 will be described later with reference to FIGS. 6 to 12.

다음으로, 제2 분석부(143)는 예측 모델에 상기 생성된 매트릭스를 적용하여, 예측 대상의 클래스를 예측한다. 전술한 바와 같이, 상기 예측 모델은 기계 학습을 통해 구축된 모델일 수 있다. 그러나, 본 발명의 범위가 이에 한정되는 것은 아니다. 제2 분석부(143)의 동작에 대한 자세한 설명은 도 16 및 도 17을 참조하여 후술하도록 한다.Next, the second analysis unit 143 applies the generated matrix to the prediction model to predict the class of the prediction target. As described above, the prediction model may be a model built through machine learning. However, the scope of the present invention is not limited to this. A detailed description of the operation of the second analysis unit 143 will be described later with reference to FIGS. 16 and 17.

다시 도 3을 참조하면, 패턴 DB(150)는 매트릭스 생성부(130)에 의해 생성된 매트릭스를 패턴 별로 저장한 저장소이다. 패턴 DB(150)는 상기 매트릭스가 저장될 때, 매칭되는 패턴의 발생 빈도를 업데이트할 수 있다.Referring to FIG. 3 again, the pattern DB 150 is a storage in which the matrix generated by the matrix generator 130 is stored for each pattern. The pattern DB 150 may update the frequency of occurrence of the matched pattern when the matrix is stored.

다음으로, 주영향 변수 결정부(160)는 복수의 시계열 변수 중에서 클래스 판정에 가장 영향을 미친 주영향 변수를 결정한다. 또한, 주영향 변수 결정부(160)는 각 시계열 변수에 대하여 클래스 판정에 영향을 미친 정도를 결정할 수 있다. 도 4에 도시된 바와 같이, 주영향 변수 결정부(160)는 제1 주영향 변수 결정부(161)와 제2 주영향 변수 결정부(163)를 포함할 수 있다.Next, the main influence variable determining unit 160 determines a main influence variable that has the most influence on class determination among a plurality of time series variables. In addition, the main influence variable determination unit 160 may determine the degree of influence on class determination for each time series variable. As illustrated in FIG. 4, the main influence variable determining unit 160 may include a first main influence variable determining unit 161 and a second main influence variable determining unit 163.

제1 주영향 변수 결정부(161)는 매트릭스 유사도에 기초하여 주영향 변수를 결정한다. 제1 주영향 변수 결정부(161)의 동작에 대한 자세한 설명은 도 6과 도 13 내지 도 15를 참조하여 후술하도록 한다.The first main influence variable determining unit 161 determines the main influence variable based on the matrix similarity. A detailed description of the operation of the first main influence variable determination unit 161 will be described later with reference to FIGS. 6 and 13 to 15.

다음으로, 제2 주영향 변수 결정부(163)는 예측 모델의 컨피던스 스코어(confidence score)에 기초하여 주영향 변수를 결정한다. 제2 주영향 변수 결정부(163)의 동작에 대한 자세한 설명은 도 16 및 도 18을 참조하여 후술하도록 한다.Next, the second main influence variable determining unit 163 determines the main influence variable based on the confidence score of the prediction model. A detailed description of the operation of the second main influence variable determination unit 163 will be described later with reference to FIGS. 16 and 18.

한편, 본 발명의 다른 실시예에 따르면, 분석 장치(100)는 도 3에 도시된 구성 요소 중 일부가 생략된 형태로 구현될 수도 있다. 즉, 도 3에 도시된 형태가 분석 장치(100)의 유일한 구성이 되는 것은 아님에 유의하여야 한다.Meanwhile, according to another embodiment of the present invention, the analysis device 100 may be implemented in a form in which some of the components illustrated in FIG. 3 are omitted. That is, it should be noted that the form shown in FIG. 3 is not the only configuration of the analysis device 100.

도 3 및 도 4에 도시된 각 구성 요소는 소프트웨어(Software) 또는, FPGA(Field Programmable Gate Array)나 ASIC(Application-Specific Integrated Circuit)과 같은 하드웨어(Hardware)를 의미할 수 있다. 그렇지만, 상기 구성 요소들은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 어드레싱(Addressing)할 수 있는 저장 매체에 있도록 구성될 수도 있고, 하나 또는 그 이상의 프로세서들을 실행시키도록 구성될 수도 있다. 상기 구성 요소들 안에서 제공되는 기능은 더 세분화된 구성 요소에 의하여 구현될 수 있으며, 복수의 구성 요소들을 합하여 특정한 기능을 수행하는 하나의 구성 요소로 구현될 수도 있다.Each component shown in FIGS. 3 and 4 may mean software or hardware such as a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). However, the above components are not limited to software or hardware, and may be configured to be in an addressable storage medium, or may be configured to execute one or more processors. The functions provided in the above components may be implemented by more detailed components, or may be implemented as a single component that performs a specific function by combining a plurality of components.

도 5는 본 발명의 일 실시예에 따른 분석 장치(100)를 나타내는 하드웨어 구성도이다.5 is a hardware configuration diagram illustrating an analysis device 100 according to an embodiment of the present invention.

도 5를 참조하면, 분석 장치(100)는 하나 이상의 프로세서(101), 버스(105), 통신 인터페이스(107), 프로세서(101)에 의하여 수행되는 컴퓨터 프로그램을 로드(load)하는 메모리(103)와, 컴퓨터 프로그램(109a)을 저장하는 스토리지(109)를 포함할 수 있다. 다만, 도 5에는 본 발명의 실시예와 관련 있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 5에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다.Referring to FIG. 5, the analysis device 100 includes one or more processors 101, a bus 105, a communication interface 107, and a memory 103 for loading a computer program performed by the processor 101 And a storage 109 for storing the computer program 109a. However, only components related to the embodiment of the present invention are illustrated in FIG. 5. Accordingly, it can be seen that a person skilled in the art to which the present invention belongs may include other general-purpose components other than those shown in FIG. 5.

프로세서(101)는 분석 장치(100)의 각 구성의 전반적인 동작을 제어한다. 프로세서(101)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 발명의 기술 분야에 잘 알려진 임의의 형태의 프로세서를 포함하여 구성될 수 있다. 또한, 프로세서(101)는 본 발명의 실시예들에 따른 방법을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. 분석 장치(100)는 하나 이상의 프로세서를 구비할 수 있다.The processor 101 controls the overall operation of each component of the analysis device 100. The processor 101 comprises a CPU (Central Processing Unit), MPU (Micro Processor Unit), MCU (Micro Controller Unit), GPU (Graphic Processing Unit) or any type of processor well known in the art of the present invention. Can be. Also, the processor 101 may perform operations on at least one application or program for executing the method according to the embodiments of the present invention. The analysis device 100 may include one or more processors.

메모리(103)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(103)는 본 발명의 실시예들에 따른 시계열 데이터 분석 방법을 실행하기 위하여 스토리지(109)로부터 하나 이상의 프로그램(109a)을 로드할 수 있다. 메모리(103)는 가령 RAM과 같은 휘발성 메모리로 구현될 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다. 메모리(103)에 컴퓨터 프로그램(109a)이 로드되면, 메모리(103) 상에 도 3에 도시된 모듈이 로직의 형태로 구현될 수 있다.The memory 103 stores various data, commands and / or information. The memory 103 may load one or more programs 109a from the storage 109 in order to execute the time series data analysis method according to embodiments of the present invention. The memory 103 may be implemented as a volatile memory such as RAM, but the scope of the present invention is not limited thereto. When the computer program 109a is loaded into the memory 103, the module illustrated in FIG. 3 on the memory 103 may be implemented in the form of logic.

버스(105)는 분석 장치(100)의 구성 요소 간 통신 기능을 제공한다. 버스(105)는 주소 버스(Address Bus), 데이터 버스(Data Bus) 및 제어 버스(Control Bus) 등 다양한 형태의 버스로 구현될 수 있다.The bus 105 provides a communication function between components of the analysis device 100. The bus 105 may be implemented as various types of buses such as an address bus, a data bus, and a control bus.

통신 인터페이스(107)는 분석 장치(100)의 유무선 인터넷 통신을 지원한다. 또한, 통신 인터페이스(107)는 인터넷 통신 외의 다양한 통신 방식을 지원할 수도 있다. 이를 위해, 통신 인터페이스(107)는 본 발명의 기술 분야에 잘 알려진 통신 모듈을 포함하여 구성될 수 있다.The communication interface 107 supports wired and wireless Internet communication of the analysis device 100. Also, the communication interface 107 may support various communication methods other than Internet communication. To this end, the communication interface 107 may include a communication module well known in the technical field of the present invention.

스토리지(109)는 다중 시계열 데이터(미도시)와 상기 하나 이상의 프로그램(109a)을 비임시적으로 저장할 수 있다. 스토리지(109)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다.The storage 109 may store multiple time series data (not shown) and the one or more programs 109a non-temporarily. The storage 109 is a non-volatile memory such as a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EPMROM), a flash memory, a hard disk, a removable disk, or well in the technical field to which the present invention pertains. And any known form of computer-readable recording media.

컴퓨터 프로그램(109a)은 메모리(103)에 로드될 때 프로세서(101)로 하여금 본 발명의 몇몇 실시예들에 따른 방법들을 수행하도록 하는 하나 이상의 인스트럭션들을 포함할 수 있다. 프로세서(101)는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 방법들을 수행할 수 있다.Computer program 109a may include one or more instructions that, when loaded into memory 103, cause processor 101 to perform methods in accordance with some embodiments of the present invention. The processor 101 can perform the above methods by executing the one or more instructions.

예를 들어, 컴퓨터 프로그램(109a)은 다중 시계열 데이터로부터 기 설정된 시계열 구간의 데이터를 추출하여, 매트릭스를 생성하고, 생성된 매트릭스를 분석하여 상기 예측 대상의 클래스를 예측하는 동작을 수행하도록 하는 인스트럭션들을 포함할 수 있다.For example, the computer program 109a extracts data of a predetermined time series section from multiple time series data, generates a matrix, analyzes the generated matrix, and performs instructions for predicting a class of the prediction target. It can contain.

지금까지 도 3 내지 도 5를 참조하여 본 발명의 일 실시예에 따른 분석 장치(100)의 구성 및 동작에 대하여 설명하였다. 이하에서는, 도 6 이하의 도면을 참조하여 본 발명의 몇몇 실시예들에 따른 시계열 데이터 분석 방법에 대하여 상세하게 설명한다.So far, the configuration and operation of the analysis apparatus 100 according to an embodiment of the present invention have been described with reference to FIGS. 3 to 5. Hereinafter, a method of analyzing time series data according to some embodiments of the present invention will be described in detail with reference to the drawings of FIG. 6 and below.

상기 시계열 데이터 분석 방법의 각 단계는 컴퓨팅 장치에 의해 수행될 수 있다. 다시 말하면, 상기 시계열 데이터 분석 방법의 각 단계는 컴퓨팅 장치의 프로세서에 의해 실행되는 하나 이상의 인스트럭션들로 구현될 수 있다. 상기 시계열 데이터 분석 방법에 포함되는 모든 단계는 하나의 물리적인 컴퓨팅 장치에 의하여 실행될 수도 있을 것이나, 상기 방법의 제1 단계들은 제1 컴퓨팅 장치에 의하여 수행되고, 상기 방법의 제2 단계들은 제2 컴퓨팅 장치에 의하여 수행될 수도 있다. 이하에서는, 상기 시계열 데이터 분석 방법의 각 단계가 분석 장치(100)에 의해 수행되는 것을 가정하여 설명을 이어가도록 한다. 다만, 설명의 편의를 위해, 상기 시계열 데이터 분석 방법에 포함되는 각 단계의 동작 주체는 그 기재가 생략될 수도 있다.Each step of the time series data analysis method may be performed by a computing device. In other words, each step of the time series data analysis method may be implemented with one or more instructions executed by a processor of a computing device. All steps included in the time series data analysis method may be executed by one physical computing device, but the first steps of the method are performed by the first computing device, and the second steps of the method are second computing It can also be performed by the device. Hereinafter, it is assumed that each step of the time series data analysis method is performed by the analysis device 100 to continue the description. However, for convenience of description, description of the operation subject of each step included in the time series data analysis method may be omitted.

먼저 도 6 내지 도 15를 참조하여 본 발명의 제1 실시예에 따른 시계열 데이터 분석 방법에 대하여 설명하도록 한다.First, a method for analyzing time series data according to a first embodiment of the present invention will be described with reference to FIGS. 6 to 15.

도 6은 본 발명의 제1 실시예에 따른 시계열 데이터 분석 방법을 나타내는 흐름도이다. 단, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일 뿐이며, 필요에 따라 일부 단계가 추가되거나 삭제될 수 있음은 물론이다.6 is a flowchart illustrating a method for analyzing time series data according to a first embodiment of the present invention. However, this is only a preferred embodiment for achieving the object of the present invention, and of course, some steps may be added or deleted as necessary.

도 6에 도시된 바와 같이, 상기 제1 실시예는 분석 장치(100)가 다중 시계열 데이터를 수집하는 단계(S10)에서 시작된다. 전술한 바와 같이, 상기 다중 시계열 데이터는 복수의 시계열 변수에 관한 측정 값으로 구성된 데이터를 의미한다. 다중 시계열 데이터를 수집하는 방식은 어떠한 방식이 되더라도 무방하다.As illustrated in FIG. 6, the first embodiment starts at step S10 in which the analysis device 100 collects multiple time series data. As described above, the multiple time series data refers to data composed of measurement values for a plurality of time series variables. The method of collecting multiple time series data may be any method.

단계(S30)에서, 분석 장치(100)는 수집된 다중 시계열 데이터에 대한 전처리를 수행한다. 상기 전처리 단계(S30)의 구체적인 동작은 실시예에 따라 달라질 수 있다.In step S30, the analysis device 100 performs pre-processing on the collected multiple time series data. The specific operation of the pre-processing step (S30) may vary depending on the embodiment.

일 실시예에서, 분석 장치(100)는 데이터 압축 처리를 수행할 수 있다. 가령, 도 7에 도시된 바와 같이, 분석 장치(100)는 시계열 데이터(210)를 기 설정된 구간(e.g. 제1 구간, 제2 구간 등)으로 분할하고, 각 구간별로 시계열 데이터(210)의 평균 값(201 내지 209)을 연산할 수 있다. 도 7의 상단에 도시된 그래프는 압축되기 전의 시계열 데이터(210)이고, 도 7의 하단에 도시된 그래프는 압축된 이후의 시계열 데이터(201 내지 209)이다. 본 실시예에 따르면, 시계열 데이터(210) 전체가 아닌 각 구간별 평균 값(201 내지 215)만 저장되므로, 저장 공간이 효율적으로 활용되고 분석에 소요되는 컴퓨팅 비용이 감소되는 효과가 달성될 수 있다. 뿐만 아니라, 평균 연산을 통해 노이즈의 영향이 감소되는 바, 노이즈 제거 효과 또한 달성될 수 있다.In one embodiment, the analysis device 100 may perform data compression processing. For example, as illustrated in FIG. 7, the analysis device 100 divides the time series data 210 into preset sections (eg, first section, second section, etc.), and averages the time series data 210 for each section. Values 201 to 209 can be calculated. The graph shown at the top of FIG. 7 is time series data 210 before compression, and the graph shown at the bottom of FIG. 7 is time series data 201 to 209 after compression. According to the present embodiment, since only the average values 201 to 215 for each section are stored, not all of the time series data 210, an effect of efficiently using storage space and reducing computing cost for analysis can be achieved. . In addition, since the influence of noise is reduced through averaging, a noise reduction effect can also be achieved.

일 실시예에서, 분석 장치(100)는 정규화(normalization)를 수행할 수 있다. 가령, 도 8에 도시된 바와 같이, 분석 장치(100)는 도 8에 도시된 바와 같이 평균 및 분산을 이용하여 다중 시계열 데이터(211-1, 213-1)를 일정 범위 내의 값으로 정규화할 수 있다. 물론, 평균 및 분산을 이용하지 않고, 얼마든지 다른 방식으로 정규화가 수행될 수도 있다. 도 8에서, 상단에 도시된 그래프는 정규화되기 전의 시계열 데이터(211-1, 213-1)이고, 하단에 도시된 그래프는 정규화된 이후의 시계열 데이터(211-2, 213-2)이다.In one embodiment, the analysis device 100 may perform normalization. For example, as shown in FIG. 8, the analysis device 100 may normalize the multi-time series data 211-1 and 213-1 to values within a predetermined range using average and variance as shown in FIG. 8. have. Of course, normalization may be performed in any other way without using the mean and variance. In FIG. 8, the graph shown at the top is time series data 211-1, 213-1 before normalization, and the graph shown at the bottom is time series data 211-2, 213-2 after normalization.

일 실시예에서, 분석 장치(100)는 심볼화(symbolization) 처리를 수행할 수 있다. 가령, 도 9에 도시된 바와 같이, 분석 장치(100)는 SAX(Symbolic Aggregate approXimation) 변환을 통해 시계열 데이터(221)를 심볼화할 수 있다. 보다 구체적으로, 분석 장치(100)는 PAA(Piecewise Aggregate Approximation) 변환을 통해 시계열 데이터(221)를 구간 별로 단편화하고, 임계 값들을 기준으로 단편화된 시계열 데이터(223)를 매칭되는 심볼(e.g. a, b, c)로 변환할 수 있다. 도 9는 시계열 데이터(221)가 심볼화의 결과 "baabccbc"로 변환되는 것을 예로써 도시하고 있으나, 상기 심볼화는 시계열 데이터를 알파벳과 같은 문자가 아닌 숫자로 변환하는 것도 포함할 수 있다.In one embodiment, the analysis device 100 may perform symbolization processing. For example, as illustrated in FIG. 9, the analysis device 100 may symbolize the time series data 221 through symbolic aggregate approXimation (SAX) transformation. More specifically, the analysis device 100 fragments the time series data 221 for each section through a piecewise aggregate approximation (PAA) transformation, and a symbol matching the fragmented time series data 223 based on threshold values (eg a, b, c). 9 shows that the time series data 221 is converted to "baabccbc" as a result of symbolization, but the symbolization may also include converting the time series data to numbers other than letters such as alphabets.

일 실시예에서, 분석 장치(100)는 주성분 분석(Principal Component analysis; PCA)을 통해 다중 시계열 데이터에 대한 차원 축소 처리를 수행할 수 있다. 가령, 분석 장치(100)는 주성분 분석을 통해 n개의 시계열 변수로부터 k개(단, k는 n 미만의 자연수)의 주성분 변수를 추출할 수 있다. 이에 따라, 데이터 고유의 특성을 최대한 유지하면서 n차원의 데이터가 k차원의 데이터로 축소될 수 있다. 상기 주성분 분석은 당해 기술 분야에서 이미 널리 알려진 기술이므로 이에 대한 자세한 설명은 생략하도록 한다. 본 실시예에 따르면, 데이터의 차원이 축소됨으로써, 시계열 데이터 분석에 소요되는 컴퓨팅 비용이 크게 절감될 수 있다.In one embodiment, the analysis device 100 may perform dimensional reduction processing on multiple time series data through Principal Component Analysis (PCA). For example, the analysis device 100 may extract principal component variables of k (where k is a natural number less than n) from n time series variables through principal component analysis. Accordingly, the n-dimensional data can be reduced to k-dimensional data while maintaining the data-specific characteristics as much as possible. Since the principal component analysis is a technique well known in the art, detailed description thereof will be omitted. According to this embodiment, by reducing the dimension of the data, computing cost for analyzing time-series data can be greatly reduced.

일 실시예에서, 분석 장치(100)는 전술한 실시예들의 조합에 기초하여 전처리 단계(S30)를 수행할 수 있다.In one embodiment, the analysis device 100 may perform a pre-processing step S30 based on a combination of the above-described embodiments.

다시 도 6을 참조하면, 단계(S50)에서, 분석 장치(100)는 전처리된 다중 시계열 데이터에서 기 설정된 시계열 구간에 대응되는 데이터를 추출하여 2차원의 데이터 구조를 갖는 매트릭스를 생성한다. 이때, 상기 시계열 구간의 길이는 실시예에 따라 얼마든지 달라질 수 있다. 참고로, 상기 2차원 매트릭스는 도 10 내지 도 12에 도시된 바와 같이 2차원 배열로 표현 가능하다는 것을 의미할 뿐, 시계열 변수의 개수가 2개라는 것을 의미하는 것은 아니다. 또한, 실시예에 따라, 2차원 이상의 다차원 매트릭스가 생성될 수도 있다.Referring back to FIG. 6, in step S50, the analysis device 100 extracts data corresponding to a preset time series section from pre-processed multi-time series data to generate a matrix having a two-dimensional data structure. At this time, the length of the time-series section may be varied depending on the embodiment. For reference, the two-dimensional matrix only means that it can be expressed in a two-dimensional array as shown in FIGS. 10 to 12, but does not mean that the number of time series variables is two. Further, according to an embodiment, a multidimensional matrix of two or more dimensions may be generated.

본 단계(S50)에서, 분석 장치(100)는 각각의 시계열 구간 별로 매트릭스를 생성할 수 있다. 보다 이해의 편의를 제공하기 위해, 도 10 내지 도 12를 참조하여 본 단계(S50)에 대하여 부연 설명하도록 한다.In this step S50, the analysis apparatus 100 may generate a matrix for each time series section. In order to provide a more convenient understanding, this step (S50) will be described in detail with reference to FIGS. 10 to 12.

도 10에 도시된 바와 같이, 다중 시계열 데이터(231 내지 236)는 시간 축과 시계열 변수 축에 의해 형성된 2차원의 데이터 평면 상에 배치될 수 있다. 이때, 시계열 변수의 배열 순서는 실시예에 따라 달라질 수 있다.As shown in FIG. 10, the multiple time series data 231 to 236 may be disposed on a two-dimensional data plane formed by the time axis and the time series variable axis. At this time, the arrangement order of the time series variables may vary depending on the embodiment.

일 실시예에서, 상기 배열 순서는 랜덤하게 결정될 수 있다.In one embodiment, the arrangement order may be randomly determined.

일 실시예에서, 상기 배열 순서는 시계열 변수 간의 상관 분석 결과에 기초하여 결정될 수 있다. 가령, 분석 장치(100)는 제1 시계열 변수와 제2 시계열 변수에 대한 상관 분석을 수행하고, 상관 관계가 존재한다고 판정에 응답하여, 상기 시계열 변수 축에서 상기 제1 시계열 변수와 상기 제2 시계열 변수를 인접하여 배치할 수 있다. 상관 관계에 대한 사전 지식(prior knowledge)이 주어진 경우라면, 상관 분석을 하지 않고 상기 사전 지식에 기초하여 배열 순서가 결정될 수도 있다. 본 실시예에 따르면, 상관 분석을 통해 연관성이 존재할 가능성이 높은 시계열 변수가 데이터 평면 상에 인접하여 배치된다. 따라서, 시계열 변수 간의 상관 관계가 데이터 분석 과정에 더욱 잘 반영될 수 있다. 가령, 컨볼루션 신경망(Convolutional Neural network; CNN)을 통해 데이터 분석이 이루어지는 경우, 지역적 특징이 보다 잘 추출될 것인 바, 예측의 정확도가 향상될 수 있다.In one embodiment, the arrangement order may be determined based on a correlation analysis result between time series variables. For example, the analysis device 100 performs correlation analysis on the first time series variable and the second time series variable, and in response to determining that a correlation exists, the first time series variable and the second time series on the time series variable axis Variables can be placed adjacent to each other. If prior knowledge of the correlation is given, an order of arrangement may be determined based on the prior knowledge without performing correlation analysis. According to the present embodiment, time series variables that are most likely to be related through correlation analysis are disposed adjacent to the data plane. Therefore, the correlation between time series variables can be better reflected in the data analysis process. For example, when data analysis is performed through a convolutional neural network (CNN), the local characteristics will be better extracted, and thus the accuracy of prediction may be improved.

일 실시예에서, 전술한 실시예의 조합에 의해 시계열 변수의 배열 순서가 결정될 수 있다. 가령, 분석 장치(100)는 상관 관계가 존재하는 제1 복수의 시계열 변수들을 인접하여 배치하고, 상관 관계가 존재하지 않는 제2 복수의 시계열 변수들을 랜덤하게 배치할 수 있다.In one embodiment, the arrangement order of time series variables may be determined by a combination of the above-described embodiments. For example, the analysis device 100 may arrange the first plurality of time-series variables having a correlation adjacently and randomly arrange the second plurality of time-series variables having no correlation.

도 11은 도 10에 도시된 2차원의 데이터 평면을 매트릭스 형태로 도시한 것이다.FIG. 11 shows a two-dimensional data plane shown in FIG. 10 in a matrix form.

도 11에 도시된 바와 같이, 도 11에 도시된 매트릭스의 제1 행은 제1 시계열 데이터(231)에 대응되고, 제2 행은 제2 시계열 데이터(232)와 대응된다. 또한, 상기 매트릭스의 제1 열은 제1 시점에 측정된 각 시계열 데이터(231 내지 236)의 값에 대응되고, 제2 열은 제2 시점에 측정된 각 시계열 데이터(231 내지 236)의 값에 대응된다. 물론, 실시예에 따라, 행과 열의 대응 관계는 변경될 수도 있다.As shown in FIG. 11, the first row of the matrix shown in FIG. 11 corresponds to the first time series data 231, and the second row corresponds to the second time series data 232. In addition, the first column of the matrix corresponds to the value of each time series data 231 to 236 measured at the first time point, and the second column corresponds to the value of each time series data 231 to 236 measured at the second time point. Correspond. Of course, depending on the embodiment, the correspondence between rows and columns may be changed.

도 12는 도 11에 도시된 데이터 평면에서 각 시계열 구간에 대응되는 매트릭스를 생성하는 과정을 도시하고 있다.12 illustrates a process of generating a matrix corresponding to each time series section in the data plane shown in FIG. 11.

도 12에 도시된 바와 같이, 분석 장치(100)는 슬라이딩 윈도우(sliding window) 방식으로 연속적으로 매트릭스(241-2, 243-3)를 생성할 수 있다. 구체적으로, 분석 장치(100)는 데이터 평면 상에서 설정된 윈도우에 대응되는 영역(241-1)을 추출하여 제1 매트릭스(241-2)를 생성하고, 슬라이딩된 윈도우에 대응되는 영역(243-1)을 추출하여 제2 매트릭스(243-3)를 생성할 수 있다. 이때, 윈도우의 이동 간격(즉, stride)과 윈도우의 크기(즉, 시계열 구간의 길이)는 실시예에 따라 얼마든지 달라질 수 있다.As illustrated in FIG. 12, the analysis device 100 may continuously generate matrices 241-2 and 243-3 in a sliding window manner. Specifically, the analysis device 100 extracts the region 241-1 corresponding to the window set on the data plane to generate the first matrix 241-2, and the region 243-1 corresponding to the sliding window. The second matrix 243-3 may be generated by extracting. At this time, the moving interval of the window (ie, stride) and the size of the window (ie, the length of the time series section) may vary depending on the embodiment.

예를 들어, 반도체 공정과 같이 엄격한 모니터링이 요구되거나 데이터의 시계열적 관계가 중요한 경우라면, 상기 이동 간격은 상대적으로 작은 값으로 설정될 수 있다. 그렇게 함으로써, 보다 철저한 모니터링이 수행될 수 있기 때문이다. 다른 예를 들어, 분석 장치(100)의 컴퓨팅 리소스가 열악한 환경이라면, 상기 이동 간격은 상대적으로 큰 값으로 설정될 수 있다. 그렇게 함으로써, 두 매트릭스 간에 중복 데이터가 최소화되며, 분석 대상이 되는 매트릭스의 개수가 감소될 수 있기 때문이다.For example, if strict monitoring is required, such as a semiconductor process, or when a time-series relationship of data is important, the movement interval may be set to a relatively small value. By doing so, more thorough monitoring can be performed. For another example, if the computing resource of the analysis device 100 is a poor environment, the movement interval may be set to a relatively large value. This is because redundant data between the two matrices is minimized and the number of matrices to be analyzed can be reduced.

전술한 바에 따라 생성된 매트릭스(241-2, 243-2)는 패턴 DB(150)에 저장될 수 있다. 패턴 DB(150)는 매트릭스(241-2, 243-2)를 저장함과 동시에, 매트릭스(241-2, 243-2)와 매칭되는 패턴을 결정하고, 패턴의 발생 빈도를 증가시킬 수 있다.The matrices 241-2 and 243-2 generated as described above may be stored in the pattern DB 150. The pattern DB 150 may store the matrix 241-2 and 243-2, and at the same time, determine a pattern matching the matrix 241-2 and 243-2, and increase the frequency of occurrence of the pattern.

이때, 상기 패턴은 매트릭스와 1:1 관계일 수 있고, 1:다 관계가 될 수도 있다. 가령, 패턴과 매트릭스가 1:1 관계인 경우, 각각의 매트릭스 자체가 패턴으로 이용될 수 있다. 패턴과 매트릭스가 1:다 관계인 경우, 클러스터링(clustering)을 통해 생성된 대표 매트릭스가 패턴으로 이용될 수 있다. 상기 대표 매트릭스는 예를 들어 클러스터에 소속된 하나 이상의 매트릭스를 평균함으로써 생성되는 것일 수 있으나, 얼마든지 다른 방식(e.g. 중간 값, 최빈 값)으로 생성되더라도 무방할 것이다. 이와 같은 경우, 하나의 패턴은 하나의 클러스터에 매칭되고, 패턴의 발생 빈도는 클러스터에 속한 매트릭스의 개수로 산출될 수 있다.In this case, the pattern may have a 1: 1 relationship with the matrix, or may have a 1: multi relationship. For example, when the pattern and the matrix are in a 1: 1 relationship, each matrix itself may be used as a pattern. When the pattern and the matrix are in a 1: multi relationship, a representative matrix generated through clustering may be used as a pattern. The representative matrix may be, for example, generated by averaging one or more matrices belonging to a cluster, but may be generated in any other way (e.g. median, mode). In this case, one pattern matches one cluster, and the frequency of occurrence of the pattern can be calculated by the number of matrices belonging to the cluster.

도 10 내지 도 12는 다중 시계열 데이터로부터 매트릭스를 생성하는 과정을 개념적인 측면에서 설명한 것임에 유의하여야 한다. 실제 구현 시에는, 데이터 평면 상에 시계열 데이터를 배치하는 것이 아니라, 특정 시계열 구간의 제1 시계열 변수의 측정 값으로 매트릭스의 제1 행을 구성하고, 특정 시계열 구간의 제2 시계열 변수의 측정 값으로 상기 매트릭스의 제2 행을 구성함으로써, 매트릭스가 생성될 수 있을 것이기 때문이다.It should be noted that FIGS. 10 to 12 illustrate the process of generating a matrix from multiple time series data from a conceptual point of view. In actual implementation, instead of arranging time series data on the data plane, the first row of the matrix is composed of the measured values of the first time series variable of a specific time series section, and the measured values of the second time series variable of a specific time series section are used. This is because by constructing the second row of the matrix, a matrix can be created.

다시 도 6을 참조하면, 단계(S70)에서, 분석 장치(100)는 매트릭스에 매칭되는 패턴의 발생 빈도를 기초로 예측 대상의 클래스를 예측한다.Referring back to FIG. 6, in step S70, the analysis apparatus 100 predicts a class of a prediction target based on the frequency of occurrence of the pattern matching the matrix.

가령, 예측 대상의 클래스가 이상 클래스와 정상 클래스인 경우, 분석 장치(100)는 상기 패턴의 발생 빈도가 임계 값 미만이라는 판정에 응답하여, 예측 대상의 클래스를 이상 클래스로 예측할 수 있다. 발생 빈도가 낮은 희귀 패턴은 이상 클래스에 가까울 확률이 높기 때문이다. 여기서, 상기 임계 값은 기 설정된 고정 값 또는 상황에 따라 변동되는 변동 값일 수 있다.For example, when the class of the prediction target is the abnormal class and the normal class, the analysis apparatus 100 may predict the class of the prediction target as the abnormal class in response to a determination that the frequency of occurrence of the pattern is less than a threshold value. This is because a rare pattern with a low occurrence frequency has a high probability of being close to an abnormal class. Here, the threshold value may be a preset fixed value or a variable value that fluctuates depending on the situation.

단계(S90)에서, 분석 장치(100)는 복수의 시계열 변수 중에서 클래스 판정에 가장 영향을 미친 주영향 변수를 결정한다. 가령, 예측 대상의 클래스가 이상 클래스로 예측된 경우, 분석 장치(100)는 복수의 시계열 변수 중에서 이상 판정에 가장 영향을 미친 주영향 변수를 결정할 수 있다. 이하, 도 13 및 도 15를 참조하여 본 단계(S90)에 대하여 상세하게 설명하도록 한다.In step S90, the analysis device 100 determines a main influence variable that has the most influence on the class determination among the plurality of time series variables. For example, when the predicted class is predicted as an abnormal class, the analysis device 100 may determine a main influence variable that has the most influence on the abnormality determination among the plurality of time series variables. Hereinafter, this step (S90) will be described in detail with reference to FIGS. 13 and 15.

도 13 내지 도 15는 이상 클래스에 대한 주영향 변수를 결정하는 과정을 설명하기 위한 예시도이다. 이해의 편의를 제공하기 위해, 도 13 내지 도 15는 예측 대상의 클래스가 정상과 이상으로 구분되는 경우를 예로써 도시하고 있으나, 셋 이상의 다중 클래스가 존재하는 경우에도 이하의 서술 내용은 동일하게 적용될 수 있다. 또한, 이하의 도면에서 매트릭스 상에 음영으로 도시된 부분은 블라인드 필터(blind filter)가 적용된 부분을 가리키며, 블라인드 필터는 개념적으로 해당 부분의 값이 유사도 연산 과정에서 제외된다는 것을 의미한다. 상기 유사도 연산 과정에 해당 부분의 값을 제외하는 구체적인 방식은 해당 부분의 값을 "0"으로 치환하거나, 임의의 값으로 변경하는 등이 될 수 있을 것이나, 이는 어떠한 방식으로 구현되더라도 무방하다. 이하, 도 13을 참조하여 설명한다.13 to 15 are exemplary views for explaining a process of determining a main influence variable for an abnormal class. In order to provide convenience of understanding, FIGS. 13 to 15 show an example in which a class of a prediction target is divided into normal and abnormal, but the descriptions below are applied equally even when three or more multi-classes exist. Can be. In addition, in the following drawings, a portion shown in a shadow on the matrix indicates a portion to which a blind filter is applied, and the blind filter conceptually means that the value of the corresponding portion is excluded from the similarity calculation process. A specific method of excluding the value of the corresponding part in the similarity calculation process may be to replace the value of the corresponding part with "0", or change it to an arbitrary value, but this may be implemented in any way. It will be described below with reference to FIG. 13.

도 13에 도시된 바와 같이, 분석 장치(100)는 정상으로 기 분류된 다수의 매트릭스(253 내지 257, 이하 "정상 매트릭스"로 칭함) 중에서 이상 클래스로 예측된 매트릭스(251, 이하 "이상 매트릭스"로 칭함)와 매칭되는 것이 있는지 탐색한다. 이때, 상기 매칭 조건은 매트릭스 간 유사도가 임계 값 이상인 조건을 의미하는 것일 수 있으나, 상기 매칭 조건은 실시예에 따라 얼마든지 달라질 수 있다.As shown in FIG. 13, the analysis device 100 predicts a matrix that is predicted as an abnormal class among a plurality of matrices (253 to 257, hereinafter referred to as “normal matrices”) classified as normal (251, hereinafter “anomaly matrix”) ). In this case, the matching condition may refer to a condition in which similarities between matrices are equal to or greater than a threshold value, but the matching condition may vary depending on embodiments.

보다 구체적으로 설명하면, 분석 장치(100)는 블라인드 필터를 각 행 별(즉, 시계열 변수 별)로 적용한 다음, 이상 매트릭스(251)와 제1 정상 매트릭스(253) 간의 유사도를 산출한다. 가령, 분석 장치(100)는 두 매트릭스(251, 253)의 첫 행(251-1, 253-1)에 블라인드 필터를 적용하고, 매트릭스 유사도를 산출하며, 이와 같은 과정을 마지막 행(251-2, 253-2)까지 반복할 수 있다. 이때, 상기 유사도를 산출하는 방식은 어떠한 방식이 되더라도 무방할 것이다.More specifically, the analysis device 100 applies a blind filter for each row (ie, for each time series variable), and then calculates the similarity between the abnormal matrix 251 and the first normal matrix 253. For example, the analysis device 100 applies a blind filter to the first rows 251-1 and 253-1 of the two matrices 251 and 253, calculates matrix similarity, and performs the same process as the last row 251-2. , 253-2). At this time, the method of calculating the similarity may be any method.

위와 같은 과정은, 다른 정상 매트릭스(e.g. 제2 정상 매트릭스(255), 제3 정상 매트릭스(257) 등)에 대해서도 동일하게 수행될 수 있다. 분석 장치(100)는 매칭 조건을 만족하는 정상 매트릭스가 발견될 때까지 상기와 같은 탐색 과정을 수행할 수 있고, 주어진 모든 정상 매트릭스에 대해서 상기와 같은 탐색 과정을 수행할 수도 있다.The above process can be performed in the same way for other normal matrices (e.g. second normal matrix 255, third normal matrix 257, etc.). The analysis apparatus 100 may perform the above search process until a normal matrix satisfying the matching condition is found, or may perform the above search process for all given normal matrices.

매칭되는 정상 매트릭스가 발견되면, 곧바로 주영향 변수가 결정될 수 있다. 가령, 제1 행(251-1)에 블라인드 필터가 적용되었을 때, 매칭되는 정상 매트릭스가 발견되었다고 가정하자. 그러면, 분석 장치(100)는 제1 행(251)에 대응되는 제1 시계열 변수를 주영향 변수로 결정할 수 있다. 상기 제1 시계열 변수의 측정 값을 제외했을 때 이상 매트릭스(251)가 정상 매트릭스에 가깝다는 것은, 상기 제1 시계열 변수의 측정 값이 이상 판정에 가장 큰 영향을 끼쳤다는 것을 의미하기 때문이다.If a matching normal matrix is found, the main influence variable can be determined immediately. For example, suppose that a matching normal matrix is found when a blind filter is applied to the first row 251-1. Then, the analysis device 100 may determine the first time series variable corresponding to the first row 251 as the main influence variable. When the measured value of the first time series variable is excluded, the abnormal matrix 251 is close to the normal matrix because it means that the measured value of the first time series variable has the greatest influence on the abnormality determination.

한편, 매칭된 정상 매트릭스가 다수 발견되어, 둘 이상의 시계열 변수가 주영향 변수로 결정되는 경우가 있을 수 있다. 예를 들어, 제1 행(251-1) 외에 다른 행(e.g. 마지막 행 251-2)에 블라인드 필터가 적용되었을 때도 매칭되는 정상 매트릭스가 발견되어, 2개의 시계열 변수가 주영향 변수로 결정될 수 있다. 이와 같은 경우, 분석 장치(100)는 소정의 기준에 따라 주영향 변수의 순위를 결정할 수 있다. 예를 들어, 제1 시계열 변수와 제2 시계열 변수가 주영향 변수로 결정되고, 상기 제1 시계열 변수와 연관된 매트릭스 유사도(즉, 제1 시계열 변수가 블록킹 되었을 때 산출된 매트릭스 유사도)가 상기 제2 시계열 변수와 연관된 매트릭스 유사도보다 높은 경우, 상기 제1 시계열 변수가 선순위의 주영향 변수로 결정될 수 있다. 다른 예를 들어, 제1 시계열 변수와 연관된 정상 매트릭스(즉, 제1 시계열 변수가 블록킹 되었을 때 매칭된 정상 매트릭스)의 개수가 제2 시계열 변수와 연관된 정상 매트릭스의 개수보다 많은 경우, 상기 제1 시계열 변수가 선순위의 주영향 변수로 결정될 수 있다.On the other hand, there may be a case where a large number of matched normal matrices are found and two or more time series variables are determined as main influence variables. For example, a matching normal matrix is found even when a blind filter is applied to another row (eg, last row 251-2) other than the first row 251-1, and two time series variables may be determined as the main influence variable . In this case, the analysis device 100 may determine the rank of the main influence variable according to a predetermined criterion. For example, the first time series variable and the second time series variable are determined as main influence variables, and the matrix similarity (ie, the matrix similarity calculated when the first time series variable is blocked) associated with the first time series variable is the second time series variable. If it is higher than the matrix similarity associated with the time series variable, the first time series variable may be determined as the primary influence variable of the priority. For another example, if the number of normal matrices associated with a first time series variable (ie, the normal matrix matched when the first time series variable is blocked) is greater than the number of normal matrices associated with the second time series variable, the first time series The variable can be determined as the primary influence variable.

한편, 통상적으로 정상 매트릭스의 개수는 매우 많을 것이기 때문에, 모든 정상 매트릭스와의 유사도를 산출하는 것은 컴퓨팅 비용 측면에서 매우 비효율적이다. 따라서, 적정한 기준에 따라 유사도 산출 대상이 되는 정상 매트릭스를 선별할 필요가 있다. 본 발명의 실시예에 따르면, LSH(Locality Sensitive Hashing) 알고리즘을 적용하여 유사도 산출 대상 매트릭스가 선별될 수 있는데, 이하 본 실시예에 대하여 도 14 및 도 15를 참조하여 설명하도록 한다.On the other hand, since the number of normal matrices will usually be very large, calculating similarity with all normal matrices is very inefficient in terms of computing cost. Therefore, it is necessary to select a normal matrix that is the object of calculating similarity according to an appropriate criterion. According to an embodiment of the present invention, a matrix for calculating similarity may be selected by applying a Locality Sensitive Hashing (LSH) algorithm, which will be described below with reference to FIGS. 14 and 15.

도 14에 도시된 바와 같이, 매트릭스(261-1 내지 261-n) 중에서, 제1 매트릭스(261-1)는 이상 매트릭스이고, 나머지 매트릭스(261-2 내지 261-n)는 정상 매트릭스이다. 분석 장치(100)는 최소 해싱(min-hashing)을 통해 각 매트릭스(261-1 내지 261-n)로부터 시그니처 벡터(signature vector, 263-1 내지 263-n) 또는 시그니처 매트릭스(이하, "시그니처"로 통칭함)를 생성한다. 상기 최소 해싱은 유사성(e.g. 자카르드 유사성)을 보존하며 큰 집합을 작은 크기의 시그니처로 변환하는 기법이다. 상기 최소 해싱은 이미 당해 기술 분야에서 널리 알려진 기술인 바, 이에 대한 자세한 설명은 생략하도록 한다. As shown in FIG. 14, among the matrix 261-1 to 261-n, the first matrix 261-1 is an ideal matrix, and the remaining matrices 261-2 to 261-n are normal matrices. The analysis device 100 is a signature vector (signature vector 263-1 to 263-n) or a signature matrix (hereinafter, "signature") from each matrix 261-1 to 261-n through minimum hashing. (Collectively as). The minimum hashing is a technique that preserves similarity (e.g. Jakard similarity) and transforms a large set into a signature of a small size. Since the minimum hashing is a technique well known in the art, detailed description thereof will be omitted.

다음으로, 도 15에 도시된 바와 같이, 분석 장치(100)는 생성된 시그니처(263-1 내지 263-n)를 종합하여 매트릭스(265)를 생성한다. 또한, 분석 장치(100)는 LSH 알고리즘을 적용하여 매트릭스(265)를 b개의 밴드(band)로 구분하고, 각 밴드에 대한 해시 값을 산출한다. 도 15에 도시된 버킷(bucket)은 동일한 해시 값을 갖는 밴드의 집합을 개념적으로 가리키는 것이다.Next, as shown in FIG. 15, the analysis device 100 synthesizes the generated signatures 263-1 to 263-n to generate a matrix 265. In addition, the analysis apparatus 100 applies the LSH algorithm to divide the matrix 265 into b bands, and calculates a hash value for each band. The bucket shown in FIG. 15 conceptually indicates a set of bands having the same hash value.

여기서, 분석 장치(100)는 제1 시그니처(263-1)를 구성하는 밴드(e.g. 266)와 동일한 버킷에 존재하는 정상 매트릭스(e.g. 시그니처 263-k의 매트릭스)를 유사도 산출 대상 매트릭스를 선정할 수 있다. 예를 들어, 제1 시그니처(263-1)의 특정 밴드(266)와 제k 시그니처(263-k)의 밴드(267)는 동일한 버킷에 존재하므로, 제k 시그니처(263-k)가 가리키는 정상 매트릭스가 유사도 산출 대상 매트릭스로 선정될 수 있다. 최소 해싱과 LSH 알고리즘의 특성 상, 일부 밴드의 해시 값이 동일한 두 매트릭스는 서로 유사할 가능성이 높기 때문이다.Here, the analysis device 100 may select a matrix to calculate the similarity of the normal matrix (eg, the matrix of signature 263-k) existing in the same bucket as the band (eg 266) constituting the first signature 263-1. have. For example, since the specific band 266 of the first signature 263-1 and the band 267 of the k-th signature 263-k exist in the same bucket, the k-th signature 263-k is normal The matrix may be selected as a matrix for calculating similarity. This is because, due to the characteristics of minimum hashing and LSH algorithms, two matrices having the same hash value of some bands are more likely to be similar to each other.

분석 장치(100)는 제1 시그니처(263-1)를 구성하는 b개의 밴드 중에서 제1 밴드와 동일 버킷에 존재하는 제1 정상 매트릭스들과 제2 밴드와 동일 버킷에 존재하는 제2 정상 매트릭스들을 모두 유사도 산출 대상으로 선정할 수 있다. 또는, 상기 제1 정상 매트릭스들과 상기 제2 정상 매트릭스들의 교집합에 속한 매트릭스들만이 유사도 산출 대상으로 선정될 수도 있다.The analysis device 100 includes first normal matrices present in the same bucket as the first band and second normal matrices present in the same bucket as the second band among the b bands constituting the first signature 263-1. All can be selected as a target for calculating similarity. Alternatively, only the matrices belonging to the intersection of the first normal matrices and the second normal matrices may be selected as a similarity calculation target.

상술한 실시예에 따르면, 유사도가 높을 것으로 예측되는 일부 정상 매트릭스가 선별되고, 선별된 정상 매트릭스에 대해서만 매트릭스 유사도 연산이 수행되는 바, 주영향 변수 결정에 소요되는 컴퓨팅 비용이 크게 절감될 수 있다.According to the above-described embodiment, since some normal matrices predicted to have high similarity are selected, and the matrix similarity operation is performed only on the selected normal matrices, computing cost required to determine the main influence variable can be greatly reduced.

한편, 본 발명의 다른 실시예에 따르면, 분석 장치(100)는 클러스터링 기법을 통해 유사도 산출 대상 매트릭스를 선정할 수도 있다. 구체적으로, 분석 장치(100)는 정상 매트릭스 집합을 클러스터링하여 기 설정된 개수의 클러스터를 구축하고, 각 클러스터의 대표 매트릭스를 유사도 산출 대상으로 선정할 수 있다. 본 실시예에 따르면, 정상 매트릭스에 대한 대표적인 패턴만을 유사도 산출 대상으로 선정함으로써, 컴퓨팅 비용이 크게 절감될 수 있다.Meanwhile, according to another embodiment of the present invention, the analysis device 100 may select a matrix for calculating similarity through a clustering technique. Specifically, the analysis device 100 may cluster a normal matrix set to build a predetermined number of clusters, and select a representative matrix of each cluster as a target for calculating similarity. According to the present embodiment, by selecting only a representative pattern for the normal matrix as a similarity calculation target, computing cost can be greatly reduced.

지금까지 도 13 내지 도 15를 참조하여 본 발명의 일 실시예에 따른 주영향 변수 결정 방법에 대하여 설명하였다. 상술한 방법에서, 예측 대상의 클래스를 정상과 이상 클래스로 한정하여 설명하였다. 그러나, 상기 주영향 변수 결정 방법은 임의의 제1 클래스에 대한 주영향 변수를 결정하는 경우에도 동일하게 수행될 수 있음에 유의하여야 한다. 가령, 제1 클래스로 예측된 제1 매트릭스와 제2 클래스에 해당하는 적어도 하나의 제2 매트릭스가 존재하고, 상기 제1 클래스에 대한 주영향 변수를 결정한다고 가정하자. 이와 같은 경우, 분석 장치(100)는 두 매트릭스에 블라인드 필터를 적용하고 상기 제1 매트릭스와 상기 제2 매트릭스 간의 유사도를 산출함으로써 상기 제1 클래스에 대한 주영향 변수를 결정할 수 있다.So far, with reference to FIGS. 13 to 15, a method for determining a main influence variable according to an embodiment of the present invention has been described. In the above-described method, the classes to be predicted are limited to normal and abnormal classes. However, it should be noted that the method for determining the main influence variable can be performed in the same manner when determining the main influence variable for any first class. For example, suppose that there is a first matrix predicted as a first class and at least one second matrix corresponding to a second class, and determines a main influence variable for the first class. In this case, the analysis device 100 may determine a main influence variable for the first class by applying a blind filter to the two matrices and calculating the similarity between the first matrix and the second matrix.

한편, 본 발명의 실시예에 따르면, 제1 클래스에 대한 주영향 변수를 결정하기 위해 동일한 클래스의 매트릭스가 이용될 수도 있다. 가령, 현재 이상 클래스로 예측된 매트릭스가 제1 이상 매트릭스이고, 이상 클래스로 기 분류된 매트릭스가 제2 이상 매트릭스라고 가정하자. 그러면, 분석 장치(100)는 상기 제1 이상 매트릭스와 상기 이상 제2 매트릭스의 각 행 별로(즉, 시계열 변수 별로) 블라인드 필터를 적용한 다음 두 매트릭스 간의 유사도를 산출할 수 있다. 또한, 제1 시계열 변수와 연관된 매트릭스 유사도(즉, 제1 시계열 변수가 블록킹 되었을 때 산출된 유사도)가 다른 시계열 변수와 연관된 매트릭스 유사도가 임계 값 이상 낮은 경우(즉, 차이가 임계 값 이상인 경우), 상기 제1 시계열 변수가 이상 클래스에 대한 주영향 변수로 결정될 수 있다. 상기 제1 시계열 변수의 측정 값을 제외했을 때 특정 이상 매트릭스가 다른 이상 매트릭스와 가장 유사하지 않다는 것은, 상기 제1 시계열 변수의 측정 값이 이상 판정에 가장 큰 영향을 끼쳤다는 것을 의미하기 때문이다. 물론, 주영향 변수를 결정하는 조건은 실시예에 따라 얼마든지 변형될 수 있다.Meanwhile, according to an embodiment of the present invention, a matrix of the same class may be used to determine a main influence variable for the first class. For example, suppose that a matrix predicted as an anomaly class is a first anomaly matrix, and a matrix pre-classified as an anomaly class is a second anomaly matrix. Then, the analysis device 100 may apply a blind filter for each row of the first abnormal matrix and the abnormal second matrix (that is, for each time series variable), and then calculate the similarity between the two matrices. In addition, when the matrix similarity (ie, the similarity calculated when the first time series variable is blocked) associated with the first time series variable is lower than or equal to the threshold of the matrix similarity associated with another time series variable (ie, when the difference is greater than or equal to the threshold value), The first time series variable may be determined as a main influence variable for the abnormal class. When the measurement values of the first time series variable are excluded, the specific abnormal matrix is not most similar to the other abnormal matrix because it means that the measurement value of the first time series variable has the greatest influence on the abnormality determination. Of course, the conditions for determining the main influence variable may be modified in any number depending on the embodiment.

참고로, 전술한 단계(S10 내지 S90) 중에서, 단계(S10)는 데이터 수집부(110)에 의해 수행되고, 단계(S30)는 전처리부(120)에 의해 수행될 수 있다. 또한, 단계(S50)는 매트릭스 생성부(130)에 의해 수행되고, 단계(S70)는 제1 분석부(141)에 의해 수행되며, 단계(S90)는 제1 주영향 변수 결정부(161)에 의해 수행될 수 있다.For reference, among the above-described steps (S10 to S90), step (S10) may be performed by the data collection unit 110, and step (S30) may be performed by the pre-processing unit 120. In addition, step S50 is performed by the matrix generator 130, step S70 is performed by the first analysis unit 141, and step S90 is the first main influence variable determination unit 161 Can be performed by.

지금까지 도 6 내지 도 15를 참조하여 본 발명의 제1 실시예에 따른 시계열 데이터 분석 방법에 대하여 설명하였다. 상술한 방법에 따르면, 다중 시계열 데이터가 2차원 데이터 구조의 매트릭스로 가공된다. 2차원의 데이터 구조는 시계열 데이터의 자기 상관 관계와 시계열 변수 간의 상관 관계를 함께 반영하기 위해 적합한 데이터 구조이다. 따라서, 다중 시계열 데이터를 분석하고 예측을 수행함에 있어서, 분석 및 예측의 정확도가 크게 개선될 수 있다. 나아가, 블라인드 필터를 활용하여 클래스 판정에 영향에 미친 주영향 변수가 정확하게 식별될 수 있다.So far, the method of analyzing time series data according to the first embodiment of the present invention has been described with reference to FIGS. 6 to 15. According to the method described above, multiple time series data are processed into a matrix of two-dimensional data structures. The two-dimensional data structure is a data structure suitable for reflecting autocorrelation of time series data and correlation between time series variables. Therefore, in analyzing multiple time series data and performing prediction, accuracy of analysis and prediction can be greatly improved. Furthermore, a main influence variable that influences class determination can be accurately identified by using a blind filter.

이하에서는, 도 16 내지 도 18을 참조하여 본 발명의 제2 실시예에 따른 시계열 데이터 분석 방법에 대하여 설명하도록 한다.Hereinafter, a method of analyzing time series data according to a second embodiment of the present invention will be described with reference to FIGS. 16 to 18.

도 16은 본 발명의 제2 실시예에 따른 시계열 데이터 분석 방법을 나타내는 흐름도이다. 이하의 서술에서, 앞서 언급한 제1 실시예의 내용과 중복되는 사항은 명세서의 명료함을 위해 생략하도록 한다.16 is a flowchart illustrating a method for analyzing time series data according to a second embodiment of the present invention. In the following description, items that overlap with the contents of the first embodiment mentioned above will be omitted for clarity of specification.

도 16에 도시된 바와 같이, 상기 제2 실시예의 전반적인 과정은 전술한 제1 실시예와 유사한다. 다만, 상기 제2 실시예에는 단계(S170, S190)에서 예측 모델에 기반하여 예측 대상의 클래스를 예측하고 주영향 변수를 결정하다는 점에서 전술한 제1 실시예와 차이가 있다.16, the overall process of the second embodiment is similar to the first embodiment described above. However, the second embodiment differs from the first embodiment described in step S170 and S190 in that it predicts a class of a prediction target based on a prediction model and determines a main influence variable.

상기 예측 모델은 예측 대상의 클래스를 예측하는데 이용하는 모델이다. 상기 예측 모델은 기계 학습을 통해 구축될 수 있으나, 예측 모델의 구체적인 구성 및 동작 방식은 실시예에 따라 달라질 수 있다.The prediction model is a model used to predict a class of prediction targets. The predictive model may be constructed through machine learning, but the specific configuration and operation method of the predictive model may vary according to embodiments.

일 실시예에서, 상기 예측 모델은 컨볼루션 신경망에 기반하여 구성될 수 있다. 컨볼루션 신경망은 2차원 이상의 데이터로부터 지역적 특징을 추출하는데 특화된 신경망이다. 따라서, 컨볼루션 신경망은 매트릭스 데이터에서 시계열 변수 간의 상관 관계를 고려하여 특징을 추출하는데 가장 적합한 모델이다. 몇몇 실시예에서, 상기 매트릭스를 컨볼루션 신경망에 입력하기 전에 상기 매트릭스의 값을 픽셀 값의 범위에 맞게 적절하게 보정하는 과정이 수행될 수 있다. 물론, 상기 보정 과정은 단계(S130)에서 다른 전처리 과정과 함께 수행될 수도 있다. 상기 컨볼루션 신경망은 이미지 분류 태스크에 특화된 신경망으로, 당해 기술 분야의 당업자라면 컨볼루션 신경망의 구성 및 동작에 대하여 자명하게 알 수 있을 것인 바, 이에 대한 자세한 설명은 생략하도록 한다.In one embodiment, the prediction model may be constructed based on a convolutional neural network. The convolutional neural network is a specialized neural network for extracting regional features from data of two or more dimensions. Therefore, the convolutional neural network is the most suitable model for extracting features by considering the correlation between time series variables in matrix data. In some embodiments, before the matrix is input to the convolutional neural network, a process of appropriately correcting the matrix value to a range of pixel values may be performed. Of course, the correction process may be performed together with other pre-processing steps in step S130. The convolutional neural network is a neural network specialized for an image classification task, and those skilled in the art will be able to clearly understand the configuration and operation of the convolutional neural network, and detailed description thereof will be omitted.

본 실시예에서, 분석 장치(100)는 예측 모델에 의해 출력된 클래스 별 컨피던스 스코어에 기초하여 예측 대상의 클래스를 예측할 수 있다. 가령, 분석 장치(100)는 제1 클래스의 컨피던스 스코어가 가장 높다는 판정에 응답하여, 상기 예측 대상의 클래스를 상기 제1 클래스로 예측할 수 있다. 본 실시예에 따르면, 컨볼루션 신경망의 특성을 활용하여 정확한 분석 및 예측이 수행될 수 있다.In this embodiment, the analysis device 100 may predict a class of a prediction target based on a confidence score for each class output by the prediction model. For example, the analysis apparatus 100 may predict the class of the prediction target as the first class in response to the determination that the confidence score of the first class is the highest. According to this embodiment, accurate analysis and prediction can be performed by utilizing the characteristics of a convolutional neural network.

일 실시예에서, 상기 예측 모델은 컨볼루션 신경망과 순환 신경망(Recurrent Neural Network; RNN)의 조합에 기반하여 구성될 수 있다. 상기 순환 신경망은 순환적 연결 구조를 통해 시간 순서에 따른 특징을 추출하는데 특화된 신경망이다. 또한, 시계열 데이터는 일반적으로 자기 상관 관계를 갖고 있어 과거의 데이터가 현재의 데이터에 영향을 미치는 특성을 지닌다. 따라서, 두 신경망이 조합되면, 자기 상관 관계를 갖는 시계열 데이터의 특성이 보다 잘 고려될 수 있고, 다중 시계열 데이터에 대한 정확한 분석이 이루어질 수 있는 것이다.In one embodiment, the prediction model may be constructed based on a combination of a convolutional neural network and a recurrent neural network (RNN). The cyclic neural network is a neural network specialized for extracting features over time through a cyclic connection structure. In addition, time series data generally have an auto-correlation, so that the data of the past affects the current data. Therefore, when the two neural networks are combined, characteristics of time series data having autocorrelation can be better considered, and accurate analysis of multiple time series data can be achieved.

보다 구체적인 예를 들어, 상기 예측 모델은 도 17에 도시된 바와 같이, 컨볼루션 신경망(273)과 순환 신경망의 일종인 LSTM(Long Short-Term Memory Model) 신경망(275)에 기반하여 구성될 수 있다. 이와 같은 경우, 컨볼루션 신경망(273)은 단계(150)에서 생성된 다수의 매트릭스(271-1 내지 271-n)를 입력받고, 다수의 매트릭스(271-1 내지 271-n)로부터 특징(e.g. 특징 맵)을 추출하는 동작을 수행하게 된다. 또한, LSTM 신경망(275)은 컨볼루션 신경망(273)에서 추출된 특징에 기반하여 예측 대상의 클래스 별 컨피던스 스코어를 출력하는 동작을 수행하게 된다. 이전 실시예와 마찬가지로, 분석 장치(100)는 클래스 별 컨피던스 스코어에 기초하여 예측 대상의 클래스를 예측할 수 있다.For a more specific example, the prediction model may be configured based on a convolutional neural network 273 and a long short-term memory model (LSTM) neural network 275, which is a type of cyclic neural network. . In this case, the convolutional neural network 273 receives the multiple matrices 271-1 to 271-n generated in step 150, and features (eg, from the multiple matrices 271-1 to 271-n) Feature map). Also, the LSTM neural network 275 performs an operation of outputting a confidence score for each class of a prediction target based on the features extracted from the convolutional neural network 273. As in the previous embodiment, the analysis device 100 may predict the class of the prediction target based on the confidence score for each class.

본 실시예에 따르면, 컨볼루션 신경망과 순환 신경망의 조합을 통해 자기 상관 관계를 갖는 시계열 데이터의 특성이 심도있게 고려될 수 있다. 이에 따라, 분석 및 예측의 정확도는 더욱 향상될 수 있다.According to this embodiment, characteristics of time-series data having autocorrelation through a combination of a convolutional neural network and a circulating neural network can be considered in depth. Accordingly, the accuracy of analysis and prediction can be further improved.

다음으로, 단계(190)에서 주영향 변수를 결정하는 방법에 대하여 도 18을 참조하여 상세하게 설명하도록 한다.Next, a method of determining the main influence variable in step 190 will be described in detail with reference to FIG. 18.

도 18은 본 발명의 일 실시예에 따라 예측 모델을 이용하여 주영향 변수를 결정하는 방법을 설명하기 위한 예시도이다. 이해의 편의를 제공하기 위해, 도 18 또한 예측 대상의 클래스가 정상과 이상으로 구분되는 경우를 예로써 도시하고 있으나, 셋 이상의 다중 클래스가 존재하는 경우에도 이하의 서술 내용은 동일하게 적용될 수 있다. 이하, 도 18을 참조하여 설명한다.18 is an exemplary view for explaining a method of determining a main influence variable using a prediction model according to an embodiment of the present invention. In order to provide convenience for understanding, FIG. 18 also shows a case in which a class to be predicted is divided into normal and abnormal, but the following description may be applied equally even when there are three or more multiple classes. Hereinafter, it will be described with reference to FIG. 18.

도 18에 도시된 바와 같이, 분석 장치(100)는 특정 행(즉, 시계열 변수)에 블라인드 필터가 적용된 이상 매트릭스(281)를 예측 모델에 적용하여 클래스 별 컨피던스 스코어(283, 285)를 획득할 수 있다. 가령, 분석 장치(100)는 이상 매트릭스(281)의 첫 행(281-1)에 블라인드 필터를 적용하여 컨피던스 스코어(283)를 산출하고, 이와 같은 과정을 마지막 행(281-2)까지 반복할 수 있다.As shown in FIG. 18, the analysis device 100 obtains the confidence scores 283 and 285 for each class by applying the anomaly matrix 281 to which a blind filter is applied to a specific row (ie, a time series variable) to a prediction model. Can be. For example, the analysis device 100 applies a blind filter to the first row 281-1 of the anomaly matrix 281 to calculate a confidence score 283, and repeats this process to the last row 281-2. Can be.

여기서, 분석 장치(100)는 복수의 시계열 변수 중에서 연관된 클래스 별 컨피던스 스코어(즉, 해당 시계열 변수가 블록킹 되었을 때 산출된 컨피던스 스코어)가 소정의 조건을 만족하는 특정 시계열 변수를 주영향 변수로 결정할 수 있다.Here, the analysis device 100 may determine a specific time series variable that satisfies a predetermined condition as a main influence variable, a confidence score for each associated class among a plurality of time series variables (ie, a confidence score calculated when the corresponding time series variable is blocked). have.

이때, 상기 소정의 조건은 상기 특정 시계열 변수와 연관된 정상 클래스의 컨피던스 스코어(이하, "정상 컨피던스 스코어")가 임계 값 이상인 경우를 가리키는 제1 조건, 상기 특정 시계열 변수와 연관된 정상 컨피던스 스코어가 본래(즉, 블라인드 필터가 전혀 적용되지 않은 경우)보다 임계 값 이상 높은 경우(즉, 차이가 임계치 이상인 경우)를 가리키는 제2 조건 또는 상기 특정 시계열 변수와 연관된 정상 컨피던스 스코어가 다른 시계열 변수와 연관된 정상 컨피던스 스코어보다 임계 값 이상 높은 제3 조건을 포함할 수 있다.In this case, the predetermined condition is a first condition indicating when a confidence class of the normal class associated with the specific time series variable (hereinafter, “normal confidence score”) is equal to or greater than a threshold value, and the normal confidence score associated with the specific time series variable is originally ( That is, a normal condition score associated with another time series variable or a normal confidence score associated with the specific time series variable or a second condition indicating when the threshold filter is higher than a threshold value (i.e., when the difference is greater than or equal to a threshold) than when a blind filter is not applied at all. A third condition higher than a threshold value may be included.

또는, 상기 소정의 조건은 상기 특정 시계열 변수와 연관된 이상 클래스의 컨피던스 스코어(이하, "이상 컨피던스 스코어")가 임계 값 미만인 경우를 가리키는 제4 조건, 상기 특정 시계열 변수와 연관된 이상 컨피던스 스코어가 본래(즉, 블라인드 필터가 전혀 적용되지 않은 경우)보다 임계 값 이상 낮은 경우(즉, 차이가 임계치 이상인 경우)를 가리키는 제5 조건 또는 상기 특정 시계열 변수와 연관된 이상 컨피던스 스코어가 다른 시계열 변수와 연관된 이상 컨피던스 스코어보다 임계 값 이상 낮은 경우를 가리키는 제6 조건을 포함할 수 있다. 그러나, 상기 열거된 조건의 예시는 본 발명의 일부 실시예를 설명하기 위한 것이므로, 본 발명의 기술적 범위가 상기 열거된 예시에 한정되는 것은 아니다.Alternatively, the predetermined condition is a fourth condition indicating when the confidence score of the abnormal class associated with the specific time series variable (hereinafter, “ideal confidence score”) is less than a threshold value, and the abnormal confidence score associated with the specific time series variable is originally ( In other words, a fifth condition indicating when the threshold value is lower than a threshold value (i.e., when the difference is greater than or equal to a threshold) than when a blind filter is not applied at all, or an abnormality confidence score associated with another time series variable having an abnormality confidence score associated with the specific time series variable A sixth condition indicating a case lower than or equal to a threshold value may be included. However, since the examples of the conditions listed above are for explaining some embodiments of the present invention, the technical scope of the present invention is not limited to the examples listed above.

참고로, 전술한 단계(S170) 중에서 단계(170)는 제2 분석부(143)에 의해 수행되고, 단계(S190)는 제2 주영향 변수 결정부(163)에 의해 수행될 수 있다.For reference, among the aforementioned steps (S170), step 170 may be performed by the second analysis unit 143, and step S190 may be performed by the second main influence variable determination unit 163.

지금까지 도 16 내지 도 18을 참조하여 본 발명의 제2 실시예에 대한 시계열 데이터 분석 방법에 대하여 설명하였다. 상술한 방법에 따르면, 컨볼루션 신경망 기반의 예측 모델을 통해 다중 시계열 데이터에 대한 분석 및 예측이 수행될 수 있다. 상기 컨볼루션 신경망은 2차원 이상의 데이터에서 지역적 특징을 추출하는데 특화된 기계 학습 모델이다. 따라서, 분석 및 예측 과정에 시계열 변수의 상관 관계가 잘 반영될 수 있으며, 이에 따라 시계열 데이터에 대한 분석 및 예측의 정확도가 크게 향상될 수 있다. 나아가, 제1 실시예와는 달리 예측 모델과 블라인드 필터를 활용함으로써, 유사 매트릭스를 탐색하는 과정 없이, 간이한 방식으로 주영향 변수가 식별될 수 있다.So far, the method of analyzing time series data for the second embodiment of the present invention has been described with reference to FIGS. 16 to 18. According to the above-described method, analysis and prediction of multiple time series data may be performed through a convolutional neural network based prediction model. The convolutional neural network is a machine learning model specialized for extracting regional features from data of two or more dimensions. Therefore, the correlation between time series variables can be well reflected in the analysis and prediction process, and accordingly, the accuracy of analysis and prediction on time series data can be greatly improved. Furthermore, unlike the first embodiment, by using a predictive model and a blind filter, the main influence variable can be identified in a simple manner without searching for a similar matrix.

이하에서는, 보다 이해의 편의를 제공하기 위해, 도 19를 참조하여 본 발명의 기술적 사상이 공정 이상 탐지 분야에 활용된 예에 대하여 간략하게 설명하도록 한다.Hereinafter, an example in which the technical idea of the present invention is utilized in the field of process anomaly detection will be briefly described with reference to FIG. 19 to provide more convenience.

도 19는 본 발명의 일 활용예에 따른 공정 이상 탐지 시스템을 나타내는 구성도이다.19 is a configuration diagram showing a process anomaly detection system according to an embodiment of the present invention.

도 19에 도시된 바와 같이, 전술한 본 발명의 기술적 사상은 이상 탐지 장치(300)를 통해 구체화될 수 있다.19, the above-described technical idea of the present invention may be embodied through the anomaly detection device 300.

이상 탐지 장치(300)는 다수의 센서(320-1 내지 320-n)의 측정 값으로 구성된 다중 시계열 데이터를 분석하여 실시간으로 공정 설비(310)의 이상을 탐지하는 장치이다.The abnormality detection device 300 is a device that detects an abnormality of the process facility 310 in real time by analyzing multiple time series data composed of measurement values of a plurality of sensors 320-1 to 320-n.

이상 탐지 장치(300)는 상기 다중 시계열 데이터로부터 매트릭스를 생성하고, 상기 매트릭스를 분석하여 공정 설비(310)의 이상 여부를 판정할 수 있다. 예를 들어, 전술한 제1 실시예(e.g. 도 6 참조)와 마찬가지로, 이상 탐지 장치(300)는 매트릭스에 매칭되는 패턴의 발생 빈도가 임계치 미만이라는 판정에 응답하여, 공정 설비(310)에 이상이 있는 것으로 판정할 수 있다. 다른 예를 들어, 전술한 제2 실시예(e.g. 도 16 참조)와 마찬가지로, 이상 탐지 장치(300)는 예측 모델의 컨피던스 스코어에 기초하여 공정 설비(310)의 이상 여부를 판정할 수 있다. 또한, 이상 판정에 응답하여, 이상 탐지 장치(300)는 관리자에게 소정의 알람을 제공할 수 있다. 이를 통해, 제조 공정의 효율이 향상될 뿐만 아니라, 관리의 편의성이 증대되는 효과가 달성될 수 있다.The anomaly detection apparatus 300 may generate a matrix from the multiple time series data and analyze the matrix to determine whether the process facility 310 is abnormal. For example, as in the above-described first embodiment (eg, see FIG. 6), the anomaly detection apparatus 300 responds to the determination that the frequency of occurrence of the pattern matching the matrix is less than the threshold, and is abnormal in the process facility 310. It can be judged that there is. For another example, as in the second embodiment described above (see e.g. FIG. 16), the anomaly detection device 300 may determine whether the process facility 310 is abnormal based on the confidence score of the predictive model. Also, in response to the abnormality determination, the abnormality detection device 300 may provide a predetermined alarm to the administrator. Through this, not only the efficiency of the manufacturing process is improved, but also the effect of increasing the convenience of management can be achieved.

나아가, 이상 탐지 장치(300)는 다수의 시계열 변수(즉, 센서의 측정 변수) 중에서 이상 판정에 가장 영향을 미친 주영향 변수를 결정할 수 있다. 이를 통해, 이상 발생 요인에 대한 정보가 관리자에게 추가로 제공될 수 있는 바, 제조 공정의 효율과 관리의 편의성이 더욱 증대될 수 있다.Furthermore, the anomaly detection device 300 may determine a main influence variable that has the most influence on the anomaly determination among a plurality of time series variables (ie, sensor measurement variables). Through this, since information on the cause of the abnormality can be additionally provided to the manager, the efficiency of the manufacturing process and the convenience of management can be further increased.

한편, 본 발명의 다른 실시예에 따르면, 이상 탐지 장치(300)는 전술한 제1 실시예와 제2 실시예의 조합에 기초하여 이상 탐지를 수행할 수도 있다. 구체적으로, 이상 탐지 장치(300)는 상기 제1 실시예에 따라 이상 탐지를 수행하며 트레이닝 데이터셋을 축적하고, 축적된 트레이닝 데이터셋을 이용하여 예측 모델을 트레이닝하며, 이후 상기 제2 실시예에 따라 예측 모델 기반으로 이상 탐지를 수행할 수 있다.Meanwhile, according to another embodiment of the present invention, the anomaly detection apparatus 300 may perform anomaly detection based on the combination of the first and second embodiments described above. Specifically, the anomaly detection device 300 performs anomaly detection according to the first embodiment, accumulates a training dataset, trains a predictive model using the accumulated training dataset, and then the second embodiment. Accordingly, anomaly detection can be performed based on a predictive model.

부연 설명하면, 상기 제1 실시예에 따라 이상 탐지를 수행함으로써, 패턴의 발생 빈도에 기반하여 매트릭스에 클래스 레이블을 부여하는 레이블링(labelling) 작업이 수행될 수 있다. 레이블링 작업이 수행된 이상 매트릭스와 정상 매트릭스는 예측 모델의 트레이닝 데이터셋으로 활용될 수 있다. 즉, 이상 탐지 장치(300)는 상기 트레이닝 데이터셋으로 예측 모델을 트레이닝할 수 있다. 상기 예측 모델이 충분히 트레이닝되면, 상기 제2 실시예에 따라 이상 탐지가 수행될 수 있다.Incidentally, by performing anomaly detection according to the first embodiment, a labeling operation of assigning a class label to the matrix based on the frequency of occurrence of the pattern may be performed. The anomaly matrix and the normal matrix on which the labeling operation has been performed can be used as a training dataset of the predictive model. That is, the anomaly detection device 300 may train a predictive model with the training dataset. When the prediction model is sufficiently trained, anomaly detection may be performed according to the second embodiment.

몇몇 실시예에서, 이상 탐지 장치(100)는 상기 제1 실시예에 따른 제1 이상 탐지와 상기 제2 실시예에 따른 제2 이상 탐지를 병행할 수도 있다. 이때, 이상 탐지 장치(100)는 예측 모델의 학습 성숙도에 비례하여 상기 제1 이상 탐지 프로세스와 상기 제2 이상 탐지의 활용 비중을 조정할 수 있다. 가령, 이상 탐지 장치(100)는 상기 학습 성숙도가 올라갈수록 상기 제2 이상 탐지의 활용 비중을 증가시키고, 상기 제1 이상 탐지의 활용 비중은 감소시킬 수 있다. 기계 학습 모델의 학습 성숙도가 올라갈수록 모델의 정확도가 향상될 것이기 때문이다. 본 실시예에 따르면, 시간이 지남에 따라 예측 모델의 활용 비중을 증가시킴으로써, 이상 탐지의 정확도가 점진적으로 향상되는 효과가 달성될 수 있다.In some embodiments, the anomaly detection apparatus 100 may perform the first anomaly detection according to the first embodiment and the second anomaly detection according to the second embodiment. In this case, the anomaly detection apparatus 100 may adjust the utilization ratio of the first anomaly detection process and the second anomaly detection in proportion to the learning maturity of the predictive model. For example, the anomaly detection apparatus 100 may increase the utilization ratio of the second anomaly detection as the learning maturity increases, and decrease the utilization ratio of the first anomaly detection. This is because the accuracy of the model will improve as the learning maturity of the machine learning model increases. According to the present exemplary embodiment, an effect of gradually improving the accuracy of anomaly detection can be achieved by increasing the utilization weight of the predictive model over time.

지금까지 도 19를 참조하여 본 발명의 기술적 사상이 공정 이상 탐지 분야에 활용된 예에 대하여 간략하게 설명하였다. 상술한 바에 따르면, 시계열 변수 간의 상관 관계를 고려하여 이상 탐지를 수행함으로써, 이상 탐지의 정확도가 떨어지는 종래의 문제가 해결될 수 있다.So far, with reference to FIG. 19, an example in which the technical idea of the present invention has been applied to the process anomaly detection field has been briefly described. As described above, by performing anomaly detection in consideration of a correlation between time-series variables, a conventional problem in which anomaly detection accuracy is poor can be solved.

한편, 본 발명의 기술적 사상은 공정 이상 탐지 분야 뿐만 아니라 다중 시계열 데이터를 다루는 다양한 분야에 적용될 수 있음에 유의하여야 한다. 가령, 환율, 주가지수 등에 관한 다중 시계열 데이터를 분석하여 특정 자산(e.g. 주식, 부동산 등)의 가치 등락을 예측하는 경우에도, 전술한 본 발명의 기술적 사상들은 어떠한 실질적인 변경없이 그대로 적용될 수 있다. 나아가, 본 발명의 실시예들에 따르면, 환율, 주가지수 등의 영향 인자 중에서 상기 가치 등락에 가장 큰 영향을 미친 주영향 인자가 무엇인지까지 정확하게 식별될 수 있다.On the other hand, it should be noted that the technical idea of the present invention can be applied not only to the process anomaly detection field but also to various fields dealing with multiple time series data. For example, even in the case of predicting the fluctuation of the value of a specific asset (e.g. stock, real estate, etc.) by analyzing multiple time series data on exchange rates, stock indices, etc., the above-described technical ideas of the present invention can be applied without any substantial change. Furthermore, according to the embodiments of the present invention, it is possible to accurately identify up to which of the influencing factors such as exchange rates and stock indices is the main influencing factor having the greatest influence on the fluctuation of the value.

지금까지 도 1 내지 도 19를 참조하여 본 발명의 몇몇 실시예들 및 그 실시예들에 따른 효과들을 언급하였다. 본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.So far, some embodiments of the present invention and effects according to the embodiments have been described with reference to FIGS. 1 to 19. The effects of the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

지금까지 도 1 내지 도 19를 참조하여 설명된 본 발명의 개념은 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 코드로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체는, 예를 들어 이동형 기록 매체(CD, DVD, 블루레이 디스크, USB 저장 장치, 이동식 하드 디스크)이거나, 고정식 기록 매체(ROM, RAM, 컴퓨터 구비 형 하드 디스크)일 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체에 기록된 상기 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 상기 다른 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 다른 컴퓨팅 장치에서 사용될 수 있다.The concept of the present invention described so far with reference to FIGS. 1 to 19 may be embodied as computer readable code on a computer readable medium. The computer-readable recording medium may be, for example, a removable recording medium (CD, DVD, Blu-ray Disc, USB storage device, removable hard disk), or a fixed recording medium (ROM, RAM, computer-equipped hard disk). Can be. The computer program recorded on the computer-readable recording medium may be transmitted to another computing device through a network such as the Internet and installed on the other computing device, and thus used on the other computing device.

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다.In the above, even if all the components constituting the embodiments of the present invention are described as being combined or operated as one, the present invention is not necessarily limited to these embodiments. That is, within the object scope of the present invention, all of the components may be selectively combined and operated.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 더욱이, 위에 설명한 실시예들에서 다양한 구성들의 분리는 그러한 분리가 반드시 필요한 것으로 이해되어서는 안 되고, 설명된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품으로 함께 통합되거나 다수의 소프트웨어 제품으로 패키지 될 수 있음을 이해하여야 한다.Although the operations in the drawings are shown in a specific order, it should not be understood that the operations must be performed in a specific order or in a sequential order, or that all the illustrated actions must be executed to obtain a desired result. In certain situations, multitasking and parallel processing may be advantageous. Moreover, the separation of various configurations in the above-described embodiments should not be understood as such separation is necessary, and the described program components and systems may generally be integrated together into a single software product or packaged into multiple software products. It should be understood that there is.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, a person skilled in the art to which the present invention pertains may be implemented in other specific forms without changing the technical concept or essential features of the present invention. Can understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive. The scope of protection of the present invention should be interpreted by the claims below, and all technical spirits within the equivalent range should be interpreted as being included in the scope of the present invention.

Claims

A method for determining a main influence variable for a target class among a plurality of time series variables in a computing device,
Obtaining a first matrix predicted by the target class from multiple time series data associated with the plurality of time series variables;
Obtaining a second matrix belonging to a specific class;
Calculating similarities between the two matrices except for the values of the first time series variables in the first matrix and the second matrix; And
In response to the determination that the calculated similarity satisfies a predetermined condition, comprising the step of determining the first time series variable as the main influence variable,
The first row or first column of the first matrix is composed of measured values of the first time series variable,
The second row or the second column of the first matrix is characterized by consisting of the measured value of the second time series variable,
How to determine the main impact variable.

According to claim 1,
The specific class is a class different from the target class,
Determining as the main impact variable,
And in response to a determination that the calculated similarity is equal to or greater than a threshold value, determining the first time series variable as the main influence variable.
How to determine the main impact variable.

According to claim 1,
The specific class is the same class as the target class,
Determining as the main impact variable,
And in response to the determination that the calculated similarity is less than a threshold value, determining the first time series variable as the main influence variable.
How to determine the main impact variable.

According to claim 1,
The obtaining of the first matrix may include:
Generating the first matrix by extracting data of a preset time series section from the multiple time series data; And
And predicting a class of the first matrix as the target class based on an analysis result of the first matrix,
How to determine the main impact variable.

According to claim 4,
The target class is an abnormal class,
The predicting by the target class,
And in response to a determination that the frequency of occurrence of the first pattern matching the first matrix is below a threshold, predicting the class of the first matrix as the abnormal class,
How to determine the main impact variable.

According to claim 4,
The step of generating the first matrix may include:
Normalizing the multiple time series data; And
And generating the first matrix based on the normalized multiple time series data.
How to determine the main impact variable.

The method of claim 6,
Generating the first matrix based on the normalized multiple time series data,
Symbolizing the normalized multiple time series data through symbolic aggregate approximation (SAX) transformation; And
And generating the first matrix based on the symbolized multi-time series data.
How to determine the main impact variable.

According to claim 4,
The step of generating the first matrix may include:
Arranging measurement values for the first time series variable and the second time series variable along the time series variable axis on a data plane formed by the time axis and the time series variable axis; And
Characterized in that it comprises the step of generating the first matrix by extracting a measurement value corresponding to a sliding window (sliding window) on the data plane,
How to determine the main impact variable.

According to claim 4,
The predicting by the target class,
Inputting the first matrix into a prediction model composed of a convolutional neural network; And
And predicting a class of the first matrix as the target class based on an output result of the prediction model.
How to determine the main impact variable.

According to claim 1,
The obtaining of the second matrix may include:
Obtaining a plurality of candidate matrices belonging to the specific class; And
And selecting a second matrix from the plurality of candidate matrices by applying a Locality Sensitive Hashing (LSH) algorithm,
How to determine the main impact variable.

A method of determining a main influence variable for a specific class among a plurality of time series variables in a computing device,
Generating a first matrix by extracting data of a preset time series section from multiple time series data associated with the plurality of time series variables;
Inputting the first matrix into a prediction model, and predicting a class of the first matrix as a first class based on a first confidence score output from the prediction model; And
Determining a main influence variable for the first class,
The first row or first column of the first matrix is composed of measured values of the first time series variable,
The second row or second column of the first matrix consists of measured values of the second time series variable,
Determining the main influence variable,
Re-entering the first matrix from which the value of the first time series variable is excluded, into the prediction model to obtain a second confidence score; And
And in response to determining that the second confidence score satisfies a predetermined condition, determining the first time-series variable as a primary influence variable of the first class.
How to determine the main impact variable.

The method of claim 11,
The first confidence score is a confidence score for the first class,
The second confidence score is a confidence score for the second class,
Determining the first time series variable as the primary influence variable of the first class,
And in response to determining that the second confidence score is equal to or greater than a threshold value, determining the first time series variable as a primary influence variable of the first class.
How to determine the main impact variable.

The method of claim 11,
The first confidence score and the second confidence score are both confidence scores for the first class,
Determining the first time series variable as the primary influence variable of the first class,
And in response to a determination that the difference between the first confidence score and the second confidence score satisfies a predetermined condition, determining the first time series variable as the main influence variable.
How to determine the main impact variable.

In the computing device using a convolutional neural network (convolutional neural network) based prediction model in the method of analyzing the multi-time series data associated with the prediction target,
Extracting data of a preset time series section from the multiple time series data, and generating a first matrix; And
And predicting a class of the prediction target by applying the first matrix to the prediction model,
The multiple time series data includes measured values of the first time series variable and the second time series variable,
The first row or first column of the first matrix is composed of measured values of the first time series variable for the time series section,
Characterized in that the second row or the second column of the first matrix is composed of measured values of the second time series variable for the time series section,
Methods for analyzing time series data.

The method of claim 14,
The step of generating the first matrix may include:
Arranging measurement values for the first time series variable and the second time series variable along the time series variable axis on a data plane formed by the time axis and the time series variable axis; And
Characterized in that it comprises the step of generating the first matrix by extracting a measurement value corresponding to a sliding window (sliding window) on the data plane,
Methods for analyzing time series data.

The method of claim 15,
On the time-series variable axis, the arrangement positions of the first time-series variable and the second time-series variable are:
Characterized in that it is determined based on the correlation analysis result of the first time-series variable and the second time-series variable,
Methods for analyzing time series data.

The method of claim 14,
The prediction model is further based on a recurrent neural network,
The step of predicting the class of the prediction target,
Extracting a feature map by inputting the first matrix to the convolutional neural network; And
And inputting the extracted feature map into the cyclic neural network, and predicting the class of the prediction target based on the output result of the cyclic neural network.
Methods for analyzing time series data.

The method of claim 14,
The prediction target class includes a normal class and an abnormal class,
Further comprising training the predictive model,
The step of training the prediction model,
Generating a plurality of training matrices based on the collected multi-time series data;
Generating a training dataset by assigning an abnormal class to a matrix in which the frequency of occurrence of a matched pattern among the plurality of training matrices is less than a threshold and a normal class to the remaining matrices; And
Characterized in that it comprises the step of training the predictive model using the training data set,
Methods for analyzing time series data.