KR102276801B1

KR102276801B1 - Method and apparatus for processing time series data based on machine learning

Info

Publication number: KR102276801B1
Application number: KR1020210025624A
Authority: KR
Inventors: 김상엽
Original assignee: (주)알티엠
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-07-14

Abstract

Provided are a method and a device for identifying at least one data set of a second section unit having a first section as an interval in first time series data, training a sub-model corresponding to each of training data using each of the identified data sets as training data, comparing an output value of the sub-model corresponding to each training data with a first threshold value, and detecting whether there is an abnormality in second time series data received after the training by using the selected sub-model based on a comparison result of the output value of the sub-model and the first threshold value.

Description

Method and apparatus for processing time series data based on machine learning {METHOD AND APPARATUS FOR PROCESSING TIME SERIES DATA BASED ON MACHINE LEARNING}

본 개시는 기계학습 기반 시계열 데이터 처리 방법 및 장치에 관한 것이다. The present disclosure relates to a machine learning-based time series data processing method and apparatus.

시계열 데이터는 일정한 시간 동안 수집된 일련의 순차적으로 정해진 데이터의 집합으로 정의될 수 있다. 시계열 데이터는 시간에 관해 순서가 정해져 있고, 연속한 데이터 간에는 서로 상관관계를 갖고 있다. 따라서, 시계열 데이터의 자기 상관관계 또는 다수의 시계열 데이터 간의 상관관계에 기초하여, 시계열 데이터에 포함된 비정상 데이터가 감지될 수 있다. 예를 들어, RNN(Recurrent Neural Network) 및 LSTM(Long short-term mempry) 딥러닝 기술을 활용하여 비정상 데이터가 감지되거나, 과거 시계열 데이터를 기반으로 미래 시계열 데이터가 예측될 수 있다.Time series data may be defined as a set of sequentially determined data collected over a certain period of time. Time series data are ordered with respect to time, and there is a correlation between successive data sets. Accordingly, abnormal data included in the time series data may be detected based on the autocorrelation of the time series data or the correlation between a plurality of time series data. For example, anomalous data may be detected using recurrent neural network (RNN) and long short-term mempry (LSTM) deep learning techniques, or future time series data may be predicted based on past time series data.

선행문헌 : 한국 특허등록공고 10-1940029Prior literature: Korean Patent Registration Publication 10-1940029

한편, 최근 하이테크 분야의 공정에서 센서에 의해 발생되는 시계열 데이터의 양이 방대하여 센서 데이터를 관리하는 것에 어려움이 있다. 또한, 기계학습에 기반한 공정 내 이상 탐지를 수행할 경우, 데이터 수의 편향성으로 인해 모든 시계열 데이터를 활용하여 모델을 학습시키면 오히려 성능이 저하되는 경우가 발생할 수 있다.On the other hand, in recent high-tech processes, the amount of time-series data generated by the sensor is huge, so it is difficult to manage the sensor data. In addition, when performing in-process anomaly detection based on machine learning, if the model is trained using all time series data due to the bias in the number of data, performance may be deteriorated.

따라서 이상 탐지를 수행하기 위해 이용되는 훈련 데이터 및 모델을 선별하는 방법 및 장치에 관한 필요성이 존재한다.Accordingly, a need exists for a method and apparatus for screening training data and models used to perform anomaly detection.

본 실시예가 해결하고자 하는 과제는, 선별된 훈련 데이터 및 서브 모델을 이용하여 시계열 데이터의 이상 여부를 탐지하는, 기계학습 기반 시계열 데이터 처리 방법 및 장치를 제공하는데 있다. An object to be solved by this embodiment is to provide a machine learning-based time series data processing method and apparatus for detecting abnormalities in time series data using selected training data and sub-models.

본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 이하의 실시예들로부터 또 다른 기술적 과제들이 유추될 수 있다. The technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and other technical problems may be inferred from the following embodiments.

제1 실시예에 따라, 기계학습 기반 시계열 데이터를 처리하는 방법은, 제1 시계열 데이터에서, 제1 구간을 간격으로 하는 제2 구간 단위의 데이터 세트를 적어도 하나 이상 식별하는 단계, 상기 식별된 데이터 세트의 각각을 훈련 데이터로 하여 상기 각각의 훈련 데이터에 대응하는 서브 모델을 학습시키는 단계, 상기 각각의 훈련 데이터에 대응하는 서브 모델의 출력값과 제1 임계값을 비교하는 단계 및 상기 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 선정된 서브 모델을 이용하여, 상기 학습 이후에 수신되는 제2 시계열 데이터의 이상 여부를 탐지하는 단계를 포함할 수 있다. According to a first embodiment, a method for processing machine learning-based time series data includes: identifying at least one data set in units of a second section having a first section as an interval in the first time series data, the identified data Training a sub-model corresponding to each of the training data using each of the sets as training data, comparing an output value of the sub-model corresponding to the respective training data with a first threshold value, and an output value of the sub-model and detecting whether the second time series data received after the learning is abnormal by using a sub-model selected based on the comparison result of the first threshold value.

제2 실시예에 따라, 기계학습 기반 시계열 데이터를 처리하는 장치는 적어도 하나의 명령어(instruction)를 저장하는 메모리(memory) 및 상기 적어도 하나의 명령어를 실행하여, 제1 시계열 데이터에서, 제1 구간을 간격으로 하는 제2 구간 단위의 데이터 세트를 적어도 하나 이상 식별하고, 상기 식별된 데이터 세트의 각각을 훈련 데이터로 하여 상기 각각의 훈련 데이터에 대응하는 서브 모델을 학습시키고, 상기 각각의 훈련 데이터에 대응하는 서브 모델의 출력값과 제1 임계값을 비교하고, 상기 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 선정된 서브 모델을 이용하여, 상기 학습 이후에 수신되는 제2 시계열 데이터의 이상 여부를 탐지하는 프로세서(processor)를 포함할 수 있다.According to a second embodiment, an apparatus for processing machine learning-based time series data executes a memory storing at least one instruction and the at least one instruction, and in the first time series data, the first section Identifies at least one data set of a second interval unit with an interval of , trains a sub-model corresponding to each of the training data using each of the identified data sets as training data, Comparing the output value of the corresponding sub-model with the first threshold, and using the sub-model selected based on the comparison result of the output value of the sub-model and the first threshold, the second time-series data received after the learning It may include a processor (processor) for detecting whether there is an abnormality.

제3 실시예에 따라, 컴퓨터로 읽을 수 있는 기록매체는 기계학습 기반 시계열 데이터를 처리하는 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 비일시적 기록매체로서, 상기 기계학습 기반 시계열 데이터를 처리하는 방법은, 제1 시계열 데이터에서, 제1 구간을 간격으로 하는 제2 구간 단위의 데이터 세트를 적어도 하나 이상 식별하는 단계, 상기 식별된 데이터 세트의 각각을 훈련 데이터로 하여 상기 각각의 훈련 데이터에 대응하는 서브 모델을 학습시키는 단계, 상기 각각의 훈련 데이터에 대응하는 서브 모델의 출력값과 제1 임계값을 비교하는 단계 및 상기 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 선정된 서브 모델을 이용하여, 상기 학습 이후에 수신되는 제2 시계열 데이터의 이상 여부를 탐지하는 단계를 포함할 수 있다. According to the third embodiment, the computer-readable recording medium is a computer-readable non-transitory recording medium in which a program for executing a method for processing machine learning-based time series data in a computer is recorded, and the machine learning-based time series data The method of processing includes the steps of: identifying at least one data set of a second interval unit having a first interval as an interval in the first time series data, and using each of the identified data sets as training data for each training Learning a submodel corresponding to the data, comparing the output value of the submodel corresponding to the respective training data with a first threshold value, and selecting the submodel based on the comparison result of the output value of the submodel and the first threshold value It may include the step of detecting whether the second time series data received after the learning is abnormal by using the sub-model.

기타 실시예들의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.The details of other embodiments are included in the detailed description and drawings.

본 개시에 따른 기계학습 기반 시계열 데이터 처리 방법 및 장치는 선별된 훈련 데이터 및 서브 모델을 이용하여 시계열 데이터의 이상 여부를 탐지함으로써, 이상 탐지에 소요되는 자원 사용량 및 시간을 감소시킬 수 있는 효과가 있다.The machine learning-based time series data processing method and apparatus according to the present disclosure has the effect of reducing the resource usage and time required for anomaly detection by detecting whether there is an abnormality in the time series data using the selected training data and sub-model. .

또한, 본 개시에 따른 기계학습 기반 시계열 데이터 처리 방법 및 장치는 임계값에 기초하여 훈련 데이터를 선별하기 때문에, 이상 탐지 정확도를 유지하면서도 자원 사용량을 감소시킬 수 있는 효과가 있다.In addition, since the machine learning-based time series data processing method and apparatus according to the present disclosure selects training data based on a threshold, there is an effect of reducing resource usage while maintaining anomaly detection accuracy.

발명의 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 당해 기술 분야의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

도 1은 본 발명의 일 실시예에 따른 시계열 데이터를 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 시계열 데이터 및 서브 모델 출력값을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 이상 탐지 모델을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 기계학습 기반 이상 탐지를 수행하는 방법을 설명하기 위한 도면이다.
도 5 및 도 6은 본 발명의 일 실시예에 따른 시계열 데이터의 종류 및 데이터 세트를 식별하는 방법을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 시계열 데이터 처리 방법에 따라 선별된 훈련 데이터를 이용한 이상 탐지 결과를 설명하기 위한 도면이다.
도 8은 일 실시예에 따른 기계학습 기반 시계열 데이터 처리 방법을 설명하기 위한 흐름도이다.
도 9는 일 실시예에 따른 기계학습 기반 시계열 데이터 처리 장치를 설명하기 위한 블록도이다.1 is a diagram for explaining time series data according to an embodiment of the present invention.
2 is a diagram for explaining time series data and sub-model output values according to an embodiment of the present invention.
3 is a diagram for explaining an anomaly detection model according to an embodiment of the present invention.
4 is a diagram for explaining a method of performing machine learning-based anomaly detection according to an embodiment of the present invention.
5 and 6 are diagrams for explaining a method of identifying a type of time series data and a data set according to an embodiment of the present invention.
7 is a diagram for explaining an anomaly detection result using training data selected according to a time series data processing method according to an embodiment of the present invention.
8 is a flowchart illustrating a machine learning-based time series data processing method according to an embodiment.
9 is a block diagram illustrating an apparatus for processing time series data based on machine learning according to an embodiment.

실시예들에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the embodiments are selected as currently widely used general terms as possible while considering the functions in the present disclosure, but may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the corresponding description. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. When a part "includes" a certain element throughout the specification, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

명세서 전체에서 기재된 "a, b, 및 c 중 적어도 하나"의 표현은, 'a 단독', 'b 단독', 'c 단독', 'a 및 b', 'a 및 c', 'b 및 c', 또는 'a, b, 및 c 모두'를 포괄할 수 있다.The expression "at least one of a, b, and c" described throughout the specification means 'a alone', 'b alone', 'c alone', 'a and b', 'a and c', 'b and c ', or 'all of a, b, and c'.

명세서 전체에서 기재된 시계열 데이터는 일정한 시간 간격으로 배치된 데이터들의 수열로 정의될 수 있다. 따라서 시간상 순차적으로 관측된 값들의 집합도 시계열 데이터로 정의될 수 있다. 한편, 시계열 데이터는 시간에 의존성을 가지는 데이터로, 시간 t에 발생한 데이터가 시간 t-1의 데이터의 영향을 받는 것일 수 있다. 예를 들어, 시계열 데이터는 기온, 주가, 환율, 해수면 높이 관측 데이터를 포함하여, 다양한 센서로부터 수신될 수 있는 센서 데이터를 포함할 수 있다. 한편, 센서 데이터는 구체적으로, 두께 센서, 속도 센서, 가속도 센서, 진동 센서, 힘 센서, 압력 센서, 위치 센서, 플라즈마 강도 측정 센서, 온도 센서, pH 센서, 화학 조성 센서, 화학 농도 센서로부터 수신되는 데이터일 수 있으나, 이에 제한되지 않는다. The time series data described throughout the specification may be defined as a sequence of data arranged at regular time intervals. Accordingly, a set of values observed sequentially in time may also be defined as time series data. On the other hand, time series data is time-dependent data, and data generated at time t may be affected by data at time t-1. For example, the time series data may include sensor data that may be received from various sensors, including temperature, stock price, exchange rate, and sea level observation data. On the other hand, the sensor data may be specifically received from a thickness sensor, a speed sensor, an acceleration sensor, a vibration sensor, a force sensor, a pressure sensor, a position sensor, a plasma intensity sensor, a temperature sensor, a pH sensor, a chemical composition sensor, and a chemical concentration sensor. data, but is not limited thereto.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다.Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein.

이하에서는 도면을 참조하여 본 개시의 실시예들을 상세히 설명한다.Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 시계열 데이터를 설명하기 위한 도면이다. 1 is a diagram for explaining time series data according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 시계열 데이터(100)는 반도체 공정 상에서 획득되는 시계열 데이터일 수 있다. 예를 들어, 시계열 데이터(100)는 세정 공정에서 압력을 조절하기 위한 펌프의 강도를 시간에 따라 나타낸 시계열 데이터일 수 있다. 반도체 공정에서 효과적으로 플라즈마를 이용하기 위해서 진공상태가 유지되어야 하기 때문에, 압력 조절을 위한 펌프의 강도는 지속적으로 센싱될 수 있다. 이러한 경우, 압력의 누수는 공정 상 이상 상태로 정의될 수 있으며, 본 개시의 방법은 도 1에 도시된 시계열 데이터(100)를 서로 다른 시간에 해당 공정이 수행될 때마다 수집하고, 수집된 시계열 데이터에 기초하여 공정 상 압력의 누수가 있는지 탐지할 수 있다.The time series data 100 according to an embodiment of the present invention may be time series data obtained in a semiconductor process. For example, the time series data 100 may be time series data indicating the strength of a pump for controlling pressure in the cleaning process over time. Since a vacuum state must be maintained in order to effectively use plasma in a semiconductor process, the intensity of the pump for pressure control can be continuously sensed. In this case, the pressure leakage may be defined as an abnormal state in the process, and the method of the present disclosure collects the time series data 100 shown in FIG. 1 every time the corresponding process is performed at different times, and the collected time series Based on the data, it is possible to detect if there is a pressure leak in the process.

한편, 시계열 데이터(100)가 세정 공정에서 압력을 조절하기 위한 펌프의 강도를 시간에 따라 나타낸 데이터인 경우, 시계열 데이터(100)는 진공 상태를 만들기 위한 공정 준비 단계(A), 세정 작업이 일어나는 메인 공정 단계(B) 및 진공 상태를 해제하는 공정 마무리 단계(C)에 대한 데이터를 포함할 수 있다. On the other hand, when the time series data 100 is data indicating the strength of the pump for controlling the pressure in the cleaning process over time, the time series data 100 is the process preparation step (A) for creating a vacuum state, the cleaning operation taking place Data for the main process step (B) and the process finishing step (C) of releasing the vacuum may be included.

이러한 시계열 데이터에 기초하여 공정 상 이상을 감지하기 위해서는 기계학습이 활용될 수 있다. 그러나, 시계열 데이터의 모든 구간을 훈련 데이터로 하여 기계학습을 수행하는 경우, 메인 공정 단계(B)에 관한 데이터에 너무 많기 때문에, 학습에 의해 편향된 모델이 생성될 확률이 증가할 수 있다. 편향된 모델을 이용하는 경우, 이상 감지율이 저하될 수 있다. 또한, 모든 데이터를 훈련 데이터로 하여 기계학습을 수행하면, 데이터 내에 포함된 속성(attribute)이 너무 많기 때문에, 모든 속성에 관한 학습을 수행하는 것에 어려움이 있을 수 있고, 이에 따라 이상 감지 성능이 저하될 수 있다. 따라서, 시계열 데이터 중 이상 감지율을 향상시킬 수 있는 데이터 세트를 선별하는 방법 및 선별된 데이터 세트 만을 훈련 데이터로 이용할 수 있는 방법이 요구된다.Machine learning can be utilized to detect process anomalies based on such time series data. However, when machine learning is performed using all sections of time series data as training data, the probability of generating a model biased by learning may increase because there is too much data on the main process step (B). When a biased model is used, anomaly detection rate may be reduced. In addition, when machine learning is performed using all data as training data, since there are too many attributes included in the data, it may be difficult to perform learning on all attributes, and thus anomaly detection performance is lowered can be Therefore, a method of selecting a data set capable of improving anomaly detection rate among time series data and a method of using only the selected data set as training data are required.

도 2는 본 발명의 일 실시예에 따른 시계열 데이터 및 서브 모델 출력값을 설명하기 위한 도면이다. 2 is a diagram for explaining time series data and sub-model output values according to an embodiment of the present invention.

도 2의 (a)는 도 1의 시계열 데이터(100)로부터 복수의 데이터 세트를 식별한 후, 식별된 데이터 세트를 훈련 데이터로 하여 학습된 서브 모델들의 출력값을 나열한 그래프이고, 도 2의 (b)는, 도 2의 (a)의 그래프의 시간 스케일을 도 1의 그래프에 일치하도록 늘린 후 도 1의 그래프와 중첩하여 표시한 것이다. 예를 들어, 도 1에서, "1"의 시간에 1개의 데이터를 출력하는 예에서는 총 1000개의 데이터가 존재하는데, 여기서 도 1의 시계열 데이터(100)를 총 200개로 분할한 후, 각각을 훈련 데이터로 하여 200개의 서브 모델들을 학습시키면, 200개의 서브 모델로부터 200개의 출력값을 얻을 수 있고, 200개의 출력값을 나열하면 도 2의 (a)에 도시된 그래프를 산출할 수 있다. 한편, 도 1의 시계열 데이터(100)는 1000개 데이터를 포함하고 있으므로, 도 1의 시계열 데이터(100)를 총 200개로 분할하는 것은 시계열 데이터(100)를 5개씩 분할하는 것과 같은 의미이다. (a) of FIG. 2 is a graph listing output values of sub-models learned by using the identified data set as training data after identifying a plurality of data sets from the time series data 100 of FIG. ) is displayed by overlapping the graph of FIG. 1 after increasing the time scale of the graph of FIG. 2 (a) to match the graph of FIG. 1 . For example, in FIG. 1 , in an example of outputting one data at a time of “1”, there are a total of 1000 data. Here, after dividing the time series data 100 of FIG. 1 into a total of 200, each is trained If 200 sub-models are trained as data, 200 output values can be obtained from 200 sub-models, and the graph shown in FIG. 2 (a) can be calculated by listing 200 output values. Meanwhile, since the time series data 100 of FIG. 1 includes 1000 pieces of data, dividing the time series data 100 of FIG. 1 into a total of 200 has the same meaning as dividing the time series data 100 into 5 pieces.

한편, 도 2의 (a)에 도시된 서브 모델의 출력값이 크다는 것은 해당 출력값에 대응하는 데이터 세트에 이상을 탐지할 수 있는 많은 양의 정보가 포함되어 있다는 것을 의미한다. 다시 말해, 도 2의 (a)에 도시된 서브 모델의 출력값이 크면, 해당 서브 모델을 학습시키는데 이용된 훈련 데이터와 대응되는 데이터 세트를 이용한 이상 탐지율도 향상될 수 있다. On the other hand, a large output value of the sub-model shown in FIG. 2A means that a large amount of information for detecting anomalies is included in the data set corresponding to the output value. In other words, if the output value of the sub-model shown in FIG. 2A is large, an anomaly detection rate using a data set corresponding to the training data used to learn the sub-model may also be improved.

도 2의 (a)를 참고하면, 초기에 서브 모델의 출력값이 피크 형태를 보였다가, 이후에 꾸준히 증가하는 것을 확인할 수 있다. Referring to (a) of FIG. 2 , it can be confirmed that the output value of the sub-model initially showed a peak shape, and then steadily increased thereafter.

한편, 도 2의 (b)를 참고하면 도 2의 (a)에 도시된 서브 모델의 출력값을 도출하는 데 이용한 시계열 데이터 내의 데이터 세트를 확인할 수 있다. 예를 들어, 도 2의 (a)에 도시된 서브 모델의 출력값 중 가장 큰 값을 갖는 제1 지점(210)의 x 값은 4일 수 있다. 만약 1000개의 데이터로 구성된 시계열 데이터를 중첩없이 각각 5개씩 200개로 분할하여 이들을 훈련 데이터로 하는 경우, 제1 지점(210)의 출력값은 1000개의 데이터로 구성된 시계열 데이터의 16번째부터 20번째 데이터로 이루어진 5개 데이터를 훈련 데이터로 하는 서브 모델의 출력값일 수 있다. Meanwhile, referring to FIG. 2B , a data set in the time series data used to derive the output value of the submodel shown in FIG. 2A can be confirmed. For example, the x value of the first point 210 having the largest value among the output values of the sub-model shown in FIG. 2A may be 4. If time series data composed of 1000 data is divided into 200 pieces of 5 each without overlapping and these are used as training data, the output value of the first point 210 is composed of the 16th to 20th data of the time series data composed of 1000 data. It may be an output value of a sub-model using five data as training data.

이때 도 2의 (b)에 도시된 제1 그래프(220)는 도 2의 (a)에 도시된 그래프를 x 축 상에서 5배 늘린 것과 같을 수 있다. 이는 도 1에 도시된 1000개의 시계열 데이터를 5개씩 분할하여 훈련 데이터로 이용하였기 때문이다.In this case, the first graph 220 shown in FIG. 2B may be the same as the graph shown in FIG. 2A is stretched 5 times on the x-axis. This is because the 1000 time series data shown in FIG. 1 was divided into 5 pieces and used as training data.

도 2의 (b)를 참고하면, 공정 준비 단계(도 1의 A)와 메인 공정 단계(도 1의 B)의 마지막 부분을 이용하여 기계학습을 수행할 때, 공정 상 이상(예를 들어, 압력의 누수)에 대한 탐지율이 높을 수 있음을 확인할 수 있다. 이에 따라, 본 개시의 방법은 시계열 데이터 내의 복수의 데이터 세트 중에서 해당 데이터 세트를 훈련 데이터로 하여 서브 모델을 학습시키고, 학습된 모델의 출력값이 미리 정해진 임계값 이상인 서브 모델 만을 이용하여 이상을 탐지할 수 있다. 이러한 경우 시계열 데이터의 모든 데이터를 활용하는 경우에 비해 연산량 및 데이터 저장 공간은 감소시킬 수 있고, 이상 탐지율은 향상시킬 수 있는 효과가 있다.Referring to FIG. 2 ( b ), when performing machine learning using the last part of the process preparation step (A in FIG. 1 ) and the main process step (B in FIG. 1 ), process abnormalities (eg, It can be seen that the detection rate for pressure leakage) can be high. Accordingly, the method of the present disclosure trains a sub-model using the corresponding data set as training data among a plurality of data sets in time series data, and detects anomalies using only sub-models in which the output value of the learned model is greater than or equal to a predetermined threshold. can In this case, compared to the case where all data of time series data is utilized, the amount of computation and data storage space can be reduced, and the anomaly detection rate can be improved.

한편, 일 실시예에 따르면, 도 2의 (a)에 도시된 그래프는 개별 앙상블 모델의 출력값으로 정의될 수 있다. 여기서, 앙상블 모델은 여러 개의 서브 모델을 생성하고, 각 서브 모델의 출력값을 고려하여 하나의 예측값을 산출함으로써 예측율을 향상시킬 수 있는 학습 모델을 의미한다. 앙상블 모델의 구체적인 내용은 도 3을 이용하여 설명하기로 한다.Meanwhile, according to an embodiment, the graph shown in (a) of FIG. 2 may be defined as an output value of an individual ensemble model. Here, the ensemble model refers to a learning model capable of improving the prediction rate by generating several sub-models and calculating one prediction value in consideration of the output values of each sub-model. Specific details of the ensemble model will be described with reference to FIG. 3 .

도 3은 본 발명의 일 실시예에 따른 이상 탐지 모델을 설명하기 위한 도면이다. 3 is a diagram for explaining an anomaly detection model according to an embodiment of the present invention.

본 개시의 일 실시예에 따른 방법은 시계열 데이터의 서로 다른 구간을 각각 훈련 데이터로 하여 앙상블 모델의 서브 모델을 학습시킬 수 있다. 이때 앙상블 모델은 결정 트리 앙상블 모델일 수 있다. The method according to an embodiment of the present disclosure may train a submodel of the ensemble model by using different sections of time series data as training data, respectively. In this case, the ensemble model may be a decision tree ensemble model.

예를 들어, 도 1에 도시된 시계열 데이터(100)를 5개 데이터 단위로 분할하여 분할된 모든 시계열 데이터 각각을 훈련 데이터로 이용하는 경우, 도 3의 훈련 데이터의 수 및 서브 모델의 수는 각각 200개일 수 있다. 많은 수로 분할할수록 데이터 수 및 서브 모델의 수는 증가하고 분할된 각각의 훈련 데이터의 길이는 감소할 수 있다. 훈련 데이터의 길이 및 서브 모델의 수는 이상 탐지 시스템의 구현 사항에 따라 달라질 수 있다.For example, when the time series data 100 shown in FIG. 1 is divided into 5 data units and each of the divided time series data is used as training data, the number of training data and the number of sub-models of FIG. 3 are each 200 can be a dog As the number of divisions increases, the number of data and the number of sub-models may increase, and the length of each divided training data may decrease. The length of the training data and the number of sub-models may vary depending on the implementation of the anomaly detection system.

본 개시의 방법은 각각의 서브 모델 중, 임계값 이상의 출력값을 갖는 서브 모델 만을 이용하여 이상 탐지를 수행할 수 있다. 이때 임계값은 이상 탐지 시스템의 구현 사항에 따라 달라질 수 있고, 사용자의 입력에 의해 결정될 수 있다.The method of the present disclosure may perform anomaly detection using only a submodel having an output value greater than or equal to a threshold value among each submodel. In this case, the threshold value may vary depending on the implementation of the anomaly detection system, and may be determined by a user input.

한편, 앙상블 학습 중 결정 트리 앙상블 학습의 구체적인 내용은 도 4를 이용하여 설명하기로 한다.On the other hand, specific details of decision tree ensemble learning among ensemble learning will be described with reference to FIG. 4 .

도 4는 본 발명의 일 실시예에 따른 기계학습 기반 이상 탐지를 수행하는 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a method of performing machine learning-based anomaly detection according to an embodiment of the present invention.

일 실시예에 따른 기계학습 기반 이상 탐지를 수행하는 방법은 결정 트리 앙상블 학습을 이용하여 이상 탐지 결과를 출력할 수 있다.The method of performing machine learning-based anomaly detection according to an embodiment may output an anomaly detection result using decision tree ensemble learning.

결정 트리(Decision Tree) 앙상블 학습은 기계학습 기법의 일 예이며, 각 특징을 조건으로 구분하여 트리 모양으로 분기 후 정답을 예측할 수 있다. 한편, 트리 하나 만으로 정답을 예측하는 것은 정확도가 낮을 수 있다. 이에 따라, 여러 개의 트리를 조합하는 앙상블이 적용될 수 있다. Decision tree ensemble learning is an example of a machine learning technique, and it is possible to predict the correct answer after branching into a tree shape by classifying each feature as a condition. On the other hand, predicting the correct answer with only one tree may have low accuracy. Accordingly, an ensemble combining several trees may be applied.

한편, 결정 트리 앙상블 학습 방식은 배깅(Bagging)과 부스팅(Boosting)으로 분류될 수 있다. 여기서 배깅은 여러 개의 분류기(서브 모델)를 만들고 분류기의 출력값을 취합하여 가장 정확도가 높은 출력값을 최종 결과를 선택하는 것이고, 부스팅은 하나의 트리를 가지고 지속적으로 성능을 높이는 기법으로 볼 수 있다. 배깅 방식의 알고리즘으로는 랜덤 포레스트(Random Forest) 알고리즘이 잘 알려져 있고, 부스팅 방식의 알고리즘으로는 그래디언트 부스팅(Gradient Boosting) 알고리즘이 잘 알려져 있다. Meanwhile, the decision tree ensemble learning method may be classified into bagging and boosting. Here, bagging creates several classifiers (sub-models) and collects the output values of the classifiers and selects the final result with the highest accuracy output value. Boosting can be viewed as a technique to continuously improve performance with one tree. A random forest algorithm is well known as a bagging algorithm, and a gradient boosting algorithm is well known as a boosting algorithm.

한편 본 개시의 방법은 알고리즘 수행 시간을 단축시키기 위해, XGBoost(eXtra Gradient Boost) 및 LightGBM과 같은 결정 트리 앙상블 학습이 수행될 수 있다.Meanwhile, in the method of the present disclosure, in order to shorten the algorithm execution time, decision tree ensemble learning such as eXtra Gradient Boost (XGBoost) and LightGBM may be performed.

도 5 및 도 6은 본 발명의 일 실시예에 따른 시계열 데이터의 종류 및 데이터 세트를 식별하는 방법을 설명하기 위한 도면이다. 5 and 6 are diagrams for explaining a method of identifying a type of time series data and a data set according to an embodiment of the present invention.

일 실시예에 따른 시계열 데이터는 단변량(univariate) 시계열 데이터일 수 있고, 다변량(multivariate) 시계열 데이터일 수 있다. 여기서, 단변량 시계열 데이터는 시계열 데이터에 포함된 변수가 하나인 데이터를 의미하고, 다변량 시계열 데이터는 시계열 데이터에 포함된 변수가 둘 이상인 데이터를 의미한다. 본 개시의 방법은 단변량 시계열 데이터 뿐 아니라, 다변량 시계열 데이터로부터 식별된 데이터 세트를 훈련 데이터로 하여 기계학습을 수행할 수 있다. The time series data according to an embodiment may be univariate time series data or multivariate time series data. Here, the univariate time series data means data in which one variable is included in the time series data, and the multivariate time series data means data in which two or more variables are included in the time series data. The method of the present disclosure may perform machine learning using not only univariate time series data but also a data set identified from multivariate time series data as training data.

도 5의 (a) 및 (b)는 단변량 시계열 데이터의 일 예이고, 도 6의 (a) 및 (b)는 다변량 시계열 데이터의 일 예이다.5 (a) and (b) are examples of univariate time series data, and FIGS. 6 (a) and (b) are examples of multivariate time series data.

한편, 본 개시의 방법은 시계열 데이터를 제1 구간을 간격으로 하는 제2 구간 단위의 데이터 세트를 적어도 하나 이상 식별할 수 있다. 도 5의 (a)의 강조된 제1 데이터 세트(510)가 제2 구간 단위의 데이터 세트이고, 제1 데이터 세트(510) 이후에 식별된 데이터 세트가 도 5의 (b)에 도시된 제2 데이터 세트(520)일 때, 제1 구간은 2개 데이터 구간이고, 제2 구간은 3개 데이터 구간일 수 있다. 이때, 각각 식별된 데이터 세트(510, 520)는 훈련 데이터로 이용될 수 있다.On the other hand, the method of the present disclosure may identify at least one data set in units of the second section in which the first section is the interval of the time series data. The highlighted first data set 510 of FIG. 5A is the data set of the second section unit, and the data set identified after the first data set 510 is the second data set shown in FIG. 5B. In the case of the data set 520 , the first interval may be two data intervals, and the second interval may be three data intervals. In this case, each of the identified data sets 510 and 520 may be used as training data.

전체 데이터의 수가 N이고 변수가 v개인 다변량 시계열 데이터를 분석하며 식별된 데이터 세트가 3개일 때, 서브 모델은 3개가 생성될 수 있다. 이러한 경우, 3개의 식별된 데이터 세트 중에서 첫 번째 식별된 데이터 세트에 대응하는 3v의 크기를 갖는 데이터 N개가 첫 번째 서브 모델의 훈련 데이터로 사용될 수 있고, 다음 식별된 데이터 세트에 대응하는 N개 데이터는 두 번째 서브 모델의 훈련 데이터로 사용될 수 있다. 또한, 마지막 식별된 데이터 세트의 N개 데이터는 마지막 서브 모델의 훈련 데이터로 사용될 수 있다. 한편, 도 6의 (a) 및 (b)에 도시된 다변량 시계열 데이터는 전체 데이터의 수는 9개, 변수가 5개인 다변량 시계열 데이터이다.When the total number of data is N and multivariate time series data with v variables are analyzed, and the number of identified data sets is three, three submodels can be generated. In this case, N data having a size of 3v corresponding to the first identified data set among the three identified data sets may be used as training data of the first sub-model, and N data corresponding to the next identified data set can be used as training data for the second sub-model. In addition, N data of the last identified data set can be used as training data of the last sub-model. On the other hand, the multivariate time series data shown in FIGS. 6A and 6B is multivariate time series data with 9 total data and 5 variables.

도 5의 (a)와 (b)를 참조하면, 본 개시의 방법이 식별한 데이터 세트(510, 520)는 서로 중첩되는 것일 수 있다. 또한, 도 6의 (a)와 (b)를 참조하면, 본 개시의 방법이 식별한 데이터 세트(610, 620)는 서로 중첩되지 않을 수 있다. 식별된 데이터 세트의 중첩 여부는 이상 탐지 시스템의 구현 사항에 따라 달라질 수 있다.Referring to FIGS. 5A and 5B , the data sets 510 and 520 identified by the method of the present disclosure may overlap each other. Also, referring to FIGS. 6A and 6B , the data sets 610 and 620 identified by the method of the present disclosure may not overlap each other. Whether or not the identified data sets overlap may depend on the implementation of the anomaly detection system.

도 7은 본 발명의 일 실시예에 따른 시계열 데이터 처리 방법에 따라 선별된 훈련 데이터를 이용한 이상 탐지 결과를 설명하기 위한 도면이다.7 is a diagram for explaining an anomaly detection result using training data selected according to a time series data processing method according to an embodiment of the present invention.

본 개시의 방법은 결정 트리 앙상블 학습, 랜덤 포레스트 앙상블 학습 및 XGBoost 앙상블 학습 중 하나에 기초하여 이상 탐지를 수행할 수 있다. 도 7은 각각 결정 트리 앙상블 학습, 랜덤 포레스트 앙상블 학습 및 XGBoost 앙상블 학습에 기초하여 이상 탐지를 수행했을 때, 임계값에 따른 예측 성능을 설명하는 도면이다.The method of the present disclosure may perform anomaly detection based on one of decision tree ensemble learning, random forest ensemble learning, and XGBoost ensemble learning. 7 is a diagram illustrating prediction performance according to a threshold value when anomaly detection is performed based on decision tree ensemble learning, random forest ensemble learning, and XGBoost ensemble learning, respectively.

도 7을 참고하면, 출력값이 0.7 이상인 서브 모델 만으로 구성된 이상 탐지 모델이 결정 트리 앙상블 학습에 기초하여 기계학습을 수행한 경우, 예측 정확도는 92.35% 일 수 있다. 또한, 출력값이 0.8 이상인 서브 모델 만으로 구성된 이상 탐지 모델이 랜덤 포레스트 앙상블 학습에 기초하여 기계학습을 수행한 경우, 예측 정확도는 94.85% 일 수 있다.Referring to FIG. 7 , when an anomaly detection model composed of only sub-models having an output value of 0.7 or more performs machine learning based on decision tree ensemble learning, the prediction accuracy may be 92.35%. In addition, when the anomaly detection model composed of only sub-models having an output value of 0.8 or more performs machine learning based on random forest ensemble learning, the prediction accuracy may be 94.85%.

한편, 도 7을 참고하면 모든 시계열 데이터를 훈련 데이터로 이용하여 기계학습을 수행한 모델의 성능이 일부 시계열 데이터를 훈련 데이터로 하여 기계학습을 수행한 모델의 성능보다 낮을 수 있다. 예를 들어, 결정 트리 앙상블 학습을 수행한 경우, 모든 시계열 데이터를 훈련 데이터로 이용하였을 때 이상 탐지 모델의 예측 정확도는 93.6%이지만, 출력값이 0.9 이상인 서브 모델 만을 이용하는 경우 이상 탐지 모델의 예측 정확도는 95.78%로 더 높은 것을 확인할 수 있다.Meanwhile, referring to FIG. 7 , the performance of a model in which machine learning is performed using all time series data as training data may be lower than the performance of a model in which machine learning is performed using some time series data as training data. For example, when decision tree ensemble learning is performed, the prediction accuracy of the anomaly detection model is 93.6% when all time series data are used as training data, but the prediction accuracy of the anomaly detection model is 93.6% when only sub-models with an output value of 0.9 or higher are used. 95.78%, which is higher.

한편 시계열 데이터의 모든 데이터를 다 사용하는 것은 데이터의 수에 비해 데이터가 갖는 속성이 너무 많아지게 되므로, 모든 속성들에 대해 학습을 수행하는 것은 어려울 수 있다. 반면, 데이터의 수가 감소하면 데이터가 갖는 속성 역시 감소하므로 감소된 데이터 내 속성을 학습하는 것은 비교적 용이할 수 있다. 따라서, 시계열 데이터 중 식별된 데이터 세트를 훈련 데이터로 이용하면, 예측 정확도가 더 향상될 수 있다.On the other hand, it may be difficult to perform learning on all the properties because using all the data of the time series data has too many properties compared to the number of data. On the other hand, if the number of data decreases, the properties of the data also decrease, so it may be relatively easy to learn the properties in the reduced data. Accordingly, when the identified data set among time series data is used as training data, prediction accuracy may be further improved.

구체적으로 도 7에 도시된 성능은 1000차원의 데이터, 즉 1000개의 데이터로 구성된 시계열 데이터를 이용하여 산출된 것으로, 임계값이 0.9 이상인 서브 모델은 3개 뿐이었다. 하나의 서브 모델에 입력된 하나의 훈련 데이터가 5개의 데이터로 구성되었으므로, 임계값이 0.9 이상인 서브 모델 만으로 구성된 이상 탐지 모델은 총 15개의 데이터를 이용하여 학습을 수행한 것으로 볼 수 있다. 이러한 실험 결과에 따르면, 본 개시의 방법은 학습에 이용되는 훈련 데이터의 수를 감소시키면서도 이상 탐지 모델의 성능을 향상시키는 효과가 있다.Specifically, the performance shown in FIG. 7 was calculated using 1000-dimensional data, that is, time series data composed of 1000 pieces of data, and there were only three sub-models having a threshold value of 0.9 or higher. Since one training data input to one sub-model consists of five data, it can be seen that the anomaly detection model consisting only of sub-models with a threshold value of 0.9 or higher was trained using a total of 15 data. According to the experimental results, the method of the present disclosure has an effect of improving the performance of an anomaly detection model while reducing the number of training data used for learning.

도 8은 일 실시예에 따른 기계학습 기반 시계열 데이터 처리 방법을 설명하기 위한 흐름도이다.8 is a flowchart illustrating a machine learning-based time series data processing method according to an embodiment.

단계 S810에서, 제1 시계열 데이터에서, 제1 구간을 간격으로 하는 제2 구간 단위의 데이터 세트를 적어도 하나 이상 식별할 수 있다.In operation S810 , from the first time series data, at least one data set in units of a second section having the first section as an interval may be identified.

단계 S820에서, 식별된 데이터 세트의 각각을 훈련 데이터로 하여 각각의 훈련 데이터에 대응하는 서브 모델을 학습시킬 수 있다.In step S820 , each of the identified data sets may be used as training data to train a sub-model corresponding to each training data.

단계 S830에서, 각각의 훈련 데이터에 대응하는 서브 모델의 출력값과 제1 임계값을 비교할 수 있다.In step S830, an output value of the sub-model corresponding to each training data may be compared with a first threshold value.

단계 S840에서, 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 선정된 서브 모델을 이용하여, 학습 이후에 수신되는 제2 시계열 데이터의 이상 여부를 탐지할 수 있다. 이때, 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 서브 모델을 선정하는 것은, 제1 임계값 이상인 출력값을 갖는 서브 모델을 선정하는 것일 수 있다. 한편, 제1 임계값은 미리 설정된 값일 수 있다.In operation S840, it is possible to detect whether the second time series data received after learning is abnormal by using the sub-model selected based on the comparison result of the output value of the sub-model and the first threshold value. In this case, selecting the sub-model based on the comparison result between the output value of the sub-model and the first threshold value may be selecting a sub-model having an output value equal to or greater than the first threshold value. Meanwhile, the first threshold value may be a preset value.

한편, 단계 S840에서 제2 시계열 데이터의 이상 여부를 탐지하는 것은 결정 트리 앙상블 모델을 이용하여 제2 시계열 데이터의 이상 여부를 탐지하는 것일 수 있다. Meanwhile, detecting whether the second time series data is abnormal in step S840 may be detecting whether the second time series data is abnormal using a decision tree ensemble model.

한편, 본 개시의 방법은 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 서브 모델을 선정하는 것이 제1 임계값 이상인 출력값을 갖는 서브 모델을 선정하는 것인 경우, 제1 임계값 이상인 출력값을 갖는 서브 모델과 대응하는 훈련 데이터에 기초하여 제1 시계열 데이터의 일부 영역을 검출하는 단계 및 검출된 제1 시계열 데이터의 일부 영역에 관한 정보를 제1 시계열 데이터의 중요 정보로서 저장하는 단계를 더 포함할 수 있다. 여기서, 제1 시계열 데이터의 중요 정보는 검출된 제1 시계열 데이터의 일부 영역과 대응하는 서브 모델의 출력값에 따라 우선 순위가 결정된 것일 수 있다.On the other hand, in the method of the present disclosure, when selecting a sub model based on the comparison result of the output value of the sub model and the first threshold value is selecting a sub model having an output value equal to or higher than the first threshold value, the first threshold value or higher Detecting a partial region of the first time series data based on the submodel having an output value and training data corresponding to the output value, and storing information about the detected partial region of the first time series data as important information of the first time series data. may include more. Here, the important information of the first time series data may be prioritized according to an output value of a submodel corresponding to a partial region of the detected first time series data.

한편, 제1 구간의 길이는 상기 제2 구간의 길이와 같거나 짧은 것일 수 있다. 그리고, 제1 임계값, 제1 구간의 길이 및 제2 구간의 길이 중 적어도 하나는 사용자의 입력에 기초하여 결정되는 것일 수 있다.Meanwhile, the length of the first section may be the same as or shorter than the length of the second section. In addition, at least one of the first threshold value, the length of the first section, and the length of the second section may be determined based on a user input.

한편, 제1 시계열 데이터는 다변량(multivariate) 시계열 데이터일 수 있다. 그리고, 제2 시계열 데이터는 제1 시계열 데이터에 포함된 변수 중 적어도 하나를 포함하는 시계열 데이터일 수 있다. Meanwhile, the first time series data may be multivariate time series data. In addition, the second time series data may be time series data including at least one of the variables included in the first time series data.

도 9는 일 실시예에 따른 기계학습 기반 시계열 데이터 처리 장치를 설명하기 위한 블록도이다.9 is a block diagram illustrating an apparatus for processing time series data based on machine learning according to an embodiment.

기계학습 기반 시계열 데이터 처리 장치(900)는 일 실시예에 따라, 메모리(memory)(910) 및 프로세서(processor)(920)를 포함할 수 있다. 도 9에 도시된 기계학습 기반 시계열 데이터 처리 장치(900)는 본 실시예와 관련된 구성요소들만이 도시되어 있다. 따라서, 도 9에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음을 본 실시예와 관련된 기술분야에서 통상의 지식을 가진 자라면 이해할 수 있다. The machine learning-based time series data processing apparatus 900 may include a memory 910 and a processor 920 , according to an embodiment. In the machine learning-based time series data processing apparatus 900 shown in FIG. 9, only the components related to the present embodiment are shown. Accordingly, it can be understood by those of ordinary skill in the art related to the present embodiment that other general-purpose components may be further included in addition to the components shown in FIG. 9 .

메모리(910)는 기계학습 기반 시계열 데이터 처리 장치(900) 내에서 처리되는 각종 데이터들을 저장하는 하드웨어로서, 예를 들어, 메모리(910)는 기계학습 기반 시계열 데이터 처리 장치(900)에서 처리된 데이터들 및 처리될 데이터들을 저장할 수 있다. 메모리(910)는 프로세서(920)의 동작을 위한 적어도 하나의 명령어(instruction)를 저장할 수 있다. 또한, 메모리(910)는 기계학습 기반 시계열 데이터 처리 장치(900)에 의해 구동될 프로그램 또는 애플리케이션 등을 저장할 수 있다. 메모리(910)는 DRAM(dynamic random access memory), SRAM(static random access memory) 등과 같은 RAM(random access memory), ROM(read-only memory), EEPROM(electrically erasable programmable read-only memory), CD-ROM, 블루레이 또는 다른 광학 디스크 스토리지, HDD(hard disk drive), SSD(solid state drive), 또는 플래시 메모리를 포함할 수 있다.The memory 910 is hardware for storing various data processed in the machine learning-based time series data processing apparatus 900 , and for example, the memory 910 is data processed by the machine learning-based time series data processing apparatus 900 . and data to be processed. The memory 910 may store at least one instruction for an operation of the processor 920 . Also, the memory 910 may store a program or an application to be driven by the machine learning-based time series data processing apparatus 900 . The memory 910 includes random access memory (RAM), such as dynamic random access memory (DRAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), CD- It may include ROM, Blu-ray or other optical disk storage, a hard disk drive (HDD), a solid state drive (SSD), or flash memory.

프로세서(920)는 기계학습 기반 시계열 데이터 처리 장치(900)의 전반의 동작을 제어하고 데이터 및 신호를 처리할 수 있다. 프로세서(920)는 메모리(910)에 저장된 적어도 하나의 명령어 또는 적어도 하나의 프로그램을 실행함으로써, 기계학습 기반 시계열 데이터 처리 장치(900)를 전반적으로 제어할 수 있다. 프로세서(920)는 CPU(central processing unit), GPU(graphics processing unit), AP(application processor) 등으로 구현될 수 있으나, 이에 제한되지 않는다.The processor 920 may control the overall operation of the machine learning-based time series data processing apparatus 900 and process data and signals. The processor 920 may generally control the machine learning-based time series data processing apparatus 900 by executing at least one instruction or at least one program stored in the memory 910 . The processor 920 may be implemented as a central processing unit (CPU), a graphics processing unit (GPU), an application processor (AP), or the like, but is not limited thereto.

프로세서(920)는 제1 시계열 데이터에서, 제1 구간을 간격으로 하는 제2 구간 단위의 데이터 세트를 적어도 하나 이상 식별하고, 식별된 데이터 세트의 각각을 훈련 데이터로 하여 각각의 훈련 데이터에 대응하는 서브 모델을 학습시키고, 각각의 훈련 데이터에 대응하는 서브 모델의 출력값과 제1 임계값을 비교하고, 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 선정된 서브 모델을 이용하여, 학습 이후에 수신되는 제2 시계열 데이터의 이상 여부를 탐지할 수 있다. 이때, 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 서브 모델을 선정하는 것은, 제1 임계값 이상인 출력값을 갖는 서브 모델을 선정하는 것일 수 있다.In the first time series data, the processor 920 identifies at least one data set of a unit of a second section having the first section as an interval, and uses each of the identified data sets as training data corresponding to the respective training data. Train the sub-model, compare the output value of the sub-model corresponding to each training data with the first threshold, and learn by using the sub-model selected based on the comparison result between the output value of the sub-model and the first threshold Whether the second time series data received thereafter is abnormal may be detected. In this case, selecting the sub-model based on the comparison result between the output value of the sub-model and the first threshold value may be selecting a sub-model having an output value equal to or greater than the first threshold value.

또한, 서브 모델의 출력값과 제1 임계값의 비교 결과에 기초하여 서브 모델을 선정하는 것이 제1 임계값 이상인 출력값을 갖는 서브 모델을 선정하는 것인 경우, 프로세서(920)는 제1 임계값 이상인 출력값을 갖는 서브 모델과 대응하는 훈련 데이터에 기초하여 제1 시계열 데이터의 일부 영역을 검출하고, 검출된 제1 시계열 데이터의 일부 영역에 관한 정보를 제1 시계열 데이터의 중요 정보로서 저장할 수 있다. 여기서, 제1 시계열 데이터의 중요 정보는 검출된 제1 시계열 데이터의 일부 영역과 대응하는 서브 모델의 출력값에 따라 우선 순위가 결정된 것일 수 있다.In addition, when selecting a sub-model based on the comparison result between the output value of the sub-model and the first threshold value is to select a sub-model having an output value equal to or greater than the first threshold value, the processor 920 is configured to be equal to or greater than the first threshold A partial region of the first time series data may be detected based on the training data corresponding to the submodel having the output value, and information about the detected partial region of the first time series data may be stored as important information of the first time series data. Here, the important information of the first time series data may be prioritized according to an output value of a submodel corresponding to a partial region of the detected first time series data.

프로세서(920)가 제2 시계열 데이터의 이상 여부를 탐지하는 것은 결정 트리 앙상블 모델을 이용하여 제2 시계열 데이터의 이상 여부를 탐지하는 것일 수 있다. When the processor 920 detects whether the second time series data is abnormal, it may be to detect whether the second time series data is abnormal using a decision tree ensemble model.

전술한 실시예들에 따른 프로세서는 프로세서, 프로그램 데이터를 저장하고 실행하는 메모리, 디스크 드라이브와 같은 영구 저장부(permanent storage), 외부 장치와 통신하는 통신 포트, 터치 패널, 키(key), 버튼 등과 같은 사용자 인터페이스 장치 등을 포함할 수 있다. 소프트웨어 모듈 또는 알고리즘으로 구현되는 방법들은 상기 프로세서상에서 실행 가능한 컴퓨터가 읽을 수 있는 코드들 또는 프로그램 명령들로서 컴퓨터가 읽을 수 있는 기록 매체 상에 저장될 수 있다. 여기서 컴퓨터가 읽을 수 있는 기록 매체로 마그네틱 저장 매체(예컨대, ROM(read-only memory), RAM(random-Access memory), 플로피 디스크, 하드 디스크 등) 및 광학적 판독 매체(예컨대, 시디롬(CD-ROM), 디브이디(DVD: Digital Versatile Disc)) 등이 있다. 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템들에 분산되어, 분산 방식으로 컴퓨터가 판독 가능한 코드가 저장되고 실행될 수 있다. 매체는 컴퓨터에 의해 판독가능하며, 메모리에 저장되고, 프로세서에서 실행될 수 있다. The processor according to the above-described embodiments includes a processor, a memory for storing and executing program data, a permanent storage such as a disk drive, a communication port for communicating with an external device, a touch panel, a key, a button, etc. It may include the same user interface device and the like. Methods implemented as software modules or algorithms may be stored on a computer-readable recording medium as computer-readable codes or program instructions executable on the processor. Here, the computer-readable recording medium includes a magnetic storage medium (eg, read-only memory (ROM), random-access memory (RAM), floppy disk, hard disk, etc.) and an optically readable medium (eg, CD-ROM). ), and DVD (Digital Versatile Disc)). The computer-readable recording medium is distributed among computer systems connected through a network, so that the computer-readable code can be stored and executed in a distributed manner. The medium may be readable by a computer, stored in a memory, and executed on a processor.

본 실시예는 기능적인 블록 구성들 및 다양한 처리 단계들로 나타내어질 수 있다. 이러한 기능 블록들은 특정 기능들을 실행하는 다양한 개수의 하드웨어 또는/및 소프트웨어 구성들로 구현될 수 있다. 예를 들어, 실시예는 하나 이상의 마이크로프로세서들의 제어 또는 다른 제어 장치들에 의해서 다양한 기능들을 실행할 수 있는, 메모리, 프로세싱, 로직(logic), 룩 업 테이블(look-up table) 등과 같은 직접 회로 구성들을 채용할 수 있다. 구성 요소들이 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있는 것과 유사하게, 본 실시예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, 파이썬(Python), C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 또한, 본 실시예는 전자적인 환경 설정, 신호 처리, 및/또는 데이터 처리 등을 위하여 종래 기술을 채용할 수 있다. “매커니즘”, “요소”, “수단”, “구성”과 같은 용어는 넓게 사용될 수 있으며, 기계적이고 물리적인 구성들로서 한정되는 것은 아니다. 상기 용어는 프로세서 등과 연계하여 소프트웨어의 일련의 처리들(routines)의 의미를 포함할 수 있다.This embodiment may be represented by functional block configurations and various processing steps. These functional blocks may be implemented in any number of hardware and/or software configurations that perform specific functions. For example, an embodiment may be an integrated circuit configuration, such as memory, processing, logic, look-up table, etc., capable of executing various functions by means of control of one or more microprocessors or other control devices. can be hired Similar to how components may be implemented as software programming or software components, this embodiment includes various algorithms implemented in combination of data structures, processes, routines or other programming constructs, including Python, C , C++, Java, assembler, etc. may be implemented in a programming or scripting language. Functional aspects may be implemented in an algorithm running on one or more processors. In addition, the present embodiment may employ the prior art for electronic environment setting, signal processing, and/or data processing, and the like. Terms such as “mechanism”, “element”, “means” and “configuration” may be used broadly and are not limited to mechanical and physical configurations. The term may include the meaning of a series of routines of software in association with a processor or the like.

전술한 실시예들은 일 예시일 뿐 후술하는 청구항들의 범위 내에서 다른 실시예들이 구현될 수 있다.The above-described embodiments are merely examples, and other embodiments may be implemented within the scope of the claims to be described later.

Claims

In the machine learning-based time series data processing method performed by the machine learning-based time series data processing device,
identifying at least one data set in units of a second section having the first section as an interval from the first time series data;
using each of the identified data sets as training data to train a sub-model corresponding to each of the training data;
comparing an output value of a sub-model corresponding to each of the training data with a first threshold value; and
Using a sub-model selected based on the comparison result of the output value of the sub-model and the first threshold, detecting whether the second time series data received after the learning is abnormal,
The length of the first section is equal to or shorter than the length of the second section, and the length of the first section is relatively in relation to the length of the second section in consideration of the performance of the machine learning-based time series data processing apparatus. is decided,
The length of the second section, which is determined in consideration of the performance of the machine learning-based time series data processing apparatus, decreases when the number of the identified data sets increases,
A method for processing time series data based on machine learning.

According to claim 1,
The selection of the sub-model based on the comparison result of the output value of the sub-model and the first threshold value is a machine learning-based time series data processing method in which the sub-model having an output value equal to or greater than the first threshold value is selected.

3. The method of claim 2,
detecting a partial region of the first time series data based on training data corresponding to a submodel having an output value equal to or greater than the first threshold value; and
Further comprising the step of storing information on a partial region of the detected first time series data as important information of the first time series data, machine learning-based time series data processing method.

4. The method of claim 3,
Important information of the first time series data is
The priority is determined according to the output value of the sub-model corresponding to the partial region of the detected first time series data, a machine learning-based time series data processing method.

delete

According to claim 1,
At least one of the first threshold value, the length of the first section, and the length of the second section is determined based on a user input, a machine learning-based time series data processing method.

According to claim 1,
The step of detecting whether the second time series data is abnormal
Detecting whether the second time series data is abnormal using a decision tree ensemble model, a machine learning-based time series data processing method.

According to claim 1,
The first time series data is multivariate time series data, a machine learning-based time series data processing method.

9. The method of claim 8,
The second time series data is time series data including at least one of the variables included in the first time series data, a machine learning-based time series data processing method.

a memory for storing at least one instruction; and
By executing the at least one command,
In the first time series data, at least one data set of a unit of a second section having the first section as an interval is identified,
Training a sub-model corresponding to each of the training data using each of the identified data sets as training data,
Comparing the output value of the sub-model corresponding to the respective training data with a first threshold,
Using a sub-model selected based on the comparison result of the output value of the sub-model and the first threshold, including a processor for detecting whether the second time-series data received after the learning is abnormal,
The length of the first section is equal to or shorter than the length of the second section, and the length of the first section is relatively determined in relation to the length of the second section in consideration of the performance of the machine learning-based time series data processing apparatus. become,
The length of the second section, which is determined in consideration of the performance of the machine learning-based time series data processing apparatus, decreases when the number of the identified data sets increases,
Machine learning-based time series data processing unit.

A computer-readable non-transitory recording medium recording a program for executing a machine learning-based time series data processing method in a computer, the method comprising:
identifying at least one data set in units of a second section having the first section as an interval from the first time series data;
using each of the identified data sets as training data to train a sub-model corresponding to each of the training data;
comparing an output value of a sub-model corresponding to each of the training data with a first threshold value; and
Using a sub-model selected based on the comparison result of the output value of the sub-model and the first threshold, detecting whether the second time series data received after the learning is abnormal,
The length of the first section is equal to or shorter than the length of the second section, and the length of the first section is relatively in relation to the length of the second section in consideration of the performance of the machine learning-based time series data processing apparatus. is decided,
The length of the second section, which is determined in consideration of the performance of the machine learning-based time series data processing apparatus, decreases when the number of the identified data sets increases,
Non-transitory recording medium.