KR102471201B1

KR102471201B1 - Outlier detection and automatic threshold system for unsupervised learning-based time series data

Info

Publication number: KR102471201B1
Application number: KR1020210148323A
Authority: KR
Inventors: 김지혁; 장한솔; 이제동
Original assignee: (주)위세아이텍
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-11-25

Abstract

The present invention is an outlier detection and automatic threshold point system for unsupervised learning-based time series data. The system comprises: a database unit that stores time series data collected from an open source and sensors in a factory; a data preprocessing unit that refines the data to extract and distinguish features for detecting outliers from the time series data; an outlier detection unit that trains the data preprocessed by the data preprocessing unit in an outlier prediction model by using an autoencoder structure in adversarial machine learning and detects outliers; a threshold calculation unit that derives a threshold for finding an outlier using a machine learning algorithm based on the outlier prediction model; and an outlier distinguishing unit that distinguishes an outlier of the time series data based on the threshold. The present invention aims to detect outliers of the time series data not labeled in advance by applying an outlier detection algorithm and deep learning technology.

Description

Outlier detection and automatic threshold system for unsupervised learning-based time series data}

본원은 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템에 관한 것이다.The present application relates to an outlier detection and automatic threshold point system for time series data based on unsupervised learning.

센서를 통해 수집된 데이터를 활용한 이상탐지 시스템은 수집된 데이터의 라벨 유무에 따라 지도학습, 비지도 학습 기반의 ai 기술을 활용한다. 지도학습은 정상, 비정상이라고 사전에 라벨링된 데이터를 필요로 하나, 실제 산업 현장에서는 비정상이 발생하는 빈도가 매우 적기 때문에 비정상 데이터를 취득하고 수집한 데이터를 라벨링을 진행하는데 시간과 비용이 소요된다.An anomaly detection system using data collected through sensors utilizes supervised learning and unsupervised learning-based ai technology depending on whether or not the collected data is labeled. Supervised learning requires pre-labeled data as normal and abnormal, but since abnormalities occur very rarely in actual industrial sites, it takes time and money to acquire abnormal data and label the collected data.

위의 문제를 극복하기 위해 비지도 학습 기반의 이상탐지 시스템을 사용하기도 하는데 수집된 데이터가 정상이라는 가정하에 학습을 진행한다. 일반적으로 학습 데이터를 축소하고 복원하는 과정을 통해 인풋과 아웃풋의 차이를 구하여 비정상 데이터를 검출한다.In order to overcome the above problem, an anomaly detection system based on unsupervised learning is used, and learning is performed under the assumption that the collected data is normal. In general, abnormal data is detected by obtaining the difference between input and output through the process of reducing and restoring training data.

그러나 비지도 학습 기반은 하이퍼 파라미터에 따라 성능이 좌우되어 지도학습 기반 이상탐지 시스템에 비해 정확도가 불안정하고 이상치를 탐지하는 임계점을 계산하거나 임의로 지정해야 한다는 단점이 있다.However, unsupervised learning-based performance is dependent on hyper-parameters, so the accuracy is unstable compared to supervised learning-based anomaly detection systems, and the critical point for detecting anomalies must be calculated or arbitrarily specified.

때문에 이상탐지 시스템을 도입할 필요가 있는 산업은 비정상 데이터를 수집하고 사전에 라벨링하는 시간과 비용을 투자하거나 불안정한 정확도를 감안하는 등의 선택을 해야 한다.Therefore, industries that need to introduce anomaly detection systems must make choices such as investing time and money in collecting and labeling abnormal data in advance, or considering unstable accuracy.

이에 비정상 데이터 수집 및 사전 라벨링 작업이 필요 없고 학습자가 임계점을 따로 계산하지 않고 이상치를 탐지할 수 있는 기술이 필요하다.Therefore, there is no need for abnormal data collection and pre-labeling work, and a technology that allows learners to detect outliers without separately calculating critical points is needed.

본원의 배경이 되는 기술은 한국등록특허공보 제10-1916926호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Registration No. 10-1916926.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 이상탐지 알고리즘과 딥러닝 기술을 적용하여 사전에 라벨링이 되지 않은 시계열 데이터의 임계점을 계산하여 이상치를 탐지하는 것을 목적으로 한다.The present invention is to solve the above-mentioned problems of the prior art, and an object of the present invention is to detect an outlier by calculating a critical point of unlabeled time series data by applying an anomaly detection algorithm and deep learning technology.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 사전에 정상, 비정상으로 라벨링하지 않은 시계열 데이터를 인공지능으로 학습하고 머신러닝 기법으로 임계점을 계산하여 정상 범주에 포함되지 않은 이상치를 판별하는 것을 목적으로 한다.The present invention is to solve the above-mentioned problems of the prior art, to learn time series data that have not been labeled as normal or abnormal in advance with artificial intelligence, and to calculate critical points using machine learning techniques to discriminate outliers that are not included in the normal category. The purpose.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the embodiments of the present application is not limited to the technical problems described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템은 오픈 소스 및 공장의 센서로부터 수집한 시계열 데이터를 저장하는 데이터베이스부, 상기 시계열 데이터에서 이상치 탐지를 위한 특질을 추출하고 판별 가능하도록 데이터를 정제하는 데이터 전처리부, 적대적 기계 학습 내 오토인코더 구조를 이용하여 상기 전처리부에서 전처리된 데이터를 이상치 예측 모델에서 학습하고, 이상을 탐지하는 이상치 탐지부, 상기 이상치 예측 모델을 기반으로 머신러닝 알고리즘을 이용하여 이상치를 찾을 수 있는 임계점을 도출하는 임계점 계산부, 상기 임계점을 기초하여 상기 시계열 데이터의 이상치를 판별하는 이상치 판별부를 포함할 수 있다.As a technical means for achieving the above technical problem, an outlier detection and automatic threshold point system of unsupervised learning-based time series data according to an embodiment of the present application includes a database unit for storing time series data collected from open source and factory sensors, A data pre-processing unit that extracts features for detecting outliers from the time series data and refines the data to make them discriminable. It may include an outlier detection unit that detects an outlier, a threshold calculation unit that derives a critical point for finding an outlier using a machine learning algorithm based on the outlier prediction model, and an outlier discriminator that determines an outlier of the time series data based on the critical point. can

본원의 일 실시예에 따르면, 상기 데이터 전처리부는 상기 데이터베이스부에 저장된 시계열 데이터를 학습을 위한 전처리 후 통계적 특성 변수를 추출할 수 있다.According to an embodiment of the present application, the data pre-processing unit may extract statistical characteristic variables after pre-processing the time-series data stored in the database unit for learning.

또한, 상기 오토인코더는 적대적 기계 학습을 위해 하나의 인코더 상기 인코더를 공유하는 제1오토인코더 및 제2오토인코더로 구축된 것일 수 있다.Also, the autoencoder may be constructed with a first autoencoder and a second autoencoder sharing one encoder for adversarial machine learning.

또한, 상기 제1오토인코더는 입력 데이터를 복원하고, 제2오토인코더를 속이는 모델을 학습하고, 제2오토인코더는 입력 데이터를 복원하고, 상기 제1오토인코더가 복원한 재건 데이터와 입력 데이터를 구별하는 모델을 학습하는 것일 수 있다.In addition, the first autoencoder restores the input data, learns a model that deceives the second autoencoder, the second autoencoder restores the input data, and reconstructs the reconstruction data and the input data restored by the first autoencoder. It may be to learn a model that discriminates.

또한, 상기 제2오토인코더는 실제 데이터와 제1오토인코더의 재건 데이터를 구분하고 상기 제1오토인코더는 제2오토인코더를 속이도록 적대적 기계 학습을 진행해 상기 오토인코터가 정상 데이터와 유사한 이상치를 탐지할 수 있다.In addition, the second autoencoder distinguishes between real data and reconstruction data of the first autoencoder, and the first autoencoder performs adversarial machine learning to deceive the second autoencoder, so that the autoencoder generates outliers similar to normal data. can detect

또한, 상기 오토인코더는 안정적인 학습을 위해 학습 초반에는 제1오토인코더의 데이터 재건 오류에 가중치를 주고 학습 후반에는 정상 데이터와 유사한 이상치에 가중치를 부여할 수 있다.In addition, for stable learning, the autoencoder may assign a weight to the data reconstruction error of the first autoencoder at the beginning of learning, and weight an outlier similar to normal data in the second half of learning.

또한, 상기 임계점 계산부는 K-means 알고리즘에 기초하여 상기 이상치 탐지부에서 학습한 정상 데이터와 이상치를 분류하는 임계점을 계산하고, 상기 계산된 임계점을 기준으로 이상치를 정의할 수 있다.In addition, the threshold point calculation unit may calculate a threshold point for classifying the normal data learned by the outlier detection unit and the outlier based on a K-means algorithm, and may define the outlier based on the calculated threshold point.

또한, 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템에 있어서, 오픈 소스 및 공장의 센서로부터 수집한 시계열 데이터를 저장하는 단계, 상기 시계열 데이터에서 이상치 탐지를 위한 특질을 추출하고 판별 가능하도록 데이터를 정제하는 데이터 전처리 단계, 적대적 기계 학습 내 오토인코더 구조를 이용하여 상기 전처리부에서 전처리된 데이터를 이상치 예측 모델에서 학습하고, 이상을 탐지하는 단계, 상기 이상치 예측 모델을 기반으로 머신러닝 알고리즘을 이용하여 이상치를 찾을 수 있는 임계점을 도출하는 임계점 계산 단계, 상기 임계점을 기초하여 상기 시계열 데이터의 이상치를 판별하는 단계를 포함할 수 있다.In addition, in the outlier detection and automatic threshold point system of unsupervised learning-based time series data, storing time series data collected from open source and factory sensors, extracting characteristics for outlier detection from the time series data and data to be able to discriminate A data pre-processing step of refining , learning the data pre-processed in the pre-processing unit in an outlier prediction model using an autoencoder structure in adversarial machine learning, and detecting an anomaly, using a machine learning algorithm based on the outlier prediction model. The method may include a threshold point calculation step of deriving a threshold point at which an outlier can be found, and a step of determining an outlier value of the time-series data based on the threshold point.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as intended to limit the present disclosure. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 사전에 비정상인 상태를 정의하지 않고 기존 수집된 정상 데이터만으로 시계열 데이터의 정상, 비정상 판별이 가능하다.According to the above-described problem solving means of the present application, it is possible to determine normal or abnormal time series data only with the previously collected normal data without defining an abnormal state in advance.

전술한 본원의 과제 해결 수단에 의하면, 사전에 이상치를 탐지하기 위한 임계점을 정의하지 않고 자동으로 최적의 임계점을 찾는 것이 가능하다.According to the above-described problem solving means of the present invention, it is possible to automatically find an optimal threshold point without defining a threshold point for detecting an outlier in advance.

본원은 비지도학습을 통한 이상치 탐지를 통해 비정상 데이터를 수집하고 사전에 라벨링하는 시간과 비용을 절약할 수 있다.The present application can save the time and cost of collecting and labeling abnormal data in advance through outlier detection through unsupervised learning.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the effects described above, and other effects may exist.

도 1은 본원의 일 실시예에 따른 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템의 개략적인 구성을 나타낸 도면이다.
도 2a는 본원의 일 실시예에 따른 적대적 기계 학습 내 오토인코더의 구조를 개략적으로 도시한 도면이다.
도 2b는 본원의 일 실시예에 따른 적대적 기계 학습을 위해 하나의 인코더를 공유하는 두개의 디코더를 개략적으로 도시한 도면이다.
도 3은 본원의 일 실시예에 따른 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점을 계산하는 방법의 일 예를 도시한 도면이다.1 is a diagram showing a schematic configuration of an outlier detection and automatic threshold point system of unsupervised learning-based time series data according to an embodiment of the present invention.
Figure 2a is a diagram schematically showing the structure of an autoencoder in adversarial machine learning according to an embodiment of the present invention.
Figure 2b is a diagram schematically showing two decoders sharing one encoder for adversarial machine learning according to an embodiment of the present invention.
3 is a diagram illustrating an example of a method of detecting outliers and automatically calculating threshold points of unsupervised learning-based time series data according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관련 없는 부분은 생략하였다.Hereinafter, embodiments of the present application will be described in detail so that those skilled in the art can easily practice with reference to the accompanying drawings. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. And in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다.Throughout the present specification, when a part is said to be “connected” to another part, it is not only “directly connected”, but also “electrically connected” or “indirectly connected” with another element in between. "Including cases where

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is referred to as being “on,” “above,” “on top of,” “below,” “below,” or “below” another member, this means that a member is located in relation to another member. This includes not only the case of contact but also the case of another member between the two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함" 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout the present specification, when a part "includes" a certain component, it means that it may further include other components without excluding other components unless otherwise stated.

도 1은 본원의 일 실시예에 따른 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템(100)의 개략적인 구성을 나타낸 도면이다.1 is a diagram showing a schematic configuration of an outlier detection and automatic threshold point system 100 of unsupervised learning-based time series data according to an embodiment of the present invention.

도 1을 참조하면, 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템(100)은 데이터베이스부(110), 데이터 전처리부(120), 이상치 탐지부(130), 임계점 계산부(140), 이상치 판별부(150)를 포함할 수 있다. Referring to FIG. 1, the outlier detection and automatic threshold point system 100 of unsupervised learning-based time series data includes a database unit 110, a data pre-processing unit 120, an outlier detection unit 130, a threshold calculation unit 140, An outlier determination unit 150 may be included.

비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템(100)은 사전에 비정상인 상태를 정의하지 않고 기존 수집된 정상 데이터만으로 시계열 데이터의 정상, 비정상을 판별하고, 이상치를 탐지하기 위한 임계점을 정의하지 않고 자동으로 최적의 임계점을 찾음으로써 사전에 라벨링하는 시간과 비용을 절약할 수 있다.Unsupervised learning-based time series data outlier detection and automatic threshold point system 100 does not define an abnormal state in advance, determines normality and abnormality of time series data only with the normal data collected, and defines a threshold for detecting outliers. By automatically finding the optimal threshold point without labeling, the time and cost of labeling in advance can be saved.

데이터베이스부(110)는 다양한 종류의 데이터가 저장되는 구성요소로, 오픈 소스 및 공장의 센서로부터 수집한 시계열 데이터가 저장되고, 소정 파라미터값을 갖는 공장 설비에 대한 입력 데이터 및 이에 대응한 출력 데이터가 저장된다.The database unit 110 is a component in which various types of data are stored, and time-series data collected from open sources and factory sensors are stored, and input data for factory facilities having predetermined parameter values and output data corresponding thereto are stored. Saved.

또한, 오픈 소스 데이터 및 실제 가동 중인 공장에서 수집한 센서 데이터는 시계열 정보가 포함된 정형데이터만을 포함한다.In addition, open source data and sensor data collected from factories in actual operation include only structured data that includes time-series information.

예를 들어, 데이터베이스부(110)는 WIP Transaction Log 취합, 분석 및 가공이 가능하도록 구현될 수 있고, 설비 상태 변경, 알람, 로그, 센서데이터 취합, 분석 및 가공이 가능하도록 구현될 수도 있으며, 생산흐름 이력 및 지표 취합, 분석, 가공이 가능하도록 구현될 수도 있다.For example, the database unit 110 may be implemented to enable WIP Transaction Log collection, analysis, and processing, and may be implemented to enable facility status change, alarm, log, sensor data collection, analysis, and processing, and production It can also be implemented to enable collection, analysis, and processing of flow history and indicators.

데이터 전처리부(120)는 시계열 데이터에서 이상치 탐지를 위한 특질을 추출하고 판별 가능하도록 데이터를 정제할 수 있다. 본원의 일 실시예에 따르면, 특질 변수는 회전설비에 대한 시간에 따른 진동 가속도 변화를 나타내는 파형으로부터 추출되는 파형의 형상(Shape), 크기(진폭, Amplitude), 위상(Phase), 전기 모터의 암페어, 펌프 작동 이후의 루프의 압력, 엔진 본체의 온도, 순환루프에서 유체의 온도, 전기 모터의 전압, 루프 내부의 유체의 순환유량 등을 포함할 수 있다. 달리 말해, 데이터 전처리부(120)는 진동 데이터로부터 형상, 크기, 위상, 엔진 본체 및 루프의 온도, 순환유량, 전기 모터의 전류 및 전압 중 적어도 하나에 대한 특질 변수를 추출할 수 있다.The data pre-processing unit 120 may extract features for detecting outliers from time-series data and refine the data to enable discrimination. According to one embodiment of the present application, the characteristic variable is the shape (Shape), size (Amplitude), phase (Phase) of the waveform extracted from the waveform representing the change in vibration acceleration over time for the rotating equipment, and the ampere of the electric motor. . In other words, the data pre-processing unit 120 may extract characteristic variables for at least one of shape, size, phase, temperature of the engine body and loop, circulating flow rate, and current and voltage of the electric motor from the vibration data.

데이터 전처리부(120)를 통해 식별된 주요 핵심변수 데이터는 센서 부착 여부 및 수집 방법에 따라 변경될 수 있다.Key variable data identified through the data pre-processing unit 120 may change depending on whether or not a sensor is attached and how it is collected.

또한, 데이터 전처리부(120)는 데이터베이스부(110)에 저장된 시계열 데이터를 학습을 위한 전처리 후 통계적 특성 변수를 추출할 수 있다.In addition, the data pre-processing unit 120 may extract statistical characteristic variables after pre-processing the time-series data stored in the database unit 110 for learning.

데이터베이스부(110)에서 도출되는 통계적 특질(특징) 변수는 진동데이터와 관련되어 수치화된 값을 가질 수 있는 변수를 포함할 수 있다. 예를 들어, 데이터베이스부(110)는 시간영역에서는 RMS, Variance, Skewness, Kurtosis, Shape factor, Crest factor, Impulse factor, Margin factor, Peak to Peak 중 적어도 어느 하나의 특질(특징) 변수를 도출(추출)할 수 있다. 또한, 전처리부는 주파수 영역에서 FC, RMSF, RVF, SS, SK중 적어도 어느 하나의 특질(특징) 변수를 도출(추출)할 수 있다. 시간영역에 포함되는 RMS 특질 변수는 결함이 발생함에 따라 점진적으로 증가하는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Variance 특질 변수는 신호가 평균으로부터 떨어져 있는 정도를 측정하는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Skewness 특질 변수는 분포가 대칭을 벗어나 한쪽으로 치우진 정보를 나타내는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Kurtosis 특질 변수는 확률밀도함수의 뾰족한 정보를 나타내는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Shape factor 특질 변수는 장비의 형상에 따른 변화를 나타내는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Crest factor 특질 변수는 각종 파형의 날카로운 정도를 나타내는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Impulse factor 특질 변수는 충격파형의 신호를 측정하는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Margin factor 특질 변수는 회전 요소와 트랙 사이의 충격파 신호를 측정하는 특징을 포함할 수 있다. 또한, 시간영역에 포함되는 Peak to Peak(PTP) 특질 변수는 진동 및 소음의 심한 정보를 나타내는 진폭을 나타내는 특징을 포함할 수 있다.Statistical characteristic (characteristic) variables derived from the database unit 110 may include variables that may have numerical values related to vibration data. For example, the database unit 110 derives (extracts) at least one feature (characteristic) variable from among RMS, Variance, Skewness, Kurtosis, Shape factor, Crest factor, Impulse factor, Margin factor, and Peak to Peak in the time domain. )can do. In addition, the pre-processing unit may derive (extract) at least one characteristic (feature) variable among FC, RMSF, RVF, SS, and SK in the frequency domain. The RMS feature variable included in the time domain may include a feature that gradually increases as a defect occurs. In addition, the variance characteristic variable included in the time domain may include a feature measuring the degree to which a signal is far from the average. In addition, the skewness feature variable included in the time domain may include a feature representing information whose distribution is out of symmetry and skewed to one side. In addition, the Kurtosis feature variable included in the time domain may include a feature representing sharp information of the probability density function. In addition, the shape factor characteristic variable included in the time domain may include a feature representing a change according to the shape of the equipment. In addition, the crest factor characteristic variable included in the time domain may include a feature representing the degree of sharpness of various waveforms. In addition, the impulse factor characteristic variable included in the time domain may include a feature of measuring a signal of a shock wave. In addition, the margin factor characteristic variable included in the time domain may include a feature of measuring a shock wave signal between the rotating element and the track. In addition, the Peak to Peak (PTP) characteristic variable included in the time domain may include a feature representing amplitude representing severe information of vibration and noise.

또한, 주파수 영역에 포함되는 FC 특질 변수는, 메인 주파수의 위치변화를 나타내는 특징을 포함할 수 있다. 또한, 주파수 영역에 포함되는 RMSF 특질 변수는, 메인 주파수의 위치변화를 나타내는 특징을 포함할 수 있다. 또한, 주파수 영역에 포함되는 RCF 특질 변수는, 스펙트럼의 정합성을 나타내는 특징을 포함할 수 있다. 또한, 주파수 영역에 포함되는 SS 특질 변수는, 평균 주위 진폭 스펙트럼 분포의 대칭성을 나타내는 특징을 포함할 수 있다. 또한, 주파수 영역에 포함되는 SK 특질 변수는, 분포의 값을 측정하고 정규분포와 비교하는데 사용되는 특징을 포함할 수 있다.In addition, the FC characteristic variable included in the frequency domain may include a feature indicating a positional change of the main frequency. In addition, the RMSF feature variable included in the frequency domain may include a feature indicating a positional change of the main frequency. In addition, the RCF feature variable included in the frequency domain may include a feature indicating the matching of the spectrum. In addition, the SS feature variable included in the frequency domain may include a feature indicating symmetry of the amplitude spectrum distribution around the mean. In addition, the SK feature variable included in the frequency domain may include a feature used to measure a distribution value and compare it with a normal distribution.

데이터베이스부(110)에서 시간영역 및 주파수 영역에서 각각의 특징을 포함하는 특질 변수를 추출함으로써, 이상치 탐지부(130)에서 정확도가 높은 이상치 예측 모델을 생성할 수 있다. 또한, 데이터베이스부(110)에서PCA(다변량 데이터 분석)기법을 적용함으로써, 여러 변수들 간의 상관관계를 소수의 주성분으로 차원을 축소하여 데이터를 쉽게 이해할 수 있다.By extracting feature variables including each feature in the time domain and the frequency domain from the database unit 110, the outlier detection unit 130 may generate a highly accurate outlier prediction model. In addition, by applying a PCA (multivariate data analysis) technique in the database unit 110, it is possible to easily understand the data by reducing the dimension of the correlation between various variables to a small number of principal components.

또한, 본원의 일 실시예에 따르면, 데이터 전처리부(120)는, 수신된 데이터에 대하여 기 설정된 로우 패스 필터(Low-Pass Filter) 또는 밴드패스 필터(Band-Pass Filter)를 적용하여 노이즈를 제거할 수 있다.In addition, according to an embodiment of the present application, the data pre-processing unit 120 removes noise by applying a preset low-pass filter or band-pass filter to the received data can do.

여기서, 노이즈는 가우시안 노이즈(gaussian noise), 솔트 앤 페퍼 노이즈(salt and pepper noise)등을 포함할 수 있다.Here, the noise may include Gaussian noise, salt and pepper noise, and the like.

또한, 데이터 전처리부(120)는 진동 데이터에 대한 RMS 분석, 중심 주파수(CF) 분석, 분산(Var) 분석, 중간 주파수(IF) 분석, 편포도(Skewness) 분석, 중파(MF) 분석, 첨도(Kurtosis) 분석 및 PTP 분석 중 적어도 하나에 기초하여 진동 데이터로부터 하나 이상의 예측에 필요한 변수(특질 변수)를 추출할 수 있다.In addition, the data pre-processing unit 120 performs RMS analysis, center frequency (CF) analysis, variance (Var) analysis, intermediate frequency (IF) analysis, skewness analysis, medium wave (MF) analysis, and kurtosis for vibration data. Variables (characteristic variables) required for one or more predictions may be extracted from the vibration data based on at least one of a Kurtosis analysis and a PTP analysis.

예시적으로, 데이터 전처리부(120)는 노이즈가 제거된 진동 데이터에 대하여 FFT(FFT: Fast Fourier Transform) 또는 STFT(STFTL: Short Time Fourier Trasnform)을 적용하여 노이즈가 제거된 진동 데이터를 주파수 영역의 데이터 또는 시간-주파수 영역의 데이터로 변환할 수 있다. 달리 말해, 데이터 전처리부(120)는 수집된 데이터에 대한 전처리를 수행하되, 수집된 시간파형 데이터(시간 영역에서의 진동 데이터)에 대하여 고속푸리에변환(FFT)을 통한 주파수 영역으로의 변환 또는 스펙트로그램을 통한 시간-주파수 영역으로의 변환을 수행하며, 변환된 각각의 영역에서의 예측에 필요한 변수(특질 변수)를 추출할 수 있다. 데이터 전처리부(120)에서 전처리된 데이터는 이상치 탐지부(130)의 학습 데이터로 사용될 수 있다.Exemplarily, the data pre-processor 120 applies Fast Fourier Transform (FFT) or Short Time Fourier Transform (STFT) to noise-removed vibration data to convert noise-removed vibration data in the frequency domain. It can be converted into data or data in the time-frequency domain. In other words, the data pre-processing unit 120 performs pre-processing on the collected data, but transforms the collected time waveform data (vibration data in the time domain) into the frequency domain through fast Fourier transform (FFT) or spectroscopy. Transformation into the time-frequency domain through the gram is performed, and variables (characteristic variables) required for prediction in each transformed domain can be extracted. The data preprocessed by the data preprocessor 120 may be used as learning data of the outlier detection unit 130 .

본원의 일 실시예에 따르면, 이상치 탐지부(130)는 적대적 기계 학습 내 오토인코더(200) 구조를 이용하여 상기 전처리부에서 전처리된 데이터를 이상치 예측 모델에서 학습하고, 이상을 탐지할 수 있다.According to an embodiment of the present application, the outlier detection unit 130 may learn the data preprocessed by the preprocessor in an outlier prediction model using the structure of the autoencoder 200 in adversarial machine learning and detect the anomaly.

본원의 일 실시예에 따르면, 인코딩 데이터 추출은 기본 정보 데이터를 신경망 모델에 적용하여 인코딩 데이터를 추출할 수 있다. 달리 말해, 인코딩 데이터 추출은 데이터베이스부(110)에 저장된 데이터를 기반으로, Encoder-Decoder LSTM 모델(신경망 모델)을 통해 인코딩 데이터를 추출할 수 있다. 인코딩 데이터 추출은 수집된 정비 이력 데이터 및 상태 데이터 각각에 대하여 시계열적으로 수집된 정보를 신경망 모델(예를 들어, LSTM)에 적용하여 인코딩 데이터를 추출할 수 있다.According to an embodiment of the present application, encoding data extraction may extract encoding data by applying basic information data to a neural network model. In other words, encoding data extraction may extract encoding data through an Encoder-Decoder LSTM model (neural network model) based on data stored in the database unit 110 . Encoding data extraction may extract encoding data by applying time-sequentially collected information to a neural network model (eg, LSTM) for each of the collected maintenance history data and state data.

이하에서 설명되는 이상치 예측 모델(신경망 모델)은, LSTM 모델 내에서 어느 기능을 담당하는 Encoder 와 Decoder를 의미한다 LSTM Encoder는 고정된 길이의 입력 벡터로부터 정보를 압축하고, LSTM Decoder는 압축된 정보로부터 시계열 신호를 재구성하는 생성모델이다. LSTM 모델의 학습은 재구성된 값과 실제 값을 비교하여 오차를 최소화하는 목적함수를 가지고 훈련한다.The outlier prediction model (neural network model) described below refers to an encoder and a decoder that are responsible for a certain function within the LSTM model. It is a generative model that reconstructs time series signals. The LSTM model is trained with an objective function that minimizes the error by comparing the reconstructed value with the actual value.

목적함수는 MSE(Mean Squared Error)를 사용할 수 있다. MSE는 거리 개념의 척도로써 두 개체간의차이를 정량적으로 확인할 수 있는 값을 의미한다.The objective function can use MSE (Mean Squared Error). MSE is a measure of the concept of distance and means a value that can quantitatively confirm the difference between two objects.

도 2a는 본원의 일 실시예에 따른 적대적 기계 학습 내 오토인코더(200)의 구조를 개략적으로 도시한 도면이다.Figure 2a is a diagram schematically showing the structure of an autoencoder 200 in adversarial machine learning according to an embodiment of the present invention.

인코딩 데이터 추출은 시계열 데이터에 대하여 t시점 기본 정보 데이터를 이상치 예측 모델에 적용하여 t시점 데이터의 인코딩 데이터를 생성할 수 있다. 인코딩 데이터 추출부는 시계열적으로 수집되는 기본 정보 데이터에 대하여 t시점 내지 t시점 이전의 t-n번째 각각의 기본 정보 데이터를 신경망 모델에 적용하여 인코딩 데이터를 생성할 수 있다.Encoding data extraction may generate encoding data of time t data by applying basic information data at time t to an outlier prediction model for time series data. The encoding data extraction unit may generate encoding data by applying each of the t-n th basic information data from time t to before time t to the neural network model with respect to the basic information data collected in time series.

예를 들어, 인코딩 데이터 추출은 시계열 데이터가 텍스트(text) 데이터인 경우, 인코더는 입력 문장의 모든 단어들을 순차적으로 입력 받은 뒤에 마지막에 이 모든 단어 정보들을 압축해서 하나의 벡터를 만드는데, 이를 컨테스트 벡터(context vector)라 할 수 있다.For example, in encoding data extraction, if the time series data is text data, the encoder sequentially receives all the words of the input sentence and finally compresses all the word information to create a vector, which is a contest vector. (context vector).

본원의 일 실시예에 따르면, 입력 데이터를 받은 LSTM 셀을 인코더라고 하고, 출력 데이터를 출력하는 LSTM셀을 디코더라고 할 수 있다. 예시적으로, 기본 정보 데이터가 텍스트(TEXT) 데이터, 문장 데이터인 경우, 단어 토큰화를 통 해서 단어 단위로 쪼개지고 단어 토큰 각각은 LSTM 셀의 각 시점에 입력될 수 있다. 인코더 LSTM 셀의 마지막 시점의 은닉 상태를 디코더 LSTM 셀로 넘겨줄 수 있는데 이를 벡터라 할 수 있다. 벡터는 디코더 LSTM셀의 첫 번째 은닉 상태로 사용될 수 있다. 다만, 앞서 설명한 내용은 일 실시예 일뿐 이에 한정되는 것은 아니다.According to an embodiment of the present application, an LSTM cell receiving input data may be referred to as an encoder, and an LSTM cell outputting output data may be referred to as a decoder. Exemplarily, when the basic information data is text (TEXT) data or sentence data, it is divided into word units through word tokenization, and each word token may be input at each time point of the LSTM cell. The hidden state of the last time of the encoder LSTM cell can be passed to the decoder LSTM cell, which can be referred to as a vector. A vector can be used as the first hidden state of a decoder LSTM cell. However, the above-described information is only an example and is not limited thereto.

예시적으로, 이상치 예측 모델(신경망 모델)은 2개의 모듈, Encoder모듈과 Decoder 모듈을 포함할 수 있다. Encoder 모듈은 입력 데이터를 인 코딩(부호화)하고 Decoder은 인코딩된 데이터를 디코딩(복호화)할 수 있다.Illustratively, the outlier prediction model (neural network model) may include two modules, an encoder module and a decoder module. The Encoder module can encode (encode) input data and the Decoder can decode (decode) the encoded data.

본원에서 개시하는 이상치 예측 모델은 시계열 데이터에서 인코딩 데이터의 특성을 추출하기 위하여 시계열 분석 알고리즘인 LSTM 기반 모델의 구조를 사용하는 것을 특징으로 할 수 있다. 특히, LSTM(Long Short Term Memory networks) 알고리즘은 딥러닝 분야에서 사용되는 인공 재귀 신경 네트워크(RNN) 아 키텍처의 하나로, 피드 포워드 신경망과 달리 피드백 연결이 존재한다. 따라서, LSTM 알고리즘에 의하면 단일 데이터 포인트뿐만 아니라 전체 데이터 시퀀스에 대 한 학습 및 처리를 수행할 수 있다는 이점이 있다.The outlier prediction model disclosed herein may be characterized by using the structure of an LSTM-based model, which is a time series analysis algorithm, to extract characteristics of encoded data from time series data. In particular, the LSTM (Long Short Term Memory networks) algorithm is one of the artificial recursive neural network (RNN) architectures used in the field of deep learning, and unlike feed-forward neural networks, there is a feedback connection. Therefore, according to the LSTM algorithm, there is an advantage that learning and processing can be performed not only for a single data point but also for an entire data sequence.

이러한 LSTM 알고리즘은 시계열 데이터를 기반으로 예측을 분류, 처리 및 예 측하는데 적합하며, LSTM은 전통적인 RNN을 통한 훈련에서 발생 가능한 Vanishing Gradient 문제를 해소하는 장점이 있다.These LSTM algorithms are suitable for classifying, processing, and predicting predictions based on time series data, and LSTM has the advantage of resolving the vanishing gradient problem that can occur in training through traditional RNNs.

한편, 이상치 예측 모델에서 반드시 입력과 출력 데이터의 형태가 같을 필요는 없다. 즉, 입력 데이터의 window size와 출력 데이터의 window size를 다르게 설정할 수 있다. 일예로, 이상치 탐지 시스템은 4주 이내 정비 발 생 여부 예측이 목표로 과거 8주간의 데이터를 입력 데이터로 구성하고 미래 4주간의 데이터를 출력 데이터로 구성해 학습을 수행할 수 있다. 이상치 예측 모델은 각각의 인코더와 디코더 부분이 하나의 은닉층으로 이루어진 것이 아닌, 다수의 은닉층으로 구성되었다. 따라서 인코딩된 데이터를 추출하기 위해서는 다수의 은닉층으로 이루어진 인코더 부분의 마지막 층의 출력 값을 선택해야 하며, 이 값을 저장해 활용할 수 있다.Meanwhile, in the outlier prediction model, the input and output data do not necessarily have the same shape. That is, the window size of input data and the window size of output data can be set differently. For example, the anomaly detection system can perform learning by configuring data for the past 8 weeks as input data and configuring data for the future 4 weeks as output data, with the goal of predicting whether maintenance will occur within 4 weeks. In the outlier prediction model, each encoder and decoder part is not composed of one hidden layer, but is composed of multiple hidden layers. Therefore, in order to extract encoded data, it is necessary to select the output value of the last layer of the encoder part composed of a plurality of hidden layers, and this value can be stored and used.

도 2b는 본원의 일 실시예에 따른 적대적 기계 학습을 위해 하나의 인코더를 공유하는 두개의 디코더를 개략적으로 도시한 도면이다.Figure 2b is a diagram schematically showing two decoders sharing one encoder for adversarial machine learning according to an embodiment of the present invention.

도 2a 및 도 2b를 참조하면 오토인코더(200)는 하나의 인코더를 공유하는 제1오토인코더(201)와 제2오토인코더(202)로 구성될 수 있다. Referring to FIGS. 2A and 2B , an autoencoder 200 may be composed of a first autoencoder 201 and a second autoencoder 202 sharing one encoder.

도 2a 및 도2b를 참조하면, 인코더가 입력 데이터를 압축한 후, 디코더가 압축된 데이터를 입력 데이터의 정상 분포를 따르는 데이터로 복원하는 제1오토인코더(201) 및 제2오토인코더(202)를 각각 학습할 수 있다. 2A and 2B, after the encoder compresses the input data, the decoder restores the compressed data to data following a normal distribution of the input data. A first autoencoder 201 and a second autoencoder 202 can learn each.

이상 탐지 분야에서의 오토인코더는 “정상”이라고 정의된 데이터만을 사용하여 학습한 뒤, 이상 탐지 시점에서 입력과 출력의 차이로 발생되는 재건 손실 값(Reconstruction Loss)을 통해 이상 유무를 판단한다.An autoencoder in the field of anomaly detection learns using only data defined as “normal”, and then determines the presence or absence of anomaly through reconstruction loss generated by the difference between input and output at the time of anomaly detection.

즉, 비정상의 데이터가 입력되면 정상의 데이터만을 생성하도록 학습된 디코더에 의해서 큰 손실 값이 발생하게 되고 특정 크기 이상의 손실 값 분기점(threshold)을 기준으로 데이터의 정상, 비정상을 구분할 수 있다.That is, when abnormal data is input, a large loss value is generated by a decoder trained to generate only normal data, and normal and abnormal data can be distinguished based on a threshold of a loss value of a certain size or more.

도2b를 참조하면, 제1오토인코더(201)에서 복원된 재건 데이터를 받아 제2오토인코더(202)에 입력하고, 제2오토인코더(202)는 실제 데이터와 재건 데이터를 구분하고, 제1오토인코더(201)는 제2오토인코더(202)를 잘 속이도록 적대적 기계 학습을 할 수 있다.Referring to FIG. 2B, the reconstruction data restored by the first autoencoder 201 is received and input to the second autoencoder 202, the second autoencoder 202 distinguishes between real data and reconstruction data, and The autoencoder 201 may perform adversarial machine learning to trick the second autoencoder 202 well.

적대적 기계 학습은 생성자(Generator)와 판별자(Discriminator)로 구성된 네트워크 일 수 있다.Adversarial machine learning can be a network composed of generators and discriminators.

이를 통해, 제2오토인코더(202)는 정상 데이터와 유사한 이상치를 탐지할 수 있다.Through this, the second autoencoder 202 can detect an outlier similar to normal data.

본원의 실시예에 따르면, 제1오토인코더(201)는 생성자(Generator) 역할을 수행하고, 제2오토인코더(202)는 판별자(Discriminator)의 역할을 수행할 수 있다.According to an embodiment of the present application, the first autoencoder 201 may serve as a generator, and the second autoencoder 202 may serve as a discriminator.

제1오토인코더(201)는 잠재공간(Latatent space)에 실제 데이터의 분포를 매핑해서 그 분포로부터 도출된 변수를 받아 정교한 위조 데이터를 생성하고 제2오토인코더(202)는 생성된 위조 데이터와 실제 데이터를 구별할 수 있다.The first autoencoder 201 maps the distribution of real data to the latent space and generates sophisticated fake data by receiving variables derived from the distribution, and the second autoencoder 202 maps the generated fake data and real data. data can be distinguished.

적대적 기계 학습은 게임이론 형태의 목적 함수를 사용하여 두명의 둘레이어(제1오토인코더와 제2오토인코더)가 서로 경쟁하면서 균형점(Nash equilibrium)을 찾아가는 방식으로 학습할 수 있다.Adversarial machine learning can be learned by using a game-theoretic type of objective function to find a balance point (Nash equilibrium) while competing with two players (the first autoencoder and the second autoencoder).

적대적 기계 학습을 활용한 이상 탐지는 훈련이 끝난 제1오토인코더(201)와 제2오토인코더(202)를 이용하여 탐지 데이터에 맞게 잠재 변수를 최적화한 후, 각각 발생되는 재건 손실 값, 판별 손실 값을 가중 합하여 이상 점수를 도출 할 수 있다.Anomaly detection using adversarial machine learning optimizes latent variables according to the detection data using the first autoencoder 201 and the second autoencoder 202 that have been trained, and then each generated reconstruction loss value and discriminant loss An ideal score can be derived by weighted summing the values.

즉, 적대적 기계 학습은 정규분포를 가정하는 잠재 변수를 입력 값으로 받아 정상을 모사하는 위조 데이터를 생성하고 실제 데이터와 위조 데이터간의 재건 손실 값을 이용하여 이상 점수를 도출 할 수 있다. That is, adversarial machine learning can generate fake data that simulates normality by receiving latent variables that assume a normal distribution as input values, and derive anomaly scores using reconstruction loss values between real data and fake data.

따라서, 이상치 탐지부(130)의 제1오토인코더(201)는 입력 데이터를 복원하고, 제2오토인코더(202)를 속이는 모델을 학습하고, 제2오토인코더(202)는 입력 데이터를 복원하고, 상기 제1오토인코더(201)가 복원한 재건 데이터와 입력 데이터를 구별하는 모델을 학습할 수 있다.Therefore, the first autoencoder 201 of the outlier detection unit 130 restores the input data, learns a model that deceives the second autoencoder 202, and the second autoencoder 202 restores the input data and , A model for distinguishing the reconstruction data restored by the first autoencoder 201 and the input data can be learned.

즉, 이상치 탐지부(130)는 제2오토인코더(202)는 실제 데이터와 제1오토인코더(201)의 재건 데이터를 구분하고 상기 제1디코더는 제2오토인코더(202)를 속이도록 적대적 기계 학습을 진행해 상기 오토인코터가 정상 데이터와 유사한 이상치를 탐지할 수 있다. 이러한 적대적 기계 학습을 통해 정상이라고 간주되는 데이터를 학습하여 이와 유사한 분포를 띄는 재구축 집단을 생성할 수 있다. 그러나, 종래의 적대적 오토 인코더 모델은 이상 데이터에 대해서도 복원 오류를 작게 만들 수 있기 때문에, 본 발명에 따른 이상치 탐지부(130)는 오토 인코더를 제1오토인코더(201)와 제2오토인코더(202)로 구성하고 안정적인 학습을 위해 학습 초반에는 제1디코더의 데이터 재건 오류에 가중치를 주고 학습 후반에는 정상 데이터와 유사한 이상치에 가중치를 부여할 수 있다.That is, in the anomaly detection unit 130, the second autoencoder 202 distinguishes between the real data and the reconstruction data of the first autoencoder 201, and the first decoder deceives the second autoencoder 202. By learning, the autoencoder can detect outliers similar to normal data. Through this adversarial machine learning, it is possible to create a reconstruction group with a similar distribution by learning data that is considered normal. However, since the conventional adversarial autoencoder model can make the restoration error small even for abnormal data, the anomaly detection unit 130 according to the present invention divides the autoencoders into the first autoencoder 201 and the second autoencoder 202. ), and for stable learning, weight can be given to the data reconstruction error of the first decoder at the beginning of learning, and weighted to outliers similar to normal data in the second half of learning.

오토인코더(200)는 비지도 학습 방법론에서 주로 사용되는 뉴럴 네트워크 구조일 수 있다. 상기 오토인코더(200)는 입력되는 값의 차원을 축소했다가 다시 복원하는 형태의 비지도 기계학습 모델로서, 학습에 사용되는 값들 이 가지는 특징을 학습하는 기능을 가질 수 있다. 보다 상세하게는 상기 오토인코더(200)는 출력 값을 입력 값에 근사 하도록 하는 함수를 학습하며, 인코더를 통해 입력 값에 대한 피쳐를 추출하고, 디코더를 통해 입력 값을 재구성할 수 있다.The autoencoder 200 may be a neural network structure mainly used in an unsupervised learning methodology. The autoencoder 200 is an unsupervised machine learning model in the form of reducing the dimension of an input value and restoring it again, and may have a function of learning characteristics of values used for learning. More specifically, the autoencoder 200 can learn a function that approximates an output value to an input value, extracts a feature for an input value through an encoder, and reconstructs an input value through a decoder.

뉴럴 네트워크(신경망)는 일반적으로 노드라 지칭될 수 있는 상호 연결된 계산 단위들의 집합으로 구성될 수 있 고, 이러한 노드들은 뉴런으로 지칭될 수 있다. 뉴럴 네트워크는 일반적으로 복수의 노드들을 포함하여 구성된 다. 뉴럴 네트워크를 구성하는 노드들은 하나 이상의 링크에 의해 상호 연결될 수 있다.A neural network (neural network) may be composed of a set of interconnected computational units, which may be generally referred to as nodes, and these nodes may be referred to as neurons. A neural network is generally composed of a plurality of nodes. Nodes constituting a neural network may be interconnected by one or more links.

뉴럴 네트워크를 구성하는 노드들 중 일부는 최초 입력 노드로부터의 거리들에 기초하여 하나의 레이어(layer) 를 구성할 수 있다. 예를 들어, 최초 입력 노드로부터 거리가 n인 노드들의 집합은 n 레이어를 구성할 수 있다.Some of the nodes constituting the neural network may configure one layer based on distances from the first input node. For example, a set of nodes having a distance of n from the first input node may constitute n layers.

본 명세서에서 설명하는 뉴럴 네트워크는 입력 레이어와 출력 레이어 외에 복수의 히든 레이어를 포함하는 딥 뉴럴 네트워크(Deep Neural Network, DNN)를 포함할 수 있다. 딥 뉴럴 네트워크는 컨볼루셔널 뉴럴 네트워크 (Convolutional Neural Network, CNN), 리커런트 뉴럴 네트워크(Recurrent Neural Network, RNN), 생성적 적대 신경망(Generative Adversarial Networks, GAN) 등을 포함할 수 있다.The neural network described in this specification may include a deep neural network (DNN) including a plurality of hidden layers in addition to an input layer and an output layer. Deep neural networks may include convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), and the like.

오토인코더(200)의 뉴럴 네트워크 구조는 노드들의 개수, 노드들을 상호 연결하는 링크들의 연결관계 및 각 링크들에 부여된 가중치로 정의될 수 있다The neural network structure of the autoencoder 200 can be defined by the number of nodes, the connection relationship of links interconnecting the nodes, and the weights assigned to each link.

또한, 오토인코더(200)는 히든 레이어를 구성하는 노드들의 개수는 입력 레이어 또는 출력 레이어를 구성하는 노드들의 개수보다 적을 수 있다.Also, in the autoencoder 200, the number of nodes constituting the hidden layer may be less than the number of nodes constituting the input layer or the output layer.

오토인코더(200)를 구성하는 레이어들의 부분집합인 입력 레이어 및 히든 레이어의 집합은 하나의 신경망 네트워크로서 인코더 네트워크로 정의될 수 있다. 또한, 히든 레이어 및 출력 레이어의 집합은 다른 하나의 신경망 네트워크로서 디코더 네트워크로 정의될 수 있다. 인코더 네트워크의 출력 레이어는 디코더 네트워크의 입력 레이어가 될 수 있다.A set of input layers and hidden layers, which are subsets of layers constituting the autoencoder 200, may be defined as an encoder network as one neural network. In addition, a set of hidden layers and output layers may be defined as a decoder network as another neural network. An output layer of an encoder network can be an input layer of a decoder network.

오토인코더(200)는 인코더 네트워크의 입력 레이어에 입력되는 데이터를 히든 레이어를 통해 압축한 결과를 잠재 변수(latent vector)로 출력하고, 출력된 잠재 변수를 다시 디코더 네트워크의 히든레이어를 거쳐 출력 레이어를 통해 복원하는 기능을 수행할 수 있다.The autoencoder 200 compresses the data input to the input layer of the encoder network through the hidden layer, outputs the result as a latent vector, and outputs the output latent variable through the hidden layer of the decoder network to the output layer. You can perform the function of restoring through

비지도 학습은 데이터의 숨겨진(Hidden) 특징(Feature)이나 구조를 발견하는 방식에 사용된다. 예시적으로, 비지도 학습은 군집(Clustering)으로 구분되는 K-평균(k-Means), 계층 군집 분석(HCA, Hierarchical ClusterAnalysis), 기댓값 최대화(Expectation Maximization)를 포함할 수 있다. 또한, 시각화(Visualization)와 차원 축소(Dimensionality Reduction)로 구분되는, 주성분 분석(PCA, Principal Component Analysis), 커널PCA(Kernel PCA), 지역적 선형 임베딩(LLE, Locally-Linear Embedding), t-SNE(t-distributed StochasticNeighbor Embedding)을 포함할 수 있다. 또한, 연관 규칙 학습(Association Rule Learning)으로 구분되는 어프라이어리(Apriori), 이클렛(Eclat)을 포함할 수 있다.Unsupervised learning is used to discover hidden features or structures of data. Illustratively, unsupervised learning may include k-means divided into clustering, hierarchical cluster analysis (HCA), and expectation maximization. In addition, Principal Component Analysis (PCA), Kernel PCA, Locally-Linear Embedding (LLE, Locally-Linear Embedding), t-SNE ( t-distributed StochasticNeighbor Embedding). In addition, Apriori and Eclat, which are classified as association rule learning, may be included.

예를 들어, 비지도 학습기반 인공지능 알고리즘은 로지스틱 회귀 알고리즘, 랜덤 포레스트 알고리즘, SVM(Support Vector Machine)알고리즘, 의사결정 알고리즘 및 군집 알고리즘이 이용될 수 있으나 이에 한정되는 것은 아니다. Random Forest 알고리즘은 수많은 Decision Tree들이 Forest를 구성하여 각각의 예측 결과를 하나의 결과변수로 평균화하는 알고리즘이고, SVM 알고리즘은 데이터의 분포공간에서 가장 큰 폭의 경계를 구분하여 데이터가 속하는 분류를 판단하는 비확률적 알고리즘이다. Extra Tree 알고리즘은 Random forest와 비슷하나 속도가 Random forest에 비해 빠른 알고리즘이며, XGBoost알고리즘은 Random Forest의 Tree는 독립적이라면 XGBoost의 Tree의 결과를 다음 트리에 적용하는 boost 방식의 알고리즘이다. Deep Learning 알고리즘은 다층구조의 Neural Network을 기반으로 변수의 패턴이 결과에 미치는 영향을 가중치로 조절하며 학습하는 알고리즘이다. 또한, K-means 클러스터링 알고리즘은 전통적인 분류기법으로 대상집단을 거리의 평균값(유사도)을 기준으로 K 개의 군집으로 반복 세분화하는 기법이고, SOM알고리즘은 인공신경망을 기반으로 훈련집합의 입력 패턴을 가중치로 학습하여 군집화하는 기법이다. 또한 EM & Canopy 알고리즘은 주어진 초기값으로 가능성이 최대인 것부터 반복 과정을 통해 파라미터 값을 갱신하여 군집화 하는 기법을 의미한다.For example, a logistic regression algorithm, a random forest algorithm, a support vector machine (SVM) algorithm, a decision-making algorithm, and a clustering algorithm may be used as the unsupervised learning-based artificial intelligence algorithm, but are not limited thereto. The Random Forest algorithm is an algorithm that forms a forest with numerous decision trees and averages each prediction result into one outcome variable. It is a non-probabilistic algorithm. The Extra Tree algorithm is similar to the Random forest, but is faster than the Random forest. The XGBoost algorithm is a boost algorithm that applies the result of the XGBoost tree to the next tree if the Random forest tree is independent. The deep learning algorithm is an algorithm that learns by adjusting the effect of variable patterns on results based on a multi-layered neural network with weights. In addition, the K-means clustering algorithm is a traditional classification technique that repeatedly subdivides a target group into K groups based on the average value of distance (similarity). It is a learning and clustering technique. In addition, the EM & Canopy algorithm refers to a technique of clustering by updating parameter values through an iterative process, starting from the one with the maximum probability with a given initial value.

계층 군집 알고리즘을 사용하면 각 그룹을 더 작은 그룹으로 세분화할 수 있다. 시각화 알고리즘은 레이블이 없는 대규모의 고차원 데이터를 넣으면 도식화가 가능한 2D나 3D 표현으로 생성한다. 차원 축소는 너무 많은 정보를 잃지 않으면서 데이터를 간소화하는데 될 수 있다. 예를 들어 차의 주행거리는 연식과 매우 연관되어 있으므로 차원 축소 알고리즘으로 두 특성을 차의 마모 정도를 나타내는 하나의 특성으로 합칠 수 있으며, 이를 특성 추출(Feature Extraction)이라고 한다. 이상치 탐지는 학습 알고리즘에 주입하기 전에 데이터베이스부(110)에서 이상한 값을 자동으로 제거하는 작업으로, 정상 샘플로 훈련되고, 새로운 샘플이 정상인지 아닌지 판단할 수 있다. 연관 규칙 학습은 대량의 데이터에서 특성 간의 흥미로운 관계를 찾아내며, 예를 들어, 어떠한 상품을 구매한 사람이 다른 상품을 구매하는 경향이 있다는 것을 찾을 때 활용된다. 다만, 비지도 학습 기반의 인공지능 알고리즘이 이에 한정되는 것은 아니며, 연구 개발의 결과에 따라 다양한 알고리즘이 포함될 수 있다.Hierarchical clustering algorithms allow each group to be subdivided into smaller groups. Visualization algorithms take large, unlabeled, high-dimensional data and create a 2D or 3D representation that can be plotted. Dimensionality reduction can be used to simplify data without losing too much information. For example, since a car's mileage is highly correlated with its age, a dimensionality reduction algorithm can combine the two features into a single feature representing the wear and tear of the car, which is called feature extraction. Outlier detection is an operation of automatically removing strange values from the database unit 110 before being injected into the learning algorithm, and can be trained with normal samples and determine whether new samples are normal or not. Association rule learning finds interesting relationships between features in a large amount of data, and is used, for example, to find that people who purchase a certain product tend to purchase other products. However, the artificial intelligence algorithm based on unsupervised learning is not limited thereto, and various algorithms may be included according to the results of research and development.

본원의 일 실시예에 따르면, 임계점 계산부(140)는 머신러닝 알고리즘을 이용하여 이상치를 찾을 수 있는 임계점을 도출할 수 있다. According to an embodiment of the present disclosure, the critical point calculation unit 140 may derive a critical point capable of finding an outlier by using a machine learning algorithm.

또한, 임계점 계산부(140)는 이상치 탐지부(130)에 의해 정의된 패턴을 군집화하여 이상치를 찾을 수 있는 최적의 임계점을 도출할 수 있다.In addition, the threshold calculation unit 140 may derive an optimal threshold point for finding an outlier by clustering patterns defined by the outlier detection unit 130 .

예시적으로, 상기 임계점 계산부(140)는 K-mean 알고리즘에 기초하여 이상치 탐지부(130)에서 학습한 정상 데이터와 비정상 데이터를 분류하는 임계점을 계산하고, 계산된 임계점을 기준으로 이상치를 정의할 수 있다.Exemplarily, the threshold calculation unit 140 calculates a threshold for classifying normal data and abnormal data learned by the outlier detection unit 130 based on a K-means algorithm, and defines the outlier based on the calculated threshold. can do.

K-mean 알고리즘은 이상치 탐지부(130)에서 학습한 정상 데이터와 비정상 데이터를 k개의 군집으로 임의로 분할을 하여 군집의 평균을 대표값으로 분할해 나가는 방법으로 데이터들을 유사성을 바탕으로 재배치할 수 있다. The K-means algorithm arbitrarily divides normal data and abnormal data learned in the outlier detection unit 130 into k clusters and divides the average of the clusters into representative values, and data can be rearranged based on similarity. .

또한, 이상치 탐지부(130)에 의하여 출력된 출력 데이터에 기초하여, 이상 스코어(anomaly score)를 산출할 수 있다. 여기서, 이상 스코어는 잠재 변수의 분포에 기초하여 스코어링될 수 있으며, 오토인코더를 통해 복원된 데이터와 원본 데이터의 차이에 기초하여 스코어링될 수도 있다.In addition, an anomaly score may be calculated based on the output data output by the anomaly detection unit 130 . Here, the ideal score may be scored based on the distribution of latent variables, or may be scored based on the difference between data reconstructed through an autoencoder and original data.

이후, 산출된 이상 스코어를 임계처리하여 오토인코더에 입력되었던 입력 데이터의 이상 여부를 판별할 수 있다. 여기서, 산출된 이상 스코어를 통해 입력 데이터의 정상 및 이상을 구분하는 임계값은 사용자가 최적의 임계치(threshold)를 찾아 이를 기준으로 이상 데이터를 탐지할 수 있다. 예를 들어, 정상 데이터에 대한 복원 오차의 평균과 표준 편차를 구하고, 그 값을 이용하여 임계치를 설정할 수도 있다.Thereafter, it is possible to determine whether the input data input to the autoencoder is abnormal by thresholding the calculated anomaly score. Here, as a threshold value for distinguishing between normal and abnormal input data through the calculated abnormality score, the user can find an optimal threshold and detect abnormal data based on this threshold. For example, the average and standard deviation of restoration errors for normal data may be obtained, and a threshold may be set using the values.

이상치 판별부(150)는 임계점 계산부(140)에서 도출한 임계점을 기초하여 시계열 데이터의 이상치를 판별할 수 있다.The outlier determination unit 150 may determine an outlier value of the time series data based on the threshold point derived by the threshold point calculation unit 140 .

이상치 판별부는 신규 입력된 데이터가 정상인지 비정상인지 판별하는 알고리즘이 포함될 수 있다.The outlier determination unit may include an algorithm for determining whether newly input data is normal or abnormal.

도 3은 본원의 일 실시예에 따른 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점을 계산하는 방법의 일 예를 도시한 도면이다.3 is a diagram illustrating an example of a method of detecting outliers and automatically calculating threshold points of unsupervised learning-based time series data according to an embodiment of the present invention.

도 3에 도시된 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점은 이상치 탐지 및 자동 임계점 계산 장치에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템(100)에 대하여 설명된 내용은 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 계산 방법에 대한 설명에도 동일하게 적용될 수 있다.The outlier detection and automatic threshold point calculation of the unsupervised learning-based time series data shown in FIG. 3 may be performed by an outlier detection and automatic threshold point calculation device. Therefore, even if the contents are omitted below, the description of the outlier detection and automatic threshold point system 100 of unsupervised learning-based time series data is the same as the description of the method of detecting outliers and automatically calculating threshold points of unsupervised learning-based time series data. can be applied

단계 S301에서 오픈 소스 및 공장의 센서로부터 수집한 시계열 데이터를 저장할 수 있다.In step S301, time-series data collected from open source and factory sensors may be stored.

단계 S302에서 시계열 데이터에서 이상치 탐지를 위한 특질을 추출하고 판별 가능하도록 데이터를 정제하는 데이터 전처리할 수 있다.In step S302, data preprocessing may be performed to refine the data so that features for detecting outliers are extracted from the time series data and discriminable.

단계 S303에서 적대적 기계 학습 내 오토인코더(200) 구조를 이용하여 상기 전처리부에서 전처리된 데이터를 이상치 예측 모델에서 학습하고, 이상을 탐지할 수 있다.In step S303, using the structure of the autoencoder 200 in adversarial machine learning, the preprocessed data in the preprocessor can be learned from an outlier prediction model, and anomalies can be detected.

단계 S304에서 머신러닝 알고리즘을 이용하여 이상치를 찾을 수 있는 임계점을 도출하는 임계점을 계산할 수 있다.In step S304, a critical point for deriving a critical point for finding an outlier may be calculated using a machine learning algorithm.

단계 S305에서 임계점을 기초하여 상기 시계열 데이터의 이상치를 판별할 수 있다.In step S305, an outlier of the time series data may be determined based on a critical point.

본원의 일 실시 예에 따른 비지도학습기반 시계열 데이터의 이상치 탐지 및 자동 임계점 시스템(100)은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프 트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램 (RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된 다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The system 100 for automatic threshold detection and automatic threshold point detection of unsupervised learning-based time series data according to an embodiment of the present disclosure may be implemented in the form of program instructions that can be executed through various computer means and may be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, etc. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

또한, 전술한 딥러닝 기반의 시계열 분석을 이용한 회전설비 진동 상태 감시 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the above-described method for monitoring vibration conditions of rotating equipment using time series analysis based on deep learning may be implemented in the form of a computer program or application stored in a recording medium and executed by a computer.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustrative purposes, and those skilled in the art will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof should be construed as being included in the scope of the present application.

100: 이상치 탐지 및 자동 임계점 시스템
110: 데이터베이스부
120: 데이터 전처리부
130: 이상치 탐지부
140: 임계점 계산부
150: 이상치 판별부 100: Outlier detection and automatic thresholding system
110: database unit
120: data pre-processing unit
130: outlier detection unit
140: critical point calculation unit
150: outlier determination unit

Claims

In the outlier detection and automatic threshold point system of unsupervised learning-based time series data,
A database unit for storing time-series data collected from open sources and factory sensors;
a data pre-processing unit extracting features for detecting outliers from the time series data and refining the data to enable discrimination;
an outlier detection unit learning the data preprocessed in the preprocessing unit in an outlier prediction model using an autoencoder structure in adversarial machine learning and detecting anomaly;
a threshold point calculation unit for deriving a critical point for finding an outlier using a machine learning algorithm based on the outlier prediction model; and
an outlier determining unit determining an outlier of the time-series data based on the critical point;
including,
The autoencoder is built with one encoder and a first autoencoder and a second autoencoder sharing the encoder for adversarial machine learning,
The first autoencoder restores input data, learns a model that deceives the second autoencoder, the second autoencoder restores the input data, and sets reconstruction data and input data restored by the first autoencoder in advance. Learning a model for distinguishing normal or abnormal state of the input data based on a loss value divergence point of one or more magnitudes,
The first autoencoder and the second autoencoder are given different weights of data reconstruction errors and outliers to be learned according to a preset ratio.
Outlier detection and automatic threshold system.

According to claim 1,
The data pre-processing unit extracts statistical characteristic variables after pre-processing the time series data stored in the database unit for learning,
Outlier detection and automatic threshold system.

delete

According to claim 1,
The second autoencoder distinguishes between real data and reconstruction data of the first autoencoder, and the first autoencoder performs adversarial machine learning to deceive the second autoencoder so that the autoencoder detects an outlier similar to normal data. sign,
Outlier detection and automatic threshold system.

delete

According to claim 1,
The threshold point calculation unit calculates a threshold point for classifying normal data and abnormal data learned in the outlier detection unit based on a K-means algorithm, and defines the outlier based on the calculated threshold point.
Outlier detection and automatic threshold system.

In the outlier detection and automatic threshold point system of unsupervised learning-based time series data,
Storing time-series data collected from sensors in open sources and factories;
a data pre-processing step of extracting features for detecting outliers from the time-series data and refining the data to enable discrimination;
learning the data preprocessed in the preprocessing step in an outlier prediction model using an autoencoder structure in adversarial machine learning and detecting the anomaly;
a threshold point calculation step of deriving a threshold point capable of finding an outlier by using a machine learning algorithm based on the outlier prediction model; and
determining an outlier of the time series data based on the critical point;
including,
The autoencoder is built with one encoder and a first autoencoder and a second autoencoder sharing the encoder for adversarial machine learning,
The first autoencoder restores input data, learns a model that deceives the second autoencoder, the second autoencoder restores the input data, and sets reconstruction data and input data restored by the first autoencoder in advance. Learning a model for distinguishing normal or abnormal state of the input data based on a loss value divergence point of one or more magnitudes,
The first autoencoder and the second autoencoder are given different weights of data reconstruction errors and outliers to be learned according to a preset ratio.
Outlier detection method.

A computer-readable recording medium recording a program for executing the method of claim 8 on a computer.