KR20210092999A

KR20210092999A - Method, apparatus and program for acquiring failure prediction model in industrial internet of things (iiot) environment

Info

Publication number: KR20210092999A
Application number: KR1020200006614A
Authority: KR
Inventors: 김의직; 권정혁
Original assignee: 한림대학교 산학협력단
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2021-07-27
Also published as: KR102328566B1

Abstract

Provided is a failure prediction model acquisition system in an industrial Internet of things (IIoT) environment. A control method of the system comprises: a step of acquiring, by a server, data from a plurality of sensors; a step of pre-processing, by the server, the acquired data; a step of acquiring, by the server, an importance of a feature based on the pre-processed data; a step of selecting, by the server, the feature based on the acquired importance and generating a plurality of models based on the selected feature; and a step of selecting, by the server, at least one model from among the plurality of generated models. Therefore, the present invention is capable of constructing an accurate failure prediction model.

Description

Method, Apparatus and Program for Acquisition of Failure Prediction Model in Industrial Internet of Things (IIoT) Environment

본 발명은 산업용 사물인터넷(IIoT) 환경에서의 고장 예측 모델 획득 방법, 장치 및 프로그램에 관한 것이다.The present invention relates to a method, apparatus, and program for acquiring a failure prediction model in an industrial Internet of Things (IIoT) environment.

산업용 사물인터넷 (IIoT)을 포함하는 IoT 시스템은 많은 수의 기기들이 상호 작용함으로써, 시스템이 원하는 목적을 달성하게 해준다. 이러한 IIoT 시스템은 복수의 기기를 포함하고 있기 때문에, 복수의 기기 중 어느 하나의 기기가 고장나는 경우, 시스템 전체가 고장날 수 있다 특히, 최근, 산업용 사물인터넷 (IIoT)은 고급 제조의 핵심 기술인 연결 및 분석 기능을 제공하기 때문에 다양한 회사에서 널리 채택되고 있으며, IIoT 시스템 또한 방대한 양의 센서를 사용하고 있는 실정이다.An IoT system, including the Industrial Internet of Things (IIoT), allows a large number of devices to interact to achieve the desired purpose of the system. Since such an IIoT system includes a plurality of devices, if any one of the plurality of devices fails, the entire system may fail. In particular, recently, the Industrial Internet of Things (IIoT) is a key technology for advanced manufacturing, such as connectivity and Because it provides analysis functions, it is widely adopted by various companies, and IIoT systems are also using a vast amount of sensors.

이러한 IIoT 시스템의 고장을 예측하기 위하여, IIoT 시스템이 포함하는 복수의 센서로부터 방대한 데이터를 수집하여 고장 예측 모델을 구축하는 방법이 제시되고 있다. 구체적으로, 고장을 예측하기 위해서는 고장 발생 여부를 결정하는 고장 예측 모델을 구축해야 하며, 고장 예측 모델을 구축하기 위해 대부분의 기존 연구는 기계 학습 기술을 사용하고 있다.In order to predict the failure of the IIoT system, a method of constructing a failure prediction model by collecting a large amount of data from a plurality of sensors included in the IIoT system has been proposed. Specifically, in order to predict failure, it is necessary to build a failure prediction model that determines whether or not a failure occurs, and most existing studies use machine learning technology to build a failure prediction model.

그러나 기존의 연구들은, 고장 예측 모델을 구축할 때, 데이터 세트의 모든 데이터를 사용하고 있다. 따라서, 기존의 연구들은 방대한 수의 센서가 사용되는 실제 IIoT 환경에서, 고장과 무관하게 데이터에 미치는 영향으로 인해 예측 정확도가 상당히 저하 될 수 있다. However, existing studies use all data in the data set when building a failure prediction model. Therefore, existing studies have shown that in an actual IIoT environment where a vast number of sensors are used, the prediction accuracy may be significantly reduced due to the effect on the data regardless of the failure.

따라서, 고장과 무관한 데이터를 선별하여 예측 정확도가 높은 고장 예측 모델을 구축할 필요성이 존재한다.Therefore, there is a need to construct a failure prediction model with high prediction accuracy by selecting data irrelevant to failure.

등록특허공보 제10-1538709호, 2015.07.16Registered Patent Publication No. 10-1538709, 2015.07.16

본 발명이 해결하고자 하는 과제는 산업용 사물인터넷(IIoT) 환경에서의 고장 예측 모델 획득 방법, 장치 및 프로그램을 제공하는 것이다.An object of the present invention is to provide a method, apparatus, and program for acquiring a failure prediction model in an industrial Internet of Things (IIoT) environment.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 산업용 사물인터넷(IIoT) 환경에서의 고장 예측 모델 획득 시스템의 제어 방법은, 서버가, 복수의 센서로부터 데이터를 획득하는 단계; 상기 서버가, 상기 획득된 데이터를 전처리하는 단계; 상기 서버가, 상기 전처리된 데이터를 바탕으로 피처의 중요도를 획득하는 단계; 상기 서버가, 상기 획득된 중요도를 바탕으로 피처를 선택하고, 상기 선택된 피처를 바탕으로 복수의 모델을 생성하는 단계; 및 상기 서버가, 상기 생성된 복수의 모델 중, 적어도 하나의 모델을 선택하는 단계; 를 포함한다.A control method of a failure prediction model acquisition system in an industrial Internet of Things (IIoT) environment according to an aspect of the present invention for solving the above-described problems includes, by a server, acquiring data from a plurality of sensors; pre-processing, by the server, the obtained data; obtaining, by the server, the importance of the feature based on the pre-processed data; selecting, by the server, a feature based on the obtained importance level, and generating a plurality of models based on the selected feature; and selecting, by the server, at least one model from among the plurality of generated models. includes

이때, 상기 복수의 모델을 생성하는 단계는, 하기 알고리즘 1에 의해 생성될 수 있다.In this case, the step of generating the plurality of models may be generated by the following algorithm 1.

[알고리즘 1][Algorithm 1]

이때, cnt는 카운터 값, SF는 선택된 피처, PA는 예측 정확도, sf_cnt는 cnt에서의 선택된 피처, 피처의 중요도 세트는 IMP=[imp₀,...,imp_j,...imp_nf-1], 피처의 세트는 F=[f₀,...,f_j,...,f_nf-1], imp_j는, (j+1)번째 피처의 중요도의 값, nf는 피처의 수, modelBuildEvalFunction(SF)는 예측 모델을 구축하고 평가하기 위한 임의의 함수일 수 있다.In this case, cnt is the counter value, SF is the selected feature, PA is the prediction accuracy, sf _cnt is the selected feature in cnt, and the importance set of the feature is IMP=[imp ₀ ,...,imp _j ,...imp _{nf- 1} ], the set of features is F=[f ₀ ,...,f _j ,...,f _nf-1 ], imp _j is the importance value of the (j+1)th feature, nf is the The number, modelBuildEvalFunction(SF), may be any function for building and evaluating a predictive model.

이때, 상기 복수의 모델을 생성하는 단계는, 상기 알고리즘 1에 의해 획득된 SF 및 SVM(Support Vector Machine) 알고리즘을 바탕으로 생성되는 단계; 를 포함할 수 있다.In this case, the generating of the plurality of models may include: generating based on SF and SVM (Support Vector Machine) algorithms obtained by Algorithm 1; may include.

이때, 상기 적어도 하나의 모델을 선택하는 단계는, 상기 알고리즘 1에 의해 획득된 pa₀내지 pa_cnt 중 가장 높은 예측 정확도를 가지는 pa_i에 대한 max(PA)를 획득하는 단계; 상기 max(PA)의 인덱스를 획득하여, 기 설정된 값 이상의 예측 정확도를 가지는 적어도 하나의 피처를 획득하는 단계; 및 상기 max(PA) 및 상기 기 설정된 값 이상의 예측 정확도를 가지는 적어도 하나의 피처를 바탕으로 상기 적어도 하나의 모델을 선택하는 단계; 를 포함할 수 있다.In this case, the selecting of the at least one model may include: obtaining max(PA) for _{pa i} having the highest prediction accuracy among _{pa 0} to pa _{cnt obtained by the algorithm 1;} obtaining at least one feature having a prediction accuracy equal to or greater than a preset value by obtaining the index of max(PA); and selecting the at least one model based on the max(PA) and at least one feature having a prediction accuracy equal to or greater than the preset value. may include.

이때, 상기 전처리하는 단계는, 상기 복수의 센서로부터 획득된 데이터 각각에 대응되는 복수의 피처 중 적어도 하나의 피처를 제거하는 단계; 상기 적어도 하나의 피처가 제거된 복수의 피처에 대한 데이터를 정규화 하는 단계; 및 상기 정규화된 데이터를 분할하는 단계;를 포함할 수 있다.In this case, the pre-processing may include: removing at least one feature from among a plurality of features corresponding to each of the data acquired from the plurality of sensors; normalizing data for a plurality of features from which the at least one feature has been removed; and dividing the normalized data.

이때, 상기 피처를 제거하는 단계는, 상기 복수의 센서 각각으로부터 획득한 데이터 중, 비유효 데이터를 획득하는 단계; 상기 비유효 데이터의 비율이 [0,1]의 범위에서 기 설정된 유효성 계수보다 큰 경우, 상기 비유효 데이터에 대응되는 피처를 제거하는 단계; 를 포함할 수 있다.In this case, the step of removing the feature may include: acquiring invalid data from among the data acquired from each of the plurality of sensors; removing a feature corresponding to the invalid data when the ratio of the invalid data is greater than a preset validity coefficient in a range of [0,1]; may include.

이때, 상기 피처를 제거하는 단계는, 상기 복수의 센서 각각으로부터 획득한 데이터에 대한 각각의 분산값을 획득하는 단계; 상기 분산값이 기 설정된 값 이하인 경우, 상기 기 설정된 값 이하인 분산값에 대응되는 피처를 제거하는 단계; 를 포함할 수 있다.In this case, the step of removing the feature may include: obtaining respective variance values for data obtained from each of the plurality of sensors; when the variance value is less than or equal to a preset value, removing a feature corresponding to a variance value equal to or less than the preset value; may include.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

상술한 본 발명의 다양한 실시예에 따라, 다양한 피처를 반복적으로 선택하여 획득한 고장 예측 모델을 통해 IIoT 환경에서 효율적이고 정확한 고장 예측 모델을 구축할 수 있는 새로운 효과가 존재한다.According to the above-described various embodiments of the present invention, there is a new effect of constructing an efficient and accurate failure prediction model in the IIoT environment through the failure prediction model obtained by repeatedly selecting various features.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 IIoT 시스템을 도시한 시스템도이다.
도 2는 본 발명의 일 실시예에 따른 고장 예측 모델 획득 방법을 설명하기 위한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 고장 예측 모델 획득하기 위한 방법을 설명하기 위한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 고장 예측 모델 선택 방법을 설명하기 위한 흐름도이다.
도 5 내지 도 9은 본 발명의 일 실시예에 따른 실험 결과를 설명하기 위한 예시도이다.
도 10은 본 발명의 일 실시예에 따른 장치의 구성도이다.1 is a system diagram illustrating an IIoT system according to an embodiment of the present invention.
2 is a block diagram illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.
3 is a flowchart illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.
4 is a flowchart illustrating a method for selecting a failure prediction model according to an embodiment of the present invention.
5 to 9 are exemplary views for explaining the experimental results according to an embodiment of the present invention.
10 is a block diagram of an apparatus according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. As used herein, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

명세서에서 사용되는 "부" 또는 “모듈”이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 “모듈”은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 “모듈”은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 “모듈”은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 “모듈”은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 “모듈”들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 “모듈”들로 결합되거나 추가적인 구성요소들과 "부" 또는 “모듈”들로 더 분리될 수 있다.As used herein, the term “unit” or “module” refers to a hardware component such as software, FPGA, or ASIC, and “unit” or “module” performs certain roles. However, “part” or “module” is not meant to be limited to software or hardware. A “unit” or “module” may be configured to reside on an addressable storage medium or to reproduce one or more processors. Thus, by way of example, “part” or “module” refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, Includes procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays and variables. Components and functionality provided within “parts” or “modules” may be combined into a smaller number of components and “parts” or “modules” or as additional components and “parts” or “modules”. can be further separated.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.Spatially relative terms "below", "beneath", "lower", "above", "upper", etc. It can be used to easily describe the correlation between a component and other components. A spatially relative term should be understood as a term that includes different directions of components during use or operation in addition to the directions shown in the drawings. For example, when a component shown in the drawing is turned over, a component described as “beneath” or “beneath” of another component may be placed “above” of the other component. can Accordingly, the exemplary term “below” may include both directions below and above. Components may also be oriented in other orientations, and thus spatially relative terms may be interpreted according to orientation.

본 명세서에서, 컴퓨터는 적어도 하나의 프로세서를 포함하는 모든 종류의 하드웨어 장치를 의미하는 것이고, 실시 예에 따라 해당 하드웨어 장치에서 동작하는 소프트웨어적 구성도 포괄하는 의미로서 이해될 수 있다. 예를 들어, 컴퓨터는 스마트폰, 태블릿 PC, 데스크톱, 노트북 및 각 장치에서 구동되는 사용자 클라이언트 및 애플리케이션을 모두 포함하는 의미로서 이해될 수 있으며, 또한 이에 제한되는 것은 아니다.In this specification, a computer means all types of hardware devices including at least one processor, and may be understood as encompassing software configurations operating in the corresponding hardware device according to embodiments. For example, a computer may be understood to include, but is not limited to, smart phones, tablet PCs, desktops, notebooks, and user clients and applications running on each device.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에서 설명되는 각 단계들은 컴퓨터에 의하여 수행되는 것으로 설명되나, 각 단계의 주체는 이에 제한되는 것은 아니며, 실시 예에 따라 각 단계들의 적어도 일부가 서로 다른 장치에서 수행될 수도 있다.Each step described in this specification is described as being performed by a computer, but the subject of each step is not limited thereto, and at least a portion of each step may be performed in different devices according to embodiments.

도 1은 본 발명의 일 실시예에 따른 IIoT 시스템을 도시한 시스템도이다.1 is a system diagram illustrating an IIoT system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, IIoT 시스템은 서버(100) 및 복수의 센서(200-1 내지 200-5)를 포함할 수 있다.As shown in FIG. 1 , the IIoT system may include a server 100 and a plurality of sensors 200 - 1 to 200 - 5 .

서버(100)는 복수의 센서(200-1 내지 200-5)로부터 센싱 데이터를 획득할 수 있다. 서버(100)는 획득된 센싱 데이터 및 센싱 데이터에 대응되는 복수의 센서(200-1 내지 200-5)를 바탕으로 시스템의 고장을 예측하기 위한 고장 예측 모델을 구축할 수 있다.The server 100 may acquire sensing data from the plurality of sensors 200 - 1 to 200 - 5 . The server 100 may build a failure prediction model for predicting system failure based on the acquired sensing data and the plurality of sensors 200-1 to 200-5 corresponding to the sensing data.

도 2는 본 발명의 일 실시예에 따른 고장 예측 모델 획득 방법을 설명하기 위한 블록도이다.2 is a block diagram illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.

서버(100)는, 고장 예측 모델의 예측 정확도를 최대화 하기 위해 선택한 피처를 기반으로 고장 예측 모델을 획득할 수 있다. 이를 위해, 서버(100)는 피처 선택을 반복적으로 수행할 수 있으며, 반복해서 선택한 피처를 기반으로 복수의 고장 예측 모델을 획득할 수 있다. 이후, 서버(100)는, 복수의 고장 예측 모델 중 가장 높은 예측 정확도를 갖는 고장 예측 모델을 획득할 수 있다.The server 100 may obtain a failure prediction model based on a feature selected in order to maximize the prediction accuracy of the failure prediction model. To this end, the server 100 may repeatedly perform feature selection, and may acquire a plurality of failure prediction models based on the repeatedly selected feature. Thereafter, the server 100 may obtain a failure prediction model having the highest prediction accuracy among a plurality of failure prediction models.

구체적으로, 도 2에 도시된 바와 같이, 서버(100) 고장 예측 모델을 획득하기 위해 1) 전처리 단계(Processing), 2) 중요도 측정 단계(Importance Measurement), 3) 피처 선택 단계(Feature Selection), 4) 모델 구축 단계(Model Building) 및 5) 모델 선택 단계(Model Selection)를 수행할 수 있다.Specifically, as shown in FIG. 2, in order to obtain a server 100 failure prediction model, 1) a pre-processing step (Processing), 2) an importance measurement step (Importance Measurement), 3) a feature selection step (Feature Selection), 4) Model Building and 5) Model Selection may be performed.

구체적으로, 전처리 단계에서, 서버(100)는 피처 제거, 누락 된 데이터 대체, 정규화 및 데이터 분할 단계를 수행할 수 있다. 이후, 서버(100)는 전처리 과정을 통해 획득된 피처의 중요도를 획득할 수 있다. 이후, 서버(100)는, 피처 선택 및 모델 구축 단계를 반복적으로 수행할 수 있다. 이후, 서버(100)는 모델 구축 단계에서, 선택된 피처에 기초한 SVM을 통해 고장 예측 모델을 구축할 수 있다. Specifically, in the pre-processing step, the server 100 may perform feature removal, missing data replacement, normalization, and data segmentation steps. Thereafter, the server 100 may acquire the importance of the feature obtained through the preprocessing process. Thereafter, the server 100 may repeatedly perform the feature selection and model building steps. Thereafter, the server 100 may build a failure prediction model through the SVM based on the selected feature in the model building step.

도 3은 본 발명의 일 실시예에 따른 고장 예측 모델 획득하기 위한 방법을 설명하기 위한 흐름도이다.3 is a flowchart illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.

단계 S110에서, 서버(100)는, 복수의 센서로부터 데이터를 획득할 수 있다.In step S110, the server 100 may obtain data from a plurality of sensors.

일 실시예에 따라, 본 발명의 다양한 실시예에서, 피처의 용어와 센서의 용어는 동일하게 해석될 수 있다.According to one embodiment, in various embodiments of the present invention, the term feature and the term sensor may be interpreted the same.

단계 S120에서, 서버(100)는, 획득된 데이터를 전처리 할 수 있다.In step S120, the server 100 may pre-process the obtained data.

구체적으로, 본 발명의 일 실시예에 따라, 서버(100)는, 복수의 센서로부터 획득된 데이터 각각에 대응되는 복수의 피처 중 적어도 하나의 피처를 제거할 수 있다.Specifically, according to an embodiment of the present invention, the server 100 may remove at least one feature from among a plurality of features corresponding to each of data obtained from a plurality of sensors.

구체적으로, 서버(100)는, 복수의 센서 각각으로부터 획득한 데이터 중, 비유효 데이터를 획득할 수 있다.Specifically, the server 100 may acquire invalid data among data acquired from each of a plurality of sensors.

일 실시예로, 서버(100)는, 상기 특정 피처에 대한 비유효 데이터의 비율이 [0,1]의 범위에서 기 설정된 유효성 계수보다 큰 경우, 비유효 데이터에 대응되는 피처를 제거할 수 있다.In an embodiment, the server 100 may remove a feature corresponding to the invalid data when the ratio of the invalid data for the specific feature is greater than a preset validity coefficient in the range of [0,1]. .

이때, 비유효 데이터란, 피처로부터 수집된 데이터 중 값이 없는 데이터를 의미할 수 있다. 일 실시예로, 피처가 온도 센서인 경우, 온도가 감지되지 않은 데이터가 비유효 데이터일 수 있다. 비유효 데이터는 수집된 데이터 세트(훈련 데이터 세트 및 테스트 데이터 세트를 포함할 수 있다)에서 각 피처별로 값이 없는 행에 대응되는 데이터일 수 있다. 또 다른 실시예로, 서버(100)는, 복수의 센서 각각으로부터 획득한 데이터에 대한 각각의 분산값을 획득할 수 있다. 이후, 서버(100)는, 분산값이 기 설정된 값 이하인 경우, 기 설정된 값 이하인 분산값에 대응되는 피처를 제거할 수 있다.In this case, the invalid data may mean data having no value among data collected from the feature. In one embodiment, when the feature is a temperature sensor, data in which the temperature is not sensed may be invalid data. Invalid data may be data corresponding to a row having no value for each feature in the collected data set (which may include a training data set and a test data set). In another embodiment, the server 100 may obtain a respective variance value for data obtained from each of a plurality of sensors. Thereafter, when the variance value is equal to or less than the preset value, the server 100 may remove features corresponding to the variance value equal to or less than the preset value.

구체적으로, 전처리 단계에서, 서버(100)는 피처 제거, 누락 된 데이터 대체, 정규화 및 데이터 분할 단계를 수행할 수 있다. 구체적으로, 서버(100) 입력 데이터 세트(즉, 복수의 센서로부터 수집 된 데이터)에서 유효하지 않은 피처를 제거하기 위하여 각 피처의 비유효 (NA) 데이터를 검색하고, 각 피처의 분산을 계산할 수 있다.Specifically, in the pre-processing step, the server 100 may perform feature removal, missing data replacement, normalization, and data segmentation steps. Specifically, the server 100 retrieves the invalid (NA) data of each feature to remove invalid features from the input data set (i.e., data collected from multiple sensors), and calculates the variance of each feature. there is.

일 실시예에 따라, 서버(100)는, 특정 피처의 NA 데이터의 비율이 [0, 1] 범위에서의 기 설정된 유효성 계수보다 큰 경우, 해당 특정 피처를 입력 데이터 세트에서 제거할 수 있다.According to an embodiment, when the ratio of NA data of a specific feature is greater than a preset validity coefficient in the range [0, 1], the server 100 may remove the specific feature from the input data set.

또 다른 실시예로, 서버(100)는 특정 피처의 분산이 기 설정된 값 이하인 경우, 해당 피처를 입력 데이터 세트에서 제거할 수 있다. 바람직하게는, 특정 피처의 분산이 0에 가까울수록 해당 피처가 입력 데이터 세트에서 제거될 확률이 높아질 수 있다.As another embodiment, when the variance of a specific feature is less than or equal to a preset value, the server 100 may remove the corresponding feature from the input data set. Preferably, the closer the variance of a particular feature to zero, the higher the probability that the feature will be removed from the input data set.

한편, 누락 된 데이터가 존재하는 경우 (즉, 피처 제거 후 남아있는 피처에 NA 데이터가 존재하는 경우), 서버(100)는 누락 된 데이터를 각 피처의 누락되지 않은 데이터의 평균값으로 대체할 수 있다.On the other hand, when missing data exists (that is, when NA data exists in the features remaining after feature removal), the server 100 may replace the missing data with the average value of the non-missing data of each feature. .

이후, 서버(100)는, 적어도 하나의 피처가 제거된 복수의 피처에 대한 데이터를 정규화할 수 있다.Thereafter, the server 100 may normalize data for a plurality of features from which at least one feature is removed.

구체적으로, 서버(100)는, 각 피처의 데이터 규모에 맞게 정규화 과정을 수행할 수 있다. 일 실시예에 따라, 서버(100)는 하기 수학식 1을 바탕으로, 각각의 피처의 정규화 과정을 수행할 수 있다.Specifically, the server 100 may perform a normalization process according to the data size of each feature. According to an embodiment, the server 100 may perform a normalization process of each feature based on Equation 1 below.

[수학식 1][Equation 1]

xi'= (xi-m) / s xi' = (xi-m) / s

이때, xi'는 (i + 1) 번째 정규화 된 특성 데이터, xi는 피처의 (i + 1) 번째 데이터, m은 피처의 평균, s는 피처의 표준 편차일 수 있다. In this case, xi' may be the (i + 1)-th normalized characteristic data, xi may be the (i + 1)-th data of the feature, m may be the average of the features, and s may be the standard deviation of the features.

이때, 피처의 평균이라 함은 각각의 피처(예를 들어 센서)에서 수집된 데이터 값의 평균을 의미할 수 있다.In this case, the average of features may mean an average of data values collected from each feature (eg, a sensor).

일 실시예로, 피처가 온도 감지 센서인 경우, 피처의 평균은 온도 감지 센서에 의해 획득된 복수의 온도의 데이터에 대한 평균값을 의미할 수 있다. 피처의 표준편차 및 피처의 분산 또한, 각각의 피처에서 수집된 데이터 값의 표준편차 또는 분산임은 자명하다.As an embodiment, when the feature is a temperature sensor, the average of the features may mean an average value of data of a plurality of temperatures obtained by the temperature sensor. It is evident that the standard deviation of the features and the variance of the features are also the standard deviation or variance of the data values collected from each feature.

이후, 서버(100)는, 정규화된 데이터를 분할할 수 있다.Thereafter, the server 100 may divide the normalized data.

구체적으로, 서버(100)는, 정규화 된 데이터 세트를 학습 데이터 세트(Training Dataset) 및 테스트 데이터 세트(Test Dataset)로 분류할 수 있다. 서버(100)는, 학습 데이터 세트는 고장 예측 모델을 획득하기 위해 사용하고, 테스트 데이터 세트는 예측 모델의 성능을 평가하는 데 사용할 수 있다.Specifically, the server 100 may classify the normalized data set into a training data set and a test data set. The server 100 may use the training data set to obtain a failure prediction model, and the test data set to evaluate the performance of the predictive model.

단계 S130에서, 서버(100)는, 전처리된 데이터를 바탕으로 피처의 중요도를 획득할 수 있다.In step S130, the server 100 may acquire the importance of the feature based on the pre-processed data.

일 실시예에 따라, 서버(100)는, 전처리 단계를 통해 획득된 데이터 중, 학습 데이터 세트를 획득할 수 있다. 이후, 서버(100)는, 학습 데이터를 랜덤 포레스트 기법을 바탕으로, 학습 데이터에 대응되는 피처의 중요도를 획득할 수 있다.According to an embodiment, the server 100 may acquire a training data set from among the data acquired through the pre-processing step. Thereafter, the server 100 may acquire the importance of a feature corresponding to the training data based on the random forest technique for the training data.

구체적으로, 서버(100)는 전처리 과정을 통해 획득된 피처의 중요도를 획득할 수 있다. 서버(100)는, 랜덤 포레스트 기법을 바탕으로, 학습 데이터 세트를 분석하여, 각 피처의 중요도를 획득할 수 있다.Specifically, the server 100 may acquire the importance of the feature obtained through the preprocessing process. The server 100 may obtain the importance of each feature by analyzing the training data set based on the random forest technique.

구체적으로, 서버(100)는, 랜덤 포레스트 기법을 통해 복수개의 의사 결정 트리가 작성되고 작성된 의사 결정 트리를 종합적으로 고려하여 각 피처와 Fail 간의 관련성을 분석할 수 있다.Specifically, the server 100 may analyze the relationship between each feature and Fail by comprehensively considering a plurality of decision trees created through the random forest technique and the created decision trees.

일 실시예에 따라, 서버(100)에 의해 수행되는 랜덤 포레스트 기법에서는 각각 의사 결정 트리를 다르게 획득하기 위해 학습 데이터 세트의 복수의 하위 세트가 생성될 수 있다. 생성된 복수의 하위 세트 각각은 서로 다른 데이터와 피처로 구성될 수 있다.According to an embodiment, in the random forest technique performed by the server 100 , a plurality of subsets of the training data set may be generated in order to obtain a decision tree differently, respectively. Each of the plurality of subsets generated may consist of different data and features.

이를 위해, 서버(100)는, 학습 데이터 세트에서 n 개의 데이터(즉, n 개의 행) 및 mtry개의 피처(즉, mtry개의 열)가 무작위로 선택할 수 있다. 서버(100)는 획득된 서브 세트 수가 ntree개로 기 정의 된 의사 결정 트리 수에 도달 할 때까지 상기 무작위 선택을 반복할 수 있다.To this end, the server 100 may randomly select n data (ie, n rows) and mtry features (ie, mtry columns) from the training data set. The server 100 may repeat the random selection until the number of obtained subsets reaches a predefined number of decision trees with ntrees.

이후, 각각의 의사 결정 트리는 획득된 서브 세트 중 하나를 바탕으로 별도로 구축될 수 있다. 즉, ntree 결정 트리를 구축 한 후 각 피처의 중요성은 Mean Decrease Gini(MDG, Gini importance score)를 통해 측정되며, 이는 각각 피처가 정확한 예측 결과에 얼마나 영향을 미치는지를 나타내는 지표가 될 수 있다.Then, each decision tree can be built separately based on one of the obtained subsets. That is, after constructing an ntree decision tree, the importance of each feature is measured through the Mean Decrease Gini (MDG, Gini importance score), which can be an indicator of how much each feature affects accurate prediction results.

보다 구체적으로, 서버(100)는, 각각의 의사 결정 트리에서, 특정 피처를 사용하는 부모 노드와 해당 부모 노드의 자식 노드 사이의 Gini impurities(Gini 불순도) 차이의 합을 계산할 수 있다. 이후, 서버(100)는 모든 의사 결정 트리 결과의 평균을 계산할 수 있다. 이때, 의사 결정 트리는 특정 피처의 임계 값을 사용하여 의사 결정을 내리는 여러 노드로 구성될 수 있다. 또한 각각의 노드에는 잘못된 결정 가능성을 측정하기 위한 Gini impurities을 포함할 수 있다.More specifically, in each decision tree, the server 100 may calculate the sum of differences in Gini impurities between a parent node using a specific feature and a child node of the parent node. Thereafter, the server 100 may calculate an average of all decision tree results. In this case, the decision tree may be composed of several nodes that make a decision using a threshold value of a specific feature. Each node can also contain Gini impurities to measure the probability of a wrong decision.

한편, 랜덤 포레스트 기법이란, 다수의 의사 결정 트리들을 학습하는 앙상블 방법으로서, 랜덤 포레스트는 검출, 분류, 그리고 회귀 등 다양한 문제에 활용될 수 있다. Meanwhile, the random forest technique is an ensemble method for learning a plurality of decision trees, and the random forest can be used for various problems such as detection, classification, and regression.

한편, 본 발명에 따른 랜덤 포레스트는 랜덤성(randomness)에 의해 트리들이 서로 조금씩 다른 특성을 가지게 되며, 이러한 특성은 각 트리들의 예측(prediction)들이 비상관화(decorrelation) 되게하며, 결과적으로 일반화(generalization) 성능을 향상시킬 수 있는 효과가 존재한다. 또한, 랜덤화(randomization)는 포레스트가 노이즈가 포함된 데이터에 대해서도 강인하게 만들어 줄 수 있다. 랜덤화는 각 트리들의 훈련 과정에서 진행되며, 랜덤 학습 데이터 추출 방법을 이용한 앙상블 학습법인 배깅(bagging)과 랜덤 노드 최적화(randomized node optimization)가 사용될 수 있다.On the other hand, in the random forest according to the present invention, the trees have slightly different characteristics from each other due to randomness, and this characteristic causes the predictions of each tree to be decorrelated, and as a result, generalization ), there is an effect that can improve the performance. Also, randomization can make the forest robust against data including noise. Randomization is performed in the training process of each tree, and bagging and randomized node optimization, an ensemble learning method using a random learning data extraction method, may be used.

단계 S140에서, 서버(100)는, 획득된 중요도를 바탕으로 피처를 선택하고, 선택된 피처를 바탕으로 복수의 모델을 생성할 수 있다.In step S140 , the server 100 may select a feature based on the obtained importance level and generate a plurality of models based on the selected feature.

서버(100)는, 피처 선택 및 모델 구축 단계를 반복적으로 수행할 수 있다. 구체적으로, 하기 알고리즘 1은 피처 선택 단계 및 모델 구축 단계의 전반적인 동작을 나타낸다.The server 100 may repeatedly perform the feature selection and model building steps. Specifically, Algorithm 1 below shows the overall operation of the feature selection step and the model building step.

한편, 알고리즘 1에서, 피처의 중요도 세트는 IMP=[imp₀,...,imp_j,...imp_nf-1] 로 정의될 수 있으며, 피처의 세트는 F=[f₀,...,f_j,...,f_nf-1] 로 정의될 수 있다.On the other hand, in Algorithm 1, the importance set of _{features can be defined as IMP=[imp 0} ,...,imp _j ,...imp _nf-1 ] , and the set of features is F=[f ₀ ,.. .,f _j ,...,f _nf-1 ] can be defined.

이때, nf는 피처의 수, imp_j는 (j + 1) 번째 피처의 중요도일 수 있다.In this case, nf may be the number of features, and imp _j may be the importance of the (j + 1)-th feature.

상기 알고리즘 1은 시작 시 변수가 초기화 되며, sf_cnt는 (cnt + 1) 번째 반복으로 선택된 피처를 의미하며, pa_cnt는, (cnt + 1) 번째 반복으로 구축된 모델의 고장 예측 정확도를 의미하며, cnt는 0에서 1씩 증가하는 카운터 값을 의미할 수 있다.In Algorithm 1, variables are initialized at the start, sf _cnt means a feature selected in the (cnt + 1)-th iteration, and pa _cnt means the failure prediction accuracy of the model built in the (cnt + 1)-th iteration, , cnt may mean a counter value that increases from 0 to 1.

서버(100)는, 상기 알고리즘 1에서 새로 선택한 피처의 중요도가 중요도 임계 값보다 작을 때까지 상기 알고리즘 1을 반복할 수 있다. 즉, 서버(100)는 새로 선택한 피처의 중요도가 중요도 임계 값보다 작을 때까지 피처 선택 단계 및 모델 작성 단계를 반복할 수 있다.The server 100 may repeat Algorithm 1 until the importance of the feature newly selected in Algorithm 1 is less than the importance threshold. That is, the server 100 may repeat the feature selection step and the model building step until the importance of the newly selected feature is less than the importance threshold.

이때, 알고리즘 1의 최대 반복 횟수(max)는 중요도가 중요도 임계 값보다 큰 피처 수와 같게 설정될 수 있다. 따라서 cnt가 최대 반복 횟수(max)값에 도달하면 서버(100)는 알고리즘 1의 반복을 종료시킬 수 있다. 피처 선택 단계가 반복되는 동안, 서버(100)는, 중요도가 가장 높은 새 피처를 선택하고 선택한 피처를 피처 세트(SF)에 추가할 수 있다. 결론적으로, SF에 포함 된 피처의 수는 반복 횟수가 증가함에 따라 1씩 증가할 수 있다.In this case, the maximum number of iterations (max) of Algorithm 1 may be set equal to the number of features whose importance is greater than the importance threshold. Accordingly, when cnt reaches the maximum number of iterations (max), the server 100 may terminate the iteration of Algorithm 1. While the feature selection step is repeated, the server 100 may select a new feature with the highest importance and add the selected feature to the feature set SF. In conclusion, the number of features included in SF can increase by one as the number of iterations increases.

이후, SF가 업데이트되면, 서버(100)는 모델 구축 단계에서, SF에 기초한 SVM을 통해 고장 예측 모델을 구축할 수 있다.Then, when the SF is updated, the server 100 may build a failure prediction model through the SVM based on the SF in the model building step.

이후, 서버(100)는 구축된 고장 예측 모델의 예측 정확도를 평가하여 예측 정확도 세트 (PA)를 업데이트할 수 있다. 이때, 알고리즘 1에서 modelBuildEvalFunction(SF)은 예측 모델을 구축하고 평가하기 위한 함수로, 종래의 예측 평가 함수가 사용될 수 있음은 물론이다. Thereafter, the server 100 may update the prediction accuracy set PA by evaluating the prediction accuracy of the built failure prediction model. In this case, in Algorithm 1, modelBuildEvalFunction (SF) is a function for building and evaluating a prediction model, and of course, a conventional prediction evaluation function may be used.

예를 들어, cnt가 2이고(3번째 루프) 선택된 피처 세트 SF = [F39, F12, F19] 이면, 서버(100)는, 해당 3개의 피처의 훈련용 데이터 세트만을 가지고 SVM 기반의 예측 모델을 생성할 수 있다. 이후 서버(100)는, 해당 3개의 피처에 대한 테스트 데이터 세트를 가지고 pa₂를 산출하고 PA 세트를 업데이트할 수 있다. (즉, PA세트 내의 pa수가 2개에서 3개로 증가한다. PA=[pa₀, pa₁, pa₂])For example, if cnt is 2 (the third loop) and the selected feature set SF = [F39, F12, F19], the server 100 uses only the training data set of the three features and creates an SVM-based prediction model. can create _{Thereafter, the server 100 may calculate pa 2} with the test data set for the three features and update the PA set. (i.e., the number of pa in the PA set increases from 2 to 3. PA=[pa ₀ , pa ₁ , pa ₂ ])

한편, 본 발명의 일 실시예에 따라, SF는 선택된 피처의 집합일 수 있으며, 중요도가 큰 순서로 정렬될 수 있다. 즉, 알고리즘 1의 Repeat 루프에서 cnt가 1씩 증가할 때 마다 포함되는 중요도가 큰 피처가 1개씩 추가될 수 있다.Meanwhile, according to an embodiment of the present invention, the SF may be a set of selected features, and may be arranged in order of increasing importance. That is, each time cnt increases by 1 in the Repeat loop of Algorithm 1, a feature having a large importance included may be added one by one.

서버(100)는, SF를 입력으로 하여 예측 정확도를 도출할 수 있다. 반복이 종료되면 서버(100)는 최종 SF 및 PA를 작업의 결과로 도출할 수 있다.The server 100 may derive prediction accuracy by taking SF as an input. When the iteration ends, the server 100 may derive the final SF and PA as a result of the operation.

즉, 서버(100)는 SF를 입력 값으로 하여 예측 정확도(PA)를 획득할 수 있으며, 알고리즘 1의 반복이 종료됨에 따라 최종적으로 도출된 SF 및 PA를 결과값으로 산출할 수 있다.That is, the server 100 may obtain prediction accuracy (PA) using SF as an input value, and as the iteration of Algorithm 1 is terminated, finally derived SF and PA may be calculated as result values.

단계 S150에서, 서버(100)는, 생성된 복수의 모델 중, 적어도 하나의 모델을 선택할 수 있다. In step S150 , the server 100 may select at least one model from among a plurality of generated models.

도 4는 본 발명의 일 실시예에 따른 고장 예측 모델 선택 방법을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a method for selecting a failure prediction model according to an embodiment of the present invention.

단계 S210에서, 서버(100)는 알고리즘 1에 의해 획득된 pa₀내지 pa_cnt 중 가장 높은 예측 정확도를 가지는 pa_i에 대한 max(PA)를 획득할 수 있다.In step S210 , the server 100 may obtain max(PA) for _{pa i} having the highest prediction accuracy among _{pa 0} to pa _{cnt obtained by Algorithm 1 .}

단계 S220에서, 서버(100)는, max(PA)의 인덱스를 획득하여, 기 설정된 값 이상의 예측 정확도를 가지는 적어도 하나의 피처를 획득할 수 있다.In step S220 , the server 100 may obtain an index of max(PA) to obtain at least one feature having a prediction accuracy equal to or greater than a preset value.

단계 S230에서, 서버(100)는, 상기 max(PA) 및 상기 기 설정된 값 이상의 예측 정확도를 가지는 적어도 하나의 피처를 바탕으로 상기 적어도 하나의 모델을 선택할 수 있다.In step S230 , the server 100 may select the at least one model based on the max(PA) and at least one feature having a prediction accuracy equal to or greater than the preset value.

이후, 서버(100)는, 모델 선택 단계에서, PA를 참조하여 예측 정확도가 가장 높은 고장 예측 모델을 선택할 수 있다. 구체적으로, 서버(100)는 max(PA)를 통해 가장 높은 예측 정확도를 선택하고, max(PA)의 인덱스를 도출하여 예측 정확도가 가장 높은 반복 횟수(즉, 선택된 피처 수)를 산출할 수 있다. 서버(100)는 max(PA)의 인덱스와 SF를 고려하여 고장 예측 모델을 선택할 수 있다. 이때, max(PA)는 알고리즘 1의 반복을 통해 획득된 PA(즉, PA 세트) 중, 가장 큰 값을 의미할 수 있음은 물론이다.Thereafter, in the model selection step, the server 100 may select a failure prediction model having the highest prediction accuracy with reference to the PA. Specifically, the server 100 selects the highest prediction accuracy through max(PA) and derives the index of max(PA) to calculate the number of iterations with the highest prediction accuracy (ie, the number of selected features). . The server 100 may select a failure prediction model in consideration of the index and SF of max(PA). In this case, of course, max(PA) may mean the largest value among PAs (ie, PA sets) obtained through repetition of Algorithm 1.

이때, max(PA)의 인덱스는 PA 세트에서 최대 정확도가 몇번째 순번에 포함되어 있는지를 의미할 수 있다. 즉, 인덱스는 cnt의 값이 몇일때 (즉, 반복(iteration) 횟수가 몇번째 일때) 가장 큰 정확도를 갖는지를 나타내는 지표이다.In this case, the index of max(PA) may mean in which order the maximum accuracy is included in the PA set. That is, the index is an index indicating the highest accuracy when the value of cnt is (ie, what number of iterations).

인덱스는 PA 세트 내 pa값들과 max(PA) 를 비교하여, max(PA)와 동일한 pa의 순번(index=cnt)을 PA 세트에서 획득될 수 있으며, 서버(100)는, PA세트 내에서 max(PA)의 인덱스(즉, cnt 값)를 획득하고, 해당 반복시행에서의 피처들을 최종적으로 선택하여 예측모델을 구축할 수 있다.The index compares max(PA) with the pa values in the PA set, and the sequence number (index=cnt) of the same pa as max(PA) can be obtained from the PA set, and the server 100 is the max(PA) in the PA set. A predictive model can be constructed by obtaining the index (ie, cnt value) of (PA) and finally selecting features in the iterative trial.

예를 들어, 도 8에 도시된 바와 같이, max(PA)의 인덱스는 7 이고, 이때 선택되는 피처 세트는 SF = [F60, F349, F41, F289, F427, F65, F66, F154] (즉, 중요도 순 8개 피처)이며 해당 SF 내 포함된 피처의 데이터들로 만든 SVM 기반 예측모델이 최종적으로 선정 및 사용될 수 있다.For example, as shown in Figure 8, the index of max(PA) is 7, where the selected feature set is SF = [F60, F349, F41, F289, F427, F65, F66, F154] (i.e., 8 features in order of importance), and an SVM-based prediction model made from data of features included in the SF can be finally selected and used.

도 5 내지 도 9은 본 발명의 일 실시예에 따른 실험 결과를 설명하기 위한 예시도이다.5 to 9 are exemplary views for explaining the experimental results according to an embodiment of the present invention.

본 발명에서는, 오픈 소스 R 버전 3.4.3을 사용하여 제안 된 고장 예측 모델의 타당성을 검증하기 위해 실험 구현이 수행되었다. 이를 위해 UCI 저장소에서 제공 한 SECOM 데이터 세트가 사용되었으며, 해당 데이터 세트는 1567 개의 데이터와 591 개의 피처로 구성되고 센서 및 프로세스 측정 지점을 모니터링하여 반도체 제조 프로세스에서 데이터를 수집하였다. 590 개의 피처 데이터는 여러 센서에서 측정되었으며, 나머지 피처의 데이터는 Pass 및 Fail로 표시되는 가계도 테스트 결과이다.In the present invention, an experimental implementation was performed to verify the validity of the proposed failure prediction model using the open source R version 3.4.3. For this purpose, the SECOM data set provided by the UCI repository was used. The data set consisted of 1567 data and 591 features, and data were collected from the semiconductor manufacturing process by monitoring sensors and process measurement points. Data for 590 features were measured from multiple sensors, and the data for the rest of the features are results of family tree tests marked as Pass and Fail.

본 실험에서는, 피처 제거를 위해 유효성 계수를 0.1로 설정하였다. 즉, 10 % 이상의 NA 데이터를 갖는 특징 및 분산이 없는 특징은 데이터 세트에서 제거되었다. 상술한 피처 제거 과정을 통해, 피처 수는 591에서 393으로 감소하였음을 확인할 수 있다.In this experiment, the effectiveness coefficient was set to 0.1 for feature removal. That is, features with NA data of 10% or more and features without variance were removed from the data set. Through the above-described feature removal process, it can be seen that the number of features is reduced from 591 to 393.

학습 데이터 집합과 테스트 데이터 집합의 비율은 7:3으로 설정되었다. 각각의 피처의 중요도를 측정하기 위해 랜덤 포레스트 기법 및 caret 패키지를 사용하였다. n, mtry 및 ntree를 각각 1000, 19 및 500으로 설정하였다. 이 설정을 통해 임의로 생성 된 1000 * 19 매트릭스를 사용하여 500 개의 의사 결정 트리가 작성되었다. 각각의 특징의 중요성은 MDG(mean Decrease Gini)를 통해 측정되었다.The ratio of the training data set and the test data set was set to 7:3. Random forest technique and caret package were used to measure the importance of each feature. n, mtry, and ntree were set to 1000, 19 and 500, respectively. With this setup, 500 decision trees were built using a randomly generated 1000*19 matrix. The significance of each feature was measured through MDG (mean Decrease Gini).

도 5은 상위 30 개 피처의 중요성을 나타낸다. x 축 및 y 축은 각각 MDG 및 특징을 나타낸다. 그림에서 F60은 모든 피처 중에서 MDG (1.52)가 가장 높음을 확인할 수 있다.5 shows the importance of the top 30 features. The x-axis and y-axis represent MDG and features, respectively. In the figure, it can be seen that F60 has the highest MDG (1.52) among all features.

한편, 본 실험에서는, 반복적인 피처 선택 과정 수행을 위하여, 중요도 임계 값을 0.7로 설정하였다. 따라서 max는 24로 결정되었으며, 이는 반복 횟수가 24임을 의미한다.Meanwhile, in this experiment, in order to perform an iterative feature selection process, the importance threshold was set to 0.7. Therefore, max was determined to be 24, which means that the number of iterations is 24.

학습 데이터 세트에는 70 개의 Fail과 1038 개의 Pass가 포함된다. 이러한 학습 데이터 세트의 불균형은 고장 예측 모델을 고장 사례를 예측하기 어렵게 만들고, 이러한 문제를 해결하기 위해 고장 예측 모델을 작성하기 전에 샘플링을 수행하였다 이를 통해, 일부 Pass 데이터가 제거되어 모델 작성에 미치는 영향을 감소시켰다. 상술한 바와 같이, 피처 선택 결과와 샘플링 된 학습 데이터 세트를 통해 SVM을 사용하여 고장 예측 모델을 구축하였으며, 고장 예측 모델은, e1071 package in R을 사용하여 구축되었다. 도 6에 도시된 표는 획득 한 SF 및 PA를 나타낸다. 도 6의 표에서 max(PA)는 0.72이고 인덱스는 7임을 확인할 수 있다 결론적으로 8 가지 피처 (예: F60, F349, F41, F289, F427, F65, F66 및 F154)가 선택된다.The training data set contains 70 Fails and 1038 Passes. This imbalance in the training data set makes the failure prediction model difficult to predict failure cases, and to solve this problem, sampling was performed before creating the failure prediction model. decreased. As described above, a failure prediction model was built using the SVM through the feature selection results and the sampled training data set, and the failure prediction model was built using the e1071 package in R. The table shown in Fig. 6 shows the obtained SF and PA. In the table of FIG. 6 , it can be seen that max(PA) is 0.72 and the index is 7. In conclusion, eight features (eg, F60, F349, F41, F289, F427, F65, F66, and F154) are selected.

성능 평가를 위해 제안 된 모델의 예측 정확도를 기존 모델의 예측 정확도와 비교하였다. 도 7 내지 도 9는 서로 다른 수의 피처를 사용하는 세 가지 고장 예측 모델에 대한 ROC (수신기 작동 특성) 곡선을 나타낸다. ROC 곡선은 예측 모델의 성능 측정으로, TPR (true positive rate)과 FPR (false positive rate) 간의 관계를 나타낸다 ROC 곡선에서는 곡선 아래 면적 (AUC)을 사용하여 모형의 예측 정확도를 평가한다. 구체적으로, AUC가 클수록 예측 정확도가 높아진다. 도 5에서, 반복적 특징 선택에 기초하여 구축 된 고장 예측 모델은 다른 모델에 비해 가장 큰 AUC를 가지는 것을 확인할 수 있다. 이는, 모델이 다른 수의 피처를 사용하여 반복적으로 구축되고 예측 정확도가 가장 높은 모델 중 하나가 선택되기 때문이다.For performance evaluation, the prediction accuracy of the proposed model was compared with that of the existing model. 7 to 9 show ROC (receiver operating characteristic) curves for three failure prediction models using different numbers of features. The ROC curve is a measure of the performance of a predictive model, and represents the relationship between the true positive rate (TPR) and the false positive rate (FPR). The ROC curve uses the area under the curve (AUC) to evaluate the prediction accuracy of the model. Specifically, the larger the AUC, the higher the prediction accuracy. In Fig. 5, it can be seen that the failure prediction model built on the basis of iterative feature selection has the largest AUC compared to other models. This is because the model is iteratively built using a different number of features and one of the models with the highest prediction accuracy is selected.

즉, 관련 없는 피처로 인해 정확한 예측 모델을 작성하기가 어렵 기 때문에, 고정 된 개수의 피처가 모델을 구축하는 데 사용되는 경우 예측 정확도가 상대적으로 저하 될 수 있다.In other words, since it is difficult to build an accurate predictive model due to irrelevant features, the prediction accuracy can be relatively poor when a fixed number of features are used to build the model.

즉, 고장과 관련이 없는 더 많은 피처를 사용하여 모델을 작성하면 모델의 예측 정확도가 떨어질 수 있다. 따라서, 데이터 세트의 모든 특징이 도 5에 도시 된 바와 같이 사용되는 경우, AUC는 상당히 감소한다. 정량적으로, 제안된 모델은 각각 고정된 수의 특징 및 모든 특징 경우에 비해 14.3 및 22.0 % 더 높은 AUC를 얻는다.In other words, if you build a model with more features that are not related to failure, the model's prediction accuracy may decrease. Therefore, when all features of the data set are used as shown in Fig. 5, the AUC decreases significantly. Quantitatively, the proposed model obtains 14.3 and 22.0% higher AUCs compared to the fixed number of features and all feature cases, respectively.

그러나, 본 발명은 다른 수의 피처를 사용하여 반복적으로 고장 예측 모델을 구축하며, 나아가, 고장과 관련된 피처를 선택적으로 획득하여 고장 예측 모델을 구축하므로 도 5에 도시된 바와 같이, 기존의 모델에 비해 더 나은 효과를 발생시킬 수 있다.However, the present invention repeatedly builds a failure prediction model using a different number of features, and furthermore, selectively acquires failure-related features to build a failure prediction model, so as can produce a better effect than

도 10은 본 발명의 일 실시예에 따른 장치의 구성도이다.10 is a block diagram of an apparatus according to an embodiment of the present invention.

프로세서(102)는 하나 이상의 코어(core, 미도시) 및 그래픽 처리부(미도시) 및/또는 다른 구성 요소와 신호를 송수신하는 연결 통로(예를 들어, 버스(bus) 등)를 포함할 수 있다.The processor 102 may include one or more cores (not shown) and a graphic processing unit (not shown) and/or a connection path (eg, a bus, etc.) for transmitting and receiving signals to and from other components. .

일 실시예에 따른 프로세서(102)는 메모리(104)에 저장된 하나 이상의 인스트럭션을 실행함으로써, 도 3 및 도 4와 관련하여 설명된 방법을 수행한다.The processor 102 according to one embodiment performs the method described with respect to FIGS. 3 and 4 by executing one or more instructions stored in the memory 104 .

한편, 프로세서(102)는 프로세서(102) 내부에서 처리되는 신호(또는, 데이터)를 일시적 및/또는 영구적으로 저장하는 램(RAM: Random Access Memory, 미도시) 및 롬(ROM: Read-Only Memory, 미도시)을 더 포함할 수 있다. 또한, 프로세서(102)는 그래픽 처리부, 램 및 롬 중 적어도 하나를 포함하는 시스템온칩(SoC: system on chip) 형태로 구현될 수 있다. On the other hand, the processor 102 is a RAM (Random Access Memory, not shown) and ROM (Read-Only Memory: ROM) for temporarily and / or permanently storing a signal (or data) processed inside the processor 102. , not shown) may be further included. In addition, the processor 102 may be implemented in the form of a system on chip (SoC) including at least one of a graphic processing unit, a RAM, and a ROM.

메모리(104)에는 프로세서(102)의 처리 및 제어를 위한 프로그램들(하나 이상의 인스트럭션들)을 저장할 수 있다. 메모리(104)에 저장된 프로그램들은 기능에 따라 복수 개의 모듈들로 구분될 수 있다.The memory 104 may store programs (one or more instructions) for processing and controlling the processor 102 . Programs stored in the memory 104 may be divided into a plurality of modules according to functions.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in relation to an embodiment of the present invention may be implemented directly in hardware, as a software module executed by hardware, or by a combination thereof. A software module may contain random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

본 발명의 구성 요소들은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 애플리케이션)으로 구현되어 매체에 저장될 수 있다. 본 발명의 구성 요소들은 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있으며, 이와 유사하게, 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다.The components of the present invention may be implemented as a program (or application) to be executed in combination with a computer, which is hardware, and stored in a medium. The components of the present invention may be implemented as software programming or software components, and similarly, embodiments may include various algorithms implemented as data structures, processes, routines, or combinations of other programming constructs, including C, C++ , Java, assembler, etc. may be implemented in a programming or scripting language. Functional aspects may be implemented in an algorithm running on one or more processors.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. As mentioned above, although embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains can realize that the present invention can be embodied in other specific forms without changing its technical spirit or essential features. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

100 : 서버
200-1 내지 200-5 : 복수의 IoT 장치100 : server
200-1 to 200-5: a plurality of IoT devices

Claims

In the control method of a failure prediction model acquisition system in an industrial Internet of Things (IIoT) environment,
obtaining, by the server, data from a plurality of sensors;
pre-processing, by the server, the obtained data;
obtaining, by the server, the importance of the feature based on the pre-processed data;
selecting, by the server, a feature based on the obtained importance level, and generating a plurality of models based on the selected feature; and
selecting, by the server, at least one model from among the plurality of generated models; A control method comprising a.

According to claim 1,
The step of generating the plurality of models comprises:
A control method, characterized in that it is generated by the following algorithm 1.
[Algorithm 1]

In this case, cnt is the counter value, SF is the selected feature, PA is the prediction accuracy, sf _cnt is the selected feature in cnt, and the importance set of the feature is IMP=[imp ₀ ,...,imp _j ,...imp _{nf- 1} ], the set of features is F=[f ₀ ,...,f _j ,...,f _nf-1 ], imp _j is the importance value of the (j+1)th feature, nf is the number, modelBuildEvalFunction(SF) is an arbitrary function for building and evaluating a predictive model.

3. The method of claim 2,
The step of generating the plurality of models comprises:
generating based on the SF and SVM (Support Vector Machine) algorithms obtained by the algorithm 1; A control method comprising a.

3. The method of claim 2,
The step of selecting the at least one model comprises:
obtaining max(PA) for _{PA i} having the highest prediction accuracy among _{pa 0} to PA _cnt obtained by the algorithm 1;
obtaining at least one feature having a prediction accuracy equal to or greater than a preset value by obtaining the index of max(PA); and
selecting the at least one model based on the max(PA) and at least one feature having a prediction accuracy equal to or greater than the preset value; A control method comprising a.

According to claim 1,
The pre-processing step is
removing at least one feature from among a plurality of features corresponding to each of the data acquired from the plurality of sensors;
normalizing data for a plurality of features from which the at least one feature has been removed; and
segmenting the normalized data; Control method comprising a.

6. The method of claim 5,
The step of removing the feature comprises:
acquiring invalid data from among the data acquired from each of the plurality of sensors;
removing a feature corresponding to the invalid data when the ratio of the invalid data is greater than a preset validity coefficient in a range of [0,1]; A control method comprising a.

6. The method of claim 5,
The step of removing the feature comprises:
obtaining respective variance values for the data obtained from each of the plurality of sensors;
when the variance value is less than or equal to a preset value, removing a feature corresponding to a variance value equal to or less than the preset value; A control method comprising a.

a memory storing one or more instructions; and
a processor executing the one or more instructions stored in the memory;
The processor by executing the one or more instructions,
An apparatus for performing the method of claim 1 .

A computer program stored in a computer-readable recording medium in combination with a computer, which is hardware, to perform the method of claim 1.