KR102340652B1

KR102340652B1 - Method, apparatus and program for acquiring failure prediction model

Info

Publication number: KR102340652B1
Application number: KR1020200029693A
Authority: KR
Inventors: 김의직; 권정혁
Original assignee: 한림대학교 산학협력단
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2021-12-17
Also published as: KR20210114244A

Abstract

고장 예측 모델 획득 방법이 제공된다. 상기 방법은, 서버가, 복수의 센서로부터 데이터를 획득하는 단계, 상기 서버가, 상기 데이터를 바탕으로 상기 복수의 센서 각각에 대한 복수의 피처를 획득하는 단계, 상기 서버가, 상기 복수의 피처 중 비유효 데이터에 대응되는 피처를 제거한 제1 피처 집합을 획득하는 단계, 상기 서버가, 상기 제1 피처 집합 중 데이터가 누락된 피처의 데이터를 대체한 제2 피처 집합을 획득하는 단계, 상기 서버가, 상기 제2 피처 집합을 정규화 하는 단계 및 상기 서버가, 상기 정규화된 데이터를 분할하는 단계를 포함한다.A method for obtaining a failure prediction model is provided. The method includes, by a server, acquiring data from a plurality of sensors, by the server acquiring a plurality of features for each of the plurality of sensors based on the data, by the server, among the plurality of features obtaining, by the server, a first set of features from which features corresponding to invalid data are removed; , normalizing the second feature set, and dividing, by the server, the normalized data.

Description

Method, device and program for acquiring failure prediction model

본 발명은 고장 예측 모델 획득 방법, 장치 및 프로그램에 관한 것이다. The present invention relates to a failure prediction model acquisition method, apparatus and program.

산업용 사물인터넷(Industrial Internet of Things, IIoT)은 제조 및 에너지 관리를 포함한 컴퓨터의 산업 부문과 함께 네트워크로 상호 연결되어 있는 센서, 장비 등의 장치를 일컫는다. 이러한 연결을 통해 데이터 수집, 교환, 분석, 그리고 생산과 효율성의 개선을 용이케 하는 것 및 그 밖의 경제적 이점을 실현시킬 수 있다. IIoT는 분산 제어 시스템(DCS)을 발전시킨 것으로, 프로세스 제어를 개선하기 위해 클라우드 컴퓨팅을 사용하여 높은 수준의 자동화를 가능케 한다.The Industrial Internet of Things (IIoT) refers to devices such as sensors and equipment interconnected by networks with the industrial sector of computers, including manufacturing and energy management. These connections enable data collection, exchange, analysis, and facilitating improvements in production and efficiency, as well as other economic benefits. IIoT is an evolution of Distributed Control Systems (DCS), enabling a high degree of automation using cloud computing to improve process control.

다만, 산업용 사물인터넷(IIoT)을 포함하는 IoT 시스템은 많은 수의 기기들이 상호 작용하므로, 어느 하나의 기기가 고장 나는 경우, 시스템 전체가 고장 날 수 있다.However, since a large number of devices interact with an IoT system including the Industrial Internet of Things (IIoT), if any one device fails, the entire system may fail.

따라서, IIoT 시스템의 고장을 예측하기 위한 방법의 필요성이 존재하며, 보다 구체적으로는, IIoT 시스템에 포함된 복수의 센서에 대한 데이터를 전처리하는 과정부터, 전처리된 데이터를 바탕으로 고장 예측 모델을 획득하는 방법에 대한 필요성이 대두되고 있는 실정이다.Therefore, there is a need for a method for predicting the failure of the IIoT system. More specifically, from the process of pre-processing data for a plurality of sensors included in the IIoT system, a failure prediction model is obtained based on the pre-processed data. There is a growing need for a method to do this.

등록특허공보 제10-1796583호, 2017.11.06Registered Patent Publication No. 10-1796583, 2017.11.06

본 발명이 해결하고자 하는 과제는 고장 예측 모델 획득 방법, 장치 및 프로그램을 제공하는 것이다.The problem to be solved by the present invention is to provide a failure prediction model acquisition method, apparatus, and program.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 고장 예측 모델 획득 방법은, 서버가, 복수의 센서로부터 데이터를 획득하는 단계; 상기 서버가, 상기 데이터를 바탕으로 상기 복수의 센서 각각에 대한 복수의 피처를 획득하는 단계; 상기 서버가, 상기 복수의 피처 중 비유효 데이터에 대응되는 피처를 제거한 제1 피처 집합을 획득하는 단계; 상기 서버가, 상기 제1 피처 집합 중 데이터가 누락된 피처의 데이터를 대체한 제2 피처 집합을 획득하는 단계; 상기 서버가, 상기 제2 피처 집합을 정규화 하는 단계; 및 상기 서버가, 상기 정규화된 데이터를 분할하는 단계;를 포함한다.A failure prediction model acquisition method according to an aspect of the present invention for solving the above-described problems, the server, the steps of acquiring data from a plurality of sensors; obtaining, by the server, a plurality of features for each of the plurality of sensors based on the data; obtaining, by the server, a first feature set from which a feature corresponding to invalid data is removed from among the plurality of features; obtaining, by the server, a second feature set in which data of a feature in which data is missing from among the first feature set is replaced; normalizing, by the server, the second set of features; and dividing, by the server, the normalized data.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

상술한 본 발명의 다양한 실시예에 따라, 다양한 피처를 반복적으로 선택하여 획득한 고장 예측 모델을 통해 IIoT 환경에서 효율적이고 정확한 고장 예측 모델을 구축할 수 있는 새로운 효과가 존재한다.According to the above-described various embodiments of the present invention, there is a new effect of constructing an efficient and accurate failure prediction model in the IIoT environment through the failure prediction model obtained by repeatedly selecting various features.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 IIoT 시스템을 도시한 시스템도이다.
도 2는 본 발명의 일 실시예에 따른 고장 예측 모델 획득 방법을 설명하기 위한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 고장 예측 모델 획득 방법을 설명하기 위한 흐름도이다.
도 4 내지 도 9은 본 발명의 일 실시예에 따른 실험 결과를 설명하기 위한 예시도이다.
도 10은 본 발명의 일 실시예에 따른 장치의 구성도이다.1 is a system diagram illustrating an IIoT system according to an embodiment of the present invention.
2 is a block diagram illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.
3 is a flowchart illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.
4 to 9 are exemplary views for explaining the experimental results according to an embodiment of the present invention.
10 is a block diagram of an apparatus according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

명세서에서 사용되는 "부" 또는 "모듈"이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 "모듈"은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 "모듈"은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 "모듈"은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 "모듈"은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 "모듈"들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 "모듈"들로 결합되거나 추가적인 구성요소들과 "부" 또는 "모듈"들로 더 분리될 수 있다.As used herein, the term “unit” or “module” refers to a hardware component such as software, FPGA, or ASIC, and “unit” or “module” performs certain roles. However, "part" or "module" is not meant to be limited to software or hardware. A “unit” or “module” may be configured to reside on an addressable storage medium or to reproduce one or more processors. Thus, by way of example, “part” or “module” refers to components such as software components, object-oriented software components, class components, and task components, and processes, functions, properties, Includes procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Components and functionality provided within “parts” or “modules” may be combined into a smaller number of components and “parts” or “modules” or additional components and “parts” or “modules”. can be further separated.

공간적으로 상대적인 용어인 "아래(below)", "아래(beneath)", "하부(lower)", "위(above)", "상부(upper)" 등은 도면에 도시되어 있는 바와 같이 하나의 구성요소와 다른 구성요소들과의 상관관계를 용이하게 기술하기 위해 사용될 수 있다. 공간적으로 상대적인 용어는 도면에 도시되어 있는 방향에 더하여 사용시 또는 동작시 구성요소들의 서로 다른 방향을 포함하는 용어로 이해되어야 한다. 예를 들어, 도면에 도시되어 있는 구성요소를 뒤집을 경우, 다른 구성요소의 "아래(below)"또는 "아래(beneath)"로 기술된 구성요소는 다른 구성요소의 "위(above)"에 놓여질 수 있다. 따라서, 예시적인 용어인 "아래"는 아래와 위의 방향을 모두 포함할 수 있다. 구성요소는 다른 방향으로도 배향될 수 있으며, 이에 따라 공간적으로 상대적인 용어들은 배향에 따라 해석될 수 있다.Spatially relative terms "below", "beneath", "lower", "above", "upper", etc. It can be used to easily describe the correlation between a component and other components. Spatially relative terms should be understood as terms including different directions of components during use or operation in addition to the directions shown in the drawings. For example, when a component shown in the drawing is turned over, a component described as “beneath” or “beneath” of another component may be placed “above” of the other component. can Accordingly, the exemplary term “below” may include both directions below and above. Components may also be oriented in other orientations, and thus spatially relative terms may be interpreted according to orientation.

본 명세서에서, 컴퓨터는 적어도 하나의 프로세서를 포함하는 모든 종류의 하드웨어 장치를 의미하는 것이고, 실시 예에 따라 해당 하드웨어 장치에서 동작하는 소프트웨어적 구성도 포괄하는 의미로서 이해될 수 있다. 예를 들어, 컴퓨터는 스마트폰, 태블릿 PC, 데스크톱, 노트북 및 각 장치에서 구동되는 사용자 클라이언트 및 애플리케이션을 모두 포함하는 의미로서 이해될 수 있으며, 또한 이에 제한되는 것은 아니다.In this specification, a computer refers to all types of hardware devices including at least one processor, and may be understood as encompassing software configurations operating in the corresponding hardware device according to embodiments. For example, a computer may be understood to include, but is not limited to, smart phones, tablet PCs, desktops, notebooks, and user clients and applications running on each device.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 명세서에서 설명되는 각 단계들은 컴퓨터에 의하여 수행되는 것으로 설명되나, 각 단계의 주체는 이에 제한되는 것은 아니며, 실시 예에 따라 각 단계들의 적어도 일부가 서로 다른 장치에서 수행될 수도 있다.Each step described in this specification is described as being performed by a computer, but the subject of each step is not limited thereto, and at least a portion of each step may be performed in different devices according to embodiments.

도 1은 본 발명의 일 실시예에 따른 IIoT 시스템을 도시한 시스템도이다.1 is a system diagram illustrating an IIoT system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, IIoT 시스템은 서버(100) 및 복수의 센서(200-1 내지 200-5)를 포함할 수 있다.As shown in FIG. 1 , the IIoT system may include a server 100 and a plurality of sensors 200 - 1 to 200 - 5 .

서버(100)는 복수의 센서(200-1 내지 200-5)로부터 센싱 데이터를 획득할 수 있다. 서버(100)는 획득된 센싱 데이터 및 센싱 데이터에 대응되는 복수의 센서(200-1 내지 200-5)를 바탕으로 시스템의 고장을 예측하기 위한 고장 예측 모델을 구축할 수 있다.The server 100 may acquire sensing data from the plurality of sensors 200 - 1 to 200 - 5 . The server 100 may build a failure prediction model for predicting system failure based on the acquired sensing data and the plurality of sensors 200-1 to 200-5 corresponding to the sensing data.

도 2는 본 발명의 일 실시예에 따른 고장 예측 모델 획득 방법을 설명하기 위한 블록도이다.2 is a block diagram illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.

서버(100)는, 고장 예측 모델의 예측 정확도를 최대화 하기 위해 선택한 피처를 기반으로 고장 예측 모델을 획득할 수 있다. 이를 위해, 서버(100)는 피처 선택을 반복적으로 수행할 수 있으며, 반복해서 선택한 피처를 기반으로 복수의 고장 예측 모델을 획득할 수 있다. 이후, 서버(100)는, 복수의 고장 예측 모델 중 가장 높은 예측 정확도를 갖는 고장 예측 모델을 획득할 수 있다.The server 100 may obtain a failure prediction model based on a feature selected to maximize the prediction accuracy of the failure prediction model. To this end, the server 100 may repeatedly perform feature selection, and may acquire a plurality of failure prediction models based on the repeatedly selected feature. Thereafter, the server 100 may obtain a failure prediction model having the highest prediction accuracy among the plurality of failure prediction models.

구체적으로, 도 2에 도시된 바와 같이, 서버(100) 고장 예측 모델을 획득하기 위해 1) 전처리 단계(Processing, 2) 중요도 측정 단계(Importance Measurement), 3) 피처 선택 단계(Feature Selection), 4) 모델 구축 단계(Model Building) 및 5) 모델 선택 단계(Model Selection)를 수행할 수 있다.Specifically, as shown in FIG. 2 , in order to obtain a server 100 failure prediction model, 1) a pre-processing step (Processing, 2) an importance measurement step (Importance Measurement), 3) a feature selection step (Feature Selection), 4 ) model building step (Model Building) and 5) model selection step (Model Selection) can be performed.

구체적으로, 전처리 단계에서, 서버(100)는 피처 제거, 누락 된 데이터 대체, 정규화 및 데이터 분할 단계를 수행할 수 있다. 이후, 서버(100)는 전처리 과정을 통해 획득된 피처의 중요도를 획득할 수 있다. 이후, 서버(100)는, 피처 선택 및 모델 구축 단계를 반복적으로 수행할 수 있다. 이후, 서버(100)는 모델 구축 단계에서, 선택된 피처에 기초한 SVM을 통해 고장 예측 모델을 구축할 수 있다. Specifically, in the pre-processing step, the server 100 may perform feature removal, missing data replacement, normalization, and data segmentation steps. Thereafter, the server 100 may acquire the importance of the feature obtained through the pre-processing process. Thereafter, the server 100 may repeatedly perform the feature selection and model building steps. Thereafter, the server 100 may build a failure prediction model through the SVM based on the selected feature in the model building step.

도 3은 본 발명의 일 실시예에 따른 고장 예측 모델 획득하기 위한 방법을 설명하기 위한 흐름도이다.3 is a flowchart illustrating a method for obtaining a failure prediction model according to an embodiment of the present invention.

단계 S110에서, 서버(100)는　복수의 센서로부터 데이터를 획득할 수 있다.In step S110, the server 100 may acquire data from a plurality of sensors.

일 실시예에 따라, 본 발명의 다양한 실시예에서, 피처의 용어와 센서의 용어는 동일하게 해석될 수 있다.According to an embodiment, in various embodiments of the present invention, the term feature and the term sensor may be interpreted the same.

단계 S120에서, 서버(100)는 복수의 센서 각각에 대한 복수의 피처를 획득할 수 있다.In step S120 , the server 100 may acquire a plurality of features for each of a plurality of sensors.

단계S130에서, 서버(100)는, 복수의 피처 중 비유효 데이터에 대응되는 피처를 제거한 제1 피처 집합을 획득할 수 있다.In step S130 , the server 100 may obtain a first feature set from which a feature corresponding to invalid data is removed from among a plurality of features.

단계 S140에서, 서버(100)는, 제1 피처 집합 중 데이터가 누락된 피처의 데이터를 대체한 제2 피처 집합을 획득할 수 있다.In operation S140 , the server 100 may obtain a second feature set in which data of a feature in which data is missing from among the first feature set is replaced.

단계 S150에서, 서버(100)는 제2 피처 집합을 정규화할 수 있다.In step S150 , the server 100 may normalize the second feature set.

단계 S160에서, 서버(100)는 정규화된 데이터를 분할할 수 있다.In step S160, the server 100 may divide the normalized data.

구체적으로, 본 발명의 일 실시예에 따라, 서버(100)는, 복수의 센서로부터 획득된 데이터 각각에 대응되는 복수의 피처 중 적어도 하나의 피처를 제거할 수 있다.Specifically, according to an embodiment of the present invention, the server 100 may remove at least one feature from among a plurality of features corresponding to each of data obtained from a plurality of sensors.

구체적으로, 서버(100)는, 복수의 센서 각각으로부터 획득한 데이터 중, 비유효 데이터를 획득할 수 있다.Specifically, the server 100 may acquire invalid data among data acquired from each of a plurality of sensors.

일 실시예로, 서버(100)는, 상기 특정 피처에 대한 비유효 데이터의 비율이 [0,1]의 범위에서 기 설정된 유효성 계수보다 큰 경우, 비유효 데이터에 대응되는 피처를 제거할 수 있다.In an embodiment, the server 100 may remove a feature corresponding to the invalid data when the ratio of the invalid data for the specific feature is greater than a preset validity coefficient in the range of [0,1]. .

이때, 비유효 데이터란, 피처로부터 수집된 데이터 중 값이 없는 데이터를 의미할 수 있다. 일 실시예로, 피처가 온도 센서인 경우, 온도가 감지되지 않은 데이터가 비유효 데이터일 수 있다. 비유효 데이터는 수집된 데이터 세트(훈련 데이터 세트 및 테스트 데이터 세트를 포함할 수 있다)에서 각 피처별로 값이 없는 행에 대응되는 데이터일 수 있다. 또 다른 실시예로, 서버(100)는, 복수의 센서 각각으로부터 획득한 데이터에 대한 각각의 분산값을 획득할 수 있다. 이후, 서버(100)는, 분산값이 기 설정된 값 이하인 경우, 기 설정된 값 이하인 분산값에 대응되는 피처를 제거할 수 있다.In this case, the invalid data may mean data having no value among data collected from the feature. In one embodiment, when the feature is a temperature sensor, data in which the temperature is not sensed may be invalid data. Invalid data may be data corresponding to a row having no value for each feature in the collected data set (which may include a training data set and a test data set). In another embodiment, the server 100 may obtain a respective variance value for data obtained from each of a plurality of sensors. Thereafter, when the variance value is equal to or less than the preset value, the server 100 may remove features corresponding to the variance value equal to or less than the preset value.

구체적으로, 전처리 단계에서, 서버(100)는 피처 제거, 누락 된 데이터 대체, 정규화 및 데이터 분할 단계를 수행할 수 있다. 구체적으로, 서버(100) 입력 데이터 세트(즉, 복수의 센서로부터 수집 된 데이터)에서 유효하지 않은 피처를 제거하기 위하여 각 피처의 비유효(NA) 데이터를 검색하고, 각 피처의 분산을 계산할 수 있다.Specifically, in the pre-processing step, the server 100 may perform feature removal, missing data replacement, normalization, and data segmentation steps. Specifically, the server 100 retrieves the invalid (NA) data of each feature to remove invalid features from the input data set (ie, data collected from multiple sensors), and calculates the variance of each feature. have.

일 실시예에 따라, 서버(100)는, 특정 피처의 NA 데이터의 비율이 [0, 1] 범위에서의 기 설정된 유효성 계수보다 큰 경우, 해당 특정 피처를 입력 데이터 세트에서 제거할 수 있다.According to an embodiment, the server 100 may remove the specific feature from the input data set when the ratio of the NA data of the specific feature is greater than the preset validity coefficient in the [0, 1] range.

또 다른 실시예로, 서버(100)는 특정 피처의 분산이 기 설정된 값 이하인 경우, 해당 피처를 입력 데이터 세트에서 제거할 수 있다. 바람직하게는, 특정 피처의 분산이 0에 가까울수록 해당 피처가 입력 데이터 세트에서 제거될 확률이 높아질 수 있다.As another embodiment, when the variance of a specific feature is less than or equal to a preset value, the server 100 may remove the corresponding feature from the input data set. Preferably, the closer the variance of a particular feature to zero, the higher the probability that the feature will be removed from the input data set.

한편, 누락된 데이터가 존재하는 경우(즉, 피처 제거 후 남아있는 피처에 NA 데이터가 존재하는 경우), 서버(100)는 누락 된 데이터를 각 피처의 누락되지 않은 데이터의 평균값으로 대체할 수 있다.On the other hand, when missing data exists (that is, when NA data exists in the features remaining after feature removal), the server 100 may replace the missing data with the average value of the non-missing data of each feature. .

이후, 서버(100)는, 적어도 하나의 피처가 제거된 복수의 피처에 대한 데이터를 정규화할 수 있다.Thereafter, the server 100 may normalize data for a plurality of features from which at least one feature is removed.

구체적으로, 서버(100)는, 각 피처의 데이터 규모에 맞게 정규화 과정을 수행할 수 있다. 일 실시예에 따라, 서버(100)는 하기 수학식 1을 바탕으로, 각각의 피처의 정규화 과정을 수행할 수 있다.Specifically, the server 100 may perform a normalization process according to the data size of each feature. According to an embodiment, the server 100 may perform a normalization process of each feature based on Equation 1 below.

[수학식 1][Equation 1]

x_i'= (x_i-m) / sx _i '= (x _i -m) / s

이때, x_i'는 (i + 1) 번째 정규화 된 특성 데이터, x_i는 피처의 (i + 1) 번째 데이터, m은 피처의 평균, s는 피처의 표준 편차일 수 있다. In this case, x _i ' may be the (i + 1)-th normalized characteristic data, x _i may be the (i + 1)-th data of the feature, m may be the average of the feature, and s may be the standard deviation of the feature.

한편, 피처의 평균 m 및 피처의 표준편차 s는 각각 하기 수학식 2 및 수학식 3을 바탕으로 획득될 수 있다.Meanwhile, the average m of the features and the standard deviation s of the features may be obtained based on Equations 2 and 3, respectively.

[수학식 2][Equation 2]

[수학식 3][Equation 3]

이때, N은 피처의 데이터 개수를 의미할 수 있다. 한편 피처의 평균 m이라 함은 각각의 피처(예를 들어 센서)에서 수집된 데이터 값의 평균을 의미할 수 있다.In this case, N may mean the number of data of the feature. Meanwhile, the average m of features may mean an average of data values collected from each feature (eg, a sensor).

일 실시예로, 피처가 온도 감지 센서인 경우, 피처의 평균은 온도 감지 센서에 의해 획득된 복수의 온도의 데이터에 대한 평균값을 의미할 수 있다. 피처의 표준편차 및 피처의 분산 또한, 각각의 피처에서 수집된 데이터 값의 표준편차 또는 분산임은 자명하다.As an embodiment, when the feature is a temperature sensor, the average of the features may mean an average value of a plurality of temperature data obtained by the temperature sensor. It is self-evident that the standard deviation of features and variance of features is also the standard deviation or variance of data values collected from each feature.

이후, 서버(100)는, 정규화된 데이터를 분할할 수 있다.Thereafter, the server 100 may divide the normalized data.

구체적으로, 서버(100)는, 정규화 된 데이터 세트를 학습 데이터 세트(Training Dataset) 및 테스트 데이터 세트(Test Dataset)로 분류할 수 있다. 서버(100)는, 학습 데이터 세트는 고장 예측 모델을 획득하기 위해 사용하고, 테스트 데이터 세트는 예측 모델의 성능을 평가하는 데 사용할 수 있다.Specifically, the server 100 may classify the normalized data set into a training data set and a test data set. The server 100 may use the training data set to obtain a failure prediction model, and the test data set to evaluate the performance of the predictive model.

한편, 서버(100)는, 전처리된 데이터를 바탕으로 피처의 중요도를 획득할 수 있다.Meanwhile, the server 100 may acquire the importance of the feature based on the preprocessed data.

일 실시예에 따라, 서버(100)는, 전처리 단계를 통해 획득된 데이터 중, 학습 데이터 세트를 획득할 수 있다. 이후, 서버(100)는, 학습 데이터를 랜덤 포레스트 기법을 바탕으로, 학습 데이터에 대응되는 피처의 중요도를 획득할 수 있다.According to an embodiment, the server 100 may acquire a training data set from among the data acquired through the pre-processing step. Thereafter, the server 100 may acquire the importance of a feature corresponding to the training data based on the random forest technique for the training data.

구체적으로, 서버(100)는 전처리 과정을 통해 획득된 피처의 중요도를 획득할 수 있다. 서버(100)는, 랜덤 포레스트 기법을 바탕으로, 학습 데이터 세트를 분석하여, 각 피처의 중요도를 획득할 수 있다.Specifically, the server 100 may acquire the importance of the feature acquired through the preprocessing process. The server 100 may obtain the importance of each feature by analyzing the training data set based on the random forest technique.

구체적으로, 서버(100)는, 랜덤 포레스트 기법을 통해 복수개의 의사 결정 트리가 작성되고 작성된 의사 결정 트리를 종합적으로 고려하여 각 피처와 Fail 간의 관련성을 분석할 수 있다.Specifically, the server 100 may analyze the relationship between each feature and Fail by comprehensively considering a plurality of decision trees created through the random forest technique and the created decision trees.

일 실시예에 따라, 서버(100)에 의해 수행되는 랜덤 포레스트 기법에서는 각각 의사 결정 트리를 다르게 획득하기 위해 학습 데이터 세트의 복수의 하위 세트가 생성될 수 있다. 생성된 복수의 하위 세트 각각은 서로 다른 데이터와 피처로 구성될 수 있다.According to an embodiment, in the random forest technique performed by the server 100 , a plurality of subsets of the training data set may be generated to obtain a decision tree differently, respectively. Each of the plurality of subsets generated may consist of different data and features.

이를 위해, 서버(100)는, 학습 데이터 세트에서 n 개의 데이터(즉, n 개의 행) 및 mtry개의 피처(즉, mtry개의 열)가 무작위로 선택할 수 있다. 서버(100)는 획득된 서브 세트 수가 ntree개로 기 정의 된 의사 결정 트리 수에 도달 할 때까지 상기 무작위 선택을 반복할 수 있다.To this end, the server 100 may randomly select n data (ie, n rows) and mtry features (ie, mtry columns) from the training data set. The server 100 may repeat the random selection until the number of obtained subsets reaches ntree, a predefined number of decision trees.

이후, 각각의 의사 결정 트리는 획득된 서브 세트 중 하나를 바탕으로 별도로 구축될 수 있다. 즉, ntree 결정 트리를 구축 한 후 각 피처의 중요성은 Mean Decrease Gini(MDG, Gini importance score)를 통해 측정되며, 이는 각각 피처가 정확한 예측 결과에 얼마나 영향을 미치는지를 나타내는 지표가 될 수 있다.Then, each decision tree can be built separately based on one of the obtained subsets. That is, after constructing an ntree decision tree, the importance of each feature is measured through the Mean Decrease Gini (MDG, Gini importance score), which can be an indicator of how much each feature affects accurate prediction results.

보다 구체적으로, 서버(100)는, 각각의 의사 결정 트리에서, 특정 피처를 사용하는 부모 노드와 해당 부모 노드의 자식 노드 사이의 Gini impurities(Gini 불순도) 차이의 합을 계산할 수 있다. 이후, 서버(100)는 모든 의사 결정 트리 결과의 평균을 계산할 수 있다. 이때, 의사 결정 트리는 특정 피처의 임계 값을 사용하여 의사 결정을 내리는 여러 노드로 구성될 수 있다. 또한 각각의 노드에는 잘못된 결정 가능성을 측정하기 위한 Gini impurities을 포함할 수 있다.More specifically, in each decision tree, the server 100 may calculate the sum of differences in Gini impurities between a parent node using a specific feature and a child node of the parent node. Thereafter, the server 100 may calculate an average of all decision tree results. In this case, the decision tree may be composed of several nodes that make a decision using a threshold value of a specific feature. Each node can also contain Gini impurities to measure the probability of a wrong decision.

한편, 랜덤 포레스트 기법이란, 다수의 의사 결정 트리들을 학습하는 앙상블 방법으로서, 랜덤 포레스트는 검출, 분류, 그리고 회귀 등 다양한 문제에 활용될 수 있다. Meanwhile, the random forest technique is an ensemble method for learning a plurality of decision trees, and the random forest can be used for various problems such as detection, classification, and regression.

한편, 본 발명에 따른 랜덤 포레스트는 랜덤성(randomness)에 의해 트리들이 서로 조금씩 다른 특성을 가지게 되며, 이러한 특성은 각 트리들의 예측(prediction)들이 비상관화(decorrelation) 되게 하며, 결과적으로 일반화(generalization) 성능을 향상시킬 수 있는 효과가 존재한다. 또한, 랜덤화(randomization)는 포레스트가 노이즈가 포함된 데이터에 대해서도 강인하게 만들어 줄 수 있다. 랜덤화는 각 트리들의 훈련 과정에서 진행되며, 랜덤 학습 데이터 추출 방법을 이용한 앙상블 학습법인 배깅(bagging)과 랜덤 노드 최적화(randomized node optimization)가 사용될 수 있다.On the other hand, in the random forest according to the present invention, trees have slightly different characteristics from each other due to randomness, and this characteristic causes the predictions of each tree to be decorrelated, and as a result, generalization ), there is an effect that can improve the performance. Also, randomization can make the forest robust against data including noise. Randomization is performed in the training process of each tree, and bagging and randomized node optimization, an ensemble learning method using a random learning data extraction method, may be used.

한편, 서버(100)는, 획득된 중요도를 바탕으로 피처를 선택하고, 선택된 피처를 바탕으로 복수의 모델을 생성할 수 있다.Meanwhile, the server 100 may select a feature based on the obtained importance level and generate a plurality of models based on the selected feature.

서버(100)는, 피처 선택 및 모델 구축 단계를 반복적으로 수행할 수 있다. 구체적으로, 하기 알고리즘 1은 피처 선택 단계 및 모델 구축 단계의 전반적인 동작을 나타낸다.The server 100 may repeatedly perform the feature selection and model building steps. Specifically, Algorithm 1 below shows the overall operation of the feature selection step and the model building step.

한편, 알고리즘 1에서, 피처의 중요도 세트는 IMP=[imp₀,..,imp_j,..,imp_nf-1] 로 정의될 수 있으며, 피처의 세트는 F=[f₀,..,f_j,..,f_nf-1] 로 정의될 수 있다.Meanwhile, in Algorithm 1, the importance set of _{features can be defined as IMP=[imp 0} ,..,imp _j ,..,imp _nf-1 ] , and the set of features is F=[f ₀ ,.., f _j ,..,f _nf-1 ] can be defined.

이때, nf는 피처의 수, imp_j는 (j + 1) 번째 피처의 중요도일 수 있다.In this case, nf may be the number of features, and imp _j may be the importance of the (j + 1)-th feature.

상기 알고리즘 1은 시작 시 변수가 초기화 되며, sf_cnt는 (cnt + 1) 번째 반복으로 선택된 피처를 의미하며, pa_cnt는, (cnt + 1) 번째 반복으로 구축된 모델의 고장 예측 정확도를 의미하며, cnt는 0에서 1씩 증가하는 카운터 값을 의미할 수 있다. 서버(100)는, 상기 알고리즘 1에서 새로 선택한 피처의 중요도가 중요도 임계 값보다 작을 때까지 상기 알고리즘 1을 반복할 수 있다. 즉, 서버(100)는 새로 선택한 피처의 중요도가 중요도 임계 값보다 작을 때까지 피처 선택 단계 및 모델 작성 단계를 반복할 수 있다.In Algorithm 1, variables are initialized at the start, sf _cnt means a feature selected in the (cnt + 1)-th iteration, and pa _cnt means the failure prediction accuracy of the model built in the (cnt + 1)-th iteration, , cnt may mean a counter value that increases from 0 to 1. The server 100 may repeat Algorithm 1 until the importance of the feature newly selected in Algorithm 1 is less than the importance threshold. That is, the server 100 may repeat the feature selection step and the model building step until the importance of the newly selected feature is less than the importance threshold.

이때, 알고리즘 1의 최대 반복 횟수(max)는 중요도가 중요도 임계 값보다 큰 피처 수와 같게 설정될 수 있다. 따라서 cnt가 최대 반복 횟수(max)값에 도달하면 서버(100)는 알고리즘 1의 반복을 종료시킬 수 있다. 구체적으로, 피처 선택 단계가 반복되는 동안, 하기 수학식 4에 따라, 서버(100)는, 중요도가 가장 높은 새 피처를 선택하고 선택한 피처를 피처 세트(SF)에 추가할 수 있다.In this case, the maximum number of iterations (max) of Algorithm 1 may be set equal to the number of features whose importance is greater than the importance threshold. Therefore, when cnt reaches the maximum number of iterations (max), the server 100 may terminate the iteration of Algorithm 1. Specifically, while the feature selection step is repeated, according to Equation 4 below, the server 100 may select a new feature with the highest importance and add the selected feature to the feature set SF.

[수학식 4][Equation 4]

SF=[sf₀, sf₁,..,sf_cnt], cnt ∈ [0, max-1]SF=[sf ₀ , sf ₁ ,..,sf _cnt ], cnt ∈ [0, max-1]

결론적으로, SF에 포함 된 피처의 수는 반복 횟수가 증가함에 따라 1씩 증가할 수 있다. 이후, SF가 업데이트되면, 서버(100)는 모델 구축 단계에서, SF에 기초한 SVM을 통해 고장 예측 모델을 구축할 수 있다.In conclusion, the number of features included in SF can increase by one as the number of iterations increases. Then, when the SF is updated, the server 100 may build a failure prediction model through the SVM based on the SF in the model building step.

이후, 서버(100)는 구축된 고장 예측 모델의 예측 정확도를 평가하여 하기 수학식 5로 표현된 예측 정확도 세트(PA)를 업데이트할 수 있다.Thereafter, the server 100 may update the prediction accuracy set PA expressed by Equation 5 below by evaluating the prediction accuracy of the built failure prediction model.

[수학식 5][Equation 5]

PA=[pa₀,pa₁,.. pa_cnt,], cnt ∈ [0, max-1]PA=[pa ₀ ,pa ₁ ,.. pa _cnt ,], cnt ∈ [0, max-1]

이때, SF 및 PA 각각의 인덱스(즉, 선택된 각 피처의 인덱스 및 예측 정확도)는 반복 횟수를 나타낼 수 있다, 상기 알고리즘 1에서 modelBuildEvalFunction()은 예측 모델을 구축하고 평가하는 함수이며 SF를 입력으로 하여 예측 정확도를 도출할 수 있다.At this time, each index of SF and PA (that is, the index and prediction accuracy of each selected feature) may indicate the number of iterations. In Algorithm 1, modelBuildEvalFunction() is a function that builds and evaluates a predictive model, Prediction accuracy can be derived.

이때, 알고리즘 1에서 modelBuildEvalFunction(SF)은 예측 모델을 구축하고 평가하기 위한 함수로, 종래의 예측 평가 함수가 사용될 수 있음은 물론이다.In this case, in Algorithm 1, modelBuildEvalFunction (SF) is a function for building and evaluating a prediction model, and of course, a conventional prediction evaluation function may be used.

예를 들어, cnt가 2이고(3번째 루프) 선택된 피처 세트 SF = [f₃₉, f₁₂, f₁₉](즉, sf₀=f₃₉, sf₁=f₁₂, sf₂=f₁₉) 이면, 서버(100)는, 해당 3개의 피처의 훈련용 데이터 세트만을 가지고 SVM 기반의 예측 모델을 생성할 수 있다. 이후 서버(100)는, 해당 3개의 피처에 대한 테스트 데이터세트를 가지고 pa₂를 산출하고 PA 세트를 업데이트할 수 있다. (즉, PA세트 내의 pa수가 2개(PA=[pa₀, pa₁])에서 3개(PA=[pa₀, pa₁, pa₂])로 증가한다.) cnt 가 max 값에 도달하면, 작업이 종료됨은 물론이다.For example, if cnt is 2 (3rd loop) and selected feature set SF = [f ₃₉ , f ₁₂ , f ₁₉ ] (ie, sf ₀ =f ₃₉ , sf ₁ =f ₁₂ , sf ₂ =f ₁₉ ) , the server 100 may generate an SVM-based prediction model with only the training data set of the three features. _{Thereafter, the server 100 may calculate pa 2} with the test data set for the three features and update the PA set. (That is, the number of pa in the PA set increases from 2 (PA=[pa ₀ , pa ₁ ]) to 3 (PA=[pa ₀ , pa ₁ , pa ₂ ]). When cnt reaches the max value, , of course, the task is terminated.

한편, 본 발명의 일 실시예에 따라, SF는 선택된 피처의 집합일 수 있으며, 중요도가 큰 순서로 정렬될 수 있다. 즉, 알고리즘 1의 Repeat 루프에서 cnt가 1씩 증가할때 마다 포함되는 중요도가 큰 피처가 1개씩 추가될 수 있다.Meanwhile, according to an embodiment of the present invention, the SF may be a set of selected features, and may be arranged in order of increasing importance. That is, every time cnt increases by 1 in the Repeat loop of Algorithm 1, a feature having a large importance included may be added one by one.

서버(100)는 SF를 입력으로 하여 예측 정확도를 도출할 수 있다. 반복이 종료되면 서버(100)는 최종 SF 및 PA를 작업의 결과로 도출할 수 있다.The server 100 may derive prediction accuracy by taking SF as an input. When the iteration ends, the server 100 may derive the final SF and PA as a result of the operation.

즉, 서버(100)는 SF를 입력 값으로 하여 예측 정확도(PA)를 획득할 수 있으며, 알고리즘 1의 반복이 종료됨에 따라 최종적으로 하기 수학식 6 및 수학식 7과 같이 도출된 SF 및 PA를 결과값으로 산출할 수 있다.That is, the server 100 can obtain prediction accuracy (PA) by using SF as an input value, and as the iteration of algorithm 1 is terminated, finally SF and PA derived as shown in Equations 6 and 7 below It can be calculated as a result.

[수학식 6][Equation 6]

SF=[sf₀,..,sf_cnt,..,sf_max-1],SF=[sf ₀ ,..,sf _cnt ,..,sf _max-1 ],

[수학식 7][Equation 7]

PA=[pa₀,..,pa_cnt,..,pa_max-1]PA=[pa ₀ ,..,pa _cnt ,..,pa _max-1 ]

한편, 서버(100)는, 생성된 복수의 모델 중, 적어도 하나의 모델을 선택할 수 있다.Meanwhile, the server 100 may select at least one model from among a plurality of generated models.

구체적으로, 서버(100)는, PA 및 SF를 참조하여 예측 정확도가 가장 높은 고장 예측 모델을 선택할 수 있다. 특히 단계 S150에서, 서버(100)는 max(PA)를 이용하여 PA에서 가장 높은 예측 정확도를 검색할 수 있다. 이때, max()는 지정된 세트에서 최대 값을 갖는 인덱스를 검색하는 함수일 수 있다. 이후 서버(100)는, PA에서 max(PA)의 인덱스를 획득할 수 있다. max(PA)의 인덱스를 획득하기 위하여, 서버(100)는 PA의 각 요소는 max(PA)를 비교될 수 있으며, max(PA)와 동일한 요소의 인덱스는 PA로부터 도출될 수 있다. 이후, 서버(100)는 max(PA)와 SF의 인덱스를 고려하여 고장 예측 모델을 선택할 수 있다.Specifically, the server 100 may select a failure prediction model having the highest prediction accuracy with reference to the PA and the SF. In particular, in step S150 , the server 100 may search for the highest prediction accuracy in the PA using max(PA). In this case, max() may be a function that searches for the index having the maximum value in the specified set. Thereafter, the server 100 may obtain an index of max(PA) in the PA. In order to obtain the index of max(PA), the server 100 may compare max(PA) of each element of the PA, and the index of the element equal to max(PA) may be derived from the PA. Thereafter, the server 100 may select a failure prediction model in consideration of the indexes of max(PA) and SF.

보다 구체적으로, 서버(100)는, 인덱스가 max(PA)의 인덱스 이하인 요소들을 SF로부터 획득할 수 있다. 이후, 서버(100)는, 단계 모델 생성 단계에서 사용된 피처와 SF에서 추출된 피처를 비교하여 모델 생성 단계에서 생성된 고장 예측 모델 중 하나의 예측 모델을 선택할 수 있다.More specifically, the server 100 may obtain elements having an index equal to or less than an index of max(PA) from the SF. Thereafter, the server 100 may select one predictive model from among the failure prediction models generated in the model generation step by comparing the features used in the step model generation step with the features extracted from the SF.

고장 예측 모델 선택 방법을 좀 더 상세히 살펴본다. 먼저, 서버(100)는 알고리즘 1에 의해 획득된 pa₀내지 pa_max-1 중 가장 높은 예측 정확도를 가지는 pa_i에 대한 max(PA)를 획득할 수 있다. 이후, 서버(100)는, max(PA)의 인덱스를 획득하여, 기 설정된 값 이상의 예측 정확도를 가지는 적어도 하나의 피처를 획득할 수 있다. 이후, 서버(100)는, 상기 max(PA) 및 상기 기 설정된 값 이상의 예측 정확도를 가지는 적어도 하나의 피처를 바탕으로 상기 적어도 하나의 모델을 선택할 수 있다.Let's look at the failure prediction model selection method in more detail. First, the server 100 may obtain max(PA) for _{pa i} having the highest prediction accuracy among _{pa 0} to pa _{max-1 obtained by Algorithm 1 .} Thereafter, the server 100 may obtain an index of max(PA) to obtain at least one feature having a prediction accuracy equal to or greater than a preset value. Thereafter, the server 100 may select the at least one model based on the max(PA) and at least one feature having a prediction accuracy equal to or greater than the preset value.

이후, 서버(100)는, 모델 선택 단계에서, PA를 참조하여 예측 정확도가 가장 높은 고장 예측 모델을 선택할 수 있다. 구체적으로, 서버(100)는 max(PA)를 통해 가장 높은 예측 정확도를 선택하고, max(PA)의 인덱스를 도출하여 예측 정확도가 가장 높은 반복 횟수(즉, 선택된 피처 수)를 산출할 수 있다. 서버(100)는 max(PA)의 인덱스와 SF 를 고려하여 고장 예측 모델을 선택할 수 있다. 이때, max(PA)는 알고리즘 1의 반복을 통해 획득된 PA(즉, PA 세트) 중, 가장 큰 값을 의미할 수 있음은 물론이다.Thereafter, in the model selection step, the server 100 may select a failure prediction model having the highest prediction accuracy with reference to the PA. Specifically, the server 100 selects the highest prediction accuracy through max(PA) and derives the index of max(PA) to calculate the number of iterations with the highest prediction accuracy (ie, the number of selected features). . The server 100 may select a failure prediction model in consideration of the index and SF of max(PA). In this case, of course, max(PA) may mean the largest value among PAs (ie, PA sets) obtained through repetition of Algorithm 1.

이때, max(PA)의 인덱스는 PA 세트에서 최대 정확도가 몇 번째 순번에 포함되어 있는지를 의미할 수 있다. 즉, 인덱스는 cnt의 값이 몇 일 때(즉, 반복(iteration) 횟수가 몇 번째일 때) 가장 큰 정확도를 갖는지를 나타내는 지표이다.In this case, the index of max(PA) may mean in which order the maximum accuracy is included in the PA set. That is, the index is an index indicating the highest accuracy when the value of cnt is what (ie, what number of iterations).

인덱스는 PA 세트 내 pa값들과 max(PA)를 비교하여, max(PA)와 동일한 pa의 순번(index=cnt)을 PA 세트에서 획득될 수 있으며, 서버(100)는, PA세트 내에서 max(PA)의 인덱스(즉, cnt 값)를 획득하고, 해당 반복시행에서의 피처들을 최종적으로 선택하여 예측모델을 구축할 수 있다.The index compares the pa values with max(PA) in the PA set, and the sequence number (index=cnt) of the same pa as max(PA) may be obtained from the PA set, and the server 100 may set max(PA) in the PA set. A predictive model can be constructed by obtaining the index (ie, cnt value) of (PA) and finally selecting features in the iterative trial.

예를 들어, 도 5에 도시된 바와 같이, max(PA)의 인덱스는 7 이고, 이때 선택되는 피처 세트는 SF = [F60, F349, F41, F289, F427, F65, F66, F154](즉, 중요도 순 8개 피처)이며 해당 SF 내 포함된 피처의 데이터들로 만든 SVM 기반 예측모델이 최종적으로 선정 및 사용될 수 있다.For example, as shown in Figure 5, the index of max(PA) is 7, where the selected feature set is SF = [F60, F349, F41, F289, F427, F65, F66, F154] (i.e., 8 features in order of importance), and an SVM-based prediction model made from data of features included in the SF can be finally selected and used.

도 4 내지 도 9은 본 발명의 일 실시예에 따른 실험 결과를 설명하기 위한 예시도이다.4 to 9 are exemplary views for explaining the experimental results according to an embodiment of the present invention.

본 발명에서는, 오픈 소스 R 버전 3.4.3을 사용하여 제안 된 고장 예측 모델의 타당성을 검증하기 위해 실험 구현이 수행되었다. 이를 위해 UCI 저장소에서 제공 한 SECOM 데이터 세트가 사용되었으며, 해당 데이터 세트는 1567 개의 데이터와 591 개의 피처로 구성되고 센서 및 프로세스 측정 지점을 모니터링하여 반도체 제조 프로세스에서 데이터를 수집하였다. 590 개의 피처 데이터는 여러 센서에서 측정되었으며, 나머지 피처의 데이터는 Pass 및 Fail로 표시되는 가계도 테스트 결과이다.In the present invention, an experimental implementation was performed to verify the validity of the proposed failure prediction model using the open source R version 3.4.3. For this purpose, the SECOM data set provided by the UCI repository was used. The data set consisted of 1567 data and 591 features, and data were collected from the semiconductor manufacturing process by monitoring sensors and process measurement points. Data for 590 features were measured from multiple sensors, and the data for the rest of the features are the results of the pedigree test indicated as Pass and Fail.

본 실험에서는, 피처 제거를 위해 유효성 계수를 0.1로 설정하였다. 즉, 10 % 이상의 NA 데이터를 갖는 피처 및 분산이 없는 피처는 데이터 세트에서 제거되었다. 상술한 피처 제거 과정을 통해, 피처 수는 591에서 393으로 감소하였음을 확인할 수 있다.In this experiment, the effectiveness coefficient was set to 0.1 for feature removal. That is, features with NA data of 10% or more and features without variance were removed from the data set. It can be seen that the number of features decreased from 591 to 393 through the feature removal process described above.

학습 데이터 집합과 테스트 데이터 집합의 비율은 7:3으로 설정되었다. 각각의 피처의 중요도를 측정하기 위해 랜덤 포레스트 기법 및 caret 패키지를 사용하였다. n, mtry 및 ntree를 각각 1000, 19 및 500으로 설정하였다. 이 설정을 통해 임의로 생성 된 1000 * 19 매트릭스를 사용하여 500 개의 의사 결정 트리가 작성되었다. 각각의 피처의 중요성은 MDG(mean Decrease Gini)를 통해 측정되었다. The ratio of the training data set and the test data set was set to 7:3. Random forest technique and caret package were used to measure the importance of each feature. n, mtry, and ntree were set to 1000, 19 and 500, respectively. With this setup, 500 decision trees were built using a randomly generated 1000*19 matrix. The significance of each feature was measured through the MDG (mean Decrease Gini).

도 4는 상위 30 개 피처의 중요성을 나타낸다. x 축 및 y 축은 각각 MDG 및 피처을 나타낸다. 그림에서 F60은 모든 피처 중에서 MDG(1.52)가 가장 높음을 확인할 수 있다.4 shows the importance of the top 30 features. The x-axis and y-axis represent MDG and features, respectively. In the figure, it can be seen that F60 has the highest MDG (1.52) among all features.

한편, 본 실험에서는, 반복적인 피처 선택 과정 수행을 위하여, 중요도 임계 값을 0.7로 설정하였다. 따라서 (max)는 24로 결정되었으며, 이는 반복 횟수가 24임을 의미한다.Meanwhile, in this experiment, in order to perform the iterative feature selection process, the importance threshold was set to 0.7. Therefore, (max) is determined to be 24, which means that the number of repetitions is 24.

학습 데이터 세트에는 70 개의 Fail과 1038 개의 Pass가 포함된다. 이러한 학습 데이터 세트의 불균형은 고장 예측 모델을 고장 사례를 예측하기 어렵게 만들고, 이러한 문제를 해결하기 위해 고장 예측 모델을 작성하기 전에 샘플링을 수행하였다 이를 통해, 일부 Pass 데이터가 제거되어 모델 작성에 미치는 영향을 감소시켰다. 상술한 바와 같이, 피처 선택 결과와 샘플링 된 학습 데이터 세트를 통해 SVM을 사용하여 고장 예측 모델을 구축하였으며, 고장 예측 모델은, e1071 package in R을 사용하여 구축되었다. 도 5에 도시된 표는 획득 한 SF 및 PA를 나타낸다. 도 5의 표에서 max(PA)는 0.72이고 인덱스는 7임을 확인할 수 있다 결론적으로 8 가지 피처(예: F60, F349, F41, F289, F427, F65, F66 및 F154)가 선택된다.The training data set contains 70 Fails and 1038 Passes. This imbalance of the training data set makes it difficult for the failure prediction model to predict failure cases, and to solve this problem, sampling was performed before creating the failure prediction model. decreased. As described above, a failure prediction model was built using the SVM through the feature selection results and the sampled training data set, and the failure prediction model was built using the e1071 package in R. The table shown in Fig. 5 shows the obtained SF and PA. It can be seen from the table of FIG. 5 that max(PA) is 0.72 and the index is 7. In conclusion, eight features (eg, F60, F349, F41, F289, F427, F65, F66, and F154) are selected.

성능 평가를 위해 제안 된 모델의 예측 정확도를 기존 모델의 예측 정확도와 비교하였다. 구체적으로, 고정 된 수의 피처(예: 12 및 24 피처)와 모든 피처를 기반으로 구축 된 3 개의 기존 모델을 고려할 수 있다. 도 6 내지 도 9는 서로 다른 수의 피처를 사용하는 세 가지 고장 예측 모델에 대한 ROC(수신기 작동 특성) 곡선을 나타낸다. ROC 곡선은 예측 모델의 성능 측정으로, TPR(true positive rate)과 FPR(false positive rate) 간의 관계를 나타낸다.For performance evaluation, the prediction accuracy of the proposed model was compared with that of the existing model. Specifically, we can consider a fixed number of features (e.g. 12 and 24 features) and three existing models built on the basis of all features. 6 to 9 show the ROC (receiver operating characteristic) curves for three failure prediction models using different numbers of features. The ROC curve is a measure of the performance of a predictive model, and represents a relationship between a true positive rate (TPR) and a false positive rate (FPR).

이때, TPR 및 FPR은 하기 수학식 8 및 수학식 9와 같이 표현될 수 있다.In this case, TPR and FPR may be expressed as in Equations 8 and 9 below.

[수학식 8][Equation 8]

[수학식 9][Equation 9]

이때, TP, FN, FP 및 TN은 각각 true positive, false negative, false positive, and true negative일 수 있다.In this case, TP, FN, FP, and TN may be true positive, false negative, false positive, and true negative, respectively.

한편, ROC 곡선에서는 곡선 아래 면적(AUC)을 사용하여 모형의 예측 정확도를 평가한다. 구체적으로, AUC가 클수록 예측 정확도가 높아진다. 도 6에서, 반복적 피처 선택에 기초하여 구축 된 고장 예측 모델은 다른 모델에 비해 가장 큰 AUC를 가지는 것을 확인할 수 있다. 이는, 모델이 다른 수의 피처를 사용하여 반복적으로 구축되고 예측 정확도가 가장 높은 모델 중 하나가 선택되기 때문이다.On the other hand, in the ROC curve, the area under the curve (AUC) is used to evaluate the prediction accuracy of the model. Specifically, the larger the AUC, the higher the prediction accuracy. 6, it can be seen that the failure prediction model built based on iterative feature selection has the largest AUC compared to other models. This is because the model is iteratively built using a different number of features and one of the models with the highest prediction accuracy is selected.

나아가, 도 7 및 도 8에 도시된 바와 같이, 고정 된 개수의 피처가 모델을 구축하는데 사용되는 경우, 예측 정확도는 상대적으로 저하 될 수 있음을 확인할 수 있다.Furthermore, as shown in FIGS. 7 and 8, when a fixed number of features is used to build a model, it can be confirmed that the prediction accuracy may be relatively deteriorated.

즉, 관련 없는 피처로 인해 정확한 예측 모델을 작성하기가 어렵기 때문에, 고정 된 개수의 피처가 모델을 구축하는 데 사용되는 경우 예측 정확도가 상대적으로 저하 될 수 있다.In other words, since it is difficult to build an accurate prediction model due to irrelevant features, prediction accuracy may be relatively poor when a fixed number of features are used to build the model.

즉, 고장과 관련이 없는 더 많은 피처를 사용하여 모델을 작성하면 모델의 예측 정확도가 떨어질 수 있다. 따라서, 데이터 세트의 모든 피처가 도 9에 도시 된 바와 같이 사용되는 경우, AUC는 상당히 감소한다. 정량적으로, 제안 된 모델은 각각 고정된 수의 피처 및 모든 피처의 경우에 비해 14.3 및 22.0 % 더 높은 AUC를 얻는다.In other words, if you build a model with more features that are not related to failure, the predictive accuracy of the model may decrease. Therefore, when all features of the data set are used as shown in Fig. 9, the AUC decreases significantly. Quantitatively, the proposed model obtains 14.3 and 22.0% higher AUCs compared to the case of fixed number of features and all features, respectively.

그러나, 본 발명은 다른 수의 피처를 사용하여 반복적으로 고장 예측 모델을 구축하며, 나아가, 고장과 관련된 피처를 선택적으로 획득하여 고장 예측 모델을 구축하므로 도 9에 도시된 바와 같이, 기존의 모델에 비해 더 나은 효과를 발생시킬 수 있다.However, the present invention builds a failure prediction model iteratively using a different number of features, and furthermore, builds a failure prediction model by selectively acquiring failure-related features. can produce a better effect than

도 10은 본 발명의 일 실시예에 따른 장치의 구성도이다.10 is a block diagram of an apparatus according to an embodiment of the present invention.

프로세서(102)는 하나 이상의 코어(core, 미도시) 및 그래픽 처리부(미도시) 및/또는 다른 구성 요소와 신호를 송수신하는 연결 통로(예를 들어, 버스(bus) 등)를 포함할 수 있다.The processor 102 may include one or more cores (not shown) and a graphic processing unit (not shown) and/or a connection path (eg, a bus, etc.) for transmitting and receiving signals to and from other components. .

일 실시예에 따른 프로세서(102)는 메모리(104)에 저장된 하나 이상의 인스트럭션을 실행함으로써, 도 3과 관련하여 설명된 방법을 수행한다.The processor 102 according to one embodiment performs the method described with respect to FIG. 3 by executing one or more instructions stored in the memory 104 .

한편, 프로세서(102)는 프로세서(102) 내부에서 처리되는 신호(또는, 데이터)를 일시적 및/또는 영구적으로 저장하는 램(RAM: Random Access Memory, 미도시) 및 롬(ROM: Read-Only Memory, 미도시)을 더 포함할 수 있다. 또한, 프로세서(102)는 그래픽 처리부, 램 및 롬 중 적어도 하나를 포함하는 시스템온칩(SoC: system on chip) 형태로 구현될 수 있다. On the other hand, the processor 102 is a RAM (Random Access Memory, not shown) and ROM (Read-Only Memory: ROM) for temporarily and / or permanently storing a signal (or, data) processed inside the processor 102. , not shown) may be further included. In addition, the processor 102 may be implemented in the form of a system on chip (SoC) including at least one of a graphic processing unit, a RAM, and a ROM.

메모리(104)에는 프로세서(102)의 처리 및 제어를 위한 프로그램들(하나 이상의 인스트럭션들)을 저장할 수 있다. 메모리(104)에 저장된 프로그램들은 기능에 따라 복수 개의 모듈들로 구분될 수 있다.The memory 104 may store programs (one or more instructions) for processing and controlling the processor 102 . Programs stored in the memory 104 may be divided into a plurality of modules according to functions.

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in relation to an embodiment of the present invention may be implemented directly in hardware, as a software module executed by hardware, or by a combination thereof. A software module may include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

본 발명의 구성 요소들은 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 애플리케이션)으로 구현되어 매체에 저장될 수 있다. 본 발명의 구성 요소들은 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있으며, 이와 유사하게, 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립트 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다.The components of the present invention may be implemented as a program (or application) to be executed in combination with a computer, which is hardware, and stored in a medium. Components of the present invention may be implemented as software programming or software components, and similarly, embodiments may include various algorithms implemented as data structures, processes, routines, or combinations of other programming constructs, including C, C++ , may be implemented in a programming or scripting language such as Java, assembler, or the like. Functional aspects may be implemented in an algorithm running on one or more processors.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다. As mentioned above, although embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains know that the present invention may be embodied in other specific forms without changing the technical spirit or essential features thereof. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

100: 서버
200-1 내지 200-5: 복수의 IoT 장치100: server
200-1 to 200-5: a plurality of IoT devices

Claims

In the failure prediction model acquisition method,
obtaining, by the server, data from a plurality of sensors;
obtaining, by the server, a plurality of features for each of the plurality of sensors based on the data;
obtaining, by the server, a first feature set from which a feature corresponding to invalid data is removed from among the plurality of features;
obtaining, by the server, a second feature set in which data of a feature in which data is missing from among the first feature set is replaced;
normalizing, by the server, the second set of features; and
dividing, by the server, the normalized data; including,
The normalizing step is normalized based on Equation 1 below,
The failure prediction model acquisition method is,
obtaining, by the server, the importance of the feature based on the normalized data;
selecting, by the server, a feature based on the obtained importance level, and generating a plurality of models based on the selected feature; and
selecting, by the server, at least one model from among the plurality of generated models; including,
The step of generating the plurality of models comprises:
generated by the following algorithm 1,
The step of selecting the at least one model comprises:
acquiring a selection feature set (SF) while increasing the size of cnt until the value of cnt becomes max by the algorithm 1;
obtaining a prediction accuracy set (PA) by inputting the selected feature set as an input value into a failure prediction model; including,
The failure prediction model acquisition method is,
Obtaining max(PA) for _{pa i} having the highest prediction accuracy among _{pa 0} to pa _max-1 obtained by obtaining the selection feature set (SF) and obtaining the prediction accuracy set (PA) to do;
obtaining the index of max(PA) to obtain at least one feature having a prediction accuracy equal to or greater than a preset value; and
selecting the at least one model based on the max(PA) and at least one feature having a prediction accuracy equal to or greater than the preset value; A failure prediction model acquisition method comprising a.
[Equation 1]
x _i '= (x _i -m) / s
In this case, x _i ' is the (i + 1)-th normalized feature data, x _i is the (i + 1)-th data of the feature, m is the average of the feature, and s is the standard deviation of the feature.
[Algorithm 1]

where cnt is the counter value, SF is the selected feature, PA is the prediction accuracy, sf _cnt is the selected feature in cnt, and the importance set of the feature is IMP=[imp ₀ ,...,imp _j ,...,imp _{nf -1} ], the set of features is F=[f ₀ ,...,f _j ,...,f _nf-1 ], imp _j is the value of importance at j+1, nf is the number of features, modelBuildEvalFunction(SF) is an arbitrary function for building and evaluating a predictive model.

delete

According to claim 1,
The failure prediction model acquisition method is,
obtaining, by the server, learning data based on the normalized data;
obtaining, by the server, the importance of a feature corresponding to the learning data based on a random forest technique; A failure prediction model acquisition method comprising a.

delete

a memory storing one or more instructions; and
a processor executing the one or more instructions stored in the memory;
The processor by executing the one or more instructions,
An apparatus for performing the method of claim 1 .

A computer program stored in a computer-readable recording medium in combination with a computer, which is hardware, to perform the method of claim 1 .