KR102072836B1

KR102072836B1 - Data classification method based on correlation, and a computer-readable storege medium having program to perform the same

Info

Publication number: KR102072836B1
Application number: KR1020190079144A
Authority: KR
Inventors: 서호준
Original assignee: 비앤에프테크놀로지 주식회사
Priority date: 2019-07-02
Filing date: 2019-07-02
Publication date: 2020-02-03
Also published as: KR20190082715A

Abstract

상관도를 고려한 데이터 분류 방법 및 이 방법을 수행하기 위한 프로그램이 저장된 컴퓨터 판독가능한 저장매체가 제공된다. 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법은, 플랜트의 장비로부터 수신된 운전데이터인 제1 및 제2 데이터를 포함하는 데이터셋을 수신하는 단계, 상기 데이터셋의 상관인자를 결정하는 단계 및 상기 상관인자를 기초로 상기 데이터셋을 제1 그룹으로 설정하는 단계를 포함한다.
본 발명의 실 실시예에 따른 컴퓨터 판독가능한 저장매체는, 상기 상관도를 고려한 데이터 분류 방법을 수행하기 위한 프로그램이 저장된다.A method of classifying data in consideration of correlation and a computer readable storage medium storing a program for performing the method are provided. According to an embodiment of the present invention, a data classification method considering a correlation may include: receiving a data set including first and second data, which are operation data received from equipment of a plant, determining a correlation factor of the data set And setting the data set as a first group based on the correlation factors.
In a computer-readable storage medium according to an embodiment of the present invention, a program for performing a data classification method in consideration of the correlation is stored.

Description

DATA CLASSIFICATION METHOD BASED ON CORRELATION, AND A COMPUTER-READABLE STOREGE MEDIUM HAVING PROGRAM TO PERFORM THE SAME}

본 발명은 상관도를 고려한 데이터 분류 방법 및 컴퓨터 판독가능한 저장매체에 관한 것으로, 보다 자세하게는 데이터간 유사도에 따른 그룹화 방법 및 이 방법을 수행하기 위한 프로그램이 저장된 컴퓨터 판독가능한 저장매체에 관한 것이다.The present invention relates to a data classification method and a computer readable storage medium in consideration of correlation. More particularly, the present invention relates to a grouping method according to similarity between data and a computer readable storage medium storing a program for performing the method.

일반적으로 각종 산업플랜트 설비에는 많은 설비들이 존재하며 이들의 작동이 제대로 되고 있는지 여부를 감시하여 심각한 문제가 발생하기 전에 조치를 취할 수 있도록 하고 있다. In general, there are many facilities in various industrial plant facilities, and they monitor whether they are working properly so that actions can be taken before serious problems occur.

예를 들어 발전소의 경우 터빈 및 보조기기 시스템, 발전기 및 보조기기 시스템, 보일러 및 보조기기 시스템, 주 급수 시스템, 응축수 시스템, 연료공급 시스템, 냉각수 시스템, 순환수 시스템, 보조증기 시스템과 같은 부속설비들로 구성되고, 터빈 및 보조기기 시스템의 경우는 다시 고압 터빈, 중압 터빈, 저압 터빈, 주 증기 제어밸브시스템, 주 증기 차단 밸브 시스템, 터빈 속도제어 시스템, 터빈 추기 시스템, 터빈 베어링 윤활유 시스템, 등으로 구성되고, 이들 각 시스템들은 다시 단위 기기나 세부 부속 시스템들로 구성되어 있고, 이러한 설비들이 상호 유기적으로 연계되어 작동함으로써 전기를 생산한다. 이러한 설비들의 운영상태가 정상에서 벗어나거나 성능이 저하될 경우에는 경보를 울리거나 더 이상 설비를 운영하는 것이 위험한 상태에 이르면 해당 설비 또는 플랜트 전체를 강제 정지 시킨다.For example, for power plants, ancillary equipment such as turbines and auxiliary systems, generators and auxiliary systems, boilers and auxiliary systems, main water supply systems, condensate systems, fuel supply systems, cooling water systems, circulating water systems and auxiliary steam systems In the case of turbine and auxiliary system, the high pressure turbine, medium pressure turbine, low pressure turbine, main steam control valve system, main steam shutoff valve system, turbine speed control system, turbine bleed system, turbine bearing lubricant system, etc. Each of these systems, in turn, consists of unit devices or sub-systems, and these facilities work together to produce electricity. If the operation of these facilities is out of normal or degrades, an alarm will occur or the entire plant or plant will be forcibly shut down when it is dangerous to operate the equipment.

따라서 플랜트가 목적하는 제품을 원하는 품질 수준과 비용으로 생산하기 위해서는 플랜트를 구성하는 상기 부속설비들의 가동 상황을 실시간으로 지속적으로 감시하며 최적의 운영상태와 성능을 유지하도록 하여야 할 필요가 있다.Therefore, in order to produce the desired product at the desired quality level and cost, it is necessary to continuously monitor the operation status of the above-mentioned facilities constituting the plant in real time and to maintain optimal operation and performance.

플랜트 설비는 특성상 복수의 모듈과 복수의 모듈을 감시하는 복수의 센서를 포함하고 있고, 이러한 복수의 센서로부터 단위시간당 많게는 수만 개의 감지신호를 수신하여 해당 플랜트 설비의 정상작동 여부를 판단하여야 한다. 발전소 플랜트 설비의 경우, 하나의 모듈에서 발생한 사소한 고장이 플랜트 전체에 심각한 오작동을 일으킬 수 있다. 발전소의 가동이 멈추게 되면 발전소 자체의 정비비용 뿐만 아니라 전력 수요가 높은 계절에는 블랙아웃과 같은 대규모 정전사태로 직결될 수 있고, 이에 따른 경제적 손실 또한 심각할 수 있다.The plant facility includes a plurality of modules and a plurality of sensors for monitoring the plurality of modules, and should receive as many as tens of thousands of detection signals per unit time from the plurality of sensors to determine whether the plant is operating normally. In a power plant plant installation, minor failures in one module can cause serious malfunctions throughout the plant. If a plant stops operating, it can lead to large power outages such as blackouts in the high-demand season, as well as the maintenance costs of the plant itself, which can also be serious.

기존의 산업플랜트에서 이상신호 발생 시 전문가들은 플랜트 설비 계통도(P&ID)를 일일이 확인하면서 이상신호 발생 원인을 규명하지만, 플랜트 설비 특성상 수 만개의 운전신호를 관찰하고 그 신호들 간의 연계성을 분석하여 이상상황과 고장발생 기기를 찾아내는 것은 매우 어렵고 상당한 시간이 소요되므로 이상신호 발생 시 고장발생 기기의 색출과 이로 인한 계통 파급영향을 신속하게 검출하여 적절한 시정조치를 취함으로써 플랜트 운전 안정성을 제고할 수 있는 감시시스템 시스템화 기술이 요구된다.When an abnormal signal occurs in an existing industrial plant, experts check the plant facility schematic (P & ID) to identify the cause of the abnormal signal, but due to the characteristics of the plant facility, it observes tens of thousands of operating signals and analyzes the linkage between them. It is very difficult and time consuming to find the faulty equipment, so the monitoring system can improve the stability of plant operation by quickly detecting the faulty equipment and the system ripple effect by taking appropriate corrective action when abnormal signal occurs. Systemization skills are required.

위와 같은 문제점으로부터 안출된 본 발명이 해결하고자 하는 기술적 과제는, 수신된 데이터간 상관인자를 결정하여 상관인자를 기초로 데이터를 그룹화하는 데이터 분류방법 및 이 방법을 수행하기 위한 프로그램이 저장된 컴퓨터 판독가능한 저장매체를 제공하고자 하는 것이다.SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems, and a technical classification method for grouping data on the basis of correlation factors by determining correlation factors between received data and a computer-readable program storing a program for performing the method. It is to provide a storage medium.

본 발명이 해결하고자 하는 다른 기술적 과제는, 상관인자를 기초로 데이터셋에서 비정상 데이터를 필터링하는 데이터 분류방법 및 이 방법을 수행하기 위한 프로그램이 저장된 컴퓨터 판독가능한 저장매체를 제공하고자 하는 것이다.Another technical problem to be solved by the present invention is to provide a data classification method for filtering abnormal data in a dataset based on correlation factors and a computer-readable storage medium storing a program for performing the method.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Technical problems of the present invention are not limited to the aforementioned technical problems, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 언급된 기술적 과제들을 해결하기 위한, 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법은, 플랜트의 장비로부터 수신된 운전데이터인 제1 및 제2 데이터를 포함하는 데이터셋을 수신하는 단계, 상기 데이터셋의 상관인자를 결정하는 단계 및 상기 상관인자를 기초로 상기 데이터셋을 제1 그룹으로 설정하는 단계를 포함한다.In order to solve the above-mentioned technical problems, the data classification method in consideration of the correlation according to an embodiment of the present invention, receiving a data set including the first and second data that is the operating data received from the equipment of the plant Determining a correlation factor of the dataset and setting the dataset as a first group based on the correlation factor.

상기 결정하는 단계는, 상기 데이터셋의 단위시간별 좌표를 결정하는 단계 및 상기 좌표를 도시하는 단계를 포함할 수 있다.The determining may include determining coordinates for each unit time of the data set and showing the coordinates.

상기 결정하는 단계는 상기 도시된 좌표의 연속된 구간을 추출하는 단계를 더 포함하고, 상기 상관인자는 상기 연속된 구간의 기울기를 기초로 결정될 수 있다.The determining may further include extracting a continuous section of the illustrated coordinates, and the correlation factor may be determined based on the slope of the continuous section.

상기 결정하는 단계는, 상기 도시된 좌표의 분포도를 결정하는 단계를 더 포함할 수 있다.The determining may further include determining a distribution of the illustrated coordinates.

상기 설정하는 단계는, 상기 상관인자의 절대값을 기초로 상기 제1 그룹을 설정할 수 있다.In the setting, the first group may be set based on an absolute value of the correlation factor.

상기 설정하는 단계는, 상기 데이터셋을 제2 그룹을 포함하는 복수의 그룹으로 설정하는 단계를 더 포함할 수 있다.The setting may further include setting the data set into a plurality of groups including a second group.

상기 설정하는 단계는, 상기 상관인자를 기초로 상기 데이터셋의 비정상데이터를 필터링할 수 있다.In the setting, the abnormal data of the data set may be filtered based on the correlation factor.

본 발명의 일 실시예에 따른 컴퓨터 판독가능한 저장매체는, 플랜트의 장비로부터 수신된 운전데이터인 제1 및 제2 데이터를 포함하는 데이터셋을 수신하는 단계, 상기 데이터셋의 상관인자를 결정하는 단계 및 상기 상관인자를 기초로 상기 데이터셋을 제1 그룹으로 설정하는 단계를 포함하는 상관도를 고려한 데이터 분류 방법을 수행하기 위한 프로그램이 저장된다.In a computer-readable storage medium according to an embodiment of the present invention, the method comprising the steps of: receiving a data set including first and second data which is operation data received from a plant equipment, determining a correlation factor of the data set And a program for performing a data classification method in consideration of correlations, including setting the data set as a first group based on the correlation factors.

상기와 같은 본 발명에 따르면, 수신되는 데이터셋을 상관인자를 기초로 복수의 그룹으로 그룹화하여 플랜트를 구성하는 복수의 장비에서 수신되는 데이터의 학습 효율을 향상시킬 수 있으며, 비정상 데이터를 필터링하여 플랜트의 상태를 나타내는 데이터셋을 보다 정확하게 모니터링 할 수 있다.According to the present invention as described above, it is possible to improve the learning efficiency of the data received from the plurality of equipment constituting the plant by grouping the received data set into a plurality of groups based on the correlation factor, and filter the abnormal data to plant You can more accurately monitor the dataset that represents the state of.

도 1은 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법의 개략적인 흐름을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법에서 상관인자를 결정하는 과정을 나타내는 도면이다.
도 3은 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법에서 데이터셋의 그룹화 과정을 나타내는 도면이다.1 is a view showing a schematic flow of a data classification method in consideration of a correlation according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a process of determining a correlation factor in a data classification method considering a correlation according to an embodiment of the present invention.
3 is a diagram illustrating a grouping process of data sets in a data classification method considering a correlation according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various forms, and only the embodiments are intended to complete the disclosure of the present invention, and the general knowledge in the art to which the present invention pertains. It is provided to fully convey the scope of the invention to those skilled in the art, and the present invention is defined only by the scope of the claims. Like reference numerals refer to like elements throughout.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used in the present specification (including technical and scientific terms) may be used in a sense that can be commonly understood by those skilled in the art. In addition, terms that are defined in a commonly used dictionary are not ideally or excessively interpreted unless they are specifically defined clearly.

본 명세서에서 사용된 용어는 실시 예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and / or “comprising” does not exclude the presence or addition of one or more other components in addition to the mentioned components.

이하, 도면을 참조하여 본 발명의 실시 예들에 따른 상관도를 고려한 데이터 분류 방법 및 이 방법을 수행하기 위한 프로그램이 저장된 컴퓨터 판독가능한 저장매체에 대해 설명하기로 한다.Hereinafter, a method of classifying data in consideration of correlation and a computer readable storage medium storing a program for performing the method will be described with reference to the accompanying drawings.

도 1을 참조하면, 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법의 개략적인 흐름이 개시된다. 도 1은 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법의 개략적인 흐름을 나타내는 도면, 도 2는 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법에서 상관인자를 결정하는 과정을 나타내는 도면, 도 3은 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법에서 데이터셋의 그룹화 과정을 나타내는 도면이다.Referring to FIG. 1, a schematic flow of a data classification method considering a correlation according to an embodiment of the present invention is disclosed. 1 is a view showing a schematic flow of a data classification method considering a correlation according to an embodiment of the present invention, Figure 2 is a method of determining a correlation factor in a data classification method considering a correlation according to an embodiment of the present invention 3 is a diagram illustrating a grouping process of data sets in a data classification method considering a correlation according to an embodiment of the present invention.

구체적으로 본 실시예에 따른 상관도를 고려한 데이터 분류 방법은, 플랜트의 장비로부터 수신된 운전데이터인 제1 및 제2 데이터를 포함하는 데이터셋을 수신하는 단계(S10), 상기 데이터셋의 상관인자를 결정하는 단계(S20, S30) 및 상기 상관인자를 기초로 상기 데이터셋을 제1 그룹으로 설정하는 단계(S40)를 포함한다.In detail, the data classification method considering the correlation according to the present embodiment includes: receiving a data set including first and second data, which is operation data received from equipment of a plant (S10), and a correlation factor of the data set. Determining (S20, S30) and setting the dataset as a first group based on the correlation factors (S40).

플랜트 설비는 복수의 모듈이 유기적으로 결합되어 상호 밀접한 영향을 미치는 경우가 많다. 따라서 플랜트 설비는 특성상 복수의 모듈과 이를 실시간으로 모니터링하는 복수의 센서를 포함하고 있다. 플랜트의 정상작동 여부를 판단하기 위한 방법으로, 본 실시예에서는 플랜트를 구성하는 복수의 장비로부터 수신한 데이터셋의 히스토리를 관리할 수 있다. 플랜트 설비는 수신된 데이터셋을 학습하고, 이를 플랜트의 현재상태와 비교하여 플랜트의 정상작동 여부 또는 비정상작동 여부를 판단하거나 예측할 수 있다.Plant plants often have a plurality of modules that are organically combined to closely influence each other. Therefore, a plant facility includes a plurality of modules and a plurality of sensors for monitoring them in real time. As a method for determining whether the plant is operating normally, in this embodiment, it is possible to manage the history of the data set received from the plurality of equipment constituting the plant. The plant facility may learn the received datasets and compare them with the current state of the plant to determine or predict whether the plant is operating normally or abnormally.

하지만, 복수의 센서로부터 단위시간당 많게는 수만 개의 감지신호를 수신하여 해당 플랜트 설비의 정상작동 여부를 판단해야 하고, 비정상 데이터는 학습의 효율을 저하시키는 원인이 될 수 있다. 따라서, 수신되는 데이터셋을 효과적으로 학습할 수 있는 방법과 학습될 데이터셋의 정확도를 향상시키는 방법이 요구된다.However, it is necessary to determine the normal operation of the plant equipment by receiving as many as tens of thousands of detection signals per unit time from the plurality of sensors, and abnormal data may cause a decrease in learning efficiency. Therefore, there is a need for a method for effectively learning the received dataset and a method for improving the accuracy of the dataset to be learned.

본 실시예에서는 수신되는 데이터셋을 유사도에 따라 미리 그룹화하여 해당 장비에 필요한 데이터그룹을 선택적으로 학습할 수 있고, 비정상 데이터 그룹을 사전에 필터링 하여 학습의 효율성을 향상시킬 수 있다.In the present embodiment, the received data set may be grouped according to similarity in advance to selectively learn the data group required for the corresponding equipment, and the abnormal data group may be filtered in advance to improve the learning efficiency.

이를 위해, 먼저 플랜트를 구성하는 복수의 장비로부터 수신된 운전데이터인 제1 및 제2 데이터를 포함하는 데이터셋을 수신하는 단계(S10)를 수행할 수 있다. 전술한 바와 같이 수신되는 데이터셋에 포함된 복수의 운전데이터는 단위나 스케일 상이할 수 있다. 예를 들어, 발전소를 구성하는 장비로 냉각수펌프, 고압터빈, 보일러, 집전시설, 발전기 등이 포함되고, 이러한 장비로부터 수신되는 운전데이터는 ℃, Kw, 톤(ton), rpm 등 다양한 단위 또는 스케일을 가질 수 있다.To this end, first, a step S10 of receiving a data set including first and second data, which is operation data received from a plurality of equipments constituting the plant, may be performed. As described above, the plurality of operation data included in the received data set may be different in units or scale. For example, the equipment constituting the power plant includes a cooling water pump, a high-pressure turbine, a boiler, a current collector, a generator, and the like. Operation data received from such equipment is various units or scales such as ℃, Kw, tons, rpm, and the like. May have

이에 본 실시예에서는 데이터간의 스케일 보정과정을 통하여 단위나 스케일에 따른 데이터 왜곡현상을 방지할 수 있다. 스케일 보정과정은 최대-최소값의 평균을 이용하여 변환인자를 추출하고, 변환인자를 기초로 데이터 스케일을 보정하는 방법이 있을 수 있으나, 이에 한정되는 것은 아니며 다양한 보정 또는 보간기법이 사용될 수 있다.Accordingly, in the present embodiment, data distortion due to units or scales can be prevented through a scale correction process between data. The scale correction process may include a method of extracting a transform factor using an average of maximum and minimum values and correcting a data scale based on the transform factor, but is not limited thereto. Various correction or interpolation techniques may be used.

다음으로, 수신된 데이터셋에서 제1 데이터와 제2 데이터를 추출하는 단계(S20)를 수행할 수 있다. 제1 데이터와 제2 데이터를 추출하는 기준으로, 해당 플랜트 또는 장비의 정상작동여부를 판단하는 중요도 순으로 추출하거나, 데이터 셋에서 무작위로 선별하거나, 시계열상 앞선 데이터를 추출하는 등의 다양한 기준을 적용할 수 있다.Next, an operation (S20) of extracting the first data and the second data from the received data set may be performed. As a criterion for extracting the first data and the second data, various criteria such as extracting in order of importance of determining whether the plant or equipment is operating properly, randomly selecting from a data set, or extracting time-series data, etc. Applicable

추출된 제1 데이터와 제2 데이터의 상관인자를 결정하는 단계(S30)를 수행할 수 있다. 본 실시예에서는 기울기, 분포도, 등을 이용하여 상관인자를 결정할 수 있으나 이에 한정되는 것은 아니며, 다양한 방법으로 상관인자를 결정할 수 있다. 상관인자를 결정하는 구체적은 과정은 도 2를 참조하여 후술하기로 한다.In operation S30, a correlation factor between the extracted first data and the second data may be determined. In the present embodiment, the correlation factor may be determined using the slope, distribution, and the like, but is not limited thereto, and the correlation factor may be determined in various ways. A detailed process of determining the correlation factor will be described later with reference to FIG. 2.

마지막으로, 결정된 상관인자를 기초로 데이터셋을 그룹화하는 단계(S40)를 수행할 수 있다. 각각의 운전데이터로부터 생성된 상관인자를 기준 상관인자와 비교하여 유사한 상관도를 가지는 운전데이터를 그룹화 할 수 있다. 이때, 크기는 같지만 음의 값을 가지는 상관인자와 양의 값을 가지는 상관인자가 같은 그룹으로 그룹화하기 위해, 각각의 상관인자는 절대값을 기초로 상기 그룹화 단계(S40)를 수행할 수 있다.Finally, the step S40 of grouping the datasets based on the determined correlation factors may be performed. By comparing the correlation factors generated from each operation data with the reference correlation factors, operation data having similar correlations can be grouped. In this case, in order to group the correlation factors having the same size but having the negative value and the correlation values having the positive value into the same group, each correlation factor may perform the grouping step S40 based on the absolute value.

본 실시예에서는 운전데이터를 분류하기 위한 분류기로 K-근접이웃(K-NN)과 가우시안 혼합 모델(GMM)을 이용하였으나, 이에 한정되는 것은 아니고 상관인자를 기준 상관인자와 비교하여 복수의 운전데이터 중 유사한 상관도를 가지는 데이터를 순차적으로 복수의 그룹으로 그룹화하는 등의 다양한 분류 알고리즘을 이용하여 데이터를 그룹화 할 수 있다.In this embodiment, although the K-NN and the Gaussian mixture model (GMM) are used as classifiers for classifying the driving data, the present invention is not limited thereto. The data may be grouped using various classification algorithms such as grouping data having similar correlations sequentially into a plurality of groups.

K-근접이웃 분류기는 주어진 데이터로부터 거리가 가까운 순서대로 K개의 데이터를 찾은 후, 그 중 가장 많은 수의 데이터가 속한 그룹으로 할당하는 방법이다. K-근접이웃은 비모수적 학습 알고리즘이며, 거리기반 분류기의 한 종류이다.The K-Neighbor classifier finds K data in the order of closest distance from the given data and assigns the group to the largest group among them. K-Neighbors are a nonparametric learning algorithm and a kind of distance-based classifier.

가우시안 혼합모델(GMM) 또한 데이터의 분류에 유용하며, 일반적으로 모수적 학습 알고리즘에 사용된다. 가우시안 혼합 모델은 주어진 데이터셋의 분포 밀도를 하나의 확률밀도 함수로 모델링하는 방법을 개선한 밀도 추정방법이다. 가우시안 혼합 모델은 가우시안 분포의 선형 조합으로 데이터의 각 그룹을 제공할 수 있다. 매개변수는 최대-최소 알고리즘을 사용하여 학습된 데이터에서 추론할 수 있다.Gaussian mixture models (GMMs) are also useful for classifying data and are commonly used for parametric learning algorithms. The Gaussian mixture model is a density estimation method that improves the method of modeling the distribution density of a given dataset as a probability density function. The Gaussian mixture model can provide each group of data as a linear combination of Gaussian distributions. Parameters can be inferred from the learned data using a maximum-minimum algorithm.

또한, 그룹화 단계(S40)에서는, 상관인자를 기초로 데이터셋에서 비정상 운전데이터를 필터링할 수 있다. 상관인자를 비교하여 일정범위 이상의 상관도를 가지는 비정상 데이터를 사전에 필터링하여 보다 정밀한 플랜트 모니터링과 데이터 학습을 할 수 있다. 이때, 비정상 데이터는 소정 상관도에 미치지 못하는 데이터를 의미할 수 있으며, 비정상 범주에 포함되는 비정상 데이터도 상관도에 따라 순차적으로 상기 복수의 그룹으로 그룹화될 수 있다. 그룹화된 비정상 데이터를 이용하여 학습하는 경우, 플랜트 장비의 현재상태와 비교하여 비정상작동 여부를 판단할 수 있다.In addition, in the grouping step S40, the abnormal operation data may be filtered in the data set based on the correlation factor. By comparing correlation factors, we can filter out abnormal data having a certain degree of correlation or higher for more precise plant monitoring and data learning. In this case, the abnormal data may mean data that does not reach a predetermined correlation, and abnormal data included in the abnormal category may be sequentially grouped into the plurality of groups according to the correlation. In the case of learning using the grouped abnormal data, it can be determined whether the abnormal operation is compared with the current state of the plant equipment.

도 2를 참조하면, 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법에서 상관인자를 결정하는 과정이 개시된다.Referring to FIG. 2, a process of determining a correlation factor in a data classification method considering a correlation according to an embodiment of the present invention is disclosed.

먼저, 수신된 데이터셋에서 제1 데이터와 제2 데이터를 추출하는 단계(S100)를 수행할 수 있다. 제1 데이터는 기준데이터로, 제2 데이터를 포함한 나머지 데이터셋은 비교데이터로 활용할 수 있다. 전술한 바와 같이, 기준데이터를 선정하는 방법은 플랜트 학습 또는 모니터링에 있어 중요한 지표가 되는 데이터일 수 있으나, 이에 한정되는 것은 아니며 다양한 기준으로 기준데이터를 결정할 수 있다.First, an operation (S100) of extracting first data and second data from a received data set may be performed. The first data may be used as reference data, and the remaining data sets including the second data may be used as comparison data. As described above, the method of selecting the reference data may be data that is an important indicator for plant learning or monitoring, but is not limited thereto. The reference data may be determined based on various criteria.

다음으로, 제1 데이터와 제2 데이터의 단위시간별 좌표를 결정하는 단계(S200)를 수행할 수 있다. 플랜트의 장비에서 생성되는 데이터는 일반적으로 시계열 데이터이므로, 시간 영역에 대응하여 각각의 데이터를 배치할 수 있다. 도 1을 참조한 일 실시예에서 데이터셋은 스케일 보정과정을 거쳐 단위 혹은 스케일에 따른 영향을 받지 않을 수 있다. Next, a step (S200) of determining coordinates for each unit time of the first data and the second data may be performed. Since the data generated by the equipment of the plant is generally time series data, each data can be arranged corresponding to the time domain. In an embodiment with reference to FIG. 1, the dataset may not be affected by a unit or scale through a scale correction process.

제1 데이터와 제2 데이터의 좌표를 도시하는 단계(S300)를 수행할 수 있다. 예를 들어 제1 데이터를 x축, 제2 데이터를 y축으로 배치하고, 이 2차원 영역에 단위시간별 좌표를 도시할 수 있다. 첫번째 단위시간에 대응하는 제1 데이터의 값이 3, 제2 데이터의 값이 7이면, D1(3,7)과 같이 도시할 수 있다. 이와 같은 방법으로 단위시간에 대응하는 제1 및 제2 데이터의 좌표를 도시할 수 있다.In operation S300, coordinates of the first data and the second data may be performed. For example, the first data can be arranged on the x-axis and the second data on the y-axis, and the coordinates for each unit time can be shown in this two-dimensional area. If the value of the first data corresponding to the first unit time is 3 and the value of the second data is 7, it can be shown as D1 (3,7). In this way, the coordinates of the first and second data corresponding to the unit time can be shown.

분포된 제1 데이터 및 제2 데이터는 기준에 따라 좌표의 연속된 구간을 추출하는 단계(S410) 또는 좌표의 분포도를 결정하는 단계(S450)를 수행할 수 있다. 전자의 단계(S410)는 불연속적인 데이터의 좌표를 연속화하는 과정을 수행할 수 있다. 연속된 구간을 추출하면(S410), 연속된 구간의 기울기를 추출하는 단계(S420)를 수행할 수 있다.The distributed first data and the second data may be performed to extract a continuous section of coordinates according to a reference (S410) or to determine a distribution of the coordinates (S450). The former step S410 may perform a process of serializing the coordinates of the discontinuous data. When the continuous section is extracted (S410), the step of extracting the slope of the continuous section may be performed (S420).

상기 S410 내지 S450의 단계를 수행하여, 기울기 또는 분포도를 이용하여 상관인자를 결정하는 단계(S500)를 수행할 수 있다.By performing the steps of S410 to S450, it is possible to perform the step (S500) of determining the correlation factor using the slope or distribution.

본 실시예에서는 피어슨 상관 계수(Pearson correlation coefficient)를 이용하여 상관인자를 결정할 수 있으나, 이에 한정되는 것은 아니며 장비별 특성, 데이터의 중요도 가중치 등의 다양한 기준으로 상관인자를 결정할 수 있다. In the present embodiment, the correlation factor may be determined using a Pearson correlation coefficient, but is not limited thereto, and the correlation factor may be determined based on various criteria such as characteristics of each device and importance weight of the data.

상관인자는 두 변수 간의 공변하는 관계를 나타내는 통계량이다. 일반적으로 상관인자 또는 상관계수라고 하면 피어슨 상관계수를 의미한다. 피어슨의 적률 상관 계수(Pearson's product moment correlation coefficient), 피어슨의 r(Pearson's r), r, R 등은 모두 피어슨의 상관 계수를 나타내는 다른 용어들이다. 피어슨의 상관계수는 -1~1의 값 범위를 가지고 있는데, 두 변수간의 상관인자가 양수를 가지는 경우, 양의 상관관계라 한다. 다시말해, 제1 변수가 증가하면 제2 변수도 증가함을 의미한다.Correlation factors are statistics representing the covariate relationship between two variables. Generally speaking, correlation factor or correlation coefficient means Pearson's correlation coefficient. Pearson's product moment correlation coefficient, Pearson's r (Pearson's r), r, and R are all other terms that describe Pearson's correlation coefficient. Pearson's correlation coefficient is in the range of -1 ~ 1. If the correlation factor between two variables is positive, it is called positive correlation. In other words, when the first variable increases, the second variable also increases.

두 변수 x, y에 대해서, n개의 관측치가 (x1,y1), (x2,y2), ..., (xn,yn)과 같이 주어지고, x와 y의 평균을 각각 m(x), m(y)라고 하면, 두 변수 x와 y에 대한 상관인자 r을 결정하는 방법은 다음과 같다.For two variables x and y, n observations are given by (x1, y1), (x2, y2), ..., (xn, yn), and the mean of x and y is m (x), If m (y), the method of determining the correlation factor r for two variables x and y is as follows.

먼저, 각 변수와 평균 사이의 편차를 구할 수 있다. 첫번째 관측치 (x1,y1)에서 x1의 편차는 x1 - m(x), y1의 편차는 y1 - m(y)이고, 마찬가지의 방법으로 i번째의 관측치에 대한 x와 y의 편차는 xi - m(x), yi - m(x)이다.First, the deviation between each variable and the mean can be obtained. The deviation of x1 in the first observation (x1, y1) is x1-m (x), the deviation of y1 is y1-m (y), and in the same way the deviation of x and y for the i th observation is xi-m (x), yi-m (x).

다음으로, 공분산을 구할 수 있다. 분산이 어느 하나의 변수의 평균을 기초로 모집단이 어떻게 분포하는가를 확인할 수 있다면, 공분산은 하나의 변수가 아닌 두 변수 사이의 관계를 나타내는 지표이다. 공분산은 두 변수의 편차를 곱하고, 이들의 평균을 구하여 얻을 수 있다. 공분산을 구하는 수학식은 다음과 같다.Next, the covariance can be obtained. If variance can determine how the population is distributed based on the mean of one variable, covariance is an indicator of the relationship between two variables rather than one. Covariance can be obtained by multiplying the deviations of two variables and taking their average. The equation for calculating the covariance is as follows.

수학식 1에서, n개의 평균을 구하는데 있어 분모에 n을 쓰지 않고 n-1이 쓰였는데, 이는 자유도에 의한 것이다. 전술한 바와 같이 xi - m(x), yi - m(y)는 편차를 나타내고, 이러한 편차들의 합은 항상 0이 되어야 한다. 따라서 수학식 2 와 수학식 3을 만족해야 한다.In Equation 1, n-1 is used instead of n in the denominator to calculate n averages, which is due to degrees of freedom. As mentioned above, xi-m (x), yi-m (y) represent deviations, and the sum of these deviations should always be zero. Therefore, Equations 2 and 3 must be satisfied.

편차의 합이 0이 되려면, 모든 관측값이 변해도 편차의 합이 0이 되도록 다른 모든 값들에 의해 어느 하나의 값이 결정되어 고정될 수 있다. 여기서, 자유롭게 변하는 관측치의 수가 자유도가 될 수 있다. 본 실시예에서 자유도는 n-1로 결정될 수 있다.If the sum of the deviations is zero, either value may be determined and fixed by all other values such that the sum of the deviations is zero even if all observations change. Here, the number of freely varying observations may be degrees of freedom. In this embodiment, the degree of freedom may be determined to be n-1.

마지막으로, 두 변수의 표준편차를 구하여 공분산의 합을 표준편차의 곱으로 나누어 상관인자를 구할 수 있다. xi에 대한 표준편차의 합은 수학식 4와 같이 나타낼 수 있다.Finally, the standard deviations of the two variables can be obtained to find the correlation factor by dividing the sum of the covariances by the product of the standard deviations. The sum of standard deviations for xi may be expressed as in Equation 4.

마찬가지로, 자유도에 의해 분모는 n이 아닌 n-1이 될 수 있다. 상관인자 r은 다음과 같이 결정될 수 있다.Similarly, the degrees of freedom allow the denominator to be n-1 rather than n. The correlation factor r can be determined as follows.

여기서, 분자와 분모에 포함된 n-1항은 상쇄되므로 무시할 수 있다. 따라서 상관인자 r은 다음의 수학식으로도 산출될 수 있다.Here, the n-1 terms included in the numerator and denominator are canceled and can be ignored. Therefore, the correlation factor r can also be calculated by the following equation.

전술한 바와 같이, 상관인자 r의 값은 -1보다 같거나 크고, 1보다 같거나 작은 범위를 가질 수 있다. 상관인자는 단위를 갖지 않으므로, 측정되는 단위와 독립적으로 정의될 수 있다. 또한, 상관인자는 방향성을 갖지 않는다. 즉, x와 y의 상관계수는 y와 x의 상관계수와 같다. 전술한 상관인자를 구하는 과정을 이용하여 기울기 또는 분포도로 나타내어진 제1 및 제2 변수간의 상관인자를 결정할 수 있다.As described above, the value of the correlation factor r may have a range equal to or greater than -1 and equal to or less than 1. Correlators have no units and can be defined independently of the unit being measured. Also, the correlator has no directivity. That is, the correlation coefficient of x and y is equal to the correlation coefficient of y and x. The correlation factor between the first and second variables represented by the slope or distribution chart may be determined using the above-described process of obtaining the correlation factor.

마지막으로, 제1 데이터와 제2 데이터의 상관인자를 기초로, 제1 데이터와 데이터셋의 상관인자를 비교하여 그룹화하는 단계(S600)를 수행할 수 있다.Finally, in operation S600, the correlation factors of the first data and the data set may be compared and grouped based on the correlation factors of the first data and the second data.

도 3을 참조하면, 본 발명의 일 실시예에 따른 상관도를 고려한 데이터 분류 방법에서 데이터셋의 그룹화 과정이 개시된다.Referring to FIG. 3, a grouping process of datasets is disclosed in a data classification method considering a correlation according to an embodiment of the present invention.

데이터셋(10)에 포함된 복수의 운전데이터에서 기준데이터를 결정하고, 기준데이터인 제1 데이터와 와 비교데이터인 제2 데이터에서 기준 상관인자(20)를 결정한다. 제1 데이터와 나머지 데이터셋에 포함된 운전데이터와의 상관인자를 결정하여, 기준 상관인자(20)와 비교하여 상관도에 따라 데이터셋을 제1 그룹을 포함은 복수의 그룹(12)으로 그룹화 할 수 있다. 데이터셋을 분류하기 위한 방법으로, K-근접이웃 또는 가우시안 혼합 모델(GMM)을 이용할 수 있다.The reference data is determined from the plurality of operation data included in the data set 10, and the reference correlation factor 20 is determined from the first data which is the reference data and the second data which is the comparison data. A correlation factor between the first data and operation data included in the remaining data sets is determined, and the data set is grouped into a plurality of groups 12 including the first group according to the degree of correlation by comparing with the reference correlation factor 20. can do. As a method for classifying datasets, K-Near or Gaussian Mixed Model (GMM) can be used.

한편, 소정 상관도에 부합하지 않는 데이터는 비정상 데이터(11)로 판단하고 필터링 할 수 있다. 필터링된 비정상 데이터(11)를 제외한 나머지 데이터그룹(12)은 플랜트 또는 장비의 특성에 맞게 선택적으로 학습할 수 있다.On the other hand, the data that does not meet the predetermined correlation can be determined and filtered as the abnormal data (11). The remaining data group 12 except the filtered abnormal data 11 may be selectively learned according to the characteristics of the plant or the equipment.

이와 달리, 소정 상관도에 부합하지 않는 데이터를 선택적으로 학습하여 플랜트 장비의 현재상태와 비교하여 비정상작동 여부를 판단할 수 있다.Alternatively, by selectively learning the data that does not meet the predetermined degree of correlation can be determined whether the abnormal operation compared to the current state of the plant equipment.

한편, 본 발명은 컴퓨터 판독가능 저장매체에 컴퓨터가 판독 가능한 코드를 저장하여 구현하는 것이 가능하다. 상기 컴퓨터 판독가능 저장매체는 컴퓨터 시스템에 의하여 판독될 수 있는 데이터가 저장되는 모든 종류의 저장장치를 포함한다.Meanwhile, the present invention may be implemented by storing computer readable codes in a computer readable storage medium. The computer readable storage medium includes all kinds of storage devices for storing data that can be read by a computer system.

상기 컴퓨터가 판독 가능한 코드는, 상기 컴퓨터 판독가능 저장매체로부터 프로세서에 의하여 독출되어 실행될 때, 본 발명에 따른 비정상 상관도를 고려한 데이터 분류 방법을 구현하는 단계들을 수행하도록 구성된다. 상기 컴퓨터가 판독 가능한 코드는 다양한 프로그래밍 언어들로 구현될 수 있다. 그리고 본 발명의 실시예들을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 통상의 기술자들에 의하여 용이하게 프로그래밍될 수 있다.The computer readable code, when read and executed by a processor from the computer readable storage medium, is configured to perform the steps of implementing a data classification method considering the abnormal correlation according to the present invention. The computer readable code may be implemented in various programming languages. And the functional program, code and code segments for implementing the embodiments of the present invention can be easily programmed by those skilled in the art.

컴퓨터 판독가능 저장매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 반송파(예를 들어 인터넷을 통한 전송)의 형태로 구현하는 것을 포함한다. 또한, 컴퓨터 판독가능저장매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 판독 가능한 코드가 저장되고 실행되는 것도 가능하다.Examples of computer-readable storage media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like, and also include implementing in the form of a carrier wave (eg, transmission over the Internet). The computer readable storage medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상 첨부된 도면을 참조하여 본 발명의 실시 예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive.

10: 데이터셋
11: 비정상 데이터
12: 데이터 그룹
20: 상관인자10: dataset
11: Abnormal data
12: Data group
20: correlation factor

Claims

In the data classification method in consideration of the correlation performed by the device interworking with the plant equipment,
Receiving a dataset from the plant equipment, the dataset comprising a plurality of operational data;
Determining a correlation factor of the data set; And
Grouping the dataset into a plurality of groups based on the correlation factors;
Receiving the data set,
When the plurality of operation data included in the data set have different scales, scale correction between the plurality of operation data is performed based on a scale conversion factor,
Grouping into the plurality of groups,
Grouping the plurality of operation data included in the data set into a plurality of groups according to the degree of correlation between the plurality of operation data and the correlation factor; And
And selectively learning according to the characteristics of the plant equipment based on the plurality of operation data for each of the plurality of groups.

The method of claim 1,
Determining the correlation factor,
Extracting reference data and comparison data from the data set; And
Determining a correlation factor between the reference data and the comparison data, and determining a reference correlation factor from the reference data.

The method of claim 2,
Grouping into the plurality of groups,
And comparing the correlation factors with the reference correlation factors to sequentially group data having similar correlation among the plurality of operation data into the plurality of groups.

The method of claim 2,
Grouping into the plurality of groups,
And comparing the correlation factors with the reference correlation factors to sequentially group data having different correlation degrees among the plurality of operation data into the plurality of groups.

The method of claim 1,
And comparing the current state of the plant equipment with the current state of the plant equipment based on the data selectively learned according to the characteristics of the plant equipment.

The method of claim 1,
And monitoring the dataset indicative of the condition of the plant equipment.

A computer readable storage medium having stored thereon a program for performing the method of claim 1.