KR20220160974A

KR20220160974A - Method and system for determining abnormalities in air quality data using the ensemble structure of supervised and unsupervised learning models

Info

Publication number: KR20220160974A
Application number: KR1020210069288A
Authority: KR
Inventors: 권장우; 이선우; 최정무; 이문형; 양호준; 신혜정; 박지훈; 이지선
Original assignee: 인하대학교 산학협력단; 대한민국 (관리부서 : 환경부 국립환경과학원장)
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2022-12-06
Also published as: KR102622895B1

Abstract

Disclosed are a method and system for determining an abnormality of the air quality data using an ensemble structure of a supervised learning model and an unsupervised learning model. The method for determining the abnormality of the air quality data according to one embodiment may comprise: a step of receiving the air quality data measured from a plurality of measurement stations into a learning model in which a supervised learning model and an unsupervised learning model are ensembled; and a step of determining whether the air quality data is abnormal using the learning model in which the supervised learning model and the unsupervised learning model are ensembled.

Description

Method and system for determining abnormality of air quality data using ensemble structure of supervised learning model and unsupervised learning model

아래의 설명은 지도 학습 모델 및 비지도 학습 모델의 앙상블 구조를 이용하여 대기질 데이터에 대한 이상을 판정하는 기술에 관한 것이다. The following description relates to a technique for determining anomalies in air quality data using an ensemble structure of a supervised learning model and an unsupervised learning model.

최근 지구 온난화 및 이상기후에 대한 관심이 높아지면서, 전국의 대기오염측정망 측정소에 대한 숫자가 늘어나고 있는 추세이다. 대기오염측정망 측정소로부터 관측된 데이터 셋에는 기록 장치의 결함이나 자연재해 등의 이유로 결측치나 이상치가 포함될 수가 있다.Recently, as interest in global warming and abnormal weather increases, the number of air pollution measurement network monitoring stations across the country is increasing. The data set observed from the Air Pollution Monitoring Network monitoring station may contain missing values or outliers due to defects in recording devices or natural disasters.

이러한 이상치나 결측값이 많아질 경우, 정보량이 정보의 질을 담보하지 못한다는 점에서, 정보의 질의 문제 또한 정보량의 문제만큼 데이터 분석에 있어서 중요한 역할을 한다. 그 이유는 첫째, 정보량이 많더라도 개개 정보의 내용을 신뢰할 수 없으면 그 정보를 이용한 통계적 추론의 신뢰성 또한 보장할 수 없기 때문이다. 둘째, 대부분 통계자료가 가지는 근원적 특성인 표본으로서의 조건을 충족하지 못하기 때문이다. 어떤 현상을 설명하기 위한 통계적 방법은 모집단을 대표할 수 있는 표본(sample)을 구성하고 그로부터 통계적 추론을 수행하는 것을 요체로 한다. 그러나 표본으로부터 조사한 자료가 불완전하여 자료의 질을 담보하지 못할 경우 자료의 대표성을 담보하기 어렵다.When such outliers or missing values increase, the quality of information plays an important role in data analysis as much as the problem of information quantity, in that the amount of information does not guarantee the quality of information. This is because, first, even if the amount of information is large, if the content of each information cannot be trusted, the reliability of statistical inference using that information cannot be guaranteed. Second, it is because most statistical data do not satisfy the condition as a sample, which is a fundamental characteristic. Statistical methods to explain a phenomenon are the essence of constructing a sample that can represent a population and performing statistical inference from it. However, if the data surveyed from the sample are incomplete and the quality of the data cannot be guaranteed, it is difficult to guarantee the representativeness of the data.

시간과 공간의 자료를 포함하는 시공간 자료의 결측치는 시공간 분석의 주요한 장애물이다. 측정소 데이터의 고신뢰성을 위하여 측정망 데이터에서 이상치 탐지와 결측치 처리는 주요 관심사 중 하나이다. 결측치가 자료를 공간적, 시간적으로 단절시켜 공간적 현상의 변화를 분석하는데 어려움을 주기 때문이다. 이상 및 결측치 발생에 의한 불완전한 자료는 자료 분석 시 모델링에서의 편향된 모수 추정등의 요소에서 문제가 발생 문제가 발생하여 잘못된 결과를 초래할 수 있기 때문에, 적절한 결측치 처리는 분석을 할 때 중요한 요소라고 할 수 있다. 또한 연구 대상 공간에 속한 일부 지역의 속성값이 결측되어 있다면, 그 자료를 이용한 대상공간 전체에 대한 통계적 추론은 완전한 자료를 이용할 때에 비해서 부정확할 가능성이 높다. 또한 결측이 발생한 어떤 시점의 공간표본이 다른 시점의 공간표본들로부터 얻은 값들과 다른 특이성을 가진다면, 공간분석의 결과가 왜곡될 수 있기 때문에 높은 시공간 자료는 신뢰성을 요구한다.Missing values in spatiotemporal data, including temporal and spatial data, are a major obstacle to spatiotemporal analysis. For high reliability of measurement station data, outlier detection and missing value processing in measurement network data are one of the main concerns. This is because missing values make it difficult to analyze changes in spatial phenomena by spatially and temporally disconnecting the data. Incomplete data due to the occurrence of anomalies and missing values can cause problems in factors such as biased parameter estimation in modeling during data analysis, which can lead to incorrect results, so proper handling of missing values is an important factor in analysis. have. In addition, if the attribute values of some regions belonging to the study space are missing, statistical inference for the entire subject space using the data is likely to be inaccurate compared to the case of using complete data. In addition, if the spatial sample at a certain point in time where the missing time has a different specificity from the values obtained from spatial samples at another point in time, the result of spatial analysis can be distorted, so high-quality spatiotemporal data requires reliability.

인공지능을 이용하여 대기오염 측정망 자료의 이상 데이터를 검출하는 방법 및 시스템을 제공할 수 있다.It is possible to provide a method and system for detecting abnormal data of air pollution measurement network data using artificial intelligence.

지도 학습 모델 및 비지도 학습 모델의 앙상블 구조를 이용하여 대기질 데이터에 대한 이상 판정을 수행하는 방법 및 시스템을 제공할 수 있다. A method and system for performing abnormality determination on air quality data using an ensemble structure of a supervised learning model and an unsupervised learning model may be provided.

대기질 데이터의 이상 판정 방법은, 복수 개의 측정소로부터 측정된 대기질 데이터를 지도 학습 모델과 비지도 학습 모델이 앙상블된 학습 모델에 입력받는 단계; 및 상기 지도 학습 모델과 비지도 학습 모델이 앙상블된 학습 모델을 이용하여 상기 대기질 데이터의 이상 여부를 판정하는 단계를 포함할 수 있다. A method for determining abnormality in air quality data may include receiving air quality data measured from a plurality of measurement stations into a learning model in which a supervised learning model and an unsupervised learning model are ensemble; and determining whether the air quality data is abnormal using a learning model in which the supervised learning model and the unsupervised learning model are ensembled.

상기 지도 학습 모델은, 심층 합성곱을 포함하는 DeepLab V3+모델을 사용하고, 상기 입력받는 단계는, 복수 개의 성분에 대한 시간별 측정값인 1차원 데이터로 구성된 대기질 데이터를 상기 지도 학습 모델에 입력받는 단계를 포함할 수 있다. The supervised learning model uses a DeepLab V3+ model including deep convolution, and the step of receiving input includes receiving air quality data composed of one-dimensional data, which is measurement values of a plurality of components over time, into the supervised learning model. can include

상기 지도 학습 모델은, 레즈넷(ResNet34) 기반의 모델이 수정된 특징 추출기가 구성되고, 상기 대기질 데이터가 상기 레즈넷 기반의 모델에 통과됨에 따라 제1 특징맵이 출력되고, 상기 출력된 특징맵이 ASSPP(Atrous Separable Spatial Pyramid Pooling)에 통과됨에 따라 성분별 특징맵이 생성되고, 상기 생성된 성분별 특징맵과 상기 제1 특징맵을 통합하도록 학습된 것일 수 있다. In the supervised learning model, a feature extractor obtained by modifying a ResNet34-based model is configured, and as the air quality data is passed through the ResNet-based model, a first feature map is output, and the output features As the map passes through Atrous Separable Spatial Pyramid Pooling (ASSPP), a feature map for each component is generated, and it may be learned to integrate the generated feature map for each component and the first feature map.

상기 입력받는 단계는, 상기 대기질 데이터에 조각별 집계근사법을 적용하여 상기 대기질 데이터에 포함된 노이즈를 제거하고, 상기 조각별 집계금사법이 적용됨에 따라 사라진 노이즈 데이터에 대하여 선형보간법(Linear interpolation)을 사용하여 데이터를 보완하는 단계를 포함할 수 있다. In the receiving of the input, noise included in the air quality data is removed by applying a piecewise aggregate approximation method to the air quality data, and linear interpolation is performed on the noise data that has disappeared as the piecewise aggregated approximation method is applied. ) to complement the data.

상기 입력받는 단계는, 상기 대기질 데이터가 측정된 시간에 따라 구간을 형성하고, 상기 형성된 구간에 포함된 대기질 데이터의 평균값을 상기 구간의 대표값으로 근사하여 노이즈를 제거하는 단계를 포함할 수 있다. The receiving of the input may include forming a section according to the time at which the air quality data was measured, and removing noise by approximating an average value of the air quality data included in the formed section to a representative value of the section. have.

상기 판정하는 단계는, 상기 지도 학습 모델을 통해 상기 대기질 데이터에 대한 시간별 베이스라인 판정 결과를 출력하는 단계를 포함할 수 있다. The determining may include outputting hourly baseline determination results for the air quality data through the supervised learning model.

상기 비지도 학습 모델은, 생성자와 판별자로 구성된 적대적 생성 신경망(GAN)을 사용하고, 상기 입력받는 단계는, 시계열 기반의 대기질 데이터를 상기 비지도 학습 모델에 입력받는 단계를 포함할 수 있다.The unsupervised learning model may use an adversarial generative neural network (GAN) composed of a generator and a discriminator, and the receiving of the input may include receiving time-series-based air quality data into the unsupervised learning model.

상기 비지도 학습 모델은, 특징을 추출하는 인코더와 특징을 되살리는 디코더로 구성된 구성되고, 다변수 시계열 데이터의 이상 패턴을 감지하는 BeatGan 모델을 사용하여 시계열 데이터의 비정상 패턴을 감지하도록 학습된 것일 수 있다. The unsupervised learning model is composed of an encoder for extracting features and a decoder for reviving features, and may be trained to detect abnormal patterns of time series data using a BeatGan model that detects abnormal patterns of multivariate time series data. .

상기 비지도 학습 모델은, 정상 패턴을 가지고 있는 시계열 데이터로 학습을 진행하여 BeatGan 모델이 정상 데이터에 관하여 특징 정보를 이해할 수 있도록 구성된 것일 수 있다. The unsupervised learning model may be configured so that the BeatGan model can understand feature information about normal data by learning with time-series data having a normal pattern.

상기 판정하는 단계는, 상기 비지도 학습 모델을 통해 상기 대기질 데이터에 포함된 비정상 데이터에 대해 이상치를 표출하는 단계를 포함할 수 있다. The determining may include expressing an outlier for abnormal data included in the air quality data through the unsupervised learning model.

이상 판정 시스템은, 복수 개의 측정소로부터 측정된 대기질 데이터를 지도 학습 모델과 비지도 학습 모델이 앙상블된 학습 모델에 입력받는 입력부; 및 상기 지도 학습 모델과 비지도 학습 모델이 앙상블된 학습 모델을 이용하여 상기 대기질 데이터의 이상 여부를 판정하는 판정부를 포함할 수 있다. The anomaly determination system includes an input unit that receives air quality data measured from a plurality of measurement stations into a learning model in which a supervised learning model and an unsupervised learning model are ensemble; and a determination unit that determines whether the air quality data is abnormal by using a learning model in which the supervised learning model and the unsupervised learning model are ensembled.

지도 학습을 활용하여 조각별 집계근사법 적용과 대기질 데이터의 이상치 제거를 통해 성능을 향상시키고, 비지도 학습을 활용하여 대기질 데이터에 대한 이상 판단의 정확도를 향상시킬 수 있다.Supervised learning can be used to improve performance by applying piecewise aggregation approximation and removing outliers from air quality data, and unsupervised learning can be used to improve the accuracy of anomaly judgment on air quality data.

도 1은 일 실시예에 따른 이상 판정 시스템의 동작을 설명하기 위한 도면이다.
도 2는 일 실시예에 있어서, DeepLab V3+모델 구조를 설명하기 위한 도면이다.
도 3은 일 실시예에 있어서, 베이스라인 이상 판정 모델 구조를 설명하기 위한 도면이다.
도 4는 일 실시예에 있어서, 조각별 집계근사법의 적용 전과 적용 후를 비교한 도면이다.
도 5는 일 실시예에 있어서, GAN 모델의 학습 구조를 설명하기 위한 도면이다.
도 6은 일 실시예에 있어서, BeatGan 모델의 학습 구조를 설명하기 위한 도면이다.
도 7은 일 실시예에 있어서, 비정상 데이터에 대해 높은 이상값을 표출하는 것을 확인할 수 있다.
도 8은 일 실시예에 있어서, 학습 데이터를 생성하는 것을 설명하기 위한 도면이다.
도 9 및 도 10은 일 실시예에 있어서, 비정상 데이터 및 정상 데이터에 대한 실험 결과를 나타낸 그래프이다.
도 11은 일 실시예에 따른 이상 판정 시스템의 구성을 설명하기 위한 블록도이다.
도 12는 일 실시예에 따른 이상 판정 시스템에서 대기질 데이터의 이상 판정 방법을 설명하기 위한 흐름도이다. 1 is a diagram for explaining the operation of an anomaly determination system according to an exemplary embodiment.
2 is a diagram for explaining a DeepLab V3+ model structure according to an embodiment.
3 is a diagram for explaining a baseline abnormality determination model structure according to an embodiment.
4 is a diagram comparing before and after application of the piecewise aggregate approximation method according to an embodiment.
5 is a diagram for explaining a learning structure of a GAN model according to an embodiment.
6 is a diagram for explaining a learning structure of a BeatGan model according to an embodiment.
7 shows that, in one embodiment, a high abnormal value is expressed for abnormal data.
8 is a diagram for explaining generating learning data according to an embodiment.
9 and 10 are graphs showing experimental results for abnormal data and normal data according to an embodiment.
11 is a block diagram for explaining the configuration of an anomaly determination system according to an embodiment.
12 is a flowchart illustrating a method of determining an abnormality of air quality data in an abnormality determination system according to an exemplary embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, an embodiment will be described in detail with reference to the accompanying drawings.

실시예에서는 대기오염 측정망 자료(예를 들면, 국립환경 과학원 대기오염 측정망 자료)의 신뢰도 향상을 위하여 이상 데이터를 검출하는 방법 및 시스템에 대하여 설명하기로 한다. 이를 위해 기존의 측정망 데이터를 분석 후 오표기되어 있는 레이블링 데이터를 수정하여 인공지능 데이터의 학습에 맞도록 개선함으로써 이상 데이터의 검출을 위한 성능을 향상시킬 수 있다.In an embodiment, a method and system for detecting abnormal data in order to improve the reliability of air pollution measurement network data (eg, air pollution measurement network data of the National Institute of Environmental Research) will be described. To this end, the performance for detecting abnormal data can be improved by analyzing the existing measurement network data, correcting the mislabeled labeling data, and improving it to fit the learning of artificial intelligence data.

도 1은 일 실시예에 따른 이상 판정 시스템의 동작을 설명하기 위한 도면이다. 1 is a diagram for explaining the operation of an anomaly determination system according to an exemplary embodiment.

도 1은 지도 학습 모델과 비지도 학습 모델을 앙상블화 시킨 구조를 시각화한 프로세스로서, 대기질 데이터를 입력 데이터로 하여 지도 학습과 비지도 학습을 모두 사용하여 이상 판정이 진행될 수 있다.1 is a process of visualizing a structure in which a supervised learning model and an unsupervised learning model are ensembled. An anomaly determination can be made using both supervised learning and unsupervised learning with air quality data as input data.

불명확한 레이블링은 학습 모델의 정확도를 저하시키는 요인이 되는데, 이러한 현상이 발생하는 원인은 레이블링을 수행하는 담당자마다 판단기준이 상이하며 측정소마다 사용되는 측정 기기가 다르고 측정 기기별로 이상 데이터를 판단하는 기준이 다르기 때문이다. 이러한 문제를 해결하기 위하여 실시예에서는 각 이상 증상별 일정한 판단 기준을 설정한 뒤 레이블링을 다시 진행하여 학습용 데이터 셋을 생성하기로 한다. 학습용 데이터 셋을 이용하여 학습 모델이 선정되고, 이를 대기질 데이터에 적용하여 학습시킴에 따라 학습 모델이 구축될 수 있다. Unclear labeling becomes a factor that lowers the accuracy of the learning model. The cause of this phenomenon is that the judgment standards are different for each person in charge of labeling, and the measurement devices used for each measurement station are different, and the criteria for judging abnormal data for each measurement device because this is different In order to solve this problem, in the embodiment, after setting a certain criterion for each abnormal symptom, labeling is performed again to generate a data set for learning. A learning model is selected using the training data set, and the learning model can be built as it is applied to air quality data and trained.

이상 판정 시스템은 인공지능을 이용하여 대기질 데이터의 이상 여부를 판단할 수 있다. 인공지능에서 학습 데이터와 이를 정답으로 표기한 레이블링이 된(Labeled) 학습 데이터를 기반으로 지도 학습과 정답이 없고 학습 데이터만을 이용하여 판단하는 비지도 학습을 포함하는 두 가지 접근법을 적용한 앙상블 모델을 통해 대기질 데이터에 대한 이상 여부를 판단할 수 있다. 이와 같이, 기존의 지도 학습의 레이블링이 측정소별, 사용자별로 기준이 각기 다르기 때문에 모델이 학습하기에 어려움이 많아 레이블없이 할 수 있는 비지도 학습 모델과 지도학습 모델을 모두 적용하여 성능을 극대화시킬 수 있다. The abnormality determination system can determine whether the air quality data is abnormal using artificial intelligence. Through an ensemble model that applies two approaches, including supervised learning based on learning data and labeled learning data marked as correct answers in artificial intelligence, and unsupervised learning, which judges only using learning data without correct answers It is possible to determine whether or not there is an abnormality in the air quality data. In this way, since the labeling of the existing supervised learning has different standards for each measurement station and each user, it is difficult to learn the model, so it is possible to maximize performance by applying both unsupervised learning models and supervised learning models that can be done without labels. have.

이상 판정 시스템은 베이스라인 이상의 경우, 기존 데이터의 레이블링으로 학습된 지도 모델의 결과와 비정상 데이터를 제외한 정상 데이터만으로 학습이 진행된 비지도 모델의 결과를 통합하여 입력 데이터의 베이스라인 이상 여부를 판단할 수 있다.If the anomaly determination system is above the baseline, it can determine whether the input data is above the baseline by integrating the results of the supervised model learned by labeling of the existing data and the results of the unsupervised model trained only on normal data excluding abnormal data. have.

도 2는 일 실시예에 있어서, DeepLab V3+모델 구조를 설명하기 위한 도면이다. 2 is a diagram for explaining a DeepLab V3+ model structure according to an embodiment.

DeepLab V3+모델은 딥러닝(Deep Learning) 분야 중 전경 분할(Semantic Segmentation) 연구분야에서 사용되는 모델이다. DeepLab V3+ 모델에 경우 세그멘테이션(Segmentation)을 해결하기 위한 모델로서, 이미지가 입력값으로 들어오게 되면, 심층 합성곱(Deep Convolutional Neural Network; DCNN)을 통과함에 따라 복수 개의 크기의 특징맵(Feature Map)이 생성될 수 있다. DeepLab V3+모델은 심층 합성곱(DCNN) 결과를 통해 출력된 복수 개의 크기의 특징맵들과 기 생성된 연산초기 특징맵을 통합한 뒤, 각 특징맵들의 합성곱 연산을 통해 다시 확장하여 구성될 수 있다. 복원된 이미지의 픽셀값을 통해 분류하고자 하는 베이스라인 이상 여부가 출력될 수 있다.The DeepLab V3+ model is a model used in the field of semantic segmentation research in the field of deep learning. In the case of the DeepLab V3+ model, it is a model to solve segmentation. When an image comes in as an input value, it passes through a deep convolutional neural network (DCNN) to generate feature maps of multiple sizes. this can be created. The DeepLab V3+ model can be configured by integrating the feature maps of a plurality of sizes output through the deep convolution (DCNN) result and the pre-generated feature maps at the initial stage of operation, and then extending them again through the convolution operation of each feature map. have. Whether or not there is a baseline abnormality to be classified may be output through pixel values of the reconstructed image.

DeepLab V3+ 에서, 대기질 데이터의 경우 PM10, PM2.5, NO, NO2, NOx, O3, CO, SO2 8개의 성분에 대한 시간별 측정값인 1차원 데이터를 입력으로 가지므로, 1차원 합성곱(1 Dimension Convolution Neural Network)이 활용될 수 있다.In DeepLab V3+, in the case of air quality data, since it has as input 1-dimensional data, which are measured values over time for eight components, PM10, PM2.5, NO, NO2, NOx, O3, CO, and SO2, 1-dimensional convolution (1 Dimension Convolution Neural Network) can be utilized.

실시예에서 데이터의 경향성을 파악하고 해당 경향성에서 벗어나는 데이터값이 베이스라인 이상으로 판정될 수 있다. 예를 들면, 각 원소의 경향성을 관측하기 위해서는 최소한 이전 데이터를 기준으로 한 1개월 데이터와 판정을 위한 1개월 데이터를 더한 2개월(720시간) 분량의 데이터가 입력 데이터로 설정될 수 있다.In an embodiment, a tendency of data may be identified and a data value deviating from the tendency may be determined to be above the baseline. For example, in order to observe the tendency of each element, at least 2 months (720 hours) of data obtained by adding 1 month data based on previous data and 1 month data for decision may be set as input data.

도 3을 참고하면, 베이스라인 이상 판정 모델 구조를 설명하기 위한 도면이다. 베이스라인 이상 판정 모델 구조는 입력된 데이터의 특징을 고차원의 특징으로 추출할 수 있도록 특징 추출기(Feature Extractor)는 레즈넷(ResNet34)기반의 모델을 수정하여 구성될 수 있다. 레즈넷을 통과함에 따라 출력된 최종 특징맵은 SO2, CO, O3 3개의 ASSPP(Atrous Separable Spatial Pyramid Pooling)을 거쳐, 성분별 특징맵이 생성될 수 있다. ASSPP 에서는 다양한 크기의 특징을 추출하기 위해 합성곱 필터의 크기를 다르게 적용하여 출력된 특징을 추출한 뒤, 이러한 특징들을 다시 통합할 수 있다. 최종적으로 생성된 각각의 성분별 특징들과 초기에 나온 특징을 통합하고, 통합된 결과를 원본 시간과 동일한 크기로 확장시켜 구성할 수 있다. 최종적으로 입력 720시간(약 2달)의 입력과 출력에 대한, 각성분별로 '시간별 베이스라인 판정결과'가 출력될 수 있다. Referring to FIG. 3, it is a diagram for explaining the structure of a baseline abnormality determination model. The structure of the baseline anomaly determination model can be configured by modifying a ResNet34-based model so that the features of the input data can be extracted as high-dimensional features. The final feature map output as it passes through the Reznet passes through three Atrous Separable Spatial Pyramid Pooling (ASSPP) of SO2, CO, and O3, and feature maps for each component can be generated. In ASSPP, output features can be extracted by applying different sizes of convolution filters to extract features of various sizes, and then these features can be integrated again. It can be constructed by integrating the finally generated features for each component and the initial features, and extending the integrated result to the same size as the original time. Finally, 'baseline determination results by time' may be output for each component for the input and output of 720 hours (about 2 months).

도 4는 일 실시예에 있어서, 조각별 집계근사법의 적용 전과 적용 후를 비교한 도면이다. 4 is a diagram comparing before and after application of the piecewise aggregate approximation method according to an embodiment.

조각별 집계근사법(Piecewise Aggregate Approximation)이 적용되어 대기오염망 자료(대기질 데이터)가 확정될 수 있다. 매 시각 레이블링을 하는 경우, 사용자마다 오탐지하는 부분이 있을 수 있고, 데이터의 노이즈가 많다는 점에 착안하여, 기존 데이터가 관측시간에 따라 구간이 형성된 뒤, 각 구간별 데이터 평균값이 각 구간의 대표값으로 근사되어 노이즈가 최소화될 수 있다. 도 4(a)는 조각별 집계 근사법의 적용 전을 나타낸 것이고, 도 4(b)는 조각별 집계 근사법의 적용 후를 나타낸 것이다. 도 4와 같이 데이터의 특징을 유지하며, 데이터의 노이즈를 줄일 수 있음을 확인할 수 있다.The air pollution network data (air quality data) can be confirmed by applying the piecewise aggregate approximation. In the case of labeling every time, each user may have false positives, and considering the fact that there is a lot of noise in the data, after the existing data is divided into sections according to the observation time, the average value of the data for each section is the representative of each section. values can be approximated to minimize noise. Fig. 4(a) shows before applying the piecewise aggregate approximation method, and Fig. 4(b) shows it after applying the piecewise aggregate approximation method. As shown in FIG. 4, it can be confirmed that the noise of the data can be reduced while maintaining the characteristics of the data.

도 4에서 녹색 그래프는 측정소의 실제 측정값을 의미하고, 회색 그래프는 측정소를 기준으로 가장 인접한 복수 개의 측정소의 시간대별 최소값과 최대값을 의미한다. 조각별 집계 근사법의 적용 전과 조각별 집계 근사법의 적용 후의 데이터를 비교 시, 그래프의 크게 튀는 값들이 감소하고, 그래프에서 나타나지 않던 작은 그래프의 변화 값이 더 부각되게 된다. 그 결과, 조각별 집계 근사법 적용 전에 비하여 그래프의 경향성이 잘 나타나게 된다. 노이즈이 감소된 데이터는 선형보간법(Linear interpolation)을 사용하여, 사라진 노이즈 데이터를 보완할 수 있다. 이에, 기존의 데이터에 대하여 노이즈가 줄어들고 데이터의 흐름에 대한 특성은 대체적으로 보존되는 결과를 보인다. 또는, 예를 들면, 대기질 데이터에 대한 이상치 제거 전 정규화 과정이 수행될 수 있다. 이때, 이상치 제거는 사분위수 범위 제거(Interquartile Range; IQR) 방식을 이용하여 제거될 수 있다. In FIG. 4 , the green graph means the actual measured value of the measuring station, and the gray graph means the minimum value and the maximum value for each time period of a plurality of measuring stations closest to each other. When comparing the data before and after the application of the piecewise aggregate approximation method and after the application of the piecewise aggregate approximation method, large jumps in the graph decrease, and small changes in the graph that do not appear in the graph become more prominent. As a result, the tendency of the graph is better than before applying the piecewise aggregation approximation method. Noise-reduced data may supplement noise data that has disappeared by using linear interpolation. As a result, the noise of the existing data is reduced and the characteristics of the flow of data are generally preserved. Alternatively, for example, a normalization process may be performed before removing outliers on the air quality data. In this case, the outlier can be removed using an Interquartile Range (IQR) method.

도 5는 일 실시예에 있어서, GAN 모델의 학습 구조를 설명하기 위한 도면이다. 5 is a diagram for explaining a learning structure of a GAN model according to an embodiment.

비지도 학습 모델로서, 생성자(Generator)와 판별자(Discriminator)로 불리우는 모델 2가지를 한번에 학습시키는 모델인 적대적 생성 신경망(Generative Adversarial Network; GAN)이 적용될 수 있다. As an unsupervised learning model, a generative adversarial network (GAN), which is a model that trains two models called a generator and a discriminator at once, can be applied.

기존의 대기오염망 측정소에서 자료 확정을 하시는 담당자들의 레이블링을 하는 근거가 각기 다르기 때문에 레이블의 신뢰성이 많이 떨어지기 때문에 학습 데이터만을 이용하여 학습하고자 비지도 학습이 수행될 수 있다. 적대적 생성 신경망은 특정 범주에 대한 예측 형태를 학습하는 모델이다. 도 5를 참고하면, 대기질 데이터를 접목시킨 적대적 생성 신경망 모델의 학습 구조를 나타낸 것이다.Unsupervised learning can be performed to learn using only the learning data because the reliability of the label is very low because the labeling grounds of the people in charge of determining the data at the existing air pollution network monitoring station are different. An adversarial generative network is a model that learns a prediction shape for a specific category. Referring to FIG. 5 , a learning structure of an adversarial generation neural network model grafted with air quality data is shown.

가짜 기상 데이터 제조가는 분류 전문가를 속이는 것을 목적으로 하여 최대한 비슷한 기상 데이터를 생성하고, 분류 전문가는 가짜 기상 데이터 제조가에게 속지 않는 것을 목적으로 하므로 서로 적대적인 관계 속에서 상호 발전하며 학습이 진행되는 것이 큰 특징이다. 실시예에서는 적대적 생성 신경망 모델 중에서 다변수 시계열 데이터의 이례적 패턴을 감지할 수 있는 'BeatGan'이라고 하는 모델을 사용할 수 있다. Fake weather data manufacturers create weather data as similar as possible with the aim of deceiving classification experts, and classification experts aim not to be fooled by fake weather data manufacturers. characteristic. In the embodiment, among adversarial generative neural network models, a model called 'BeatGan' capable of detecting an unusual pattern of multivariate time-series data may be used.

도 6을 참고하면, BeatGan은 특징을 추출하는 인코더(Encoder)와 특징을 되살리는 디코더(Decoder)로 구성되며, 여기에 적대적 생성 신경망의 학습 방식을 접목한 모델이다. 시계열 데이터의 이례적인 패턴을 감지할 수 있는 것이 큰 특징이다. 정상 패턴을 가지고 있는 시계열 데이터로만 학습을 진행하여 BeatGan 모델이 정상 데이터에 관하여 중요한 특징 정보를 이해할 수 있도록 할 수 있다. 도 7을 참고하면, 비정상 데이터에 대해 높은 이상값을 표출하는 것을 확인할 수 있다.Referring to FIG. 6, BeatGan is composed of an encoder that extracts features and a decoder that restores features, and is a model that combines the learning method of an adversarial generative neural network. Its great feature is that it can detect unusual patterns in time series data. By training only time series data with normal patterns, the BeatGan model can understand important feature information about normal data. Referring to FIG. 7 , it can be confirmed that high abnormal values are expressed for abnormal data.

도 8을 참고하면, 학습 데이터를 생성하는 것을 설명하기 위한 도면이다. 비지도 학습 모델을 이용하여 대기질 데이터 적용이 결정될 수 있다. 기존의 대기질 자료 확정 담당자에 의해 제작된 데이터 레이블이 활용되어 학습 데이터가 구성될 수 있다. 도 8의 왼쪽 도면은 전문가들에 의한 데이터 레이블을 나타낸 것이고, 도 8의 오른쪽 도면은 해당 구간의 데이터를 나타낸 것이다. Referring to FIG. 8 , it is a diagram for explaining generating learning data. Application of air quality data may be determined using an unsupervised learning model. The data labels created by the existing air quality data confirmation staff can be utilized to configure the learning data. The left diagram of FIG. 8 shows data labels by experts, and the right diagram of FIG. 8 shows data of a corresponding section.

도 9 및 도 10은 일 실시예에 있어서, 비정상 데이터 및 정상 데이터에 대한 실험 결과를 나타낸 그래프이다.9 and 10 are graphs showing experimental results for abnormal data and normal data according to an embodiment.

도 9는 비정상 데이터에 대한 실험 결과이다. 도 9를 참고하면, 베이스라인이 갑자기 높아지는 현상에 대해서는 높은 재현율을 보이는 결과를 보이지만, 낮아지는 현상에 대해서는 재현율이 떨어지는 결과를 보인다. 도 10은 정상 데이터에 대한 실험 결과이다. 도 10을 참고하면, 진폭이 큰 패턴을 가지고 있는 정상 데이터의 경우 이상치로 탐지하는 현상 발생하는 결과를 보인다. 이는 입력이 굉장히 진폭이 크기 때문에 오차함수가 커져서 모두 다 잘못된 값이라고 하는 거짓 양성(False Positive)라고 하기 때문이다. 이에, 임계값 등이나 전/후처리 등의 수정 과정이 추가적으로 수행될 수 있다. 9 is an experimental result for abnormal data. Referring to FIG. 9 , a high reproducibility is obtained for a phenomenon in which the baseline suddenly increases, but a low reproducibility is obtained for a phenomenon in which the baseline suddenly increases. 10 is an experimental result for normal data. Referring to FIG. 10 , in the case of normal data having a pattern with a large amplitude, a phenomenon of detecting it as an outlier occurs. This is because the error function becomes large because the input has a very large amplitude, so it is called a false positive, which means that all are false values. Accordingly, a correction process such as a threshold value or pre/post processing may be additionally performed.

도 11은 일 실시예에 따른 이상 판정 시스템의 구성을 설명하기 위한 블록도이고, 도 12는 일 실시예에 따른 이상 판정 시스템에서 대기질 데이터의 이상 판정 방법을 설명하기 위한 흐름도이다. 11 is a block diagram for explaining the configuration of an anomaly determination system according to an embodiment, and FIG. 12 is a flowchart illustrating a method for determining an anomaly of air quality data in the anomaly determination system according to an embodiment.

이상 판정 시스템(100)의 프로세서는 입력부(1110) 및 판정부(1120)를 포함할 수 있다. 이러한 프로세서의 구성요소들은 이상 판정 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 프로세서 및 프로세서의 구성요소들은 도 2의 대기질 데이터의 이상 판정 방법이 포함하는 단계들(1210 내지 1220)을 수행하도록 이상 판정 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다.The processor of the abnormality determination system 100 may include an input unit 1110 and a determination unit 1120 . Components of such a processor may be representations of different functions performed by the processor according to control instructions provided by program codes stored in the anomaly determination system. The processor and components of the processor may control the anomaly determination system to perform steps 1210 to 1220 included in the method for determining anomaly of air quality data of FIG. 2 . In this case, the processor and components of the processor may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program.

프로세서는 대기질 데이터의 이상 판정 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 이상 판정 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 이상 판정 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서가 포함하는 입력부(1110) 및 판정부(1120) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(1210 내지 1220)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다. The processor may load a program code stored in a program file for a method for determining an abnormality in air quality data into a memory. For example, when a program is executed in the anomaly determination system, the processor may control the anomaly determination system to load a program code from a file of the program into a memory under the control of an operating system. At this time, each of the processor and the input unit 1110 and the determination unit 1120 included in the processor executes a command of a corresponding part of the program code loaded into the memory, and each of the processors for executing the subsequent steps 1210 to 1220 There may be other functional representations.

단계(1210)에서 입력부(1110)는 복수 개의 측정소로부터 측정된 대기질 데이터를 지도 학습과 비지도 학습이 앙상블된 학습 모델에 입력받을 수 있다. 일례로, 복수 개의 측정소로부터 측정된 대기질 데이터가 수집될 수 있으며, 주기적으로 매일 모든 측정소의 데이터를 회수하여 전일 시간별 측정값이 선별될 수 있다. 입력부(1110)는 복수 개의 성분에 대한 시간별 측정값인 1차원 데이터로 구성된 대기질 데이터를 지도 학습 모델에 입력받을 수 있다. 입력부(1110)는 대기질 데이터에 조각별 집계근사법을 적용하여 대기질 데이터에 포함된 노이즈를 제거하고, 조각별 집계금사법이 적용됨에 따라 사라진 노이즈 데이터에 대하여 선형보간법(Linear interpolation)을 사용하여 데이터를 보완할 수 있다. 입력부(1110)는 대기질 데이터가 측정된 시간에 따라 구간을 형성하고, 형성된 구간에 포함된 대기질 데이터의 평균값을 구간의 대표값으로 근사하여 노이즈를 제거할 수 있다. 입력부(1110)는 시계열 기반의 대기질 데이터를 비지도 학습 모델에 입력받을 수 있다. In step 1210, the input unit 1110 may receive air quality data measured from a plurality of measurement stations as an input to a learning model in which supervised learning and unsupervised learning are ensemble. For example, air quality data measured from a plurality of measurement stations may be collected, and data of all measurement stations may be periodically retrieved every day to select measurement values for each hour of the previous day. The input unit 1110 may receive air quality data composed of one-dimensional data, which are measurement values of a plurality of components by time, as an input to the supervised learning model. The input unit 1110 removes noise included in the air quality data by applying the piecewise aggregation approximation method to the air quality data, and uses linear interpolation for the noise data that has disappeared as the piecewise aggregation method is applied. data can be supplemented. The input unit 1110 may form sections according to the time at which the air quality data is measured, and remove noise by approximating an average value of the air quality data included in the formed sections to a representative value of the section. The input unit 1110 may receive time series-based air quality data as an input to an unsupervised learning model.

단계(1220)에서 판정부(1120)는 지도 학습과 비지도 학습이 앙상블된 학습 모델을 이용하여 대기질 데이터의 이상 여부를 판정할 수 있다. 판정부(1120)는 지도 학습 모델을 통해 대기질 데이터에 대한 시간별 베이스라인 판정 결과를 출력할 수 있다. 판정부(1120)는 비지도 학습 모델을 통해 대기질 데이터에 포함된 비정상 데이터에 대해 이상치를 표출할 수 있다. In step 1220, the determination unit 1120 may determine whether the air quality data is abnormal using a learning model in which supervised learning and unsupervised learning are ensemble. The determination unit 1120 may output hourly baseline determination results for air quality data through a supervised learning model. The determination unit 1120 may express an outlier for abnormal data included in the air quality data through an unsupervised learning model.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the abnormality determination method of air quality data,
receiving air quality data measured from a plurality of measurement stations into a learning model in which a supervised learning model and an unsupervised learning model are ensembled; and
Determining whether the air quality data is abnormal using a learning model in which the supervised learning model and the unsupervised learning model are ensembled
Anomaly determination method of air quality data comprising a.

According to claim 1,
The supervised learning model uses a DeepLab V3 + model including deep convolution,
In the step of receiving the input,
Step of receiving air quality data composed of one-dimensional data, which are measured values of a plurality of components by time, into the supervised learning model
Anomaly determination method of air quality data comprising a.

According to claim 2,
The supervised learning model,
A feature extractor in which a ResNet34-based model is modified is configured, and as the air quality data passes through the ResNet-based model, a first feature map is output, and the output feature map is ASSPP (Atrous Separable Spatial Pyramid Pooling), a feature map for each component is generated, and learning to integrate the generated feature map for each component and the first feature map
An abnormality determination method of air quality data, characterized in that.

According to claim 2,
In the step of receiving the input,
Noise included in the air quality data is removed by applying the piecewise aggregation approximation method to the air quality data, and linear interpolation is used for the noise data that has disappeared as the piecewise aggregation method is applied to obtain data. step to complement
Anomaly determination method of air quality data comprising a.

According to claim 4,
In the step of receiving the input,
Forming a section according to the time at which the air quality data was measured, and removing noise by approximating an average value of the air quality data included in the formed section to a representative value of the section.
Anomaly determination method of air quality data comprising a.

According to claim 1,
The determining step is
outputting hourly baseline determination results for the air quality data through the supervised learning model;
Anomaly determination method of air quality data comprising a.

According to claim 1,
The unsupervised learning model uses an adversarial generative neural network (GAN) composed of a generator and a discriminator,
In the step of receiving the input,
Receiving time-series-based air quality data into the unsupervised learning model
Anomaly determination method of air quality data comprising a.

According to claim 7,
The unsupervised learning model,
It consists of an encoder that extracts features and a decoder that reproduces features, and is trained to detect anomalies in multivariate time series data using the BeatGan model that detects anomalies in multivariate time series data.
An abnormality determination method of air quality data, characterized in that.

In the eighth,
The unsupervised learning model,
An abnormality determination method for air quality data, characterized in that the BeatGan model is configured to understand feature information about normal data by learning with time series data having a normal pattern.

According to claim 1,
The determining step is
Expressing outliers for abnormal data included in the air quality data through the unsupervised learning model
Anomaly determination method of air quality data comprising a.

In the abnormality determination system,
an input unit that receives air quality data measured from a plurality of monitoring stations into a learning model in which a supervised learning model and an unsupervised learning model are ensemble; and
Decision unit for determining whether the air quality data is abnormal using a learning model in which the supervised learning model and the unsupervised learning model are ensembled
An anomaly determination system comprising a.