CN117332344A

CN117332344A - Air quality anomaly detection method based on error optimization automatic encoder model

Info

Publication number: CN117332344A
Application number: CN202311051456.6A
Authority: CN
Inventors: 刘希亮; 智晓颖; 赵俊杰
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2024-01-02

Abstract

The invention discloses an air quality anomaly detection method based on an error optimization automatic encoder model, which comprises the steps of firstly performing error optimization training to obtain an unsupervised cluster of an air quality anomaly detection data set; secondly, selecting possible outlier examples from unsupervised clusters of the air quality anomaly detection data set by using different strategies, and constructing an air quality normal data set; finally, a depth auto-encoder model is applied on the air quality normal dataset to optimize its reconstruction errors. In the second stage, the invention constructs an air quality abnormality detection method based on the EOA model, and the model is trained and verified by using an air quality abnormality detection data set. The error optimization automatic encoder model provided by the invention can optimize the depth automatic encoder model, only the air quality normal data set is used when the reconstruction error is calculated, and the air quality abnormal detection effect is improved.

Description

Air quality anomaly detection method based on error optimization automatic encoder model

Technical Field

The invention relates to an air quality anomaly detection technology, density peak clustering (Density Peak Clustering, DPC) and self-organizing mapping (Self Organizing Map, SOM), in particular to an air quality anomaly detection method based on an Error-optimized automatic encoder model (Error-optimized Autoencoder Model, EOA).

Background

Existing studies have found various methods of outlier detection, such as cluster-based techniques, nearest neighbor-based methods, and deep learning-based models. Clustering-based outlier detection techniques group data instances using similarity or similar patterns, with data instances that do not belong to any class being considered anomalous data. Common methods for detecting outliers and noise points using clustering methods include: a Density-based spatial clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) that can be used on noise-robust datasets, that can find all dense areas of sample points and treat these dense areas as one cluster; a density-based clustering algorithm (Ordering Points To Identify the Clustering Structure, OPTICS) that can find arbitrarily shaped clusters in large-scale data and has better robustness to noise points; a local outlier factor detection method (Cluster-Based Local Outlier Factor, CBLOF) based on clustering adopts the idea of the local outlier factor detection method, and an outlier is detected based on the clustering method. The core goal of cluster-based approaches is to identify the structure of the clusters, and thus such approaches may miss outlier instances. Anomaly detection methods based on nearest neighbors are divided into two categories: distance-based and density-based detection methods. The method based on the distance realizes the abnormality detection by calculating the distance between the abnormal instance and the normal instance; in density-based techniques where the estimated density of each data instance is compared to its neighboring instances, the instance with the lower estimated density is considered an outlier. Breunig et al introduced a local outlier (Local Outlier Factor, LOF) method that calculated the relative isolation of a given data instance, however, the performance of this method was very poor for a scattered data set. Furthermore, algorithm performance may decrease as outlier data point densities approach their neighborhood density and boundary instances. To further enhance the efficiency of this approach, researchers have improved the approach, such as Connectivity-based outliers (COF), anomaly-of-Influence (Influence-Based Local Outlier Factor, infflo), local distance-based outliers (Label Driven Outlier Factor, ldif), local correlation integration (LOCI), and the like. Deep learning models are currently an effective technique in the field of outlier detection, and deep learning-based models have been applied in supervised, semi-supervised, and unsupervised modes. In the supervised mode, the model is trained by using normal examples, the trained model is used for anomaly detection, and the key problem of this mode is to obtain accurate labels for the inliers and outlier classes in the respective domains. In semi-supervised mode for anomaly detection, class labels are only available for internal points.

The unsupervised mode is widely applicable because unlabeled datasets can be processed. In this mode, the deep learning based network attempts to reconstruct the input at the output, measuring the reconstruction error to rank outliers for all instances in the dataset. Many deep learning techniques, such as adaptive resonance theory (Adaptive Resonance Theory, ART), generation of countermeasure networks (Generative Adversarial Networks, GAN), and limiting boltzmann machines (Restricted Boltzmann Machines, RBM), have been applied in the field of anomaly detection. In addition to the above networks, cyclic neural networks (Recurrent Neural Network, RNN), automatic encoder sets (Randomized Neighbor Network, randNet), boosting-based automatic encoder integration (Bayesian Autoencoder, BAE), and the like may be used. Hawkins et al describe a method for outlier detection using a recurrent neural network in which the network is trained using a sample dataset and used to find outliers for an instance based on reconstruction errors. Later, hadzic and Dillon introduced a self-organizing map (SOM) based technique. SOM, commonly referred to as Kohonen network or Kohonen map, is an unsupervised neural network commonly used for clustering and visualization of high-dimensional data. SOM is also used as a clustering method, olszewski introduces a fraud detection method that uses ad hoc mapping to visualize user profiles. A class of SVM (One-Class Support Vector Machine, OCSVM) methods, a special case of SVM, in which a smooth hyperplane is built around most data instances, is also widely used for anomaly detection. The core approach to an unsupervised deep learning model is an automatic encoder. Autoencoder is a symmetric artificial neural network that we can train layers in an unsupervised manner. It finds important attributes from the data and constructs the input data as close as possible to the learning encoded representation of the input neurons. Thus, the output neuron of the auto-encoder is a reconstruction of the input neuron. Fig. 1 depicts the basic architecture of an automatic encoder. The mechanism of the automatic encoder is mainly divided into 3 layers: encoder, compression (concealment) and decoder. In the encoder stage, the auto encoder compresses the data to find useful information (compression layer); the principle of operation of a decoder is the opposite of that of an encoder, which reconstructs the input features as closely as possible. Minimizing reconstruction errors is a primary goal of automatic encoders. The automatic encoder may be applied for various purposes including anomaly detection, image denoising, data compression (dimension reduction), and the like. During outlier detection, the outlier instance has a higher reconstruction error than the normal instance. An and Cho describe a method for anomaly detection using a variational automatic encoder in which the reconstruction probabilities are used to detect outliers in a model based on the variational automatic encoder. Later Chalapathy, menon and chawlea proposed a robust, depth and induction model based on an automatic encoder for detecting outliers. It learns to obtain the nonlinear subspace of most instances. Zongetal. First, they find the reconstruction error for each data instance and input it into the gaussian mixture model. Various integrated-based automatic encoder networks have been introduced to detect outlier instances. Chemet al has proposed a popular automatic encoder integration model for outlier detection, named Randnet. In such an integration-based network, randnet is an alternative to a fully connected network, which uses random connections between nodes. Sarvari et al developed an unsupervised outlier detection model that was based on an enhanced set of automatic encoders. To reduce outliers in the training dataset, they apply weighted sampling to the data instances. Later, du et al introduced an unsupervised outlier detection method, i.e., a method based on a graph automatic encoder. In the method, the Euclidean structured dataset is converted into a graph, and the graph is used for training of a graph automatic encoder, the determination of outliers is based on the reconstruction of the instance at the output layer.

In summary, automatic encoders have been used for outlier detection. However, the reconstruction process of the automatic encoder-based model includes the entire data set (normal and abnormal instances). Thus, for normal examples, the reconstruction error is overestimated, and for abnormal examples, the reconstruction error is underestimated. Accordingly, the present invention contemplates other techniques that may calculate reconstruction errors to achieve efficient detection of air quality anomaly data.

Disclosure of Invention

The invention solves the problems that: the air quality anomaly detection method based on the Error optimization automatic encoder (Error-optimizedAutoencoderModel, EOA) model is provided, and the method is used for identifying outliers in an air quality anomaly detection data set, namely air quality anomaly data, by applying an intelligent clustering technology, so that air quality anomaly detection is realized. In the first stage, error optimization training is firstly carried out, and two intelligent clustering technologies, namely density peak clustering (DensityPeakClustering, DPC) and Self-organizing mapping (Self OrganizingMap, SOM), are used to obtain an unsupervised cluster of an air quality abnormality detection data set; secondly, selecting possible outlier examples from unsupervised clusters of the air quality anomaly detection data set by using different strategies, and constructing an air quality normal data set; finally, a depth auto encoder model (DeepAutoencoder Model) is applied on the air quality normal dataset to optimize its reconstruction errors. In the second stage, the invention constructs an air quality abnormality detection method based on the EOA model, and the model is trained and verified by using an air quality abnormality detection data set. The method comprises the following specific steps:

(1) Data preparation: and establishing an air quality abnormality detection data set according to the air quality monitoring data acquired by the air quality detection station, and storing and preliminarily preprocessing the air quality abnormality detection data set.

(2) Data preprocessing: in order to ensure the accuracy of analysis and modeling, firstly, data cleaning is carried out on the selected air quality abnormal data, and the method specifically comprises the steps of processing missing values and abnormal values. For the missing values with smaller time spans, linear interpolation or secondary interpolation is adopted to fill the missing values; for the case of missing values for a long time span, filling with data for the same time period within the adjacent date; and replacing or deleting the data with obvious abnormality. And converting part of data types, and converting the numerical information into date, so that the subsequent processing is convenient.

(3) The possible outlier instances are selected from each cluster by using different strategies, and specifically include:

(a) In a cluster C _j We calculate two important features of a point, namely density and distance from the higher density instance, which are used to identify possible outlier instances from each cluster. Wherein the density of data point i can be expressed as:

wherein d is _ik Is instance i to cluster C _j The distances of all other examples except example i.Is dependent on cluster C _j Cut-off distance (cutoff distance) of mean and standard deviation of inner examples, and X (d) =1, ifd < 0, X (d) =0. For data point i, where the number of samples in its neighborhood is δ (i), the distance from the higher density data point i can be expressed as:

it is to cluster C _j The closest distance of the medium density higher points. Based on these two features, points with low density but very close to the dense area and points very far from the dense area are taken as possible outliers.

(b) A different strategy is used to obtain possible outlier instances from each cluster.

In a first strategy, to find points that are low in density but very close to the dense area, the probability outlier score POS for each point is first calculated by multiplying the density of each point by the distance within the cluster ⁽¹⁾ It may be noted that smaller outlier values represent a greater likelihood of becoming outliers.

POS ⁽¹⁾ (i)＝density(i)*distance(i) (3)

Probability outlier score POS for each data point in a found cluster ⁽¹⁾ Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C _j And is expressed as a possible outlier data of a probability_outlier_point ⁽¹⁾ 。

In a second strategy, to identify instances where density is low and very far from dense areas, the density is first calculated as the inverse of the distance of each point within the cluster multiplied by the density and expressed as a probabilistic outlier score POS ⁽²⁾ ：

Probability outlier score POS for each data point in a found cluster ⁽²⁾ Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C _j And is expressed as a possible outlier data of a probability_outlier_point ⁽²⁾ . We will list the probable_outlier_point ⁽¹⁾ And a probable_outlier_point ⁽²⁾ In combination with selected ones of the instances of (c),and is expressed as:

probable_outlier_point

＝probable_outlier_point ⁽¹⁾ ∪probable_outlier_point ⁽²⁾ (5)

by using both of the above strategies we can obtain anomalous instances with low density and very far or very close to dense areas. However, this may also miss truly outlier instances that are very close to dense areas.

In a third strategy, to find a true outlier instance very close to a dense region, a probability outlier score POS is first obtained, using a gaussian distribution function of feature distance, expressed as:

POS(i)＝density(i)*f(distance(i)) (6)

wherein the method comprises the steps ofSigma and mu are parameters of normal distribution, representing standard deviation and average value, respectively. After finding the probability outlier score POS for each data point in the cluster, sorting the data examples in ascending order, and selecting the first N data examples as cluster C _j And is denoted as probable_outlier_point.

(4) Creating an air quality normal dataset and training a depth automatic encoder model with the same, comprising in particular:

(a) Let D be the unlabeled dataset and P be the detected possible outlier probability point, i.e. air quality anomaly data.

(b) After the detection of possible air quality anomaly data, these instances need to be excluded from the air quality anomaly detection dataset, and the remaining points (D-P) in the dataset are considered to be "normal points", i.e., the air quality normal dataset.

(c) The reconstruction error of the depth auto-encoder model is calculated using the air quality normal dataset, wherein the reconstruction error is calculated using a Mean Square Error (MSE) measured as:

wherein X is _i The input is represented by a representation of the input,representing the output (input to the reconstruction), n is the size of the dataset.

(d) As activation functions of the model, a rectifying linear unit (ReLU) and a hyperbolic tangent (tanh) are used.

ReLU：f(x)＝max{0，x} (8)

Where x represents the input vector.

(e) To improve model performance, L1 regularization is used, and adaptive moment estimation (Adaptive moment estimation, adam) optimization, abbreviated Adam optimizer, is used for loss optimization.

(5) An error optimization-based automatic Encoder (EOA) model is constructed, comprising:

(a) An EOA model is trained using the air quality anomaly detection dataset, outputting a reconstruction error for each instance, wherein the reconstruction error is calculated using MSE.

(b) The same as in the step (4-5) of claim 4.

(c) And according to the calculated reconstruction errors of the examples, sorting the data examples according to ascending order of the reconstruction errors, and judging the first N data examples as outliers, namely air quality abnormal data.

(6) The air quality anomaly detection method based on the EOA model is verified by adopting a cross verification method, and specifically comprises the following steps:

(a) The air quality anomaly detection data set is divided into a training set and a testing set according to the ratio of 8:2, and 10-fold cross-validation (10-fold cross-validation) is adopted to test the accuracy of the EOA model.

(b) precision@N, recall, and area under the receiver operating characteristic curve (AUC_ROC) were used as evaluation indicators for the experiment.

TP and FN respectively represent real cases (true positive) and False Negative cases (False Negative), and TP refers to the number of samples predicted to be positive and actually positive; FN refers to the number of samples predicted to be negative, as well as actually negative. TOP_N represents the TOP N data instance inputs of the data instances sorted in ascending order of probability outlier scores, i.e., the instances that are judged to be air quality anomalies. The auc_roc value represents the area under the ROC curve, which is a graphical tool depicting classifier performance, showing the relationship between the true positive rate (TruePositiveRate, TPR) and False positive rate (False PositiveRate, FPR) of the classifier at different thresholds.

Drawings

FIG. 1 is a basic structural diagram of an automatic encoder

FIG. 2 is a frame diagram of an EOA model based air quality anomaly detection method

Detailed Description

(1) Data preparation:

(a) Air quality monitoring data collected by the air quality detection station: the data collected at a plurality of detection sites distributed in a city comprises atmospheric particulates, gaseous pollutants and meteorological factors, and the hour level data of each detection site is stored in a mode of 1 csv file. Detecting that the site data includes PM _2.5 、PM ₁₀ 、CO、NO ₂ 、SO ₂ 、O ₃ The 11 features are summed up by pressure, humidity, temperature, wind_direction, wind_speed, and an air quality anomaly dataset is established.

In the first strategy, isFinding points with low density but very close to the dense area, first calculating the probability outlier score POS for each point by multiplying the density of each point by the distance within the cluster ⁽¹⁾ It may be noted that smaller outlier values represent a greater likelihood of becoming outliers.

POS ⁽¹⁾ (i)＝density(i)*distance(i) (3)

Probability outlier score POS for each data point in a found cluster ⁽²⁾ Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C _j And is expressed as a possible outlier data of a probability_outlier_point ⁽²⁾ . We will list the probable_outlier_point ⁽¹⁾ And a probable_outlier_point ⁽²⁾ Is combined and expressed as:

probable_outlier_point

＝probable_outlier_point ⁽¹⁾ ∪probable_outlier_point ⁽²⁾ (5)

POS(i)＝density(i)*f(distance(i)) (6)

(4) The root creates and trains a depth automatic encoder model with an air quality normal dataset, specifically comprising:

ReLU：f(x)＝max{0，x} (8)

Where x represents the input vector.

(b) The same as in the step (4-5) of claim 4.

TP and FN respectively represent real (True Positive) and False Negative (False Negative), and TP refers to the number of samples predicted to be Positive and actually Positive; FN refers to the number of samples predicted to be negative, as well as actually negative. TOP_N represents the TOP N data instance inputs of the data instances sorted in ascending order of probability outlier scores, i.e., the instances that are judged to be air quality anomalies. The auc_roc value represents the area under the ROC curve, which is a graphical tool depicting classifier performance, showing the relationship between the true positive rate (TruePositiveRate, TPR) and False positive rate (False PositiveRate, FPR) of the classifier at different thresholds.

Claims

1. An air quality anomaly detection method based on an error optimization automatic encoder model is characterized by comprising the following implementation steps:

(1) Collecting air quality monitoring data from an air quality detection station, and storing and primarily preprocessing the air quality monitoring data to form an air quality abnormality detection data set;

(2) Intelligent clustering is carried out on the air quality abnormality detection data set by applying DPC and SOM technologies, so that pi= { C ₁ ,C ₂ ,C ₃ ,……,C _k The clustering obtained after applying both DPC and SOM techniques on the air quality anomaly detection dataset;

(3) Selecting possible outlier instances from each cluster using different strategies;

(4) Removing possible air quality abnormal data from the air quality abnormal detection data set, obtaining an air quality normal data set, and training a depth automatic encoder model by using the air quality normal data set to optimize reconstruction errors;

(5) Constructing an EOA model of an automatic encoder based on error optimization, training the EOA model by using an air quality anomaly detection data set, and judging whether the EOA model is air quality anomaly data or not according to the reconstruction error of each instance output by the model;

(6) And verifying an air quality abnormality detection method based on the EOA model by adopting a cross verification method.

2. The method for detecting air quality anomalies based on the error-optimized automatic encoder model according to claim 1, wherein in the step (1), performing data preprocessing includes:

(1) Processing the missing value and the abnormal value; for the missing values with smaller time spans, linear interpolation or secondary interpolation is adopted to fill the missing values; for the case of missing values for a long time span, filling with data for the same time period within the adjacent date; replacing or deleting the data with obvious abnormality;

(2) And converting part of data types, and converting the numerical information into date, so that the subsequent processing is convenient.

3. The method for detecting air quality anomalies based on an error-optimized automatic encoder model of claim 1, wherein in the step (3), obtaining possible outlier instances from each cluster using different strategies includes:

(1) In a cluster C _j Two important features of a point are calculated, namely density and distance from the instance with higher density, which are used to identify possible outliers from each cluster, the density of data point i is expressed as:

wherein d is _ik Is instance i to cluster C _j Distances of all other examples except example i;is dependent on cluster C _j Cut-off distance of mean and standard deviation of inner examples, and X (d) =1, if d<0, x (d) =0; for data point i, where the number of samples in its neighborhood is δ (i), the distance from the higher density data point i is expressed as:

it is to cluster C _j Closest distance of medium density higher pointsThe method comprises the steps of carrying out a first treatment on the surface of the Points with low density but close to the dense area and points with far distance from the dense area are used as possible outliers;

(2) Obtaining possible outlier instances from each cluster using different strategies;

(a) In a first strategy, to find points that are low in density but very close to the dense area, the probability outlier score POS for each point is first calculated by multiplying the density of each point by the distance within the cluster ⁽¹⁾ Smaller outlier values indicate a greater likelihood of becoming outliers;

POS ⁽¹⁾ (i)＝density(i)*distance(i) (3)

probability outlier score POS for each data point in a found cluster ⁽¹⁾ Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C _j And is expressed as a possible outlier data of a probability_outlier_point ⁽¹⁾ ；

(b) In a second strategy, to identify instances where density is low and very far from dense areas, the density is first calculated as the inverse of the distance of each point within the cluster multiplied by the density and expressed as a probabilistic outlier score POS ⁽²⁾ ：

Probability outlier score POS for each data point in a found cluster ⁽²⁾ Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C _j And is expressed as a possible outlier data of a probability_outlier_point ⁽²⁾ ；

List of probable_outlier_point ⁽¹⁾ And a probable_outlier_point ⁽²⁾ Is combined and expressed as:

probable_outlier_point

＝probable_outlier_point ⁽¹⁾ ∪probable_outlier_point ⁽²⁾ (5)

by using the above two strategies, it is possible to obtain an abnormal instance having a low density and being very far from or very close to a dense area; this may also miss true outlier instances very close to the dense region;

(c) In a third strategy, to find a true outlier instance very close to a dense region, a probability outlier score POS is first obtained, using a gaussian distribution function of feature distance, expressed as:

POS(i)＝density(i)*f(distance(i)) (6)

wherein the method comprises the steps ofSigma and mu are parameters of normal distribution, and represent standard deviation and average value respectively; after finding the probability outlier score POS for each data point in the cluster, sorting the data examples in ascending order, and selecting the first N data examples as cluster C _j And is denoted as probable_outlier_point.

4. The method for detecting air quality anomalies based on an error-optimized automatic encoder model of claim 1, wherein in the step (4), creating an air quality normal dataset and training a depth automatic encoder model therewith includes:

(1) Let D be the unlabeled dataset, P be the detected possible outlier probability point, i.e. air quality anomaly data;

(2) After the detection of possible air quality anomaly data, these instances need to be excluded from the air quality anomaly detection dataset, and the remaining points (D-P) in the dataset are considered to be "normal points", i.e., the air quality normal dataset;

(3) The reconstruction error of the depth automatic encoder model is calculated using the air quality normal dataset, wherein the reconstruction error is calculated using a mean square error MSE, measured as:

wherein X is _i The input is represented by a representation of the input,representing the output, n being the size of the dataset;

(4) Using a rectifying linear unit (ReLU) and a hyperbolic tangent (tanh) as an activation function of the model;

ReLU：f(x)＝max{0，x} (8)

wherein x represents an input vector;

(5) In order to improve the model performance, L1 regularization is used, and adaptive moment estimation optimization, abbreviated as Adam optimizer, is used for loss optimization.

5. The method for detecting air quality anomalies based on the error-optimized automatic encoder model according to claim 1, wherein in the step (5), constructing an error-optimized automatic encoder EOA model includes:

(1) Training an EOA model using the air quality anomaly detection dataset, outputting a reconstruction error for each instance, wherein the reconstruction error is calculated using MSE;

(2) The same as in (4-5) of claim 4;

(3) And according to the calculated reconstruction errors of the examples, sorting the data examples according to ascending order of the calculated reconstruction errors, and judging the first N data examples as outliers, namely air quality abnormal data.

6. The method according to claim 1, wherein in the step (6), the model verification section includes:

(1) The air quality abnormality detection data set was set as 8: dividing a training set and a testing set, and testing the accuracy of the EOA model by adopting 10-fold cross validation;

(2) Using precision@N, recall and the area under the receiver operation characteristic curve as evaluation indexes of the experiment;

TP and FN represent real examples and false negative examples respectively, wherein TP refers to the number of samples predicted to be positive and actually positive; FN refers to the number of samples predicted to be negative, as well as actually negative; TOP_N represents the first N data instance inputs of the data instances after the data instances are sorted according to the ascending order of the probability outlier scores, namely the instances which are judged to be abnormal in air quality; the auc_roc value represents the area under the ROC curve, which is used to measure classifier performance, where ROC shows the relationship between the true and false positive rates of the classifier at different thresholds.