CN117332344A - Air quality anomaly detection method based on error optimization automatic encoder model - Google Patents

Air quality anomaly detection method based on error optimization automatic encoder model Download PDF

Info

Publication number
CN117332344A
CN117332344A CN202311051456.6A CN202311051456A CN117332344A CN 117332344 A CN117332344 A CN 117332344A CN 202311051456 A CN202311051456 A CN 202311051456A CN 117332344 A CN117332344 A CN 117332344A
Authority
CN
China
Prior art keywords
air quality
data
outlier
cluster
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311051456.6A
Other languages
Chinese (zh)
Inventor
刘希亮
智晓颖
赵俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202311051456.6A priority Critical patent/CN117332344A/en
Publication of CN117332344A publication Critical patent/CN117332344A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0062General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method or the display, e.g. intermittent measurement or digital display
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0062General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method or the display, e.g. intermittent measurement or digital display
    • G01N33/0068General constructional details of gas analysers, e.g. portable test equipment concerning the measuring method or the display, e.g. intermittent measurement or digital display using a computer specifically programmed
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Combustion & Propulsion (AREA)
  • Pathology (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an air quality anomaly detection method based on an error optimization automatic encoder model, which comprises the steps of firstly performing error optimization training to obtain an unsupervised cluster of an air quality anomaly detection data set; secondly, selecting possible outlier examples from unsupervised clusters of the air quality anomaly detection data set by using different strategies, and constructing an air quality normal data set; finally, a depth auto-encoder model is applied on the air quality normal dataset to optimize its reconstruction errors. In the second stage, the invention constructs an air quality abnormality detection method based on the EOA model, and the model is trained and verified by using an air quality abnormality detection data set. The error optimization automatic encoder model provided by the invention can optimize the depth automatic encoder model, only the air quality normal data set is used when the reconstruction error is calculated, and the air quality abnormal detection effect is improved.

Description

Air quality anomaly detection method based on error optimization automatic encoder model
Technical Field
The invention relates to an air quality anomaly detection technology, density peak clustering (Density Peak Clustering, DPC) and self-organizing mapping (Self Organizing Map, SOM), in particular to an air quality anomaly detection method based on an Error-optimized automatic encoder model (Error-optimized Autoencoder Model, EOA).
Background
Existing studies have found various methods of outlier detection, such as cluster-based techniques, nearest neighbor-based methods, and deep learning-based models. Clustering-based outlier detection techniques group data instances using similarity or similar patterns, with data instances that do not belong to any class being considered anomalous data. Common methods for detecting outliers and noise points using clustering methods include: a Density-based spatial clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) that can be used on noise-robust datasets, that can find all dense areas of sample points and treat these dense areas as one cluster; a density-based clustering algorithm (Ordering Points To Identify the Clustering Structure, OPTICS) that can find arbitrarily shaped clusters in large-scale data and has better robustness to noise points; a local outlier factor detection method (Cluster-Based Local Outlier Factor, CBLOF) based on clustering adopts the idea of the local outlier factor detection method, and an outlier is detected based on the clustering method. The core goal of cluster-based approaches is to identify the structure of the clusters, and thus such approaches may miss outlier instances. Anomaly detection methods based on nearest neighbors are divided into two categories: distance-based and density-based detection methods. The method based on the distance realizes the abnormality detection by calculating the distance between the abnormal instance and the normal instance; in density-based techniques where the estimated density of each data instance is compared to its neighboring instances, the instance with the lower estimated density is considered an outlier. Breunig et al introduced a local outlier (Local Outlier Factor, LOF) method that calculated the relative isolation of a given data instance, however, the performance of this method was very poor for a scattered data set. Furthermore, algorithm performance may decrease as outlier data point densities approach their neighborhood density and boundary instances. To further enhance the efficiency of this approach, researchers have improved the approach, such as Connectivity-based outliers (COF), anomaly-of-Influence (Influence-Based Local Outlier Factor, infflo), local distance-based outliers (Label Driven Outlier Factor, ldif), local correlation integration (LOCI), and the like. Deep learning models are currently an effective technique in the field of outlier detection, and deep learning-based models have been applied in supervised, semi-supervised, and unsupervised modes. In the supervised mode, the model is trained by using normal examples, the trained model is used for anomaly detection, and the key problem of this mode is to obtain accurate labels for the inliers and outlier classes in the respective domains. In semi-supervised mode for anomaly detection, class labels are only available for internal points.
The unsupervised mode is widely applicable because unlabeled datasets can be processed. In this mode, the deep learning based network attempts to reconstruct the input at the output, measuring the reconstruction error to rank outliers for all instances in the dataset. Many deep learning techniques, such as adaptive resonance theory (Adaptive Resonance Theory, ART), generation of countermeasure networks (Generative Adversarial Networks, GAN), and limiting boltzmann machines (Restricted Boltzmann Machines, RBM), have been applied in the field of anomaly detection. In addition to the above networks, cyclic neural networks (Recurrent Neural Network, RNN), automatic encoder sets (Randomized Neighbor Network, randNet), boosting-based automatic encoder integration (Bayesian Autoencoder, BAE), and the like may be used. Hawkins et al describe a method for outlier detection using a recurrent neural network in which the network is trained using a sample dataset and used to find outliers for an instance based on reconstruction errors. Later, hadzic and Dillon introduced a self-organizing map (SOM) based technique. SOM, commonly referred to as Kohonen network or Kohonen map, is an unsupervised neural network commonly used for clustering and visualization of high-dimensional data. SOM is also used as a clustering method, olszewski introduces a fraud detection method that uses ad hoc mapping to visualize user profiles. A class of SVM (One-Class Support Vector Machine, OCSVM) methods, a special case of SVM, in which a smooth hyperplane is built around most data instances, is also widely used for anomaly detection. The core approach to an unsupervised deep learning model is an automatic encoder. Autoencoder is a symmetric artificial neural network that we can train layers in an unsupervised manner. It finds important attributes from the data and constructs the input data as close as possible to the learning encoded representation of the input neurons. Thus, the output neuron of the auto-encoder is a reconstruction of the input neuron. Fig. 1 depicts the basic architecture of an automatic encoder. The mechanism of the automatic encoder is mainly divided into 3 layers: encoder, compression (concealment) and decoder. In the encoder stage, the auto encoder compresses the data to find useful information (compression layer); the principle of operation of a decoder is the opposite of that of an encoder, which reconstructs the input features as closely as possible. Minimizing reconstruction errors is a primary goal of automatic encoders. The automatic encoder may be applied for various purposes including anomaly detection, image denoising, data compression (dimension reduction), and the like. During outlier detection, the outlier instance has a higher reconstruction error than the normal instance. An and Cho describe a method for anomaly detection using a variational automatic encoder in which the reconstruction probabilities are used to detect outliers in a model based on the variational automatic encoder. Later Chalapathy, menon and chawlea proposed a robust, depth and induction model based on an automatic encoder for detecting outliers. It learns to obtain the nonlinear subspace of most instances. Zongetal. First, they find the reconstruction error for each data instance and input it into the gaussian mixture model. Various integrated-based automatic encoder networks have been introduced to detect outlier instances. Chemet al has proposed a popular automatic encoder integration model for outlier detection, named Randnet. In such an integration-based network, randnet is an alternative to a fully connected network, which uses random connections between nodes. Sarvari et al developed an unsupervised outlier detection model that was based on an enhanced set of automatic encoders. To reduce outliers in the training dataset, they apply weighted sampling to the data instances. Later, du et al introduced an unsupervised outlier detection method, i.e., a method based on a graph automatic encoder. In the method, the Euclidean structured dataset is converted into a graph, and the graph is used for training of a graph automatic encoder, the determination of outliers is based on the reconstruction of the instance at the output layer.
In summary, automatic encoders have been used for outlier detection. However, the reconstruction process of the automatic encoder-based model includes the entire data set (normal and abnormal instances). Thus, for normal examples, the reconstruction error is overestimated, and for abnormal examples, the reconstruction error is underestimated. Accordingly, the present invention contemplates other techniques that may calculate reconstruction errors to achieve efficient detection of air quality anomaly data.
Disclosure of Invention
The invention solves the problems that: the air quality anomaly detection method based on the Error optimization automatic encoder (Error-optimizedAutoencoderModel, EOA) model is provided, and the method is used for identifying outliers in an air quality anomaly detection data set, namely air quality anomaly data, by applying an intelligent clustering technology, so that air quality anomaly detection is realized. In the first stage, error optimization training is firstly carried out, and two intelligent clustering technologies, namely density peak clustering (DensityPeakClustering, DPC) and Self-organizing mapping (Self OrganizingMap, SOM), are used to obtain an unsupervised cluster of an air quality abnormality detection data set; secondly, selecting possible outlier examples from unsupervised clusters of the air quality anomaly detection data set by using different strategies, and constructing an air quality normal data set; finally, a depth auto encoder model (DeepAutoencoder Model) is applied on the air quality normal dataset to optimize its reconstruction errors. In the second stage, the invention constructs an air quality abnormality detection method based on the EOA model, and the model is trained and verified by using an air quality abnormality detection data set. The method comprises the following specific steps:
(1) Data preparation: and establishing an air quality abnormality detection data set according to the air quality monitoring data acquired by the air quality detection station, and storing and preliminarily preprocessing the air quality abnormality detection data set.
(2) Data preprocessing: in order to ensure the accuracy of analysis and modeling, firstly, data cleaning is carried out on the selected air quality abnormal data, and the method specifically comprises the steps of processing missing values and abnormal values. For the missing values with smaller time spans, linear interpolation or secondary interpolation is adopted to fill the missing values; for the case of missing values for a long time span, filling with data for the same time period within the adjacent date; and replacing or deleting the data with obvious abnormality. And converting part of data types, and converting the numerical information into date, so that the subsequent processing is convenient.
(3) The possible outlier instances are selected from each cluster by using different strategies, and specifically include:
(a) In a cluster C j We calculate two important features of a point, namely density and distance from the higher density instance, which are used to identify possible outlier instances from each cluster. Wherein the density of data point i can be expressed as:
wherein d is ik Is instance i to cluster C j The distances of all other examples except example i.Is dependent on cluster C j Cut-off distance (cutoff distance) of mean and standard deviation of inner examples, and X (d) =1, ifd < 0, X (d) =0. For data point i, where the number of samples in its neighborhood is δ (i), the distance from the higher density data point i can be expressed as:
it is to cluster C j The closest distance of the medium density higher points. Based on these two features, points with low density but very close to the dense area and points very far from the dense area are taken as possible outliers.
(b) A different strategy is used to obtain possible outlier instances from each cluster.
In a first strategy, to find points that are low in density but very close to the dense area, the probability outlier score POS for each point is first calculated by multiplying the density of each point by the distance within the cluster (1) It may be noted that smaller outlier values represent a greater likelihood of becoming outliers.
POS (1) (i)=density(i)*distance(i) (3)
Probability outlier score POS for each data point in a found cluster (1) Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C j And is expressed as a possible outlier data of a probability_outlier_point (1)
In a second strategy, to identify instances where density is low and very far from dense areas, the density is first calculated as the inverse of the distance of each point within the cluster multiplied by the density and expressed as a probabilistic outlier score POS (2)
Probability outlier score POS for each data point in a found cluster (2) Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C j And is expressed as a possible outlier data of a probability_outlier_point (2) . We will list the probable_outlier_point (1) And a probable_outlier_point (2) In combination with selected ones of the instances of (c),and is expressed as:
probable_outlier_point
=probable_outlier_point (1) ∪probable_outlier_point (2) (5)
by using both of the above strategies we can obtain anomalous instances with low density and very far or very close to dense areas. However, this may also miss truly outlier instances that are very close to dense areas.
In a third strategy, to find a true outlier instance very close to a dense region, a probability outlier score POS is first obtained, using a gaussian distribution function of feature distance, expressed as:
POS(i)=density(i)*f(distance(i)) (6)
wherein the method comprises the steps ofSigma and mu are parameters of normal distribution, representing standard deviation and average value, respectively. After finding the probability outlier score POS for each data point in the cluster, sorting the data examples in ascending order, and selecting the first N data examples as cluster C j And is denoted as probable_outlier_point.
(4) Creating an air quality normal dataset and training a depth automatic encoder model with the same, comprising in particular:
(a) Let D be the unlabeled dataset and P be the detected possible outlier probability point, i.e. air quality anomaly data.
(b) After the detection of possible air quality anomaly data, these instances need to be excluded from the air quality anomaly detection dataset, and the remaining points (D-P) in the dataset are considered to be "normal points", i.e., the air quality normal dataset.
(c) The reconstruction error of the depth auto-encoder model is calculated using the air quality normal dataset, wherein the reconstruction error is calculated using a Mean Square Error (MSE) measured as:
wherein X is i The input is represented by a representation of the input,representing the output (input to the reconstruction), n is the size of the dataset.
(d) As activation functions of the model, a rectifying linear unit (ReLU) and a hyperbolic tangent (tanh) are used.
ReLU:f(x)=max{0,x} (8)
Where x represents the input vector.
(e) To improve model performance, L1 regularization is used, and adaptive moment estimation (Adaptive moment estimation, adam) optimization, abbreviated Adam optimizer, is used for loss optimization.
(5) An error optimization-based automatic Encoder (EOA) model is constructed, comprising:
(a) An EOA model is trained using the air quality anomaly detection dataset, outputting a reconstruction error for each instance, wherein the reconstruction error is calculated using MSE.
(b) The same as in the step (4-5) of claim 4.
(c) And according to the calculated reconstruction errors of the examples, sorting the data examples according to ascending order of the reconstruction errors, and judging the first N data examples as outliers, namely air quality abnormal data.
(6) The air quality anomaly detection method based on the EOA model is verified by adopting a cross verification method, and specifically comprises the following steps:
(a) The air quality anomaly detection data set is divided into a training set and a testing set according to the ratio of 8:2, and 10-fold cross-validation (10-fold cross-validation) is adopted to test the accuracy of the EOA model.
(b) precision@N, recall, and area under the receiver operating characteristic curve (AUC_ROC) were used as evaluation indicators for the experiment.
TP and FN respectively represent real cases (true positive) and False Negative cases (False Negative), and TP refers to the number of samples predicted to be positive and actually positive; FN refers to the number of samples predicted to be negative, as well as actually negative. TOP_N represents the TOP N data instance inputs of the data instances sorted in ascending order of probability outlier scores, i.e., the instances that are judged to be air quality anomalies. The auc_roc value represents the area under the ROC curve, which is a graphical tool depicting classifier performance, showing the relationship between the true positive rate (TruePositiveRate, TPR) and False positive rate (False PositiveRate, FPR) of the classifier at different thresholds.
Drawings
FIG. 1 is a basic structural diagram of an automatic encoder
FIG. 2 is a frame diagram of an EOA model based air quality anomaly detection method
Detailed Description
(1) Data preparation:
(a) Air quality monitoring data collected by the air quality detection station: the data collected at a plurality of detection sites distributed in a city comprises atmospheric particulates, gaseous pollutants and meteorological factors, and the hour level data of each detection site is stored in a mode of 1 csv file. Detecting that the site data includes PM 2.5 、PM 10 、CO、NO 2 、SO 2 、O 3 The 11 features are summed up by pressure, humidity, temperature, wind_direction, wind_speed, and an air quality anomaly dataset is established.
(2) Data preprocessing: in order to ensure the accuracy of analysis and modeling, firstly, data cleaning is carried out on the selected air quality abnormal data, and the method specifically comprises the steps of processing missing values and abnormal values. For the missing values with smaller time spans, linear interpolation or secondary interpolation is adopted to fill the missing values; for the case of missing values for a long time span, filling with data for the same time period within the adjacent date; and replacing or deleting the data with obvious abnormality. And converting part of data types, and converting the numerical information into date, so that the subsequent processing is convenient.
(3) The possible outlier instances are selected from each cluster by using different strategies, and specifically include:
(a) In a cluster C j We calculate two important features of a point, namely density and distance from the higher density instance, which are used to identify possible outlier instances from each cluster. Wherein the density of data point i can be expressed as:
wherein d is ik Is instance i to cluster C j The distances of all other examples except example i.Is dependent on cluster C j Cut-off distance (cutoff distance) of mean and standard deviation of inner examples, and X (d) =1, ifd < 0, X (d) =0. For data point i, where the number of samples in its neighborhood is δ (i), the distance from the higher density data point i can be expressed as:
it is to cluster C j The closest distance of the medium density higher points. Based on these two features, points with low density but very close to the dense area and points very far from the dense area are taken as possible outliers.
(b) A different strategy is used to obtain possible outlier instances from each cluster.
In the first strategy, isFinding points with low density but very close to the dense area, first calculating the probability outlier score POS for each point by multiplying the density of each point by the distance within the cluster (1) It may be noted that smaller outlier values represent a greater likelihood of becoming outliers.
POS (1) (i)=density(i)*distance(i) (3)
Probability outlier score POS for each data point in a found cluster (1) Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C j And is expressed as a possible outlier data of a probability_outlier_point (1)
In a second strategy, to identify instances where density is low and very far from dense areas, the density is first calculated as the inverse of the distance of each point within the cluster multiplied by the density and expressed as a probabilistic outlier score POS (2)
Probability outlier score POS for each data point in a found cluster (2) Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C j And is expressed as a possible outlier data of a probability_outlier_point (2) . We will list the probable_outlier_point (1) And a probable_outlier_point (2) Is combined and expressed as:
probable_outlier_point
=probable_outlier_point (1) ∪probable_outlier_point (2) (5)
by using both of the above strategies we can obtain anomalous instances with low density and very far or very close to dense areas. However, this may also miss truly outlier instances that are very close to dense areas.
In a third strategy, to find a true outlier instance very close to a dense region, a probability outlier score POS is first obtained, using a gaussian distribution function of feature distance, expressed as:
POS(i)=density(i)*f(distance(i)) (6)
wherein the method comprises the steps ofSigma and mu are parameters of normal distribution, representing standard deviation and average value, respectively. After finding the probability outlier score POS for each data point in the cluster, sorting the data examples in ascending order, and selecting the first N data examples as cluster C j And is denoted as probable_outlier_point.
(4) The root creates and trains a depth automatic encoder model with an air quality normal dataset, specifically comprising:
(a) Let D be the unlabeled dataset and P be the detected possible outlier probability point, i.e. air quality anomaly data.
(b) After the detection of possible air quality anomaly data, these instances need to be excluded from the air quality anomaly detection dataset, and the remaining points (D-P) in the dataset are considered to be "normal points", i.e., the air quality normal dataset.
(c) The reconstruction error of the depth auto-encoder model is calculated using the air quality normal dataset, wherein the reconstruction error is calculated using a Mean Square Error (MSE) measured as:
wherein X is i The input is represented by a representation of the input,representing the output (input to the reconstruction), n is the size of the dataset.
(d) As activation functions of the model, a rectifying linear unit (ReLU) and a hyperbolic tangent (tanh) are used.
ReLU:f(x)=max{0,x} (8)
Where x represents the input vector.
(e) To improve model performance, L1 regularization is used, and adaptive moment estimation (Adaptive moment estimation, adam) optimization, abbreviated Adam optimizer, is used for loss optimization.
(5) An error optimization-based automatic Encoder (EOA) model is constructed, comprising:
(a) An EOA model is trained using the air quality anomaly detection dataset, outputting a reconstruction error for each instance, wherein the reconstruction error is calculated using MSE.
(b) The same as in the step (4-5) of claim 4.
(c) And according to the calculated reconstruction errors of the examples, sorting the data examples according to ascending order of the reconstruction errors, and judging the first N data examples as outliers, namely air quality abnormal data.
(6) The air quality anomaly detection method based on the EOA model is verified by adopting a cross verification method, and specifically comprises the following steps:
(a) The air quality anomaly detection data set is divided into a training set and a testing set according to the ratio of 8:2, and 10-fold cross-validation (10-fold cross-validation) is adopted to test the accuracy of the EOA model.
(b) precision@N, recall, and area under the receiver operating characteristic curve (AUC_ROC) were used as evaluation indicators for the experiment.
TP and FN respectively represent real (True Positive) and False Negative (False Negative), and TP refers to the number of samples predicted to be Positive and actually Positive; FN refers to the number of samples predicted to be negative, as well as actually negative. TOP_N represents the TOP N data instance inputs of the data instances sorted in ascending order of probability outlier scores, i.e., the instances that are judged to be air quality anomalies. The auc_roc value represents the area under the ROC curve, which is a graphical tool depicting classifier performance, showing the relationship between the true positive rate (TruePositiveRate, TPR) and False positive rate (False PositiveRate, FPR) of the classifier at different thresholds.

Claims (6)

1. An air quality anomaly detection method based on an error optimization automatic encoder model is characterized by comprising the following implementation steps:
(1) Collecting air quality monitoring data from an air quality detection station, and storing and primarily preprocessing the air quality monitoring data to form an air quality abnormality detection data set;
(2) Intelligent clustering is carried out on the air quality abnormality detection data set by applying DPC and SOM technologies, so that pi= { C 1 ,C 2 ,C 3 ,……,C k The clustering obtained after applying both DPC and SOM techniques on the air quality anomaly detection dataset;
(3) Selecting possible outlier instances from each cluster using different strategies;
(4) Removing possible air quality abnormal data from the air quality abnormal detection data set, obtaining an air quality normal data set, and training a depth automatic encoder model by using the air quality normal data set to optimize reconstruction errors;
(5) Constructing an EOA model of an automatic encoder based on error optimization, training the EOA model by using an air quality anomaly detection data set, and judging whether the EOA model is air quality anomaly data or not according to the reconstruction error of each instance output by the model;
(6) And verifying an air quality abnormality detection method based on the EOA model by adopting a cross verification method.
2. The method for detecting air quality anomalies based on the error-optimized automatic encoder model according to claim 1, wherein in the step (1), performing data preprocessing includes:
(1) Processing the missing value and the abnormal value; for the missing values with smaller time spans, linear interpolation or secondary interpolation is adopted to fill the missing values; for the case of missing values for a long time span, filling with data for the same time period within the adjacent date; replacing or deleting the data with obvious abnormality;
(2) And converting part of data types, and converting the numerical information into date, so that the subsequent processing is convenient.
3. The method for detecting air quality anomalies based on an error-optimized automatic encoder model of claim 1, wherein in the step (3), obtaining possible outlier instances from each cluster using different strategies includes:
(1) In a cluster C j Two important features of a point are calculated, namely density and distance from the instance with higher density, which are used to identify possible outliers from each cluster, the density of data point i is expressed as:
wherein d is ik Is instance i to cluster C j Distances of all other examples except example i;is dependent on cluster C j Cut-off distance of mean and standard deviation of inner examples, and X (d) =1, if d<0, x (d) =0; for data point i, where the number of samples in its neighborhood is δ (i), the distance from the higher density data point i is expressed as:
it is to cluster C j Closest distance of medium density higher pointsThe method comprises the steps of carrying out a first treatment on the surface of the Points with low density but close to the dense area and points with far distance from the dense area are used as possible outliers;
(2) Obtaining possible outlier instances from each cluster using different strategies;
(a) In a first strategy, to find points that are low in density but very close to the dense area, the probability outlier score POS for each point is first calculated by multiplying the density of each point by the distance within the cluster (1) Smaller outlier values indicate a greater likelihood of becoming outliers;
POS (1) (i)=density(i)*distance(i) (3)
probability outlier score POS for each data point in a found cluster (1) Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C j And is expressed as a possible outlier data of a probability_outlier_point (1)
(b) In a second strategy, to identify instances where density is low and very far from dense areas, the density is first calculated as the inverse of the distance of each point within the cluster multiplied by the density and expressed as a probabilistic outlier score POS (2)
Probability outlier score POS for each data point in a found cluster (2) Then, the data examples are ordered according to the ascending order, and the first N data examples are selected as a cluster C j And is expressed as a possible outlier data of a probability_outlier_point (2)
List of probable_outlier_point (1) And a probable_outlier_point (2) Is combined and expressed as:
probable_outlier_point
=probable_outlier_point (1) ∪probable_outlier_point (2) (5)
by using the above two strategies, it is possible to obtain an abnormal instance having a low density and being very far from or very close to a dense area; this may also miss true outlier instances very close to the dense region;
(c) In a third strategy, to find a true outlier instance very close to a dense region, a probability outlier score POS is first obtained, using a gaussian distribution function of feature distance, expressed as:
POS(i)=density(i)*f(distance(i)) (6)
wherein the method comprises the steps ofSigma and mu are parameters of normal distribution, and represent standard deviation and average value respectively; after finding the probability outlier score POS for each data point in the cluster, sorting the data examples in ascending order, and selecting the first N data examples as cluster C j And is denoted as probable_outlier_point.
4. The method for detecting air quality anomalies based on an error-optimized automatic encoder model of claim 1, wherein in the step (4), creating an air quality normal dataset and training a depth automatic encoder model therewith includes:
(1) Let D be the unlabeled dataset, P be the detected possible outlier probability point, i.e. air quality anomaly data;
(2) After the detection of possible air quality anomaly data, these instances need to be excluded from the air quality anomaly detection dataset, and the remaining points (D-P) in the dataset are considered to be "normal points", i.e., the air quality normal dataset;
(3) The reconstruction error of the depth automatic encoder model is calculated using the air quality normal dataset, wherein the reconstruction error is calculated using a mean square error MSE, measured as:
wherein X is i The input is represented by a representation of the input,representing the output, n being the size of the dataset;
(4) Using a rectifying linear unit (ReLU) and a hyperbolic tangent (tanh) as an activation function of the model;
ReLU:f(x)=max{0,x} (8)
wherein x represents an input vector;
(5) In order to improve the model performance, L1 regularization is used, and adaptive moment estimation optimization, abbreviated as Adam optimizer, is used for loss optimization.
5. The method for detecting air quality anomalies based on the error-optimized automatic encoder model according to claim 1, wherein in the step (5), constructing an error-optimized automatic encoder EOA model includes:
(1) Training an EOA model using the air quality anomaly detection dataset, outputting a reconstruction error for each instance, wherein the reconstruction error is calculated using MSE;
(2) The same as in (4-5) of claim 4;
(3) And according to the calculated reconstruction errors of the examples, sorting the data examples according to ascending order of the calculated reconstruction errors, and judging the first N data examples as outliers, namely air quality abnormal data.
6. The method according to claim 1, wherein in the step (6), the model verification section includes:
(1) The air quality abnormality detection data set was set as 8: dividing a training set and a testing set, and testing the accuracy of the EOA model by adopting 10-fold cross validation;
(2) Using precision@N, recall and the area under the receiver operation characteristic curve as evaluation indexes of the experiment;
TP and FN represent real examples and false negative examples respectively, wherein TP refers to the number of samples predicted to be positive and actually positive; FN refers to the number of samples predicted to be negative, as well as actually negative; TOP_N represents the first N data instance inputs of the data instances after the data instances are sorted according to the ascending order of the probability outlier scores, namely the instances which are judged to be abnormal in air quality; the auc_roc value represents the area under the ROC curve, which is used to measure classifier performance, where ROC shows the relationship between the true and false positive rates of the classifier at different thresholds.
CN202311051456.6A 2023-08-21 2023-08-21 Air quality anomaly detection method based on error optimization automatic encoder model Pending CN117332344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311051456.6A CN117332344A (en) 2023-08-21 2023-08-21 Air quality anomaly detection method based on error optimization automatic encoder model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311051456.6A CN117332344A (en) 2023-08-21 2023-08-21 Air quality anomaly detection method based on error optimization automatic encoder model

Publications (1)

Publication Number Publication Date
CN117332344A true CN117332344A (en) 2024-01-02

Family

ID=89292194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311051456.6A Pending CN117332344A (en) 2023-08-21 2023-08-21 Air quality anomaly detection method based on error optimization automatic encoder model

Country Status (1)

Country Link
CN (1) CN117332344A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972618A (en) * 2024-04-01 2024-05-03 青岛航天半导体研究所有限公司 Method and system for detecting secondary power failure of hybrid integrated circuit

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972618A (en) * 2024-04-01 2024-05-03 青岛航天半导体研究所有限公司 Method and system for detecting secondary power failure of hybrid integrated circuit

Similar Documents

Publication Publication Date Title
CN110443281B (en) Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering
CN111695639A (en) Power consumer power consumption abnormity detection method based on machine learning
WO2024021246A1 (en) Cross-device incremental bearing fault diagnosis method based on continuous learning
CN111353373B (en) Related alignment domain adaptive fault diagnosis method
WO2021114231A1 (en) Training method and detection method for network traffic anomaly detection model
Liu et al. A two-stage deep autoencoder-based missing data imputation method for wind farm SCADA data
Wang et al. Random convolutional neural network structure: An intelligent health monitoring scheme for diesel engines
CN104751229A (en) Bearing fault diagnosis method capable of recovering missing data of back propagation neural network estimation values
CN112488025B (en) Double-temporal remote sensing image semantic change detection method based on multi-modal feature fusion
CN110020712B (en) Optimized particle swarm BP network prediction method and system based on clustering
CN117332344A (en) Air quality anomaly detection method based on error optimization automatic encoder model
CN115484102A (en) Industrial control system-oriented anomaly detection system and method
CN110245390B (en) Automobile engine oil consumption prediction method based on RS-BP neural network
CN113705099B (en) Social platform rumor detection model construction method and detection method based on contrast learning
CN114548199A (en) Multi-sensor data fusion method based on deep migration network
CN116663613A (en) Multi-element time sequence anomaly detection method for intelligent Internet of things system
CN114049305A (en) Distribution line pin defect detection method based on improved ALI and fast-RCNN
CN116248392A (en) Network malicious traffic detection system and method based on multi-head attention mechanism
CN116542170A (en) Drainage pipeline siltation disease dynamic diagnosis method based on SSAE and MLSTM
Chou et al. SHM data anomaly classification using machine learning strategies: A comparative study
CN116776270A (en) Method and system for detecting micro-service performance abnormality based on transducer
CN114861778A (en) Method for rapidly classifying rolling bearing states under different loads by improving width transfer learning
CN114139624A (en) Method for mining time series data similarity information based on integrated model
CN113987910A (en) Method and device for identifying load of residents by coupling neural network and dynamic time planning
CN117112992A (en) Fault diagnosis method for polyester esterification stage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination