CN116776271A

CN116776271A - Polluted time sequence unsupervised anomaly detection method based on negative correlation

Info

Publication number: CN116776271A
Application number: CN202310790901.4A
Authority: CN
Inventors: 李佐勇; 林晓辉; 樊好义; 陈新伟; 黄训华
Original assignee: Minjiang University
Current assignee: Minjiang University
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-09-19

Abstract

The invention relates to a polluted time sequence unsupervised anomaly detection method based on negative correlation. A soft contamination calibration strategy is established using a negative correlation between semantic characterization and anomaly detection within the automatic encoder framework. To simulate such negative correlation, the present invention introduces morphological similarity to represent semantics and reconstruction consistency to detect anomalies. First, morphology similarity is effectively measured from representative normal samples generated in the learned gaussian distribution. Then, the present invention designs an outlier measurement calibration loss function based on the negative correlation between morphology similarity and reconstruction consistency to calibrate outlier metric deviations caused by contaminated samples. Through experiments on the disclosed time sequence data set, the result shows that the method provided by the invention effectively improves the abnormality detection performance under the condition that the training set is polluted.

Description

Polluted time sequence unsupervised anomaly detection method based on negative correlation

Technical Field

The invention belongs to the technical field of time sequence anomaly detection, and particularly relates to a polluted time sequence unsupervised anomaly detection method based on negative correlation.

Background

In recent years, rapid development of science and technology and progress of human society have accumulated a lot of data for various fields. Wherein the time series data occupies a considerable proportion, including electrocardiogram medical records, network flow records, stock price quotations and the like. How to fully mine and utilize these time series data has become a hotspot in research in the field of data mining. In some fields, anomaly data often contains more valuable information, so research into time-series anomaly detection has attracted more and more attention.

At present, supervised learning is widely applied to researches on abnormal detection of time series data as a theoretical complete method. However, supervised learning requires a large amount of labeled training data, which is time consuming and labor intensive. Therefore, the unsupervised learning shows a wide research prospect in the field of time series data anomaly detection by virtue of the characteristic that the unsupervised learning can extract useful information from the untyped marked data, but simultaneously faces a great challenge.

In recent years, many scholars have made significant efforts in the study of unsupervised timing anomaly detection. Some of these approaches have attempted conventional proximity-based algorithms [1], such as mining outliers from normal data using spatial information such as distance and density, and defining outliers as outliers. While these methods are easy to understand, they have limited applicability. Recently, with the rapid development of deep learning, a deep convolutional neural network-based method has been applied to the study of time series data anomaly detection [2-4]. These methods develop studies mainly from the viewpoint of reconstruction errors, by generating intrinsic features of model learning data sets, rely on commonalities within the data to detect outliers. A larger reconstruction error is considered as a sign of an outlier. When the training data is sufficiently rich, the generative model can converge quickly.

In a real-world scenario, the data set is susceptible to abnormal contamination due to the time and effort of labeling the data. Traditional unsupervised time series anomaly detection methods can produce biased anomaly metrics under contaminated training data. Currently, most existing methods employ a hard strategy to calibrate the contaminated data, i.e., assign pseudo tags to the training data. However, these hard strategies rely on the choice of threshold values, resulting in suboptimal performance.

Disclosure of Invention

The invention aims to solve the problems and provide a polluted time sequence unsupervised anomaly detection method based on negative correlation, which utilizes the negative correlation between semantic characterization and anomaly detection in an automatic encoder framework to establish a soft pollution calibration strategy. To simulate such negative correlation, the present invention introduces morphological similarity to represent semantics and reconstruction consistency to detect anomalies. First, morphology similarity is effectively measured from representative normal samples generated in the learned gaussian distribution. Then, the present invention designs an outlier measurement calibration loss function based on the negative correlation between morphology similarity and reconstruction consistency to calibrate outlier metric deviations caused by contaminated samples. Through experiments on the disclosed time sequence data set, the result shows that the method provided by the invention effectively improves the abnormality detection performance under the condition that the training set is polluted.

In order to achieve the above purpose, the technical scheme of the invention is as follows: a polluted time sequence unsupervised anomaly detection method based on negative correlation utilizes the negative correlation between semantic characterization and anomaly detection in an automatic encoder framework to establish a soft pollution calibration strategy, introduces morphological similarity to express semantics for simulating the negative correlation, and introduces reconstruction consistency to detect anomalies; firstly, effectively measuring morphological similarity according to a representative normal sample generated in the learned Gaussian distribution; an anomaly measurement calibration loss function is then designed based on the negative correlation between morphology similarity and reconstruction consistency to calibrate for anomaly metric bias caused by the contaminated sample.

In one embodiment of the invention, the method is implemented as follows:

a set of contaminated samples is input and,x ₁ ,x ₂ ,x ₃ ,…,x _N examples of anomaly categories are included;

mapping samples to a low-dimensional feature space using a lightweight encoder network f;

using the contrast training model, using the discriminant D to combine the posterior distribution and Gaussian distribution of the low-dimensional feature spaceAlignment is carried out;

reconstructing the samples of the low-dimensional feature space by using a decoder g, and generating representative normal samples by distributed center sampling;

calculating a reconstruction consistency loss between an input sample and a reconstructed sampleAnd morphological similarity loss between reconstructed sample and representative normal sample ++>Measuring the quality of the reconstructed sample and the similarity between the reconstructed sample and the representative normal sample;

will reconstruct the consistency lossAnd morphological similarity loss->Input anomaly metric calibration loss +>Calibration loss->An effective calibration is performed using the negative correlation between the reconstructed consistency and the morphological similarity to calibrate the abnormal metric deviation caused by the contaminated sample.

In one embodiment of the present invention, the generation method of the representative normal sample is specifically as follows:

a representative normal sample is generated from a central region of the low-dimensional feature space and conforms to a gaussian distribution; the method is realized by adopting a generated countermeasure network GAN framework; firstly, introducing a discriminator D, and training the discriminator D to distinguish the characteristics of an input sample and noise sampled from Gaussian distribution; resistance loss functionThe form of (2) is as follows:

wherein the noise vector ω is from a gaussian distributionIs obtained by sampling, mu _z Mean value of the characteristic, I _d Representing covariance matrix>A desired value representing the logarithm of the output of the arbiter D on the input ω;representing distribution from real data->Mid-sampling a sample x, f (x) means extracting features from sample x through the encoder network,/->The expected value obtained by taking the logarithm of the output of the discriminator D on the feature f (x) extracted by the encoder is shown. By antagonism loss function->Forcing the distribution of features to be +.>Keeping consistency;

to generate representative normal samples, random noise is sampled from the learned Gaussian distributionWherein γ is +.>Is a super parameter of (2); by adjusting the magnitude of gamma, the random noise is +.>Sampling; in the generation of representative normal samples->When we use the decoder network g to add random noise +.>Mapping to representative Normal sample->I.e. < ->In the present invention we have chosen the value of γ to be 0.1. By setting γ to a smaller value, we have a higher chance to generate a representative normal sample.

In one embodiment of the present invention, the calibration method for abnormal measurement deviation caused by the polluted sample is specifically implemented as follows:

introduction of morphological similarity lossAnd reconstruction consistency loss-> wherein x_i Representing input samples, +_>Representative normal sample,/->Represents x _i Is a reconstruction result of (a); these loss functions are used to quantify the similarity between the reconstructed sample quality and the representative normal sample;

design of abnormal measurement calibration loss: first consider the reconstructed L2 loss regression, whose goal is to minimize the square error between the reconstructed value and zero; when a reconstruction average value of mu and variance of sigma is deduced ² Modifying the target, the modified target aiming at maximizing the probability of zero reconstruction value; thus, using a gaussian probability density function, the goal is expressed as maximization of the form:

equation (2) is further modified to minimize the form:

the composition of the lost part of L2 is shown in the formula (3), which isImpaired and is->Further penalizing the noise term of (2); by amplifying the noise term, the influence of L2 loss is attenuated, thereby making the pair +.>The penalty ratio of (2) is larger, and the reconstruction loss is effectively prevented from being minimized indefinitely; to ensure reliability and reduce ambiguity in representing noise items, morphology similarity is used to setConsider reconstruction target->The abnormal measurement calibration loss generated by equation (3) is expressed as follows:

wherein use is made ofTo represent mu ² ；

Equation (4) by usingTo contamination of samples>Applying small punishmentPenalties effectively hinder overfitting on contaminated samples; furthermore, the abnormal samples will show a different reconstruction pattern compared to the representative normal samples, resulting in a +_ of contaminated samples in the loss function>The value is higher; therefore, loss value->Is +.>Adjusted to a lower level; on the other hand, the second term is relatively high +.>The value penalizes and prevents +.>Is infinitely minimized.

In one embodiment of the invention, a test set is givenx ₁ ,x ₂ ,x ₃ ,…,x _N Data x _i The anomaly score for (2) is calculated as follows:

wherein ,represents x _i Is then passed through Score (x _i ) Compare with a threshold t to determine x _i Whether it belongs to the normal or abnormal category.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a polluted time sequence unsupervised anomaly detection method based on negative correlation, which utilizes the negative correlation between semantic characterization and anomaly detection in an automatic encoder framework to establish a soft pollution calibration strategy. To simulate such negative correlation, the present invention introduces morphological similarity to represent semantics and reconstruction consistency to detect anomalies. First, morphology similarity is effectively measured from representative normal samples generated in the learned gaussian distribution. Then, the present invention designs an outlier measurement calibration loss function based on the negative correlation between morphology similarity and reconstruction consistency to calibrate outlier metric deviations caused by contaminated samples. Through experiments on the disclosed time sequence data set, the result shows that the method provided by the invention effectively improves the abnormality detection performance under the condition that the training set is polluted.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is specifically described below with reference to the accompanying drawings.

The invention provides a polluted time sequence unsupervised anomaly detection method based on negative correlation, which utilizes the negative correlation between semantic representation and anomaly detection in an automatic encoder framework to establish a soft pollution calibration strategy, introduces morphological similarity to express semantics and reconstruction consistency to detect anomalies for simulating the negative correlation; firstly, effectively measuring morphological similarity according to a representative normal sample generated in the learned Gaussian distribution; an anomaly measurement calibration loss function is then designed based on the negative correlation between morphology similarity and reconstruction consistency to calibrate for anomaly metric bias caused by the contaminated sample.

The method of the invention is concretely realized as follows:

the model proposed by the present invention is shown in fig. 1, the input of the model is a set of contaminated training samples,x ₁ ,x ₂ ,x ₃ ,…,x _N examples of anomaly categories are included. For models, the labels of the input samples cannot be obtained.

First, we use a lightweight encoder network f to map samples into a low-dimensional feature space. Subsequently, using the contrast training paradigm, the posterior distribution of the feature space is compared with the Gaussian distribution by using the discriminant DAlignment is performed. Next, we reconstruct the samples of the feature space with the decoder g and generate representative normal samples by distributed center sampling. Then, calculate the reconstruction consistency loss between the input sample and its reconstructed sample>And morphological similarity loss between reconstructed sample and representative normal sample ++>These losses are used to measure the quality of the reconstructions and their similarity to a representative normal sample. Finally, the reconstruction consistency is lost->And morphological similarity loss->Input anomaly metric calibration loss +>Calibration loss->And carrying out effective calibration by utilizing the negative correlation between the reconstruction consistency and the morphological similarity so as to solve the problem of biased anomaly measurement.

1. Generation of representative samples

The invention aims at being based on representative normal samplesMorphology similarity is measured and negative correlation is highlighted. For this purpose we focus on generating representative normal samples, which are generated from the central region of the feature space and follow a gaussian distribution. To achieve this goal, we employ a generative antagonism network (generative adversarial network, GAN) framework. First, we introduce a discriminant D and train it to distinguish between the characteristics of the input samples and the noise sampled from the gaussian distribution. Resistance loss functionThe form of (2) is as follows:

2. Abnormal metric calibration loss

In order to calibrate biased anomaly metrics under data contamination, the present invention exploits the negative correlation between reconstruction consistency and morphological similarity. To achieve this goal, we introduce two loss functions: loss of morphological similarityAnd reconstruction consistency loss-> wherein x_i Representing input samples, +_>Representative normal sample,/->Represents x _i Is a reconstruction of the results of (a). These loss functions are used to quantify the quality of the reconstruction and its similarity to a representative normal sample.

Then we detail the design of the anomalous measurement calibration loss. We first consider the reconstructed L2 loss regression with the goal of minimizing the square error between the reconstructed value and zero. When the model deduces that a reconstructed average value is mu and variance is sigma ² We have modified the target at the time of the gaussian distribution of (a). The modified objective aims to maximize the probability of the reconstruction value being zero. Thus, using a gaussian probability density function, we express the learning objective as maximization of the form:

equation (2) can be further modified to minimize the following form:

the composition of the L2 loss portion is shown in the formula (3), which isImpaired and is->Further penalizing the noise term of (c). By amplifying the noise term, the influence of L2 loss is attenuated, thereby making the pair +.>The penalty ratio of (2) is greater. Doing so can effectively prevent the model from minimizing reconstruction losses indefinitely. In the presence of data contamination, assigning a larger noise term to contaminated samples can effectively mitigate overfitting of the samples. To ensure reliability and reduce ambiguity in representing noise terms, we exploit morphological similarity. Is provided with->Consider reconstruction target->The abnormal measurement calibration loss generated by equation (3) is expressed as follows:

among them, we useTo represent mu ² 。

Abnormal measurement calibration loss functionIs defined in equation (4). It is by using +.>To contamination of samples>Imposing a small penalty effectively prevents an overfitting of such data. Furthermore, we expect that the abnormal samples will exhibit a different reconstruction pattern than the representative normal samples, resulting in a loss functionCount of contaminating samples->The value is higher. Therefore, loss value->Is +.>Adjusted to a lower level. On the other hand, the second term is relatively high +.>The value penalizes and prevents +.>Is infinitely minimized.

3. Abnormality score

By training the anomaly detector on normal data, we can use the reconstruction error to assess the degree of anomaly of the data. The present invention aims to use the negative correlation between reconstruction consistency and morphological similarity to calibrate biased anomaly metrics caused by contaminated data in the event of data contamination. This ensures that the contaminated data exhibits a higher reconstruction error than normal samples. Thus, given a test setx ₁ ,x ₂ ,x ₃ ,…,x _N Data x _i The anomaly score for (2) is calculated as follows:

wherein ,represents x _i Is a reconstruction of the results of (a). Score (x) _i ) Compare with a threshold t to determine x _i Whether it belongs to the normal or abnormal category.

4. Experiment verification

(1) Data set and evaluation index

To verify the effectiveness of the algorithm of the present invention, experiments were performed on two datasets on the UCR time series database we disclosed, the UWaveGestureLibraryall dataset and the InnectWingBeatSound dataset, respectively. In experiments we have taken the following steps to verify the robustness of the inventive algorithm under contaminated data sets.

First, we set one category in the dataset as a normal category and the other categories as abnormal categories to simulate an abnormal situation in a real scene. Second, we calculate the contamination rateWhere N and a represent the number of normal samples and abnormal samples, respectively. This allows quantification of the proportion of outlier samples in the dataset. We then construct a contaminated training set based on the contamination rate, wherein the duty cycle of the outlier samples in the training set is equal to the contamination rate. Finally, we divide the training set and the test set in a ratio of 80% and 20%.

In the invention, the evaluation index AUC value commonly used in anomaly detection is selected to measure the algorithm performance. The calculation formula of the AUC value is as follows:

wherein, rank _i Indicating the position of the ith normal sample in the ordered sequence, N and a indicating the number of normal and abnormal samples, respectively. By calculating the AUC value, we can evaluate the performance of the algorithm in the anomaly detection task.

(2) Baseline method

The invention adopts the following methods to carry out comparison experiments:

AE-CNN method [2]: the method builds an end-to-end data reconstruction model based on a Convolutional Neural Network (CNN) and an automatic encoder. It is used for an unsupervised anomaly detection task to capture anomaly patterns by learning a low-dimensional representation of the data and reconstructing the input data.

DeepSVDD method [3]: the method is an unsupervised anomaly detection method, which utilizes a neural network to extract characteristics and introduces deep single class classification targets. The method attempts to separate the abnormal sample from the normal sample by gathering the normal sample in a compact spherical region.

NCAE method [4]: the method is an unsupervised anomaly detection method, and improves anomaly detection performance on a polluted data set by distributing pseudo-pollution labels to training samples within a given threshold range and maximizing reconstruction errors of the training samples.

(3) Experimental details

The experimental part of the present invention uses Python 3.6 and PyTorch deep learning frameworks as the basic tools. All experiments were performed on a Ubuntu 18.04 system configured with Intel Core i7-8700, 64GB memory and NVIDIA GeForce RTX 2070. We used a four-layer 1D convolutional neural network with a LeakyReLU activation function and batch normalization as the backbone encoder f for our method and other baseline methods. The discriminator D consists of two linear layers. To generate a representative normal sample, we selected γ=0.1. In the optimization process, we use Adam optimizer, initial learning rate is lr=0.0001, momentum parameter β ₁ ＝0.5，β ₂ =0.999. All models were trained over 200 iterations, with a batch size of 64 for each iteration.

(4) Experimental results

To reduce the extent to which the experimental results are affected by randomness, the experimental results in this study are presented based on the average of 5 runs.

To evaluate the anomaly detection performance and robustness of the AE-CNN method [2], the deep SVDD method [3], the NCAE method [4] and the algorithm of the present invention when contaminated with data, we set different contamination rates (i.e. 5%, 10%, 15% and 20%). The AUC values in anomaly detection for the different baseline methods are listed in table 1, where the best performance indicators are shown in each column, in bold.

Table 1 experimental results for different baseline methods

First, as can be seen from table 1, the inventive algorithm is significantly better than the comparative baseline approach when the dataset is contaminated (i.e., ρ is greater than 0%). For example, when the contamination rate of the training set is 20%, on the uwavegesturelibrary all dataset, the algorithm of the present invention is improved by 9.3% compared to the optimal baseline method NCAE method [4 ]; on the instrectwingbeatsound dataset, an improvement of 8.8% was achieved. Likewise, when the contamination rate of the training set is 15%, on the uwavegesturelibrary all dataset, the algorithm of the present invention is improved by 7.3% compared to the optimal baseline method NCAE method [4 ]; on the instrectwingbeatsound dataset, 1.6% improvement was achieved. When the pollution rate of the training set is 10%, on the UWaveGestureLibraryAll data set, the algorithm is improved by 3.3% compared with an optimal baseline method NCAE method [4 ]; on the instrectwingbeatsound dataset, 2.5% improvement compared to the optimal baseline method AE-CNN method [2 ]. When the pollution rate of the training set is 5%, on the UWaveGestureLibraryAll data set, compared with an optimal baseline method NCAE method [4], the algorithm is improved by 1.1%; on the instrectwingbeatsound dataset, 1.5% improvement compared to the optimal baseline method AE-CNN method [2 ].

Compared with NCAE method [4] for treating pollution problem based on hard strategy, the algorithm of the invention has obvious advantage in performance. Although the NCAE method [4] can deal with the problem of data contamination by assigning pseudo tags to training samples, it still faces the complex problem of threshold selection. In contrast, the algorithm of the present invention provides a flexible pollution calibration strategy that automatically adapts to data pollution using negative correlation. In general, the algorithm of the invention can effectively reduce the influence of pollution data on the unsupervised anomaly detection performance of the time series.

Reference is made to:

[1]Angiulli F,Pizzuti C.Fast outlier detection in high dimensional spaces[C].Principles ofData Mining and Knowledge Discovery:6th European Conference,PKDD 2002Helsinki,Finland,August19–23,2002Proceedings 6.Springer Berlin Heidelberg,2002,pp.15-27.

[2]Haselmann M,Gruber D P,Tabatabai P.Anomaly detection using deep learning based image completion[C].17th IEEE international conference on machine learning and applications(ICMLA),2018,pp.1237–1242.

[3]Ruff L,Vandermeulen R,Goernitz N,et al.Deep one-class classification[C].International conference on machine learning,2018,pp.4393–4402.

[4]Yu J,Oh H,Kim M,et al.Normality-calibrated autoencoder for unsupervised anomaly detection on data contamination[C].NeurIPS 2021Workshop on Deep Generative Models and DownstreamApplications,2021.。

the above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims

1. A polluted time sequence unsupervised anomaly detection method based on negative correlation is characterized in that a soft pollution calibration strategy is established by utilizing the negative correlation between semantic characterization and anomaly detection in an automatic encoder framework, morphology similarity is introduced to represent semantics and reconstruction consistency is introduced to detect anomalies in order to simulate the negative correlation; firstly, effectively measuring morphological similarity according to a representative normal sample generated in the learned Gaussian distribution; an anomaly measurement calibration loss function is then designed based on the negative correlation between morphology similarity and reconstruction consistency to calibrate for anomaly metric bias caused by the contaminated sample.

2. The method for unsupervised anomaly detection of a contaminated time series based on negative correlation according to claim 1, wherein the method is implemented as follows:

3. The method for unsupervised anomaly detection of contaminated time series based on negative correlation according to claim 2, wherein the representative normal sample is generated by the following method:

wherein the noise vector ω is from a gaussian distributionIs obtained by sampling, mu _z Mean value of the characteristic, I _d Representing covariance matrix>A desired value representing the logarithm of the output of the arbiter D on the input ω; />Representing distribution from real data->The mid-sampling of one sample x, f (x) represents the extraction of features from sample x by the encoder network,representing an expected value of taking a logarithm of the output of the discriminator D on the feature f (x) extracted by the encoder; by antagonism loss function->Forcing the distribution of features to be +.>Keeping consistency;

to generate representative normal samples, random noise is sampled from the learned Gaussian distributionWherein γ is +.>Is a super parameter of (2); by adjusting the magnitude of gamma, the random noise is +.>Sampling; in the generation of representative normal samples->When we use the decoder network g to add random noise +.>Mapping to representative Normal sample->I.e. < ->Setting the value of γ to 0.1 gives a higher chance to generate a representative normal sample.

4. The method for unsupervised anomaly detection of contaminated time series based on negative correlation according to claim 2, wherein the calibration of anomaly metric deviation caused by contaminated samples is specifically implemented as follows:

equation (2) is further modified to minimize the form:

wherein use is made ofTo represent mu ² ；

Equation (4) by usingTo contamination of samples>Applying smaller punishment, effectively preventing overfitting to the contaminated sample; furthermore, the abnormal samples will show a different reconstruction pattern compared to the representative normal samples, resulting in a +_ of contaminated samples in the loss function>The value is higher; therefore, loss value->Is +.>Adjusted to a lower level; on the other hand, the second term is relatively high +.>The value penalizes and prevents +.>Is infinitely minimized.

5. A method for unsupervised anomaly detection of contaminated time series based on negative correlation according to claim 2, wherein a test set is givenx ₁ ,x ₂ ,x ₃ ,…,x _N Data x _i The anomaly score for (2) is calculated as follows: