CN111275113B

CN111275113B - Skew time series abnormity detection method based on cost sensitive hybrid network

Info

Publication number: CN111275113B
Application number: CN202010065816.8A
Authority: CN
Inventors: 王晓峰; 张英; 李斌; 王妍; 雷锦锦
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2023-04-07
Anticipated expiration: 2040-01-20
Also published as: CN111275113A

Abstract

The invention discloses a skew time sequence anomaly detection method based on a cost-sensitive hybrid network, which comprises the steps of firstly establishing and training a cost-sensitive hybrid network model consisting of a deep convolutional neural network, a gated recursive network and a cost-sensitive loss function, wherein local characteristics of a time sequence are learned through the deep convolutional neural network, sequence characteristics of the time sequence are learned through the gated recursive network, then the characteristics are combined for classification, the similarity between an output result and a real value is measured by using the cost-sensitive loss function in the model training process, then parameters of the network model are adjusted through a back propagation algorithm, and different penalty factors are used for punishing error detection of the network model aiming at samples of different quantity and types. The method is simple and efficient, high in precision and strong in robustness, and has high detection precision on both the skew time series data set and the non-skew time series data set.

Description

Skew time series abnormity detection method based on cost sensitive hybrid network

Technical Field

The invention belongs to the technical field of time series data abnormity detection, and relates to a skew time series abnormity detection method based on a cost-sensitive hybrid network.

Background

Skew time series data refers to data sets with widely different sample data volumes in different classes. In practical applications, the time series data obtained by engineering measurement is mostly within a normal range, and has only a very small number of abnormal values, which is a typical skew time series data set. In the binary classification problem, the result of a general classifier is biased to a normal class, and the false detection rate of an abnormal class is very high. However, in practical applications, a few categories are focused on, for example, fault detection of spacecraft, disease diagnosis in the medical field, and credit card fraud in the financial field.

The time series classification method based on deep learning is based on the whole time series, and combines a feature extraction stage and a classification stage together for processing. The classification of univariate time series based on a multichannel Convolutional neural network model (MC-CNN) is proposed in the literature [ Y.ZHING, Q.Liu, E.Chen, explicit Multi-Channels later conditional neural network for Multi-variant time series classification [ J ], frontiers of Computer Science,2016,10 (1): 96-112 ]. The MC-CNN respectively learns time series characteristics by using three channels, combines the characteristics learned by the three channels together, and sends the combined characteristics to a Softmax layer for classification finally. Compared with the traditional algorithm, the MC-CNN model has better performance, but the MC-CNN model only performs experiments on two reference UCR data sets and cannot show superior performance. The most significant advantage of MCNN is that the classification technology of machine learning is continuously explored in the document [ Z, cui, W.Chen, Y.Chen, multi-scale conditional Neural Networks for time series classification [ J ], arXiv preprints arXiv:1603.06995,2016], a plurality of branch conversion layers are arranged, preprocessing is carried out on data in a time domain and a frequency domain respectively, a plurality of local convolution layers are established, features with different sizes and frequencies are automatically extracted, and the characteristic representation performance is outstanding. The MCNN makes model evaluations on 44 UCR reference data, with 10 data sets showing better experimental results. But MCNN requires more pre-processing and hyper-parameter settings. Because the MC-CNN and the MCNN models need to be preprocessed in a large quantity before training, the learning rate needs to be set manually, and the full connection layer is used for feature tiling before the network output layer, the parameters of network learning are increased obviously. The network models proposed in the documents [ Z.Wang, W.Yan, oates T.Time series classification from scratch with deep Neural Networks: A string basis, neural Networks (IJCNN), 2017International journal reference on IEEE,2017, 1578-1585] and [ F.Karim, S.Majumdar, H.Darabi, LSTM full volumetric network for time series classification [ J ], IEEE Access,2018, 1662-1669] all use a global average pooling layer instead of a full connection layer before the output layer, reducing network parameters, using an adaptive optimizer for the training of loss functions, avoiding the setting of learning rate. The documents [ Z.Wang, W.Yan, oates T.Time series classification from scratch with deep Neural Networks: A string basis, neural Networks (IJCNN), 2017International Joint Conference on. IEEE,2017, 1578-1585] propose a benchmark Neural network model in which Full Convolution Networks (FCN) and residual Networks (ResNet) achieve true end-to-end learning, without cumbersome preprocessing work, and show better performance on 18 data sets in an experiment of 44 UCR data sets. In order to further improve the performance of the network, two combined network models (Long Short message Memory network, LSTM-FCN) which combines a Long Short message Memory network and a full volume network for learning time sequence features are proposed in the documents [ F.Karim, S.Majumdar, H.Darabi, LSTM full connectivity network for time services classification [ J ], IEEE Access,2018,6: 1662-1669], wherein the ALTM-FCN model is disclosed by X29. Xiong, zhao, Y.Pat, K.Choo, Y.sub.Y.sub.sub.19, and X29. J.for predicting the time sequence features on the basis of LSTM-FCN. The ALSTM-FCN model is superior to the existing method in 51 data sets in 85 UCR data sets. However, the ALSTM-FCN model obviously reduces the learning efficiency by learning more parameters in the training phase. The prior literature also discloses the use of long-and-short memory networks for classification of medical and industrial data, but improvements in detection accuracy are not significant.

Most of the existing algorithms are directed at the situation that the scale of a minority class is equivalent to that of a majority class, and for an oblique class time series data set, namely a data set with the number of samples of the minority class and the majority class being seriously different, the feature learning of the minority class samples is insufficient by an algorithm, so that the classification precision is seriously reduced. Existing solutions are data-based sampling methods and algorithm-based methods. Data-based sampling methods typically preprocess a data set, including downsampling for most classes and oversampling for few classes. The former balances the data set by randomly deleting samples from the multi-class, with the disadvantage of causing loss of valuable information. The latter requires random generation of some samples that are not originally present to adjust the data distribution, and this method usually causes an overfitting problem in the final result because the original structure of the time series data is changed. The algorithm-based method is a threshold moving method, namely a decision threshold of the classifier is adjusted through experiments or artificial setting, and the cost for finding a proper threshold is very high. For a data set with serious skewness, the existing processing method has a plurality of problems, and the detection performance of the classifier is seriously influenced, so that the development of data analysis technology in the fields of industry, finance, medicine, military and the like is influenced.

Disclosure of Invention

The invention aims to provide a skew time series anomaly detection method based on a cost-sensitive hybrid network, which solves the problems that the detection precision of a few types of samples in a skew time series data set is low in the prior art, and the classification precision is seriously reduced because the feature learning of the few types of samples is insufficient in the prior algorithm.

The technical scheme adopted by the invention is a skew time series abnormity detection method based on a cost sensitive hybrid network. Firstly, establishing and training a cost-sensitive hybrid network model consisting of a Deep Convolutional Neural Network (DCNN), a gated recursive network (GRU) and a cost-sensitive loss function; learning local features of the time series through the deep convolution neural network DCNN, learning sequence features of the time series through the gated recursion network GRU, combining the features and classifying the combined features through a Soft-max classifier; in the model training process, a cost sensitive loss function is utilized to measure the similarity between an output result and a true value, then parameters of the network model are adjusted through a back propagation algorithm, and different penalty factors are used for samples with skew between classes to penalize error detection of the network model.

The invention is also characterized in that:

the method specifically comprises the following steps:

step 1, integrating a DCNN (deep convolutional neural network) and a GRU (gated recursive network) containing 128 cell units, introducing a cost sensitive loss function, and constructing a cost sensitive hybrid network model CSHN;

step 1.1, learning local features of a time sequence by using a Deep Convolutional Neural Network (DCNN) composed of three convolutional layers, wherein each convolutional layer comprises convolution operation and batch normalization operation, and a global average pooling layer is introduced into an output layer and used for reducing feature dimensions;

step 1.2, learning sequence characteristics of time sequence through gated recursion network GRU, wherein gated recursion network GRU is updated by update gate p _s And a reset gate q _s Composition, X represents a time-series data sample, g _s Indicates the amount of output information at time s,

representing a hidden state at time s, the input to the memory unit at time s being g _s-1 And X; reset gate p _s Controlling the output value g of the last moment _s-1 Into the hidden state at the present time>

Is mapped to by the activation function one0,1]In between, hidden states>

Mapping to [ -1,1 ] by activating function two]In the range of (a), the mathematical expression thereof is as follows:

p _s ＝σ(K _p ·[g _s-1 ,X]) (6)

wherein, K _p Weight matrix representing reset gates, [ g ] _s-1 ,X]Representing a vector g of two inputs _s-1 And X are connected into a long vector, and sigma is the first activation function;

updating door q _s Information g for determining output of s-1 time _s-1 Is brought into the S time to output information g _s Degree of (d), update gate q _s Value is [0,1 ]]The larger the value is, the output information g at the previous time is shown _s-1 Is brought to the current moment and outputs information g _s The less, the mathematical expression thereof is as follows:

q _s ＝σ(K _q ·[g _s-1 ,X _s ]) (8)

wherein, K _q Weight matrix representing the updated gate, [ g ] _s-1 ,X _s ]Vector g representing two inputs _s-1 And X _s Connecting into a long vector, wherein sigma is the activation function I;

step 1.3, in the training process of the cost-sensitive hybrid network model, measuring the similarity between an output result and a true value by using a cost-sensitive loss function, wherein the expression is as follows:

wherein l _j True label, X, representing the jth training sample _j J-th time-series sample, σ, representing the input _k,b (X _j ) Representing model inputThe probability value is obtained, K represents a weight parameter, b represents bias, and N represents the total number of samples; wherein eta and ν respectively represent punishment factors under the condition that the minority samples and the majority samples are wrongly classified, and when the minority samples are wrongly detected, the punishment factors are multiplied by a larger punishment factor eta, so that the total loss is amplified; when most samples are detected wrongly, multiplying a smaller penalty factor v, eta and upsilon by the calculation formula as follows:

where N is the total number of samples, N _{normal_total} Is the normal number of samples, n _{abnormal_total} Is the number of abnormal samples, n _classes For the sample class, n in the present invention _classes ＝2；

Step 2, a skew time series data anomaly detection algorithm based on the cost sensitive hybrid network model:

the algorithm is mainly divided into three stages: the first stage is a data preprocessing stage; the second stage is a time sequence local feature learning stage, which mainly comprises the local feature learning of the time sequence based on the deep convolutional neural network DCNN in the step 1 and the local feature learning of the time sequence of the gated recursive network GRU; the third stage is an abnormality detection stage;

step 2.1, preprocessing data mainly comprises normalization operation and time slicing operation;

step 2.2, local feature learning of time series: data of 80% in the time series data

Inputting the training samples into the cost-sensitive hybrid network model constructed in the step 1 to learn local characteristics of a time sequence, simultaneously performing cross verification by using part of the training samples, and updating model parameters by adopting a back propagation algorithm in the whole training and learning process; the specific process of feature learning comprises the following steps: local features of the time sequence based on the deep convolutional neural network DCNN in the step 1Learning, namely performing local feature learning based on the time sequence of the gated recursive network GRU in the step 1, classifying by using a Softmax classifier to obtain a probability value output by the cost-sensitive hybrid network model, and updating parameters by using the cost-sensitive loss function to measure the similarity between a predicted value and a true value;

step 2.3, anomaly detection phase

Testing the test data by using the cost sensitive hybrid network model trained in the step 2.2, and testing the rest 20 percent of data in the time sequence data

As a sample book, let phi (L) _r (ii) a K, b) as a cost-sensitive hybrid network model, L _r ∈L _{test_set} The mathematical expression is:

wherein, P _nclass (L _r ) Is phi (L) _r (ii) a Predicted probability values of K, b), l _{r_label} In order to predict the label of a sample,

are parameters obtained in the learning process.

Step 1.1 specifically comprises the following steps:

step 1.1.1, convolution operation

Definition of

Represents the difference between the u-th channel in the d-th layer and the v-th channel in the d-1 layerIn between convolution kernels, is greater than or equal to>

Represents the output value of the u-th channel of the sample in layer d-1, < >>

And &>

The local features of the time series are learned by convolution operations:

wherein,

represents the output value of the u-th channel of level d, < >>

Represents a bias of the u-th channel of level d>

Representing convolution operation, and V represents the number of convolution kernels in the previous layer;

step 1.1.2, batch normalization operation

For an input time series of samples X = { X ₁ ,x ₂ ,…,x _z And expressing the batch normalization operation as follows:

wherein,

is a standard normalization value, τ is a constant used to ensure that the denominator is greater than 0, γ represents a data scale change, β represents a data offset, and/or->

Represents the value after the batch normalization operation;

step 1.1.3, global average pooling layer

And carrying out average pooling operation on a plurality of feature vectors obtained by the previous convolutional layer by utilizing the global average pooling layer to obtain the following result:

A＝{a ₁ ,a ₂ ,…,a _U } (5)

wherein, X _u Representing the feature vector, K, of the u-th channel after the last layer of convolution _GAP Representing a global average pooling matrix, U representing the dimension of the output feature vector, A representing the output value a of each channel _u And combining as the final output vector.

In step 1.2, the first activation function is a Sigmoid activation function, and the second activation function is a tanh activation function;

reset gate p _s Mapping to [0,1 ] by Sigmoid activation function]In the above-mentioned manner, hidden state g & _s Mapping to [ -1,1 ] by tanh activation function]In the range of (a), the mathematical expression thereof is as follows:

p _s ＝σ(K _p ·[g _s-1 ,X]) (6)

wherein, K _p Weight matrix representing reset gates, [ g ] _s-1 ,X]Representing a vector g of two inputs _s-1 And X are connected into a long vector, sigma is the Sigmoid activation function,

representing the weights for computing the hidden state.

The specific steps of step 2.1 are as follows:

step 2.1.1, data normalization processing

X{t _m (x _m ,l _m ) } (M =1,2, \ 8230;, M) represents a time-series data set, where t _m (x _m ,l _m ) Representing time series samples, x _m Signal value, l, representing the m-th sample _m Label representing the mth sample,/ _m Is 0 or 1, M represents the total number of samples, and the mathematical expression is as follows:

wherein,

define >>

Representing the normalized time series data set;

step 2.1.2, time slicing

Long time sequence data X { t } is processed by adopting sliding window _m (x _m ,l _m ) Dividing (M =1,2, \ 8230;, M) into short overlapping segments, taking a window function window () with length w, which is shifted by a step length h, and normalizing the data processed by the step 2.1.1

Is divided into->

Each segment->

Is w, the expression is as follows:

wherein L is _r Denotes the r-th fragment, sets w to half the period of the time-series data,

for the total number of fragments, M represents the total number of samples.

The specific steps of step 2.2 are as follows:

step 2.2.1, deep convolutional neural network feature learning: and (2) local feature learning of a time sequence is carried out by adopting the deep convolutional neural network DCNN based on the step 1, a hidden layer of the convolutional network consists of three convolutional layers, each convolutional layer comprises three processing operations, and the specific flow is as follows.

Conv1 layer: assume that Conv1 layer has e ₁ Size of k ₁ Of the convolution kernel

Suppose to take e ₁ ＝32,k ₁ =8, for sample L _r (L _r ∈L _{train_set} ) And the convolution kernel->

Performing convolution operation to obtain e1 characteristic vectors with the length of w-7>

Then, the final output of the Conv1 layer is obtained through the BN operation and the activation function LeakyReLU

This process is expressed as follows:

wherein

Represents a Conv1 level bias>

Representing a convolution operation;

conv2 layer: assume that Conv2 layer has e ₂ Size of k ₂ Convolution kernel of

Suppose to take e ₂ ＝64,k ₂ =5, the characteristic vector obtained by Conv1 layer is quantified =>

And convolution kernel->

Performing convolution operation to generate e ₂ Characteristic vector with length w = 11->

Then, the final output ^ of Conv2 layer is obtained through the BN operation and the activation function LeakyReLU>

This process is expressed as follows:

wherein

Represents the bias of the Conv2 layer;

conv3 layer: assume that Conv3 layer has e ₃ Each size is k ₃ Of the convolution kernel

Suppose take e ₃ ＝128,k ₃ =3, the characteristic vector determined by the Conv2 layer is/is @>

And convolution kernel->

Performing convolution operation to generate 128 characteristic vectors with the length of w-13>

The final output ^ of Conv3 layer is then obtained through the BN operation and the activation function LeakyReLU>

This process is expressed as follows:

wherein

Indicating the bias of the Conv3 layer.

GAP layer, feature vector for Conv3 layer output

Using and>

convolution kernel K with same dimension _GAP And

convolution operation is carried out to generate a 128-dimensional characteristic vector->

Wherein

Feature vector representing the eventual learning of a deep convolutional neural network>

Each component value of (a);

step 2.2.2, gated recursive network feature learning: for an input time series data set L _r (L _r ∈L _{train_set} ) Using the GRU learning sequence characteristics of the gated recursive network containing 128 cells to obtain the final output characteristics of the gated recursive networkEigenvector

Wherein K _p And K _q Weight matrices, F, representing reset and update gates, respectively _GRU A mapping function representing a GRU network;

step 2.2.3, outputting the cost sensitive hybrid network model: for the input time series samples L _r (L _r ∈L _{train_set} ) Finally outputting a probability value P by using a Softmax classifier by using the cost-sensitive hybrid network model _nclass (L _r ) Where nclass =0,1, and nclass =0 denotes L _r Belongs to the majority of classes, nclass =1 denotes L _r Belonging to a few classes, this process is expressed as follows:

wherein

A feature vector representing the output of the convolutional network, <' >>

A feature vector representing the output of the GRU network, the function concat (-) will pick the feature vector->

And &>

Splicing into a long vector;

step 2.2.4, updating parameters by using a cost sensitive loss function: probability value output by the cost sensitive hybrid network model CSHM obtained in the step 2.2.3Measuring the similarity between the predicted value and the true value by the cost sensitive loss function formula (11) in said step 1.3, wherein the weight

Is biased to->

Adopting a mechanism that the learning rate is 0.001 and gradient is reduced for every 200 segments, using 40% of training samples to perform cross validation, and updating the weight K and the bias b through a back propagation mechanism of an Adam optimization algorithm;

the final weight K and the deviation b are related to penalty factors eta and nu, and when a few types of samples are detected wrongly, the relatively large penalty factor eta is used for expanding the total loss; when most samples are misdetected, a relatively small penalty factor v is used to control the increase of the total loss;

generalizing the proposed cost sensitive loss function to a multi-class case, where the penalty factor for multi-class skewed data samples is as shown in equation (28):

wherein N is the total number of samples, N _{c_total} Is the total number of samples of class c, η _c Penalty factor corresponding to class c, c = {1,2, \8230;, n _classes }。

The invention has the beneficial effects that:

the invention provides a skew time series data anomaly detection algorithm based on a cost-sensitive hybrid network model, wherein the cost-sensitive hybrid network model integrates the characteristics that a DCNN has strong local feature learning capability and a GRU has good sequence feature learning capability. The cost-sensitive hybrid network model has stronger nonlinear representation performance, is an end-to-end network model, and avoids a complex data preprocessing process. According to the invention, a cost sensitive loss function is introduced into the CSHN network model, and parameters of the network model are adjusted by adopting different penalty loss factors aiming at different types of samples, so that the problem of insufficient feature learning of a few types of samples is solved. The invention solves the problem of insufficient learning of a few types of samples in the prior art, and avoids the problems that the sampling method changes the structure of data, the threshold in the threshold moving method is difficult to determine and the like. The method is simple and efficient, high in precision and strong in robustness. The method has higher detection precision on the skewed time series data set and the non-skewed time series data set.

Drawings

FIG. 1 is a schematic diagram of a time series anomaly detection algorithm in the skew time series anomaly detection method based on the cost-sensitive hybrid network according to the present invention;

FIG. 2 is a schematic diagram of a cost-sensitive hybrid network model in the skew time series anomaly detection method based on the cost-sensitive hybrid network of the present invention;

FIG. 3 is a schematic diagram of a full connection layer in the skew time series anomaly detection method based on a cost-sensitive hybrid network according to the present invention;

FIG. 4 is a schematic diagram of a global average pooling layer in the skew time series anomaly detection method based on the cost-sensitive hybrid network according to the present invention;

FIG. 5 is a schematic diagram of a GRU network structure in the method for detecting the skew time series anomaly based on the cost-sensitive hybrid network according to the present invention;

FIG. 6 is a comparison of F-measure for different models on Dataset Dataset 1;

FIG. 7 is a comparison of F-measure for different models on Dataset Dataset 2;

FIG. 8 (a) is a graph comparing the loss variation of four networks on Dataset Dataset 1;

fig. 8 (b) is a graph comparing the loss variation of the four networks on Dataset 2.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in FIGS. 1 and 2, the method for detecting the skew time series abnormality based on the cost-sensitive hybrid network of the present invention comprises the steps of firstly establishing and training a cost-sensitive hybrid network model composed of a deep convolutional neural network DCNN, a gated recursive network GRU containing 128 cell units, and a cost-sensitive loss function, wherein local features of a time series are learned through the deep convolutional neural network DCNN, sequence features of the time series are learned through the gated recursive network GRU, then the features are combined and classified through a Soft-max classifier, the similarity between an output result and a real value is measured by using the cost-sensitive loss function in a model training process, then parameters of the network model are adjusted through a back propagation algorithm, and different penalty factors are used for samples of different numbers and categories to penalize error detection of the network model. By introducing BN operation and GAP operation, the overfitting problem of a general neural network caused by a plurality of training parameters is avoided, and the training speed of the model is improved.

The method for detecting the skew time series abnormity based on the cost sensitive hybrid network specifically comprises the following steps:

step 1, integrating a deep convolutional neural network DCNN and a gated recursive network GRU containing 128 cell units, introducing a cost sensitive loss function, and constructing a cost sensitive hybrid network model CSHN;

step 1.1, learning local features of a time sequence by using a Deep Convolutional Neural Network (DCNN) composed of three convolutional layers, wherein each convolutional layer comprises convolution operation and batch normalization operation, and a global average pooling layer is introduced into an output layer and is used for reducing feature dimensions;

step 1.1.1, convolution operation

The purpose of the convolution operation is to learn the local features of the sample. Definition of

Represents a convolution kernel between the u-th channel in level d and the v-th channel in level d-1, and->

Represents the output value of the sample in the u-th channel in layer d-1, -is combined with the output value of the preceding channel>

And with

The local features of the time series are learned by convolution operations:

wherein,

represents an output value of the u-th channel of level d, <' > is determined>

Represents a bias of the u-th channel of level d>

step 1.1.2, batch normalization operation

The batch normalization enables the intermediate output value of each layer to tend to be stable, and therefore the problem of unstable data distribution in the training process is solved. Therefore, the data needs to be normalized before each layer is input, which not only improves the stability of the network, but also improves the generalization capability of the network.

wherein,

Represents the value after the batch normalization operation;

step 1.1.3, global average pooling layer

General convolutional networks to reduce the dimensionality of the feature vectors obtained by the last layer of convolution operations, typically include one or more fully-connected layers near the output layer, as shown in fig. 3. In the method of the present invention, we use a global average pooling layer GAP instead of a fully connected layer, as shown in fig. 4. This can reduce not only the dimensionality of the feature vectors, but also the parameters of the network. Carrying out average pooling operation on a plurality of feature vectors obtained by the previous convolutional layer by utilizing the global average pooling layer to obtain the following results:

A＝{a ₁ ,a ₂ ,…,a _U } (5)

Step 1.2, learning sequence characteristics of time sequence through gated recursion network GRU, wherein gated recursion network GRU is updated by update gate p _s And a reset gate q _s The updating gate is obtained by combining a forgetting gate and an input gate in the LSTM network structure. The structure of the GRU network is shown in fig. 5: x represents a time-series data sample, g _s Output indicating time sThe amount of information is such that,

indicating a hidden state at time s, the input of the memory cell at time s being g _s-1 And X; reset gate p _s Controlling the output value g of the last moment _s-1 Flowing into hidden state at present>

The reset gate is mapped to [0,1 ] by the activation function one]In between, a smaller value of the reset gate indicates that less information has been flowing into the current hidden state at the previous time, and the hidden state->

p _s ＝σ(K _p ·[g _s-1 ,X]) (6)

updating door q _s Information g for determining output of s-1 time _s-1 Is brought into the s-time output information g _s Degree of updating gate q _s Value is [0,1 ]]The larger the value is, the output information g at the previous time is shown _s-1 Is brought to the current moment and outputs information g _s The less, the mathematical expression thereof is as follows:

q _s ＝σ(K _q ·[g _s-1 ,X _s ]) (8)

wherein, K _q Weight matrix representing the updated gate, [ g ] _s-1 ,X _s ]Representing a vector g of two inputs _s-1 And X _s Connecting into a long vector, wherein sigma is the activation function I;

reset gate p _s Mapping to [0,1 ] by Sigmoid activation function]Between, hidden state

Mapping to [ -1,1 ] by tanh activation function]In the range of (a), the mathematical expression thereof is as follows:

p _s ＝σ(K _p ·[g _s-1 ,X]) (6)

wherein, K _p Weight matrix representing reset gates, [ g ] _s-1 ,X]Representing a vector g of two inputs _s-1 And X are connected into a long vector, sigma is the Sigmoid activation function, K _g～ Representing the weights for computing the hidden state.

Step 1.3, generally, during model training, a cross entropy loss function is used to measure the similarity between a true value and a predicted value, and the expression thereof is as follows:

wherein l _j True label, X, representing the jth training sample _j Represents the j time-series sample of the input, σ _k,b (X _j ) Representing the probability value of the model output, K representing the weight parameter, b representing the bias, and N representing the total number of samples.

In general, the total loss f ₁ The smaller (K, b), the better the learning effect of the model. In the case of a severely skewed data distribution, the network model cannot obtain enough feature representation from a few classes by using a general cross entropy loss function, and therefore, the detection accuracy of the few classes is severely affected. The reason is that the general cross entropy loss function is on a small number of classesThe loss of samples and most classes of samples have the same penalty factor.

In order to solve the problem, in the training process of the cost-sensitive hybrid network model, a cost-sensitive loss function is used for measuring the similarity between an output result and a true value, and the expression is as follows:

wherein l _j True label, X, representing the jth training sample _j J-th time-series sample, σ, representing the input _k,b (X _j ) Representing the probability value output by the model, K representing a weight parameter, b representing a bias, and N representing the total number of samples; wherein eta and ν respectively represent punishment factors under the condition that the minority samples and the majority samples are wrongly classified, and when the minority samples are wrongly detected, the punishment factors are multiplied by a larger punishment factor eta, so that the total loss is amplified; when most samples are detected incorrectly, multiplying by a smaller penalty factor v, since the total number of most samples is more, the total loss is also large, and the calculation formula of η, v is as follows:

the specific algorithm framework is shown in fig. 1, and the algorithm mainly comprises three stages: the first stage is a data preprocessing stage; the second stage is a time-series local feature learning stage, which mainly comprises the local feature learning of the time series based on the deep convolutional neural network DCNN in the step 1 and the local feature learning of the time series of the gated recursive network GRU; the third stage is an abnormality detection stage;

step 2.1.1, data normalization processing

X{t _m (x _m ,l _m ) } (M =1,2, \8230;, M) denotes a time-series dataset, where t _m (x _m ,l _m ) Representing time series samples, x _m Signal value, l, representing the m-th sample _m Label representing the m-th sample, l _m Is 0 or 1, M represents the total number of samples, and the mathematical expression is as follows:

wherein,

define >>

Representing a normalized set of time series data;

step 2.1.2, time slicing

Long time sequence data X { t } is processed by adopting sliding window _m (x _m ,l _m ) Segmentation (M =1,2, \8230;, M) into short overlapping segments, taking a window function window () of length w, with a moving step h, and normalizing the data, which has undergone said step 2.1.1

Is divided into->

Each segment->

Is w, the expression is as follows:

for the total number of fragments, M represents the total number of samples.

Inputting the training samples into the local features of the learning time sequence in the cost-sensitive hybrid network model constructed in the step 1, simultaneously performing cross verification by using part of the training samples, and updating model parameters by adopting a back propagation algorithm in the whole training and learning process; the specific process of feature learning comprises the following steps: based on the local feature learning of the time sequence of the deep convolutional neural network DCNN in the step 1, based on the local feature learning of the time sequence of the gated recursive network GRU in the step 1, obtaining a probability value output by the cost-sensitive hybrid network model by using a Softmax classifier to classify, and measuring the similarity between a predicted value and a true value by using the cost-sensitive loss function to update parameters;

Conv1 layer: assume that Conv1 layer has e ₁ Each size is k ₁ Of the convolution kernel

This process is expressed as follows:

wherein

Represents a bias of Conv1 level,. Sup.,>

representing a convolution operation;

conv2 layer: assume that Conv2 layer has e ₂ Size of k ₂ Of the convolution kernel

Suppose take e ₂ ＝64,k ₂ =5, the eigenvector @ obtained in the Conv1 layer is evaluated>

And convolution kernel->

Performing convolution operation to generate e ₂ Characteristic vector with length w-11->

This process is expressed as follows:

wherein

Represents the bias of the Conv2 layer;

conv3 layer: assume that Conv3 layer has e ₃ Size of k ₃ Convolution kernel of

And convolution kernel>

This process is expressed as follows:

wherein

Indicating the bias of the Conv3 layer.

GAP layer for feature vector output by Conv3 layer

Using an AND>

Convolution kernel K with same dimension _GAP And with

Wherein

Represents the feature vector that the deep convolutional neural network finally learns->

Each component value of (a);

step 2.2.2, gating recursive network feature learning: for an input time series data set L _r (L _r ∈L _{train_set} ) Using the GRU learning sequence characteristics of the gated recursive network containing 128 cells to obtain the characteristic vector finally output by the gated recursive network

step 2.2.3, outputting the cost sensitive hybrid network model: for the input time series samples L _r (L _r ∈L _{train_set} ) Finally, the cost-sensitive hybrid network model outputs a probability value P by using a Softmax classifier _nclass (L _r ) Where nclass =0,1, it is assumed that nclass =0 denotes L _r Belongs to the majority of classes, nclass =1 denotes L _r Belonging to a few classes, this process is expressed as follows:

wherein

A feature vector representing the output of the convolutional network, <' >>

And &>

Splicing into a long vector;

step 2.2.4, updating parameters by using a cost sensitive loss function: for the probability value output by the cost-sensitive hybrid network model CSHM obtained in the step 2.2.3, the similarity between the predicted value and the true value is measured by the cost-sensitive loss function formula (11) in the step 1.3, wherein the weight is

Is biased to->

Adopting a mechanism that the learning rate is 0.001 and gradient is reduced for every 200 segments, using 40% of training samples to perform cross verification, and updating the weight K and the bias b through a back propagation mechanism of an Adam optimization algorithm;

the proposed cost sensitive loss function is generalized to the multi-class case, where the penalty factor for the multi-class skewed data samples is shown in equation (28):

wherein n is _{c_total} Is the total number of samples of class c, η _c Penalty factor corresponding to class c, c = {1,2, \8230;, n _classes }。

Step 2.3, anomaly detection phase

As a test sample, let phi (L) _r (ii) a K, b) as a cost-sensitive hybrid network model, L _r ∈L _{test_set} The mathematical expression is:

wherein, P _nclass (L _r ) Is phi (L) _r (ii) a Predicted probability value of K, b)/ _{r_label} In order to predict the label of a sample,

are parameters obtained in the learning process.

The invention relates to a simulation experiment result of a skew time series abnormity detection method based on a cost sensitive hybrid network, which comprises the following steps:

the experiment is carried out on an actual engineering data set and 44 UCR reference data sets, and the actual engineering data comprises a flywheel rotating speed data set (DataSet 1) and a gyroscope temperature data set (DataSet 2) of a certain device. The number of normal and abnormal values in these data sets varies greatly. Assume that most classes represent normal classes and a few classes represent abnormal classes. In the experiments, the FCN _ alsm network, the Resnet network, the FCN network, and the proposed CSHN were implemented using Keras deep learning packages. SVC (Support Vector Classification), adaBoost, RFC (Random Forest Classification) algorithms are implemented using the Scik kit-left package in Python 3.5.

1. Evaluation index

The performance of the method was evaluated using True Positive (TP), false Negative (FN), true Negative (TN) and False Positive (FP), which are defined as follows:

TP, the classifier detects the positive classes as the number of the positive classes; FN, the classifier detects the positive class as the number of the negative class; the classifier detects the negative class as the number of the positive class; TN the classifier detects the negative classes as the number of negative classes.

In the experiment, ACC was used ⁺ 、ACC ^- G-means and F-Measure to evaluate the performance of the algorithm, ACC ⁺ And ACC ^- Respectively representing the detection rates of normal samples and abnormal samples, and G-means can comprehensively evaluate the detection performance of the algorithm, which is defined as follows:

the F-measure is a comprehensive evaluation index for measuring the detection performance of the classifier on abnormal samples and is defined by weighted harmonic mean of Recall (Recall) and Precision (Precision).

Where Recall is a measure of completeness (i.e., how many samples of the exception class are correctly identified), precision is a measure of accuracy, and β is used to adjust the importance of accuracy relative to Recall (typically β = 1).

2. CSHN model assessment

In order to verify the effectiveness of the CSHN model of the method, the influence of a cost sensitive loss function and a general cross entropy loss function on the detection precision is compared. For this purpose, experiments were performed combining the proposed hybrid network model (DCNN + GRU) with cost-sensitive loss functions and general cross-entropy loss functions, which were performed on DataSet1 and DataSet2, respectively, and the results are shown in tables 1 and 2.

TABLE 1 test results on DataSet DataSet1

TABLE 2 test results on DataSet DataSet2

As can be seen from tables 1 and 2, for the normal samples, ACC ⁺ The range of variation in the values is small. ACC for abnormal samples using a general cross entropy loss function ^- The value is only around 78%, whereas with the cost sensitive loss function proposed by the present invention, ACC is used ^- The value increases by about 5% to 10%. The method obviously improves the detection precision of the minority class, which means that the proposed cost sensitive loss function can solve the problem of low detection precision of the minority class caused by the inclination of data distribution. In addition, from G-means and F-meas can be seen, the use of a cost sensitive loss function can improve detection performance.

3. Performance comparison

3.1ACC ⁺ 、ACC ^- Evaluation and comparison of G-means

In order to evaluate the detection performance of the method, evaluation indexes on the data sets DataSet1 and DataSet2 were calculated, and the results are shown in tables 3 and 4:

table 3 detection accuracy of different methods on DataSet1

Table 4 detection accuracy of different methods on DataSet2

As can be seen from tables 3 and 4, the deep learning based method is superior to the machine learning based algorithm. For normal samples, all detection methods ACC ⁺ The detection results are all more than 94%. ACC for abnormal samples, methods of the invention ^- ACC with value greater than the comparison method ^- The value is obtained. Through G-means comprehensive evaluation, the detection performance of the method is superior to that of a comparison method.

3.2 comparison of F-measure detection results

To further investigate the detection performance of the method, fig. 6 and 7 show the F-measure detection results of different methods on the data sets DataSet1 and DataSet2, respectively. In FIGS. 6 and 7, the abscissa represents the names of the different methods, and the ordinate represents the F-measure values of the different models on the data sets Dataset1 and Dataset2, respectively.

As can be seen from fig. 6 and 7, the deep learning based approach is superior to the machine learning based algorithm. This is because neural networks have better characterization capabilities for non-linear relationships. In the comparative method, the F-measure value of the method of the present invention is significantly higher than that of the other methods. This means that the detection performance of the method of the present invention is superior to that of the comparative method.

3.3 evaluation and comparison of convergence Rate and stability

For deep neural networks, the training loss reflects the convergence speed and stability of the network model. In terms of loss accuracy, the CSHN model is compared with FCN _ alsm, resnet, and FCN models based on deep learning, and fig. 8 (a) and 8 (b) are training loss variation curves of data sets DataSet1 and DataSet2, respectively. In fig. 8 (a) and 8 (b), the abscissa represents the number of iterations of the model, and the ordinate represents the loss values of the FCN _ alsm, respet, and FCN models on the data sets DataSet1 and DataSet2, respectively.

Fig. 8 (a) and 8 (b) show the variation trend of the training loss values on the data sets DataSet1 and DataSet2, respectively. It can be seen that the CSHN model loss values converge faster. On data set1, the loss value of the proposed CSHN model tends to be stable and significantly lower than the comparison network model when the number of iterations is greater than 250. On the data set DataSet2, when the number of iterations is greater than 120, the loss value of the proposed CSHN model tends to be stable and lower than the loss value of the comparison network model. This means that the stability of the model is better than that of the comparative network model.

4. Performance evaluation of UCR public datasets

To further verify the detection performance of the proposed CSHN model, 44 UCR equilibrium datasets were experimented. Since the model is tested over multiple data sets, it is necessary to define new metrics to evaluate the overall test performance. For this reason, the detection performance was evaluated using the accuracy of the evaluation index and the average error per type (MPCE).

Here, a data pool G = { G =isdefined _z },g _z Represents the z-th data set, C _z Representative data set g _c The evaluation index is defined as follows:

wherein the PCE _z As a data set g _z The MPCE is the average value of the error rates detected by each class in the data pool G, and Z is the number of data sets in the data pool G, and in the present invention, Z =44. The experimental results are shown in table 5:

TABLE 5 accuracy of different methods on 44 UCR datasets and MPCE

In table 5, the first column indicates the names of 44 UCR datasets, nclasses indicates the number of categories in each dataset, and win indicates the number of different methods with the highest accuracy among the 44 UCR datasets, where the highest accuracy means the highest experimental accuracy among the different methods on the same dataset.

As can be seen from Table 5, the method provided by the present invention has a significant detection effect not only on skewed data sets, but also on non-skewed data sets.

Claims

1. The method for detecting the abnormity of the skew time series based on the cost-sensitive hybrid network is characterized by firstly establishing a cost-sensitive hybrid network model consisting of a deep convolutional neural network DCNN, a gated recursive network GRU and a cost-sensitive loss function, wherein the local characteristics of the time series are learned through the deep convolutional neural network DCNN, the sequence characteristics of the time series are learned through the gated recursive network GRU, then the characteristics are combined and classified through a Soft-max classifier, the similarity between an output result and a true value is measured by using the cost-sensitive loss function in the model training process, then the parameters of the network model are adjusted through a back propagation algorithm, and different penalty factors are used for samples of different numbers and categories to penalize the error detection of the network model, and the method specifically comprises the following steps:

step 1.2, learning sequence characteristics of the time sequence by gating recursive network GRU, which updates gate p _s And a reset gate q _s Composition, X represents a time-series data sample, g _s Indicating the amount of output information at time s,

The reset gate is mapped to [0,1 ] by the activation function one]In between, hidden state>

p _s ＝σ(K _p ·[g _s-1 ,X]) (6)

updating door q _s Information g for determining output of s-1 time _s-1 Is brought into the s time output information g _s Degree of updating gate q _s Value is [0,1 ]]The larger the value is, the output information g at the previous time is shown _s-1 Is brought into the current time output information g _s The less, the mathematical expression thereof is as follows:

q _s ＝σ(K _q ·[g _s-1 ,X _s ]) (8)

wherein l _j True label, X, representing the jth training sample _j J-th time-series sample, σ, representing the input _k,b (X _j ) Representing the probability value output by the model, K representing a weight parameter, b representing a bias, and N representing the total number of samples; wherein eta and ν respectively represent punishment factors under the condition that the minority samples and the majority samples are wrongly classified, and when the minority samples are wrongly detected, the punishment factors are multiplied by a larger punishment factor eta, so that the total loss is amplified; when most samples are detected wrongly, multiplying a smaller penalty factor v, eta and v by the calculation formula as follows:

where N is the total number of samples, N _{normal_total} Is the normal number of samples, n _{abnormal_total} Number of abnormal samples, n _classes Is a sample class, n _classes ＝2；

Step 2, a skew class time sequence data anomaly detection algorithm based on the cost sensitive hybrid network model:

the algorithm is mainly divided into three stages: the first stage is a data preprocessing stage; the second stage is a time-series local feature learning stage, which mainly comprises the local feature learning of the time series based on the deep convolutional neural network DCNN in the step 1 and the local feature learning of the time series of the gated recursive network GRU; the third stage is an abnormality detection stage;

Inputting the training samples into the cost-sensitive hybrid network model constructed in the step 1 to learn local characteristics of a time sequence, simultaneously performing cross validation by using part of the training samples, and updating model parameters by adopting a back propagation algorithm in the whole training and learning process; the specific process of feature learning comprises the following steps: based on the local feature learning of the time sequence of the deep convolutional neural network DCNN in the step 1, based on the local feature learning of the time sequence of the gating recursive network GRU in the step 1, obtaining a probability value output by the cost-sensitive hybrid network model by using a Softmax classifier to classify, and measuring the similarity between a predicted value and a true value by using the cost-sensitive loss function to update parameters;

step 2.3, anomaly detection phase

Detecting the test data by using the cost sensitive hybrid network model trained in the step 2.2, and detecting the rest 20% of data in the time sequence data

are parameters obtained in the learning process.

2. The method for detecting the skew time series anomaly based on the cost-sensitive hybrid network according to claim 1, wherein the step 1.1 specifically comprises the following steps:

step 1.1.1, convolution operation

Definition of

Represents a convolution kernel between the u-th channel in level d and the v-th channel in level d-1>

And/or>

The local features of the time series are learned by convolution operations:

wherein,

represents the output value of the u-th channel of level d, < >>

Represents a bias of the u-th channel of level d>

Representing convolution operation, and V represents the number of convolution kernels in the previous layer; />

Step 1.1.2, batch normalization operation

wherein,

is a standard normalization value, τ is a constant used to ensure that the denominator is greater than 0, γ represents a data scale change, β represents a data offset, and->

Representing a value after a batch normalization operation;

step 1.1.3, global average pooling layer

And carrying out average pooling operation on a plurality of feature vectors obtained by the last convolutional layer by utilizing the global average pooling layer to obtain the following results:

A＝{a ₁ ,a ₂ ,…,a _U } (5)

wherein X _u The feature vector, K, of the u-th channel after the last layer of convolution is represented _GAP Representing a global average pooling matrix, U representing the dimension of the output feature vector, A representing the output value a of each channel _u And combining as the final output vector.

3. The method for detecting the skew time series anomaly based on the cost-sensitive hybrid network as claimed in claim 1, wherein the first activation function in the step 1.2 is a Sigmoid activation function, and the second activation function is a tanh activation function;

reset gate p _s Mapping to [0,1 ] by Sigmoid activation function]In a hidden state

p _s ＝σ(K _p ·[g _s-1 ,X]) (6)

wherein, K _p Indicating reset gateWeight matrix, [ g ] _s-1 ,X]Representing a vector g of two inputs _s-1 And X are connected into a long vector, sigma is the Sigmoid activation function,

representing the weights for computing the hidden state.

4. The method for detecting the skew time series anomaly based on the cost-sensitive hybrid network according to claim 1, wherein the specific steps of the step 2.1 are as follows:

step 2.1.1, data normalization processing

X{t _m (x _m ,l _m ) } (M =1,2, \8230;, M) denotes a time-series dataset, where t _m (x _m ,l _m ) Representing time series samples, x _m Signal value, l, representing the m-th sample _m Label representing the mth sample,/ _m Is 0 or 1, M represents the total number of samples, and the mathematical expression is as follows:

wherein,

define >>

Representing a normalized time series data set;

step 2.1.2, time slicing

Long-time sequence data X { t) is processed by adopting a sliding window _m (x _m ,l _m ) Segmentation (M =1,2, \8230;, M) into short overlapping segments, taking a window function window () of length w, with a moving step h, and normalizing the data, which has undergone said step 2.1.1

Is divided into->

Each segment->

Is w, the expression is as follows:

for the total number of fragments, M represents the total number of samples.

5. The method for detecting the skew time series abnormality based on the cost-sensitive hybrid network as claimed in claim 1, wherein the feature learning process of the step 2.2 specifically comprises the following steps:

step 2.2.1, deep convolutional neural network feature learning: local feature learning of a time sequence is carried out by adopting the deep convolutional neural network DCNN based on the step 1, a hidden layer of the convolutional network consists of three convolutional layers, each convolutional layer comprises three processing operations, and the specific flow is as follows:

Suppose take e ₁ ＝32,k ₁ =8, for sample L _r (L _r ∈L _{train_set} ) And the convolution kernel->

Performing convolution operation to obtain e1 lengthsA feature vector of w-7->

The final output ^ of Conv1 layer is then obtained through the BN operation and the activation function LeakyReLU>

This process is expressed as follows:

wherein

Represents a bias of Conv1 level,. Sup.,>

represents a convolution operation;

Suppose to take e ₂ ＝64,k ₂ =5, the eigenvector @ obtained in the Conv1 layer is evaluated>

And convolution kernel->

This process is expressed as follows:

wherein

Represents the bias of the Conv2 layer;

conv3 layer: assume that Conv3 layer has e ₃ Size of k ₃ Of the convolution kernel

Suppose to take e ₃ ＝128,k ₃ =3, the eigenvector @obtainedby the Conv2 layer is substituted>

And convolution kernel>

Performing convolution operation to generate 128 pieces of data with the length of w-13 feature vector +>

This process is expressed as follows:

wherein

Represents the bias of the Conv3 layer;

GAP layer, feature vector for Conv3 layer output

Using an AND>

Convolution kernel K with same dimension _GAP And/or>

Wherein

Each component value of (a);

step 2.2.2, gated recursive network feature learning: for the input time series data set L _r (L _r ∈L _{train_set} ) Using the GRU learning sequence characteristics of the gated recursive network containing 128 cells to obtain the characteristic vector finally output by the gated recursive network

step 2.2.3, outputting the cost sensitive hybrid network model: for the input time series samples L _r (L _r ∈L _{train_set} ) Finally, the cost-sensitive hybrid network model outputs a probability value P by using a Softmax classifier _nclass (L _r ) Where nclass =0,1, it is assumed that nclass =0 denotes L _r Belongs to a majority of classes, nclass =1 denotes L _r Belonging to a few classes, this process is expressed as follows：

Wherein

A feature vector representing the output of the convolutional network, <' >>

And &>

Splicing into a long vector;

Is biased to->

the proposed cost sensitive loss function is generalized to the multi-classification case, where the penalty factor of the multi-class skewed data samples is shown as equation (28):