CN116361640A

CN116361640A - Multi-variable time sequence anomaly detection method based on hierarchical attention network

Info

Publication number: CN116361640A
Application number: CN202310024568.6A
Authority: CN
Inventors: 栾宁; 张震宇; 赵琳; 冯曙明; 曹杰; 王惠; 陶海成; 缪佳伟
Original assignee: Nanjing University of Finance and Economics; Jiangsu Electric Power Information Technology Co Ltd
Current assignee: Nanjing University of Finance and Economics; Jiangsu Electric Power Information Technology Co Ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-06-30

Abstract

The invention discloses a multi-variable time sequence abnormality detection method based on a hierarchical attention network, which comprises the steps of firstly extracting time and sequence characteristics by adopting a Bi-directional gating circulation unit Bi-GRU, then constructing variables and time sequences into a similar graph by adopting a graph attention network GAN, wherein nodes of the graph represent variables in the sequence, edges of the graph represent the relationship among the variables, constructing a first graph attention layer on the graph, extracting characteristic representation of the relationship among different variables, namely variable learning, constructing a second graph attention layer on the graph, learning interactions between the variables and the sequences, namely sequence learning, and finally reconstructing the time sequence by adopting an automatic encoder, and calculating a loss value to detect abnormal sequences. The method can effectively detect the abnormality in the multivariate time series, and the experimental result is superior to the current most advanced method.

Description

Multi-variable time sequence anomaly detection method based on hierarchical attention network

Technical Field

The invention relates to the field of artificial intelligence, in particular to a multivariate time sequence anomaly detection method based on a hierarchical attention network.

Background

Anomaly detection aims at identifying data records or events that are significantly different from other normal data, and in recent years, many researchers have made a lot of research work on anomaly detection in the fields of network security, fraud detection, and the like. The abnormality detection is generally related to time-series data, and the conventional abnormality detection method based on univariate time-series ignores the relationship between different variables, whereas the abnormality detection method based on multivariate time-series (MultivariateTimeSeries, MTS) can sufficiently capture the time correlation.

Anomaly detection in multivariate time series plays an important role in most physical systems in the real world, such as smart grids. Analysis of the relationship between different variables based on MTS data is important for detecting anomalies. With the progress of technology, MTS data presents characteristics of high dimension, complexity, dependency among variables and the like, and challenges are brought to anomaly detection. The method such as distance, linear model, probability and density estimation cannot meet the requirement of high-dimensional complex multivariate time sequence anomaly detection, the characteristic representation capability of time sequence data can be greatly improved by adopting a deep learning method, and most of the existing methods adopt a graph neural network to learn the relationship among variables, but the method often ignores the fact that the variable relationship is different in different sequences, namely the relationship among the variables has dynamic property and evolves along with time. For example, in a smart grid, three sensors, namely a voltage sensor, a temperature sensor and a current sensor, are provided to monitor the grid operating conditions. Normally, three sensors follow the same trend, i.e. the temperature always rises with increasing voltage or current, but some specific cases, while violating such trend, cannot be determined as abnormal, i.e. when the outdoor temperature is high, the temperature in the factory is manually lowered, which is obviously a common problem in real-world applications: how to capture time dependencies in different time series and integrate their relationships with different variables.

Patent 202210819480.9, 202210351790.2 proposes to detect abnormal sequences based on a graph neural network, but only considers the static correlation that may exist between variables, ignoring that the relationship between variables is different in different time sequences. Patent 202210042038X proposes that based on time series detection of the generation countermeasure network, correlation features between time series can be extracted, but that a global dependency is obtained by using a full connection layer, but that correlation between variables is ignored. Methods such as multi-variable time sequence anomaly detection based on inter-level measurement and time embedding proposed by Li et al, multi-variable time sequence anomaly detection based on a graph neural network proposed by Deng et al, multi-variable time sequence anomaly detection based on a graph attention network proposed by Zhao et al are all implicit or explicit modeling on the relation between different variables, only static correlation between different variables is studied, and the dynamic property of the variable relation is ignored.

Disclosure of Invention

The invention aims to provide a multivariate time sequence abnormality detection method (Hierarchical Attention Networks for Context Anomaly Detection is abbreviated as HAN-CAD) based on a hierarchical attention network, which is used for extracting dynamic characteristics of variables in different time sequences based on the hierarchical attention network, improving MTS abnormality detection efficiency, effectively detecting abnormalities in the multivariate time sequence and enabling experimental results to be superior to the current most advanced method.

The invention adopts the following technical scheme:

a multi-variable time sequence anomaly detection method based on a layered attention network firstly adopts a Bi-directional gating circulation unit Bi-GRU to extract characteristics of time and sequence, then adopts a graph attention network GAN to construct a similar graph, nodes of the graph represent variables in the sequence, edges of the graph represent relations among the variables, a first graph attention layer is constructed on the graph, characteristic representation of the relation among different variables, namely variable learning, is extracted, a second graph attention layer is constructed on the graph, interaction between the learning variables and the sequence, namely sequence learning, finally adopts an automatic encoder to reconstruct a time sequence, calculates loss values, and performs experimental verification on a real data set.

The method comprises the following specific steps:

step 1, data definition: defining a set of multivariate time series data

Step 2 Feature Learning (Feature Learning): extracting relevant time features v of variable and time series using Bi-gating cyclic units (Bidirectional Gated Recurrent Unit, bi-GRU) _i 。

Step 3 Variable-level Learning: obtaining the relationship alpha between different variables using a graph attention network (Graph Attention Network, GAN) _ij 。

Step 4 Sequence-level Learning: and learning the evolution relation between the variable and the time sequence by adopting an attention mechanism, and acquiring the characteristic vector s of the sequence.

Step 5 Reconstruction-based Detection method (Reconstruction-based Detection): reconstructing sequence X to be an Automatic Encoder (AE)

Thereby detecting an abnormal sequence.

Further, in the step 1,

the observation values of d variables are represented by the variable characteristics when the time step is t, N represents the maximum length of the time step, and the normal sequence is detected>

If there is abnormality, i is not less than 1 and not more than N, j is not less than 1 and not more than N.

In step 2, the variable and the time series related time feature v are extracted _i The method comprises the following steps:

1) Taking sequence X of length L _L ∈R ^d×L As the input of the feature learning model, L is more than or equal to 1 and less than or equal to N, and vector v _i And s are respectively taken as the variable i and the characteristic output of the sequence;

2) The Bi-gate control circulation unit Bi-GRU network model is adopted to obtain the time dependence in the time sequence, extract the learning variable and the sequence characteristic, and enable

For feature representation of variable initialization, wherein the variable i comprises continuous L observed values (such as running values of hardware devices such as a CPU, a memory, a network and the like in a smart grid server in a continuous reading time step), the feature values are updated by adopting a nonlinear conversion method, and a calculation formula is as follows:

wherein the method comprises the steps of

And W is ⁱ All represent training weight matrix,/->

Model output representing variable i at time step t-1,/o>

Representing the output value of variable i calculated by the reset gate neurons of Bi-GRU at time step t,/>

Output value of update gate neuron calculation representing variable i through BGRU at time step t, +.>

Representing the candidate hidden value of the full connection layer calculated by using the activation function tanh when the variable i is t in time step, +.>

Discarding the hidden value of the previous time step t-1 if the candidate hidden value is close to 0, and retaining the hidden value of the previous time step t-1 if the candidate hidden value is close to 1, < >>

Representing the forward result output of Bi-GRU model, < >>

The reverse result output of the Bi-GRU model is represented; the final variable i is characterized as:

in step 3, the relation alpha between different variables is obtained _ij The method comprises the following steps:

1) Performing relationship modeling by adopting a graph attention network GAN to obtain the characteristics of the updated variables; firstly, constructing a similarity graph between different variables, namely a variable graph G= { V, E }, wherein nodes V and edges E of the graph respectively represent the variables and the relations between the variables, and a node set V= { V ₁ ,v ₂ ,…,v _d The node characteristics in the node set V are variable characteristics extracted by the Bi-GRU, namely { V } ₁ ,v ₂ ,…,v _d -a }; the similarity between variables is calculated as follows:

the calculation results are sorted according to descending order;

2) Selecting the first k similar pairs as edges, and modeling the relation between variables by adopting a graph attention mechanism so as to learn the characteristics of the variables;

3) Node characteristic v with stronger expression capacity is extracted by adopting multi-head attention mechanism _i The calculation formula is as follows:

wherein H represents the number of heads, H is more than or equal to 1 and less than or equal to H, N _i Representing a set of fields, α, for node i _ij In order to normalize the attention weight of the neighbor node of node i using the softmax function, specifically the attention value representing the contribution of neighbor node j to node i,

representing the first k attention weights, node j is one of the neighbors of node i, +.>

Representing the first k AND nodes v _i Is connected withNode v _j Weight sum, alpha of features _ij The calculation formula is as follows:

wherein r is _ij Representing the dependency of node j on node i,

representing concatenation, W ^r Training weights are represented, leakyReLU is a nonlinear conversion activation function, and d is the number of nodes, namely d variables in step 1.

In step 4, the step of obtaining the feature vector s of the sequence is as follows:

1) Learning interactions between variables and sequences using an attention mechanism, sequence attention weights beta _j The calculation formula is as follows:

m _j ＝LeakyReLU(W ^m (v _j ))

wherein m is _j Values representing neighbor nodes j non-linearly transformed with the LeakyReLU function, s representing eigenvectors of the sequence, normalized calculated with the softmax function, W ^m Representing training parameters;

2) Updating sequence characteristics, wherein the calculation formula is as follows:

in step 5, the reconstructed sequence X is

The method comprises the following steps:

2) According to the hierarchical attention process of steps 1-4, the sequence x= { X is obtained ₁ ,x ₂ ,…,x _L }；

2) Reconstructing sequence X using an automatic encoder to let f _e (. Cndot.) represents the code, f _d (. Cndot.) represents decoding, for a feature vector s of sequence X, the encoding process maps s to an implicit representation z, and the decoding process maps z to a reconstructed representation z

The encoding and decoding calculation formula is as follows:

z＝f _e (s',W ^e )

wherein W is ^e And W is ^d Is a training parameter;

3) The Loss function Loss is calculated as follows:

wherein I II ₂ Representing iota ₂ And if the loss value after reconstruction is larger than a certain threshold value, regarding that the sequence is abnormal, and continuously adjusting the threshold value to obtain the maximum F1 comprehensive value.

The invention has the following characteristics:

(1) In order to fully utilize the information of the previous time step (forward direction) and the later time step (reverse direction) and obtain the time dependence in the sequence, the invention adopts a Bi-directional gating circulating unit network Bi-GRU, thereby better extracting the variable and the sequence characteristics.

(2) Learning the characteristics of variables in a multivariate time series alone often does not adequately capture the characteristics of the sequence anomalies, and furthermore the relationships between the variables can reveal different time-dependent patterns. Therefore, in order to better detect the abnormality in the sequence, the invention adopts the graph attention network GAN to model the relationship among the variables, analyzes the mutual influence among the variables and integrates the variable characteristics.

(3) The relationship between variables is not stable, always evolving over time, and anomalies of strongly correlated variables can vary significantly in different time sequences. Whereas previous studies would treat the variable and sequence equally and assign them the same weight, this is not indicative of the effect of the sequence on the variable. To obtain time series dependencies, the present invention learns the interactions between variables and sequences using another attentional mechanism.

Drawings

FIG. 1 is a HAN-CAD framework of the detection method proposed by the present invention;

FIG. 2 shows the F1 values of the HAN-CAD method and the MTAD-GAT, GDN, interFusion method under different sliding windows;

FIG. 3 shows F1 values of different edge ratios of MTAD-GAT, GDN and HAN-CAN in three data sets according to the method of the present invention;

Detailed Description

The technical results of the present invention will be described in detail below with reference to the accompanying drawings. For a clearer description of embodiments of the invention or of solutions in the prior art, it is evident that the figures in the following description are only some embodiments of the invention, from which other solutions can be obtained for a person skilled in the art without inventive effort.

A multi-variable time sequence anomaly detection method based on a layered attention network firstly adopts a Bi-directional gating circulation unit Bi-GRU to extract characteristics of time and sequence, then adopts a graph attention network GAN to construct a similar graph, nodes of the graph represent variables in the sequence, edges of the graph represent relations among the variables, a first graph attention layer is constructed on the graph, characteristic representation of the relation among different variables, namely variable learning, is extracted, a second graph attention layer is constructed on the graph, interaction between the learning variables and the sequence, namely sequence learning, finally adopts an automatic encoder to reconstruct a time sequence, calculates loss values, and performs experimental verification on a real data set. The method comprises the following steps:

step 1: data definition

Defining a set of multivariate time series data of length N

The observed value of d variables is represented by variable characteristics when the time step is t, N represents the maximum length of the time step, and the normal sequence is detected

If there is abnormality, i is not less than 1 and not more than N, j is not less than 1 and not more than N. For example, when i=1, j=4, d=1 (representing 1 variable of current, and when d=2, 2 variables of current and voltage may be represented), the following is preferable>

Representing the observed value of the current variable, i.e. the characteristic value, between time step 1 and time step 4.

Step 2: feature learning

in the formulas (1), (2), (3) and (4),

and W is ⁱ All represent training weight matrix,/->

The model output at time step t-1 represents the variable i. />

When the time step of the variable i is t, the candidate hidden value of the full-connection layer calculated by adopting the activation function tanh is represented,

if the candidate concealment value is close to 0, the concealment value of the previous time step t-1 is discarded, and if it is close to 1, the concealment value of the previous time step t-1 is retained. />

Representing the forward result output of Bi-GRU model, < >>

And (5) representing the reverse result output of the Bi-GRU model. Thus, the final variable i is characterized as:

step 3: variable learning

1) And carrying out relational modeling by adopting a graph attention network GAN to obtain the characteristics of the updated variables. Firstly, constructing a similarity graph between different variables, namely a variable graph G= { V, E }, wherein nodes V and edges E of the graph respectively represent the variables and the relations between the variables, and a node set V= { V ₁ ,v ₂ ,…,v _d The node characteristics in the node set V are the variable characteristics extracted by the Bi-GRU in the step 2, namely { V } ₁ ,v ₂ ,…,v _d }. The similarity between variables is calculated as follows:

the calculation results are sorted in descending order.

2) And selecting the first k similar pairs as edges, and modeling the relation between variables by adopting a graph attention mechanism, so as to learn the characteristics of the variables.

in the formula (7), H represents the number of heads, H is 1.ltoreq.h, N _i Representing a set of fields, α, for node i _ij Annotating normalized computation for neighbor nodes of node i using softmax functionThe attention weight, in particular the attention value representing the contribution of the neighbor node j to node i,

representing the first k attention weights, node j is one of the neighbors of node i,

representing the first k AND nodes v _i Connection node v _j Weight sum, alpha of features _ij The calculation formula is as follows:

in the formulas (8), (9), r _ij Representing the dependency of node j on node i,

Step 4: sequence learning

m _j ＝LeakyReLU(W ^m (v _j )) (10)

in the formulas (10) and (11), m _j Values representing neighbor nodes j non-linearly transformed with the LeakyReLU function, s representing eigenvectors of the sequence, normalized calculated with the softmax function, W ^m Representing training parametersA number.

step 5: detection method based on reconstruction

3) According to the hierarchical attention procedure of steps 1 to 4, the sequence x= { X can be obtained ₁ ,x ₂ ,…,x _L }。

The encoding and decoding calculation formula is as follows:

z＝f _e (s'，W ^e ) (13)

in formulas (13) and (14), W ^e And W is ^d Is a training parameter.

3) The Loss function Loss is calculated as follows:

in equation (15), I I.I ₂ Representing iota ₂ Norm, if the loss value after reconstruction is greater than a certain threshold, the sequence can be considered to be abnormal, and the threshold is continuously adjusted to obtain the maximum F1 comprehensive value.

The validity of the proposed method is verified based on the actual data.

The experiment uses 3 multivariate time series anomaly detection datasets: SMD, WADI and ASD, wherein the SMD data set address is https:// gitsub.com/NetManAIOps/OmniAnomaly, WADI data set address is https:// iturst.sutd.edu. Sg/iturst-labs_datasets/dataset_info/, the ASD data set address is https:// gitsub.com/zhhlee/InterFusion, SMD and WADI is a commercial experimental data set commonly used for multivariate time series anomaly detection, and ASD is a new data set from a large Internet company. The data sample information for each data set is shown in table 1.

TABLE 1 three data set sample information

Data set	ASD	SMD	WADI
				Feature number	19	38	112
Training sample number	8640	28479	335999
				Number of test samples	4320	28479	172801
Abnormal sample ratio (%)	3.40	5.84	5.85

In the experiment, the sliding window lengths of ASD, SMD and WADI were set to 100, 100 and 30, respectively, model parameters were optimized using Adam optimizer, learning rate was set to 5e-4, variable and sequence representation length was 64, dropout algorithm was used to prevent the training results from being over-fitted, dropout probability was set to 0.2, indicating that dropout algorithm randomly lost some neurons of the training model with 20% probability, even if neurons were inactive, the head number of the multi-head attention mechanism was 2. All experimental data are trained on a Microsoft server, the CPU main frequency of the Microsoft server is 3.60GHz, the model is Intel I9-9900k, the GPU memory is 11GB, and the model of a display card chip is NvidiaGeForceRTX2080Ti.

In order to verify the superiority of HAN-CAD of the method provided by the invention, five newly proposed MTS abnormality detection methods of LSTM-AE, MAD-GAN and MTAD-GAT, GDN, interFusion are selected and compared with the detection method of HAN-CAD provided by the invention in terms of accuracy, recall and F1 value, the experimental results are shown in table 2, the experimental results show that the detection method of HAN-CAD provided by the invention shows better results in terms of accuracy and recall, the F1 value is the largest, and the table 2 also shows that 1) the training results are sequentially from top to bottom, namely the method of the invention, interfusion, GDN, MTAD-GAT, MAD-GAN and LSTM-AE, wherein the detection method of HAN-CAD, interFusion, GDN, MTAD-GAT provided by the invention is superior to the traditional reconstruction method. 2) All detection methods have poorer results on the WADI data set than the results on the other two data sets because the WADI data set contains 112 variables, complex relationships exist among the variables, and the anomaly detection is difficult, but the method HAN-CAD provided by the invention has the optimal performance, and the method HAN-CAD provided by the invention is verified to be capable of acquiring more complex variable relationships, namely dynamic relationships among variables in different sequences.

TABLE 2 LSTM-AE, MAD-GAN, MTAD-GAT, GDN, interFusion, accuracy, recall, F1 values of HAN-CAD of the proposed method

FIG. 2 shows the comparison of F1 values of HAN-CAD and MTAD-GAT, GDN, interFusion according to the method proposed by the present invention under different sliding windows, wherein in FIG. 2, the left side 1 is the MTAD-GAT method, the left side 2 is the GDN method, the left side 3 is the Interfusion method, and the left side 4 is the HAN-CAD method. FIG. 2 shows that the HAN-CAD method of the present invention always performs best and stably on 3 data sets, and the HAN-CAD method of the present invention obtains the highest F1 value when the sliding window of the time length is 100. The experimental results of the other three methods have certain fluctuation, which indicates that the MTS abnormality detection based on the graph neural network can be more robust by integrating the relationship among variables and the time sequence.

Fig. 3 shows F1 values of different edge ratios of MTAD-GAT, GDN and the proposed method HAN-CAN on three data sets, and fig. 3 shows that the proposed method HAN-CAN is better than the graph neural network method in MTAD-GAT, GDN2, and in addition, the MTAD-GAT and GDN methods perform worse when the number of edges of the graph is less, whereas the anomaly checking method based on the graph neural network has few extracted nonlinear structural features in a sparse graph with a smaller number of edges.

Table 3 shows a comparison of the accuracy, recall and F1 values of the HAN-CAD method, the w/o feature learning method, the HAN-CAD method without using Bi-GRU for feature learning, and the w/o variable learning method without using GAN for variable learning, which are presented in the present invention, over 3 data sets, and Table 3 shows that if Bi-GRU and GAN are not used, experimental result values are relatively low, illustrating that the graph attention mechanism is very important for feature learning because the graph attention mechanism can extract complex relationships between variables.

TABLE 3 accuracy, recall and F1 values for three methods of HAN-CAD, w/o feature learning, w/o variable learning

The method is based on a hierarchical attention network, extracts dynamic characteristics of variables in different time sequences, improves MTS abnormality detection efficiency, can effectively detect abnormalities in a multi-variable time sequence, and has experimental results superior to the current most advanced method.

Claims

1. A multi-variable time sequence anomaly detection method based on a hierarchical attention network is characterized by comprising the following steps of: firstly, extracting time and sequence characteristics by adopting a Bi-gate control circulation unit Bi-GRU; then, a graph attention network GAN is adopted to construct a similar graph, nodes of the graph represent variables in the sequence, and edges of the graph represent relations among the variables; constructing a first graph annotation force layer on a graph, and extracting characteristic representations of the relationships among different variables, namely variable learning; constructing a second graph annotation layer on the graph, wherein the interaction between the learning variable and the sequence is sequence learning; and finally, reconstructing a time sequence by adopting an automatic encoder, calculating a loss value, and performing experimental verification on a real data set.

2. The hierarchical attention network based multivariate time series anomaly detection method of claim 1, comprising the steps of:

step 1, data definition: defining a set of multivariate time series data

Step 2, feature learning: using bi-directional gatingRing unit extracts the relevant temporal features v of the variable and the time series _i ；

Step 3, variable learning: obtaining relationships alpha between different variables using a graph attention network _ij ；

Step 4, sequence learning: learning an evolution relation between a variable and a time sequence by adopting an attention mechanism, and acquiring a characteristic vector s of the sequence;

step 5, a detection method based on reconstruction: reconstructing sequence X into sequence X by using automatic encoder

Thereby detecting an abnormal sequence.

3. The hierarchical attention network based multivariate time series anomaly detection method of claim 2, wherein: in the step (1) of the process,

4. The hierarchical attention network based multivariate time series anomaly detection method of claim 2, wherein: in step 2, the variable and the time series related time feature v are extracted _i The method comprises the following steps:

2) Bi-GRU network adopting bidirectional gating circulation unitThe model acquires time dependence in a time sequence, extracts a learning variable and sequence characteristics, and enables

For the characteristic representation of variable initialization, wherein the variable i comprises L continuous observed values, the characteristic value is updated by adopting a nonlinear conversion method, and the calculation formula is as follows:

wherein the method comprises the steps of

And W is ⁱ All represent training weight matrix,/->

Model output representing variable i at time step t-1,/o>

Representing the passage of a variable i at a time step tUpdating the output value calculated by the gate neurons of the BGRU,>

Representing the forward result output of Bi-GRU model, < >>

5. the hierarchical attention network based multivariate time series anomaly detection method of claim 2, wherein: in step 3, the relation alpha between different variables is obtained _ij The method comprises the following steps:

the calculation results are sorted according to descending order;

wherein r is _ij Representing the dependency of node j on node i,

representing concatenation, W ^r Training weights are represented, leakyReLU is a nonlinear conversion activation function, and d is the number of nodes.

6. The hierarchical attention network based multivariate time series anomaly detection method of claim 2, wherein: in step 4, the step of obtaining the feature vector s of the sequence is as follows:

m _j ＝LeakyReLU(Wm(v _j ))

7. the hierarchical attention network based multivariate time series anomaly detection method of claim 2, wherein: in step 5, the reconstructed sequence X is

The method comprises the following steps:

1) According to the hierarchical attention process of steps 1-4, the sequence x= { X is obtained ₁ ,x ₂ ,…,x _L }；

The encoding and decoding calculation formula is as follows:

z＝f _e (s’,W ^e )

wherein W is ^e And W is ^d Is a training parameter;

3) The Loss function Loss is calculated as follows: