AU2021106594A4

AU2021106594A4 - Online anomaly detection method and system for streaming data

Info

Publication number: AU2021106594A4
Application number: AU2021106594A
Authority: AU
Inventors: Xingrong FAN; Zhiwei Guo; Yu Shen; Jianhui Wang; Xianming ZHANG; Dujiang ZHAO; Xiaolong ZHAO
Original assignee: Engineering Research Center for Waste Oil Recovery Technology and Equipment Ministry of Education Chongqing Technology and Business University; Chongqing Technology and Business University S&T Developing Ltd
Current assignee: Engineering Research Center for Waste Oil Recovery Technology and Equipment Ministry of Education Chongqing Technology and Business University; Chongqing Technology and Business University S&T Developing Ltd
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-11-11
Anticipated expiration: 2029-08-23

Abstract

OF THE DISCLOSURE The present disclosure relates to the technical field of streaming data mining, and in particular, to an online anomaly detection method and system for streaming data. The online anomaly detection method for streaming data includes: processing a data block by using a matrix sketching model to obtain a sketch matrix, where the data block is transmitted at a high speed; importing the sketch matrix to a hash learning model to obtain an optimal model parameter and a feature hash table for the current moment; and constructing an anomaly score calculation model based on the optimal model parameter and the feature hash table, importing to-be-detected sample data to the anomaly score calculation model for detection, and determining whether the to-be-detected sample data is abnormal. The present disclosure uses matrix sketching and hash learning technologies. This reduces data sizes and feature dimensions and improves a detection speed and storage efficiency. In addition, a detection model can be updated online to adapt to dynamic changes of data distribution. Therefore, when a large amount of high-dimensional streaming data is transmitted at a high speed, anomalies in the streaming data can be efficiently detected in real time. -2/2 D, ,Normal data block I Matrix sketching model (Matrix sketching-based sub model) B Sketch matrix Hash learning model (Hash learning-based sub model) Feature hash table Anomaly score calculations X I model I ' (Coupling model for detecting Abnorma anomalies in streaming data)_i I data yt Normal data D, , Normal data block Matrix sketching model' I (Matrix sketching-based sub E model) CO B Sketch matrix -0) - 1 Hash learning model E (Hash learning-based sub model) (D UW H Feature *hash table 7Anomaly s co re6-aTCufatiV] (Coupling model for detecting Abnorma anomalies in streaming data) -i I data ormal data D Normal data block FIG. 2

Description

-2/2

D, ,Normal data block

I Matrix sketching model (Matrix sketching-based sub model)

B Sketch matrix

Hash learning model (Hash learning-based sub model)

Feature hash table Anomaly score calculations X I model I

' (Coupling model for detecting Abnorma anomalies in streaming data)_i I data

yt Normal data

D, , Normal data block

Matrix sketching model' I (Matrix sketching-based sub E model) CO B Sketch matrix -0) - 1 Hash learning model E (Hash learning-based sub (D model)

UW H Feature *hash table 7Anomaly s co re6-aTCufatiV] (Coupling model for detecting Abnorma anomalies in streaming data) -i I data

ormal data

D Normal data block

FIG. 2

ONLINE ANOMALY DETECTION METHOD AND SYSTEM FOR STREAMING DATA TECHNICAL FIELD

[01] The present disclosure relates to the technical field of streaming data mining, and in particular, to an online anomaly detection method and system for streaming data.

BACKGROUND ART

[02] Streaming data (SD) is a continuous flow of sequential data that is transmitted in a large volume and at a high speed. An anomaly detection method can be used to detect anomalies in streaming data and is essential to data mining.

[03] Currently, growing requirements emerge for detecting anomalies in streaming data based on limited storage and computing resources. A key technology that is based on distance, density, incremental learning, or ensemble learning is proposed to perform online anomaly detection on a large amount of high-dimensional high-speed streaming data. In addition, various technologies that integrate incremental learning and ensemble learning are developed to reduce computing and storage overheads.

[04] However, these existing technologies are based on space division and use multiple detectors to detect anomalies in streaming data. As a result, large amounts of overheads are caused in storage and computing, which reduces the efficiency in detecting anomalies in high-dimensional streaming data. In addition, these technologies ignore encoding characteristics of streaming data. Therefore, an online anomaly detection method for streaming data is required urgently.

SUMMARY

[05] To resolve the technical issues in the prior art, the present disclosure provides an online anomaly detection method for streaming data. The method includes: obtaining a normal data block that is transmitted at a high speed and importing data in the normal data block to an online anomaly detection model for training; importing to-be-detected sample data to the trained online anomaly detection model and then identifying whether the to-be-detected sample data is normal data; and if the to-be-detected sample data is normal data, updating the normal data block to generate a new normal data block, and using the new normal data block as training data for a next anomaly detection; or if the to-be-detected sample data is abnormal data, labeling the abnormal data; the online anomaly detection model mentioned above consists of a modified matrix sketching model, a hash learning model, and an anomaly score calculation model.

[06] An online anomaly detection system for streaming data includes a data collection module, a matrix sketching module, a hash learning module, an anomaly identification module, an identification result output module, and a model update module.

[07] The present disclosure combines a matrix sketching technology with a hash learning technology and proposes a new solution for detecting anomalies in a large amount of high-dimensional high-speed streaming data online. This facilitates online detection of anomalies in a large amount of high-dimensional high-speed streaming data and in 5G scenarios, and provides technical support for achieving ultra-high speed and performance, ultra-low latency, and ultra-high computing and storage efficiency.

BRIEFT DESCRIPTION OF THE DRAWINGS

[08] FIG. 1 is a structural block diagram of an online anomaly detection method for streaming data according to the present disclosure; and

[09] FIG. 2 is a technical roadmap of an online anomaly detection method for streaming data according to the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[10] The following clearly and completely describes the technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings of the present disclosure.

[11] FIG. 1 is a structural block diagram of an online anomaly detection method for streaming data. The online anomaly detection method consists of two sub models. The upper part of FIG. 1 represents the matrix sketching-driven sub model, which is developed by the matrix sketching-based anomaly detection technology. The lower part of FIG. 1 represents the hash leaming-driven sub model, which is constructed by the hash learning-based anomaly detection technology. The two sub models are bidirectionally connected by a coupling operator that can providing flexibility to represent various forms of two interacting sub models. Data is imported to and processed by the sub models. Then, normal data and abnormal data are obtained. +1 represents streaming data that is imported at a t+1 moment. +1 +-1 represents normal data and abnormal data that are detected in real time by the sub models at a t +1 moment respectively.

[12] In the present disclosure, a large amount of high-dimensional high-speed streaming data is abstracted as an ever-increasing dynamic data set in which data is continuously generated with time, SD={D,6 Rdxn t=1,2,°°..} namely, 1 ' ' ',. D represents a normal data block that is transmitted at a high speed at a t moment. d and " represent a feature space dimension and sample data size of the data block D,, respectively.

[13] The online anomaly detection method for streaming data includes the following steps: Obtain a normal data block that is transmitted at a high speed and import data in the normal data block to an online anomaly detection model for training. Import to-be-detected sample data to the trained online anomaly detection model and identify whether the to-be-detected sample data is normal data. If the to-be-detected sample data is normal data, update the normal data to generate a new normal data block, and use the new normal data block as training data for a next anomaly detection. If the to-be-detected sample data is abnormal data, label the abnormal data. The online anomaly detection model includes a modified matrix sketching model, a hash learning model, and an anomaly score calculation model.

[14] In an implementation of the online anomaly detection method for streaming data, the following steps are included: Obtain the normal data block that is transmitted at a high speed. Process the normal data block by using the modified matrix sketching model to obtain a sketch matrix. Import the sketch matrix to the hash learning model and optimize the sketch matrix by using a hash objective function, to obtain an optimal model parameter and a feature hash table H, for the current moment. Obtain to-be-detected sample data of a next moment and import the to-be-detected sample data and the feature hash table H to the anomaly score calculation model, to calculate an anomaly score for the to-be-detected sample data. Specify an anomaly score threshold (whose default value is 0.5) and compare the anomaly score of the to-be-detected sample data with the anomaly score threshold. If the calculated anomaly score is greater than the specified anomaly score threshold, the to-be-detected sample data is abnormal data. If the calculated anomaly score is less than or equal to the specified anomaly score threshold, the to-be-detected sample data is normal data.

[15] In a preferred embodiment, the to-be-detected sample data is imported to the trained online anomaly detection model for detection, as shown in FIG. 2. This process includes the following steps:

[16] SI: Import the data in the normal data block to the modified matrix sketching model to obtain a sketch matrix.

[17] S2: Import the sketch matrix to the hash learning model, optimize the sketch matrix by using a hash objective function to obtain an optimal model parameter , and then obtain a hash projection matrix based on the optimal model parameter.

[18] S3: Map the sketch matrix by using the hash projection matrix to obtain a feature hash table H

[19] S4: Obtain the to-be-detected sample data.

[20] S5: Import the to-be-detected sample data to the anomaly score calculation model to identify whether the to-be-detected sample data is abnormal data.

[21] A process of processing the data in the normal data block by using the modified matrix sketching model includes the following steps:

[22] S11: Construct a data matrix Z based on the data in the normal data block and select a precision parameters The data matrix Ze Rdxn and Rdxn representsarealnumberspaceof dxn.

[23] Optionally, a value range of the selected precision parameter 8 is (0,1].

[24] S12: Specify a number of iterations based on the data matrix Z.

[25] The data matrix Z is a real number space of dx n . Therefore, the specified number of iterations equals the number of columns in the data matrix Z. In other words, the specified number of iterations is n.

[26] S13: Initialize a zero matrix B of dx I based on the precision parameter 8 , where B =[bl,b2,---, bib]

[27] The selected precision parameter is 8 Therefore, a number of columns in the initialized zero matrix can be obtained by rounding up a reciprocal of the precision parameter, that is, l, <-1/l where represents a round up operation.

[28] S14: Replace the last column in the zero matrix B with an ith column in the data matrix Z to obtain a new matrix T, where T <[b,-,b_1,z,] andE ,

[29] S15: Perform singular value decomposition (SVD) on the new matrix T to obtain a singular value, left singular matrix U, and diagonal matrix I of the matrix T. A formula for performing

SVD on the new matrix T is as follows:[UZV]<- SVD(T)

[301 Z=diag([o,..., 1 ]), 1 >-o...

[31] U , Y , and V represent the left singular matrix, a right singular matrix, and the diagonal matrix of the matrix T respectively. diag represents a diagonal matrix whose diagonal elements are (I''''' . ''I represents an 1th singular value of the matrix T.

[32] S16: Select a minimum singular value 5 of the matrix T, and scan and update the diagonal matrix of the matrix T based on the minimum singular value.

[33] A formula for selecting the minimum singular value is as follows:

[34]

[35] A formula for scanning and updating the diagonal matrix of the matrix T based on the minimum singular value is as follows:

[36] i<- max(E -I,,0)

[37] represents an identity matrix of Ix I and 5 represents the minimum singular value.

[38] S17: Construct and update the sketch matrix B based on the updated diagonal matrix and the left singular matrix U, and add one to a value of i. A formula for updating the sketch matrix is as follows:

[39] B <- Ui

[40] S18: Compare the value of i with the number of iterations, and export the current sketch matrix B if the value of i is greater than the number of iterations or return to Si4 if the value of i is less than or equal to the number of iterations.

[41] In an implementation of processing the sketch matrix by using the hash learning model, the following steps are included: Process data in each column of the sketch matrix by using a hash projection method, to obtain a hash projection vector for the data in each column. Obtain the optimal model parameter based on the hash projection vector and the sketch matrix and the projection matrix based on a maximum objective function. The optimal model parameter is the maximum objective function that is obtained after the hash objective function is optimized.

[42] The hash learning model is constructed by using the following linear hash projection method:

[431 hk=sgn(w b,)

[44] hk represents a k th hash function in a hash function group H,=[hh2 ,- -,-,h ] Wke Rd represents a k th projection vector in the hash projection matrix W =[W1,W 2, ... ,Wk I .. WW,]cR> Bgn( BJ'2 . hl..h ( , represents - sign function, [b,b2 ,-,b,--b,]ER represents a sketch matrix of a data block D , and h represents an i th column in the sketch matrix.

[45] The feature hash table is calculated by using the linear hash projection method based on the following formula:

[461 H, = sgn(W,'B,) WB represents the

[47] represents the hash projection matrix, T represents transposition, and B sketch matrix the data block D,

[48] Optimization of the hash objective function is to maximize the objective function and obtain the optimal model parameter . A formula for maximizing the objective function is as follows: 14, <-- max tr WBBWj s.t.WW =

[49] WtcR"

[50] Rdxr represents a real number space of dx r , Bt represents the sketch matrix, represents the projection matrix, T represents transposition, tr(-) represents a matrix trace, and Ir represents an identity matrix of rx r.

[51] In an implementation of processing the to-be-detected sample data by using the anomaly score calculation model, the following steps are included:

[52] Step 1: Import a processed to-be-detected sample data matrix Xt±, a hash table H, of normal sample features, and the hash projection matrix to the anomaly score calculation model, where X,,1 E Rdx, t E R t Rdxr , and r < d.

[53] Step 2: Specify a threshold .

[54] Step 3: Perform binary hash encoding on data ' of each column in the to-be-detected

sample data matrix based on the hash projection matrix to obtain a binary hash code h, where iE1, 2,..., n Kh hahoe K h

[55] Step 4: Seek for K hash codes that are closest to the binary hash code in the hash table of normal sample features.

[56] Step 5: Calculate an average Hamming distance ai between the binary hash code h and hK the K closest hash codes i .

[57] Step 6: Compare the mean value ai with the specified threshold , and determine that the data of the column is normal data if a, { or the data of the column is abnormal data if a,

[58] Step 7: Determine whether the to-be-detected sample data is detected. If the to-be-detected sample data is detected, collectively label all abnormal data and export normal data. If the to-be-detected sample data is not detected, return to Step 3.

[59] The anomaly score calculation model is constructed based on the average Hamming distance between the binary hash code hi of the to-be-detected sample data and the K hash

codes ' in the feature hash table that are closest to the binary hash code

[60] The binary hash code of the to-be-detected sample data can be expressed as follows:

[611 h,=sgn(WTx,)

[621 h is the binary hash code of xi in a Hamming space.

[63] A formula for calculating the average Hamming distance is as follows: I1K a=- HamDist(hi,h|' 1641 Kj~

[65] a represents the anomaly score of the to-be-detected sample data, K represents a number

of closest hash codes that are specified by a user, and HamDist(hh/) represents a Hamming distance between h and h/ . K is usually set to 10. A threshold is specified to determine whether the data is abnormal data by using the following formulas: r xi c Y,, a, <

[661 jxie , a, >{

[67] represents the specified threshold.

[68] The online anomaly detection is updated in real time based on accumulation of sample data. If the sample data accumulates to a specified data size, repeat Steps 1 and 2 and update the model

parameter , sketch matrix B, and feature hash table Ht online.

[69] An online anomaly detection system for streaming data includes a data collection module, a matrix sketching module, a hash learning module, an anomaly identification module, an identification result output module, and a model update module. ) [70] The data collection module is configured to collect data and import the collected data to the matrix sketching module.

[71] The matrix sketching module is configured to perform matrix sketching on a large amount of high-dimensional high-speed streaming data, to generate a sketch matrix.

[72] The hash learning module is configured to map data in the sketch matrix to a Hamming space to generate a hash projection matrix and a feature hash table.

[73] The anomaly identification module is configured to: calculate an anomaly score for to-be-detected data based on the hash projection matrix and the feature hash table, and compare the calculated anomaly score with a specified anomaly threshold to obtain a detection result for the to-be-detected data.

[74] The identification result output module is configured to export the detection result.

[75] The model update module is configured to update data attributes and distribution characteristics of a model.

[76] The data collection module includes devices such as a sensor and data collector. These devices can be used to collect network logs, data of industrial sensors, and data in other fields.

[77] A process of processing data in a normal data block by using the matrix sketching module includes the following steps: Construct a data matrix Z based on the data in the normal data block and select a precision parameter -' . Specify a number of iterations based on the data matrix Z

. Initialize a zero matrix B of dx I based on the precision parameter ' . Replace the last column in the zero matrix B with an ith column in the data matrix Z to obtain a new matrix T. Perform SVD on the new matrix T to obtain a singular value, left singular matrix U , and diagonal matrix of the matrix T. Select a minimum singular value 6 of the matrix T, and scan and update the diagonal matrix of the matrix T based on the minimum singular value. Construct and update the sketch matrix B based on the updated diagonal matrix and the left singular matrix U , and add one to a value of i. Compare the value of i with the number of iterations, and export the current sketch matrix B if the value of i is greater than the number of iterations or reselect data from the data matrix Z for matrix sketching if the value of i is less than or equal to the number of iterations.

[78] A process of processing data by using the hash learning module includes the following steps: Process data in each column of the sketch matrix by using a hash projection method, to obtain a hash projection vector for the data in each column. Obtain an optimal model parameter based on the hash projection vector and the sketch matrix and the projection matrix based on a maximum objective function. The optimal model parameter is the maximum objective function that is obtained after a hash objective function is optimized.

[79] A process of processing data by using the anomaly identification module includes the ) following steps: Import a processed to-be-detected sample data matrix, a hash table of normal sample features, and the hash projection matrix to an anomaly score calculation model. Specify a threshold ' . Perform binary hash encoding on data i of each column in the to-be-detected sample data matrix based on the hash projection matrix to obtain a binary hash code h. Seek for K hash codes hi that are closest to the binary hash code h in the hash table of normal sample features. Calculate an average Hamming distance ai between the binary hash code h and the K closest hash codes hi . Compare the mean value a with the specified threshold , and determine that the data of the column is normal data if a, - ; or the data of the column is abnormal data if a, > g . Determine whether the to-be-detected sample data is detected. If the to-be-detected sample data is detected, collectively label all abnormal data and export normal data. If the to-be-detected sample data is not detected, detect again.

[80] Then, the identification result output module updates and exports the detection result.

[81] A process updating data by using the model update module includes the following steps: Convert the obtained normal data to a data matrix. Map the sketch matrix that is obtained by using the matrix sketching model to a binary Hamming space by using a linear hash projection method, to obtain an updated hash projection matrix. Package the data matrix and the sketch matrix to generate a new normal data block.

[82] The implementations in the system of the present disclosure are the same as those in the method of the present disclosure.

[83] The objectives, technical solutions, and beneficial effects of the present disclosure are further described in detail in the foregoing specific implementations. It should be understood that the foregoing descriptions are merely specific implementations of the present disclosure, but are not intended to limit the protection scope of the present disclosure. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure shall fall within the protection scope of the present disclosure.

Claims

WHAT IS CLAIMED IS:

1. An online anomaly detection method for streaming data, comprising: obtaining a normal data block that is transmitted at a high speed and importing data in the normal data block to an online anomaly detection model for training; importing to-be-detected sample data to the trained online anomaly detection model and determining whether the to-be-detected sample data is normal data; and if the to-be-detected sample data is normal data, updating the normal data to generate a new normal data block, and using the new normal data block as training data for a next anomaly detection; or if the to-be-detected sample data is abnormal data, labeling the abnormal data, wherein the online anomaly detection model comprises a modified matrix sketching model, a hash learning model, and an anomaly score calculation model.

2. The online anomaly detection method for streaming data according to claim 1, wherein a process of importing the to-be-detected sample data to the trained online anomaly detection model for detection comprises: Si: importing the data in the normal data block to the modified matrix sketching model to obtain a sketch matrix; S2: importing the sketch matrix to the hash learning model, optimizing the sketch matrix by using a hash objective function to obtain an optimal model parameter , and obtaining a hash projection matrix based on the optimal model parameter; S3: mapping the sketch matrix by using the hash projection matrix to obtain a feature hash table H S4: obtaining the to-be-detected sample data; and S5: importing the to-be-detected sample data to the anomaly score calculation model to determine whether the to-be-detected sample data is abnormal data.

3. The online anomaly detection method for streaming data according to claim 2, wherein a process of processing the data in the normal data block by using the modified matrix sketching model comprises: S11: constructing a data matrix Z based on the data in the normal data block and selecting a precision parameter 8, wherein the data matrix Ze Rdxn and Rdxn representsarealnumber spaceof dxl; S12: specifying a number of iterations based on the data matrix Z; S13: initializing a zero matrix B of dx I based on the precision parameter 8 , wherein B =[bl,b2,---,bi,--

Si4: replacing the last column in the zero matrix B with an ith column in the data matrix Z to obtain a new matrix T, wherein i E 1,2,..., n. ) 15: performing singular value decomposition (SVD) on the new matrix T to obtain a singular value, left singular matrix U, and diagonal matrix I of the matrix T; S16: selecting a minimum singular value 5 of the matrix T, and scanning and updating the diagonal matrix of the matrix T based on the minimum singular value; S17: constructing and updating the sketch matrix B based on the updated diagonal matrix and the left singular matrix U, and adding one to a value of i; and S18: comparing the value of i with the number of iterations, and exporting the current sketch matrix B if the value of i is greater than the number of iterations or returning to S14 if the value of i is less than or equal to the number of iterations; wherein a process of processing the sketch matrix by using the hash learning model comprises: processing data in each column of the sketch matrix by using a hash projection method, to obtain a hash projection vector for the data in each column; and obtaining the optimal model parameterWt based on the hash projection vector and the sketch matrix and the projection matrix based on a maximum objective function, wherein the optimal model parameter is the maximum objective function that is obtained after the hash objective function is optimized; wherein a formula for the optimal model parameter is as follows: W* <- max tr(W,$B,BW, st)w.t. ,= I WJR ", wherein Rdx represents a real number space of dx r , Bt represents the sketch matrix, t

represents the projection matrix, T represents transposition, tr(.) represents a matrix trace, and Ir represents an identity matrix of rx r ; wherein a formula for obtaining the feature hash table based on the hash projection matrix is as follows: H, =sgnW B ,wherein sgn(-) represents a sign function, represents the hash projection matrix, T represents transposition, and B represents the sketch matrix; wherein a process of processing the to-be-detected sample data by using the anomaly score calculation model comprises: Step 1: importing a processed to-be-detected sample data matrix, a hash table of normal sample features, and the hash projection matrix to the anomaly score calculation model; Step 2: specifying a threshold ' ; Step 3: performing binary hash encoding on data ' of each column in the to-be-detected sample data matrix based on the hash projection matrix to obtain a binary hash code h wherein iE 1,2,...,n

. Step 4: seeking for K hash codes that are closest to the binary hash code in the ) hash table of normal sample features; Step 5: calculating an average Hamming distance ai between the binary hash code h i and hK the K closest hash codes I;

Step 6: comparing the mean value ai with the specified threshold , and determining that the data of the column is normal data if a, or the data of the column is abnormal data if a, and Step 7: determining whether the to-be-detected sample data is detected; and if the to-be-detected sample data is detected, collectively labeling all abnormal data and exporting normal data, or if the to-be-detected sample data is not detected, returning to Step 3; wherein a formula for calculating the average Hamming distance between the binary hash code h K is as follows: and the closest hash codes/<

a,= ZHamDist(h,,h|J K, wherein K represents a number of closest hash codes that are specified by a user, and HamDist(h,h/J) represents a Hamming distance between h and

4. The online anomaly detection method for streaming data according to claim 1, wherein a process of updating the normal data comprises: converting the obtained normal data to a data matrix; mapping a sketch matrix that is obtained by using the matrix sketching model to a binary Hamming space by using a linear hash projection method, to obtain an updated hash projection matrix; and packaging the data matrix and the sketch matrix to generate a new normal data block.

5. An online anomaly detection system for streaming data, wherein the system comprises a data collection module, a matrix sketching module, a hash learning module, an anomaly identification module, an identification result output module, and a model update module, wherein the data collection module is configured to collect data and import the collected data to the matrix sketching module; the matrix sketching module is configured to perform matrix sketching on a large amount of high-dimensional high-speed streaming data, to generate a sketch matrix; the hash learning module is configured to map data in the sketch matrix to a Hamming space to generate a hash projection matrix and a feature hash table; the anomaly identification module is configured to: calculate an anomaly score for to-be-detected data based on the hash projection matrix and the feature hash table, and compare the calculated anomaly score with a specified anomaly threshold to obtain a detection result for the to-be-detected data; the identification result output module is configured to export the detection result; and the model update module is configured to update data attributes and distribution characteristics of a model.

FIG. 1 －1/2－

DRAWINGS

FIG. 2 －2/2－