CN110837874B

CN110837874B - Business data anomaly detection method based on time sequence classification

Info

Publication number: CN110837874B
Application number: CN201911127919.6A
Authority: CN
Inventors: 程永新; 宋辉
Original assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Current assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2023-05-26
Anticipated expiration: 2039-11-18
Also published as: CN110837874A

Abstract

The invention discloses a business data anomaly detection method based on time sequence classification, which comprises the following steps: s1: extracting off-line service data, classifying the off-line service data according to time sequences, and generating a sample library comprising time sequences of different types; s2: correlating different types of time sequences in the sample library with different time sequence anomaly detection algorithms; s3: acquiring online service data, and classifying the online service data according to time sequence classification in a sample library; s4: and carrying out anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm. The invention automatically classifies and identifies different types of time sequences, automatically selects parameters or algorithms to detect the time sequence abnormality, automatically identifies the time sequence type when processing large-scale time sequence abnormality detection, reduces false alarm and missing report of the alarm, and effectively saves labor cost.

Description

Business data anomaly detection method based on time sequence classification

Technical Field

The present invention relates to an anomaly detection method, and more particularly, to a method for detecting anomalies in service data based on time series classification.

Background

The abnormal detection of the time sequence index is a core link for finding problems, the traditional static threshold detection is mainly adopted, the threshold is too high, missed alarms are too many, quality hidden dangers are difficult to find, the threshold is too low, too many alarms cause alarm storm, and judgment of service operation and maintenance personnel is interfered. The method has the advantages that the method can manually select what abnormality detection algorithm to use according to different types of time sequences, can manually select when the number of the time sequences is small, and has great limitation in manual processing when large-scale time sequences are required to be subjected to abnormality detection. Therefore, there is a need for a method of classifying a large-scale time series and performing abnormality detection using different parameters or algorithms according to different classifications.

Disclosure of Invention

The invention aims to solve the technical problem of providing a business data anomaly detection method based on time sequence classification, which aims at automatic classification and identification of different types of time sequences, and automatic selection of parameters or algorithms of the different types of time sequences for time sequence anomaly detection.

The technical scheme adopted by the invention for solving the technical problems is to provide a business data anomaly detection method based on time sequence classification, which comprises the following steps: s1: extracting off-line service data, classifying the off-line service data according to time sequences, and generating a sample library comprising time sequences of different types; s2: correlating different types of time sequences in the sample library with different time sequence anomaly detection algorithms; s3: acquiring online service data, and classifying the online service data according to time sequence classification in a sample library in the step S1; s4: and (2) performing anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm in the step (S2).

Further, the time series classification method in the step S1 includes clustering according to the similarity of the time series, and specifically includes the following steps: s11: defining a distance between the time series; s12: calculating a distance matrix between the time sequences according to the distance between the time sequences defined in the step S11; s13: the time series is divided into several classes according to the calculation result in step S12 and the maximum distance between each given time series and the minimum number of samples within each class.

Further, the time stamp, the time interval and the time sequence length of the time sequence for similarity clustering have the same value, the distance definition between the time sequences is based on Euclidean distance, a DTW time sequence alignment strategy is adopted to reach the boundary of the DTW through an LB Keogh lower boundary method, and the distance between the time sequences is calculated; and classifying the time sequence by a density clustering algorithm.

Further, the time series classification method in step S1 further includes hierarchical clustering according to global features of the time series, where the classification features of the time series hierarchical clustering include trend, seasonal, periodic, sequence correlation, skewness, kurtosis, nonlinearity, self-similarity, chaos, decomposed sequence correlation, decomposed nonlinearity, decomposed skewness, and decomposed kurtosis.

Further, when hierarchical clustering is performed through the global features of the time sequence, the time stamp, the time interval and the time sequence length of the time sequence have the same value.

Further, the step S2 specifically includes selecting a category of the time series as a time series type of the sample library, binding a corresponding anomaly detection algorithm and parameters, and using the category as a basis for on-line classification.

Further, the anomaly detection algorithm comprises a prediction-based ARIMA algorithm, a weighted moving average algorithm, a wavelet decomposition algorithm and a 3-sigma algorithm, wherein the prediction-based ARIMA algorithm and the weighted moving average algorithm are anomaly detection algorithms for stable periodic time sequences; the wavelet decomposition algorithm and the 3-sigma algorithm are anomaly detection algorithms for unstable time series.

Further, the step S4 specifically includes obtaining an abnormality detection algorithm associated with the time sequences of the same type in the sample library according to the classified online time sequence types, and performing abnormality detection on the classified online time sequences by the algorithm.

Compared with the prior art, the invention has the following beneficial effects: the business data anomaly detection method based on time sequence classification provided by the invention aims at automatic classification and identification of time sequences of different types, and automatic selection parameters or algorithms of the time sequences of different types are used for time sequence anomaly detection, so that the time sequence types are automatically identified when large-scale time sequence anomaly detection is processed, excessive human participation is not needed, false alarm and missing report of alarms are reduced, and the labor cost is effectively saved.

Drawings

FIG. 1 is a flowchart of a method for detecting anomalies in traffic data based on time series classification in an embodiment of the present invention;

fig. 2 is a schematic diagram of a method for detecting anomalies of service data based on time series classification in an embodiment of the present invention;

FIG. 3 is a graph of a time-series similarity clustering effect in an embodiment of the present invention;

FIG. 4 is a graph of hierarchical clustering effects of time sequences in an embodiment of the present invention;

fig. 5 is an effect diagram of a method for detecting abnormal business data based on time series classification in an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

fig. 2 is a schematic diagram of a method for detecting anomalies in service data based on time series classification in an embodiment of the present invention.

Referring to fig. 1 and 2, the method for detecting abnormal business data based on time sequence classification provided by the invention comprises the following steps:

s1: extracting off-line service data, classifying the off-line service data according to time sequences, and generating a sample library comprising time sequences of different types;

s2: correlating different types of time sequences in the sample library with different time sequence anomaly detection algorithms;

s3: acquiring online service data, and classifying the online service data according to time sequence classification in a sample library in the step S1;

s4: and (2) performing anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm in the step (S2).

Specifically, the method for detecting the abnormal business data based on time sequence classification provided by the invention comprises the following steps of clustering according to the similarity of time sequences:

s11: defining a distance between the time series;

s12: calculating a distance matrix between the time sequences according to the distance between the time sequences defined in the step S11;

s13: the time series is divided into several classes according to the calculation result in step S12 and the maximum distance between each given time series and the minimum number of samples within each class.

Definition of the distance between two time series includes alignment based on euclidean distance (Euclidean Distance) and based on DTW (Dynamic Time Warping) timing. Euclidean distance is a commonly used distance definition that refers to the true distance between two points in m-dimensional space, or the natural length of a vector (i.e., the distance from the point to the origin), and euclidean distance in two-dimensional and three-dimensional space is the actual distance between two points. The DTW (Dynamic Time Warping) algorithm is based on the idea of dynamic programming, solves the problem of template matching of different pronunciation lengths, and is an earlier and more classical algorithm in speech recognition and used for recognizing isolated words. The DTW algorithm is used here, and then the algorithm used for clustering is density clustering (DBSCAN), which is a density-based clustering algorithm that generally assumes that the class can be determined by how tightly the sample is distributed. Samples of the same class are closely connected, that is, there must be samples of the same class around any sample of that class. By grouping closely connected samples into one class, a cluster class is thus obtained. By grouping all closely connected sets of samples into different categories we get the final all clusters category result. The function of this algorithm is to unsupervised the total number of samples into several classes, with the maximum distance between each given sample and the minimum number of samples within each class.

The specific implementation process is as follows: 16 time series were derived from the database, each time series comprising 7000 data points, a distance matrix was calculated, where distance is DTW distance, which is a time alignment strategy measuring time series similarity and dissimilarity, but the time complexity of the algorithm was o (n 2), resulting in high computational cost when calculating a large number of time series distances. For the above 16 sequences, it takes about 1 hour to calculate the distance matrix through the computer, so the acceleration algorithm for calculating the time sequence distance uses the LB Keogh lower bound method to calculate the boundary of DTW, and the time complexity of the algorithm is linear, thus greatly improving the calculation efficiency. The time series similarity clustering effect is shown in fig. 3.

Specifically, the method for detecting the business data anomalies based on time sequence classification provided by the invention has the advantages that the time sequence classification mode further comprises hierarchical clustering according to global features of the time sequence, and the classification features of the time sequence hierarchical clustering comprise trend, seasonality, periodicity, sequence correlation, skewness, kurtosis, nonlinearity, self-similarity, chaos, decomposed sequence correlation, decomposed nonlinearity, decomposed skewness and decomposed kurtosis. The global feature has the advantage that the time stamps of the time sequences are not identical, the time intervals are not identical, and the time sequences are not identical in length. The time series hierarchical clustering effect is shown in fig. 4.

Specifically, the business data anomaly detection method based on time sequence classification provided by the invention selects the main category of the time sequence as a time sequence type of a sample library, binds the corresponding anomaly detection algorithm and parameters and is used as the basis of on-line classification. The anomaly detection algorithm comprises a prediction-based ARIMA algorithm, a weighted moving average algorithm, a wavelet decomposition algorithm and a 3-sigma algorithm, wherein the prediction-based ARIMA algorithm and the weighted moving average algorithm are anomaly detection algorithms aiming at stable periodic time sequences; the wavelet decomposition algorithm and the 3-sigma algorithm are anomaly detection algorithms for unstable time series. And finally, according to the classified online time sequence types, acquiring an abnormality detection algorithm associated with the time sequences of the same type in the sample library, and carrying out abnormality detection on the classified online time sequences by the algorithm.

The effect of the service data anomaly detection method based on time sequence classification is shown in fig. 5, wherein a curve 1 is an original value curve, a curve 2 is a predicted value curve, a curve 3 is an upper limit value curve, and a curve 4 is a lower limit value curve.

In summary, the method for detecting the abnormal business data based on time sequence classification provided by the invention aims at automatically classifying and identifying time sequences of different types, automatically selecting parameters or algorithms for time sequence abnormality detection of different types, automatically identifying time sequence types during processing large-scale time sequence abnormality detection, avoiding excessive human participation, reducing false alarm and missing report of alarms and effectively saving labor cost.

While the invention has been described with reference to the preferred embodiments, it is not intended to limit the invention thereto, and it is to be understood that other modifications and improvements may be made by those skilled in the art without departing from the spirit and scope of the invention, which is therefore defined by the appended claims.

Claims

1. A business data anomaly detection method based on time sequence classification is characterized by comprising the following steps:

s4: performing anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm in the step S2;

the time sequence classification method in the step S1 includes clustering according to the similarity of the time sequences, and specifically includes the following steps:

s11: defining a distance between the time series;

s13: dividing the time series into a plurality of classes according to the calculation result in the step S12, the maximum distance between every two given time series and the minimum sample number in each class;

the time stamp, the time interval and the time sequence length of the time sequence for similarity clustering have the same value, the distance definition between the time sequences is based on Euclidean distance, a DTW time sequence alignment strategy is adopted to reach the boundary of the DTW through an LB Keogh lower boundary method, and the distance between the time sequences is calculated; and classifying the time sequence by a density clustering algorithm.

2. The method for detecting abnormal business data based on time series classification as claimed in claim 1, wherein the time series classification in step S1 further comprises hierarchical clustering according to global features of the time series, and the classification features of the time series hierarchical clustering include trend, seasonal, periodic, sequence correlation, skewness, kurtosis, nonlinearity, self-similarity, chaos, decomposed sequence correlation, decomposed nonlinearity, decomposed skewness, and decomposed kurtosis.

3. The method for detecting abnormal traffic data based on time series classification according to claim 2, wherein the time stamp, the time interval and the time series length of the time series have different values when hierarchical clustering is performed by global features of the time series.

4. The method for detecting abnormal business data based on time series classification as claimed in claim 1, wherein said step S2 specifically includes selecting a main category of time series as a time series type of the sample library, binding a corresponding abnormal detection algorithm and parameters, and as a basis for on-line classification.

5. The traffic data anomaly detection method based on time series classification according to claim 1, wherein the anomaly detection algorithm comprises a prediction-based ARIMA algorithm, a weighted moving average algorithm, a wavelet decomposition algorithm, and a 3-sigma algorithm, the prediction-based ARIMA algorithm and the weighted moving average algorithm being anomaly detection algorithms for stable periodic time series; the wavelet decomposition algorithm and the 3-sigma algorithm are anomaly detection algorithms for unstable time series.

6. The method for detecting abnormal business data based on time series classification according to claim 1, wherein the step S4 specifically comprises obtaining an abnormality detection algorithm associated with time series of the same type in a sample library according to the classified online time series type, and performing abnormality detection on the classified online time series by the algorithm.