CN110837874B - Business data anomaly detection method based on time sequence classification - Google Patents

Business data anomaly detection method based on time sequence classification Download PDF

Info

Publication number
CN110837874B
CN110837874B CN201911127919.6A CN201911127919A CN110837874B CN 110837874 B CN110837874 B CN 110837874B CN 201911127919 A CN201911127919 A CN 201911127919A CN 110837874 B CN110837874 B CN 110837874B
Authority
CN
China
Prior art keywords
time
time series
time sequence
algorithm
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911127919.6A
Other languages
Chinese (zh)
Other versions
CN110837874A (en
Inventor
程永新
宋辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai New Torch Network Information Technology Ltd By Share Ltd
Original Assignee
Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai New Torch Network Information Technology Ltd By Share Ltd filed Critical Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority to CN201911127919.6A priority Critical patent/CN110837874B/en
Publication of CN110837874A publication Critical patent/CN110837874A/en
Application granted granted Critical
Publication of CN110837874B publication Critical patent/CN110837874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Testing And Monitoring For Control Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a business data anomaly detection method based on time sequence classification, which comprises the following steps: s1: extracting off-line service data, classifying the off-line service data according to time sequences, and generating a sample library comprising time sequences of different types; s2: correlating different types of time sequences in the sample library with different time sequence anomaly detection algorithms; s3: acquiring online service data, and classifying the online service data according to time sequence classification in a sample library; s4: and carrying out anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm. The invention automatically classifies and identifies different types of time sequences, automatically selects parameters or algorithms to detect the time sequence abnormality, automatically identifies the time sequence type when processing large-scale time sequence abnormality detection, reduces false alarm and missing report of the alarm, and effectively saves labor cost.

Description

Business data anomaly detection method based on time sequence classification
Technical Field
The present invention relates to an anomaly detection method, and more particularly, to a method for detecting anomalies in service data based on time series classification.
Background
The abnormal detection of the time sequence index is a core link for finding problems, the traditional static threshold detection is mainly adopted, the threshold is too high, missed alarms are too many, quality hidden dangers are difficult to find, the threshold is too low, too many alarms cause alarm storm, and judgment of service operation and maintenance personnel is interfered. The method has the advantages that the method can manually select what abnormality detection algorithm to use according to different types of time sequences, can manually select when the number of the time sequences is small, and has great limitation in manual processing when large-scale time sequences are required to be subjected to abnormality detection. Therefore, there is a need for a method of classifying a large-scale time series and performing abnormality detection using different parameters or algorithms according to different classifications.
Disclosure of Invention
The invention aims to solve the technical problem of providing a business data anomaly detection method based on time sequence classification, which aims at automatic classification and identification of different types of time sequences, and automatic selection of parameters or algorithms of the different types of time sequences for time sequence anomaly detection.
The technical scheme adopted by the invention for solving the technical problems is to provide a business data anomaly detection method based on time sequence classification, which comprises the following steps: s1: extracting off-line service data, classifying the off-line service data according to time sequences, and generating a sample library comprising time sequences of different types; s2: correlating different types of time sequences in the sample library with different time sequence anomaly detection algorithms; s3: acquiring online service data, and classifying the online service data according to time sequence classification in a sample library in the step S1; s4: and (2) performing anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm in the step (S2).
Further, the time series classification method in the step S1 includes clustering according to the similarity of the time series, and specifically includes the following steps: s11: defining a distance between the time series; s12: calculating a distance matrix between the time sequences according to the distance between the time sequences defined in the step S11; s13: the time series is divided into several classes according to the calculation result in step S12 and the maximum distance between each given time series and the minimum number of samples within each class.
Further, the time stamp, the time interval and the time sequence length of the time sequence for similarity clustering have the same value, the distance definition between the time sequences is based on Euclidean distance, a DTW time sequence alignment strategy is adopted to reach the boundary of the DTW through an LB Keogh lower boundary method, and the distance between the time sequences is calculated; and classifying the time sequence by a density clustering algorithm.
Further, the time series classification method in step S1 further includes hierarchical clustering according to global features of the time series, where the classification features of the time series hierarchical clustering include trend, seasonal, periodic, sequence correlation, skewness, kurtosis, nonlinearity, self-similarity, chaos, decomposed sequence correlation, decomposed nonlinearity, decomposed skewness, and decomposed kurtosis.
Further, when hierarchical clustering is performed through the global features of the time sequence, the time stamp, the time interval and the time sequence length of the time sequence have the same value.
Further, the step S2 specifically includes selecting a category of the time series as a time series type of the sample library, binding a corresponding anomaly detection algorithm and parameters, and using the category as a basis for on-line classification.
Further, the anomaly detection algorithm comprises a prediction-based ARIMA algorithm, a weighted moving average algorithm, a wavelet decomposition algorithm and a 3-sigma algorithm, wherein the prediction-based ARIMA algorithm and the weighted moving average algorithm are anomaly detection algorithms for stable periodic time sequences; the wavelet decomposition algorithm and the 3-sigma algorithm are anomaly detection algorithms for unstable time series.
Further, the step S4 specifically includes obtaining an abnormality detection algorithm associated with the time sequences of the same type in the sample library according to the classified online time sequence types, and performing abnormality detection on the classified online time sequences by the algorithm.
Compared with the prior art, the invention has the following beneficial effects: the business data anomaly detection method based on time sequence classification provided by the invention aims at automatic classification and identification of time sequences of different types, and automatic selection parameters or algorithms of the time sequences of different types are used for time sequence anomaly detection, so that the time sequence types are automatically identified when large-scale time sequence anomaly detection is processed, excessive human participation is not needed, false alarm and missing report of alarms are reduced, and the labor cost is effectively saved.
Drawings
FIG. 1 is a flowchart of a method for detecting anomalies in traffic data based on time series classification in an embodiment of the present invention;
fig. 2 is a schematic diagram of a method for detecting anomalies of service data based on time series classification in an embodiment of the present invention;
FIG. 3 is a graph of a time-series similarity clustering effect in an embodiment of the present invention;
FIG. 4 is a graph of hierarchical clustering effects of time sequences in an embodiment of the present invention;
fig. 5 is an effect diagram of a method for detecting abnormal business data based on time series classification in an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
FIG. 1 is a flowchart of a method for detecting anomalies in traffic data based on time series classification in an embodiment of the present invention;
fig. 2 is a schematic diagram of a method for detecting anomalies in service data based on time series classification in an embodiment of the present invention.
Referring to fig. 1 and 2, the method for detecting abnormal business data based on time sequence classification provided by the invention comprises the following steps:
s1: extracting off-line service data, classifying the off-line service data according to time sequences, and generating a sample library comprising time sequences of different types;
s2: correlating different types of time sequences in the sample library with different time sequence anomaly detection algorithms;
s3: acquiring online service data, and classifying the online service data according to time sequence classification in a sample library in the step S1;
s4: and (2) performing anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm in the step (S2).
Specifically, the method for detecting the abnormal business data based on time sequence classification provided by the invention comprises the following steps of clustering according to the similarity of time sequences:
s11: defining a distance between the time series;
s12: calculating a distance matrix between the time sequences according to the distance between the time sequences defined in the step S11;
s13: the time series is divided into several classes according to the calculation result in step S12 and the maximum distance between each given time series and the minimum number of samples within each class.
Definition of the distance between two time series includes alignment based on euclidean distance (Euclidean Distance) and based on DTW (Dynamic Time Warping) timing. Euclidean distance is a commonly used distance definition that refers to the true distance between two points in m-dimensional space, or the natural length of a vector (i.e., the distance from the point to the origin), and euclidean distance in two-dimensional and three-dimensional space is the actual distance between two points. The DTW (Dynamic Time Warping) algorithm is based on the idea of dynamic programming, solves the problem of template matching of different pronunciation lengths, and is an earlier and more classical algorithm in speech recognition and used for recognizing isolated words. The DTW algorithm is used here, and then the algorithm used for clustering is density clustering (DBSCAN), which is a density-based clustering algorithm that generally assumes that the class can be determined by how tightly the sample is distributed. Samples of the same class are closely connected, that is, there must be samples of the same class around any sample of that class. By grouping closely connected samples into one class, a cluster class is thus obtained. By grouping all closely connected sets of samples into different categories we get the final all clusters category result. The function of this algorithm is to unsupervised the total number of samples into several classes, with the maximum distance between each given sample and the minimum number of samples within each class.
The specific implementation process is as follows: 16 time series were derived from the database, each time series comprising 7000 data points, a distance matrix was calculated, where distance is DTW distance, which is a time alignment strategy measuring time series similarity and dissimilarity, but the time complexity of the algorithm was o (n 2), resulting in high computational cost when calculating a large number of time series distances. For the above 16 sequences, it takes about 1 hour to calculate the distance matrix through the computer, so the acceleration algorithm for calculating the time sequence distance uses the LB Keogh lower bound method to calculate the boundary of DTW, and the time complexity of the algorithm is linear, thus greatly improving the calculation efficiency. The time series similarity clustering effect is shown in fig. 3.
Specifically, the method for detecting the business data anomalies based on time sequence classification provided by the invention has the advantages that the time sequence classification mode further comprises hierarchical clustering according to global features of the time sequence, and the classification features of the time sequence hierarchical clustering comprise trend, seasonality, periodicity, sequence correlation, skewness, kurtosis, nonlinearity, self-similarity, chaos, decomposed sequence correlation, decomposed nonlinearity, decomposed skewness and decomposed kurtosis. The global feature has the advantage that the time stamps of the time sequences are not identical, the time intervals are not identical, and the time sequences are not identical in length. The time series hierarchical clustering effect is shown in fig. 4.
Specifically, the business data anomaly detection method based on time sequence classification provided by the invention selects the main category of the time sequence as a time sequence type of a sample library, binds the corresponding anomaly detection algorithm and parameters and is used as the basis of on-line classification. The anomaly detection algorithm comprises a prediction-based ARIMA algorithm, a weighted moving average algorithm, a wavelet decomposition algorithm and a 3-sigma algorithm, wherein the prediction-based ARIMA algorithm and the weighted moving average algorithm are anomaly detection algorithms aiming at stable periodic time sequences; the wavelet decomposition algorithm and the 3-sigma algorithm are anomaly detection algorithms for unstable time series. And finally, according to the classified online time sequence types, acquiring an abnormality detection algorithm associated with the time sequences of the same type in the sample library, and carrying out abnormality detection on the classified online time sequences by the algorithm.
The effect of the service data anomaly detection method based on time sequence classification is shown in fig. 5, wherein a curve 1 is an original value curve, a curve 2 is a predicted value curve, a curve 3 is an upper limit value curve, and a curve 4 is a lower limit value curve.
In summary, the method for detecting the abnormal business data based on time sequence classification provided by the invention aims at automatically classifying and identifying time sequences of different types, automatically selecting parameters or algorithms for time sequence abnormality detection of different types, automatically identifying time sequence types during processing large-scale time sequence abnormality detection, avoiding excessive human participation, reducing false alarm and missing report of alarms and effectively saving labor cost.
While the invention has been described with reference to the preferred embodiments, it is not intended to limit the invention thereto, and it is to be understood that other modifications and improvements may be made by those skilled in the art without departing from the spirit and scope of the invention, which is therefore defined by the appended claims.

Claims (6)

1. A business data anomaly detection method based on time sequence classification is characterized by comprising the following steps:
s1: extracting off-line service data, classifying the off-line service data according to time sequences, and generating a sample library comprising time sequences of different types;
s2: correlating different types of time sequences in the sample library with different time sequence anomaly detection algorithms;
s3: acquiring online service data, and classifying the online service data according to time sequence classification in a sample library in the step S1;
s4: performing anomaly detection on the classified online time sequence according to the association relation between the time sequence classification and the time sequence anomaly detection algorithm in the step S2;
the time sequence classification method in the step S1 includes clustering according to the similarity of the time sequences, and specifically includes the following steps:
s11: defining a distance between the time series;
s12: calculating a distance matrix between the time sequences according to the distance between the time sequences defined in the step S11;
s13: dividing the time series into a plurality of classes according to the calculation result in the step S12, the maximum distance between every two given time series and the minimum sample number in each class;
the time stamp, the time interval and the time sequence length of the time sequence for similarity clustering have the same value, the distance definition between the time sequences is based on Euclidean distance, a DTW time sequence alignment strategy is adopted to reach the boundary of the DTW through an LB Keogh lower boundary method, and the distance between the time sequences is calculated; and classifying the time sequence by a density clustering algorithm.
2. The method for detecting abnormal business data based on time series classification as claimed in claim 1, wherein the time series classification in step S1 further comprises hierarchical clustering according to global features of the time series, and the classification features of the time series hierarchical clustering include trend, seasonal, periodic, sequence correlation, skewness, kurtosis, nonlinearity, self-similarity, chaos, decomposed sequence correlation, decomposed nonlinearity, decomposed skewness, and decomposed kurtosis.
3. The method for detecting abnormal traffic data based on time series classification according to claim 2, wherein the time stamp, the time interval and the time series length of the time series have different values when hierarchical clustering is performed by global features of the time series.
4. The method for detecting abnormal business data based on time series classification as claimed in claim 1, wherein said step S2 specifically includes selecting a main category of time series as a time series type of the sample library, binding a corresponding abnormal detection algorithm and parameters, and as a basis for on-line classification.
5. The traffic data anomaly detection method based on time series classification according to claim 1, wherein the anomaly detection algorithm comprises a prediction-based ARIMA algorithm, a weighted moving average algorithm, a wavelet decomposition algorithm, and a 3-sigma algorithm, the prediction-based ARIMA algorithm and the weighted moving average algorithm being anomaly detection algorithms for stable periodic time series; the wavelet decomposition algorithm and the 3-sigma algorithm are anomaly detection algorithms for unstable time series.
6. The method for detecting abnormal business data based on time series classification according to claim 1, wherein the step S4 specifically comprises obtaining an abnormality detection algorithm associated with time series of the same type in a sample library according to the classified online time series type, and performing abnormality detection on the classified online time series by the algorithm.
CN201911127919.6A 2019-11-18 2019-11-18 Business data anomaly detection method based on time sequence classification Active CN110837874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911127919.6A CN110837874B (en) 2019-11-18 2019-11-18 Business data anomaly detection method based on time sequence classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911127919.6A CN110837874B (en) 2019-11-18 2019-11-18 Business data anomaly detection method based on time sequence classification

Publications (2)

Publication Number Publication Date
CN110837874A CN110837874A (en) 2020-02-25
CN110837874B true CN110837874B (en) 2023-05-26

Family

ID=69576766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911127919.6A Active CN110837874B (en) 2019-11-18 2019-11-18 Business data anomaly detection method based on time sequence classification

Country Status (1)

Country Link
CN (1) CN110837874B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001756B (en) * 2020-08-24 2022-07-12 北京道隆华尔软件股份有限公司 Method and device for determining abnormal telecommunication service scene and computer equipment
CN114548465A (en) * 2020-11-19 2022-05-27 上海宝信软件股份有限公司 Service data time sequence reusability abnormity detection method and system
CN112565422B (en) * 2020-12-04 2022-07-22 杭州佳速度产业互联网有限公司 Method, system and storage medium for identifying fault data of power internet of things
CN113535458B (en) * 2021-09-17 2021-12-28 上海观安信息技术股份有限公司 Abnormal false alarm processing method and device, storage medium and terminal
CN114330583B (en) * 2021-12-31 2022-11-08 四川大学 Abnormal electricity utilization identification method and abnormal electricity utilization identification system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899327A (en) * 2015-06-24 2015-09-09 哈尔滨工业大学 Method for detecting abnormal time sequence without class label
KR101621019B1 (en) * 2015-01-28 2016-05-13 한국인터넷진흥원 Method for detecting attack suspected anomal event
CN110032670A (en) * 2019-04-17 2019-07-19 腾讯科技(深圳)有限公司 Method for detecting abnormality, device, equipment and the storage medium of time series data
CN110197211A (en) * 2019-05-17 2019-09-03 河海大学 Similarity data clustering method for dam safety monitoring data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10628435B2 (en) * 2017-11-06 2020-04-21 Adobe Inc. Extracting seasonal, level, and spike components from a time series of metrics data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101621019B1 (en) * 2015-01-28 2016-05-13 한국인터넷진흥원 Method for detecting attack suspected anomal event
CN104899327A (en) * 2015-06-24 2015-09-09 哈尔滨工业大学 Method for detecting abnormal time sequence without class label
CN110032670A (en) * 2019-04-17 2019-07-19 腾讯科技(深圳)有限公司 Method for detecting abnormality, device, equipment and the storage medium of time series data
CN110197211A (en) * 2019-05-17 2019-09-03 河海大学 Similarity data clustering method for dam safety monitoring data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
丁小欧 ; 于晟健 ; 王沐贤 ; 王宏志 ; 高宏 ; 杨东华 ; .基于相关性分析的工业时序数据异常检测.软件学报.(03),全文. *

Also Published As

Publication number Publication date
CN110837874A (en) 2020-02-25

Similar Documents

Publication Publication Date Title
CN110837874B (en) Business data anomaly detection method based on time sequence classification
CN114090396B (en) Cloud environment multi-index unsupervised anomaly detection and root cause analysis method
US8520949B1 (en) Self-similar descriptor filtering
CN111127517A (en) Production line product positioning method based on monitoring video
CN111367777B (en) Alarm processing method, device, equipment and computer readable storage medium
CN108304567B (en) Method and system for identifying working condition mode and classifying data of high-voltage transformer
CN110942099A (en) Abnormal data identification and detection method of DBSCAN based on core point reservation
CN112926045B (en) Group control equipment identification method based on logistic regression model
CN113344133B (en) Method and system for detecting abnormal fluctuation of time sequence behaviors
CN111126820A (en) Electricity stealing prevention method and system
CN111860981A (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN111325410A (en) General fault early warning system based on sample distribution and early warning method thereof
CN112966088B (en) Unknown intention recognition method, device, equipment and storage medium
CN105426441B (en) A kind of automatic preprocess method of time series
CN110794360A (en) Method and system for predicting fault of intelligent electric energy meter based on machine learning
CN112528774B (en) Intelligent unknown radar signal sorting system and method in complex electromagnetic environment
CN111796957A (en) Transaction abnormal root cause analysis method and system based on application log
CN113537321A (en) Network traffic anomaly detection method based on isolated forest and X-means
CN114020811A (en) Data anomaly detection method and device and electronic equipment
US20230377132A1 (en) Wafer Bin Map Based Root Cause Analysis
CN114490235A (en) Algorithm model for intelligently identifying quantity relation and abnormity of log data
GB2610989A (en) Systems and methods for state identification and classification of text data
CN113128584A (en) Mode-level unsupervised sorting method of multifunctional radar pulse sequence
CN116342422A (en) Defect identification method based on wafer map denoising
CN113705624B (en) Intrusion detection method and system for industrial control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant