CN109032829B - Data anomaly detection method and device, computer equipment and storage medium - Google Patents

Data anomaly detection method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN109032829B
CN109032829B CN201810813779.7A CN201810813779A CN109032829B CN 109032829 B CN109032829 B CN 109032829B CN 201810813779 A CN201810813779 A CN 201810813779A CN 109032829 B CN109032829 B CN 109032829B
Authority
CN
China
Prior art keywords
time sequence
time
data point
anomaly detection
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810813779.7A
Other languages
Chinese (zh)
Other versions
CN109032829A (en
Inventor
刘彪
张戎
李剑锋
胡婧茹
汪华
任思宇
刘玉杰
肖世广
林向东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810813779.7A priority Critical patent/CN109032829B/en
Publication of CN109032829A publication Critical patent/CN109032829A/en
Application granted granted Critical
Publication of CN109032829B publication Critical patent/CN109032829B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The application relates to a data anomaly detection method, a data anomaly detection device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reported time sequence; performing primary anomaly identification on the time sequence in a primary judgment mode; when the time sequence is identified to be suspected to be abnormal, performing feature extraction on the time sequence; inputting the extracted characteristic data into an anomaly detection model, and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm. The scheme of the application improves the accuracy of anomaly detection.

Description

Data anomaly detection method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data anomaly detection method and apparatus, a computer device, and a storage medium.
Background
With the rapid development of science and technology, the online system is more and more widely used. The anomaly detection is especially important for ensuring the normal operation of an online system and the accuracy of online data.
In the traditional method, the abnormality detection is carried out by setting a threshold value, namely, a numerical value is artificially set as the threshold value, and if the numerical value exceeds the threshold value, the abnormality is considered. However, in practical situations, data forms are different, and this way of simply setting a threshold to determine whether data is abnormal is too absolute, resulting in a lower accuracy rate of abnormality detection.
Disclosure of Invention
Therefore, it is necessary to provide a data anomaly detection method, an apparatus, a computer device, and a storage medium for solving the problem of relatively low accuracy of anomaly detection in the conventional method.
A method of data anomaly detection, the method comprising:
acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reported time sequence;
performing primary anomaly identification on the time sequence in a primary judgment mode;
when the time sequence is identified to be suspected to be abnormal, performing feature extraction on the time sequence;
inputting the extracted characteristic data into an anomaly detection model, and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
An apparatus for data anomaly detection, the apparatus comprising:
the acquisition module is used for acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reported time sequence;
the primary judgment module is used for carrying out primary abnormity identification on the time sequence in a primary judgment mode;
the characteristic extraction module is used for extracting the characteristics of the time sequence when the suspected abnormality of the time sequence is identified;
the anomaly detection module is used for inputting the extracted characteristic data into an anomaly detection model and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
A computer device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of:
acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reported time sequence;
performing primary anomaly identification on the time sequence in a primary judgment mode;
when the time sequence is identified to be suspected to be abnormal, performing feature extraction on the time sequence;
inputting the extracted characteristic data into an anomaly detection model, and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reported time sequence;
performing primary anomaly identification on the time sequence in a primary judgment mode;
when the time sequence is identified to be suspected to be abnormal, performing feature extraction on the time sequence;
inputting the extracted characteristic data into an anomaly detection model, and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
The data anomaly detection method, the data anomaly detection device, the computer equipment and the storage medium acquire a time sequence comprising a target data point and historical data points reported before the target data point; and arranging the target data points and the historical data points according to the reported time sequence. The primary anomaly identification is carried out on the time series in a primary judgment mode, which is equivalent to the anomaly detection of the first level. When the suspected abnormality of the time sequence is identified, extracting the characteristics of the time sequence; and inputting the extracted feature data into an anomaly detection model obtained by training through a supervised machine learning algorithm, equivalently performing anomaly detection of a second level, and outputting an anomaly detection result aiming at the target data point. The method uses multi-level anomaly detection, combines a primary judgment mode different from a supervised machine learning algorithm with the supervised algorithm, and carries out deep detection through an anomaly detection model obtained by supervised learning training, thereby improving the accuracy of anomaly detection.
Drawings
FIG. 1 is a diagram illustrating an exemplary implementation of a data anomaly detection method;
FIG. 2 is a flow diagram illustrating a method for data anomaly detection in one embodiment;
FIG. 3 is a graphical representation of a time series in one embodiment;
FIG. 4 is a schematic diagram illustrating historical data point selection in one embodiment;
FIG. 5 is a graphical representation of an anomaly detection result in one embodiment;
FIG. 6 is a schematic diagram of the principle of three sigma's law in one embodiment;
FIG. 7 is a schematic diagram of a data anomaly detection method in one embodiment;
FIG. 8 is a technical framework diagram of a data anomaly detection method in one embodiment;
FIG. 9 is a schematic diagram of an interface for alert information in one embodiment;
FIG. 10 is a block diagram of a data anomaly detection apparatus in one embodiment;
FIG. 11 is a block diagram of a data anomaly detection apparatus in another embodiment;
FIG. 12 is a diagram showing an internal configuration of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is a diagram illustrating an application scenario of the data anomaly detection method in one embodiment. Referring to fig. 1, the application scenario includes a data reporting device 110 and an anomaly detection device 120 connected via a network. The data reporting device 110 is a device for reporting data points, and the anomaly detecting device 120 is a device for performing anomaly detection processing on the reported data points. Both the data reporting device 110 and the anomaly detection device 120 may be terminals or servers. The terminal may be a smart television, a desktop computer, or a mobile terminal, and the mobile terminal may include at least one of a mobile phone, a tablet computer, a notebook computer, a personal digital assistant, a wearable device, and the like. The server may be implemented as a stand-alone server or as a server cluster of multiple physical servers. The data reporting device 110 may be one or more, for example, a plurality of terminals respectively report respective data to the abnormality detecting device 120.
The data reporting device 110 may report data points to the anomaly detection device 120 periodically at certain time intervals. The anomaly detection device 120 can obtain a time series including a target data point and historical data points reported before the target data point, where the target data point and the historical data points are arranged according to a reporting time sequence. The anomaly detection device 120 can perform primary anomaly identification on the time series in a primary decision manner; the primary decision mode is different from supervised machine learning algorithms. When the time series is identified as suspected to be abnormal, the abnormality detection device 120 may perform feature extraction on the time series. The anomaly detection device 120 may input the extracted feature data into an anomaly detection model, and output an anomaly detection result for the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
FIG. 2 is a flowchart illustrating a data anomaly detection method according to an embodiment. The data anomaly detection method is mainly applied to a computer device for example in the embodiment, and the computer device may be the anomaly detection device 120 in fig. 1. Referring to fig. 2, the method specifically includes the following steps:
s202, acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reporting time sequence.
It can be understood that the time series is a series formed by a group of data points arranged according to the reported time sequence.
FIG. 3 is a graphical illustration of a time series in one embodiment. For a more intuitive understanding of the time sequence, reference is now made to fig. 3 for illustration. Referring to fig. 3, the horizontal axis is a time axis, and the vertical axis is a reported request number sequence, for example, 20730 received requests are reported at 16: 25. The time series can be formed by arranging each reported data point according to the reported time sequence, and the curve 302 in fig. 3 is an intuitive graphical representation of the time series.
In an embodiment of the present application, the time series includes a target data point and a historical data point. In the acquired time sequence, the target data points and the historical data points are arranged according to the reported time sequence. Then, the time series includes the number series of the target data points and the historical data points arranged according to the reported time sequence.
The target data point is a data point that needs to be detected for abnormality, that is, whether the target data point needs to be detected for abnormality or not. The historical data points are data points reported before the target data points.
In one embodiment, the target data point is a data point reported at the current time. In another embodiment, the target data point may also be a designated one. It will be appreciated that the data points for which anomaly detection is desired can be designated as target data points.
In one embodiment, step S202 includes: determining a target data point; acquiring historical data points reported before a target data point; and arranging the historical data points and the target data points according to the reported time sequence to obtain a time sequence.
In one embodiment, obtaining historical data points reported before the target data point comprises: and acquiring historical data points reported within a preset time before the reporting time corresponding to the target data points. For example, if the preset time is 3 hours, the historical data point reported within 3 hours before the reporting time corresponding to the target data point can be obtained.
In another embodiment, obtaining historical data points reported before the target data point comprises: and determining the same-ratio reporting time of the reporting time corresponding to the target data point, and acquiring historical data points in the preset time before and/or after the same-ratio reporting time.
It can be understood that, assuming that the reporting time corresponding to the target data point is 14:00 in the present period, the comparable reporting time is 14:00 in the previous period. Wherein, a period can be in the unit of day, week or month.
For example, the reporting time corresponding to the target data point is 14:00 every 1 month and 2 days in 2000, and the preset time duration is 3 hours, then, assuming that one day is a cycle, the comparable reporting time can be 14:00 every 1 month and 1 day in 2000, and the acquired historical data points can be historical data points reported in 1 month and 1 day in 2000, 3 hours before and 3 hours after the same. Similarly, assuming that a cycle is a week, the comparable reporting time can be 14:00 at 26/12/1999, and the acquired historical data points can be reported within 14:00 at 26/12/1999, 3 hours before and 3 hours after 14: 00.
FIG. 4 is a diagram illustrating historical data point selection in one embodiment. To more clearly understand the selection of historical data points, reference is now made to FIG. 4. Referring to fig. 4, the horizontal axis represents reporting time, and the vertical axis represents period, and referring to fig. 4, a day or a week may be used as a period, and the historical data points may be selected according to the ring ratio selection mode (1), that is, the historical data points reported within 180 minutes before the reporting time point 402 corresponding to the target data point are selected. The selection can also be performed in any one of the (2) th type and the (3) th type in a same-ratio manner, where the (2) th type is a cycle of one day, the same-ratio reporting time point of the reporting time point 402 corresponding to the target data point is 404, and the same-ratio reporting time point 404 and the historical data points before and after the same-ratio reporting time point 404 within 180 minutes can be obtained. In the type (3), a cycle of one week is used, the comparable reporting time point of the reporting time point 402 corresponding to the target data point is 406, and the comparable reporting time point 406 and the historical data points in the preceding and following 180 minutes can be obtained.
And S204, performing primary anomaly identification on the time sequence in a primary judgment mode.
It should be noted that the preliminary decision manner may be different from the supervised machine learning algorithm.
The primary judgment mode is a mode of performing primary abnormity identification on the time sequence to judge whether the time sequence is abnormal or not. It is to be understood that the primary decision manner is a general term, and any manner that is different from a supervised machine learning algorithm and can perform primary anomaly identification on a time series may be referred to as the primary decision manner.
The primary decision mode may include a statistical decision algorithm and/or an unsupervised algorithm. And the statistical decision algorithm is used for judging whether the time series is abnormal or not through statistical analysis. The unsupervised algorithm is an algorithm for performing machine learning training on training samples without marks to find structural knowledge in a training sample set.
And S206, when the suspected abnormality of the time sequence is identified, extracting the characteristics of the time sequence.
Specifically, the computer device may perform primary anomaly identification on the time series in a primary decision manner, where a primary anomaly identification result includes normal and suspected anomalies. When it is recognized that the time series is normal, the subsequent abnormality detection processing may not be continued. When the suspected abnormality of the time series is identified, the computer device can perform feature extraction on the time series so as to perform feature analysis on the time series and extract feature data.
It is understood that a computer device may feature extract a time series from multiple dimensions. In one embodiment, the computer device may feature the time series from a time domain dimension and a frequency domain dimension.
The Time domain is a function describing a mathematical function or a physical signal versus Time. Frequency domain (frequency domain) refers to the analysis of a function or signal in its frequency-dependent part, rather than in its time-dependent part, as opposed to the time domain.
In one embodiment, step S206 includes: when the suspected abnormality of the time sequence is identified, extracting corresponding time domain characteristic data from the time sequence in a time domain; and/or, performing frequency domain transformation on the time sequence, and extracting corresponding frequency domain characteristic data from the transformed time sequence in a frequency domain.
The time domain feature data is feature data extracted in the time domain. The frequency domain feature data is feature data extracted in the frequency domain.
In one embodiment, extracting the respective time-domain feature data for the time series in the time domain comprises: performing statistical analysis on the time sequence to obtain statistical characteristic data; fitting the trend distribution of the time sequence to obtain fitting characteristic data; and extracting the characteristic data for classification in the time sequence to obtain classified characteristic data.
It is to be understood that the time domain feature data includes at least one of statistical feature data, fitting feature data, classification feature data, and the like.
The statistical characteristic data is characteristic data obtained by performing statistical analysis on the time series. And fitting the characteristic data, namely fitting the trend distribution of the time series to obtain the characteristic data.
The classification feature data is feature data indicating a classification to which the time series belongs. In one embodiment, the time series includes shapes such as burr type, steady type, or concussion type. It is understood that the feature data for classification in the time series is used to indicate the classification to which the time series belongs.
In one embodiment, the computer device may extract statistical feature data, fitting feature data, and classification feature data in the time series through feature engineering. Feature engineering, which is essentially an engineering activity, aims to extract feature data from raw data for use by algorithms and/or models.
In one embodiment, the computer device may extract the statistical feature data, the fitting feature data, and the classification feature data according to the numerical statistics, algorithms, or features shown in table 1, respectively.
TABLE 1
Figure BDA0001739844680000071
Figure BDA0001739844680000081
This is illustrated in connection with table 1. The computer equipment can obtain statistical characteristic data by carrying out numerical statistics such as the most value (maximum value, minimum value and the like), the mean value, the same ratio, the ring ratio and the like on the time series. The computer equipment can perform fitting processing on the trend distribution of the time series through various moving average algorithms, deep learning algorithms and other algorithms to obtain fitting characteristic data. The computer equipment can analyze the entropy characteristics, the value distribution characteristics, the wavelet analysis characteristics and the like of the time sequence, and determine the classification of the time sequence through the entropy characteristics, the value distribution characteristics and the wavelet analysis characteristics to obtain classification characteristic data.
It is understood that, in a normal state, the time sequence is in a time domain, and the computer device may perform frequency domain conversion on the time sequence, convert the time sequence into a frequency domain, and extract corresponding frequency domain feature data from the converted time sequence in the frequency domain.
In one embodiment, the computer device may Transform the time series in the time domain into the frequency domain by Fourier Transform (Fourier Transform). It will be appreciated that fourier transformation is a method of analysing a signal which analyses the components of the signal and may also be used to synthesize the signal. Fourier transform, which analyzes components of a signal in a time domain that is difficult to process originally, and synthetically converts the components into a signal in a frequency domain that is easy to analyze. The method comprises the steps of analyzing signal components of a time sequence in a time domain, synthesizing and converting the components into signals in a frequency domain, and obtaining the converted time sequence in the frequency domain.
It is to be appreciated that the computer device can extract at least one of the extracted time domain feature data and the frequency domain feature data. That is, the computer device may extract only the time domain feature data or the frequency domain feature data, or may extract both the time domain feature data and the frequency domain feature data.
It should be noted that, the time domain feature data is extracted, so that the features in the time dimension can be reflected, and the extracted feature data can more accurately reflect the features of the time sequence. The frequency domain feature data can visually reflect the features in the frequency domain, and compared with the time domain feature data, the frequency domain feature data is easier to extract, so that the feature extraction efficiency is improved. In addition, obviously, the time domain characteristic data and the frequency domain characteristic data are extracted, so that the characteristics of the time sequence can be extracted from multiple dimensions, the extracted characteristic data is more comprehensive, and the accuracy of anomaly detection is improved.
S208, inputting the extracted feature data into an anomaly detection model, and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
Specifically, the computer device may perform machine learning training in advance using a supervised machine learning algorithm, resulting in an anomaly detection model. It is understood that the anomaly detection model is a machine learning model having an anomaly data point detection function. That is, the anomaly detection model can be used to detect whether the target data point is anomalous.
The computer device may input feature data obtained by feature extraction of the time series into the abnormality detection model. The computer equipment can analyze and process the characteristic data through the anomaly detection model and output an anomaly detection result aiming at the target data point.
It is understood that the anomaly detection result for the target data point includes either the target data point being normal or the target data point being anomalous.
FIG. 5 is a graphical representation of an anomaly detection result in one embodiment. In order to intuitively understand the abnormality detection result, an example will now be described with reference to fig. 5. FIG. 5 is a graphical representation of an anomaly detection result from an anomaly detection process performed on a series of target data points. Referring to fig. 5, the data point indicated by the circle 502 clearly deviates from the normal curve, i.e. the data point reported by 2017-10-19, 8:50 is abnormal. Then, when the data point reported at 8:50 is taken as a target data point for anomaly detection, if the preset time is 3 hours, historical data points within the previous 3 hours can be obtained, and the historical data points and the target data points are arranged according to the sequence of the reporting time to obtain a time sequence.
It is to be understood that, when the anomaly detection result includes an anomaly of the target data point, the computer device may invoke a corresponding anomaly handling policy according to the anomaly detection result for the target data point. The exception handling strategy is a handling method adopted for an exception target data point.
In one embodiment, the exception handling policy includes triggering an alert message when a consecutive preset number of exception target data points are detected.
The data anomaly detection method comprises the steps of obtaining a time sequence comprising a target data point and historical data points reported before the target data point; and arranging the target data points and the historical data points according to the reported time sequence. The primary anomaly identification is carried out on the time series in a primary judgment mode, which is equivalent to the anomaly detection of the first level. When the suspected abnormality of the time sequence is identified, extracting the characteristics of the time sequence; and inputting the extracted feature data into an anomaly detection model obtained by training through a supervised machine learning algorithm, equivalently performing anomaly detection of a second level, and outputting an anomaly detection result aiming at the target data point. The method uses multi-level anomaly detection, combines a primary judgment mode different from a supervised machine learning algorithm with the supervised algorithm, and carries out deep detection through an anomaly detection model obtained by supervised learning training, thereby improving the accuracy of an anomaly detection result.
In one embodiment, the primary decision mode comprises a statistical decision algorithm. The step S204 of performing primary anomaly identification on the time series by a primary decision manner includes: extracting historical data points from the time series; determining the mean value and the standard deviation of the historical data points through a statistical decision algorithm; determining a numerical value interval meeting the random error according to the mean value and the standard deviation; and when the target data point is positioned outside the numerical range, identifying the suspected abnormality of the time series.
Specifically, the computer device may extract historical data points other than the target data point from the time series, and determine a mean and a standard deviation of the historical data points through a statistical decision algorithm, i.e., average and standard deviation the extracted historical data points.
In one embodiment, the statistical decision algorithm comprises three sigma law (three-sigma rule of soft humb). The three sigma law is also called Layida rule, it is assumed that a group of detection data only contains random error, and it is calculated to obtain standard deviation, and a section is determined according to a certain probability, and when the error exceeding the section is considered, it is not random error but coarse error, and the data containing said error should be removed.
The three sigma law is specifically: the probability of the numerical distribution in (μ - σ, μ + σ) is 0.6827; the probability of the numerical distribution in (μ -2 σ, μ +2 σ) is 0.9545; the probability of the numerical distribution in (μ -3 σ, μ +3 σ) was 0.9973. Where σ represents the standard deviation and μ represents the mean. And x is the symmetry axis of the image. It is understood that the mean is the mean of historical data points in the time series. The standard deviation is the standard deviation of historical data points in the time series.
Specifically, the computer device may obtain one end point of the numerical range satisfying the random error according to a difference between the mean value and a standard deviation of the preset multiple, and obtain the other end point of the numerical range satisfying the random error according to a sum of the mean value and the standard deviation of the preset multiple. That is, the computer device may regard the range of standard deviations between plus and minus preset multiples of the mean as the range of values that satisfy the random error. In one embodiment, the preset multiple may be any one of one time, two times, and three times. It is understood that the error of the data within the numerical range satisfying the random error is a random error, and then the data within the numerical range satisfying the random error is normal data, and the data within the numerical range satisfying the random error is abnormal data. Therefore, when the target data point is outside the value range, the computer device identifies the time series as suspected abnormal.
Fig. 6 is a schematic diagram of the principle of three sigma laws in one embodiment. For a clearer and intuitive understanding, the explanation will now be made with reference to fig. 6. Referring to fig. 6, the probability of the numerical distribution in (μ - σ, μ + σ) is 68.3%; the probability of the numerical distribution in (μ -2 σ, μ +2 σ) is 95.5%; the probability of the numerical distribution in (μ -3 σ, μ +3 σ) is 0.99.7%. Assuming the predetermined multiple is three times, the computer device may identify that the time series is suspected to be abnormal if the target data point is outside the interval (μ -3 σ, μ +3 σ).
It will be appreciated that in other embodiments, the computer device may also use other statistical decision algorithms for primary anomaly identification for time series.
In the above embodiment, the historical data points are extracted from the time series; determining a numerical interval meeting random errors according to historical data points by using a statistical decision algorithm; and when the target data point is positioned outside the numerical range, identifying the suspected abnormality of the time series. The method is equivalent to the method that whether the time sequence including the target data points is suspected to be abnormal or not is identified by applying the priori knowledge through a statistical means, and the accuracy of abnormality identification is guaranteed to a certain extent. In addition, the statistical decision algorithm is combined with the anomaly detection model obtained by supervised learning, so that multi-level anomaly detection processing is realized, and the accuracy of anomaly detection is further improved.
In one embodiment, the preliminary decision mode comprises an unsupervised algorithm. The step S204 of performing primary anomaly identification on the time series by a primary decision manner includes: extracting each data point in the time series; classifying the extracted data points through an unsupervised algorithm; performing abnormity judgment processing on the time sequence according to a classification result obtained by classification processing; and an abnormity judgment result obtained by abnormity judgment processing is used for indicating whether the time sequence is suspected to be abnormal or not.
As described above, the unsupervised algorithm is an algorithm that performs machine learning training on training samples without labels to find structural knowledge in a training sample set.
Specifically, the computer device can substitute the unmarked training samples into the formula of the unsupervised algorithm through the unsupervised algorithm in advance to perform unsupervised machine learning training, and adjust the parameters of the formula in the training process to optimize the algorithm. The computer device may extract each data point in the time series, with the understanding that the extracted data points include a target data point and a historical data point. The computer equipment can substitute the extracted data points into the formula of the unsupervised algorithm after the parameters are adjusted to calculate, so that the classification processing is carried out on each data point to obtain a classification result. The computer device can perform exception judgment processing on the time sequence according to the classification result.
The unsupervised algorithm includes at least one of a Recurrent Neural Network (RNN), an isolated Forest algorithm (Isolation Forest), a class of Support Vector machines (onelasssvm, OneClass Support Vector Machine), an exponential Weighted Moving Average algorithm (EWMA, explicit Weighted Moving Average), and the like.
Among them, a Recurrent Neural Network (RNN) is a type of Neural Network algorithm for processing sequence data. The essential feature is that there is both an internal feedback and a feedforward connection between the processing units.
An isolated Forest (Isolation Forest) is a fast anomaly detection method based on Ensemble learning (Ensemble), has linear time complexity and high accuracy, and is an algorithm meeting the requirement of big data processing.
One type of Support Vector Machine (OneClass svm, OneClass Support Vector Machine) is a classifier obtained by performing unsupervised training using training samples of only one type, and the trained classifier discriminates all other samples not belonging to the type as "not yes", rather than returning a "not yes" result due to belonging to another type.
The Exponentially Weighted Moving Average algorithm (EWMA), is a special Weighted Moving Average method.
It will be appreciated that different unsupervised algorithms will yield different classification results.
In one embodiment, when the unsupervised algorithm is a recurrent neural network algorithm, the classification result of whether the target data point is abnormal or not can be directly output, and it can be understood that the time series can be subjected to abnormality judgment processing according to the classification result representing the target data point to obtain an abnormality judgment result representing whether the time series is suspected to be abnormal or not.
In one embodiment, when the unsupervised algorithm is an isolated forest, the classification result includes an average path length of a leaf node where the target data point is located in a tree of the isolated forest. Then, when the average path length is less than or equal to the preset threshold, it can be determined that the time series is suspected to be abnormal. Otherwise, when the average path length is greater than the preset threshold, it may be determined that the time series is normal.
In one embodiment, when the unsupervised algorithm is a type of support vector machine algorithm, the classification result indicates whether the target data point belongs to a normal category, when the target data point does not belong to the normal category, it can be determined that the time series is suspected to be abnormal, and when the target data point belongs to the normal category, it can be determined that the time series is normal.
In one embodiment, when the unsupervised algorithm is an exponential weighted moving average algorithm, the computer device may smooth the time series through the exponential weighted moving average algorithm, and determine whether the target data point is within a random error range by using a statistical analysis algorithm with respect to the smoothed time series, and if so, determine that the time series is normal, and if not, determine that the time series is suspected to be abnormal.
In the embodiment, the time sequence is subjected to the anomaly judgment processing through the unsupervised algorithm, and the unsupervised algorithm is combined with the anomaly detection model obtained through supervised learning, so that the multi-level anomaly detection processing is realized, and the accuracy of anomaly detection is improved.
In one embodiment, the unsupervised algorithm is plural; the method further comprises the following steps: obtaining an abnormal judgment result corresponding to each unsupervised algorithm; performing combined detection processing according to the abnormal judgment results corresponding to the unsupervised algorithms; and when the result of the combined detection processing shows that the time sequence is abnormal, judging that the time sequence is suspected to be abnormal.
In one embodiment, the performing the joint detection processing according to the abnormal decision result corresponding to each unsupervised algorithm includes: and when the abnormity judgment result corresponding to any unsupervised algorithm represents that the time sequence is abnormal, judging that the time sequence is suspected to be abnormal. It can be understood that, since each unsupervised algorithm has its own disadvantages, and the abnormality decision result obtained by each unsupervised algorithm may have the situations of imperfection and undetected abnormality, the abnormality decision results corresponding to each unsupervised algorithm are jointly decided, and when the abnormality decision result corresponding to any unsupervised algorithm indicates that the time sequence is abnormal, the time sequence is determined to be suspected to be abnormal. Namely, the anomaly judgment results of each unsupervised algorithm are comprehensively considered, so that the primary anomaly identification of the time series can be more accurate.
In one embodiment, the performing the joint detection processing according to the abnormal decision result corresponding to each unsupervised algorithm includes: and determining preset weights corresponding to the unsupervised algorithms, and determining a joint detection processing result according to the abnormal judgment result corresponding to each unsupervised algorithm and the corresponding preset weights.
The abnormal judgment result corresponding to each unsupervised algorithm comprises any one of the abnormal time sequence or the normal time sequence. The computer can determine a first proportion of the abnormal judgment results of the time sequence abnormality and a second proportion of the abnormal judgment results of the time sequence normality according to the weight of each unsupervised algorithm and the corresponding abnormal judgment results, compares the first proportion and the second proportion, and takes the abnormal judgment results corresponding to larger values as the results of the joint detection processing.
It can be understood that when the first proportion of the abnormal judgment result of the time series abnormality is greater than the second proportion of the abnormal judgment result of the time series abnormality, the time series abnormality is taken as a joint detection processing result. And otherwise, when the first proportion of the abnormal judgment result of the abnormal time sequence is smaller than the second proportion of the abnormal judgment result of the normal time sequence, the normal time sequence is taken as the result of the joint detection processing.
For ease of understanding, this is now exemplified. For example, there are 3 unsupervised algorithms A, B and C, the corresponding preset weights are 0.4, and 0.2, respectively, the abnormality determination result obtained by the unsupervised algorithm a is time series abnormality, the abnormality determination result obtained by the unsupervised algorithm B is time series abnormality, the abnormality determination result obtained by the unsupervised algorithm C is time series normality, the first percentage of the abnormality determination result of time series abnormality is 0.8, and the second percentage of the abnormality determination result of time series normality is 0.2. Then, the computer device may treat the time series anomaly as a result of the joint detection process.
It can be understood that the result of the joint detection processing is determined according to the abnormality judgment result corresponding to each unsupervised algorithm and the corresponding preset weight, and the abnormality judgment result of each unsupervised algorithm is comprehensively and reasonably considered, so that the primary abnormality identification of the time series can be more accurate.
The computer device may determine whether the time series is suspected to be abnormal according to a result of the joint detection processing. And when the result of the joint detection processing indicates that the time sequence is abnormal, the computer equipment judges that the time sequence is suspected to be abnormal. Further, when the result of the joint detection processing indicates that the time series is normal, the computer device may determine that the time series is normal.
It should be noted that the computer device may combine the statistical decision algorithm with at least one unsupervised algorithm to perform the primary anomaly identification on the time series.
In one embodiment, the computer device may perform anomaly identification on the time series through a statistical decision algorithm at a first level, perform joint detection processing on the time series through a plurality of unsupervised algorithms at a second level after the suspected anomaly of the time series is identified, perform feature extraction on the time series at a third level after the suspected anomaly of the time series is determined through the joint detection, input the extracted feature data into an anomaly detection model obtained through supervised machine learning training for further detection, and invoke an anomaly processing strategy when the anomaly detection model outputs an anomaly detection result that a target data point is abnormal.
FIG. 7 is a schematic diagram illustrating a data anomaly detection method in accordance with one embodiment. Referring to fig. 7, the time sequence is sequentially subjected to primary anomaly identification by a first layer of statistical decision algorithm, if the time sequence is identified to be anomalous, the time sequence is subjected to joint detection by a second layer of multiple unsupervised algorithms, if the time sequence is judged to be anomalous, feature extraction is performed, the time sequence enters a third layer of supervised detection (namely, detection is performed by an anomaly detection model), and if the target data point is detected to be anomalous, an anomaly handling strategy can be called.
In the above embodiment, the anomaly decision results corresponding to each unsupervised algorithm are jointly decided, that is, the anomaly decision results of each unsupervised algorithm are comprehensively considered, so that the primary anomaly identification of the time series can be more accurate.
In one embodiment, the method further comprises the step of training the anomaly detection model by a supervised machine learning algorithm, specifically comprising the steps of: acquiring a sample time series and corresponding markers; wherein the mark of the positive sample time sequence is a normal mark, and the mark of the negative sample time sequence is an abnormal mark; extracting sample characteristic data in the sample time sequence; iteratively determining updated model parameters for the initial machine learning model from the sample feature data and the respective labels; and adjusting the model parameters of the initial machine learning model according to the updated model parameters until the iteration stop condition is met, and obtaining an abnormal detection model.
It will be appreciated that the labeled sample time series used in machine learning training with supervised machine learning algorithms. The sample time series is a time series for being a training sample. Wherein the marks of the positive sample time series are normal marks, and the marks of the negative sample time series are abnormal marks. The sample feature data is feature data of a sample time series.
In one embodiment, a sample library may be pre-configured in the computer device. The sample base is used for storing sample data. The computer device may obtain a time series of samples and corresponding markers from a sample library.
The computer device may extract sample feature data in the time series of samples, iteratively determine updated model parameters for the initial machine learning model based on the sample feature data and corresponding labels. The updated model parameters of the initial machine learning model are the model parameters to which the model parameters of the initial machine learning model are updated. It will be appreciated that each iteration determines a new model parameter, and that the model parameters of the initial machine-learned model need to be updated to the new model parameters. This new model parameter is the updated model parameter for the initial machine learning model.
The computer equipment can directly adjust the model parameters of the initial machine learning model according to the updated model parameters, namely, the model parameters of the initial machine learning model are adjusted to the determined updated model parameters, and the iteration is repeated in this way until the iteration stop condition is met, so that the abnormal detection model is obtained. Namely, the computer device can use the model parameters meeting the iteration stop condition as final model parameters to obtain the anomaly detection model.
The model parameters of the initial machine learning model (i.e., the model parameters that need to be adjusted) referred to herein are the model parameters of the initial machine learning model after the previous iteration process has completed updating the model parameters and before the current iteration process updates the model parameters, and are not limited to the initial model parameters before updating the model parameters.
The iteration stop condition is a condition for stopping the iteration to update the model parameters. In one embodiment, the iteration stop condition may be that the number of iterations satisfies a preset number of iterations. For example, the preset number of iterations is 20, and the iteration can be stopped after the iteration reaches 20 times. The iteration stop condition may be that the model parameters are stable. The model parameters are stable, and the model parameters are not changed or the change of the model parameters is in a preset change range.
In one embodiment, after determining the updated model parameters for the initial machine learning model in each iteration, the computer device may further verify the updating effect of the model parameters, and when the verification is passed, perform the step of adjusting the model parameters of the initial machine learning model according to the updated model parameters.
In the embodiment, the anomaly detection model is obtained through supervised machine learning training, and the supervised anomaly detection model is used for carrying out deep detection on the time sequence after primary anomaly identification, so that the accuracy of anomaly detection is improved.
In one embodiment, the method further includes a step of verifying the update effect of the model parameter, and specifically includes the following steps: after each iteration determines updated model parameters, determining a first experimental model and a second experimental model; the model parameters of the first experimental model are the model parameters of the initial machine learning model before the current iteration is updated, and the model parameters of the second experimental model are the updated model parameters determined by the current iteration; inputting the same experimental data into the first experimental model and the second experimental model respectively, and outputting a first experimental result of the first experimental model and a second experimental result of the second experimental model; and when the second experimental result reaches the preset optimization condition compared with the first experimental result, executing the step of adjusting the model parameters of the initial machine learning model according to the updated model parameters.
The experimental model is used for verifying the updating effect of the model parameters. It will be appreciated that the first and second experimental models are identical before each new iteration begins, and are identical to the initial machine learning model before the current iteration is updated. After determining updated model parameters in each iteration, keeping the first experimental model unchanged (namely, the model parameters of the first experimental model are the model parameters of the initial machine learning model before updating in the current iteration), and updating the determined updated model parameters to the second experimental model. At this time, the model parameters of the second experimental model are updated model parameters determined through the current iteration.
The computer device can input the same experimental data into the first experimental model and the second experimental model respectively and output a first experimental result of the first experimental model and a second experimental result of the second experimental model. The computer device may compare the first experimental result with the second experimental result, and adjust the model parameters of the initial machine learning model to the updated model parameters determined through the current iteration when the second experimental result reaches the preset optimization condition compared with the first experimental result.
The preset optimization condition is a condition which can play an optimization role after the preset model parameters are updated. It can be understood that, when the preset optimization condition is satisfied, the updated model parameters determined by the current iteration are updated to the initial machine learning model, so that the optimization function can be achieved.
In one embodiment, the preset optimization condition may include the accuracy of the second experimental result and the accuracy of the first experimental result, and when the accuracy of the second experimental result is higher than the accuracy of the first experimental result, the preset optimization condition may be considered to be reached. It is understood that the experimental data have preset practical results. The computer device may compare the second experimental result and the first experimental result with actual results set in advance, respectively, to determine the accuracy of the second experimental result and the first experimental result.
In the above embodiment, in the training process of the supervised anomaly detection model, the model parameter updating effect is verified through the first experimental model and the second experimental model, and when the second experimental result reaches the preset optimization condition compared with the first experimental result, the step of adjusting the model parameters of the initial machine learning model according to the updated model parameters is performed. Resource waste caused by unnecessary updating is avoided, meanwhile, the effectiveness of model training is verified, and optimization of the model training is facilitated.
FIG. 8 is a technical framework diagram of a data anomaly detection method in one embodiment. Referring to fig. 8, the method mainly includes three parts of offline model training, model update effect verification and online anomaly detection.
And training the part aiming at the off-line model to obtain an abnormality detection model. In the off-line training process, data used as a training sample can be obtained from a database storing the data, primary anomaly recognition is carried out on the obtained data through a statistical decision algorithm and an unsupervised algorithm, then the obtained data is led into a sample library to be used as the training sample, and manual labeling can be carried out on the training sample manually according to a primary anomaly recognition result so as to add a corresponding label. And extracting sample characteristic data of the training sample through characteristic engineering, and performing supervised machine learning training by adopting a supervised algorithm according to the extracted sample characteristic data and corresponding marks to obtain an anomaly detection model.
Aiming at the part of model updating effect verification, in the training process of the anomaly detection model, the model updating effect can be verified through an A, B experimental model, and when the anomaly detection model is trained in each iteration, if the verification updating effect reaches an optimization condition, the iteration updating is carried out.
For the part of online anomaly detection, data extraction can be carried out to extract a time sequence comprising target data points, then primary anomaly identification and multiple unsupervised algorithm joint detection are carried out sequentially through a statistical decision algorithm, when the output time sequence is suspected to be anomalous, characteristic data of the time sequence is extracted through characteristic engineering, and a supervision model (namely an anomaly detection model) is loaded to carry out anomaly detection. It should be noted that the steps of extracting the feature data of the time series through the feature engineering and loading the supervision model do not limit the sequence, and the feature data of the time series may be extracted through the feature engineering after the supervision model is loaded. It is understood that when the anomaly detection result of the anomaly of the target data point is output, the anomaly flag can be automatically added to the target data point and updated to the sample library. After the abnormal mark is automatically added to the target data point, the abnormal mark can be manually checked, and after the check is passed, the abnormal mark is updated to the sample library.
In one embodiment, the target data point is a data point reported at the current time; the method further comprises the following steps: when the anomaly detection result is that the target data point reported at the current time is abnormal, performing anomaly recording; and after the data point reported at the next time is obtained, the data point reported at the next time is used as the target data point reported at the current time again, and the step of obtaining the time sequence is returned to continue execution until the abnormal target data points with the continuous preset number are recorded, and the alarm information is triggered.
It should be noted that the next time refers to the time for reporting the next data point. For example, reporting a data point every minute, the current time is 8:52, and the next time is 8: 53.
Specifically, after the computer device obtains the data point reported at the next time, the next time is the new current time, and the computer device may regard the data point reported at the next time as the target data point reported at the current time, that is, the new target data point, and return to step S202 to continue the execution. That is, the computer device may acquire a time series that includes a new target data point and historical data points reported before the new target data point. Similarly, the new target data point in the time sequence and the historical data point reported before the new target data point are arranged according to the reporting time sequence. The computer device may continue to perform steps S204-S208 for the re-acquired time series to obtain an anomaly detection result for the new target data point. And when the anomaly detection result is that the new target data point is abnormal, continuing to record the anomaly. Similarly, after the data points reported at the next time are obtained, the steps are continuously repeated until the preset number of abnormal target data points are continuously recorded, and the alarm information is triggered.
The recording of the continuous preset number of abnormal target data points refers to recording of the continuous preset number of abnormal target data points in the reporting time. For example, if the data points are reported once every minute, and the preset number is 3, if all the target data points reported at the time points 8:52, 8:53, and 8:54 are recorded to be abnormal, it indicates that 3 consecutive abnormal target data points are recorded, and the alarm information may be triggered. It should be noted that, assuming that the target data points reported at times 8:52, 8:53, and 8:55 are recorded to be abnormal, and the target data points reported at time 8:54 are normal, since the target data points reported at times 8:52, 8:53, and 8:55 are not continuous in the reporting time and one 8:54 is less in the middle, no alarm information is triggered if 3 continuous abnormal target data points are not recorded.
The alarm information is prompt information used for reporting and reflecting the abnormal data points. It is understood that the alarm information may be presented in at least one of text, voice, video, and graphic forms.
In one embodiment, the alarm information includes first order alarm information and second order alarm information. The first-order alarm information is basic alarm information, and the second-order alarm information is used for displaying detailed alarm information in an advanced mode after the first-order alarm information is triggered.
FIG. 9 is an interface diagram of alert information in one embodiment. It can be understood that fig. 9 is an interface diagram of first-order warning information. As shown in fig. 9, the target data points in the dashed box 902 are all greatly deviated from the normal curve, and therefore are all abnormal, which corresponds to recording a predetermined number of abnormal target data points, and then triggering the first-order warning message shown in fig. 9, where the first-order warning message includes an abnormal display chart and a text introduction, such as "at time: the abnormal condition occurs in 2018-06-1914: 35, namely the character introduction is used for introducing the starting time point of the abnormal condition. Fig. 9 also has a view link address, and after the user triggers the view link address, the user may enter a display interface of the second-order warning information, and the second-order warning information may display detailed warning information in a view form. Wherein, the view refers to a view in a computer database, and is a virtual table, and the content of the virtual table is defined by the query. As with the real table, the view contains a series of columns and rows with names.
In the embodiment, whether the newly reported data point is abnormal or not is detected circularly according to the data abnormality detection method, and when the abnormal target data points with the continuous preset number are recorded, the alarm information is triggered, so that the safety is improved. In addition, an abnormal target data point may be an accident, and there is no greater risk that an alarm may be required, and it is recorded that a predetermined number of consecutive abnormal target data points are more risky than an abnormal target data point, and the alarm triggering information is more accurate.
As shown in fig. 10, in one embodiment, there is provided a data anomaly detection apparatus 1000, the apparatus 1000 including: an obtaining module 1002, a primary decision module 1004, a feature extraction module 1006, and an anomaly detection module 1008, wherein:
an obtaining module 1002, configured to obtain a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point; and arranging the target data points and the historical data points according to the reported time sequence.
And a primary decision module 1004, configured to perform primary anomaly identification on the time series in a primary decision manner.
The feature extraction module 1006 is configured to, when it is identified that the time sequence is suspected to be abnormal, perform feature extraction on the time sequence.
An anomaly detection module 1008, configured to input the extracted feature data into an anomaly detection model, and output an anomaly detection result for the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
In one embodiment, the primary decision mode comprises a statistical decision algorithm; the preliminary decision module 1004 is further configured to extract historical data points from the time series; determining the mean value and the standard deviation of the historical data points through a statistical decision algorithm; determining a numerical value interval meeting the random error according to the mean value and the standard deviation; and when the target data point is positioned outside the numerical range, identifying the suspected abnormality of the time series.
In one embodiment, the preliminary decision mode comprises an unsupervised algorithm; the preliminary decision module 1004 is further configured to extract each data point in the time series; classifying the extracted data points through an unsupervised algorithm; performing abnormity judgment processing on the time sequence according to a classification result obtained by classification processing; and an abnormity judgment result obtained by abnormity judgment processing is used for indicating whether the time sequence is suspected to be abnormal or not.
In one embodiment, the unsupervised algorithm is plural; the primary decision module 1004 is further configured to obtain an exception decision result corresponding to each unsupervised algorithm; performing combined detection processing according to the abnormal judgment results corresponding to the unsupervised algorithms; and when the result of the combined detection processing shows that the time series is abnormal, judging that the time series is suspected to be abnormal.
In one embodiment, the feature extraction module 1006 is further configured to, when the suspected abnormality of the time series is identified, extract corresponding time-domain feature data for the time series in a time domain; and/or, performing frequency domain transformation on the time sequence, and extracting corresponding frequency domain characteristic data from the transformed time sequence in a frequency domain.
In one embodiment, the feature extraction module 1006 is further configured to perform statistical analysis on the time series to obtain statistical feature data; fitting the trend distribution of the time sequence to obtain fitting characteristic data; and extracting the characteristic data for classification in the time sequence to obtain classified characteristic data.
As shown in fig. 11, in one embodiment, the apparatus 1000 further comprises:
a model training module 1007 for obtaining a sample time series and corresponding labels; wherein the mark of the positive sample time sequence is a normal mark, and the mark of the negative sample time sequence is an abnormal mark; extracting sample characteristic data in the sample time sequence; iteratively determining updated model parameters for the initial machine learning model from the sample feature data and the respective labels; and adjusting the model parameters of the initial machine learning model according to the updated model parameters until the iteration stop condition is met, and obtaining an abnormal detection model.
In one embodiment, the model training module 1010 is further configured to determine a first experimental model and a second experimental model after determining updated model parameters for each iteration; the model parameters of the first experimental model are the model parameters of the initial machine learning model before the current iteration is updated, and the model parameters of the second experimental model are the updated model parameters determined by the current iteration; inputting the same experimental data into a first experimental model and a second experimental model respectively, and outputting a first experimental result of the first experimental model and a second experimental result of the second experimental model; and when the second experimental result reaches the preset optimization condition compared with the first experimental result, adjusting the model parameters of the initial machine learning model according to the updated model parameters.
In one embodiment, the target data point is a data point reported at the current time; the apparatus 1000 further comprises:
an alarm module (not shown in the figure) for performing an exception record when the exception detection result is that the target data point reported at the current time is abnormal; and after the data point reported at the next time is obtained, the data point reported at the next time is used as the target data point reported at the current time again, and the step of obtaining the time sequence is returned to continue execution until the abnormal target data points with the continuous preset number are recorded, and the alarm information is triggered.
FIG. 12 is a diagram showing an internal configuration of a computer device according to an embodiment. Referring to fig. 12, the computer device may be the abnormality detection device 120 shown in fig. 1. It will be appreciated that the computer device may also be a terminal. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device may store an operating system and a computer program. The computer program, when executed, causes a processor to perform a method of data anomaly detection. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The internal memory may store a computer program that, when executed by the processor, causes the processor to perform a data anomaly detection method. The network interface of the computer device is used for network communication.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the data anomaly detection apparatus provided in the present application may be implemented in the form of a computer program, the computer program may be run on a computer device as shown in fig. 12, and a nonvolatile storage medium of the computer device may store various program modules constituting the data anomaly detection apparatus, such as the obtaining module 1002, the preliminary decision module 1004, the feature extraction module 1006, and the anomaly detection module 1008 shown in fig. 10. A computer program composed of various program modules is used for causing the computer device to execute the steps in the data abnormality detection method according to the various embodiments of the present application described in the present specification, for example, the computer device may acquire a time series by the acquisition module 1002 in the data abnormality detection apparatus 1000 as shown in fig. 10; the time sequence comprises a target data point and historical data points reported before the target data point; and arranging the target data points and the historical data points according to the reported time sequence. The computer device can perform primary anomaly identification on the time series through a primary decision mode through a primary decision module 1004. The computer device may perform feature extraction on the time series via the feature extraction module 1006 when it is identified that the time series is suspected to be abnormal. The computer device can input the extracted feature data into an anomaly detection model through the anomaly detection module 1008 and output an anomaly detection result for the target data point; the anomaly detection model is obtained by training through a supervised machine learning algorithm.
A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the data anomaly detection method according to any one of the embodiments of the present application.
A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the data anomaly detection method according to any one of the embodiments of the present application.
It should be understood that although the steps in the embodiments of the present application are not necessarily performed in the order indicated by the step numbers. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (15)

1. A method of data anomaly detection, the method comprising:
acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reported time sequence; the target data point is a data point needing to be subjected to anomaly detection;
performing primary anomaly identification on the time sequence in a primary judgment mode;
when the time sequence is identified to be suspected to be abnormal, performing feature extraction on the time sequence; the extracted feature data comprises at least one of time domain feature data and frequency domain feature data; the time domain feature data is used for reflecting the features of the time sequence in a time dimension;
inputting the extracted characteristic data into an anomaly detection model, and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by performing iterative training according to a supervised machine learning algorithm by taking the sample time sequence and the corresponding mark as sample data.
2. The method of claim 1, wherein the primary decision manner comprises a statistical decision algorithm; the primary anomaly identification of the time series by the primary decision mode comprises the following steps:
extracting the historical data points from the time series;
determining the mean and standard deviation of the historical data points through a statistical decision algorithm;
determining a numerical value interval meeting random errors according to the mean value and the standard deviation;
and when the target data point is positioned outside the numerical value interval, identifying that the time sequence is suspected to be abnormal.
3. The method of claim 1, wherein the preliminary decision manner comprises an unsupervised algorithm; the primary anomaly identification of the time series by the primary decision mode comprises the following steps:
extracting each data point in the time series;
classifying the extracted data points through an unsupervised algorithm;
performing abnormity judgment processing on the time sequence according to a classification result obtained by classification processing; and an anomaly judgment result obtained by the anomaly judgment processing is used for indicating whether the time sequence is suspected to be abnormal or not.
4. The method of claim 3, wherein the unsupervised algorithm is plural; the method further comprises the following steps:
obtaining an abnormal judgment result corresponding to each unsupervised algorithm;
performing combined detection processing according to the abnormal judgment results corresponding to the unsupervised algorithms;
and when the result of the combined detection processing shows that the time sequence is abnormal, judging that the time sequence is suspected to be abnormal.
5. The method of claim 1, wherein when the time series is identified as suspected to be abnormal, the performing feature extraction on the time series comprises:
when the suspected abnormality of the time sequence is identified, then
Extracting corresponding time domain characteristic data from the time sequence in a time domain; and/or the presence of a gas in the gas,
and carrying out frequency domain transformation on the time sequence, and extracting corresponding frequency domain characteristic data from the transformed time sequence in a frequency domain.
6. The method of claim 1, wherein the extracting the corresponding time domain feature data for the time series in the time domain comprises:
performing statistical analysis on the time sequence to obtain statistical characteristic data;
fitting the trend distribution of the time sequence to obtain fitting characteristic data;
and extracting the characteristic data for classification in the time sequence to obtain classified characteristic data.
7. The method of claim 1, further comprising:
acquiring a sample time series and corresponding markers; wherein the mark of the positive sample time sequence is a normal mark, and the mark of the negative sample time sequence is an abnormal mark;
extracting sample characteristic data in the sample time series;
iteratively determining updated model parameters for an initial machine learning model from the sample feature data and corresponding labels;
and adjusting the model parameters of the initial machine learning model according to the updated model parameters until the iteration stop condition is met, and obtaining an abnormal detection model.
8. The method of claim 7, further comprising:
after each iteration determines updated model parameters, determining a first experimental model and a second experimental model; the model parameters of the first experimental model are the model parameters of the initial machine learning model before the current iteration is updated, and the model parameters of the second experimental model are the updated model parameters determined by the current iteration;
inputting the same experimental data into the first experimental model and the second experimental model respectively, and outputting a first experimental result of the first experimental model and a second experimental result of the second experimental model;
and when the second experimental result reaches a preset optimization condition compared with the first experimental result, executing the step of adjusting the model parameters of the initial machine learning model according to the updated model parameters.
9. The method according to any one of claims 1 to 8, wherein the target data point is a data point reported at a current time; the method further comprises the following steps:
when the anomaly detection result is that the target data point reported at the current time is abnormal, performing anomaly recording;
and after the data point reported at the next time is acquired, the data point reported at the next time is used as the target data point reported at the current time again, and the step of acquiring the time sequence is returned to continue execution until the abnormal target data points with the continuous preset number are recorded, and the alarm information is triggered.
10. An apparatus for detecting data abnormality, the apparatus comprising:
the acquisition module is used for acquiring a time sequence; the time sequence comprises a target data point and historical data points reported before the target data point, and the target data point and the historical data points are arranged according to the reported time sequence; the target data point is a data point needing to be subjected to anomaly detection;
the primary judgment module is used for carrying out primary abnormity identification on the time sequence in a primary judgment mode;
the characteristic extraction module is used for extracting the characteristics of the time sequence when the suspected abnormality of the time sequence is identified; the extracted feature data comprises at least one of time domain feature data and frequency domain feature data; the time domain feature data is used for reflecting the features of the time sequence in a time dimension;
the anomaly detection module is used for inputting the extracted characteristic data into an anomaly detection model and outputting an anomaly detection result aiming at the target data point; the anomaly detection model is obtained by performing iterative training according to a supervised machine learning algorithm by taking the sample time sequence and the corresponding mark as sample data.
11. The apparatus of claim 10, wherein the primary decision manner comprises a statistical decision algorithm; the primary decision module is also used for extracting the historical data points from the time sequence; determining the mean and standard deviation of the historical data points through a statistical decision algorithm; determining a numerical value interval meeting random errors according to the mean value and the standard deviation; and when the target data point is positioned outside the numerical value interval, identifying that the time sequence is suspected to be abnormal.
12. The apparatus of claim 10, wherein the preliminary decision manner comprises an unsupervised algorithm; the primary decision module is further configured to extract each data point in the time series; classifying the extracted data points through an unsupervised algorithm; performing abnormity judgment processing on the time sequence according to a classification result obtained by classification processing; and an anomaly judgment result obtained by the anomaly judgment processing is used for indicating whether the time sequence is suspected to be abnormal or not.
13. The apparatus according to any one of claims 10 to 12, wherein the feature extraction module is further configured to, when it is identified that the time series is suspected to be abnormal, extract corresponding time-domain feature data for the time series in a time domain; and/or, carrying out frequency domain transformation on the time sequence, and extracting corresponding frequency domain characteristic data from the transformed time sequence in a frequency domain.
14. A computer arrangement comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any one of claims 1 to 9.
15. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the method of any one of claims 1 to 9.
CN201810813779.7A 2018-07-23 2018-07-23 Data anomaly detection method and device, computer equipment and storage medium Active CN109032829B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810813779.7A CN109032829B (en) 2018-07-23 2018-07-23 Data anomaly detection method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810813779.7A CN109032829B (en) 2018-07-23 2018-07-23 Data anomaly detection method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109032829A CN109032829A (en) 2018-12-18
CN109032829B true CN109032829B (en) 2020-12-08

Family

ID=64645225

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810813779.7A Active CN109032829B (en) 2018-07-23 2018-07-23 Data anomaly detection method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109032829B (en)

Families Citing this family (67)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783877B (en) * 2018-12-19 2024-03-01 平安科技(深圳)有限公司 Time sequence model establishment method, device, computer equipment and storage medium
CN109753372A (en) * 2018-12-20 2019-05-14 东软集团股份有限公司 Multidimensional data method for detecting abnormality, device, readable storage medium storing program for executing and electronic equipment
CN109800858B (en) * 2018-12-21 2021-03-05 东软集团股份有限公司 Application system abnormality detection method and device, readable storage medium and electronic equipment
CN109871401B (en) * 2018-12-26 2021-05-25 北京奇安信科技有限公司 Time series abnormity detection method and device
CN111489218B (en) * 2019-01-28 2023-04-18 阿里巴巴集团控股有限公司 Data auditing method, device and equipment
CN109993065B (en) * 2019-03-06 2022-08-23 开易(北京)科技有限公司 Driver behavior detection method and system based on deep learning
JP7072531B2 (en) * 2019-03-12 2022-05-20 株式会社日立製作所 Anomaly detection device and anomaly detection method
DE102019107363B4 (en) * 2019-03-22 2023-02-09 Schaeffler Technologies AG & Co. KG Method and system for determining a property of a machine, in particular a machine tool, without measuring the property and method for determining an expected quality condition of a component manufactured with a machine
US11593716B2 (en) * 2019-04-11 2023-02-28 International Business Machines Corporation Enhanced ensemble model diversity and learning
CN110262939B (en) * 2019-05-14 2023-07-21 苏宁金融服务(上海)有限公司 Algorithm model operation monitoring method, device, computer equipment and storage medium
CN111949496B (en) * 2019-05-15 2022-06-07 华为技术有限公司 Data detection method and device
CN110262950A (en) * 2019-05-21 2019-09-20 阿里巴巴集团控股有限公司 Abnormal movement detection method and device based on many index
CN110232082B (en) * 2019-06-13 2022-08-30 中国科学院新疆理化技术研究所 Anomaly detection method for continuous space-time refueling data
CN110378386A (en) * 2019-06-20 2019-10-25 平安科技(深圳)有限公司 Based on unmarked abnormality recognition method, device and the storage medium for having supervision
CN112114878B (en) * 2019-06-21 2024-03-12 宏碁股份有限公司 Acceleration starting-up system and acceleration starting-up method
CN110443274B (en) * 2019-06-28 2024-05-07 平安科技(深圳)有限公司 Abnormality detection method, abnormality detection device, computer device, and storage medium
CN112188532A (en) * 2019-07-02 2021-01-05 中国移动通信集团贵州有限公司 Training method of network anomaly detection model, network detection method and device
CN110377447B (en) * 2019-07-17 2022-07-22 腾讯科技(深圳)有限公司 Abnormal data detection method and device and server
CN110362612B (en) * 2019-07-19 2022-02-22 中国工商银行股份有限公司 Abnormal data detection method and device executed by electronic equipment and electronic equipment
US11941502B2 (en) 2019-09-04 2024-03-26 Optum Services (Ireland) Limited Manifold-anomaly detection with axis parallel
US11347718B2 (en) 2019-09-04 2022-05-31 Optum Services (Ireland) Limited Manifold-anomaly detection with axis parallel explanations
CN111026653B (en) * 2019-09-16 2022-04-08 腾讯科技(深圳)有限公司 Abnormal program behavior detection method and device, electronic equipment and storage medium
CN112532467B (en) * 2019-09-17 2022-12-27 华为技术有限公司 Method, device and system for realizing fault detection
CN110674124B (en) * 2019-09-23 2022-04-12 珠海格力电器股份有限公司 Abnormal data detection method and system and intelligent router
CN112861895B (en) * 2019-11-27 2023-11-03 北京京东振世信息技术有限公司 Abnormal article detection method and device
CN110912909A (en) * 2019-11-29 2020-03-24 北京工业大学 DDOS attack detection method for DNS server
CN111177224B (en) * 2019-12-30 2022-04-05 浙江大学 Time sequence unsupervised anomaly detection method based on conditional regularized flow model
CN111122945B (en) * 2019-12-31 2022-03-01 南京天溯自动化控制系统有限公司 High-precision alarm filtering method and device for hospital logistics monitoring system
CN111176953B (en) * 2020-01-02 2023-06-20 广州虎牙科技有限公司 Abnormality detection and model training method, computer equipment and storage medium
CN113157758A (en) * 2020-01-07 2021-07-23 微软技术许可有限责任公司 Customized anomaly detection
CN111178456B (en) * 2020-01-15 2022-12-13 腾讯科技(深圳)有限公司 Abnormal index detection method and device, computer equipment and storage medium
CN113157760A (en) * 2020-01-22 2021-07-23 阿里巴巴集团控股有限公司 Target data determination method and device
CN111400126A (en) * 2020-02-19 2020-07-10 中国平安人寿保险股份有限公司 Network service abnormal data detection method, device, equipment and medium
CN111352971A (en) * 2020-02-28 2020-06-30 中国工商银行股份有限公司 Bank system monitoring data anomaly detection method and system
CN111291096B (en) * 2020-03-03 2023-07-28 腾讯科技(深圳)有限公司 Data set construction method, device, storage medium and abnormal index detection method
CN113435464B (en) * 2020-03-08 2022-05-17 阿里巴巴集团控股有限公司 Abnormal data detection method and device, electronic equipment and computer storage medium
CN111581046A (en) * 2020-03-19 2020-08-25 平安科技(深圳)有限公司 Data anomaly detection method and device, electronic equipment and storage medium
CN111614634B (en) * 2020-04-30 2024-01-23 腾讯科技(深圳)有限公司 Flow detection method, device, equipment and storage medium
CN111614578B (en) * 2020-05-09 2021-11-02 北京邮电大学 Network resource allocation method and device based on exponential weighting and inflection point detection
CN111858231A (en) * 2020-05-11 2020-10-30 北京必示科技有限公司 Single index abnormality detection method based on operation and maintenance monitoring
CN113746688B (en) * 2020-05-29 2023-02-28 华为技术有限公司 Method and device for updating anomaly detection model and computing equipment
CN111726341B (en) * 2020-06-02 2022-10-14 五八有限公司 Data detection method and device, electronic equipment and storage medium
CN111831870B (en) * 2020-06-12 2024-02-13 北京百度网讯科技有限公司 Abnormality detection method and device for spatiotemporal data, electronic equipment and storage medium
CN111814908B (en) * 2020-07-30 2023-06-27 浪潮通用软件有限公司 Abnormal data detection model updating method and device based on data flow
CN111897695B (en) * 2020-07-31 2022-06-17 平安科技(深圳)有限公司 Method and device for acquiring KPI abnormal data sample and computer equipment
CN112929386B (en) * 2020-08-08 2022-06-28 重庆华唐云树科技有限公司 Model training method, system and platform based on artificial intelligence and anomaly recognition
CN112069359B (en) * 2020-09-01 2024-03-19 上海熙菱信息技术有限公司 Method for dynamically filtering abnormal data of snapshot object comparison result
CN112101468B (en) * 2020-09-18 2024-04-16 刘吉耘 Method for judging abnormal sequence in sequence combination
CN112463531A (en) * 2020-11-24 2021-03-09 中国建设银行股份有限公司 File transmission early warning method, device, equipment and storage medium
CN112541016A (en) * 2020-11-26 2021-03-23 南方电网数字电网研究院有限公司 Power consumption abnormality detection method, device, computer equipment and storage medium
CN112328425A (en) * 2020-12-04 2021-02-05 杭州谐云科技有限公司 Anomaly detection method and system based on machine learning
CN112712113B (en) * 2020-12-29 2024-04-09 广州品唯软件有限公司 Alarm method, device and computer system based on index
CN114764967A (en) * 2021-01-14 2022-07-19 新智数字科技有限公司 Equipment fault alarm method under combined learning framework
CN112905671A (en) * 2021-03-24 2021-06-04 北京必示科技有限公司 Time series exception handling method and device, electronic equipment and storage medium
CN113076215B (en) * 2021-04-08 2023-06-20 华南理工大学 Unsupervised anomaly detection method independent of data types
CN113110961B (en) * 2021-04-30 2022-10-21 平安国际融资租赁有限公司 Equipment abnormality detection method and device, computer equipment and readable storage medium
CN113283501A (en) * 2021-05-24 2021-08-20 平安国际融资租赁有限公司 Deep learning-based equipment state detection method, device, equipment and medium
CN113536288B (en) * 2021-06-23 2023-10-27 上海派拉软件股份有限公司 Data authentication method, device, authentication equipment and storage medium
CN113268372B (en) * 2021-07-21 2021-09-24 中国人民解放军国防科技大学 One-dimensional time series anomaly detection method and device and computer equipment
CN113645231B (en) * 2021-08-10 2023-07-21 北京易通信联科技有限公司 Intrusion detection method, memory and processor for industrial control system
CN113673606A (en) * 2021-08-24 2021-11-19 中国水利水电科学研究院 Intelligent identification method and system for safety monitoring data abnormity
CN114338284A (en) * 2021-12-24 2022-04-12 深圳尊悦智能科技有限公司 5G intelligent gateway of Internet of things
CN114637620B (en) * 2022-03-10 2024-04-16 南京开特信息科技有限公司 Database system abnormal classification prediction method based on SVM algorithm
CN114419528B (en) * 2022-04-01 2022-07-08 浙江口碑网络技术有限公司 Anomaly identification method and device, computer equipment and computer readable storage medium
CN114781529A (en) * 2022-04-28 2022-07-22 郑州云海信息技术有限公司 KPI (Key performance indicator) abnormity detection method, device, equipment and medium
CN114710369B (en) * 2022-06-06 2022-08-16 山东云天安全技术有限公司 Abnormal data detection method and device, computer equipment and storage medium
CN117454299B (en) * 2023-12-21 2024-03-26 深圳市研盛芯控电子技术有限公司 Abnormal node monitoring method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182623A (en) * 2014-08-12 2014-12-03 南京工程学院 Thermal process data detection method based on equivalent change rate calculation
CN105760978A (en) * 2015-07-22 2016-07-13 北京师范大学 Agricultural drought grade monitoring method based on temperature vegetation drought index (TVDI)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102540165B (en) * 2011-12-19 2013-07-17 北京师范大学 Method and system for preprocessing MODIS (Moderate-Resolution Imaging Spectroradiometer) surface albedo data
CN103093078B (en) * 2012-12-18 2016-02-17 湖南大唐先一科技有限公司 A kind of data checking method improving 53H algorithm
CN103234767B (en) * 2013-04-21 2016-01-06 苏州科技学院 Based on the nonlinear fault detection method of semi-supervised manifold learning
JP2015026252A (en) * 2013-07-26 2015-02-05 株式会社豊田中央研究所 Abnormality detection device and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182623A (en) * 2014-08-12 2014-12-03 南京工程学院 Thermal process data detection method based on equivalent change rate calculation
CN105760978A (en) * 2015-07-22 2016-07-13 北京师范大学 Agricultural drought grade monitoring method based on temperature vegetation drought index (TVDI)

Also Published As

Publication number Publication date
CN109032829A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109032829B (en) Data anomaly detection method and device, computer equipment and storage medium
WO2020177377A1 (en) Machine learning-based data prediction processing method and apparatus, and computer device
CN110912867B (en) Intrusion detection method, device, equipment and storage medium for industrial control system
CN112800116B (en) Method and device for detecting abnormity of service data
CN111309539A (en) Abnormity monitoring method and device and electronic equipment
CN111625516A (en) Method and device for detecting data state, computer equipment and storage medium
CN109325118B (en) Unbalanced sample data preprocessing method and device and computer equipment
US11977536B2 (en) Anomaly detection data workflow for time series data
CN111711608A (en) Method and system for detecting abnormal flow of power data network and electronic equipment
CN115936262B (en) Yield prediction method, system and medium based on big data environment interference
CN115204536A (en) Building equipment fault prediction method, device, equipment and storage medium
JP2020071845A (en) Abnormality detection device, abnormality detection method, and abnormality detection program
CN113110961B (en) Equipment abnormality detection method and device, computer equipment and readable storage medium
US20220342861A1 (en) Automatic model selection for a time series
CN114547145A (en) Method, system, storage medium and equipment for detecting time sequence data abnormity
WO2021114613A1 (en) Artificial intelligence-based fault node identification method, device, apparatus, and medium
CN114266284A (en) Method, device, equipment and program product for detecting insulation defect type of switch cabinet
US20240004847A1 (en) Anomaly detection in a split timeseries dataset
CN116910592B (en) Log detection method and device, electronic equipment and storage medium
CN115587898B (en) Financial data secure sharing method and system based on cloud service
CN110865939B (en) Application program quality monitoring method, device, computer equipment and storage medium
CN114039837A (en) Alarm data processing method, device, system, equipment and storage medium
US10970288B2 (en) Analysis device
CN117312350B (en) Steel industry carbon emission data management method and device
CN113407422A (en) Data abnormity alarm processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant