CN114528934A - Time series data abnormity detection method, device, equipment and medium - Google Patents

Time series data abnormity detection method, device, equipment and medium Download PDF

Info

Publication number
CN114528934A
CN114528934A CN202210149024.8A CN202210149024A CN114528934A CN 114528934 A CN114528934 A CN 114528934A CN 202210149024 A CN202210149024 A CN 202210149024A CN 114528934 A CN114528934 A CN 114528934A
Authority
CN
China
Prior art keywords
data
sequence
abnormal
time
time series
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210149024.8A
Other languages
Chinese (zh)
Inventor
林荣吉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202210149024.8A priority Critical patent/CN114528934A/en
Publication of CN114528934A publication Critical patent/CN114528934A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The invention relates to artificial intelligence, and provides a method, a device, equipment and a medium for detecting time sequence data abnormity, wherein an ARIMA model is constructed through historical time sequence data, a Grabbs method is further adopted to detect abnormal points of time sequences in the ARIMA model, an abnormal value threshold benchmark and a time sequence prediction sequence are further determined according to detection results, the precision of abnormal data detection can be improved in a mode of combining the abnormal data with the time sequence prediction sequence, the obtained abnormal value threshold benchmark and the obtained time sequence prediction sequence are directly adopted for carrying out subsequent abnormal detection of data, and a fusion strategy of the ARIMA model and the Grabbs model can be applied to abnormal data detection of online data and can quickly respond to abnormal diagnosis.

Description

Time series data abnormity detection method, device, equipment and medium
Technical Field
The invention relates to the field of data processing of big data, in particular to a time sequence data anomaly detection method and device, computer equipment and a storage medium.
Background
The abnormal detection of the time sequence data can effectively detect a small amount of time sequence data different from most other time sequences, early warn in time, locate reasons, provide powerful support for subsequent measures such as strategy operation and the like, and has very important significance. The method is widely applied to the fields of economy, industry, medical treatment and financial insurance.
In some service scenes, for example, service agent screening scenes, a large amount of time series data are accumulated, the conversion trend of links can be analyzed through the data, weak links can be positioned, link improvement can be performed in a targeted manner, and the service conversion efficiency is improved. Therefore, it becomes important to detect abnormality of the link data.
Currently, the popular anomaly detection methods are classified into supervised and unsupervised methods. The supervision method is used for converting the abnormal detection into a two-classification problem, the normal detection is one class, and the abnormal detection is one class. Unsupervised methods mainly include rule-based methods, such as the 3 σ criterion; based on a clustering algorithm, the method clusters data into a plurality of classes, and if a certain data and a class center are far away, the data is abnormal. In the field of financial insurance, such as link data and performance data of an increase process, the periodicity and the seasonality are often existed, the method has low applicability, and the detection accuracy of abnormal data is low.
Disclosure of Invention
The embodiment of the invention provides a method and a device for detecting time sequence data abnormity, computer equipment and a storage medium, and aims to solve the problem that when the periodical and seasonal data are subjected to abnormity detection in the prior art, a supervised method and an unsupervised method are adopted, and the detection accuracy of the abnormal data is low.
In a first aspect, an embodiment of the present invention provides a method for detecting time series data anomalies, where the method includes:
acquiring historical time sequence data, and constructing a difference integration moving average autoregressive model based on the historical time sequence data;
verifying the seasonality of the historical time sequence data to obtain a verification result;
if the verification result is seasonal, taking the data in the same season as sample data of one category, and forming a sample data set by the sample data of a plurality of categories;
if the verification result is that the seasonal sample is not available, dividing the samples within a preset time range into sample data of one category, and forming a sample data set by the sample data of multiple categories;
performing time sequence anomaly detection on the sample data of each category in the sample data set based on a Grabbs model to obtain an anomaly detection result, and determining an anomaly value threshold benchmark and a time sequence prediction sequence of each category according to the anomaly detection result; and
and when receiving data to be detected, performing anomaly detection on the data to be detected based on the abnormal value threshold reference of each category and the time sequence prediction sequence to obtain a detection result.
In a second aspect, an embodiment of the present invention provides a time series data anomaly detection apparatus, which includes:
the first model acquisition unit is used for acquiring historical time series data and constructing a difference integration moving average autoregressive model based on the historical time series data;
the verification result acquisition unit is used for verifying the seasonality of the historical time series data to obtain a verification result;
the first dividing unit is used for taking the seasonal data as sample data of one category and forming a sample data set by the sample data of a plurality of categories if the verification result shows that the seasonal data exists;
the second dividing unit is used for dividing the samples in the preset time range into sample data of one category and forming a sample data set by the sample data of a plurality of categories if the verification result shows that the samples do not have seasonality;
the second model acquisition unit is used for carrying out time sequence abnormity detection on the sample data of each category in the sample data set based on the Grabbs model to obtain an abnormity detection result, and determining an abnormal value threshold benchmark and a time sequence prediction sequence of each category according to the abnormity detection result; and
and the detection result acquisition unit is used for carrying out anomaly detection on the data to be detected based on the abnormal value threshold reference of each category and the time sequence prediction sequence when the data to be detected is received, so as to obtain a detection result.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the time series data abnormality detection method described in the first aspect when executing the computer program.
In a fourth aspect, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the time-series data anomaly detection method according to the first aspect.
The embodiment of the invention provides a time series data anomaly detection method, a time series data anomaly detection device, computer equipment and a storage medium, wherein an ARIMA model is constructed through historical time series data, a Grabbs method is further adopted to detect anomaly points of the time series in the ARIMA model, an anomaly threshold benchmark and a time series prediction sequence are further determined according to a detection result, the accuracy of anomaly data detection can be improved through a mode of combining the anomaly data with the time series prediction sequence, the obtained anomaly threshold benchmark and the obtained time series prediction sequence are directly adopted, and a fusion strategy of the ARIMA model and the Grabbs model can be applied to anomaly data detection of online data and can quickly respond to anomaly diagnosis.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a time series data anomaly detection method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for detecting an anomaly in time series data according to an embodiment of the present invention;
FIG. 3 is a schematic block diagram of an apparatus for detecting anomalies in time series data according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a method for detecting a time series data anomaly according to an embodiment of the present invention; fig. 2 is a schematic flowchart of a time series data anomaly detection method according to an embodiment of the present invention, where the time series data anomaly detection method is applied to a server and is executed by application software installed in the server.
As shown in fig. 2, the method includes steps S101 to S106.
S101, historical time sequence data are obtained, and a difference integration moving average autoregressive model is constructed on the basis of the historical time sequence data.
In this embodiment, the technical solution is described with a server as an execution subject. In addition, the time series data in the technical scheme of the application is the data of the increase of the enterprise increase of the member link (the increase can be understood as recruitment of the member) or the data of the service performance of the member, the data is often periodic and seasonal, and the abnormal data in the time series data is difficult to be accurately detected by adopting a supervised abnormal detection method of two categories or an unsupervised method of a clustering algorithm.
The historical time sequence data refers to time sequence data which is generated and stored, and the time sequence data is time sequence data which is a data column recorded by the same unified index according to a time sequence and can be uploaded to a server by a user terminal. The data in the same data column must be of the same aperture, requiring comparability. The time series data can be the number of epochs or the number of epochs. The purpose of time series analysis is to construct a time series model by finding out the statistical characteristics and the development regularity of time series in a sample and to predict outside the sample.
Among them, the differential integration Moving Average Autoregressive model, namely ARIMA model (also called as Autoregressive Integrated Moving Average model), is also called as integration Moving Average Autoregressive model (Moving can also be called as sliding), and is one of the time series prediction analysis methods. In ARIMA (p, d, q), AR is "autoregressive" and p is the number of autoregressive terms; MA is "moving average", q is the number of terms of the moving average, and d is the number of differences (order) made to make it a stationary sequence.
In one embodiment, step S101 includes:
segmenting the historical time sequence data according to a preset time interval to obtain an original data sequence taking the preset time interval as a time interval;
performing sequence stability test on the original data sequence to obtain a stability test result;
if the stationarity test result is that the original data sequence is a non-stationary data sequence, carrying out stabilization processing on the original data sequence by adopting a difference to obtain a stationary data sequence;
if the stationarity test result is that the original data sequence is a steady data sequence, taking the original data sequence as the steady data sequence;
and fitting the stable data sequence through an initial difference integration moving average autoregressive model, and updating the order of the initial difference integration moving average autoregressive model to obtain the difference integration moving average autoregressive model.
In this embodiment, an ARIMA model is constructed based on historical time series data, and mainly includes that after stationarity test, white noise test and the like are performed to obtain a stable data sequence, an initial ARMA model is used to fit the stable data sequence to obtain a target ARIMA model, and the obtained target ARIMA model can be used for subsequent abnormal data detection. And (3) performing stationarity test on the sequence, and if the sequence is a non-stationary sequence, performing difference operation, wherein the difference comprises D-order general difference and D-order periodic difference.
For example, taking the historical time series data composed of the employee data of the enterprise employee increment link as an example, the historical time series data composed of the historical employee data of the department a in the last year may be generally obtained, and then the historical time series data of the department a in the last 3 years may be segmented by obtaining the preset time interval Δ T (for example, setting Δ T to 1 month), so that the original data sequence having 36 segments may be obtained in total. And then, performing stationarity test on the original data sequence with 36 segments by a Daniel test method (namely a denier test method) and the like to obtain a stationarity test result. If the stability test result of the original data sequence is a non-stable data sequence, the stable data sequence can be obtained by realizing the stabilization processing through differential processing; and if the stationarity test result of the original data sequence is a stationary data sequence, directly taking the original data sequence as the stationary data sequence without any processing. And finally, fitting the stable data sequence by using a preset initial ARMA model to obtain the target ARIMA model.
Optionally, the fitting the stationary data sequence by using an initial difference integrated moving average autoregressive model, and updating an order of the initial difference integrated moving average autoregressive model to obtain the difference integrated moving average autoregressive model includes:
and fitting the stable data sequence through the initial difference integration moving average autoregressive model to determine the maximum lag order of the aperiodic autoregressive polynomial, the maximum lag order of the aperiodic average moving polynomial, the maximum lag order of the periodic autoregressive polynomial and the maximum lag order of the periodic average moving polynomial, wherein the difference integration moving average autoregressive model is obtained through the periodic difference times, the aperiodic difference times, the maximum lag order of the aperiodic autoregressive polynomial, the maximum lag order of the aperiodic average moving polynomial, the maximum lag order of the periodic autoregressive polynomial and the maximum lag order of the periodic average moving polynomial.
The method comprises the following steps of fitting a stable data sequence by using an initial ARMA model, determining the order of the ARMA model, namely determining the values of (P, Q) and (P, Q), and then integrating the D-order general difference and the D-order periodic difference to obtain the complete structure of a product periodic model for fitting the data sequence as follows:
φp(B)ΦP(BS)(1-B)(1-BS)Dyt=θq(B)ΘQ(BSt
wherein, ytIs an observed value of the original data sequence, epsilontIs a residual term, B is a lag operator, S represents a change period, 1-B represents an aperiodic difference, 1-BSRepresents the period difference, phip(B) Represents an aperiodic autoregressive polynomial, phiP(BS) Representing a periodic autoregressive polynomial, thetaq(B) Represents a non-periodic mean-shift polynomial, ΘQ(BS) The method comprises the following steps of representing a periodic average moving polynomial, P representing the maximum hysteresis order of an aperiodic autoregressive polynomial, P representing the maximum hysteresis order of the periodic autoregressive polynomial, Q representing the maximum hysteresis order of the aperiodic average moving polynomial, Q representing the maximum hysteresis order of the periodic average moving polynomial, D representing the aperiodic difference degree, and D representing the periodic difference degree.
The text uses the time sequence diagram to perform visual judgment and then uses the related diagram to perform further inspection. If there is a trend of increasing or decreasing in the non-stationary time series, it is necessary to perform a difference process and then perform a stationarity check until stationary. The number of differences is the order of the model ARIMA (p, d, q), that is, the value of d in the above formula, theoretically, the greater the number of differences, the more sufficient the extraction of the non-stationary deterministic information of the timing information, but theoretically, the greater the number of differences, the better the difference, each time of difference operation, the loss of information will be caused, so that an excessive difference should be avoided, and generally, in application, the order of the difference is not more than 2.
And S102, verifying the seasonality of the historical time sequence data to obtain a verification result.
Specifically, the present embodiment combines an Auto Correlation Function (ACF) and a Partial Auto Correlation Function (PACF) to verify the historical timing data in the ARIMA model whether periodicity and seasonality exist.
Where Cyclic refers to a wave-like or oscillatory variation around a long-term trend that appears in a time series. The time frequency of its fluctuation is wide and may not be fixed.
Wherein Seasonal (Seasonal) is that if similarity is shown in a sequence after S time intervals, the sequence is said to have a periodic characteristic with a period of S. A sequence with a periodic nature is called a seasonal time sequence, where S is the period length.
The seasonality is one of special periodic cases, so that whether the sequence is periodic or seasonal is verified, that is, whether the sequence is periodic or not is verified, and only the number of seasonality in an actual service scene is large, so that as a preferable mode, the embodiment adopts seasonal verification as a reference for subsequent classification so as to improve the accuracy of classification.
Wherein ACF is a correlation describing a linear combination between Yt values at time t of a time series and the surrounding individual data (Yt-1, Yt-2, …, Yt-n); PACF describes the correlation between the Yt value at time t and the Yt-k at the far end, without considering other linear combinations.
For the seasonal terms of the AR and MA models, differences will be seen in the lags of ACF and PACF. For example, the ARIMA (0,0,0) (0,0,1)12 model will see a spike (prominent point) at lag12 of ACF, and no prominent point elsewhere. The PACF will show an exponential decay at the position of the period, i.e. lag12, 24, 36 …. Similarly, the ARIMA (0,0,0) (1,0,0)12 model showed an exponential decay at the periodic locations of the ACF plot, while a spike was seen at lag12 in the PACF plot.
Usually, whether a sequence has periodicity is often determined according to the ACF and the PACF graph described above, so that the ACF and the PACF graph are understood to have a relatively large deviation, which brings many subjective factors and is not beneficial to determining other parameters. Therefore, in order to reasonably determine model parameters and to self-iterate the model, the whole trend and periodicity are analyzed when the model is explored in combination with the graph, but (P, D, Q), (P, D, Q) is automatically determined in a grid search mode when the model is applied in combination with the result of graph analysis.
The method comprises the following specific steps:
1) according to the analysis results of the ACF and the PACF graphs and the time series data have short-term correlation, the value of P is usually [0,6], the value of D is [0,2], the value of Q is [0,6], the value of P is usually [0,2], the value of Q is usually [0,2], the period parameter D is generally set according to the time granularity and the period size of the data, for example, the data is in an hour level, the period is one day, and then the corresponding parameter is 24. The data in this document is historical time series data composed of the member data of the member increasing link, and is monthly data, and through analysis, the monthly data has seasonality, such as 1, 4, 7, 10 months, 2, 5, 8, 11 months, 3, 6, 9, 12 months of each historical year, and these have the same rule, so D can be set to 4, and if each month of the historical year has the same time, D is set to 12, so D is set to 0, 4 or 12 here;
2) in performing a grid search of parameters, it is common to find the set of parameters with the smallest AIC values in order to find the optimal model. The result of the AIC value is calculated based on the data of the training set through fitting parameters, and depends on the data of the training set, if the data outside the training set mutates, the data are often unpredictable, and thus parameter fitting is deviated. Thus, the determination of optimal parameters herein will take into account the AIC value + model accuracy over time validation in combination. The inter-time verification means that when a model is built, a part of data is reserved for verification, 12 months of data are usually reserved, 12 models are built by rolling the same group of parameters one by one, and the first model predicts the 1 st to 12 th months in the future; the 2 nd model, training set data +1, predicts the 2 nd to 12 th months in the future; the 3 rd model, training set data +2, predicts the 3 rd to 12 th month in the future; …, respectively; and so on. The average accuracy is calculated by the deviation of the predicted results from the actual results for the 12 models. The optimal parameter is usually selected to be the one with the smallest AIC value of the cross-time verification model accuracy top 5.
3) And determining whether seasonality exists or not and whether the type of the seasonality is seasonal or annual by referring to the value of D based on the optimal parameter.
S103, if the verification result is seasonal, taking the data in the same season as sample data of one category, and forming a sample data set by the sample data of a plurality of categories.
In this embodiment, when the result of the verification indicates that the data has seasonality, the data belonging to the same season is used as one class, and sample data belonging to the same class of seasons is formed, and after the acquisition of the sample data of each class of seasons is completed, a sample data set can be formed by the sample data of a plurality of classes of seasons together
In the embodiment, the seasonality of the time sequence data is fully considered through the verification of the seasonality, the data of the same type is classified into one type, the classification accuracy is improved, the error is reduced when the deviation value is calculated subsequently, and the judgment accuracy of the abnormal point is improved.
And S104, if the verification result is that the seasonal sample is not available, dividing the samples within the preset time range into sample data of one category, and forming a sample data set by the sample data of a plurality of categories.
In this embodiment, when the verification result is that the seasonal sample is not included, a preset time interval threshold (e.g., equal to the previous preset time interval Δ T, more specifically, 1 month) is obtained, then the samples belonging to the preset time range are divided into a category according to the preset time interval threshold, and finally, a sample data set is composed of sample data of multiple categories. The preset time range can be set according to actual conditions, and is not limited herein.
And S105, performing time sequence anomaly detection on the sample data of each category in the sample data set based on the Grabbs model to obtain an anomaly detection result, and determining an anomaly value threshold benchmark and a time sequence prediction sequence of each category according to the anomaly detection result.
In this embodiment, the Grubbs model can also be understood as a Grubbs algorithm (i.e., a Grubbs method) which can remove a "suspicious value" from a sequence sample without participating in calculation, and the "suspicious value" is referred to as an outlier. The Grubbs algorithm, also known as the maximum normalized residual test or the extreme chemical biochemical residual test, is used to test outliers that assume a single sequence that follows a normal distribution. Since the samples are from the population, the variance is unknown, subject to the t-distribution, and the critical value formula is as follows (1):
Figure BDA0003509937940000091
wherein, tα/N,N-2Represents the critical value of the t distribution with N-2 degrees of freedom, alpha/N, at the significance level. Whether the sample data of each category in the sample data set has abnormal time sequence data or not can be quickly detected through a Grabbs method, and the abnormal time sequence data can be replaced based on a preset value in a Grabbs table (the preset value is specifically a 95% confidence temporary value in the Grabbs table) so as to ensure the integrity and the continuity of the time sequence data.
In one embodiment, step S105 includes:
taking the data of each category in the sample data set as a basic time sequence;
calculating the mean value and the standard deviation of each basic time sequence, and determining the deviation value of each time sequence data in the basic time sequence according to the mean value and the standard deviation;
comparing the deviation value with a preset value in a Grabbs table to obtain a comparison result;
and if the comparison result is that the deviation value is larger than the preset value, determining that the time sequence data corresponding to the deviation value is abnormal, and taking the time sequence data with the abnormality as an abnormal point.
In the embodiment, the mean value and the standard deviation of each basic time sequence are calculated, wherein the mean value is represented by x _ mean and the standard deviation is represented by x _ std; determining a deviation value of each time series data in the basic time series sequence, wherein the deviation value of each time series data is represented by Gi, and Gi is (xi-x _ mean)/x _ std; the preset value in the preset values in the Grabbs table is represented by G _ p (n); calculating a deviation value Gi of each time series data xi in the basic time series sequence based on a reference mean value and a standard deviation (wherein Gi ═ xi-x _ mean)/x _ std), and then determining that the time series data corresponding to the deviation value Gi are abnormal if the deviation value Gi of the time series data xi is determined to be present and the deviation value Gi is greater than a preset value G _ p (n) in a Graves table, and taking the time series data with the abnormality as an abnormal point. In this way, the above-described method is referred to for each type of sample data, and the anomaly detection is performed, thereby obtaining a comprehensive detection result.
In one embodiment, in the base time series sequence, the mean value and the standard deviation of the time series data of the non-abnormal points are used as the abnormal value threshold reference of the corresponding category of the base time series sequence.
In this embodiment, after determining that all the outliers in the basic time series sequence of a certain category have been removed, the mean and the standard deviation may be calculated based on the remaining outliers, and then used as the outlier threshold reference corresponding to the basic time series sequence of the category. When the future time sequence is detected to be abnormal, if the future time sequence and the basic time sequence of the category are judged to belong to the same category, the mean value and the standard deviation of the category can be directly called as the abnormal value threshold reference.
In one embodiment, the abnormal point in the basic time sequence is replaced by a preset value in the grassbs table, and the obtained sequence is used as the time sequence prediction sequence.
In this embodiment, if it is determined that abnormal time series data exists in a certain category of basic time series data, the abnormal time series data may be replaced based on a preset value in the grassroots table to ensure the integrity and continuity of the time series data.
The steps of steps S101-S105 correspond to offline processing historical time series data to obtain a plurality of categories of sample data (i.e. time series prediction sequences of each category are obtained), the sample data in each category is a normal point, and each category has a calculated mean value, standard deviation and abnormal value threshold reference. And then, detecting abnormal points of the future time series data, namely comparing and judging sample data of various types obtained based on the previous offline processing.
And S106, when the data to be detected is received, performing anomaly detection on the data to be detected based on the abnormal value threshold reference of each category and the time sequence prediction sequence to obtain a detection result.
In this embodiment, when data to be detected is received, double anomaly detection is performed on the data to be detected based on the anomaly threshold criterion of each category and the time sequence prediction sequence, so as to obtain a detection result. The data to be detected is a sequence of current to-be-detected time sequences different from historical time sequences, and whether abnormal points exist in the data to be detected is judged currently, so that a plurality of types of time sequence prediction sequences which are processed offline in a server before are called, and a real-time processing process is performed when whether the abnormal points exist in the data to be detected or not is judged at the moment, and the process can be understood as an online processing process.
In this embodiment, the detection is ordered detection, that is, whether the data to be detected is an abnormal point is determined according to the abnormal value threshold criterion, the specific detection method may refer to the description of the above embodiment, if the abnormal point is determined, secondary detection is performed according to the time sequence prediction sequence, and if the abnormal point is not determined, it is determined that the data to be detected is abnormal.
In one embodiment, step S106 includes:
determining the deviation value of the data to be detected according to the data to be detected and the abnormal value threshold reference corresponding to the category;
if the deviation value of the data to be detected is larger than a preset value in a Grabbs table, determining a residual sequence based on the time sequence prediction sequence and a basic time sequence corresponding to the data to be detected;
if the residual sequence is determined to accord with the 3 sigma criterion, judging that the detection result is abnormal;
and if the residual sequence is determined not to meet the 3 sigma criterion, judging that the detection result is abnormal.
In this embodiment, the category to which the data to be detected belongs may be determined based on the time period to which the data to be detected belongs, for example, the data to be detected corresponds to a time series data sequence of department a in 3 months today, and seasonal grouping is performed based on historical time series data to obtain time series prediction sequences of multiple categories, so that the category of the data to be detected may be determined according to the season to which the data to be detected belongs. And then acquiring a corresponding abnormal value threshold reference according to the category to which the data to be detected belongs, and determining the deviation value of the data to be detected, so that abnormal point detection based on the Grabbs method can be performed for one time. And if the abnormal points exist in the data to be detected, determining a residual sequence based on the difference value between the time sequence prediction sequence and the basic time sequence corresponding to the data to be detected, and finally, further judging whether the residual sequence meets the 3 sigma criterion or not to obtain a final judgment result. Through double anomaly detection, a detection result is obtained, and the result is more accurate. Wherein σ represents a standard deviation, μ represents a mean, and the 3 σ criterion means that the probability of the numerical distribution in (μ - σ, μ + σ) is 0.6826, the probability of the numerical distribution in (μ -2 σ, μ +2 σ) is 0.9544, and the probability of the numerical distribution in (μ -3 σ, μ +3 σ) is 0.9974. And if all values in the residual sequence are determined to be concentrated in the (mu-3 sigma, mu +3 sigma) interval, the 3 sigma criterion is met, and if the values in the residual sequence are determined not to be in the (mu-3 sigma, mu +3 sigma) interval, the 3 sigma criterion is not met.
The embodiment of the application can acquire and process related data in the server based on the artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The method realizes the construction of the ARIMA model based on historical time sequence data, performs abnormal point detection on the time sequence in the ARIMA model, further determines an abnormal value threshold benchmark and a time sequence prediction sequence according to the detection result, and then combines the abnormal value threshold benchmark and the time sequence prediction sequence when performing abnormal detection on the data to be detected, thereby being capable of more accurately detecting the abnormal data.
The embodiment of the invention also provides a time sequence data abnormity detection device which is used for executing any embodiment of the time sequence data abnormity detection method. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of a time series data anomaly detection apparatus 100 according to an embodiment of the present invention.
As shown in fig. 3, the time-series data abnormality detection apparatus 100 includes a first model acquisition unit 101, a verification result acquisition unit 102, a first division unit 103, a second division unit 104, a second model acquisition unit 105, and a detection result acquisition unit 106.
The first model obtaining unit 101 is configured to obtain historical time series data, and construct a difference-integrated moving average autoregressive model based on the historical time series data.
In this embodiment, the technical solution is described with a server as an execution subject. In addition, the time series data in the technical scheme of the application is the data of the increase of the enterprise increase of the member link (the increase can be understood as recruitment of the member) or the data of the service performance of the member, the data is often periodic and seasonal, and the abnormal data in the time series data is difficult to be accurately detected by adopting a supervised abnormal detection method of two categories or an unsupervised method of a clustering algorithm.
The historical time series data refers to time series data which is generated and stored, and the time series data is time series data which is a data column recorded by the same unified index according to a time sequence. The data in the same data column must be of the same aperture, requiring comparability. The time series data can be the number of epochs or the number of epochs. The purpose of time series analysis is to construct a time series model by finding out the statistical characteristics and the development regularity of time series in a sample and to predict outside the sample.
Among them, the differential integration Moving Average Autoregressive model, namely ARIMA model (also called as Autoregressive Integrated Moving Average model), is also called as integration Moving Average Autoregressive model (Moving can also be called as sliding), and is one of the time series prediction analysis methods. In ARIMA (p, d, q), AR is "autoregressive" and p is the number of autoregressive terms; MA is "moving average", q is the number of terms of the moving average, and d is the number of differences (order) made to make it a stationary sequence.
In an embodiment, the first model obtaining unit 101 is specifically configured to:
segmenting the historical time series data according to a preset time interval to obtain an original data sequence taking the preset time interval as a time interval;
performing sequence stability test on the original data sequence to obtain a stability test result;
if the stationarity test result is that the original data sequence is a non-stationary data sequence, carrying out stabilization processing on the original data sequence by adopting a difference to obtain a stationary data sequence;
if the stationarity test result is that the original data sequence is a steady data sequence, taking the original data sequence as the steady data sequence;
and fitting the stable data sequence through an initial difference integration moving average autoregressive model, and updating the order of the initial difference integration moving average autoregressive model to obtain the difference integration moving average autoregressive model.
In this embodiment, an ARIMA model is constructed based on historical time series data, and mainly includes that after stationarity test, white noise test and the like are performed to obtain a stable data sequence, an initial ARMA model is used to fit the stable data sequence to obtain a target ARIMA model, and the obtained target ARIMA model can be used for subsequent abnormal data detection. And (3) carrying out stationarity test on the sequence, and if the sequence is a non-stationary sequence, carrying out differential operation, wherein the differential operation comprises D-order general differential and D-order periodic differential.
For example, taking the historical time series data composed of the employee data of the enterprise employee increment link as an example, the historical time series data composed of the historical employee data of the department a in the last year may be generally obtained, and then the historical time series data of the department a in the last 3 years may be segmented by obtaining the preset time interval Δ T (for example, setting Δ T to 1 month), so that the original data sequence having 36 segments may be obtained in total. And then, performing stationarity test on the original data sequence with 36 segments by a Daniel test method (namely a denier test method) and the like to obtain a stationarity test result. If the stability test result of the original data sequence is a non-stable data sequence, the stable data sequence can be obtained by realizing the stabilization processing through differential processing; and if the stationarity test result of the original data sequence is a stationary data sequence, directly taking the original data sequence as the stationary data sequence without any processing. And finally, fitting the stable data sequence by using a preset initial ARMA model to obtain the target ARIMA model.
Optionally, the fitting the stationary data sequence through an initial difference integrated moving average autoregressive model, and updating the order of the initial difference integrated moving average autoregressive model to obtain the difference integrated moving average autoregressive model includes:
and fitting the stable data sequence through the initial difference integration moving average autoregressive model to determine the maximum lag order of the aperiodic autoregressive polynomial, the maximum lag order of the aperiodic average moving polynomial, the maximum lag order of the periodic autoregressive polynomial and the maximum lag order of the periodic average moving polynomial, wherein the difference integration moving average autoregressive model is obtained through the periodic difference times, the aperiodic difference times, the maximum lag order of the aperiodic autoregressive polynomial, the maximum lag order of the aperiodic average moving polynomial, the maximum lag order of the periodic autoregressive polynomial and the maximum lag order of the periodic average moving polynomial.
The method comprises the following steps of fitting a stable data sequence by using an initial ARMA model, determining the order of the ARMA model, namely determining the values of (P, Q) and (P, Q), and then integrating the D-order general difference and the D-order periodic difference to obtain the complete structure of a product periodic model for fitting the data sequence as follows:
φp(B)ΦP(BS)(1-B)(1-BS)Dyt=θq(B)ΘQ(BSt
wherein, ytIs an observed value of the original data sequence, epsilontIs a residual term, B is a lag operator, S represents a change period, 1-B represents an aperiodic difference, 1-BSRepresents the period difference, phip(B) Representing an aperiodic autoregressive polynomial, phiP(BS) Representing a periodic autoregressive polynomial, thetaq(B) Represents a non-periodic mean-shift polynomial, ΘQ(BS) Representing a periodic mean-shift polynomial, P represents the maximum hysteresis order of a non-periodic autoregressive polynomial, P represents the maximum hysteresis order of a periodic autoregressive polynomial, q represents a non-periodic mean-shift polynomialThe maximum hysteresis order, Q represents the maximum hysteresis order of the periodic mean shift polynomial, D represents the aperiodic differential degree, and D represents the periodic differential degree.
The text uses the time sequence diagram to perform visual judgment and then uses the related diagram to perform further inspection. If there is a trend of increasing or decreasing in the non-stationary time series, it is necessary to perform a difference process and then perform a stationarity check until stationary. The number of differences is the order of the model ARIMA (p, d, q), that is, the value of d in the above formula, theoretically, the greater the number of differences, the more sufficient the extraction of the non-stationary deterministic information of the timing information, but theoretically, the greater the number of differences, the better the difference, each time of difference operation, the loss of information will be caused, so that an excessive difference should be avoided, and generally, in application, the order of the difference is not more than 2.
A verification result obtaining unit 102, configured to verify seasonality of the historical time series data to obtain a verification result.
Specifically, the present embodiment combines an Auto Correlation Function (ACF) and a Partial Auto Correlation Function (PACF) to verify the historical timing data in the ARIMA model whether periodicity and seasonality exist.
Where Cyclic refers to a wave-like or oscillatory variation around a long-term trend that appears in a time series. The time frequency of its fluctuation is wide and may not be fixed.
Seasonal (seamental) is a sequence that is said to have a periodic characteristic with a period of S if it shows similarity over S time intervals. A sequence with periodic characteristics is called a seasonal time sequence, where S is the period length.
The seasonality is one of special periodic cases, so that whether the sequence is periodic or seasonal is verified, that is, whether the sequence is periodic or not is verified, and only the number of seasonality in an actual service scene is large, so that as a preferable mode, the embodiment adopts seasonal verification as a reference for subsequent classification so as to improve the accuracy of classification.
Wherein ACF is a correlation describing a linear combination between Yt values at time t of a time series and the surrounding individual data (Yt-1, Yt-2, …, Yt-n); PACF describes the correlation between the Yt value at time t and the Yt-k at the far end, without considering other linear combinations.
For the seasonal terms of the AR and MA models, differences will be seen in the lags of ACF and PACF. For example, the ARIMA (0,0,0) (0,0,1)12 model will see a spike (prominent point) at lag12 of ACF, and no prominent point elsewhere. The PACF will show an exponential decay at the position of the period, i.e. lag12, 24, 36 …. Similarly, the ARIMA (0,0,0) (1,0,0)12 model showed an exponential decay at the periodic locations of the ACF plot, while a spike was seen at lag12 in the PACF plot.
Usually, whether a sequence has periodicity is often determined according to the ACF and the PACF graph described above, so that the ACF and the PACF graph are understood to have a relatively large deviation, which brings many subjective factors and is not beneficial to determining other parameters. Therefore, in order to reasonably determine model parameters and to self-iterate the model, the whole trend and periodicity are analyzed when the model is explored in combination with the graph, but (P, D, Q), (P, D, Q) is automatically determined in a grid search mode when the model is applied in combination with the result of graph analysis.
The method comprises the following specific steps:
1) according to the analysis results of the ACF and the PACF graphs and the time series data have short-term correlation, the value of P is usually [0,6], the value of D is [0,2], the value of Q is [0,6], the value of P is usually [0,2], the value of Q is usually [0,2], the period parameter D is generally set according to the time granularity and the period size of the data, for example, the data is in an hour level, the period is one day, and then the corresponding parameter is 24. The data in this document is historical time series data composed of the member data of the member increasing link, and is monthly data, and through analysis, the monthly data has seasonality, such as 1, 4, 7, 10 months, 2, 5, 8, 11 months, 3, 6, 9, 12 months of each historical year, and these have the same rule, so D can be set to 4, and if each month of the historical year has the same time, D is set to 12, so D is set to 0, 4 or 12 here;
2) in performing a grid search of parameters, it is common to find the set of parameters with the smallest AIC values in order to find the optimal model. The result of the AIC value is calculated based on the data of the training set through fitting parameters, and depends on the data of the training set, if the data outside the training set mutates, the data are often unpredictable, and thus parameter fitting is deviated. Thus, the determination of optimal parameters herein will take into account the AIC value + model accuracy over time validation in combination. The inter-time verification means that when a model is built, a part of data is reserved for verification, 12 months of data are usually reserved, 12 models are built by rolling the same group of parameters one by one, and the first model predicts the 1 st to 12 th months in the future; the 2 nd model, training set data +1, predicts the 2 nd to 12 th months in the future; the 3 rd model, training set data +2, predicts the 3 rd to 12 th month in the future; …, respectively; and so on. The average accuracy is calculated by the deviation of the predicted results from the actual results for the 12 models. The optimal parameter is usually selected to be the one with the smallest AIC value of the cross-time verification model accuracy top 5.
3) And determining whether seasonality exists and the type of the seasonality is seasonal or annual by referring to the value of D based on the optimal parameters.
A first dividing unit 103, configured to, if the verification result is seasonal, use the seasonal data as sample data of one category, and compose a sample data set from sample data of multiple categories.
In this embodiment, when the result of the verification indicates that the data has seasonality, the data belonging to the same season is used as a class, and sample data belonging to the same season class is formed, and after the acquisition of the sample data of each season class is completed, a sample data set can be formed by the sample data of multiple season classes together
In the embodiment, the seasonality of the time sequence data is fully considered through the verification of the seasonality, the data of the same type is classified into one type, the classification accuracy is improved, the error is reduced when the deviation value is calculated subsequently, and the judgment accuracy of the abnormal point is improved.
A second dividing unit 104, configured to divide the samples within the preset time range into sample data of one category if the verification result indicates that the samples do not have seasonality, and form a sample data set with sample data of multiple categories.
In this embodiment, when the verification result is that the seasonal sample is not included, a preset time interval threshold (e.g., equal to the previous preset time interval Δ T, more specifically, 1 month) is obtained, then the samples belonging to the preset time range are divided into a category according to the preset time interval threshold, and finally, a sample data set is composed of sample data of multiple categories. The preset time range can be set according to actual conditions, and is not limited herein.
The second model obtaining unit 105 is configured to perform time sequence anomaly detection on sample data of each category in the sample data set based on a grassbris model to obtain an anomaly detection result, and determine an anomaly threshold benchmark and a time sequence prediction sequence of each category according to the anomaly detection result.
In this embodiment, the Grubbs model can also be understood as a Grubbs algorithm (i.e., a Grubbs method) which can remove "suspicious values" from sequence samples without participating in calculation, and then the "suspicious values" are referred to as abnormal values. The Grubbs algorithm, also known as the maximum normalized residual test or the extreme chemical biochemical residual test, is used to test outliers that assume a single sequence that follows a normal distribution. Since the samples are from the population, the variance is unknown, subject to the t-distribution, and its critical value formula is as above for formula (1). Whether abnormal time sequence data exist in the sample data of each category in the sample data set can be rapidly detected through a Grabbs method, and the abnormal time sequence data can be replaced on the basis of preset values in a Grabbs table, so that the integrity and the continuity of the time sequence data are ensured.
In an embodiment, the second model obtaining unit 105 is specifically configured to:
taking the data of each category in the sample data set as a basic time sequence;
calculating the mean value and the standard deviation of each basic time sequence, and determining the deviation value of each time sequence data in the basic time sequence according to the mean value and the standard deviation;
comparing the deviation value with a preset value in a Grabbs table to obtain a comparison result;
and if the comparison result is that the deviation value is larger than the preset value, determining that the time sequence data corresponding to the deviation value is abnormal, and taking the time sequence data with the abnormality as an abnormal point.
In the embodiment, the mean value and the standard deviation of each basic time sequence are calculated, wherein the mean value is represented by x _ mean and the standard deviation is represented by x _ std; determining a deviation value of each time series data in the basic time series sequence, wherein the deviation value of each time series data is represented by Gi and Gi is (xi-x _ mean)/x _ std; the preset values in the Grubbs table are represented by G _ p (n); calculating a deviation value Gi of each time series data xi in the basic time series sequence based on a reference mean value and a standard deviation (wherein Gi ═ xi-x _ mean)/x _ std), and then determining that the time series data corresponding to the deviation value Gi are abnormal if the deviation value Gi of the time series data xi is determined to be present and the deviation value Gi is greater than a preset value G _ p (n) in a Graves table, and taking the time series data with the abnormality as an abnormal point. In this way, the above-described method is referred to for each type of sample data, and the anomaly detection is performed, thereby obtaining a comprehensive detection result.
In an embodiment, in the base time sequence, the mean and the standard deviation of the time sequence data of the non-abnormal points are used as the abnormal value threshold reference of the corresponding category of the base time sequence.
In this embodiment, after determining that all the outliers in the basic time series sequence of a certain category have been removed, the mean and the standard deviation may be calculated based on the remaining outliers, and then used as the outlier threshold reference corresponding to the basic time series sequence of the category. When the future time sequence is detected to be abnormal, if the future time sequence and the basic time sequence of the category are judged to belong to the same category, the mean value and the standard deviation of the category can be directly called as the abnormal value threshold reference.
In an embodiment, after determining that the time series data corresponding to the deviation value is abnormal and taking the time series data with the abnormal time series as an abnormal point if the comparison result is that the deviation value is greater than the preset value, the method further includes: and replacing abnormal points in the basic time sequence by preset values in the Grabbs table, and taking the obtained sequence as a time sequence prediction sequence.
In this embodiment, if it is determined that abnormal time series data exists in a certain category of basic time series data, the abnormal time series data may be replaced based on a preset value in the grassbris table to ensure the integrity and continuity of the time series data.
The processing performed in the first model obtaining unit 101, the verification result obtaining unit 102, the first dividing unit 103, the second dividing unit 104, and the second model obtaining unit 105 corresponds to offline processing historical time series data to obtain a plurality of categories of sample data (i.e. time series prediction sequences of each category are obtained), the sample data in each category is a normal point, and each category has a mean value, a standard deviation, and an abnormal value threshold reference obtained by calculation. And then, detecting abnormal points of the future time series data, namely comparing and judging sample data of various types obtained based on the previous offline processing.
A detection result obtaining unit 106, configured to, when receiving data to be detected, perform anomaly detection on the data to be detected based on the anomaly threshold criterion of each category and the time sequence prediction sequence, so as to obtain a detection result.
In this embodiment, when data to be detected is received, double anomaly detection is performed on the data to be detected based on the anomaly threshold criterion of each category and the time sequence prediction sequence, so as to obtain a detection result. The data to be detected is a sequence of current to-be-detected time sequences different from historical time sequences, and whether abnormal points exist in the data to be detected is judged currently, so that a plurality of categories of time sequence prediction sequences which are processed offline in a server before are called, and a real-time processing process can be understood as an online processing process when whether abnormal points exist in the data to be detected or not is detected.
In this embodiment, the detection is ordered detection, that is, whether the data to be detected is an abnormal point is determined according to the abnormal value threshold criterion, and the specific detection method may refer to the description of the above embodiment.
In an embodiment, the detection result obtaining unit 106 is specifically configured to:
determining the deviation value of the data to be detected according to the data to be detected and the abnormal value threshold reference corresponding to the category;
if the deviation value of the data to be detected is larger than a preset value in a Grabbs table, determining a residual sequence based on the time sequence prediction sequence and a basic time sequence corresponding to the data to be detected;
if the residual sequence is determined to accord with the 3 sigma criterion, judging that the detection result is abnormal;
and if the residual sequence is determined not to meet the 3 sigma criterion, judging that the detection result is abnormal.
In this embodiment, the category to which the data to be detected belongs may be determined based on the time period to which the data to be detected belongs, for example, the data to be detected corresponds to a time series data sequence of department a in 3 months today, and seasonal grouping is performed based on historical time series data to obtain time series prediction sequences of multiple categories, so that the category of the data to be detected may be determined according to the season to which the data to be detected belongs. And then acquiring a corresponding abnormal value threshold reference according to the category to which the data to be detected belongs, and determining the deviation value of the data to be detected, so that abnormal point detection based on the Grabbs method can be performed for one time. And if the abnormal point exists in the data to be detected, determining a residual sequence based on the difference value between the time sequence prediction sequence and the basic time sequence corresponding to the data to be detected, and finally, further judging whether the residual sequence meets the 3 sigma criterion or not to obtain a final judgment result. Through double anomaly detection, a detection result is obtained, and the result is more accurate.
The device realizes the construction of the ARIMA model based on historical time sequence data, detects abnormal points of the time sequence in the ARIMA model, further determines an abnormal value threshold benchmark and a time sequence prediction sequence according to the detection result, and then combines the abnormal value threshold benchmark and the time sequence prediction sequence when performing abnormal detection on the data to be detected, thereby being capable of more accurately detecting the abnormal data.
The above-described time series data abnormality detecting apparatus may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 may be a server or a server cluster. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
Referring to fig. 4, the computer apparatus 500 includes a processor 502, a memory, which may include a storage medium 503 and an internal memory 504, and a network interface 505 connected by a device bus 501.
The storage medium 503 may store an operating device 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a method for timing data anomaly detection.
The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can execute the timing data abnormality detection method.
The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The processor 502 is configured to run the computer program 5032 stored in the memory, so as to implement the time series data anomaly detection method disclosed in the embodiment of the present invention.
Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.
It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer-readable storage medium may be a nonvolatile computer-readable storage medium or a volatile computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the time series data anomaly detection method disclosed by the embodiments of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a background server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for detecting time series data abnormity is characterized by comprising the following steps:
acquiring historical time sequence data, and constructing a difference integration moving average autoregressive model based on the historical time sequence data;
verifying the seasonality of the historical time sequence data to obtain a verification result;
if the verification result is seasonal, taking the seasonal data as sample data of one category, and forming a sample data set by the sample data of multiple categories;
if the verification result is that the seasonal sample is not available, dividing the samples within a preset time range into sample data of one category, and forming a sample data set by the sample data of multiple categories;
performing time sequence anomaly detection on the sample data of each category in the sample data set based on a Grabbs model to obtain an anomaly detection result, and determining an anomaly value threshold benchmark and a time sequence prediction sequence of each category according to the anomaly detection result; and
and when receiving data to be detected, performing anomaly detection on the data to be detected based on the abnormal value threshold reference of each category and the time sequence prediction sequence to obtain a detection result.
2. The method for detecting anomalies in time series data according to claim 1, wherein the obtaining historical time series data and constructing a differential integrated moving average autoregressive model based on the historical time series data includes:
segmenting the historical time sequence data according to a preset time interval to obtain an original data sequence taking the preset time interval as a time interval;
performing sequence stability test on the original data sequence to obtain a stability test result;
if the stationarity test result is that the original data sequence is a non-stationary data sequence, carrying out stabilization processing on the original data sequence by adopting a difference to obtain a stationary data sequence;
if the stationarity test result is that the original data sequence is a steady data sequence, taking the original data sequence as the steady data sequence;
and fitting the stable data sequence through an initial difference integration moving average autoregressive model, and updating the order of the initial difference integration moving average autoregressive model to obtain the difference integration moving average autoregressive model.
3. The method for detecting the anomaly of time series data according to claim 2, wherein the fitting the stationary data sequence through an initial difference integrated moving average autoregressive model, and updating the order of the initial difference integrated moving average autoregressive model to obtain the difference integrated moving average autoregressive model comprises:
and fitting the stable data sequence through the initial difference integration moving average autoregressive model to determine the maximum lag order of the aperiodic autoregressive polynomial, the maximum lag order of the aperiodic average moving polynomial, the maximum lag order of the periodic autoregressive polynomial and the maximum lag order of the periodic average moving polynomial, wherein the difference integration moving average autoregressive model is obtained through the periodic difference times, the aperiodic difference times, the maximum lag order of the aperiodic autoregressive polynomial, the maximum lag order of the aperiodic average moving polynomial, the maximum lag order of the periodic autoregressive polynomial and the maximum lag order of the periodic average moving polynomial.
4. The method according to claim 1, wherein the performing a time series anomaly detection on the sample data of each category in the sample data set based on the grassbs model to obtain an anomaly detection result comprises:
taking the data of each category in the sample data set as a basic time sequence;
calculating the mean value and the standard deviation of each basic time sequence, and determining the deviation value of each time sequence data in the basic time sequence according to the mean value and the standard deviation;
comparing the deviation value with a preset value in a Grabbs table to obtain a comparison result;
and if the comparison result is that the deviation value is larger than the preset value, determining that the time sequence data corresponding to the deviation value is abnormal, and taking the time sequence data with the abnormality as an abnormal point.
5. The method according to claim 4, wherein if the comparison result indicates that the deviation value is greater than the predetermined value, determining that the time series data corresponding to the deviation value is abnormal, and taking the time series data with the abnormality as an abnormal point, further comprising:
and replacing abnormal points in the basic time sequence by preset values in the Grabbs table, and taking the obtained sequence as a time sequence prediction sequence.
6. The method for detecting the abnormality of the time series data according to claim 1, wherein the detecting the abnormality of the data to be detected based on the threshold reference of the abnormal value of each category and the time series prediction sequence to obtain a detection result comprises:
determining a deviation value of the data to be detected according to the data to be detected and an abnormal value threshold reference corresponding to the category;
if the deviation value of the data to be detected is larger than a preset value in a Grabbs table, determining a residual sequence based on the time sequence prediction sequence and a basic time sequence corresponding to the data to be detected;
if the residual sequence is determined to accord with the 3 sigma criterion, judging that the detection result is abnormal;
and if the residual sequence is determined not to meet the 3 sigma criterion, judging that the detection result is abnormal.
7. The method according to claim 4, wherein the base time series sequence has a mean value and a standard deviation of the time series data of non-abnormal points as an abnormal value threshold reference of a category corresponding to the base time series sequence.
8. An apparatus for detecting abnormality in time-series data, comprising:
the first model acquisition unit is used for acquiring historical time series data and constructing a difference integration moving average autoregressive model based on the historical time series data;
the verification result acquisition unit is used for verifying the seasonality of the historical time series data to obtain a verification result;
the first dividing unit is used for taking the seasonal data as sample data of one category and forming a sample data set by the sample data of a plurality of categories if the verification result indicates that the seasonal data exists;
the second dividing unit is used for dividing the samples in the preset time range into sample data of one category and forming a sample data set by the sample data of a plurality of categories if the verification result shows that the samples do not have seasonality;
the second model acquisition unit is used for carrying out time sequence abnormity detection on the sample data of each category in the sample data set based on the Grabbs model to obtain an abnormity detection result, and determining an abnormal value threshold benchmark and a time sequence prediction sequence of each category according to the abnormity detection result; and
and the detection result acquisition unit is used for carrying out anomaly detection on the data to be detected based on the abnormal value threshold reference of each category and the time sequence prediction sequence when the data to be detected is received, so as to obtain a detection result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of temporal data anomaly detection according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to execute the time-series data abnormality detection method according to any one of claims 1 to 7.
CN202210149024.8A 2022-02-18 2022-02-18 Time series data abnormity detection method, device, equipment and medium Pending CN114528934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210149024.8A CN114528934A (en) 2022-02-18 2022-02-18 Time series data abnormity detection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210149024.8A CN114528934A (en) 2022-02-18 2022-02-18 Time series data abnormity detection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114528934A true CN114528934A (en) 2022-05-24

Family

ID=81623472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210149024.8A Pending CN114528934A (en) 2022-02-18 2022-02-18 Time series data abnormity detection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114528934A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115412923A (en) * 2022-10-28 2022-11-29 河北省科学院应用数学研究所 Multi-source sensor data credible fusion method, system, equipment and storage medium
CN115510302A (en) * 2022-11-16 2022-12-23 西北工业大学 Intelligent factory data classification method based on big data statistics
CN115685792A (en) * 2022-12-29 2023-02-03 北京万维盈创科技发展有限公司 Wastewater intermittent drain outlet flow triggering method and device based on flow threshold
CN115876964A (en) * 2023-01-31 2023-03-31 北方工业大学 Mobile monitoring and early warning method and system for city block climate environment and carbon emission
CN116976682A (en) * 2023-09-22 2023-10-31 安徽融兆智能有限公司 Fuzzy algorithm-based operation state evaluation method for electricity consumption information acquisition system
CN117009754A (en) * 2023-09-20 2023-11-07 中交四航局第一工程有限公司 Safety early warning monitoring method for upper existing bridge pile foundation during underpass tunnel construction
CN117134504A (en) * 2023-10-25 2023-11-28 陕西禄远电子科技有限公司 Intelligent energy monitoring method and system based on safety protection

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021430A (en) * 2014-06-12 2014-09-03 东南大学 Method for analyzing uncertainty of passenger flow of urban mass transit terminal
CN107123113A (en) * 2017-04-20 2017-09-01 北京工业大学 A kind of GWAC light curve method for detecting abnormality based on Grubbs test method and ARIMA
CN108984870A (en) * 2018-06-29 2018-12-11 中国科学院深圳先进技术研究院 Freezer data of the Temperature and Humidity module prediction technique and Related product based on ARIMA
CN110287086A (en) * 2019-07-03 2019-09-27 中国工商银行股份有限公司 A kind of the trading volume prediction technique and device of periodicity time
CN111311086A (en) * 2020-02-11 2020-06-19 中国银联股份有限公司 Capacity monitoring method and device and computer readable storage medium
CN111563028A (en) * 2020-05-15 2020-08-21 周毅 Data center task scale prediction method based on time series data analysis
CN111930526A (en) * 2020-10-19 2020-11-13 腾讯科技(深圳)有限公司 Load prediction method, load prediction device, computer equipment and storage medium
US20210019211A1 (en) * 2019-07-15 2021-01-21 Bull Sas Method and device for determining a performance indicator value for predicting anomalies in a computing infrastructure from values of performance indicators
US20210026725A1 (en) * 2019-07-15 2021-01-28 Bull Sas Method and device for determining an estimated time before a technical incident in a computing infrastructure from values of performance indicators
WO2021072890A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Traffic abnormality monitoring method and apparatus based on model, and device and storage medium
CN113849374A (en) * 2021-09-28 2021-12-28 平安科技(深圳)有限公司 CPU occupancy rate prediction method, system, electronic device and storage medium
CN115470973A (en) * 2022-08-25 2022-12-13 江苏电力信息技术有限公司 ARIMA-based enterprise daily electricity quantity abnormity detection method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021430A (en) * 2014-06-12 2014-09-03 东南大学 Method for analyzing uncertainty of passenger flow of urban mass transit terminal
CN107123113A (en) * 2017-04-20 2017-09-01 北京工业大学 A kind of GWAC light curve method for detecting abnormality based on Grubbs test method and ARIMA
CN108984870A (en) * 2018-06-29 2018-12-11 中国科学院深圳先进技术研究院 Freezer data of the Temperature and Humidity module prediction technique and Related product based on ARIMA
CN110287086A (en) * 2019-07-03 2019-09-27 中国工商银行股份有限公司 A kind of the trading volume prediction technique and device of periodicity time
US20210019211A1 (en) * 2019-07-15 2021-01-21 Bull Sas Method and device for determining a performance indicator value for predicting anomalies in a computing infrastructure from values of performance indicators
US20210026725A1 (en) * 2019-07-15 2021-01-28 Bull Sas Method and device for determining an estimated time before a technical incident in a computing infrastructure from values of performance indicators
WO2021072890A1 (en) * 2019-10-18 2021-04-22 平安科技(深圳)有限公司 Traffic abnormality monitoring method and apparatus based on model, and device and storage medium
CN111311086A (en) * 2020-02-11 2020-06-19 中国银联股份有限公司 Capacity monitoring method and device and computer readable storage medium
CN111563028A (en) * 2020-05-15 2020-08-21 周毅 Data center task scale prediction method based on time series data analysis
CN111930526A (en) * 2020-10-19 2020-11-13 腾讯科技(深圳)有限公司 Load prediction method, load prediction device, computer equipment and storage medium
CN113849374A (en) * 2021-09-28 2021-12-28 平安科技(深圳)有限公司 CPU occupancy rate prediction method, system, electronic device and storage medium
CN115470973A (en) * 2022-08-25 2022-12-13 江苏电力信息技术有限公司 ARIMA-based enterprise daily electricity quantity abnormity detection method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115412923A (en) * 2022-10-28 2022-11-29 河北省科学院应用数学研究所 Multi-source sensor data credible fusion method, system, equipment and storage medium
CN115510302A (en) * 2022-11-16 2022-12-23 西北工业大学 Intelligent factory data classification method based on big data statistics
CN115685792A (en) * 2022-12-29 2023-02-03 北京万维盈创科技发展有限公司 Wastewater intermittent drain outlet flow triggering method and device based on flow threshold
CN115876964A (en) * 2023-01-31 2023-03-31 北方工业大学 Mobile monitoring and early warning method and system for city block climate environment and carbon emission
CN115876964B (en) * 2023-01-31 2024-01-23 北方工业大学 Urban neighborhood climate environment and carbon emission mobile monitoring and early warning method and system
CN117009754A (en) * 2023-09-20 2023-11-07 中交四航局第一工程有限公司 Safety early warning monitoring method for upper existing bridge pile foundation during underpass tunnel construction
CN116976682A (en) * 2023-09-22 2023-10-31 安徽融兆智能有限公司 Fuzzy algorithm-based operation state evaluation method for electricity consumption information acquisition system
CN116976682B (en) * 2023-09-22 2023-12-26 安徽融兆智能有限公司 Fuzzy algorithm-based operation state evaluation method for electricity consumption information acquisition system
CN117134504A (en) * 2023-10-25 2023-11-28 陕西禄远电子科技有限公司 Intelligent energy monitoring method and system based on safety protection
CN117134504B (en) * 2023-10-25 2024-01-26 陕西禄远电子科技有限公司 Intelligent energy monitoring method and system based on safety protection

Similar Documents

Publication Publication Date Title
CN114528934A (en) Time series data abnormity detection method, device, equipment and medium
CN110865929B (en) Abnormality detection early warning method and system
CN109815084B (en) Abnormity identification method and device, electronic equipment and storage medium
US9379951B2 (en) Method and apparatus for detection of anomalies in integrated parameter systems
CN111459700B (en) Equipment fault diagnosis method, diagnosis device, diagnosis equipment and storage medium
Ying et al. A hidden Markov model-based algorithm for fault diagnosis with partial and imperfect tests
EP3136297A1 (en) System and method for determining information and outliers from sensor data
JPWO2017154844A1 (en) Analysis apparatus, analysis method, and analysis program
CN112188531B (en) Abnormality detection method, abnormality detection device, electronic apparatus, and computer storage medium
US20040199482A1 (en) Systems and methods for automatic and incremental learning of patient states from biomedical signals
CN112257755B (en) Method and device for analyzing running state of spacecraft
CN102257520A (en) Performance analysis of applications
CN112414694B (en) Equipment multistage abnormal state identification method and device based on multivariate state estimation technology
CN112416643A (en) Unsupervised anomaly detection method and unsupervised anomaly detection device
Lim et al. Identifying recurrent and unknown performance issues
CN109857618B (en) Monitoring method, device and system
Mashkov et al. The problem of system fault-tolerance
CN112416662A (en) Multi-time series data anomaly detection method and device
Šikić et al. Improving software defect prediction by aggregated change metrics
Hompes et al. Detecting changes in process behavior using comparative case clustering
KR20170084445A (en) Method and apparatus for detecting abnormality using time-series data
Atzmueller et al. Anomaly detection and structural analysis in industrial production environments
CN116450482A (en) User abnormality monitoring method and device, electronic equipment and storage medium
KR101960755B1 (en) Method and apparatus of generating unacquired power data
US20210321956A1 (en) Determination of health status of systems equipped with sensors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination