CN110399400B

CN110399400B - Method, device, equipment and medium for detecting abnormal data

Info

Publication number: CN110399400B
Application number: CN201811288656.2A
Authority: CN
Inventors: 胡天行; 杨凡; 戴兴虎; 黄斐
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2023-08-15
Anticipated expiration: 2038-10-31
Also published as: CN110399400A

Abstract

A method, apparatus, device and medium for detecting anomalous data in a data sequence are disclosed. The method for detecting the abnormal data comprises the following steps: extracting an analysis sample set of data to be analyzed from the data sequence; determining trend parameters of the analysis sample set; determining a confidence range of the data to be analyzed according to the trend parameters; and carrying out anomaly detection on the data to be analyzed according to the confidence range. By considering the data change trend of the data sequence, the confidence range of the data to be analyzed is adjusted according to the data change trend, so that the abnormality detection has stronger adaptability, and continuous over-high abnormality alarm and continuous over-low abnormality alarm can be avoided.

Description

Method, device, equipment and medium for detecting abnormal data

Technical Field

The present disclosure relates to detection of abnormal data, and more particularly, to a method, apparatus, device, and medium for detecting abnormal data.

Background

By monitoring the key indexes of the service system abnormally, the service system can give an alarm when abnormal fluctuation occurs in the service system, so that a service operator can timely check the service system, find and locate the problem.

Typically, data discrimination and anomaly detection, i.e., statistical discrimination algorithms, are performed based on data statistics. According to the statistical discrimination algorithm, a confidence interval is determined for a given confidence probability, and the data to be discriminated is identified as abnormal in the case that the data to be discriminated exceeds the confidence interval, that is, in the case that the data fluctuation drastically exceeds the confidence range defined by the confidence interval. Common statistical discrimination algorithms are the Grubbs (Grubbs) test and the Dixon (Dixon) test.

However, the conventional statistical discrimination algorithm does not consider the trend of data change, so that continuous abnormal warning is easy to occur when abnormal detection is performed on data changing according to a certain trend. Moreover, the conventional statistical discrimination algorithm has poor anomaly detection effect on periodically-changed data.

Therefore, a method capable of performing good anomaly detection for time series data of a trend change is required.

Disclosure of Invention

In view of the above problems, the present disclosure provides a method for detecting abnormal data in a data sequence, which adjusts a confidence range of data to be analyzed according to a data change trend by considering the data change trend of the data sequence, so that the abnormal detection has stronger adaptability, and can avoid continuous over-high abnormal alarm and continuous over-low abnormal alarm.

According to an aspect of the present disclosure, there is provided a method of detecting abnormal data in a data sequence, comprising: extracting an analysis sample set of data to be analyzed from the data sequence; determining trend parameters of the analysis sample set; determining a confidence range of the data to be analyzed according to the trend parameters; and carrying out anomaly detection on the data to be analyzed according to the confidence range.

In one embodiment, determining the confidence range of the data to be analyzed based on the trend parameter comprises: determining a confidence range adjustment coefficient of the data to be analyzed according to the trend parameter; and determining the confidence range of the data to be analyzed according to the statistical parameters of the analysis sample set and the confidence range adjustment coefficient.

In one embodiment, determining trend parameters of the analysis sample set includes: a linear regression slope of the sample data sequence in the analysis sample set is determined.

In one embodiment, extracting an analysis sample set of data to be analyzed in a data sequence comprises: preprocessing the data sequence to obtain a ring ratio data sequence; and extracting the analysis sample set in the ring ratio data sequence.

In one embodiment, the data sequence is a periodic data sequence, and extracting an analysis sample set of data to be analyzed in the data sequence comprises: dividing the data sequence according to the data period of the data sequence; sequentially determining a preset number of data periods from the data to be analyzed as periods to be extracted; determining the position of the data to be analyzed in the data period; and extracting data at the position from each period to be extracted to obtain the analysis sample set.

According to another aspect of the present disclosure, there is provided an apparatus for detecting abnormal data in a data sequence, including: a sample set generation module configured to extract an analysis sample set of data to be analyzed in a data sequence; a trend determination module configured to determine a trend parameter of the analysis sample set; the threshold value determining module is configured to determine a confidence range of the data to be analyzed according to the trend parameter; and the abnormality detection module is configured to detect the abnormality of the data to be analyzed according to the confidence range.

In one embodiment, the sample set generation module further comprises: the data preprocessing sub-module is configured to preprocess the data sequence to obtain a ring ratio data sequence; and a sample set construction sub-module configured to extract the analysis sample set in the ring ratio data sequence.

In one embodiment, the data sequence is a periodic data sequence, and the sample set generation module includes: a sequence dividing sub-module configured to divide the data sequence according to a data period of the data sequence; the period intercepting sub-module is configured to sequentially determine a preset number of data periods from the data to be analyzed as periods to be extracted; a position determination sub-module configured to determine a position of the data to be analyzed in its data period. And the sample set generating module extracts the data at the position from each period to be extracted to obtain the analysis sample set.

According to still another aspect of the present disclosure, there is provided an apparatus for detecting abnormal data in a data sequence, including: a processor, and a memory containing a set of processor-executable instructions that, when executed by the processor, cause the apparatus to: extracting an analysis sample set of data to be analyzed from the data sequence; determining trend parameters of the analysis sample set; determining a confidence range of the data to be analyzed according to the trend parameters; and carrying out anomaly detection on the data to be analyzed according to the confidence range.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium, characterized in that computer-readable instructions are stored thereon, which when executed by a computer, perform the above-described method.

By using the method and the device for detecting the abnormal data in the data sequence, which are provided by the invention, the confidence range of the data to be analyzed is adjusted according to the data change trend by considering the data change trend of the data sequence, so that the abnormal detection has stronger adaptability, and continuous over-high abnormal alarm and continuous over-low abnormal alarm can be avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without making creative efforts to one of ordinary skill in the art. The following drawings are not intended to be drawn to scale on actual dimensions, emphasis instead being placed upon illustrating the principles of the disclosure.

FIG. 1 shows a schematic diagram of a time data sequence;

FIG. 2 illustrates an exemplary flow chart of a method of detecting anomalous data in a data sequence in accordance with an embodiment of the disclosure;

FIG. 3 illustrates an exemplary flow chart for extracting an analysis sample set in accordance with an embodiment of the present disclosure;

FIG. 4A shows an exemplary illustration of an aperiodic time series data sequence and an analysis sample set of the periodic time series data sequence, according to an embodiment of the disclosure;

FIG. 4B illustrates an exemplary flow chart of a method of extracting an analysis sample set from a time series data sequence having a periodic variation law, according to an embodiment of the present disclosure;

FIG. 5 illustrates an exemplary flow chart for determining confidence limits for data to be analyzed based on trend parameters, according to an embodiment of the present disclosure;

FIG. 6 illustrates a more specific exemplary flow chart for determining confidence limits for data to be analyzed based on trend parameters based on a Grubbs' test algorithm, according to an embodiment of the present disclosure;

FIG. 7 illustrates an exemplary flowchart of a method of determining confidence limits for data to be analyzed based on a Grubbs' test algorithm, according to an embodiment of the present disclosure;

FIG. 8 illustrates one example of an application of a method of detecting anomalous data in accordance with an embodiment of the disclosure;

FIG. 9 illustrates an exemplary block diagram of an apparatus for detecting anomalous data in a data sequence in accordance with an embodiment of the disclosure;

FIG. 10 illustrates an exemplary block diagram of a sample set generation module in an apparatus for detecting anomalous data in accordance with an embodiment of the disclosure;

FIG. 11 shows a schematic block diagram of a threshold determination module in an apparatus for detecting anomalous data in accordance with an embodiment of the disclosure;

fig. 12 illustrates an exemplary block diagram of an apparatus for detecting anomalous data in a data sequence in accordance with an embodiment of the disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the disclosure. All other embodiments, which can be made by one of ordinary skill in the art without undue burden based on the embodiments of the present disclosure, are also within the scope of the present disclosure.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

In some business systems, as time progresses, it is necessary to continually update its key indicators and accordingly evaluate and further guide the operation of the business system. By constantly monitoring the key indicators, a data sequence is generated that continuously increases new data over time, i.e. a time data sequence. For example, in the financial arts, key indicators of a financial business system may include the number of user openings, the amount purchased, the amount redeemed, and so forth; in the gaming field, key indicators of a gaming system may include new user registration number, user login amount, user online time, game equipment sales amount, and so forth; in the field of the internet of things, key indexes of the business system of the electric vehicle can comprise the registration number of new users, sales amount and the like.

In some business systems, the time data sequence of the key index shows a certain trend. For example, (a) in fig. 1 shows a time data series showing an upward change trend, and (B) in fig. 1 shows a time data series showing a downward change trend. Neither the common Grubbs (Grubbs) nor Dixon (Dixon) test algorithms take into account the trend of the data sequence. When the data sequence shows a certain change trend, such as an ascending trend or a descending trend, continuous abnormal data warning is easy to appear in a Grubbs' inspection algorithm, so that the false alarm rate of the abnormal warning is high and the accuracy is low. The Dixon (Dixon) checking algorithm sorts the data in the data sequence according to the size, the sequence of the data in the original data sequence is disordered, the time sequence relation of the data in the data sequence is not considered at all, and the abnormal warning accuracy rate is low.

According to the embodiment of the disclosure, a method for detecting abnormal data in a data sequence is provided, which enables the abnormal detection to be more efficient and accurate by considering the change trend of the data sequence in the data abnormal detection process.

Next, a method 200 of detecting abnormal data in a data sequence according to an embodiment of the present disclosure will be described with reference to fig. 2.

First, in step S210, an analysis sample set of data to be analyzed is extracted in a data sequence. The data to be analyzed may be currently monitored data, such as total user account opening amount, user account opening amount on the same day, game equipment sales amount on the same day, and the like.

According to an embodiment of the present disclosure, the data sequence may be a data sequence of an incremental data type, or may be a data sequence of a cumulative total data type. The aggregate data is aggregate and continuously growing, such as total number of openings; the incremental data is non-cumulative, such as the number of new openings. Further, according to an embodiment of the present disclosure, the data sequence may be a time-series data sequence without a periodic variation law, or may be a time-series data sequence exhibiting a periodic variation law.

According to embodiments of the present disclosure, the number of samples in an analysis sample set may be selected according to system design requirements, which may include, for example, system complexity, accuracy, false positive rate, etc.

In step S220, trend parameters of the analysis sample set are determined. The trend parameters of the analysis sample set may include linear slope, tangential slope of a curve, and so forth.

According to an embodiment of the present disclosure, a trend parameter of an analysis sample set is determined based on a data value of each data sample in the analysis sample set and a time parameter thereof. Specifically, for the data to be analyzed, according to the data value and the time parameter of the data sample, determining the change trend of the analysis sample set at the time point of the data to be analyzed.

Next, in step S230, a confidence range of the data to be analyzed is determined according to the determined trend parameter of the analysis sample set. According to the embodiment of the disclosure, after determining the change trend of the position of the data to be analyzed, the confidence range of the data to be analyzed is determined based on the data value of each data sample in the analysis sample set and the change trend of the time point of the data to be analyzed.

For example, the confidence range of the data to be analyzed may be provided by a single-sided threshold or a double-sided threshold. If only too high data or only too low data need to be detected in anomaly detection, a single-sided threshold may be employed to define the confidence range. If both too high and too low data are to be detected in anomaly detection, a double-sided threshold may be employed to define confidence ranges, which may include an upper-sided threshold and a lower-sided threshold.

Then, in step S240, according to the confidence range, abnormality detection is performed on the data to be analyzed.

Alternatively, according to the embodiment of the present disclosure, an abnormality detection result of the data to be analyzed may also be output. For example, an anomaly alert may be output only when anomaly data is detected; different abnormal warnings can be generated according to whether the data to be analyzed are too high abnormality or too low abnormality.

According to the embodiment of the disclosure, during anomaly detection, the confidence range of the data to be analyzed is adjusted according to the data change trend by considering the data change trend of the data sequence, so that the anomaly detection has stronger adaptability, and continuous over-high anomaly alarm and continuous over-low anomaly alarm can be avoided.

Next, an exemplary implementation of extracting an analysis sample set in step S210 according to an embodiment of the present disclosure will be described with reference to fig. 3.

As shown in fig. 3, in step S2110, the data sequence is preprocessed to obtain a loop ratio data sequence.

In one embodiment, the data sequence of the aggregate data type is accumulated as y ₀ 、y ₁ 、y ₂ 、……、y _N-1 、y _N Then the data sequence of the accumulated total data type can be preprocessed through the formula (1) to obtain the ring ratio data sequence a ₁ 、a ₂ 、……、a _M-1 、a _M . For example, the data sequence of the aggregate data type may include a hold amount, an aggregate user amount, and so forth.

In one embodiment, the data sequence of the delta data type is Δy ₁ 、△y ₂ 、……、△y _N-1 、△y _N Then the data sequence of the accumulated total data type can be preprocessed through the formula (2) to obtain the ring ratio data sequence a ₁ 、a ₂ 、……、a _M-1 、a _M 。

In step S2120, an analysis sample set is extracted from the ring ratio data sequence. In the following description a will be _M Referred to as the data to be analyzed.

At the time of the data sequence being without a weekIn the case of a time-series data sequence with a regular period change, a predetermined number of data may be sequentially extracted from the data to be analyzed to constitute an analysis sample set. According to the embodiment of the disclosure, for example, the number of samples in an analysis sample set is selected to be N according to the system design requirement, and data of a preset number of N are sequentially extracted from the data to be analyzed in a data sequence to serve as the analysis sample set. As shown in (1) of FIG. 4A, the original time series data sequence is a ₀ 、a ₁ 、…、a _M-N 、a _M-N+1 、…、a _M-1 、a _M Wherein the data a need to be analyzed _M Whether or not there is an abnormality. By from the data a to be analysed _M Sequentially extracting a predetermined number N of data to form an analysis sample set x ₁ 、x ₂ 、……、x _N-1 、x _N Wherein x is _N For the data to be analysed (i.e. x _N ＝a _M )，x ₁ 、x ₂ 、……、x _N-1 Is a time series data sample before the data to be analyzed. In the embodiments of the present disclosure, x will be in this case for convenience of description _N And a _M Are referred to as data to be analyzed and are no longer distinguished. Furthermore, for a time series of data, each data sample also has its corresponding time parameter, which time parameter of each data sample in the analysis sample set can be marked as t in turn ₁ 、t ₂ 、……、t _N-1 、t _N . For example, in the case where the data sample is a daily data sample, the time parameter may be marked as t in sequence ₁ ＝1、t ₂ ＝2、……、t _N-1 ＝N-1、t _N ＝N。

In the case where the data sequence is a time-series data sequence having a periodic variation law, it is necessary to extract data samples at the same position from each period to constitute an analysis sample set. The data period of the data sequence may be one week, one month, one year, etc. As shown in (2) of fig. 4A, the original time-series data sequence is a ₀ 、a ₁ 、…、a _M-Np 、…、a _M-(N-1)p 、…、a _M-p 、…、a _M-1 、a _M Wherein the data a need to be analyzed _M Whether or not there isAn abnormality. Assuming that the period of the original time series data sequence is p, the data a is extracted _M 、a _M-p 、…、a _M-(N-2)p 、a _M-(N-1)p To construct an analysis sample set x ₁ 、x ₂ 、…、x _N-1 、x _N Specifically, as shown in (2) in fig. 4A, x _N For the data to be analysed (i.e. x _N ＝a _M )，x _N-1 ＝a _M-p 、x _N-2 ＝a _M-2p 、…、x ₂ ＝a _M-(N-2)p 、x ₁ ＝a _M-(N-1)p . In the embodiments of the present disclosure, x will be in this case for convenience of description _N And a _M Are referred to as data to be analyzed and are no longer distinguished. Similarly, each data sample has its corresponding time parameter, which may be marked as t in turn for each data sample in the analysis sample set ₁ 、t ₂ 、…、t _N-1 、t _N 。

In one embodiment, the data to be analyzed a may be extracted after determining the data period p of the original time series data sequence _M Data a of p positions before the data to be analyzed _M-p Data a of 2p positions before the data to be analyzed _M-p … data a of (N-2) p positions before the data to be analyzed _M-(N-2)p Data a of (N-1) p positions before the data to be analyzed _M-(N-1)p To form an analysis sample set comprising N data samples.

In another embodiment, as shown in fig. 4B, after determining the data period p of the original time-series data sequence, the original time-series data sequence may be divided according to the data period p in step S410, the data period in which the data to be analyzed is located and N-1 periods before the data period are sequentially determined as N periods to be extracted in step S420, the position of the data to be analyzed in the data period is determined in step S430, and then the data at the position is extracted from each period to be extracted in step S440 to obtain the analysis sample set, so that the analysis sample set including N data samples is formed by using the data at the same position in the data to be analyzed and N-1 periods before the data to be extracted together. It should be appreciated that the operations of step S420 and step S430 may be performed in parallel, step S420 may be performed first and then step S430 may be performed, or step S430 may be performed first and then step S420 may be performed. For example, the data period of the data sequence may be one week, the data sequence being divided into data periods of one week, each data period comprising data samples of one week. The data samples of a week may include only data samples of monday through friday, or may include data samples of monday through sunday, according to different features of the business system, which is not limited by the embodiments of the present disclosure. Similarly, in the case where the data period is one month, the data samples of one month may include only data samples of a working day, or may include data samples of each day of one month.

An exemplary implementation of determining confidence ranges of data to be analyzed according to trend parameters in step S230 according to an embodiment of the present disclosure is described below with reference to fig. 5.

As shown in fig. 5, at step S2310, statistical parameters of the analysis sample set are determined. According to embodiments of the present disclosure, the statistical parameters may include an average value and a standard deviation of the analysis sample set.

In one embodiment, the average may be a simple average of the analysis sample set. In this case, in step S2310, the average value of the analysis sample set is determined as shown in formula (3), and the standard deviation of the analysis sample set is determined as shown in formula (4).

In another embodiment, the average value of the analysis sample set may be a weighted average value of the analysis sample set. In some scenarios, historical data affects the predicted future data differently, away from the number of target timesThe influence on the prediction is relatively low, while the influence of the data near the target time on the prediction is relatively high. Thus, for data far from the target time, it may be given a lower weight, while for data near the target time, it may be given a higher weight. In this case, in step S2310, it is first necessary to determine the weight value of each data sample in the analysis sample set. For example, the weight value ω of each data sample may be determined according to the characteristics of the index, the change rule of the index, the total number of data samples in the analysis sample set, and the like _i Then according to the weight value omega of each data sample _i To calculate a weighted average of the individual data samples in the analysis sample set. Equation (5) gives the weight value ω of each data sample _i Is used for calculating the constraint condition and the weighted average value of the model.

Then, in step S2320, a confidence range adjustment coefficient of the data to be analyzed is determined according to the trend parameter.

Finally, at step S2330, a confidence range of the data to be analyzed is determined based on the statistical parameters of the analysis sample set determined at step S2310 and the confidence range adjustment coefficients of the data to be analyzed determined at step S2320.

Next, a manner of determining the confidence range according to the embodiment of the present disclosure will be described with reference to fig. 6.

If only too high data or only too low data need to be detected in anomaly detection, a single-sided threshold may be employed to define the confidence range. In particular, the confidence range may be defined by an upper threshold, in which case the data to be analyzed is considered normal data only if it is below the upper threshold, whereas data that is too high is detected as abnormal data. Alternatively, the confidence range may be defined by a lower threshold, in which case the data to be analyzed is considered normal data only if it is higher than the lower threshold, while data that is too low is detected as abnormal data.

If both too high and too low data are to be detected in anomaly detection, a double-sided threshold may be employed to define confidence ranges, which may include an upper-sided threshold and a lower-sided threshold. In this case, the data to be analyzed is considered as normal data only if the data to be analyzed is between the upper threshold and the lower threshold; and in the case that the data to be analyzed is higher than the upper threshold or lower than the lower threshold, the data to be analyzed is detected as abnormal data; whereby both the excessively high data and the excessively low data are detected as abnormal data.

According to an embodiment of the disclosure, in a case where the confidence range is a one-sided threshold, the confidence range adjustment coefficient may be a single adjustment coefficient; in the case where the confidence range is a two-sided threshold, an upper threshold adjustment coefficient and a lower threshold adjustment coefficient may be determined for the upper threshold and the lower threshold, respectively.

In one embodiment, the confidence range including the bilateral threshold may be determined based on a Grubbs (Grubbs) test algorithm. In this embodiment, as shown in fig. 6, at step S610, the average value and standard deviation of the analysis sample set are first determined. The operation of step S610 is similar to that of step S2310, and will not be described herein. Next, in step S620, upper threshold adjustment coefficients F are respectively determined according to the trend parameters _up And a lower threshold adjustment coefficient F _down The operation of this step S610 is an example implementation of step S2320. In step S630, the coefficient F is adjusted based on the determined upper threshold _up And a lower threshold adjustment coefficient F _down For a reference upper critical value G determined based on a Grubbs' test algorithm _up And a reference lower critical value G _down And (5) adjusting. Then, in step S640,by means of the adjusted upper critical value G' _up And a lower critical value G' _d o _wn Determining an upper threshold X of the data to be analyzed _up And a lower threshold X _down . The operations of steps S630 and S640 are one example implementation of step S2330.

For example, the reference upper threshold G determined from the Grubbs' test threshold table may be applied as shown in equation (6) _up And a reference lower threshold G _down Adjusting and determining an upper threshold X of the data to be analyzed as shown in formula (7) _up And a lower threshold X _down 。

G′ _up ＝F _up *G _up

G′ _down ＝F _down *C _down (6)

X _up ＝μ+G′ _up *s

X _down ＝μ-G′ _down *s (7)

According to the embodiment of the disclosure, in the case where the confidence range is defined by the one-side threshold value, a plurality of one-side threshold values may also be set, so that the degree of abnormality may be further detected. For example, two upper thresholds may be set: a first threshold and a second threshold, the second threshold being greater than the first threshold, the data to be analyzed being detectable as a general anomaly when the data to be analyzed exceeds the first threshold; and when the data to be analyzed exceeds the second threshold, the data to be analyzed may be detected as a serious abnormality.

According to the embodiment of the present disclosure, in the case where the confidence range is defined by the two-sided threshold value, a plurality of upper-sided threshold values and a plurality of lower-sided threshold values may also be similarly set, so that the degree of abnormality may be further detected.

Alternatively, according to the embodiments of the present disclosure, different anomaly alerts may be generated according to the degree to which the data to be analyzed deviates from the upper threshold value and the degree to which the data to be analyzed deviates from the lower threshold value. For example, different abnormal alerts may be distinguished by color, height of alert tone, melody of alert tone, number of alert marks, etc.

Next, an example determination manner of the trend parameter of the analysis sample set in step S220 according to an embodiment of the present disclosure will be described with reference to fig. 7. The trend parameter of the analysis sample set may be a trend of the analysis sample set data sequence over a time sequence. As an example, a linear regression slope is described.

For an analysis sample set comprising N data samples, the N data samples thereof may be sequentially labeled as x ₁ 、x ₂ 、……、x _N-1 、x _N Wherein x is _N For the data to be analyzed, the time parameters of the N data samples may be sequentially marked as t ₁ 、t ₂ 、……、t _N-1 、t _N . For example, in the case where the data sample is a daily data sample, the time parameter may be marked as t in sequence ₁ ＝1、t ₂ ＝2、……、t _N-1 ＝N-1、t _N ＝N。

Calculating linear regression slope of the N data samples according to equation (8):

wherein, the liquid crystal display device comprises a liquid crystal display device,for a simple average value according to said N data samples +.>Is a time average of the analysis sample set as shown in equation (9).

Referring to fig. 7, an exemplary implementation of confidence-range adjustment coefficient determination and confidence-range adjustment is still illustrated using the Grubbs (Grubbs) test algorithm as an example.

As shown in fig. 7, in step S710, a simple average and standard deviation of N data samples in the analysis sample set are determined. In step S720, a linear regression slope k of the analysis sample set is calculated according to the above formula (8) based on the simple average and standard deviation of the N data samples in the analysis sample set. In step S730, an upper threshold adjustment coefficient F is calculated using the calculated linear regression slope of the analysis sample set _up And a lower threshold adjustment coefficient F _down . The operation of step S730 is an example implementation of step S610.

For example, in step S730, the determined linear regression slope k of the analysis sample set is mapped to the (-1, 1) interval using the formula (10) by the sigmoid function transformation, and the upper threshold adjustment coefficient F is calculated using the formula (11) _up And a lower threshold adjustment coefficient F _down 。

Next, in step S740, the coefficient F is adjusted based on the determined upper threshold value, for example, according to the above formula (6) _up And a lower threshold adjustment coefficient F _down For a reference upper critical value G determined based on a Grubbs' test algorithm _up And a reference lower critical value G _down And (5) adjusting.

Then, in step S750, the adjusted upper critical value G 'is used, for example, according to the above formula (7)' _up And a lower critical value G' _down Determining an upper threshold X of the data to be analyzed _up And lower partSide threshold X _down 。

According to the embodiment of the invention, the Grabbs (Grubbs) inspection algorithm is adjusted by considering the change trend of the data sequence, so that the upper or lower critical value can be correspondingly adjusted adaptively according to the change trend, continuous abnormal alarm can be avoided, and the accuracy of abnormal detection is improved.

Next, a method of detecting abnormal data according to an embodiment of the present disclosure will be described taking time-series data of accumulated full-amount data having a periodic variation law as an example.

Referring to fig. 8, first, the raw data sequence of the cumulative total data is subjected to data preprocessing, for example, the above-described step S2110 is performed to preprocess the data sequence of the cumulative total data into a loop ratio data sequence.

Then, the data sequence is divided into the ring ratio data sequence obtained after pretreatment to reconstruct the subsequence because the data sequence has a periodic variation rule. For example, the above step S2120 is performed, or the above steps S410 to S440 are performed, taking a natural cycle (p=7) and the date of the data point to be analyzed is wednesday (t=3) as an example, the loop ratio data sequence obtained after the preprocessing is divided according to the data cycle of one week, each data cycle includes one week of data samples, and then the data of wednesday is extracted from each data cycle to reconstruct the subsequence to be analyzed as an analysis sample set.

Next, correlation data of the reconstructed sub-sequence is calculated. For example, the above step S2310 is performed to determine statistical parameters of the analysis sample set, which may include an average value and a standard deviation of the analysis sample set. For example, the average value may be a weighted average of the analysis sample set.

The verification threshold of the Grubbs (Grubbs) verification algorithm is then modified. For example, the above steps S220 and S730 are first performed to determine trend parameters of an analysis sample set, which may include a linear slope, a tangential slope of a curve, and so on. Then, the above steps S2320 and 2330 are performed, and the above steps S620-S630 or steps S740-750 are performed, which may be as well known Calculating an upper threshold (upper critical value) adjustment coefficient F as shown in (10) and (11) _up And a lower threshold (lower critical value) adjustment coefficient F _down For the reference upper threshold G determined from the Grubbs' test threshold table as shown in equation (6) _up And a reference lower threshold G _down And (5) adjusting.

Finally, abnormal detection is carried out by utilizing the corrected Grubbs test critical value. For example, the upper threshold X of the data to be analyzed may be determined as shown in equation (7) _up And a lower threshold X _down And according to the determined upper threshold X _up And a lower threshold X _down And (5) performing abnormality detection.

According to the present disclosure, the visualization effect method, the confusion matrix, commonly used in machine learning is still used to compare the results of detecting anomalous data according to embodiments of the present disclosure with the detection results of the conventional Grubbs method. Taking the total number of samples as 400 as an example, the comparison result is given.

In the case of an anomaly alarm, the true TP in the table represents a positive sample that is correctly classified (i.e., an outlier that is correctly detected as an anomaly), the false negative FN represents a positive sample that is incorrectly classified (i.e., an outlier that is incorrectly detected as a normal), the false positive FP represents a negative sample that is incorrectly classified (i.e., a normal value that is incorrectly detected as an anomaly), and the false negative TN represents a negative sample that is correctly classified (i.e., a normal value that is correctly identified as a normal).

Wherein, the accuracy represents the proportion of the number of correctly classified samples to the total number of samples; recall indicates the proportion of the number of positive samples to the number of positive samples that are correctly classified; f1 represents the harmonic mean of accuracy and recall:

in the context of anomaly alarm, the most interesting index is accuracy, and the best effort is to improve accuracy and reduce false alarm rate, and as can be seen from the results shown in the above tables, the method for detecting anomaly data according to the embodiments of the disclosure significantly improves accuracy (accuracy improves 48.81% and F1 improves 16.15%) compared with the traditional Grubbs method, and the recall rate still remains within an acceptable range.

Next, an apparatus 900 for detecting abnormal data in a data sequence according to an embodiment of the present disclosure will be described with reference to fig. 9.

As shown in fig. 9, an apparatus 900 for detecting anomalous data in a data sequence in accordance with an embodiment of the disclosure includes a sample set generation module 910, a trend determination module 920, a threshold determination module 930, and an anomaly detection module 940.

The sample set generation module 910 is configured to extract an analysis sample set of data to be analyzed in a data sequence. The data to be analyzed may be currently monitored data, such as total user account opening amount, user account opening amount on the same day, game equipment sales amount on the same day, and the like.

According to an embodiment of the present disclosure, the data sequence may be a data sequence of an incremental data type, or may be a data sequence of a cumulative total data type; the data sequence may be a time series data sequence without a periodic variation law, or may be a time series data sequence exhibiting a periodic variation law. The number of samples in the analysis sample set may be selected according to system design requirements, which may include, for example, system complexity, accuracy, false positive rate, etc. The sample set generation module 910 may be configured to determine an analysis sample set using the exemplary methods described above with reference to fig. 3, 4A, and 4B.

As shown in fig. 10, the sample set generation module 910 may include a data preprocessing sub-module 9100 and a sample set construction sub-module 9140.

The data preprocessing sub-module 9100 is configured to preprocess the data sequence to obtain a ring ratio data sequence. For example, in the case where the data sequence is a data sequence of a cumulative total data type, it is preprocessed into a ring ratio data sequence by the above formula (1); in the case where the data sequence is of an incremental data type, it is preprocessed into a ring ratio data sequence by the above formula (2).

In the case where the data sequence is a time-series data sequence without a periodic variation law, the sample set construction sub-module 9140 sequentially extracts a predetermined number of data from the data to be analyzed to construct an analysis sample set.

In the case that the data sequence is a time-series data sequence with a periodic variation rule, the sample set generating module 910 may further include a sequence dividing sub-module 9110, a period intercepting sub-module 9120, and a position determining sub-module 9130.

The sequence segmentation submodule 9110 is configured to segment the original time series data sequence by a data period p of the original time series data sequence. The period interception submodule 9120 is configured to sequentially determine the data period in which the data to be analyzed is located and N-1 periods before the data period as N periods to be extracted. The position determination sub-module 9130 is configured to determine the position of the data to be analyzed in its data period. The sample set construction sub-module 9140 is configured to: and forming an analysis sample set containing N data samples by utilizing the data to be analyzed and the data at the same position in the previous N-1 periods to be extracted.

Trend determination module 920 is configured to determine trend parameters for the analysis sample set. The trend parameters of the analysis sample set may include linear slope, tangential slope of a curve, and so forth. According to an embodiment of the present disclosure, a trend parameter of an analysis sample set is determined based on a data value of each data sample in the analysis sample set and a time parameter thereof.

The threshold determination module 930 is configured to determine a confidence range of the data to be analyzed based on the determined trend parameters of the analysis sample set. According to the embodiment of the disclosure, after determining the change trend of the position of the data to be analyzed, the confidence range of the data to be analyzed is determined based on the data value of each data sample in the analysis sample set and the change trend of the time point of the data to be analyzed. For example, the confidence range of the data to be analyzed may be provided by a single-sided threshold or a double-sided threshold. If only too high data or only too low data need to be detected in anomaly detection, a single-sided threshold may be employed to define the confidence range. If both too high and too low data are to be detected in anomaly detection, a double-sided threshold may be employed to define confidence ranges, which may include an upper-sided threshold and a lower-sided threshold.

The anomaly detection module 940 is configured to perform anomaly detection on the data to be analyzed according to the confidence range.

Optionally, the apparatus 900 for detecting abnormal data in a data sequence according to an embodiment of the present disclosure may further include a result output module 950. The result output module is configured to output an abnormality detection result for the data to be analyzed. For example, an anomaly alert may be output only when anomaly data is detected; different abnormal warnings can be generated according to whether the data to be analyzed are too high abnormality or too low abnormality.

As shown in fig. 11, according to an embodiment of the disclosure, the threshold determining module 930 may further include: statistical parameter determination submodule 9310, adjustment coefficient determination submodule 9320, and confidence range determination submodule 9330.

According to an embodiment of the disclosure, the statistical parameter may include an average value and a standard deviation of the analysis sample set, and the average value may be a simple average value or a weighted average value of the analysis sample set. The statistical parameter determination submodule 9310 may further include a mean determination submodule 9310 and a standard deviation determination submodule 93120.

The average determination submodule 9310 may determine a simple average value of the analysis sample set using the above formula (3), or may calculate a weighted average value of each data sample in the analysis sample set from the weight value ωi of each data sample using the above formula (5). The standard deviation determination submodule 93120 can determine the standard deviation of the analysis sample set using the above equation (4).

The adjustment factor determination sub-module 9320 is configured to determine a confidence range adjustment factor for the data to be analyzed from the trend parameter.

The confidence range determination sub-module 9330 is configured to determine a confidence range of the data to be analyzed based on the statistical parameters of the analysis sample set determined by the statistical parameter determination sub-module 9310 and the confidence range adjustment coefficients of the data to be analyzed determined by the adjustment coefficient determination sub-module 9320.

The operation of trend determination module 920 and threshold determination module 930 according to embodiments of the present disclosure will be described with respect to a Grubbs (Grubbs) test algorithm.

The mean determination submodule 9310 and the standard deviation determination submodule 93120 in the statistical parameter determination submodule 9310 determine simple mean values and standard deviations of the N data samples of the analysis sample set.

The trend determination module 920 calculates the linear regression slope of the N data samples based on the simple average and standard deviation of the N data samples of the analysis sample set according to equation (8) above.

The adjustment coefficient determination sub-module 9320 determines an upper threshold adjustment coefficient F based on the linear regression slope _up And a lower threshold adjustment coefficient F _down . As an example, the adjustment coefficient determination sub-module 9320 maps the determined linear regression slope k of the analysis sample set to the (-1, 1) interval by sigmoid function transformation using the above formula (10), and calculates the upper threshold adjustment coefficient F using the above formula (11) _up And a lower threshold adjustment coefficient F _down 。

The confidence range determination submodule 9330 determines a benchmark upper threshold G based on a Grubbs (Grubbs) test algorithm _up And a reference lower critical value G _down The coefficient F may be adjusted based on the upper threshold as shown in the above equation (6) _up And a lower threshold adjustment coefficient F _down For the upper critical value G of the reference _up And a reference lower critical value G _down Adjusting to obtain an adjusted upper critical value G' _up And a lower critical value G' _down 。

The confidence range determination submodule 9330 may utilize the mean and standard deviation of the N data samples of the analysis sample set and the upper threshold G 'as shown in equation (7) above' _up And a lower critical value G' _down To determine an upper threshold X of the data to be analyzed _up And a lower threshold X _down . The average value may be a simple average value or a weighted average value. In the case where the average is a weighted average, the average determination submodule 9310 in the statistical parameter determination submodule 9310 also calculates a weighted average of the N data samples of the analysis sample set according to the above formula (5).

Fig. 12 shows an exemplary block diagram of an apparatus 1110 for detecting anomalous data in accordance with an embodiment of the disclosure.

The device 1110 for detecting anomaly data as shown in FIG. 12 may be implemented as one or more special purpose or general purpose computer system modules or components, such as a personal computer, notebook computer, tablet computer, cell phone, personal digital assistant (personal digital assistance, PDA), smart glasses, smart watch, smart ring, smart helmet, and any smart portable device. The device 1110 for detecting abnormal data may include at least one processor 1110 and a memory 1120.

Wherein the at least one processor 1110 is configured to execute program instructions. The memory 1120 may be present in the device 1110 for detecting anomalous data in different forms of program storage units as well as data storage units, such as hard disk, read Only Memory (ROM), random Access Memory (RAM), which can be used to store various data files used by the processor in processing and/or executing the license plate recognition process, as well as possible program instructions executed by the processor. Although not shown in the figures, the device 1110 for detecting abnormal data may further include an input/output component supporting input/output data flow between the device 1110 for detecting abnormal data and other components. The device 1110 that detects anomalous data may also send and receive information and data from the network through the communication port.

In some embodiments, the set of instructions stored by the memory 1120, when executed by the processor 1110, cause the apparatus for detecting exception data 1110 to perform a method of detecting exception data as described above, or implement an apparatus for detecting exception data as described above.

Although in fig. 12, the processor 1110 and the memory 1120 are presented as separate modules, those skilled in the art will appreciate that the above-described device modules may be implemented as separate hardware devices or may be integrated as one or more hardware devices. The specific implementation of the different hardware devices should not be taken as a factor limiting the scope of protection of the present disclosure, as long as the principles described in this disclosure can be implemented.

According to another aspect of the present disclosure, there is also provided a non-volatile computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a computer, may perform a method as described above, or implement an apparatus for detecting anomalous data as described above.

According to the embodiment of the disclosure, the confidence range of the data to be analyzed is adjusted according to the data change trend by considering the data change trend of the data sequence, so that the abnormality detection has stronger adaptability, and continuous over-high abnormality alarm and continuous under-low abnormality alarm can be avoided. Compared with the traditional Grubbs test method, the method for detecting the abnormal data according to the embodiment of the disclosure can remarkably improve the accuracy of abnormal warning and reduce the false alarm rate.

Program portions of the technology may be considered to be "products" or "articles of manufacture" in the form of executable code and/or associated data, embodied or carried out by a computer readable medium. A tangible, persistent storage medium may include any memory or storage used by a computer, processor, or similar device or related module. Such as various semiconductor memories, tape drives, disk drives, or the like, capable of providing storage functionality for software.

All or a portion of the software may sometimes communicate over a network, such as the internet or other communication network. Such communication may load software from one computer device or processor to another. For example: a hardware platform loaded from a server or host computer of the license plate recognition device to a computer environment, or other computer environment implementing the system, or similar functioning system related to providing information needed for license plate recognition. Thus, another medium capable of carrying software elements may also be used as a physical connection between local devices, such as optical, electrical, electromagnetic, etc., propagating through cable, optical cable, air, etc. Physical media used for carrier waves, such as electrical, wireless, or optical, may also be considered to be software-bearing media. Unless limited to a tangible "storage" medium, other terms used herein to refer to a computer or machine "readable medium" mean any medium that participates in the execution of any instructions by a processor.

The application uses specific words to describe embodiments of the application. Reference to "a first/second embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the application may be combined as suitable.

Furthermore, those skilled in the art will appreciate that the various aspects of the application are illustrated and described in the context of a number of patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the following claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A method of detecting anomalous data in a data sequence, wherein the data sequence is a temporal data sequence obtained by monitoring a critical indicator of a business system, the method comprising:

preprocessing the data sequence to obtain a ring ratio data sequence;

extracting an analysis sample set of data to be analyzed from the ring ratio data sequence, wherein in the case that the data sequence is a time sequence data sequence with a period change rule, data samples at the same position are extracted from each period of the ring ratio data sequence to form the analysis sample set; and sequentially extracting a predetermined number of data samples from the ring ratio data sequence to construct the analysis sample set in the case that the data sequence is a time sequence data sequence without a periodic variation rule;

determining a trend parameter of the analysis sample set, wherein the trend parameter is indicative of a linear regression slope of a sequence of data samples in the analysis sample set;

determining a confidence range of the data to be analyzed according to the trend parameters; and

performing anomaly detection on the data to be analyzed according to the confidence range;

wherein, according to the trend parameter, determining the confidence range of the data to be analyzed includes:

Determining a reference critical value of a confidence range of the data to be analyzed;

determining a confidence range adjustment coefficient of the data to be analyzed according to the trend parameter, wherein the confidence range adjustment coefficient is used for adjusting a reference critical value of the confidence range of the data to be analyzed according to the change trend of the analysis sample set; and

and adjusting a reference critical value of the confidence range of the data to be analyzed according to the statistical parameters of the analysis sample set and the confidence range adjustment coefficient so as to determine the confidence range of the data to be analyzed.

2. The method of detecting anomalous data according to claim 1, wherein the statistical parameters of the analysis sample set include a mean value and a standard deviation of the analysis sample set, and the method of detecting anomalous data further comprises:

determining an average value of the data samples in the analysis sample set; and

a standard deviation of the data samples in the analysis sample set is determined.

3. The method of detecting anomalous data according to claim 2 wherein the average value of the analysis sample set is a weighted average value of the analysis sample set,

wherein said determining an average value of data samples in said analysis sample set comprises:

Determining a weight value for each data sample in the analysis sample set; and

a weighted average of the data samples in the analysis sample set is determined.

4. A method of detecting anomalous data according to claim 3 wherein in the analysis sample set, a data sample close to the data to be analysed has a greater weight value than a data sample far from the data to be analysed.

5. The method for detecting anomalous data according to claim 1 wherein the data sequence is a periodic data sequence,

wherein extracting an analysis sample set of data to be analyzed in the ring ratio data sequence comprises:

dividing the data sequence according to the data period of the ring ratio data sequence;

sequentially determining a preset number of data periods from the data to be analyzed as periods to be extracted;

determining the position of the data to be analyzed in the period to be extracted; and

and extracting data at the position from each period to be extracted to obtain the analysis sample set.

6. The method of detecting anomaly data of claim 1, wherein anomaly detecting the data to be analyzed according to the confidence range comprises:

And detecting the data to be analyzed as abnormal data under the condition that the data to be analyzed exceeds the confidence range.

7. An apparatus for detecting abnormal data in a data sequence, wherein the data sequence is a time data sequence obtained by monitoring a key index of a business system, the apparatus comprising:

the sample set generation module comprises a data preprocessing sub-module and is configured to preprocess the data sequence to obtain a ring ratio data sequence; and a sample set construction sub-module configured to extract an analysis sample set of data to be analyzed in the loop ratio data sequence, wherein in the case that the data sequence is a time series data sequence having a period change rule, data samples at the same position are extracted from respective periods of the loop ratio data sequence to construct the analysis sample set; and sequentially extracting a predetermined number of data samples from the ring ratio data sequence to construct the analysis sample set in the case that the data sequence is a time sequence data sequence without a periodic variation rule;

a trend determination module configured to determine a trend parameter of the analysis sample set, wherein the trend parameter is indicative of a linear regression slope of a sequence of data samples in the analysis sample set;

The threshold value determining module is configured to determine a confidence range of the data to be analyzed according to the trend parameter;

the anomaly detection module is configured to detect anomalies of the data to be analyzed according to the confidence range; and, in addition, the processing unit,

the threshold determination module is further configured to:

8. The apparatus for detecting anomalous data according to claim 7, wherein the data sequence is a periodic data sequence,

wherein the sample set generation module further comprises:

a sequence dividing sub-module configured to divide the data sequence according to a data period of the ring ratio data sequence;

The period intercepting sub-module is configured to sequentially determine a preset number of data periods from the data to be analyzed as periods to be extracted; and

a position determination sub-module configured to determine a position of the data to be analyzed in the period to be extracted;

the sample set generating module extracts data at the position from each period to be extracted to obtain the analysis sample set.

9. An apparatus for detecting abnormal data in a data sequence, wherein the data sequence is a time data sequence obtained by monitoring a key index of a business system, the apparatus comprising:

a processor, and

a memory comprising a set of processor-executable instructions that, when executed by the processor, cause the apparatus to:

preprocessing the data sequence to obtain a ring ratio data sequence;

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a computer perform the method of any of the preceding claims 1-6.