CN112905671A

CN112905671A - Time series exception handling method and device, electronic equipment and storage medium

Info

Publication number: CN112905671A
Application number: CN202110313319.XA
Authority: CN
Inventors: 张文池; 王泓琳; 陈哲康; 周波; 王勇; 刘大鹏
Original assignee: Beijing Bishi Technology Co ltd; National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-04

Abstract

The invention provides a time series exception handling method and device, electronic equipment and a computer readable storage medium. The time series exception handling method comprises the following steps: acquiring time sequence data, training the time sequence data, and constructing a model; detecting whether abnormal data exist in the time sequence data obtained in real time according to the model, and if so, recommending part of the abnormal data; judging whether the recommended part of abnormal data is reasonable or not, and then feeding back a judgment result; and optimizing the model according to the judgment result, and then continuously detecting the real-time sequence data. According to the time series exception handling method, obvious bias is not generated on data, indexes with specific scene semantics can be adapted, operation and maintenance requirements in the field of non-traditional internet can be met, the method has higher expandability and universality, and specific exception reasons can be given for given exception results.

Description

Time series exception handling method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing a time series exception, an electronic device, and a computer-readable storage medium.

Background

Modern software enterprises often rely on a large number of application services installed on a large number of infrastructures, including physical machines, virtual machines, containers. To ensure the reliability of these high-level services and systems, the operation and maintenance personnel need to monitor and check the operating conditions of the infrastructure. During routine operation and maintenance management work, an operation and maintenance engineer typically monitors and collects various performance metrics for the infrastructure. For example, the machine often has indexes such as memory utilization rate, CPU utilization rate, and disk utilization rate, and in actual operation, due to a fault caused by external attack, disk medium aging, performance continuous overload, and the like, the availability of the machine is severely challenged, and at this time, these monitoring indexes also reflect an abnormality. The method is very important for the abnormity detection of the time series indexes, and can help an operation and maintenance team to find the fault as soon as possible, so that the efficiency of fault occurrence to troubleshooting is improved.

The problem of anomaly detection of time series indexes is also widely noted in academia, and algorithms for anomaly detection of time series indexes are proposed in large quantities in recent years, but are limited by algorithm effects and detection performance, and the methods still cannot meet the requirements of actual landing application. In consideration of the fact that the number of indexes to be monitored and checked in operation and maintenance work is extremely large, manual index marking is impractical, and therefore a supervised anomaly detection method is difficult to practice, and an unsupervised learning mode must be adopted. In addition, the time series anomaly detection scenes are different, the service objects and loads of the services are different greatly, and the trends and characteristics shown by the indexes sometimes have strong service correlation, so that the anomaly detection method needs to have the capability of efficiently collecting the feedback of the operation and maintenance experts so as to acquire the knowledge of the operation and maintenance experts.

The following table 1 lists the most advanced unsupervised time series anomaly detection algorithms in the academic world at present, most of the algorithms adopt deep learning models, huge computing resources are required to support training, the computing performance needs to be improved, and user feedback cannot be directly applied to deep learning framework optimization. The traditional unsupervised statistical learning method needs a large amount of manual parameter adjustment and has uneven effects. The algorithms also have obvious bias on data, each algorithm is excellent in performance on a specific data type, but has no universality, and specific abnormal reasons are difficult to explain by given abnormal results.

Characteristics of	Regression statistics learning	Traditional unsupervised learning	Unsupervised depth generation model
				High capacity	Difference (D)	In general	Is excellent in
Without need of regulating parameters	Difference (D)	Is excellent in	In general
				Need not label	Is excellent in	Is excellent in	Is excellent in
The detection speed is high	Is excellent in	In general	In general
				Low training resources	Is excellent in	In general	Difference (D)
Short training time	Is excellent in	In general	Difference (D)
				Can be manually adjusted	Difference (D)	In general	Difference (D)

TABLE 1

Disclosure of Invention

The present invention is directed to solve at least one of the problems in the background art and provides a time series exception handling method, a time series exception handling apparatus, an electronic device, and a computer-readable storage medium.

In order to achieve the above object, the present invention provides a method for processing time series exception, comprising the following steps:

acquiring time sequence data, training the time sequence data, and constructing a model;

detecting whether abnormal data exist in the time sequence data obtained in real time according to the model, and if so, recommending part of the abnormal data;

judging whether the recommended part of abnormal data is reasonable or not, and then feeding back a judgment result;

and optimizing the model according to the judgment result, and then continuously detecting the real-time sequence data.

According to one aspect of the invention, acquiring time series data comprises acquiring regular small-scale time series data and irregular large-scale time series data, clustering all time series data when acquiring irregular large-scale time series data, and then training various types of time series data to construct a model.

According to one aspect of the invention, the clustering process is to capture the correlation among the time sequence data to be trained through DBSCAN, and cluster the data with approximate shape and consistent periodicity.

According to an aspect of the present invention, in the clustering process, in calculating the approximation degree of the time-series data, the distance between the time-series data is calculated using DTW.

According to one aspect of the invention, according to the type of the time sequence data, feature data capable of representing the corresponding type of the time sequence data is selected for training, and a model is constructed.

According to one aspect of the invention, RRCF is adopted to select all the feature data for training, all the feature data are iterated to obtain a plurality of decision trees, the decision trees form a decision forest, and then whether abnormal data exist in the real-time sequence data is determined through voting of the decision forest.

According to one aspect of the invention, when constructing the decision tree, the RRCF selects a segmentation dimension for segmenting the feature data when constructing the decision tree, and the RRCF has a probability of selecting the feature data as

gi＝max_x∈Sx_j-x_j-1(ii) a Where i is the characteristic data, p_iRepresenting the probability of the feature i being selected, the probability value being between 0 and 1; l_iRepresenting the difference between the maximum value and the minimum value of the characteristic i in a training sample set and in a characteristic set obtained by calculation; gi represents the maximum difference between two adjacent characteristic values in the characteristic set obtained by calculation after the characteristic i is sorted according to the characteristic size in the training sample set; sigma g_jRepresenting g calculated for each feature dimension j_jThe summation ∑ l_jRepresents l calculated for each feature dimension j_jAnd (6) summing.

According to one aspect of the invention, the RRCF equally divides the feature data in the slicing dimension into N intervals [ l [ ]₀，h₀，l₁，h₁，...，l_N-1，h_N-1]And calculating the density d of each interval_i＝Count(p，p∈[l_i，h_i]) Wherein the probability that each of the intervals is selected is

Finally randomly selecting a cutting point X from the selected interval_i～Uniform[l_i，h_i](ii) a Wherein l-0 and h-N-1 respectively represent the minimum value of the characteristic in the characteristic dimension solved for the training set, h-N-1 represents the maximum value of the characteristic, the difference between the minimum value and the maximum value is divided by N, and the N intervals are equally divided.

According to one aspect of the present invention, when the abnormal data exists, the abnormal score codip of the abnormal data is calculated by using the dividing point, and when the abnormal score codip is calculated, the ratio codip of the number of the abnormal data contained in the sibling subtree and the father subtree of the dividing point is calculated_NodeSelecting the largest ratio CoDisp_NodeAbnormal data x_iIs an abnormality score of

According to one aspect of the invention, the recommending part of the abnormal data is to select a plurality of most abnormal segments in the abnormal data, and recommend after obtaining labels of the plurality of segments; or

Recommending partial abnormal data by selecting a plurality of uncertain segments in the abnormal data and recommending after obtaining labels of the segments; or

And the recommendation of the abnormal data of the part is to divide the abnormal data into a plurality of groups according to the abnormal scores, obtain a plurality of fragments in each group, and recommend after obtaining the labels of the fragments.

According to one aspect of the invention, after the abnormal data of n labeled segments are obtained by the model, the abnormal data and M decision trees in the decision forest of the model jointly form an abnormal score matrix codip _ M [ x [ [ x ])_i][tree_j]For each exception data x_iIf the feedback judgment result is true positive, the decision tree is_jHas a weight of tw_j＝tw_j+δ×CoDisp_M[x_i][tree_j]And selecting a decision tree with higher weight according to the feedback judgment result so as to optimize the model.

In order to achieve the above object, the present invention further provides a time-series exception handling apparatus, including:

the data processing module is used for acquiring time series data, training the time series data and constructing a model;

the abnormal data detection recommending module detects whether abnormal data exist in the time sequence data obtained in real time according to the model, and if the abnormal data exist, part of the abnormal data are recommended;

the abnormal data judgment feedback module judges whether the part of abnormal data is reasonable or not and then feeds back a judgment result;

and the model optimization module optimizes the model according to the feedback judgment result and then continuously detects the real-time sequence data.

According to an aspect of the invention, further comprising:

and the data classification processing module is used for acquiring irregular large-scale time sequence data, clustering all the time sequence data, training various time sequence data and constructing a model.

In order to achieve the above object, the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the above time-series exception handling method.

To achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the above time-series exception handling method.

According to one scheme of the invention, as the number of time sequences to be monitored in a production environment is extremely large, each production unit can generate dozens or even hundreds of monitoring index data, the index data need to be monitored completely, if the time sequences are trained respectively in a targeted manner, the number of models and consumed resources are extremely large, and the existing operation and maintenance resources are difficult to support. Therefore, before the targeted training stage of the index data, the data are clustered, so that the detection processing time can be greatly reduced, and the abnormity can be quickly and accurately processed.

According to one scheme of the invention, a characteristic data selection stage is provided, and more appropriate characteristic data are extracted in a targeted manner according to the statistical information and characteristics of indexes, so that the accuracy of the model is improved.

According to one scheme of the invention, the most abnormal 30 segments are selected, and the labels of the abnormal segments are acquired, so that the explicit abnormality can be further confirmed, and the false positive rate can be reduced.

According to one scheme of the invention, 30 most uncertain segments (namely around the vicinity of an abnormality judgment threshold) are selected, and the labels can help the model to clearly classify boundaries, so that the identification accuracy of fuzzy abnormalities is improved.

According to one aspect of the invention, the abnormal data is divided into 10 groups according to the abnormal scores, each group obtains at most 3 segments, and the labels can capture attitudes of the judgment feedback module on different abnormal judgment conditions, so as to help the model determine the optimal threshold value selection range.

According to one scheme of the invention, the invention provides an unsupervised, white-box and accurate time series exception handling method which is matched with active learning and can actively and efficiently collect feedback information. On the basis of a traditional unsupervised learning frame, an active learning stage is introduced, abnormality is actively recommended to a judgment feedback part (such as a judgment feedback module or operation and maintenance personnel) and feedback is acquired, so that a model is corrected, and the accuracy is improved. The method reserves the advantages of the traditional unsupervised learning in the aspects of parameter adjustment and marking, designs the application strategy of marking feedback in a targeted manner, and further optimizes the recall rate, the detection speed and the capacity of the model.

According to one scheme of the invention, the processing method has no obvious bias on data, can adapt to indexes with specific scene semantics, can meet the operation and maintenance requirements in the field of non-traditional Internet, has higher expandability and universality, and can give specific abnormal reasons for the given abnormal result.

According to one aspect of the present invention, the present invention is able to accurately detect and interpret anomalies, testing on 1 public data set and 2 time series data of a commercial bank's actual production environment, ultimately reaching F1-score of 0.81 and 0.89 on both data sets. Compared with the traditional unsupervised exception handling method, the best F1-score is improved by 0.19-0.5 on two data sets, and the detection time is shortened by 58%.

Drawings

FIG. 1 schematically shows a flow diagram of a method for time series exception handling according to one embodiment of the present invention;

FIG. 2 schematically represents an approximate index plot collected by the same switch;

3-5 schematically show three different anomaly fragment proactive recommender diagrams;

fig. 6 schematically shows a functional configuration diagram of a time-series abnormality processing apparatus according to an embodiment of the present invention.

Detailed Description

The content of the invention will now be discussed with reference to exemplary embodiments. It is to be understood that the embodiments discussed are merely intended to enable one of ordinary skill in the art to better understand and thus implement the teachings of the present invention, and do not imply any limitations on the scope of the invention.

As used herein, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to. The term "based on" is to be read as "based, at least in part, on". The terms "one embodiment" and "an embodiment" are to be read as "at least one embodiment".

In view of the above-described drawbacks of the prior art in the background art, the present invention provides a batch task time monitoring method, which can predict the time of a batch task and detect an abnormality of the batch task, and update a task model or generate an alarm according to the prediction and detection results.

FIG. 1 schematically shows a flow diagram of a method for time series exception handling according to one embodiment of the present invention. As shown in fig. 1, a time-series exception handling method according to an embodiment of the present invention includes the following steps:

a. acquiring time sequence data, training the time sequence data, and constructing a model;

b. detecting whether abnormal data exist in the time sequence data obtained in real time according to the model, and if so, recommending part of the abnormal data;

c. judging whether the recommended part of abnormal data is reasonable or not, and then feeding back a judgment result;

d. and optimizing the model according to the judgment result, and then continuously detecting the real-time sequence data.

In practice, the time series data may be represented by x, where x is { x ═ x₁，x₂，...，x_NN is the length of data x, data point x at any time t_tIs a specific data value. The time series may be collected from many sources, such as networks, transaction links, request logs, and the like. Same sourceHave a greater probability of having similar characteristics.

Because the number of time sequences to be monitored in a production environment is extremely large, each production unit can generate dozens or even hundreds of monitoring index data, the index data need to be monitored completely, if the time sequences are trained respectively in a targeted manner, the number of models and consumed resources are extremely large, and the existing operation and maintenance resources are difficult to support. Therefore, before the targeted training stage of the index data, the data are clustered, so that the detection processing time can be greatly reduced, and the abnormity can be quickly and accurately processed.

Specifically, according to an embodiment of the present invention, in the step a, in the clustering stage, the algorithm uses DBSCAN to capture the association relationship between the timing indexes to be trained, and clusters the indexes with similar shapes and consistent periodicity. In calculating the index similarity, distance between indexes is calculated using dtw (dynamic Time warping). The DBSCAN does not need to provide the predefined category information, and can control the clustering accuracy by adjusting the clustering radius, so the DBSCAN is very suitable for index clustering scenes.

Figure 2 schematically shows an approximate index map of the same switch acquisition. As shown in fig. 2, two network traffic curves for different ports of the same switch exhibit substantially the same trend and scale. In an actual production environment, the same type of data under the same monitoring unit also has the clustering characteristic, and by utilizing the characteristic, the number of models generated in a model training stage can be greatly reduced, consumed resources are reduced, and the cost performance of an operation and maintenance tool is improved. In addition, the number of data in a part of scenes is small, the accuracy requirement is high, and the cost performance of the single training data is higher than that of pre-clustering at the moment, so that the clustering stage is taken as an optional step.

As can be seen from the above, in the present invention, acquiring time series data includes acquiring regular small-scale time series data and acquiring irregular large-scale time series data, and the acquisition of regular small-scale time series data is only performed by direct training, while the acquisition of irregular large-scale time series data requires clustering, and then training various types of time series data, and then constructing a model.

According to one embodiment of the invention, according to the type of the time sequence data, feature data capable of representing the corresponding type of the time sequence data is selected for training, and a model is constructed. The time series data have different characteristics. For example, percentage type sequence data tends to exhibit a horizontal state with short dips or spikes in failure; transaction sequence data related to services often show periodic peaks/valleys, and a small amount of fluctuation occurs in the case of failure; exchanging infrastructure sequence data such as space, there may be a process that slowly rises over time. Therefore, the invention provides a characteristic data selection stage, and according to the statistical information and characteristics of the indexes, more suitable characteristic data are extracted in a targeted manner, so that the model accuracy is improved. The specific extraction rules are shown in the following table:

TABLE 2

In this embodiment, table 2 contains simple and effective feature data that can cover the different features of most curves, and is easy to calculate and performs well.

According to one embodiment of the invention, RRCF is adopted to select all feature data for training, all feature data are iterated to obtain a plurality of decision trees, the decision trees form a decision forest, and then whether abnormal data exist in real-time sequence data is determined through voting of the decision forest.

When the decision tree is constructed, the RRCF selects the segmentation dimension for segmenting the feature data, and the probability of the RRCF selecting the feature data under the segmentation dimension is

g_i＝max_x∈Sx_j-x_j-1(ii) a Where i is the characteristic data, p_iRepresenting the probability, that the feature i is selectedThe value is between 0 and 1; l_iRepresenting the difference between the maximum value and the minimum value of the characteristic i in a training sample set and in a characteristic set obtained by calculation; gi represents the maximum difference between two adjacent characteristic values in the characteristic set obtained by calculation after the characteristic i is sorted according to the characteristic size in the training sample set; sigma g_jRepresenting g calculated for each feature dimension j_jThe summation ∑ l_jRepresents l calculated for each feature dimension j_jAnd (6) summing.

Specifically, the unsupervised anomaly detection basic algorithm selected by the invention is RRCF (robust Random Cut forest), the detection effect of the unsupervised anomaly detection basic algorithm is better than that of other unsupervised anomaly detection algorithms, and a certain difference exists between the accuracy of the unsupervised anomaly detection basic algorithm and the accuracy of the unsupervised anomaly detection basic algorithm used when the vehicle is actually landed. The RRCF trains all training sample feature data in batches, each batch of feature data is subjected to multiple rounds of iteration to obtain a decision tree, and all decision trees finally form a decision forest and decide whether the training sample feature data are abnormal or not through voting. In the process of constructing the decision tree, feature segmentation needs to be selected from multiple dimensions of feature data. The RRCF considers that the segmentation is carried out on the dimension with larger coverage data range, the distinguishing effect of the sample is better, namely the probability that the feature i is selected

l_i＝max_x∈Sx_i-min_x∈Sx_iWherein Si represents the probability of the feature i being selected, li represents the difference between the maximum value and the minimum value in the feature i, S represents the training sample set, and x_iRepresenting the value of the feature i calculated for one sample in S. But this does not take into account the effect of the distribution of the dimensions themselves. According to an embodiment of the invention, when a decision tree is constructed and the dimension for cutting branches is selected, in addition to considering the coverage range of data of the dimension, the extreme difference of the data is used as an influence factor, namely, the probability of selecting the characteristic i is selected by the invention

Wherein g is_i＝max_x∈Sx_j-x_j-1. Thus, the larger the maximum spacing of the data distribution in each dimension,the degree of discrimination provided by segmentation at the interval is higher, so that segmentation dimensionality is selected more effectively, and model accuracy is improved.

Further, when a decision tree is constructed, after each iteration determines a segmentation dimension, a suitable boundary point needs to be selected on data of the dimension, and left and right subtrees are divided according to the boundary point. After the RRCF equally divides the dimension data, a dividing point is randomly selected, and the distribution characteristics of the dimension are not considered. According to one embodiment of the invention, the RRCF equally divides the feature data in the segmentation dimension into N intervals l₀，h₀，l₁，h₁，...，l_N-1，h_N-1]And calculating the density d of each interval_i＝Count(p，p∈[l_i，h_i]) Wherein the probability that each interval is selected is

Finally randomly selecting a cutting point X from the selected interval_i～Uniform[l_i，h_i]. Wherein l-0 and h-N-1 respectively represent the minimum value of the characteristic in the characteristic dimension solved for the training set, h-N-1 represents the maximum value of the characteristic, the difference between the minimum value and the maximum value is divided by N, and the N intervals are equally divided. For example, the left and right endpoints of the ith interval are l_iAnd h_i. The selection strategy can identify the sparse part of the segmentation dimension more accurately, so that the discrimination is improved. In the present embodiment, d_iThe density of the intervals is represented, and refers to the number of samples in the range. Since the spacing widths are the same, the greater the number of samples, the greater the density. Count represents the Count, p represents each sample falling in the interval, i.e. [ l ] is counted_i,h_i]Number of samples in the interval range. Uniform [ l_i,h_i]Represents the interval of pair l_i,h_iMake normalization, X_iIs a randomly selected segmentation point in the normalized interval.

Further, when abnormal data exists, an abnormal score codip of the abnormal data is calculated using the dividing point (specific node), and when the abnormal score codip is calculated, the sibling subtree and father of the dividing point are calculatedProportion CoDisp of abnormal data quantity contained in subtree_NodeThe higher the ratio, the higher the outlier degree of the outlier data. Since the calculation process of each abnormal data involves a plurality of characteristic data, the model is gradually moved upwards from the initial node for detection, and after repeated multiple iterations, the largest proportion CoDisp is selected_NodeAbnormal data x_iIs an abnormality score of

Abnormal score CoDisp_xiMeans x_iThe calculated degree of abnormality is sampled. First, x_iA leaf sample in the decision tree is dropped, and the algorithm searches upwards from the leaf until a branch Node is found, and the sample size of the sub-tree represented by the Node is far smaller than that of the sibling sub-tree thereof. Final sample x_iThe Codisp of (1) is the average value of the Codisp of the Node nodes corresponding to the sample in each tree in the whole forest. In the present embodiment, the largest ratio codip is selected_NodeConsidering the depth at which the node is located, deeper nodes in the tree are more normal. Thus find the demarcation point of the sample where x_iThe subtree is isolated from other large samples and is more representative.

Further, in the step b, recommending part of abnormal data as a plurality of most abnormal segments in the selected abnormal data, and recommending after obtaining labels of the plurality of segments; or

Recommending partial abnormal data by selecting a plurality of uncertain segments in the abnormal data, and recommending after obtaining labels of the plurality of segments; or

And recommending part of abnormal data, namely segmenting the abnormal data into a plurality of groups according to the abnormal scores, acquiring a plurality of fragments in each group, and recommending after acquiring the labels of the fragments.

3-5 schematically show three different anomaly fragment proactive recommender diagrams. As shown in fig. 3, according to an embodiment of the present invention, the scheme a selects the most abnormal 30 segments, and the labels of these abnormal segments can further affirm the explicit abnormality and reduce the false positive rate.

According to another embodiment of the invention, as shown in fig. 4, the scheme B selects the most uncertain 30 segments (i.e., around the anomaly determination threshold), and these labels can help the model to clearly classify the boundary, thereby improving the identification accuracy of the fuzzy anomaly.

As shown in fig. 5, according to the third embodiment of the present invention, the solution C divides the abnormal data into 10 groups according to the abnormal score, each group obtains at most 3 segments, and these labels can capture, for example, attitudes of the judgment feedback module on different abnormal judgment conditions, thereby helping the model determine the optimal threshold selection range.

In experiments disclosing data sets, the F1-score for protocol a was higher than the other two protocols, but each of the other two protocols possessed specific applicable scenarios.

Furthermore, the invention improves the processing efficiency of the model in the online detection stage through various technologies, and enables the model to have the capability of dynamic adjustment according to the feedback of the user. In the on-line detection stage, only the extreme abnormal value is selected as the automatic model feedback data to dynamically adjust the RRCF model, so that the model updating frequency is reduced, and the detection performance is improved. According to an embodiment of the invention, after the abnormal data of n labeled segments are obtained by the model, the abnormal data and M trees in the decision forest of the model jointly form an abnormal score matrix codip _ M [ x ]_i][tree_j]For each exception data x_iIf the user marks true sun, tree is used_jWeight tw of_j＝tw_j+δ×CoDosp_M[x_i][tree_j]. The self-correction of the model is fed back, so that the model can be helped to screen out decision trees with higher quality, the decision trees have higher weight in later-stage abnormal judgment, and the decision trees with higher weight are selected, so that the model is optimized, and the influence on the detection result is improved.

Furthermore, the present invention provides a time-series exception handling apparatus for implementing the time-series exception handling method, as shown in fig. 6, the apparatus including:

According to an embodiment of the present invention, further comprising:

In the invention, the data processing module acquires time sequence data, including acquiring regular small-scale time sequence data and irregular large-scale time sequence data, and when acquiring irregular large-scale time sequence data, all the time sequence data are clustered, and then various time sequence data are trained to construct a model.

The clustering process is to capture the incidence relation among the time sequence data to be trained through DBSCAN and cluster the data with approximate shape and consistent periodicity.

In the clustering process, in calculating the approximation degree of the time-series data, the distance between the time-series data is calculated using Dynamic Time Warping (DTW).

And the data classification processing module selects characteristic data which can represent the time sequence data of the corresponding type according to the type of the time sequence data to train and construct a model.

According to one embodiment of the invention, the abnormal data detection recommendation module adopts RRCF to select all feature data for training, the feature data are iterated to obtain a plurality of decision trees, the decision trees form a decision forest, and then whether abnormal data exist in the real-time sequence data or not is determined through decision forest voting.

In this embodiment, when constructing the decision tree, the RRCF selects a segmentation dimension for segmenting the feature data, and the RRCF has a probability of selecting the feature data in the segmentation dimension of

g_i＝max_x∈Sx_j-x_j-1(ii) a Where i is the characteristic data, p_iRepresenting the probability of the feature i being selected, the probability value being between 0 and 1; l_iRepresenting the difference between the maximum value and the minimum value of the characteristic i in a training sample set and in a characteristic set obtained by calculation; gi represents the maximum difference between two adjacent characteristic values in the characteristic set obtained by calculation after the characteristic i is sorted according to the characteristic size in the training sample set; sigma g_jRepresenting g calculated for each feature dimension j_jThe summation ∑ l_jRepresents l calculated for each feature dimension j_jAnd (6) summing.

In this embodiment, the RRCF equally divides the feature data in the segmentation dimension into N intervals [ l [ ]₀，h₀，l₁，h₁，...，l_N-1，h_N-1]And calculating the density d of each interval_i＝Count(p，p∈[l_i，h_i]) Wherein the probability that each interval is selected is

Finally randomly selecting a cutting point X from the selected interval_i～Uniform[l_i，h_i]. Wherein l-0 and h-N-1 respectively represent the minimum value of the characteristic in the characteristic dimension solved for the training set, h-N-1 represents the maximum value of the characteristic, the difference between the minimum value and the maximum value is divided by N, and the N intervals are equally divided.

When abnormal data exists, the abnormal score CoDisp of the abnormal data is calculated by using the dividing point, and when the abnormal score CoDisp is calculated, the proportion CoDisp of the abnormal data quantity contained in the brother subtree and the father subtree of the dividing point is calculated_NodeSelecting the largest ratio CoDisp_NodeAbnormal data x_iIs an abnormality score of

In the invention, the abnormal data detection recommending module recommends part of abnormal data as a plurality of most abnormal segments in the selected abnormal data, acquires labels of the plurality of segments and then recommends; or

According to an embodiment of the present invention, after obtaining the abnormal data of n labeled segments, the model and M decision trees in the decision forest of the model jointly form an abnormal score matrix codip _ M [ x [ ]_i][tree_j]For each exception data x_iIf the feedback judgment result is true positive, the decision tree is_jHas a weight of tw_j＝tw_j+δ×CoDisp_M[x_i][tree_j]And selecting a decision tree with higher weight according to the feedback judgment result so as to optimize the model.

To achieve the above object, the present invention also provides an electronic device, including: the time-series exception handling system comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the time-series exception handling method when being executed by the processor.

In order to achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the above time-series exception handling method.

According to the scheme, the invention provides an unsupervised, white-box and accurate time series exception handling method which is matched with active learning and can actively and efficiently collect feedback information. On the basis of a traditional unsupervised learning frame, an active learning stage is introduced, abnormality is actively recommended to a judgment feedback part (such as a judgment feedback module or operation and maintenance personnel) and feedback is acquired, so that a model is corrected, and the accuracy is improved. The method reserves the advantages of the traditional unsupervised learning in the aspects of parameter adjustment and marking, designs the application strategy of marking feedback in a targeted manner, and further optimizes the recall rate, the detection speed and the capacity of the model.

Moreover, the processing method has no obvious bias on data, can adapt to indexes with specific scene semantics, can meet the operation and maintenance requirements in the field of non-traditional Internet, has higher expandability and universality, and can give specific abnormal reasons to the given abnormal result.

Moreover, the present invention was able to accurately detect and interpret anomalies, tested on 1 public data set and time series data of 2 commercial bank actual production environments, ultimately reaching F1-score of 0.81 and 0.89 on both data sets. Compared with the traditional unsupervised exception handling method, the best F1-score is improved by 0.19-0.5 on two data sets, and the detection time is shortened by 58%.

Those of ordinary skill in the art will appreciate that the modules and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and devices may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, each functional module in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method for transmitting/receiving the power saving signal according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

It should be understood that the order of execution of the steps in the summary of the invention and the embodiments of the present invention does not absolutely imply any order of execution, and the order of execution of the steps should be determined by their functions and inherent logic, and should not be construed as limiting the process of the embodiments of the present invention.

Claims

1. A time series exception handling method is characterized by comprising the following steps:

2. The method for processing the time series abnormality according to claim 1, wherein the acquiring of the time series data includes acquiring regular small-scale time series data and acquiring irregular large-scale time series data, and when acquiring the irregular large-scale time series data, all the time series data are clustered, and then each type of time series data is trained to construct a model.

3. The method for processing the time series abnormity according to claim 2, wherein the clustering process is to capture the incidence relation among the time series data to be trained through DBSCAN, and cluster the data with approximate shape and consistent periodicity.

4. The method according to claim 3, wherein in the clustering process, in calculating the approximation degree of the time-series data, a distance between the time-series data is calculated using a dynamic time warping algorithm.

5. The method according to claim 4, wherein the time series abnormality processing method is characterized in that feature data capable of representing the corresponding type of time series data is selected for training to construct a model according to the type of the time series data.

6. The method for processing the time series abnormity according to claim 5, wherein all the feature data are selected for training by adopting a steady random deforestation, a plurality of decision trees are obtained by iteration of all the feature data, the decision trees form a decision forest, and then whether the abnormal data exist in the real-time series data is determined by voting of the decision forest.

7. The method of claim 6, wherein the RRCF selects a segmentation dimension for segmenting the feature data when constructing the decision tree, and the RRCF has a probability of selecting the feature data in the segmentation dimension as

g_i＝max_x∈Sx_j-x_j-1(ii) a Where i is the characteristic data, p_iRepresenting the probability of the feature i being selected, the probability value being between 0 and 1; l_iRepresenting the difference between the maximum value and the minimum value of the characteristic i in a training sample set and in a characteristic set obtained by calculation; g_iRepresenting the maximum difference between two adjacent characteristic values in the characteristic set obtained by calculation and sorting the characteristic i according to the characteristic size in the training sample set; sigma g_iRepresentsG calculated for each feature dimension j_jThe summation ∑ l_jRepresents l calculated for each feature dimension j_jAnd (6) summing.

8. The method of processing time series exceptions of claim 7 wherein the RRCF equally divides the feature data in the sliced dimension into N intervals [ l [ ]₀，h₀，l₁，h₁，...，l_N-1，h_N-1]And calculating the density d of each interval_i＝Count(p，p∈[l_i，h_i]) Wherein the probability that each of the intervals is selected is

Finally randomly selecting a cutting point X from the selected interval_i～Uniform[l_i，h_i](ii) a Wherein, 1-0 and h-N-1 respectively represent the minimum value of the characteristic in the characteristic dimension solved by the training set, 1-0 represents the maximum value of the characteristic, h-N-1 represents the difference of the minimum value and the maximum value of the characteristic, and the difference is divided by N and equally divided into N intervals.

9. The method according to claim 8, wherein when the abnormal data exists, the division point is used to calculate an abnormal score codip of the abnormal data, and when the abnormal score codip is calculated, a ratio codip of the number of the abnormal data included in a sibling subtree and a parent subtree of the division point is calculated_NodeSelecting the largest ratio CoDisp_NodeAbnormal data x_iIs an abnormality score of

T∈forest。

10. The time-series abnormality processing method according to claim 9, wherein the recommendation of the partial abnormality data is to select a plurality of pieces of the abnormality data which are presumed to be most abnormal; or

11. The method for processing time series exception according to claim 10, wherein the anomaly data of n labeled segments obtained by the model and M decision trees in the decision forest of the model together form an exception score matrix codip _ M [ x ×)_i][tree_j]For each exception data x_iIf the feedback judgment result is true positive, the decision tree is_jHas a weight of tw_j＝tw_j+δ×CoDisp_M[xi][tree_j]And selecting a decision tree with higher weight according to the feedback judgment result so as to optimize the model.

12. A time-series exception handling apparatus, comprising:

13. The time-series exception handling apparatus according to claim 12, further comprising

14. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the time series exception handling method of any one of claims 1 to 11.

15. A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the time-series abnormality processing method according to any one of claims 1 to 11.