KR101906859B1

KR101906859B1 - Aparatus and method for detecting anomalous subsequence

Info

Publication number: KR101906859B1
Application number: KR1020120030043A
Authority: KR
Inventors: 김상욱; 배덕호
Original assignee: 삼성전자 주식회사; 한양대학교 산학협력단
Priority date: 2012-03-23
Filing date: 2012-03-23
Publication date: 2018-10-11
Also published as: KR20130107889A

Abstract

The present invention relates to an abnormal subsequence detecting apparatus and method for detecting abnormal subsequences in time series data.
The abnormal subsequence detecting apparatus according to an embodiment of the present invention includes a preprocessing unit for constructing a suffix tree by symbolizing time series data and using the symbolized string, a length of a maximum pattern is updated while visiting the constructed suffix tree A window length determination unit for determining a window length range based on a maximum value and a minimum value of the length of the updated maximum pattern, a window length determination unit for determining a window distance between the subsequence and the kth neighboring subsequence of the subsequence, An abnormal value calculation unit for calculating an abnormal value score with respect to the derived ideal subsequence candidate, and a top-k abnormal slave candidate based on the calculated outlier score, And an abnormal subsequence detecting unit for detecting the sequence.

Description

[0001] APPARATUS AND METHOD FOR DETECTING ANOMALOUS SUBSEQUENCE [0002]

The present invention relates to an abnormal subsequence detecting apparatus and method for detecting abnormal subsequences in time series data.

Detecting an ideal subsequence in time series data is useful not only for the purpose of noise cancellation that prevents successful extraction of the normal subsequence but also in various domains by itself.

In recent years, as the number of data collection tools such as sensors and remote equipments has been developed and increasing, the amount of time series data is rapidly increasing. Therefore, it is more important to study the efficient subsequence detection scheme for large-volume time-series data.

Conventional methods for detecting abnormal subsequences in time series data are largely classified into a projection-based ideal subsequence detection scheme and a window-based ideal subsequence detection scheme.

The mapping-based ideal subsequence detection scheme first projects the subsequence to a point on the dimension space corresponding to its length, then uses the clusters, statistics, prediction, information theory, This is a method for detecting subsequences. Such projection-based subsequence detection schemes have a problem in that when the length of the subsequence becomes longer, the subsequence is projected onto a point in space of a higher dimension, which makes it difficult to efficiently detect the ideal subsequence.

The window-based subsequence detection scheme derives the subsequences from the entire time series data using the sliding window scheme and then compares each derived subsequence with the other subsequences to obtain the most dissimilar ) Subsequence. Such window-based subsequence detection schemes are known to be effective for detecting ideal subsequences that occur locally as compared to projection-based detection schemes.

Conventional ideal subsequence detection schemes have the following limitations. First, conventional studies on abnormal subsequence detection focus on detecting abnormal subsequences of a certain length. For this, the length of the subsequence to be detected must be input from the user, and the length of the subsequence to be detected greatly affects the accuracy of the detection of the ideal subsequence. Even a subsequence having the same start position may be an ideal subsequence depending on its length, or may be a normal subsequence. However, it is very difficult for the user to determine the length of the ideal ideal subsequence, except when the time series data has a definite periodicity.

Second, normal subsequences and ideal subsequences of various lengths can be mixed in one long real-time time series data depending on the domain. Referring to FIG. 1, most subsequences have similar lengths and shapes, such as group N. FIG. However, depending on the state, ideal subsequences having various lengths and shapes such as subsequences A1, A2 and A3 may occur. Therefore, it is difficult to derive various ideal subsequences occurring in the real world when only the ideal subsequence for a specific length is detected.

Third, an ideal subsequence group with similar length and shape may appear. In FIG. 1, the ideal subsequences A2 and A3 have four and two similar ideal ideal subsequences, respectively. These ideal subsequences must be distinguished from normal subsequences because they all have similar subsequences but all cause a failure and an accident.

One aspect of the present invention provides an apparatus and method for detecting an ideal subsequence that automatically determines a length range of a meaningful subsequence and detects top-k abnormal subsequences of a length range of a meaningful subsequence do.

The abnormal subsequence detecting apparatus according to an embodiment of the present invention includes a preprocessing unit for constructing a suffix tree by symbolizing time series data and using the symbolized string, A window length determination unit for determining a window length range based on a maximum value and a minimum value of the updated maximum pattern length, a window length determining unit for determining a window distance between the subsequence and the kth neighboring subsequence of the subsequence, An ideal value calculation unit for calculating an outlier score for the derived ideal subsequence candidate, and a top-k candidate calculation unit for calculating an outlier score based on the calculated outlier score, And an abnormal subsequence detector for detecting abnormal subsequences.

Wherein the window length determining unit stores the number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree, updates a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes, And stores the information of internal nodes whose number of stored leaf nodes is k or less.

The window length determination unit determines the window length range from the maximum value of the maximum pattern length + 1 to the minimum length of the maximum pattern length.

Wherein the ideal candidate derivation unit stores the ideal candidate cell based on the information of the internal node having the number of stored leaf nodes equal to or less than k and sets the sum of the occurrence frequencies of the stored ideal candidate cell and the neighboring cells to k + And calculates a maximum lattice distance at which the sum of the occurrence frequencies when the frequency of occurrence of the abnormal candidate cell is increased in the order of the ideal cell having the larger stored lattice distance is k + 1 or more.

The abnormal candidate deriving unit removes a cell having a lattice distance smaller than the calculated maximum lattice distance from the abnormal candidate cell to derive an abnormal subsequence candidate.

The abnormal subsequence detecting method according to an embodiment of the present invention includes the steps of constructing a suffix tree by symbolizing time series data and using the symbolized string, updating the length of a maximum pattern while visiting the constructed suffix tree, Determining a window length range based on a maximum value and a minimum value of the updated maximum pattern length, determining a window length range based on the maximum and minimum values of the updated maximum pattern, Deriving a subsequence candidate, calculating an outlier score for the derived ideal subsequence candidate, and detecting top-k ideal subsequences based on the calculated outlier score.

Wherein the step of determining the window length range comprises: storing the number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree; and determining a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes And stores the information of internal nodes whose number of stored leaf nodes is k or less.

The determining of the window length range determines the window length range from the maximum value of the maximum pattern length + 1 to the minimum length of the maximum pattern length.

Wherein the step of deriving the ideal subsequence candidates includes storing an ideal candidate cell based on information of an internal node having a number of stored leaf nodes equal to or less than k and determining whether the sum of occurrence frequencies of the stored ideal candidate cell and neighboring cells is k + And a maximum lattice distance is calculated such that the sum of the occurrence frequencies when the frequency of occurrence of the abnormal candidate cell is increased in the order of the ideal cell having the large stored lattice distance is k + 1 or more.

The step of deriving the ideal subsequence candidates derives an ideal subsequence candidate by removing a cell having a lattice distance smaller than the calculated maximum lattice distance from the ideal candidate cell.

According to an aspect of the present invention described above, it is possible to automatically determine the length of the ideal subsequence, detect ideal subsequences of various lengths, detect ideal subsequence groups having the same length and shape, The abnormal subsequence appearing in the domain can be efficiently detected.

1 is time series data including ideal subsequence groups of various lengths and shapes.
2 is a block diagram schematically showing a configuration of a time series data analysis system according to an embodiment of the present invention.
3 is a block diagram schematically showing an internal configuration of an anomaly detection unit of the abnormal subsequence detection apparatus of FIG.
4 is a block diagram schematically showing a hardware configuration of the abnormal subsequence detecting apparatus of FIG.
5 is a diagram schematically illustrating a method of deriving a subsequence according to a sliding window technique.
FIG. 6 is a diagram schematically showing whether a subsequence is frequent when the number of desired subsequences is four, which is desired by a user.
FIG. 7 is a diagram schematically showing a suffix tree to which a subsequence length and the occurrence frequency of the corresponding subsequence are added.
8 is a diagram schematically showing the lattice distance when the window length is 2, the number of alphabets is 10, and the number k of ideal subsequences desired by the user is k = 5.
9 is a flowchart schematically illustrating a method of analyzing time series data according to an embodiment of the present invention.
10 is a flowchart schematically illustrating an internal process of the abnormal subsequence detecting process of FIG.

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

In the embodiment of the present invention, time series data refers to sequence data having a time-stamp of a constant time interval. In the time series data, normal subsequences having a certain pattern and abnormal subsequences deviating from the normal pattern are mixed.

In an embodiment of the present invention, an anomalous subsequence or discard refers to a subsequence that is not the most similar to other subsequences of the same length among all subsequences of time series data.

2 is a block diagram schematically showing a configuration of a time series data analysis system according to an embodiment of the present invention.

Referring to FIG. 2, the time series data analysis system includes a data collection environment and an ideal subsequence detection apparatus.

The data collection environment 100 may be any object capable of collecting data. The data collection environment 100 includes at least one data collection unit to generate time series data.

As an example, the data collection environment 100 may represent a machine industry process. Here, the data collector generates time series data that reflects the physical change of the data collection environment 100. The data collection environment 100 may include a temperature sensor, a motion sensor, a pressure sensor, or the like as a data collection unit.

In another example, the data collection environment 100 may represent a financial analysis system. Here, the data collector may generate time series data reflecting the financial processing of the data collection environment 100.

In another example, the data collection environment 100 may represent a data processing device. Here, the data collector may generate time series data reflecting the memory consumption rate, data processing speed, and the like of the data collection environment 100.

In another example, the data collection environment 100 may correspond to a network configuration. Here, the data collector may generate time series data reflecting the traffic of the network or the flow of data in the data collection environment 100.

The data collection unit corresponds to at least one sensor or data collection operation. The data collector measures the state of the data collection environment (100). The data collecting unit generates time series data based on the data measured according to a predetermined time difference.

The time series data generated by the data collector may have at least one segment region, and the segment region represents a value of the measured variable. For example, the time series data reflecting the temperature change of the data collection environment 100 may have an area representing the measurement temperature. The measured value of a variable can be expressed in the form of a constant, or in the form of a discrete combination that can be transformed into a constant type.

The time series data generated by the data acquisition unit may have a time recording area indicating the time at which the data is recorded.

The above-described subsequence detecting apparatus 200 detects abnormal subsequences in the time series data transmitted from the data collecting environment 100. The abnormal subsequence detecting apparatus 200 includes a data receiving unit 210, an error detecting unit 220, a result output unit 230, and a data calibrating unit 240.

The data receiving unit 210 receives the time series data from the data collection environment 100. The data receiving unit 210 may convert the type of the received measurement value into a form of a measurement value usable in the analysis system. The data receiving unit 210 may store the collected time series data according to the analysis order.

The anomaly detection unit 220 receives the time series data from the data reception unit 210 and detects an abnormal subsequence within the time series data. The specific internal configuration of the anomaly detection unit 220 will be described in more detail with reference to FIG.

The result output unit 230 outputs the result of the abnormal subsequence detected by the abnormality detecting unit 220. For example, the result output unit 230 may display a graph recording time series data. The result output unit 230 may represent an ideal subsequence within the graph in which the time series data is recorded.

Also, the result output unit 230 may display the time series data in the form of a table, and may represent an ideal subsequence within the displayed table.

The data correcting unit 240 can perform an operation based on the abnormal subsequence detected from the error detecting unit 220. [ The data calibration unit 240 may perform a calibration operation for the data collection environment 100 based on the detection result of the abnormality detection unit 220. For example, if the data acquisition environment 100 corresponds to one process in the mechanical industry process, the data calibration unit 240 may change the state of the process so that the occurrence of abnormalities detected from the anomaly detection unit 220 is limited have.

3 is a block diagram schematically showing an internal configuration of an anomaly detection unit of the abnormal subsequence detection apparatus of FIG.

Referring to FIG. 3, the anomaly detection unit 220 includes a preprocessing unit 221 for encoding time series data, a window length determination unit 222 for determining a meaningful window length range, and a subsequence to be a candidate for the ideal subsequence An ideal value calculation unit 224 for calculating an outlier score with respect to the candidate subsequence, and an ideal subsequence detection unit 225 for detecting the ideal subsequence.

The preprocessing unit 221 constructs a suffix tree by encoding the time series data by reflecting the distribution of the entire data. The preprocessing unit 221 encodes the time series data using the min and max ranges of the data collected by the data collection unit.

The preprocessing unit 221 searches the inputted time series data once, and converts the min and max ranges of the data into symbols according to an interval obtained by equally dividing the ranges of the data into proper alphanumeric numbers. For example, when the min and max are 0.0 to 1.0 and the number of alphabets is 10, the alphabets a, b, c, d, e, f, g, h, i, j . That is, 2.3 is converted to the alphabet c. Then, construct a suffix tree using the entire string converted to symbols.

The window length determination unit 222 determines a meaningful window length range using the concept of a maximal pattern while circulating the suffix tree constructed by the preprocessing unit 221. [

The window length determination unit 222 stores the number of leaf nodes located at the bottom of each suffix tree. The window length determination unit 222 stores the minimum value and the maximum value of the maximum pattern length for all internal nodes, and also stores pointers of internal nodes having a frequency of occurrence less than or equal to k. When the visit is completed, the window length determination unit 222 sets the range from the minimum value of the maximum pattern length + 1 to the maximum value of the maximum pattern length as the range of the window length.

The ideal candidate derivation unit 223 derives a subsequence to be a candidate of top-k ideal subsequences having a large grid-distance between each meaningful subsequence of window length and a k-NN subsequence of the corresponding subsequence Respectively.

The ideal candidate derivation unit 223 performs the following operations for each window length range stored by the window length determination unit 222. [ The ideal candidate derivation unit 223 obtains cells having occurrence frequencies of 1 or more and less than k using internal node information having a frequency of occurrence less than or equal to k, and stores the cells as abnormal candidate cells. The ideal candidate derivation unit 223 extracts a minimum extended interval for each ideal candidate cell so that the sum of the occurrence frequencies of itself and neighboring cells is greater than or equal to k + 1, And stores it as a grid-distance. The ideal candidate derivation unit 223 stores a maximum lattice distance at which the sum is greater than or equal to k + 1 when the frequency of occurrence of each cell is increased in the order of cells having a large lattice distance. The anomaly candidate deriving unit 223 prunes the cells having the lattice distance smaller than the maximum lattice distance obtained in the ideal candidate cell.

The outlier calculation unit 224 calculates an outlier score for the subsequences included in each ideal candidate cell. The outlier calculation unit 224 calculates the kth neighbor acquiring and the outliers for each abnormal candidate cell.

The outlier calculator 224 performs the following operations for each abnormal sequence candidate included in the derived abnormal candidate cell. The outlier calculator 224 calculates the actual distance (k-NN real) between the kth neighbor subsequence and the kth neighbor subsequence using the distance between the input data of all the subsequences included in the neighboring cell within the lattice distance calculated for each abnormal candidate cell distance. Then, the k-th neighbor sequence of the ideal subsequence candidate included in each ideal candidate cell is stored in the ranking up to k rank of the normalized outliers using the actual distance (k-NN real distance).

The abnormal subsequence detector 225 detects a top-k abnormal subsequence on the basis of the abnormal value stored by the abnormal value calculator 224. The above-described subsequence detector 225 derives the top-k or more subsequences by summing up the rankings of the outliers calculated by the outlier calculator 224.

4 is a block diagram schematically showing a hardware configuration of the abnormal subsequence detecting apparatus of FIG.

Referring to FIG. 4, the hardware configuration of the abnormal subsequence detecting apparatus 200 can be any computer apparatus capable of performing the operation of the abnormal subsequence described above.

In addition, the hardware configuration of the abnormal subsequence detecting apparatus 200 can be any computer apparatus used in a general data collection environment 100. For example, if the data acquisition environment 100 is a network configuration, the hardware configuration of the abnormal subsequence detecting apparatus 200 may correspond to a client computer or a server computer.

The hardware configuration may include at least one non-volatile memory or volatile memory. For example, the non-volatile memory may be main memory 320 (RAM), and the volatile memory may be auxiliary memory 33 (ROM).

The hardware configuration can perform various operations as the central processing unit (CPU) 31 executes the instructions provided from the memory. The hardware configuration may include a media device, where the media device may be a hard disk device or an optical disk device, and so on.

The hardware configuration may include various input devices 340 and output devices 350 for receiving various inputs from a user or transmitting various outputs to a user. For example, the output device 350 may be a display device or a device associated with a graphical user interface (GUI).

The hardware configuration may include a communication device 360 for exchanging data with other devices. And, the communication device 360 can be connected to various kinds of networks.

The communication bus 370 connects the hardware configuration described above so that data can be exchanged between the hardware configurations described above.

Hereinafter, an abnormal subsequence detecting method according to an embodiment of the present invention will be described in detail with reference to the configuration of the above-described time series data analysis system.

First, the conventional window-based ideal subsequence detection methods will be described, and differences between conventional detection methods and methods proposed in the embodiments of the present invention will be described.

5 is a diagram schematically illustrating a method of deriving a subsequence according to a sliding window technique.

Referring to FIG. 5, the sliding window technique is a technique for deriving time records included in a fixed length window X as a single subsequence. In the sliding window technique, a window X is divided into one column or It is a technique to extract all possible subsequences by sliding several spaces. Accordingly, when the length of the window X is w in the time-series data of length n, a total of (n-w + 1) subsequences can be extracted.

In the conventional window-based ideal subsequence detection method, all possible subsequences are derived from the entire time-series data by using the sliding window technique, and then the ideal subsequence is checked for each derived subsequence. In this case, the dissimilarity with the closest subsequence is regarded as the anomaly score of the corresponding subsequence, and the subsequence having the largest outlier score is detected as the ideal subsequence. These window-based methods are known to be effective in deriving ideal subsequences that occur locally compared to other schemes.

E. Keogh, J. Lin, and A. Fu, "Fast SAX: Efficiently Finding the Most Unusual Time Series Subsequence", In Proc. of IEEE Int 'Conf. on Data Mining, IEEE ICDM, pp. 8-15, 2005. proposed a linear algorithm for the length of time-series data, focusing on the intuitive heuristic that the subsequence with the lowest frequency of occurrence of the changed string is an ideal data candidate.

However, all conventional detection methods have only dealt with detection of ideal subsequences of a certain length. In some detection methods, we have dealt with the detection of ideal subsequence clusters with similar patterns, but there is no method for detecting ideal subsequences of various lengths occurring in real-time time series data.

Accordingly, in the embodiment of the present invention, first, the characteristics of the ideal subsequence in the time series data are analyzed. Then, based on this, top-k subsequences with various lengths are detected.

First, one subsequence can be a normal subsequence or an abnormal subsequence depending on the window length. If the window length is very large, most subsequences will have a low frequency of occurrence. For this reason, most subsequences can be regarded as ideal subsequences. Conversely, when the window length is very small, most subsequences have a high frequency of occurrence. As a result, most of the subsequences can be regarded as normal subsequences.

In conventional detection methods, the length of the ideal subsequence to be detected is input from the user or determined using a heuristic technique. However, it is difficult for the user to input the correct length, except when the periodicity of the time series data is definite.

Second, long subsequences of various lengths are mixed in one long time series data. Conventional methods have focused on detecting an ideal subsequence of a certain length and may be able to detect ideal subsequences of various lengths by performing conventional methods each time for all possible window length ranges. In this case, however, the necessary data structures for each window length must be reconstructed each time, and the abnormal subsequence detection process must be performed again each time, which is very inefficient.

Third, ideal subsequences with similar patterns may appear in one long time series data. As the length of the time series data becomes longer, ideal subsequences having the same or similar length and shape may appear in a small group form. These ideal subsequences, although grouped, are considered to be ideal subsequences since they are very few in number compared to normal subsequences.

In order to derive a small number of ideal subsequence groups having a similar length and shape, the distance between the k-th neighbor (k-NN) subsequence instead of the distance between itself and the nearest subsequence Lt; / RTI > In this case, k is the number of ideal subsequences that the user wants to detect.

In the embodiment of the present invention, a new method of automatically detecting ideal subsequences of various lengths of top-k in one long time series data is used in the time series data analysis system based on the above description.

Time series data T as the input of the time series data analysis system and the number k of the ideal subsequences desired by the user are input. The position and length information of Top-k ideal subsequences is output to the output of the analysis system of time series data.

In this case, when one abnormal subsequence is detected, the overlapped subsequences are likely to be detected as abnormal subsequences, but this is meaningless as a result. Therefore, when two subsequences overlap each other (the difference between the start positions of two subsequences is Length) is excluded from the target.

In the embodiment of the present invention, the time series data is T = {e1, e2, e3, ... , en} (where ei is the data stored at the i-th time), and the subsequence of the time series data T is C = {tp, ... tp + n-1} . That is, the subsequence C of T is an ordered set of data continuously extracted from any position of T, with length n less than length m of T.

Dist (C, D) is a symmetric function that returns a nonnegative value, which is a function to find the distance between two subsequences C and D of the same length. That is, Dist (C, D) = Dist (D, C). In the embodiment of the present invention, the Euclidean distance of the following equation (1) is used as a distance function.

[Equation 1]

In an embodiment of the present invention, to detect ideal subsequences forming a cluster, a normalized anomaly score with the k-NN subsequence of the ideal subsequence candidate of Equation (2) As the outlier score.

&Quot; (2) "

Here, w represents the length of each subsequence.

In the embodiment of the present invention, an ideal subsequence detection method based on a suffix tree is used. A method of detecting abnormal subsequences of various lengths using a conventional single-length or larger subsequence detection method will be described below.

A simple method for detecting subsequences of more than top-k different lengths in one long time series data is to detect subsequences more than a conventional single length over all possible window lengths (2 to n / 2, n is the length of the entire time series data) The method is repeated. Then, among the ideal subsequences detected for each window length, the final ideal subsequences can be determined as top-k with the largest outlier score.

The problems of the above method are as follows. First, an abnormal subsequence detection is attempted for a meaningless window length. That is, the window length is so short that all the subsequences are normal subsequences, or the window length is very long, so that the abnormal subsequence detection is attempted also for the window length in which all subsequences are ideal subsequences.

Second, abnormal subsequence detection is performed independently for each window length. That is, the data structure necessary for detecting the ideal subsequence must be reconstructed every time, and the ideal subsequence detection process must be performed again each time. Therefore, the above-described method requires (n / 2? 1) times longer than a single-length ideal subsequence detection scheme.

Third, accurate ideal subsequences are not detected. Conventional detection methods use a heuristic scheme that focuses on performance rather than accuracy. This heuristic scheme is not a problem when detecting a single-length ideal subsequence, but it generates a serious problem when detecting an ideal subsequence of various lengths. Therefore, a method for detecting abnormal subsequences of various lengths should have a high detection accuracy of abnormal subsequences.

Hereinafter, a meaningful window length range will be described first, and a method for quickly deriving the range will be described. Furthermore, a method for quickly detecting ideal subsequences of various lengths using a suffix tree for the derived window length range will be described. Finally, a grid-distance based k-NN candidate detection method for improving the accuracy of the abnormal subsequence detection method will be described.

In embodiments of the present invention, a meaningful range of window lengths for detecting an ideal subsequence is determined based on the following observations.

FIG. 6 is a diagram schematically showing whether a subsequence is frequent when the number of desired subsequences is four, which is desired by a user.

Referring to FIG. 6, a row indicates a start position of a subsequence in time series data, and a column indicates a length of a subsequence (window).

First, for one subsequence C, if there are k or more same subsequences as the subsequence C, the subsequence C can never be a subsequence of more than top-k. In the embodiment of the present invention, a subsequence having k or more identical subsequences is often defined as a frequently occurring subsequence.

Second, for a non-frequent subsequence C of length m, all the upper subsequences of length m + 1 inclusive of subsequence C are not frequent. In FIG. 6, subsequences having a window length of 3 and a start position of 1 are not frequent. Therefore, all the upper subsequences including the subsequence are not frequent.

Third, for window length w2, if all subsequences of the length are frequent, for all window length w1 with w1? W2? N, all subsequences of that length are frequent. In Figure 6, all subsequences with window length 2 are frequent. Thus, all subsequences with a window length of 1 are also frequent.

Fourth, for window length w1, if not all subsequences of the length are frequent, for all w2 w1 ≤ w2 ≤ n, not all subsequences of that length are frequent. In FIG. 6, all subsequences with a window length of 5 are not frequent. Therefore, not all subsequences having a window length of 6 or more are frequent.

As a result, if all subsequences of the length are frequent for a specific window length w1, ideal subsequence detection may not be performed for the window length w1. Similarly, for a particular window length w2, it is meaningless to perform ideal subsequence detection for window length w2, unless all subsequences of that length are frequent.

Therefore, in the embodiment of the present invention, a meaningful window length range for detecting an abnormal subsequence is determined as follows. That is, the minimum window length is the smallest window length in which one or more non-frequent subsequences exist, and the maximum window length is the largest window length in which one or more frequent subsequences exist.

In the embodiment of the present invention, a suffix tree is used to store information on subsequences having various lengths in one data structure. Here, the suffix tree is a prefix tree that stores all the suffixes of a string.

The suffix tree stores all the subsequences extracted for each start position. Therefore, if the stored subsequences are extracted by a desired length, a subsequence of a specific length with respect to the corresponding start position can be obtained. Thus, since the suffix tree stores subsequence information of all lengths for each starting position, it is not necessary to construct a new data structure every time, even if ideal subsequence detection is performed for various window lengths.

In addition, the suffix tree can be constructed with only one time series data scan, and is very fast to build and search. The construction complexity of the suffix tree is O (n), and the search complexity is O (l). Where n is the length of the time series data and l is the length of the subsequence to be searched. Therefore, it is possible to extract a subsequence of a desired length quickly by using a suffix tree.

In the embodiment of the present invention, after constructing the suffix tree data structure, the established tree is circulated once to further store the subsequence length and the occurrence frequency of the corresponding subsequence for each internal node.

FIG. 7 is a diagram schematically showing a suffix tree to which a subsequence length and the occurrence frequency of the corresponding subsequence are added.

7, the time series data is T = {a, b, c, a, b, b, a, b, b} and the number of ideal subsequences desired by the user is k = The occurrence frequency information of the corresponding subsequence is added to the internal node and stored (subsequence length: occurrence frequency of the subsequence).

This frequency information is useful for determining a meaningful window length range, deriving ideal subsequence candidate groups for each determined window length, and detecting abnormal subsequences of various lengths, such as securing neighboring subsequences for each group.

Conventional single-length subsequence detection methods use a heuristic scheme to greatly reduce the accuracy in the neighborhood derivation step and greatly improve the speed. This heuristic scheme is not a problem when detecting a single-length subsequence. However, serious problems may occur when detecting subsequences of various lengths.

In order to solve this problem, in the embodiment of the present invention, a grid-distance-based ideal subsequence detection that can perform both the candidate subsequence search of the ideal subsequence and the accurate k-NN subsequence search for each ideal subsequence candidate subsequence simultaneously Use the room.

For the two subsequences S = {s1, s2, ..., sw}, R = {r1, r2, ..., rw} with length w, the grid-distance between the two subsequences is computed as .

8 is a diagram schematically showing the lattice distance when the window length is 2, the number of alphabets is 10, and the number k of ideal subsequences desired by the user is k = 5.

8, when the window length is 2 and k is 5, the grid-distance (A, B) of the subsequence A corresponding to the string `ee` and the subsequence B corresponding to the string` ii` is 4 to be. The grid-distance (A, 5-NN (A)) of the subsequence A and the 5-NN of the subsequence A is 1 and the grid-distance of the 5-NN of the subsequence B and the subsequence B (B)) is 2.

Here, the grid-distance (S, R) is calculated by the following equation (3).

&Quot; (3) "

Using grid-distance, we can quickly find subsequences that are candidates for ideal subsequences. The top-k ideal subsequences are calculated by considering only the top-k largest grid-distance between the subsequence and the k-NN of the corresponding subsequence as candidates of the ideal subsequence, Can be calculated quickly and accurately.

Furthermore, the grid-distance can be used for neighboring k-NNs for each candidate subsequence. And further stores a list of the searched subsequences in searching for the top-k subsequences having large grid-distances between the k-NNs to be candidates of the above subsequences. The list of subsequences thus stored becomes the k-NN candidates of each candidate subsequence. By using this grid-distance, the exact k-NN candidates of each candidate subsequence can be found without additional searching.

9 is a flowchart schematically illustrating a method of analyzing time series data according to an embodiment of the present invention.

Referring to FIG. 9, data is collected in a data collection environment 100 including at least one data collector to generate time series data (410). At this time, the time series data has at least one segment area and a time recording area.

Next, the data receiving unit 210 receives the time series data generated from the data collection environment 100 (420). Next, the anomaly detection unit 220 detects an abnormal subsequence within the time series data received from the data reception unit 210 (430). The specific procedure by which the anomaly detection unit 220 detects the abnormal subsequence will be described in more detail with reference to FIG.

Next, the result output unit 230 outputs the result of the abnormal subsequence detected by the abnormality detecting unit 220, and according to the user's selection, the data calibrating unit 240 performs a calibration operation for the data collecting environment 100 (440).

10 is a flowchart schematically illustrating an internal process of the abnormal subsequence detecting process of FIG.

Referring to FIG. 10, a top-k subsequence detection method with various lengths is as follows. First, in order to detect abnormal subsequences of various lengths, the preprocessing process must be performed in comparison with the conventional single-length ideal subsequence detection method, compared with the determination of the window length range. Should be performed.

After the entire time series data is symbolized, a preprocessing process of building a suffix tree is performed (510). There are two ways to symbolize the entire data. One is to symbolize the data by reflecting the distribution of data, and the other is to equally encode the data at the same interval.

One way to reflect the distribution of data is to presume the distribution of the entire data or to scan the input data once beforehand to determine the exact distribution of the entire data. However, further searching of the entire input data to determine the exact distribution of the data is not appropriate for large-length time series data, and even assuming a data distribution may be less accurate if the assumption is incorrect.

Therefore, in the embodiment of the present invention, the min and max ranges of the already known data measuring mechanism are used to divide the data into equal numbers of alphanumeric characters. For example, when the min and max are 0.0 to 1.0 and the number of alphabets is 10, the alphabets a, b, c, d, e, f, g, h, i, j . That is, 2.3 is converted to the alphabet c.

It is very straightforward to divide the range of the data into equal lengths, and it is possible to secure k or more neighbors of the ideal subsequence candidate group within the linear time of data length by using grid-distance, which is an approximated outlier of the ideal subsequence.

Next, the generated suffix tree is traversed and the occurrence frequency information for each subsequence is stored, and a meaningful window length range is determined using the maximal pattern concept (520).

In the embodiment of the present invention, a suffix tree is used to determine a meaningful window length range. To do this, we construct a suffix tree and use the number of leaf nodes stored in each node while visiting all nodes. The number of leaf nodes indicates the number of start positions sharing the subsequence indicated by the path label from the root node to the corresponding internal node, that is, the occurrence frequency of the corresponding subsequence.

When detecting the ideal subsequence of Top-k, the maximal pattern means the longest pattern with a frequency of k + 1 or more in the pattern of the same prefix. The pattern represented by the internal node at the instant when the occurrence frequency stored in the internal node is greater than k + 1 is the maximum pattern, and the maximum value and the minimum value of the maximum pattern length are continuously updated during the tree visit . The execution complexity for the entire tree visit is linear in the data length and after the visit is completed, a range of meaningful window length can be obtained through the maximum value and the minimum value of the maximum pattern length.

For efficient post-processing, store a pointer to the internal nodes of the pattern that does not occur more than k + 1 when the tree is visited. If storage space savings are required, nodes outside the meaningful window length range can be pruned from the suffix tree.

Next, the subsequences to be candidates of the top-k ideal subsequences having a large grid-distance between each meaningful subsequence of the window length and the k-NN subsequence of the corresponding subsequence are derived (530).

In the window length range determination process, k or more neighbor subsequences can always be secured for each abnormal candidate subsequence candidate group. If the neighborhood of the same pattern is less than k, the neighborhood is further searched to find out efficiency or randomness, Prevent falling problems in advance.

Next, an outliers score is calculated for the candidate subsequence of each ideal subsequence (540). The remaining subsequences are used as the outliers by calculating the distance between the input data of the neighboring subsequences within the grid-distance and the kth neighbor for each candidate group. At this time, an idealized value normalized to the length of the subsequence is used.

Next, a subsequence of the final top-k or more is detected based on the outlier score (550). Order of the outliers calculated for each ideal subsequence candidate for each window length is calculated to derive a subsequence of more than the total top-k.

221: preprocessing unit 222: window length determining unit
223: abnormal candidate deriving unit 224: abnormal value calculating unit
225: Abnormal subsequence detector

Claims

A preprocessor for generating a symbol string by symbolizing the time series data and constructing a suffix tree using the symbol string;
A length of a maximum pattern is updated while visiting the constructed suffix tree, and a window length range is determined based on a maximum value and a minimum value of the updated maximum pattern length, A window length determination unit for determining the window length range up to a minimum value of the length of the maximum pattern;
An ideal candidate derivation unit for deriving the top-k ideal subsequence candidates having a largest lattice distance between the subsequence and the kth neighboring subsequence of the subsequence for the determined window length range;
An ideal value calculation unit for calculating an ideal value for the derived ideal subsequence candidate; And
An ideal subsequence detector for detecting top-k ideal subsequences based on the calculated outliers; The subsequence detection apparatus comprising:

The apparatus of claim 1, wherein the window length determiner comprises:
Storing a number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree, updating a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes, And stores information of internal nodes whose number is k or less.

delete

3. The apparatus of claim 2,
Storing the ideal candidate cell based on the information of the internal node having the number of stored leaf nodes equal to or less than k and calculating the grid distance so that the sum of the occurrence frequencies of the stored ideal candidate cell and neighboring cells is k + And calculates a maximum lattice distance such that a sum of the occurrence frequencies is greater than or equal to k + 1 when the frequency of occurrence of the abnormal candidate cell is added in the order of the ideal cell having a larger stored lattice distance.

The apparatus according to claim 4,
And derives the ideal subsequence candidate by removing a cell having a lattice distance smaller than the calculated maximum lattice distance from the ideal candidate cell.

Constructing a suffix tree by using the symbol string to generate a symbol string by symbolizing the time series data;
A length of a maximum pattern is updated while visiting the constructed suffix tree, and a window length range is determined based on a maximum value and a minimum value of the updated maximum pattern length, Determining a window length range up to a minimum value of the maximum pattern length;
Deriving top-k ideal subsequence candidates having a larger lattice distance between the subsequence and the kth neighboring subsequence of the subsequence according to the determined window length range;
Calculating an ideal value for the derived ideal subsequence candidate; And
Detecting top-k ideal subsequences based on the calculated outlier score; The subsequence detection method comprising:

7. The method of claim 6, wherein determining the window length range comprises:
Storing a number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree, updating a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes, And storing information of internal nodes whose number is k or less.

delete

8. The method of claim 7, wherein deriving the ideal subsequence candidate comprises:
Storing the ideal candidate cell based on the information of the internal node having the number of stored leaf nodes equal to or less than k and calculating the grid distance so that the sum of the occurrence frequencies of the stored ideal candidate cell and neighboring cells is k + And calculating a maximum lattice distance such that a sum of the occurrence frequencies is greater than or equal to k + 1 when the frequency of occurrence of the abnormal candidate cell is added in the order of the ideal cell having the largest stored lattice distance.

10. The method of claim 9, wherein deriving the ideal subsequence candidate comprises:
And subtracting the ideal subsequence candidate from the ideal candidate cell by removing a cell having a lattice distance smaller than the calculated maximum lattice distance from the ideal candidate cell.