KR101906859B1 - Aparatus and method for detecting anomalous subsequence - Google Patents
Aparatus and method for detecting anomalous subsequence Download PDFInfo
- Publication number
- KR101906859B1 KR101906859B1 KR1020120030043A KR20120030043A KR101906859B1 KR 101906859 B1 KR101906859 B1 KR 101906859B1 KR 1020120030043 A KR1020120030043 A KR 1020120030043A KR 20120030043 A KR20120030043 A KR 20120030043A KR 101906859 B1 KR101906859 B1 KR 101906859B1
- Authority
- KR
- South Korea
- Prior art keywords
- subsequence
- ideal
- length
- subsequences
- candidate
- Prior art date
Links
Images
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Testing And Monitoring For Control Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to an abnormal subsequence detecting apparatus and method for detecting abnormal subsequences in time series data.
The abnormal subsequence detecting apparatus according to an embodiment of the present invention includes a preprocessing unit for constructing a suffix tree by symbolizing time series data and using the symbolized string, a length of a maximum pattern is updated while visiting the constructed suffix tree A window length determination unit for determining a window length range based on a maximum value and a minimum value of the length of the updated maximum pattern, a window length determination unit for determining a window distance between the subsequence and the kth neighboring subsequence of the subsequence, An abnormal value calculation unit for calculating an abnormal value score with respect to the derived ideal subsequence candidate, and a top-k abnormal slave candidate based on the calculated outlier score, And an abnormal subsequence detecting unit for detecting the sequence.
Description
The present invention relates to an abnormal subsequence detecting apparatus and method for detecting abnormal subsequences in time series data.
Detecting an ideal subsequence in time series data is useful not only for the purpose of noise cancellation that prevents successful extraction of the normal subsequence but also in various domains by itself.
In recent years, as the number of data collection tools such as sensors and remote equipments has been developed and increasing, the amount of time series data is rapidly increasing. Therefore, it is more important to study the efficient subsequence detection scheme for large-volume time-series data.
Conventional methods for detecting abnormal subsequences in time series data are largely classified into a projection-based ideal subsequence detection scheme and a window-based ideal subsequence detection scheme.
The mapping-based ideal subsequence detection scheme first projects the subsequence to a point on the dimension space corresponding to its length, then uses the clusters, statistics, prediction, information theory, This is a method for detecting subsequences. Such projection-based subsequence detection schemes have a problem in that when the length of the subsequence becomes longer, the subsequence is projected onto a point in space of a higher dimension, which makes it difficult to efficiently detect the ideal subsequence.
The window-based subsequence detection scheme derives the subsequences from the entire time series data using the sliding window scheme and then compares each derived subsequence with the other subsequences to obtain the most dissimilar ) Subsequence. Such window-based subsequence detection schemes are known to be effective for detecting ideal subsequences that occur locally as compared to projection-based detection schemes.
Conventional ideal subsequence detection schemes have the following limitations. First, conventional studies on abnormal subsequence detection focus on detecting abnormal subsequences of a certain length. For this, the length of the subsequence to be detected must be input from the user, and the length of the subsequence to be detected greatly affects the accuracy of the detection of the ideal subsequence. Even a subsequence having the same start position may be an ideal subsequence depending on its length, or may be a normal subsequence. However, it is very difficult for the user to determine the length of the ideal ideal subsequence, except when the time series data has a definite periodicity.
Second, normal subsequences and ideal subsequences of various lengths can be mixed in one long real-time time series data depending on the domain. Referring to FIG. 1, most subsequences have similar lengths and shapes, such as group N. FIG. However, depending on the state, ideal subsequences having various lengths and shapes such as subsequences A1, A2 and A3 may occur. Therefore, it is difficult to derive various ideal subsequences occurring in the real world when only the ideal subsequence for a specific length is detected.
Third, an ideal subsequence group with similar length and shape may appear. In FIG. 1, the ideal subsequences A2 and A3 have four and two similar ideal ideal subsequences, respectively. These ideal subsequences must be distinguished from normal subsequences because they all have similar subsequences but all cause a failure and an accident.
One aspect of the present invention provides an apparatus and method for detecting an ideal subsequence that automatically determines a length range of a meaningful subsequence and detects top-k abnormal subsequences of a length range of a meaningful subsequence do.
The abnormal subsequence detecting apparatus according to an embodiment of the present invention includes a preprocessing unit for constructing a suffix tree by symbolizing time series data and using the symbolized string, A window length determination unit for determining a window length range based on a maximum value and a minimum value of the updated maximum pattern length, a window length determining unit for determining a window distance between the subsequence and the kth neighboring subsequence of the subsequence, An ideal value calculation unit for calculating an outlier score for the derived ideal subsequence candidate, and a top-k candidate calculation unit for calculating an outlier score based on the calculated outlier score, And an abnormal subsequence detector for detecting abnormal subsequences.
Wherein the window length determining unit stores the number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree, updates a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes, And stores the information of internal nodes whose number of stored leaf nodes is k or less.
The window length determination unit determines the window length range from the maximum value of the maximum pattern length + 1 to the minimum length of the maximum pattern length.
Wherein the ideal candidate derivation unit stores the ideal candidate cell based on the information of the internal node having the number of stored leaf nodes equal to or less than k and sets the sum of the occurrence frequencies of the stored ideal candidate cell and the neighboring cells to k + And calculates a maximum lattice distance at which the sum of the occurrence frequencies when the frequency of occurrence of the abnormal candidate cell is increased in the order of the ideal cell having the larger stored lattice distance is k + 1 or more.
The abnormal candidate deriving unit removes a cell having a lattice distance smaller than the calculated maximum lattice distance from the abnormal candidate cell to derive an abnormal subsequence candidate.
The abnormal subsequence detecting method according to an embodiment of the present invention includes the steps of constructing a suffix tree by symbolizing time series data and using the symbolized string, updating the length of a maximum pattern while visiting the constructed suffix tree, Determining a window length range based on a maximum value and a minimum value of the updated maximum pattern length, determining a window length range based on the maximum and minimum values of the updated maximum pattern, Deriving a subsequence candidate, calculating an outlier score for the derived ideal subsequence candidate, and detecting top-k ideal subsequences based on the calculated outlier score.
Wherein the step of determining the window length range comprises: storing the number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree; and determining a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes And stores the information of internal nodes whose number of stored leaf nodes is k or less.
The determining of the window length range determines the window length range from the maximum value of the maximum pattern length + 1 to the minimum length of the maximum pattern length.
Wherein the step of deriving the ideal subsequence candidates includes storing an ideal candidate cell based on information of an internal node having a number of stored leaf nodes equal to or less than k and determining whether the sum of occurrence frequencies of the stored ideal candidate cell and neighboring cells is k + And a maximum lattice distance is calculated such that the sum of the occurrence frequencies when the frequency of occurrence of the abnormal candidate cell is increased in the order of the ideal cell having the large stored lattice distance is k + 1 or more.
The step of deriving the ideal subsequence candidates derives an ideal subsequence candidate by removing a cell having a lattice distance smaller than the calculated maximum lattice distance from the ideal candidate cell.
According to an aspect of the present invention described above, it is possible to automatically determine the length of the ideal subsequence, detect ideal subsequences of various lengths, detect ideal subsequence groups having the same length and shape, The abnormal subsequence appearing in the domain can be efficiently detected.
1 is time series data including ideal subsequence groups of various lengths and shapes.
2 is a block diagram schematically showing a configuration of a time series data analysis system according to an embodiment of the present invention.
3 is a block diagram schematically showing an internal configuration of an anomaly detection unit of the abnormal subsequence detection apparatus of FIG.
4 is a block diagram schematically showing a hardware configuration of the abnormal subsequence detecting apparatus of FIG.
5 is a diagram schematically illustrating a method of deriving a subsequence according to a sliding window technique.
FIG. 6 is a diagram schematically showing whether a subsequence is frequent when the number of desired subsequences is four, which is desired by a user.
FIG. 7 is a diagram schematically showing a suffix tree to which a subsequence length and the occurrence frequency of the corresponding subsequence are added.
8 is a diagram schematically showing the lattice distance when the window length is 2, the number of alphabets is 10, and the number k of ideal subsequences desired by the user is k = 5.
9 is a flowchart schematically illustrating a method of analyzing time series data according to an embodiment of the present invention.
10 is a flowchart schematically illustrating an internal process of the abnormal subsequence detecting process of FIG.
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.
In the embodiment of the present invention, time series data refers to sequence data having a time-stamp of a constant time interval. In the time series data, normal subsequences having a certain pattern and abnormal subsequences deviating from the normal pattern are mixed.
In an embodiment of the present invention, an anomalous subsequence or discard refers to a subsequence that is not the most similar to other subsequences of the same length among all subsequences of time series data.
2 is a block diagram schematically showing a configuration of a time series data analysis system according to an embodiment of the present invention.
Referring to FIG. 2, the time series data analysis system includes a data collection environment and an ideal subsequence detection apparatus.
The
As an example, the
In another example, the
In another example, the
In another example, the
The data collection unit corresponds to at least one sensor or data collection operation. The data collector measures the state of the data collection environment (100). The data collecting unit generates time series data based on the data measured according to a predetermined time difference.
The time series data generated by the data collector may have at least one segment region, and the segment region represents a value of the measured variable. For example, the time series data reflecting the temperature change of the
The time series data generated by the data acquisition unit may have a time recording area indicating the time at which the data is recorded.
The above-described
The
The
The
Also, the
The
3 is a block diagram schematically showing an internal configuration of an anomaly detection unit of the abnormal subsequence detection apparatus of FIG.
Referring to FIG. 3, the
The
The
The window
The window
The ideal
The ideal
The
The
The abnormal subsequence detector 225 detects a top-k abnormal subsequence on the basis of the abnormal value stored by the
4 is a block diagram schematically showing a hardware configuration of the abnormal subsequence detecting apparatus of FIG.
Referring to FIG. 4, the hardware configuration of the abnormal
In addition, the hardware configuration of the abnormal
The hardware configuration may include at least one non-volatile memory or volatile memory. For example, the non-volatile memory may be main memory 320 (RAM), and the volatile memory may be auxiliary memory 33 (ROM).
The hardware configuration can perform various operations as the central processing unit (CPU) 31 executes the instructions provided from the memory. The hardware configuration may include a media device, where the media device may be a hard disk device or an optical disk device, and so on.
The hardware configuration may include
The hardware configuration may include a
The
Hereinafter, an abnormal subsequence detecting method according to an embodiment of the present invention will be described in detail with reference to the configuration of the above-described time series data analysis system.
First, the conventional window-based ideal subsequence detection methods will be described, and differences between conventional detection methods and methods proposed in the embodiments of the present invention will be described.
5 is a diagram schematically illustrating a method of deriving a subsequence according to a sliding window technique.
Referring to FIG. 5, the sliding window technique is a technique for deriving time records included in a fixed length window X as a single subsequence. In the sliding window technique, a window X is divided into one column or It is a technique to extract all possible subsequences by sliding several spaces. Accordingly, when the length of the window X is w in the time-series data of length n, a total of (n-w + 1) subsequences can be extracted.
In the conventional window-based ideal subsequence detection method, all possible subsequences are derived from the entire time-series data by using the sliding window technique, and then the ideal subsequence is checked for each derived subsequence. In this case, the dissimilarity with the closest subsequence is regarded as the anomaly score of the corresponding subsequence, and the subsequence having the largest outlier score is detected as the ideal subsequence. These window-based methods are known to be effective in deriving ideal subsequences that occur locally compared to other schemes.
E. Keogh, J. Lin, and A. Fu, "Fast SAX: Efficiently Finding the Most Unusual Time Series Subsequence", In Proc. of IEEE Int 'Conf. on Data Mining, IEEE ICDM, pp. 8-15, 2005. proposed a linear algorithm for the length of time-series data, focusing on the intuitive heuristic that the subsequence with the lowest frequency of occurrence of the changed string is an ideal data candidate.
However, all conventional detection methods have only dealt with detection of ideal subsequences of a certain length. In some detection methods, we have dealt with the detection of ideal subsequence clusters with similar patterns, but there is no method for detecting ideal subsequences of various lengths occurring in real-time time series data.
Accordingly, in the embodiment of the present invention, first, the characteristics of the ideal subsequence in the time series data are analyzed. Then, based on this, top-k subsequences with various lengths are detected.
First, one subsequence can be a normal subsequence or an abnormal subsequence depending on the window length. If the window length is very large, most subsequences will have a low frequency of occurrence. For this reason, most subsequences can be regarded as ideal subsequences. Conversely, when the window length is very small, most subsequences have a high frequency of occurrence. As a result, most of the subsequences can be regarded as normal subsequences.
In conventional detection methods, the length of the ideal subsequence to be detected is input from the user or determined using a heuristic technique. However, it is difficult for the user to input the correct length, except when the periodicity of the time series data is definite.
Second, long subsequences of various lengths are mixed in one long time series data. Conventional methods have focused on detecting an ideal subsequence of a certain length and may be able to detect ideal subsequences of various lengths by performing conventional methods each time for all possible window length ranges. In this case, however, the necessary data structures for each window length must be reconstructed each time, and the abnormal subsequence detection process must be performed again each time, which is very inefficient.
Third, ideal subsequences with similar patterns may appear in one long time series data. As the length of the time series data becomes longer, ideal subsequences having the same or similar length and shape may appear in a small group form. These ideal subsequences, although grouped, are considered to be ideal subsequences since they are very few in number compared to normal subsequences.
In order to derive a small number of ideal subsequence groups having a similar length and shape, the distance between the k-th neighbor (k-NN) subsequence instead of the distance between itself and the nearest subsequence Lt; / RTI > In this case, k is the number of ideal subsequences that the user wants to detect.
In the embodiment of the present invention, a new method of automatically detecting ideal subsequences of various lengths of top-k in one long time series data is used in the time series data analysis system based on the above description.
Time series data T as the input of the time series data analysis system and the number k of the ideal subsequences desired by the user are input. The position and length information of Top-k ideal subsequences is output to the output of the analysis system of time series data.
In this case, when one abnormal subsequence is detected, the overlapped subsequences are likely to be detected as abnormal subsequences, but this is meaningless as a result. Therefore, when two subsequences overlap each other (the difference between the start positions of two subsequences is Length) is excluded from the target.
In the embodiment of the present invention, the time series data is T = {e1, e2, e3, ... , en} (where ei is the data stored at the i-th time), and the subsequence of the time series data T is C = {tp, ... tp + n-1} . That is, the subsequence C of T is an ordered set of data continuously extracted from any position of T, with length n less than length m of T.
Dist (C, D) is a symmetric function that returns a nonnegative value, which is a function to find the distance between two subsequences C and D of the same length. That is, Dist (C, D) = Dist (D, C). In the embodiment of the present invention, the Euclidean distance of the following equation (1) is used as a distance function.
[Equation 1]
In an embodiment of the present invention, to detect ideal subsequences forming a cluster, a normalized anomaly score with the k-NN subsequence of the ideal subsequence candidate of Equation (2) As the outlier score.
&Quot; (2) "
Here, w represents the length of each subsequence.
In the embodiment of the present invention, an ideal subsequence detection method based on a suffix tree is used. A method of detecting abnormal subsequences of various lengths using a conventional single-length or larger subsequence detection method will be described below.
A simple method for detecting subsequences of more than top-k different lengths in one long time series data is to detect subsequences more than a conventional single length over all possible window lengths (2 to n / 2, n is the length of the entire time series data) The method is repeated. Then, among the ideal subsequences detected for each window length, the final ideal subsequences can be determined as top-k with the largest outlier score.
The problems of the above method are as follows. First, an abnormal subsequence detection is attempted for a meaningless window length. That is, the window length is so short that all the subsequences are normal subsequences, or the window length is very long, so that the abnormal subsequence detection is attempted also for the window length in which all subsequences are ideal subsequences.
Second, abnormal subsequence detection is performed independently for each window length. That is, the data structure necessary for detecting the ideal subsequence must be reconstructed every time, and the ideal subsequence detection process must be performed again each time. Therefore, the above-described method requires (n / 2? 1) times longer than a single-length ideal subsequence detection scheme.
Third, accurate ideal subsequences are not detected. Conventional detection methods use a heuristic scheme that focuses on performance rather than accuracy. This heuristic scheme is not a problem when detecting a single-length ideal subsequence, but it generates a serious problem when detecting an ideal subsequence of various lengths. Therefore, a method for detecting abnormal subsequences of various lengths should have a high detection accuracy of abnormal subsequences.
Hereinafter, a meaningful window length range will be described first, and a method for quickly deriving the range will be described. Furthermore, a method for quickly detecting ideal subsequences of various lengths using a suffix tree for the derived window length range will be described. Finally, a grid-distance based k-NN candidate detection method for improving the accuracy of the abnormal subsequence detection method will be described.
In embodiments of the present invention, a meaningful range of window lengths for detecting an ideal subsequence is determined based on the following observations.
FIG. 6 is a diagram schematically showing whether a subsequence is frequent when the number of desired subsequences is four, which is desired by a user.
Referring to FIG. 6, a row indicates a start position of a subsequence in time series data, and a column indicates a length of a subsequence (window).
First, for one subsequence C, if there are k or more same subsequences as the subsequence C, the subsequence C can never be a subsequence of more than top-k. In the embodiment of the present invention, a subsequence having k or more identical subsequences is often defined as a frequently occurring subsequence.
Second, for a non-frequent subsequence C of length m, all the upper subsequences of length m + 1 inclusive of subsequence C are not frequent. In FIG. 6, subsequences having a window length of 3 and a start position of 1 are not frequent. Therefore, all the upper subsequences including the subsequence are not frequent.
Third, for window length w2, if all subsequences of the length are frequent, for all window length w1 with w1? W2? N, all subsequences of that length are frequent. In Figure 6, all subsequences with
Fourth, for window length w1, if not all subsequences of the length are frequent, for all w2 w1 ≤ w2 ≤ n, not all subsequences of that length are frequent. In FIG. 6, all subsequences with a window length of 5 are not frequent. Therefore, not all subsequences having a window length of 6 or more are frequent.
As a result, if all subsequences of the length are frequent for a specific window length w1, ideal subsequence detection may not be performed for the window length w1. Similarly, for a particular window length w2, it is meaningless to perform ideal subsequence detection for window length w2, unless all subsequences of that length are frequent.
Therefore, in the embodiment of the present invention, a meaningful window length range for detecting an abnormal subsequence is determined as follows. That is, the minimum window length is the smallest window length in which one or more non-frequent subsequences exist, and the maximum window length is the largest window length in which one or more frequent subsequences exist.
In the embodiment of the present invention, a suffix tree is used to store information on subsequences having various lengths in one data structure. Here, the suffix tree is a prefix tree that stores all the suffixes of a string.
The suffix tree stores all the subsequences extracted for each start position. Therefore, if the stored subsequences are extracted by a desired length, a subsequence of a specific length with respect to the corresponding start position can be obtained. Thus, since the suffix tree stores subsequence information of all lengths for each starting position, it is not necessary to construct a new data structure every time, even if ideal subsequence detection is performed for various window lengths.
In addition, the suffix tree can be constructed with only one time series data scan, and is very fast to build and search. The construction complexity of the suffix tree is O (n), and the search complexity is O (l). Where n is the length of the time series data and l is the length of the subsequence to be searched. Therefore, it is possible to extract a subsequence of a desired length quickly by using a suffix tree.
In the embodiment of the present invention, after constructing the suffix tree data structure, the established tree is circulated once to further store the subsequence length and the occurrence frequency of the corresponding subsequence for each internal node.
FIG. 7 is a diagram schematically showing a suffix tree to which a subsequence length and the occurrence frequency of the corresponding subsequence are added.
7, the time series data is T = {a, b, c, a, b, b, a, b, b} and the number of ideal subsequences desired by the user is k = The occurrence frequency information of the corresponding subsequence is added to the internal node and stored (subsequence length: occurrence frequency of the subsequence).
This frequency information is useful for determining a meaningful window length range, deriving ideal subsequence candidate groups for each determined window length, and detecting abnormal subsequences of various lengths, such as securing neighboring subsequences for each group.
Conventional single-length subsequence detection methods use a heuristic scheme to greatly reduce the accuracy in the neighborhood derivation step and greatly improve the speed. This heuristic scheme is not a problem when detecting a single-length subsequence. However, serious problems may occur when detecting subsequences of various lengths.
In order to solve this problem, in the embodiment of the present invention, a grid-distance-based ideal subsequence detection that can perform both the candidate subsequence search of the ideal subsequence and the accurate k-NN subsequence search for each ideal subsequence candidate subsequence simultaneously Use the room.
For the two subsequences S = {s1, s2, ..., sw}, R = {r1, r2, ..., rw} with length w, the grid-distance between the two subsequences is computed as .
8 is a diagram schematically showing the lattice distance when the window length is 2, the number of alphabets is 10, and the number k of ideal subsequences desired by the user is k = 5.
8, when the window length is 2 and k is 5, the grid-distance (A, B) of the subsequence A corresponding to the string `ee` and the subsequence B corresponding to the string` ii` is 4 to be. The grid-distance (A, 5-NN (A)) of the subsequence A and the 5-NN of the subsequence A is 1 and the grid-distance of the 5-NN of the subsequence B and the subsequence B (B)) is 2.
Here, the grid-distance (S, R) is calculated by the following equation (3).
&Quot; (3) "
Using grid-distance, we can quickly find subsequences that are candidates for ideal subsequences. The top-k ideal subsequences are calculated by considering only the top-k largest grid-distance between the subsequence and the k-NN of the corresponding subsequence as candidates of the ideal subsequence, Can be calculated quickly and accurately.
Furthermore, the grid-distance can be used for neighboring k-NNs for each candidate subsequence. And further stores a list of the searched subsequences in searching for the top-k subsequences having large grid-distances between the k-NNs to be candidates of the above subsequences. The list of subsequences thus stored becomes the k-NN candidates of each candidate subsequence. By using this grid-distance, the exact k-NN candidates of each candidate subsequence can be found without additional searching.
9 is a flowchart schematically illustrating a method of analyzing time series data according to an embodiment of the present invention.
Referring to FIG. 9, data is collected in a
Next, the
Next, the
10 is a flowchart schematically illustrating an internal process of the abnormal subsequence detecting process of FIG.
Referring to FIG. 10, a top-k subsequence detection method with various lengths is as follows. First, in order to detect abnormal subsequences of various lengths, the preprocessing process must be performed in comparison with the conventional single-length ideal subsequence detection method, compared with the determination of the window length range. Should be performed.
After the entire time series data is symbolized, a preprocessing process of building a suffix tree is performed (510). There are two ways to symbolize the entire data. One is to symbolize the data by reflecting the distribution of data, and the other is to equally encode the data at the same interval.
One way to reflect the distribution of data is to presume the distribution of the entire data or to scan the input data once beforehand to determine the exact distribution of the entire data. However, further searching of the entire input data to determine the exact distribution of the data is not appropriate for large-length time series data, and even assuming a data distribution may be less accurate if the assumption is incorrect.
Therefore, in the embodiment of the present invention, the min and max ranges of the already known data measuring mechanism are used to divide the data into equal numbers of alphanumeric characters. For example, when the min and max are 0.0 to 1.0 and the number of alphabets is 10, the alphabets a, b, c, d, e, f, g, h, i, j . That is, 2.3 is converted to the alphabet c.
It is very straightforward to divide the range of the data into equal lengths, and it is possible to secure k or more neighbors of the ideal subsequence candidate group within the linear time of data length by using grid-distance, which is an approximated outlier of the ideal subsequence.
Next, the generated suffix tree is traversed and the occurrence frequency information for each subsequence is stored, and a meaningful window length range is determined using the maximal pattern concept (520).
In the embodiment of the present invention, a suffix tree is used to determine a meaningful window length range. To do this, we construct a suffix tree and use the number of leaf nodes stored in each node while visiting all nodes. The number of leaf nodes indicates the number of start positions sharing the subsequence indicated by the path label from the root node to the corresponding internal node, that is, the occurrence frequency of the corresponding subsequence.
When detecting the ideal subsequence of Top-k, the maximal pattern means the longest pattern with a frequency of k + 1 or more in the pattern of the same prefix. The pattern represented by the internal node at the instant when the occurrence frequency stored in the internal node is greater than k + 1 is the maximum pattern, and the maximum value and the minimum value of the maximum pattern length are continuously updated during the tree visit . The execution complexity for the entire tree visit is linear in the data length and after the visit is completed, a range of meaningful window length can be obtained through the maximum value and the minimum value of the maximum pattern length.
For efficient post-processing, store a pointer to the internal nodes of the pattern that does not occur more than k + 1 when the tree is visited. If storage space savings are required, nodes outside the meaningful window length range can be pruned from the suffix tree.
Next, the subsequences to be candidates of the top-k ideal subsequences having a large grid-distance between each meaningful subsequence of the window length and the k-NN subsequence of the corresponding subsequence are derived (530).
In the window length range determination process, k or more neighbor subsequences can always be secured for each abnormal candidate subsequence candidate group. If the neighborhood of the same pattern is less than k, the neighborhood is further searched to find out efficiency or randomness, Prevent falling problems in advance.
Next, an outliers score is calculated for the candidate subsequence of each ideal subsequence (540). The remaining subsequences are used as the outliers by calculating the distance between the input data of the neighboring subsequences within the grid-distance and the kth neighbor for each candidate group. At this time, an idealized value normalized to the length of the subsequence is used.
Next, a subsequence of the final top-k or more is detected based on the outlier score (550). Order of the outliers calculated for each ideal subsequence candidate for each window length is calculated to derive a subsequence of more than the total top-k.
221: preprocessing unit 222: window length determining unit
223: abnormal candidate deriving unit 224: abnormal value calculating unit
225: Abnormal subsequence detector
Claims (10)
A length of a maximum pattern is updated while visiting the constructed suffix tree, and a window length range is determined based on a maximum value and a minimum value of the updated maximum pattern length, A window length determination unit for determining the window length range up to a minimum value of the length of the maximum pattern;
An ideal candidate derivation unit for deriving the top-k ideal subsequence candidates having a largest lattice distance between the subsequence and the kth neighboring subsequence of the subsequence for the determined window length range;
An ideal value calculation unit for calculating an ideal value for the derived ideal subsequence candidate; And
An ideal subsequence detector for detecting top-k ideal subsequences based on the calculated outliers; The subsequence detection apparatus comprising:
Storing a number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree, updating a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes, And stores information of internal nodes whose number is k or less.
Storing the ideal candidate cell based on the information of the internal node having the number of stored leaf nodes equal to or less than k and calculating the grid distance so that the sum of the occurrence frequencies of the stored ideal candidate cell and neighboring cells is k + And calculates a maximum lattice distance such that a sum of the occurrence frequencies is greater than or equal to k + 1 when the frequency of occurrence of the abnormal candidate cell is added in the order of the ideal cell having a larger stored lattice distance.
And derives the ideal subsequence candidate by removing a cell having a lattice distance smaller than the calculated maximum lattice distance from the ideal candidate cell.
A length of a maximum pattern is updated while visiting the constructed suffix tree, and a window length range is determined based on a maximum value and a minimum value of the updated maximum pattern length, Determining a window length range up to a minimum value of the maximum pattern length;
Deriving top-k ideal subsequence candidates having a larger lattice distance between the subsequence and the kth neighboring subsequence of the subsequence according to the determined window length range;
Calculating an ideal value for the derived ideal subsequence candidate; And
Detecting top-k ideal subsequences based on the calculated outlier score; The subsequence detection method comprising:
Storing a number of leaf nodes for each internal node of the suffix tree while visiting the suffix tree, updating a maximum value and a minimum value of the maximum pattern length based on the number of stored leaf nodes, And storing information of internal nodes whose number is k or less.
Storing the ideal candidate cell based on the information of the internal node having the number of stored leaf nodes equal to or less than k and calculating the grid distance so that the sum of the occurrence frequencies of the stored ideal candidate cell and neighboring cells is k + And calculating a maximum lattice distance such that a sum of the occurrence frequencies is greater than or equal to k + 1 when the frequency of occurrence of the abnormal candidate cell is added in the order of the ideal cell having the largest stored lattice distance.
And subtracting the ideal subsequence candidate from the ideal candidate cell by removing a cell having a lattice distance smaller than the calculated maximum lattice distance from the ideal candidate cell.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020120030043A KR101906859B1 (en) | 2012-03-23 | 2012-03-23 | Aparatus and method for detecting anomalous subsequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020120030043A KR101906859B1 (en) | 2012-03-23 | 2012-03-23 | Aparatus and method for detecting anomalous subsequence |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20130107889A KR20130107889A (en) | 2013-10-02 |
KR101906859B1 true KR101906859B1 (en) | 2018-10-11 |
Family
ID=49631098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020120030043A KR101906859B1 (en) | 2012-03-23 | 2012-03-23 | Aparatus and method for detecting anomalous subsequence |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101906859B1 (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109727446B (en) * | 2019-01-15 | 2021-03-05 | 华北电力大学(保定) | Method for identifying and processing abnormal value of electricity consumption data |
KR102245718B1 (en) | 2019-05-22 | 2021-04-28 | 카페24 주식회사 | A method for visualizing outliers occurrences based on keys, a computing device and a computer readable storage medium |
KR20200134560A (en) | 2019-05-22 | 2020-12-02 | 카페24 주식회사 | A method for detecting outliers occurrences, a computing device and a computer readable storage medium |
KR20200134562A (en) | 2019-05-22 | 2020-12-02 | 카페24 주식회사 | A method for determining risk factors of outliers occurrences, a computing device and a computer readable storage medium |
CN112347813B (en) * | 2019-08-07 | 2024-07-09 | 顺丰科技有限公司 | Baseline detection method, equipment and storage medium for high signal-to-noise ratio time sequence |
CN112819190B (en) * | 2019-11-15 | 2024-01-26 | 上海杰之能软件科技有限公司 | Device performance prediction method and device, storage medium and terminal |
CN111222710B (en) * | 2020-01-15 | 2024-07-12 | 平安银行股份有限公司 | Data abnormality warning method, device, equipment and storage medium |
CN111612082B (en) * | 2020-05-26 | 2023-06-23 | 河北小企鹅医疗科技有限公司 | Method and device for detecting abnormal subsequence in time sequence |
CN112966017B (en) * | 2021-03-01 | 2023-11-14 | 北京青萌数海科技有限公司 | Abnormal subsequence detection method for indefinite length in time sequence |
CN116402483B (en) * | 2023-06-09 | 2023-08-18 | 国网山东省电力公司兰陵县供电公司 | Online monitoring method and system for carbon emission of park |
CN117556108B (en) * | 2024-01-12 | 2024-03-26 | 泰安金冠宏食品科技有限公司 | Abnormal detection method for oil-residue separation efficiency based on data analysis |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101739992B1 (en) | 2016-01-28 | 2017-05-25 | 한양대학교 산학협력단 | Database system and method for subsequence matching |
-
2012
- 2012-03-23 KR KR1020120030043A patent/KR101906859B1/en active IP Right Grant
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101739992B1 (en) | 2016-01-28 | 2017-05-25 | 한양대학교 산학협력단 | Database system and method for subsequence matching |
Non-Patent Citations (1)
Title |
---|
노웅기 외, 시계열 데이터베이스에서 인덱스 보간법을 기반으로 정규화 변환을 지원하는 서브시퀀스 매칭 알고리즘, 정보과학회 논문지: 데이터베이스 제28권 제2호, pp.217-232 (2001.06.)* |
Also Published As
Publication number | Publication date |
---|---|
KR20130107889A (en) | 2013-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101906859B1 (en) | Aparatus and method for detecting anomalous subsequence | |
CN111694879B (en) | Multielement time sequence abnormal mode prediction method and data acquisition monitoring device | |
JP7353238B2 (en) | Method and system for performing automated root cause analysis of abnormal events in high-dimensional sensor data | |
CN111339129B (en) | Remote meter reading abnormity monitoring method and device, gas meter system and cloud server | |
CN104809134B (en) | The method and apparatus for detecting the abnormal subsequence in data sequence | |
CN106228002B (en) | High-efficiency abnormal time sequence data extraction method based on secondary screening | |
KR102472637B1 (en) | Method for analyzing time series data, determining a key influence variable and apparatus supporting the same | |
KR20190072652A (en) | Information processing apparatus and information processing method | |
Shao et al. | A modified Hausdorff distance based algorithm for 2-dimensional spatial trajectory matching | |
KR20170078252A (en) | Method and apparatus for time series data monitoring | |
CN110288003B (en) | Data change identification method and equipment | |
CN108288111B (en) | Thermal power plant exhaust smoke temperature reference value determining method and device based on association rules | |
US9674083B2 (en) | Path calculation order deciding method, program and calculating apparatus | |
Kim et al. | An adaptive step-down procedure for fault variable identification | |
JP2013196665A (en) | Data retriever, data retrieval method, and data retrieval program | |
CN114357037A (en) | Time sequence data analysis method and device, electronic equipment and storage medium | |
US20200312430A1 (en) | Monitoring, predicting and alerting for census periods in medical inpatient units | |
WO2020084404A1 (en) | System and method for direct subsequence searching and mapping in nanopore raw signal | |
Li et al. | Data imputation for sparse radio maps in indoor positioning | |
Geppert et al. | Advances in 2D fingerprint similarity searching | |
CN113108806A (en) | Path planning method, device, equipment and medium | |
JP5060340B2 (en) | Similar partial sequence detection method, similar partial sequence detection program, and similar partial sequence detection device | |
CN112684402A (en) | Method and system for monitoring error data of stable electric energy operation of power consumption | |
Sankararaman et al. | Computing similarity between a pair of trajectories | |
CN111540202A (en) | Similar bayonet determining method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |