CN112966017A - Abnormal subsequence detection method with indefinite length in time sequence - Google Patents

Abnormal subsequence detection method with indefinite length in time sequence Download PDF

Info

Publication number
CN112966017A
CN112966017A CN202110226782.0A CN202110226782A CN112966017A CN 112966017 A CN112966017 A CN 112966017A CN 202110226782 A CN202110226782 A CN 202110226782A CN 112966017 A CN112966017 A CN 112966017A
Authority
CN
China
Prior art keywords
subsequence
len
length
sub
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110226782.0A
Other languages
Chinese (zh)
Other versions
CN112966017B (en
Inventor
陈逸舟
张丹
熊晓菁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qingmeng Shuhai Technology Co ltd
Original Assignee
Beijing Qingmeng Shuhai Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qingmeng Shuhai Technology Co ltd filed Critical Beijing Qingmeng Shuhai Technology Co ltd
Priority to CN202110226782.0A priority Critical patent/CN112966017B/en
Publication of CN112966017A publication Critical patent/CN112966017A/en
Application granted granted Critical
Publication of CN112966017B publication Critical patent/CN112966017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method for detecting abnormal subsequences of indefinite length in a time sequence, which adopts the mean value/median of K neighbor distance in the definition of the abnormal subsequences, carries out parallel optimization based on STOMP algorithm in the calculation of the subsequence distance, uses the length range and step length of input subsequences in the algorithm parameter setting, directly outputs the abnormal subsequence detection result under each length in the algorithm output, finds out different abnormal points, and can also output the abnormal degree score and the abnormal subsequence result obtained by judgment based on the detection result and a certain evaluation index. The abnormal subsequence detection method with the indefinite length in the time sequence can obviously improve the operation efficiency and the detection accuracy of the abnormal subsequence detection of the time sequence.

Description

Abnormal subsequence detection method with indefinite length in time sequence
Technical Field
The invention relates to time sequence anomaly detection in the field of data mining algorithms, in particular to an anomaly subsequence detection method with an indefinite length in a time sequence.
Background
Currently, anomaly detection is applied in a large number of actual scenes as a wide technical field, and different types of special technical methods are generally required to be applied to different service fields and data types. Time series anomaly detection, which is the identification of outliers in time-aligned data, can play a role in a very large number of application areas, such as: in a financial market, detecting sudden changes in a stock market, or abnormal patterns within a particular time window; the system operation diagnosis aspect can be used for monitoring the equipment operation condition, detecting abnormal invasion and the like; in the biological field, since the arrangement of amino acids is similar to the characteristics of time series data, a detection method in terms of time series can also be applied. Due to the different data characteristics generated in different fields and the different proposed service requirements, a number of different methods have also been developed in the field of time series anomaly detection: from the aspect of data dimension, the method can be divided into single-sequence and multi-dimension data anomaly detection methods; from the aspect of anomaly definition, the method can be divided into data point anomaly detection and window anomaly detection; from the perspective of algorithm implementation, the algorithm can be divided into supervised and unsupervised anomaly detection algorithms.
Currently, most of time series anomaly detection algorithms which are researched and applied are used for detecting single data point anomalies, namely, the anomaly probability of each data point in a time series is output, and then whether the data point is anomalous or not is judged by setting a threshold value. However, in practical application scenarios, it is often necessary to identify pattern abnormalities that last for a period of time (e.g., pattern abnormalities of an electrocardiogram under arrhythmia symptoms), and it is preferable to use an abnormal subsequence detection algorithm.
Given a time series data of length n, a subsequence of length L and starting position i can be represented. The definition of an aberrant subsequence currently in common use is: the subsequence with the largest nearest neighbor distance in the time series T, i.e. for subsequence D and optionally subsequence C, and corresponding set of non-overlapping subsequences MDAnd MCIf so, the subsequence D is an abnormal subsequence in the time series T. The distance between two time sequences is measured by Euclidean distance in the most basic definition, and other reasonable distance measurement can be used in practical application; in addition, the definition of the abnormal subsequence can be easily expanded to output a plurality of abnormal subsequences.
However, detection of abnormal subsequences also faces difficulties in terms of computational inefficiency, parameter dependence, similar abnormality identification, and the like. First, in terms of computational efficiency, according to the definition of abnormal subsequences described above, the most straightforward implementation is to extract abnormal subsequences from every two subsequences by circularly calculating the distance between the subsequences, and the time complexity of such an algorithm is (where n is the time sequence length and m is the subsequence length), and the time sequence data is often large in length, so that a brute force-based algorithm is almost impossible to implement. In recent years, many studies have been made to solve this problem, and various algorithms for improving the efficiency have been proposed by performing dimensionality reduction representation and pre-sorting on the time series, or by setting a distance threshold value, and the like. The former is mostly a heuristic algorithm, the actual calculation efficiency is strongly associated with the setting of a plurality of parameters and the actual data characteristic condition, and when the parameters are improperly set or the data distribution does not meet the expectation, the efficiency of the algorithm may be reduced; the latter prunes the distance calculation process by setting a threshold, and also has strong dependency on the setting of the threshold, and an improper setting of the threshold may cause the algorithm to fail (fail to return any abnormal subsequence) or the efficiency to be reduced, and the threshold is difficult to estimate through experience in advance. Yeh et al (2018) propose a breakthrough algorithm STOMP, which is characterized in that the calculation efficiency of the distance between every two subsequences is greatly optimized by a fast Fourier transform and moving dot product method, and the calculation process does not depend on the setting of other parameters or the distribution characteristics of data, so that the distance calculation on a large-scale data set becomes feasible and predictable.
Another problem with abnormal subsequence detection is parameter dependence, and the algorithm only needs to input one parameter according to the abnormal subsequence definition described above: the length of the target subsequence. However, the setting of this parameter has a decisive influence on the detection result, and different target subsequence lengths may result in completely different detection results, rather than only slightly affecting the detection accuracy. There has been little interest in this problem in previous studies, which is generally regarded as a necessary parameter tuning process, but finding a suitable target subsequence length is not an easy task to implement in view of the time cost of abnormal subsequence detection. Therefore, it is desirable to eliminate the dependency on this parameter as much as possible in the invention, and to obtain more stable and efficient output results.
There is also the problem of similar anomalies, defined as the subsequence with the largest nearest neighbor distance in the time series T according to the above. In practical applications, we find that there may be two (and more) abnormal subsequences with similar morphology in the data, and since the nearest neighbor distance is used as a measure in the definition, the two similar abnormal subsequences may have a very small nearest neighbor distance, which causes failure of the detection algorithm. It is also desirable to optimize the distance index in the invention to cope with similar anomalies and similar data problems.
Disclosure of Invention
The invention provides a method for detecting abnormal subsequences with indefinite length in a time sequence, which solves the problems in the prior art and is improved from the following aspects:
(1) in the aspect of definition of an abnormal subsequence, expanding the traditional nearest neighbor distance into a mean value/median of a K nearest neighbor distance so as to solve the problem that similar abnormality possibly exists in data;
(2) in the aspect of subsequence distance calculation, parallel optimization is performed based on the STOMP algorithm, and the algorithm operating efficiency is further improved;
(3) in the aspect of algorithm parameter setting, the length range and the step length of an input subsequence are used for replacing the length of a single target subsequence, and the algorithm calculates the detection of an abnormal subsequence at intervals of a certain step length in the input length range;
(4) in the aspect of algorithm output, abnormal subsequence detection results under various lengths can be directly output, different abnormal points can be found, and abnormal degree scores and abnormal subsequence results obtained through judgment can also be output based on the detection results and certain evaluation indexes (such as subsequence reproduction times).
The technical scheme is as follows:
a method for detecting abnormal subsequences with indefinite length in a time sequence comprises the following steps:
s1: inputting time sequence data T, a minimum subsequence length min _ len, a maximum subsequence length max _ len and a step size step; optionally inputting the number k of neighbors, a k-neighbor distance integration method, detecting the number n _ disorders of abnormal subsequences and the number n _ works of parallel processes;
s2: determining a target subsequence length sub _ len set according to the set minimum subsequence length min _ len, the set maximum subsequence length max _ len and the step size step, and executing the following cycle for each target subsequence length sub _ len in the set:
a) dividing the time sequence T into a plurality of time sequence subsets T according to the parallel process number n _ worksworker
b) In each process, the STOMP algorithm is applied to calculate the time sequence T for each sub-sequence in the time sequence TworkerLocal neighbor matrix mpworker
c) The calculation results mp of the respective processesworkerIntegrating, reserving k nearest neighbor distances for each subsequence in the time sequence T, and counting the number of bits or the mean value of the k nearest neighbor distances to form a k nearest neighbor distance matrix mp of the time sequence T on the length sub _ len of the subsequencesub_len
S3: k neighbor distance matrix mp of length sub _ len from each target subsequencesub_lenIn the method, n _ discords abnormal subsequences with the maximum k neighbor distance are obtained through calculation;
s4: for each data point T in the time series TiAnd calculating the occurrence frequency of the abnormal subsequences with various lengths, and marking the points with the occurrence frequency exceeding a certain threshold value as the finally detected abnormal values.
Further, in step S1, k is defaulted to 1, k is defaulted to the median by the k-nearest neighbor distance integration method, the number n _ records of detected abnormal subsequences is defaulted to 3, and the number n _ works of the runs is defaulted to 4.
Further, in step S1, the time-series data T should include a time-series and a value-series, which indicate the corresponding values of the time-series at each time point, and the time points are preferably equally spaced; the minimum subsequence length min _ len, the maximum subsequence length max _ len and the step size step are used to determine the length range of the subsequences, i.e. within the range of greater than or equal to min _ len and less than or equal to max _ len, a value is taken every step as the target subsequence length, and for the target subsequence length set obtained here, the loop calculations of S2a-S2c are performed in sequence.
Further, in step S2, the abnormality detection is performed for the first time by using the nearest neighbor distance where k is 1, and if a similar abnormality pattern that cannot be identified is found, the value of k is increased, and an abnormality is detected by the k-neighbor distance.
In step S2, the time sequence T is divided into several time sequence subsets TworkerIn each parallel process, time series data T and data subset T are inputworkerThe length sub _ len of the target subsequence and the number k of adjacent neighbors, and initializing a k adjacent neighbor distance matrix _ profileworkerThe number of rows is n-sub _ len +1, the number of columns is k, and the initial values are all positive infinity.
In step S2, for data subset TworkerEach sub-sequence T ini,sub_lenCalculating the subsequence T by means of fast Fourier transform and moving dot product by applying STOMP algorithmi,sub_lenThe distance from each subsequence in the time sequence data T is obtained to obtain a distance vector dist with the length of n-sub _ len +1i,sub_len
In step S2, if k is 1, that is, only the nearest neighbor distance is calculated, the distance vector dist is calculated every timei,sub_lenWith local neighbor matrix mpworkerComparing the elements of the corresponding positions, and reserving a smaller distance value at each position; if k > 1, k neighbor distances need to be preserved, and distance vector dist is carried out each timei,sub_lenWith local neighbor matrix mpworkerAfter merging, the minimum k values at each position are retained.
In step S2, each data subset T is divided intoworkerLocal neighbor matrix mp obtained by the above calculationworkerMerging, retaining the minimum k values at each position, and calculating the mean value orAfter the median, a k neighbor distance matrix mp of the time sequence data T on the subsequence length sub _ len is obtainedsub_lenAnd (6) merging.
Further, in step S3, for each target subsequence length sub _ len, a k-nearest neighbor distance matrix mp is calculatedsub_lenAfter the sequences are arranged in a descending order, 1 subsequence with the maximum adjacent distance is selected as an abnormal subsequence result; if n _ records is more than 1, checking backwards one by one, if the difference value between the position i of the subsequence and the position of the existing abnormal subsequence is less than sub _ len, namely if the checked subsequence overlaps with the existing abnormal subsequence, skipping the current subsequence, and continuing checking backwards until the number of the abnormal subsequences reaches n _ records, so as to obtain an abnormal subsequence set records with the target subsequence length sub _ lensub_len
Further, in step S4, after obtaining the abnormal subsequence set at each target subsequence length, the final abnormal subsequence result is obtained by establishing an evaluation index.
The abnormal subsequence detection method with the indefinite length in the time sequence can obviously improve the operation efficiency and the detection accuracy of the abnormal subsequence detection of the time sequence. In the aspect of operation efficiency, original time sequence data are divided into a plurality of subsets, a plurality of processes are started to calculate k neighbor distance matrixes of the subsets in parallel, and the k neighbor distance matrixes of the original time sequence data are obtained through combination calculation. In the aspect of detection accuracy, the method uses the k-nearest neighbor distance instead of the nearest neighbor distance for detecting the abnormal subsequence, and uses the subsequence length range and the step length parameter instead of the commonly used fixed subsequence length, so that multiple types of abnormal patterns in data can be detected more accurately in practical application.
Drawings
FIG. 1 is a schematic flow chart of a method for detecting abnormal subsequences of indefinite length in the time sequence;
FIG. 2 is a schematic diagram of the results of abnormal subsequence detection for New York City taxi passenger data in an embodiment;
fig. 3 is a schematic effect diagram of the abnormal subsequence detection method with an indefinite length in the time series.
Detailed Description
As shown in fig. 1, the method for detecting abnormal subsequences with indefinite length in the time sequence comprises the following steps:
s1: inputting time sequence data T, a minimum subsequence length min _ len, a maximum subsequence length max _ len and a step size step; optionally, a number k of neighbors (default to 1), a k-neighbor distance integration method (default to median), a number n _ disorders of abnormal subsequences (default to 3), and a number n _ workers of strokes (default to 4) are input.
S2: determining a target subsequence length sub _ len set according to the set minimum subsequence length, maximum subsequence length and step length, and executing the following cycle for each target subsequence length sub _ len in the set:
a) dividing the time sequence T into a plurality of time sequence subsets T according to the parallel process number n _ worksworker;(4.1)
b) In each process, the STOMP algorithm is applied to calculate the time sequence T for each sub-sequence in the time sequence TworkerLocal neighbor matrix mpworker;(4.2,4.3,4,4)
c) The calculation results mp of the respective processesworkerIntegrating, reserving k nearest neighbor distances for each subsequence in the time sequence T, and counting the number of bits (or average value) of the k nearest neighbor distances to form a k nearest neighbor distance matrix mp of the time sequence T on a subsequence length sub _ lensub_len;(4.5)
S3: k neighbor distance matrix mp of length sub _ len from each target subsequencesub_lenIn the method, n _ discords abnormal subsequences with the largest k neighbor distance are obtained through calculation.
S4: for each data point T in the time series TiAnd calculating the occurrence frequency of the abnormal subsequences with various lengths, and marking the points with the occurrence frequency exceeding a certain threshold value as the finally detected abnormal values.
In step S1, the time-series data T should include a time column and a value column, which indicate the corresponding values of the time series at each time point, and the time points are preferably equally spaced; the minimum subsequence length min _ len, the maximum subsequence length max _ len and the step size step are used to determine the length range of the subsequences, i.e. within the range of greater than or equal to min _ len and less than or equal to max _ len, a value is taken every step as the target subsequence length, and for the target subsequence length set obtained here, the loop calculations of S2a-S2c are performed in sequence.
The following takes a time series data set of the number of passengers of a taxi in new york as an example to describe in detail the embodiment of the invention.
1. Data is acquired. The new york taxi passenger data set comprises two columns of time stamp (timestamp) and passenger number (value), the time span is from 7 months 1 days in 2014 to 1 month 31 days in 2015, and each data interval is 30 minutes and has 10320 data.
2. And (5) determining parameters. The most important input parameters in the algorithm of the invention are the length ranges of the subsequences, namely the minimum and maximum subsequence lengths and step length. Because the length of the subsequence in the abnormal subsequence detection algorithm can seriously influence the detection result, the invention uses the length range (the minimum subsequence length, the maximum subsequence length and the step length) of the subsequence to replace the length of a single subsequence, can obviously improve the stability of the detection result of the algorithm, but still needs a user to input a reasonable length range of the subsequence. Empirically, suitable values for the minimum and maximum subsequence lengths can be approximated by the following rules: if the actual anomaly occurs over a time span of L (e.g., 10 data points), then the actual anomalous subsequence can be better detected when the subsequence has a length approximately in the range of 1.5L to 3L (i.e., 15-30 data points). In practical application, the time span of the actual abnormality occurrence may be roughly estimated according to data characteristics or background experience, and then the input subsequence length range parameter may be determined by combining the above experience.
3. The embodiment corresponds to the following. In this example, the shortest anomalies found from the observation of the data may occur within 1-2 hours (2-4 data points), while the longest anomalies may last 1-2 days (48-96 data points), so here the minimum subsequence length is chosen to be 8(4 hours), the maximum subsequence length is chosen to be 240(5 days), and the step size is 8. The significance of the step size is to reduce the repeated calculation under the adjacent subsequence length, if the step size parameter is not set (i.e. the step size defaults to 1), the algorithm needs to perform the cyclic calculation on each target length value (in this example, 8,9,10, …,239,240, 233 target lengths in total) within the minimum to maximum subsequence length, which will occupy a large amount of calculation time, and the results obtained by the algorithm detection under the adjacent target subsequence length (for example, the lengths are 8 and 9, respectively) are very close, and the significance of the repeated calculation is not large, so that the calculation efficiency can be greatly improved by setting the step size parameter without affecting the final detection result, in this example, setting the step size to 8,16,24, …,232, and 240 need only perform the cyclic calculation for 30 target subsequence lengths sub _ len, which are 8,16,24, …,232, and 240.
4. And circularly calculating a k neighbor distance matrix under the length of each target subsequence. In practical applications, it is necessary to input the number k of neighbors and a summary method (mean/median) of k-neighbor distances, and generally, when abnormality detection is performed for the first time, detection can be performed using k equal to 1, that is, nearest neighbor distances. In this example, first, an attempt is made to detect an abnormality with k equal to 1. After the input of the neighbor number k is determined, starting from the input minimum subsequence length, a k neighbor distance matrix at the subsequence length is calculated, and then the subsequence length is added with a step size, and the process is circulated until the maximum subsequence length is exceeded. In each cycle, the method for calculating the k neighbor distance matrix is as follows:
4.1. partitioning raw time series data into subsets Tworker. The subset dividing process is mainly used for supporting subsequent parallel computing tasks, and the specific dividing mode and the dividing result of the subset dividing process do not influence the final abnormal detection result and only influence the operation efficiency of the algorithm to a certain extent. In this example, the subset size is selected to be 200, 51 subsets are divided in total, and 4 processes are allocated for parallel computing. In other cases, the subset partitioning may be based on actual data size and computational resource conditionsThis is not a limitation in the present invention.
4.2. In each parallel process, time-series data T (data amount n) and a data subset T are inputworkerTarget subsequence length sub _ len and neighbor number k, initializing local neighbor matrix mpworkerThe number of rows is n-sub _ len +1, the number of columns is k, and the initial values are all positive and infinite;
4.3. for data subset TworkerEach sub-sequence T ini,sub_lenCalculating the subsequence T by means of fast Fourier transform and moving dot product by applying STOMP algorithmi,sub_lenThe distance from each subsequence in the time sequence data T is obtained to obtain a distance vector dist with the length of n-sub _ len +1i,sub_len
4.4. If k is 1, i.e. only the nearest neighbor distance is calculated, every time the distance vector dist is calculatedi,sub_lenWith local neighbor matrix mpworkerComparing the elements of the corresponding positions, and reserving a smaller distance value at each position; if k > 1, k neighbor distances need to be preserved, and distance vector dist is carried out each timei,sub_lenWith local neighbor matrix mpworkerAfter merging, keeping the minimum k values at each position;
4.5. each data subset TworkerLocal neighbor matrix mp obtained by the above calculationworkerMerging, reserving the minimum k values at each position, and calculating a mean value or a median according to an input summarizing mode to obtain a k neighbor distance matrix mp of the time series data T on the subsequence length sub _ lensub_lenAnd (6) merging.
5. K neighbor distance matrix mp of length sub _ len from each target subsequencesub_lenDetecting to obtain corresponding abnormal subsequence. In practical applications, the number n _ disorders of detected abnormal subsequences needs to be input, and this parameter depends on the estimation of the number of times of abnormal occurrence in the actual data, which is not limited by the present invention, and in this example, the default value 3 is selected. K nearest neighbor distance matrix mp calculated under length sub _ len of each target subsequencesub_lenAfter the sequences are arranged in a descending order, 1 subsequence with the maximum adjacent distance is selected as an abnormal subsequence result; if n _ discs is more than 1, checking backwards one by one, if the difference value between the position i of the subsequence and the position of the existing abnormal subsequence is less than sub _ len, namely if the checked subsequence overlaps with the existing abnormal subsequence, skipping the current subsequence, and continuing checking backwards until the number of the abnormal subsequences reaches n _ discords to obtain an abnormal subsequence set discords under the length sub _ len of the target subsequencesub_len
6. And judging the final abnormal subsequence result by establishing an evaluation index. In the last step, the abnormal subsequence set under each target subsequence length is obtained through calculation, and on the basis, a user can establish a certain evaluation index so as to judge and obtain a final abnormal subsequence result, wherein the used evaluation index is not fixed. In this example, the number of reproductions is used as an evaluation index, i.e. for each data point T in the time series TiCounting the abnormal subsequences in the abnormal subsequencessub_lenIf the number of occurrences is equal to or greater than 5, the number is determined as a final abnormal result and output.
7. And (6) evaluating the results. Through the implementation steps, the result of detecting the abnormal subsequence for the passenger data of the taxi in new york city is shown in fig. 2. The upper part is the original data and the abnormal point marks in the original data, and the lower part is an abnormal subsequence set detected by the algorithm under each subsequence length. It can be seen that when the evaluation index with the reproduction times of more than or equal to 5 is used, the algorithm very accurately identifies 5 anomalies in the data, the data respectively correspond to winter season, thanksgiving festival, christmas, denier and one-time snowstorm weather, no false positive case is generated, and the anomaly detection accuracy is very high.
Through the embodiments, the invention can obviously improve the operation efficiency and the detection accuracy of the time series abnormal subsequence detection.
In the aspect of operation efficiency, original time sequence data are divided into a plurality of subsets, a plurality of processes are started to calculate k neighbor distance matrixes of the subsets in parallel, and the k neighbor distance matrixes of the original time sequence data are obtained through combination calculation. The KPI time sequence data with the length of 20000 are used for testing, the length of the selected subsequence is 720, the number of the neighbors is 1, the time consumed for calculating the k-neighbor distance matrix by using the original STOMP algorithm is about 85 seconds, the time consumed for parallel use of 2 processes is about 36 seconds, the time consumed for parallel use of 4 processes is about 25 seconds, and the operation efficiency of the algorithm is obviously improved.
In the aspect of detection accuracy, the method uses the k-nearest neighbor distance instead of the nearest neighbor distance for detecting the abnormal subsequence, and uses the subsequence length range and the step length parameter instead of the commonly used fixed subsequence length, so that multiple types of abnormal patterns in data can be detected more accurately in practical application. Similarly, the KPI time series data with the length of 20000 is used for testing, the length of the minimum subsequence is 180, the length of the maximum subsequence is 1440, the step size is 30, and the result of performing anomaly detection on the data is shown in fig. 3. The upper part of fig. 3 is labeled with original data and actual abnormal points, and the lower part is the position of the abnormal subsequence detected under each subsequence length. With more than 10 times of recurrences as the standard for selecting the final abnormal result, the algorithm correctly identifies 4 main abnormalities in the data, only generates 1 false positive detection result, and obviously improves the accuracy of abnormal detection.

Claims (10)

1. A method for detecting abnormal subsequences with indefinite length in a time sequence comprises the following steps:
s1: inputting time sequence data T, a minimum subsequence length min _ len, a maximum subsequence length max _ len and a step size step; optionally inputting the number k of neighbors, a k-neighbor distance integration method, detecting the number n _ disorders of abnormal subsequences and the number n _ works of parallel processes;
s2: determining a target subsequence length sub _ len set according to the set minimum subsequence length min _ len, the set maximum subsequence length max _ len and the step size step, and executing the following cycle for each target subsequence length sub _ len in the set:
a) dividing the time sequence T into a plurality of time sequence subsets T according to the parallel process number n _ worksworker
b) In each process, the STOMP algorithm is applied to calculate the time sequence T for each sub-sequence in the time sequence TworkerLocal neighbor matrix mpworker
c) The calculation results mp of the respective processesworkerIntegrating, reserving k nearest neighbor distances for each subsequence in the time sequence T, and counting the number of bits or the mean value of the k nearest neighbor distances to form a k nearest neighbor distance matrix mp of the time sequence T on the length sub _ len of the subsequencesub_len
S3: k neighbor distance matrix mp of length sub _ len from each target subsequencesub_lenIn the method, n _ discords abnormal subsequences with the maximum k neighbor distance are obtained through calculation;
s4: for each data point T in the time series TiAnd calculating the occurrence frequency of the abnormal subsequences with various lengths, and marking the points with the occurrence frequency exceeding a certain threshold value as the finally detected abnormal values.
2. The method for detecting abnormal subsequences with indefinite length in time sequence according to claim 1, wherein: in step S1, k is defaulted to 1, k is defaulted to a median by the k neighbor distance integration method, the number n _ records of the detected abnormal subsequences is defaulted to 3, and the number n _ works is defaulted to 4.
3. The method for detecting abnormal subsequences with indefinite length in time sequence according to claim 1, wherein: in step S1, the time series data T should include a time series and a value series, which represent the corresponding values of the series at each time point, and the time points are preferably equally spaced; the minimum subsequence length min _ len, the maximum subsequence length max _ len and the step size step are used to determine the length range of the subsequences, i.e. within the range of greater than or equal to min _ len and less than or equal to max _ len, a value is taken every step as the target subsequence length, and for the target subsequence length set obtained here, the loop calculations of S2a-S2c are performed in sequence.
4. The method for detecting abnormal subsequences with indefinite length in time sequence according to claim 1, wherein: in step S2, the abnormality detection is performed for the first time by using the nearest neighbor distance where k is 1, and if an unrecognizable similar abnormality pattern is found in the result, the value of k is increased, and the abnormality is detected by the k-neighbor distance.
5. The method according to claim 3, wherein the method comprises the following steps: in step S2, the time sequence T is divided into several time sequence subsets TworkerIn each parallel process, time series data T and data subset T are inputworkerThe length sub _ len of the target subsequence and the number k of adjacent neighbors, and initializing a k adjacent neighbor distance matrix _ profileworkerThe number of rows is n-sub _ len +1, the number of columns is k, and the initial values are all positive infinity.
6. The method according to claim 5, wherein the method comprises the following steps: in step S2, for data subset TworkerEach sub-sequence T ini,sub_lenCalculating the subsequence T by means of fast Fourier transform and moving dot product by applying STOMP algorithmi,sub_lenThe distance from each subsequence in the time sequence data T is obtained to obtain a distance vector dist with the length of n-sub _ len +1i,sub_len
7. The method according to claim 6, wherein the method comprises the following steps: in step S2, if k is 1, that is, only the nearest neighbor distance is calculated, the distance vector dist is calculated every timei,sub_lenWith local neighbor matrix mpworkerComparing the elements of the corresponding positions, and reserving a smaller distance value at each position; if k > 1, k neighbor distances need to be preserved, and distance vector dist is carried out each timei,sub_lenWith local neighbor matrix mpworkerAfter merging, the minimum k values at each position are retained.
8. According to claimThe method for detecting abnormal subsequences with indefinite length in time sequence according to claim 7, wherein: in step S2, each data subset T is divided intoworkerLocal neighbor matrix mp obtained by the above calculationworkerMerging, reserving the minimum k values at each position, and calculating a mean value or a median according to an input summarizing mode to obtain a k neighbor distance matrix mp of the time series data T on the subsequence length sub _ lensub_lenAnd (6) merging.
9. The method according to claim 8, wherein the method comprises the following steps: in step S3, k neighbor distance matrix mp calculated for each target subsequence length sub _ lensub_lenAfter the sequences are arranged in a descending order, 1 subsequence with the maximum adjacent distance is selected as an abnormal subsequence result; if n _ records is more than 1, checking backwards one by one, if the difference value between the position i of the subsequence and the position of the existing abnormal subsequence is less than sub _ len, namely if the checked subsequence overlaps with the existing abnormal subsequence, skipping the current subsequence, and continuing checking backwards until the number of the abnormal subsequences reaches n _ records, so as to obtain an abnormal subsequence set records with the target subsequence length sub _ lensub_len
10. The method according to claim 9, wherein the method comprises the following steps: in step S4, after obtaining the abnormal subsequence set at each target subsequence length, the final abnormal subsequence result is obtained by establishing evaluation index.
CN202110226782.0A 2021-03-01 2021-03-01 Abnormal subsequence detection method for indefinite length in time sequence Active CN112966017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110226782.0A CN112966017B (en) 2021-03-01 2021-03-01 Abnormal subsequence detection method for indefinite length in time sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110226782.0A CN112966017B (en) 2021-03-01 2021-03-01 Abnormal subsequence detection method for indefinite length in time sequence

Publications (2)

Publication Number Publication Date
CN112966017A true CN112966017A (en) 2021-06-15
CN112966017B CN112966017B (en) 2023-11-14

Family

ID=76276232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110226782.0A Active CN112966017B (en) 2021-03-01 2021-03-01 Abnormal subsequence detection method for indefinite length in time sequence

Country Status (1)

Country Link
CN (1) CN112966017B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553232A (en) * 2021-07-12 2021-10-26 厦门大学 Technology for carrying out unsupervised anomaly detection on operation and maintenance data through online matrix portrait
CN116383190A (en) * 2023-05-15 2023-07-04 青岛场外市场清算中心有限公司 Intelligent cleaning method and system for massive big data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010078467A (en) * 2008-09-26 2010-04-08 Internatl Business Mach Corp <Ibm> Time-series data analysis system, method, and program
KR20130107889A (en) * 2012-03-23 2013-10-02 삼성전자주식회사 Aparatus and method for detecting anomalous subsequence
CN106127249A (en) * 2016-06-24 2016-11-16 深圳市颐通科技有限公司 A kind of single time series exception subsequence detection method
CN110378371A (en) * 2019-06-11 2019-10-25 广东工业大学 A kind of energy consumption method for detecting abnormality based on average nearest neighbor distance Outlier factor
CN110569890A (en) * 2019-08-23 2019-12-13 河海大学 Hydrological data abnormal mode detection method based on similarity measurement
WO2020019403A1 (en) * 2018-07-26 2020-01-30 平安科技(深圳)有限公司 Electricity consumption abnormality detection method, apparatus and device, and readable storage medium
CN111835738A (en) * 2020-06-30 2020-10-27 山东大学 Network abnormal flow automatic detection method based on time series mining

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010078467A (en) * 2008-09-26 2010-04-08 Internatl Business Mach Corp <Ibm> Time-series data analysis system, method, and program
KR20130107889A (en) * 2012-03-23 2013-10-02 삼성전자주식회사 Aparatus and method for detecting anomalous subsequence
CN106127249A (en) * 2016-06-24 2016-11-16 深圳市颐通科技有限公司 A kind of single time series exception subsequence detection method
WO2020019403A1 (en) * 2018-07-26 2020-01-30 平安科技(深圳)有限公司 Electricity consumption abnormality detection method, apparatus and device, and readable storage medium
CN110378371A (en) * 2019-06-11 2019-10-25 广东工业大学 A kind of energy consumption method for detecting abnormality based on average nearest neighbor distance Outlier factor
CN110569890A (en) * 2019-08-23 2019-12-13 河海大学 Hydrological data abnormal mode detection method based on similarity measurement
CN111835738A (en) * 2020-06-30 2020-10-27 山东大学 Network abnormal flow automatic detection method based on time series mining

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘雪梅;王亚茹;: "基于异常因子的时间序列异常模式检测", 计算机技术与发展, no. 03 *
展鹏;陈琳;曹鲁慧;李学庆;: "基于特征符号表示的网络异常流量检测算法", 浙江大学学报(工学版), no. 07 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113553232A (en) * 2021-07-12 2021-10-26 厦门大学 Technology for carrying out unsupervised anomaly detection on operation and maintenance data through online matrix portrait
CN113553232B (en) * 2021-07-12 2023-12-05 厦门大学 Technology for carrying out unsupervised anomaly detection on operation and maintenance data through online matrix image
CN116383190A (en) * 2023-05-15 2023-07-04 青岛场外市场清算中心有限公司 Intelligent cleaning method and system for massive big data
CN116383190B (en) * 2023-05-15 2023-08-25 青岛场外市场清算中心有限公司 Intelligent cleaning method and system for massive financial transaction big data

Also Published As

Publication number Publication date
CN112966017B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
WO2017008451A1 (en) Abnormal load detecting method for cloud computing oriented online service
CN108731923B (en) Fault detection method and device for rotary mechanical equipment
US9779361B2 (en) Method for learning exemplars for anomaly detection
US9245000B2 (en) Methods for the cyclical pattern determination of time-series data using a clustering approach
CN112966017B (en) Abnormal subsequence detection method for indefinite length in time sequence
CN107909344B (en) Workflow log repeated task identification method based on relation matrix
US20220042952A1 (en) State estimation device and state estimation method
Wang et al. Dimension reduction for clustering time series using global characteristics
US20230385699A1 (en) Data boundary deriving system and method
CN111611961A (en) Harmonic anomaly identification method based on variable point segmentation and sequence clustering
CN111967535A (en) Fault diagnosis method and device for temperature sensor in grain storage management scene
US7813893B2 (en) Method of process trend matching for identification of process variable
CN112487048A (en) Correlation analysis method and device based on time series abnormal fluctuation
CN114020598B (en) Method, device and equipment for detecting abnormity of time series data
JP7238378B2 (en) Abnormality detection device, abnormality detection program, and abnormality detection method
CN112183469B (en) Method for identifying congestion degree of public transportation and self-adaptive adjustment
CN117093944A (en) Time sequence data template self-adaptive abnormal mode identification method and system
CN112541016A (en) Power consumption abnormality detection method, device, computer equipment and storage medium
CN115878987A (en) Fault positioning method based on contribution value and causal graph
CN115700553A (en) Anomaly detection method and related device
CN113205146A (en) Time sequence data abnormal fluctuation detection algorithm based on fragment statistical characteristic comparison
CN113535527A (en) Load shedding method and system for real-time flow data predictive analysis
CN116662466B (en) Land full life cycle maintenance system through big data
CN111612082B (en) Method and device for detecting abnormal subsequence in time sequence
CN114886440B (en) Epileptic sample discharge classification model training and recognition method, system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 717, Building D, Fudun Center, No. 58 East Third Ring South Road, Chaoyang District, Beijing, 100022

Applicant after: Beijing Qingmeng Shuhai Technology Co.,Ltd.

Address before: 2517, block D, Futon center, No.58, South East Third Ring Road, Chaoyang District, Beijing 100022

Applicant before: Beijing Qingmeng Shuhai Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant