Background
Over the last decade, hundreds of articles have been studying how to find the most similar subsequence to a given time sequence in a large amount of time sequence data (time sequence data refers to data recorded in time sequence), while the patent studies how to find the subsequence with the largest difference from other time sequences in a large amount of time sequence data, which is called time sequence data anomaly.
The time-series data abnormality means that, in short, a very large time-series data has time-series fragments which are greatly different from other time-series data. Time series data anomalies have great use in the field of data mining, including improving the quality of classification clustering, data cleaning, anomaly detection, and the like.
An ECG (electrocardiogram) is a timing data considered to be very important, and each heart beat can be represented by a period of the ECG timing data, i.e. a sub-sequence. By searching for abnormalities in the ECG data, a physician can be helped to quickly find abnormalities from a large amount of ECG data, thereby performing disease analysis.
To find abnormalities in a large amount of ECG time series data, there are many algorithms, but most algorithms require more than two parameters and are complicated to calculate.
Brute force algorithms for finding anomalies only need to know the length n of the subsequence, and it measures the difference between two sequences based on Euclidean distance, relatively simple. However, the time complexity of the brute force algorithm is high, and assuming that m subsequences are shared, the time complexity is m squared. Therefore, a researcher adds a threshold value on the basis of a brute force algorithm to perform early-abandon, and most of calculation is omitted, so that the calculation speed is 4 to 5 orders of magnitude faster.
The following describes how the improved brute force algorithm finds anomalies in a large amount of time series data (assuming the time series data is an ECG):
first, the technical vocabulary to be used below is introduced:
time series: the time series T refers to: time dependent variables with m real values, which is an ordered arrangement: T-T1, …, tm.
Subsequences (subsequences): subsequence C refers to a subset of length n (n < m) in time sequence T: c ═ tp, …, tp + n-1; (1 ═ p ═ m-n-1).
Sliding Window: given a time sequence T of length m and a custom subsequence n of length n, all possible subsequences will be able to be extracted by sliding over T over a window of length n, and these subsequences are defined as Cp.
Distance (Distance function): by taking the subsequences C and M as input to the distance function, a non-negative value R, referred to as the distance between C and M, can be obtained. As can be seen from the definition, the positions of C and M can be interchanged, i.e., Dist (C, M) ═ Dist (M, C).
Non-Self Match (allogenic collocation): for a sequence T comprising a subsequence C of length n starting from position p and another subsequence M of length equal to C starting from q, C and M are defined as being matched if | p-q | >, i.e. M.
nearest _ neighbor _ dist (nearest neighbor distance): for one subsequence C in the time-series data set T (T contains large quantum sequences), Dist (C, D) is said to be the nearest neighbor distance of C, neighbor _ Dist, if C is less distant from another subsequence D in T than it is from any other subsequence in T.
Time Series record (timing exception): for a timing T, a subsequence D of length n begins at position i, and D is said to be an abnormal timing of timing T if D is greater than the nearest neighbor distance neighbor _ dist of any other subsequence (e.g., C). I.e. for any subsequence C, C of T, the alien match MC, with D the alien match MD, min (Dist (D, MD)) > min (Dist (C, MC) — in short, under non-self match, the timing anomaly has the largest nearest neighbor distance nearest _ neighbor _ Dist.
The Array structure: after time series data is mapped into characters abcbacb, etc. by SAX, they are divided into subsequences, which are to be stored in Array, including the position of the subsequences and the number of times their corresponding character string occurs. This makes it possible to know which subsequence has the least number of occurrences and to use it as the candidate anomaly timing sequence TS for outer loop search for the first candidate anomaly TS.
The Trie structure: this is a ternary tree, which is used to build the index of the subsequence character and store all the positions where the corresponding subsequence appears, so that when the nearest neighbor distance near _ neighbor _ dist of a certain subsequence C needs to be found, it can be compared with other subsequences where C appears in the Trie structure, and used for inner loop comparison.
SAX technique: and (3) segmenting the time sequence data at equal intervals, solving the average value of each segment, and then enabling the interval range where each average value is located to correspond to one character.
Next, the improved brute force algorithm flow is described in detail, as shown in fig. 1.
1. It is assumed that all electrocardiogram data ECG is divided into a number of sub-sequences according to the cycle of the heart beat. These subsequences are mapped to a series of strings by SAX techniques, such as acb, baa, …, where first, p-1 represents the first candidate anomaly, q-1, 2, 3, … represents the other subsequences to which the candidate anomaly is compared (q does not include the current candidate anomaly subsequence), and n is the length of the subsequence. The red box is a part for improving the brute force algorithm, namely a threshold best _ so _ far _ dist is added, the threshold is the maximum nearest neighbor distance nearest _ neighbor _ dist obtained in real time in the algorithm execution process, and the brute force algorithm is obtained by removing the threshold.
2. The violence algorithm is to update each subsequence in turn through outer circulation to serve as a candidate abnormal time sequence; then, for each candidate anomaly, finding out the nearest neighbor distance nearest _ neighbor _ dist through an inner loop; finally, the largest one of the nearest neighbor distances nearest _ neighbor _ dist corresponding to each candidate anomaly is found, and the corresponding subsequence is the anomaly timing sequence to be found. Although the abnormal time sequence can be found, the time is too long, the time complexity is n squared, and n is the number of subsequences.
3. Thus, in the improved brute force algorithm, the whole brute force algorithm searches for anomalies, actually, each subsequence is regarded as a candidate anomaly, and then the Euclidean distance between the current candidate anomaly and other subsequences is obtained through inner loop, and the minimum distance is taken as the nearest neighbor distance nearest _ neighbor _ dist. Thus, after all the major loops are executed in turn, the nearest neighbor distance nearest _ neighbor _ dist of each subsequence is respectively calculated, and then the sequence with the largest nearest neighbor distance nearest _ neighbor _ dist is selected as the abnormal time sequence. However, if there is a smaller distance in the inner loop than the current best _ so _ far _ dist, then there is no need to execute the current inner loop, because the nearest neighbor distance nearest _ neighbor _ dist of the candidate exception found by the current inner loop is certainly smaller than best _ so _ far _ dist, which is not likely to be the final result, and the largest nearest neighbor distance nearest _ neighbor _ dist and its corresponding subsequence are needed.
Only one nearest neighbor distance nearest _ neighbor _ dist is required, so that the nearest neighbor distance nearest _ neighbor _ dist calculated by most of the inner loops is not a required final result, all the inner loops are not required to be completely executed, and the calculation of the inner loops can be stopped as long as a distance smaller than best _ so _ far _ dist exists in the inner loops, so that most of redundant calculation is omitted, and the process is called early-undo.
The effect of early-abandon depends on two points: first the outer loop considers the order of the candidate anomalies, and second the inner loop does early-abandon to choose the order of the subsequences that can do early-abandon.
In the outer loop, an absolutely perfect ordering is not needed, and only a relatively large nearest neighbor distance nearest _ neighbor _ dist is found at the beginning as the current best _ so _ far _ dist threshold, that is, a greatly different subsequence is found, so that early-absdon can be performed on the next inner loop. Through the Array structure, the subsequence can be found quickly, and the subsequence with the least occurrence times in the Array is to be found. As long as this very different subsequence is initially found, as the first candidate exception, the other candidate exceptions can be performed in sequence, without sorting.
In the inner loop, again, a perfect ordering is not required (i.e., arranged from small to large according to the Euclidean distance from the current candidate anomaly), and it is only necessary to have a distance less than best _ so _ far _ dist to obtain early-abondon. Through the Trie structure, the method can realize the three-dimensional data transmission,
finding the character string corresponding to the current candidate exception, the character string generally corresponds to several subsequences, and the positions of the subsequences are all at the leaf nodes of the Trie structure. The several subsequences have small differences, i.e. small distances, so that SAX corresponds to the same string. Therefore, it is only necessary to find the several strings and put them in front of the inner loop to perform early-abandon well, and other strings can be arranged in sequence. As can be seen from the above, the inner loop needs to find the most similar subsequence through the Trie structure for each candidate exception, and then perform the comparison.
4. When the inner and outer loops are all executed, the obtained last best _ so _ far _ dist and best _ so _ far _ loc are the nearest neighbor distance nearest _ neighbor _ dist of the abnormal time sequence to be searched and the position of the nearest neighbor distance nearest _ neighbor _ dist.
Problems and disadvantages with the above methods:
compared with the original brute force algorithm, the improved brute force algorithm can omit most redundant calculation and can find the abnormity in larger-scale time sequence data, but both the improved brute force algorithm and the improved brute force algorithm are based on Euclidean distance, the robustness is poor, and accurate searching cannot be realized. Usually, the electrocardiographic data will have a certain phase shift, which results in a deviation of the distance between the two ECG subsequences calculated by Euclidean distance. For example, two original identical ECG subsequences have a certain phase shift difference, and the distance directly calculated by the Euclidean distance is very large, which is not practical.
Especially, when noise interference occurs in the original data or human error causes a certain phase shift to some non-abnormal sub-sequences (which is generally difficult to avoid in practical applications), the nearest neighbor _ dist calculated based on the Euclidean distance is very large, and then the non-abnormal sub-sequences are likely to be detected as an abnormality, thereby obtaining an erroneous result.
Compared with the Euclidean distance, the DTW distance has good robustness and strong anti-noise capability, and especially when two similar time sequence data have phase shift deviation, the calculated distance value is smaller and is closer to the real situation. However, since DTW itself has high time complexity, it takes too much time to find the nearest neighbor _ dist through inner loop in a large amount of time series data, so that it is not feasible to directly replace the Euclidean distance with the DTW distance to measure the difference between two subsequences, and it is difficult to obtain a satisfactory effect.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below. The following description encompasses numerous specific details in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a clearer understanding of the present invention by illustrating examples of the present invention. The present invention is in no way limited to any specific configuration and algorithm set forth below, but rather covers any modification, substitution, and improvement of relevant elements, components, and algorithms without departing from the spirit of the invention.
The embodiment of the invention provides a method for accurately searching an abnormal time sequence in super-large-scale time sequence data, wherein a distance function Euclidean distance is replaced by a DTW distance on the basis of the existing improved brute force algorithm; secondly, the time complexity for DTW itself is high, leading to a situation where such a direct replacement is not feasible. The idea of 'secondary screening' is put forward: the first most complex large loop in the original algorithm is directly taken out as the first step, secondary screening is used for replacing the first step, and the problem that a simple replacement distance function cannot be realized due to overhigh redundancy of DTW is solved.
Next, DTW (Dynamic Time Warping) and secondary screening will be described.
Let us assume that we have two time series Q and C, whose lengths are n and m, respectively, and are expressed as:
Q=q1,q2,...,qi,...,qn
C=c1,c2,...,ci,...,cn
to calculate the distance between two subsequences using the DTW algorithm, we create a matrix of n x m, whose elements (i) areth,jth) Comprising qiAnd cjDistance d (q) between two pointsi,cj)。
d(qi,cj)=(qi-cj)2Each matrix element (i, j) corresponds to qiAnd cjCombinations of (a) and (b). A regular path W is a series of continuous matrix element components that define a mapping from Q to C; the kth element in W is defined as:
W=ω1,ω2,...,ωk,...,ωKthe regular path of max (m, n) is more than or equal to K and less than m + n-1 is generally as followsSome constraints of the faces:
boundary conditions are as follows: w1 is (1,1) and wK is (m, n), i.e. the start and end of the path are the first and last points of the diagonal of the regular matrix, respectively.
Continuous conditions: the points in the path must be continuous (including diagonally adjacent points).
Monotonicity: the trend of the path is non-decreasing, such as Wk=(a,b),Wk-1(a ', b'), then a-a 'is ≦ 1, b-b' is ≦ 1.
Although most of the warping paths satisfy the above condition, we are interested in only the one that can get the minimum distance.
this path can be calculated by a dynamic procedure, we define the cumulative distance γ (i, j) of the two time series at point (i, j) in the regular matrix, which is equal to the sum of the distance between the two points corresponding to the current position plus the smallest of the distances accumulated in the three points adjacent to it, like an iterative calculation:
γ(i,j)=d(qi,cj)+min{γ(i-1,j-1),γ(i-1,j),γ(i,j-1)}
this is a recursive equation that determines the basis for the warping path, γ (i, j), as the value of the road-force distance currently accumulated in the table cell. Wherein the first right of the equation represents the distance between two points i, j in the current cell; the latter representing the minimum of the previous accumulated distance in the i, j adjacent cell
Secondary screening
The purpose of the secondary screening is to find the DTW distance between the first candidate abnormal timing sequence TS and the Non-self match subsequence closest to the first candidate abnormal timing sequence TS, i.e. the nearest neighbor distance nearest _ neighbor _ dist, in the whole time subsequence set, and then use it as the best _ so _ far distance threshold.
The ECG data records the beating condition of the heart of the patient, and the problem of phase difference is inevitable. For two very similar ECG sequences but with a phase shift difference, calculating the Euclidean distance between them directly would be much larger than the DTW distance, resulting in a calculated Euclidean distance that shows that the two sequences are much different, and in fact, are two similar sequences; in contrast, the results of the DTW-based calculations are more realistic.
It can be seen that, in general, for a pair of similar ECG sequences with significant phase shift difference and noise interference, the Euclidean distance > DTW distance, and the Euclidean distance will be used to calculate the error result, while the DTW can correctly find the similar sequences.
However, because the DTW distance is complex to calculate, it is not practical to calculate the DTW distances between the first candidate anomaly and all other subsequences, and then select the smallest DTW distance as the threshold best _ so _ far _ dist.
Thus, the invention adopts secondary screening:
first screening (coarse screening)
Firstly, directly selecting a sequence set Z similar to the candidate abnormal time sequence TS based on Euclidean distance for the first time.
To ensure that the subsequence most similar to the candidate anomaly sequence TS is within Z, we set the maximum shift difference allowed by the two similar sequences as r, i.e. translate the candidate anomaly sequence by r units, then calculate its Euclidean distance (DTW distance ═ Euclidean distance), set as E, as a threshold. And as long as subsequences with Euclidean distance to TS smaller than E exist in all the original subsequences, putting the subsequences into a set Z as a candidate sequence set, arranging the subsequences from small to large as Z-Z1, … Zn, and waiting for secondary screening.
It follows that the sequence that we are looking for that is most similar to the abnormal sequence TS, whether it has a phase shift or not, and whether noise is present or not, will be within this sequence set Z.
Second screening (Fine screening)
The second screening is based on the DTW algorithm to perform accurate searching.
First, the LB _ Keogh technique is introduced briefly, and LB _ Keogh can be regarded as an algorithm for simplifying DTW, and the distance required based on LB _ Keogh is smaller than the DTW distance, and the time complexity is much smaller than the DTW.
Specifically, the ECG time series Q is wrapped in an envelope, the upper and lower bounds of which are U and L, respectively, and then the LB _ Keogh distance from the other time series C to Q is the Euclidean distance from C to the envelope, as shown in fig. 2A and 2B, which can be specifically referred to the Exact indicating of dynamic time warping.
In the second screening, firstly, the DTW distance between TS and Z1 is obtained and set as a threshold bsf, then the candidate abnormal TS is compared with the rest of the subsequence set Zn in sequence, when the distance between the candidate abnormal TS and the rest of the subsequence set Zn is calculated, the DTW distance is adopted in the front part, LB _ Keogh calculation is adopted in the back part, the middle boundary is pushed backwards, and then the DTW distance of the front part and the LB _ Keogh distance in the back part are added. As shown in fig. 3A and 3B, the lower left-diagonal line represents the DTW distance between two sequences, and the calculation is complicated; the part represented by the upper right vertical line is the LB _ Keogh distance between the two sequences, which is simpler to calculate. Comparing the sum of the two distances with a threshold value bsf, stopping calculation when the sum is larger than the bsf, and performing early-abandon, so that the DTW distance of the whole sequence does not need to be calculated, and the time is saved; otherwise, continuing to advance to the right, and taking the obtained complete DTW distance as a new bsf threshold value.
In the same way, the following sequences are compared in order. Finally, the sequence C most similar to the candidate abnormal timing TS can be found, and the distance between itself and the DTW nearest neighbor distance nearest _ neighbor _ dist distance bsf of the TS is obtained as the last best _ so _ far _ dist.
After secondary screening is finished, the following internal and external circulation can be entered, early abson is carried out by taking best _ so _ far _ dist as a threshold value, most redundant calculation amount is saved, and finally, an abnormal time sequence is accurately found in a large amount of data.
The invention provides a novel improved violence algorithm capable of accurately searching for the abnormality in mass data, and solves the problem that DTW is simply replaced and cannot be used. Due to the excellent robustness of DTW, the method can still realize accurate abnormity finding under the condition that noise exists in actual ECG data or phase shift exists, thereby detecting the heart disease, further showing the advantages of the method and conforming to the practical application condition.
As shown in fig. 4, the abnormal time series data extraction method of the present invention includes the following steps:
storing an ECG data set to be subjected to abnormal timing search;
mapping the ECG time sequence into a character set acbacab;
storing the character set data processed by the data preprocessing unit;
storing the character set and establishing an index to find the most likely ECG abnormal time sequence TS as the first candidate abnormal time sequence;
storing the arranged character set and candidate abnormal TS;
since the first major loop of the improved brute force algorithm in the article of HOT SAX needs to be fully calculated, a best _ so _ far _ dist is obtained as a threshold value so as to perform early analysis in the following loop, thereby saving the calculation amount. When directly replacing the Euclidean distance function with DTW, the first large loop cannot be performed due to high DTW time complexity and large ECG timing data volume. Then the major circulation is taken out and replaced by secondary screening, and finally the same effect is obtained. Firstly, directly carrying out coarse screening by means of Euclidean distance for the first screening to select a smaller candidate range, then carrying out accurate screening by adopting DTW distance in the smaller range during the second screening to obtain a first threshold value best _ so _ far _ dist so as to still carry out early analysis in the following cycle, and removing most of irrelevant calculation;
storing the values of best _ so _ far _ dist and best _ so _ far _ loc, and when the distance between the current candidate exception TS and other subsequences is smaller than that of the current candidate exception TS (after the first large loop, the distance is usually smaller than that of the current candidate exception TS, because the occurrence frequency of the first candidate exception is the minimum, the best _ so _ far _ dist obtained by the first large loop is the most likely to be the maximum), performing early analysis, jumping out of the inner loop, and removing most irrelevant calculations;
when the distance between one subsequence and the current candidate abnormal TS is smaller than the best _ so _ far _ dis threshold t, early answering jumps out of the internal loop, returns to the beginning of the large loop, updates to the next candidate abnormal TS, and restarts the internal loop; otherwise, continuously executing the inner loop to find out the nearest neighbor distance neighbor _ dist of the current candidate abnormal time sequence as a new best _ so _ far _ dist threshold value;
storing the latest nearest neighbor distance nearest _ neighbor _ dist returned by the inner loop, or jumping out of the inner loop due to early answering;
updating the candidate abnormal TS after the internal loop is executed;
storing best _ so _ far _ dis and best _ so _ far _ loc which are updated by internal and external circulation in real time;
the last best _ so _ far _ dis and best _ so _ far _ loc are stored.
The present invention relates to ECG time series, and the time series data is divided into equal length sub-sequence sets by sliding window, each sub-sequence is non-self match, i.e. each sub-sequence does not overlap each other on the time axis.
As shown in FIG. 5, after the present invention is implemented in code, a software system that accepts parameters. After a large ECG time series data set T is actually input, parameter setting is carried out according to requirements; the software will then be able to preprocess the raw data set, store it in the Array and Trie structures, and from these two structures determine the timing of the first candidate ECG abnormality in the major loop and the order of the timing data to be compared with the candidate abnormality in the minor loop. After the preparation work is done, the nearest neighbor distance nearest _ neighbor _ dist of the first candidate anomaly is obtained through secondary screening and used as a best _ so _ far _ dist threshold value. Then entering an external circulation: and updating the next candidate abnormal time sequence TS, and updating the candidate abnormal time sequence TS once after each internal cycle is finished.
Each step of the major loop corresponds to an inner loop, the main function of the inner loop is to judge whether the distance between the current candidate abnormal TS and the non-self match subsequence thereof is smaller than a threshold best _ so _ far _ dist, if one of the distances is smaller than the threshold best _ so _ far _ dist, early-undo is carried out, the inner loop is skipped out, most of calculated amount is saved, and therefore the abnormal time sequence is found out quickly finally; otherwise, the internal loop is continuously executed, and the best _ so _ far _ dist and the best _ so _ far _ loc are updated once after the internal loop is finished each time.
And after the large loop and the inner loop are executed, a final result is returned: best _ so _ far _ dist and best _ so _ far _ loc of the ECG abnormality timing.
As shown in fig. 6, the first small loop of the original improved algorithm is extracted and replaced by the second screening, just because it is not feasible to directly replace the Euclidean distance with the DTW distance in the improved brute force algorithm. In addition, the present invention has found that after the inner loop is executed, it is not necessary to determine whether or not the nearest neighbor distance neighbor _ dist found by the inner loop is greater than the best _ so _ far _ dist, because the inner loop can be completely executed when it is not performing early-undo, and the obtained nearest neighbor distance neighbor _ dist is certainly greater than the threshold value best _ so _ far _ dist, otherwise it is necessary to perform early-undo when it is performing threshold value determination in the inner loop.
The judgment of the original improved algorithm is removed:
nearest_neighbor_dist>best_so_far_dist
in particular, see fig. 1 for a flow chart of an improved brute force algorithm.
Returning to the original improved brute force algorithm in fig. 1, initially, best _ so _ far _ dist is 0 and nearest _ neighbor _ dist is infinite, and early-undo cannot be performed, the first small loop is completely executed, and then a nearest _ neighbor _ dist is obtained as the following threshold best _ so _ far _ dist. Due to the fact that processed data are extremely large, the algorithm is possible to complete based on a simple Euclidean distance, and is almost impossible to complete due to a field disaster in a DTW distance with high time complexity, and therefore, the original algorithm cannot be replaced simply.
Therefore, to realize the replacement of DTW, the first most complex inner loop is taken out, and the function of the inner loop is completed through 'secondary screening'. The technical scheme of the main idea of the secondary screening is explained in detail and is not repeated.
Through secondary screening, the nearest neighbor distance nearest _ neighbor _ dist of the first candidate anomaly based on the DTW can be obtained and used as the threshold best _ so _ far _ dist, and early-abondon can be easily performed in the subsequent loop calculation, so that most of the DTW-based calculation is omitted, and the final desired result is obtained.
The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.