CN112988512A - Method, device and equipment for detecting time sequence data abnormity and storage medium - Google Patents

Method, device and equipment for detecting time sequence data abnormity and storage medium Download PDF

Info

Publication number
CN112988512A
CN112988512A CN202110269901.0A CN202110269901A CN112988512A CN 112988512 A CN112988512 A CN 112988512A CN 202110269901 A CN202110269901 A CN 202110269901A CN 112988512 A CN112988512 A CN 112988512A
Authority
CN
China
Prior art keywords
time sequence
sequence data
detected
data segment
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110269901.0A
Other languages
Chinese (zh)
Inventor
曹臻
潘陈益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202110269901.0A priority Critical patent/CN112988512A/en
Publication of CN112988512A publication Critical patent/CN112988512A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

The embodiment of the invention provides a time sequence data anomaly detection method, a time sequence data anomaly detection device, time sequence data anomaly detection equipment and a storage medium, wherein a time sequence data segment to be detected and a plurality of historical time sequence data segments are obtained, and the time sequence data segment to be detected is a time sequence data segment with numerical value mutation; calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected; judging whether a historical time sequence data segment meeting a similar distance condition with the time sequence data segment to be detected exists or not; and if not, sending an abnormal alarm aiming at the time sequence data segment to be detected. If the historical time sequence data segment close to the time sequence data segment does not exist, the numerical mutation of the time sequence data segment does not occur in a short time, and the numerical mutation cannot be determined to be caused by the fixed behavior of the user in a short time, in this case, the abnormal time sequence data segment is determined, the abnormal alarm is further sent, and the invalid alarm for the abnormal time sequence data can be reduced.

Description

Method, device and equipment for detecting time sequence data abnormity and storage medium
Technical Field
The present invention relates to the field of data analysis technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting time series data anomalies.
Background
In an internet platform, a large amount of time series data is generated every day, the time series data refers to time series data and is a data column recorded by the same uniform index in time sequence, and each data in the same data column has the same caliber and is comparable.
The time sequence data abnormity detection is used as an auxiliary means for judging the data state, and aims to find abnormal data in time, conveniently and timely take measures to eliminate or weaken abnormity, reduce loss caused by abnormity and ensure normal operation of various services.
In the prior art, upper and lower thresholds of time series data in a future period of time can be predicted based on historical time series data, and then abnormality detection is performed on the time series data acquired in the future period of time according to the predicted upper and lower thresholds, that is, if the value of the time series data exceeds the threshold for predicting the time series data, the time series data is considered to be abnormal.
However, in some cases, the fixed behavior of the user in a short period may cause a phenomenon that the value of the time series data frequently increases rapidly in a short period, and the value of the time series data is relatively stable in the rest of the time. If the above prediction mode based on the historical time series data is adopted, the value of the time series data may exceed the predicted upper and lower thresholds after each rapid increase, and thus the time series data is frequently detected as abnormal data. It can be understood that the fixed behavior of the user in a short period is a relatively common behavior pattern, and therefore, the value of the time series data frequently and rapidly increases in a short period without paying attention to the time series data, and under such a condition, a large number of invalid alarms are generated, and the operation and maintenance cost of the time series data is increased.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a device, equipment and a storage medium for detecting time sequence data abnormity, so as to improve the accuracy of time sequence data abnormity detection, reduce invalid alarms for time sequence data abnormity and reduce the operation and maintenance cost for time sequence data. The specific technical scheme is as follows:
in a first aspect of the present invention, there is provided a method for detecting time series data anomaly, where the method includes:
acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments, wherein the time stamp of the historical time sequence data fragment is earlier than that of the time sequence data fragment to be detected, and the time sequence data fragment to be detected is a time sequence data fragment with numerical value mutation;
calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected;
judging whether a historical time sequence data segment meeting a similar distance condition with the time sequence data segment to be detected exists or not;
and if not, sending an abnormal alarm aiming at the time sequence data segment to be detected.
Optionally, in the presence of a historical time series data segment whose euclidean distance from the time series data segment to be detected satisfies a condition of a close distance, the method further includes:
taking the historical time sequence data segments meeting the similar distance condition as target time sequence data segments, and calculating the change trend similarity of the target time sequence data segments and the time sequence data segments to be detected;
and if the variation trend similarity is smaller than a preset similarity threshold, sending an abnormal alarm aiming at the time sequence data segment to be detected.
Optionally, the acquiring the time series data segment to be detected and the multiple historical time series data segments includes:
acquiring a time sequence data fragment to be detected and historical time sequence data with a timestamp earlier than the time sequence data fragment to be detected;
extracting a plurality of historical time sequence data fragments with the same length as the time sequence data fragments to be detected from the historical time sequence data, wherein an overlapping part is arranged between every two adjacent historical time sequence data fragments, and the length of the overlapping part is the length of the time sequence data fragments to be detected minus 1.
Optionally, the calculating the euclidean distance between each historical time series data segment and the time series data segment to be detected includes:
and calculating the Euclidean distance between each historical time sequence data segment and the time sequence data segment to be detected according to a proximity algorithm.
Optionally, the determining whether there is a historical time series data segment whose euclidean distance to the time series data segment to be detected satisfies a condition of a similar distance includes:
determining historical time sequence data fragments with the Euclidean distance between the historical time sequence data fragments and the time sequence data fragments to be detected and ranked as a first preset value and a second preset value from the historical time sequence data fragments, and respectively using the historical time sequence data fragments as the first fragments and the second fragments, wherein the first preset value is smaller than the second preset value;
judging whether the ratio of the Euclidean distance between the first segment and the time sequence data segment to be detected to the Euclidean distance between the second segment data and the time sequence data segment to be detected is larger than a preset ratio threshold value or not;
and if the Euclidean distance between the time sequence data segments to be detected and the historical time sequence data segments meeting the similar distance condition is not larger than the preset ratio threshold, judging that the historical time sequence data segments exist.
Optionally, the calculating the similarity of the variation trends of the target time series data segment and the time series data segment to be detected includes:
according to the value ranges of the plurality of historical time sequence data fragments, carrying out numerical partition, wherein each numerical partition corresponds to a partition index;
replacing time sequence data in the target time sequence data segment and the time sequence data segment to be detected with corresponding partition indexes according to the numerical value of the time sequence data and the numerical partition;
carrying out one-hot coding on the replaced target time sequence data segment and the replaced time sequence data segment to be detected;
and calculating cosine similarity between the encoded target time sequence data segment and the encoded time sequence data segment to be detected, wherein the cosine similarity is used as the change trend similarity of the target time sequence data segment and the time sequence data segment to be detected.
In a second aspect of the present invention, there is also provided a time series data abnormality detection apparatus, including:
the acquisition module is used for acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments, the timestamp of the historical time sequence data fragment is earlier than that of the time sequence data fragment to be detected, and the time sequence data fragment to be detected is a time sequence data fragment with numerical value mutation;
the distance calculation module is used for calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected;
the distance judgment module is used for judging whether a historical time sequence data segment meeting the similar distance condition with the Euclidean distance between the time sequence data segment to be detected and the time sequence data segment to be detected exists; and if not, judging that the time sequence data segment to be detected is an abnormal time sequence data segment.
Optionally, the distance determining module is further configured to:
taking the historical time sequence data segments meeting the similar distance condition as target time sequence data segments, and calculating the change trend similarity of the target time sequence data segments and the time sequence data segments to be detected;
and if the variation trend similarity is smaller than a preset similarity threshold, sending an abnormal alarm aiming at the time sequence data segment to be detected.
Optionally, the obtaining module is specifically configured to:
acquiring a time sequence data fragment to be detected and historical time sequence data with a timestamp earlier than the time sequence data fragment to be detected;
extracting a plurality of historical time sequence data fragments with the same length as the time sequence data fragments to be detected from the historical time sequence data, wherein an overlapping part is arranged between every two adjacent historical time sequence data fragments, and the length of the overlapping part is the length of the time sequence data fragments to be detected minus 1.
Optionally, the distance calculating module is specifically configured to:
and calculating the Euclidean distance between each historical time sequence data segment and the time sequence data segment to be detected according to a proximity algorithm.
Optionally, the distance determining module is specifically configured to:
determining historical time sequence data fragments with the Euclidean distance between the historical time sequence data fragments and the time sequence data fragments to be detected and ranked as a first preset value and a second preset value from the historical time sequence data fragments, and respectively using the historical time sequence data fragments as the first fragments and the second fragments, wherein the first preset value is smaller than the second preset value;
judging whether the ratio of the Euclidean distance between the first segment and the time sequence data segment to be detected to the Euclidean distance between the second segment data and the time sequence data segment to be detected is larger than a preset ratio threshold value or not;
and if the Euclidean distance between the time sequence data segments to be detected and the historical time sequence data segments meeting the similar distance condition is not larger than the preset ratio threshold, judging that the historical time sequence data segments exist.
Optionally, the distance determining module is specifically configured to:
according to the value ranges of the plurality of historical time sequence data fragments, carrying out numerical partition, wherein each numerical partition corresponds to a partition index;
replacing time sequence data in the target time sequence data segment and the time sequence data segment to be detected with corresponding partition indexes according to the numerical value of the time sequence data and the numerical partition;
carrying out one-hot coding on the replaced target time sequence data segment and the replaced time sequence data segment to be detected;
and calculating cosine similarity between the encoded target time sequence data segment and the encoded time sequence data segment to be detected, wherein the cosine similarity is used as the change trend similarity of the target time sequence data segment and the time sequence data segment to be detected.
In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing any one of the time series data abnormity detection methods when executing the program stored in the memory.
In yet another aspect of the present invention, there is also provided a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute any one of the above-mentioned time series data anomaly detection methods.
In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any one of the above described methods of temporal data anomaly detection.
The time sequence data abnormity detection method, the time sequence data abnormity detection device, the time sequence data abnormity detection equipment and the storage medium, which are provided by the embodiment of the invention, are used for acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments, wherein the time stamp of the historical time sequence data fragment is earlier than that of the time sequence data fragment to be detected, and the time sequence data fragment to be detected is a time sequence data fragment with numerical value mutation; calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected; judging whether a historical time sequence data segment meeting a similar distance condition with the time sequence data segment to be detected exists or not; and if not, sending an abnormal alarm aiming at the time sequence data segment to be detected.
Therefore, the time sequence data segment with the numerical mutation is further detected, if the historical time sequence data segment with the similar Euclidean distance to the historical time sequence data segment does not exist, the numerical mutation of the time sequence data segment cannot be considered to be the frequent condition within a short time, namely, the numerical mutation of the time sequence data segment cannot be considered to be caused by the fixed action of a user within a short time, under the condition, the time sequence data segment to be detected can be judged to be the abnormal time sequence data segment, under the condition, the abnormal alarm is sent to the time sequence data segment to be detected, therefore, the invalid alarm of the abnormal time sequence data can be reduced, and the operation and maintenance cost of the time sequence data is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flowchart illustrating steps of a method for detecting an anomaly in time series data according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating steps of another method for detecting anomalies in timing data according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an apparatus for detecting an abnormality in time series data according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.
In the related art, the upper and lower thresholds of the time series data in a future period can be predicted based on the historical time series data, and then the time series data acquired in the future period is subjected to anomaly detection, and if the value of the time series data exceeds the threshold for predicting the time series data, the time series data is considered to be anomalous.
However, similar rapid increase phenomena that values of time series data frequently appear in a short time due to a fixed behavior of a user in a short time are not necessarily concerned originally, and if the above-mentioned manner of predicting based on historical time series data is adopted, values of the time series data may exceed predicted upper and lower thresholds after each rapid increase, so that the time series data are frequently detected as abnormal data, a large amount of invalid alarms are generated, and operation and maintenance costs of the time series data are increased.
In order to solve the above problem, an embodiment of the present invention provides a method for detecting an abnormality of time series data, and the following generally describes the method for detecting an abnormality of time series data provided by the embodiment of the present invention:
acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments, wherein the time stamp of the historical time sequence data fragment is earlier than that of the time sequence data fragment to be detected, and the time sequence data fragment to be detected is a time sequence data fragment with numerical value mutation;
calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected;
judging whether a historical time sequence data segment meeting a similar distance condition with the time sequence data segment to be detected exists or not;
and if not, sending an abnormal alarm aiming at the time sequence data segment to be detected.
As can be seen from the above, in the method for detecting abnormality of time series data provided in the embodiment of the present invention, the time series data segment with a numerical value mutation is further detected, and if there is no historical time series data segment with a similar euclidean distance to the historical time series data segment, it indicates that the numerical value mutation of the time series data segment is unlikely to occur within a short time, and cannot be considered as a frequent occurrence, that is, the numerical value mutation of the time series data segment cannot be considered as being caused by a fixed behavior of a user within a short period of time, and in this case, it can be determined that the time series data segment to be detected is an abnormal time series data segment, so that the accuracy of detecting abnormality of time series data can be improved, an invalid alarm for detecting abnormality of time series data is reduced, and the operation and maintenance cost for time series data.
The time series data abnormality detection method provided by the embodiment of the invention will be described in detail through specific embodiments.
Referring to fig. 1, a flowchart illustrating steps of a time series data anomaly detection method according to the present application is shown, and specifically, the method may include the following steps:
s101: and acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments.
The time stamp of the historical time sequence data segment is earlier than that of the time sequence data segment to be detected, and the time sequence data segment to be detected is the time sequence data segment with numerical value mutation. The time series data segment to be detected may be determined in a manner of prediction based on historical time series data in the related art, or may be determined by comparing the time series data segment to be detected with the historical time series data segment acquired within a preset time interval, which is not limited specifically.
In one implementation, obtaining a plurality of historical time series data segments may include the steps of:
firstly, acquiring a time sequence data fragment to be detected and historical time sequence data with a time stamp earlier than the time sequence data fragment to be detected. In the embodiment of the present invention, the data before the T periods of the time series data segment to be detected may be regarded as older data, and the referential is low. Only historical time sequence data within the latest T periods of the time sequence data segment to be detected are obtained, on one hand, computing resources can be effectively saved, on the other hand, the referential performance of the historical time sequence data can be improved, and the accuracy of time sequence data abnormity detection is further improved.
And then, extracting a plurality of historical time sequence data segments with the same length as the time sequence data segments to be detected from the historical time sequence data, wherein an overlapping part is arranged between every two adjacent historical time sequence data segments, and the length of the overlapping part is the length of the time sequence data segments to be detected minus 1. Therefore, each data segment with the same length as the time sequence data segment to be detected in the historical time sequence data is used as the historical time sequence data segment, the subsequent analysis on the historical data is more comprehensive, and the time sequence data abnormal detection errors caused by the fact that a certain data segment is not detected are reduced.
For example, assuming that the historical time-series data is [1, 2, 3, 4, 5, 6, …,100 ] and the length of the time-series data segment to be detected is 3, the historical time-series data segments extracted from the historical time-series data may be [1, 2, 3], [2, 3, 4], [3, 4, 5], …, and so on, respectively.
In this step, after the historical time series data segment is obtained, the obtained historical time series data segment may be normalized, and a value of each time series data in the historical time series data segment is converted into a decimal between 0 and 1, so that a dimension attribute of the historical time series data segment is removed, and subsequent calculation of the historical time series data segment is facilitated.
S102: and calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected.
The Euclidean distance is used for measuring the spatial distance between each historical time sequence data fragment and the time sequence data fragment to be detected, and the larger the Euclidean distance is, the longer the spatial distance between the historical time sequence data fragment and the time sequence data fragment to be detected is, and the larger the difference is.
In this step, the euclidean distance between each historical time series data segment and the time series data segment to be detected can be calculated according to a proximity algorithm. The Neighbor algorithm may be a KNN algorithm, or a KNN (K Near Neighbor, K Neighbor) algorithm, that is, each sample may be represented by its nearest K neighbors, and specifically, KDTree may be used for calculation. And the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected, which is calculated according to the proximity algorithm, is sorted from small to large.
S103: and judging whether a historical time sequence data segment which has a Euclidean distance from the time sequence data segment to be detected and meets a similar distance condition exists, if not, executing S104.
In one implementation, the step of determining whether there is a historical time series data segment whose distance from the time series data segment to be detected satisfies a condition that the distance is close to the distance may include:
firstly, historical time sequence data segments with the sequence of Euclidean distances between the historical time sequence data segments and the time sequence data segments to be detected as a first preset value and a second preset value are determined from the historical time sequence data segments and are respectively used as the first segment and the second segment, and the first preset value is smaller than the second preset value.
The first preset value and the second preset value can be adjusted according to a specific scene, but a large difference is required between the first preset value and the second preset value, that is, the historical time series data segments sequenced between the first preset value and the second preset value account for most of all the historical time series data segments, so that the range of the euclidean distance between most of all the historical time series data segments and the time series data segment to be detected can be represented by the range of the euclidean distance between most of all the historical time series data segments and the time series data segment to be detected. In addition, by setting the first preset value and the second preset value, compared with a method for judging according to the historical time sequence data segment with the maximum Euclidean distance and the minimum Euclidean distance to the time sequence data segment to be detected, the method can reduce the condition of judgment errors caused by extreme values at two ends, and further improve the accuracy of time sequence data abnormity detection. For example, if the number of historical time series data segments is 120, the first preset value may be 3, and the second preset value may be 100.
Then, judging whether the ratio of the Euclidean distance between the first segment and the time sequence data segment to be detected to the Euclidean distance between the second segment data and the time sequence data segment to be detected is larger than a preset ratio threshold value or not; if so, judging that no historical time sequence data segment meeting the condition of similar distance with the Euclidean distance between the time sequence data segments to be detected does not exist; if the Euclidean distance is not greater than the preset ratio threshold, judging that a historical time sequence data segment exists, wherein the Euclidean distance between the historical time sequence data segment and the time sequence data segment to be detected meets the condition that the distance is close to the preset ratio threshold.
The preset ratio threshold value can be a preset empirical value, the value range is 0 to 1, and when the ratio of the Euclidean distance between the first segment and the time sequence data segment to be detected to the Euclidean distance between the second segment data and the time sequence data segment to be detected is greater than the preset ratio threshold value, it indicates that the difference between the Euclidean distances of the historical time sequence data segments and the time sequence data segment to be detected is large.
For example, if the first predetermined value is 3, the second predetermined value is 100. Assuming that the euclidean distance between the first segment and the time series data segment to be detected is 100, and the euclidean distance between the second segment and the time series data segment to be detected is 103, at this time, the ratio of the euclidean distance between the first segment and the time series data segment to be detected to the euclidean distance between the second segment data and the time series data segment to be detected is 100/103. If the euclidean distance between the first segment and the time series data segment to be detected is 20, the ratio of the euclidean distance between the first segment and the time series data segment to be detected to the euclidean distance between the second segment data and the time series data segment to be detected is 20/103.
In contrast, in the first case, the euclidean distances between the other historical time series data segments sequenced between the first preset value and the second preset value and the time series data segment to be detected are all between [100, 103], that is, the euclidean distances between the other historical time series data segments sequenced after the 3 rd name and the time series data segment to be detected are all greater than 100, while in the second case, the euclidean distances between the other historical time series data segments sequenced between the first preset value and the second preset value and the time series data segment to be detected are all between [20, 103], that is, the euclidean distances between the other historical time series data segments sequenced after the 3 rd name and the time series data segment to be detected are all greater than 20, in other words, in the first case, the euclidean distances between each historical time series data segment and the time series data segment to be detected are different greatly, the ratio in both cases is also 100/103 greater than 20/103.
If the preset ratio threshold value is between 100/103 and 20/103, in the first case, there is no historical time series data segment whose Euclidean distance from the time series data segment to be detected satisfies a distance proximity condition, and in the second case, there is a historical time series data segment whose Euclidean distance from the time series data segment to be detected satisfies a distance proximity condition.
S104: and sending an abnormal alarm aiming at the time sequence data segment to be detected.
In this step, if there is no historical time series data segment whose euclidean distance is close to that of the time series data segment, it indicates that the numerical mutation of the time series data segment is unlikely to occur within a short time, and it cannot be determined that the numerical mutation of the time series data segment is a frequent occurrence, that is, the numerical mutation of the time series data segment cannot be determined as being caused by a fixed behavior of the user within a short time, and it may be determined that the time series data segment to be detected is an abnormal time series data segment.
The method for sending the abnormal alarm may be sending alarm information, or may also be sending a related instruction to the alarm device, instructing the alarm device to execute a corresponding operation, and the like, which is not limited specifically.
In one implementation, if it is determined in S103 that there is a historical time-series data segment whose euclidean distance to the time-series data segment to be detected satisfies a distance proximity condition, then, as shown in fig. 2, after S103, S105 to S106 may be continuously performed.
S105: and taking the historical time sequence data segments meeting the similar distance condition as target time sequence data segments, and calculating the change trend similarity of the target time sequence data segments and the time sequence data segments to be detected.
The variation trend similarity is used for measuring the similarity between the target time sequence data fragment and the time sequence data fragment to be detected, and the smaller the variation trend similarity value is, the smaller the similarity between the target time sequence data fragment and the time sequence data fragment to be detected is, the larger the difference is.
In this step, the step of calculating the similarity of the variation trend of the target time series data segment and the time series data segment to be detected may include:
firstly, carrying out numerical value partitioning according to the value ranges of a plurality of historical time sequence data fragments, wherein each numerical value partition corresponds to a partition index. For example, [0,1000] may be set to one numeric partition with a corresponding partition index of 1, [1000, 2000] to another numeric partition with a corresponding partition index of 2, and so on. Therefore, the difficulty in judging the similarity of the time sequence segments caused by different values can be reduced by a numerical value partitioning mode, and the method is guaranteed not to be disturbed by numerical value change caused by user behavior change.
And secondly, replacing the time sequence data in the target time sequence data segment and the time sequence data in the time sequence data segment to be detected with corresponding partition indexes according to the numerical value and the numerical value partition of the time sequence data. For example, if a value of a certain time series data in the target time series data segment is 800, the time series data may be replaced with 1 according to the above numerical partition and the corresponding partition index.
And thirdly, carrying out one-hot coding on the replaced target time sequence data segment and the replaced time sequence data segment to be detected. The one-hot encoding is to encode N possible values of any time series data by using 0 or 1, each possible value corresponds to a unique code, and the value of each time series data at the same time is unique.
In this application, after the unique hot coding, a 1-dimensional partition index may be mapped to an M-1-dimensional vector, where M is the partition number of the numerical partition, that is, each possible value of the partition index may be coded by using an M-1-dimensional vector composed of 0 or 1, for example, when the partition number of the numerical partition is 3, that is, M is 3, each partition index has 3 possible values, which may be 0,1, and 2, respectively, then each possible value of the partition index may be coded by using a 2-dimensional vector composed of 0 or 1, for example, when the partition index value is 0, the corresponding code may be (0, 0), when the partition index value is 1, the corresponding code may be (1, 0), and when the partition index value is 2, the corresponding code may be (0, 1), and so on.
If the time sequence data values included in a certain replaced target time sequence data segment are respectively 0,1 and 2, then through one-hot encoding, the replaced target time sequence data segment can be represented as (0, 0, 0,1, 1, 0), so that the replaced target time sequence data segment and the replaced time sequence data segment to be detected are expanded, and the cosine similarity between the replaced target time sequence data segment and the replaced time sequence data segment to be detected is convenient to further calculate.
And fourthly, calculating cosine similarity between the coded target time sequence data segment and the coded time sequence data segment to be detected, and taking the cosine similarity as the change trend similarity of the target time sequence data segment and the time sequence data segment to be detected.
S106: and if the variation trend similarity is smaller than a preset similarity threshold, sending an abnormal alarm aiming at the time sequence data segment to be detected.
When the similarity of the variation trends is smaller than a preset threshold, the variation trends of the target time series data segment and the time series data segment to be detected are considered to be dissimilar, that is, the time series data segment to be detected cannot be determined to represent a certain frequently-occurring condition, and the time series data segment to be detected is considered to be an abnormal segment, and an abnormal alarm needs to be sent. On the contrary, if the similarity of the variation trends is not less than the preset threshold, the variation trends of the target time series data segment and the time series data segment to be detected can be considered to be similar, that is, the time series data segment to be detected can be determined to represent a certain frequently-occurring condition, and an abnormal alarm does not need to be sent for the time series data segment to be detected.
For example, if the time series data is the number of requests of the user to the database, the time series data is increased from 10request/second to 1000request/second, and 15request/second to 2000request/second, although there is a large difference in the numerical value, the time series change forms are basically consistent, so that a reasonable user behavior is presented, and the method in the embodiment of the present invention can determine an abnormal pattern without attention, and does not need to perform an alarm.
In addition, in an implementation manner, even if the variation trend similarity is not less than the preset similarity threshold, it is still necessary to determine whether the values of the target time series data segment and the time series data segment to be detected satisfy the preset condition.
For example, the request amount of a server at 0 a.m. will suddenly increase to 100, but suddenly increase to 120 a.m. in a certain day, although the variation trend is time-domain regular, whether there is a significant difference between 120 and 100 may depend on a specific scenario.
As can be seen from the above, the method for detecting abnormality of time series data provided in the embodiment of the present invention can further detect a time series data segment with a numerical mutation, and if there is no historical time series data segment with a similar euclidean distance to the historical time series data segment, it indicates that the numerical mutation of the time series data segment is unlikely to occur within a short time, and cannot be considered as a frequent occurrence, that is, the numerical mutation of the time series data segment cannot be considered as being caused by a fixed behavior of a user within a short period of time, and in this case, it can be determined that the time series data segment to be detected is an abnormal time series data segment, so that an invalid alarm on the time series data abnormality is reduced, and the operation and maintenance cost on the time series data is reduced.
Referring to fig. 3, a block diagram of a time series data anomaly detection apparatus according to the present application is shown, where the apparatus may specifically include the following modules:
the acquiring module 301 is configured to acquire a time series data segment to be detected and a plurality of historical time series data segments, where a timestamp of the historical time series data segment is earlier than that of the time series data segment to be detected, and the time series data segment to be detected is a time series data segment with a numerical value mutation;
a distance calculation module 302, configured to calculate an euclidean distance between each historical time series data segment and the time series data segment to be detected;
the distance judgment module 303 is configured to judge whether there is a historical time series data segment whose euclidean distance to the time series data segment to be detected satisfies a condition that the distance is close to the criterion; if not, the time sequence data segment to be detected is judged to be an abnormal time sequence data segment.
In one implementation, the distance determining module 303 is further configured to:
taking the historical time sequence data segments meeting the similar distance condition as target time sequence data segments, and calculating the change trend similarity of the target time sequence data segments and the time sequence data segments to be detected, wherein the target time sequence data segments are the historical time sequence data segments meeting the similar distance condition;
and if the variation trend similarity is smaller than a preset similarity threshold, sending an abnormal alarm aiming at the time sequence data segment to be detected.
In one implementation, the obtaining module 301 is specifically configured to:
acquiring a time sequence data fragment to be detected and historical time sequence data with a timestamp earlier than the time sequence data fragment to be detected;
extracting a plurality of historical time sequence data fragments with the same length as the time sequence data fragments to be detected from the historical time sequence data, wherein an overlapping part is arranged between every two adjacent historical time sequence data fragments, and the length of the overlapping part is the length of the time sequence data fragments to be detected minus 1.
In one implementation, the distance calculating module 302 is specifically configured to:
and calculating the Euclidean distance between each historical time sequence data segment and the time sequence data segment to be detected according to a proximity algorithm.
In an implementation manner, the distance determining module 303 is specifically configured to:
determining historical time sequence data fragments with the Euclidean distance between the historical time sequence data fragments and the time sequence data fragments to be detected and ranked as a first preset value and a second preset value from the historical time sequence data fragments, and respectively using the historical time sequence data fragments as the first fragments and the second fragments, wherein the first preset value is smaller than the second preset value;
judging whether the ratio of the Euclidean distance between the first segment and the time sequence data segment to be detected to the Euclidean distance between the second segment data and the time sequence data segment to be detected is larger than a preset ratio threshold value or not;
and if the Euclidean distance between the time sequence data segments to be detected and the historical time sequence data segments meeting the similar distance condition is not larger than the preset ratio threshold, judging that the historical time sequence data segments exist.
In one implementation, the distance determining module 303 is specifically configured to:
according to the value ranges of the plurality of historical time sequence data fragments, carrying out numerical partition, wherein each numerical partition corresponds to a partition index;
replacing time sequence data in the target time sequence data segment and the time sequence data segment to be detected with corresponding partition indexes according to the numerical value of the time sequence data and the numerical partition;
carrying out one-hot coding on the replaced target time sequence data segment and the replaced time sequence data segment to be detected;
and calculating cosine similarity between the encoded target time sequence data segment and the encoded time sequence data segment to be detected, wherein the cosine similarity is used as the change trend similarity of the target time sequence data segment and the time sequence data segment to be detected.
As can be seen from the above, the time series data anomaly detection device provided in the embodiment of the present invention can further detect a time series data segment with a numerical mutation, and if there is no historical time series data segment with a similar euclidean distance to the historical time series data segment, it indicates that the numerical mutation of the time series data segment is unlikely to occur within a short time, and cannot be considered as a frequent occurrence, that is, the numerical mutation of the time series data segment cannot be considered as being caused by a fixed behavior of a user within a short period of time, and in this case, it can be determined that the time series data segment to be detected is an anomalous time series data segment, so that an invalid alarm on the time series data anomaly is reduced, and the operation and maintenance cost on the time series data is reduced.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,
a memory 403 for storing a computer program;
the processor 401, when executing the program stored in the memory 403, implements the following steps:
acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments, wherein the time stamp of the historical time sequence data fragment is earlier than that of the time sequence data fragment to be detected, and the time sequence data fragment to be detected is a time sequence data fragment with numerical value mutation;
calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected;
judging whether a historical time sequence data segment meeting a similar distance condition with the time sequence data segment to be detected exists or not;
and if not, sending an abnormal alarm aiming at the time sequence data segment to be detected.
The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the terminal and other equipment.
The Memory may include a Random Access Memory (RAM), or may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In another embodiment of the present invention, a computer-readable storage medium is further provided, where instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to execute the time series data abnormality detection method described in any one of the above embodiments.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for detecting a time series data anomaly as described in any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for detecting time series data abnormity, which is characterized in that the method comprises the following steps:
acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments, wherein the time stamp of the historical time sequence data fragment is earlier than that of the time sequence data fragment to be detected, and the time sequence data fragment to be detected is a time sequence data fragment with numerical value mutation;
calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected;
judging whether a historical time sequence data segment meeting a similar distance condition with the time sequence data segment to be detected exists or not;
and if not, sending an abnormal alarm aiming at the time sequence data segment to be detected.
2. The method according to claim 1, wherein in the case that there is a historical time-series data segment whose euclidean distance to the time-series data segment to be detected satisfies a distance proximity condition, the method further comprises:
taking the historical time sequence data segments meeting the similar distance condition as target time sequence data segments, and calculating the change trend similarity of the target time sequence data segments and the time sequence data segments to be detected;
and if the variation trend similarity is smaller than a preset similarity threshold, sending an abnormal alarm aiming at the time sequence data segment to be detected.
3. The method according to any one of claims 1 or 2, wherein the acquiring the time series data segment to be detected and the plurality of historical time series data segments comprises:
acquiring a time sequence data fragment to be detected and historical time sequence data with a timestamp earlier than the time sequence data fragment to be detected;
extracting a plurality of historical time sequence data fragments with the same length as the time sequence data fragments to be detected from the historical time sequence data, wherein an overlapping part is arranged between every two adjacent historical time sequence data fragments, and the length of the overlapping part is the length of the time sequence data fragments to be detected minus 1.
4. The method according to any one of claims 1 or 2, wherein the calculating of the euclidean distance between each historical time series data segment and the time series data segment to be detected comprises:
and calculating the Euclidean distance between each historical time sequence data segment and the time sequence data segment to be detected according to a proximity algorithm.
5. The method according to any one of claims 1 or 2, wherein the determining whether there is a historical time series data segment whose euclidean distance to the time series data segment to be detected satisfies a condition of a close distance includes:
determining historical time sequence data fragments with the Euclidean distance between the historical time sequence data fragments and the time sequence data fragments to be detected and ranked as a first preset value and a second preset value from the historical time sequence data fragments, and respectively using the historical time sequence data fragments as the first fragments and the second fragments, wherein the first preset value is smaller than the second preset value;
judging whether the ratio of the Euclidean distance between the first segment and the time sequence data segment to be detected to the Euclidean distance between the second segment data and the time sequence data segment to be detected is larger than a preset ratio threshold value or not;
and if the Euclidean distance between the time sequence data segments to be detected and the historical time sequence data segments meeting the similar distance condition is not larger than the preset ratio threshold, judging that the historical time sequence data segments exist.
6. The method according to claim 2, wherein the calculating the similarity of the variation trend of the target time series data segment and the time series data segment to be detected comprises:
according to the value ranges of the plurality of historical time sequence data fragments, carrying out numerical partition, wherein each numerical partition corresponds to a partition index;
replacing time sequence data in the target time sequence data segment and the time sequence data segment to be detected with corresponding partition indexes according to the numerical value of the time sequence data and the numerical partition;
carrying out one-hot coding on the replaced target time sequence data segment and the replaced time sequence data segment to be detected;
and calculating cosine similarity between the encoded target time sequence data segment and the encoded time sequence data segment to be detected, wherein the cosine similarity is used as the change trend similarity of the target time sequence data segment and the time sequence data segment to be detected.
7. An apparatus for detecting an abnormality in time series data, the apparatus comprising:
the acquisition module is used for acquiring a time sequence data fragment to be detected and a plurality of historical time sequence data fragments, the timestamp of the historical time sequence data fragment is earlier than that of the time sequence data fragment to be detected, and the time sequence data fragment to be detected is a time sequence data fragment with numerical value mutation;
the distance calculation module is used for calculating the Euclidean distance between each historical time sequence data fragment and the time sequence data fragment to be detected;
the distance judgment module is used for judging whether a historical time sequence data segment meeting the similar distance condition with the Euclidean distance between the time sequence data segment to be detected and the time sequence data segment to be detected exists; and if not, judging that the time sequence data segment to be detected is an abnormal time sequence data segment.
8. The apparatus of claim 7, wherein the distance determining module is further configured to:
taking the historical time sequence data segments meeting the similar distance condition as target time sequence data segments, and calculating the change trend similarity of the target time sequence data segments and the time sequence data segments to be detected;
and if the variation trend similarity is smaller than a preset similarity threshold, sending an abnormal alarm aiming at the time sequence data segment to be detected.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202110269901.0A 2021-03-12 2021-03-12 Method, device and equipment for detecting time sequence data abnormity and storage medium Pending CN112988512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110269901.0A CN112988512A (en) 2021-03-12 2021-03-12 Method, device and equipment for detecting time sequence data abnormity and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110269901.0A CN112988512A (en) 2021-03-12 2021-03-12 Method, device and equipment for detecting time sequence data abnormity and storage medium

Publications (1)

Publication Number Publication Date
CN112988512A true CN112988512A (en) 2021-06-18

Family

ID=76334615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110269901.0A Pending CN112988512A (en) 2021-03-12 2021-03-12 Method, device and equipment for detecting time sequence data abnormity and storage medium

Country Status (1)

Country Link
CN (1) CN112988512A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218164A (en) * 2021-12-17 2022-03-22 微梦创科网络科技(中国)有限公司 Data anomaly detection method and system based on time sequence vector retrieval
CN114235652A (en) * 2021-11-30 2022-03-25 国网北京市电力公司 Smoke dust particle concentration abnormity identification method and device, storage medium and equipment
CN117518939A (en) * 2023-12-06 2024-02-06 广州市顺风船舶服务有限公司 Industrial control system based on big data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708739A (en) * 2020-05-21 2020-09-25 北京奇艺世纪科技有限公司 Method and device for detecting abnormality of time series data, electronic device and storage medium
CN112165471A (en) * 2020-09-22 2021-01-01 杭州安恒信息技术股份有限公司 Industrial control system flow abnormity detection method, device, equipment and medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708739A (en) * 2020-05-21 2020-09-25 北京奇艺世纪科技有限公司 Method and device for detecting abnormality of time series data, electronic device and storage medium
CN112165471A (en) * 2020-09-22 2021-01-01 杭州安恒信息技术股份有限公司 Industrial control system flow abnormity detection method, device, equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114235652A (en) * 2021-11-30 2022-03-25 国网北京市电力公司 Smoke dust particle concentration abnormity identification method and device, storage medium and equipment
CN114218164A (en) * 2021-12-17 2022-03-22 微梦创科网络科技(中国)有限公司 Data anomaly detection method and system based on time sequence vector retrieval
CN117518939A (en) * 2023-12-06 2024-02-06 广州市顺风船舶服务有限公司 Industrial control system based on big data

Similar Documents

Publication Publication Date Title
CN112988512A (en) Method, device and equipment for detecting time sequence data abnormity and storage medium
CN110083475B (en) Abnormal data detection method and device
CN111538642B (en) Abnormal behavior detection method and device, electronic equipment and storage medium
CN110286656B (en) False alarm filtering method and device for tolerance of error data
CN112003838B (en) Network threat detection method, device, electronic device and storage medium
CN112231174A (en) Abnormity warning method, device, equipment and storage medium
CN112148768A (en) Index time series abnormity detection method, system and storage medium
CN110674014A (en) Method and device for determining abnormal query request
CN112148733A (en) Method, device, electronic device and computer readable medium for determining fault type
KR20180046598A (en) A method and apparatus for detecting and managing a fault
GB2517147A (en) Performance metrics of a computer system
CN111740865B (en) Flow fluctuation trend prediction method and device and electronic equipment
CN108399115B (en) Operation and maintenance operation detection method and device and electronic equipment
CN115185761A (en) Abnormality detection method and apparatus
CN114365094A (en) Timing anomaly detection using inverted indices
CN112765161A (en) Alarm rule matching method and device, electronic equipment and storage medium
CN115514619A (en) Alarm convergence method and system
KR101960755B1 (en) Method and apparatus of generating unacquired power data
CN115081969A (en) Abnormal data determination method and related device
CN112380073B (en) Fault position detection method and device and readable storage medium
CN114492576A (en) Abnormal user detection method, system, storage medium and electronic equipment
CN108229585B (en) Log classification method and system
CN115932144B (en) Chromatograph performance detection method, chromatograph performance detection device, chromatograph performance detection equipment and computer medium
CN112100037A (en) Alarm level identification method and device, electronic equipment and storage medium
CN113468014A (en) Abnormity detection method and device for operation and maintenance data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination