CN115456107A - Time series abnormity detection system and method - Google Patents

Time series abnormity detection system and method Download PDF

Info

Publication number
CN115456107A
CN115456107A CN202211201476.2A CN202211201476A CN115456107A CN 115456107 A CN115456107 A CN 115456107A CN 202211201476 A CN202211201476 A CN 202211201476A CN 115456107 A CN115456107 A CN 115456107A
Authority
CN
China
Prior art keywords
data set
time sequence
pseudo
self
pseudo label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211201476.2A
Other languages
Chinese (zh)
Inventor
侯策
张振领
陈子昂
刘尚秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202211201476.2A priority Critical patent/CN115456107A/en
Publication of CN115456107A publication Critical patent/CN115456107A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The application provides a time sequence anomaly detection system and a method, an initialization pseudo label generation unit automatically labels pseudo labels on partial data in a non-labeled time sequence data set, partial data carrying the pseudo labels are determined to be an initialization pseudo label data set, an iteration self-training unit trains the iteration self-training unit by using the pseudo label data set, the time sequence data set is input to the trained iteration self-training unit to obtain an updated pseudo label data set, the updated pseudo label data set is used for continuing to train the iteration self-training unit, the time sequence data set is input to the trained iteration self-training unit, the process of obtaining the updated pseudo label data set is repeated, the last output result of the iteration self-training unit is used as a time sequence anomaly detection result, manual labeling is not needed in the embodiment of the application, the automatically labeled pseudo labels are used for guiding the iteration self-training unit to perform unsupervised learning, and time sequence anomaly detection is effectively performed.

Description

Time series abnormity detection system and method
Technical Field
The invention relates to the field of finance, in particular to a time series abnormity detection system and a time series abnormity detection method.
Background
The time series abnormity detection is a process of identifying the time points with abnormal conditions from a given time series by adopting a certain method, has important significance for the fields of network intrusion data, medical data, log data and the like, and can develop serious faults at any time, so the accurate and timely abnormity detection can help enterprises to take measures as early as possible. With the continuous development of artificial intelligence technology in recent years, machine learning technology has become an important means in time series anomaly detection, and can be divided into supervised learning and unsupervised learning, which are mature and have higher accuracy, but it takes a lot of time and resources to label massive time series data. In the unsupervised learning, although labels do not need to be labeled, the problem of insufficient accuracy often exists due to the lack of labels to guide the training process.
That is, how to perform effective time series abnormality detection by fully utilizing time series data without manually labeling an abnormality label for the time series data is a problem to be solved urgently.
Therefore, an effective time series abnormality detection method is needed.
Disclosure of Invention
In view of the above, an object of the present invention is to provide a time series abnormality detection system and method, which can detect abnormality of time series.
The embodiment of the application provides a time series anomaly detection system, which comprises: initializing a pseudo label generating unit and an iteration self-training unit;
the initialization pseudo tag generation unit is used for automatically labeling a pseudo tag on part of data in a non-tag time sequence data set and determining the part of data carrying the pseudo tag as an initialization pseudo tag data set;
the iteration self-training unit is used for acquiring the initialization pseudo tag data set and the time sequence data set, training the iteration self-training unit by using the pseudo tag data set, inputting the time sequence data set to the trained iteration self-training unit to obtain an updated pseudo tag data set, continuing training the iteration self-training unit by using the updated pseudo tag data set, continuing inputting the time sequence data set to the trained iteration self-training unit, repeating the process of obtaining the updated pseudo tag data set until the training times of the iteration self-training unit reach preset times, and taking the output result of the last time of processing the time sequence data set by the iteration self-training unit as a time sequence abnormity detection result.
The embodiment of the application provides a time series abnormity detection method, which comprises the following steps:
automatically labeling a pseudo label on partial data in a label-free time sequence data set, and determining the partial data carrying the pseudo label as an initialized pseudo label data set;
acquiring the initialized pseudo label data set and the time sequence data set, training an iteration self-training unit by using the pseudo label data set, inputting the time sequence data set to the trained iteration self-training unit to obtain an updated pseudo label data set, continuing training the iteration self-training unit by using the updated pseudo label data set, continuing inputting the time sequence data set to the trained iteration self-training unit, repeating the process of obtaining the updated pseudo label data set until the training times of the iteration self-training unit reach preset times, and taking the output result of the last time of processing the time sequence data set by the iteration self-training unit as a time sequence abnormity detection result.
The embodiment of the application provides a time sequence abnormity detection system which comprises an initialization pseudo label generation unit and an iteration self-training unit, wherein the initialization pseudo label generation unit is used for automatically labeling a pseudo label on part of data in a non-label time sequence data set, the part of data carrying the pseudo label is determined to be an initialization pseudo label data set, the iteration self-training unit is used for obtaining the initialization pseudo label data set and a time sequence data set, the iteration self-training unit is trained by using the pseudo label data set, the time sequence data set is input into the trained iteration self-training unit to obtain an updated pseudo label data set, the updated pseudo label data set is used for continuing training the iteration self-training unit, the time sequence data set is input into the trained iteration self-training unit, the process of obtaining the updated pseudo label data set is repeated until the training number of the iteration self-training unit reaches a preset number, the output result of the last time sequence data set processed by the iteration self-training unit is used as a time sequence abnormity detection result, and the non-supervision self-training unit is used for guiding the iteration self-training unit to effectively learn the time sequence abnormity detection.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram illustrating a time-series anomaly detection system according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a process of generating an initialization pseudo tag according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating a training of an auto-encoder according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a pseudo label generator generating an initialization pseudo label data set according to an embodiment of the application;
FIG. 5 is a schematic diagram illustrating an iterative self-training unit training process provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of a multi-layered sensor provided in an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating a feature screening process provided by an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating a time-series anomaly detection process provided by an embodiment of the present application;
fig. 9 shows a schematic flowchart of a time series anomaly detection system method according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways than those described herein, and it will be apparent to those of ordinary skill in the art that the present application is not limited by the specific embodiments disclosed below.
The time series abnormity detection is a process of identifying the time points with abnormal conditions from a given time series by adopting a certain method, has important significance for the fields of network intrusion data, medical data, log data and the like, and can develop serious faults at any time, so the accurate and timely abnormity detection can help enterprises to take measures as early as possible. With the continuous development of artificial intelligence technology in recent years, machine learning technology has become an important means in time series anomaly detection,
in the process of machine learning, "learning" can be understood as a one-to-one and three-by-one process, with some training data, enabling machines to analyze unknown data using them. The machine learning technology can be divided into supervised learning and unsupervised learning, wherein the supervised learning is to lead a machine to learn an existing data set so as to master the relation between input and output, the data set is trained to have both characteristics and labels in the supervised learning, the machine can find the relation between the characteristics and the labels by training, and the labels can be judged when the data with only characteristics and no labels are faced. The method is characterized in that a time sequence is marked to determine whether each time point is abnormal or not, the model is trained, then the abnormal time points can be determined on the time sequence without the mark, the label needs to be manually determined and marked by personnel with certain experience on a specific scene, and the time series is rapidly increased today in time series data scale, so that the labor cost is high, the time consumption is long, and the practicability is poor.
The supervised learning is mature, the accuracy is higher, but a large amount of time and resources are needed for labeling massive time series data. In the unsupervised learning, although labels do not need to be labeled, the problem of insufficient accuracy often exists due to the lack of labels to guide the training process.
That is, how to perform effective time series abnormality detection by fully utilizing time series data without manually labeling an abnormality label for the time series data is a problem to be solved urgently.
The current methods for detecting time series anomalies mainly focus on the following:
first, based on classical statistical methods, time points that fail statistical hypothesis testing are identified by testing time series data for indicators such as mean shift, variance, probability distribution, etc.
Second, a prediction model is built through historical data to predict future values based on a predictive method, and an abnormality is identified according to whether a prediction error exceeds a preset threshold.
Third, the reconstruction-based deep learning approach attempts to capture the data distribution of the time series by learning a model, mapping the high-dimensional time series to a low-dimensional representation, and then generating a reconstructed representation corresponding to the input data, in the process of which time-series data points with higher reconstruction errors are considered as anomalies.
However, all three of the above methods have certain drawbacks:
firstly, the method based on the traditional statistics is completely based on the statistics principle, and has no process of training a model, so that the problem of manual labeling is not needed, and meanwhile, even if the method has no early data accumulation, the method can be put into production and use quickly, but the relatively simple and fixed statistics method can not meet the current massive complex time sequence anomaly detection scene.
Secondly, a time sequence abnormity detection problem is converted into a time sequence prediction problem based on a prediction method, due to the front-back relation of data on a time sequence of a time sequence prediction model, a label of each time point is the next time point of the time point, a training data set has no problem of manual marking, but depends on the training effect of the corresponding time sequence prediction model to a great extent, and has a relation with whether the time sequence is convenient to predict, and the best detection result cannot be achieved under more types or scene conditions.
Third, the reconstruction-based deep learning method generally adopts an auto-encoder or generates an anti-neural network, two of the most popular unsupervised deep learning reconstruction models. Because of the unsupervised model, the training process does not need to be labeled, and the problem of model inaccuracy exists due to the lack of guidance of labels.
Therefore, an effective time series abnormality detection method is needed.
Based on this, the embodiment of the present application provides a time sequence anomaly detection system, which includes an initialized pseudo label generation unit and an iterative self-training unit, wherein the initialized pseudo label generation unit is configured to automatically label a pseudo label on part of data in a non-labeled time sequence data set, determine part of the data carrying the pseudo label as an initialized pseudo label data set, the iterative self-training unit is configured to obtain the initialized pseudo label data set and a time sequence data set, train the iterative self-training unit using the pseudo label data set, input the time sequence data set into the trained iterative self-training unit, obtain an updated pseudo label data set, continue to train the iterative self-training unit using the updated pseudo label data set, continue to input the time sequence data set into the trained iterative self-training unit, repeat a process of obtaining the updated pseudo label data set until the number of training times of the iterative self-training unit reaches a preset number of times, use an output result of the iterative self-training unit which processes the time sequence data set for the last time as a time sequence anomaly detection result, and thus it is clear that the iterative self-training unit does not need to label, and effectively learn the time sequence anomaly detection system without supervision.
For a better understanding of the technical solutions and effects of the present application, specific embodiments will be described in detail below with reference to the accompanying drawings.
Referring to fig. 1, the figure is a schematic structural diagram of a time series anomaly detection system according to an embodiment of the present application.
The time-series abnormality detection system 100 provided in the present embodiment includes: the pseudo label generating unit 110 and the iterative self-training unit 120 are initialized.
In an embodiment of the present application, the initialized pseudo tag generating unit 110 may automatically label a pseudo tag on part of data in the non-tagged time series data set, and determine the part of data carrying the pseudo tag as the initialized pseudo tag data set, that is, the initialized pseudo tag generating unit 110 may be trained in an unsupervised manner to generate an initial pseudo tag in the time series data set, so as to form an initialized pseudo tag data set carrying the pseudo tag, and then the initialized pseudo tag data set may be used to train the iterative self-training unit.
In the embodiment of the present application, the initialization pseudo tag generation unit 110 includes an auto encoder, a reconstruction error operator, and a pseudo tag generator, and fig. 2 is a schematic diagram illustrating an initialization pseudo tag generation process.
Specifically, the self-encoder is configured to perform training according to a time sequence data set, input the time sequence data set into the trained self-encoder to obtain a second reconstructed time sequence data set, that is, input the time sequence data set not carrying a tag into the self-encoder to perform unsupervised learning training, and output the reconstructed time sequence data set from the self-encoder, as shown in fig. 3, where fig. 3 shows a training schematic diagram of the self-encoder. The input data of the embodiment of the application is a time sequence data set without a manually marked abnormal label, and a reconstructed second time sequence data set is output through an encoder and a decoder. The embodiment of the application aims to automatically generate a pseudo label for a time sequence by combining a self-training thought on the premise of not manually labeling the label of the time sequence data, namely, in an unsupervised scene, and improve the time sequence abnormity detection effect under unsupervised learning without manual labeling under the guidance of the pseudo label. In supervised learning, a large amount of data is provided with label information, and the model is trained under the guidance of the label information in the whole process. In semi-supervised learning, a small amount of tagged data and a large amount of untagged data are required. On the basis that only a large amount of label-free data and no data carrying labels are available, the precondition that self-training can be carried out can be created firstly: a small amount of data carrying the tag is generated. The self-encoder as a typical unsupervised model can be trained under the current situation without labels, potential patterns and rules in the time sequence data set are learned in the process of reconstructing the time sequence data set, and the self-encoder reconstructs the time sequence data set by using the mastered potential representation. The abnormal data in the time series data set is not well restored by reconstruction because the abnormal data does not conform to the potential representation of the time series, so the part of the second reconstructed time series data set which is greatly different from the time series data set is probably the part with the abnormality.
The reconstruction error operator is used for calculating the error between the time sequence data set and the second reconstruction time sequence data set to obtain the abnormal score of each data in the time sequence data set. The reconstruction error operator can select the mean square error as the error calculation standard. After the reconstructed time series data set is output from the encoder, a reconstruction error between the time series data set and the reconstructed time series data set is calculated using a mean square error. The reconstruction error of an abnormal sample will be larger than that of a normal sample and will therefore be more easily identified. Therefore, the output result of the reconstruction error operator can be used as an abnormal score to evaluate the abnormal degree of each data, and the higher the score is, the more probable the abnormal point is.
The pseudo label generator is used for automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an initialization pseudo label data set. When the pseudo label is automatically marked, because the label result is wrong due to non-manual labeling, the label result is called as a labeled pseudo label, the pseudo label comprises a normal label and an abnormal label, the normal label is labeled on part of data, the abnormal label is labeled on part of data, and the data labeled with the pseudo label is determined as an initialized pseudo label data set. That is, the reconstruction error during reconstruction may be taken as an anomaly score for each data in the time series data set, and a pseudo label may be generated for each data according to whether the anomaly score exceeds some predefined threshold. Each data in the second reconstructed time-series data set is sorted by an anomaly score from high to low, and the higher the anomaly score is, the more likely it is that the data is an anomaly. Taking the abnormal score as a confidence coefficient, selecting a sample with the highest first a% of the abnormal score as an abnormal sample set, setting the pseudo label as an abnormal label, selecting a sample with the lowest second b% of the abnormal score as a normal sample set, setting the pseudo label as a normal label, and mixing the two to form an initialized pseudo label data set, as shown in fig. 4, fig. 4 shows a schematic diagram of generating the initialized pseudo label data set by a pseudo label generator, wherein a and b can be determined according to actual conditions. Therefore, partial time sequence data carrying the pseudo label are obtained from the label-free time sequence data set without manual labeling, and the accuracy of the pseudo label is continuously improved in the subsequent repeated training process although the pseudo label is not as accurate as the manual label.
In the embodiment of the present application, the iterative self-training unit 120 may obtain an initialized pseudo tag data set and a time series data set, train the iterative self-training unit 120 using the pseudo tag data set, input the time series data set to the trained iterative self-training unit 120 to obtain an updated pseudo tag data set, continue training the iterative self-training unit 120 using the updated pseudo tag data set, continue inputting the time series data set to the trained iterative self-training unit 120, repeat the process of obtaining the updated pseudo tag data set until the number of times of training of the iterative self-training unit 120 reaches a preset number, use an output result of the iterative self-training unit 120 that processes the time series data set last time as a time series abnormality detection result, that is, after obtaining the initialized pseudo tag data set using the initialized pseudo tag generation unit 110, perform unsupervised learning on the iterative self-training unit 120 using the initialized pseudo tag data set, repeat the process of generating the pseudo tag data set using the iterative self-training unit 120, where the accuracy of the number of training of the iterative self-training unit 120 is higher and higher, and the accuracy of the number of the iterative self-training unit 120 may be finally used for detecting the abnormal time series abnormality.
In the embodiment of the present application, the iterative self-training unit 120 includes a self-encoder, a multi-layer perceptron and a pseudo label generator, and fig. 5 is a schematic diagram illustrating a training process of the iterative self-training unit.
The self-encoder is used for training according to the initialized pseudo tag data set, inputting the time sequence data set into the trained self-encoder to obtain a 1 st reconstructed time sequence data set, namely, the initialized pseudo tag data set is used as input, the self-encoder is trained, the trained self-encoder is further obtained, the time sequence data set is used as input, and the trained self-encoder obtains a reconstructed time sequence data set.
The multilayer perceptron is used for obtaining 1 st reconstruction time sequence data set and obtaining the abnormal score of each data in the time sequence data set. The multilayer perceptron is also called a multilayer neural network and consists of an input layer, an output layer and a plurality of hidden layers. Referring to fig. 6, fig. 6 shows a schematic diagram of a multi-layered perceptron. Fig. 6 shows a multi-layered perceptron with a hidden layer. The self-encoder outputs a reconstructed time sequence data set, the reconstructed time sequence data set comprises a pseudo label data set and is transmitted into the multilayer perceptron, and the multilayer perceptron is supervised and learned based on the pseudo label data set, so that the classification loss on pseudo labels is reduced to the maximum extent. And taking the time sequence data set as input, processing by using a trained multilayer perceptron, and outputting an abnormal score. Since the multi-tier perceptron is trained with high confidence data, it will produce better anomaly scores on time series data sets. The multi-layer perceptron can generate new pseudo labels for the next iteration process to form a new pseudo label data set, and the new pseudo label data set is used for training the multi-layer perceptron in the next iteration period to improve the classification performance. This process is self-training, which iterates through the rounds until the pseudo-label no longer changes or a preset number of rounds is reached. And obtaining the false label result of the last iteration as an abnormal detection result.
And the pseudo label generator is used for automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an updated pseudo label data set. That is to say, the abnormal score output by the multi-layer perceptron can be input into the pseudo tag generator, the partial abnormal score with higher confidence coefficient is selected, the corresponding partial data is automatically labeled with the abnormal tag, and the data of the partial labeled pseudo tag is used as a new pseudo tag data set.
After the first round of training of the iterative self-training unit 120, the self-encoder may continue to train according to the updated pseudo tag data set, the time sequence data set continues to be input into the trained self-encoder to obtain a 2 nd reconstructed time sequence data set, the multilayer perceptron acquires the 2 nd reconstructed time sequence data set to obtain an abnormal score of each data in the time sequence data set, the pseudo tag generator automatically labels the pseudo tags for part of data in the time sequence data set according to the abnormal score of each data, and the part of data carrying the pseudo tags is determined as the updated pseudo tag data set.
After the second round of training of the iterative self-training unit 120, the ith round of self-training of the iterative self-training unit 120 is continued, i is greater than 1, whether i is greater than or equal to a preset number of times is judged, if not, the training of the iterative self-training unit 120 is continued, and if yes, partial data which is automatically marked on the time sequence data set by the pseudo tag generator for the last time is used as a time sequence abnormity detection result.
In the embodiment of the present application, the system further includes a feature filter, which is shown in fig. 7 and 8, fig. 7 shows a schematic diagram of a feature filtering process, and fig. 8 shows a schematic diagram of a time series anomaly detection process. The feature filter can filter an original time sequence data set to obtain a time sequence data set, specifically, the time sequence data is often high-dimensional, the feature selector filters multi-dimensional features, irrelevant features and multiple co-linear features are removed, and finally the time sequence data set which can be used for further time sequence analysis is obtained. Referring to fig. 7, for original time series data, combined with the characteristics of the time series data, there is often a high dimensional characteristic, and time series data of multiple indexes is often used to characterize the development level of a certain thing. However, the high dimension brings more information degree, and also brings some misleading in model training. A plurality of feature dimensions can have two types of features, and need to be removed in advance: the first type is useless features, the relevance of the features and other features is very low, the features are screened and removed by using relevance calculation, the second type is redundant features, a Principal Component Analysis (PCA) method is used for calculating a covariance matrix of multi-dimensional features, feature values are sorted from large to small, and the largest N feature values are reserved.
In practical application, the original time series data set is easy to have repeated or missing items due to the characteristics of the time series data, so that before the original time series data set is input into the feature selector, data preprocessing is firstly carried out, repeated items are removed, and the missing items are subjected to mean value filling in the same period.
Referring to fig. 8, the time series anomaly detection system provided in the embodiment of the present application implements a process from an original time series data set input to a time series anomaly detection result output, and the whole apparatus operation includes three independent processes: 1. and (3) a characteristic screening process: obtaining an original time sequence data set through a feature selector, and 2, initializing a pseudo tag generation process: generating an initialized pseudo label data set by a time sequence data set subjected to feature screening through a self-encoder, a reconstruction error operator and a pseudo label generator, and 3, performing an iterative self-training process: a new pseudo label data set is formed by the pseudo label data set through a self-encoder, a multilayer perceptron and a pseudo label generator, the original pseudo label data set is updated, the process needs to iterate circularly for a plurality of times, and finally a time sequence abnormity detection result is output.
As can be seen from the above description, the time series anomaly detection system provided in the embodiment of the present application does not need to manually label the original time series data set, thereby saving a large amount of labor cost and enabling training to be performed in an unsupervised learning scenario. In addition, the method does not depend on the traditional statistical method, but is based on an autoencoder and a multilayer neural network in deep learning, and compared with a relatively simple statistical method, the method can better meet the current massive and complicated time sequence anomaly detection scene. In addition, based on a deep learning method under unsupervised learning, a self-training thought in semi-supervised learning is introduced, and a pseudo label is used for guiding model training, so that the problem of model inaccuracy caused by lack of label guiding in unsupervised learning is solved to a certain extent.
The embodiment of the application provides a time sequence abnormity detection system which comprises an initialization pseudo label generation unit and an iteration self-training unit, wherein the initialization pseudo label generation unit is used for automatically labeling a pseudo label on part of data in a non-label time sequence data set, the part of data carrying the pseudo label is determined to be an initialization pseudo label data set, the iteration self-training unit is used for obtaining the initialization pseudo label data set and a time sequence data set, the iteration self-training unit is trained by using the pseudo label data set, the time sequence data set is input into the trained iteration self-training unit to obtain an updated pseudo label data set, the updated pseudo label data set is used for continuing training the iteration self-training unit, the time sequence data set is input into the trained iteration self-training unit, the process of obtaining the updated pseudo label data set is repeated until the training number of the iteration self-training unit reaches a preset number, the output result of the last time sequence data set processed by the iteration self-training unit is used as a time sequence abnormity detection result, and the non-supervision self-training unit is used for guiding the iteration self-training unit to effectively learn the time sequence abnormity detection.
Based on the time series anomaly detection system provided by the above embodiment, the embodiment of the application also provides a time series anomaly detection method, and the working principle of the time series anomaly detection system is explained in detail with reference to the attached drawings.
Referring to fig. 9, the figure is a schematic flowchart of a time series anomaly detection method provided in the embodiment of the present application.
The time series abnormality detection method provided by the embodiment comprises the following steps:
s101, automatically labeling a pseudo label on part of data in the label-free time sequence data set, and determining the part of data carrying the pseudo label as an initialization pseudo label data set.
S102, acquiring the initialized pseudo label data set and the time sequence data set, training an iteration self-training unit by using the pseudo label data set, inputting the time sequence data set to the trained iteration self-training unit to obtain an updated pseudo label data set, continuing training the iteration self-training unit by using the updated pseudo label data set, continuing inputting the time sequence data set to the trained iteration self-training unit, repeating the process of obtaining the updated pseudo label data set until the training times of the iteration self-training unit reach preset times, and taking the output result of the last time of processing the time sequence data set by the iteration self-training unit as a time sequence abnormity detection result.
Optionally, the obtaining the initialized pseudo tag data set and the time sequence data set, training an iterative self-training unit using the pseudo tag data set, and inputting the time sequence data set to the trained iterative self-training unit to obtain an updated pseudo tag data set includes:
training according to the initialized pseudo label data set, and inputting the time sequence data set into the trained self-encoder to obtain a 1 st reconstructed time sequence data set;
acquiring the 1 st reconstruction time sequence data set to obtain an abnormal score of each data in the time sequence data set;
and automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an updated pseudo label data set.
Optionally, the training of the iterative self-training unit is continued by using the updated pseudo tag data set, the time sequence data set is continuously input to the trained iterative self-training unit, the process of obtaining the updated pseudo tag data set is repeated until the training frequency of the iterative self-training unit reaches a preset frequency, and taking an output result of the iterative self-training unit that processes the time sequence data set for the last time as a time sequence abnormality detection result includes:
continuing training according to the updated pseudo label data set, and inputting the time sequence data set into the trained self-encoder to obtain an ith reconstructed time sequence data set, wherein i is greater than 1;
acquiring the ith time sequence data set to obtain an abnormal score of each data in the time sequence data set;
according to the abnormal score of each data, automatically labeling a pseudo label for partial data in the time sequence data set, and determining the partial data carrying the pseudo label as an updated pseudo label data set;
and judging whether the i is more than or equal to a preset number of times, if so, taking partial data which is automatically marked on the time sequence data set at the last time as a time sequence abnormity detection result.
Optionally, automatically labeling a pseudo tag on part of the data in the unlabeled time series data set, and determining the part of the data carrying the pseudo tag as an initialized pseudo tag data set includes:
training according to the time sequence data set, and inputting the time sequence data set into the trained self-encoder to obtain a second reconstructed time sequence data set;
calculating an error between the time series data set and the second reconstructed time series data set to obtain an abnormal score of each data in the time series data set;
and automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an initialized pseudo label data set.
Optionally, the method further comprises:
and screening the original time sequence data set to obtain the time sequence data set.
Based on the time series abnormality detection method provided by the above embodiments, an embodiment of the present application further provides a time series abnormality detection device, where the time series abnormality detection device includes:
a processor and a memory, and the number of the processors may be one or more. In some embodiments of the application, the processor and the memory may be connected by a bus or other means.
The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include NVRAM. The memory stores an operating system and operating instructions, executable modules or data structures, or subsets thereof, or expanded sets thereof, wherein the operating instructions may include various operating instructions for performing various operations. The operating system may include various system programs for implementing various basic services and for handling hardware-based tasks.
The processor controls the operation of the terminal device and may also be referred to as a CPU.
The method disclosed in the embodiments of the present application may be applied to a processor, or may be implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor described above may be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
The embodiment of the present application further provides a computer-readable storage medium for storing a program code, where the program code is used to execute any one implementation of the methods of the foregoing embodiments.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
When introducing elements of various embodiments of the present application, the articles "a," "an," "the," and "said" are intended to mean that there are one or more of the elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements.
It should be noted that, as one of ordinary skill in the art would understand, all or part of the processes of the above method embodiments may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when executed, the computer program may include the processes of the above method embodiments. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, the apparatus embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the description of the method embodiments for relevant points. The above-described apparatus embodiments are merely illustrative, and the units and modules described as separate components may or may not be physically separate. In addition, some or all of the units and modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is merely a preferred embodiment of the present application and, although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims (10)

1. A time series anomaly detection system, said system comprising: initializing a pseudo label generating unit and an iteration self-training unit;
the initialization pseudo tag generating unit is used for automatically labeling a pseudo tag on part of data in a label-free time sequence data set, and determining the part of data carrying the pseudo tag as an initialization pseudo tag data set;
the iteration self-training unit is used for acquiring the initialization pseudo tag data set and the time sequence data set, training the iteration self-training unit by using the pseudo tag data set, inputting the time sequence data set to the trained iteration self-training unit to obtain an updated pseudo tag data set, continuing training the iteration self-training unit by using the updated pseudo tag data set, continuing inputting the time sequence data set to the trained iteration self-training unit, repeating the process of obtaining the updated pseudo tag data set until the training times of the iteration self-training unit reach preset times, and taking the output result of the last time of processing the time sequence data set by the iteration self-training unit as a time sequence abnormity detection result.
2. The system of claim 1, wherein the iterative self-training unit comprises a self-encoder, a multi-layer perceptron, and a pseudo-label generator;
the self-encoder is used for training according to the initialized pseudo label data set, and inputting the time sequence data set into the trained self-encoder to obtain a 1 st reconstructed time sequence data set;
the multilayer perceptron is used for acquiring the 1 st reconstruction time sequence data set to obtain the abnormal score of each data in the time sequence data set;
and the pseudo label generator is used for automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an updated pseudo label data set.
3. The system of claim 2, wherein the self-encoder is configured to continue training according to the updated pseudo tag data set, and input the time-series data set into the trained self-encoder to obtain an i-th reconstructed time-series data set, where i > 1;
the multilayer perceptron is used for acquiring the ith reconstructed time sequence data set to obtain an abnormal score of each data in the time sequence data set;
the pseudo label generator is used for automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an updated pseudo label data set;
and judging whether the i is more than or equal to a preset number of times, if so, taking partial data which is automatically marked on the time sequence data set by the pseudo label generator at the last time as a time sequence abnormity detection result.
4. The system of any of claims 1-3, wherein the initialization pseudo label generating unit comprises a self-encoder, a reconstruction error operator, and a pseudo label generator;
the self-encoder is used for training according to the time sequence data set, and inputting the time sequence data set into the trained self-encoder to obtain a second reconstructed time sequence data set;
the reconstruction error operator is used for calculating the error between the time sequence data set and the second reconstruction time sequence data set to obtain the abnormal score of each data in the time sequence data set;
and the pseudo label generator is used for automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an initialized pseudo label data set.
5. The system of any one of claims 1-3, further comprising a feature filter;
the characteristic filter is used for filtering an original time sequence data set to obtain the time sequence data set.
6. A method for time series anomaly detection, the method comprising:
automatically labeling a pseudo label on partial data in a label-free time sequence data set, and determining the partial data carrying the pseudo label as an initialized pseudo label data set;
acquiring the initialized pseudo label data set and the time sequence data set, training an iteration self-training unit by using the pseudo label data set, inputting the time sequence data set to the trained iteration self-training unit to obtain an updated pseudo label data set, continuing training the iteration self-training unit by using the updated pseudo label data set, continuing inputting the time sequence data set to the trained iteration self-training unit, repeating the process of obtaining the updated pseudo label data set until the training times of the iteration self-training unit reach preset times, and taking the output result of the last time of processing the time sequence data set by the iteration self-training unit as a time sequence abnormity detection result.
7. The method of claim 6, wherein the obtaining the initialization pseudo tag dataset and the timing sequence dataset, training an iterative self-training unit with the pseudo tag dataset, and inputting the timing sequence dataset into the trained iterative self-training unit to obtain an updated pseudo tag dataset comprises:
training according to the initialized pseudo label data set, and inputting the time sequence data set into the trained self-encoder to obtain a 1 st reconstructed time sequence data set;
acquiring the 1 st reconstruction time sequence data set to obtain an abnormal score of each data in the time sequence data set;
and automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an updated pseudo label data set.
8. The method according to claim 7, wherein the training of the iterative self-training unit using the updated pseudo tag data set continues, the inputting of the time series data set into the trained iterative self-training unit continues, and the repeating of the process of obtaining the updated pseudo tag data set is repeated until the training frequency of the iterative self-training unit reaches a preset frequency, and the using, as the time series anomaly detection result, an output result of the iterative self-training unit processing the time series data set for the last time includes:
continuing training according to the updated pseudo label data set, and inputting the time sequence data set into the trained self-encoder to obtain an ith reconstructed time sequence data set, wherein i is greater than 1;
acquiring the ith time sequence data set to obtain an abnormal score of each data in the time sequence data set;
according to the abnormal score of each data, automatically labeling a pseudo label for partial data in the time sequence data set, and determining the partial data with the pseudo label as an updated pseudo label data set;
and judging whether the i is greater than or equal to a preset number of times, if so, taking partial data which is automatically marked on the time sequence data set at the last time as a time sequence abnormity detection result.
9. The method according to any one of claims 6 to 8, wherein the automatic labeling of the pseudo label to the partial data in the unlabeled time series data set is performed, and the determining of the partial data carrying the pseudo label as the initialized pseudo label data set includes:
training according to the time sequence data set, and inputting the time sequence data set into the trained self-encoder to obtain a second reconstructed time sequence data set;
calculating an error between the time series data set and the second reconstructed time series data set to obtain an abnormal score of each data in the time series data set;
and automatically labeling a pseudo label for partial data in the time sequence data set according to the abnormal score of each data, and determining the partial data carrying the pseudo label as an initialized pseudo label data set.
10. The method according to any one of claims 6-8, further comprising:
and screening the original time sequence data set to obtain the time sequence data set.
CN202211201476.2A 2022-09-29 2022-09-29 Time series abnormity detection system and method Pending CN115456107A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211201476.2A CN115456107A (en) 2022-09-29 2022-09-29 Time series abnormity detection system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211201476.2A CN115456107A (en) 2022-09-29 2022-09-29 Time series abnormity detection system and method

Publications (1)

Publication Number Publication Date
CN115456107A true CN115456107A (en) 2022-12-09

Family

ID=84309587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211201476.2A Pending CN115456107A (en) 2022-09-29 2022-09-29 Time series abnormity detection system and method

Country Status (1)

Country Link
CN (1) CN115456107A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116633705A (en) * 2023-07-26 2023-08-22 山东省计算中心(国家超级计算济南中心) Industrial control system abnormality detection method and system based on composite automatic encoder
CN117077802A (en) * 2023-06-15 2023-11-17 深圳计算科学研究院 Sequencing prediction method and device for time sequence data

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117077802A (en) * 2023-06-15 2023-11-17 深圳计算科学研究院 Sequencing prediction method and device for time sequence data
CN116633705A (en) * 2023-07-26 2023-08-22 山东省计算中心(国家超级计算济南中心) Industrial control system abnormality detection method and system based on composite automatic encoder
CN116633705B (en) * 2023-07-26 2023-10-13 山东省计算中心(国家超级计算济南中心) Industrial control system abnormality detection method and system based on composite automatic encoder

Similar Documents

Publication Publication Date Title
Dam et al. A deep tree-based model for software defect prediction
CN111914873A (en) Two-stage cloud server unsupervised anomaly prediction method
CN113792825B (en) Fault classification model training method and device for electricity information acquisition equipment
CN115456107A (en) Time series abnormity detection system and method
US7617010B2 (en) Detecting instabilities in time series forecasting
CN112148955A (en) Method and system for detecting abnormal time sequence data of Internet of things
CN111427775B (en) Method level defect positioning method based on Bert model
US10810508B1 (en) Methods and apparatus for classifying and discovering historical and future operational states based on Boolean and numerical sensor data
CN109739904A (en) A kind of labeling method of time series, device, equipment and storage medium
Wong et al. Using an RBF neural network to locate program bugs
CN114037478A (en) Advertisement abnormal flow detection method and system, electronic equipment and readable storage medium
CN116124398A (en) Rotary machine fault detection method and device, equipment and storage medium
CN111080168A (en) Power communication network equipment reliability evaluation method based on capsule network
CN112394973B (en) Multi-language code plagiarism detection method based on pseudo-twin network
CN114787831B (en) Improving accuracy of classification models
Simani et al. Dynamic Neural Network Architecture Design for Predicting Remaining Useful Life of Dynamic Processes
CN110969015B (en) Automatic label identification method and equipment based on operation and maintenance script
CN112882898A (en) Anomaly detection method, system, device and medium based on big data log analysis
Afric et al. Empirical Study: How Issue Classification Influences Software Defect Prediction
CN114969334B (en) Abnormal log detection method and device, electronic equipment and readable storage medium
CN115964470A (en) Service life prediction method and system for motorcycle accessories
CN116910526A (en) Model training method, device, communication equipment and readable storage medium
CN116955059A (en) Root cause positioning method, root cause positioning device, computing equipment and computer storage medium
CN111400606B (en) Multi-label classification method based on global and local information extraction
Matcha et al. Using Deep Learning Classifiers to Identify Candidate Classes for Unit Testing in Object-Oriented Systems.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination