CN118133103A

CN118133103A - Time sequence data anomaly detection model generation method and device and electronic equipment

Info

Publication number: CN118133103A
Application number: CN202311764891.3A
Authority: CN
Inventors: 孙婷; 杜宇宁; 刘毅; 赵乔; 胡晓光; 于佃海; 马艳军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-06-04

Abstract

The disclosure provides a method and a device for generating a time sequence data anomaly detection model and electronic equipment, relates to the technical field of computers, and particularly relates to the technical field of artificial intelligence such as data processing and deep learning. The specific implementation scheme is as follows: acquiring a plurality of candidate models and a verification data set, wherein each candidate model is used for reconstructing time sequence data, the verification data set comprises sample time sequence data and a label, and the label is used for describing whether each data in the sample time sequence data is abnormal or not; determining first time sequence data corresponding to the sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model; selecting a target model from a plurality of candidate models according to the first time sequence data, the first recall rate and the verification data set; and generating a time sequence data abnormality detection model according to the target model.

Description

Time sequence data anomaly detection model generation method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence such as data processing and deep learning, and specifically relates to a method and a device for generating a time sequence data anomaly detection model and electronic equipment.

Background

Time series data refers to a set of data arranged in a time series, wherein each data point is associated with a particular point in time or time period. Abnormal points in the time series data refer to points where patterns in the time series data have inconsistencies, such as abrupt rises or falls, trend changes, hierarchical transitions, exceeding historical maximum/minimum values, and the like. Anomaly detection of time series data aims to quickly and accurately find out the anomaly points.

Disclosure of Invention

The disclosure provides a method and device for generating a time sequence data anomaly detection model and electronic equipment.

According to a first aspect of the present disclosure, there is provided a method for generating a time series data anomaly detection model, including:

Acquiring a plurality of candidate models and a verification data set, wherein each candidate model is used for reconstructing time sequence data, the verification data set comprises sample time sequence data and a label, and the label is used for describing whether each data in the sample time sequence data is abnormal or not;

determining first time sequence data corresponding to the sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model;

Selecting a target model from a plurality of candidate models according to the first time sequence data, the first recall rate and the verification data set;

And generating a time sequence data abnormality detection model according to the target model.

According to a second aspect of the present disclosure, there is provided a time series data anomaly detection method including:

Acquiring time sequence data to be detected;

Inputting the time sequence data to be detected into a time sequence data abnormality detection model to obtain target reconstruction time sequence data corresponding to the time sequence data to be detected, wherein the time sequence data abnormality detection model is generated based on the generation method of the time sequence data abnormality detection model in the first aspect;

And determining abnormal data in the time sequence data to be detected according to the difference between the time sequence data to be detected and the target reconstruction time sequence data.

According to a third aspect of the present disclosure, there is provided a generation apparatus of a time series data abnormality detection model, including:

The acquisition module is used for acquiring a plurality of candidate models and a verification data set, wherein each candidate model is used for reconstructing time sequence data, the verification data set comprises sample time sequence data and a label, and the label is used for describing whether each data in the sample time sequence data is abnormal or not;

The determining module is used for determining first time sequence data corresponding to the sample time sequence data reconstructed by each candidate model and a first recall rate corresponding to each candidate model;

A selection module for selecting a target model from a plurality of candidate models according to the first time sequence data, the first recall and the verification data set;

and the generating module is used for generating a time sequence data abnormality detection model according to the target model.

According to a fourth aspect of the present disclosure, there is provided a time series data abnormality detection apparatus including:

The first acquisition module is used for acquiring time sequence data to be detected;

A second obtaining module, configured to input the timing data to be detected into a timing data anomaly detection model to obtain target reconstructed timing data corresponding to the timing data to be detected, where the timing data anomaly detection model is generated based on the apparatus of any one of claims 14 to 23;

The determining module is used for determining abnormal data in the time sequence data to be detected according to the difference between the time sequence data to be detected and the target reconstruction time sequence data.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of generating a temporal data anomaly detection model as described in the first aspect or to perform the method of detecting a temporal data anomaly as described in the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method of generating a time series data abnormality detection model as described in the first aspect or to be able to execute the method of detecting a time series data abnormality as described in the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of generating a temporal data anomaly detection model as described in the first aspect, or are capable of performing the steps of the method of detecting a temporal data anomaly as described in the second aspect.

The method and device for generating the time sequence data anomaly detection model and the electronic equipment have the following beneficial effects:

In the embodiment of the disclosure, a plurality of candidate models and verification data sets are acquired first, then first time sequence data corresponding to sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model are determined, further a target model is selected from the plurality of candidate models according to the first time sequence data, the first recall rate and the verification data sets, and finally a time sequence data abnormality detection model is generated according to the target model. Therefore, the target model can be selected from the plurality of candidate models according to the first time sequence data, the first recall rate and the verification data set to form the time sequence data abnormality detection model with higher recall rate, so that the time sequence data abnormality detection model can be fused with the prediction result of the target model, and the accuracy and the reliability of the time sequence data abnormality detection model for predicting abnormal data are improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method for generating a time series data anomaly detection model according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for generating a time series data anomaly detection model according to a further embodiment of the present disclosure;

FIG. 3 is a flow chart of a method for generating a time series data anomaly detection model according to a further embodiment of the present disclosure;

FIG. 4 is a flow chart of a method for detecting anomalies in time series data according to one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a device for generating a time series data anomaly detection model according to an embodiment of the present disclosure;

fig. 6 is a schematic structural view of a time-series data anomaly detection device according to a further embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device used to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The embodiment of the disclosure relates to the technical field of artificial intelligence such as data processing, deep learning and the like. The method can be applied to the scenes of monitoring battery charging conditions, monitoring machine operation data by a sensor to judge whether the machine has faults, monitoring and analyzing the flow of people, monitoring the flow of network and the like.

Artificial intelligence (ARTIFICIAL INTELLIGENCE), english is abbreviated AI. It is a new technical science for researching, developing theory, method, technology and application system for simulating, extending and expanding human intelligence.

Data processing (data processing) is the collection, storage, retrieval, processing, transformation, and transmission of data. The basic purpose of data processing is to extract and derive data that is valuable and meaningful to some particular person from a large, possibly unorganized, unintelligible, data.

Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. The final goal of deep learning is to enable a machine to analyze learning capabilities like a person, and to recognize text, images, and sound data.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

The following describes a method and apparatus for generating a time series data anomaly detection model and an electronic device according to an embodiment of the present disclosure with reference to the accompanying drawings.

It should be noted that, the execution body of the method for generating the time series data anomaly detection model according to the present embodiment is a device for generating the time series data anomaly detection model, and the device may be implemented in a software and/or hardware manner, and the device may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal, a server, and the like.

Fig. 1 is a flowchart of a method for generating a time-series data anomaly detection model according to an embodiment of the present disclosure.

As shown in fig. 1, the method for generating the time series data anomaly detection model includes:

S101: and acquiring a plurality of candidate models and a verification data set, wherein each candidate model is used for reconstructing time sequence data, the verification data set comprises sample time sequence data and a label, and the label is used for describing whether each data in the sample time sequence data is abnormal or not.

The candidate model can be a model which is trained in advance and can be reconstructed by time sequence data.

In some embodiments, a training data set and a plurality of initial models are acquired, wherein the training data set includes third time series data, each of the third time series data is normal data, and then each initial model is trained based on the training data set to acquire a plurality of candidate models. Therefore, the plurality of initial models are respectively trained based on the same training data set, so that each candidate model obtained through training can learn different semantic features of the same data.

The initial model may be a model and a decoding module (for example, a linear layer with the same dimension as the length of the input sequence may be added after the output of the model) such as a time sequence local attention converter (PATCH TIME SERIES transducer, patchTST), a time sequence basic model (TimesNet), or a Non-stationary converter (Non-stationary Transformer).

The sample time sequence data is a data sequence arranged based on time sequence. The sample time sequence data comprises normal data and abnormal data, and each data in the sample time sequence data corresponds to a label for identifying whether the data is abnormal or not. For example, when the data is abnormal data, the corresponding tag may be 1; when the data is normal data, the corresponding tag may be 0. Or the label corresponding to the abnormal data may be 0, and the label corresponding to the normal data is 1.

S102: and determining first time sequence data corresponding to the sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model.

In some embodiments, sample timing data is input into each candidate model to obtain first timing data reconstructed for each candidate model.

After the first time sequence data reconstructed by each candidate model is obtained, the difference value between each data in the first time sequence data and the corresponding data in the sample time sequence data can be determined, and under the condition that the difference value corresponding to any data in the first time sequence data is larger than a difference threshold value, any data is determined to be predicted abnormal data; and determining a first total number of predicted abnormal data and a second total number of abnormal data marked by labels, and finally determining the ratio of the first total number to the second total number as a first recall rate. Thus, the recall rate of the abnormal data by each candidate model can be accurately determined.

Wherein a distance or a square of a distance between each of the first time series data and corresponding data in the sample time series data may be determined as the corresponding difference value. The present disclosure is not limited in this regard.

In some embodiments, the first recall corresponding to each candidate model may be determined to be 90% accurate.

S103: a target model is selected from the plurality of candidate models based on the first time series data, the first recall and the verification data set.

In some embodiments, first models with first recall rates greater than a preset threshold may be screened out, then the first models are freely combined to obtain a plurality of model combinations, first time sequence data corresponding to the first models in each model combination is fused to obtain fused time sequence data corresponding to each model combination, third recall rates corresponding to each model combination are determined according to the fused time sequence data and the verification data set, and candidate models in the model combination with the highest third recall rates are determined to be target models.

Any data fusion method can be adopted to fuse the first time sequence data corresponding to the first model in each model combination. The present disclosure is not limited in this regard.

The specific implementation form of the third recall corresponding to each model combination is determined according to the fusion time sequence data and the verification data set, and reference can be made to determining the specific description of the first recall of each candidate model.

S104: and generating a time sequence data abnormality detection model according to the target model.

Specifically, the target model may be integrated into a time series data anomaly detection model.

In some embodiments, a weight corresponding to each target model may also be set in the time-series data anomaly detection model. The sum of the weights corresponding to each target model is 1. Optionally, the weights corresponding to each target model may be the same or different. The present disclosure is not limited in this regard.

In some embodiments, the target weight corresponding to each target model may be determined according to the first recall rate corresponding to each target model. The higher the first recall, the greater the corresponding target weight.

Optionally, under the condition that the number of the target models is multiple, adding the first recall rates corresponding to the target models to obtain a second numerical value, then determining the ratio of the first recall rates corresponding to the target models to the second numerical value as the target weight corresponding to the target models, and finally integrating the target models into a time sequence data anomaly detection model, wherein the time sequence data anomaly detection model comprises the target weight corresponding to the target models. Therefore, the target weight corresponding to each target model can be more accurately determined according to the first recall rate corresponding to each target model.

as shown in fig. 2, the method for generating the time series data anomaly detection model includes:

S201: and acquiring a plurality of candidate models and a verification data set, wherein each candidate model is used for reconstructing time sequence data, the verification data set comprises sample time sequence data and a label, and the label is used for describing whether each data in the sample time sequence data is abnormal or not.

S202: and determining first time sequence data corresponding to the sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model.

The specific implementation manner of step S201 and step S202 may refer to the detailed description in other embodiments in the disclosure, which will not be described in detail here.

S203: a plurality of integrated models is determined, wherein each integrated model is integrated from at least one candidate model.

In some embodiments, one or more candidate models may be randomly selected to obtain an integrated model.

In some embodiments, multiple candidate models with higher first recall rates may also be freely combined to obtain multiple integrated models.

S204: and based on the first recall rate, fusing the first time sequence data reconstructed by the candidate models in each integrated model to obtain second time sequence data corresponding to each integrated model.

In the embodiment of the disclosure, a first weight corresponding to a candidate model in the integrated model may be determined based on the first recall. The first weight sum corresponding to the candidate model in the integrated model is 1. The higher the first recall, the higher the corresponding first weight.

The first weight may be used to indicate a fusion ratio of the first time sequence data corresponding to the candidate model in the integrated model.

For example, if the integrated model includes a candidate model a and a candidate model B, the first recall rate of the candidate model a is smaller than the first recall rate of the candidate model B, the first weight of the candidate model a may be 0.4, and the first weight of the candidate model B may be 0.6.

In some embodiments, the first recall rates corresponding to each candidate model in the integrated model may be added to obtain a first value, then a ratio of the first recall rate corresponding to each candidate model in the integrated model to the first value is determined as a first weight corresponding to each candidate model in the integrated model, and finally the reconstructed first time sequence data of each candidate model in the integrated model is fused based on the first weight corresponding to each candidate model in the integrated model to obtain second time sequence data corresponding to the integrated model. Therefore, the fusion weight corresponding to each candidate model in the integrated model is more accurately determined through the ratio of the first recall rate corresponding to each candidate model in the integrated model to the first numerical value.

S205: and determining a second recall rate corresponding to each integrated model according to the second time sequence data and the verification data set.

Specifically, determining a difference value between each data in the second time sequence data and the corresponding data in the sample time sequence data, and determining any data as abnormal data predicted by the integrated model under the condition that the difference value corresponding to any data in the second time sequence data is larger than a difference threshold value; and determining a third total number of abnormal data predicted by the integrated model and a second total number of abnormal data marked by the tag, and finally determining the ratio of the third total number to the second total number as a second recall rate corresponding to the integrated model.

S206: and determining the candidate model in the integrated model with the highest second recall rate as the target model.

The higher the second recall ratio is, the higher the accuracy of the integrated model to predict the abnormal data is, and therefore, the candidate model in the integrated model having the highest second recall ratio is determined as the target model.

S207: and generating a time sequence data abnormality detection model according to the target model.

In the embodiment of the disclosure, the second integrated model with the highest recall rate is the time sequence data anomaly detection model.

In the embodiment of the disclosure, after determining first time sequence data corresponding to sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model, determining a plurality of integrated models, fusing the first time sequence data reconstructed by the candidate models in each integrated model based on the first recall rate to obtain second time sequence data corresponding to each integrated model, further determining second recall rate corresponding to each integrated model according to the second time sequence data and a verification data set, determining a candidate model in the integrated model with the highest second recall rate as a target model, and finally generating a time sequence data anomaly detection model according to the target model. Therefore, the first time sequence data reconstructed by the candidate models in the integrated model can be fused based on the first recall rate, the second time sequence data corresponding to each integrated model is accurately determined, the second recall rate corresponding to each integrated model is further accurately determined, the integrated model with the optimal second recall rate can be selected, and the time sequence data abnormality detection model is generated, so that the accuracy and the reliability of predicting abnormal data by the time sequence data abnormality detection model are further improved.

As shown in fig. 3, the method for generating the time series data anomaly detection model includes:

s301: and acquiring a plurality of candidate models and a verification data set, wherein each candidate model is used for reconstructing time sequence data, the verification data set comprises sample time sequence data and a label, and the label is used for describing whether each data in the sample time sequence data is abnormal or not.

S302: and determining first time sequence data corresponding to the sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model.

S303: a plurality of integrated models is determined, wherein each integrated model is integrated from at least one candidate model.

S304: and based on the first recall rate, fusing the first time sequence data reconstructed by the candidate models in each integrated model to obtain second time sequence data corresponding to each integrated model.

S305: and determining a second recall rate corresponding to each integrated model according to the second time sequence data and the verification data set.

The specific implementation manner of step S301 and step S302 may refer to the detailed description in other embodiments in this disclosure, which will not be described in detail herein.

S306: updating the integrated model, and determining a second recall rate corresponding to the updated integrated model.

In some embodiments, the integrated model may be updated based on genetic algorithms. Therefore, based on a genetic algorithm, the diversity of the integrated model can be increased, and the accuracy of the generated time sequence data anomaly detection model on anomaly data detection is improved.

For example, there are 5 candidate models, namely candidate model 1, candidate model 2, candidate model 3, candidate model 4 and candidate model 5, and each candidate model has two cases of being selected and not selected, if the candidate model is marked 1 and not selected and marked 0, each integrated model can correspond to a 01 sequence, and if the integrated model corresponds to a sequence 00110, which indicates that candidate model 3 and candidate model 4 are selected, candidate model 1, candidate model 2 and candidate model 5 are not selected, and the integrated model comprises candidate model 3 and candidate model 4.

In some embodiments, corresponding elements in two sequences corresponding to the integrated model may be exchanged to obtain a new integrated model. For example, the sequence corresponding to the integrated model a is 10110, the sequence corresponding to the integrated model B is 00101, and the sequence corresponding to the integrated model a and the 1 st element in the sequence corresponding to the integrated model B can be interacted to obtain updated integrated models 00110 and 10101.

In some embodiments, any element in the integrated model may be replaced to obtain an updated integrated model. For example, if the sequence corresponding to the integrated model a is 10110, the third element may be replaced, so as to obtain an updated integrated model 10010.

In some embodiments, candidate models of the integrated model with a higher first recall rate may be retained. For example, the integrated model a corresponds to the sequence 10110. The first recall rate corresponding to the candidate model 1 and the candidate model 3 is higher than the first recall rate corresponding to the candidate model 4, so that the candidate model 1 and the candidate model 3 can be reserved, and a second element, a fourth element or a fifth element in the sequence is changed to obtain a new integrated model.

In some embodiments, updating the integration model may include at least one of:

replacing the candidate model with the lowest first recall rate in the integrated model with other candidate models;

deleting the candidate model with the lowest first recall rate in the integrated model;

Adding other candidate models into the integrated model;

either candidate model of the two integrated models is swapped.

For example, there are a total of 5 candidate models, candidate model 1, candidate model 2, candidate model 3, candidate model 4, and candidate model 5, respectively. If the integrated model a includes the candidate model 1, the candidate model 3 and the candidate model 5, the first recall rate of the candidate model 5 is the lowest.

Candidate model 5 may be replaced with candidate model 4 and/or candidate model 2 resulting in an updated integrated model.

Or deleting the candidate model 5 in the integrated model A to obtain an updated integrated model. Or adding candidate model 2 and/or candidate model 4 to integrated model a.

Or if the integrated model B includes the candidate model 2, the candidate model 4 and the candidate model 5, the candidate model 1 in the integrated model a and the candidate model 2 in the integrated model B may be exchanged to obtain an updated integrated model C including the candidate model 2, the candidate model 3 and the candidate model 5; the integrated model D includes a candidate model 1, a candidate model 4, and a candidate model 5.

Therefore, the candidate models in the integrated model are subjected to operations such as exchanging, deleting and adding, so that the diversity of the integrated model can be increased, and the accuracy of the generated time sequence data anomaly detection model on anomaly data detection is improved.

In some embodiments, only the integrated model with the second recall rate higher than the preset threshold may be updated, so that the calculation amount may be reduced, and further, the efficiency of generating the time-series data anomaly detection model may be improved.

S307: and updating the updated integrated model until a preset iteration stop condition is reached, and determining a candidate model in the integrated model with the highest second recall rate as a target model.

In some embodiments, the iteration stop condition may be any of the following: the iteration times reach the preset times; the difference between the maximum second recall rate in the nth iteration result and the maximum second recall rate of each iteration result in the previous m iteration results is smaller than a first threshold, wherein m is a positive integer, and n is a positive integer larger than m.

In some embodiments, the preset number of times may be determined based on the number of candidate models. The larger the number of candidate models, the larger the preset number of times.

For example, if the preset number of times is 50, 50 iterative updates may be performed to obtain the integrated model with the highest second recall rate in the 50 iterative updates.

For example, if the value of m is 5, if the difference between the maximum second recall rate in the nth iteration result and the maximum second recall rate in the n-1 th iteration result is smaller than the first threshold, the difference between the maximum second recall rate in the nth iteration result and the maximum second recall rate in the n-2 nd iteration result is smaller than the first threshold, the difference between the maximum second recall rate in the nth iteration result and the maximum second recall rate in the n-3 rd iteration result is smaller than the first threshold, the difference between the maximum second recall rate in the nth iteration result and the maximum second recall rate in the n-4 th iteration result is smaller than the first threshold, and the difference between the maximum second recall rate in the nth iteration result and the maximum second recall rate in the n-5 th iteration result is smaller than the first threshold, updating is stopped, and the second highest integration model of the recall rate is acquired.

It can be understood that if the maximum second recall rate in the continuous multiple iteration results does not change greatly, the iteration update is continued, and the maximum second recall rate is not increased obviously, so that the iteration update can be stopped, and the waste of computing resources is avoided.

In the embodiment of the disclosure, by setting the iteration stop condition, the iteration times can be controlled, iteration update can be stopped in time, the consumption of computing resources is reduced, and the efficiency of generating the time sequence data abnormal detection model is improved.

S308: and generating a time sequence data abnormality detection model according to the target model.

In the embodiment of the disclosure, after determining first time sequence data corresponding to sample time sequence data reconstructed by each candidate model and first recall rate corresponding to each candidate model, determining a plurality of integrated models, fusing the first time sequence data reconstructed by the candidate models in each integrated model based on the first recall rate to obtain second time sequence data corresponding to each integrated model, further determining second recall rate corresponding to each integrated model according to the second time sequence data and a verification data set, updating the integrated models, determining second recall rate corresponding to the updated integrated models, updating the updated integrated models again until a preset iteration stop condition is reached, determining the candidate model in the integrated model with the highest second recall rate as a target model, and finally generating a time sequence data anomaly detection model according to the target model. Therefore, after the second recall rate corresponding to each integrated model is determined, iterative updating can be carried out on the integrated models, and the second recall rate corresponding to the updated integrated models is determined, so that the integrated model with the highest second recall rate can be further obtained, a time sequence data anomaly detection model is generated, and the accuracy and reliability of the time sequence data anomaly detection model for predicting anomaly data are further improved.

FIG. 4 is a flow chart of a method for detecting anomalies in time series data according to yet another embodiment of the present disclosure;

As shown in fig. 4, the time series data abnormality detection method includes:

s401: and acquiring time sequence data to be detected.

The time series data to be detected may be time series data to be subjected to anomaly detection.

S402: inputting the time sequence data to be detected into a time sequence data abnormality detection model to obtain target reconstruction time sequence data corresponding to the time sequence data to be detected.

The time series data abnormality detection model is generated based on a generation method of the time series data abnormality detection model in other embodiments of the present disclosure.

In some embodiments, in the case that the time series data anomaly detection model includes a plurality of target models, time series data to be detected may be input into each target model to obtain initial reconstruction time series data output by each target model, and then the plurality of initial reconstruction time series data are fused to obtain target reconstruction time series data. Therefore, the initial reconstruction time sequence data output by a plurality of target models can be combined, and the deviation of a single model is reduced, so that more reliable and accurate target reconstruction time sequence data can be obtained.

In some embodiments, a target weight corresponding to each target model is obtained, and then, based on the target weight corresponding to each target model, a plurality of initial reconstruction timing data are fused to obtain target reconstruction timing data. Therefore, based on the target weight corresponding to the target model, a plurality of initial reconstruction time sequence data can be fused, and the reliability and accuracy of the determined target reconstruction time sequence data are further improved.

In some embodiments, the same weight value may be set for each target model, and then the multiple initial reconstruction time series data are fused in an average manner to obtain the target reconstruction time series data.

S403: and determining abnormal data in the time sequence data to be detected according to the difference between the time sequence data to be detected and the target reconstruction time sequence data.

Specifically, a difference value between each data in the target reconstruction time sequence data and the corresponding data in the time sequence data to be detected is determined, and if the difference value corresponding to any data in the first time sequence data is larger than a difference threshold value, any data is determined to be abnormal data in the time sequence data to be detected.

Wherein, the distance or the square of the distance between each data in the target reconstruction time series data and the corresponding data in the time series data to be detected can be determined as the corresponding difference value. The present disclosure is not limited in this regard.

In the embodiment of the disclosure, time sequence data to be detected is obtained; inputting the time sequence data to be detected into a time sequence data abnormity detection model to obtain target reconstruction time sequence data corresponding to the time sequence data to be detected, and finally determining abnormal data in the time sequence data to be detected according to the difference between the time sequence data to be detected and the target reconstruction time sequence data. Thus, the abnormal data in the time series data to be detected can be accurately determined by the time series data abnormal detection model which is generated in advance.

As shown in fig. 5, the apparatus 500 for generating a time series data abnormality detection model includes:

An obtaining module 501, configured to obtain a plurality of candidate models and a verification data set, where each candidate model is used for reconstructing time sequence data, the verification data set includes sample time sequence data and a tag, and the tag is used for describing whether each data in the sample time sequence data is abnormal;

A determining module 502, configured to determine first time sequence data corresponding to the reconstructed sample time sequence data of each candidate model, and a first recall rate corresponding to each candidate model;

A selection module 503, configured to select a target model from a plurality of candidate models according to the first time-series data, the first recall and the verification data set;

a generating module 504, configured to generate a time-series data anomaly detection model according to the target model.

In some embodiments of the present disclosure, the selecting module 503 is configured to:

determining a plurality of integrated models, wherein each integrated model is integrated by at least one candidate model;

Based on the first recall rate, fusing the first time sequence data reconstructed by the candidate models in each integrated model to obtain second time sequence data corresponding to each integrated model;

determining a second recall rate corresponding to each integrated model according to the second time sequence data and the verification data set;

and determining the candidate model in the integrated model with the highest second recall rate as the target model.

adding the first recall rates corresponding to each candidate model in the integrated model to obtain a first numerical value;

Determining the ratio of the first recall rate corresponding to each candidate model in the integrated model to the first value as a first weight corresponding to each candidate model in the integrated model;

and fusing the reconstructed first time sequence data of each candidate model in the integrated model based on the first weight corresponding to each candidate model in the integrated model so as to acquire second time sequence data corresponding to the integrated model.

In some embodiments of the present disclosure, the method further includes an update module configured to:

Updating the integrated model, and determining a second recall rate corresponding to the updated integrated model;

and updating the updated integrated model until a preset iteration stop condition is reached, and determining a candidate model in the integrated model with the highest second recall rate as a target model.

In some embodiments of the present disclosure, wherein the updating module is configured to:

Based on the genetic algorithm, the integrated model is updated.

Adding other candidate models into the integrated model;

either candidate model of the two integrated models is swapped.

In some embodiments of the present disclosure, wherein the iteration stop condition is any one of:

The iteration times reach the preset times;

the difference between the maximum second recall rate in the nth iteration result and the maximum second recall rate of each iteration result in the previous m iteration results is smaller than a first threshold, wherein m is a positive integer, and n is a positive integer larger than m.

In some embodiments of the present disclosure, the generating module 504 is configured to:

under the condition that the number of the target models is a plurality of, adding the first recall rates corresponding to the target models to obtain a second value;

Determining the ratio of the first recall rate to the second value corresponding to the target model as a target weight corresponding to the target model;

Integrating the target model into a time sequence data abnormity detection model, wherein the time sequence data abnormity detection model comprises target weights corresponding to the target model.

In some embodiments of the present disclosure, the determining module 502 is configured to:

inputting the sample time sequence data into each candidate model to obtain first time sequence data reconstructed by each candidate model;

determining a difference value between each data in the first time sequence data and the corresponding data in the sample time sequence data;

determining any one data as predicted abnormal data under the condition that a difference value corresponding to any one data in the first time sequence data is larger than a difference threshold value;

Determining a first total number of predicted abnormal data and a second total number of tag-labeled abnormal data;

The ratio of the first total number to the second total number is determined as a first recall.

In some embodiments of the present disclosure, the obtaining module 501 is configured to:

Acquiring a training data set and a plurality of initial models, wherein the training data set comprises third time sequence data, and each data in the third time sequence data is normal data;

Each initial model is trained based on the training dataset to obtain a plurality of candidate models.

It should be noted that the explanation of the method for generating the time series data anomaly detection model is also applicable to the apparatus for generating the time series data anomaly detection model in this embodiment, and will not be repeated here.

As shown in fig. 6, the time series data abnormality detection apparatus 600 includes:

a first obtaining module 601, configured to obtain time sequence data to be detected;

A second obtaining module 602, configured to input the timing data to be detected into a timing data anomaly detection model to obtain target reconstruction timing data corresponding to the timing data to be detected, where the timing data anomaly detection model is generated based on a generating device of the timing data anomaly detection model;

The determining module 603 is configured to determine abnormal data in the time series data to be detected according to a difference between the time series data to be detected and the target reconstruction time series data.

In some embodiments of the present disclosure, the second obtaining module 602 is configured to:

under the condition that the time sequence data abnormality detection model comprises a plurality of target models, inputting time sequence data to be detected into each target model so as to acquire initial reconstruction time sequence data output by each target model;

and fusing the plurality of initial reconstruction time sequence data to obtain target reconstruction time sequence data.

obtaining a target weight corresponding to each target model;

And fusing the plurality of initial reconstruction time sequence data based on the target weight corresponding to each target model so as to acquire target reconstruction time sequence data.

It should be noted that the explanation of the method for detecting abnormal time series data is also applicable to the apparatus for detecting abnormal time series data in this embodiment, and is not repeated here.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, a generation method of a time series data abnormality detection model, or a time series data abnormality detection method. For example, in some embodiments, the method of generating the temporal data anomaly detection model, or the method of temporal data anomaly detection, may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into the RAM 703 and executed by the calculation unit 701, one or more steps of the above-described generation method of the time series data abnormality detection model, or the time series data abnormality detection method may be executed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method of generating the time series data anomaly detection model, or the method of time series data anomaly detection, in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise. In the description of the present disclosure, the words "if" and "if" as used may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in the … … case".

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for generating a time sequence data abnormality detection model comprises the following steps:

2. The method of claim 1, wherein the selecting a target model from a plurality of the candidate models based on the first timing data, the first recall, and the validation data set comprises:

And determining a candidate model in the integrated model with the highest second recall rate as the target model.

3. The method of claim 2, wherein the fusing the first time series data reconstructed from the candidate models in each integrated model based on the first recall rate to obtain the second time series data corresponding to each integrated model comprises:

And based on the first weight corresponding to each candidate model in the integrated model, fusing the first time sequence data reconstructed by each candidate model in the integrated model to acquire the second time sequence data corresponding to the integrated model.

4. The method of claim 2, wherein after determining the second recall corresponding to each of the integrated models from the second time series data and the validation data set, further comprising:

5. The method of claim 4, wherein the updating the integration model comprises:

Updating the integrated model based on a genetic algorithm.

6. The method of claim 4, wherein the updating the integration model comprises at least one of:

Adding other candidate models in the integrated model;

either candidate model of the two integrated models is swapped.

7. The method of claim 4, wherein the iteration stop condition is any one of:

The iteration times reach the preset times;

8. The method of any of claims 1-7, wherein the generating a time series data anomaly detection model from the target model comprises:

Adding the first recall rates corresponding to the target models to obtain a second value under the condition that the number of the target models is a plurality of;

Determining the ratio of the first recall rate corresponding to the target model to the second value as a target weight corresponding to the target model;

And integrating the target model into the time sequence data abnormality detection model, wherein the time sequence data abnormality detection model comprises target weights corresponding to the target model.

9. The method of any of claims 1-7, wherein the determining the first time series data corresponding to the sample time series data reconstructed for each candidate model and the first recall corresponding to each candidate model comprises:

inputting the sample time sequence data into each candidate model to acquire the first time sequence data reconstructed by each candidate model;

determining a difference value between each of the first time series data and corresponding data in the sample time series data;

determining any one data as predicted abnormal data under the condition that a difference value corresponding to the any one data in the first time sequence data is larger than a difference threshold value;

determining a first total number of the predicted abnormal data and a second total number of the tag-labeled abnormal data;

A ratio of the first total number to the second total number is determined as the first recall.

10. The method of any of claims 1-7, wherein the obtaining a plurality of candidate models comprises:

training each of the initial models based on the training data set to obtain a plurality of candidate models.

11. A time series data anomaly detection method, comprising:

Acquiring time sequence data to be detected;

Inputting the time sequence data to be detected into a time sequence data abnormity detection model to obtain target reconstruction time sequence data corresponding to the time sequence data to be detected, wherein the time sequence data abnormity detection model is generated based on the method of any one of claims 1-10;

12. The method of claim 11, wherein the inputting the timing data to be detected into the timing data anomaly detection model to obtain the target reconstructed timing data corresponding to the timing data to be detected comprises:

Inputting the time sequence data to be detected into each target model under the condition that the time sequence data abnormality detection model comprises a plurality of target models so as to acquire initial reconstruction time sequence data output by each target model;

And fusing the plurality of initial reconstruction time sequence data to acquire the target reconstruction time sequence data.

13. The method of claim 12, wherein the fusing the plurality of initial reconstruction timing data to obtain the target reconstruction timing data comprises:

obtaining a target weight corresponding to each target model;

and fusing the plurality of initial reconstruction time sequence data based on the target weight corresponding to each target model so as to acquire the target reconstruction time sequence data.

14. A time series data abnormality detection model generation device includes:

15. The apparatus of claim 14, wherein the selection module is configured to:

16. The apparatus of claim 15, wherein the selection module is configured to:

17. The apparatus of claim 16, further comprising an update module to:

18. The apparatus of claim 17, wherein the update module is configured to:

Updating the integrated model based on a genetic algorithm.

19. The apparatus of claim 17, wherein the update module is configured to:

Adding other candidate models in the integrated model;

either candidate model of the two integrated models is swapped.

20. The apparatus of claim 17, wherein the iteration stop condition is any one of:

The iteration times reach the preset times;

21. The apparatus of any of claims 14-20, wherein the generating module is configured to:

22. The apparatus of any one of claims 14-20, wherein the determining module is configured to:

23. The apparatus of any of claims 14-20, wherein the acquisition module is configured to:

24. A time series data anomaly detection device, comprising:

25. The apparatus of claim 24, wherein the second acquisition module is configured to:

26. The apparatus of claim 25, wherein the second acquisition module is configured to:

obtaining a target weight corresponding to each target model;

27. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10 or to perform the method of any one of claims 11-13.

28. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-10 or to be capable of performing the method of any one of claims 11-13.

29. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 10 or are capable of performing the steps of the method of any one of claims 11 to 13.