CN115409066A

CN115409066A - Time series data anomaly detection method, device and computer storage medium

Info

Publication number: CN115409066A
Application number: CN202211060141.3A
Authority: CN
Inventors: 贾翠玲; 尹将伯; 郝金龙; 刘梓田; 程红星; 梁子寒; 杨洋
Original assignee: Beijing China Power Information Technology Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Beijing China Power Information Technology Co Ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-11-29

Abstract

The method, the device and the computer storage medium decompose time sequence monitoring data, extract high-frequency components and low-frequency components in the time sequence monitoring data, analyze and predict the high-frequency components and the low-frequency components by adopting corresponding models according to the characteristics of the high-frequency components and the low-frequency components respectively, superpose prediction results corresponding to the high-frequency components and the low-frequency components to obtain a prediction value of the data under a future time sequence, and further identify abnormal point data in real data under the future time sequence by taking the prediction value as a basis, so that the real-time detection of the time sequence data can be realized, the abnormal point data in the time sequence data can be found in time, maintenance personnel can be reminded to take corresponding measures as early as possible, and the loss which cannot be compensated is avoided.

Description

Time series data anomaly detection method, device and computer storage medium

技术领域technical field

本申请属于大数据处理技术领域，尤其涉及一种时间序列数据的异常检测方法、装置及计算机存储介质。The present application belongs to the technical field of big data processing, and in particular relates to an anomaly detection method, device and computer storage medium for time series data.

背景技术Background technique

随着互联网与人工智能的不断发展，世界正处在一个信息大爆炸的时代，大数据越来越受到人们的重视，在信息化的发展过程中，数据的挖掘与分析将有助于人们理解数据背后蕴含的价值与规律，因此，很多领域都在发展大数据、数据挖掘等信息技术，通过发掘数据背后隐藏的重要信息，促进相关领域的发展。With the continuous development of the Internet and artificial intelligence, the world is in an era of information explosion, and people pay more and more attention to big data. In the process of information development, data mining and analysis will help people understand The value and laws behind the data. Therefore, information technologies such as big data and data mining are being developed in many fields to promote the development of related fields by discovering the important information hidden behind the data.

时间序列数据是各类大数据中的一类，时间序列广泛应用在各个领域，如医疗、金融、工业等领域，如何从中发掘出有价值和有规律的信息成为当前的研究热点。时间序列数据的异常检测包含于时间序列数据挖掘领域，其作用是将时间序列数据中不正常的数据识别出来，通过对异常数据的发现，来及时解决设备故障、医疗状况、网络安全的非法入侵等监控问题，随着业务系统的逐渐庞大，业务组合的愈加复杂化、时间序列数据的规模变得越来越大，其数据的数据量和数据维度也越来越大，单纯地依靠人力检测已经无法满足日益增长的监控需求。Time series data is one of all kinds of big data. Time series is widely used in various fields, such as medical, financial, industrial and other fields. How to discover valuable and regular information from it has become a current research hotspot. The anomaly detection of time series data is included in the field of time series data mining. Its function is to identify abnormal data in time series data, and through the discovery of abnormal data, timely solve equipment failures, medical conditions, and illegal intrusions of network security. As the business system grows larger, the business portfolio becomes more complex, the scale of time series data becomes larger and larger, and the data volume and data dimensions of the data become larger and larger, relying solely on human detection It has been unable to meet the growing monitoring needs.

发明内容Contents of the invention

有鉴于此，本申请提供一种时间序列数据的异常检测方法、装置及计算机存储介质，以高效、准确的实现对时间序列数据的异常检测，解决人工检测方式存在的费时、费力、易出错等问题。In view of this, the present application provides a time-series data anomaly detection method, device, and computer storage medium to efficiently and accurately realize the anomaly detection of time-series data, and solve the time-consuming, labor-intensive, and error-prone problems of manual detection methods. question.

具体方案如下：The specific plan is as follows:

一种时间序列数据的异常检测方法，包括：An anomaly detection method for time series data, comprising:

获取时间序列监测数据；Obtain time series monitoring data;

提取所述时间序列监测数据的高频分量和低频分量；extracting high frequency components and low frequency components of the time series monitoring data;

利用第一预测模型基于所述高频分量进行预测处理，得到未来时序下数据的第一预测结果；Using the first prediction model to perform prediction processing based on the high-frequency component to obtain a first prediction result of data in future time series;

利用第二预测模型基于所述低频分量进行预测处理，得到所述未来时序下数据的第二预测结果；Using a second prediction model to perform prediction processing based on the low-frequency component to obtain a second prediction result of the data in the future time series;

对所述第一预测结果和所述第二预测结果进行叠加处理，得到所述未来时序下数据的目标预测结果；superimposing the first prediction result and the second prediction result to obtain the target prediction result of the data in the future time series;

根据所述目标预测结果，识别所述未来时序下真实数据中的异常点数据。Identify outlier data in the real data in the future time series according to the target prediction result.

可选的，所述提取所述时间序列监测数据的高频分量和低频分量，包括：Optionally, the extracting the high-frequency components and low-frequency components of the time series monitoring data includes:

通过使用小波分析理论对所述时间序列监测数据进行多尺度分解重构，提取所述时间序列监测数据的高频周期性分量、高频随机性分量和低频趋势性分量；performing multi-scale decomposition and reconstruction on the time series monitoring data by using wavelet analysis theory, extracting high frequency periodic components, high frequency random components and low frequency trend components of the time series monitoring data;

其中，所述高频分量包括所述高频周期性分量和所述高频随机性分量，所述低频分量包括所述低频趋势性分量。Wherein, the high-frequency component includes the high-frequency periodic component and the high-frequency random component, and the low-frequency component includes the low-frequency trend component.

可选的，所述通过使用小波分析理论对所述时间序列监测数据进行多尺度分解重构，提取所述时间序列监测数据的高频周期性分量、高频随机性分量和低频趋势性分量，包括：Optionally, the multi-scale decomposition and reconstruction of the time series monitoring data is carried out by using wavelet analysis theory, and the high frequency periodic component, high frequency random component and low frequency trend component of the time series monitoring data are extracted, include:

获取所述时间序列监测数据的变形实测数据，所述变形实测数据为对所述时间序列监测数据清洗异常值后所得的数据；Acquiring measured deformation data of the time-series monitoring data, where the measured deformation data is data obtained after cleaning abnormal values of the time-series monitoring data;

将所述变形实测数据作为当前待分解数据；Taking the measured deformation data as the current data to be decomposed;

基于小波分解，将当前待分解数据分解为低频部分和高频部分，得到当前待分解数据对应的低频分量和高频分量；Based on wavelet decomposition, decompose the current data to be decomposed into low-frequency parts and high-frequency parts, and obtain the low-frequency components and high-frequency components corresponding to the current data to be decomposed;

确定当前待分解数据对应的低频分量包含的低频子序列是否具备预设的变化趋势特征；Determine whether the low-frequency subsequence contained in the low-frequency component corresponding to the current data to be decomposed has a preset change trend feature;

若是，结束对数据的分解处理；If so, end the decomposition processing of the data;

若否，将待分解数据更新为当前待分解数据对应的低频分量，并循环至所述将当前待分解数据分解为低频部分和高频部分的步骤，直至当前待分解数据对应的低频分量包含的低频子序列具备所述变化趋势特征时，结束对数据的分解处理；分解过程中每一层仅对低频分量进行分解，高频分量不处理；If not, update the data to be decomposed to the low-frequency component corresponding to the current data to be decomposed, and loop to the step of decomposing the current data to be decomposed into a low-frequency part and a high-frequency part, until the low-frequency component corresponding to the current data to be decomposed contains When the low-frequency subsequence has the characteristics of the change trend, the decomposition processing of the data is ended; during the decomposition process, each layer only decomposes the low-frequency components, and the high-frequency components are not processed;

将分解结果中最后一层的低频分量提取为所述低频趋势性分量，并从各个层的高频分量中提取高频周期性分量和高频随机性分量。The low-frequency component of the last layer in the decomposition result is extracted as the low-frequency trend component, and the high-frequency periodic component and high-frequency random component are extracted from the high-frequency components of each layer.

可选的，所述利用第一预测模型基于所述高频分量进行预测处理，得到未来时序下数据的第一预测结果，包括：Optionally, using the first prediction model to perform prediction processing based on the high-frequency component to obtain a first prediction result of data in future time series includes:

对所述高频随机性分量进行去噪处理；performing denoising processing on the high-frequency random component;

将所述高频周期性分量和去噪处理后的高频随机性分量输入长短期记忆网络模型，由所述长短期记忆网络模型基于所述高频周期性分量和去噪处理后的高频随机性分量进行预测处理，得到未来时序下数据的第一预测结果。Inputting the high-frequency periodic component and the high-frequency random component after denoising processing into the long-short-term memory network model, and the long-short-term memory network model based on the high-frequency periodic component and the high-frequency denoising processing The randomness component performs prediction processing to obtain the first prediction result of the data in the future time series.

可选的，所述对所述高频随机性分量进行去噪处理，包括：Optionally, the performing denoising processing on the high-frequency random component includes:

基于预设的阈值，对所述高频随机性分量进行去噪处理；performing denoising processing on the high-frequency random component based on a preset threshold;

其中，所述阈值为

σ₁＝MAD/0.6745，MAD表示首层小波分解系数绝对值的中间值，0.6745为高斯噪声标准方差的调整系数，N1表示高频随机性分量信号的尺寸或者长度。Among them, the threshold is

σ ₁ ＝MAD/0.6745, MAD represents the intermediate value of the absolute value of the wavelet decomposition coefficient of the first layer, 0.6745 is the adjustment coefficient of the standard deviation of Gaussian noise, and N1 represents the size or length of the high-frequency random component signal.

可选的，所述长短期记忆网络模型包括多个依次相连的记忆单元，所述记忆单元包括遗忘门、输入门和输出门；Optionally, the long-short-term memory network model includes a plurality of sequentially connected memory units, and the memory units include forget gates, input gates and output gates;

其中，当前记忆单元通过对前一记忆单元传递的特征信息和当前记忆单元的高频分量输入数据进行融合处理得到当前记忆单元的产出信息，在进行融合处理时，当前记忆单元通过对输入的高频分量进行有选择的记忆和遗忘来处理前一记忆单元传递下来的特征信息；Among them, the current memory unit obtains the output information of the current memory unit by fusing the characteristic information transmitted by the previous memory unit and the high-frequency component input data of the current memory unit. The high-frequency component performs selective memory and forgetting to process the characteristic information passed down from the previous memory unit;

经由依次相连的多个记忆单元分别对各自输入的高频分量及前一记忆单元传递的特征信息进行处理，实现在模型基于高频分量的预测处理；Through multiple memory units connected in sequence, respectively process the high-frequency components input by each and the characteristic information transmitted by the previous memory unit, so as to realize the prediction processing based on the high-frequency components in the model;

前一记忆单元传递的特征信息包括所述前一记忆单元的状态值和输出值。The characteristic information transmitted by the previous memory unit includes the state value and output value of the previous memory unit.

可选的，所述利用第二预测模型基于所述低频分量进行预测处理，得到在所述未来时序下数据的第二预测结果，包括：Optionally, using the second prediction model to perform prediction processing based on the low-frequency component to obtain a second prediction result of the data in the future time series includes:

将所述低频趋势性分量输入差分整合移动平均自回归模型，由所述差分整合移动平均自回归模型基于低频趋势性分量进行预测处理，得到所述未来时序下数据的第二预测结果。The low-frequency trend component is input into a differential integrated moving average autoregressive model, and the differential integrated moving average autoregressive model is used to perform prediction processing based on the low-frequency trend component to obtain a second forecast result of the data in the future time series.

可选的，所述根据所述目标预测结果，识别所述未来时序下真实数据中的异常点数据，包括：Optionally, the identifying outlier data in the real data in the future time series according to the target prediction result includes:

确定所述未来时序下的真实数据与所述未来时序下数据的目标预测结果之间的偏差信息；determining deviation information between the real data in the future time series and the target prediction result of the data in the future time series;

根据所述偏差信息，识别所述未来时序下真实数据中的异常点数据。Identify outlier data in the real data in the future time series according to the deviation information.

一种时间序列数据的异常检测装置，包括：An anomaly detection device for time series data, comprising:

获取模块，用于获取时间序列监测数据；The acquisition module is used to acquire time series monitoring data;

提取模块，用于提取所述时间序列监测数据的高频分量和低频分量；An extraction module, configured to extract high-frequency components and low-frequency components of the time series monitoring data;

第一预测处理模块，用于利用第一预测模型基于所述高频分量进行预测处理，得到未来时序下数据的第一预测结果；The first prediction processing module is used to use the first prediction model to perform prediction processing based on the high-frequency component, and obtain the first prediction result of the data in the future time series;

第二预测处理模块，用于利用第二预测模型基于所述低频分量进行预测处理，得到所述未来时序下数据的第二预测结果；A second prediction processing module, configured to use a second prediction model to perform prediction processing based on the low-frequency component, to obtain a second prediction result of the data in the future time series;

叠加模块，用于对所述第一预测结果和所述第二预测结果进行叠加处理，得到所述未来时序下数据的目标预测结果；A superposition module, configured to superimpose the first prediction result and the second prediction result to obtain the target prediction result of the data in the future time series;

异常点识别模块，用于根据所述目标预测结果，识别所述未来时序下真实数据中的异常点数据。An outlier identification module, configured to identify outlier data in the real data in the future time series according to the target prediction result.

一种计算机可读介质，其上存储有计算机程序，所述计算机程序包含用于执行如上文任一项所述的时间序列数据的异常检测方法的程序代码。A computer-readable medium on which is stored a computer program, the computer program including program codes for executing the method for detecting anomalies in time-series data as described in any one of the above.

一种计算机程序产品，其包括承载在非暂态计算机可读介质上的计算机程序，所述计算机程序包含用于执行如上文任一项所述的时间序列数据的异常检测方法的程序代码。A computer program product, which includes a computer program carried on a non-transitory computer readable medium, the computer program including program codes for executing the method for detecting anomalies in time series data as described in any one of the above.

综上所述，本申请提供的时间序列数据的异常检测方法、装置及计算机存储介质，通过将时间序列监测数据分解并提取其中的高频分量和低频分量，分别根据高/低频分量的特点、采用对应的模型对高/低频分量进行分析、预测处理，并将高/低频分量对应的预测结果进行叠加得到未来时序下数据的预测值，进而以预测值为依据识别未来时序下真实数据中的异常点数据，可以实现对时间序列数据的实时检测，及时发现其中的异常点数据，以提醒维护人员尽早采取应对措施，避免造成无法弥补的损失。To sum up, the anomaly detection method, device and computer storage medium for time series data provided by this application, by decomposing time series monitoring data and extracting high frequency components and low frequency components, according to the characteristics of high/low frequency components, Use the corresponding model to analyze and predict the high/low frequency components, and superimpose the prediction results corresponding to the high/low frequency components to obtain the predicted value of the data in the future time series, and then use the predicted value to identify the real data in the future time series. Abnormal point data can realize real-time detection of time series data, find abnormal point data in time, and remind maintenance personnel to take countermeasures as soon as possible to avoid irreparable losses.

附图说明Description of drawings

结合附图并参考以下具体实施方式，本申请各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中，相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的，原件和元素不一定按照比例绘制。The above and other features, advantages and aspects of the various embodiments of the present application will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

图1是本申请提供的时间序列数据的异常检测方法流程图；Fig. 1 is the flow chart of the anomaly detection method of the time series data provided by the present application;

图2是本申请提供的小波分解结构图；Fig. 2 is the wavelet decomposition structure diagram that the application provides;

图3是本申请提供的LSTM模型的细胞结构图；Figure 3 is a cell structure diagram of the LSTM model provided by the present application;

图4是本申请提供的基于WA-LSTM-ARIMA模型的时间序列数据异常检测流程图；Fig. 4 is the flow chart of time series data anomaly detection based on WA-LSTM-ARIMA model provided by this application;

图5是本申请提供的模型训练所基于的原数据集及其数据分布示意图；Fig. 5 is a schematic diagram of the original data set and its data distribution based on the model training provided by the present application;

图6是本申请提供的真实值与预测值对应的一个示例性残差图；Fig. 6 is an exemplary residual diagram corresponding to the actual value and the predicted value provided by the present application;

图7是本申请提供的时间序列数据的异常检测装置的组成结构图。FIG. 7 is a structural diagram of an anomaly detection device for time series data provided by the present application.

具体实施方式Detailed ways

下面将参照附图更详细地描述本申请的实施例。虽然附图中显示了本申请的某些实施例，然而应当理解的是，本申请可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本申请。应当理解的是，本申请的附图及实施例仅用于示例性作用，并非用于限制本申请的保护范围。Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present application are shown in the drawings, it should be understood that the application may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the application. It should be understood that the drawings and embodiments of the present application are for exemplary purposes only, and are not intended to limit the protection scope of the present application.

本文使用的术语“包括”及其变形是开放性包括，即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”；术语“另一实施例”表示“至少一个另外的实施例”；术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms will be given in the description below.

需要注意，本申请中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分，并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in this application are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence.

需要注意，本申请中提及的“一个”、“多个”的修饰是示意性而非限制性的，本领域技术人员应当理解，除非在上下文另有明确指出，否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "multiple" mentioned in this application are illustrative and not restrictive. Those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".

时间序列数据是各类大数据中的一类，它被定义为按照时间的统计或者观测而成的数列。时间序列广泛应用在各个领域，如医疗、金融、工业等领域，目前时间序列数据的数据量和数据维度越来越大，如何从中发掘出有价值和有规律的信息成为当前的研究热点。时间序列异常检测是包含于时间序列数据挖掘领域的，其被定义为从正常的时间序列中识别不正常的事件或行为的过程，它的作用是将时间序列中不正常的数据识别出来。有时候异常数据本身有更重要的价值，能够提供很多有用的信息，有效的异常检测被广泛应用于现实世界的很多领域，例如量化交易，网络安全检测、自动驾驶汽车和大型工业设备的日常维护等，特别的，由于时间序列数据具有可视化成本低、含义明确、规律明显等特点，因此常常被应用在运维领域中，可以用时间序列数据监控某个系统的运行状态。Time series data is one of various types of big data, which is defined as a series of statistics or observations based on time. Time series are widely used in various fields, such as medical care, finance, industry and other fields. At present, the data volume and data dimension of time series data are increasing. How to discover valuable and regular information has become a current research hotspot. Time series anomaly detection is included in the field of time series data mining, which is defined as the process of identifying abnormal events or behaviors from normal time series, and its role is to identify abnormal data in time series. Sometimes the abnormal data itself has more important value and can provide a lot of useful information. Effective anomaly detection is widely used in many fields in the real world, such as quantitative transactions, network security detection, self-driving cars and daily maintenance of large industrial equipment etc. In particular, because time series data has the characteristics of low visualization cost, clear meaning, and obvious rules, it is often applied in the field of operation and maintenance, and time series data can be used to monitor the operating status of a certain system.

为了解决现有人工检测方式存在的费时、费力、易出错等问题，本申请提供一种时间序列数据的异常检测方法、装置、计算机存储介质及计算机程序产品，以高效、准确地分析与发现时间序列数据中不符合正常发展规律的数据，通过对异常数据的发现，及时发现设备故障，有效避免重大灾难事故，保证日常生产生活安全，减少不必要的人力财力损失。In order to solve the time-consuming, labor-intensive, and error-prone problems existing in the existing manual detection methods, this application provides an anomaly detection method, device, computer storage medium, and computer program product for time-series data to efficiently and accurately analyze and discover time-series data. The data that does not conform to the normal development law in the sequence data, through the discovery of abnormal data, can detect equipment failures in time, effectively avoid major disasters, ensure the safety of daily production and life, and reduce unnecessary human and financial losses.

参见图1所示的时间序列数据的异常检测方法流程图，本申请提供的时间序列数据的异常检测方法包括以下处理流程：Referring to the flow chart of the abnormal detection method for time series data shown in Figure 1, the abnormal detection method for time series data provided by this application includes the following processing flow:

步骤101、获取时间序列监测数据。Step 101, acquiring time series monitoring data.

本申请的方法可以应用到很多领域，比如网络安全的非法入侵检测、金融交易中的金融欺诈检测、工业生产中设备故障的检测以及医疗状况的检测等等。以工业领域为例，通过对设备动作数据的记录以及分析，当某一时刻数据发生异常时，提示设备可能因长时间连续工作或者磨损受到了损坏，需要及时检修，以避免因没有及时发现问题而造成更加严重的损失，这些异常数据比正常数据蕴含着更有价值的信息，及时有效地找到这些异常数据，对于很多领域来讲是非常重要的。The method of the present application can be applied to many fields, such as illegal intrusion detection of network security, financial fraud detection in financial transactions, detection of equipment failures in industrial production, detection of medical conditions, and the like. Taking the industrial field as an example, through the recording and analysis of equipment action data, when the data is abnormal at a certain moment, it will prompt that the equipment may be damaged due to long-term continuous work or wear and tear, and it needs to be repaired in time to avoid failure to find problems in time. And causing more serious losses, these abnormal data contain more valuable information than normal data, and it is very important for many fields to find these abnormal data in a timely and effective manner.

本步骤中获取的时间序列监测数据，可以是但不限于网络安全监测、金融交易、设备状态监测、医疗状况监测等方面的时间序列数据，可视实际需求而定。The time-series monitoring data obtained in this step may be, but not limited to, time-series data of network security monitoring, financial transactions, equipment status monitoring, medical condition monitoring, etc., depending on actual needs.

步骤102、提取所述时间序列监测数据的高频分量和低频分量。Step 102, extracting high-frequency components and low-frequency components of the time-series monitoring data.

本申请通过对时间序列监测数据(简称“时间序列数据”)进行多尺度分解重构，提取出其中的高频分量和低频分量，其中，高频分量包括高频周期性分量和高频随机性分量，低频分量包括低频趋势性分量。This application extracts high-frequency components and low-frequency components through multi-scale decomposition and reconstruction of time-series monitoring data (referred to as "time-series data"), where high-frequency components include high-frequency periodic components and high-frequency randomness Components, low frequency components include low frequency trend components.

进一步，可选的，利用WA技术对时间序列监测数据进行分解，WA是指小波变换，其继承和发展了短时傅里叶变换局部化的思想，同时又克服了窗口大小不随频率变化等缺点。Further, optionally, use WA technology to decompose time series monitoring data. WA refers to wavelet transform, which inherits and develops the idea of short-time Fourier transform localization, and at the same time overcomes the shortcomings of window size not changing with frequency. .

相应的，本申请对时间序列监测数据运用小波分析理论进行多尺度分解重构，通过使用小波分析理论对时间序列监测数据的多尺度分解重构，提取其中的高频周期性分量、高频随机性分量和低频趋势性分量，该过程具体可实现为：Correspondingly, this application uses wavelet analysis theory for multi-scale decomposition and reconstruction of time series monitoring data, and extracts high-frequency periodic components, high-frequency random Sexual component and low-frequency trending component, this process can be specifically realized as:

11)获取所述时间序列监测数据的变形实测数据，所述变形实测数据为对所述时间序列监测数据清洗异常值后所得的数据；11) Acquiring measured deformation data of the time series monitoring data, where the measured deformation data is data obtained after cleaning abnormal values of the time series monitoring data;

12)将所述变形实测数据作为当前待分解数据；12) Using the measured deformation data as the current data to be decomposed;

13)基于小波分解，将当前待分解数据分解为低频部分和高频部分，得到当前待分解数据对应的低频分量和高频分量；13) Based on wavelet decomposition, the current data to be decomposed is decomposed into a low-frequency part and a high-frequency part, and a low-frequency component and a high-frequency component corresponding to the current data to be decomposed are obtained;

14)确定当前待分解数据对应的低频分量包含的低频子序列是否具备预设的变化趋势特征；14) Determine whether the low-frequency subsequence contained in the low-frequency component corresponding to the current data to be decomposed has a preset change trend feature;

可选的，预设的变化趋势特征可设定为：当前分解层对应的低频时序数据波形有明显波动变化，且无周期性变化。其中，所述的明显波动变化，可基于设定的波形变化阈值衡量，低频时序数据波形的波动达到该阈值，即可视为具有明显波动变化。Optionally, the preset change trend feature may be set as: the waveform of the low-frequency time-series data corresponding to the current decomposition layer has obvious fluctuations and no periodic changes. Wherein, the obvious fluctuation and change can be measured based on a set waveform change threshold, and when the fluctuation of the low-frequency time series data waveform reaches the threshold, it can be regarded as having obvious fluctuation and change.

15)若是，结束对数据的分解处理；15) If so, end the decomposition processing of the data;

16)若否，将待分解数据更新为当前待分解数据对应的低频分量，并循环至所述步骤13)中将当前待分解数据分解为低频部分和高频部分的步骤，直至当前待分解数据对应的低频分量包含的低频子序列具备所述变化趋势特征时，结束对数据的分解处理；16) If not, update the data to be decomposed to the low-frequency component corresponding to the current data to be decomposed, and loop to the step 13) of decomposing the current data to be decomposed into low-frequency parts and high-frequency parts, until the current data to be decomposed When the low-frequency subsequence contained in the corresponding low-frequency component has the characteristic of the change trend, the decomposition processing of the data is ended;

其中，分解过程中每一层仅对低频分量进行分解，高频分量不处理。Among them, in the decomposition process, each layer only decomposes low-frequency components, and high-frequency components are not processed.

17)将分解结果中最后一层的低频分量提取为所述低频趋势性分量，并从各个层的高频分量中提取高频周期性分量和高频随机性分量。17) Extract the low-frequency component of the last layer in the decomposition result as the low-frequency trend component, and extract the high-frequency periodic component and high-frequency random component from the high-frequency components of each layer.

为便于理解，以下进一步详细说明。For ease of understanding, further details are given below.

申请人研究发现，时间序列数据的数据序列中蕴含了环境及采集过程等对时间序列数据的影响信息，其中高频数据主要体现出高频率、周期性的特点，并可能伴随一些噪声，而低频数据则体现出较明显的低频率、趋势性特点，从而，基于WA技术对序列进行多尺度分解，来提取出高频周期性分量、低频趋势性分量和高频随机性分量。The applicant's research found that the data sequence of time series data contains information on the impact of the environment and the collection process on time series data, among which high frequency data mainly reflect the characteristics of high frequency and periodicity, and may be accompanied by some noise, while low frequency The data shows obvious low-frequency and trend characteristics. Therefore, based on WA technology, the sequence is decomposed on multiple scales to extract high-frequency periodic components, low-frequency trend components and high-frequency random components.

考虑到采集数据的离散性，优选采用离散小波变换W_f进行数据分析：Considering the discreteness of the collected data, the discrete wavelet transform _Wf is preferably used for data analysis:

该式中，a₀、b₀均为实常数，j、k为整数，

表示小波基函数的复共辄，f(t)表示变形时间序列，t为时间。In this formula, a ₀ and b ₀ are real constants, j and k are integers,

Represents the complex conjugate of the wavelet basis function, f(t) represents the deformed time series, and t is time.

小波分解结构图见图2所示，图2中f₀表示时间序列监测数据的变形实测数据，变形实测数据指根据实际需要从源数据即时间序列监测数据中选择/提取的检测所需数据，具体可以为对时间序列监测数据清洗异常值后所得的数据，使得在训练阶段或预测阶段，选用进行清洗异常值后的数据进行训练或预测；f₁,f₂,…,f_N1为低频部分，d₁,d₂,…,d_N1为高频部分。The wavelet decomposition structure diagram is shown in Fig. 2. In Fig. 2, f ₀ represents the measured deformation data of the time series monitoring data. The measured deformation data refers to the data required for detection selected/extracted from the source data, that is, the time series monitoring data according to actual needs. Specifically, it can be the data obtained after cleaning the outliers of the time series monitoring data, so that in the training stage or the prediction stage, the data after cleaning the outliers is selected for training or prediction; f ₁ , f ₂ ,...,f _N1 are low-frequency parts , d ₁ , d ₂ ,..., d _N1 are high-frequency parts.

其中，在每一层中只对低频分量进行分解，高频分量不作处理，并判断每次分解后低频子序列的变化趋势，识别其是否满足预设的变化趋势特征，若具有明显的变化趋势，即当前分解层的低频时序数据波形有明显波动变化，且无周期性变化，则判定满足预设的变化趋势特征，停止分解，否则若趋势性不明显，则继续分解至呈现明显的趋势性。假设此时分解层数为N1，则得到N1个高频分量和1个低频分量，叠加后可得：Among them, only the low-frequency components are decomposed in each layer, and the high-frequency components are not processed, and the change trend of the low-frequency subsequence after each decomposition is judged to identify whether it meets the preset change trend characteristics. If there is an obvious change trend , that is, the low-frequency time-series data waveform of the current decomposition layer has obvious fluctuations and no periodic changes, then it is determined that the preset trend characteristics are met, and the decomposition is stopped. Otherwise, if the trend is not obvious, continue to decompose until it shows an obvious trend . Assuming that the number of decomposition layers is N1 at this time, N1 high-frequency components and 1 low-frequency component are obtained. After superposition, we can get:

f₀＝d₁+d₂+…+d_N1+f_N1 f ₀ =d ₁ +d ₂ +...+d _N1 +f _N1

也就是说，将分解结果中最后一层的低频分量(f_N1)，提取为时间序列监测数据的低频趋势性分量，将各个层的高频分量(d₁,d₂,…,d_N1)，提取为时间序列监测数据的高频分量。申请人研究发现，高频数据主要体现出高频率、周期性的特点，且受工况、仪器误差等因素干扰，数据序列中往往含有噪声，相应可进一步从高频分量中提取出高频周期性分量和高频随机性分量。That is to say, the low-frequency component (f _N1 ) of the last layer in the decomposition result is extracted as the low-frequency trend component of the time series monitoring data, and the high-frequency component (d ₁ ,d ₂ ,…,d _N1 ) of each layer is extracted , extracted as the high frequency component of the time series monitoring data. The applicant found that high-frequency data mainly reflect the characteristics of high frequency and periodicity, and are interfered by factors such as working conditions and instrument errors, and the data sequence often contains noise. Correspondingly, the high-frequency period can be further extracted from the high-frequency component sex component and high-frequency randomness component.

其中，高频周期性分量主要体现时间序列监测数据的高频数据部分信息，低频趋势性分量主要体现时间序列监测数据的低频数据部分信息，高频随机性分量则主要体现噪声的影响。Among them, the high-frequency periodic component mainly reflects the high-frequency data part information of the time series monitoring data, the low-frequency trend component mainly reflects the low-frequency data part information of the time series monitoring data, and the high-frequency random component mainly reflects the influence of noise.

步骤103、利用第一预测模型基于所述高频分量进行预测处理，得到未来时序下数据的第一预测结果。Step 103 , using the first prediction model to perform prediction processing based on the high-frequency component, to obtain a first prediction result of the data in the future time series.

优选的，针对高频分量的特点，本申请实施例采用LSTM(Long Short-TermMemory，长短期记忆网络)作为高频分量的分析与预测处理模型，也就是，第一预测模型为LSTM模型。LSTM是一种时间循环神经网络，是为了解决一般的循环神经网络存在的长期依赖问题而专门设计出来的。Preferably, in view of the characteristics of high-frequency components, the embodiment of the present application adopts LSTM (Long Short-Term Memory, long-term short-term memory network) as the analysis and prediction processing model of high-frequency components, that is, the first prediction model is the LSTM model. LSTM is a time cyclic neural network, which is specially designed to solve the long-term dependence problem of general cyclic neural network.

受工况、仪器误差等因素干扰，时间序列监测数据的数据序列中往往含有噪声，干扰预报的精度，故需要对监测数据序列进行去噪。本申请实施例中，由于主要通过高频随机性分量体现噪声的影响，相应具体可对提取的高频随机性分量进行去噪，其中去噪的关键在于确定阈值λ，阈值选取过大或过小都会影响去噪效果，本实施例优选采用以下计算式确定阈值λ的取值:Affected by factors such as working conditions and instrument errors, the data sequence of the time series monitoring data often contains noise, which interferes with the accuracy of the forecast. Therefore, it is necessary to denoise the monitoring data sequence. In the embodiment of the present application, since the influence of noise is mainly reflected through the high-frequency random component, the extracted high-frequency random component can be denoised correspondingly. The key to denoising is to determine the threshold λ, and the threshold is selected too large or too Small will affect the denoising effect, the present embodiment preferably adopts the following calculation formula to determine the value of the threshold λ:

σ₁＝MAD/0.6745σ ₁ =MAD/0.6745

其中，MAD表示首层小波分解系数绝对值的中间值，0.6745为高斯噪声标准方差的调整系数，N1表示高频随机性分量信号的尺寸或者长度(也就是分解层数)。Among them, MAD represents the intermediate value of the absolute value of the wavelet decomposition coefficient of the first layer, 0.6745 is the adjustment coefficient of the standard deviation of Gaussian noise, and N1 represents the size or length of the high-frequency random component signal (that is, the number of decomposition layers).

经验证，基于以上计算式确定出的阈值进行去噪，可对高频随机性分量达到较好的去噪效果。It has been verified that denoising based on the threshold determined by the above calculation formula can achieve a better denoising effect on high-frequency random components.

在对高频随机性分量进行去噪处理后，将其并入高频序列中统一处理。即，将去噪处理后的高频随机性分量与高频周期性分量一并输入LSTM模型，由LSTM模型基于去噪处理后的高频随机性分量和高频周期性分量进行预测处理，得到未来时序下数据的第一预测结果。After denoising the high-frequency random component, it is merged into the high-frequency sequence for unified processing. That is, the high-frequency random components and high-frequency periodic components after denoising processing are input into the LSTM model together, and the LSTM model performs prediction processing based on the high-frequency random components and high-frequency periodic components after denoising processing, to obtain The first prediction result of the data in the future time series.

长短期记忆网络模型包括多个依次相连的记忆单元，所述记忆单元包括遗忘门、输入门和输出门。The long-short-term memory network model includes a plurality of sequentially connected memory units, and the memory units include forget gates, input gates and output gates.

前一记忆单元传递的特征信息包括所述前一记忆单元的状态值和输出值以下进一步详细说明。The characteristic information transmitted by the previous memory unit includes the state value and output value of the previous memory unit in further detail below.

LSTM模型是RNN(Recurrent Neural Network，循环神经网络)的一种特殊形式，传统RNN模型对于距离较远的信息会产生遗忘现象，难以存储过去较长时间序列的信息，仅凭就近的几步信息进行未来预测，一旦出现错误则很难恢复。而LSTM模型通过输入门、遗忘门、输出门来控制信息的保留、舍去和更新，有效克服了前者的不足，大大提高了预测精度。The LSTM model is a special form of RNN (Recurrent Neural Network, cyclic neural network). The traditional RNN model will forget the information that is far away, and it is difficult to store the information of a long time series in the past. Only a few steps of information nearby Make future predictions that are difficult to recover from if they go wrong. The LSTM model controls the retention, discarding and updating of information through the input gate, forgetting gate and output gate, which effectively overcomes the shortcomings of the former and greatly improves the prediction accuracy.

LSTM模型由若干个记忆单元组成，本实施例将其称之为细胞，模型的细胞s参与基于输入数据的预测处理，模型的细胞结构具体可参见图3所示，其中，f、i、o分别表示遗忘门、输入门、输出门，x_t、h_t分别表示输入和输出，σ表示sigmoid函数，C_t表示t时刻的状态值，即在此时刻之前的细胞单元对输入高频分量进行有选择的记忆和遗忘来处理传递下来的特征信息，C_t具体由前一细胞的状态值和输出值以及当前细胞的输入数据进行融合处理得到。The LSTM model consists of several memory units, which are called cells in this embodiment. The cell s of the model participates in the prediction process based on the input data. The cell structure of the model can be seen in Figure 3, where f, i, o respectively represent the forget gate, input gate and output gate, x _t and h _t represent the input and output respectively, σ represents the sigmoid function, C _t represents the state value at time t, that is, the cell unit before this time conducts the input high-frequency component Selective memory and forgetting are used to process the transmitted feature information, and C _t is specifically obtained by fusion processing the state value and output value of the previous cell and the input data of the current cell.

结合参见图3，通过LSTM模型对高频分量(高频周期性分量与去噪后的高频随机性分量)进行预测处理的过程如下：Referring to Figure 3, the process of predicting high-frequency components (high-frequency periodic components and high-frequency random components after denoising) through the LSTM model is as follows:

21)利用遗忘门控制由小波分解数据得到的高频分量信息的取舍，确定保留前一状态细胞旧信息的多少：21) Use the forgetting gate to control the choice of high-frequency component information obtained from wavelet decomposition data, and determine how much old information of cells in the previous state is retained:

f_t＝σ(W_hfh_t-1+W_xfx_t+b_f)f _t ＝σ(W _hf h _t-1 +W _xf x _t +b _f )

式中，f_t为遗忘门输出值，即0～1的某一数值；h_t-1,W_hf分别为前一时刻的输出和权重；x_t,W_xf分别为当前时刻的输入和权重；b_f为偏置项。In the formula, f _t is the output value of the forget gate, that is, a certain value from 0 to 1; h _t-1 , W _hf are the output and weight of the previous moment respectively; x _t , W _xf are the input and weight of the current moment respectively ; b _f is a bias item.

22)求解更新信息的候选值

和更新信息的输出值i_t，确定细胞需要更新的信息，即对前一细胞输出值和当前细胞输入值通过输入门运算进行选择记忆的数据信息：22) Solve the candidate value of update information

and the output value it of the updated _information to determine the information that the cell needs to update, that is, the data information that is selected and memorized through the input gate operation for the previous cell output value and the current cell input value:

i_t＝σ(W_hih_t-1+W_xix_i+b_i)i _t = σ(W _hi h _t-1 +W _xi x _i +b _i )

式中，W_hc,W_xc分别为求解

时h_t-1和x_t对应的更新信息权重；W_hi,W_xi分别为求解i_t时h_t-1和x_i对应的输入门权重；b_c,b_i为偏置项。In the formula, W _hc , W _xc are the solution

When h _t-1 and x _t correspond to update information weights; W _hi , W _xi are the input gate weights corresponding to h _t _-1 and x _i when solving it respectively; b _c , _bi are bias items.

23)将输出值f_t与细胞前一状态C_t-1相乘，将候选值

和输出值i_t相乘，并将两个乘积叠加，从而达到遗忘部分旧信息和记住部分新信息的目的，得到细胞新状态C_t：23) Multiply the output value f _t with the previous state C _t-1 of the cell, and the candidate value

Multiply with the output value it, and superimpose the two products, so as to achieve the purpose of forgetting some old information and remembering some new information, and obtain the new state C _t of the cell _:

24)利用输出门确定新细胞需要输出的信息o_t，并与新细胞C_t经tanh层处理后的结果相乘得到最终输出结果h_t：24) Use the output gate to determine the information o _t that the new cell needs to output, and multiply it with the result of the new cell C _t processed by the tanh layer to obtain the final output result h _t :

o_t＝σ(W_hoh_t-1+W_xox_i+b_o)o _t ＝σ(W _ho h _t-1 +W _xo x _i +b _o )

h_t＝o_t tanh(C_t)h _t ＝o _t tanh(C _t )

式中，W_ho,W_xo分别为求解o_t时h_t-1和x_i对应的输出门权重；b_o为偏置项。In the formula, _{Who ho} , W _xo are the output gate weights corresponding to h _t-1 and x _i when solving o _t ; b _o is the bias item.

步骤104、利用第二预测模型基于所述低频分量进行预测处理，得到所述未来时序下数据的第二预测结果。Step 104 , using a second prediction model to perform prediction processing based on the low-frequency component, to obtain a second prediction result of the data in the future time series.

优选的，针对低频分量的特点，本申请实施例采用ARIMA(AutoregressiveIntegrated Moving Average Model)作为低频分量的分析与预测处理模型，也就是，第二预测模型为ARIMA模型。ARIMA是差分整合移动平均自回归模型，又称整合移动平均自回归模型(移动也可称作滑动)，是时间序列预测分析方法之一。Preferably, in view of the characteristics of low-frequency components, the embodiment of the present application adopts ARIMA (Autoregressive Integrated Moving Average Model) as the analysis and prediction processing model of low-frequency components, that is, the second prediction model is an ARIMA model. ARIMA is the differential integrated moving average autoregressive model, also known as the integrated moving average autoregressive model (moving can also be called sliding), which is one of the time series forecasting analysis methods.

其中，ARIMA模型将差分运算与自回归移动平均模型(ARMA)相结合，通过对非平稳序列进行差分处理，使其趋于平稳化，进而对该序列进行ARMA模型预测。ARIMA的优势在于用数学模型来描述时效序列，并通过序列的过去值和现在值预测未来的走向。Among them, the ARIMA model combines the difference operation with the autoregressive moving average model (ARMA), and through the difference processing of the non-stationary sequence, it tends to be stable, and then the ARMA model forecasts the sequence. The advantage of ARIMA is that it uses a mathematical model to describe the time series, and predicts the future trend through the past and present values of the series.

ARIMA模型表示如下：The ARIMA model is represented as follows:

式中，p为自回归阶数；

为自回归系数；L为滞后算子；d为差分阶数；Y_t为输入的时间序列；q为移动平均阶数；θ_i为移动平均系数；ε_t为残差序列。In the formula, p is the autoregressive order;

L is the lag operator; d is the difference order; Y _t is the input time series; q is the moving average order; θ _i is the moving average coefficient; ε _t is the residual sequence.

本申请实施例中，ARIMA模型的构建过程包括：In the embodiment of this application, the construction process of the ARIMA model includes:

31)对时间序列数据进行差分处理，当处理后的序列成为平稳序列时停止差分，并确定此时的差分阶数d，作为模型的差分阶数；31) Perform difference processing on the time series data, stop the difference when the processed sequence becomes a stationary sequence, and determine the difference order d at this time as the difference order of the model;

32)在差分处理的基础上，通过SBC准则确定p、q值。32) On the basis of difference processing, determine p and q values by SBC criterion.

SBC准则是最优模型的真实阶数的相合估计。其表达式如下：The SBC criterion is a consistent estimate of the true order of the optimal model. Its expression is as follows:

SBC＝-2lnM_l+lnN·NSBC＝-2lnM _l +lnN·N

式中，M_lL为模型极大似然函数值。In the formula, M _l L is the maximum likelihood function value of the model.

该准则旨在将模型中需要拟合的未知参数个数的惩罚权重由固定值转换为与样本容量有关的动态值，使其能够适应时间序列长度的变化，判断结果更可靠。在实际应用该准则时，本实施例比较一定范围内p、q值的SBC指标，选取与SBC指标最小值对应的p、q值作为最优参数。This criterion aims to convert the penalty weight of the number of unknown parameters that need to be fitted in the model from a fixed value to a dynamic value related to the sample size, so that it can adapt to changes in the length of the time series and the judgment results are more reliable. When this criterion is actually applied, this embodiment compares the SBC indexes with p and q values within a certain range, and selects the p and q values corresponding to the minimum values of the SBC indexes as optimal parameters.

步骤105、对所述第一预测结果和所述第二预测结果进行叠加处理，得到所述未来时序下数据的目标预测结果。Step 105. Perform superposition processing on the first prediction result and the second prediction result to obtain the target prediction result of the data in the future time series.

之后，将第一预测模型、第二预测模型的预测结果进行叠加，得到未来时序下数据的最终预测结果，即目标预测结果。Afterwards, the prediction results of the first prediction model and the second prediction model are superimposed to obtain the final prediction result of the data in the future time series, that is, the target prediction result.

步骤106、根据所述目标预测结果，识别所述未来时序下真实数据中的异常点数据。Step 106, according to the target prediction result, identify outlier data in the real data in the future time series.

在根据对不同模型预测结果的叠加处理得到未来时序下数据的最终预测结果即目标预测结果后，具体可确定未来时序下的真实数据与所述未来时序下数据的目标预测结果之间的偏差信息，并根据偏差信息，识别未来时序下真实数据中的异常点数据。After the final prediction result of the data in the future time series, that is, the target prediction result is obtained according to the superposition processing of the prediction results of different models, the deviation information between the real data in the future time series and the target prediction result of the data in the future time series can be specifically determined , and according to the deviation information, identify the outlier data in the real data in the future time series.

进一步，实际应用中，在将两模型的预测结果叠加得到最终的目标预测结果后，可进一步分析并绘制真实值与预测值之间的残差所对应的残差图，并基于残差图识别异常点，以基于异常点识别，来判断所属应用/系统是否出现故障/异常，如判断是否出现网络非法入侵、金融欺诈、设备故障或医疗事故状况等等。Furthermore, in practical applications, after superimposing the prediction results of the two models to obtain the final target prediction result, the residual map corresponding to the residual error between the real value and the predicted value can be further analyzed and drawn, and based on the residual map to identify Abnormal points are based on the identification of abnormal points to determine whether the application/system is faulty/abnormal, such as judging whether there is illegal network intrusion, financial fraud, equipment failure or medical accidents, etc.

接下来继续提供本申请方法的一应用示例。Next, continue to provide an application example of the method of the present application.

该示例预先建立基于WA-LSTM-ARIMA的时间序列异常检测模型，并使用该模型，对时间序列数据运用小波分析理论进行多尺度分解重构，对于分解后提取的高频分量和低频分量，分别采用LSTM模型和ARIMA模型进行分析，然后将两模型的预测结果叠加得到最终的预测结果，最后基于两模型预测结果的叠加值识别异常点。This example pre-establishes a time series anomaly detection model based on WA-LSTM-ARIMA, and uses this model to perform multi-scale decomposition and reconstruction of time series data using wavelet analysis theory. For the high-frequency components and low-frequency components extracted after decomposition, respectively The LSTM model and ARIMA model are used for analysis, and then the prediction results of the two models are superimposed to obtain the final prediction result, and finally the outliers are identified based on the superposition value of the prediction results of the two models.

该基于WA-LSTM-ARIMA的时间序列异常检测模型，对时间序列数据的详细异常检测流程如图4所示，包括：对时间序列数据运用小波分析理论进行多尺度分解重构，提取出高频周期性分量、低频趋势性分量和高频随机性分量；高频周期性分量主要体现了高频数据部分信息，采用LSTM模型进行分析与预测处理；低频趋势性分量体现了低频数据部分信息，采用ARIMA模型进行分析与预测处理；高频随机性分量主要体现噪声的影响，在对其进行去噪处理后并入高频序列中统一处理。将各模型的预测结果叠加，得到未来时序下数据的最终预测结果，并基于预测值结合未来时序下数据的真实值(观测值)分析绘制残差图，识别异常点。The time series anomaly detection model based on WA-LSTM-ARIMA, the detailed anomaly detection process of time series data is shown in Figure 4, including: using wavelet analysis theory to perform multi-scale decomposition and reconstruction on time series data, extracting Periodic components, low-frequency trend components and high-frequency random components; high-frequency periodic components mainly reflect part of the high-frequency data information, which is analyzed and predicted using the LSTM model; low-frequency trend components reflect part of the low-frequency data information, using The ARIMA model is used for analysis and prediction processing; the high-frequency random component mainly reflects the influence of noise, which is merged into the high-frequency sequence for unified processing after denoising. The prediction results of each model are superimposed to obtain the final prediction result of the data in the future time series, and based on the prediction value combined with the real value (observation value) of the data in the future time series, the residual graph is drawn to identify abnormal points.

该示例预先选取数据集对模型进行训练与测试、评估。This example pre-selects a dataset to train, test, and evaluate the model.

鉴于证券市场中股票的交易价格与交易量等数据会形成一个连续不断的时间序列，蕴含着与时间相关的有用信息，本示例采用股票交易方面的相关数据作为实验用数据集，数据集包括日期、开盘价等信息，原数据集及其数据分布如图5所示，由于要观察的是预测值在时间维度上的走势，因此取数据集中的data和volume(在股票中指成交量)这两列信息，其中，采用数据集中前60％的数据作为模型的学习数据，后40％的数据用来预测，并基于残差图将真实值与预测值之间最大偏差的前4％作为异常点，残差图如图6所示。In view of the fact that the transaction price and transaction volume of stocks in the stock market will form a continuous time series, which contains useful information related to time, this example uses relevant data on stock transactions as the experimental data set, and the data set includes date , opening price and other information, the original data set and its data distribution are shown in Figure 5. Since what is to be observed is the trend of the predicted value in the time dimension, the data and volume (referring to the trading volume in the stock market) in the data set are taken. Column information, where the first 60% of the data in the data set is used as the learning data of the model, the last 40% of the data is used for prediction, and the first 4% of the maximum deviation between the real value and the predicted value is used as an outlier based on the residual map , and the residual plot is shown in Figure 6.

在评估时间序列异常检测模型的性能时，本示例采用准确率(P)、召回率(R)和F1值等指标进行衡量，其中，准确率表示预测和实际都是异常的样本数占预测是异常的总数的比例，值越大，表示性能越好；召回率表示预测和实际都是异常的样本数占实际总异常数的比例，值越大，性能越好；F1值表示的是准确率和召回率的加权调和平均，值越大，性能越好。准确率、召回率以及F1值的计算公式分别如下所示：When evaluating the performance of the time series anomaly detection model, this example uses indicators such as precision rate (P), recall rate (R) and F1 value to measure. Among them, the accuracy rate indicates that the number of samples that are predicted and actually are abnormal accounts for the prediction. The proportion of the total number of abnormalities, the larger the value, the better the performance; the recall rate indicates the proportion of the predicted and actual abnormal samples to the actual total number of abnormalities, the larger the value, the better the performance; the F1 value indicates the accuracy rate and the weighted harmonic mean of the recall rate, the larger the value, the better the performance. The calculation formulas of precision rate, recall rate and F1 value are as follows:

其中，TP表示预测是异常实际也是异常的样本数，FP表示预测是异常实际是正常的样本数，FN表示预测是正常实际是异常的样本数。因此，TP+FP表示预测为异常的样本总数，TP+FN表示实际为异常的样本总数。Among them, TP indicates the number of samples predicted to be abnormal and actually abnormal, FP indicates the number of samples predicted to be abnormal but actually normal, and FN indicates the number of samples predicted to be normal but actually abnormal. Therefore, TP+FP represents the total number of samples predicted to be abnormal, and TP+FN represents the total number of samples that are actually abnormal.

由于本示例的模型是异常检测模型，因此这里的正例指的是异常的样本。本示例的试验所用的总样本数是730个，其中正负例样本数分布如表1所示。Since the model in this example is an anomaly detection model, the positive examples here refer to abnormal samples. The total number of samples used in the experiment of this example is 730, and the distribution of the number of positive and negative samples is shown in Table 1.

表1正负例样本数分布Table 1 Distribution of positive and negative samples

实际正例actual positive example 实际负例Actual negative example 预测正例Predict positive examples 2525 44 预测负例Predict negative examples 33 698698

根据表1可以计算出模型的准确率、召回率以及F1值，如表2所示：According to Table 1, the accuracy rate, recall rate and F1 value of the model can be calculated, as shown in Table 2:

表2模型的准确率、召回率以及F1值Table 2 The accuracy rate, recall rate and F1 value of the model

准确率(P)Accuracy (P) 0.8620.862 召回率(R)Recall (R) 0.8920.892 F1F1 0.8770.877

本示例的模型基于学习某一系统的历史数据对后续数据进行预测，通过比较真实值与预测值之间的偏差来判断系统是否出现异常，可以实现对时间序列的实时检测，以准确、及时发现异常点数据，并提醒维护人员尽早采取应对措施，避免造成无法弥补的损失，可彻底解决以往人工检测方式存在的费时、费力、易出错等问题。The model in this example predicts the follow-up data based on learning the historical data of a certain system, and judges whether the system is abnormal by comparing the deviation between the real value and the predicted value. Abnormal point data, and remind maintenance personnel to take countermeasures as soon as possible to avoid irreparable losses, which can completely solve the time-consuming, laborious, and error-prone problems of the previous manual detection methods.

对应于上述的时间序列数据的异常检测方法，本申请还提供一种时间序列数据的异常检测装置，该装置的组成结构如图7所示，包括：Corresponding to the above-mentioned anomaly detection method for time-series data, the present application also provides an anomaly detection device for time-series data, the composition and structure of which is shown in Figure 7, including:

获取模块10，用于获取时间序列监测数据；Obtaining module 10, for obtaining time series monitoring data;

提取模块20，用于提取所述时间序列监测数据的高频分量和低频分量；An extraction module 20, configured to extract high frequency components and low frequency components of the time series monitoring data;

第一预测处理模块30，用于利用第一预测模型基于所述高频分量进行预测处理，得到未来时序下数据的第一预测结果；The first prediction processing module 30 is configured to use the first prediction model to perform prediction processing based on the high-frequency component to obtain a first prediction result of data in future time series;

第二预测处理模块40，用于利用第二预测模型基于所述低频分量进行预测处理，得到所述未来时序下数据的第二预测结果；The second prediction processing module 40 is configured to use a second prediction model to perform prediction processing based on the low-frequency component to obtain a second prediction result of the data in the future time series;

叠加模块50，用于对所述第一预测结果和所述第二预测结果进行叠加处理，得到所述未来时序下数据的目标预测结果；A superposition module 50, configured to superimpose the first prediction result and the second prediction result to obtain the target prediction result of the data in the future time series;

异常点识别模块60，用于根据所述目标预测结果，识别所述未来时序下真实数据中的异常点数据。The outlier identification module 60 is configured to identify outlier data in the real data in the future time series according to the target prediction result.

在一实施方式中，提取模块20，具体用于：In one embodiment, the extraction module 20 is specifically used for:

在一实施方式中，提取模块20，在通过使用小波分析理论对所述时间序列监测数据进行多尺度分解重构，提取所述时间序列监测数据的高频周期性分量、高频随机性分量和低频趋势性分量时，具体用于：In one embodiment, the extraction module 20 extracts the high-frequency periodic components, high-frequency random components and For low-frequency trend components, it is specifically used for:

在一实施方式中，第一预测处理模块30，具体用于：In one embodiment, the first prediction processing module 30 is specifically used for:

在一实施方式中，第一预测处理模块30，在对所述高频随机性分量进行去噪处理时，具体用于：In one embodiment, the first prediction processing module 30, when performing denoising processing on the high-frequency random components, is specifically used to:

其中，所述阈值为

在一实施方式中，所述长短期记忆网络模型包括多个依次相连的记忆单元，所述记忆单元包括遗忘门、输入门和输出门；In one embodiment, the long-short-term memory network model includes a plurality of sequentially connected memory units, and the memory units include a forgetting gate, an input gate and an output gate;

在一实施方式中，第二预测处理模块40，具体用于：In one embodiment, the second prediction processing module 40 is specifically used for:

在一实施方式中，异常点识别模块60，具体用于：In one embodiment, the abnormal point identification module 60 is specifically used for:

对于本申请实施例申请的时间序列数据的异常检测装置而言，由于其与上文方法实施例申请的时间序列数据的异常检测方法相对应，所以描述的比较简单，相关相似之处请参见上文方法实施例的说明即可，此处不再详述。For the anomaly detection device for time-series data applied in the embodiment of the present application, since it corresponds to the anomaly detection method for time-series data applied in the method embodiment above, the description is relatively simple. For related similarities, please refer to the above The description of the embodiment of the text method is sufficient, and will not be described in detail here.

本申请还提供一种计算机可读介质，其上存储有计算机程序，所述计算机程序包含用于执行如上文方法实施例申请的时间序列数据的异常检测方法的程序代码。The present application also provides a computer-readable medium on which a computer program is stored, and the computer program includes program codes for executing the anomaly detection method for time-series data as applied in the above method embodiments.

在本申请的上下文中，计算机可读介质(机器可读介质)可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this application, a computer-readable medium (machine-readable medium) may be a tangible medium that may contain or be stored for use by or in conjunction with an instruction execution system, apparatus, or device program of. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

需要说明的是，本申请上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF(射频)等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in this application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present application, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

上述计算机可读介质可以是电子设备中所包含的；也可以是单独存在，而未装配入电子设备中。The above-mentioned computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.

本申请还提供一种计算机程序产品，其包括承载在非暂态计算机可读介质上的计算机程序，所述计算机程序包含用于执行如上文方法实施例申请的时间序列数据的异常检测方法的程序代码。The present application also provides a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program includes a program for executing the anomaly detection method for time series data as applied for in the method embodiment above code.

特别地，根据本申请的实施例，上文各参考流程图描述的过程可以被实现为计算机软件程序。在这样的实施例中，该计算机程序可以通过通信装置从网络上被下载和安装，或者从存储装置被安装，或者从ROM被安装。在该计算机程序被处理装置执行时，执行本申请实施例的方法中限定的上述功能。In particular, according to the embodiments of the present application, the processes described above with reference to the flowcharts can be implemented as computer software programs. In such an embodiment, the computer program may be downloaded and installed from a network via communication means, or installed from a storage means, or installed from a ROM. When the computer program is executed by the processing device, the above-mentioned functions defined in the methods of the embodiments of the present application are executed.

综上所述，本申请提供的时间序列数据的异常检测方法、装置及计算机存储介质及计算机程序产品，至少具备以下技术优势：To sum up, the anomaly detection method, device, computer storage medium and computer program product provided by this application have at least the following technical advantages:

a)基于WA技术分解时间序列数据，可以将时间序列数据分解成低频和高频分量，有助于去除高频分量中的噪声，提高模型的预测精度；a) Decomposing time series data based on WA technology can decompose time series data into low-frequency and high-frequency components, which helps to remove noise in high-frequency components and improve the prediction accuracy of the model;

b)针对低频分量低频率、趋势性等特点，采用ARIMA模型对其进行分析，使得可基于ARIMA用数学模型来描述时效序列，并通过序列的过去值和现在值预测未来的走向；b) ARIMA model is used to analyze the characteristics of low-frequency components such as low frequency and trend, so that the time-sensitive sequence can be described with a mathematical model based on ARIMA, and the future trend can be predicted through the past value and present value of the sequence;

c)对于高频周期性分量，使用具有长时记忆功能的LSTM模型对其进行分析，可以更精确的预测结果，LSTM可以在更长的序列中有更好的表现。c) For high-frequency periodic components, using the LSTM model with long-term memory function to analyze it can predict the results more accurately, and LSTM can perform better in longer sequences.

需要说明，尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题，但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反，上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。It should be noted that, although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

虽然在上面论述中包含了若干具体实现细节，但是这些不应当被解释为对本申请的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地，在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。While several specific implementation details are contained in the above discussion, these should not be construed as limitations on the scope of the application. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本申请中所涉及的申请范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述申请构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中申请的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present application and an illustration of the applied technical principles. Those skilled in the art should understand that the scope of application involved in this application is not limited to the technical solutions formed by the specific combination of the above technical features, but also covers the technical solutions made by the above technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions in this application.

Claims

1. A method for detecting an abnormality in time-series data, comprising:

acquiring time series monitoring data;

extracting high-frequency components and low-frequency components of the time series monitoring data;

performing prediction processing on the basis of the high-frequency component by using a first prediction model to obtain a first prediction result of data in a future time sequence;

performing prediction processing based on the low-frequency component by using a second prediction model to obtain a second prediction result of the data in the future time sequence;

superposing the first prediction result and the second prediction result to obtain a target prediction result of the data in the future time sequence;

and identifying abnormal point data in the real data under the future time sequence according to the target prediction result.

2. The method of claim 1, wherein extracting the high frequency component and the low frequency component of the time series monitoring data comprises:

performing multi-scale decomposition reconstruction on the time series monitoring data by using a wavelet analysis theory, and extracting a high-frequency periodic component, a high-frequency random component and a low-frequency trend component of the time series monitoring data;

wherein the high frequency component includes the high frequency periodic component and the high frequency stochastic component, and the low frequency component includes the low frequency trending component.

3. The method according to claim 2, wherein the extracting the high-frequency periodic component, the high-frequency stochastic component and the low-frequency tendency component of the time-series monitoring data by performing multi-scale decomposition reconstruction on the time-series monitoring data by using wavelet analysis theory comprises:

acquiring deformation actual measurement data of the time series monitoring data, wherein the deformation actual measurement data is obtained by cleaning abnormal values of the time series monitoring data;

taking the deformation actual measurement data as current data to be decomposed;

decomposing the current data to be decomposed into a low-frequency part and a high-frequency part based on wavelet decomposition to obtain a low-frequency component and a high-frequency component corresponding to the current data to be decomposed;

determining whether a low-frequency subsequence contained in a low-frequency component corresponding to current data to be decomposed has a preset variation trend characteristic or not;

if so, ending the decomposition processing of the data;

if not, updating the data to be decomposed into a low-frequency component corresponding to the current data to be decomposed, and circulating to the step of decomposing the current data to be decomposed into a low-frequency part and a high-frequency part until a low-frequency subsequence contained in the low-frequency component corresponding to the current data to be decomposed has the change trend characteristic, and ending the decomposition processing of the data; in the decomposition process, each layer only decomposes low-frequency components, and high-frequency components are not processed;

and extracting the low-frequency component of the last layer in the decomposition result as the low-frequency tendency component, and extracting a high-frequency periodic component and a high-frequency random component from the high-frequency components of the layers.

4. The method according to claim 2, wherein the performing a prediction process based on the high frequency component by using a first prediction model to obtain a first prediction result of data at a future time sequence comprises:

denoising the high-frequency stochastic component;

and inputting the high-frequency periodic component and the denoised high-frequency stochastic component into a long-short term memory network model, and performing prediction processing on the long-short term memory network model based on the high-frequency periodic component and the denoised high-frequency stochastic component to obtain a first prediction result of the data in the future time sequence.

5. The method of claim 4, wherein the denoising the high-frequency stochastic component comprises:

denoising the high-frequency stochastic component based on a preset threshold;

wherein the threshold is

σ ₁ = MAD/0.6745, MAD represents the middle value of the absolute value of the wavelet decomposition coefficient of the first layer,0.6745 is the adjustment coefficient of the standard deviation of gaussian noise, and N1 represents the size or length of the high frequency stochastic component signal.

6. The method according to claim 4, wherein the long-short term memory network model comprises a plurality of memory units which are connected in sequence, wherein the memory units comprise a forgetting gate, an input gate and an output gate;

the current memory unit obtains the output information of the current memory unit by carrying out fusion processing on the feature information transmitted by the previous memory unit and the high-frequency component input data of the current memory unit, and processes the feature information transmitted by the previous memory unit by selectively memorizing and forgetting the input high-frequency component when carrying out fusion processing;

respectively processing the high-frequency components input by the memory units and the characteristic information transmitted by the previous memory unit through the memory units connected in sequence to realize the prediction processing of the model based on the high-frequency components;

the characteristic information transmitted by the previous memory unit comprises a state value and an output value of the previous memory unit.

7. The method of claim 2, wherein the performing a prediction process based on the low frequency component using a second prediction model to obtain a second prediction result of the data at the future time sequence comprises:

and inputting the low-frequency tendency component into a differential integration moving average autoregressive model, and performing prediction processing on the low-frequency tendency component by the differential integration moving average autoregressive model to obtain a second prediction result of the data in the future time sequence.

8. The method of claim 1, wherein identifying outlier data in real data at the future time sequence based on the target prediction result comprises:

determining deviation information between real data at the future time sequence and a target prediction result of the data at the future time sequence;

and identifying abnormal point data in the real data under the future time sequence according to the deviation information.

9. An abnormality detection device for time-series data, comprising:

the acquisition module is used for acquiring time series monitoring data;

the extraction module is used for extracting a high-frequency component and a low-frequency component of the time series monitoring data;

the first prediction processing module is used for performing prediction processing on the basis of the high-frequency component by using a first prediction model to obtain a first prediction result of data in a future time sequence;

the second prediction processing module is used for performing prediction processing on the basis of the low-frequency component by using a second prediction model to obtain a second prediction result of the data in the future time sequence;

the superposition module is used for carrying out superposition processing on the first prediction result and the second prediction result to obtain a target prediction result of the data under the future time sequence;

and the abnormal point identification module is used for identifying abnormal point data in the real data under the future time sequence according to the target prediction result.

10. A computer-readable medium, characterized in that a computer program is stored thereon, the computer program comprising program code for executing the abnormality detection method for time-series data according to any one of claims 1 to 8.

11. A computer program product, characterized in that it comprises a computer program carried on a non-transitory computer-readable medium, the computer program comprising program code for executing the method of anomaly detection of time-series data according to any of claims 1-8.