CN113792749A

CN113792749A - Time series data abnormity detection method, device, equipment and storage medium

Info

Publication number: CN113792749A
Application number: CN202011282307.7A
Authority: CN
Inventors: 李婷; 张钧波; 郑宇�
Original assignee: Jingdong City Beijing Digital Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2021-12-14

Abstract

The present disclosure provides a method, device, device and storage medium for abnormal detection of time series data, and relates to the technical field of data processing. The method includes: acquiring time series data, where the time series data is a sequence of index data corresponding to each time point in a continuous time point; obtaining a neighborhood correlation feature according to the index data corresponding to each time point, and the neighborhood correlation feature is used to represent each time point The correlation between the index data corresponding to each neighboring time point and the index data corresponding to each time point among the multiple neighboring time points of the point; based on the neighboring correlation characteristics, the multiple index data corresponding to the multiple neighboring time points are dimensionally reduced processing to obtain the associated neighbor data corresponding to each time point; and dividing a plurality of associated neighbor data corresponding to consecutive time points to determine the abnormal time point of the index data from the consecutive time points. This method improves the accuracy of anomaly detection for multiple index data with different neighbor correlations.

Description

Time series data anomaly detection method, device, equipment and storage medium

技术领域technical field

本公开涉及数据处理技术领域，具体而言，涉及一种时间序列数据异常检测方法、装置、设备及可读存储介质。The present disclosure relates to the technical field of data processing, and in particular, to a method, apparatus, device, and readable storage medium for abnormal detection of time series data.

背景技术Background technique

现实生活中经过统计可得到一些具有复杂关联的时间序列指标，在每个时间点都对应一个指标或一组指标数据，各个时间点之间的样本数据没有必然的联系。例如某地区国内生产总值(Gross Domestic Product，GDP)等经济指标，受影响因素众多，历史年份的指标值对当前年份的指标具有一定影响。在对地区发展做出一些决策时，会对历史年份该地区发展情况的相关指标进行分析，例如GDP，人口，就业人数等等，对指标数据进行异常检测，检测出异常的历史时间点，并分析异常原因，从而提供准确的决策支持。In real life, some time series indicators with complex correlations can be obtained through statistics. Each time point corresponds to an indicator or a group of indicator data, and there is no necessary connection between the sample data at each time point. For example, economic indicators such as the Gross Domestic Product (GDP) of a certain region are affected by many factors, and the index value of the historical year has a certain influence on the index of the current year. When making some decisions on regional development, it will analyze the relevant indicators of the development of the region in historical years, such as GDP, population, employment, etc., to detect abnormality in the indicator data, detect abnormal historical time points, and Analyze exception causes to provide accurate decision support.

通常进行异常检测的指标类别众多，且各指标的情况差别较大。例如，经济指标包括宏观的社会消费品总额，居民消费结构各个部分，各行业固定资产投资，各类商品出口金额等，类目总量有成百上千种，不同类指标之间差异大。相关技术中对不同时间序列指标进行异常检测时通常采用统一模型算法，而利用统一模型对多种类别指标进行异常检测的准确性较低。Usually there are many categories of indicators for anomaly detection, and the situation of each indicator is quite different. For example, economic indicators include the total amount of social consumer goods, various parts of the household consumption structure, investment in fixed assets in various industries, and the export value of various commodities. In related technologies, a unified model algorithm is usually used to detect anomalies of different time series indicators, and the accuracy of anomaly detection for multiple categories of indicators using a unified model is low.

如上所述，如何提供对多种类别时间序列指标进行异常检测的准确性成为亟待解决的问题。As mentioned above, how to provide the accuracy of anomaly detection for multiple categories of time series indicators has become an urgent problem to be solved.

在所述背景技术部分公开的上述信息仅用于加强对本公开的背景的理解，因此它可以包括不构成对本领域普通技术人员已知的现有技术的信息。The above information disclosed in this Background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.

发明内容SUMMARY OF THE INVENTION

本公开的目的在于提供一种时间序列数据异常检测方法、装置、设备及可读存储介质，至少在一定程度上提高对多种类别时间序列指标进行异常检测的准确性。The purpose of the present disclosure is to provide a time series data anomaly detection method, device, device and readable storage medium, which at least to a certain extent improve the accuracy of anomaly detection for various types of time series indicators.

本公开的其他特性和优点将通过下面的详细描述变得显然，或部分地通过本公开的实践而习得。Other features and advantages of the present disclosure will become apparent from the following detailed description, or be learned in part by practice of the present disclosure.

根据本公开的一方面，提供一种时间序列数据异常检测方法，包括：获取时间序列数据，所述时间序列数据为连续时间点中各个时间点对应的指标数据的序列；根据所述各个时间点对应的指标数据获得近邻关联性特征，所述近邻关联性特征用于表示所述各个时间点的多个近邻时间点中各个近邻时间点对应的指标数据与所述各个时间点对应的指标数据之间的关联性；基于所述近邻关联性特征对所述多个近邻时间点对应的多个指标数据进行降维处理，获得所述各个时间点对应的关联近邻数据；对所述连续时间点对应的多个所述关联近邻数据进行划分以从所述连续时间点中确定指标数据异常的时间点。According to an aspect of the present disclosure, a method for detecting anomalies in time series data is provided, including: acquiring time series data, where the time series data is a sequence of index data corresponding to each time point in consecutive time points; Corresponding indicator data obtains a neighbor correlation feature, and the neighbor correlation feature is used to represent the difference between the indicator data corresponding to each neighbor time point and the indicator data corresponding to each time point among the multiple neighbor time points of each time point. dimensionality reduction processing is performed on the multiple index data corresponding to the multiple neighboring time points based on the neighbor correlation feature, and the associated neighboring data corresponding to each time point is obtained; A plurality of the associated neighbor data are divided to determine the time point at which the index data is abnormal from the consecutive time points.

根据本公开的一实施例，所述时间序列数据包括所述各个时间点对应的多个相似区域的指标数据；所述获取时间序列数据包括：获取所述各个时间点对应的第一预定区域的指标数据；获取所述各个时间点对应的第二预定区域的指标数据；在所述第一预定区域的指标数据与所述第二预定区域的指标数据的相似度大于预设阈值时，获得所述多个相似区域，所述多个相似区域包括所述第一预定区域和所述第二预定区域。According to an embodiment of the present disclosure, the time series data includes index data of a plurality of similar areas corresponding to the respective time points; the acquiring the time series data includes: acquiring the first predetermined area corresponding to the respective time points index data; obtain the index data of the second predetermined area corresponding to each time point; when the similarity between the index data of the first predetermined area and the index data of the second predetermined area is greater than a preset threshold, obtain the index data of the second predetermined area. The plurality of similar areas includes the first predetermined area and the second predetermined area.

根据本公开的一实施例，所述近邻关联性特征包括第一维度和第二维度，所述第一维度为近邻时间点序列，所述第二维度为与所述近邻时间点序列对应的信息增益序列；所述根据所述各个时间点对应的指标数据获得近邻关联性特征包括：获取所述各个时间点的各个近邻时间点对应的指标数据；基于所述各个时间点对应的指标数据和所述各个时间点的各个近邻时间点对应的指标数据计算所述近邻时间点序列对应的所述信息增益序列。According to an embodiment of the present disclosure, the neighborhood correlation feature includes a first dimension and a second dimension, the first dimension is a sequence of neighboring time points, and the second dimension is information corresponding to the sequence of neighboring time points Gain sequence; the obtaining the neighbor correlation feature according to the index data corresponding to each time point includes: acquiring the index data corresponding to each neighbor time point of the each time point; based on the index data corresponding to each time point and all The information gain sequence corresponding to the sequence of adjacent time points is calculated according to the index data corresponding to each adjacent time point of each time point.

根据本公开的一实施例，所述基于所述近邻关联性特征对所述多个近邻时间点对应的多个指标数据进行降维处理包括：基于所述近邻关联性特征从所述多个近邻时间点中确定所述各个时间点的关联近邻时间点；根据所述各个时间点的关联近邻时间点确定降维后的指标数据维度；通过主成分分析方法将多个近邻时间点对应的多个指标数据降维到所述降维后的指标数据维度。According to an embodiment of the present disclosure, the performing dimensionality reduction processing on multiple index data corresponding to the multiple neighboring time points based on the neighboring correlation feature includes: based on the neighboring related feature Determine the associated neighboring time points of each time point in the time point; determine the dimension of the indicator data after dimension reduction according to the associated neighboring time points of each time point; The indicator data is reduced to the dimension of the indicator data after the dimension reduction.

根据本公开的一实施例，所述根据所述各个时间点的关联近邻时间点确定降维后的指标数据维度包括：基于所述近邻关联性特征根据所述各个时间点的关联近邻时间点确定降维后的指标数据维度。According to an embodiment of the present disclosure, the determining the dimension of the dimension-reduced indicator data according to the associated neighboring time points of the respective time points includes: determining according to the associated neighboring time points of the respective time points based on the neighbor correlation feature. The dimension of the indicator data after dimension reduction.

根据本公开的一实施例，所述基于所述近邻关联性特征对所述多个近邻时间点对应的多个指标数据进行降维处理包括：基于所述近邻关联性特征从所述多个近邻时间点中确定所述各个时间点的关联近邻时间点；所述获得所述各个时间点对应的关联近邻数据包括：获得所述各个时间点的关联近邻时间点对应的指标数据为降维后的所述关联近邻数据。According to an embodiment of the present disclosure, the performing dimensionality reduction processing on multiple index data corresponding to the multiple neighboring time points based on the neighboring correlation feature includes: based on the neighboring related feature Determining the associated neighbor time points of the respective time points in the time points; the obtaining the associated neighbor data corresponding to the respective time points includes: obtaining the index data corresponding to the associated neighbor time points of the respective time points after dimensionality reduction the associated neighbor data.

根据本公开的一实施例，所述对所述连续时间点对应的多个所述关联近邻数据进行划分以从所述连续时间点中确定指标数据异常的时间点包括：根据所述连续时间点对应的多个所述关联近邻数据获得孤立树；基于所述孤立树分别获得所述连续时间点对应的各个所述关联近邻数据的异常值；获得异常值大于预设阈值关联近邻数据对应的时间点为指标数据异常的时间点。According to an embodiment of the present disclosure, the dividing a plurality of the associated neighbor data corresponding to the continuous time points to determine the abnormal time points of the index data from the continuous time points includes: according to the continuous time points Obtaining an isolation tree corresponding to a plurality of the associated neighbor data; respectively obtaining outliers of each of the associated neighbor data corresponding to the continuous time points based on the isolation tree; obtaining the time corresponding to the associated neighbor data when the abnormal value is greater than a preset threshold The point is the time point when the indicator data is abnormal.

根据本公开的再一方面，提供一种时间序列数据异常检测装置，包括：数据获取模块，用于获取时间序列数据，所述时间序列数据为连续时间点中各个时间点对应的指标数据的序列；关联性特征提取模块，用于根据所述各个时间点对应的指标数据获得近邻关联性特征，所述近邻关联性特征用于表示所述各个时间点的多个近邻时间点中各个近邻时间点对应的指标数据与所述各个时间点对应的指标数据之间的关联性；指标降维模块，用于基于所述近邻关联性特征对所述多个近邻时间点对应的多个指标数据进行降维处理，获得所述各个时间点对应的关联近邻数据；异常检测模块，用于对所述连续时间点对应的多个所述关联近邻数据进行划分以从所述连续时间点中确定指标数据异常的时间点。According to still another aspect of the present disclosure, there is provided a time series data anomaly detection device, comprising: a data acquisition module for acquiring time series data, where the time series data is a sequence of index data corresponding to each time point in consecutive time points ; Relevance feature extraction module, used to obtain the neighbor relevance feature according to the index data corresponding to each time point, and the neighbor relevance feature is used to represent each neighbor time point in the multiple neighbor time points of each time point. Correlation between the corresponding index data and the index data corresponding to the respective time points; the index dimension reduction module is used to reduce the multiple index data corresponding to the multiple neighboring time points based on the neighbor correlation feature dimensional processing, to obtain the associated neighbor data corresponding to each time point; an anomaly detection module, configured to divide a plurality of the associated neighbor data corresponding to the continuous time points to determine the abnormality of the index data from the continuous time points time point.

根据本公开的一实施例，所述时间序列数据包括所述各个时间点对应的多个相似区域的指标数据；所述数据获取模块，还用于获取所述各个时间点对应的第一预定区域的指标数据；获取所述各个时间点对应的第二预定区域的指标数据；所述数据获取模块，还包括相似区域聚合模块，用于在所述第一预定区域的指标数据与所述第二预定区域的指标数据的相似度大于预设阈值时，获得所述多个相似区域，所述多个相似区域包括所述第一预定区域和所述第二预定区域。According to an embodiment of the present disclosure, the time series data includes index data of a plurality of similar regions corresponding to each time point; the data acquisition module is further configured to acquire a first predetermined region corresponding to each time point The indicator data of the first predetermined area is obtained; the indicator data of the second predetermined area corresponding to each time point is obtained; the data acquisition module further includes a similar area aggregation module, which is used for the indicator data in the first predetermined area and the second predetermined area. When the similarity of the index data of the predetermined area is greater than a preset threshold, the plurality of similar areas are obtained, and the plurality of similar areas include the first predetermined area and the second predetermined area.

根据本公开的一实施例，所述近邻关联性特征包括第一维度和第二维度，所述第一维度为近邻时间点序列，所述第二维度为与所述近邻时间点序列对应的信息增益序列；所述关联性特征提取模块还用于：获取所述各个时间点的各个近邻时间点对应的指标数据；基于所述各个时间点对应的指标数据和所述各个时间点的各个近邻时间点对应的指标数据计算所述近邻时间点序列对应的所述信息增益序列。According to an embodiment of the present disclosure, the neighborhood correlation feature includes a first dimension and a second dimension, the first dimension is a sequence of neighboring time points, and the second dimension is information corresponding to the sequence of neighboring time points Gain sequence; the correlation feature extraction module is further configured to: obtain the index data corresponding to each neighbor time point of each time point; based on the index data corresponding to each time point and each neighbor time of each time point The information gain sequence corresponding to the adjacent time point sequence is calculated from the index data corresponding to the point.

根据本公开的一实施例，所述指标降维模块还用于：基于所述近邻关联性特征从所述多个近邻时间点中确定所述各个时间点的关联近邻时间点；根据所述各个时间点的关联近邻时间点确定降维后的指标数据维度；通过主成分分析方法将多个近邻时间点对应的多个指标数据降维到所述降维后的指标数据维度。According to an embodiment of the present disclosure, the indicator dimensionality reduction module is further configured to: determine an associated neighbor time point of each time point from the plurality of neighbor time points based on the neighbor correlation feature; The associated neighboring time points of the time points determine the dimension of the index data after the dimension reduction; the multiple index data corresponding to the multiple neighboring time points are dimension-reduced to the dimension of the index data after the dimension reduction through the principal component analysis method.

根据本公开的一实施例，所述指标降维模块还用于基于所述近邻关联性特征根据所述各个时间点的关联近邻时间点确定降维后的指标数据维度。According to an embodiment of the present disclosure, the indicator dimension reduction module is further configured to determine the dimension of the indicator data after dimension reduction according to the associated neighbor time points of the respective time points based on the neighbor correlation feature.

根据本公开的一实施例，所述指标降维模块还用于基于所述近邻关联性特征从所述多个近邻时间点中确定所述各个时间点的关联近邻时间点；获得所述各个时间点的关联近邻时间点对应的指标数据为降维后的所述关联近邻数据。According to an embodiment of the present disclosure, the indicator dimensionality reduction module is further configured to determine, from the plurality of neighbor time points, the associated neighbor time points of the respective time points based on the neighbor correlation feature; obtain the respective time points The index data corresponding to the associated neighbor time point of the point is the associated neighbor data after dimension reduction.

根据本公开的一实施例，所述异常检测模块，还用于：根据所述连续时间点对应的多个所述关联近邻数据获得孤立树；基于所述孤立树分别获得所述连续时间点对应的各个所述关联近邻数据的异常值；获得异常值大于预设阈值关联近邻数据对应的时间点为指标数据异常的时间点。According to an embodiment of the present disclosure, the anomaly detection module is further configured to: obtain an isolation tree according to a plurality of the associated neighbor data corresponding to the continuous time points; respectively obtain the corresponding data of the continuous time points based on the isolation tree The abnormal value of each of the associated neighbor data is obtained; the time point corresponding to the associated neighbor data obtained when the abnormal value is greater than the preset threshold value is the time point when the indicator data is abnormal.

根据本公开的再一方面，提供一种设备，包括：存储器、处理器及存储在所述存储器中并可在所述处理器中运行的可执行指令，所述处理器执行所述可执行指令时实现如上述任一种方法。According to yet another aspect of the present disclosure, there is provided an apparatus comprising: a memory, a processor, and executable instructions stored in the memory and executable in the processor, the processor executing the executable instructions When implementing any of the above methods.

根据本公开的再一方面，提供一种计算机可读存储介质，其上存储有计算机可执行指令，所述可执行指令被处理器执行时实现如上述任一种方法。According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium on which computer-executable instructions are stored, and when the executable instructions are executed by a processor, implement any of the above methods.

本公开的实施例提供的时间序列数据异常检测方法，通过根据各个时间点对应的指标数据获得近邻关联性特征，基于近邻关联性特征对多个近邻时间点对应的多个指标数据进行降维处理，获得各个时间点对应的关联近邻数据，对连续时间点对应的多个关联近邻数据进行划分以从所述连续时间点中确定指标数据异常的时间点，从而可筛选与各个时间点的指标关联性较强的近邻时间点指标数据作为异常检测的对象，以提高对具有不同近邻关联性的多种指标数据的异常检测的准确性。In the method for detecting anomalies in time series data provided by the embodiments of the present disclosure, a neighborhood correlation feature is obtained according to the index data corresponding to each time point, and a dimensionality reduction process is performed on a plurality of index data corresponding to a plurality of neighboring time points based on the neighborhood correlation feature. , obtain the associated neighbor data corresponding to each time point, and divide multiple associated neighbor data corresponding to consecutive time points to determine the abnormal time points of the indicator data from the consecutive time points, so that the indicators associated with each time point can be filtered. In order to improve the accuracy of anomaly detection for various index data with different neighbor correlations, the index data of the nearest neighbor time point with strong correlation is used as the object of anomaly detection.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性的，并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary only and do not limit the present disclosure.

附图说明Description of drawings

通过参照附图详细描述其示例实施例，本公开的上述和其它目标、特征及优点将变得更加显而易见。The above and other objects, features and advantages of the present disclosure will become more apparent from the detailed description of example embodiments thereof with reference to the accompanying drawings.

图1示出本公开实施例中一种系统结构的示意图。FIG. 1 shows a schematic diagram of a system structure in an embodiment of the present disclosure.

图2示出本公开实施例中一种时间序列数据异常检测方法的流程图。FIG. 2 shows a flowchart of a method for detecting anomalies in time series data in an embodiment of the present disclosure.

图3A是根据一示例性实施例示出的一种用于获得异常检测数据的区域聚合方法的流程图。Fig. 3A is a flow chart showing a method for region aggregation for obtaining anomaly detection data according to an exemplary embodiment.

图3B根据一示例性实施例示出了金融业地区生产总值相似度热力图。FIG. 3B shows a heat map of the similarity of financial sector gross regional product, according to an exemplary embodiment.

图3C示出了2000年份至2018份眉山市企业所得税税收收入指标曲线。Figure 3C shows the corporate income tax revenue index curve of Meishan City from 2000 to 2018.

图3D示出了2000年份至2018份达州市企业所得税税收收入指标曲线。Figure 3D shows the Dazhou corporate income tax revenue index curve from 2000 to 2018.

图3E示出了2000年份至2018份遂宁市企业所得税税收收入指标曲线。Figure 3E shows the index curve of corporate income tax revenue in Suining from 2000 to 2018.

图3F示出了2000年份至2018份绵阳市企业所得税税收收入指标曲线。Figure 3F shows the Mianyang corporate income tax revenue index curve from 2000 to 2018.

图3G示出了四城市企业所得税税收收入指标相似度聚类图。Figure 3G shows a clustering diagram of the similarity of the four cities' corporate income tax revenue indicators.

图4A是根据一示例性实施例示出的一种用于异常检测的特征降维方法的流程图。Fig. 4A is a flowchart showing a feature dimension reduction method for anomaly detection according to an exemplary embodiment.

图4B根据一实施例示出了一种左右子树划分的示意图。FIG. 4B shows a schematic diagram of a left and right subtree division according to an embodiment.

图4C根据一实施例示出了房地产业近5年的重要性直方图。FIG. 4C shows a histogram of the importance of the real estate industry over the past 5 years, according to an embodiment.

图4D根据一实施例示出了建筑业近5年的重要性直方图。FIG. 4D shows a histogram of the importance of the construction industry over the past 5 years, according to an embodiment.

图4E根据图4C示出了房地产总值近邻时间点数据降维散点图。FIG. 4E shows a dimensionality-reduced scatter plot of real estate total value neighbor time point data according to FIG. 4C .

图5是根据一示例性实施例示出的另一种特征降维方法的流程图。Fig. 5 is a flowchart showing another feature dimension reduction method according to an exemplary embodiment.

图6A是根据一示例性实施例示出的一种异常点判断方法的流程图。Fig. 6A is a flowchart of a method for determining an abnormal point according to an exemplary embodiment.

图6B根据一实施例示出了一种样本切割过程示意图。FIG. 6B shows a schematic diagram of a sample cutting process according to an embodiment.

图6C根据一实施例示出了另一种样本切割过程示意图。FIG. 6C shows another schematic diagram of a sample cutting process according to an embodiment.

图6D根据一实施例示出了一种孤立树的示意图。FIG. 6D shows a schematic diagram of an isolation tree according to an embodiment.

图6E根据一实施例示出了四川省普通高等学校专任教师数量随时间变化图及对应异常得分图。FIG. 6E shows a graph of changes over time in the number of full-time teachers in ordinary institutions of higher learning in Sichuan Province and a graph of corresponding abnormal scores, according to an embodiment.

图7是根据一示例性实施例示出的一种时间序列数据异常检测装置的框图。Fig. 7 is a block diagram of a device for detecting abnormality in time series data according to an exemplary embodiment.

图8是根据一示例性实施例示出的另一种时间序列数据异常检测装置的框图。Fig. 8 is a block diagram of another apparatus for detecting abnormality in time series data according to an exemplary embodiment.

图9示出本公开实施例中一种电子设备的结构示意图。FIG. 9 shows a schematic structural diagram of an electronic device in an embodiment of the present disclosure.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施例。然而，示例实施例能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施例使得本公开将更加全面和完整，并将示例实施例的构思全面地传达给本领域的技术人员。附图仅为本公开的示意性图解，并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分，因而将省略对它们的重复描述。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repeated descriptions will be omitted.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本公开的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本公开的技术方案而省略所述特定细节中的一个或更多，或者可以采用其它的方法、装置、步骤等。在其它情况下，不详细示出或描述公知结构、方法、装置、实现或者操作以避免喧宾夺主而使得本公开的各方面变得模糊。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the present disclosure. However, those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, devices, steps, etc. may be employed. In other instances, well-known structures, methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

此外，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本公开的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。符号“/”一般表示前后关联对象是一种“或”的关系。In addition, the terms "first", "second", etc. are used for descriptive purposes only, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. In the description of the present disclosure, "plurality" means at least two, such as two, three, etc., unless expressly and specifically defined otherwise. The symbol "/" generally indicates that the related objects are an "or" relationship.

在本公开中，除非另有明确的规定和限定，“连接”等术语应做广义理解，例如，可以是电连接或可以互相通讯；可以是直接相连，也可以通过中间媒介间接相连。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本公开中的具体含义。In the present disclosure, unless otherwise expressly specified and limited, terms such as "connection" should be interpreted in a broad sense, for example, it may be an electrical connection or may communicate with each other; it may be directly connected or indirectly connected through an intermediate medium. For those of ordinary skill in the art, the specific meanings of the above terms in the present disclosure can be understood according to specific situations.

如上所述，某类指标的各个时间点之间的样本数据没有必然的联系，历史年份的指标值对当前年份的指标具有一定影响，在进行历史年份指标分析以进行异常检测时需要考虑全局的发展趋势。而不同类别的指标的影响当前年份指标的历史年份可能不同，因此采用统一模型进行异常检测的准确率较低。因此，本公开提供了一种时间序列数据异常检测方法，通过根据各个时间点对应的指标数据获得近邻关联性特征，基于近邻关联性特征对多个近邻时间点对应的多个指标数据进行降维处理，获得各个时间点对应的关联近邻数据，对连续时间点对应的多个关联近邻数据进行划分以从所述连续时间点中确定指标数据异常的时间点，从而可筛选与各个时间点的指标关联性较强的近邻时间点指标数据作为异常检测的对象，以提高对具有不同近邻关联性的多种指标数据的异常检测的准确性。As mentioned above, there is no necessary connection between the sample data of various time points of a certain type of indicator, and the indicator value of the historical year has a certain influence on the indicator of the current year. When analyzing the indicators of the historical year for abnormal detection, it is necessary to consider the overall situation. development trend. However, the impact of different categories of indicators may be different in the historical years of the current year indicator, so the accuracy of anomaly detection using a unified model is low. Therefore, the present disclosure provides a method for detecting anomalies in time series data, by obtaining neighbor correlation features according to index data corresponding to each time point, and performing dimension reduction on multiple index data corresponding to multiple neighboring time points based on the neighbor correlation features Process, obtain the associated neighbor data corresponding to each time point, and divide a plurality of associated neighbor data corresponding to consecutive time points to determine the abnormal time point of the indicator data from the consecutive time points, so that the indicators related to each time point can be filtered. The adjacent time point index data with strong correlation is used as the object of anomaly detection to improve the accuracy of anomaly detection for various index data with different adjacent correlations.

图1示出了可以应用本公开的时间序列数据异常检测方法或时间序列数据异常检测装置的示例性系统架构10。FIG. 1 shows an exemplary system architecture 10 to which the time series data anomaly detection method or the time series data anomaly detection apparatus of the present disclosure may be applied.

如图1所示，系统架构10可以包括终端设备102、网络104、服务器106和数据库108。终端设备102可以是具有显示屏并且支持输入、输出的各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机、台式计算机、可穿戴设备、虚拟现实设备、智能家居等等。网络104用以在终端设备102和服务器106之间提供通信链路的介质。网络104可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。服务器106可以是提供各种服务的服务器或服务器集群等。数据库108可以为置于服务器上的大型数据库软件，也可以为安装在计算机上的小型数据库软件，用于存储、管理数据。As shown in FIG. 1 , the system architecture 10 may include terminal devices 102 , a network 104 , a server 106 and a database 108 . The terminal device 102 can be various electronic devices with a display screen and supporting input and output, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, wearable devices, virtual reality devices, smart homes, etc. . The network 104 is the medium used to provide the communication link between the terminal device 102 and the server 106 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The server 106 may be a server or server cluster or the like that provides various services. The database 108 can be a large database software placed on a server, or a small database software installed on a computer for storing and managing data.

用户可以使用终端设备102通过网络104与服务器106和数据库108交互，以接收或发送数据等。例如用户在终端设备102导入指标数据列表，通过网络104将指标数据上传到服务器106上以进行异常分析，或通过网络104将指标数据上传到数据库108进行存储。又例如用户通过网络104从数据库108中获取多个地区的同类指标数据，在终端设备102上进行处理获得相似区域。The user can use the terminal device 102 to interact with the server 106 and the database 108 through the network 104 to receive or send data and the like. For example, the user imports the index data list on the terminal device 102 , uploads the index data to the server 106 through the network 104 for abnormal analysis, or uploads the index data to the database 108 through the network 104 for storage. For another example, the user obtains similar index data of multiple regions from the database 108 through the network 104, and performs processing on the terminal device 102 to obtain similar regions.

在服务器106也可通过网络104从数据库108接收数据或向数据库108发送数据等。例如服务器106可为后台处理服务器，用于通过网络104从数据库108获取待进行异常检测的指标数据。又例如服务器106可用于对通过网络104从数据库108获取多个地区的同类指标数据并进行区域聚合，将聚合后的指标数据通过网络104传输至数据库108进行存储。The server 106 may also receive data from the database 108 or send data to the database 108 through the network 104, or the like. For example, the server 106 may be a background processing server, configured to obtain the index data to be subjected to anomaly detection from the database 108 through the network 104 . For another example, the server 106 may be configured to obtain similar index data in multiple regions from the database 108 through the network 104 and perform regional aggregation, and transmit the aggregated index data to the database 108 through the network 104 for storage.

应该理解，图1中的终端设备、网络、服务器和数据库的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络、服务器和数据库。It should be understood that the numbers of terminal devices, networks, servers and databases in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks, servers and databases according to implementation needs.

图2是根据一示例性实施例示出的一种时间序列数据异常检测方法的流程图。如图2所示的方法例如可以应用于上述系统的服务器端，也可以应用于上述系统的终端设备。Fig. 2 is a flow chart of a method for detecting anomalies in time series data according to an exemplary embodiment. The method shown in FIG. 2 can be applied to, for example, the server side of the above-mentioned system, and can also be applied to the terminal device of the above-mentioned system.

参考图2，本公开实施例提供的方法20可以包括以下步骤。Referring to FIG. 2 , the method 20 provided by the embodiment of the present disclosure may include the following steps.

在步骤S202中，获取时间序列数据，时间序列数据为连续时间点中各个时间点对应的指标数据的序列。连续时间点可为连续的年份，指标数据的序列可为连续的年份中各个年份对应的指标数据，例如从1970年至2019年共50个年份中，各个年份的四川省普通高等学校专职任课教师的人数。连续时间点也可为连续的月、季度、半年等等，例如，从2000年2019年连续的各季度的房地产固定资产投资额的指标数据的序列。In step S202, time-series data is acquired, where the time-series data is a sequence of index data corresponding to each time point in consecutive time points. Consecutive time points can be consecutive years, and the sequence of indicator data can be indicator data corresponding to each year in consecutive years. number of people. The consecutive time points may also be consecutive months, quarters, half-years, etc., for example, a sequence of indicator data of real estate fixed asset investment in consecutive quarters from 2000 to 2019.

在一些实施例中，例如，对以年份为时间点的指标数据进行统计时，若仅有一些近年的指标数据，则待进行异常检测的一个区域的时间序列数据量较少，如区域的经济相关指标：GDP、毕业人数等等，部分指标可能在2000年后开始统计，则单个指标历史值是非常少的，一个地区的历史参考值仅有20个，并且对当年年份有关联的数据的可能就是近三年数据，因此获得后续用于时序特征建模、异常点划分的样本量较少，导致异常检测的准确度降低。可在获取时间序列数据时，将指标数据相似的区域的数据进行聚合，具体实施方式可参照图3A至图3G，此处不予详述。In some embodiments, for example, when performing statistics on the index data with the year as the time point, if there are only some index data in recent years, the amount of time series data in an area to be anomaly detected is relatively small, such as the regional economic Relevant indicators: GDP, number of graduates, etc. Some indicators may start to be counted after 2000, so the historical value of a single indicator is very small, and there are only 20 historical reference values for a region, and there are data related to the year of the year. It may be the data of the past three years, so the sample size obtained for subsequent time series feature modeling and anomaly point division is small, resulting in a decrease in the accuracy of anomaly detection. When acquiring time-series data, data in regions with similar index data may be aggregated. For specific implementations, reference may be made to FIG. 3A to FIG. 3G , which will not be described in detail here.

在步骤S204中，根据各个时间点对应的指标数据获得近邻关联性特征，近邻关联性特征用于表示各个时间点的多个近邻时间点中各个近邻时间点对应的指标数据与各个时间点对应的指标数据之间的关联性。指标时序异常通常定义为与近期指标值差距大的点可能是异常，因此可先获得各时间点的指标数据的近邻时间点的指标数据以提取近邻关联性特征，例如可抽取每个时间点前1-5年的数据，用于提取近邻关联性特征。In step S204, the neighbor correlation feature is obtained according to the index data corresponding to each time point, and the neighbor correlation feature is used to represent the index data corresponding to each neighbor time point among the multiple neighbor time points at each time point and the corresponding value of each time point. Correlations between indicator data. Indicator time series anomalies are usually defined as points with a large gap with the recent indicator values that may be abnormal. Therefore, the indicator data of the adjacent time points of the indicator data at each time point can be obtained first to extract the adjacent correlation characteristics. 1-5 years of data, used to extract the nearest neighbor association features.

在一些实施例中，例如，可使用基尼系数计算近邻时间点的指标对于当前时间点的指标之间的关联性(重要性)，再根据近邻关联性获得近邻关联性特征。In some embodiments, for example, the Gini coefficient can be used to calculate the correlation (importance) between the indicators of the neighboring time point and the indicators of the current time point, and then obtain the neighboring correlation feature according to the neighboring correlation.

在一些实施例中，例如，可通过信息增益衡量近邻时间点的指标对于当前时间点的指标之间的关联性(重要性)，再根据近邻关联性获得近邻关联性特征，具体实施方式可参照图4A至图4D，此处不予详述。In some embodiments, for example, the correlation (importance) between the indicators of the nearest neighbor time point and the indicators of the current time point can be measured by information gain, and then the feature of the nearest neighbor correlation can be obtained according to the nearest neighbor correlation. For the specific implementation, please refer to Figures 4A to 4D will not be described in detail here.

在步骤S206中，基于近邻关联性特征对多个近邻时间点对应的多个指标数据进行降维处理，获得各个时间点对应的关联近邻数据。通过关联性特征提取后，可获得对各个时间点的指标影响较大的近邻时间点，即关联性较强(或更为重要)的近邻时间点，可对这些近邻时间点的指标数据进行降维处理，可将各个时间点对应的N(N为大于2的正整数)个近邻时间点的N维指标数据降为2维，获得各个时间点对应的2维的关联近邻数据。In step S206, dimensionality reduction processing is performed on a plurality of index data corresponding to a plurality of adjacent time points based on the adjacent correlation feature, and the associated adjacent data corresponding to each time point is obtained. After the correlation feature extraction, the neighboring time points that have a greater impact on the indicators of each time point can be obtained, that is, the neighboring time points with strong correlation (or more important), and the index data of these neighboring time points can be reduced. Dimensional processing can reduce the N-dimensional index data of N (N is a positive integer greater than 2) neighboring time points corresponding to each time point to 2 dimensions, and obtain 2-dimensional associated neighbor data corresponding to each time point.

在步骤S208中，对连续时间点对应的多个关联近邻数据进行划分以从连续时间点中确定指标数据异常的时间点。获得各个时间点对应的关联近邻数据点后，可对这些数据点进行划分以进行异常检测。In step S208, a plurality of associated neighbor data corresponding to consecutive time points are divided to determine the abnormal time point of the index data from the consecutive time points. After obtaining the associated neighboring data points corresponding to each time point, these data points can be divided for anomaly detection.

在一些实施例中，例如，可通过隔离森林作为异常检测模型，对关联近邻数据点进行划分，具体实施方式可参照图6A至图6D，此处不予详述。In some embodiments, for example, an isolation forest can be used as an anomaly detection model to divide the associated neighboring data points. The specific implementation can be referred to FIG. 6A to FIG. 6D , which will not be described in detail here.

在另一些实施例中，例如可基于统计学的方法构建一个概率分布模型，并计算各个数据点的2维特征符合该模型的概率，把具有低概率的对象视为异常点，如特征工程中的RobustScaler方法等等。再例如可基于聚类的方法来做异常点检测，如果聚类后发现某些聚类簇的数据样本量比其他簇少很多，而且这个簇里数据的特征均值分布之类的值和其他簇也差异很大，这些簇里的样本点可认为是异常点，如BIRCH聚类算法、DBSCAN密度聚类算法等等。In other embodiments, for example, a probability distribution model can be constructed based on statistical methods, and the probability that the 2-dimensional features of each data point conform to the model can be calculated, and objects with low probability can be regarded as abnormal points, such as in feature engineering. The RobustScaler method and more. For another example, outlier detection can be done based on the clustering method. If it is found after clustering that the data sample size of some clusters is much smaller than that of other clusters, and the value of the characteristic mean distribution of the data in this cluster is different from that of other clusters. It is also very different. The sample points in these clusters can be considered as abnormal points, such as BIRCH clustering algorithm, DBSCAN density clustering algorithm and so on.

根据本公开实施例提供的时间序列数据异常检测方法，通过根据各个时间点对应的指标数据获得近邻关联性特征，基于近邻关联性特征对多个近邻时间点对应的多个指标数据进行降维处理，获得各个时间点对应的关联近邻数据，对连续时间点对应的多个关联近邻数据进行划分以从所述连续时间点中确定指标数据异常的时间点，从而可筛选与各个时间点的指标关联性较强的近邻时间点指标数据作为异常检测的对象，以提高对具有不同近邻关联性的多种指标数据的异常检测的准确性。According to the method for detecting anomaly in time series data provided by the embodiment of the present disclosure, the neighborhood correlation feature is obtained according to the index data corresponding to each time point, and the dimension reduction processing is performed on the multiple index data corresponding to the multiple neighboring time points based on the neighboring correlation feature. , obtain the associated neighbor data corresponding to each time point, and divide multiple associated neighbor data corresponding to consecutive time points to determine the abnormal time points of the indicator data from the consecutive time points, so that the indicators associated with each time point can be filtered. In order to improve the accuracy of anomaly detection for various index data with different neighbor correlations, the index data of the nearest neighbor time point with strong correlation is used as the object of anomaly detection.

图3A是根据一示例性实施例示出的一种用于获得异常检测数据的区域聚合方法的流程图。图3A可作为图2中所示的步骤S202在一实施例中的处理过程。对以年份为时间点的指标数据进行统计时，待进行异常检测的一个区域指标历史数据可能较少的，可通过图3A至图3G的实施方式对相似区域的数据进行聚合，以扩充样本量。如图3A所示的方法例如可以应用于上述系统的服务器端，也可以应用于上述系统的终端设备。Fig. 3A is a flow chart showing a method for region aggregation for obtaining anomaly detection data according to an exemplary embodiment. FIG. 3A can be used as the processing procedure of step S202 shown in FIG. 2 in one embodiment. When performing statistics on the index data with the year as the time point, there may be less historical data of the index in an area to be abnormally detected, and the data of similar areas can be aggregated through the implementation of FIG. 3A to FIG. 3G to expand the sample size. . For example, the method shown in FIG. 3A can be applied to the server side of the above-mentioned system, and can also be applied to the terminal device of the above-mentioned system.

参考图3A，本公开实施例提供的方法30可以包括以下步骤。Referring to FIG. 3A , the method 30 provided by the embodiment of the present disclosure may include the following steps.

在步骤S302中，获取各个时间点对应的第一预定区域的指标数据。In step S302, the index data of the first predetermined area corresponding to each time point is acquired.

在步骤S304中，获取各个时间点对应的第二预定区域的指标数据。区域聚合的目的是依据少量数据，借助相似区域的历史状况，辅助目标区域进行异常检测。第一预定区域和第二预定区域可为两个情况较为类似的省份、城市、县等等，例如在对四川省的经济指标数据进行异常检测时，四川省的金融业的地区生产总值仅包括近20年中各年份的数据，可以借助四川省省会成都市的同指标数据，或经济状况相近的重庆市的同指标数据辅助进行决策。In step S304, the index data of the second predetermined area corresponding to each time point is acquired. The purpose of region aggregation is to assist target regions to perform anomaly detection based on a small amount of data and with the help of the historical conditions of similar regions. The first predetermined area and the second predetermined area can be two provinces, cities, counties, etc. with similar situations. Including the data of each year in the past 20 years, the same index data of Chengdu, the capital of Sichuan Province, or the same index data of Chongqing, which has similar economic conditions, can be used to assist decision-making.

在步骤S306中，在第一预定区域的指标数据与第二预定区域的指标数据的相似度大于预设阈值时，获得多个相似区域，多个相似区域包括第一预定区域和第二预定区域。In step S306, when the similarity between the index data of the first predetermined area and the index data of the second predetermined area is greater than a preset threshold, obtain multiple similar areas, and the multiple similar areas include the first predetermined area and the second predetermined area .

在一些实施例中，例如，可采用余弦(cosine)相似度度量不同区域的历史指标的相似度。如将第一预定区域表示为i，将第一预定区域的指标序列表示为向量x_i，第二预定区域表示为j，将第二预定区域的指标序列表示为向量x_j，其中第一预定区域的指标与第二预定区域的指标对应的时间点相同，即第一预定区域的指标序列与第二预定区域的指标序列的向量长度相同，则第一预定区域与第二预定区域的相似度S_i，j可由下式计算：In some embodiments, for example, cosine similarity may be used to measure the similarity of historical indicators of different regions. For example, the first predetermined area is represented as i, the index sequence of the first predetermined area is represented as a vector x _i , the second predetermined area is represented as j, the index sequence of the second predetermined area is represented as a vector x _j , wherein the first predetermined area is represented as a vector x j . The time points corresponding to the index of the area and the index of the second predetermined area are the same, that is, the vector length of the index sequence of the first predetermined area and the index sequence of the second predetermined area are the same, then the similarity between the first predetermined area and the second predetermined area is S _i,j can be calculated by the following formula:

其中，

表示区域i和区域j的内积，||x_i||表示区域i的指标向量x_i的二范式，||x_j||表示区域j的指标向量x_j的二范式，以区域i的二范式为例，计算公式为：in,

represents the inner product of region i and region j, ||x _i || represents the second normal form of the indicator vector x _i of region i, ||x _j || represents the second normal form of the indicator vector x _j of region j, and the Take the second normal form as an example, the calculation formula is:

式中，k表示向量x_i的长度，k为正整数。In the formula, k represents the length of the vector x _i , and k is a positive integer.

在计算获得各个区域的指标之间的余弦相似度之后，可获得区域相似度的热力图，以将各个区域之间指标的相似程度直观地进行显示。例如，图3B示出了金融业地区生产总值相似度热力图。如图3B所示，分别计算出乐山市、内江市、南充市、四川省、宜宾市、德阳市、成都市、泸州市、绵阳市、自贡市中每两个城市之间的金融业地区生产总值指标的余弦相似度得分，将这些得分以热力图的形式显示出来，可以看出，对于金融业生产总值，与四川省最相似的是成都市和德阳市。成都市是四川省会，成都的金融生产指标直接影响着四川省的指标。After calculating the cosine similarity between the indicators of each region, a heat map of the regional similarity can be obtained to visually display the similarity of the indicators between the regions. For example, Figure 3B shows a heatmap of the similarity of GDP in the financial sector. As shown in Figure 3B, the regional production of the financial industry between each of Leshan City, Neijiang City, Nanchong City, Sichuan Province, Yibin City, Deyang City, Chengdu City, Luzhou City, Mianyang City, and Zigong City was calculated respectively. The cosine similarity score of the gross value index is displayed in the form of a heat map. It can be seen that Chengdu and Deyang are the most similar to Sichuan Province in terms of the gross domestic product of the financial industry. Chengdu is the capital of Sichuan Province, and the financial production indicators of Chengdu directly affect the indicators of Sichuan Province.

在另一些实施例中，例如，也可利用地区指标的相似度矩阵进行聚类，通过聚类效果获得预定区域的相似区域。如图3C-图3G所示，图3C示出了2000年份至2018份眉山市企业所得税税收收入指标曲线，图3D示出了2000年份至2018份达州市企业所得税税收收入指标曲线，图3E示出了2000年份至2018份遂宁市企业所得税税收收入指标曲线，图3F示出了2000年份至2018份绵阳市企业所得税税收收入指标曲线，从图中我们可以看出，绵阳和遂宁的波动规律相似，眉山和达州的波动规律相似；图3G示出了四城市企业所得税税收收入指标相似度聚类图，从聚类效果上可以看出眉山和达州距离最近，绵阳和遂宁距离最近。In other embodiments, for example, a similarity matrix of regional indicators may also be used to perform clustering, and a similar area of a predetermined area may be obtained through a clustering effect. As shown in Figure 3C-Figure 3G, Figure 3C shows the corporate income tax revenue index curve of Meishan City from 2000 to 2018, Figure 3D shows the corporate income tax revenue index curve of Dazhou City from 2000 to 2018, Figure 3E shows From 2000 to 2018, the corporate income tax revenue index curve of Suining City is shown. Figure 3F shows the corporate income tax revenue index curve of Mianyang City from 2000 to 2018. From the figure, we can see that the fluctuation laws of Mianyang and Suining are similar. , the fluctuation laws of Meishan and Dazhou are similar; Figure 3G shows the similarity cluster diagram of the corporate income tax revenue index of the four cities. From the clustering effect, it can be seen that Meishan and Dazhou are the closest, and Mianyang and Suining are the closest.

在步骤S308中，获取各个时间点对应的多个相似区域的指标数据。区域聚合的目的是依据少量数据，借助相似区域的历史状况，辅助目标区域进行异常检测。例如对于四川省数据量少，我们可以借助四川省省会成都市，以及经济状况相近的重庆市辅助进行决策。In step S308, index data of multiple similar regions corresponding to each time point is acquired. The purpose of region aggregation is to assist target regions to perform anomaly detection based on a small amount of data and with the help of the historical conditions of similar regions. For example, for the small amount of data in Sichuan Province, we can use Chengdu, the capital of Sichuan Province, and Chongqing, which has similar economic conditions, to assist in decision-making.

根据本公开实施例提供的区域聚合方法，在每个区域的历史数据较少时，根据历史指标的曲线进行相似性区域的聚合，借助相似区域辅助建模，有效解决了目标区域数据量少的问题，提高了后续进行异常检测的准确度。According to the area aggregation method provided by the embodiment of the present disclosure, when the historical data of each area is small, the aggregation of similarity areas is performed according to the curve of the historical index, and the auxiliary modeling of similar areas is used to effectively solve the problem of the small amount of data in the target area. This improves the accuracy of subsequent anomaly detection.

图4A是根据一示例性实施例示出的一种用于异常检测的特征降维方法的流程图。图4A可作为图2中所示的步骤S206在一实施例中的处理过程。如图4A所示的方法例如可以应用于上述系统的服务器端，也可以应用于上述系统的终端设备。Fig. 4A is a flowchart showing a feature dimension reduction method for anomaly detection according to an exemplary embodiment. FIG. 4A can be used as the processing procedure of step S206 shown in FIG. 2 in one embodiment. The method shown in FIG. 4A can be applied to, for example, the server side of the above-mentioned system, and can also be applied to the terminal device of the above-mentioned system.

参考图4A，本公开实施例提供的方法40可以包括以下步骤。Referring to FIG. 4A , the method 40 provided by the embodiment of the present disclosure may include the following steps.

在步骤S402中，获取各个时间点的各个近邻时间点对应的指标数据。指标时序异常通常定义为与近期指标值差距大的点可能是异常，例如将每个年份近1-5年为近邻时间点，判断每个时间点近5年的数据与该时间点之间的关联性是否有异常变动。In step S402, index data corresponding to each neighboring time point of each time point is acquired. Indicator time series anomalies are usually defined as points with a large gap with recent indicator values that may be anomalies. For example, take the nearest 1-5 years of each year as the nearest neighbor time point, and determine the difference between the data of each time point in the past five years and the time point. Whether the correlation has changed abnormally.

在步骤S404中，基于各个时间点对应的指标数据和各个时间点的各个近邻时间点对应的指标数据计算近邻时间点序列对应的信息增益序列。可通过信息增益衡量近邻时间点与对应时间点数据之间的关联性。In step S404, an information gain sequence corresponding to the sequence of neighboring time points is calculated based on the index data corresponding to each time point and the index data corresponding to each neighboring time point of each time point. The correlation between neighboring time points and corresponding time points can be measured by information gain.

在一些实施例中，例如，将如每个年份近1-5年的近邻时间点表示为c_p∈{c₀，c₁，c₂，c₃，c₄}，p∈{0，1，2，3，4}，则可通过下式计算每个c_p对应的信息增益I_p：，In some embodiments, for example, the nearest neighbor time points such as each year are represented as cp ∈ {c ₀ , c ₁ , c ₂ , c ₃ , c ₄ }, _p ∈ {0, 1 , 2, 3, 4}, then the information gain I _p corresponding to each c _p can be calculated by the following formula:

式中y_k表示指标序列中第k个指标值，Q表示指标序列数据集合，

表示各个时间段的近邻时间点c_p的指标值的均值，

中

和

分别表示指标序列数据按照

划分的左子树数据集和右子树数据集，y_k′为左子树数据集或右子树数据集中的数据，

为左子树数据集或右子树数据集中的数据的均值。图4B中根据一实施例示出了一种左右子树划分的示意图。如图4B所示，在

时，

中的数据y₁至y₆都比

小，

中的数据y₁至y₆都比

大。根据式(3)信息增益的计算方式，可得出每个近邻时间点与对应时间点的关联性，即每个近邻时间点的重要性。图4C中根据一实施例示出了房地产业近5年的重要性直方图。图4D中根据一实施例示出了建筑业近5年的重要性直方图。如图4C所示，对于房地产生产总值的指标，近三年的值对当前值贡献最大，时间相隔越久，特征重要性下降越快。不同经济指标重要的特征不同，如图4D所示，对于建筑业来说，近1年和之前第5年影响最大，说明建筑行业具有周期性，周期性特征影响比较大。也可根据图3A的方法进行区域聚合获得聚合数据，则对应式(3)中的指标总数量k应为一个区域的指标数乘以聚合区域数。where y _k represents the k-th indicator value in the indicator sequence, Q represents the indicator sequence data set,

represents the mean value of the index values of the neighboring time points _cp in each time period,

middle

and

Respectively represent the index series data according to

The divided left subtree data set and right subtree data set, y _k′ is the data in the left subtree data set or the right subtree data set,

is the mean of the data in the left subtree dataset or the right subtree dataset. FIG. 4B shows a schematic diagram of a left and right subtree division according to an embodiment. As shown in Figure 4B, in

hour,

The data in y ₁ to y ₆ are more than

Small,

The data in y ₁ to y ₆ are more than

big. According to the calculation method of the information gain of the formula (3), the correlation between each neighboring time point and the corresponding time point can be obtained, that is, the importance of each neighboring time point. FIG. 4C shows the importance histogram of the real estate industry in the past five years according to an embodiment. FIG. 4D shows a histogram of the importance of the construction industry over the past 5 years, according to an embodiment. As shown in Figure 4C, for the indicator of gross real estate production, the values in the past three years contribute the most to the current value, and the longer the time interval, the faster the feature importance declines. The important characteristics of different economic indicators are different. As shown in Figure 4D, for the construction industry, the recent 1 year and the previous 5 years have the greatest impact, indicating that the construction industry is cyclical, and the cyclical characteristics have a greater impact. The aggregated data can also be obtained by performing regional aggregation according to the method shown in FIG. 3A , then the total number of indicators k in the corresponding formula (3) should be the number of indicators in a region multiplied by the number of aggregated regions.

在步骤S406中，基于近邻关联性特征从多个近邻时间点中确定各个时间点的关联近邻时间点。近邻关联性特征包括第一维度和第二维度，第一维度为近邻时间点序列，第二维度为与近邻时间点序列对应的信息增益序列。以图4C为例，选取近三年c₀，c₁，c₂为关联近邻时间点，对应的信息增益序列{I₀，I₁，I₂}为{0.7×10⁶,1.18×10⁶,1.35×10⁶}。In step S406, the associated neighbor time point of each time point is determined from the plurality of neighbor time points based on the neighbor correlation feature. The neighbor correlation feature includes a first dimension and a second dimension, the first dimension is a sequence of neighbor time points, and the second dimension is an information gain sequence corresponding to the sequence of neighbor time points. Taking Fig. 4C as an example, c ₀ , c ₁ , and c ₂ in the past three years are selected as the time points of the associated neighbors, and the corresponding information gain sequence {I ₀ , I ₁ , I ₂ } is {0.7×10 ⁶ , 1.18×10 ⁶ , 1.35×10 ⁶ }.

在步骤S408中，根据各个时间点的关联近邻时间点确定降维后的指标数据维度。基于近邻关联性特征根据各个时间点的关联近邻时间点确定降维后的指标数据维度。In step S408, the dimension of the index data after dimension reduction is determined according to the associated adjacent time points of each time point. The dimension of the indicator data after dimensionality reduction is determined according to the associated neighboring time points of each time point based on the neighbor correlation feature.

在一些实施例中，例如，若近邻关联性特征的序列长度为3，即关联近邻时间点为3个，可确定降维后的指标数据为3维，具体实施方式可参照图5，此处不予详述。In some embodiments, for example, if the sequence length of the neighbor correlation feature is 3, that is, the number of time points associated with the neighbors is 3, it can be determined that the dimension-reduced index data is 3-dimensional. Not detailed.

在一些实施例中，例如，若采用主成分分析方法进行降维，可先确定降维后的指标数据为2维，然后对关联近邻时间点的指标数据(如3维指标数据)进行降维操作。In some embodiments, for example, if the principal component analysis method is used for dimensionality reduction, the dimensionality reduction index data may be first determined to be 2-dimensional, and then the dimensionality reduction is performed on the index data (eg, 3-dimensional index data) associated with neighboring time points. operate.

在步骤S410中，通过主成分分析方法将多个近邻时间点对应的多个指标数据降维到降维后的指标数据维度。例如可将关联近邻时间点的指标数据通过线性变换转换为2维数据，变换后的两个维度的含义与指标本身含义无关，用于表示关联近邻时间点与对应时间点的指标数据的关联情况。In step S410, a principal component analysis method is used to reduce the dimension of the multiple index data corresponding to the multiple neighboring time points to the dimension of the index data after the dimension reduction. For example, the indicator data associated with neighboring time points can be transformed into 2-dimensional data through linear transformation. The meaning of the transformed two dimensions has nothing to do with the meaning of the indicator itself, and is used to indicate the relationship between the associated neighboring time points and the indicator data at the corresponding time point. .

在一些实施例中，例如，可将降维后的数据在二维坐标系中可视化显示，获得二维数据散点图。图4E根据图4C示出了房地产总值近邻时间点数据降维散点图。如图4E所示，地区房地产总值除了少数地区的一些时间点异常，总体呈现聚集的簇状结构，表明大部分地区的大部分时间，房地产规律是相近的。可根据散点图获取与大部分数据点距离较远的点为异常点。In some embodiments, for example, the dimensionality-reduced data can be visualized in a two-dimensional coordinate system to obtain a two-dimensional data scatter plot. FIG. 4E shows a dimensionality-reduced scatter plot of real estate total value neighbor time point data according to FIG. 4C . As shown in Figure 4E, the total value of real estate in the region, except for some anomalies at some time points in a few regions, generally presents a clustered structure, indicating that the real estate laws are similar in most regions for most of the time. Points that are far away from most of the data points can be obtained from the scatter plot as outliers.

根据本公开实施例提供的近邻时间点数据降维方法，对关联性特征通过信息增益进行度量，筛选出更重要的近邻时间点，综合考虑特征之间相关性，进行相关特征整合，消除噪音特征保留关键特征，提高了进行异常检测的准确性。According to the data dimensionality reduction method for neighboring time points provided by the embodiment of the present disclosure, the correlation features are measured through information gain, and more important neighboring time points are screened out, and the correlation between features is comprehensively considered, and related features are integrated to eliminate noise features. Key features are preserved, improving the accuracy of anomaly detection.

图5是根据一示例性实施例示出的另一种特征降维方法的流程图。图5可作为图2中所示的步骤S206在另一实施例中的处理过程。如图5所示的方法例如可以应用于上述系统的服务器端，也可以应用于上述系统的终端设备。Fig. 5 is a flowchart showing another feature dimension reduction method according to an exemplary embodiment. FIG. 5 can be used as the processing procedure of step S206 shown in FIG. 2 in another embodiment. The method shown in FIG. 5 can be applied to, for example, the server side of the above-mentioned system, and can also be applied to the terminal device of the above-mentioned system.

步骤S502，基于近邻关联性特征从多个近邻时间点中确定各个时间点的关联近邻时间点。以图4D为例，选取近一年c₀和之前第五年c₄为关联近邻时间点，对应的信息增益序列{I₀，I₄}为{5.8×10⁶,1.5×10⁶,1.35×10⁶}。Step S502: Determine the associated neighbor time point of each time point from the multiple neighbor time points based on the neighbor correlation feature. Taking Fig. 4D as an example, select the last year c ₀ and the previous fifth year c ₄ as the associated neighbor time points, the corresponding information gain sequence {I ₀ , I ₄ } is {5.8×10 ⁶ , 1.5×10 ⁶ , 1.35 ×10 ⁶ }.

步骤S504，获得各个时间点的关联近邻时间点对应的指标数据为降维后的关联近邻数据。以图4D为例，可直接选取近一年c₀和之前第五年c₄的建筑业地区生产总值指标数据作为降维后的(时间，指标)2维数据，也可将在其二维坐标系中可视化显示，获得二维数据散点图，以便找出异常数据点。In step S504, the index data corresponding to the associated neighbor time points of each time point is obtained as the dimension-reduced associated neighbor data. Taking Figure 4D as an example, you can directly select the construction industry GDP indicator data of the last year c ₀ and the previous fifth year c ₄ as the dimensionality reduction (time, index) 2-dimensional data, or you can use the data in the second Visual display in a 2D coordinate system to obtain a 2D scatter plot of data in order to find outlier data points.

图6A是根据一示例性实施例示出的一种异常点判断方法的流程图。图6A可作为图3中所示的步骤S308在一实施例中的处理过程。如图6A所示的方法例如可以应用于上述系统的服务器端，也可以应用于上述系统的终端设备。Fig. 6A is a flowchart of a method for determining an abnormal point according to an exemplary embodiment. FIG. 6A can be used as the processing procedure of step S308 shown in FIG. 3 in one embodiment. The method shown in FIG. 6A can be applied to, for example, the server side of the above-mentioned system, and can also be applied to the terminal device of the above-mentioned system.

参考图6A，本公开实施例提供的方法60可以包括以下步骤。Referring to FIG. 6A , the method 60 provided by the embodiment of the present disclosure may include the following steps.

步骤S602，根据连续时间点对应的多个关联近邻数据获得孤立树。可通过孤立森林来描述降维后的关联近邻数据。首先从关联近邻数据中随机选取n个样本点作为孤立树的根节点，(从二维中)随机指定一个维度，在当前根节点数据范围内，随机产生一个切割点，该切割点产生于当前节点数据中指定维度的最大值和最小值之间；此切割点的选取生成了一个超平面，将当前根节点数据空间切分为2个子空间，把指定维度下小于切割点的点放在当前根节点的左分支，把大于切割点的点放在当前根节点的右分支；在当前根节点的左分支节点和右分支节点递归前两步，不断构造新的叶子节点，直至叶子节点上只有一个数据(无法再继续切割)或树已经生长到了所设定的高度。Step S602, obtaining an isolation tree according to a plurality of associated neighbor data corresponding to consecutive time points. The associated neighbor data after dimensionality reduction can be described by an isolation forest. First, randomly select n sample points from the associated neighbor data as the root node of the isolated tree, randomly specify a dimension (from two dimensions), and randomly generate a cutting point within the data range of the current root node, which is generated at the current Between the maximum and minimum values of the specified dimension in the node data; the selection of this cutting point generates a hyperplane, which divides the current root node data space into 2 subspaces, and places the points smaller than the cutting point in the specified dimension on the current On the left branch of the root node, put the point greater than the cutting point on the right branch of the current root node; recursively first two steps in the left branch node and right branch node of the current root node, and continuously construct new leaf nodes until there are only A stat (no further cuts can be made) or the tree has grown to the set height.

在一些实施例中，例如，图6B根据一实施例示出了一种样本切割过程示意图，图6C根据一实施例示出了另一种样本切割过程示意图，如图6B、图6C所示，相比于z_i点，z₀点可用更少步骤分割开。图6D根据一实施例示出了一种孤立树的示意图，如图6D所示，x、y分别表示两个维度的值，树中(8.7,9.2)节点可对应z₀点。In some embodiments, for example, FIG. 6B shows a schematic diagram of a sample cutting process according to an embodiment, and FIG. 6C shows a schematic diagram of another sample cutting process according to an embodiment. As shown in FIG. 6B and FIG. 6C , compared with At the _zi point, the z ₀ point can be divided in fewer steps. FIG. 6D shows a schematic diagram of an isolated tree according to an embodiment. As shown in FIG. 6D , x and y respectively represent the values of two dimensions, and the (8.7, 9.2) node in the tree may correspond to the z ₀ point.

步骤S604，基于孤立树分别获得连续时间点对应的各个关联近邻数据的异常值。孤立树构建完成后，可对各关联近邻数据仅预测，即看数据落在哪个叶子节点。可使用平均路径长度来度量各样本点的异常程度，路径长度为孤立树的根节点到叶子节点所经过的边的数量。n个样本点中样本点(x，y)的异常得分u(x，n)可由下式计算：Step S604, based on the isolation tree, obtain outliers of each associated neighbor data corresponding to consecutive time points, respectively. After the isolation tree is constructed, only predictions can be made for each associated neighbor data, that is, to see which leaf node the data falls on. The average path length can be used to measure the abnormality of each sample point, and the path length is the number of edges passed from the root node to the leaf node of the isolated tree. The abnormal score u(x, n) of the sample point (x, y) among the n sample points can be calculated by the following formula:

式中，其中h(x)是路径长度，表示从孤立树的根结点遍历到叶子结点x总共需要的路径长度，E(h(x))表示对关联近邻数据进行多次采样获得多棵隔离树对应的叶子结点x的路径长度平均值。q(n)表示采样样本数为n时路径长度的均值，用来对根结点样本x的路径长度h(x)进行标准化处理。H(n-1)为调和数，当n确定时为定值。In the formula, h(x) is the path length, which represents the total path length required to traverse from the root node of the isolated tree to the leaf node x, and E(h(x)) represents the multiple sampling of the associated neighbor data to obtain more The average path length of the leaf node x corresponding to an isolation tree. q(n) represents the mean value of the path length when the number of sampling samples is n, and is used to normalize the path length h(x) of the root node sample x. H(n-1) is a harmonic number, which is a fixed value when n is determined.

步骤S606，获得异常值大于预设阈值关联近邻数据对应的时间点为指标数据异常的时间点。Step S606, obtaining the time point corresponding to the associated neighbor data when the abnormal value is greater than the preset threshold value is the time point when the index data is abnormal.

在一些实施例中，例如，图6E根据一实施例示出了四川省普通高等学校专任教师数量随时间变化图及对应异常得分图，如图6E所示，横轴表示年份，上图纵轴表示教师数量，下图纵轴表示异常得分，可以看到教师数量骤降和骤升的时间点即为可能的异常。与横轴平行的虚线表示预设阈值，可根据异常得分占比计算获得。例如图6E中设置的阈值为3％，即异常得分前3％的节点，可以判断为异常。A点和B点属于突降，C点属于突升，前期在波动上升或下降，突然发生教师人数大幅降低或上升，则可认为此类情况为识别出的异常点。In some embodiments, for example, FIG. 6E shows a graph of changes over time in the number of full-time teachers in ordinary colleges and universities in Sichuan Province and a corresponding abnormal score graph according to an embodiment. As shown in FIG. 6E , the horizontal axis represents the year, and the vertical axis of the above figure represents the year. The number of teachers, the vertical axis of the figure below represents the abnormal score, and it can be seen that the time points when the number of teachers plummets and rises are possible anomalies. The dotted line parallel to the horizontal axis represents the preset threshold, which can be calculated based on the percentage of abnormal scores. For example, the threshold set in FIG. 6E is 3%, that is, the nodes with the top 3% abnormal score can be judged as abnormal. Points A and B belong to a sudden drop, and point C belongs to a sudden rise. In the early stage, the fluctuations rose or fell, and the number of teachers suddenly dropped or rose sharply. Such situations can be considered as abnormal points identified.

根据本公开实施例提供的异常检测方法，对关联近邻数据点基于孤立树计算异常得分，并根据异常得分筛选出可能异常的时间点，相较于直接在指标维度设置阈值判断异常的方法更为准确。According to the anomaly detection method provided by the embodiment of the present disclosure, an anomaly score is calculated based on an isolated tree for associated neighboring data points, and a possible abnormal time point is screened out according to the anomaly score, which is more efficient than the method of directly setting a threshold in the indicator dimension to determine anomaly. precise.

图7是根据一示例性实施例示出的一种时间序列数据异常检测装置的框图。如图7所示的装置例如可以应用于上述系统的服务器端，也可以应用于上述系统的终端设备。Fig. 7 is a block diagram of a device for detecting abnormality in time series data according to an exemplary embodiment. The apparatus shown in FIG. 7 can be applied to, for example, the server side of the above-mentioned system, and can also be applied to the terminal device of the above-mentioned system.

参考图7，本公开实施例提供的装置70可以包括数据获取模块702、关联性特征提取模块704、指标降维模块706和异常检测模块708。Referring to FIG. 7 , the apparatus 70 provided by the embodiment of the present disclosure may include a data acquisition module 702 , a correlation feature extraction module 704 , an index dimension reduction module 706 , and an abnormality detection module 708 .

数据获取模块702可用于获取时间序列数据，时间序列数据为连续时间点中各个时间点对应的指标数据的序列。The data acquisition module 702 may be configured to acquire time series data, where the time series data is a sequence of index data corresponding to each time point in consecutive time points.

关联性特征提取模块704可用于根据各个时间点对应的指标数据获得近邻关联性特征，近邻关联性特征用于表示各个时间点的多个近邻时间点中各个近邻时间点对应的指标数据与各个时间点对应的指标数据之间的关联性。The relevance feature extraction module 704 can be used to obtain the neighbor relevance feature according to the index data corresponding to each time point, and the neighbor relevance feature is used to represent the index data corresponding to each neighbor time point among the multiple neighbor time points at each time point and each time point. The correlation between the indicator data corresponding to the point.

指标降维模块706可用于基于近邻关联性特征对多个近邻时间点对应的多个指标数据进行降维处理，获得各个时间点对应的关联近邻数据。The index dimension reduction module 706 may be configured to perform dimension reduction processing on multiple index data corresponding to multiple neighboring time points based on the neighbor correlation feature, and obtain associated neighboring data corresponding to each time point.

异常检测模块708可用于对连续时间点对应的多个关联近邻数据进行划分以从连续时间点中确定指标数据异常的时间点。The abnormality detection module 708 may be configured to divide a plurality of correlated neighbor data corresponding to consecutive time points to determine the abnormal time points of the indicator data from the consecutive time points.

图8是根据一示例性实施例示出的另一种时间序列数据异常检测装置的框图。如图8所示的装置例如可以应用于上述系统的服务器端，也可以应用于上述系统的终端设备。Fig. 8 is a block diagram of another apparatus for detecting abnormality in time series data according to an exemplary embodiment. The apparatus shown in FIG. 8 can be applied to, for example, the server side of the above-mentioned system, and can also be applied to the terminal device of the above-mentioned system.

参考图8，本公开实施例提供的装置80可以包括数据获取模块802、关联性特征提取模块804、指标降维模块806和异常检测模块808，其中数据获取模块802包括相似区域聚合模块8022。8 , the apparatus 80 provided by the embodiment of the present disclosure may include a data acquisition module 802 , a correlation feature extraction module 804 , an index dimension reduction module 806 and an anomaly detection module 808 , where the data acquisition module 802 includes a similar region aggregation module 8022 .

数据获取模块802可用于获取时间序列数据，时间序列数据为连续时间点中各个时间点对应的指标数据的序列。时间序列数据包括各个时间点对应的多个相似区域的指标数据。The data acquisition module 802 may be configured to acquire time series data, where the time series data is a sequence of index data corresponding to each time point in consecutive time points. The time series data includes indicator data of multiple similar regions corresponding to each time point.

数据获取模块802还可用于获取各个时间点对应的第一预定区域的指标数据；获取各个时间点对应的第二预定区域的指标数据。The data acquisition module 802 may also be configured to acquire the index data of the first predetermined area corresponding to each time point; and acquire the index data of the second predetermined area corresponding to each time point.

相似区域聚合模块8022还可用于在第一预定区域的指标数据与第二预定区域的指标数据的相似度大于预设阈值时，获得多个相似区域，多个相似区域包括第一预定区域和第二预定区域。The similar area aggregation module 8022 can also be used to obtain multiple similar areas when the similarity between the index data of the first predetermined area and the index data of the second predetermined area is greater than a preset threshold, and the multiple similar areas include the first predetermined area and the second predetermined area. 2. Predetermined area.

关联性特征提取模块804可用于根据各个时间点对应的指标数据获得近邻关联性特征，近邻关联性特征用于表示各个时间点的多个近邻时间点中各个近邻时间点对应的指标数据与各个时间点对应的指标数据之间的关联性。近邻关联性特征包括第一维度和第二维度，第一维度为近邻时间点序列，第二维度为与近邻时间点序列对应的信息增益序列。The relevance feature extraction module 804 can be used to obtain the neighbor relevance feature according to the index data corresponding to each time point, and the neighbor relevance feature is used to represent the index data corresponding to each neighbor time point among the multiple neighbor time points of each time point and each time point. The correlation between the indicator data corresponding to the point. The neighbor correlation feature includes a first dimension and a second dimension, the first dimension is a sequence of neighbor time points, and the second dimension is an information gain sequence corresponding to the sequence of neighbor time points.

关联性特征提取模块804还可用于获取各个时间点的各个近邻时间点对应的指标数据；基于各个时间点对应的指标数据和各个时间点的各个近邻时间点对应的指标数据计算近邻时间点序列对应的信息增益序列。The correlation feature extraction module 804 can also be used to obtain the index data corresponding to each neighboring time point at each time point; based on the index data corresponding to each time point and the index data corresponding to each neighboring time point at each time point, calculate the corresponding sequence of neighboring time points. information gain sequence.

指标降维模块806可用于基于近邻关联性特征对多个近邻时间点对应的多个指标数据进行降维处理，获得各个时间点对应的关联近邻数据。The indicator dimensionality reduction module 806 may be configured to perform dimensionality reduction processing on a plurality of indicator data corresponding to a plurality of neighboring time points based on the neighbor correlation feature, and obtain associated neighbor data corresponding to each time point.

指标降维模块806还可用于基于近邻关联性特征从多个近邻时间点中确定各个时间点的关联近邻时间点；根据各个时间点的关联近邻时间点确定降维后的指标数据维度；通过主成分分析方法将多个近邻时间点对应的多个指标数据降维到降维后的指标数据维度。The indicator dimensionality reduction module 806 can also be used to determine the associated neighbor time points of each time point from a plurality of neighbor time points based on the neighbor correlation feature; determine the dimension of the indicator data after dimensionality reduction according to the associated neighbor time points of each time point; The component analysis method reduces the dimension of multiple index data corresponding to multiple neighboring time points to the dimension of the index data after dimension reduction.

指标降维模块806还可用于基于近邻关联性特征根据各个时间点的关联近邻时间点确定降维后的指标数据维度。The index dimension reduction module 806 may also be configured to determine the dimension of the index data after dimension reduction according to the associated neighboring time points of each time point based on the neighbor correlation feature.

指标降维模块806还可用于基于近邻关联性特征从多个近邻时间点中确定各个时间点的关联近邻时间点；获得各个时间点的关联近邻时间点对应的指标数据为降维后的关联近邻数据。The indicator dimensionality reduction module 806 can also be used to determine the associated neighbor time point of each time point from a plurality of neighbor time points based on the neighbor correlation feature; the index data corresponding to the associated neighbor time point of each time point is obtained as the associated neighbor after dimensionality reduction. data.

异常检测模块808可用于对连续时间点对应的多个关联近邻数据进行划分以从连续时间点中确定指标数据异常的时间点。The abnormality detection module 808 may be configured to divide a plurality of correlated neighbor data corresponding to consecutive time points to determine the abnormal time points of the indicator data from the consecutive time points.

异常检测模块808还可用于：根据连续时间点对应的多个关联近邻数据获得孤立树；基于孤立树分别获得连续时间点对应的各个关联近邻数据的异常值；获得异常值大于预设阈值关联近邻数据对应的时间点为指标数据异常的时间点。The anomaly detection module 808 can also be used to: obtain an isolation tree according to a plurality of associated neighbor data corresponding to consecutive time points; obtain outlier values of each associated neighbor data corresponding to consecutive time points based on the isolation tree; obtain an associated neighbor whose abnormal value is greater than a preset threshold The time point corresponding to the data is the time point when the indicator data is abnormal.

本公开实施例提供的装置中的各个模块的具体实现可以参照上述方法中的内容，此处不再赘述。For the specific implementation of each module in the apparatus provided by the embodiment of the present disclosure, reference may be made to the content in the foregoing method, which will not be repeated here.

图9示出本公开实施例中一种电子设备的结构示意图。需要说明的是，图9示出的设备仅以计算机系统为示例，不应对本公开实施例的功能和使用范围带来任何限制。FIG. 9 shows a schematic structural diagram of an electronic device in an embodiment of the present disclosure. It should be noted that the device shown in FIG. 9 is only an example of a computer system, and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.

如图9所示，设备900包括中央处理单元(CPU)901，其可以根据存储在只读存储器(ROM)902中的程序或者从存储部分908加载到随机访问存储器(RAM)903中的程序而执行各种适当的动作和处理。在RAM 903中，还存储有设备900操作所需的各种程序和数据。CPU901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。As shown in FIG. 9, the apparatus 900 includes a central processing unit (CPU) 901, which can be processed according to a program stored in a read only memory (ROM) 902 or a program loaded from a storage section 908 into a random access memory (RAM) 903 Various appropriate actions and processes are performed. In the RAM 903, various programs and data necessary for the operation of the device 900 are also stored. The CPU 901 , the ROM 902 , and the RAM 903 are connected to each other through a bus 904 . An input/output (I/O) interface 905 is also connected to bus 904 .

以下部件连接至I/O接口905：包括键盘、鼠标等的输入部分906；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分907；包括硬盘等的存储部分908；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分909。通信部分909经由诸如因特网的网络执行通信处理。驱动器910也根据需要连接至I/O接口905。可拆卸介质911，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器910上，以便于从其上读出的计算机程序根据需要被安装入存储部分908。The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, etc.; an output section 907 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 908 including a hard disk, etc. ; and a communication section 909 including a network interface card such as a LAN card, a modem, and the like. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 910 as needed so that a computer program read therefrom is installed into the storage section 908 as needed.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分909从网络上被下载和安装，和/或从可拆卸介质911被安装。在该计算机程序被中央处理单元(CPU)901执行时，执行本公开的系统中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909, and/or installed from the removable medium 911. When the computer program is executed by the central processing unit (CPU) 901, the above-described functions defined in the system of the present disclosure are executed.

需要说明的是，本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、电线、光缆、RF等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.

描述于本公开实施例中所涉及到的模块可以通过软件的方式实现，也可以通过硬件的方式来实现。所描述的模块也可以设置在处理器中，例如，可以描述为：一种处理器包括数据获取模块、关联性特征提取模块、指标降维模块和异常检测模块。其中，这些模块的名称在某种情况下并不构成对该模块本身的限定，例如，数据获取模块还可以被描述为“向所连接的数据库服务器获取时间序列数据的模块”。The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. The described modules can also be set in the processor, for example, it can be described as: a processor includes a data acquisition module, a correlation feature extraction module, an index dimension reduction module and an anomaly detection module. Among them, the names of these modules do not constitute a limitation of the module itself in some cases, for example, the data acquisition module can also be described as "a module that acquires time series data from a connected database server".

作为另一方面，本公开还提供了一种计算机可读介质，该计算机可读介质可以是上述实施例中描述的设备中所包含的；也可以是单独存在，而未装配入该设备中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被一个该设备执行时，使得该设备包括：获取时间序列数据，时间序列数据为连续时间点中各个时间点对应的指标数据的序列；根据各个时间点对应的指标数据获得近邻关联性特征，近邻关联性特征用于表示各个时间点的多个近邻时间点中各个近邻时间点对应的指标数据与各个时间点对应的指标数据之间的关联性；基于近邻关联性特征对多个近邻时间点对应的多个指标数据进行降维处理，获得各个时间点对应的关联近邻数据；对连续时间点对应的多个关联近邻数据进行划分以从连续时间点中确定指标数据异常的时间点。As another aspect, the present disclosure also provides a computer-readable medium. The computer-readable medium may be included in the device described in the above-mentioned embodiments, or it may exist alone without being assembled into the device. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by a device, the device includes: acquiring time-series data, the time-series data corresponding to each time point in consecutive time points; The sequence of indicator data; the neighbor correlation feature is obtained according to the indicator data corresponding to each time point, and the neighbor correlation feature is used to indicate that the indicator data corresponding to each neighbor time point among the multiple neighbor time points at each time point corresponds to each time point. The correlation between the indicator data; based on the neighbor correlation feature, the dimensionality reduction processing is performed on multiple indicator data corresponding to multiple neighboring time points, and the associated neighbor data corresponding to each time point is obtained; the multiple associated neighbors corresponding to consecutive time points are The data is divided to identify the time points at which the metric data is anomalous from consecutive time points.

以上具体地示出和描述了本公开的示例性实施例。应可理解的是，本公开不限于这里描述的详细结构、设置方式或实现方法；相反，本公开意图涵盖包含在所附权利要求的精神和范围内的各种修改和等效设置。Exemplary embodiments of the present disclosure have been specifically shown and described above. It should be understood that this disclosure is not limited to the details of construction, arrangements, or implementations described herein; on the contrary, this disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. a time series data anomaly detection method, is characterized in that, comprises:

obtaining time series data, where the time series data is a sequence of index data corresponding to each time point in the consecutive time points;

The neighbor correlation feature is obtained according to the index data corresponding to each time point, and the neighbor correlation feature is used to indicate that the index data corresponding to each neighbor time point among the multiple neighbor time points of each time point is related to the respective time points. The correlation between the indicator data corresponding to the point;

Perform dimensionality reduction processing on multiple index data corresponding to the multiple neighboring time points based on the neighbor correlation feature, and obtain the associated neighbor data corresponding to each time point;

A plurality of the associated neighbor data corresponding to the continuous time points are divided to determine the abnormal time points of the index data from the continuous time points.

2. The method according to claim 1, wherein the time series data comprises index data of a plurality of similar regions corresponding to the respective time points;

The acquiring time series data includes:

obtaining the index data of the first predetermined area corresponding to each time point;

acquiring the indicator data of the second predetermined area corresponding to each time point;

When the similarity between the index data of the first predetermined area and the index data of the second predetermined area is greater than a preset threshold, obtain the plurality of similar areas, the plurality of similar areas including the first predetermined area and the second predetermined area.

3 . The method according to claim 1 , wherein the neighbor correlation feature includes a first dimension and a second dimension, the first dimension is a sequence of neighboring time points, and the second dimension is a relationship with the The information gain sequence corresponding to the sequence of neighboring time points;

The obtaining of the neighbor correlation feature according to the index data corresponding to each time point includes:

Obtain the indicator data corresponding to each neighboring time point of the each time point;

The information gain sequence corresponding to the sequence of neighboring time points is calculated based on the index data corresponding to each time point and the index data corresponding to each neighboring time point of each time point.

4 . The method according to claim 1 , wherein the performing dimensionality reduction processing on the multiple index data corresponding to the multiple neighboring time points based on the neighboring correlation feature comprises: 4 .

Determine the associated neighbor time point of each time point from the plurality of neighbor time points based on the neighbor correlation feature;

Determine the dimension of the indicator data after dimension reduction according to the associated adjacent time points of each time point;

The multiple index data corresponding to multiple neighboring time points are reduced to the dimension of the index data after the dimension reduction by the principal component analysis method.

5. The method according to claim 4, wherein the determining the dimension of the dimension-reduced index data according to the associated neighboring time points of the respective time points comprises:

The dimension of the index data after dimension reduction is determined according to the associated neighbor time points of the respective time points based on the neighbor correlation feature.

6 . The method according to claim 1 , wherein the performing dimensionality reduction processing on the multiple index data corresponding to the multiple neighboring time points based on the neighboring correlation feature comprises: 6 .

The obtaining the associated neighbor data corresponding to each time point includes:

Obtaining the index data corresponding to the associated neighbor time points of the respective time points is the associated neighbor data after dimension reduction.

7. The method according to any one of claims 1 to 6, characterized in that, dividing a plurality of the associated neighbor data corresponding to the continuous time points to determine an indicator from the continuous time points The time points of abnormal data include:

Obtain an isolation tree according to a plurality of the associated neighbor data corresponding to the continuous time points;

Obtain outliers of each of the associated neighbor data corresponding to the consecutive time points based on the isolation tree;

The time point corresponding to the obtained abnormal value greater than the preset threshold value and the associated neighbor data is the time point when the index data is abnormal.

8. A device for detecting anomalies in time series data, comprising:

a data acquisition module, configured to acquire time series data, where the time series data is a sequence of index data corresponding to each time point in consecutive time points;

The correlation feature extraction module is used to obtain the neighbor correlation feature according to the index data corresponding to each time point, and the neighbor correlation feature is used to represent the corresponding time point of each nearest neighbor time point among the multiple nearest neighbor time points of each time point. The correlation between the indicator data and the indicator data corresponding to each time point;

An index dimension reduction module, configured to perform dimension reduction processing on multiple index data corresponding to the multiple neighboring time points based on the neighbor correlation feature, and obtain the associated neighboring data corresponding to each time point;

An abnormality detection module, configured to divide a plurality of the associated neighbor data corresponding to the continuous time points to determine the abnormal time points of the index data from the continuous time points.

9. A device, comprising: a memory, a processor, and executable instructions stored in the memory and executable in the processor, wherein the processor implements the following when executing the executable instructions. The method of any one of claims 1-7.

10. A computer-readable storage medium having computer-executable instructions stored thereon, characterized in that, when the executable instructions are executed by a processor, the method according to any one of claims 1-7 is implemented.