WO2020147480A1 - 基于流式处理的监控指标异常检测方法、装置及设备 - Google Patents

基于流式处理的监控指标异常检测方法、装置及设备 Download PDF

Info

Publication number
WO2020147480A1
WO2020147480A1 PCT/CN2019/125937 CN2019125937W WO2020147480A1 WO 2020147480 A1 WO2020147480 A1 WO 2020147480A1 CN 2019125937 W CN2019125937 W CN 2019125937W WO 2020147480 A1 WO2020147480 A1 WO 2020147480A1
Authority
WO
WIPO (PCT)
Prior art keywords
streaming
data
monitoring
streaming data
information
Prior art date
Application number
PCT/CN2019/125937
Other languages
English (en)
French (fr)
Inventor
赵孝松
王少华
游永胜
陈治
周扬
霍扬扬
杨树波
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020147480A1 publication Critical patent/WO2020147480A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Definitions

  • This application relates to the technical field of intelligent operation and maintenance, and in particular to a method, device and equipment for detecting abnormality of monitoring indicators based on streaming processing.
  • this specification provides a method, device and equipment for detecting abnormality of monitoring indicators based on streaming processing.
  • this specification provides a monitoring indicator abnormality detection method based on streaming processing, and the method includes:
  • index information of the target monitoring index from the log information, where the index information is streaming data
  • this specification provides a monitoring indicator abnormality detection device based on streaming processing, the device includes:
  • An obtaining module which obtains indicator information of a target monitoring indicator from log information, where the indicator information is streaming data;
  • the aggregation module reads the streaming data in a streaming manner, and aggregates the read streaming data into an aggregation window of a specified dimension
  • the abnormality detection module performs abnormality detection on the streaming data in the aggregation window according to a predetermined trigger condition to determine whether the target monitoring index is abnormal.
  • this application provides a device, which includes:
  • Memory used to store executable computer instructions
  • the processor is configured to implement the following steps when executing the computer instructions:
  • index information of the target monitoring index from the log information, where the index information is streaming data
  • the beneficial effects of this application Obtain the indicator information of the target monitoring indicator from the log information, where the indicator information is streaming data; read the streaming data in a streaming manner, and aggregate the read streaming data to a specified Dimensional aggregation window; according to a predetermined trigger condition, abnormality detection is performed on the streaming data in the aggregation window to determine whether the target monitoring index is abnormal.
  • Fig. 1 is a flow chart of an anomaly detection method based on streaming processing according to an exemplary embodiment of this specification
  • Fig. 2a is a schematic diagram of a streaming data data table shown in an exemplary embodiment of this specification
  • Fig. 2b is a flowchart of an abnormality detection method based on streaming processing according to an exemplary embodiment of this specification
  • FIG. 3 is a schematic diagram of an anomaly detection method based on streaming processing according to an exemplary embodiment of this specification
  • Fig. 4 is a schematic diagram of an anomaly detection method based on streaming processing according to an exemplary embodiment of this specification
  • Fig. 5 is a logical block diagram of an anomaly detection device based on streaming processing according to an exemplary embodiment of this specification
  • Fig. 6 is a structural logical block diagram of a device shown in an exemplary embodiment of this specification.
  • first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word “if” as used herein can be interpreted as "when” or “when” or “in response to a determination”.
  • the current anomaly detection method for monitoring indicators commonly used is to clean the log data, write the cleaned log data into the database, and then the algorithm platform calls the data from the database to perform anomaly detection on the data.
  • This method is suitable for small data volumes Scenario, when the number of logs is very large, reading data in the database is very time consuming, so the algorithm platform cannot complete the task in a short time at all, and cannot achieve anomaly detection and alarm in minutes or seconds. In short, the current monitoring platform cannot achieve real-time anomaly detection when the log data is very large.
  • this specification provides a method for anomaly detection based on streaming processing.
  • the streaming processing platform is used to perform anomaly detection on the data of monitoring indicators, which can realize real-time anomaly detection.
  • streaming data is relative to traditional databases, while streaming processing is based on real-time calculation of streaming data.
  • stream processing is characterized by continuous, borderless, and instantaneous, which is suitable for high-speed concurrency and large-scale data real-time processing scenarios.
  • data has no boundaries. A steady stream of data flows from input to output, but calculations need boundaries. Whether it is incremental or full calculation, a range is required.
  • This calculation model can be called a window model.
  • a window model a window with a range is divided according to time, so that a batch of data sets in the window can be calculated.
  • the window can be divided according to the event time of the data (event time) and the time of data processing (process time).
  • process time the time of data processing
  • the window length refers to the time span of the window
  • the window size refers to the number of data items in the window.
  • Watermark is a concept used to express the integrity of the input associated with the event time. For a Watermark with an event time of X, it means that all input data whose event time is less than X have been observed. Therefore, when the observation object is an unbounded data source with no end, Watermark measures the data progress.
  • the system When the Watermark reaches the threshold of the window, the system considers that the data smaller than the Watermark will enter the window, and the data in the window will be calculated. That is to say, Watermark is used to determine whether the threshold of the window is reached, that is, to generate a window, Watermark will continuously update itself. Watermark can be obtained based on data generation time or data processing time. In the streaming computing platform, after the Watermark of the window is reached, it will trigger the calculation of the data in the window.
  • Fig. 1 is a flowchart of the abnormality detection method based on streaming processing, including steps S102-S106;
  • S106 Perform abnormality detection on the streaming data in the aggregation window according to a predetermined trigger condition to determine whether the target monitoring indicator is abnormal.
  • the anomaly detection method based on streaming processing can be used for Kepler streaming platform, of course, it can also be used for other streaming platform with similar functions of streaming computing engine, such as Flink streaming platform , Blink streaming platform, STORM streaming platform, etc.
  • Flink streaming platform a streaming platform with similar functions of streaming computing engine
  • Blink streaming platform a streaming platform with similar functions of streaming computing engine
  • STORM streaming platform a streaming platform with similar functions of streaming computing engine
  • the target monitoring indicators can be anything that needs to be monitored. Indicators.
  • the target monitoring indicators may be CPU usage, hard disk usage, memory usage, GC times, and so on.
  • the log information of the CPU usage rate may contain a lot of information such as the host ID corresponding to the CPU, the host IP address, the observed value of the CPU usage rate, and the time, but it does not need so much indicator information when performing anomaly detection. , So you can clean the index information in the log information first, and clean out some index information needed for abnormal detection to form streaming data.
  • the streaming data includes monitoring dimension information, time stamps, and monitoring index observation values, where the monitoring dimension information is used to identify the dimension of the streaming data. For example, generally when aggregating streaming data composed of indicator information into a specified dimension, this specified dimension can be divided according to the single machine or cluster corresponding to the indicator information, so the monitoring dimension information is used to identify which single machine or the monitored indicator belongs to.
  • Cluster indicators for example, three hosts A, B, and C correspond to one CPU respectively, and the IDs of these three hosts can be used to identify the CPUs of these three hosts, so the IDs of the three hosts can be used to indicate the monitoring dimension Then, according to the monitoring dimension information, the streaming data is aggregated into the specified dimension, that is, the aggregation window of the specified host ID number, so that the CPU index information of the unified host can be aggregated into an aggregation window for easy detection.
  • different monitoring indicators correspond to different monitoring dimensions, but the essence is the same.
  • Fig. 2a is a data table of indicator information flow data obtained after cleaning log information of monitoring indicators in an embodiment of the specification.
  • the log information is printed with a timestamp, which can obtain the indicator observation values of the monitoring indicators at different times, and the indicator observation values and the time are one-to-one correspondence in the cleaned indicator information flow data.
  • the streaming data can be read in a streaming manner, and the read streaming data can be aggregated into an aggregation window of a specified dimension, and then the data can be adjusted according to a predetermined trigger condition.
  • An abnormality detection is performed on the streaming data in the aggregation window to determine whether the target monitoring index is abnormal.
  • streaming data is unbounded data and will be continuously input. Therefore, it is necessary to divide the streaming data through a window and divide the streaming data into bounded data.
  • the streaming data contains data of different dimensions, it is also necessary to divide the data of different dimensions, and aggregate the data of the same dimension into the aggregation window of the specified dimension to facilitate anomaly detection.
  • the streaming data can be divided into different windows according to the monitoring dimension information contained in the streaming data.
  • the monitoring dimension information can be the host ID number
  • the host ID number can be The streaming data of 11111 is aggregated into aggregation window A uniformly
  • the streaming data of the host ID number 2222 is aggregated into aggregation window B uniformly.
  • the length of the aggregation window can be set according to the computing resources, the complexity of the calculation, and the delay of the data.
  • the anomaly detection algorithm integrated by the streaming processing platform has relatively high computational complexity, less computing resources, and data delay. In the case of a long time, the time span of the aggregation window can be set to be larger.
  • the length of the aggregation window can be set to be smaller.
  • the N-sigma anomaly detection algorithm is integrated on the Kepler streaming platform to detect anomalies in the data in the aggregation window, considering the computing resources of the Kepler platform and the characteristics of the N-Sigma algorithm, the length of the aggregation window can be Set to 30min.
  • the length of the aggregation window can be flexibly set according to actual application conditions, and is not limited in this application.
  • the arrival of the Watermark of the aggregation window can be used as a trigger condition for anomaly detection of the data in the aggregation window.
  • the system After the streaming data is aggregated into the corresponding aggregation window based on the monitoring dimension information, the system will determine the aggregation window If it reaches the Watermark, an anomaly detection algorithm will be used to perform anomaly detection on the streaming data in the aggregation window to determine whether the monitoring indicators are abnormal.
  • the anomaly detection algorithm can use the N-sigma algorithm, and there are other statistical or machine learning algorithms for anomaly detection of monitoring indicators, such as Holt Winter algorithm, LOF algorithm, Isolaed Forest algorithm, etc., which are not specifically limited in this manual.
  • the N-sigma algorithm can be integrated in the streaming processing platform to detect abnormalities of the monitoring indicators, and the degree of deviation of the observed values of the monitoring indicators at a certain time can be calculated through the N-sigma algorithm, and the deviation from the preset threshold Compare to determine whether the monitoring indicators are abnormal.
  • the specific detection method is shown in Figure 2b, including the following steps:
  • S202 Sort the streaming data in the aggregation window in chronological order, and take out the target monitoring index observation value at the latest time point;
  • S204 Calculate the average value and standard deviation of the observed values of the target monitoring indicators at other time points;
  • S206 Based on the average value and standard deviation, perform Z-score calculation on the extracted target monitoring index observation value at the most recent time point, and compare the calculation result with a preset threshold to determine whether the target monitoring index is abnormal.
  • the target monitoring index can be observed according to the timestamp
  • the values are sorted in chronological order, and then the target monitoring indicator observation value at the latest time point is taken out, and the average value and standard deviation of the target monitoring indicator observation value at the remaining time point are calculated.
  • Based on the average value and standard deviation Take Z-score calculation of the target monitoring index observation value at the latest time point, and then compare the calculation result with the preset threshold to see if the degree of deviation exceeds the preset threshold. If it exceeds, the monitoring index is considered to be abnormal. If it does not exceed, It shows that the monitoring indicators are normal.
  • the cleaned streaming data is the CPU usage of different machines at different points in time. You can first aggregate the CPU usage at different times according to the machine ID to the corresponding aggregation window. For example, aggregation window A is the ID 1111
  • the CPU usage rate of the machine includes the CPU usage rate at different points in time.
  • the CPU usage is sorted in chronological order. The data is as follows:
  • the time corresponding to the Watermark is Vt.
  • Vt the time corresponding to the Watermark
  • some data before Vt may not reach the aggregation window. If these data are directly discarded, the results calculated by anomaly detection may not be accurate enough. Therefore, in one embodiment, a delay window can be set. After judging the arrival of the Watermark, in addition to triggering the anomaly detection calculation on the streaming data in the aggregation window, a timer is also started, and then the delay is reached.
  • the streaming data of the delay waiting window is distributed to the preset delay waiting window, and the waiting time reaches the preset value, then the streaming data of the delay waiting window is added to the aggregation window; the anomaly detection algorithm is again used for the Anomaly detection is performed on the streaming data of the aggregation window.
  • the delay time setting can be set according to the specific situation. For example, the delay window can be set to only wait for the data with a delay of 30s, or it can wait for the data with a delay of 1min. This manual does not make specific restrictions.
  • the delayed data is also taken into account, which can improve the accuracy of anomaly detection.
  • the alarm information may be generated based on the monitoring dimension information of the abnormal streaming data, the time stamp, and the monitoring index observation value, and then the alarm information may be pushed to a designated database.
  • the specified database can be three-dimensional HBASE data.
  • delayed arrival data If delayed arrival data is triggered In theory, the delayed data arrival will make the aggregation window data more complete and the calculated value more accurate. Therefore, it will trigger the generation of the same rowkey, overwriting the result calculated by the same Watermark trigger calculation. , Refresh the alarm result. At the same time, the downstream monitoring and alarm market will regularly retrieve data from the HBASE database, and make a summary and customized display based on the retrieved data.
  • the Kepler streaming processing platform integrates anomaly detection algorithm, N-sigma algorithm, and uses streaming processing to perform real-time anomaly detection on monitoring indicators.
  • the log information is cleaned to obtain the streaming data of the indicator information of the monitoring indicators, and then the streaming data is used to perform anomaly detection calculations on the streaming data, and the abnormal data is saved to the HBASE database for monitoring Dapan obtains data from the HABSE database.
  • the specific detection method is shown in Figure 4. First, obtain the indicator information of the CPU from the log information to obtain the indicator information streaming data of the monitoring indicator CPU (S401).
  • the streaming data includes The monitoring dimension information, that is, the host ID corresponding to the CPU, the time stamp, and the observed value of the CPU usage rate at different times; then the streaming data is read in a stream, and the streaming data is aggregated into the aggregation window of the specified dimension according to the monitoring dimension information ( S402), for example, aggregate data with a host ID of 1111 into window 1, data with a host ID of 2222 into window 2, and data with a host ID of 3333 into window 3, and set the length of the aggregation window It is 30 minutes.
  • the N-sigma anomaly detection algorithm is used to calculate the data in the aggregation window (S403).
  • the specific calculation steps are as follows: according to the timestamp, the data in the aggregation window is chronologically ordered Sort the data in the order of the most recent time point, and then calculate the average value and standard deviation of the data at the other time points. According to the calculated average value and standard deviation, perform Z-Score (normalized ) Processing to obtain the degree of deviation of the monitoring index. In addition, in order to take the delayed data into account so that the result of the abnormal calculation is more accurate, a delay waiting window is also set up to store the delayed data, and the waiting time is set to 30s. While triggering the abnormal detection of the data in the aggregation window, start the timer to count the time.
  • the detection algorithm performs anomaly detection on the data of the aggregation window (S405), and calculates the degree of deviation of the monitoring index. Then compare the calculated deviation degree with the preset threshold to determine whether it exceeds the preset threshold (S406). If it does not exceed, it means that there is no abnormality. If it exceeds, it will be based on the observation of the host ID number, timestamp and CPU usage. Value generates a unique rowkey, and stores the generated rowkey in the HBASE database (S407). When the delayed data arrives, the aggregation window data will be more complete and the calculated value will be more accurate. Therefore, it will trigger the generation of the same rowkey, overwriting the result calculated by the same Watermark trigger calculation, and refreshing the alarm result. In addition, the downstream monitoring and alarm market will regularly retrieve data from the HBASE database for a summary and customized display (S408).
  • the device 500 includes:
  • the obtaining module 501 obtains indicator information of the target monitoring indicator from log information, where the indicator information is streaming data;
  • the aggregation module 502 reads the streaming data in a streaming manner, and aggregates the read streaming data into an aggregation window of a specified dimension;
  • the abnormality detection module 503 performs abnormality detection on the streaming data in the aggregation window according to a predetermined triggering condition to determine whether the target monitoring index is abnormal.
  • the target monitoring indicators include: CPU usage, Disk usage, Memory usage, and/or GC recovery times.
  • the streaming data includes at least the following information: monitoring dimension information, a time stamp, and an observation value of a target monitoring index, wherein the monitoring dimension information is used to identify the specified dimension.
  • performing abnormality detection on the streaming data in the aggregation window according to a predetermined trigger condition specifically includes:
  • anomaly detection is performed on the streaming data of the aggregation window.
  • the method further includes:
  • the streaming data of the delayed waiting window is added to the aggregation window;
  • Anomaly detection is performed on the streaming data of the aggregation window again.
  • using an anomaly detection algorithm to perform anomaly detection on the streaming data of the aggregation window specifically includes:
  • the push alarm information specifically includes:
  • the method is used in a Kepler streaming platform.
  • the relevant part can refer to the part of the description of the method embodiment.
  • the device embodiments described above are merely illustrative.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units.
  • Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the application. Those of ordinary skill in the art can understand and implement it without creative work.
  • FIG. 6 it is a hardware structure diagram of the device where the page preloading device of this specification is located, except for the processor 601, network interface 604, memory 602, and non-volatile memory shown in FIG.
  • the device where the device is located in the embodiment can usually include other hardware, such as a forwarding chip responsible for processing messages, etc.; in terms of hardware structure, the device may also be a distributed device, which may include multiple interface cards. , In order to expand the message processing at the hardware level.
  • the non-volatile memory 603 stores executable computer instructions, and the processor 601 implements the following steps when executing the computer instructions:
  • index information of the target monitoring index from the log information, where the index information is streaming data
  • the computer software product is stored in a storage medium and includes several instructions to enable a terminal device Perform all or part of the steps of the method in each embodiment of this application.
  • the foregoing storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种基于流式处理的监控指标异常检测方法、装置及设备,其从日志信息中获取目标监控指标的指标信息构成流式数据(S102);流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口(S104);根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常(S106)。通过在流式处理平台上集成异常检测算法,流式地对数据进行实时的异常检测,可以实现对海量数据进行分钟级甚至秒级的异常检测。

Description

基于流式处理的监控指标异常检测方法、装置及设备 技术领域
本申请涉及智能运维技术领域,尤其涉及一种基于流式处理的监控指标异常检测方法、装置及设备。
背景技术
信息化时代,为了保障各业务平台的正常运行,需要对各个业务平台进行监控。在对业务平台及各业务数据进行监控时,需要从海量的日志信息中抽取一些高可用的监控指标,通过各种异常检测算法对这些监控指标进行异常检测,如发现异常,则及时报警,以便运维人员进行处理。为了能够及时的报警以便运维人员对异常进行及时处理,避免造成太大的损失,一般要求异常报警的延时要在分钟级甚至是秒级内。但是目前很多业务平台的指标的input tps平均都在百万量级,要及时的分析这些指标并及时报警,对整个监控方案的架构和算法设计都提出了非常高的要求。目前的监控平台在日志数据非常巨大时,还无法实现实时的异常检测。
发明内容
为克服相关技术中存在的问题,本说明书提供了一种基于流式处理的监控指标异常检测方法、装置及设备。
首先,本说明书提供了一种基于流式处理的监控指标异常检测方法,所述方法包括:
从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
其次,本说明书提供了一种基于流式处理的监控指标异常检测装置,所述装置包括:
获取模块,从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
聚合模块,流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
异常检测模块,根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
进一步,本申请提供了一种设备,所述设备包括:
存储器,用于存储可执行的计算机指令;
处理器,用于执行所述计算机指令时实现以下步骤:
从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
本申请的有益效果:从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。通过在流式处理平台上集成异常检测算法,采用流式处理的方式实时地对数据进行异常检测,可以实现对海量数据进行分钟级甚至秒级的异常检测。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1为本说明书一示例性实施例示出的一种基于流式处理的异常检测方法的流程图;
图2a为本说明书一示例性实施例示出的流式数据数据表的示意图;
图2b为本说明书一示例性实施例示出的一种基于流式处理的异常检测方法的流程图;
图3为本说明书一示例性实施例示出的一种基于流式处理的异常检测方法的示意图;
图4为本说明书一示例性实施例示出的一种基于流式处理的异常检测方法的示意图;
图5为本说明书一示例性实施例示出的一种基于流式处理的异常检测装置的逻辑框图;
图6为本说明书一示例性实施例示出的一种设备的结构逻辑框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
信息化时代,为了保障各业务平台的正常运行,需要对各个业务平台进行监控。在对业务平台及各业务数据进行监控时,需要从海量的日志信息中抽取一些高可用的监控指标,比如CPU使用率、内存使用率等,通过各种异常检测算法对这些监控指标进行异常检测,如发现异常,则及时报警,以便运维人员进行处理。为了能够及时的报警以便运维人员对异常进行及时处理,避免造成太大的损失,一般要求异常报警的延时要在分钟级甚至是秒级内。但是目前很多业务平台的指标的input tps平均都在百万量级,要及时的分析这些指标并及时报警,对整个监控方案的架构和算法设计都提出了非常高 的要求。目前常用的监控指标的异常检测方法是对日志数据进行清洗,将清洗后的日志数据写入数据库,然后算法平台从数据库中调用数据,对数据进行异常检测,这种方法适合数据量较小的场景,当日志数量非常大的时候,读取数据库里的数据非常耗时,因而算法平台根本没法在短时间内完成任务,不能实现在分钟级或秒级完成异常检测并报警。总之,目前的监控平台在日志数据非常巨大时,还无法实现实时的异常检测。
为了解决上述问题,本说明书提供了一种基于流式处理的异常检测的方法,采用流式处理平台对监控指标的数据进行异常检测,可以实现实时的异常检测。
在介绍本说明书的基于流式处理的异常检测的方法之前,先对流式数据和流式处理做一个简单的介绍。在大数据环境下,许多应用都呈现多源并发、数据汇聚、在线处理的特征,因而传统的数据库技术已经不能满足数据处理的实时性需求。流式数据与传统的数据库是相对的,而流式处理是基于流式数据的实时计算。与静态、批处理和持久化的数据库相比,流式处理以连续、无边界和瞬时性为特征,适合高速并发和大规模数据实时处理的场景。在流式处理中,数据是没有边界的,源源不断的数据从输入流向输出,但是计算是需要边界的,无论是增量计算还是全量计算,都需要一个范围。因此,在对流式数据进行流式处理之前,需要把无限的数据流划分成一段一段的数据集,这个计算模型可以称为窗口模型。在窗口模型中,会根据时间来划分出一个一个有范围的窗口,从而可以对窗口内的一批数据集进行计算。一般情况下,窗口可以根据数据的发生时间(event time)和数据处理的时间(process time)来划分。通过设置的窗口边界,使得流中部分数据项位于窗口内,而在窗口之外数据则不被处理计算考虑。窗口长度指窗口的时间跨度,窗口大小指窗口中数据项的数量。
在流式处理中,从事件产生到处理,中间是有一个过程和时间的。虽然大部分情况下,对数据的处理都是按照事件产生的时间顺序来的,但是也不排除由于网络、背压等原因,导致乱序的产生。但是对于乱序的数据,我们又不能无限期的等下去,必须要有个机制来保证一个特定的时间后,必须触发对窗口内的数据进行计算,这个特别的机制,就是Watermark。Watermark是用来表示与事件时间相关联的输入完整性的概念。对于事件时间为X的Watermark是指:已经观察到事件时间小于X的所有输入数据。因此,当观测对象是没有尽头的无界数据源时,Watermark来测量数据进度。当Watermark到达窗口的阈值,那么系统认为小于Watermark的数据会进入到该窗口,则会对窗口内数据进行计算。也就是说Watermark用于判定是否到达窗口的阈值,也就是产生一个窗口,Watermark会不断自我更新。Watermark可以基于数据产生时间或者数据处理时间 得到。在流式计算平台中,窗口的Watermark达到后,就会触发对窗口内的数据进行计算。
图1为所述基于流式处理的异常检测方法的流程图,包括步骤S102-S106;
S102、从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
S104、流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
S106、根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
本说明书提供的基于流式处理的异常检测的方法可以用于Kepler流式处理平台,当然,也可以用于其他具有类似的流式计算引擎的功能的流式处理平台,比如Flink流式处理平台、Blink流式处理平台、STORM流式处理平台等。通过在流式处理平台中集成异常检测算法,以实现流式地对监控指标的数据进行检测。
由于日志信息数据量非常庞大,可以从海量的日志数据中挑选出一些重要的,高可用的监控指标的作为目标监控指标,检测这些指标是否出现异常,目标监控指标可以是任一需要被监控的指标,在某些实施例中,所述目标监控指标可以是CPU的使用率、硬盘的使用率、内存的使用率、GC次数等等。
由于日志信息中的目标监控指标的指标信息数据量非常大,包含的信息也非常多,需要对目标监控指标的指标信息进行数据清洗,提取出一些对异常检测有用的指标信息,并将这些指标信息构成该目标监控指标的流式数据。比如说CPU使用率的日志信息中可能包含CPU对应的主机ID、主机IP地址、CPU使用率的观测值、时间等非常多的信息,但是在进行异常检测的时候并不需要那么多的指标信息,所以可以先对日志信息中的指标信息进行清洗,清洗出异常检测需要的一些指标信息,组成流式数据。在一个实施例中,所述流式数据包含监控维度信息、时间戳、监控指标观测值,其中监控维度信息用于标识流式数据的维度。比如,一般在将指标信息组成的流式数据聚合到指定维度时,这个指定维度可以按照指标信息所对应的单机或集群来划分,因而监控维度信息用于标识所监控的指标是属于哪个单机或者集群的指标,比如说A、B、C三个主机,分别对应一个CPU,则可以使用这三台主机的ID来标识这三台主机的CPU,所以可以使用三台主机的ID来表示监控维度信息,然后根据监控维度信息将流式数据聚合到指定维度,即指定主机ID号的聚合窗口,这样就能将统一主机的CPU的指标信息聚合到 一个聚合窗口,便于检测。当然,不同的监控指标对应的监控维度会不一样,但本质是一样的。图2a为本说明书一实施例中将监控指标的日志信息清洗后,得到的指标信息流式数据的数据表。
另外,日志信息的打印都会带有时间戳,可以获取监控指标在不同时间的指标观测值,并在清洗出来的指标信息流式数据中将指标观测值与时间一一对应。
在清洗得到目标监控指标的指标信息流式数据后,可以流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口,然后根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。由于流式处理中,流式数据是无界的数据,会源源不断的输入,因而需要通过窗口对流式数据进行划分,将流式数据划分成有界的数据。另外,由于流式数据中包含不同维度的数据,因此,也需要对不同维度的数据进行划分,将同一维度的数据聚合到指定维度的聚合窗口,便于进行异常检测。比如,可以根据流式数据中包含的监控维度信息将流式数据划分到不同的窗口,比如,当目标监控指标为CPU的使用率,监控维度信息可以是主机ID号,可以将主机ID号为11111的流式数据统一聚合到聚合窗口A,主机ID号为2222的流式数据统一聚合到聚合窗口B。一般聚合窗口的长度可以根据计算资源、计算的复杂程度以及数据的延时情况来设定,比如、流式处理平台集成的异常检测算法的计算复杂度比较高、计算资源较少、数据延迟的时间较长的情况,则聚合窗口的时间跨度可以设置得大一些,反之,如果计算复杂度比较小、计算资源较多、数据延迟的时间较短,则聚合窗口的长度可以设置的小一些,比如说如果在Kepler流式处理平台集成了N-sigma异常检测算法来对聚合窗口中的数据进行异常检测的话,考虑到Kepler平台的计算资源和N-Sigma算法的特性,可以将聚合窗口的长度设置为30min。聚合窗口长度可根据实际应用情况灵活设置,本申请不作限制。
在一个实施例中,可以将聚合窗口的Watermark到达作为触发对聚合窗口中的数据进行异常检测的触发条件,在将流式数据依据监控维度信息聚合到相应的聚合窗口后,系统会判断聚合窗口的Watermark是否达到,如果到达,则会采用异常检测算法对聚合窗口内的流式数据进行异常检测,以判断监控指标是否出现异常。当然,异常检测算法可以采用N-sigma算法,还有其他的对监控指标进行异常检测一些统计或者机器学习算法,如Holt Winter算法、LOF算法、Isolaed Forest算法等,本说明书不作具体限制。
在一个实施例中,可以在流式处理平台中集成N-sigma算法对监控指标进行异常检测,通过N-sigma算法计算出监控指标在某个时间的观测值的偏差程度,并与预设阈值 比较,来判定监控指标是否出现异常,具体检测方法如图2b所示,包括以下步骤:
S202、将所述聚合窗口内的流式数据按照时间先后顺序进行排序,取出最近时间点的目标监控指标观测值;
S204、计算其余时间点的目标监控指标观测值的平均值和标准差;
S206、基于所述平均值和标准差,对所取出的最近时间点的目标监控指标观测值进行Z-score计算,并将计算结果与预设阈值比较,判断所述目标监控指标是否异常。
由于每一个聚合窗口中聚合的流式数据都是同一类型的监控指标在同一维度的数据,且流式数据中包括各个时间点的目标监控指标观测值,因而可以根据时间戳将目标监控指标观测值按照时间先后的顺序排序,然后取出最近一个时间点的目标监控指标观测值,并计算剩余时间点的目标监控指标观测值的平均值和标准差,基于所述平均值和标准差,对所取出的最近时间点的目标监控指标观测值进行Z-score计算,然后将计算结果与预设阈值比较,看偏差程度是否超出预设阈值,如果超过,则认为监控指标出现异常,如果未超过,说明监控指标正常。比如,清洗后的流式数据为不同的机器在不同时间点的CPU的使用率,可以先根据机器的ID将不同时间CPU使用率聚合到对应的聚合窗口,比如聚合窗口A为ID为1111的机器的CPU使用率且包含了不同时间点的CPU使用率,将CPU的使用按照时间先后顺序进行排序,数据如下:
时间:9:30,CPU使用率:50%
时间:9:35,CPU使用率:55%
时间:9:40,CPU使用率:69%
时间:9:45,CPU使用率:78%
…..
时间:10:30,CPU使用率:95%
取出最近时间点10:30的数据,然后计算其余的数据的平均值和标准差,然后对最近时间点的数据95%做Z-score归一化处理,计算得到偏差程度,然后与预设值比较,判断偏差程度是否超过预设值。
在流式处理中,经常会遇到数据延时到达的问题,即乱序问题,比如,Watermark对应的时间为Vt,当Watermark即Vt达到时,可能有些在Vt之前的数据还没到达聚合窗口,如果直接抛弃这些数据,可能异常检测计算出来的结果会不够准确。因此,在一 个实施例中,可以设置一个延时窗口,在判断所述Watermark到达后,除了触发对聚合窗口的流式数据进行异常检测计算,还会启动一个计时器计时,然后将延时到达的流式数据分发到预先设置的延时等待窗口中,待计时时长达到预设值,则将所述延时等待窗口的流式数据添加到所述聚合窗口;再次采用异常检测算法对所述聚合窗口的流式数据进行异常检测。其中,延时时长的设置可根据具体情况设定,比如可以设定延时窗口只等待延时30s的数据,也可以等待延时1min的数据,本说明书不作具体限制。在进行异常检测时,通过设置延时窗口,将延时达到的数据也考虑进去,可以提高异常检测的准确度。
在通过异常检测算法对聚合窗口中的数据进行异常检测后,如果发现有异常的数据,则需要推送报警信息。在一个实施例中,可以根据出现异常的流式数据的监控维度信息,时间戳以及监控指标观测值生成报警信息,然后将报警信息推送到指定的数据库。当然,如果有延时到达数据,重新出发了聚合窗口的异常检测计算,则把最新一次的数据更新到指定数据库。在一个实施例中,所述指定的数据库可以是三维的HBASE数据,在出现异常后,可以根据监控维度信息,目标监控指标观测值以及时间戳等信息生成唯一的rowkey,如果延迟到达的数据触发了计算,理论上来说,延迟的数据到了,会使得聚合窗口数据更完整,算出来的值更为准确,因此,它会触发生成一个同样的rowkey,覆盖之前由同一个Watermark触发计算算出的结果,达到报警结果的刷新。同时,下游的监控报警大盘,会定时去HBASE数据库中捞取数据,并根据捞去的数据做一个汇总和定制展示。
为了进一步解释本申请的基于流式处理的异常检测方法,以下结合图3和图4再以一个具体的实施例进行说明。
为了对监控指标进行实时的异常检测,在Kepler流式处理平台集成了异常检测算法,N-sigma算法,采用流式处理的方式对监控指标进行实时地异常检测。如图3所示,将日志信息进行清洗后得到监控指标的指标信息的流式数据,再采用流式地方式对流式数据进行异常检测计算,并将出现异常的数据保存到HBASE数据库,以便监控大盘从HABSE数据库获取数据。假设要监控的指标为CPU的使用率,具体的检测方法如图4所示,先从日志信息中获取CPU的指标信息,得到监控指标CPU的指标信息流式数据(S401),流式数据包括监控维度信息,即CPU对应的主机ID,时间戳以及不同时间CPU的使用率的观测值;然后流式地读取流式数据,按照监控维度信息将流式数据聚合到指定维度的聚合窗口(S402),比如,将主机ID为1111的数据聚合到1号聚合窗口, 主机ID为2222的数据聚合到2号窗口,主机ID为3333的数据聚合到3号窗口,并将聚合窗口的长度设置为30分钟,当判断聚合窗口的Watermark到达时,则采用N-sigma异常检测算法对聚合窗口内的数据进行计算(S403),具体计算步骤如下:根据时间戳将聚合窗口中的数据按照时间先后的顺序排序,然后取出最近时间点的数据,计算其余时间点的数据的平均值和标准差,根据计算得到的平均值和标准差对取出的最近时间点的数据做Z-Score(归一化)处理,得到监控指标的偏差程度。另外,为了将延时到达的数据也考虑进去,以便异常计算的结果更加准确,还设置了一个延时等待窗口,用于存放延时到达的数据,将等待时间设置为30s。在触发对聚合窗口的数据进行异常检测的同时,启动计时器计时,待计时器计时达30后,将延时等待窗口的数据一并聚合到聚合窗口(S404),并再次采用N-Sigma异常检测算法对聚合窗口的数据进行异常检测(S405),计算得到监控指标的偏差程度。然后将计算得到的偏差程度与预设阈值比较,判断是否超过预设阈值(S406),如未超过,则说明无异常,如超过,则根据主机ID号、时间戳和CPU的使用率的观测值生成一个唯一的rowkey,将生成的rowkey存储到HBASE数据库中(S407)。延迟的数据到了,会使得聚合窗口数据更完整,算出来的值更为准确,因此,它会触发生成一个同样的rowkey,覆盖之前由同一个Watermark触发计算算出的结果,达到报警结果的刷新。此外,下游的监控报警大盘,会定时去HBASE数据库中捞取数据,做一个汇总和定制展示(S408)。
与本说明书提供的基于流式处理的异常检测的方法实施例相对应,本说明还提供了一种基于流式处理的异常检测装置,如图5所示,所述装置500包括:
获取模块501,从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
聚合模块502,流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
异常检测模块503,根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
在一个实施例中,所述目标监控指标包括:CPU使用率、Disk硬盘使用率、Memory内存使用率和/或GC回收次数。
在一个实施例中,所述流式数据至少包括以下信息:监控维度信息、时间戳和目标监控指标的观测值,其中所述监控维度信息用于标识所述指定维度。
在一个实施例中,据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,具体包括:
判断所述聚合窗口的Watermark是否到达;
如到达,则对所述聚合窗口的流式数据进行异常检测。
在一个实施例中,所述方法还包括:
在判断所述Watermark到达后,则启动计时器计时,并将延时到达的流式数据分发到预先设置的延时等待窗口;
待计时时长达到预设值,则将所述延时等待窗口的流式数据添加到所述聚合窗口;
再次对所述聚合窗口的流式数据进行异常检测。
在一个实施例中,采用异常检测算法对所述聚合窗口的流式数据进行异常检测具体包括:
将所述聚合窗口内的流式数据按照时间先后顺序进行排序,取出最近时间点的目标监控指标观测值;
计算其余时间点的目标监控指标观测值的平均值和标准差;
基于所述平均值和标准差,对所取出的最近时间点的目标监控指标观测值进行Z-score计算,并将计算结果与预设阈值比较,判断所述目标监控指标是否异常。
在一个实施例中,如果所述目标监控指标出现异常,则推送报警信息;其中,所述推送报警信息具体包括:
根据所述流式数据的监控维度、目标监控指标观测值、时间戳生成所述报警信息;
将所述报警信息推送至指定数据库。根据所述流式数据的监控维度、目标监控指标观测值、时间戳生成所述报警信息;
将所述报警信息推送至指定数据库。
在一个实施例中,所述方法用于Kepler流式处理平台。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部 件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
从硬件层面而言,如图6所示,为本说明书的预加载页面装置所在设备的一种硬件结构图,除了图6所示的处理器601、网络接口604、内存602以及非易失性存储器603之外,实施例中装置所在的设备通常还可以包括其他硬件,如负责处理报文的转发芯片等;从硬件结构上来讲该设备还可能是分布式的设备,可能包括多个接口卡,以便在硬件层面进行报文处理的扩展。
所述非易失性存储器603存储有用于存储可执行的计算机指令,处理器601执行所述计算机指令时实现以下步骤:
从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
由于本申请对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台终端设备执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。

Claims (10)

  1. 一种基于流式处理的监控指标异常检测方法,所述方法包括:
    从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
    流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
    根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
  2. 如权利要求1所述的基于流式处理的监控指标异常检测方法,所述目标监控指标包括:CPU使用率、硬盘使用率、内存使用率和/或GC回收次数。
  3. 如权利要求1所述的基于流式处理的监控指标异常检测方法,所述流式数据至少包括以下信息:监控维度信息、时间戳和目标监控指标的观测值,其中所述监控维度信息用于标识所述指定维度。
  4. 如权利要求1所述的基于流式处理的监控指标异常检测方法,根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测具体包括:
    判断所述聚合窗口的Watermark是否到达;
    如到达,则对所述聚合窗口的流式数据进行异常检测。
  5. 如权利要求4所述的基于流式处理的监控指标异常检测方法,所述方法还包括:
    在判断所述Watermark到达后,则启动计时器计时,并将延时到达的流式数据分发到预先设置的延时等待窗口;
    待计时时长达到预设值,则将所述延时等待窗口的流式数据添加到所述聚合窗口;
    再次对所述聚合窗口的流式数据进行异常检测。
  6. 如权利要求1-5所述的基于流式处理的监控指标异常检测方法,对所述聚合窗口的流式数据进行异常检测具体包括:
    将所述聚合窗口内的流式数据按照时间先后顺序进行排序,取出最近时间点的目标监控指标观测值;
    计算其余时间点的目标监控指标观测值的平均值和标准差;
    基于所述平均值和标准差,对所述取出的最近时间点的目标监控指标观测值进行Z-score计算,并将计算结果与预设阈值比较,判断所述目标监控指标是否异常。
  7. 如权利要求3所述的基于流式处理的监控指标异常检测方法,还包括:如果所述目标监控指标出现异常,则推送报警信息;其中,
    所述推送报警信息具体包括:
    根据所述流式数据的监控维度、所述目标监控指标观测值、所述时间戳生成所述报 警信息;
    将所述报警信息推送至指定数据库。
  8. 如权利要求1所述的基于流式处理的监控指标异常检测方法,所述方法用于Kepler流式处理平台。
  9. 一种基于流式处理的监控指标异常检测装置,所述装置包括:
    获取模块,从日志信息中获取目标监控指标的指标信息,所述指标信息为流式数据;
    聚合模块,流式地读取所述流式数据,并将所读取的流式数据聚合到指定维度的聚合窗口;
    异常检测模块,根据预定的触发条件对所述聚合窗口内的所述流式数据进行异常检测,以判断所述目标监控指标是否出现异常。
  10. 一种设备,所述设备包括:
    存储器,用于存储可执行的计算机指令;
    处理器,用于执行所述计算机指令时实现权利要求1至8任一所述方法的步骤。
PCT/CN2019/125937 2019-01-14 2019-12-17 基于流式处理的监控指标异常检测方法、装置及设备 WO2020147480A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910032707.3A CN110058977B (zh) 2019-01-14 2019-01-14 基于流式处理的监控指标异常检测方法、装置及设备
CN201910032707.3 2019-01-14

Publications (1)

Publication Number Publication Date
WO2020147480A1 true WO2020147480A1 (zh) 2020-07-23

Family

ID=67315973

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/125937 WO2020147480A1 (zh) 2019-01-14 2019-12-17 基于流式处理的监控指标异常检测方法、装置及设备

Country Status (2)

Country Link
CN (1) CN110058977B (zh)
WO (1) WO2020147480A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021189844A1 (zh) * 2020-09-22 2021-09-30 平安科技(深圳)有限公司 多元kpi时间序列的检测方法、装置、设备及存储介质

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110058977B (zh) * 2019-01-14 2020-08-14 阿里巴巴集团控股有限公司 基于流式处理的监控指标异常检测方法、装置及设备
CN110569166A (zh) * 2019-08-19 2019-12-13 阿里巴巴集团控股有限公司 异常检测方法、装置、电子设备及介质
CN110609780B (zh) * 2019-08-27 2023-06-13 Oppo广东移动通信有限公司 数据监控方法、装置、电子设备及存储介质
CN110764944B (zh) * 2019-10-22 2023-05-16 东软睿驰汽车技术(沈阳)有限公司 一种异常检测方法及装置
CN110971485B (zh) * 2019-11-19 2022-01-28 网联清算有限公司 业务指标的监控系统及方法
CN110990433B (zh) * 2019-11-21 2023-06-13 深圳马可孛罗科技有限公司 一种实时业务监控预警方法及预警装置
CN111210156B (zh) * 2020-01-13 2022-04-01 拉扎斯网络科技(上海)有限公司 基于流窗口实现的实时流数据处理方法及装置
CN111506581B (zh) * 2020-06-17 2020-11-06 北京北龙超级云计算有限责任公司 一种数据聚合方法和服务器
CN111815449B (zh) * 2020-07-13 2023-12-19 上证所信息网络有限公司 一种基于流计算的多主机行情系统的异常检测方法及系统
US11269706B2 (en) * 2020-07-15 2022-03-08 Beijing Wodong Tianjun Information Technology Co., Ltd. System and method for alarm correlation and aggregation in IT monitoring
CN111831383A (zh) * 2020-07-20 2020-10-27 北京百度网讯科技有限公司 窗口拼接方法、装置、设备以及存储介质
CN112445679B (zh) * 2020-11-13 2023-01-06 度小满科技(北京)有限公司 一种信息检测方法、装置、服务器及存储介质
CN112463779A (zh) * 2020-11-25 2021-03-09 中国第一汽车股份有限公司 一种数据处理方法、装置、设备及存储介质
CN114666237B (zh) * 2022-02-25 2023-10-31 众安在线财产保险股份有限公司 秒级监控方法、装置及存储介质
CN117312391A (zh) * 2023-10-23 2023-12-29 中南民族大学 一种基于流式计算的大数据平台动态指标评价方法和系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546514A (zh) * 2012-07-13 2014-01-29 阿里巴巴集团控股有限公司 一种处理延迟发送的日志数据的方法和系统
CN105812202A (zh) * 2014-12-31 2016-07-27 阿里巴巴集团控股有限公司 日志实时监控预警方法及其装置
CN108683560A (zh) * 2018-05-15 2018-10-19 中国科学院软件研究所 一种大数据流处理框架的性能基准测试系统及方法
CN110058977A (zh) * 2019-01-14 2019-07-26 阿里巴巴集团控股有限公司 基于流式处理的监控指标异常检测方法、装置及设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110083046A1 (en) * 2009-10-07 2011-04-07 International Business Machines Corporation High availability operator groupings for stream processing applications
CN103279479A (zh) * 2013-04-19 2013-09-04 中国科学院计算技术研究所 一种面向微博客平台文本流的突发话题检测方法及系统
CN108108253A (zh) * 2017-12-26 2018-06-01 北京航空航天大学 一种面向多数据流的异常状态检测方法
CN108306879B (zh) * 2018-01-30 2020-11-06 福建师范大学 基于Web会话流的分布式实时异常定位方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546514A (zh) * 2012-07-13 2014-01-29 阿里巴巴集团控股有限公司 一种处理延迟发送的日志数据的方法和系统
CN105812202A (zh) * 2014-12-31 2016-07-27 阿里巴巴集团控股有限公司 日志实时监控预警方法及其装置
CN108683560A (zh) * 2018-05-15 2018-10-19 中国科学院软件研究所 一种大数据流处理框架的性能基准测试系统及方法
CN110058977A (zh) * 2019-01-14 2019-07-26 阿里巴巴集团控股有限公司 基于流式处理的监控指标异常检测方法、装置及设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021189844A1 (zh) * 2020-09-22 2021-09-30 平安科技(深圳)有限公司 多元kpi时间序列的检测方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN110058977B (zh) 2020-08-14
CN110058977A (zh) 2019-07-26

Similar Documents

Publication Publication Date Title
WO2020147480A1 (zh) 基于流式处理的监控指标异常检测方法、装置及设备
US10644932B2 (en) Variable duration windows on continuous data streams
WO2020233212A1 (zh) 一种日志记录的处理方法、服务器及存储介质
CN110908883B (zh) 用户画像数据监控方法、系统、设备及存储介质
EP3425524A1 (en) Cloud platform-based client application data calculation method and device
US20160275150A1 (en) Lightweight table comparison
CN107992398A (zh) 一种业务系统的监控方法和监控系统
CN112416724B (zh) 告警处理方法、系统、计算机设备和存储介质
CN111984499A (zh) 一种大数据集群的故障检测方法和装置
CN105404581B (zh) 一种数据库的评测方法和装置
CN110535713B (zh) 监控管理系统以及监控管理方法
WO2017032043A1 (zh) 快速确定网络合理告警阈值的系统和方法
US20200341868A1 (en) System and Method for Reactive Log Spooling
CN106789251B (zh) 网银运行状态监控系统及方法
JP6633642B2 (ja) 分散データベースにおけるデータブロックを処理する方法およびデバイス
CN105786973A (zh) 基于大数据技术的数据并发处理方法及系统
CN112306700A (zh) 一种异常rpc请求的诊断方法和装置
CN111240936A (zh) 一种数据完整性校验的方法及设备
WO2016095716A1 (zh) 一种故障信息处理方法与相关装置
CN113254313A (zh) 一种监控指标异常检测方法、装置、电子设备及存储介质
CN105205168A (zh) 一种基于Redis数据库的曝光系统及其操作方法
US8589444B2 (en) Presenting information from heterogeneous and distributed data sources with real time updates
WO2014184263A1 (en) Integration platform monitoring
TWI677785B (zh) 核心帳務主機監控方法
US11886453B2 (en) Quantization of data streams of instrumented software and handling of delayed or late data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19909649

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19909649

Country of ref document: EP

Kind code of ref document: A1