WO2017008451A1 - 一种面向云计算在线业务的异常负载检测方法 - Google Patents

一种面向云计算在线业务的异常负载检测方法 Download PDF

Info

Publication number
WO2017008451A1
WO2017008451A1 PCT/CN2015/098770 CN2015098770W WO2017008451A1 WO 2017008451 A1 WO2017008451 A1 WO 2017008451A1 CN 2015098770 W CN2015098770 W CN 2015098770W WO 2017008451 A1 WO2017008451 A1 WO 2017008451A1
Authority
WO
WIPO (PCT)
Prior art keywords
load
time series
abnormal
online service
probability
Prior art date
Application number
PCT/CN2015/098770
Other languages
English (en)
French (fr)
Inventor
周悦芝
刘金钊
张迪
张尧学
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2017008451A1 publication Critical patent/WO2017008451A1/zh
Priority to US15/786,426 priority Critical patent/US10581961B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1023Server selection for load balancing based on a hash applied to IP addresses or costs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2066Optimisation of the communication load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3433Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment for load management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/022Capturing of monitoring data by sampling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis

Definitions

  • the present invention belongs to the field of cloud computing application technologies, and in particular, to a method for identifying abnormal load and abnormal operating conditions of a service by using historical load data of an online service.
  • cloud computing technology With the development of cloud computing technology, more and more users choose to deploy or migrate their services to cloud-based platforms.
  • resources such as computing, storage, and network allocated to specific services can be increased or decreased as needed to maximize resource utilization and reduce operating costs of the business.
  • Online business accounts for a large proportion of all businesses deployed to the cloud platform. Since online services often provide service interfaces directly to users, the load of online services is more susceptible to user traffic.
  • Monitoring the load of a service is the basis for cloud computing to provide elastic resource scaling. By continuously monitoring the load, the resource allocation can be adjusted accordingly when the resource demand changes, thereby maximizing resources while ensuring service quality. Use efficiency.
  • the service load changes, it is possible to automatically, effectively, and quickly determine whether the load is in an abnormal state, which can greatly help the operation and maintenance personnel. If the abnormality of the service can be quickly discovered through the load monitoring data of the service after the abnormality occurs, the user who has an abnormality can be manually interfered with, and the program that is in error can be eliminated or repaired in time, thereby reducing the operation of the service in an abnormal state. Time, to the greatest extent to ensure the quality of service and user experience of the business.
  • the existing abnormal load detection methods mainly include three types: threshold-based abnormal load detection, abnormal load detection based on statistical/regression model, and abnormal load detection based on performance characteristics.
  • the threshold-based abnormal load detection method matches these conditions by setting a certain number of performance thresholds as conditions for abnormal loads and using load data that is monitored in real time for the service. If the condition is met, that is, the performance data of the service exceeds a certain threshold, the current load is considered to be an abnormal load.
  • This method relies on the experience of the operation and maintenance personnel to set the threshold condition of the abnormal load. When the load characteristics of the service change (for example, business process upgrade), these conditions also need to be corrected accordingly to ensure the accuracy of subsequent detection.
  • this method has low tolerance for burst access, and a high false alarm rate is caused when the traffic load rises normally or rapidly.
  • An improved approach is to replace the fixed threshold with an adaptive threshold.
  • the adaptive threshold method periodically analyzes the characteristics of the load data (eg, every 24 hours) and adjusts the threshold settings accordingly. This method with a fixed threshold There is a problem of high false positive rate in the same way. And when the traffic load oscillates in a short period of time, the adaptive adjustment algorithm cannot play its role, so there is no corresponding improvement in the tolerance of the service burst request.
  • the abnormal load detection method based on regression/statistical model performs regression analysis on load data (such as linear regression) and establishes a regression model to obtain the trend of load change, and then uses this trend to predict the load situation in the future period as abnormality detection.
  • load data such as linear regression
  • load data can be independently modeled in different periods, and these models are cross-correlated to identify abnormal feature periods.
  • the problem with this type of method is that the model needs to be calibrated and corrected continuously, and the accuracy of the prediction directly affects the accuracy of the anomaly detection.
  • the abnormal load detection method based on performance characteristics uses statistical methods to model the performance characteristics of the service, and selects certain business performance characteristic parameters and metadata as the "fingerprint" of the abnormal load, thereby identifying the abnormal load according to these "fingerprints". .
  • For the service load data in a certain period of time first analyze it, calculate the fingerprint, and then match the fingerprint with the load characteristics and performance indicators of the service at other different time periods to determine whether the fingerprint is an abnormal load.
  • the recognition process typically utilizes statistical-based methods (such as Gaussian distribution) or data mining methods (such as clustering algorithms).
  • the accuracy of the detection depends on the accuracy of these "fingerprint” features. Since the characteristics of the business are constantly changing throughout the life of the business, these "fingerprints" also need to be constantly adjusted and corrected, making it difficult to fully automate.
  • the present invention proposes an abnormal load detection method for the cloud computing online service, which uses the historical load data of the online service to detect the abnormal load of the online service through wavelet analysis and statistical analysis. Methods. Compared with the existing methods, this method can not only achieve higher accuracy, but also have better adaptive ability.
  • the invention provides a method for detecting an abnormal load of a cloud computing online service, which utilizes historical load data of an online service to detect an abnormal load of an online service through wavelet analysis and statistical analysis, and includes the following steps:
  • Step 2) Preprocessing the load item information data of all the collected hosts: processing the information data of each load item of the current online service into a time series with a fixed time interval, and if the data is empty at a certain time point, Data interpolation processing is performed on the time point; the time series of each load item is stored in the form of a tuple, which is recorded as among them Where k is the ratio of the time series period to the sampling period. Processing the collected load data into a time series with a fixed period, and then combining the data of the load items of all the hosts to obtain a time series of all load item data of the current service;
  • Step 3) Perform discrete wavelet transform on each time series of the online service to obtain a coefficient matrix and a detail vector; perform statistical analysis on each coefficient vector in the obtained coefficient matrix, and calculate a probability that each coefficient vector has an abnormal load. ;
  • Step 4) Calculate the weighted average of the probabilities of all coefficient vectors by using the weighting formula, and obtain the probability p of the abnormal load in each time series, using the following formula:
  • Step 5 For each time series, compare the obtained probability with the confidence interval given by the confidence function to determine whether there is an abnormal load; if the obtained probability falls within the confidence interval, it means that the time series has no abnormality. ;
  • Step 6) Integrate the load information of all load items of the current online service and the abnormal load probability obtained, and determine whether the online service has an abnormal load; specifically, the following steps are included:
  • Step 6.1) according to the result of step 5), find a time series of all load items corresponding to the abnormal load of the online service;
  • Step 6.2) for all load items with abnormalities record the time point corresponding to the last point of the time series, and as the time point of occurrence of the abnormality, the record items are stored in the online service database;
  • Step 7) Use the K-means clustering algorithm to find out the bearer server where the current online service is abnormal.
  • the step 2) specifically includes:
  • Step 2.2) Filter out all load information data of a load item. If the value of the information data of a certain load at a certain time point of a host is empty, the load method of the host at the time point is filled by the average value method.
  • Step 2.3) Combine the information data of the load items of all the hosts to obtain the time series of the load item, and record it as The merging method is at time point t i , and the value of time series S is as in equation (2):
  • m is the number of hosts
  • Step 2.4 If the current online service still has unprocessed load data, skip to step 2.1), otherwise go to step 3);
  • the step 3) specifically includes:
  • Step 3.1) Select an unprocessed time series of the online service, perform a one-dimensional discrete wavelet transform on the time series, and select a Haar wavelet based on the wavelet basis, and set corresponding according to the time series period T l and the abnormality detection period T s .
  • the transformation level L satisfies T 1 ⁇ 2 L ⁇ T s .
  • Step 3.2) Filter the coefficient vector from the coefficient matrix, and apply a statistical analysis based on the normal distribution to each coefficient vector.
  • the mean value is 0, and the variance is the variance estimation value of the load value in the past T s ⁇ m time, where m
  • the probability p i of the abnormal load of the coefficient vector is calculated, where 1 ⁇ i ⁇ L.
  • the probability is the maximum of the cumulative distribution probabilities of the load values at each of the time points in T s .
  • the formula for calculating the cumulative distribution probability of a normal distribution is as follows:
  • Step 3.3) Filter out the detail vector and judge the trend of the current load according to the detail vector; if d[-1] ⁇ d[-2], the trend is decreasing, the value is -1; if d[-1] >d[-2], the trend of change is rising, and the value is 1; otherwise the trend is steady and the value is 0;
  • Step 3.4) Combine the abnormal probability of each coefficient vector, and combine the current load trend to obtain the probability that the current load has an abnormality
  • Step 3.5 If there is still a time series that has not been processed, then go to step 3.1), otherwise go to step 4);
  • the step 5) specifically includes:
  • Step 5.1 Take a time series of an unprocessed load item from all load items of the online service and a probability value of its abnormal load.
  • Step 5.2 Calculate the standard deviation t of the time series and bring the standard deviation as a parameter into the confidence function to obtain a confidence interval.
  • the confidence function is defined as follows:
  • c is the confidence factor
  • d is the relaxation coefficient
  • c and d are empirical values
  • Step 5.3 taking out the abnormal probability of the time series, and comparing the probability with the confidence interval (0, G(t)); if the abnormal probability of the time series falls within the confidence interval, it indicates that the current load item has no abnormality. Otherwise it means there is an exception;
  • Step 5.4 If there is still unprocessed data, skip to step 5.1), otherwise go to step 6);
  • the step 7) specifically includes:
  • This step specifically includes the following steps:
  • Step 7.1 taking out abnormal state data of all load items of the online service
  • Step 7.2 determining whether the service has an abnormality; if not, ending; otherwise, jumping to step 7.3);
  • Step 7.3 selecting abnormal load item data of all bearer servers of the service, and normalizing the load item data
  • Step 7.4 using the load item data of each bearer server as a vector, using the K-means algorithm for clustering, using the Euclidean distance;
  • Step 7.5 Compare the standard deviation of the two classes, and make the one with the larger standard deviation the exception class.
  • the server is abnormal; the standard deviation is calculated as follows:
  • Step 7.6) If there is still an unprocessed online service, go back to step 1), otherwise end.
  • the present invention utilizes the periodic principle and varying characteristics of online traffic load data to identify abnormal loads.
  • the method is mainly based on the following principle: the user's access frequency to the online service is approximately obeying a normal distribution; the normal access frequency change will not cause a large load change in a short time, and the load change due to abnormal access or program error will be It has a large variation range in a short time. Therefore, it can be determined whether the current traffic load is an abnormal load by analyzing the rate of change of the load and its distribution characteristics.
  • the present invention utilizes wavelet analysis to perform multi-time scale analysis of the load time series.
  • Discrete Wavelet Transform is used to decompose time series into vibrations on multiple time scales, perform independent analysis on each time scale, and finally combine the results of each analysis to obtain more accurate analysis conclusions.
  • the present invention utilizes statistical analysis methods for analysis. Assuming that the load changes obey the normal distribution on each time scale, the probability density function of the normal distribution can be used to obtain the probability that the current load state is abnormal load. By combining the analysis results on each time scale, the final anomaly probability can be obtained.
  • the present invention presents a variant based on the Sigmoid function to calculate an abnormal load confidence interval under different traffic load characteristics. Using the confidence interval and the abnormal probability, it can be determined whether the current load is an abnormal load.
  • the invention utilizes wavelet analysis, improves the accuracy based on the statistical analysis method, and has good adaptive characteristics; not only can be applied to different online services, but also in the upgrading of business programs and the normal oscillation of the business load ( The user's traffic will change periodically, and it will still work normally.
  • FIG. 2 is a specific flowchart of the pre-processed online service load data of step 2 in the embodiment.
  • FIG. 3 is a flowchart of calculating the probability that the load of the online service has an abnormality (step 3) in the embodiment.
  • FIG. 4 is a flowchart of determining whether there is an abnormality in each load of the online service in the embodiment (step 5).
  • FIG. 5 is a flowchart of the search server (step 7) for finding an abnormality in the embodiment.
  • the method for detecting an abnormal load of a cloud computing online service utilizes historical load data of an online service to detect an abnormal load of an online service through wavelet analysis and statistical analysis, which is described in detail below with reference to the accompanying drawings and embodiments.
  • the method flow proposed by the present invention is as shown in FIG. 1 and includes the following steps:
  • the CPU usage rate For CPU usage, the sample is sampled every 5 minutes by default, and each data point represents the average CPU usage over the past 5 minutes.
  • the sequence of data points is stored in the form of a tuple, which is recorded as
  • the collected load information data of each load item is recorded into an online business database (for example, a MySQL database).
  • the format of the data record of the load item information is shown in Table 1.
  • Step 2) Preprocessing the load item information data of all the collected hosts: processing the information data of each load item of the current service into a time series with a fixed time interval. If the data at a certain time point is empty, then Data interpolation processing is performed on the time point; the time series of each load item is stored in the form of a tuple, which is recorded as among them Where k is the ratio of the time series period to the sampling period (for example, in the present embodiment, for CPU usage, it is recorded as The collected load data is processed into a time series having a fixed period (the default value of the time series period in this embodiment is 15 minutes). Then, the data of the load items of all the hosts are combined to obtain the time series of all the load item data of the current service.
  • the specific implementation process is shown in FIG. 2 . include:
  • Step 2.3) Combine the information data of the load items of all the hosts to obtain the time series of the load item, and record it as In this embodiment, the merging method is at time point t i , and the value of the time series S is as in equation (2):
  • m is the number of hosts
  • Step 2.4 If the current online service still has unprocessed load data, skip to step 2.1), otherwise go to step 3);
  • Step 3 performing discrete wavelet transform on each time series to obtain a coefficient matrix and a detail vector; performing statistical analysis on each coefficient vector in the obtained coefficient matrix, and calculating a probability that each coefficient vector has an abnormal load; as follows:
  • Step 3.1) Select an unprocessed time series of the online service, perform a one-dimensional discrete wavelet transform on the time series, and select a Haar wavelet based on the wavelet basis, and set corresponding according to the time series period T l and the abnormality detection period T s .
  • the transformation level L satisfies T 1 ⁇ 2 L ⁇ T s ; the coefficient matrix cA and the detail vector cD are obtained:
  • the time series period is once every 15 minutes, and the abnormality detection period is detected once every 12 hours.
  • L the transform level
  • the original time series will be decomposed into L coefficient vectors (formation coefficient matrix) and a detail vector.
  • L coefficient vectors transformation coefficient matrix
  • a discrete wavelet transform with a transform level of L for a time series will result in L coefficient vectors cA[1], cA[2],...,cA[L], and a detail vector cD.
  • Step 3.2) Filter the coefficient vector from the coefficient matrix, and apply a statistical analysis based on the normal distribution to each coefficient vector.
  • the mean value is 0, and the variance is the variance estimation value of the load value in the past T s ⁇ m time, where m Is the empirical value; calculate the probability p i of the abnormal load of the time series, where 1 ⁇ i ⁇ L, the probability is the maximum value of the cumulative distribution probability of the load value at each time point in T s ;
  • the formula for calculating the distribution probability is as follows:
  • Step 3.3) Filter out the detail vector and judge the trend of the current load according to the detail vector; if d[-1] ⁇ d[-2], the trend is decreasing, the value is -1; if d[-1] >d[-2], the trend of change is rising, and the value is 1; otherwise the trend is steady and the value is 0;
  • Step 3.4) Combine the abnormal probability of each coefficient vector, and combine the current load trend to obtain the probability that the current load has an abnormality
  • Step 3.5 If there is still a time series that has not been processed, then go to step 3.1), otherwise go to step 4);
  • Step 4) Calculate the weighted average of the probabilities of all coefficient vectors by using the weighting formula, and obtain the probability p of the abnormal load in each time series, using the following formula:
  • the time series abnormal probability For the obtained time series abnormal probability, it is stored in the online business database.
  • the data record format of the abnormal data is shown in Table 2.
  • Step 5 For each time series, compare the obtained probability with the confidence interval given by the confidence function to determine whether there is an abnormal load; if the obtained probability falls within the confidence interval, it means that the time series has no abnormality. Specifically includes:
  • Step 5.1 taking a time series of an unprocessed load item from all load items of the online service and a probability value of the abnormal load thereof;
  • Step 5.2 Calculate the standard deviation t of the time series, and bring the standard deviation as a parameter into the confidence function to obtain a confidence interval;
  • the confidence function is defined as follows:
  • c is the confidence coefficient
  • d is the relaxation coefficient
  • Step 5.3 taking out the abnormal probability of the time series, and comparing the probability with the confidence interval (0, G(t)); if the abnormal probability of the time series falls within the confidence interval, it indicates that the current load item has no abnormality. Otherwise it means there is an exception;
  • Step 5.4 If there is still unprocessed data, skip to step 5.1), otherwise go to step 6);
  • Step 6) Integrate the load information of all load items of the current online service and the abnormal load probability obtained, and determine whether the online service has an abnormal load; specifically, the following steps are included:
  • Step 6.1) according to the result of step 5), find a time series of all load items corresponding to the abnormal load of the online service;
  • Step 6.2 For all the load items with abnormalities, record the time point corresponding to the last point of the time series, and as the time point of the occurrence of the abnormality, the record items are stored in the online service database; the data item format of this embodiment as shown in Table 3:
  • Step 7) Use the K-means clustering algorithm to find out the bearer server where the current online service is abnormal.
  • This step specifically includes the following steps:
  • Step 7.1 taking out abnormal state data of all load items of the online service
  • Step 7.2 Determine if there is an abnormality in the service. If it does not exist, it ends. Otherwise jump to step 7.3;
  • Step 7.3 selecting abnormal load item data of all bearer servers of the service, and normalizing the load item data
  • Step 7.4 using the load item data of each bearer server as a vector, using the K-means algorithm for clustering, using the Euclidean distance;
  • Step 7.5 Comparing the standard deviations of the two classes, the one with the larger standard deviation is the exception class, and all the bearer servers are abnormal; the standard deviation is calculated as follows:
  • Step 7.6) If there is still an unprocessed online service, go back to step 1), otherwise end.
  • first and second are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated.
  • features defining “first” or “second” may include at least one of the features, either explicitly or implicitly.
  • the meaning of "a plurality” is at least two, such as two, three, etc., unless specifically defined otherwise.
  • the terms “installation”, “connected”, “connected”, “fixed” and the like shall be understood broadly, and may be either a fixed connection or a detachable connection, unless explicitly stated and defined otherwise. , or integrated; can be mechanical or electrical connection; can be directly connected, or indirectly connected through an intermediate medium, can be the internal communication of two elements or the interaction of two elements, unless otherwise specified Limited.
  • the specific meanings of the above terms in the present invention can be understood on a case-by-case basis.
  • the first feature "on” or “under” the second feature may be a direct contact of the first and second features, or the first and second features may be indirectly through an intermediate medium, unless otherwise explicitly stated and defined. contact.
  • the first feature "above”, “above” and “above” the second feature may be that the first feature is directly above or above the second feature, or merely that the first feature level is higher than the second feature.
  • the first feature “below”, “below” and “below” the second feature may be that the first feature is directly below or obliquely below the second feature, or merely that the first feature level is less than the second feature.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明涉及一种面向云计算在线业务的异常负载检测方法,属于云计算应用技术领域,该方法利用固定周期采样方法,收集承载某个在线业务的所有主机的各负载项的信息数据;对于当前在线业务的每一个负载项的信息数据,将其处理成具有固定时间间隔的时间序列,得到当前业务所有负载项数据的时间序列;对在线业务的每一个时间序列进行离散小波变换,在所得到的系数矩阵的每一个系数向量上进行统计分析,计算存在异常负载的概率;将求得的概率与置信函数给出的置信区间做对比,判断是否存在异常负载;使用K-均值聚类算法查找出当前在线业务存在异常的承载服务器。该方法和已有方法相比,不仅能够获得更高的准确度,而且具有更好的自适应能力。

Description

一种面向云计算在线业务的异常负载检测方法 技术领域
本发明属于云计算应用技术领域,特别涉及一种利用在线业务的历史负载数据来识别业务的异常负载以及异常运行状况的方法。
背景技术
随着云计算技术的发展,越来越多的用户选择将业务部署或迁移到基于云架构的平台上。利用云计算技术,分配到特定业务的计算、存储、网络等资源可以按需进行增加或者减少,从而最大化资源利用率,降低业务的运营成本。在线业务在所有部署到云平台的业务中占据了较大的比例。由于在线业务往往直接为用户提供服务界面,因此在线业务的负载更容易受到用户访问量的影响。对业务的负载进行监控是云计算提供弹性资源伸缩的基础,通过对负载进行不间断的监控,可以在资源需求变动的时候相应地调整资源分配量,从而在保证业务服务质量的同时最大化资源的使用效率。
对于云计算在线业务,在运行的生命周期内都会遇到由于突发用户请求、程序错误等原因引发的异常运行状况。对于这些业务来说,对业务负载进行监控并根据负载状态识别异常运行状态是保证业务能够正常运行的基本方法。对于由突发用户访问量导致的异常负载变化,云平台可以通过资源弹性伸缩自动进行资源的调整,从而可以在无需人工干预的情况下保证业务的资源分配和服务质量。然而对由程序异常导致的异常负载,自动化的资源伸缩则无法保证业务的服务质量,因而需要将其同正常的负载变化区分开来,以保证异常负载可以及时的得到人工干预。
当业务负载发生变动时,能够自动、有效、迅速的判断负载是否处于异常状态可以给运维人员带来极大的帮助。如果能够在异常发生之后通过业务的负载监控数据迅速发现该异常的存在,便可以更快对出现异常的业务进行人工干预,及时排除掉或者修复出错的程序,从而减少业务在异常状态下运行的时间,最大程度上保证业务的服务质量和用户体验。
现有的异常负载检测方法主要包括三类:基于阈值的异常负载检测,基于统计/回归模型的异常负载检测,以及基于性能特征的异常负载检测。基于阈值的异常负载检测方法通过设定一定数量的性能阈值作为异常负载的条件,并利用对业务实时监测的负载数据来匹配这些条件。若有条件被满足,即业务的性能数据超过某一设定的阈值,则认为当前的负载为异常负载。这种方法依赖于运维人员的经验来设定异常负载的阈值条件,当业务的负载特性发生变化时(例如业务程序升级),这些条件也需要进行相应的修正以保证后续检测的准确性。并且这种方法对于突发访问的容忍能力低,当业务负载出现正常的快速上升或者下降时会导致较高的误报率。
一种改进的方法是利用自适应的阈值来取代固定阈值。自适应的阈值方法周期性地(例如每24小时)对负载数据的特性进行分析并相应地调整阈值设定。这种方法与固定阈值方 法一样存在高误报率的问题。并且当业务负载在短时间内震荡时,自适应的调整算法无法发挥其作用,从而对于业务突发性请求的容忍能力没有相应的提升。
基于回归/统计模型的异常负载检测方法对负载数据进行回归分析(例如线性回归)并建立回归模型,从而得到负载的变化趋势,然后再通过该趋势来预测未来一段时间内的负载情况作为异常检测的依据。对于具有周期性负载特征的在线业务,可以对不同周期内的负载数据进行独立的建模,并将这些模型进行交叉对比,从而识别出异常的特征周期。这类方法的问题在于模型需要不断的进行校准和修正,同时预测的准确性会直接影响异常检测的准确性。
基于性能特征的异常负载检测方法使用统计方法来对业务的性能特性进行建模,并选择一定的业务性能特征参数与元数据作为异常负载的“指纹”,从而根据这些“指纹”来识别异常负载。对于一定时间内的业务负载数据,首先对其进行分析,计算出指纹,之后将该指纹与该业务在其它不同时段的负载特性以及性能指标进行匹配,判断该指纹是否为异常负载。识别过程通常利用到基于统计学的方法(例如高斯分布)或数据挖掘方法(例如聚类算法)。检测的准确性依赖于这些“指纹”特征的准确性。由于业务的特征在业务的整个生命周期内是不断变化的,因此这些“指纹”也需要不断的进行调整和修正,从而很难做到完全的自动化。
总结来说,现有的利用负载数据判断在线业务的异常状况的方法无法得到较高的准确度,存在误报率高的问题。同时,这些方法在很大程度上依赖于运维人员的经验,无法做到完全的自动化监控。
发明内容
本发明为了克服已有异常负载检测方法的不足之处,提出了一种面向云计算在线业务的异常负载检测方法,利用在线业务的历史负载数据,通过小波分析和统计分析来检测在线业务异常负载的方法。该方法和已有方法相比,不仅能够获得更高的准确度,而且具有更好的自适应能力。
本发明提出了一种面向云计算在线业务的异常负载检测方法,利用在线业务的历史负载数据,通过小波分析和统计分析来检测在线业务异常负载的方法,包括以下步骤:
步骤1)利用固定周期采样方法,收集承载某个在线业务的所有主机的各负载项的信息数据,主要包括CPU使用率、内存使用率、磁盘I/O速率和网络I/O速率,记为
Figure PCTCN2015098770-appb-000001
Figure PCTCN2015098770-appb-000002
其中
Figure PCTCN2015098770-appb-000003
表示某一时间点i的负载统计数据,i=1,2,….,n;n为正整数;x表示主机的CPU、内存、磁盘I/O或网络I/O之中任一项使用率;
步骤2)预处理收集到的所有主机的负载项信息数据:对于当前在线业务的每一个负载项的信息数据,将其处理成具有固定时间间隔的时间序列,如果某一时间点数据为空,则对该时间点进行数据插补处理;各负载项的时间序列以元组的形式存储,记为
Figure PCTCN2015098770-appb-000004
Figure PCTCN2015098770-appb-000005
其中
Figure PCTCN2015098770-appb-000006
其中k为时间序 列周期与采样周期的比值。将收集到的负载数据处理成具有固定周期的时间序列,之后合并所有主机的负载项的数据,得到当前业务所有负载项数据的时间序列;
步骤3)对在线业务的每一个时间序列进行离散小波变换,得到系数矩阵和细节向量;在得到的系数矩阵中的每一个系数向量上进行统计分析,计算出每一个系数向量存在异常负载的概率;
步骤4)采用加权公式对所有系数向量的概率计算带权平均值,求得每一个时间序列存在异常负载的概率p,利用如下公式:
Figure PCTCN2015098770-appb-000007
其中wi=elog i+1  (5)
对于求得的时间序列异常概率,存入在线业务数据库;
步骤5)对于每一个时间序列,将求得的概率与置信函数给出的置信区间做对比,判断是否存在异常负载;若求得的概率落入置信区间内,则说明该时间序列不存在异常;
步骤6)综合当前在线业务的所有负载项的负载信息以及求得的异常负载概率,判断该在线业务是否存在异常负载;具体包括以下步骤:
步骤6.1)根据步骤5)的结果找出对应在线业务的所有存在异常负载的负载项的时间序列;
步骤6.2)对于所有存在异常的负载项,记录其时间序列的最后一个点所对应的时间点,作为该项异常发生的时间点,记录项存入在线业务数据库中;
步骤7)使用K-均值聚类算法查找出当前在线业务存在异常的承载服务器。
所述步骤2)具体包括:
步骤2.1)选出承载某一在线业务的所有主机的所有负载项的信息数据,将该数据处理成具有固定时间间隔的时间序列,时间序列以元组的形式存储;
步骤2.2)筛选出一个负载项的所有负载信息数据,如果某个主机某一时间点的某项负载的信息数据的值为空,则利用平均值法填充该主机在该时间点的某负载项的信息数据数值;例如,对于序列{C1,C2,…,Ci,…,Ck,…,Cn},其中Ck为缺失项,则先令Ck=0,然后通过该式(1)将计算Ck的值填充到该时间点:
Figure PCTCN2015098770-appb-000008
步骤2.3)合并所有主机的负载项的信息数据,得到该负载项的时间序列,记为
Figure PCTCN2015098770-appb-000009
Figure PCTCN2015098770-appb-000010
合并方法为在时间点ti,时间序列S的值如式(2):
Figure PCTCN2015098770-appb-000011
其中
Figure PCTCN2015098770-appb-000012
为主机j在ti时刻的负载值,m为主机数;
步骤2.4)如果当前在线业务仍有未处理的负载数据,跳转到步骤2.1),否则转步骤3);
所述步骤3)具体包括:
具体步骤如下:
步骤3.1)选取在线业务的某个尚未处理的时间序列,对该时间序列进行一维离散小波变换,小波基选择Haar小波,根据时间序列周期Tl以及异常检测周期Ts的不同,设定相 应的变换级别L,满足T1×2L≥Ts。得到系数矩阵cA和细节向量cD:
cA,cD=DWT([s1,s2,…,sn],L,′haar′)  (3)
步骤3.2)从系数矩阵中筛选出系数向量,并对每一个系数向量应用基于正态分布的统计分析,均值为0,方差为过去Ts×m时间内的负载值的方差估计值,其中m为经验值,计算出该系数向量存在异常负载的概率pi,其中1≤i≤L。该概率为Ts中每一个时间点的负载值的累积分布概率中的最大值。正态分布累积分布概率的计算公式如下:
pi=2*Φ(|xi|)-1,其中
Figure PCTCN2015098770-appb-000013
步骤3.3)筛选出细节向量,并根据细节向量判断出当前负载的变化趋势;若d[-1]<d[-2],变化趋势为下降,取值为-1;若d[-1]>d[-2],变化趋势为上升,取值为1;否则趋势为平稳,取值为0;
步骤3.4)合并各个系数向量的异常概率,并结合当前负载的变化趋势求得当前负载存在异常的概率;
步骤3.5)如果仍有尚未处理的时间序列,则跳转到步骤3.1),否则转步骤4);
所述步骤5)具体包括:
具体包括:
步骤5.1)从在线业务的所有负载项中取出一个尚未处理的负载项的时间序列以及其存在异常负载的概率值。
步骤5.2)计算出时间序列的标准差t,并将该标准差作为参数带入置信函数中求得置信区间。置信函数定义如下:
Figure PCTCN2015098770-appb-000014
其中c为置信系数,d为松弛系数,c和d均为经验值;
步骤5.3)取出该时间序列的异常概率,并将该概率与置信区间(0,G(t))进行对比;如果该时间序列的异常概率落入置信区间,则表明当前负载项不存在异常,否则表示存在异常;
步骤5.4)如果仍有尚未处理的数据,则跳转到步骤5.1),否则转步骤6);
所述步骤7)具体包括:
本步骤具体包括以下步骤:
步骤7.1)取出在线业务所有负载项的异常状态数据;
步骤7.2)判断该业务是否存在异常;如果不存在,则结束;否则跳转到步骤7.3);
步骤7.3)选取该业务的所有承载服务器存在异常的负载项数据,并对负载项数据进行归一化处理;
步骤7.4)将每一个承载服务器的负载项数据作为一个向量,使用K-均值算法进行聚类,使用欧几里得距离;
步骤7.5)对比两个类的标准差,令标准差较大的那个为异常类,其中的所有承载服 务器为存在异常;标准差的计算方法如下:
对每一个类,求出其中所有负载项的时间序列的标准差。将所求的所有标准差取均值,将该均值作为该类的标准差;
步骤7.6)如果还有未处理的在线业务,则转回步骤1),否则结束。
本发明的技术特点及有益效果:
本发明利用在线业务负载数据的周期性原理和变化特性来识别异常负载。方法主要基于以下原理:用户对在线业务的访问频率近似服从正态分布;正常的访问频率变化不会导致短时间内大幅度的负载变化,而由于异常访问量或者程序错误导致的负载变化则会在较短的时间内具有较大的变化幅度。因此可以通过分析负载的变化速率及其分布特征来判断当前的业务负载是否为异常负载。
为了更好地观察到负载的变化特性,本发明利用小波分析来对负载时间序列进行多时间尺度分析。利用离散小波变换(Discrete Wavelet Transform),将时间序列分解成多个时间尺度上的振动,在每一个时间尺度上进行独立的分析,最后综合各个分析的结果,从而得到更加精确的分析结论。
对于每个时间尺度上的时间序列,本发明利用统计分析方法来进行分析。假设负载的变化在每个时间尺度上都服从正态分布,可以利用正态分布的概率密度函数得到当前负载状态的为异常负载的概率。综合每一个时间尺度上的分析结果,可以得到最终的异常概率。
为了实现对于业务的自适应,本发明给出了一个基于Sigmoid函数的变体来计算在不同业务负载特性下的异常负载置信区间。利用该置信区间和异常概率,便可以判定出当前的负载是否为异常负载。
本发明由于利用了小波分析,在统计分析方法的基础上提高了其准确性,同时具有良好的自适应特性;不但能够适用于不同的在线业务,而且在业务程序的升级以及业务负载正常振荡(用户访问量呈周期性变化)情况下仍能正常工作。
附图说明
图1是本发明所提出的方法的总体步骤的流程图。
图2是本实施例中步骤2的预处理在线业务负载数据具体的流程图。
图3是本实施例中计算在线业务各项负载存在异常的概率(步骤3)的流程图。
图4是本实施例中判断在线业务各项负载是否存在异常(步骤5)的流程图。
图5是本实施例中查找存在异常的承载服务器(步骤7)的流程图。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描 述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。
本发明提出的一种面向云计算在线业务的异常负载检测方法,利用在线业务的历史负载数据,通过小波分析和统计分析来检测在线业务异常负载的方法,结合附图及实施例详细说明如下。
本发明提出的方法流程如图1所示,包括以下步骤:
步骤1)利用固定周期采样方法,收集承载某个在线业务的所有主机的各负载项的信息数据,主要包括CPU使用率、内存使用率、磁盘I/O速率和网络I/O速率,记为
Figure PCTCN2015098770-appb-000015
Figure PCTCN2015098770-appb-000016
其中
Figure PCTCN2015098770-appb-000017
表示某一时间点i的负载统计数据,i=1,2,….,n;n为正整数;x表示主机的CPU、内存、磁盘I/O或网络I/O之中任一项使用率;
本实施例以CPU使用率来说明。对于CPU使用率,默认每隔5分钟采样一次,每一次的数据点代表过去5分钟内的平均CPU使用率。数据点序列以元组的形式存储,记为
Figure PCTCN2015098770-appb-000018
Figure PCTCN2015098770-appb-000019
将收集到的各个负载项的负载信息数据记录到在线业务数据库(例如MySQL数据库)中。负载项信息的数据记录的格式如表1所示。
表1各负载项信息的数据记录的格式及举例说明
字段 描述 类型 长度 举例
MID 服务器标识 字符串 64 Z1V3_109452
Service 所属业务标识 字符串 64 WebServer1
Time 时间戳 字符串 64 2014-10-31 13:10:10
CPU CPU使用率 浮点型 4 0.9231
Mem 内存使用率 浮点型 4 0.9231
Disk 磁盘I/O使用率 浮点型 4 0.9231
Net 网络I/O使用率 浮点型 4 0.9231
步骤2)预处理收集到的所有主机的负载项信息数据:对于当前业务的每一个负载项的信息数据,将其处理成具有固定时间间隔的时间序列,如果某一时间点数据为空,则对该时间点进行数据插补处理;各负载项的时间序列以元组的形式存储,记为
Figure PCTCN2015098770-appb-000020
Figure PCTCN2015098770-appb-000021
其中
Figure PCTCN2015098770-appb-000022
其中k为时间序列周期与采样周期的比值(例如在本实施例中,对于CPU使用率,记为
Figure PCTCN2015098770-appb-000023
将收集到的负载数据处理成具有固定周期(本实施例中时间序列周期的默认值为15分钟)的时间序列,
Figure PCTCN2015098770-appb-000024
之后合并所有主机的负载项的数据,得到当前业务所有负载项数据的时间序列,具体实现流程如图2所示。包括:
步骤2.1)选出承载某一在线业务的所有主机的所有负载项的信息数据,将该数据处理成具有固定时间间隔的时间序列,时间序列以元组的形式存储;
步骤2.2)筛选出一个负载项的所有负载信息数据,如果某个主机某一时间点的某项负载的信息数据的值为空,则利用平均值法填充该主机在该时间点的某负载项的信息数据数 值。例如,对于序列{C1,C2,…,Ci,…,Ck,…,Cn},其中Ck为缺失项,则先令Ck=0,然后通过该式(1)将计算Ck的值填充到该时间点:
Figure PCTCN2015098770-appb-000025
步骤2.3)合并所有主机的负载项的信息数据,得到该负载项的时间序列,记为
Figure PCTCN2015098770-appb-000026
Figure PCTCN2015098770-appb-000027
在本实施例中,合并方法为在时间点ti,时间序列S的值如式(2):
Figure PCTCN2015098770-appb-000028
其中
Figure PCTCN2015098770-appb-000029
为主机j在ti时刻的负载值,m为主机数;
步骤2.4)如果当前在线业务仍有未处理的负载数据,跳转到步骤2.1),否则否则转步骤3);
步骤3)对每一个时间序列进行离散小波变换,得到系数矩阵和细节向量;在得到的系数矩阵中的每一个系数向量上进行统计分析,计算出每一个系数向量存在异常负载的概率;具体步骤如下:
步骤3.1)选取在线业务的某个尚未处理的时间序列,对该时间序列进行一维离散小波变换,小波基选择Haar小波,根据时间序列周期Tl以及异常检测周期Ts的不同,设定相应的变换级别L,满足T1×2L≥Ts;得到系数矩阵cA和细节向量cD:
cA,cD=DWT([s1,s2,…,sn],L,′haar′)  (3)
在本实施例中,时间序列周期为15分钟一次,异常检测周期为12小时检测一次。对于一维离散小波变换,当变换级别为L(本例中,L=6)时,原时间序列将被分解成L个系数向量(形成系数矩阵)和一个细节向量。例如,对一个时间序列进行变换级别为L的离散小波变换,将得到L个系数向量cA[1],cA[2],…,cA[L],和一个细节向量cD。对于第i级系数向量cA[i],其周期Pi同第i+1级系数向量cA[i+1]的周期Pi+1的关系为Pi×2=Pi+1。同时,对于第i级系数向量cA[i],其元素数目Ni同第i+1级系数向量cA[i+1]的元素数目Ni+1的关系为Ni/2=Ni+1。从而第i+1级的细节向量的观测精度只有第i级的一半。利用该特性,离散小波分析使得时间序列可以在不同的时间尺度上进行观测。
步骤3.2)从系数矩阵中筛选出系数向量,并对每一个系数向量应用基于正态分布的统计分析,均值为0,方差为过去Ts×m时间内的负载值的方差估计值,其中m为经验值;计算出该时间序列存在异常负载的概率pi,其中1≤i≤L,该概率为Ts中每一个时间点的负载值的累积分布概率中的最大值;正态分布累积分布概率的计算公式如下:
pi=2*Φ(|xi|)-1,其中
Figure PCTCN2015098770-appb-000030
在本实施例中,令m=3。
步骤3.3)筛选出细节向量,并根据细节向量判断出当前负载的变化趋势;若d[-1]<d[-2],变化趋势为下降,取值为-1;若d[-1]>d[-2],变化趋势为上升,取值为1;否则趋势为平稳,取值为0;
步骤3.4)合并各个系数向量的异常概率,并结合当前负载的变化趋势求得当前负载存在异常的概率;
步骤3.5)如果仍有尚未处理的时间序列,则跳转到步骤3.1),否则转步骤4);
步骤4)采用加权公式对所有系数向量的概率计算带权平均值,求得每一个时间序列存在异常负载的概率p,利用如下公式:
Figure PCTCN2015098770-appb-000031
其中wi=elog i+1  (5)
对于求得的时间序列异常概率,存入在线业务数据库。异常数据的数据记录格式如表2所示。
表2时间序列异常概率的存储格式
字段 描述 类型 长度 举例
Service 所属业务标识 字符串 64 WebServer1
TimeBegin 起始时间戳 字符串 64 2014-10-31 13:10:10
TimeEnd 终点时间戳 字符串 64 2014-10-31 13:10:10
Deviation 时间序列方差 浮点型 4 8.9231
Prob 异常概率 浮点型 4 0.9231
Trend 变化趋势 整形 4 1
CI 置信区间 浮点型 4 0.9231
步骤5)对于每一个时间序列,将求得的概率与置信函数给出的置信区间做对比,判断是否存在异常负载;若求得的概率落入置信区间内,则说明该时间序列不存在异常;具体包括:
步骤5.1)从在线业务的所有负载项中取出一个尚未处理的负载项的时间序列以及其存在异常负载的概率值;
步骤5.2)计算出时间序列的标准差t,并将该标准差作为参数带入置信函数中求得置信区间;置信函数定义如下:
Figure PCTCN2015098770-appb-000032
其中c为置信系数,d为松弛系数,c和d均为经验值;在本实施例中,这两个系数的设定为c=0.6,d=200;
步骤5.3)取出该时间序列的异常概率,并将该概率与置信区间(0,G(t))进行对比;如果该时间序列的异常概率落入置信区间,则表明当前负载项不存在异常,否则表示存在异常;
步骤5.4)如果仍有尚未处理的数据,则跳转到步骤5.1),否则转步骤6);
步骤6)综合当前在线业务的所有负载项的负载信息以及求得的异常负载概率,判断该在线业务是否存在异常负载;具体包括以下步骤:
步骤6.1)根据步骤5)的结果找出对应在线业务的所有存在异常负载的负载项的时间序列;
步骤6.2)对于所有存在异常的负载项,记录其时间序列的最后一个点所对应的时间点,作为该项异常发生的时间点,记录项存入在线业务数据库中;本实施例的数据项格式如表3所示:
表3异常负载数据项的存储格式
字段 描述 类型 长度 举例
Service 所属业务标识 字符串 64 WebServer1
TimeBegin 起始时间戳 字符串 64 2014-10-31 13:10:10
TimeEnd 终点时间戳 字符串 64 2014-10-31 13:10:10
Prob 异常概率 浮点型 4 0.9231
CI 置信区间 浮点型 4 0.9231
步骤7)使用K-均值聚类算法查找出当前在线业务存在异常的承载服务器。
该步骤具体包括以下步骤:
步骤7.1)取出在线业务所有负载项的异常状态数据;
步骤7.2)判断该业务是否存在异常。如果不存在,则结束。否则跳转到步骤7.3;
步骤7.3)选取该业务的所有承载服务器存在异常的负载项数据,并对负载项数据进行归一化处理;
步骤7.4)将每一个承载服务器的负载项数据作为一个向量,使用K-均值算法进行聚类,使用欧几里得距离;
步骤7.5)对比两个类的标准差,令标准差较大的那个为异常类,其中的所有承载服务器均为存在异常;该标准差的计算方法如下:
对每一个类,求出其中所有负载项的时间序列的标准差。将所求的所有标准差取均值,将该均值作为该类的标准差。
步骤7.6)如果还有未处理的在线业务,则转回步骤1),否则结束。
在本发明的描述中,需要理解的是,术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”、“顺时针”、“逆时针”、“轴向”、“径向”、“周向”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
在本发明中,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或成一体;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系,除非另有明确的限定。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本发明中的具体含义。
在本发明中,除非另有明确的规定和限定,第一特征在第二特征“上”或“下”可以是第一和第二特征直接接触,或第一和第二特征通过中间媒介间接接触。而且,第一特征在第二特征“之上”、“上方”和“上面”可是第一特征在第二特征正上方或斜上方,或仅仅表示第一特征水平高度高于第二特征。第一特征在第二特征“之下”、“下方”和“下面”可以是第一特征在第二特征正下方或斜下方,或仅仅表示第一特征水平高度小于第二特征。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (5)

  1. 一种面向云计算在线业务的异常负载检测方法,利用在线业务的历史负载数据,通过小波分析和统计分析来检测在线业务异常负载的方法,该方法包括以下步骤:
    步骤1)利用固定周期采样方法,收集承载某个在线业务的所有主机的各负载项的信息数据,主要包括CPU使用率、内存使用率、磁盘I/O速率和网络I/O速率,记为
    Figure PCTCN2015098770-appb-100001
    Figure PCTCN2015098770-appb-100002
    其中
    Figure PCTCN2015098770-appb-100003
    表示某一时间点i的负载统计数据,i=1,2,….,n;n为正整数;x表示主机的CPU、内存、磁盘I/O或网络I/O之中任一项使用率;
    步骤2)预处理收集到的所有主机的负载项信息数据:对于当前在线业务的每一个负载项的信息数据,将其处理成具有固定时间间隔的时间序列,如果某一时间点数据为空,则对该时间点进行数据插补处理;各负载项的时间序列以元组的形式存储,记为
    Figure PCTCN2015098770-appb-100004
    Figure PCTCN2015098770-appb-100005
    其中
    Figure PCTCN2015098770-appb-100006
    其中k为时间序列周期与采样周期的比值。将收集到的负载数据处理成具有固定周期的时间序列,之后合并所有主机的负载项的数据,得到当前业务所有负载项数据的时间序列;
    步骤3)对在线业务的每一个时间序列进行离散小波变换,得到系数矩阵和细节向量;在得到的系数矩阵中的每一个系数向量上进行统计分析,计算出每一个系数向量存在异常负载的概率;
    步骤4)采用加权公式对所有系数向量的概率计算带权平均值,求得每一个时间序列存在异常负载的概率p,利用如下公式:
    Figure PCTCN2015098770-appb-100007
    其中wi=elogi+1    (5)
    对于求得的时间序列异常概率,存入在线业务数据库;
    步骤5)对于每一个时间序列,将求得的概率与置信函数给出的置信区间做对比,判断是否存在异常负载;若求得的概率落入置信区间内,则说明该时间序列不存在异常;
    步骤6)综合当前在线业务的所有负载项的负载信息以及求得的异常负载概率,判断该在线业务是否存在异常负载;具体包括以下步骤:
    步骤6.1)根据步骤5)的结果找出对应在线业务的所有存在异常负载的负载项的时间序列;
    步骤6.2)对于所有存在异常的负载项,记录其时间序列的最后一个点所对应的时间点,作为该项异常发生的时间点,记录项存入在线业务数据库中;
    步骤7)使用K-均值聚类算法查找出当前在线业务存在异常的承载服务器。
  2. 如权利要求1所述的方法,其特征在于,所述步骤2)具体包括:
    步骤2.1)选出承载某一在线业务的所有主机的所有负载项的信息数据,将该数据处理成具有固定时间间隔的时间序列,时间序列以元组的形式存储;
    步骤2.2)筛选出一个负载项的所有负载信息数据,如果某个主机某一时间点的某项负载的信息数据的值为空,则利用平均值法填充该主机在该时间点的某负载项的信息数据数 值;例如,对于序列{C1,C2,…,Ci,…,Ck,…,Cn},其中Ck为缺失项,则先令Ck=0,然后通过该式(1)将计算Ck的值填充到该时间点:
    Figure PCTCN2015098770-appb-100008
    步骤2.3)合并所有主机的负载项的信息数据,得到该负载项的时间序列,记为
    Figure PCTCN2015098770-appb-100009
    Figure PCTCN2015098770-appb-100010
    合并方法为在时间点ti,时间序列S的值如式(2):
    Figure PCTCN2015098770-appb-100011
    其中
    Figure PCTCN2015098770-appb-100012
    为主机j在ti时刻的负载值,m为主机数;
    步骤2.4)如果当前在线业务仍有未处理的负载数据,跳转到步骤2.1),否则转步骤3);
  3. 如权利要求1所述的方法,其特征在于,所述步骤3)具体包括:
    步骤3.1)选取在线业务的某个尚未处理的时间序列,对该时间序列进行一维离散小波变换,小波基选择Haar小波,根据时间序列周期Tl以及异常检测周期Ts的不同,设定相应的变换级别L,满足Tl×2L≥Ts。得到系数矩阵cA和细节向量cD:
    cA,cD=DWT([s1,s2,…,sn],L,′haar′)       (3)
    步骤3.2)从系数矩阵中筛选出系数向量,并对每一个系数向量应用基于正态分布的统计分析,均值为0,方差为过去Ts×m时间内的负载值的方差估计值,其中m为经验值:计算出该时间序列存在异常负载的概率,该概率为Ts中每一个时间点的负载值的累积分布概率中的最大值:正态分布累积分布概率的计算公式如下:
    pi=2*Φ(|xi|)-1,其中
    Figure PCTCN2015098770-appb-100013
    步骤3.3)筛选出细节向量,并根据细节向量判断出当前负载的变化趋势;若d[-1]<d[-2],变化趋势为下降,取值为-1;若d[-1]>d[-2],变化趋势为上升,取值为1;否则趋势为平稳,取值为0;
    步骤3.4)合并各个系数向量的异常概率,并结合当前负载的变化趋势求得当前负载存在异常的概率;
    步骤3.5)如果仍有尚未处理的时间序列,则跳转到步骤3.1),否则转步骤4);
  4. 如权利要求1所述的方法,其特征在于,所述步骤5)具体包括:
    步骤5.1)从在线业务的所有负载项中取出一个尚未处理的负载项的时间序列以及其存在异常负载的概率值。
    步骤5.2)计算出时间序列的标准差t,并将该标准差作为参数带入置信函数中求得置信区间。置信函数定义如下:
    Figure PCTCN2015098770-appb-100014
    其中c为置信系数,d为松弛系数,c和d均为经验值;
    步骤5.3)取出该时间序列的异常概率,并将该概率与置信区间(0,G(t))进行对比;如果该时间序列的异常概率落入置信区间,则表明当前负载项不存在异常,否则表示存在异常;
    步骤5.4)如果仍有尚未处理的数据,则跳转到步骤5.1),否则转步骤6);
  5. 如权利要求1所述方法,其特征在于,所述步骤7)具体包括:
    步骤7.1)取出在线业务所有负载项的异常状态数据;
    步骤7.2)判断该业务是否存在异常;如果不存在,则结束;否则跳转到步骤7.3);
    步骤7.3)选取该业务的所有承载服务器存在异常的负载项数据,并对负载项数据进行归一化处理;
    步骤7.4)将每一个承载服务器的负载项数据作为一个向量,使用K-均值算法进行聚类,使用欧几里得距离;
    步骤7.5)对比两个类的标准差,令标准差较大的那个为异常类,其中的所有承载服务器为存在异常;标准差的计算方法如下:
    对每一个类,求出其中所有负载项的时间序列的标准差。将所求的所有标准差取均值,将该均值作为该类的标准差;
    步骤7.6)如果还有未处理的在线业务,则转回步骤1),否则结束。
PCT/CN2015/098770 2015-07-16 2015-12-24 一种面向云计算在线业务的异常负载检测方法 WO2017008451A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/786,426 US10581961B2 (en) 2015-07-16 2017-10-17 Method for detecting abnormal load in cloud computing oriented online service

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510419286.1 2015-07-16
CN201510419286.1A CN105071983B (zh) 2015-07-16 2015-07-16 一种面向云计算在线业务的异常负载检测方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/786,426 Continuation US10581961B2 (en) 2015-07-16 2017-10-17 Method for detecting abnormal load in cloud computing oriented online service

Publications (1)

Publication Number Publication Date
WO2017008451A1 true WO2017008451A1 (zh) 2017-01-19

Family

ID=54501270

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/098770 WO2017008451A1 (zh) 2015-07-16 2015-12-24 一种面向云计算在线业务的异常负载检测方法

Country Status (3)

Country Link
US (1) US10581961B2 (zh)
CN (1) CN105071983B (zh)
WO (1) WO2017008451A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115618A (zh) * 2020-09-22 2020-12-22 南方电网海南数字电网研究院有限公司 一种基于矩阵图及置信度的电力设备故障诊断方法及系统
CN113298128A (zh) * 2021-05-14 2021-08-24 西安理工大学 基于时间序列聚类的云服务器异常检测方法
CN116820057A (zh) * 2023-08-30 2023-09-29 四川远方云天食品科技有限公司 一种基于物联网的火锅底料生产监测方法和系统

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105071983B (zh) * 2015-07-16 2017-02-01 清华大学 一种面向云计算在线业务的异常负载检测方法
CN105262647A (zh) * 2015-11-27 2016-01-20 广州神马移动信息科技有限公司 一种异常指标检测方法及装置
CN105610647A (zh) * 2015-12-30 2016-05-25 华为技术有限公司 一种探测业务异常的方法和服务器
CN106209426B (zh) * 2016-06-28 2019-05-21 北京北信源软件股份有限公司 一种基于d-s证据理论的服务器负载状态评估分析方法和系统
CN106886478A (zh) * 2017-02-22 2017-06-23 郑州云海信息技术有限公司 一种数据过滤方法及监控服务器
CN108733812B (zh) * 2018-05-21 2021-09-14 华东师范大学 基于全局信息的时间序列数据中异常数据点的识别方法
CN108923952B (zh) * 2018-05-31 2021-11-30 北京百度网讯科技有限公司 基于服务监控指标的故障诊断方法、设备及存储介质
CN109101390B (zh) * 2018-06-29 2021-08-24 平安科技(深圳)有限公司 基于高斯分布的定时任务异常监控方法、电子装置及介质
CN109039809A (zh) * 2018-07-17 2018-12-18 中国电子科技集团公司电子科学研究院 一种网闸集群异常的检测方法、装置及内网服务器
CN110764975B (zh) * 2018-07-27 2021-10-22 华为技术有限公司 设备性能的预警方法、装置及监控设备
CN109104493A (zh) * 2018-09-04 2018-12-28 南京群顶科技有限公司 一种云资源池业务负载感知与自处理装置及方法
CN109347653B (zh) * 2018-09-07 2021-06-04 创新先进技术有限公司 一种指标异常发现方法和装置
CN110209467B (zh) * 2019-05-23 2021-02-05 华中科技大学 一种基于机器学习的弹性资源扩展方法和系统
CN111654327A (zh) * 2019-11-08 2020-09-11 国网辽宁省电力有限公司电力科学研究院 一种面向光缆纤芯远程管理控制的业务特征提取方法
CN111190756B (zh) * 2019-11-18 2023-04-28 中山大学 一种基于调用链数据的根因定位算法
US11132342B2 (en) * 2019-12-02 2021-09-28 Alibaba Group Holding Limited Periodicity detection and period length estimation in time series
CN111522845B (zh) * 2020-04-08 2022-07-01 北京航空航天大学 一种基于时间序列预测的流计算系统水印发放方法
CN112052109B (zh) * 2020-08-28 2022-03-04 西安电子科技大学 基于日志分析的云服务平台事件异常检测方法
CN112654060B (zh) * 2020-12-18 2023-03-24 中国计量大学 一种装置异常检测方法及系统
CN116809652B (zh) * 2023-03-28 2024-04-26 材谷金带(佛山)金属复合材料有限公司 一种热轧机控制系统的异常分析方法及系统
CN116643908B (zh) * 2023-07-19 2024-03-15 深圳市同泰怡信息技术有限公司 一种基于飞腾多路服务器的自动故障报警方法
CN116826977B (zh) * 2023-08-28 2023-11-21 青岛恒源高新电气有限公司 一种光储直柔微电网智能管理系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101252482A (zh) * 2008-04-07 2008-08-27 华为技术有限公司 网络流量异常检测方法和装置
JP2015075896A (ja) * 2013-10-08 2015-04-20 日本電信電話株式会社 フロー集約装置及び方法
CN105071983A (zh) * 2015-07-16 2015-11-18 清华大学 一种面向云计算在线业务的异常负载检测方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9843596B1 (en) * 2007-11-02 2017-12-12 ThetaRay Ltd. Anomaly detection in dynamically evolving data and systems
CN101494567B (zh) * 2008-08-29 2011-04-13 北京理工大学 一种基于负载预测的分布式拒绝服务攻击检测方法
US8868474B2 (en) * 2012-08-01 2014-10-21 Empire Technology Development Llc Anomaly detection for cloud monitoring
CN104521182B (zh) * 2012-08-08 2016-08-24 英派尔科技开发有限公司 用于云监视的实时压缩数据收集方法及数据中心
US20160164721A1 (en) * 2013-03-14 2016-06-09 Google Inc. Anomaly detection in time series data using post-processing
US9614742B1 (en) * 2013-03-14 2017-04-04 Google Inc. Anomaly detection in time series data
US9355007B1 (en) * 2013-07-15 2016-05-31 Amazon Technologies, Inc. Identifying abnormal hosts using cluster processing
US20150081880A1 (en) * 2013-09-17 2015-03-19 Stackdriver, Inc. System and method of monitoring and measuring performance relative to expected performance characteristics for applications and software architecture hosted by an iaas provider
US10395032B2 (en) * 2014-10-03 2019-08-27 Nokomis, Inc. Detection of malicious software, firmware, IP cores and circuitry via unintended emissions
US10142353B2 (en) * 2015-06-05 2018-11-27 Cisco Technology, Inc. System for monitoring and managing datacenters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101252482A (zh) * 2008-04-07 2008-08-27 华为技术有限公司 网络流量异常检测方法和装置
JP2015075896A (ja) * 2013-10-08 2015-04-20 日本電信電話株式会社 フロー集約装置及び方法
CN105071983A (zh) * 2015-07-16 2015-11-18 清华大学 一种面向云计算在线业务的异常负载检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, NING ET AL.: "Abnormal Detection and Location Method of Network Traffic Based on Wavelet Analysis", JOURNAL OF CHINESE COMPUTER SYSTEMS, vol. 31, no. 1, 31 January 2010 (2010-01-31) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115618A (zh) * 2020-09-22 2020-12-22 南方电网海南数字电网研究院有限公司 一种基于矩阵图及置信度的电力设备故障诊断方法及系统
CN113298128A (zh) * 2021-05-14 2021-08-24 西安理工大学 基于时间序列聚类的云服务器异常检测方法
CN113298128B (zh) * 2021-05-14 2024-04-02 西安理工大学 基于时间序列聚类的云服务器异常检测方法
CN116820057A (zh) * 2023-08-30 2023-09-29 四川远方云天食品科技有限公司 一种基于物联网的火锅底料生产监测方法和系统
CN116820057B (zh) * 2023-08-30 2023-12-01 四川远方云天食品科技有限公司 一种基于物联网的火锅底料生产监测方法和系统

Also Published As

Publication number Publication date
CN105071983B (zh) 2017-02-01
CN105071983A (zh) 2015-11-18
US10581961B2 (en) 2020-03-03
US20180041573A1 (en) 2018-02-08

Similar Documents

Publication Publication Date Title
WO2017008451A1 (zh) 一种面向云计算在线业务的异常负载检测方法
WO2021179572A1 (zh) 运维系统异常指标检测模型优化方法、装置及存储介质
US11106190B2 (en) System and method for predicting remaining lifetime of a component of equipment
US11669083B2 (en) System and method for proactive repair of sub optimal operation of a machine
US20160217378A1 (en) Identifying anomalous behavior of a monitored entity
JP2018530803A (ja) コンピュータ環境における根本原因分析および修復のために機械学習原理を活用する装置および方法
WO2018071005A1 (en) Deep long short term memory network for estimation of remaining useful life of the components
TWI662424B (zh) 領先輔助參數的選擇方法以及結合關鍵參數及領先輔助參數進行設備維護預診斷的方法
CN114978956B (zh) 智慧城市网络设备性能异常突变点检测方法及装置
CN111262750B (zh) 一种用于评估基线模型的方法及系统
CN109934301B (zh) 一种电力负荷聚类分析方法、装置和设备
CN113918433A (zh) 一种自适应的智慧网络设备性能指标异常检测装置及方法
US11847619B2 (en) System-state monitoring method and device and storage medium
CN112966017A (zh) 一种时间序列中不定长的异常子序列检测方法
CN116030955B (zh) 基于物联网的医疗设备状态监测方法及相关装置
CN117010442A (zh) 设备剩余寿命预测模型训练方法、剩余寿命预测方法及系统
US20230034061A1 (en) Method for managing proper operation of base station and system applying the method
CN115495274B (zh) 基于时序数据的异常处理方法、网络设备和可读存储介质
WO2020220438A1 (zh) 一种针对虚拟机不同类型的业务并发量预测方法
CN113283157A (zh) 智能冲压压力机部件生命周期预测系统、方法、终端、介质
CN117555501B (zh) 基于边缘计算的云打印机运维数据处理方法以及相关装置
CN117407264B (zh) 内存老化剩余时间的预测方法、装置、计算机设备及介质
CN111400284B (zh) 一种基于性能数据建立动态异常探测模型的方法
TWI573027B (zh) Customer Experience and Equipment Profitability Analysis System and Its Method
CN116628573A (zh) 作业分类方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15898175

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15898175

Country of ref document: EP

Kind code of ref document: A1