WO2010044770A1 - Trend determination and identification - Google Patents

Trend determination and identification Download PDF

Info

Publication number
WO2010044770A1
WO2010044770A1 PCT/US2008/079739 US2008079739W WO2010044770A1 WO 2010044770 A1 WO2010044770 A1 WO 2010044770A1 US 2008079739 W US2008079739 W US 2008079739W WO 2010044770 A1 WO2010044770 A1 WO 2010044770A1
Authority
WO
WIPO (PCT)
Prior art keywords
subset
performance data
trend
processor
measure
Prior art date
Application number
PCT/US2008/079739
Other languages
French (fr)
Inventor
Mustafa Uysal
Virginia Smith
Arif A. Merchant
Original Assignee
Hewlett-Packard Development Company, L.P.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett-Packard Development Company, L.P. filed Critical Hewlett-Packard Development Company, L.P.
Priority to US13/123,595 priority Critical patent/US20110231582A1/en
Priority to EP08877461A priority patent/EP2347340A4/en
Priority to PCT/US2008/079739 priority patent/WO2010044770A1/en
Priority to CN200880131557.1A priority patent/CN102187327B/en
Publication of WO2010044770A1 publication Critical patent/WO2010044770A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/87Monitoring of transactions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/88Monitoring involving counting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/885Monitoring specific for caches

Definitions

  • Performance data is collected by system performance monitors at the hardware level, operating system level, database level, middleware level, and application level. Collecting and using the large amount of performance data available is an onerous task requiring significant resources. In some cases, collecting and using performance data negatively impacts performance, and hence performance data, itself. Efficient collection and use of performance data is desirable.
  • Figure 1 A shows a system for trend determination and identification in accordance with at least some embodiments
  • Figure 1 B shows a system for trend determination and identification in accordance with at least some embodiments
  • Figure 1 C shows a stack providing performance data for trend determination and identification
  • Figure 2 shows a system having a computer readable medium for trend determination and identification in accordance with at least some embodiments.
  • Figure 3 shows a method of trend determination and identification in accordance with at least some embodiments. NOTATION AND NOMENCLATURE
  • the models capture performance data in different deployment scenarios, configurations, and workloads.
  • the models tune and refine themselves to increase predictive performance.
  • each piece of the multitude of performance data is available to be collected, but excessive and unnecessary monitoring is avoided, saving time and resources. Consequently, implementation of the models results in fewer violations as well as a time and resource advantage over competitors.
  • a system 100 comprises a processor 102 and an alert module 104 coupled to the processor 102.
  • the system 100 is a computer.
  • the processor 102 is a computer processor and the alert module 104 is a computer display.
  • the processor 102 comprises a plurality of computer processors and the alert module 104 comprises a light-emitting diode coupled to an audio speaker in at least one embodiment.
  • the processor 102 preferably monitors performance data.
  • Figure 1 C shows a stack 199 providing performance data 189 for trend determination and identification.
  • the stack 199 comprises various layers of hardware and software from which the performance data 189 is measured.
  • the performance data 189 is preferably collected by system performance monitors at the hardware layer 197, operating system layer 195, middleware layer 193, and applications layer 191.
  • Hardware layer 197 provides hardware performance data 187 such as hardware performance counters, etc.
  • Operating system layer 195 provides operating system performance data 185 such as I/O/sec, memory allocation, page faults, page hits, resident memory size, CPU utilization, packets/sec, etc.
  • Middleware layer 193 provides middleware performance data 183 such as queries/sec, tuples read, page hits in buffer cache, disk I/O, table scans, requests/sec, connections, etc.
  • Applications layer 191 provides application performance data such as response time, outstanding requests, previous transactions, etc. Many categories of performance data are possible.
  • the performance data is collected from a network.
  • hardware layer 197 provides hardware performance data 187 for the hardware of the entire network.
  • the other layers provide performance data for the entire network.
  • the performance data comprises application metrics and operating system metrics.
  • monitoring any type of performance data is possible.
  • M t denote a vector of values, [m 0 , ITi 1 , ITi 2 , ..., m n ] t , collected by the processor 102 using the performance indicators being monitored.
  • the processor 102 preferably constructs a model F(M, k, ⁇ ) that maps the input vector [M t- k, M t -k + i,...,M t ] to S t+ ⁇ , the state of the SLO at time t+ ⁇ .
  • the thresholds k and ⁇ are parameters.
  • the parameter k is infinite and the processor 102 uses all the available history of the performance indicator values to construct the model F(M, k, ⁇ ).
  • machine learning techniques used in processor 102 include, but are not limited to, na ⁇ ve Bayes classifier, support vector machines, decision trees, Bayesian networks, or neural networks. For the details of these techniques, refer to T. Hastie, R. Tibrishani, and J. Friedman, The elements of statistical learning, Springer, 2001.
  • the processor 102 preferably constructs the model F(M, k, ⁇ ) in a classifier C, approximating the function F(M, k, ⁇ ), based on a given training set containing the past observations of the performance indicators and the observed state of the SLO metrics.
  • the processor 102 combines values of the performance indicators with the directionality of these values over time.
  • the processor 102 constructs a model F(M, k, ⁇ ) that maps the input vector [M t , D t- k , D t- k + i,...,Dt] to S t+ ⁇ , the state of the SLO at time t+ ⁇ .
  • the processor 102 determines a subset of the performance data correlated with a measure of underperformance.
  • the measure of underperformance is based on a service level objective ("SLO").
  • a SLO is preferably a portion of a service level agreement ("SLA") between a service provider and a customer. SLOs are agreed means of measuring the performance of the service provider and are helpful in managing expectations and avoiding disputes between the two parties.
  • the SLA is the entire agreement that specifies the SLOs, what service is to be provided and how the service is supported as well as times, locations, costs, performance, and responsibilities of the parties involved.
  • the SLOs are specific measurable characteristics of the SLA, e.g., availability, throughput, frequency, response time, and quality.
  • an SLO between a website hosting service and the owner of a website may be that 99% of transactions submitted be completed in under one second, and the measure of underperformance tracks the SLO exactly.
  • the subset of performance data correlated with the measure of underperformance may be, for example, a tripling of website traffic in less than ten minutes.
  • processor 102 selects the subsets of the performance indicators using a feature selection technique.
  • the processor 102 selects the M*, a subset of M, such that the difference between their corresponding models F * (M*) and F(M) is minimal, with respect to the training set.
  • the processor 102 preferably uses a greedy algorithm that eliminates a single metric m, at each step, such that
  • the subset corresponds to one SLO.
  • the SLO is composed of one or more performance indicators that are combined to produce an SLO achievement value.
  • an SLO may depend on multiple components, each of which has a performance indicator measurement.
  • the weights applied to the performance indicator measurements when used to calculate the SLO achievement value depend on the nature of the service and which components are given priority by the service provider and the customer.
  • each of the multiple components corresponds to its own subset of performance data.
  • the measure of underperformance is a combination of sub-measures of underperformance.
  • the correlation value between the subset and the measure of underperformance must be above a programmable threshold. As such, the selection of elements of performance data to include in the subset is not over-inclusive or under-inclusive. [0018] If the subset is appropriately correlated with the measure of underperformance, the subset may be monitored to anticipate the measure. If the measure corresponds with an SLO violation, then a breach of the SLA agreement can be anticipated.
  • the processor 102 determines a trend of the subset of performance data, the trend also correlated with the measure of underperformance. Preferably, the processor 102 determines a trend correlated with an SLO violation itself. Determining a trend of the subset of performance data comprises determining that one element of the subset is behaving in a certain fashion, another element is behaving in a certain fashion, etc., where each behavior could be independent of each other behavior and each behavior need not occur simultaneously.
  • the behaviors comprise a linear, exponential, arithmetic, geometric, etc., increase, decrease, oscillation, random movement, etc.
  • the behaviors also include directionality.
  • the former behavior is a tripling in website traffic while the latter behavior is a reduction of website traffic by a third.
  • the behaviors can also be expressed as thresholds. For example, ⁇ 1 ⁇ n-i ⁇ 2, 2 ⁇ n 2 ⁇ 3, 3 ⁇ n 3 ⁇ 4 ⁇ .
  • the first value for the element is between 1 and 2
  • the second value is between 2 and 3, etc.
  • a trend can be determined by determining that one element is increasing and another element is decreasing simultaneously over a particular period of time. Note that the behaviors of the elements need not always occur simultaneously.
  • a number of adjustable parameters can be used to increase the correlation between a trend and a measure of underperformance, which allows for a more accurate prediction of the measure of underperformance.
  • Such parameters comprise any or all of: the number of elements of performance data used for the subset, the number of samples collected for each element, the rate of recording of each element, the rate of change of an element, the rate of change of the entire trend, and correlations between different elements of the performance data themselves, e.g., if change in one element causes change in another element.
  • Many adjustable parameters and combinations of parameters are possible.
  • the trend is a combination of sub-trends of the subset.
  • the processor determines different subsets of performance data that, when each subset is behaving in its own particular way, will result in a SLO violation, but when less than all subsets exhibit their behavior, will not result in a SLO violation.
  • the processor 102 ceases to monitor the performance data except for the subset after determining the trend. Because monitoring itself is an added overhead that uses system resources, it is advantageous to keep the amount of system resources dedicated to monitoring at a minimum. As such, ceasing monitoring performance of performance data that has little or no correlation to the measure of underperformance is preferable.
  • the processor 102 is still able to identify an occurrence of the trend. After such identification, in at least one embodiment, the processor 102 monitors a second subset of the performance data.
  • the second subset comprises at least one element not in the subset.
  • System administrators prefer to study various data sources to determine the root cause of SLO violations after the fact, and this dynamic control of the collection of diagnostics information (when and what kinds of more detailed monitoring and instrumentation to be turned on as the second subset) assists system administrators in the event that a SLO violation occurs.
  • the processor 102 preferably refines the subset of performance data automatically. Many methods of refinement are possible.
  • Machine learning techniques determine and refine the trends that establish correlation between performance data and measures of underperformance. Because the machine learning techniques create succinct representations of correlations from a diverse set of data, the techniques are ideal for determining which performance metrics lead to underperformance and which performance metrics can be safely ignored. As such, the system 100 is self-refining. Specifically, instances of SLO violations provide positive examples for the training of the machine learning models while normal operating conditions, without SLO violations, provide the negative examples for training. As such, the subset of performance data correlated with the underperformance can be adjusted automatically, and if a highly correlated subset suddenly or gradually becomes uncorrelated for any reason, the subset can be adjusted to maintain a high correlation.
  • the alert module 104 preferably outputs an alert based on the identification of a trend.
  • the processor 102 sends a signal to the alert module 104 to output the alert.
  • the alert is a combination of alerts comprising a visual alert, an audio alert, an email alert, etc.
  • the measure of underperformance is a future measure of underperformance and the alert is output prior to occurrence of the future measure of underperformance.
  • the future measure of underperformance is based on an SLO.
  • a computer-readable medium 988 comprises volatile memory (e.g., random access memory, etc.), non-volatile storage (e.g., read only memory, Flash memory, hard disk drive, CD ROM, etc.), or combinations thereof.
  • the computer-readable medium comprises software 984 (which includes firmware) executed by the processor 982. One or more of the actions described in this document are performed by the processor 982 during execution of the software.
  • the computer-readable medium 988 stores a software program 984 that, when executed by the processor 982, causes the processor 982 to monitor performance data and determine a subset of the performance data, the subset correlated with a measure of underperformance.
  • the processor 982 determines a trend of the subset, the trend correlated with the measure. In at least one embodiment, the processor 982 is further caused to cease to monitor the performance data except for the subset after determining the trend. The processor 982 preferably identifies an occurrence of the trend. In at least one embodiment, the processor 982 is further caused to monitor a second subset of the performance data after identifying the occurrence of the trend, the second subset comprising at least one element not in the subset. The processor 982 preferably outputs an alert based on the identification. In at least one embodiment, the alert is a signal to an alert module 104.
  • Figure 3 illustrates a method 300, beginning at 302 and ending at 316, of trend determination and identification in accordance with at least some embodiments.
  • One or more of the steps described in this document are performed during the method.
  • performance data is monitored.
  • a subset of the performance data is determined, the subset correlated with a measure of underperformance.
  • a trend of the subset is determined, the trend correlated with the measure.
  • the performance data ceases to be monitored, except for the subset after determining the trend, at 310.
  • an occurrence of the trend is identified.
  • an alert is output based on the identification.
  • the alert is a signal to an alert module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A system comprises a processor and an alert module coupled to the processor. The processor the processor monitors performance data; determines a subset of the performance data, the subset correlated with a measure of underperformance; determines a trend of the subset, the trend correlated with the measure; and identifies an occurrence of the trend. The alert module outputs an alert based on the identification.

Description

TREND DETERMINATION AND IDENTIFICATION
BACKGROUND
[0001] In information processing environments, a vast variety of performance data is available. Performance data is collected by system performance monitors at the hardware level, operating system level, database level, middleware level, and application level. Collecting and using the large amount of performance data available is an onerous task requiring significant resources. In some cases, collecting and using performance data negatively impacts performance, and hence performance data, itself. Efficient collection and use of performance data is desirable.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] For a detailed description of the embodiments of the invention, reference will now be made to the accompanying drawings in which: [0003] Figure 1 A shows a system for trend determination and identification in accordance with at least some embodiments;
[0004] Figure 1 B shows a system for trend determination and identification in accordance with at least some embodiments;
[0005] Figure 1 C shows a stack providing performance data for trend determination and identification;
[0006] Figure 2 shows a system having a computer readable medium for trend determination and identification in accordance with at least some embodiments; and
[0007] Figure 3 shows a method of trend determination and identification in accordance with at least some embodiments. NOTATION AND NOMENCLATURE
[0008] Certain terms are used throughout the following claims and description to refer to particular components. As one having ordinary skill in the art will appreciate, different entities may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms "including" and "comprising" are used in an open-ended fashion, and thus should be interpreted to mean "including, but not limited to... ." Also, the term "couple" or "couples" is intended to mean an optical, wireless, indirect electrical, or direct electrical connection. Thus, if a first device couples to a second device, that connection may be through an indirect electrical connection via other devices and connections, through a direct optical connection, etc. Additionally, the term "system" refers to a collection of two or more hardware components, and may be used to refer to an electronic device.
DETAILED DESCRIPTION
[0009] The following discussion is directed to various embodiments of the invention. Although one or more of these embodiments may be preferred, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims, unless otherwise specified. In addition, one having ordinary skill in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be exemplary of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment. [0010] Trend determination and identification is disclosed. Self-tuning predictive performance models, based on machine learning, utilize performance data to monitor system performance levels, control the monitoring levels at various layers so that the variety and the detail of the performance data collected are decided dynamically, and determine potential service level objective violations. As such, the models capture performance data in different deployment scenarios, configurations, and workloads. The models tune and refine themselves to increase predictive performance. Furthermore, each piece of the multitude of performance data is available to be collected, but excessive and unnecessary monitoring is avoided, saving time and resources. Consequently, implementation of the models results in fewer violations as well as a time and resource advantage over competitors.
[0011] Referring to Figure 1 A, a system 100 comprises a processor 102 and an alert module 104 coupled to the processor 102. Referring to Figure 1 B, in at least one embodiment, the system 100 is a computer. As such, the processor 102 is a computer processor and the alert module 104 is a computer display. Many processors and alert modules are possible. For example, the processor 102 comprises a plurality of computer processors and the alert module 104 comprises a light-emitting diode coupled to an audio speaker in at least one embodiment. [0012] The processor 102 preferably monitors performance data. Figure 1 C shows a stack 199 providing performance data 189 for trend determination and identification. The stack 199 comprises various layers of hardware and software from which the performance data 189 is measured. The performance data 189 is preferably collected by system performance monitors at the hardware layer 197, operating system layer 195, middleware layer 193, and applications layer 191. Each of these layers provides multiple categories of performance data. Hardware layer 197 provides hardware performance data 187 such as hardware performance counters, etc. Operating system layer 195 provides operating system performance data 185 such as I/O/sec, memory allocation, page faults, page hits, resident memory size, CPU utilization, packets/sec, etc. Middleware layer 193 provides middleware performance data 183 such as queries/sec, tuples read, page hits in buffer cache, disk I/O, table scans, requests/sec, connections, etc. Applications layer 191 provides application performance data such as response time, outstanding requests, previous transactions, etc. Many categories of performance data are possible. In at least one embodiment, the performance data is collected from a network. As such, hardware layer 197 provides hardware performance data 187 for the hardware of the entire network. Similarly, the other layers provide performance data for the entire network. In at least one embodiment, the performance data comprises application metrics and operating system metrics. However, monitoring any type of performance data is possible. [0013] The processor 102 preferably constructs a model of SLO compliance based on the monitored performance data. Let S = {SLO compliance, SLO violation} be the set of possible states for a given SLO. At any time t, the state of an SLO, St, may be in one of these two states. Let Mt denote a vector of values, [m0, ITi1, ITi2, ..., mn]t, collected by the processor 102 using the performance indicators being monitored. The processor 102 preferably constructs a model F(M, k,Δ) that maps the input vector [Mt-k, Mt-k+i,...,Mt] to St+Δ, the state of the SLO at time t+ Δ. In at least one embodiment, the thresholds k and Δ are parameters. In at least one other embodiment, the parameter k is infinite and the processor 102 uses all the available history of the performance indicator values to construct the model F(M, k,Δ). There are a variety of machine learning techniques that the processor 102 uses to construct the model F(M, k,Δ). For example, machine learning techniques used in processor 102 include, but are not limited to, naϊve Bayes classifier, support vector machines, decision trees, Bayesian networks, or neural networks. For the details of these techniques, refer to T. Hastie, R. Tibrishani, and J. Friedman, The elements of statistical learning, Springer, 2001. In at least one embodiment, the processor 102 preferably constructs the model F(M, k,Δ) in a classifier C, approximating the function F(M, k,Δ), based on a given training set containing the past observations of the performance indicators and the observed state of the SLO metrics.
[0014] In at least one embodiment, the processor 102 combines values of the performance indicators with the directionality of these values over time. Let Dt = [{+.=.-}i. {+.=.-}2. {+.=.-}3, .... {+.=.-}n]t be a directionality vector, indicating the directional difference between Mt and MM . Each element βj in Dt indicates whether or not the corresponding metric j in Mt has increased ({+} value), decreased ({-} value), or stayed the same ({=} value). In at least one embodiment, the processor 102 constructs a model F(M, k,Δ) that maps the input vector [Mt, Dt- k, Dt-k+i,...,Dt] to St+Δ, the state of the SLO at time t+ Δ.
[0015] While monitoring each piece of performance data is possible, the cost of monitoring would be prohibitive as the amount of performance data increases. As such, the processor 102 determines a subset of the performance data correlated with a measure of underperformance. In at least one embodiment, the measure of underperformance is based on a service level objective ("SLO"). A SLO is preferably a portion of a service level agreement ("SLA") between a service provider and a customer. SLOs are agreed means of measuring the performance of the service provider and are helpful in managing expectations and avoiding disputes between the two parties. In at least one embodiment, the SLA is the entire agreement that specifies the SLOs, what service is to be provided and how the service is supported as well as times, locations, costs, performance, and responsibilities of the parties involved. The SLOs are specific measurable characteristics of the SLA, e.g., availability, throughput, frequency, response time, and quality. For example, an SLO between a website hosting service and the owner of a website may be that 99% of transactions submitted be completed in under one second, and the measure of underperformance tracks the SLO exactly. Expressed in words, the subset of performance data correlated with the measure of underperformance may be, for example, a tripling of website traffic in less than ten minutes.
[0016] In at least one embodiment, processor 102 selects the subsets of the performance indicators using a feature selection technique. The processor 102 selects the M*, a subset of M, such that the difference between their corresponding models F*(M*) and F(M) is minimal, with respect to the training set. The processor 102 preferably uses a greedy algorithm that eliminates a single metric m, at each step, such that |F(M-m) - F(M) | is minimal. [0017] In at least one embodiment, the subset corresponds to one SLO. However, in at least one other embodiment, the SLO is composed of one or more performance indicators that are combined to produce an SLO achievement value. As such, an SLO may depend on multiple components, each of which has a performance indicator measurement. The weights applied to the performance indicator measurements when used to calculate the SLO achievement value depend on the nature of the service and which components are given priority by the service provider and the customer. Preferably, in such an embodiment, each of the multiple components corresponds to its own subset of performance data. In this way, the measure of underperformance is a combination of sub-measures of underperformance. In at least one embodiment, the correlation value between the subset and the measure of underperformance must be above a programmable threshold. As such, the selection of elements of performance data to include in the subset is not over-inclusive or under-inclusive. [0018] If the subset is appropriately correlated with the measure of underperformance, the subset may be monitored to anticipate the measure. If the measure corresponds with an SLO violation, then a breach of the SLA agreement can be anticipated.
[0019] The processor 102 determines a trend of the subset of performance data, the trend also correlated with the measure of underperformance. Preferably, the processor 102 determines a trend correlated with an SLO violation itself. Determining a trend of the subset of performance data comprises determining that one element of the subset is behaving in a certain fashion, another element is behaving in a certain fashion, etc., where each behavior could be independent of each other behavior and each behavior need not occur simultaneously. The behaviors comprise a linear, exponential, arithmetic, geometric, etc., increase, decrease, oscillation, random movement, etc. The behaviors also include directionality. For example, the two behaviors {n-i = 1 , n2 = 2, n3 = 3} and In1 = 3, n2 = 2, n3 = 1 }, where nx is the xth value of the element, are different behaviors even though each behavior contains the same values. The former behavior is a tripling in website traffic while the latter behavior is a reduction of website traffic by a third. In at least one embodiment, the behaviors can also be expressed as thresholds. For example, {1 < n-i < 2, 2 < n2 < 3, 3 < n3 < 4}. Specifically, the first value for the element is between 1 and 2, the second value is between 2 and 3, etc. As an example, a trend can be determined by determining that one element is increasing and another element is decreasing simultaneously over a particular period of time. Note that the behaviors of the elements need not always occur simultaneously. A number of adjustable parameters can be used to increase the correlation between a trend and a measure of underperformance, which allows for a more accurate prediction of the measure of underperformance. Such parameters comprise any or all of: the number of elements of performance data used for the subset, the number of samples collected for each element, the rate of recording of each element, the rate of change of an element, the rate of change of the entire trend, and correlations between different elements of the performance data themselves, e.g., if change in one element causes change in another element. Many adjustable parameters and combinations of parameters are possible. In at least one embodiment, the trend is a combination of sub-trends of the subset. For example, the processor determines different subsets of performance data that, when each subset is behaving in its own particular way, will result in a SLO violation, but when less than all subsets exhibit their behavior, will not result in a SLO violation.
[0020] In at least one embodiment, the processor 102 ceases to monitor the performance data except for the subset after determining the trend. Because monitoring itself is an added overhead that uses system resources, it is advantageous to keep the amount of system resources dedicated to monitoring at a minimum. As such, ceasing monitoring performance of performance data that has little or no correlation to the measure of underperformance is preferable. By monitoring the subset, the processor 102 is still able to identify an occurrence of the trend. After such identification, in at least one embodiment, the processor 102 monitors a second subset of the performance data. Preferably, the second subset comprises at least one element not in the subset. System administrators prefer to study various data sources to determine the root cause of SLO violations after the fact, and this dynamic control of the collection of diagnostics information (when and what kinds of more detailed monitoring and instrumentation to be turned on as the second subset) assists system administrators in the event that a SLO violation occurs. However, it is an inefficient use of resources to collect the same level of diagnostic information during normal operation. If a violation does occur, the processor 102 preferably refines the subset of performance data automatically. Many methods of refinement are possible.
[0021] Machine learning techniques determine and refine the trends that establish correlation between performance data and measures of underperformance. Because the machine learning techniques create succinct representations of correlations from a diverse set of data, the techniques are ideal for determining which performance metrics lead to underperformance and which performance metrics can be safely ignored. As such, the system 100 is self-refining. Specifically, instances of SLO violations provide positive examples for the training of the machine learning models while normal operating conditions, without SLO violations, provide the negative examples for training. As such, the subset of performance data correlated with the underperformance can be adjusted automatically, and if a highly correlated subset suddenly or gradually becomes uncorrelated for any reason, the subset can be adjusted to maintain a high correlation. In this way, a steady supply of positive and negative examples allow for self-refining. Manual refining is also possible. [0022] The alert module 104 preferably outputs an alert based on the identification of a trend. In at least one embodiment, the processor 102 sends a signal to the alert module 104 to output the alert. In at least one embodiment, the alert is a combination of alerts comprising a visual alert, an audio alert, an email alert, etc. Many alerting methods are possible. Preferably, the measure of underperformance is a future measure of underperformance and the alert is output prior to occurrence of the future measure of underperformance. In at least one embodiment, the future measure of underperformance is based on an SLO. [0023] Referring to Figure 2, in various embodiments, a computer-readable medium 988 comprises volatile memory (e.g., random access memory, etc.), non-volatile storage (e.g., read only memory, Flash memory, hard disk drive, CD ROM, etc.), or combinations thereof. The computer-readable medium comprises software 984 (which includes firmware) executed by the processor 982. One or more of the actions described in this document are performed by the processor 982 during execution of the software. Preferably the computer-readable medium 988 stores a software program 984 that, when executed by the processor 982, causes the processor 982 to monitor performance data and determine a subset of the performance data, the subset correlated with a measure of underperformance. Preferably, the processor 982 determines a trend of the subset, the trend correlated with the measure. In at least one embodiment, the processor 982 is further caused to cease to monitor the performance data except for the subset after determining the trend. The processor 982 preferably identifies an occurrence of the trend. In at least one embodiment, the processor 982 is further caused to monitor a second subset of the performance data after identifying the occurrence of the trend, the second subset comprising at least one element not in the subset. The processor 982 preferably outputs an alert based on the identification. In at least one embodiment, the alert is a signal to an alert module 104.
[0024] Figure 3 illustrates a method 300, beginning at 302 and ending at 316, of trend determination and identification in accordance with at least some embodiments. One or more of the steps described in this document are performed during the method. At 304, performance data is monitored. At 306, a subset of the performance data is determined, the subset correlated with a measure of underperformance. At 308, a trend of the subset is determined, the trend correlated with the measure. In at least one embodiment, the performance data ceases to be monitored, except for the subset after determining the trend, at 310. At 312, an occurrence of the trend is identified. At 314, an alert is output based on the identification. In at least one embodiment, the alert is a signal to an alert module.
[0025] The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those having ordinary skill in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims

CLAIMS What is claimed is:
1. A system, comprising: a processor; and an alert module coupled to the processor; wherein the processor monitors performance data; of the performance data, determines a subset that is correlated with a measure of underperformance; of the subset, determines a trend that is correlated with the measure; and identifies an occurrence of the trend; and wherein the alert module outputs an alert based on the identification.
2. The system of claim 1 , wherein the processor ceases to monitor the performance data except for the subset after determining the trend.
3. The system of claim 2, wherein the processor monitors a second subset of the performance data after identifying the occurrence of the trend, the second subset comprising at least one element not in the subset.
4. The system of claim 1 , wherein the measure is a combination of sub- measures of performance.
5. The system of claim 1 , wherein the trend is a combination of sub-trends of the subset.
6. The system of claim 1 , wherein the measure of underperformance is based on a service level objective.
7. The system of claim 1 , wherein the performance data comprises application metrics, operating system metrics, middleware metrics, and hardware metrics.
8. The system of claim 7, wherein the middleware metrics are selected from the group consisting of queries per second, tuples read, page hits in buffer cache, disk input/output, page hits, requests per second, connections, and table scans.
9. The system of claim 7, wherein the operating system metrics are selected from the group consisting of input/output operations per second, memory allocation, page faults, page hits, resident memory size, central processing unit utilization, and packets transferred per second.
10. The system of claim 7, wherein the application metrics are selected from the group consisting of previous transactions, response time, and outstanding requests.
1 1. A computer-readable medium storing a software program that, when executed by a processor, causes the processor to: monitor performance data; of the performance data, determines a subset that is correlated with a measure of underperformance; of the subset, determines a trend that is correlated with the measure; identify an occurrence of the trend; and output an alert based on the identification.
12. The computer-readable medium of claim 1 1 , further causing the processor to cease to monitor the performance data except for the subset after determining the trend.
13. The computer-readable medium of claim 1 1 , further causing the processor to monitor a second subset of the performance data after identifying the occurrence of the trend, the second subset comprising at least one element not in the subset.
14. A method, comprising: monitoring performance data; of the performance data, determining a subset that is correlated with a measure of underperformance; of the subset, determining a trend that is correlated with the measure; identifying an occurrence of the trend; and outputting an alert based on the identification.
15. The method of claim 14, further comprising ceasing to monitor the performance data except for the subset after determining the trend.
16. The system of claim 1 , wherein the measure of underperformance is a future measure of underperformance.
17. The system of claim 16, wherein the future measure of underperformance is based on a service level objective.
PCT/US2008/079739 2008-10-13 2008-10-13 Trend determination and identification WO2010044770A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US13/123,595 US20110231582A1 (en) 2008-10-13 2008-10-13 Trend determination and identification
EP08877461A EP2347340A4 (en) 2008-10-13 2008-10-13 Trend determination and identification
PCT/US2008/079739 WO2010044770A1 (en) 2008-10-13 2008-10-13 Trend determination and identification
CN200880131557.1A CN102187327B (en) 2008-10-13 2008-10-13 Trend is determined and is identified

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2008/079739 WO2010044770A1 (en) 2008-10-13 2008-10-13 Trend determination and identification

Publications (1)

Publication Number Publication Date
WO2010044770A1 true WO2010044770A1 (en) 2010-04-22

Family

ID=42106748

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/079739 WO2010044770A1 (en) 2008-10-13 2008-10-13 Trend determination and identification

Country Status (4)

Country Link
US (1) US20110231582A1 (en)
EP (1) EP2347340A4 (en)
CN (1) CN102187327B (en)
WO (1) WO2010044770A1 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9262346B2 (en) * 2010-06-21 2016-02-16 Hewlett Packard Enterprises Development LP Prioritizing input/outputs at a host bus adapter
US8930489B2 (en) * 2011-10-11 2015-01-06 Rakspace US, Inc. Distributed rate limiting of handling requests
US8782504B2 (en) 2012-04-11 2014-07-15 Lsi Corporation Trend-analysis scheme for reliably reading data values from memory
US9400731B1 (en) * 2014-04-23 2016-07-26 Amazon Technologies, Inc. Forecasting server behavior
US11068827B1 (en) 2015-06-22 2021-07-20 Wells Fargo Bank, N.A. Master performance indicator
US20170102681A1 (en) * 2015-10-13 2017-04-13 Google Inc. Coordinating energy use of disparately-controlled devices in the smart home based on near-term predicted hvac control trajectories
US10261806B2 (en) * 2017-04-28 2019-04-16 International Business Machines Corporation Adaptive hardware configuration for data analytics
US11500874B2 (en) * 2019-01-23 2022-11-15 Servicenow, Inc. Systems and methods for linking metric data to resources
US20220283833A1 (en) * 2019-07-09 2022-09-08 Nippon Telegraph And Telephone Corporation Spp server, virtual machine connection control system, spp server connection control method and program
US11799741B2 (en) * 2019-10-29 2023-10-24 Fannie Mae Systems and methods for enterprise information technology (IT) monitoring
US11817994B2 (en) * 2021-01-25 2023-11-14 Yahoo Assets Llc Time series trend root cause identification

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5796633A (en) * 1996-07-12 1998-08-18 Electronic Data Systems Corporation Method and system for performance monitoring in computer networks
US6405327B1 (en) * 1998-08-19 2002-06-11 Unisys Corporation Apparatus for and method of automatic monitoring of computer performance
US20030110007A1 (en) 2001-07-03 2003-06-12 Altaworks Corporation System and method for monitoring performance metrics
US7062685B1 (en) 2002-12-11 2006-06-13 Altera Corporation Techniques for providing early failure warning of a programmable circuit
US7131037B1 (en) * 2002-06-05 2006-10-31 Proactivenet, Inc. Method and system to correlate a specific alarm to one or more events to identify a possible cause of the alarm
US20080016412A1 (en) 2002-07-01 2008-01-17 Opnet Technologies, Inc. Performance metric collection and automated analysis

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5506955A (en) * 1992-10-23 1996-04-09 International Business Machines Corporation System and method for monitoring and optimizing performance in a data processing system
US6636486B1 (en) * 1999-07-02 2003-10-21 Excelcom, Inc. System, method and apparatus for monitoring and analyzing traffic data from manual reporting switches
US6892236B1 (en) * 2000-03-16 2005-05-10 Microsoft Corporation System and method of generating computer system performance reports
US7065566B2 (en) * 2001-03-30 2006-06-20 Tonic Software, Inc. System and method for business systems transactions and infrastructure management
US6975962B2 (en) * 2001-06-11 2005-12-13 Smartsignal Corporation Residual signal alert generation for condition monitoring using approximated SPRT distribution
US6823382B2 (en) * 2001-08-20 2004-11-23 Altaworks Corporation Monitoring and control engine for multi-tiered service-level management of distributed web-application servers
US7007084B1 (en) * 2001-11-07 2006-02-28 At&T Corp. Proactive predictive preventative network management technique
US7603340B2 (en) * 2003-09-04 2009-10-13 Oracle International Corporation Automatic workload repository battery of performance statistics
US7583587B2 (en) * 2004-01-30 2009-09-01 Microsoft Corporation Fault detection and diagnosis
US7698113B2 (en) * 2005-06-29 2010-04-13 International Business Machines Corporation Method to automatically detect and predict performance shortages of databases
US8200659B2 (en) * 2005-10-07 2012-06-12 Bez Systems, Inc. Method of incorporating DBMS wizards with analytical models for DBMS servers performance optimization
US7562140B2 (en) * 2005-11-15 2009-07-14 Cisco Technology, Inc. Method and apparatus for providing trend information from network devices
US7822417B1 (en) * 2005-12-01 2010-10-26 At&T Intellectual Property Ii, L.P. Method for predictive maintenance of a communication network
US7890315B2 (en) * 2005-12-29 2011-02-15 Microsoft Corporation Performance engineering and the application life cycle
US7467067B2 (en) * 2006-09-27 2008-12-16 Integrien Corporation Self-learning integrity management system and related methods
US8195478B2 (en) * 2007-03-07 2012-06-05 Welch Allyn, Inc. Network performance monitor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5796633A (en) * 1996-07-12 1998-08-18 Electronic Data Systems Corporation Method and system for performance monitoring in computer networks
US6405327B1 (en) * 1998-08-19 2002-06-11 Unisys Corporation Apparatus for and method of automatic monitoring of computer performance
US20030110007A1 (en) 2001-07-03 2003-06-12 Altaworks Corporation System and method for monitoring performance metrics
US7131037B1 (en) * 2002-06-05 2006-10-31 Proactivenet, Inc. Method and system to correlate a specific alarm to one or more events to identify a possible cause of the alarm
US20080016412A1 (en) 2002-07-01 2008-01-17 Opnet Technologies, Inc. Performance metric collection and automated analysis
US7062685B1 (en) 2002-12-11 2006-06-13 Altera Corporation Techniques for providing early failure warning of a programmable circuit

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2347340A4 *

Also Published As

Publication number Publication date
CN102187327A (en) 2011-09-14
US20110231582A1 (en) 2011-09-22
CN102187327B (en) 2015-09-09
EP2347340A4 (en) 2012-05-02
EP2347340A1 (en) 2011-07-27

Similar Documents

Publication Publication Date Title
US20110231582A1 (en) Trend determination and identification
US10963330B2 (en) Correlating failures with performance in application telemetry data
US7502971B2 (en) Determining a recurrent problem of a computer resource using signatures
US7693982B2 (en) Automated diagnosis and forecasting of service level objective states
Tang et al. Fault-aware, utility-based job scheduling on blue, gene/p systems
Chen et al. Distributed autonomous virtual resource management in datacenters using finite-markov decision process
US20080195369A1 (en) Diagnostic system and method
US20170286252A1 (en) Workload Behavior Modeling and Prediction for Data Center Adaptation
US8874642B2 (en) System and method for managing the performance of an enterprise application
US8285841B2 (en) Service quality evaluator having adaptive evaluation criteria
US9858106B2 (en) Virtual machine capacity planning
TW201636839A (en) Method and apparatus of realizing resource provisioning
US20100238814A1 (en) Methods and Apparatus to Characterize and Predict Network Health Status
US8321362B2 (en) Methods and apparatus to dynamically optimize platforms
EP2742662A2 (en) Application performance analysis that is adaptive to business activity patterns
US10616078B1 (en) Detecting deviating resources in a virtual environment
US8930773B2 (en) Determining root cause
CN105893385A (en) Method and device for analyzing user behavior
JP6658507B2 (en) Load estimation system, information processing device, load estimation method, and computer program
Rao et al. Online capacity identification of multitier websites using hardware performance counters
US7962692B2 (en) Method and system for managing performance data
Rao et al. Online measurement of the capacity of multi-tier websites using hardware performance counters
CN110928750B (en) Data processing method, device and equipment
US11556451B2 (en) Method for analyzing the resource consumption of a computing infrastructure, alert and sizing
US9755925B2 (en) Event driven metric data collection optimization

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200880131557.1

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08877461

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2008877461

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 13123595

Country of ref document: US