US20120209568A1 - Multiple modeling paradigm for predictive analytics - Google Patents

Multiple modeling paradigm for predictive analytics Download PDF

Info

Publication number
US20120209568A1
US20120209568A1 US13/026,351 US201113026351A US2012209568A1 US 20120209568 A1 US20120209568 A1 US 20120209568A1 US 201113026351 A US201113026351 A US 201113026351A US 2012209568 A1 US2012209568 A1 US 2012209568A1
Authority
US
United States
Prior art keywords
performance metric
model
value
threshold
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US13/026,351
Inventor
Karla K. Arndt
James M. Caffrey
Keyur Patel
Aspen L. Payton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US13/026,351 priority Critical patent/US20120209568A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PATEL, KEYUR, CAFFREY, JAMES M., ARNDT, KARLA K., PAYTON, ASPEN L.
Publication of US20120209568A1 publication Critical patent/US20120209568A1/en
Application status is Pending legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis

Abstract

Techniques are described for monitoring a performance metric. A multiple modeling approach is used to improve predictive analysis by avoiding the issuance of warnings during spikes which occur as a part of normal system processing. This approach increases the accuracy of predictive analytics on a monitored computing system, does not require creating rules defining periodic processing cycles, reduces the amount of data required to perform predictive modeling, and reduces the amount of CPU required to perform predictive modeling.

Description

    BACKGROUND
  • Embodiments of the invention are directed to techniques which may be used as part of a predictive modeling analysis. More specifically, embodiments of the invention provide methods and systems for evaluating performance metrics of a computing system using a multiple modeling paradigm.
  • In large scale computing deployments, one common resiliency problem is solving what is referred to as “soft failures,” where a computing system does not crash, but simply stops working correctly or slows down to a point of being effectively non-functional. Predictive analysis is a technique used to identify when a current set of sampled metrics for a computing system indicates that a future event is likely to occur (e.g., to predict when a soft failure is likely to occur). Predictive analysis tools rely on historical data to derive a model of expected system behavior.
  • An important aspect of such tools is the capability to avoid false positives. A false positive occurs when the predictive analysis tool detects a problem and warns a user, but the behavior is actually normal system behavior. False positives can significantly reduce a user's confidence in the predictive analytics tool. In large computer systems, many tasks or jobs may be running whose behavior is “spikey,” meaning the activity rate may vary drastically depending on workload and time of day, day of week, etc. Predictive analytic tools analyze historical data collected on a system and use machine learning algorithms to identify abnormal behavior on a system. For example, regular periodic processing (weekly, bi-weekly, monthly, etc.) can cause normal spikes in activity that could be erroneously identified as abnormal behavior by the predictive analytic tools. Jobs or processes which exhibit “spikey” behavior tend to generate false positives, because the spikes tend to exceed consumption thresholds set using average consumption rates. Further, the timing of a spike may not follow a pattern that is detectable by pattern recognition algorithms due to a varying number of days in the month, weekends, holidays, etc.
  • SUMMARY
  • One embodiment of the invention includes a method for monitoring a performance metric. This method may generally include determining a value of a performance metric for a current sampling period. Upon determining the value of the performance metric passes a threshold derived from a first model of expected behavior of the performance metric, the value of performance metric is evaluated according to a second model of expected behavior of the performance metric. And upon determining the value of the performance metric passes a threshold derived from the second model, an alert message is generated.
  • Another embodiment of the invention includes a computer-readable storage medium storing an application, which, when executed on a processor, performs an operation for monitoring a performance metric. The operation itself may generally include determining a value of a performance metric for a current sampling period. Upon determining the value of the performance metric passes a threshold derived from a first model of expected behavior of the performance metric, the value of performance metric is evaluated according to a second model of expected behavior of the performance metric. And upon determining the value of the performance metric passes a threshold derived from the second model, an alert message is generated.
  • Still another embodiment of the invention includes a system having a processor and a memory storing an application program, which, when executed on the processor, performs an operation for monitoring a performance metric. The operation itself may generally include determining a value of a performance metric for a current sampling period. Upon determining the value of the performance metric passes a threshold derived from a first model of expected behavior of the performance metric, the value of performance metric is evaluated according to a second model of expected behavior of the performance metric. And upon determining the value of the performance metric passes a threshold derived from the second model, an alert message is generated.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • So that the manner in which the above recited aspects are attained and can be understood in detail, a more particular description of embodiments of the invention, briefly summarized above, may be had by reference to the appended drawings.
  • It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
  • FIG. 1 illustrates an example computing infrastructure in which embodiments of the invention may be implemented.
  • FIG. 2 illustrates an example computing system used to monitor performance metrics using a multiple modeling paradigm, according to one embodiment of the invention.
  • FIG. 3 illustrates a method for using a multiple modeling paradigm to monitor “spikey” computing jobs or processes, according to one embodiment of the invention.
  • FIG. 4 illustrates a method for using a multiple modeling paradigm to perform a predictive analysis, according to one embodiment of the invention.
  • FIGS. 5A-5B illustrate an example data set monitored by a predictive analysis tool configured to use a multiple modeling paradigm, according to one embodiment of the invention.
  • DETAILED DESCRIPTION
  • As noted, some computing tasks can regularly exhibit so-called “spikey” behavior, where the amount of computing resources consumed by the computing task suddenly and dramatically change. For example, the amount of storage space, memory allocation, or CPU utilization, etc., can suddenly increase (or spike) as part of normal program operation. However, the same can occur when a process crashes (or otherwise operates abnormally). Accordingly, jobs or processes which exhibit “spikey” behavior make it challenging to determine whether a current spike in resource consumption (e.g., memory usage or processor utilization) indicates that something has gone wrong with a system function or with one of the jobs running on the system. That is, the problem could be rooted in something other than the job. For example, a communication device problem could cause a sudden increase in transaction response times. Thus, it is difficult for a predictive analysis tool to discern between a periodic spike in behavior resulting from the normal operations of a “spikey” job or process and an error condition that results in spikes in resource consumption.
  • Modeling these types of periodic behaviors frequently requires long-term retention of large volumes of historical data. Running modeling algorithms against very large amounts of data can consume unacceptable amounts of limited system resources both in terms of storage allocations and the time required to run the analysis against the historical data (which cuts into time available for regular computing tasks).
  • Embodiments of the invention provide methods and systems for evaluating performance metrics of a computing system using a multiple modeling paradigm. In one embodiment, system data for a modeling a performance metric is stored as multiple groups: one group representing “standard” activity for a performance metric and one (or more) additional groups representing “spike” activity for the performance metric. The groups are modeled separately to allow for one prediction representing the “normal-standard” or expected value of the performance metric and for one (or more) predictions representing a “spike-standard” value of the metric expected during a spike. Doing so avoids issuing an erroneous exception when spikes occur, but still allows valid exceptions to be thrown when the value of the performance metric is outside of the modeled “spike-standard” value during a spike.
  • This approach greatly reduces the data retention requirements for the predictive analysis tool. Specifically, data used to model the “normal-standard” or standard value may be maintained for a shorter retention period. That is, historical data used to model resource consumption for non-spike periods (i.e., the “normal-standard”) may be based on a relatively short time-window (e.g., a period of one month), while data used to model spike periods (i.e., the “spike-standard”) can reach back over a much longer period (e.g., a period of one year) in order to retain a representative sample. However, as the spikes occur less frequently, storing longer periods of modeling data for the “spike normal”) does not result in unacceptable amounts of storage resources to be dedicated to the predictive analysis tool. Further, storing the data in this manner also reduces overall processing time, since the “normal-standard” model is not based on a long-history of the values sampled for the performance metric.
  • Furthermore, in one embodiment, users can identify specific periods for modeling expected spikes in addition to the “normal-standard” and “spike normal” periods. For example, assume a user creates a recurring job or processing task performed at the 1st day of each month, as well executes jobs or tasks that result in transient spikes in the relevant performance metric. In such a case, the predictive analysis tool could also create a model for the spikes that are known to occur at the beginning of each month, in addition to the “normal-standard” and transient “spike-standard” models. Doing so might be useful in cases where the known spike-periods result in resource consumptions levels that would still generate a false positive according to the “spike-standard” model.
  • In the following, reference is made to embodiments of the invention. However, it should be understood that the invention is not limited to specific described embodiments. Instead, any combination of the following features and elements, whether related to different embodiments or not, is contemplated to implement and practice the invention. Furthermore, although embodiments of the invention may achieve advantages over other possible solutions and/or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the invention. Thus, the following aspects, features, embodiments and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s). Likewise, reference to “the invention” shall not be construed as a generalization of any inventive subject matter disclosed herein and shall not be considered to be an element or limitation of the appended claims except where explicitly recited in a claim(s).
  • As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus or device.
  • A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
  • The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • Embodiments of the invention may be provided to end users through a cloud computing infrastructure. Cloud computing generally refers to the provision of scalable computing resources as a service over a network. More formally, cloud computing may be defined as a computing capability that provides an abstraction between the computing resource and its underlying technical architecture (e.g., servers, storage, networks), enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. Thus, cloud computing allows a user to access virtual computing resources (e.g., storage, data, applications, and even complete virtualized computing systems) in “the cloud,” without regard for the underlying physical systems (or locations of those systems) used to provide the computing resources.
  • Typically, cloud computing resources are provided to a user on a pay-per-use basis, where users are charged only for the computing resources actually used (e.g., an amount of storage space consumed by a user or a number of virtualized systems instantiated by the user). A user can access any of the resources that reside in the cloud at any time, and from anywhere across the Internet. In context of the present invention, a user may access monitor applications or related data present in a cloud environment. For example, the monitoring application could monitor an amount of shared memory (or other resources) available to multiple virtual machine instances in a cloud-based server deployment.
  • The flowchart and block diagrams in the Figures illustrate the architecture, functionality and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
  • Further, particular embodiments of the invention are described using an example of monitoring of a performance metric of a computing system over a data communications network. However, it should be understood that the techniques described herein for modeling a performance metric using data for multiple models may be adapted to a variety of purposes in addition to modeling performance metrics on computing systems. Further, in many cases, the predictive analysis tool may be executing on the computing system being monitored. That is, the predictive analysis tool may monitor resource performance metrics on a local computing system as well as resources and performance metrics on remote systems.
  • FIG. 1 illustrates an example computing infrastructure 100 in which embodiments of the invention may be implemented. As shown, the computing infrastructure 100 includes a monitoring system 105 and server systems 130 1-2, each connected to a communications network 120. In this example, the monitoring system 105 communicates over the network 120 to monitor the ongoing state of the server systems 130. As one example, the monitoring system 105 could be configured to monitor the consumption of shared resources on each of the servers 130. Of course, the monitoring system 105 could be configured to monitor a variety of performance metrics related to the function of the server systems 130 (as well as performance metrics of the monitoring systems 105), e.g., CPU utilization, shared (or dedicated) storage consumption, virtual storage consumption, error message traffic, system message (console) traffic, latching (latches held/released), transaction response times, disk I/O response times, disk I/O activity (reads, writes, etc.). Further, one of ordinary skill in the art will recognize that the particular metrics may be selected as needed in a particular case.
  • FIG. 2 illustrates an example computing system 200 that includes a monitoring application 222 used to monitor performance metrics using a multiple modeling paradigm, according to one embodiment of the invention. As shown, the computing system 200 includes, without limitation, a central processing unit (CPU) 205, a network interface 215, an interconnect 220, a memory 225, and storage 230. The computer system 200 may also include an I/O device interface 210 connecting I/O devices 212 (e.g., keyboard, display and mouse devices) to the computer system 200.
  • In general, the CPU 205 retrieves and executes programming instructions stored in the memory 225. Similarly, the CPU 205 stores and retrieves application data residing in the memory 225. The interconnect 220 provides a communication path for transmitting programming instructions and application data between the CPU 205, I/O devices interface 210, storage 230, network interface 215, and memory 225. CPU 205 is included to be representative of a single CPU, multiple CPUs, a CPU having multiple processing cores, and the like. And the memory 225 is generally included to be representative of a random access memory. The storage 230 may be a hard disk drive or solid state storage device (SSD). Further, although shown as a single unit, the storage 230 may be a combination of fixed and/or removable storage devices, such as fixed disc drives, floppy disc drives, tape drives, removable memory cards, optical storage, network attached storage (NAS), or a storage area-network (SAN).
  • Illustratively, the memory 225 stores the monitoring application 222, along with first model thresholds 224 and second model thresholds 226. Storage 230 contains a sampled metric history 232 for a first model, a sampled metric history 234 for a second model, and optional date/time spike pattern data 236. In one embodiment, the monitoring application 222 is configured to generate an alarm (e.g., an alert message sent to a system administrator) when a performance metric exceeds (or in the appropriate case falls below) the thresholds specified by the first model thresholds 224 and second model thresholds 226. Further, the monitoring application 222 may be configured to derive values for thresholds 224, 226 using the sampled metric history 232 for the first model and using the sampled metric history 234 for the second model. For example, the first model thresholds 224 may provide an estimated maximum (or minimum) value for a performance metric based on sampled values not associated with a spike period. Accordingly, the sampled metric history 232 for the first model, i.e., for the “normal-standard” value, may include sample data covering a relatively recent history of sampled metric values (e.g., a period of four weeks).
  • At the same time, the sampled metric history 234 for the second model thresholds 226 may include data covering a relatively longer history of sampled metric values. However, the sampled metric history 234 for the second model is limited to data values sampled during periods where a spike in the performance metric is being observed. That is, the sampled metric history 234 is used to determine the appropriate “spike-standard” threshold.
  • In one embodiment, the monitoring application 222 initially establishes a base activity level of a performance metric using several hours of data collection (or days, as appropriate). If a spike occurs during this time it may slightly skew the calculation of expected activity for the “normal-standard” threshold, but generally not enough to affect the overall outcome.
  • Assume, e.g., a data collector samples a metric every 30 minutes for a two-week period and metrics are stored in a historical data file (i.e., as the sampled metric history 232 for the first model). Based on the data sampled over the two week period, the first metric threshold 224 is identified. At this point, until sufficient historical data has been collected, any spikes that occur would be identified as abnormal behavior and result in a warning. Accordingly, in one embodiment, any performance metric values that would trigger a warning are diverted to a separate historical data file for spike activity to avoid any additional skew to the calculations of standard normal behavior, i.e., performance metric values sampled during an observed spike period are diverted to sampled metric history 234 for the second model.
  • Thus, the sampled metric history 234 stores metric values which are high enough to cause a warning when compared to standard or “normal-standard” behavior. Once a sufficient number of metric values have been stored in the sampled metric history 234, predictive modeling algorithms may be used to determine a “spike-standard” value. For example, a sufficient amount of data may be considered to be spike data collected over a four-to-six week period which includes data for at least 3 occurrences of spike behavior. Of course, this training period may be adjusted depending on the requirements of the monitored system. That is, the monitoring application 222 may calculate a value which represents a metric value expected to see during a spike. Note, such a value may be specified as a single threshold (with some tolerance such as an expected standard deviation and variance), but may also be specified as a normal operating range, or in other forms appropriate for the particular performance metric.
  • At this point, the sampled metric history 232 and 234 data may be used to generate predictions of expected performance metric values. The first prediction represents the standard normal metric value and the additional predictions represent a normal metric value during a spike in activity. Based on these models, thresholds 224, 226 for each type of behavior can be programmatically generated. The sensitivity of the thresholds 224, 226 may be fine tuned by user configurable parameters.
  • Once the first and second model thresholds 224, 226 have been established, if an observed sample value for the performance metric exceeds the first model threshold 224, this sampled value may be stored in the sampled metric history 234 (and used to refresh the predictive model more frequently until regular activity has resumed). Further, if a subsequently observed sample value for the performance metric exceeds the second model threshold 226 during a spike period, then an alert message may be generated, e.g., a warning to the system operator so that action may be taken to prevent further complications. Of course, a broad variety of other actions could be trigged when the performance metric exceeds (or falls below) the first model threshold 224, the second model threshold 226, or both.
  • While the approach described above eliminates the need for advance user-knowledge and configuration in advance of a resource spike, in some cases, users may consistently schedule jobs or processing tasks in a manner that certain spike periods may be predicted. In such a case, the monitoring application 222 may be configured to create additional thresholds and sampled metric histories to model spike periods associated with specific jobs or tasks. For example, data/time spike pattern data may specify when a specific spike is expected to occur, and reoccur. Further still, the first and second thresholds may be dynamic once established. That is, once set to an initial value, subsequent sample values during both “normal” and “spike” periods may be used to update the thresholds over time.
  • FIG. 3 illustrates a method 300 for using a multiple modeling paradigm to monitor “spikey” computing jobs or processes, according to one embodiment of the invention. As shown, the method 300 begins at step 305, where a monitoring application begins monitoring a performance metric associated with a set of computing jobs or tasks for a training period. As noted, examples of a monitored performance metric can include a variety of aspects of a computing system, grid, cluster, network, etc., including, e.g., system utilization, processor (or processor core) utilization, shared (or dedicated) storage consumption, virtual storage consumption, error message traffic, system message (console) traffic, latching (latches held/released), transaction response times, disk I/O response times, disk I/O activity (reads, writes, etc.). The training period allows the monitoring system to determine both a first threshold (i.e., the normal-standard) but also determine a second threshold (i.e., the spike-standard).
  • At step 310, during the training period, the monitoring system suppresses any alarms when a sampled value of a monitored performance metric exceeds the value for the first threshold. This occurs because, while the first threshold (i.e., the normal-standard threshold) may be established relatively quickly (i.e., over a period of a few hours or days). Further, once established, the first threshold is used to identify periods where data used to model the second threshold is itself determined. For example, data for the second threshold may be limited to periods where the monitored performance metric exceeds the first threshold. (i.e., during a period of spike activity). At step 315, the monitoring system determines whether enough spike-data has been observed to determine the second threshold (i.e., the spike-standard threshold). Depending on the frequency and duration of spike periods, the training period may last for a period of weeks or months. After observing a representative sample of spike periods, the monitoring system determines a second model threshold for distinguishing between normal spikes in performance and events that may require user-intervention (step 320). That is, the monitoring system determines the second model thresholds.
  • FIG. 4 illustrates a method 400 for using a multiple modeling paradigm to perform a predictive analysis, according to one embodiment of the invention. As shown, the method 400 begins at step 405 where the monitoring application determines a value of a performance metric for a current sampling period. Of course, the sampling frequency may be set as appropriate for the particular performance metric monitored by the monitoring system.
  • At step 410, the monitoring system determines whether the value of the performance metric sampled at step 405 exceeds (or in the appropriate case falls below) the threshold for the first model. If not, then the system returns to step 405 until reaching the time for the next sample period. Otherwise, in cases where the sampled performance metric value exceeds the threshold, the system begins evaluating the monitored metric using the second model. Note, in one embodiment, when a spike is observed, the sampling frequency may be increased (relative to the sampling frequency during non-spike periods) in order to monitor the performance metric more closely during a spike-period.
  • At step 420, if the performance metric exceeds the second threshold determined using the second model (i.e., the spike-standard threshold), then at step 425 an alarm message may be sent regarding the performance metric. Otherwise, if the evaluation of the performance metric indicates that the performance metric, while experiencing a spike, is experiencing a “normal” spike, then the system returns to step 405 to wait for the next sampling period.
  • An example of the multiple-modeling approach is shown in FIGS. 5A-5B. More specifically, FIGS. 5A-5B illustrate example data sets monitored by a predictive analysis tool configured to use a multiple modeling paradigm, according to one embodiment of the invention. FIG. 5A shows samples 500 of a metric value obtained over a two week period. In this example, value of roughly ˜100 is the generally obtained for the performance metric, except during two spikes 510, 515. Assume for this example, that the spike 510 results from normal activity of the computing system being monitored (via the performance metric) and that spike 515 results from a crash or other system malfunction. If a threshold for an alarm was set to ˜150, then an alarm would be generated from both spike 510 (a false positive) and spike 515 (an actual problem). Accordingly, as described above, a multiple modeling approach may be used to model both so called “normal” values of the performance metric and a separate model to use for spike periods. This result is illustrated in FIG. 5B.
  • As shown in FIG. 5B, data 550 for the performance metric is captured for a larger time-period than shown in FIG. 5A. Additionally a first threshold 555 is set to roughly 150 and a second threshold 560 is set to roughly 425. During a training period 565, data for spikes 580, 585 is used to determine the value for the second threshold 560. Once training is complete, a spike 575 1 does not generate an alarm, as it does not exceed the second threshold 560. In contrast, a spike 575 2 does exceed the second threshold 560, and does result in an alarm.
  • Thus, advantageously, the multiple modeling approach described above improves predictive analysis by avoiding the issuance of warnings during spikes which occur as a part of normal system processing. This approach increases the accuracy of predictive analytics on a monitored computing system, does not require creating rules defining periodic processing cycles, reduces the amount of data required to perform predictive modeling, and reduces the amount of CPU required to perform predictive modeling.
  • While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (21)

1. A computer-implemented method for monitoring a performance metric, the method comprising:
determining a value of a performance metric for a current sampling period;
upon determining the value of the performance metric passes a threshold derived from a first model of expected behavior of the performance metric, evaluating the value of performance metric according to a second model of expected behavior of the performance metric; and
upon determining the value of the performance metric passes a threshold derived from the second model, generating an alert message.
2. The method of claim 1, further comprising, upon determining the value of the performance metric has not passed the threshold derived from the first model, updating the first model based on the sampled value of the performance metric.
3. The method of claim 1, further comprising, upon determining the value of the performance metric has passed the threshold derived from the first model of expected behavior of the performance metric, updating the second model based on the sampled value of the performance metric.
4. The method of claim 1, further comprising, upon determining the value of the performance metric has passed the threshold derived from the first model of expected behavior of the performance metric, increasing a sampling frequency of the sampling period.
5. The method of claim 1, wherein the performance metric corresponds to a usage of a shared resource.
6. The method of claim 1, wherein the performance metric corresponds to one of processor utilization, storage resource consumption, and memory consumption.
7. The method of claim 1, wherein the first threshold is derived by training the first model over a specified first training period, and wherein the second model is trained by sampling the performance metric when the performance metric value exceeds the first threshold.
8. A computer-readable storage medium storing an application, which, when executed on a processor, performs an operation for monitoring a performance metric, the operation comprising:
determining a value of a performance metric for a current sampling period;
upon determining the value of the performance metric passes a threshold derived from a first model of expected behavior of the performance metric, evaluating the value of performance metric according to a second model of expected behavior of the performance metric; and
upon determining the value of the performance metric passes a threshold derived from the second model, generating an alert message.
9. The computer-readable storage medium of claim 8, wherein the operation further comprises, upon determining the value of the performance metric has not passed the threshold derived from the first model, updating the first model based on the sampled value of the performance metric.
10. The computer-readable storage medium of claim 8, wherein the operation further comprises, upon determining the value of the performance metric has passed the threshold derived from the first model of expected behavior of the performance metric, updating the second model based on the sampled value of the performance metric.
11. The computer-readable storage medium of claim 8, wherein the operation further comprises, upon determining the value of the performance metric has passed the threshold derived from the first model of expected behavior of the performance metric, increasing a sampling frequency of the sampling period.
12. The computer-readable storage medium of claim 8, wherein the performance metric corresponds to a usage of a shared resource.
13. The computer-readable storage medium of claim 8, wherein the performance metric corresponds to one of processor utilization, storage resource consumption, and memory consumption.
14. The computer-readable storage medium of claim 8, wherein the first threshold is derived by training the first model over a specified first training period, and wherein the second model is trained by sampling the performance metric when the performance metric value exceeds the first threshold.
15. A system, comprising:
a processor; and
a memory storing an application program, which, when executed on the processor, performs an operation for monitoring a performance metric, the operation comprising:
determining a value of a performance metric for a current sampling period,
upon determining the value of the performance metric passes a threshold derived from a first model of expected behavior of the performance metric, evaluating the value of performance metric according to a second model of expected behavior of the performance metric, and
upon determining the value of the performance metric passes a threshold derived from the second model, generating an alert message.
16. The system of claim 15, wherein the operation further comprises, upon determining the value of the performance metric has not passed the threshold derived from the first model, updating the first model based on the sampled value of the performance metric.
17. The system of claim 15, wherein the operation further comprises, upon determining the value of the performance metric has passed the threshold derived from the first model of expected behavior of the performance metric, updating the second model based on the sampled value of the performance metric.
18. The system of claim 15, wherein the operation further comprises, upon determining the value of the performance metric has passed the threshold derived from the first model of expected behavior of the performance metric, increasing a sampling frequency of the sampling period.
19. The system of claim 15, wherein the performance metric corresponds to a usage of a shared resource.
20. The system of claim 15, wherein the performance metric corresponds to one of processor utilization, storage resource consumption, and memory consumption.
21. The system of claim 15, wherein the first threshold is derived by training the first model over a specified first training period, and wherein the second model is trained by sampling the performance metric when the performance metric value exceeds the first threshold.
US13/026,351 2011-02-14 2011-02-14 Multiple modeling paradigm for predictive analytics Pending US20120209568A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/026,351 US20120209568A1 (en) 2011-02-14 2011-02-14 Multiple modeling paradigm for predictive analytics

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US13/026,351 US20120209568A1 (en) 2011-02-14 2011-02-14 Multiple modeling paradigm for predictive analytics
DE201211000797 DE112012000797T5 (en) 2011-02-14 2012-02-08 Multiple modeling paradigm for predictive analytics
PCT/IB2012/050569 WO2012110918A1 (en) 2011-02-14 2012-02-08 Multiple modeling paradigm for predictive analytics
CN201280008552.6A CN103354924B (en) 2011-02-14 2012-02-08 For monitoring the performance of the method and system
JP2013553058A JP6025753B2 (en) 2011-02-14 2012-02-08 Computer-implemented method, computer-readable storage medium, and system for monitoring performance metrics
GB201307559A GB2499535B (en) 2011-02-14 2012-02-08 Multiple modeling paradigm for predictive analytics
US13/686,389 US20130086431A1 (en) 2011-02-14 2012-11-27 Multiple modeling paradigm for predictive analytics

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/686,389 Continuation US20130086431A1 (en) 2011-02-14 2012-11-27 Multiple modeling paradigm for predictive analytics

Publications (1)

Publication Number Publication Date
US20120209568A1 true US20120209568A1 (en) 2012-08-16

Family

ID=46637561

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/026,351 Pending US20120209568A1 (en) 2011-02-14 2011-02-14 Multiple modeling paradigm for predictive analytics
US13/686,389 Pending US20130086431A1 (en) 2011-02-14 2012-11-27 Multiple modeling paradigm for predictive analytics

Family Applications After (1)

Application Number Title Priority Date Filing Date
US13/686,389 Pending US20130086431A1 (en) 2011-02-14 2012-11-27 Multiple modeling paradigm for predictive analytics

Country Status (6)

Country Link
US (2) US20120209568A1 (en)
JP (1) JP6025753B2 (en)
CN (1) CN103354924B (en)
DE (1) DE112012000797T5 (en)
GB (1) GB2499535B (en)
WO (1) WO2012110918A1 (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124669A1 (en) * 2011-11-10 2013-05-16 Eric Paul Anderson System for monitoring eleastic cloud-based computing systems as a service
US20140068298A1 (en) * 2012-09-05 2014-03-06 Nvidia Corporation System and process for accounting for aging effects in a computing device
US8782197B1 (en) * 2012-07-17 2014-07-15 Google, Inc. Determining a model refresh rate
US8874589B1 (en) 2012-07-16 2014-10-28 Google Inc. Adjust similar users identification based on performance feedback
US8886799B1 (en) 2012-08-29 2014-11-11 Google Inc. Identifying a similar user identifier
US8886575B1 (en) 2012-06-27 2014-11-11 Google Inc. Selecting an algorithm for identifying similar user identifiers based on predicted click-through-rate
US8914500B1 (en) 2012-05-21 2014-12-16 Google Inc. Creating a classifier model to determine whether a network user should be added to a list
US20150095719A1 (en) * 2013-10-01 2015-04-02 Samsung Sds Co., Ltd. Data preprocessing device and method thereof
US20150149850A1 (en) * 2013-11-25 2015-05-28 Comcast Cable Communication, Llc Device Performance Monitoring
US9053185B1 (en) 2012-04-30 2015-06-09 Google Inc. Generating a representative model for a plurality of models identified by similar feature data
US9065727B1 (en) 2012-08-31 2015-06-23 Google Inc. Device identifier similarity models derived from online event signals
US20150281008A1 (en) * 2014-03-25 2015-10-01 Emulex Corporation Automatic derivation of system performance metric thresholds
US20160127204A1 (en) * 2014-03-07 2016-05-05 Hitachi, Ltd. Performance evaluation method and information processing device
WO2016150395A1 (en) * 2015-03-24 2016-09-29 Huawei Technologies Co., Ltd. Adaptive, anomaly detection based predictor for network time series data
US20160285783A1 (en) * 2015-03-26 2016-09-29 Vmware, Inc. Methods and apparatus to control computing resource utilization of monitoring agents
US20160357232A1 (en) * 2015-06-05 2016-12-08 Apple Inc. Predictive control systems and methods
US9588813B1 (en) * 2013-06-07 2017-03-07 Amazon Technologies, Inc. Determining cost of service call
US9600774B1 (en) * 2013-09-25 2017-03-21 Amazon Technologies, Inc. Predictive instance suspension and resumption
US9921934B1 (en) * 2011-10-14 2018-03-20 Amazon Techologies, Inc. Storage process metrics
EP3333707A1 (en) * 2016-12-09 2018-06-13 British Telecommunications public limited company Autonomic method for managing a computing system
US10089165B2 (en) * 2016-04-06 2018-10-02 International Business Machines Corporation Monitoring data events using calendars

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8732534B2 (en) * 2010-09-17 2014-05-20 Oracle International Corporation Predictive incident management
US10043194B2 (en) 2014-04-04 2018-08-07 International Business Machines Corporation Network demand forecasting
US10361924B2 (en) 2014-04-04 2019-07-23 International Business Machines Corporation Forecasting computer resources demand
US9385934B2 (en) 2014-04-08 2016-07-05 International Business Machines Corporation Dynamic network monitoring
US9665460B2 (en) * 2015-05-26 2017-05-30 Microsoft Technology Licensing, Llc Detection of abnormal resource usage in a data center
EP3323047A4 (en) * 2015-07-14 2019-03-27 Sios Technology Corporation Distributed machine learning analytics framework for the anaylsis of streaming data sets from a computer enviroment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3795008A (en) * 1972-04-12 1974-02-26 B Kolsrud Method for the discrete sampling of co-related values of two or more variables
US20030079160A1 (en) * 2001-07-20 2003-04-24 Altaworks Corporation System and methods for adaptive threshold determination for performance metrics
US20060293777A1 (en) * 2005-06-07 2006-12-28 International Business Machines Corporation Automated and adaptive threshold setting
US20070083513A1 (en) * 2005-10-12 2007-04-12 Ira Cohen Determining a recurrent problem of a computer resource using signatures

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001142746A (en) * 1999-11-11 2001-05-25 Nec Software Chubu Ltd Load monitor device for computer system
US7610377B2 (en) * 2004-01-27 2009-10-27 Sun Microsystems, Inc. Overload management in an application-based server
JP2005316808A (en) * 2004-04-30 2005-11-10 Nec Software Chubu Ltd Performance monitoring device, performance monitoring method and program
US8320256B2 (en) * 2006-09-13 2012-11-27 International Business Machines Corporation Method, computer program product and system for managing usage of marginal capacity of computer resources
JP2009003742A (en) * 2007-06-22 2009-01-08 Hitachi Electronics Service Co Ltd Task delay prediction system
US8214308B2 (en) * 2007-10-23 2012-07-03 Sas Institute Inc. Computer-implemented systems and methods for updating predictive models
US7966152B2 (en) * 2008-04-23 2011-06-21 Honeywell International Inc. System, method and algorithm for data-driven equipment performance monitoring
EP2330510A4 (en) * 2008-09-18 2015-08-12 Nec Corp Operation management device, operation management method, and operation management program
CN101882107A (en) * 2010-06-28 2010-11-10 山东中创软件商用中间件股份有限公司 Method and device for automatically testing WEB (World Wide Web) application

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3795008A (en) * 1972-04-12 1974-02-26 B Kolsrud Method for the discrete sampling of co-related values of two or more variables
US20030079160A1 (en) * 2001-07-20 2003-04-24 Altaworks Corporation System and methods for adaptive threshold determination for performance metrics
US20060293777A1 (en) * 2005-06-07 2006-12-28 International Business Machines Corporation Automated and adaptive threshold setting
US20070083513A1 (en) * 2005-10-12 2007-04-12 Ira Cohen Determining a recurrent problem of a computer resource using signatures

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Threshold | Definition of threshold by Merriam-Webster Dictionary *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9921934B1 (en) * 2011-10-14 2018-03-20 Amazon Techologies, Inc. Storage process metrics
US8447851B1 (en) * 2011-11-10 2013-05-21 CopperEgg Corporation System for monitoring elastic cloud-based computing systems as a service
US20130124669A1 (en) * 2011-11-10 2013-05-16 Eric Paul Anderson System for monitoring eleastic cloud-based computing systems as a service
US9053185B1 (en) 2012-04-30 2015-06-09 Google Inc. Generating a representative model for a plurality of models identified by similar feature data
US8914500B1 (en) 2012-05-21 2014-12-16 Google Inc. Creating a classifier model to determine whether a network user should be added to a list
US8886575B1 (en) 2012-06-27 2014-11-11 Google Inc. Selecting an algorithm for identifying similar user identifiers based on predicted click-through-rate
US8874589B1 (en) 2012-07-16 2014-10-28 Google Inc. Adjust similar users identification based on performance feedback
US8782197B1 (en) * 2012-07-17 2014-07-15 Google, Inc. Determining a model refresh rate
US8886799B1 (en) 2012-08-29 2014-11-11 Google Inc. Identifying a similar user identifier
US9065727B1 (en) 2012-08-31 2015-06-23 Google Inc. Device identifier similarity models derived from online event signals
US9823990B2 (en) * 2012-09-05 2017-11-21 Nvidia Corporation System and process for accounting for aging effects in a computing device
US20140068298A1 (en) * 2012-09-05 2014-03-06 Nvidia Corporation System and process for accounting for aging effects in a computing device
US9588813B1 (en) * 2013-06-07 2017-03-07 Amazon Technologies, Inc. Determining cost of service call
US9600774B1 (en) * 2013-09-25 2017-03-21 Amazon Technologies, Inc. Predictive instance suspension and resumption
US9588832B2 (en) * 2013-10-01 2017-03-07 Samsung Sds Co., Ltd. Data preprocessing device and method associated with a failure risk level of a target system
CN104516808A (en) * 2013-10-01 2015-04-15 三星Sds株式会社 Data preprocessing device and method thereof
US20150095719A1 (en) * 2013-10-01 2015-04-02 Samsung Sds Co., Ltd. Data preprocessing device and method thereof
US9251034B2 (en) * 2013-11-25 2016-02-02 Comcast Cable Communications, Llc Device performance monitoring
US9960984B2 (en) 2013-11-25 2018-05-01 Comcast Cable Communications, Llc Device performance monitoring
US20150149850A1 (en) * 2013-11-25 2015-05-28 Comcast Cable Communication, Llc Device Performance Monitoring
US9712404B2 (en) * 2014-03-07 2017-07-18 Hitachi, Ltd. Performance evaluation method and information processing device
US20160127204A1 (en) * 2014-03-07 2016-05-05 Hitachi, Ltd. Performance evaluation method and information processing device
US20150281008A1 (en) * 2014-03-25 2015-10-01 Emulex Corporation Automatic derivation of system performance metric thresholds
CN107409075A (en) * 2015-03-24 2017-11-28 华为技术有限公司 Adaptive, anomaly detection based predictor for network time series data
EP3259881A4 (en) * 2015-03-24 2018-03-14 Huawei Technologies Co. Ltd. Adaptive, anomaly detection based predictor for network time series data
WO2016150395A1 (en) * 2015-03-24 2016-09-29 Huawei Technologies Co., Ltd. Adaptive, anomaly detection based predictor for network time series data
US20160285783A1 (en) * 2015-03-26 2016-09-29 Vmware, Inc. Methods and apparatus to control computing resource utilization of monitoring agents
US20160357232A1 (en) * 2015-06-05 2016-12-08 Apple Inc. Predictive control systems and methods
US10089165B2 (en) * 2016-04-06 2018-10-02 International Business Machines Corporation Monitoring data events using calendars
EP3333707A1 (en) * 2016-12-09 2018-06-13 British Telecommunications public limited company Autonomic method for managing a computing system

Also Published As

Publication number Publication date
CN103354924B (en) 2016-05-11
GB201307559D0 (en) 2013-06-12
GB2499535B (en) 2014-12-10
JP2014507727A (en) 2014-03-27
CN103354924A (en) 2013-10-16
US20130086431A1 (en) 2013-04-04
JP6025753B2 (en) 2016-11-16
GB2499535A (en) 2013-08-21
DE112012000797T5 (en) 2013-11-14
WO2012110918A1 (en) 2012-08-23

Similar Documents

Publication Publication Date Title
Kavulya et al. An analysis of traces from a production mapreduce cluster
Fu et al. Exploring event correlation for failure prediction in coalitions of clusters
Oliner et al. Carat: Collaborative energy diagnosis for mobile devices
US7730364B2 (en) Systems and methods for predictive failure management
US9392022B2 (en) Methods and apparatus to measure compliance of a virtual computing environment
US8447851B1 (en) System for monitoring elastic cloud-based computing systems as a service
US20130081005A1 (en) Memory Management Parameters Derived from System Modeling
Tan et al. Prepare: Predictive performance anomaly prevention for virtualized cloud systems
US20150081882A1 (en) System and method of alerting on ephemeral resources from an iaas provider
US8364813B2 (en) Administering incident pools for event and alert analysis
Dean et al. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems
US8880943B2 (en) Restarting event and alert analysis after a shutdown in a distributed processing system
US8730816B2 (en) Dynamic administration of event pools for relevant event and alert analysis during event storms
US8392385B2 (en) Flexible event data content management for relevant event and alert analysis within a distributed processing system
CN104113585B (en) A method and apparatus for generating a state indicative of load balancing hardware-level interrupt
US8805999B2 (en) Administering event reporting rules in a distributed processing system
US9298525B2 (en) Adaptive fault diagnosis
US8713366B2 (en) Restarting event and alert analysis after a shutdown in a distributed processing system
US8904209B2 (en) Estimating and managing power consumption of computing devices using power models
US8756462B2 (en) Configurable alert delivery for reducing the amount of alerts transmitted in a distributed processing system
Bruneo et al. Workload-based software rejuvenation in cloud systems
US8688769B2 (en) Selected alert delivery in a distributed processing system
US9413773B2 (en) Method and apparatus for classifying and combining computer attack information
US9213621B2 (en) Administering event pools for relevant event analysis in a distributed processing system
Ibidunmoye et al. Performance anomaly detection and bottleneck identification

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ARNDT, KARLA K.;CAFFREY, JAMES M.;PATEL, KEYUR;AND OTHERS;SIGNING DATES FROM 20110117 TO 20110208;REEL/FRAME:025800/0914

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER