US20070225926A1 - Method and apparatus for quantitatively determining severity of degradation in a signal - Google Patents

Method and apparatus for quantitatively determining severity of degradation in a signal Download PDF

Info

Publication number
US20070225926A1
US20070225926A1 US11/389,578 US38957806A US2007225926A1 US 20070225926 A1 US20070225926 A1 US 20070225926A1 US 38957806 A US38957806 A US 38957806A US 2007225926 A1 US2007225926 A1 US 2007225926A1
Authority
US
United States
Prior art keywords
signal
slope
value
degradation
cumulative function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US11/389,578
Other versions
US7269536B1 (en
Inventor
Kenny Gross
Keith Whisnant
Gregory Cumberford
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle America Inc
Original Assignee
Sun Microsystems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems Inc filed Critical Sun Microsystems Inc
Priority to US11/389,578 priority Critical patent/US7269536B1/en
Assigned to SUN MICROSYSTEMS, INC. reassignment SUN MICROSYSTEMS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GROSS, KENNY C., WHISNANT, KEITH A., CUMBERFORD, GREGORY A.
Application granted granted Critical
Publication of US7269536B1 publication Critical patent/US7269536B1/en
Publication of US20070225926A1 publication Critical patent/US20070225926A1/en
Assigned to Oracle America, Inc. reassignment Oracle America, Inc. MERGER AND CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: Oracle America, Inc., ORACLE USA, INC., SUN MICROSYSTEMS, INC.
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01RMEASURING ELECTRIC VARIABLES; MEASURING MAGNETIC VARIABLES
    • G01R31/00Arrangements for testing electric properties; Arrangements for locating electric faults; Arrangements for electrical testing characterised by what is being tested not provided for elsewhere
    • G01R31/28Testing of electronic circuits, e.g. by signal tracer
    • G01R31/317Testing of digital circuits
    • G01R31/31708Analysis of signal quality

Definitions

  • the present invention relates to techniques for proactively detecting impending problems in computer systems. More specifically, the present invention relates to a method and an apparatus for quantitatively determining the severity of degradation in a signal in a computer system.
  • Modern computer server systems are typically equipped with a significant number of sensors which monitor signals during the operation of the computer systems. Results from this monitoring process can be used to generate time series data for these signals which can subsequently be analyzed to determine how well a computer system is operating.
  • One particularly useful application of this analysis process is for “proactive fault-monitoring,” to identify leading indicators of component or system failures before the failures actually occur.
  • a quantitative indicator of the amount of degradation allows the service engineer to make appropriate decisions based on the actual health of the system with high confidence. For example, if a system is scheduled for shutdown due to a preventative maintenance on Saturday night and a warning flag is generated on Friday afternoon, it would be extremely beneficial for the service engineer to know if the detected degradation is of extremely low severity, so that the system can be allowed to operate safely until the scheduled outage time. On the other hand, if there is no scheduled shutdown in the near future and a warning flag is generated, the service engineer may desire to shutdown the system immediately if he/she knows that severity of the detected degradation is extremely high.
  • One embodiment of the present invention provides a system that determines a severity of degradation in a signal.
  • the system receives signal values for the signal, wherein the signal values are received with a constant sampling interval.
  • the system applies a Sequential Probability Ratio Test (SPRT) to the signal value. If the SPRT generates an alarm on the signal value, the system increments a cumulative counter which records a running total number of the SPRT alarms.
  • the system updates a cumulative function using a value in the cumulative counter.
  • the system determines the severity of degradation in the signal from the shape of the cumulative function.
  • SPRT Sequential Probability Ratio Test
  • the system determines the severity of degradation in the signal from the shape of the cumulative function by computing the slope of the cumulative function.
  • the slope of the cumulative function indicates the degree of severity of degradation in the signal.
  • an increase in the slope of the cumulative function indicates an increasing severity of degradation in the signal.
  • the system computes the slope of the cumulative function by: (1) selecting a predetermined number of successive data values in the cumulative function; and (2) computing the slope using the predetermined number of successive data values.
  • the slope of the cumulative function (1) increases continuously with time or observations; or (2) increases abruptly from a smaller value to a larger value and remains at the larger value.
  • the cumulative function changes linearly with received signal values.
  • the cumulative function value does not change.
  • FIG. 1 illustrates real-time telemetry system in accordance with an embodiment of the present invention.
  • FIG. 2A illustrates an exemplary plot of an Inter-Arrival Time (IAT) as a function of a cumulative number of SPRT alarms for a monitored signal with no degradation in accordance with an embodiment of the present invention.
  • IAT Inter-Arrival Time
  • FIG. 2B illustrates the associated mean cumulative function (MCF) for the signal represented in FIG. 2A in accordance with an embodiment of the present invention.
  • MCF mean cumulative function
  • FIG. 3 presents a flowchart illustrating the process of determining the severity of degradation in a signal in accordance with an embodiment of the present invention.
  • FIG. 4A illustrates two phases of degradation in a signal with different degrees of severity in accordance with an embodiment of the present invention.
  • FIG. 4B illustrates the corresponding MCF curve of the signal in FIG. 4A in accordance with an embodiment of the present invention.
  • FIG. 5A illustrates a step function degradation in a signal in accordance with an embodiment of the present invention.
  • FIG. 5B illustrates the corresponding MCF curve of the signal in FIG. 5A in accordance with an embodiment of the present invention.
  • a computer-readable storage medium which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs) and DVDs (digital versatile discs or digital video discs).
  • FIG. 1 illustrates real-time telemetry system 100 in accordance with an embodiment of the present invention.
  • Real-time telemetry system 100 contains server 102 .
  • Server 102 can generally include any computational node including a mechanism for servicing requests from a client for computational and/or data storage resources.
  • server 102 is a uniprocessor or multiprocessor server that is being monitored by real-time telemetry system 100 .
  • the present invention is not limited to the computer server system illustrated in FIG. 1 .
  • the present invention can be applied to any type of computer system. This includes, but is not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a personal organizer, a device controller, and a computational engine within an appliance.
  • Real-time telemetry system 100 also contains telemetry device 104 , which gathers telemetry signals 106 from the various sensors and monitoring tools within server 102 , and directs telemetry signals 106 to a local or a remote location that contains fault-detecting tool 108 .
  • telemetry signals 106 gathered by real-time telemetry system 104 can include signals associated with physical and/or software performance parameters measured through sensors within the computer system.
  • the physical parameters can include, but are not limited to: distributed temperatures within the computer system, relative humidity, cumulative or differential vibrations within the computer system, fan speed, acoustic signals, currents, voltages, time-domain reflectometry (TDR) readings, and miscellaneous environmental variables.
  • the software parameters can include, but are not limited to: load metrics, CPU utilization, idle time, memory utilization, disk activity, transaction latencies, and other performance metrics reported by the operating system.
  • Fault-detecting tool 108 monitors and analyzes telemetry signals 106 in real-time. Specifically, fault-detecting tool 108 detects anomalies in telemetry signals 106 and predicts probabilities of faults and failures in server 102 .
  • fault-detecting tool 108 is a Continuous System Telemetry Harness (CSTH).
  • the CSTH performs Sequential Probability Ratio Test (SPRT) on telemetry signals 106 .
  • SPRT Sequential Probability Ratio Test
  • the SPRT provides a technique for monitoring noisy process variables and detecting the incipience or onset of anomalies in such processes with high sensitivity.
  • telemetry device 104 and fault-detecting tool 108 are both embedded in server 102 which is being monitored.
  • One embodiment of the present invention uses a SPRT to analyze monitored telemetry signals from a system.
  • the SPRT is a binary hypothesis test that analyzes process observations sequentially to determine whether or not the signal is consistent with normal behavior.
  • the SPRT reaches a decision about current process behavior (i.e., the signal is behaving normally or abnormally), it reports the decision and continues to process observations.
  • the SPRT generates warning flags/alarms when anomalies are detected in the monitored signals.
  • the SPRT can generate alarms even when the monitored signals contain no degradation.
  • the frequency of SPRT alarms is typically very low and less than a pre-assigned “false alarm probability” (FAP).
  • FAP pre-assigned “false alarm probability”
  • the FAP specifies the probability of making a failure hypothesis when in fact a non-failure hypothesis holds. Note that the FAP cannot be zero, for mathematical reasons.
  • False alarms do not present any problem as long as the associated frequency of the false alarm is smaller than the FAP which is specified when initializing the SPRT.
  • the frequency of SPRT alarms exceeds the FAP, a problem is signaled for the monitored component, system, or process.
  • FAP is set to be 0.01, it means that about 1 out of 100 observations, on average, will produce a false alarm.
  • the frequency of the occurrences of SPRT alarms is more than 0.01, this indicates that there is a problem in the monitored component, system, or process.
  • IAT Inter-Arrival Time
  • FIG. 2A illustrates an exemplary plot of an IAT as a function of a cumulative number of SPRT alarms for a monitored signal with no degradation in accordance with an embodiment of the present invention.
  • the y-value of each point in FIG. 2A represents the number of observations between successive SPRT alarms ( 202 ), which follows a random process.
  • the horizontal axis of FIG. 2A represents the cumulative number of SPRT alarms ( 204 ).
  • MCF Mean Cumulative Function
  • FIG. 2B illustrates the associated MCF for the SPRT alarms represented in FIG. 2A in accordance with an embodiment of the present invention.
  • the vertical axis represents the cumulative number of SPRT alarms ( 204 ) and the horizontal axis represents time or sequence of observations ( 206 ).
  • the associated IAT follows a random process, while the associated MCF versus time/observation plot changes linearly with time/observation (see also “Applied Reliability,” 2nd Edition, Chapter 10, Tobias, P. A., and Trindade, D.C., New York: Van Nostrand Reinhold, 1995). Consequently, the slope of the MCF curve for a signal with no degradation is nearly a constant.
  • the slope of a MCF curve can provide a quantitative measure of the frequency of SPRT alarms, which can be used as an indicator of the degree of severity of degradation in the original monitored signal.
  • FIG. 3 presents a flowchart illustrating the process of determining the severity of degradation in a signal in accordance with an embodiment of the present invention.
  • the process starts by receiving a signal, wherein the signal values are received with a constant sampling interval (step 300 ).
  • the process applies the SPRT to the signal value (step 302 ).
  • the system next determines if the SPRT generates an alarm on the signal value (step 304 ). If so, the system increments an associated MCF value which keeps track of a running total number of the SPRT alarms (step 306 ). If the SPRT does not generate an alarm on the signal value, the MCF value for the current signal value assumes the previous MCF value computed for the previous signal value (step 308 ). The system then updates a MCF curve for the received signal value using the MCF value (step 309 ).
  • the system determines the severity of degradation in the signal from the shape of the MCF curve (step 310 ).
  • the system determines the severity of degradation from the shape of the MCF curve by computing the slope of the MCF curve, wherein an increase in the slope of the MCF curve indicates an increasing severity of degradation in the signal.
  • one embodiment of the present invention computes the slope of the MCF curve using a predetermined window size, which contains a predetermined number of successive data values. This computation can be performed using a linear interpolation or a linear regression using these data values. Note that the number of successive data values used to compute the slope should be carefully chosen. When a larger number is used, the computation can reduce the effect of noisiness in the MCF curve but can lose some responsiveness. On the other hand, when a smaller number is used, the computation result is more instantaneous but will lose some smoothness. It is therefore desirable to constantly adjust the number of data values used to compute the slope based on the frequency of the SPRT alarms, wherein the number can be gradually reduced as the frequency increases.
  • the degradation in a signal can show up in different forms which would result in different behaviors in the MCF curve and the associated slope of the MCF curve.
  • different forms of degradation will cause the MCF curve to show two types of slope behavior: (1) the slope increases continuously with time/observations; or (2) the slope increases abruptly from a smaller value to a larger value and remains at the larger value.
  • FIG. 4A illustrates two phases of degradation in a signal with different degrees of severity in accordance with an embodiment of the present invention. Note that the first phase of the degradation 402 occurs around 2000 to 3000 observations with a higher degree of severity (a more rapid drift upward), whereas the second phase of the degradation 404 occurs around 6000 to 8000 observations with a lower degree of severity (a less rapid drift upward).
  • FIG. 4B illustrates the corresponding MCF curve of the signal in FIG. 4A in accordance with an embodiment of the present invention.
  • FIG. 4B there is a concurrent first phase of slope increase around 2000 to 3000 observations.
  • the MCF curve demonstrates a normal linear behavior
  • the signal returns to normal, and hence of the slope of the corresponding MCF curve returns to the same small constant value as before 2000 observations.
  • the slope of the MCF curve in FIG. 4B again increases continuously, indicating the degradation is reoccurring.
  • the slopes between 6000 and 8000 observations increase at a slower rate than the rate between 2000 and 3000 observations. Note that the slope can be used as a quantitative metric for the degree of severity of the degradation.
  • FIG. 5A illustrates a step function degradation in a signal in accordance with an embodiment of the present invention.
  • the step function degradation 500 jumps up to a risky level abruptly and remains at the risky level.
  • FIG. 5B illustrates the corresponding MCF curve of the signal in FIG. 5A in accordance with an embodiment of the present invention.
  • the slope increases abruptly from a smaller value to a significantly larger value at around 4000 observations, which is when the step function degradation 500 in the signal kicks in. The slope then remains at the larger value until the end of degradation 500 at around 6000 observations, and drops back down to the same smaller value for the signal before degradation 500.
  • the slope of the MCF curve provides a quantitative metric associated with the degree of degradation or “risk” for the monitored system.
  • the system is subject to the dependency on the magnitude, noisiness, or units of the original telemetry signals.
  • the advantage of integrating an MCF approach with a SPRT alarm frequency is that the slope of the MCF curve removes any dependency on the magnitude, noisiness, or units for the original signal under surveillance, and provides a dimensionless, quantitative metric for the degree of severity in the original signal.
  • the slope of the MCF curve can be computed and analyzed automatically, thereby freeing humans from the tedious task of monitoring the telemetry signals for the appearance of degradation.

Landscapes

  • Physics & Mathematics (AREA)
  • Nonlinear Science (AREA)
  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

One embodiment of the present invention provides a system that determines a severity of degradation in a signal. During operation, the system receives signal values for the signal, wherein the signal values are received with a constant sampling interval. Next, for each received signal value, the system applies a Sequential Probability Ratio Test (SPRT) to the signal value. If the SPRT generates an alarm on the signal value, the system increments a cumulative counter which records a running total number of the SPRT alarms. Upon receiving each signal value, the system updates a cumulative function using a value in the cumulative counter. Next, the system determines the severity of degradation in the signal from the shape of the cumulative function.

Description

    BACKGROUND
  • 1. Field of the Invention
  • The present invention relates to techniques for proactively detecting impending problems in computer systems. More specifically, the present invention relates to a method and an apparatus for quantitatively determining the severity of degradation in a signal in a computer system.
  • 2. Related Art
  • Modern computer server systems are typically equipped with a significant number of sensors which monitor signals during the operation of the computer systems. Results from this monitoring process can be used to generate time series data for these signals which can subsequently be analyzed to determine how well a computer system is operating. One particularly useful application of this analysis process is for “proactive fault-monitoring,” to identify leading indicators of component or system failures before the failures actually occur.
  • Unfortunately, all existing proactive fault-monitoring systems have a serious limitation: they can only indicate that there are anomalies in the monitored signals, but provide no information on the degree or the severity of the degradation. For example, existing proactive fault-monitoring systems can either flag a component of a system to be at risk or not at risk, but cannot determine the level of the risk.
  • However, it is of tremendous interest to service engineers to have the knowledge of the degree or severity of degradation in the monitored systems. A quantitative indicator of the amount of degradation allows the service engineer to make appropriate decisions based on the actual health of the system with high confidence. For example, if a system is scheduled for shutdown due to a preventative maintenance on Saturday night and a warning flag is generated on Friday afternoon, it would be extremely beneficial for the service engineer to know if the detected degradation is of extremely low severity, so that the system can be allowed to operate safely until the scheduled outage time. On the other hand, if there is no scheduled shutdown in the near future and a warning flag is generated, the service engineer may desire to shutdown the system immediately if he/she knows that severity of the detected degradation is extremely high.
  • Hence, what is needed is a method and an apparatus for quantitatively determining the severity of degradation in a signal when the degradation is detected.
  • SUMMARY
  • One embodiment of the present invention provides a system that determines a severity of degradation in a signal. During operation, the system receives signal values for the signal, wherein the signal values are received with a constant sampling interval. Next, for each received signal value, the system applies a Sequential Probability Ratio Test (SPRT) to the signal value. If the SPRT generates an alarm on the signal value, the system increments a cumulative counter which records a running total number of the SPRT alarms. Upon receiving each signal value, the system updates a cumulative function using a value in the cumulative counter. Next, the system determines the severity of degradation in the signal from the shape of the cumulative function.
  • In a variation on this embodiment, the system determines the severity of degradation in the signal from the shape of the cumulative function by computing the slope of the cumulative function.
  • In a further variation on this embodiment, the slope of the cumulative function indicates the degree of severity of degradation in the signal.
  • In a further variation on this embodiment, an increase in the slope of the cumulative function indicates an increasing severity of degradation in the signal.
  • In a further variation on this embodiment, the system computes the slope of the cumulative function by: (1) selecting a predetermined number of successive data values in the cumulative function; and (2) computing the slope using the predetermined number of successive data values.
  • In a further variation on this embodiment, if the signal is degrading, the slope of the cumulative function: (1) increases continuously with time or observations; or (2) increases abruptly from a smaller value to a larger value and remains at the larger value.
  • In a variation on this embodiment, if the signal is not degrading, the cumulative function changes linearly with received signal values.
  • In a variation on this embodiment, if the SPRT does not generate an alarm on the signal value, the cumulative function value does not change.
  • BRIEF DESCRIPTION OF THE FIGURES
  • FIG. 1 illustrates real-time telemetry system in accordance with an embodiment of the present invention.
  • FIG. 2A illustrates an exemplary plot of an Inter-Arrival Time (IAT) as a function of a cumulative number of SPRT alarms for a monitored signal with no degradation in accordance with an embodiment of the present invention.
  • FIG. 2B illustrates the associated mean cumulative function (MCF) for the signal represented in FIG. 2A in accordance with an embodiment of the present invention.
  • FIG. 3 presents a flowchart illustrating the process of determining the severity of degradation in a signal in accordance with an embodiment of the present invention.
  • FIG. 4A illustrates two phases of degradation in a signal with different degrees of severity in accordance with an embodiment of the present invention.
  • FIG. 4B illustrates the corresponding MCF curve of the signal in FIG. 4A in accordance with an embodiment of the present invention.
  • FIG. 5A illustrates a step function degradation in a signal in accordance with an embodiment of the present invention.
  • FIG. 5B illustrates the corresponding MCF curve of the signal in FIG. 5A in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION
  • The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
  • The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs) and DVDs (digital versatile discs or digital video discs).
  • Real-Time Telemetry System
  • FIG. 1 illustrates real-time telemetry system 100 in accordance with an embodiment of the present invention. Real-time telemetry system 100 contains server 102. Server 102 can generally include any computational node including a mechanism for servicing requests from a client for computational and/or data storage resources. In the present embodiment, server 102 is a uniprocessor or multiprocessor server that is being monitored by real-time telemetry system 100.
  • Note that the present invention is not limited to the computer server system illustrated in FIG. 1. In general, the present invention can be applied to any type of computer system. This includes, but is not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a personal organizer, a device controller, and a computational engine within an appliance.
  • Real-time telemetry system 100 also contains telemetry device 104, which gathers telemetry signals 106 from the various sensors and monitoring tools within server 102, and directs telemetry signals 106 to a local or a remote location that contains fault-detecting tool 108.
  • Note that telemetry signals 106 gathered by real-time telemetry system 104 can include signals associated with physical and/or software performance parameters measured through sensors within the computer system. The physical parameters can include, but are not limited to: distributed temperatures within the computer system, relative humidity, cumulative or differential vibrations within the computer system, fan speed, acoustic signals, currents, voltages, time-domain reflectometry (TDR) readings, and miscellaneous environmental variables. The software parameters can include, but are not limited to: load metrics, CPU utilization, idle time, memory utilization, disk activity, transaction latencies, and other performance metrics reported by the operating system.
  • Fault-detecting tool 108 monitors and analyzes telemetry signals 106 in real-time. Specifically, fault-detecting tool 108 detects anomalies in telemetry signals 106 and predicts probabilities of faults and failures in server 102. In one embodiment of the present invention, fault-detecting tool 108 is a Continuous System Telemetry Harness (CSTH). In one embodiment of the present invention, the CSTH performs Sequential Probability Ratio Test (SPRT) on telemetry signals 106. Note that the SPRT provides a technique for monitoring noisy process variables and detecting the incipience or onset of anomalies in such processes with high sensitivity. In one embodiment of the present invention, telemetry device 104 and fault-detecting tool 108 are both embedded in server 102 which is being monitored.
  • SPRT and False Alarm Probability (FAP)
  • One embodiment of the present invention uses a SPRT to analyze monitored telemetry signals from a system. The SPRT is a binary hypothesis test that analyzes process observations sequentially to determine whether or not the signal is consistent with normal behavior. When the SPRT reaches a decision about current process behavior (i.e., the signal is behaving normally or abnormally), it reports the decision and continues to process observations. In particular, the SPRT generates warning flags/alarms when anomalies are detected in the monitored signals.
  • Note that the SPRT can generate alarms even when the monitored signals contain no degradation. In such a case, the frequency of SPRT alarms is typically very low and less than a pre-assigned “false alarm probability” (FAP). The FAP specifies the probability of making a failure hypothesis when in fact a non-failure hypothesis holds. Note that the FAP cannot be zero, for mathematical reasons.
  • False alarms do not present any problem as long as the associated frequency of the false alarm is smaller than the FAP which is specified when initializing the SPRT. However, when the frequency of SPRT alarms exceeds the FAP, a problem is signaled for the monitored component, system, or process. For example, when FAP is set to be 0.01, it means that about 1 out of 100 observations, on average, will produce a false alarm. When the frequency of the occurrences of SPRT alarms is more than 0.01, this indicates that there is a problem in the monitored component, system, or process.
  • Inter-Arrival Time (IAT)
  • The time between successive SPRT alarms is referred to as the inter-arrival time (IAT). The IAT is an exponentially-distributed random variable when there is no degradation in the monitored signal. Note that the IAT can be measured in different time scales (e.g., second, minute, hour, etc.), depending upon the sampling rate of the monitored signal. Moreover, IAT measurement is not limited to time. Other measurements of the distance between successive SPRT alarms can be in terms of: number of cycles, number of incidents, or number of observations. FIG. 2A illustrates an exemplary plot of an IAT as a function of a cumulative number of SPRT alarms for a monitored signal with no degradation in accordance with an embodiment of the present invention. The y-value of each point in FIG. 2A represents the number of observations between successive SPRT alarms (202), which follows a random process. The horizontal axis of FIG. 2A represents the cumulative number of SPRT alarms (204).
  • Mean Cumulative Function (MCF)
  • We introduce “Mean Cumulative Function” (MCF), which represents a cumulative number of SPRT alarms as a function of time, or number of observations. To compute a MCF, one only needs to keep track of a running total number of the SPRT alarms for each new observation or sampling time. If a SPRT alarm is generated for a newly received sample value, the MCF is incremented by one. Otherwise, the MCF maintains its previous value for this sample value.
  • FIG. 2B illustrates the associated MCF for the SPRT alarms represented in FIG. 2A in accordance with an embodiment of the present invention. The vertical axis represents the cumulative number of SPRT alarms (204) and the horizontal axis represents time or sequence of observations (206). Note that for the signal in FIG. 2A (which has no apparent degradation), the associated IAT follows a random process, while the associated MCF versus time/observation plot changes linearly with time/observation (see also “Applied Reliability,” 2nd Edition, Chapter 10, Tobias, P. A., and Trindade, D.C., New York: Van Nostrand Reinhold, 1995). Consequently, the slope of the MCF curve for a signal with no degradation is nearly a constant.
  • On the other hand, if degradation suddenly appears in a monitored signal, the frequency of the SPRT alarms starts increasing dramatically, which subsequently causes the MCF value to also increase rapidly. As a result, the slope of the MCF curve, which measures the rate of the MCF change with time/observation, increases as well. Hence, the slope of a MCF curve can provide a quantitative measure of the frequency of SPRT alarms, which can be used as an indicator of the degree of severity of degradation in the original monitored signal.
  • Determine the Severity of Degradation in a Signal
  • FIG. 3 presents a flowchart illustrating the process of determining the severity of degradation in a signal in accordance with an embodiment of the present invention.
  • The process starts by receiving a signal, wherein the signal values are received with a constant sampling interval (step 300).
  • Next, for each received signal value, the process applies the SPRT to the signal value (step 302).
  • The system next determines if the SPRT generates an alarm on the signal value (step 304). If so, the system increments an associated MCF value which keeps track of a running total number of the SPRT alarms (step 306). If the SPRT does not generate an alarm on the signal value, the MCF value for the current signal value assumes the previous MCF value computed for the previous signal value (step 308). The system then updates a MCF curve for the received signal value using the MCF value (step 309).
  • Next, the system determines the severity of degradation in the signal from the shape of the MCF curve (step 310). In one embodiment of the present invention, the system determines the severity of degradation from the shape of the MCF curve by computing the slope of the MCF curve, wherein an increase in the slope of the MCF curve indicates an increasing severity of degradation in the signal.
  • Note that because the IAT in time/observations between successive SPRT alarms can be noisy, the associated MCF curve can also appear “choppy” in response. In order to reduce the effect of noisiness in the MCF curve, one embodiment of the present invention computes the slope of the MCF curve using a predetermined window size, which contains a predetermined number of successive data values. This computation can be performed using a linear interpolation or a linear regression using these data values. Note that the number of successive data values used to compute the slope should be carefully chosen. When a larger number is used, the computation can reduce the effect of noisiness in the MCF curve but can lose some responsiveness. On the other hand, when a smaller number is used, the computation result is more instantaneous but will lose some smoothness. It is therefore desirable to constantly adjust the number of data values used to compute the slope based on the frequency of the SPRT alarms, wherein the number can be gradually reduced as the frequency increases.
  • Note that the degradation in a signal can show up in different forms which would result in different behaviors in the MCF curve and the associated slope of the MCF curve. However, different forms of degradation will cause the MCF curve to show two types of slope behavior: (1) the slope increases continuously with time/observations; or (2) the slope increases abruptly from a smaller value to a larger value and remains at the larger value.
  • FIG. 4A illustrates two phases of degradation in a signal with different degrees of severity in accordance with an embodiment of the present invention. Note that the first phase of the degradation 402 occurs around 2000 to 3000 observations with a higher degree of severity (a more rapid drift upward), whereas the second phase of the degradation 404 occurs around 6000 to 8000 observations with a lower degree of severity (a less rapid drift upward).
  • FIG. 4B illustrates the corresponding MCF curve of the signal in FIG. 4A in accordance with an embodiment of the present invention. Note that in FIG. 4B there is a concurrent first phase of slope increase around 2000 to 3000 observations. Before 2000 observations, the MCF curve demonstrates a normal linear behavior, and after 3000 observations, the signal returns to normal, and hence of the slope of the corresponding MCF curve returns to the same small constant value as before 2000 observations. During the second phase of the degradation, the slope of the MCF curve in FIG. 4B again increases continuously, indicating the degradation is reoccurring. However, the slopes between 6000 and 8000 observations increase at a slower rate than the rate between 2000 and 3000 observations. Note that the slope can be used as a quantitative metric for the degree of severity of the degradation.
  • FIG. 5A illustrates a step function degradation in a signal in accordance with an embodiment of the present invention. Instead of a gradual but increasing degradation as shown in FIG. 4A, the step function degradation 500 jumps up to a risky level abruptly and remains at the risky level. FIG. 5B illustrates the corresponding MCF curve of the signal in FIG. 5A in accordance with an embodiment of the present invention. As seen in FIG. 5B, the slope increases abruptly from a smaller value to a significantly larger value at around 4000 observations, which is when the step function degradation 500 in the signal kicks in. The slope then remains at the larger value until the end of degradation 500 at around 6000 observations, and drops back down to the same smaller value for the signal before degradation 500. Once again, the slope of the MCF curve provides a quantitative metric associated with the degree of degradation or “risk” for the monitored system.
  • Note that generally when a fault-detection system attempts to establish certain criteria for detecting degradation based on original telemetry signals, the system is subject to the dependency on the magnitude, noisiness, or units of the original telemetry signals. The advantage of integrating an MCF approach with a SPRT alarm frequency is that the slope of the MCF curve removes any dependency on the magnitude, noisiness, or units for the original signal under surveillance, and provides a dimensionless, quantitative metric for the degree of severity in the original signal. Furthermore, the slope of the MCF curve can be computed and analyzed automatically, thereby freeing humans from the tedious task of monitoring the telemetry signals for the appearance of degradation.
  • Note that we have assumed that a departure from stationarity in a signal is an indication of the degradation, which is the case for many monitored telemetry signals in computing systems. Moreover, we have assumed that the farther the signal deviates from its nominal value and the faster it departs from its nominal value, the more severe the degradation is.
  • The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.

Claims (18)

1. A method for determining a severity of degradation in a signal, comprising:
receiving signal values for the signal, wherein the signal values are received with a constant sampling interval;
for each received signal value,
applying a Sequential Probability Ratio Test (SPRT) to the signal value;
if the SPRT generates an alarm on the signal value, incrementing a cumulative counter which records a running total number of the SPRT alarms for the signal; and
updating a cumulative function for the received signal value using a value in the cumulative counter; and
computing the slope of the cumulative function; and
determining the severity of degradation in the signal from the computed slope of the cumulative function,
wherein the slope of the cumulative function indicates the degree of severity of degradation in the signal.
2-3. (canceled)
4. The method of claim 1, wherein an increase in the slope of the cumulative function indicates an increasing severity of degradation in the signal.
5. The method of claim 1, wherein computing the slope of the cumulative function involves:
selecting a predetermined number of successive data values in the cumulative function; and
computing the slope using the predetermined number of successive data values.
6. The method of claim 1, wherein if the signal is degrading, the slope of the cumulative function:
increases continuously with time or observations; or
increases abruptly from a smaller value to a larger value and remains at the larger value.
7. The method of claim 1, wherein if the signal is not degrading, the cumulative function changes linearly with received signal values.
8. The method of claim 1, wherein if the SPRT does not generate an alarm on the signal value, the cumulative function value does not change.
9. A computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for determining a severity of degradation in a signal, comprising:
receiving signal values for the signal, wherein the signal values are received with a constant sampling interval;
for each received signal value,
applying a Sequential Probability Ratio Test (SPRT) to the signal value;
if the SPRT generates an alarm on the signal value, incrementing a cumulative counter which records a running total number of the SPRT alarms for the signal; and
updating a cumulative function for the received signal value using a value in the cumulative counter;
computing the slope of the cumulative function: and
determining the severity of degradation in the signal from the computed slope of the cumulative function,
wherein the slope of the cumulative function indicates the degree of severity of degradation in the signal.
10-11. (canceled)
12. The computer-readable storage medium of claim 9, wherein an increase in the slope of the cumulative function indicates an increasing severity of degradation in the signal.
13. The computer-readable storage medium of claim 9, wherein computing the slope of the cumulative function involves:
selecting a predetermined number of successive data values in the cumulative function; and
computing the slope using the predetermined number of successive data values.
14. The computer-readable storage medium of claim 9, wherein if the signal is degrading, the slope of the cumulative function:
increases continuously with time or observations; or
increases abruptly from a smaller value to a larger value and remains at the larger value.
15. The computer-readable storage medium of claim 9, wherein if the signal is not degrading, the cumulative function changes linearly with received signal values.
16. The computer-readable storage medium of claim 9, wherein if the SPRT does not generate an alarm on the signal value, the cumulative function value does not change.
17. An apparatus that determines a severity of degradation in a signal, comprising:
a receiving mechanism configured to receive signal values for the signal, wherein the signal values are received with a constant sampling interval;
a SPRT mechanism configured to applying a Sequential Probability Ratio Test (SPRT) to each received signal value;
wherein if the SPRT generates an alarm on the received signal value, the SPRT mechanism is configured to increment a cumulative counter which records a running total number of the SPRT alarms;
an updating mechanism configured to update a cumulative function for the received signal value using a value in the cumulative counter;
a computing mechanism configured to compute the slope of the cumulative function; and
a determination mechanism configured to determine the severity of degradation in the signal from the computed slope of the cumulative function,
wherein the slope of the cumulative function indicates the degree of severity of degradation in the signal.
18-19. (canceled)
20. The apparatus of claim 17, wherein
an increase in the slope of the cumulative function indicates an increasing severity of degradation in the signal.
21. The apparatus of claim 17, wherein while computing the slope of the cumulative function, the determination mechanism is configured to:
select a predetermined number of successive data values in the cumulative function; and to
compute the slope using the predetermined number of successive data values.
US11/389,578 2006-03-23 2006-03-23 Method and apparatus for quantitatively determining severity of degradation in a signal Active US7269536B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/389,578 US7269536B1 (en) 2006-03-23 2006-03-23 Method and apparatus for quantitatively determining severity of degradation in a signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/389,578 US7269536B1 (en) 2006-03-23 2006-03-23 Method and apparatus for quantitatively determining severity of degradation in a signal

Publications (2)

Publication Number Publication Date
US7269536B1 US7269536B1 (en) 2007-09-11
US20070225926A1 true US20070225926A1 (en) 2007-09-27

Family

ID=38473329

Family Applications (1)

Application Number Title Priority Date Filing Date
US11/389,578 Active US7269536B1 (en) 2006-03-23 2006-03-23 Method and apparatus for quantitatively determining severity of degradation in a signal

Country Status (1)

Country Link
US (1) US7269536B1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090232007A1 (en) * 2008-03-17 2009-09-17 Comcast Cable Holdings, Llc Method for detecting video tiling
US20110134918A1 (en) * 2008-03-17 2011-06-09 Comcast Cable Communications, Llc Representing and Searching Network Multicast Trees

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11307569B2 (en) 2019-02-21 2022-04-19 Oracle International Corporation Adaptive sequential probability ratio test to facilitate a robust remaining useful life estimation for critical assets
CN112183344B (en) * 2020-09-28 2021-06-01 广东石油化工学院 Large unit friction fault analysis method and system based on waveform and dimensionless learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060036403A1 (en) * 2001-04-10 2006-02-16 Smartsignal Corporation Diagnostic systems and methods for predictive condition monitoring

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060036403A1 (en) * 2001-04-10 2006-02-16 Smartsignal Corporation Diagnostic systems and methods for predictive condition monitoring

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090232007A1 (en) * 2008-03-17 2009-09-17 Comcast Cable Holdings, Llc Method for detecting video tiling
US20110134918A1 (en) * 2008-03-17 2011-06-09 Comcast Cable Communications, Llc Representing and Searching Network Multicast Trees
US8259594B2 (en) * 2008-03-17 2012-09-04 Comcast Cable Holding, Llc Method for detecting video tiling
US8599725B2 (en) 2008-03-17 2013-12-03 Comcast Cable Communications, Llc Representing and searching network multicast trees
US9130830B2 (en) 2008-03-17 2015-09-08 Comcast Cable Holdings, Llc Method for detecting video tiling
US9160628B2 (en) 2008-03-17 2015-10-13 Comcast Cable Communications, Llc Representing and searching network multicast trees
US9769028B2 (en) 2008-03-17 2017-09-19 Comcast Cable Communications, Llc Representing and searching network multicast trees

Also Published As

Publication number Publication date
US7269536B1 (en) 2007-09-11

Similar Documents

Publication Publication Date Title
US7577542B2 (en) Method and apparatus for dynamically adjusting the resolution of telemetry signals
US7975175B2 (en) Risk indices for enhanced throughput in computing systems
US7571347B2 (en) Method and apparatus for providing fault-tolerance in parallel-processing systems
US8340923B2 (en) Predicting remaining useful life for a computer system using a stress-based prediction technique
US7890813B2 (en) Method and apparatus for identifying a failure mechanism for a component in a computer system
US7870440B2 (en) Method and apparatus for detecting multiple anomalies in a cluster of components
US9292473B2 (en) Predicting a time of failure of a device
US7162393B1 (en) Detecting degradation of components during reliability-evaluation studies
US9152530B2 (en) Telemetry data analysis using multivariate sequential probability ratio test
WO2008127909A2 (en) Using emi signals to facilitate proactive fault monitoring in computer systems
US11307569B2 (en) Adaptive sequential probability ratio test to facilitate a robust remaining useful life estimation for critical assets
US7949497B2 (en) Machine condition monitoring using discontinuity detection
US20110196820A1 (en) Robust Filtering And Prediction Using Switching Models For Machine Condition Monitoring
US7269536B1 (en) Method and apparatus for quantitatively determining severity of degradation in a signal
US7668696B2 (en) Method and apparatus for monitoring the health of a computer system
US20080010556A1 (en) Estimating the residual life of a software system under a software-based failure mechanism
Gross et al. Early detection of signal and process anomalies in enterprise computing systems.
US7249287B2 (en) Methods and apparatus for providing alarm notification
US8214693B2 (en) Damaged software system detection
US7085681B1 (en) Symbiotic interrupt/polling approach for monitoring physical sensors
US7191096B1 (en) Multi-dimensional sequential probability ratio test for detecting failure conditions in computer systems
CN115222278A (en) Intelligent inspection method and system for robot
CN108197717B (en) Equipment maintenance system and method signal-based
US11228606B2 (en) Graph-based sensor ranking
US7483816B2 (en) Length-of-the-curve stress metric for improved characterization of computer system reliability

Legal Events

Date Code Title Description
AS Assignment

Owner name: SUN MICROSYSTEMS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GROSS, KENNY C.;WHISNANT, KEITH A.;CUMBERFORD, GREGORY A.;REEL/FRAME:017689/0863;SIGNING DATES FROM 20060301 TO 20060307

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Free format text: PAYER NUMBER DE-ASSIGNED (ORIGINAL EVENT CODE: RMPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

AS Assignment

Owner name: ORACLE AMERICA, INC., CALIFORNIA

Free format text: MERGER AND CHANGE OF NAME;ASSIGNORS:ORACLE USA, INC.;SUN MICROSYSTEMS, INC.;ORACLE AMERICA, INC.;REEL/FRAME:037302/0843

Effective date: 20100212

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12