US20090037777A1 - Use of operational configuration parameters to predict system failures - Google Patents
Use of operational configuration parameters to predict system failures Download PDFInfo
- Publication number
- US20090037777A1 US20090037777A1 US11/830,802 US83080207A US2009037777A1 US 20090037777 A1 US20090037777 A1 US 20090037777A1 US 83080207 A US83080207 A US 83080207A US 2009037777 A1 US2009037777 A1 US 2009037777A1
- Authority
- US
- United States
- Prior art keywords
- configuration
- operational
- parameters
- parameter
- operational configuration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/008—Reliability or availability analysis
Definitions
- FIG. 1A shows an example of a system constructed in accordance with at least some illustrative embodiments
- FIG. 1B shows a block diagram of the system of FIG. 1A , constructed in accordance with at least some illustrative embodiments;
- FIGS. 2A and 2B show different memory and memory bus configurations, suitable for use with the system of FIGS. 1A and 1B , in accordance with at least some illustrative embodiments;
- FIGS. 3A , 3 B and 3 C show the flow of operational configuration data, collected, processed and used to identify systems at risk of future failures, in accordance with at least some illustrative embodiments
- FIG. 4 shows a method for collecting and processing operational configuration data and for generating reference values, in accordance with at least some illustrative embodiments.
- FIG. 5 shows a method for using operational configuration data to identify systems at risk of future failures, in accordance with at least some illustrative embodiments.
- system refers to a collection of two or more hardware and/or software components, and may be used to refer to an electronic device, such as, for example, a computer, a portion of a computer, a combination of computers, etc.
- software includes any executable code capable of running on a processor, regardless of the media used to store the software. Thus, code stored in non-volatile memory, and sometimes referred to as “embedded firmware,” is included within the definition of software.
- software can include more than one program, and may be used to refer to a single program executing on a single processor, multiple programs executing on a single processor, and multiple programs executing on multiple processors.
- a timing and level window (sometimes referred to as an “eye specification” or “eye mask”) can be identified. This window allows the selection of optimal values that result in interface signals between the components that operate in the middle of the “eye,” i.e., that result in read operations with the least likelihood of failure when compared against other configurations of the interface signals.
- the calibration process described above is used to compensate for variations in the performance of electronic components that can result from variations in the manufacturing process of such components.
- the degree of compensation is indicative of the degree of variation of the performance of the component from a target performance. More significantly, an anomalous compensation value, i.e., one that deviates significantly from the norm as compared to other similar components, may be indicative of a defective component, even if the component does not fail a functional test. Such a defect may reflect a weakness in the structure of the component that may later result in an actual functional failure. Such a defect is sometimes referred to as a “latent” defect.
- FIGS. 1A and 1B show an illustrative computer system 100 suitable for providing the configuration parameter data needed to predict a failure of computer system 100 .
- the illustrative computer system 100 comprises a chassis 180 , a display 140 , and an input device 170 .
- the computer system 100 comprises a processing logic 102 , volatile storage 110 , and non-volatile storage 164 .
- Processing logic 102 may be implemented in hardware (e.g., a microprocessor), software (e.g., microcode), or a combination of hardware and software.
- Volatile storage 110 comprises a computer-readable medium such as random access memory (RAM).
- Non-volatile storage 164 comprises a computer-readable medium such as flash RAM, read-only memory (ROM), a hard disk drive, a floppy disk (e.g., floppy 194 ), a compact disk read-only memory (CD-ROM, e.g., CD 196 ), and combinations thereof.
- ROM read-only memory
- a hard disk drive e.g., floppy disk
- CD-ROM compact disk read-only memory
- CD-ROM compact disk read-only memory
- the computer-readable media of both volatile storage 110 and non-volatile storage 164 comprise, for example, software that is executed by processing logic 102 and provides the computer system with some or all of the functionality described herein.
- the computer system 100 also comprises a network interface (Net I/F) 162 that enables the computer system 100 to receive information via a local area network and/or a wired or wireless wide area network, represented in the example of FIG. 1A by Ethernet jack 192 .
- a video interface (Video I/F) 142 couples to the display 140 .
- a user interacts with the station via the input device 170 (e.g., a keyboard) and/or pointing device 172 (e.g., a mouse), which couples to a peripheral interface 168 .
- the display 140 together with the input device 170 and/or the pointing device 172 , may operate together as a user interface.
- Computer system 100 may be a bus-based computer, with a variety of busses interconnecting the various elements shown in FIG. 1B through a series of hubs or bridges, including memory controller hub (MCH) 104 (sometimes referred to as a “north bridge”) and interface controller hub (ICH) 106 (sometimes referred to as a “south bridge”).
- MCH memory controller hub
- ICH interface controller hub
- 1B include: front-side bus 103 coupling processing logic 102 to MCH 104 ; accelerated graphics port (AGP) bus 141 coupling video interface 142 to MCH 104 ; peripheral component interconnect (PCI) bus 161 coupling network interface 162 , non-volatile storage 164 , peripheral interface 168 and ICH 106 to each other; PCI express (PCIe) bus 151 coupling one or more PCI express devices 152 to ICH 106 ; and memory bus 111 coupling MCH 104 to dual inline memory modules (DIMMs) 120 and 130 within volatile storage 110 .
- PCIe PCI express
- the peripheral interface 168 accepts signals from the input device 170 and other input devices such as a pointing device 172 , and transforms the signals into a form suitable for communication on PCI bus 161 .
- the video interface 142 may comprise a graphics card or other suitable video interface that accepts information from the AGP bus 141 and transforms it into a form suitable for the display 140 .
- the processing logic 102 gathers information from other system elements, including input data from the peripheral interface 168 , and program instructions and other data from non-volatile storage 164 or volatile storage 110 , or from other systems (e.g., a server used to store and distribute copies of executable code) coupled to a local area network or a wide area network via the network interface 162 .
- the processing logic 102 executes the program instructions (e.g., collection software 200 ) and processes the data accordingly.
- the program instructions may further configure the processing logic 102 to send data to other system elements, such as information presented to the user via the video interface 142 and the display 140 .
- the network interface 162 enables the processing logic 102 to communicate with other systems via a network (e.g., the Internet).
- Volatile storage 110 may serve as a low-latency temporary store of information for the processing logic 102
- non-volatile storage 164 may serve as a long term (but higher latency) store of information (e.g., extended markup language (XML) database 210 ).
- XML extended markup language
- the processing logic 102 operates in accordance with one or more programs stored on non-volatile storage 164 or received via the network interface 162 .
- the processing logic 102 may copy portions of the programs into volatile storage 110 for faster access, and may switch between programs or carry out additional programs in response to user actuation of the input device 170 .
- the additional programs may be retrieved from non-volatile storage 164 or may be retrieved or received from other locations via the network interface 162 .
- One or more of these programs executes on computer system 100 , causing the computer system to perform at least some functions disclosed herein.
- the memory bus 111 comprises a double data rate, version 2, (DDR2) memory bus, which couples MCH 104 to the individual memories 122 and 132 within DIMMs 120 and 130 through buffers 124 and 134 as shown in FIG. 2A .
- Buffers 124 and 134 within DIMMS forward signals from DDR2 memory bus 111 to DDR2 busses 126 and 136 within DIMMs 120 and 130 , respectively, and also conversely from DDR2 busses 126 and 136 to DDR2 memory bus 111 .
- MCH 104 writes a series of training sequences to DIMMS 120 and 130 , reading each sequence back multiple times, using different timing delays on the read strobe (e.g., the DQS signal generated by the DDR DIMMs) for each read of the data from the DIMMs. After a range of delay values for the strobe that result in error free data being read back has been identified, a delay value in the middle of the window is selected as the configuration value for the read strobe.
- the read strobe e.g., the DQS signal generated by the DDR DIMMs
- a delay value in the middle of the window is selected as the configuration value for the read strobe.
- memory bus 111 comprises fully buffered DIMM (FDB) lanes that couple the MCH 104 of FIG. 1B to advanced memory buffer (AMB) chips 128 and 138 , respectively, within DIMMs 120 and 130 , as shown in FIG. 2B .
- the AMB chips 128 and 138 couple to each of the individual memories 122 and 132 of DIMMS 120 and 130 via DDR2 busses 126 and 136 as shown.
- AMB chips 128 and 138 also calibrate the DDR2 interface using training sequences as described above, and as a result an optimal operational configuration of the AMB interface to the individual memories 122 and 132 is also determined.
- Those of ordinary skill will recognize that other interfaces may similarly be calibrated using training sequences (e.g., the interface to PCI express devices 152 ), and all such interfaces are within the scope of the present disclosure.
- the parameters may be used as a basis for statistically determining a range of normal values for such parameters, what constitutes a statistically significant deviation (statistical outliers) relative to such a range of normal values, and what deviations correlate to future failures of the computer system 100 .
- the results of the statistical processing can be used to identify computer systems, such as computer system 100 that are at risk of future failures.
- FIGS. 3A through 3C illustrates how operational configuration parameters determined during the initialization of multiple computer systems 100 are collected within each computer system, are further collected from the computer system 100 and stored on another similar computer system (hereinafter, the “Aggregating System”), are statistically processed to determine ranges of normal operation as well as limits defining statistical outliers, and are used to generate reference values used to subsequently identify computer systems at risk of future failure.
- collection software 200 executes on processing logic 102 and collects one or more operational configuration parameters determined at initialization.
- the parameters that are collected are determined based upon the content of XML parser script 220 , which includes a list of the parameters to be collected by collection software 200 , information as to where the parameters can be located within the system (e.g., the PCI address of a register within MCH 104 of FIG. 1 ), and information describing the format of the data.
- the collected operational configuration parameter data, as well as the XML description of the data are saved within XML database (XML DB) 210 .
- collection software 200 does not need to include hard-coded descriptions of the collected parameters. If the format of the parameter data later changes, or if the list of parameters selected for collection changes, XML parser script 220 can be easily modified without the need to change collection software 200 .
- Aggregating System 300 comprises an architecture similar to that of computer system 100 shown in FIG. 1B , but executes processing software 400 on processing logic 302 .
- XML database 410 on Aggregating System 300 stores the operational configuration parameters for multiple computer systems such as computer system 100 , as well as the results of one or more statistical analyses of the configuration data from multiple systems.
- processing software 400 of Aggregating System 300 communicates with computer system 100 by interacting with collection software 200 , thus gaining access to the operational configuration parameters associated with computer system 100 and stored in XML database 210 .
- processing software 400 interacts with other software to access XML database 210 , and in still other illustrative embodiments processing software 400 accesses XML database through one or more shared network file systems made accessible by an operating system (not shown) executing on computer system 100 .
- the physical connection between computer system 100 and Aggregating System 300 may be a dedicated wired connection (e.g., a universal serial bus (USB) connection), a dedicated wireless connection (e.g., a Bluetooth connection), a wired network connection (e.g., an Ethernet connection), or a wireless network connection (e.g., a WiFi connection).
- a dedicated wired connection e.g., a universal serial bus (USB) connection
- a dedicated wireless connection e.g., a Bluetooth connection
- a wired network connection e.g., an Ethernet connection
- a wireless network connection e.g., a WiFi connection
- the process of collecting operational configuration parameters shown in FIG. 3A , and of transferring the collected parameters to Aggregating System 300 as shown in FIG. 3B is repeated for multiple computer systems 100 .
- Statistical outliers and system failures are identified, tracked and saved within XML database 410 .
- a set of reference values 412 are created and saved on non-volatile storage 364 , as shown in FIG. 3C .
- the set of reference values comprises a statistical reference parameter value (e.g., the mean value of an operational configuration parameter), and a tolerance value above and below the reference parameter value.
- the tolerance value establishes a limit beyond which an anomalous operational configuration value will be treated as indicative of an unacceptable risk of a future system failure.
- the tolerance value may be based, for example, on such statistical measures as the standard deviation of a dataset of parameter values (e.g., a tolerance value of three-sigma referenced to the mean value of a parameter). If more than one operational configuration parameter is identified as a good predictor of future failures of the computer system, a subset of one or more parameters with better correlations than other parameters are selected as the parameters used in the comparison with the reference values.
- a computer system 100 that generates at initialization one or more operational configuration parameter values that exceed the statistical norm by more than the tolerance values is identified as defective, and not, for example, shipped to the customer.
- the reference values are generated once from a sample set of computer systems 100 and used for all subsequent evaluations of computer systems 100 without further updates to the reference values.
- the reference values are periodically updated as new data is collected from computer systems 100 and the systems are evaluated.
- the reference values are periodically updated, but data is maintained separately for distinct production runs of the manufacture of computer systems 100 , or for a particular production facility. Other combinations of updating, combining and segregating reference values will become apparent to those of ordinary skill in the art, and all such techniques are within the scope of the present disclosure.
- the system can be evaluated in a non-invasive manner without interrupting the normal operation of computer system 100 , once it is shipped to the customer and placed into operation. This is due at least in part to the fact that it is not necessary to run dedicated test programs, nor to generate and/or write test patterns that can disrupt the configuration of the system and destroy information in memory.
- the operational configuration parameter values are determined automatically at power-up as part of the normal initialization of the system, without the need for additional user intervention. The data is generated at that time and is thus available for later collection as described herein.
- a technical service representative can evaluate the computer system 100 by establishing a remote connection between computer system 100 and Aggregating System 300 . Once the remote connection is established, the collection software is executed on computer system 100 and the current operational configuration parameters are collected and sent to Aggregating System 300 . An archived copy of the reference data for the original production run, of which computer system 100 was a part, is retrieved and used to determine if the computer system 100 is utilizing operational configuration parameter values that exceed the norm by more than the tolerance value, or if changes over time in the operational configuration parameter values (referenced to archived values for the particular computer system 100 ) exceed the norm for changes of the same parameter for the production group by more than a tolerance value established for the production group. In this manner, the technical service representative can determine if a latent failure is developing, allowing the system to be repaired or replace before an actual failure occurs.
- FIG. 4 shows a method 500 for collecting and processing operational configuration parameter values of a computer system, and for generating one or more sets of reference values, in accordance with at least some illustrative embodiments.
- a computer system e.g., computer system 100
- the operational configuration parameters are collected and saved in a local database (e.g., XML DB 210 ) on the computer system (block 502 ).
- a local database e.g., XML DB 210
- the collected parameters are subsequently transferred from the local database to a centralized collection database (e.g., XML DB 410 ), where the parameters are combined with previously collected parameters from other similar computer systems and statistically processed (e.g., calculating mean values and standard deviations of the data collected), as shown in block 504 of FIG. 4 .
- a centralized collection database e.g., XML DB 410
- statistically processed e.g., calculating mean values and standard deviations of the data collected
- reference values including statistical values such as the mean and standard deviation for each parameter, as well as tolerance values for each parameter
- block 508 another computer system is selected (block 508 ) and the data collection and processing (blocks 502 through 506 ) is repeated.
- FIG. 5 shows a method for using operational configuration parameter values to identify computer systems at risk for future failures, in accordance with at least some illustrative embodiments.
- a computer system e.g., computer system 100
- the operational configuration parameters are collected and saved in a local database (e.g., XML DB 210 ) on the computer system (block 552 ).
- a local database e.g., XML DB 210
- the collected parameters are subsequently transferred to a centralized collection database (e.g., XML DB 410 ) from the local database (block 554 ) where one or more selected parameters are compared to a set of reference values that includes tolerance values referenced to a set of statistical norms for each parameter (block 556 ). If none of the selected parameters differs from the statistical norm for the parameter by more than a tolerance value also associated with the selected parameter (block 558 ), no action is necessary (block 560 ) ending the method (block 564 ).
- a centralized collection database e.g., XML DB 410
- the computer system that generated the operational configuration parameter value is identified as being at risk for a future failure, and is flagged as defective (block 562 ), ending the method (block 564 ).
Abstract
Description
- The increase in the demand for computers and other digital systems, such as cell phones and personal digital assistants (PDAs), has resulted in a corresponding increase in the competition between systems manufacturers for market share. In order to better compete, manufacturers of such systems continue to develop techniques for testing systems in order to identify systems that have failed during the manufacturing process, thus improving the quality of the delivered products and furthering the reputation of the manufacturer. But such a reputation can also be damaged by systems that function at delivery, but fail shortly afterward, a failure sometimes referred to as “infant mortality.” These infant mortality failures also result in additional expenditures in the form of warranty related servicing, repairs and/or replacements. To avoid shipping systems with such “latent” failures, techniques have been developed to test systems and identify what are sometimes referred to as “statistical outliers,” wherein a given system passes a functional test, but differs significantly in its test results as compared to the statistical norm for a given group of manufactured systems. Depending on the nature and degree of the difference in the test results, it may be possible to correlate such statistical anomalies to future failures.
- But the use of statistical testing anomalies as predictors of future failures can require that highly specialized testing be performed on each system during various stages of production, sometimes by sophisticated and expensive testing systems, adding to the overall production time and cost. Further, because each type of system may have a different configuration or design, unique testing systems and programs may have to be developed for each platform, adding to the overall product development time and cost as well.
- For a detailed description of exemplary embodiments of the invention, reference will now be made to the accompanying drawings in which:
-
FIG. 1A shows an example of a system constructed in accordance with at least some illustrative embodiments; -
FIG. 1B shows a block diagram of the system ofFIG. 1A , constructed in accordance with at least some illustrative embodiments; -
FIGS. 2A and 2B show different memory and memory bus configurations, suitable for use with the system ofFIGS. 1A and 1B , in accordance with at least some illustrative embodiments; -
FIGS. 3A , 3B and 3C show the flow of operational configuration data, collected, processed and used to identify systems at risk of future failures, in accordance with at least some illustrative embodiments; -
FIG. 4 shows a method for collecting and processing operational configuration data and for generating reference values, in accordance with at least some illustrative embodiments; and -
FIG. 5 shows a method for using operational configuration data to identify systems at risk of future failures, in accordance with at least some illustrative embodiments. - Certain terms are used throughout the following description and claims to refer to particular system components. As one skilled in the art will appreciate, companies may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . .” Also, the term “couple” or “couples” is intended to mean either an indirect, direct, optical or wireless electrical connection. Thus, if a first device couples to a second device, that connection may be through a direct electrical connection, through an indirect electrical connection via other devices and connections, through an optical electrical connection, or through a wireless electrical connection. Additionally, the term “system” refers to a collection of two or more hardware and/or software components, and may be used to refer to an electronic device, such as, for example, a computer, a portion of a computer, a combination of computers, etc. Further, the term “software” includes any executable code capable of running on a processor, regardless of the media used to store the software. Thus, code stored in non-volatile memory, and sometimes referred to as “embedded firmware,” is included within the definition of software. Also, the term “software” can include more than one program, and may be used to refer to a single program executing on a single processor, multiple programs executing on a single processor, and multiple programs executing on multiple processors.
- The following discussion is directed to various embodiments of the invention. Although one or more of these embodiments may be preferred, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims. In addition, one skilled in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be illustrative of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment.
- Many of today's digital systems achieve the high levels of performance required of such systems by dynamically adjusting parameters that control the timing and levels of signals of interfaces between different components of the system, particularly high-speed interfaces. Such calibration is sometimes accomplished by causing one component of the system to transmit and/or write a “training sequence” to another component within the system, and then reading back the sequence multiple times using differing configurations for the interface signals controlling the read operation. These training sequences are predetermined patterns designed to generate the types of signal transitions most likely to cause a system with marginal timing or marginal logic levels to fail (e.g., an alternating ones and zeros pattern). By identifying the range of configuration values that result in successful reads of the training sequence, a timing and level window (sometimes referred to as an “eye specification” or “eye mask”) can be identified. This window allows the selection of optimal values that result in interface signals between the components that operate in the middle of the “eye,” i.e., that result in read operations with the least likelihood of failure when compared against other configurations of the interface signals.
- The calibration process described above is used to compensate for variations in the performance of electronic components that can result from variations in the manufacturing process of such components. The degree of compensation is indicative of the degree of variation of the performance of the component from a target performance. More significantly, an anomalous compensation value, i.e., one that deviates significantly from the norm as compared to other similar components, may be indicative of a defective component, even if the component does not fail a functional test. Such a defect may reflect a weakness in the structure of the component that may later result in an actual functional failure. Such a defect is sometimes referred to as a “latent” defect.
- By collecting the calibrated configuration parameter values from a large number of similar systems, it is possible to determine a statistical norm for the parameters within a population of systems, and to correlate statistically significant variations from the norm, for each parameter, to later occurring systems failures. Once the correlation is established, anomalous parameter variations (sometimes referred to as “statistical outliers”) may be used to identify systems that are at risk of future failures. By identifying systems that are likely to fail in the future, a system manufacturer can avoid shipping products with latent defects that may later cause increased customer dissatisfaction, and further avoid the increased costs of doing business that can result from warranty repairs and replacements.
-
FIGS. 1A and 1B show anillustrative computer system 100 suitable for providing the configuration parameter data needed to predict a failure ofcomputer system 100. As shown, theillustrative computer system 100 comprises achassis 180, adisplay 140, and aninput device 170. Thecomputer system 100 comprises aprocessing logic 102,volatile storage 110, andnon-volatile storage 164.Processing logic 102 may be implemented in hardware (e.g., a microprocessor), software (e.g., microcode), or a combination of hardware and software.Volatile storage 110 comprises a computer-readable medium such as random access memory (RAM). Non-volatilestorage 164 comprises a computer-readable medium such as flash RAM, read-only memory (ROM), a hard disk drive, a floppy disk (e.g., floppy 194), a compact disk read-only memory (CD-ROM, e.g., CD 196), and combinations thereof. - The computer-readable media of both
volatile storage 110 andnon-volatile storage 164 comprise, for example, software that is executed byprocessing logic 102 and provides the computer system with some or all of the functionality described herein. Thecomputer system 100 also comprises a network interface (Net I/F) 162 that enables thecomputer system 100 to receive information via a local area network and/or a wired or wireless wide area network, represented in the example ofFIG. 1A by Ethernetjack 192. A video interface (Video I/F) 142 couples to thedisplay 140. A user interacts with the station via the input device 170 (e.g., a keyboard) and/or pointing device 172 (e.g., a mouse), which couples to aperipheral interface 168. Thedisplay 140, together with theinput device 170 and/or thepointing device 172, may operate together as a user interface. -
Computer system 100 may be a bus-based computer, with a variety of busses interconnecting the various elements shown inFIG. 1B through a series of hubs or bridges, including memory controller hub (MCH) 104 (sometimes referred to as a “north bridge”) and interface controller hub (ICH) 106 (sometimes referred to as a “south bridge”). The busses of the illustrative example ofFIG. 1B include: front-side bus 103coupling processing logic 102 toMCH 104; accelerated graphics port (AGP)bus 141coupling video interface 142 toMCH 104; peripheral component interconnect (PCI)bus 161coupling network interface 162,non-volatile storage 164,peripheral interface 168 andICH 106 to each other; PCI express (PCIe)bus 151 coupling one or more PCIexpress devices 152 toICH 106; andmemory bus 111coupling MCH 104 to dual inline memory modules (DIMMs) 120 and 130 withinvolatile storage 110. - The
peripheral interface 168 accepts signals from theinput device 170 and other input devices such as apointing device 172, and transforms the signals into a form suitable for communication onPCI bus 161. Thevideo interface 142 may comprise a graphics card or other suitable video interface that accepts information from theAGP bus 141 and transforms it into a form suitable for thedisplay 140. Theprocessing logic 102 gathers information from other system elements, including input data from theperipheral interface 168, and program instructions and other data fromnon-volatile storage 164 orvolatile storage 110, or from other systems (e.g., a server used to store and distribute copies of executable code) coupled to a local area network or a wide area network via thenetwork interface 162. Theprocessing logic 102 executes the program instructions (e.g., collection software 200) and processes the data accordingly. The program instructions may further configure theprocessing logic 102 to send data to other system elements, such as information presented to the user via thevideo interface 142 and thedisplay 140. Thenetwork interface 162 enables theprocessing logic 102 to communicate with other systems via a network (e.g., the Internet).Volatile storage 110 may serve as a low-latency temporary store of information for theprocessing logic 102, andnon-volatile storage 164 may serve as a long term (but higher latency) store of information (e.g., extended markup language (XML) database 210). - The
processing logic 102, and hence thecomputer system 100 as a whole, operates in accordance with one or more programs stored onnon-volatile storage 164 or received via thenetwork interface 162. Theprocessing logic 102 may copy portions of the programs intovolatile storage 110 for faster access, and may switch between programs or carry out additional programs in response to user actuation of theinput device 170. The additional programs may be retrieved fromnon-volatile storage 164 or may be retrieved or received from other locations via thenetwork interface 162. One or more of these programs executes oncomputer system 100, causing the computer system to perform at least some functions disclosed herein. - In at least some illustrative embodiments of the
computer system 100 ofFIG. 1B , thememory bus 111 comprises a double data rate, version 2, (DDR2) memory bus, which couplesMCH 104 to theindividual memories DIMMs buffers FIG. 2A .Buffers DDR2 memory bus 111 to DDR2 busses 126 and 136 withinDIMMs DDR2 memory bus 111.MCH 104, as part of the initialization sequence ofcomputer system 100, writes a series of training sequences to DIMMS 120 and 130, reading each sequence back multiple times, using different timing delays on the read strobe (e.g., the DQS signal generated by the DDR DIMMs) for each read of the data from the DIMMs. After a range of delay values for the strobe that result in error free data being read back has been identified, a delay value in the middle of the window is selected as the configuration value for the read strobe. Those of ordinary skill in the art will recognize that other signals and transactions may be similarly calibrated using training sequences as described, and further that other characteristics such as the voltage level of the signals may also be similarly adjusted to determine an optimal operational configuration of the interface. All such signals, transactions and configurations are within the scope of the present disclosure. - In other illustrative embodiments,
memory bus 111 comprises fully buffered DIMM (FDB) lanes that couple theMCH 104 ofFIG. 1B to advanced memory buffer (AMB) chips 128 and 138, respectively, withinDIMMs FIG. 2B . The AMB chips 128 and 138 couple to each of theindividual memories DIMMS AMB chips individual memories - Once operational configuration parameters have been determined for an interface, the parameters may be used as a basis for statistically determining a range of normal values for such parameters, what constitutes a statistically significant deviation (statistical outliers) relative to such a range of normal values, and what deviations correlate to future failures of the
computer system 100. Once the initial statistical processing is complete, the results of the statistical processing can be used to identify computer systems, such ascomputer system 100 that are at risk of future failures. -
FIGS. 3A through 3C illustrates how operational configuration parameters determined during the initialization ofmultiple computer systems 100 are collected within each computer system, are further collected from thecomputer system 100 and stored on another similar computer system (hereinafter, the “Aggregating System”), are statistically processed to determine ranges of normal operation as well as limits defining statistical outliers, and are used to generate reference values used to subsequently identify computer systems at risk of future failure. - Referring to the illustrative embodiment of
FIG. 3A , aftercomputer system 100 completes its initialization sequence,collection software 200 executes onprocessing logic 102 and collects one or more operational configuration parameters determined at initialization. The parameters that are collected are determined based upon the content ofXML parser script 220, which includes a list of the parameters to be collected bycollection software 200, information as to where the parameters can be located within the system (e.g., the PCI address of a register withinMCH 104 ofFIG. 1 ), and information describing the format of the data. In at least some illustrative embodiments, the collected operational configuration parameter data, as well as the XML description of the data, are saved within XML database (XML DB) 210. By using XML parser script to describe the data to be collected,collection software 200 does not need to include hard-coded descriptions of the collected parameters. If the format of the parameter data later changes, or if the list of parameters selected for collection changes,XML parser script 220 can be easily modified without the need to changecollection software 200. - Referring to
FIG. 3B , once the operational configuration parameters have been collected and saved toXML database 210, communication is established between AggregatingSystem 300 andcomputer system 100, allowing the collected parameters withinXML database 210 to be transferred to, and stored within,XML database 410 which is maintained onnon-volatile storage 364 of AggregatingSystem 300. AggregatingSystem 300 comprises an architecture similar to that ofcomputer system 100 shown inFIG. 1B , but executesprocessing software 400 onprocessing logic 302. Further,XML database 410 on AggregatingSystem 300 stores the operational configuration parameters for multiple computer systems such ascomputer system 100, as well as the results of one or more statistical analyses of the configuration data from multiple systems. - In the illustrative embodiment of
FIG. 3B ,processing software 400 of AggregatingSystem 300 communicates withcomputer system 100 by interacting withcollection software 200, thus gaining access to the operational configuration parameters associated withcomputer system 100 and stored inXML database 210. In other illustrative embodiments,processing software 400 interacts with other software to accessXML database 210, and in still other illustrativeembodiments processing software 400 accesses XML database through one or more shared network file systems made accessible by an operating system (not shown) executing oncomputer system 100. Further, the physical connection betweencomputer system 100 and AggregatingSystem 300 may be a dedicated wired connection (e.g., a universal serial bus (USB) connection), a dedicated wireless connection (e.g., a Bluetooth connection), a wired network connection (e.g., an Ethernet connection), or a wireless network connection (e.g., a WiFi connection). Other mechanism and connections for transferring the operational configuration parameters fromcomputer system 100 to AggregatingSystem 300 will become apparent to those skilled in the art, and all such mechanisms and connections are within the scope of the present disclosure. - In at least some illustrative embodiments, the process of collecting operational configuration parameters shown in
FIG. 3A , and of transferring the collected parameters to AggregatingSystem 300 as shown inFIG. 3B , is repeated formultiple computer systems 100. Statistical outliers and system failures are identified, tracked and saved withinXML database 410. When enough data has been collected to identify at least one statistically significant correlation between system failures and statistical outliers associated with a given parameter, a set ofreference values 412 are created and saved onnon-volatile storage 364, as shown inFIG. 3C . The set of reference values comprises a statistical reference parameter value (e.g., the mean value of an operational configuration parameter), and a tolerance value above and below the reference parameter value. The tolerance value establishes a limit beyond which an anomalous operational configuration value will be treated as indicative of an unacceptable risk of a future system failure. The tolerance value may be based, for example, on such statistical measures as the standard deviation of a dataset of parameter values (e.g., a tolerance value of three-sigma referenced to the mean value of a parameter). If more than one operational configuration parameter is identified as a good predictor of future failures of the computer system, a subset of one or more parameters with better correlations than other parameters are selected as the parameters used in the comparison with the reference values. Acomputer system 100 that generates at initialization one or more operational configuration parameter values that exceed the statistical norm by more than the tolerance values is identified as defective, and not, for example, shipped to the customer. - In at least some illustrative embodiments, the reference values are generated once from a sample set of
computer systems 100 and used for all subsequent evaluations ofcomputer systems 100 without further updates to the reference values. In other illustrative embodiments, the reference values are periodically updated as new data is collected fromcomputer systems 100 and the systems are evaluated. In still other illustrative embodiments the reference values are periodically updated, but data is maintained separately for distinct production runs of the manufacture ofcomputer systems 100, or for a particular production facility. Other combinations of updating, combining and segregating reference values will become apparent to those of ordinary skill in the art, and all such techniques are within the scope of the present disclosure. - By using operational configuration parameter values to evaluate the risk of a future failure of the system, the system can be evaluated in a non-invasive manner without interrupting the normal operation of
computer system 100, once it is shipped to the customer and placed into operation. This is due at least in part to the fact that it is not necessary to run dedicated test programs, nor to generate and/or write test patterns that can disrupt the configuration of the system and destroy information in memory. The operational configuration parameter values are determined automatically at power-up as part of the normal initialization of the system, without the need for additional user intervention. The data is generated at that time and is thus available for later collection as described herein. - In at least some illustrative embodiments, a technical service representative can evaluate the
computer system 100 by establishing a remote connection betweencomputer system 100 and AggregatingSystem 300. Once the remote connection is established, the collection software is executed oncomputer system 100 and the current operational configuration parameters are collected and sent to AggregatingSystem 300. An archived copy of the reference data for the original production run, of whichcomputer system 100 was a part, is retrieved and used to determine if thecomputer system 100 is utilizing operational configuration parameter values that exceed the norm by more than the tolerance value, or if changes over time in the operational configuration parameter values (referenced to archived values for the particular computer system 100) exceed the norm for changes of the same parameter for the production group by more than a tolerance value established for the production group. In this manner, the technical service representative can determine if a latent failure is developing, allowing the system to be repaired or replace before an actual failure occurs. -
FIG. 4 shows amethod 500 for collecting and processing operational configuration parameter values of a computer system, and for generating one or more sets of reference values, in accordance with at least some illustrative embodiments. Referring toFIGS. 3A through 3C andFIG. 4 , after completing initialization of a computer system (e.g., computer system 100) and determining a set of operational configuration parameters for the computer system, the operational configuration parameters are collected and saved in a local database (e.g., XML DB 210) on the computer system (block 502). The collected parameters are subsequently transferred from the local database to a centralized collection database (e.g., XML DB 410), where the parameters are combined with previously collected parameters from other similar computer systems and statistically processed (e.g., calculating mean values and standard deviations of the data collected), as shown inblock 504 ofFIG. 4 . - If enough data has been collected from enough systems to identify a statistically significant correlation between statistical outlier parameter values and identified system failures (block 506), reference values (including statistical values such as the mean and standard deviation for each parameter, as well as tolerance values for each parameter) are generated and saved in the centralized database (block 510), completing the method. If a correlation is not identified (block 506), another computer system is selected (block 508) and the data collection and processing (
blocks 502 through 506) is repeated. -
FIG. 5 shows a method for using operational configuration parameter values to identify computer systems at risk for future failures, in accordance with at least some illustrative embodiments. Referring toFIGS. 3A through 3C andFIG. 5 , after completing initialization of a computer system (e.g., computer system 100) and determining a set of operational configuration parameters for the computer system, the operational configuration parameters are collected and saved in a local database (e.g., XML DB 210) on the computer system (block 552). The collected parameters are subsequently transferred to a centralized collection database (e.g., XML DB 410) from the local database (block 554) where one or more selected parameters are compared to a set of reference values that includes tolerance values referenced to a set of statistical norms for each parameter (block 556). If none of the selected parameters differs from the statistical norm for the parameter by more than a tolerance value also associated with the selected parameter (block 558), no action is necessary (block 560) ending the method (block 564). If at least one parameter exceeds its corresponding statistical norm by more than the corresponding tolerance value, the computer system that generated the operational configuration parameter value is identified as being at risk for a future failure, and is flagged as defective (block 562), ending the method (block 564). - The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. For example, although the embodiments described collect operational configuration parameters from bus bridges and device interfaces, other sources may be used to obtain operational configuration parameters, such as, for example, advanced configuration and power interface registers associated with restoring the state of
computer system 100 upon exiting a global sleep state (e.g., ACPI S3 restore registers). Also, although the databases of the illustrative embodiments described utilize XML databases, other embodiments may utilize non-XML databases. Further, although computer systems are described in the illustrative embodiments, those of ordinary skill in the art will recognize that the systems and methods described may be implemented in a wide variety of digital systems, including cellular telephones, PDAs, digital televisions, MP3 players, and digital cameras, just to name a few examples. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Claims (22)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/830,802 US7877645B2 (en) | 2007-07-30 | 2007-07-30 | Use of operational configuration parameters to predict system failures |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/830,802 US7877645B2 (en) | 2007-07-30 | 2007-07-30 | Use of operational configuration parameters to predict system failures |
Publications (2)
Publication Number | Publication Date |
---|---|
US20090037777A1 true US20090037777A1 (en) | 2009-02-05 |
US7877645B2 US7877645B2 (en) | 2011-01-25 |
Family
ID=40339285
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/830,802 Active 2029-01-20 US7877645B2 (en) | 2007-07-30 | 2007-07-30 | Use of operational configuration parameters to predict system failures |
Country Status (1)
Country | Link |
---|---|
US (1) | US7877645B2 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090300278A1 (en) * | 2008-05-29 | 2009-12-03 | Advanced Micro Devices, Inc. | Embedded Programmable Component for Memory Device Training |
US20130073908A1 (en) * | 2011-09-21 | 2013-03-21 | Toshiba Tec Kabushiki Kaisha | Maintenance device and maintenance method |
US20130326289A1 (en) * | 2012-06-05 | 2013-12-05 | Infineon Technologies Ag | Method and system for detection of latent faults in microcontrollers |
US20140115378A1 (en) * | 2012-10-24 | 2014-04-24 | Kinpo Electronics, Inc. | System and method for restoring network configuration parameters |
US9015536B1 (en) * | 2011-08-31 | 2015-04-21 | Amazon Technologies, Inc. | Integration based anomaly detection service |
US20160147641A1 (en) * | 2014-11-24 | 2016-05-26 | Syntel, Inc. | Cross-browser web application testing tool |
US9892014B1 (en) * | 2014-09-29 | 2018-02-13 | EMC IP Holding Company LLC | Automated identification of the source of RAID performance degradation |
KR20220006351A (en) * | 2020-07-08 | 2022-01-17 | 현대자동차주식회사 | Ethernet unit and method for controlling thereof |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9317350B2 (en) * | 2013-09-09 | 2016-04-19 | International Business Machines Corporation | Method and apparatus for faulty memory utilization |
US20150074450A1 (en) * | 2013-09-09 | 2015-03-12 | International Business Machines Corporation | Hard disk drive (hdd) early failure detection in storage systems based on statistical analysis |
US9396200B2 (en) | 2013-09-11 | 2016-07-19 | Dell Products, Lp | Auto-snapshot manager analysis tool |
US9720758B2 (en) | 2013-09-11 | 2017-08-01 | Dell Products, Lp | Diagnostic analysis tool for disk storage engineering and technical support |
US9317349B2 (en) * | 2013-09-11 | 2016-04-19 | Dell Products, Lp | SAN vulnerability assessment tool |
US10223230B2 (en) | 2013-09-11 | 2019-03-05 | Dell Products, Lp | Method and system for predicting storage device failures |
US9454423B2 (en) | 2013-09-11 | 2016-09-27 | Dell Products, Lp | SAN performance analysis tool |
US9436411B2 (en) | 2014-03-28 | 2016-09-06 | Dell Products, Lp | SAN IP validation tool |
US9734458B2 (en) | 2014-04-28 | 2017-08-15 | International Business Machines Corporation | Predicting outcome based on input |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5528516A (en) * | 1994-05-25 | 1996-06-18 | System Management Arts, Inc. | Apparatus and method for event correlation and problem reporting |
US5539592A (en) * | 1994-10-05 | 1996-07-23 | International Business Machines Corporation | System and method for monitoring friction between head and disk to predict head disk interaction failure in direct access storage devices |
US20030084381A1 (en) * | 2001-11-01 | 2003-05-01 | Gulick Dale E. | ASF state determination using chipset-resident watchdog timer |
US20040168108A1 (en) * | 2002-08-22 | 2004-08-26 | Chan Wai T. | Advance failure prediction |
US20050216800A1 (en) * | 2004-03-24 | 2005-09-29 | Seagate Technology Llc | Deterministic preventive recovery from a predicted failure in a distributed storage system |
US7370241B2 (en) * | 2002-09-17 | 2008-05-06 | International Business Machines Corporation | Device, system and method for predictive failure analysis |
US20080189578A1 (en) * | 2007-02-05 | 2008-08-07 | Microsoft Corporation | Disk failure prevention and error correction |
US7539907B1 (en) * | 2006-05-05 | 2009-05-26 | Sun Microsystems, Inc. | Method and apparatus for determining a predicted failure rate |
-
2007
- 2007-07-30 US US11/830,802 patent/US7877645B2/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5528516A (en) * | 1994-05-25 | 1996-06-18 | System Management Arts, Inc. | Apparatus and method for event correlation and problem reporting |
US5539592A (en) * | 1994-10-05 | 1996-07-23 | International Business Machines Corporation | System and method for monitoring friction between head and disk to predict head disk interaction failure in direct access storage devices |
US20030084381A1 (en) * | 2001-11-01 | 2003-05-01 | Gulick Dale E. | ASF state determination using chipset-resident watchdog timer |
US20040168108A1 (en) * | 2002-08-22 | 2004-08-26 | Chan Wai T. | Advance failure prediction |
US7370241B2 (en) * | 2002-09-17 | 2008-05-06 | International Business Machines Corporation | Device, system and method for predictive failure analysis |
US20050216800A1 (en) * | 2004-03-24 | 2005-09-29 | Seagate Technology Llc | Deterministic preventive recovery from a predicted failure in a distributed storage system |
US7526684B2 (en) * | 2004-03-24 | 2009-04-28 | Seagate Technology Llc | Deterministic preventive recovery from a predicted failure in a distributed storage system |
US7539907B1 (en) * | 2006-05-05 | 2009-05-26 | Sun Microsystems, Inc. | Method and apparatus for determining a predicted failure rate |
US20080189578A1 (en) * | 2007-02-05 | 2008-08-07 | Microsoft Corporation | Disk failure prevention and error correction |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090300278A1 (en) * | 2008-05-29 | 2009-12-03 | Advanced Micro Devices, Inc. | Embedded Programmable Component for Memory Device Training |
US9015536B1 (en) * | 2011-08-31 | 2015-04-21 | Amazon Technologies, Inc. | Integration based anomaly detection service |
US10216560B2 (en) | 2011-08-31 | 2019-02-26 | Amazon Technologies, Inc. | Integration based anomaly detection service |
US9436535B2 (en) | 2011-08-31 | 2016-09-06 | Amazon Technologies, Inc. | Integration based anomaly detection service |
US20130073908A1 (en) * | 2011-09-21 | 2013-03-21 | Toshiba Tec Kabushiki Kaisha | Maintenance device and maintenance method |
US20130326289A1 (en) * | 2012-06-05 | 2013-12-05 | Infineon Technologies Ag | Method and system for detection of latent faults in microcontrollers |
US8954794B2 (en) * | 2012-06-05 | 2015-02-10 | Infineon Technologies Ag | Method and system for detection of latent faults in microcontrollers |
US20140115378A1 (en) * | 2012-10-24 | 2014-04-24 | Kinpo Electronics, Inc. | System and method for restoring network configuration parameters |
US9892014B1 (en) * | 2014-09-29 | 2018-02-13 | EMC IP Holding Company LLC | Automated identification of the source of RAID performance degradation |
US9836385B2 (en) * | 2014-11-24 | 2017-12-05 | Syntel, Inc. | Cross-browser web application testing tool |
US20160147641A1 (en) * | 2014-11-24 | 2016-05-26 | Syntel, Inc. | Cross-browser web application testing tool |
KR20220006351A (en) * | 2020-07-08 | 2022-01-17 | 현대자동차주식회사 | Ethernet unit and method for controlling thereof |
KR102357224B1 (en) | 2020-07-08 | 2022-02-03 | 현대자동차주식회사 | Ethernet unit and method for controlling thereof |
Also Published As
Publication number | Publication date |
---|---|
US7877645B2 (en) | 2011-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7877645B2 (en) | Use of operational configuration parameters to predict system failures | |
US8898408B2 (en) | Memory controller-independent memory mirroring | |
CN101937726B (en) | Fast data eye retraining for a memory | |
US10204698B2 (en) | Method to dynamically inject errors in a repairable memory on silicon and a method to validate built-in-self-repair logic | |
US7650527B2 (en) | MIPS recovery technique | |
US8543801B2 (en) | Booting method using a backup memory in place of a failed main memory | |
CN103930878A (en) | Method, apparatus and system for memory validation | |
CN101369240A (en) | System and method for managing memory errors in an information handling system | |
US20140328132A1 (en) | Memory margin management | |
US7765439B2 (en) | Traceability management apparatus, storage medium storing program, and tracing method | |
KR20170059219A (en) | Memory device, memory system and method of verifying repair result of memory device | |
US20080168408A1 (en) | Performance control of an integrated circuit | |
US11804277B2 (en) | Error remapping | |
KR20140054908A (en) | Duty correction circuit and system including the same | |
CN112700816A (en) | Memory chip with on-die mirroring functionality and method for testing same | |
US11681807B2 (en) | Information handling system with mechanism for reporting status of persistent memory firmware update | |
US11010250B2 (en) | Memory device failure recovery system | |
US20080276121A1 (en) | Method and infrastructure for recognition of the resources of a defective hardware unit | |
CN115757196B (en) | Memory, memory access method and computing device | |
US11307785B2 (en) | System and method for determining available post-package repair resources | |
US20220383158A1 (en) | Method for classifying failure consumer devices on-line, electronic device employing method, and computer readable storage medium | |
US11842785B2 (en) | Temperature-accelerated solid-state storage testing methods | |
US11714707B2 (en) | DDR5 crosstalk mitigation through agressor misalignment | |
US11928354B2 (en) | Read-disturb-based read temperature determination system | |
CN117112452B (en) | Register simulation configuration method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P., TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MEYER, JOHN E.;WADE, MARK A.;COVINGTON, ROBERT R.;AND OTHERS;REEL/FRAME:019720/0325;SIGNING DATES FROM 20070729 TO 20070730 Owner name: HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P., TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MEYER, JOHN E.;WADE, MARK A.;COVINGTON, ROBERT R.;AND OTHERS;SIGNING DATES FROM 20070729 TO 20070730;REEL/FRAME:019720/0325 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
AS | Assignment |
Owner name: HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP, TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P.;REEL/FRAME:037079/0001 Effective date: 20151027 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552) Year of fee payment: 8 |
|
AS | Assignment |
Owner name: OT PATENT ESCROW, LLC, ILLINOIS Free format text: PATENT ASSIGNMENT, SECURITY INTEREST, AND LIEN AGREEMENT;ASSIGNORS:HEWLETT PACKARD ENTERPRISE DEVELOPMENT LP;HEWLETT PACKARD ENTERPRISE COMPANY;REEL/FRAME:055269/0001 Effective date: 20210115 |
|
AS | Assignment |
Owner name: VALTRUS INNOVATIONS LIMITED, IRELAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:OT PATENT ESCROW, LLC;REEL/FRAME:060005/0600 Effective date: 20220504 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |