US20050193284A1 - Electronic device, failure prediction method, and computer product - Google Patents

Electronic device, failure prediction method, and computer product Download PDF

Info

Publication number
US20050193284A1
US20050193284A1 US10/875,917 US87591704A US2005193284A1 US 20050193284 A1 US20050193284 A1 US 20050193284A1 US 87591704 A US87591704 A US 87591704A US 2005193284 A1 US2005193284 A1 US 2005193284A1
Authority
US
United States
Prior art keywords
environmental load
error
failure
temperature
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US10/875,917
Other versions
US7469189B2 (en
Inventor
Akihiro Yasuo
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Assigned to FUJITSU LIMITED reassignment FUJITSU LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YASUO, AKIHIRO
Publication of US20050193284A1 publication Critical patent/US20050193284A1/en
Application granted granted Critical
Publication of US7469189B2 publication Critical patent/US7469189B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/24Marginal checking or other specified testing methods not covered by G06F11/26, e.g. race tests

Definitions

  • the present invention relates to a technology for predicting a failure of a part that is prone to aged deterioration in an electronic device by directly detecting a signal of impending failure of the part.
  • a blade server that realizes a high-density mounting includes more number of parts, such as several hundred scales of CPUs, than a conventional server, which may cause a higher rate of part failure.
  • a method such as a dual operation of a system is employed to avoid the system failure, there still is a system failure occurred due to a complex failure interacted with a latent failure in the system.
  • the failure prediction based on the statistical method compares a result of measurement of operation status of an electronic device using a sensor with an operation model of constituent parts to predict a failure.
  • the operation model is created based on performance data obtained from each of the parts, and is periodically updated. By compensating a minute disturbance appearing as a noise between the operation model and a result of actual measurement, it is possible to determine whether the result is within an acceptable range or a sign of a coming failure. For instance, in the case of a hard disk device, it is possible to predict a failure by a comparing of a measured response time with a calculated response time from the operation model.
  • Another technique as an extension of the statistical method, employs a redundant hardware structure to an electronic circuit or a part of interest, and applies a greater load than that for a normal operation of the circuit to the redundant structure. At the point at which the redundant circuit breaks down it predicts that a breakdown of the circuit may be imminent (for example, see Japanese Patent Laid-Open Publication No. H2-87079 and Japanese Patent Laid-Open Publication No. H7-128384).
  • the accuracy of the statistical method is dependent on a quality of the operation model, and it is difficult to make a model of all operations of a complex semiconductor device with a large-scale.
  • the accuracy of the statistical method is also dependent on setting a threshold value when determining a difference between an actual operation and the operation model, and it is also extremely difficult to set a proper threshold value.
  • the method using the redundant circuit is not better than a statistical method, having a problem of a considerable error caused by a variation in the parts, subtle differences in the test environment, etc. Furthermore, regarding running of a system, it is not easy to replace a part of questionable life expectancy in a normally operation status of a circuit of interest.
  • the electronic device includes an environmental load applying unit that applies a higher environmental load on the part than an environmental load in a normal operation, an error detecting unit that detects an error in the part with the higher environmental load applied, and a failure predicting unit that predicts a failure of the part based on the error detected.
  • the failure prediction method includes applying a higher environmental load on the part than an environmental load in a normal operation, detecting an error in the part with the higher environmental load applied and predicting the failure of the part based on the error detected.
  • the computer program according to still another aspect of the present invention realizes the method according to the above aspect on a computer.
  • the computer readable recording medium stores the computer program according to the above aspect.
  • FIG. 1A is a graph of part characteristic curves in response to an environmental load
  • FIG. 1B is a block diagram of an electronic device according to an embodiment of the present invention.
  • FIG. 2 is a block diagram of a blade server according to a first embodiment of the present invention
  • FIG. 3 is a graph of an error rate of a hard disk in response to a temperature load
  • FIG. 4 is a table of an example of data stored in a secure operation value storing unit
  • FIG. 5 is a table of an example of data stored in a threshold value history storing unit
  • FIG. 6 is a flowchart a process procedure of failure prediction in the blade server according to the first embodiment
  • FIG. 7 is a block diagram of an electronic device according to a second embodiment of the present invention.
  • FIG. 8 is a block diagram of an electronic device according to a third embodiment of the present invention.
  • FIG. 9 is a block diagram of a blade server according to a fourth embodiment of the present invention.
  • FIG. 10 is a graph of an error ratio of a memory module in response to an applied voltage
  • FIG. 11 is a graph of an error status in a CPU in response to an applied voltage
  • FIG. 12 is a block diagram of an electronic device according to a fifth embodiment of the present invention.
  • FIG. 13 is a schematic of a computer that executes a failure prediction program for the electronic devices according to the first embodiment to the fifth embodiment.
  • failure prediction method according to the present invention failure is predicted by using the feature of the tendency of an electronic part to fail when a high stress is applied.
  • FIG. 1A is a graph of part characteristic curves in response to an environmental load.
  • An electronic part tends to malfunction when a high load is applied in environmental conditions such as temperature, humidity, voltage, etc.
  • the environmental conditions for the electronic parts mounted on an apparatus are usually provided so that the conditions for a secure operation are within stipulated values. (For instance, if the environmental temperature of the part is anticipated to exceed the stipulated value, a fan, and the like is provided for cooling.)
  • Failure is assessed to have occurred in the electronic part due to detection deterioration if there is a gradual deterioration in the characteristics of the part and when eventually the characteristics of the part exceed a threshold value for malfunctioning under the operating environment.
  • the aged deterioration of parts is progressing (from curve S to curve T) even though no failure occurs under normal operating environment conditions.
  • the normal operating conditions are not affected even if the deterioration progresses further, almost to the point of breakdown of the part.
  • the normal operating conditions are adversely affected if the deterioration progresses further, leading to a cascading breakdown, first of the entire apparatus, and then of the entire system (curve U and curve V).
  • the failure prediction method a high load of environmental condition is applied on the part and malfunctioning under high load is detected. In this way, the part that is normally functioning under normal conditions in spite of aged deterioration is detected and the failure of the apparatus and system due to a malfunctioning part can be avoided.
  • FIG. 1B is a block diagram of an electronic device according to an embodiment of the present invention.
  • An electronic device 10 includes parts 11 , an environment load applying unit 12 , an error detecting unit 13 , and a control unit 14 .
  • the parts 11 realize the function of the electronic device 10 , and for which failure prediction is to be carried out.
  • the environment load applying unit 12 is a functional part that applies environment load in the form of temperature, humidity, voltage, etc. on the parts 11 .
  • the error detecting unit 13 is a functional part that detects error in the device part to which load is applied. Error detection can be carried out by providing a detecting unit externally or by monitoring error signals such as parity/ECC error, etc. output by the parts 11 .
  • the control unit 14 controls the environment load applying unit 12 to apply environmental load on the parts 11 , and based on the error detected by the error detecting unit 13 , carries out failure prediction for the parts 11 .
  • the control unit 14 may issue an instruction for environmental load application test under normal operating conditions. Alternatively, when a problem is encountered in the apparatus, the control unit 14 may switch to a test mode and issue the instruction for environmental load application test.
  • the control unit 14 controls the environmental load applying unit 12 to apply a greater load than under normal operating conditions.
  • the control unit 14 then carries out failure prediction for the parts 11 based on the error detected by the error detecting unit 13 . Consequently, it is possible to directly pick up the extent of deterioration of the part which is approaching breakdown due to detection deterioration but still functions normally under normal conditions.
  • FIG. 2 is a block diagram of a blade server according to a first embodiment of the present invention.
  • a blade server 100 includes a hard disk 110 , a temperature setting unit 120 , an error detecting unit 130 , and a control unit 140 .
  • the hard disk 110 is the part to which a temperature load is applied by the temperature setting unit 120 and for which failure prediction is to be carried out.
  • FIG. 3 is a graph of an error rate of a hard disk 110 in response to a temperature load.
  • a maximum temperature and a minimum temperature for absolute rating, and a maximum temperature and a minimum temperature for secure operation are stipulated as operating conditions for all electronic parts such as the hard disk 110 .
  • the absolute rating is a value which when exceeded results in the possibility of a breakdown of the part.
  • the secure operation temperature value is a temperature value which when exceeded does not guarantee a normal operation.
  • a normal part may not function normally in an environment outside the secure operation temperature range. However, within the secure operation temperature range, normal functioning of the part is guaranteed. However, a part that is approaching breakdown due to detection deterioration functions normally in a normal operating environment, but may malfunction in an environment which is close to the secure operation temperature values within the secure operation temperature range.
  • the error rate of the hard disk 110 is within the secure operation temperature range and is below (curve S) a warning error rate (Erw). If the detection deterioration progresses and the characteristics of the hard disk 110 deteriorate, the characteristic curve is represented by first curve T and then curve U, and a SMART (Self-Monitoring, Analysis and Reporting Technology) function of the hard disk 110 assesses this as a warning situation.
  • SMART Self-Monitoring, Analysis and Reporting Technology
  • the SMART function is a self-diagnostic function of the hard disk 110 , which logs errors that occur. When multiple errors occur, the SMART function decides that the hard disk 110 needs to be replaced.
  • the error rate is measured under the operating condition of the minimum secure operation temperature of 5° C. or the maximum secure operation temperature of 55° C.
  • failure prediction can be done at a step earlier (represented by curve T) than the failure prediction step of the SMART function.
  • the temperature setting unit 120 is a functional part that applies the temperature load on the hard disk 110 , for instance, a temperature correcting circuit such as a Peltier element.
  • the error detecting unit 130 is a functional part that detects malfunctioning in the hard disk 110 that results from the application of the temperature load.
  • the control unit 140 is a functional part that controls the temperature setting unit 120 to apply the temperature load on the hard disk 110 based on the temperature measured by a temperature sensor disposed near the hard disk 110 , and carries out failure prediction by calculating the error rate of the hard disk 110 based on the error detected by the error detecting unit 130 .
  • the control unit 140 includes a temperature controller 141 , a secure operation value storing unit 142 , an error data collecting unit 143 , a secure operation value testing unit 144 , a non-secure operation value testing unit 145 , a threshold value history storing unit 146 , and a temperature load test controller 147 .
  • the temperature controller 141 controls the temperature setting unit 120 to set the temperature of the hard disk 110 to the temperature setting specified by the secure operation value testing unit 144 or the non-secure operation value testing unit 145 .
  • the secure operation value storing unit 142 stores secure operation temperature values of the hard disk 110 .
  • FIG. 4 is a table of an example of data stored in the secure operation value storing unit 142 in which the minimum secure operation temperature is 5° C. and the maximum secure operation temperature is 55° C.
  • the error data collecting unit 143 calculates the error rate of the hard disk 110 based on the error detected by the error detecting unit 130 .
  • the error rate calculated by the error data collecting unit 143 is used by the secure operation value testing unit 144 and the non-secure operation value testing unit 145 in the failure prediction for the hard disk 110 .
  • the secure operation value testing unit 144 carries out a secure temperature value application test to determine the error rate when a temperature load of the secure operation value is applied on the hard disk 110 . If the determined error rate exceeds the warning error rate, the secure operation value testing unit 144 notifies the user that there is high possibility of the hard disk 110 breakdown. To be more specific, the secure operation value testing unit 144 carries out the secure operation value test by applying the maximum secure operation temperature of 55° C. on the hard disk 110 .
  • the secure operation value testing unit 144 is able to carry out precise failure prediction for the hard disk 110 by determining the error rate by applying the maximum secure operation temperature of 55° C. on the hard disk 110 , and notifying the user of a high possibility of the hard disk 110 breakdown if the determined error rate exceeds the warning error rate.
  • the secure operation value testing unit 144 determines the error rate by applying the maximum secure operation temperature of 55° C. on the hard disk 110 .
  • the secure operation value testing unit 144 can also determine the error rate by applying the minimum secure operation temperature of 5° C. on the hard disk 110 . Further, apart from the maximum and minimum secure operation temperatures, a temperature that is close to the secure operation value within the secure operation temperature range can also be applied.
  • error rate can be determined by applying both the maximum secure operation temperature of 55° C. and the minimum secure operation temperature of 5° C., and if either of the determined error rates exceeds the warning error rate, the user is notified of the high possibility of the hard disk 110 breakdown.
  • the non-secure operation value testing unit 145 carries out a non-secure temperature value application test that determines, when a temperature load exceeding the secure operation temperature values is applied on hard disk 110 , a threshold temperature at which the error rate turns into a failure threshold.
  • the non-secure operation value testing unit 145 carries out the failure prediction for the hard disk 110 based on a relation between a previously measured threshold temperature, a current measured threshold temperature and the maximum secure operation temperature of 55° C.
  • the non-secure operation value testing unit 145 determines the previously measured threshold temperature from a previous non-secure temperature value application test, and the current measured threshold temperature from a current non-secure temperature value application test.
  • the non-secure operation value testing unit 145 compares an absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature with an absolute value obtained from the difference between the current measured threshold temperature and the maximum secure operation temperature of 55° C. If the absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature is greater, the non-secure operation value testing unit 145 assesses that the hard disk 110 is likely to malfunction during the next non-secure temperature value application test and notifies this fact to the user.
  • the non-secure operation value testing unit 145 is able to carry out precise failure prediction for the hard disk 110 by determining, when the maximum secure operation temperature of 55° C. on the hard disk 110 , the threshold temperature at which the error rate turns into a failure threshold, and carrying out failure prediction based on the relation between the previous measured threshold temperature and the maximum secure operation temperature.
  • a part does not immediately malfunction when a load exceeding a secure operation value is applied. Therefore, by first determining a threshold value exceeding the secure operation value in which the part works normally, and measuring this threshold, which varies with aged deterioration of the part, malfunctioning can be predicted within a time range in which operation recovery is possible.
  • the threshold temperature is determined by applying a temperature load exceeding the maximum secure operation temperature of 55° C. on the hard disk 110 .
  • the threshold temperature can also be determined by applying a temperature load exceeding the minimum secure operation temperature of 5° C. on the hard disk 110 .
  • the threshold value history storing unit 146 stores a history of the threshold temperature values determined by the non-secure operation value testing unit 145 . For instance, as shown in FIG. 3 , the temperatures A, B, etc., which are points at which the characteristic curves cut the line representing the failure threshold, are stored as the threshold temperatures in the threshold value history storing unit 146 . The threshold temperatures stored in the threshold value history storing unit 146 are used by the non-secure operation value testing unit 145 for carrying out failure prediction.
  • FIG. 5 is a table of an example of data stored in the threshold value history storing unit 146 .
  • the threshold temperatures determined by the non-secure temperature value application test are stored sequentially in the threshold value history storing unit 146 .
  • the threshold value of the previous non-secure temperature value application test is 80° C.
  • the threshold temperature determined from the current non-secure temperature value application test is 70° C.
  • the temperature load test controller 147 applies the temperature load on the hard disk 110 and controls all the temperature load tests. To be more specific, the temperature load test controller 147 carries out temperature load tests and failure prediction by transferring controls between the functional parts and transferring data between the functional parts and the storage unit.
  • FIG. 6 is a flowchart a process procedure of failure prediction in the blade server 100 according to the first embodiment. The failure prediction is carried out at fixed intervals.
  • the secure operation value testing unit 144 first sets the temperature of the hard disk 110 to a secure operation temperature of 55° C. and carries out the secure operation test (Step S 601 ).
  • the secure operation value testing unit 144 then assesses whether the error rate of the hard disk 110 is greater than the warning error rate (Step S 602 ). If the error rate is greater than the warning error rate, the secure operation value testing unit 144 notifies the user the possibility of malfunctioning of the hard disk 110 under normal operating conditions (Step S 603 ).
  • the non-secure operation value testing unit 145 determines, by applying a non-secure operation temperature, the load threshold, that is, the threshold temperature, at which the hard disk 110 malfunctions (Step S 604 ).
  • the non-secure operation value testing unit 145 then compares the absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature with the absolute value obtained from the difference between the current measured threshold temperature and the secure operation temperature (Step S 605 ). If the absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature is greater, the non-secure operation value testing unit 145 assesses that the hard disk 110 is likely to malfunction during the next non-secure temperature value application test and notifies this fact to the user (Step S 606 ).
  • the non-secure operation value testing unit 145 assesses that the hard disk 110 will function normally in the next non-secure temperature value application test. The process ends here.
  • the secure operation value testing unit 144 carries out the secure operation temperature value application test and the non-secure operation value testing unit 145 carries out the non-secure operation temperature value application test, thereby realizing a precise failure prediction for the hard disk 110 .
  • precise failure prediction is achieved as it is based on application of a greater temperature load than that required for a normal operation on the hard disk 110 by the temperature setting unit 120 controlled by the control unit 140 , and detection of malfunctioning of the hard disk 110 by the error detecting unit 130 .
  • Both the secure operation temperature value application test and the non-secure operation temperature value application test are carried out in the first embodiment of the present invention. However, either one of these tests can be carried out.
  • the secure operation temperature value application test a value close to the secure operation temperature values and within the secure operation temperature range can be applied.
  • the temperature setting unit 120 of the blade server 100 carries out the temperature load test by applying a temperature load on the hard disk 110 .
  • the electronic parts themselves generate heat. This generated heat of the electronic part may also be applied as a temperature load.
  • a temperature load can be applied on the electronic parts by controlling the fan or a heat pump provided for cooling the electronic parts.
  • a temperature load can be applied by slowing down or stopping the fan.
  • the temperature load tests are carried out by applying temperature load by means of controlling the cooling of the electronic part.
  • FIG. 7 is a block diagram of an electronic device according to a second embodiment of the present invention.
  • An electronic device 200 includes an electronic part 210 , a cooling unit 220 , an error detecting unit 230 , and a control unit 240 .
  • the electronic part 210 is a part that realizes the device functions, and for which failure prediction is to be carried out.
  • the cooling unit 220 is functional part that cools the electronic part 210 .
  • the error detecting unit 230 is a functional part that detects malfunctioning in the electronic part 210 .
  • the control unit 240 applies a temperature load on the electronic part 210 by controlling the cooling unit 220 and carries out failure prediction for the electronic part 210 based on the error detected by the error detecting unit 230 .
  • the control unit 240 includes a temperature controller 241 , a secure operation value storing unit 242 , an error data collecting unit 243 , a secure operation value testing unit 244 , a non-secure operation value testing unit 245 , a threshold value history storing unit 246 , and a temperature load test controller 247 .
  • the temperature controller 241 sets the temperature of the electronic part 210 to a predetermined value by controlling the cooling unit 220 .
  • the secure operation value storing unit 242 stores secure operation temperature values of the hard disk 210 .
  • the error data collecting unit 243 calculates error data of the hard disk 210 based on the error detected by the error detecting unit 230 .
  • the secure operation value testing unit 244 carries out a secure temperature value application test.
  • the non-secure operation value testing unit 245 carries out a non-secure temperature value application test.
  • the threshold value history storing unit 246 stores a history of threshold temperature values determined by the non-secure operation value testing unit 245 .
  • the temperature load test controller 247 controls all the temperature load tests.
  • the temperature of the electronic part 210 is set to a predetermined value by controlling the cooling unit by means of the temperature controller 241 . Consequently, temperature load tests can be carried out without external application of heat on the electronic part 210 .
  • temperature load is applied on an electronic part by controlling a cooling unit.
  • heat load can be applied by increasing the processing load of the electronic part.
  • the temperature load tests are carried out by applying temperature load by increasing the processing load of the electronic part.
  • FIG. 8 is a block diagram of an electronic device according to a third embodiment of the present invention.
  • An electronic device 300 includes a communication processing part 310 , a test data applying unit 320 , and error detecting unit 330 , a control unit 340 , and a test data separating unit 350 .
  • the communication processing part 310 is a part that carries out communication processing, and for which failure prediction is to be carried out.
  • the test data applying unit 320 is a processing unit that adds test data to regular data in order to increase the processing load of the communication processing part 310 .
  • the error detecting unit 330 is a functional unit that detects error generated in the communication processing part 310 .
  • the control unit 340 includes a temperature controller 341 , a secure operation value storing unit 342 , error data collecting unit 343 , secure operation value testing unit 344 , a non-secure operation value testing unit 345 , a threshold value history storing unit 346 , and a temperature load test controller 347 .
  • the temperature controller 341 sets the temperature of the communication processing part 210 to a predetermined value by controlling the test data applying unit 320 .
  • the secure operation value storing unit 342 stores secure operation temperature values of the communication processing parts.
  • the error data collecting unit 343 calculates error data based on the error detected by the error detecting unit 330 .
  • the secure operation value testing unit 344 carries out a secure operation temperature value application test.
  • the non-secure operation value testing unit 345 carries out a non-secure operation temperature value application test.
  • the threshold value history storing unit 346 stores threshold temperature values determined by the non-secure operation value testing unit 345 .
  • the temperature load test controller 347 controls all the temperature load tests.
  • the test data separating unit 350 retrieves communication data by separating the test data that is added to the communication data by the test data applying unit in order to increase the processing load of the communication processing part 310 .
  • the temperature of the communication processing part 310 is set to a predetermined value by controlling the test data applying unit to adjust the processing load of the communication processing part 310 . Consequently, temperature load tests can be carried out without external application of heat on the communication processing part 310 .
  • failure prediction is carried out by applying a temperature load on an electronic part. It is also possible to carry out failure prediction by applying environmental loads to an electronic part other than temperature load.
  • voltage is one of the operating conditions of an electronic part.
  • a maximum voltage and a minimum voltage for absolute rating, and a maximum voltage and a minimum voltage for secure operation are stipulated as operating conditions for all electronic parts.
  • the voltage applied on an electronic part is varied and failure prediction is carried out for the electronic part under a high voltage condition or a low voltage condition.
  • FIG. 9 is a block diagram of a blade server according to a fourth embodiment of the present invention.
  • a blade server 400 includes a memory module 410 , a variable power source 420 , an error detecting unit 430 , and a control unit 440 .
  • the memory module 410 is a part for which failure prediction is to be carried out.
  • the memory module 410 has an error-correcting function which uses error-correcting code (ECC).
  • ECC error correction one-bit error is automatically corrected by the ECC and the result is reported to the blade server 400 .
  • a two-bit error occurs, it fails to be corrected and is reported as a memory access error.
  • FIG. 10 is a graph of an error ratio of a memory module 410 in response to an applied voltage. If the memory module 410 is operated within the secure operation value range error is unlikely to occur. However, when the characteristics of the device deteriorate with age, the frequency of occurrence of memory errors increases.
  • the curve S shown in FIG. 10 represents a characteristic curve during normal operation. Neither one bit nor two-bit error occurs within secure operation value range. If the characteristics deteriorate with the progression of aged deterioration, the characteristic curve resembles the curve T. Under such circumstances, even within secure operation value range, two-bit error does not occur; only one-bit error (point B) occurs. Therefore, there is no failure as such of the blade server 400 .
  • the characteristic curve resembles curve U if the deterioration continues further, two-bit error (point C) occurs. Therefore, if the time of failure can be predicted before the deterioration progresses further, it can be prevented.
  • the time of failure can be predicted, by the procedure explained by the flow chart shown in FIG. 6 , by calculating the time of progress from point A to point B, or from point A′ to point C of FIG. 10 .
  • the variable power source 420 applies voltage load on the memory module 410 by varying the voltage.
  • the error detecting unit 430 detects the error that occurs in the memory module 410 .
  • the control unit 440 controls the variable power source 420 to apply varying voltage on the memory module 410 and carries out failure prediction.
  • the control unit 440 includes a voltage controller 441 , a secure operation value storing unit 442 , an error data collecting unit 443 , a secure operation value testing unit 444 , a non-secure operation value testing unit 445 , a threshold value history storing unit 446 , and a voltage load test controller 447 .
  • the voltage controller 441 controls the variable power source 420 .
  • the secure operation value storing unit 442 stores secure operation voltage values of the memory module 410 .
  • the error data collecting unit 443 calculates an error rate of the memory module 410 based on the error detected by the error detecting unit 430 .
  • the secure operation value testing unit 444 carries out a secure operation voltage value application test.
  • the non-secure operation value testing unit 445 carries out a non-secure operation voltage value application test.
  • the threshold value history storing unit 446 stores a history of threshold voltage values determined by the non-secure operation value testing unit 445 .
  • the voltage load test controller 447 controls all the voltage load tests.
  • precise failure prediction is achieved as it is based on application of a voltage load on the memory module 410 by the variable power source 420 controlled by the control unit 440 , and detection of error under the voltage load in the memory module 410 by the error detecting unit 430 .
  • a voltage load is applied on a memory module in the fourth embodiment.
  • a voltage load may be applied on a CPU to carry out failure prediction for the CPU.
  • FIG. 11 is a graph of an error status in a CPU in response to an applied voltage.
  • the CPU outputs an alarm signal when an error occurs in its internal functioning. This alarm is output when a bus parity error or sequence error occurs inside the CPU and not when it functioning normally (curve S).
  • failure prediction is carried out by applying an environmental load on an electronic part. Failure prediction can also be carried out by applying an environmental load on mechanical parts.
  • Error detection in the mechanical part can be carried out from the detection of error related to increased or decreased number of rotations (that is, when the number of rotations is not the stipulated value and there is a fluctuation in the number of rotations).
  • FIG. 12 is a block diagram of an electronic device according to a fifth embodiment of the present invention.
  • An electronic device 500 includes a fan 510 , a variable power source 520 , a rotation count monitoring unit 530 , and a control unit 540 .
  • the fan 510 is a part for which failure prediction is to be carried out.
  • the variable power source 520 is a power source that applies voltage of varying values on the fan 510 .
  • the rotation count monitoring unit 530 monitors the number of rotations of the fan 510 and detects the error.
  • the control unit 540 controls the variable power source 520 to vary the voltage value applied on the fan 510 and carries out failure prediction.
  • the control unit 540 includes a voltage controller 540 , secure operation value storing unit 542 , an error data collecting unit 543 , a secure operation value testing unit 544 , a non-secure operation value testing unit 545 , a threshold value history storing unit 546 , and a voltage load test controller 547 .
  • the voltage controller 541 controls the variable power source 520 .
  • the secure operation value storing unit 542 stores secure operation voltage values of the fan 510 .
  • the error data collecting unit 543 calculates error data of the fan 510 based on an error in rotation count detected by the rotation count monitoring unit 530 .
  • the secure operation value testing unit 544 carries out a secure operation voltage value application test.
  • the non-secure operation value testing unit 545 carries out a non-secure operation voltage value application test.
  • the threshold value history storing unit 546 stores a history of threshold voltages determined by the non-secure operation value testing unit 545 .
  • the voltage load test controller 547 controls all the voltage load tests.
  • precise failure prediction is achieved as it is based on application of a voltage load on the fan 510 by the variable power source 520 controlled by the control unit 540 , and detection of error under the voltage load by the rotation count monitoring unit 530 .
  • Failure prediction for a fan is carried out by applying a voltage load to it in the fifth embodiment of the present invention.
  • failure prediction can be carried out for any motor-driven mechanical part or for a non-electronic device which includes a motor-driven mechanical part.
  • failure prediction can be carried out by increasing the access frequency of the hard disk or by increasing the load on the drive system, apart from failure prediction by application of environmental load such as temperature load, voltage load, etc.
  • FIG. 13 is a schematic of a computer that executes a failure prediction program for the electronic devices according to the first embodiment to the fifth embodiment.
  • a computer 600 includes a CPU 610 , a random access memory (RAM) 620 , a read-only memory (ROM) 630 , and input/output (I/O) interface 640 , and a non-volatile memory 650 .
  • RAM random access memory
  • ROM read-only memory
  • I/O input/output
  • the CPU 610 is a processing device that executes the failure prediction program.
  • the RAM 620 is a storage unit that stores intermediate results of the failure prediction program.
  • the ROM 630 is a storage unit that stores the failure prediction program, secure operation values, etc.
  • the I/O interface 640 is an interface for inputting values measured by a sensor, or for outputting settings to a temperature setting unit or a variable power source.
  • the non-volatile memory 650 stores data that is stored in a threshold value history storing unit, etc.
  • the CPU 610 , RAM 620 , ROM 630 , non-volatile memory 650 , etc. may be used as constituent parts of an electronic device as well as exclusively for executing the failure prediction program.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Testing Electric Properties And Detecting Electric Faults (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An electronic device that includes a part prone to aged deterioration includes an environmental load applying unit that applies a higher environmental load on the part than an environmental load in a normal operation, an error detecting unit that detects an error in the part with the higher environmental load applied, and a failure predicting unit that predicts a failure of the part based on the error detected.

Description

    BACKGROUND OF THE INVENTION
  • 1) Field of the Invention
  • The present invention relates to a technology for predicting a failure of a part that is prone to aged deterioration in an electronic device by directly detecting a signal of impending failure of the part.
  • 2) Description of the Related Art
  • The possibility of a system failure due to breakdown of constituent parts has increased in recent electronic devices, such as an information apparatus or a communication apparatus, because of large scale and high rate of integration of the system. For instance, a blade server that realizes a high-density mounting includes more number of parts, such as several hundred scales of CPUs, than a conventional server, which may cause a higher rate of part failure. Although a method such as a dual operation of a system is employed to avoid the system failure, there still is a system failure occurred due to a complex failure interacted with a latent failure in the system.
  • Hence, it is extremely important to predict a failure of a part to avoid the system failure, and a technology to predict a part failure using a statistical method has been developed. The failure prediction based on the statistical method compares a result of measurement of operation status of an electronic device using a sensor with an operation model of constituent parts to predict a failure.
  • The operation model is created based on performance data obtained from each of the parts, and is periodically updated. By compensating a minute disturbance appearing as a noise between the operation model and a result of actual measurement, it is possible to determine whether the result is within an acceptable range or a sign of a coming failure. For instance, in the case of a hard disk device, it is possible to predict a failure by a comparing of a measured response time with a calculated response time from the operation model.
  • Another technique, as an extension of the statistical method, employs a redundant hardware structure to an electronic circuit or a part of interest, and applies a greater load than that for a normal operation of the circuit to the redundant structure. At the point at which the redundant circuit breaks down it predicts that a breakdown of the circuit may be imminent (for example, see Japanese Patent Laid-Open Publication No. H2-87079 and Japanese Patent Laid-Open Publication No. H7-128384).
  • However, the accuracy of the statistical method is dependent on a quality of the operation model, and it is difficult to make a model of all operations of a complex semiconductor device with a large-scale. Besides, the accuracy of the statistical method is also dependent on setting a threshold value when determining a difference between an actual operation and the operation model, and it is also extremely difficult to set a proper threshold value.
  • The method using the redundant circuit is not better than a statistical method, having a problem of a considerable error caused by a variation in the parts, subtle differences in the test environment, etc. Furthermore, regarding running of a system, it is not easy to replace a part of questionable life expectancy in a normally operation status of a circuit of interest.
  • SUMMARY OF THE INVENTION
  • It is an object of the present invention to solve at least the problems in the conventional technology.
  • The electronic device according to one aspect of the present invention includes an environmental load applying unit that applies a higher environmental load on the part than an environmental load in a normal operation, an error detecting unit that detects an error in the part with the higher environmental load applied, and a failure predicting unit that predicts a failure of the part based on the error detected.
  • The failure prediction method according to another aspect of the present invention includes applying a higher environmental load on the part than an environmental load in a normal operation, detecting an error in the part with the higher environmental load applied and predicting the failure of the part based on the error detected.
  • The computer program according to still another aspect of the present invention realizes the method according to the above aspect on a computer.
  • The computer readable recording medium according to still another aspect of the present invention stores the computer program according to the above aspect.
  • The other objects, features, and advantages of the present invention are specifically set forth in or will become apparent from the following detailed description of the invention when read in conjunction with the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1A is a graph of part characteristic curves in response to an environmental load;
  • FIG. 1B is a block diagram of an electronic device according to an embodiment of the present invention;
  • FIG. 2 is a block diagram of a blade server according to a first embodiment of the present invention;
  • FIG. 3 is a graph of an error rate of a hard disk in response to a temperature load;
  • FIG. 4 is a table of an example of data stored in a secure operation value storing unit;
  • FIG. 5 is a table of an example of data stored in a threshold value history storing unit;
  • FIG. 6 is a flowchart a process procedure of failure prediction in the blade server according to the first embodiment;
  • FIG. 7 is a block diagram of an electronic device according to a second embodiment of the present invention;
  • FIG. 8 is a block diagram of an electronic device according to a third embodiment of the present invention;
  • FIG. 9 is a block diagram of a blade server according to a fourth embodiment of the present invention;
  • FIG. 10 is a graph of an error ratio of a memory module in response to an applied voltage;
  • FIG. 11 is a graph of an error status in a CPU in response to an applied voltage;
  • FIG. 12 is a block diagram of an electronic device according to a fifth embodiment of the present invention; and
  • FIG. 13 is a schematic of a computer that executes a failure prediction program for the electronic devices according to the first embodiment to the fifth embodiment.
  • DETAILED DESCRIPTION
  • Exemplary embodiments of an electronic device, a failure prediction method, and a computer product according to the present invention are explained in detail with reference to the accompanying drawings. In the present embodiments, the present invention is applied to a blade server.
  • A concept of failure prediction according to an embodiment of the present invention is explained first with reference to FIG. 1A and FIG. 1B. In the failure prediction method according to the present invention, failure is predicted by using the feature of the tendency of an electronic part to fail when a high stress is applied.
  • FIG. 1A is a graph of part characteristic curves in response to an environmental load. An electronic part tends to malfunction when a high load is applied in environmental conditions such as temperature, humidity, voltage, etc.
  • For this reason, the environmental conditions for the electronic parts mounted on an apparatus are usually provided so that the conditions for a secure operation are within stipulated values. (For instance, if the environmental temperature of the part is anticipated to exceed the stipulated value, a fan, and the like is provided for cooling.)
  • To test the parts that have a tendency to malfunction upon high stress, these parts are subjected to a burn-in test as a part of product release test. In the burn-in test, temperatures exceeding that for secure operation are applied and all the parts that develop error are assessed to be defective.
  • Failure is assessed to have occurred in the electronic part due to detection deterioration if there is a gradual deterioration in the characteristics of the part and when eventually the characteristics of the part exceed a threshold value for malfunctioning under the operating environment.
  • In other words, the aged deterioration of parts is progressing (from curve S to curve T) even though no failure occurs under normal operating environment conditions. The normal operating conditions are not affected even if the deterioration progresses further, almost to the point of breakdown of the part. However, the normal operating conditions are adversely affected if the deterioration progresses further, leading to a cascading breakdown, first of the entire apparatus, and then of the entire system (curve U and curve V).
  • Therefore, if the aged deterioration of the part can be detected, then the failure prediction can be done for the part. In the failure prediction method according to the present invention, a high load of environmental condition is applied on the part and malfunctioning under high load is detected. In this way, the part that is normally functioning under normal conditions in spite of aged deterioration is detected and the failure of the apparatus and system due to a malfunctioning part can be avoided.
  • FIG. 1B is a block diagram of an electronic device according to an embodiment of the present invention. An electronic device 10 includes parts 11, an environment load applying unit 12, an error detecting unit 13, and a control unit 14.
  • The parts 11 realize the function of the electronic device 10, and for which failure prediction is to be carried out. The environment load applying unit 12 is a functional part that applies environment load in the form of temperature, humidity, voltage, etc. on the parts 11.
  • The error detecting unit 13 is a functional part that detects error in the device part to which load is applied. Error detection can be carried out by providing a detecting unit externally or by monitoring error signals such as parity/ECC error, etc. output by the parts 11.
  • The control unit 14 controls the environment load applying unit 12 to apply environmental load on the parts 11, and based on the error detected by the error detecting unit 13, carries out failure prediction for the parts 11. The control unit 14 may issue an instruction for environmental load application test under normal operating conditions. Alternatively, when a problem is encountered in the apparatus, the control unit 14 may switch to a test mode and issue the instruction for environmental load application test.
  • Thus, in the electronic device 10 according to the present embodiment, the control unit 14 controls the environmental load applying unit 12 to apply a greater load than under normal operating conditions. The control unit 14 then carries out failure prediction for the parts 11 based on the error detected by the error detecting unit 13. Consequently, it is possible to directly pick up the extent of deterioration of the part which is approaching breakdown due to detection deterioration but still functions normally under normal conditions.
  • FIG. 2 is a block diagram of a blade server according to a first embodiment of the present invention. A blade server 100 includes a hard disk 110, a temperature setting unit 120, an error detecting unit 130, and a control unit 140.
  • The hard disk 110 is the part to which a temperature load is applied by the temperature setting unit 120 and for which failure prediction is to be carried out. FIG. 3 is a graph of an error rate of a hard disk 110 in response to a temperature load.
  • Generally, a maximum temperature and a minimum temperature for absolute rating, and a maximum temperature and a minimum temperature for secure operation are stipulated as operating conditions for all electronic parts such as the hard disk 110. The absolute rating is a value which when exceeded results in the possibility of a breakdown of the part. The secure operation temperature value is a temperature value which when exceeded does not guarantee a normal operation.
  • In other words, a normal part may not function normally in an environment outside the secure operation temperature range. However, within the secure operation temperature range, normal functioning of the part is guaranteed. However, a part that is approaching breakdown due to detection deterioration functions normally in a normal operating environment, but may malfunction in an environment which is close to the secure operation temperature values within the secure operation temperature range.
  • Thus, by periodically applying a temperature load of a secure operation temperature or of a temperature close to the secure operation temperature range on the hard disk 110, it can be determined that the hard disk 110 is approaching a breakdown due to aged deterioration.
  • Under normal conditions, the error rate of the hard disk 110 is within the secure operation temperature range and is below (curve S) a warning error rate (Erw). If the detection deterioration progresses and the characteristics of the hard disk 110 deteriorate, the characteristic curve is represented by first curve T and then curve U, and a SMART (Self-Monitoring, Analysis and Reporting Technology) function of the hard disk 110 assesses this as a warning situation.
  • The SMART function is a self-diagnostic function of the hard disk 110, which logs errors that occur. When multiple errors occur, the SMART function decides that the hard disk 110 needs to be replaced.
  • However, since the time span between the assessment by the SMART function as a warning situation and a complete breakdown of the part (represented by curve V) is short, there is a high possibility of a breakdown of the entire system before the disk is replaced.
  • Therefore, in the blade server 100 according to the present embodiment, the error rate is measured under the operating condition of the minimum secure operation temperature of 5° C. or the maximum secure operation temperature of 55° C. Thus, failure prediction can be done at a step earlier (represented by curve T) than the failure prediction step of the SMART function.
  • The temperature setting unit 120 is a functional part that applies the temperature load on the hard disk 110, for instance, a temperature correcting circuit such as a Peltier element. The error detecting unit 130 is a functional part that detects malfunctioning in the hard disk 110 that results from the application of the temperature load.
  • The control unit 140 is a functional part that controls the temperature setting unit 120 to apply the temperature load on the hard disk 110 based on the temperature measured by a temperature sensor disposed near the hard disk 110, and carries out failure prediction by calculating the error rate of the hard disk 110 based on the error detected by the error detecting unit 130.
  • The control unit 140 includes a temperature controller 141, a secure operation value storing unit 142, an error data collecting unit 143, a secure operation value testing unit 144, a non-secure operation value testing unit 145, a threshold value history storing unit 146, and a temperature load test controller 147.
  • The temperature controller 141 controls the temperature setting unit 120 to set the temperature of the hard disk 110 to the temperature setting specified by the secure operation value testing unit 144 or the non-secure operation value testing unit 145.
  • The secure operation value storing unit 142 stores secure operation temperature values of the hard disk 110. FIG. 4 is a table of an example of data stored in the secure operation value storing unit 142 in which the minimum secure operation temperature is 5° C. and the maximum secure operation temperature is 55° C.
  • The error data collecting unit 143 calculates the error rate of the hard disk 110 based on the error detected by the error detecting unit 130. The error rate calculated by the error data collecting unit 143 is used by the secure operation value testing unit 144 and the non-secure operation value testing unit 145 in the failure prediction for the hard disk 110.
  • The secure operation value testing unit 144 carries out a secure temperature value application test to determine the error rate when a temperature load of the secure operation value is applied on the hard disk 110. If the determined error rate exceeds the warning error rate, the secure operation value testing unit 144 notifies the user that there is high possibility of the hard disk 110 breakdown. To be more specific, the secure operation value testing unit 144 carries out the secure operation value test by applying the maximum secure operation temperature of 55° C. on the hard disk 110.
  • The secure operation value testing unit 144 is able to carry out precise failure prediction for the hard disk 110 by determining the error rate by applying the maximum secure operation temperature of 55° C. on the hard disk 110, and notifying the user of a high possibility of the hard disk 110 breakdown if the determined error rate exceeds the warning error rate.
  • In the description given above, the secure operation value testing unit 144 determines the error rate by applying the maximum secure operation temperature of 55° C. on the hard disk 110. However, the secure operation value testing unit 144 can also determine the error rate by applying the minimum secure operation temperature of 5° C. on the hard disk 110. Further, apart from the maximum and minimum secure operation temperatures, a temperature that is close to the secure operation value within the secure operation temperature range can also be applied.
  • Further, error rate can be determined by applying both the maximum secure operation temperature of 55° C. and the minimum secure operation temperature of 5° C., and if either of the determined error rates exceeds the warning error rate, the user is notified of the high possibility of the hard disk 110 breakdown.
  • The non-secure operation value testing unit 145 carries out a non-secure temperature value application test that determines, when a temperature load exceeding the secure operation temperature values is applied on hard disk 110, a threshold temperature at which the error rate turns into a failure threshold.
  • The non-secure operation value testing unit 145 carries out the failure prediction for the hard disk 110 based on a relation between a previously measured threshold temperature, a current measured threshold temperature and the maximum secure operation temperature of 55° C. The non-secure operation value testing unit 145 determines the previously measured threshold temperature from a previous non-secure temperature value application test, and the current measured threshold temperature from a current non-secure temperature value application test.
  • The non-secure operation value testing unit 145 compares an absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature with an absolute value obtained from the difference between the current measured threshold temperature and the maximum secure operation temperature of 55° C. If the absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature is greater, the non-secure operation value testing unit 145 assesses that the hard disk 110 is likely to malfunction during the next non-secure temperature value application test and notifies this fact to the user.
  • The non-secure operation value testing unit 145 is able to carry out precise failure prediction for the hard disk 110 by determining, when the maximum secure operation temperature of 55° C. on the hard disk 110, the threshold temperature at which the error rate turns into a failure threshold, and carrying out failure prediction based on the relation between the previous measured threshold temperature and the maximum secure operation temperature.
  • Generally, a part does not immediately malfunction when a load exceeding a secure operation value is applied. Therefore, by first determining a threshold value exceeding the secure operation value in which the part works normally, and measuring this threshold, which varies with aged deterioration of the part, malfunctioning can be predicted within a time range in which operation recovery is possible.
  • In the description given above, the threshold temperature is determined by applying a temperature load exceeding the maximum secure operation temperature of 55° C. on the hard disk 110. However, the threshold temperature can also be determined by applying a temperature load exceeding the minimum secure operation temperature of 5° C. on the hard disk 110. Further, it is also possible to apply temperature loads exceeding both the maximum secure operation temperature of 55° C. and the minimum secure operation temperature of 5° C. on the hard disk 110, and failure prediction can be carried out by determining the threshold temperatures in both the cases. If malfunctioning of the hard disk 110 during the next non-secure temperature value application test is predicted based on either of the threshold temperatures, the user is notified of the fact.
  • The threshold value history storing unit 146 stores a history of the threshold temperature values determined by the non-secure operation value testing unit 145. For instance, as shown in FIG. 3, the temperatures A, B, etc., which are points at which the characteristic curves cut the line representing the failure threshold, are stored as the threshold temperatures in the threshold value history storing unit 146. The threshold temperatures stored in the threshold value history storing unit 146 are used by the non-secure operation value testing unit 145 for carrying out failure prediction.
  • FIG. 5 is a table of an example of data stored in the threshold value history storing unit 146. The threshold temperatures determined by the non-secure temperature value application test are stored sequentially in the threshold value history storing unit 146. For instance, the threshold value of the previous non-secure temperature value application test is 80° C., and the threshold temperature determined from the current non-secure temperature value application test is 70° C.
  • The temperature load test controller 147 applies the temperature load on the hard disk 110 and controls all the temperature load tests. To be more specific, the temperature load test controller 147 carries out temperature load tests and failure prediction by transferring controls between the functional parts and transferring data between the functional parts and the storage unit.
  • A process sequence of failure prediction in the blade server 100 according to the first embodiment of the present invention is explained next. FIG. 6 is a flowchart a process procedure of failure prediction in the blade server 100 according to the first embodiment. The failure prediction is carried out at fixed intervals.
  • The secure operation value testing unit 144 first sets the temperature of the hard disk 110 to a secure operation temperature of 55° C. and carries out the secure operation test (Step S601).
  • The secure operation value testing unit 144 then assesses whether the error rate of the hard disk 110 is greater than the warning error rate (Step S602). If the error rate is greater than the warning error rate, the secure operation value testing unit 144 notifies the user the possibility of malfunctioning of the hard disk 110 under normal operating conditions (Step S603).
  • On the other hand, if the error rate is not greater than the warning error rate, the non-secure operation value testing unit 145 determines, by applying a non-secure operation temperature, the load threshold, that is, the threshold temperature, at which the hard disk 110 malfunctions (Step S604).
  • The non-secure operation value testing unit 145 then compares the absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature with the absolute value obtained from the difference between the current measured threshold temperature and the secure operation temperature (Step S605). If the absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature is greater, the non-secure operation value testing unit 145 assesses that the hard disk 110 is likely to malfunction during the next non-secure temperature value application test and notifies this fact to the user (Step S606).
  • If the absolute value obtained from the difference between the previously measured threshold temperature and the current measured threshold temperature is smaller, the non-secure operation value testing unit 145 assesses that the hard disk 110 will function normally in the next non-secure temperature value application test. The process ends here.
  • Thus, the secure operation value testing unit 144 carries out the secure operation temperature value application test and the non-secure operation value testing unit 145 carries out the non-secure operation temperature value application test, thereby realizing a precise failure prediction for the hard disk 110.
  • Thus, in the first embodiment of the present invention, precise failure prediction is achieved as it is based on application of a greater temperature load than that required for a normal operation on the hard disk 110 by the temperature setting unit 120 controlled by the control unit 140, and detection of malfunctioning of the hard disk 110 by the error detecting unit 130.
  • Consequently, a failure of the entire blade server 100 system caused by a malfunctioning of the hard disk 110 can be avoided. Moreover, the cost incurred for replacement as well as maintenance time can be cut down since the life of the hard disk 110 can be assessed.
  • Both the secure operation temperature value application test and the non-secure operation temperature value application test are carried out in the first embodiment of the present invention. However, either one of these tests can be carried out. In the secure operation temperature value application test, a value close to the secure operation temperature values and within the secure operation temperature range can be applied.
  • In the first embodiment explained above, the temperature setting unit 120 of the blade server 100 carries out the temperature load test by applying a temperature load on the hard disk 110. However, in electronic devices in general, the electronic parts themselves generate heat. This generated heat of the electronic part may also be applied as a temperature load.
  • To be more specific, a temperature load can be applied on the electronic parts by controlling the fan or a heat pump provided for cooling the electronic parts. For instance, a temperature load can be applied by slowing down or stopping the fan. In the electronic device explained in a second embodiment of the present invention, the temperature load tests are carried out by applying temperature load by means of controlling the cooling of the electronic part.
  • FIG. 7 is a block diagram of an electronic device according to a second embodiment of the present invention. An electronic device 200 includes an electronic part 210, a cooling unit 220, an error detecting unit 230, and a control unit 240.
  • The electronic part 210 is a part that realizes the device functions, and for which failure prediction is to be carried out. The cooling unit 220 is functional part that cools the electronic part 210. The error detecting unit 230 is a functional part that detects malfunctioning in the electronic part 210.
  • The control unit 240 applies a temperature load on the electronic part 210 by controlling the cooling unit 220 and carries out failure prediction for the electronic part 210 based on the error detected by the error detecting unit 230.
  • The control unit 240 includes a temperature controller 241, a secure operation value storing unit 242, an error data collecting unit 243, a secure operation value testing unit 244, a non-secure operation value testing unit 245, a threshold value history storing unit 246, and a temperature load test controller 247. The temperature controller 241 sets the temperature of the electronic part 210 to a predetermined value by controlling the cooling unit 220. The secure operation value storing unit 242 stores secure operation temperature values of the hard disk 210. The error data collecting unit 243 calculates error data of the hard disk 210 based on the error detected by the error detecting unit 230. The secure operation value testing unit 244 carries out a secure temperature value application test. The non-secure operation value testing unit 245 carries out a non-secure temperature value application test. The threshold value history storing unit 246 stores a history of threshold temperature values determined by the non-secure operation value testing unit 245. The temperature load test controller 247 controls all the temperature load tests.
  • Thus, in the second embodiment of the present invention, the temperature of the electronic part 210 is set to a predetermined value by controlling the cooling unit by means of the temperature controller 241. Consequently, temperature load tests can be carried out without external application of heat on the electronic part 210.
  • In the second embodiment explained above, temperature load is applied on an electronic part by controlling a cooling unit. However, if the heat generation is not enough for conducting the temperature load test, heat load can be applied by increasing the processing load of the electronic part. In the electronic device explained in a third embodiment of the present invention, the temperature load tests are carried out by applying temperature load by increasing the processing load of the electronic part.
  • FIG. 8 is a block diagram of an electronic device according to a third embodiment of the present invention. An electronic device 300 includes a communication processing part 310, a test data applying unit 320, and error detecting unit 330, a control unit 340, and a test data separating unit 350.
  • The communication processing part 310 is a part that carries out communication processing, and for which failure prediction is to be carried out. The test data applying unit 320 is a processing unit that adds test data to regular data in order to increase the processing load of the communication processing part 310. The error detecting unit 330 is a functional unit that detects error generated in the communication processing part 310.
  • The control unit 340 includes a temperature controller 341, a secure operation value storing unit 342, error data collecting unit 343, secure operation value testing unit 344, a non-secure operation value testing unit 345, a threshold value history storing unit 346, and a temperature load test controller 347. The temperature controller 341 sets the temperature of the communication processing part 210 to a predetermined value by controlling the test data applying unit 320. The secure operation value storing unit 342 stores secure operation temperature values of the communication processing parts.
  • The error data collecting unit 343 calculates error data based on the error detected by the error detecting unit 330. The secure operation value testing unit 344 carries out a secure operation temperature value application test. The non-secure operation value testing unit 345 carries out a non-secure operation temperature value application test. The threshold value history storing unit 346 stores threshold temperature values determined by the non-secure operation value testing unit 345. The temperature load test controller 347 controls all the temperature load tests.
  • The test data separating unit 350 retrieves communication data by separating the test data that is added to the communication data by the test data applying unit in order to increase the processing load of the communication processing part 310.
  • Thus, in the third embodiment of the present invention, the temperature of the communication processing part 310 is set to a predetermined value by controlling the test data applying unit to adjust the processing load of the communication processing part 310. Consequently, temperature load tests can be carried out without external application of heat on the communication processing part 310.
  • In the first embodiment to the third embodiment, failure prediction is carried out by applying a temperature load on an electronic part. It is also possible to carry out failure prediction by applying environmental loads to an electronic part other than temperature load.
  • For instance, voltage is one of the operating conditions of an electronic part. A maximum voltage and a minimum voltage for absolute rating, and a maximum voltage and a minimum voltage for secure operation are stipulated as operating conditions for all electronic parts. In a blade server according to a fourth embodiment of the present invention, the voltage applied on an electronic part is varied and failure prediction is carried out for the electronic part under a high voltage condition or a low voltage condition.
  • FIG. 9 is a block diagram of a blade server according to a fourth embodiment of the present invention. A blade server 400 includes a memory module 410, a variable power source 420, an error detecting unit 430, and a control unit 440.
  • The memory module 410 is a part for which failure prediction is to be carried out. The memory module 410 has an error-correcting function which uses error-correcting code (ECC). In ECC error correction, one-bit error is automatically corrected by the ECC and the result is reported to the blade server 400. However, when a two-bit error occurs, it fails to be corrected and is reported as a memory access error.
  • FIG. 10 is a graph of an error ratio of a memory module 410 in response to an applied voltage. If the memory module 410 is operated within the secure operation value range error is unlikely to occur. However, when the characteristics of the device deteriorate with age, the frequency of occurrence of memory errors increases.
  • The curve S shown in FIG. 10 represents a characteristic curve during normal operation. Neither one bit nor two-bit error occurs within secure operation value range. If the characteristics deteriorate with the progression of aged deterioration, the characteristic curve resembles the curve T. Under such circumstances, even within secure operation value range, two-bit error does not occur; only one-bit error (point B) occurs. Therefore, there is no failure as such of the blade server 400.
  • However, the characteristic curve resembles curve U if the deterioration continues further, two-bit error (point C) occurs. Therefore, if the time of failure can be predicted before the deterioration progresses further, it can be prevented.
  • In other words, the time of failure can be predicted, by the procedure explained by the flow chart shown in FIG. 6, by calculating the time of progress from point A to point B, or from point A′ to point C of FIG. 10.
  • The variable power source 420 applies voltage load on the memory module 410 by varying the voltage. The error detecting unit 430 detects the error that occurs in the memory module 410.
  • The control unit 440 controls the variable power source 420 to apply varying voltage on the memory module 410 and carries out failure prediction. The control unit 440 includes a voltage controller 441, a secure operation value storing unit 442, an error data collecting unit 443, a secure operation value testing unit 444, a non-secure operation value testing unit 445, a threshold value history storing unit 446, and a voltage load test controller 447. The voltage controller 441 controls the variable power source 420. The secure operation value storing unit 442 stores secure operation voltage values of the memory module 410. The error data collecting unit 443 calculates an error rate of the memory module 410 based on the error detected by the error detecting unit 430. The secure operation value testing unit 444 carries out a secure operation voltage value application test. The non-secure operation value testing unit 445 carries out a non-secure operation voltage value application test. The threshold value history storing unit 446 stores a history of threshold voltage values determined by the non-secure operation value testing unit 445. The voltage load test controller 447 controls all the voltage load tests.
  • Thus, in the fourth embodiment of the present invention, precise failure prediction is achieved as it is based on application of a voltage load on the memory module 410 by the variable power source 420 controlled by the control unit 440, and detection of error under the voltage load in the memory module 410 by the error detecting unit 430.
  • A voltage load is applied on a memory module in the fourth embodiment. However, a voltage load may be applied on a CPU to carry out failure prediction for the CPU.
  • FIG. 11 is a graph of an error status in a CPU in response to an applied voltage. The CPU outputs an alarm signal when an error occurs in its internal functioning. This alarm is output when a bus parity error or sequence error occurs inside the CPU and not when it functioning normally (curve S).
  • However, if the characteristics of the CPU deteriorate, the margin for setup hold becomes insufficient, and if there is a malfunctioning in the internal timing, malfunctioning related to bus parity error, sequence error occurs.
  • Thus, when the internal characteristics that change gradually due to aged deterioration exceed a threshold value, error is output (curve U) even in a digital circuit. Consequently, the failure time can be predicted by varying the voltage load at curve T, and measuring the value at which error occurs.
  • In the first embodiment to the fourth embodiment according to the present invention, failure prediction is carried out by applying an environmental load on an electronic part. Failure prediction can also be carried out by applying an environmental load on mechanical parts.
  • For instance, conditions are stipulated for mechanical parts that are driven by a motor, such as a fan, which has a stipulated maximum number of rotations and minimum number of rotations. Error detection in the mechanical part can be carried out from the detection of error related to increased or decreased number of rotations (that is, when the number of rotations is not the stipulated value and there is a fluctuation in the number of rotations).
  • FIG. 12 is a block diagram of an electronic device according to a fifth embodiment of the present invention. An electronic device 500 includes a fan 510, a variable power source 520, a rotation count monitoring unit 530, and a control unit 540.
  • The fan 510 is a part for which failure prediction is to be carried out. The variable power source 520 is a power source that applies voltage of varying values on the fan 510. The rotation count monitoring unit 530 monitors the number of rotations of the fan 510 and detects the error.
  • The control unit 540 controls the variable power source 520 to vary the voltage value applied on the fan 510 and carries out failure prediction. The control unit 540 includes a voltage controller 540, secure operation value storing unit 542, an error data collecting unit 543, a secure operation value testing unit 544, a non-secure operation value testing unit 545, a threshold value history storing unit 546, and a voltage load test controller 547. The voltage controller 541 controls the variable power source 520. The secure operation value storing unit 542 stores secure operation voltage values of the fan 510. The error data collecting unit 543 calculates error data of the fan 510 based on an error in rotation count detected by the rotation count monitoring unit 530. The secure operation value testing unit 544 carries out a secure operation voltage value application test. The non-secure operation value testing unit 545 carries out a non-secure operation voltage value application test. The threshold value history storing unit 546 stores a history of threshold voltages determined by the non-secure operation value testing unit 545. The voltage load test controller 547 controls all the voltage load tests.
  • Thus, in the fifth embodiment of the present invention, precise failure prediction is achieved as it is based on application of a voltage load on the fan 510 by the variable power source 520 controlled by the control unit 540, and detection of error under the voltage load by the rotation count monitoring unit 530.
  • Failure prediction for a fan is carried out by applying a voltage load to it in the fifth embodiment of the present invention. However, failure prediction can be carried out for any motor-driven mechanical part or for a non-electronic device which includes a motor-driven mechanical part.
  • In a device that is a combination of mechanical and electronic parts, such as a hard disk drive, failure prediction can be carried out by increasing the access frequency of the hard disk or by increasing the load on the drive system, apart from failure prediction by application of environmental load such as temperature load, voltage load, etc.
  • FIG. 13 is a schematic of a computer that executes a failure prediction program for the electronic devices according to the first embodiment to the fifth embodiment. A computer 600 includes a CPU 610, a random access memory (RAM) 620, a read-only memory (ROM) 630, and input/output (I/O) interface 640, and a non-volatile memory 650.
  • The CPU 610 is a processing device that executes the failure prediction program. The RAM 620 is a storage unit that stores intermediate results of the failure prediction program. The ROM 630 is a storage unit that stores the failure prediction program, secure operation values, etc. The I/O interface 640 is an interface for inputting values measured by a sensor, or for outputting settings to a temperature setting unit or a variable power source. The non-volatile memory 650 stores data that is stored in a threshold value history storing unit, etc.
  • The CPU 610, RAM 620, ROM 630, non-volatile memory 650, etc. may be used as constituent parts of an electronic device as well as exclusively for executing the failure prediction program.
  • According to the present invention, direct detection of an impending part failure is carried out. Consequently, a highly reliable failure prediction can be achieved.
  • Although the invention has been described with respect to a specific embodiment for a complete and clear disclosure, the appended claims are not to be thus limited but are to be construed as embodying all modifications and alternative constructions that may occur to one skilled in the art which fairly fall within the basic teaching herein set forth.

Claims (16)

1. An electronic device that includes a part prone to aged deterioration, comprising:
an environmental load applying unit that applies a higher environmental load on the part than an environmental load in a normal operation;
an error detecting unit that detects an error in the part with the higher environmental load applied; and
a failure predicting unit that predicts a failure of the part based on the error detected.
2. The electronic device according to claim 1, wherein the environmental load applying unit applies a first environmental load of a boundary secure operation value between a secure operation environment and non-secure operation environment of the electronic device or a second environmental load of a value that is close to the boundary secure operation value within a range of a secure operation of the electronic device to the part.
3. The electronic device according to claim 2, wherein the failure predicting unit predicts the failure if an error rate of the error detected is greater than a predetermined value.
4. The electronic device according to claim 1, wherein the environmental load applying unit applies an environmental load of a non-secure operation value that exceeds the boundary secure operation value to the part.
5. The electronic device according to claim 4, wherein the failure predicting unit predicts the failure based on a value of the environmental load with which an error rate of the error becomes a failure threshold value.
6. The electronic device according to claim 5, wherein
the failure predicting unit includes a threshold value history storing unit that stores a last value of the environmental load with which an error rate became the failure threshold value when the environmental load of a non-secure operation value was applied as a previous measured threshold value, and
the failure predicting unit predicts the failure if a first absolute value of a difference between the previous measured threshold value and a current measured threshold value that is a current value of the environmental load with which an error rate became the failure threshold value when the environmental load of a non-secure operation value was applied is greater than a second absolute value of a difference between the current measured threshold value and a secure operation value.
7. The electronic device according to claim 1, wherein the environmental load applying unit applies a temperature load to the part as the environmental load.
8. The electronic device according to claim 7, wherein the environmental load applying unit further includes a temperature setting unit that applies the temperature load on the part; and
a temperature controller that controls a temperature of the part using the temperature setting unit.
9. The electronic device according to claim 7, wherein the
environmental load applying unit further includes a cooling unit that cools down the part; and
a temperature controller that controls a temperature of the part using the cooling unit.
10. The electronic device according to claim 7, wherein
the part is a communication processing part that processes communication data, and
the environmental load applying unit applies the environmental load on the communication processing part by adding other data to the communication data.
11. The electronic device according to claim 1, wherein the environmental load applying unit applies the environmental load by varying an applied voltage to the part.
12. The electronic device according to claim 11, wherein the part is a fan, and the error detecting unit detects the error by monitoring a rotation count of the fan.
13. The electronic device according to claim 1, wherein the part is a hard disk, and the environmental load applying unit applies the environmental load on the disk by increasing an access frequency of the hard disk.
14. A method of predicting a failure of a part prone to aged deterioration, comprising:
applying a higher environmental load on the part than an environmental load in a normal operation;
detecting an error in the part with the higher environmental load applied; and
predicting the failure of the part based on the error detected.
15. A computer program for predicting a failure of a part prone to aged deterioration, making a computer execute:
applying a higher environmental load on the part than an environmental load in a normal operation;
detecting an error in the part with the higher environmental load applied; and
predicting the failure of the part based on the error detected.
16. A computer readable recording medium for storing a computer program that makes a computer execute:
applying unit a higher environmental load on the part than an environmental load in a normal operation;
detecting an error in the part with the higher environmental load applied; and
predicting the failure of the part based on the error detected.
US10/875,917 2004-02-06 2004-06-23 Electronic device, failure prediction method, and computer product Expired - Fee Related US7469189B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2004-030570 2004-02-06
JP2004030570A JP4500063B2 (en) 2004-02-06 2004-02-06 Electronic device, prediction method, and prediction program

Publications (2)

Publication Number Publication Date
US20050193284A1 true US20050193284A1 (en) 2005-09-01
US7469189B2 US7469189B2 (en) 2008-12-23

Family

ID=34879199

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/875,917 Expired - Fee Related US7469189B2 (en) 2004-02-06 2004-06-23 Electronic device, failure prediction method, and computer product

Country Status (2)

Country Link
US (1) US7469189B2 (en)
JP (1) JP4500063B2 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060282709A1 (en) * 2005-06-14 2006-12-14 Microsoft Corporation Hard disk drive condition reporting and error correction
US20070219747A1 (en) * 2006-03-07 2007-09-20 Hughes James E HDD throttle polling based on blade temperature
US20070268509A1 (en) * 2006-05-18 2007-11-22 Xerox Corporation Soft failure detection in a network of devices
US20070294072A1 (en) * 2006-06-16 2007-12-20 Ming-Shiahn Tsai Testing model
US20080080843A1 (en) * 2006-09-29 2008-04-03 Hibbard Gary D Systems and methods to improve consumer product reliability and lifetime of a hard disk drive by reducing its activity
US20080109862A1 (en) * 2006-11-07 2008-05-08 General Instrument Corporation Method and apparatus for predicting failures in set-top boxes and other devices to enable preventative steps to be taken to prevent service disruption
US20080147910A1 (en) * 2006-09-29 2008-06-19 Hibbard Gary D Provisional load sharing buffer for reducing hard disk drive (hdd) activity and improving reliability and lifetime
US20080270072A1 (en) * 2007-04-24 2008-10-30 Hiroshi Sukegawa Data remaining period management device and method
US7519880B1 (en) * 2005-07-05 2009-04-14 Advanced Micro Devices, Inc. Burn-in using system-level test hardware
US20090161243A1 (en) * 2007-12-21 2009-06-25 Ratnesh Sharma Monitoring Disk Drives To Predict Failure
TWI398767B (en) * 2007-06-07 2013-06-11 Delta Electronics Inc Electronic system and alarm device thereof
CN103440416A (en) * 2013-08-27 2013-12-11 西北工业大学 Blade machining process error prediction method based on extended error flow
US20170166240A1 (en) * 2015-12-14 2017-06-15 Hyundai Motor Company System for compensating for disturbance of motor for motor driven power steering
EP3188095A1 (en) * 2015-12-29 2017-07-05 Flytech Technology Co., Ltd. System for displaying and prompting life percentage of electronic device
US20180203769A1 (en) * 2014-03-06 2018-07-19 International Business Machines Corporation Reliability Enhancement in a Distributed Storage System

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4349408B2 (en) 2005-12-28 2009-10-21 日本電気株式会社 Life prediction monitoring apparatus, life prediction monitoring method, and life prediction monitoring program
US8548421B2 (en) * 2008-07-03 2013-10-01 Centurylink Intellectual Property Llc Battery charge reservation for emergency communications
JP5355031B2 (en) * 2008-10-23 2013-11-27 キヤノン株式会社 Image forming apparatus and image forming apparatus control method
JP5135401B2 (en) * 2010-09-10 2013-02-06 株式会社東芝 Information processing apparatus, failure sign diagnosis method and program
JP2015185120A (en) * 2014-03-26 2015-10-22 株式会社Nttファシリティーズ Information processing equipment, information processing method, and program
KR102467843B1 (en) * 2018-08-23 2022-11-16 삼성전자주식회사 Method and apparatus for monitoring secondary power device, and electronic system comprising the same apparatus
US11209808B2 (en) 2019-05-21 2021-12-28 At&T Intellectual Property I, L.P. Systems and method for management and allocation of network assets

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5557183A (en) * 1993-07-29 1996-09-17 International Business Machines Corporation Method and apparatus for predicting failure of a disk drive
US5923876A (en) * 1995-08-24 1999-07-13 Compaq Computer Corp. Disk fault prediction system
US6078455A (en) * 1997-06-13 2000-06-20 Seagate Technology, Inc. Temperature dependent disc drive parametric configuration
US6108586A (en) * 1997-03-31 2000-08-22 Hitachi, Ltd. Fraction defective estimating method, system for carrying out the same and recording medium
US6249890B1 (en) * 1998-06-05 2001-06-19 Seagate Technology Llc Detecting head readback response degradation in a disc drive
US20040085670A1 (en) * 2002-11-05 2004-05-06 Seagate Technology Llc Method for measuring pad wear of padded slider with MRE cooling effect

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0287079A (en) 1988-09-26 1990-03-27 Toshiba Corp Estimating apparatus of lifetime of electric component
JPH04286968A (en) * 1991-03-15 1992-10-12 Fujitsu Ltd Temperature margin testing system of electrical equipment
JP3069882B2 (en) 1993-11-05 2000-07-24 株式会社日立製作所 Semiconductor device
JPH09264930A (en) * 1996-03-29 1997-10-07 Ando Electric Co Ltd Ic tester and method for diagnosing it
JP2000275291A (en) * 1999-03-26 2000-10-06 Toshiba Corp Voltage margin test circuit and deterioration diagnostic device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5557183A (en) * 1993-07-29 1996-09-17 International Business Machines Corporation Method and apparatus for predicting failure of a disk drive
US5923876A (en) * 1995-08-24 1999-07-13 Compaq Computer Corp. Disk fault prediction system
US6108586A (en) * 1997-03-31 2000-08-22 Hitachi, Ltd. Fraction defective estimating method, system for carrying out the same and recording medium
US6078455A (en) * 1997-06-13 2000-06-20 Seagate Technology, Inc. Temperature dependent disc drive parametric configuration
US6249890B1 (en) * 1998-06-05 2001-06-19 Seagate Technology Llc Detecting head readback response degradation in a disc drive
US20040085670A1 (en) * 2002-11-05 2004-05-06 Seagate Technology Llc Method for measuring pad wear of padded slider with MRE cooling effect

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7802019B2 (en) * 2005-06-14 2010-09-21 Microsoft Corporation Hard disk drive condition reporting and error correction
US20060282709A1 (en) * 2005-06-14 2006-12-14 Microsoft Corporation Hard disk drive condition reporting and error correction
US7519880B1 (en) * 2005-07-05 2009-04-14 Advanced Micro Devices, Inc. Burn-in using system-level test hardware
US20070219747A1 (en) * 2006-03-07 2007-09-20 Hughes James E HDD throttle polling based on blade temperature
US20070268509A1 (en) * 2006-05-18 2007-11-22 Xerox Corporation Soft failure detection in a network of devices
US7865089B2 (en) 2006-05-18 2011-01-04 Xerox Corporation Soft failure detection in a network of devices
US20070294072A1 (en) * 2006-06-16 2007-12-20 Ming-Shiahn Tsai Testing model
US20080080843A1 (en) * 2006-09-29 2008-04-03 Hibbard Gary D Systems and methods to improve consumer product reliability and lifetime of a hard disk drive by reducing its activity
US7974522B2 (en) 2006-09-29 2011-07-05 Hibbard Gary D Systems and methods to improve consumer product reliability and lifetime of a hard disk drive by reducing its activity
US20080147910A1 (en) * 2006-09-29 2008-06-19 Hibbard Gary D Provisional load sharing buffer for reducing hard disk drive (hdd) activity and improving reliability and lifetime
US7818474B2 (en) * 2006-09-29 2010-10-19 Hibbard Gary D Provisional load sharing buffer for reducing hard disk drive (HDD) activity and improving reliability and lifetime
WO2008057812A2 (en) * 2006-11-07 2008-05-15 General Instrument Corporation Method and apparatus for predicting failure in set-top boxes and other devices to enable preventative steps to be taken to prevent service disruption
WO2008057812A3 (en) * 2006-11-07 2008-07-03 Gen Instrument Corp Method and apparatus for predicting failure in set-top boxes and other devices to enable preventative steps to be taken to prevent service disruption
US20080109862A1 (en) * 2006-11-07 2008-05-08 General Instrument Corporation Method and apparatus for predicting failures in set-top boxes and other devices to enable preventative steps to be taken to prevent service disruption
US20080270072A1 (en) * 2007-04-24 2008-10-30 Hiroshi Sukegawa Data remaining period management device and method
US8000927B2 (en) 2007-04-24 2011-08-16 Kabushiki Kaisha Toshiba Data remaining period management device and method
TWI398767B (en) * 2007-06-07 2013-06-11 Delta Electronics Inc Electronic system and alarm device thereof
US20090161243A1 (en) * 2007-12-21 2009-06-25 Ratnesh Sharma Monitoring Disk Drives To Predict Failure
CN103440416A (en) * 2013-08-27 2013-12-11 西北工业大学 Blade machining process error prediction method based on extended error flow
US20180203769A1 (en) * 2014-03-06 2018-07-19 International Business Machines Corporation Reliability Enhancement in a Distributed Storage System
US10223207B2 (en) * 2014-03-06 2019-03-05 International Business Machines Corporation Reliability enhancement in a distributed storage system
US20170166240A1 (en) * 2015-12-14 2017-06-15 Hyundai Motor Company System for compensating for disturbance of motor for motor driven power steering
US10167011B2 (en) * 2015-12-14 2019-01-01 Hyundai Motor Company System for compensating for disturbance of motor for motor driven power steering
EP3188095A1 (en) * 2015-12-29 2017-07-05 Flytech Technology Co., Ltd. System for displaying and prompting life percentage of electronic device

Also Published As

Publication number Publication date
US7469189B2 (en) 2008-12-23
JP2005221413A (en) 2005-08-18
JP4500063B2 (en) 2010-07-14

Similar Documents

Publication Publication Date Title
US7469189B2 (en) Electronic device, failure prediction method, and computer product
US7870440B2 (en) Method and apparatus for detecting multiple anomalies in a cluster of components
US6982842B2 (en) Predictive disc drive failure methodology
US7373559B2 (en) Method and system for proactive drive replacement for high availability storage systems
US6986075B2 (en) Storage-device activation control for a high-availability storage system
US9891975B2 (en) Failure prediction system of controller
US8340923B2 (en) Predicting remaining useful life for a computer system using a stress-based prediction technique
US20150193325A1 (en) Method and system for determining hardware life expectancy and failure prevention
JP4929783B2 (en) Power monitoring device
CN111124827B (en) Monitoring device and monitoring method for equipment fan
WO2004025650A1 (en) Predictive disc drive failure methodology
JPH04271229A (en) Temperature abnormality detecting system
US11307569B2 (en) Adaptive sequential probability ratio test to facilitate a robust remaining useful life estimation for critical assets
JP2016161990A (en) Controller predicting life-span with error correcting function
US8234235B2 (en) Security and remote support apparatus, system and method
EP3726233A1 (en) Chip health monitor
JP2016146198A (en) Power supply system, control method, and control program
US10991169B2 (en) Method for determining a mean time to failure of an electrical device
JP7347953B2 (en) Equipment early warning monitoring device and equipment early warning monitoring method
US10837990B2 (en) Semiconductor device
KR20220098202A (en) Diagnostic devices, diagnostic methods and programs
EP3611523B1 (en) Apparatuses and methods involving adjustable circuit-stress test conditions for stressing regional circuits
JP2021043891A (en) Storage system and control method thereof
CN113567836B (en) Segmented prediction circuit aging system and method
JP2022155028A (en) Device state monitor, program and device state monitoring method

Legal Events

Date Code Title Description
AS Assignment

Owner name: FUJITSU LIMITED, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YASUO, AKIHIRO;REEL/FRAME:015518/0026

Effective date: 20040520

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20201223