JP5311473B2

JP5311473B2 - Computer system and re-installation method of CPU

Info

Publication number: JP5311473B2
Application number: JP2009012547A
Authority: JP
Inventors: 志保小酒井
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2009-01-23
Filing date: 2009-01-23
Publication date: 2013-10-09
Anticipated expiration: 2029-01-23
Also published as: JP2010170355A

Description

本発明はコンピュータシステム及びＣＰＵの再組み込み方法に関し、特にＣＰＵに障害が発生した場合に、当該ＣＰＵを切り離した後、再組み込みを行うコンピュータシステム及びＣＰＵの再組み込み方法に関する。 The present invention relates to a computer system and a CPU reincorporation method, and more particularly to a computer system and a CPU reincorporation method in which, when a failure occurs in a CPU, the CPU is detached and then reincorporated.

スーパーコンピュータ等の大規模システムでは、ＣＰＵ（Central Processing Unit）に障害が発生した場合には、該当ＣＰＵをシステムから切り離した後に初期化を行い、再度システムに組み込むという処理が行われている。 In a large-scale system such as a supercomputer, when a failure occurs in a CPU (Central Processing Unit), a process is performed in which the CPU is initialized after being disconnected from the system and incorporated into the system again.

関連する技術として、特許文献１乃至５には、ＣＰＵ障害発生時におけるＣＰＵの再組み込み方法や、その障害内容を分析する技術が開示されている。 As related techniques, Patent Documents 1 to 5 disclose a method of re-installing a CPU when a CPU fault occurs and a technique for analyzing the contents of the fault.

特開平０２−１２９７３０号公報Japanese Patent Laid-Open No. 02-129730 特開平０６−０５１８６４号公報JP 06-051864 A 特開平０９−０３４８５２号公報JP 09-034852 A 特開平０９−１２８２５８号公報JP 09-128258 A 特許２７９０２０４号Patent 2790204

しかしながら、上述したいずれの関連技術においても、障害が発生したＣＰＵについて、その障害内容に対応した障害再発予防処置については開示されていない。このため、障害再発予防処置が行われずにＣＰＵの初期化及び再組み込みがなされていたために、同一の障害が再発する可能性が高いという問題があった。 However, none of the related technologies described above discloses a failure recurrence prevention measure corresponding to the failure content of the CPU in which the failure has occurred. For this reason, there has been a problem that the same failure is likely to recur because the CPU has been initialized and re-installed without performing the failure recurrence prevention measure.

本発明に係るコンピュータシステムは、ＣＰＵに障害が発生した場合に、当該ＣＰＵを切り離した後、再組み込みを行うコンピュータシステムであって、前記ＣＰＵに発生した障害の内容を分析し、当該障害内容の分析結果に応じた障害再発予防処置を行った後に、前記コンピュータシステムに対して前記ＣＰＵの再組み込みを行う診断部を備えるものである。 The computer system according to the present invention is a computer system that performs re-installation after disconnecting the CPU when a failure occurs in the CPU, and analyzes the content of the failure that has occurred in the CPU. After performing the failure recurrence prevention treatment according to the analysis result, a diagnostic unit is provided for re-incorporating the CPU into the computer system.

また、本発明に係るＣＰＵの再組み込み方法は、ＣＰＵに障害が発生した場合に、当該ＣＰＵをコンピュータシステムから切り離した後に、再組み込みを行うＣＰＵの再組み込み方法であって、前記ＣＰＵに発生した障害の内容を分析するステップと、前記障害内容の分析結果に応じた障害再発予防処置を行うステップと、を有するものである。 Further, the CPU re-installation method according to the present invention is a CPU re-installation method in which when a failure occurs in a CPU, the CPU is detached from the computer system and then re-installed. A step of analyzing a content of the failure, and a step of performing a failure recurrence prevention treatment according to the analysis result of the failure content.

本発明によれば、障害が発生したＣＰＵを再組み込みする際に、障害内容の分析結果に応じた障害再発予防処置を行うことで、再組み込みしたＣＰＵの障害再発の可能性を低下させるコンピュータシステム及びＣＰＵの再組み込み方法を提供することができる。 According to the present invention, when re-installing a CPU in which a failure has occurred, a computer system that reduces the possibility of re-occurrence of the re-integrated CPU by performing a failure relapse preventive measure according to the analysis result of the failure content. And a CPU re-installation method can be provided.

実施の形態１に係るコンピュータシステムの構成を示すブロック図である。1 is a block diagram illustrating a configuration of a computer system according to a first embodiment. 実施の形態１に係るコンピュータシステムの動作例を示すフローチャート図である。FIG. 6 is a flowchart showing an operation example of the computer system according to the first embodiment. 実施の形態１に係る設定電圧決定方法を説明するための図である。FIG. 3 is a diagram for explaining a set voltage determination method according to the first embodiment.

実施の形態１
以下、図面を参照して本発明の実施の形態について説明する。図１は、本実施の形態１に係るコンピュータシステム１の構成を示すブロック図である。コンピュータシステム１は、診断プロセッサ１０と、電源制御部２０と、クロック制御部３０と、ＣＰＵ４０＿１〜ＣＰＵ４０＿ｎ（以下、ＣＰＵ４０と総称する場合がある。）と、ＭＭＵ（Memory Management Unit）７０と、冷却装置８０とを備えている。 Embodiment 1
Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a computer system 1 according to the first embodiment. The computer system 1 includes a diagnostic processor 10, a power control unit 20, a clock control unit 30, a CPU 40_1 to CPU 40_n (hereinafter may be collectively referred to as a CPU 40), an MMU (Memory Management Unit) 70, a cooling device. 80.

診断部としての診断プロセッサ１０は、組み込みソフトウェア１１を有している。診断プロセッサ１０は、診断パスを介してＣＰＵ４０とＭＭＵ７０の診断を行う。電源制御部２０は、診断プロセッサ１０、ＣＰＵ４０、ＭＭＵ７０、冷却装置８０などに電源を供給する。クロック制御部３０は、診断プロセッサ１０、ＣＰＵ４０、ＭＭＵ７０、冷却装置８０などに供給するクロックを制御する。 A diagnostic processor 10 as a diagnostic unit has embedded software 11. The diagnosis processor 10 diagnoses the CPU 40 and the MMU 70 through the diagnosis path. The power control unit 20 supplies power to the diagnostic processor 10, the CPU 40, the MMU 70, the cooling device 80, and the like. The clock control unit 30 controls a clock supplied to the diagnostic processor 10, the CPU 40, the MMU 70, the cooling device 80, and the like.

ＣＰＵ４０＿１〜ＣＰＵ４０＿ｎは、各ＣＰＵ４０＿１〜ＣＰＵ４０＿ｎの温度を測定する温度センサ５０＿１〜温度センサ５０＿ｎ（以下、温度センサ５０と総称する場合がある。）と、各ＣＰＵ４０＿１〜ＣＰＵ４０＿ｎにＢＩＳＴ(Built In Self Test)を実行させるＢＩＳＴ６０＿１〜ＢＩＳＴ６０＿ｎ（以下、ＢＩＳＴ６０と総称する場合がある。）と、をそれぞれ有している。 The CPU 40_1 to CPU 40_n includes a temperature sensor 50_1 to a temperature sensor 50_n for measuring the temperature of each of the CPU 40_1 to CPU 40_n (hereinafter may be collectively referred to as the temperature sensor 50), and a BIST (Built In Self Test) to each of the CPU 40_1 to CPU 40_n. BIST 60_1 to BIST 60_n (hereinafter, may be collectively referred to as BIST 60) to be executed.

組み込みソフトウェア１１は、後述するように、ＣＰＵ４０に障害が発生した場合に、該当ＣＰＵ４０をシステム１から切り離した後、その障害内容を分析して、障害内容に対応した障害再発予防処置を行う。また、組み込みソフトウェア１１は、電源制御部２０が供給する電源の電圧を変更することができる。さらに、組み込みソフトウェア１１は、クロック制御部３０が供給するクロックを変更することができる。 As will be described later, the embedded software 11, when a failure occurs in the CPU 40, disconnects the CPU 40 from the system 1, analyzes the failure content, and performs failure recurrence prevention measures corresponding to the failure content. The embedded software 11 can change the voltage of the power supplied by the power control unit 20. Furthermore, the embedded software 11 can change the clock supplied by the clock control unit 30.

組み込みソフトウェア１１は、例えば、障害再発予防処置として、内部ロジック部分の遅延が原因の内部障害であると分析した場合には、該当ＣＰＵ４０に供給する電源の電圧を上昇させる処置を行う。これにより、障害内容に対応した効果的な障害再発予防処置を行うことができる。 For example, as a failure recurrence prevention measure, the embedded software 11 performs a measure of increasing the voltage of the power supply to be supplied to the CPU 40 when it is analyzed that the failure is an internal failure caused by an internal logic portion. As a result, it is possible to perform effective failure recurrence prevention treatment corresponding to the failure content.

ここで、供給する電源の電圧を上昇させすぎた場合には、ＣＰＵ４０に温度障害を引き起こす可能性がある。このため、組み込みソフトウェア１１は、ＣＰＵ４０に搭載した温度センサ５０により温度状況を判断し、温度が所定の設定値に対して余裕がある場合には電圧を上昇させ、余裕がない場合には冷却を強化した後に電圧を上昇させるなどの対応を行うことで、より適切な障害再発予防処置を行うことできる。尚、後述するように、供給する電源の最適な電圧値は、電圧・温度の関係について予め調査を行い、当該調査結果に基づいて導出した値を設定する。 Here, if the voltage of the power supply to be supplied is increased too much, there is a possibility of causing a temperature failure in the CPU 40. For this reason, the embedded software 11 determines the temperature state by the temperature sensor 50 mounted on the CPU 40, and increases the voltage when the temperature has a margin with respect to a predetermined set value, and cools the temperature when there is no margin. By taking measures such as increasing the voltage after strengthening, it is possible to perform more appropriate failure recurrence prevention treatment. As will be described later, the optimum voltage value of the power supply to be supplied is determined in advance with respect to the relationship between voltage and temperature, and a value derived based on the result of the investigation is set.

また、組み込みソフトウェア１１による他の障害再発予防処置としては、ノイズが原因の障害である場合には電圧を低下させる、インタフェースによる障害である場合にはクロックを低下させるなど、障害内容に対応した効果的な障害再発予防処置を行うことができる。 As another failure recurrence prevention measure by the built-in software 11, an effect corresponding to the content of the failure, such as reducing the voltage if the failure is caused by noise, or lowering the clock if the failure is caused by the interface, etc. Relapse prevention treatment can be performed.

さらにまた、障害再発予防処置を行う際には、予め設定されたモードに応じて、ＣＰＵ４０の性能を考慮した障害再発予防処置を行うことができる。ここで、モードとしては、ｉ）ＣＰＵ４０の性能を落とさずに再組み込みを行う（性能が落ちる場合には切り離す）、ｉｉ）性能を落としても再組み込みを行う、ｉｉｉ）即、切り離しを行う等のモードが予め設定される。組み込みソフトウェア１１は、これらのモードのうちからいずれかのモードを選択して処置を行う。 Furthermore, when performing the failure recurrence prevention treatment, the failure recurrence prevention treatment can be performed in consideration of the performance of the CPU 40 in accordance with a preset mode. Here, as modes, i) re-installation is performed without degrading the performance of the CPU 40 (separate when performance is degraded), ii) re-installation is performed even when performance is degraded, iii) immediate decoupling, etc. These modes are preset. The embedded software 11 performs a treatment by selecting one of these modes.

続いて、図２に示すフローチャート図を参照して、システムの動作例について具体的に説明する。 Next, an example of the operation of the system will be specifically described with reference to the flowchart shown in FIG.

まず、ＣＰＵ４０に障害が発生した場合に、組み込みソフトウェア１１は、ＣＰＵ４０の障害割込みを検出すると、該当ＣＰＵ４０をシステム１から切り離す（ステップＳ１０１）。次いで、組み込みソフトウェア１１は、該当ＣＰＵ４０がシステム１に再組み込みが可能か否かを判断し（ステップＳ１０２）、再組み込みが可能である場合には、その障害内容を分析する（ステップＳ１０３）。尚、再組み込みが可能でない場合には、該当ＣＰＵ４０をシステム１から切り離して（ステップＳ１１９）、システム１の運転を継続する。 First, when a failure occurs in the CPU 40, the embedded software 11 disconnects the CPU 40 from the system 1 when detecting a failure interrupt of the CPU 40 (step S101). Next, the embedded software 11 determines whether or not the corresponding CPU 40 can be reincorporated into the system 1 (step S102), and if reincorporation is possible, analyzes the failure content (step S103). If re-installation is not possible, the CPU 40 is disconnected from the system 1 (step S119), and the operation of the system 1 is continued.

ステップ１０３における障害内容の分析の結果、例えば、遅延が原因の障害であった場合（ステップＳ１０４でＹｅｓの場合）には、組み込みソフトウェア１１は、該当ＣＰＵ４０についてＢＩＳＴを実行し（ステップＳ１０５）、ＢＩＳＴ実行時の温度を測定する（ステップＳ１０６）。次いで、組み込みソフトウェア１１は、測定した温度を高負荷なＪＯＢの実行時の温度へと変換し（ステップＳ１０７）、変換した温度に余裕があるか否かを判断する（ステップＳ１０８）。 As a result of the analysis of the failure content in step 103, for example, if the failure is caused by a delay (Yes in step S104), the embedded software 11 executes BIST for the corresponding CPU 40 (step S105), and BIST The temperature at the time of execution is measured (step S106). Next, the embedded software 11 converts the measured temperature into a temperature at the time of execution of a high-load JOB (step S107), and determines whether or not the converted temperature has a margin (step S108).

温度に余裕がある場合（ステップＳ１０８でＹｅｓの場合）には、組み込みソフトウェア１１は、障害再発予防処置として、該当ＣＰＵ４０に供給する電源の電圧を上昇させる（ステップＳ１０９）。温度に余裕がない場合には、組み込みソフトウェア１１は、さらに、該当ＣＰＵ４０の冷却の強化が可能であるか否かを判断し（ステップＳ１１１）、冷却が可能である場合（ステップＳ１１１でＹｅｓの場合）には、冷却を強化（ステップＳ１１２）した後、電圧を上昇させる（ステップＳ１１３）。次いで、組み込みソフトウェア１１は、該当ＣＰＵ４０の初期化及び再組み込みを行い（ステップＳ１１０）、システム１の運転を継続する。 If there is a margin in temperature (Yes in step S108), the embedded software 11 increases the voltage of the power supplied to the CPU 40 as a failure recurrence prevention measure (step S109). If there is no room for the temperature, the embedded software 11 further determines whether or not the cooling of the CPU 40 can be enhanced (step S111). If the cooling is possible (Yes in step S111). ), After enhancing the cooling (step S112), the voltage is increased (step S113). Next, the embedded software 11 initializes and re-incorporates the CPU 40 (step S110), and continues the operation of the system 1.

また、ステップ１０３における障害内容の分析の結果、遅延が原因の障害でなく（ステップＳ１０４でＮｏの場合）、例えば、ノイズが原因の障害であった場合（ステップＳ１１４でＹｅｓの場合）には、組み込みソフトウェア１１は、電圧を低下させ（ステップＳ１１５）、初期化及び再組み込みを行う（ステップＳ１１０）。 In addition, as a result of the analysis of the failure content in step 103, if the failure is not caused by delay (in the case of No in step S104), for example, if it is a failure caused by noise (Yes in step S114), The embedded software 11 reduces the voltage (step S115), and performs initialization and re-incorporation (step S110).

ステップ１０３における障害内容の分析の結果、遅延が原因の障害でなく（ステップＳ１０４でＮｏの場合）、さらに、ノイズが原因の障害でなかった場合（ステップＳ１１４でＮｏの場合）には、組み込みソフトウェア１１は、例えば、障害がインタフェース障害であるか否かを判断する（Ｓ１１６）。 As a result of the analysis of the failure content in step 103, if the failure is not a failure caused by delay (No in step S104), and if the failure is not caused by noise (No in step S114), the embedded software 11, for example, determines whether or not the failure is an interface failure (S116).

インタフェース障害であった場合（ステップＳ１１６でＹｅｓの場合）には、組み込みソフトウェア１１は、さらに、設定されたモードが、例えば、システム１の性能低下を認めるモードであるか否かを判断する（ステップＳ１１７）。性能低下を認めるモードである場合（ステップＳ１１７でＹｅｓの場合）には、組み込みソフトウェア１１は、ＣＰＵ４０のクロックを低下させ（ステップＳ１１８）、初期化及び再組み込みを行う（ステップＳ１１０）。尚、インタフェース障害でなかった場合（ステップＳ１１６でＮｏの場合）、または、設定されたモードがシステム１の性能低下を認めるモードでなかった場合（ステップＳ１１７でＮｏの場合）には、クロックを低下させずに、初期化及び再組み込みを行う（ステップＳ１１０） In the case of an interface failure (Yes in step S116), the embedded software 11 further determines whether or not the set mode is, for example, a mode in which performance degradation of the system 1 is recognized (step S1). S117). If the mode is a mode in which performance degradation is recognized (Yes in step S117), the embedded software 11 reduces the clock of the CPU 40 (step S118), and performs initialization and re-incorporation (step S110). If there is no interface failure (No in step S116), or if the set mode is not a mode that recognizes the performance degradation of the system 1 (No in step S117), the clock is decreased. Without initialization, initialization and re-installation are performed (step S110).

続いて、障害再発予防処置として電圧を上昇させる場合に、設定する電圧の決定方法について説明する。設定する電圧は、障害発生後に電圧を上昇させる際に、高負荷なＪＯＢの実行時の温度が、温度障害が発生する温度以下となるように設定することが好ましい。また、各ＣＰＵ４０の温度上昇率には個体差があるため、高負荷なＪＯＢの実行時の温度を推測する方法として、ＢＩＳＴを利用する。 Next, a method for determining the voltage to be set when the voltage is increased as a failure recurrence prevention measure will be described. The voltage to be set is preferably set so that when the voltage is increased after the occurrence of a failure, the temperature at the time of executing a high-load job is equal to or lower than the temperature at which the temperature failure occurs. Further, since there is an individual difference in the temperature increase rate of each CPU 40, BIST is used as a method for estimating the temperature at the time of executing a high-load JOB.

具体的には、まず、ＢＩＳＴ実行時と高負荷ＪＯＢ実行時における、ＣＰＵ４０の電圧及び温度の相関関係を予め調査しておく。調査した相関関係は、図示しない記憶部などに保持しておく。そして、組み込みソフトウェア１１は、障害が発生した場合に、システム１から切り離されたＣＰＵ４０に対してＢＩＳＴを実行してＣＰＵ４０の温度を測定する。さらに、組み込みソフトウェア１１は、ＢＩＳＴの実行時に測定した温度を、予め調査して保持しておいた相関関係を用いて高負荷ＪＯＢ実行時の温度へと変換し、変換した温度が温度障害を引き起こさない範囲となるように、電圧値を設定する。すなわち、障害が発生した場合にＢＩＳＴを実行して、ＢＩＳＴ実行時の温度から高負荷ＪＯＢ実行時の温度を推定して設定電圧を決定することで、システム１を停止することなく、最適な電圧を設定することができる。 Specifically, first, the correlation between the voltage and the temperature of the CPU 40 during the BIST execution and the high-load JOB execution is examined in advance. The investigated correlation is held in a storage unit (not shown). Then, when a failure occurs, the embedded software 11 performs BIST on the CPU 40 disconnected from the system 1 and measures the temperature of the CPU 40. Further, the embedded software 11 converts the temperature measured at the time of executing the BIST into the temperature at the time of executing the high load job using the correlation that has been investigated and held in advance, and the converted temperature causes a temperature failure. Set the voltage so that there is no range. That is, when a failure occurs, the BIST is executed, the temperature at the time of executing the high load job is estimated from the temperature at the time of executing the BIST, and the set voltage is determined, so that the optimum voltage can be obtained without stopping the system 1. Can be set.

図３を参照して、障害再発予防処置として電圧を上昇させる場合の設定電圧決定方法について具体的に説明する。図３は、予め調査しておいた、ＢＩＳＴ実行時と高負荷ＪＯＢ実行時における、ＣＰＵ４０の電圧及び温度の相関関係を示すグラフである。図３（ａ）は、温度上昇率がＡである場合の例を示すグラフである。図３（ｂ）は、温度上昇率がＢである場合の例を示すグラフである。 With reference to FIG. 3, the setting voltage determination method in the case of raising a voltage as a failure recurrence prevention measure will be specifically described. FIG. 3 is a graph showing the correlation between the voltage and the temperature of the CPU 40, which has been investigated in advance, when the BIST is executed and when the high-load job is executed. FIG. 3A is a graph showing an example in which the temperature increase rate is A. FIG. FIG. 3B is a graph showing an example in which the temperature increase rate is B.

図３において、設定する電圧は、高負荷ＪＯＢの実行時の温度が、障害が発生する温度Ｔｍａｘ以下となるように設定することが好ましい。例えば、電圧Ｖ２でＢＩＳＴを実行した際に測定温度が温度Ｔ２である場合には、図３（ａ）及び図３（ｂ）から、該当ＣＰＵ４０の温度上昇率はＡであるものと判断することができる。そして、この場合に、設定する電圧を電圧Ｖ４以上としては温度Ｔｍａｘを超えてしまうため、電圧Ｖ４より小さな電圧に設定することが好ましい。また例えば、電圧Ｖ２でＢＩＳＴを実行した際に測定温度が温度Ｔ１である場合には、図３（ａ）及び図３（ｂ）から、該当ＣＰＵ４０の温度上昇率はBであるものと判断することができる。そしてこの場合には、設定電圧として電圧Ｖ４を設定することができる。 In FIG. 3, it is preferable to set the voltage to be set so that the temperature at the time of execution of the high-load job is equal to or lower than the temperature Tmax at which a failure occurs. For example, when the measured temperature is the temperature T2 when the BIST is executed with the voltage V2, it is determined that the temperature increase rate of the CPU 40 is A from FIGS. 3 (a) and 3 (b). Can do. In this case, if the voltage to be set is equal to or higher than the voltage V4, it will exceed the temperature Tmax. Therefore, it is preferable to set the voltage to be lower than the voltage V4. Further, for example, when the measured temperature is the temperature T1 when the BIST is executed with the voltage V2, it is determined from FIGS. 3A and 3B that the temperature increase rate of the CPU 40 is B. be able to. In this case, the voltage V4 can be set as the set voltage.

以上説明したように、本発明によれば、ＣＰＵ４０に障害が発生した場合に、ＣＰＵ４０に発生した障害の内容を分析し、障害内容の分析結果に応じた障害再発予防処置を行った後に、コンピュータシステム１に対してＣＰＵ４０の再組み込みを行う診断部１０を備えることで、再組み込みしたＣＰＵ４０の障害再発の可能性を低下させることができる。 As described above, according to the present invention, when a failure occurs in the CPU 40, the content of the failure that has occurred in the CPU 40 is analyzed, and after the failure recurrence prevention treatment is performed according to the analysis result of the failure content, the computer Providing the diagnosis unit 10 that re-installs the CPU 40 in the system 1 can reduce the possibility of failure recurrence of the re-installed CPU 40.

尚、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 Note that the present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention.

１コンピュータシステム、
１０診断プロセッサ、
１１組み込みソフトウェア、
２０電源制御部、
３０クロック制御部、
４０＿１〜４０＿ｎＣＰＵ、
５０＿１〜５０＿ｎ温度センサ、
６０＿１〜６０＿ｎＢＩＳＴ、
７０ＭＭＵ、
８０冷却装置 1 computer system,
10 diagnostic processor,
11 Embedded software,
20 power control unit,
30 clock controller,
40_1-40_n CPU,
50_1 to 50_n temperature sensor,
60_1 to 60_n BIST,
70 MMU,
80 Cooling device

Claims

When a failure occurs in a CPU (Central Processing Unit), it is a computer system that re-installs after disconnecting the CPU,
A computer system comprising: a diagnostic unit that analyzes the content of a failure that has occurred in the CPU and performs re-prevention of the failure according to the analysis result of the failure content, and then re-installs the CPU into the computer system.

A power control unit for controlling the voltage of the power supplied to the CPU;
When the failure that has occurred in the CPU is caused by a delay in an internal logic portion of the computer system, the diagnostic unit increases the voltage of the power supply to be supplied to the CPU as the failure recurrence prevention measure. The computer system according to claim 1, wherein the computer system is characterized.

A BIST unit that causes the CPU to execute a BIST (Built In Self Test);
A temperature measuring unit for measuring the temperature of the CPU,
The diagnosis unit, when a failure that occurs in the CPU is caused by a delay in an internal logic part of the computer system,
The computer according to claim 2, wherein the CPU is configured to execute a BIST to measure a temperature, and to set a voltage value for increasing a voltage of a power supply supplied to the CPU based on the measured temperature. system.

In response to the voltage change of the power supplied to the CPU, the CPU temperature change at the time of the BIST execution and the CPU temperature change at the time of execution of the high load JOB are measured in advance, and the measured CPU Keep the correlation between the voltage change and temperature change of the power supply to supply,
The diagnosis unit, when a failure that occurs in the CPU is caused by a delay in an internal logic part of the computer system,
The CPU is caused to execute the BIST to measure the temperature, and the measured temperature at the time of the BIST execution is calculated using the correlation between the voltage change of the held power supply and the temperature change of the CPU at the time of execution of the high load JOB. A voltage value for increasing the voltage of the power source supplied to the CPU is set so that the converted temperature becomes a voltage value smaller than a predetermined temperature at which the failure occurs in the CPU. The computer system according to claim 3.

A cooling device for cooling the CPU;
The diagnostic unit determines whether or not the measured temperature at the time of the BIST execution has a margin with respect to a predetermined set value when increasing the voltage of the power source supplied to the CPU. 5. The computer system according to claim 2, wherein after the cooling of the CPU is enhanced, a voltage of a power source supplied to the CPU is increased.

A power control unit for controlling the voltage of the power supplied to the CPU;
The diagnosis unit, when a failure that has occurred in the CPU is caused by noise in the computer system, lowers the voltage of a power source that is supplied to the CPU as the failure recurrence prevention measure. Item 4. The computer system according to Item 1.

A clock control unit for controlling a clock supplied to the CPU;
The said diagnostic part reduces the clock supplied to said CPU as said failure recurrence prevention measure, when the failure which generate | occur | produced in said CPU is the interface failure of said computer system. The computer system described.

A mode determined in advance by a combination of the CPU performance degradation when the CPU is re-integrated and the CPU re-integration or disconnection treatment according to the performance degradation is preset,
The diagnosis unit selects whether or not to reduce a clock supplied to the CPU according to the set mode when a failure occurring in the CPU is an interface failure of the computer system. The computer system according to claim 7.

A method of re-installing a CPU that performs re-installation after disconnecting the CPU from a computer system when a failure occurs in a CPU (Central Processing Unit).
Analyzing the content of a failure that has occurred in the CPU;
Performing a failure recurrence prevention treatment according to the analysis result of the failure content;
CPU re-installation method.

When the failure occurred in the CPU is caused by a delay in the internal logic part of the computer system,
The method of re-installing a CPU according to claim 9, wherein the step of performing the failure recurrence prevention treatment includes a step of increasing a voltage of a power source supplied to the CPU.

When the failure occurred in the CPU is caused by a delay in the internal logic part of the computer system,
The step of performing the failure recurrence prevention treatment comprises:
Causing the CPU to execute a BIST (Built In Self Test) and measuring the temperature;
The method of re-installing a CPU according to claim 10, further comprising: setting a voltage value when increasing a voltage of a power source supplied to the CPU based on the measured temperature.

In response to the voltage change of the power supplied to the CPU, the CPU temperature change at the time of the BIST execution and the CPU temperature change at the time of execution of the high load JOB are measured in advance, and the measured CPU A step of maintaining a correlation between a voltage change and a temperature change of the power supply to be supplied;
When the failure occurred in the CPU is caused by a delay in the internal logic part of the computer system,
The step of performing the failure recurrence prevention treatment comprises:
Causing the CPU to perform a BIST and measuring the temperature;
Converting the measured temperature at the time of the BIST execution into the temperature of the CPU at the time of executing the high-load JOB using the correlation between the voltage change of the held power supply and the temperature change;
Setting a voltage value at the time of increasing the voltage of the power supply to be supplied to the CPU so that the converted temperature becomes a voltage value lower than a predetermined temperature at which a failure occurs in the CPU. The CPU re-embedding method according to claim 11.

The step of performing the failure recurrence prevention treatment comprises:
Determining whether the measured temperature at the time of the BIST execution has a margin with respect to a predetermined set value when raising the voltage of the power source supplied to the CPU;
13. The method according to claim 10, further comprising: increasing the power supply voltage supplied to the CPU after enhancing the cooling of the CPU if there is no room as a result of the determination. The re-installation method of the CPU according to the item.

When the failure that occurred in the CPU is caused by noise in the computer system,
The step of performing the failure recurrence prevention treatment comprises:
The method of re-installing a CPU according to claim 9, further comprising a step of reducing a voltage of a power source supplied to the CPU.

When the failure that has occurred in the CPU is an interface failure in the computer system,
The step of performing the failure recurrence prevention treatment comprises:
The method of re-installing a CPU according to claim 9, further comprising a step of reducing a clock supplied to the CPU.

The method further includes a step in which a mode determined by a combination of the performance degradation of the CPU when the CPU is re-assembled and the re-installation or separation processing of the CPU according to the performance degradation is preset,
When the failure that has occurred in the CPU is an interface failure in the computer system,
The step of performing the failure recurrence prevention treatment comprises:
The method of re-embedding a CPU according to claim 15, further comprising a step of selecting whether or not to reduce a clock supplied to the CPU according to the set mode.