JP2012178014A

JP2012178014A - Failure prediction/countermeasure method and client server system

Info

Publication number: JP2012178014A
Application number: JP2011040109A
Authority: JP
Inventors: Toshiie Kinoshita; 利家木下
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2011-02-25
Filing date: 2011-02-25
Publication date: 2012-09-13
Anticipated expiration: 2031-02-25
Also published as: JP5583052B2

Abstract

PROBLEM TO BE SOLVED: To suppress any damage due to unpredictable failure occurrence by calculating the degree of deterioration by collecting the information of a device, and executing failure countermeasures in an early stage according to the degree of deterioration.SOLUTION: This client server system includes a management server 100 connected via a network 500 to a maintenance object device 200, and the management server 100 includes management means 70 for acquiring the information of the maintenance object device in a fixed cycle from the maintenance object device, and for managing the device. The management means 70 is configured to calculate the progress of deterioration of the maintenance object device on the basis of the acquired information of the maintenance object device, and to, when the calculated progress of deterioration is turned into a predetermined progress, or the information acquired from the maintenance object device shows a predetermined event, execute failure countermeasures to the maintenance object device. The failure countermeasures are executed for mirroring with an auxiliary device 210 and operation switching to the auxiliary device 210.

Description

本発明は、故障予測・対策方法及びクライアントサーバシステムに係り、特に、情報処理装置等の保守対象とする装置の故障を事前に予測し、故障に対する対策を実施することを可能とした故障予測・対策方法及び該方法を適用したクライアントサーバシステムに関する。 The present invention relates to a failure prediction / countermeasure method and a client / server system, and in particular, it is possible to predict a failure of a device to be maintained such as an information processing device in advance and to implement a countermeasure against the failure. The present invention relates to a countermeasure method and a client server system to which the method is applied.

装置の故障を予測する従来技術として、例えば、特許文献１等に記載された技術が知られている。この従来技術は、保守対象の装置に接続された監視端末から定期的に送信されてくる保守対象装置の稼動状況の情報に基づいて、保守対象装置の故障を予測し、故障時期が近いことを検出してユーザーに通知するというものである。 As a conventional technique for predicting a failure of an apparatus, for example, a technique described in Patent Document 1 is known. This prior art predicts a failure of the maintenance target device based on information on the operation status of the maintenance target device periodically transmitted from a monitoring terminal connected to the maintenance target device, and confirms that the failure time is near. It detects and notifies the user.

特開２００９−２１７７７０号公報JP 2009-217770 A

前述した保守対象装置の故障予測を行う従来技術は、故障時期が近いことをユーザーに知らせるだけのものであるため、ユーザが実際に装置に対する対処を行うまでの間に障害が発生してしまう危険性があるという問題点を有している。また、前述した従来技術は、予測よりも早期に故障が発生した場合の対処について考慮されておらず、このような場合に、データ損失やシステムダウンの危険性ががあるという問題点を有している。 The conventional technology for predicting the failure of the maintenance target device described above only informs the user that the failure time is near, so there is a risk that a failure will occur before the user actually takes action on the device. There is a problem that there is. In addition, the above-described conventional technology does not take into account a countermeasure when a failure occurs earlier than predicted, and in such a case, there is a problem that there is a risk of data loss or system down. ing.

本発明の目的は、前述した従来技術の問題点を解決し、装置の情報を収集して劣化の度合いを算出し、劣化の度合いにより早期に故障対策を行って、予測を早回る故障発生による被害を抑えることを可能とした故障予測・対策方法及びクライアントサーバシステムを提供することにある。 The object of the present invention is to solve the above-mentioned problems of the prior art, calculate the degree of deterioration by collecting information on the apparatus, take measures against the failure early according to the degree of deterioration, and by the occurrence of a failure that makes predictions faster It is an object of the present invention to provide a failure prediction / countermeasure method and a client server system that can suppress damage.

本発明によれば前記目的は、保守対象とする装置の故障を事前に予測し、故障に対する対策を実施する装置の故障予測・対策方法において、前記保守対象とする装置にネットワークを介して接続された管理サーバを有し、前記管理サーバは、前記保守対象とする装置から一定の周期で当該装置の情報を取得して当該装置を管理する管理手段を有し、前記管理手段は、前記取得した前記保守対象とする装置の情報に基づいて、前記保守対象とする装置の劣化進行度を算出し、算出した劣化進行度が予め定めた進行度となった場合、あるいは、前記保守対象とする装置から取得した情報が予め定めた事象を示している場合、前記保守対象とする装置に対する故障対策を実施することにより達成される。 According to the present invention, the object is to predict a failure of a device to be maintained in advance and implement a countermeasure against the failure. In the device failure prediction / measure method, the device is connected to the device to be maintained through a network. The management server has management means for acquiring information on the apparatus from the apparatus to be maintained at a constant cycle and managing the apparatus, and the management means Based on the information of the device to be maintained, the deterioration progress of the device to be maintained is calculated, and when the calculated deterioration progress becomes a predetermined progress, or the device to be maintained When the information acquired from the above indicates a predetermined event, this is achieved by implementing a countermeasure against the device to be maintained.

本発明によれば、情報処理装置等の保守対象とする装置の故障が発生する危険性が高いと判断すると、その故障を回避するための対処を行ことができるので、装置管理者の手間を削減すると共に、故障の危険性を予測していたにもかかわらず装置を故障させてしまうような事態を避けることができる。 According to the present invention, when it is determined that there is a high risk of failure of a device to be maintained such as an information processing device, it is possible to take measures to avoid the failure. In addition to the reduction, it is possible to avoid a situation in which the apparatus is broken even though the risk of failure is predicted.

本発明の一実施形態によるクライアントサーバシステムの構成を示すブロック図である。It is a block diagram which shows the structure of the client server system by one Embodiment of this invention. 本発明の実施形態によるクライアントサーバシステムにおけるブレードモジュールと管理サーバとの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the blade module and management server in the client server system by embodiment of this invention. 管理サーバがブレードモジュールの劣化を判定し故障を予測を行うために用いる情報を取得する経路を説明する図である。It is a figure explaining the path | route which the management server acquires the information used in order to determine deterioration of a blade module and to predict a failure. 本発明の実施形態によるクライアントサーバシステムにおいて、監視対象ブレードモジュールの劣化を判定し故障対策を行う処理の例を説明するシーケンスチャート（その１）である。6 is a sequence chart (part 1) illustrating an example of processing for determining deterioration of a monitoring target blade module and taking a countermeasure against a failure in the client server system according to the embodiment of the present invention. 本発明の実施形態によるクライアントサーバシステムにおいて、監視対象ブレードモジュールの劣化を判定し故障対策を行う処理の例を説明するシーケンスチャート（その２）である。It is a sequence chart (the 2) explaining the example of the process which determines deterioration of a monitoring object blade module, and performs a countermeasure against a failure in the client server system by embodiment of this invention. １次故障対策のための１次劣化判定に使用する情報ついて説明する図である。It is a figure explaining the information used for the primary deterioration determination for a primary failure countermeasure. ２次故障対策を実施する事象の情報を説明する図である。It is a figure explaining the information of the event which implements a secondary failure countermeasure. 本発明の実施形態によるクライアントサーバシステムにおいて、監視対象ブレードモジュールの劣化を判定し故障対策を行う処理の他の例を説明するシーケンスチャート（その１）である。It is a sequence chart (the 1) explaining other examples of processing which judges degradation of a monitoring object blade module and performs a countermeasure against failure in a client server system by an embodiment of the present invention. 本発明の実施形態によるクライアントサーバシステムにおいて、監視対象ブレードモジュールの劣化を判定し故障対策を行う処理の他の例を説明するシーケンスチャート（その２）である。It is a sequence chart (the 2) explaining the other example of the process which determines deterioration of a monitoring object blade module, and performs a countermeasure against failure in the client server system by embodiment of this invention.

以下、本発明による故障予測・対策方法及びクライアントサーバシステムの実施形態を図面により詳細に説明する。 Embodiments of a failure prediction / measure method and a client server system according to the present invention will be described below in detail with reference to the drawings.

図１は本発明の一実施形態によるクライアントサーバシステムの構成を示すブロック図である。 FIG. 1 is a block diagram showing a configuration of a client server system according to an embodiment of the present invention.

図１に示す本発明の実施形態によるクライアントサーバシステムは、複数のブレードモジュール２００を備えるサーバ３００と、管理者が使用する管理端末６００が接続された管理サーバ１００と、ユーザが使用する複数のクライアント端末４００とがイントラネット等のネットワーク５００により接続されて構成されている。複数のブレードモジュール２００のそれぞれは、少なくとも、ＨＤＤ３０、ＲＡＭ４０、ＣＰＵを有する情報処理装置であり、ネットワーク５００を介してクライアント端末４００に接続され、ユーザに対して各種の情報処理サービスを提供する機能を有している。また、管理サーバ１００は、ブレードモジュール２００と同様に、ＨＤＤ、ＲＡＭ、ＣＰＵ等を備えて構成される情報処理装置であり、ブレードモジュール２００からの情報を取得し、ブレードモジュール２００の管理、制御を行う機能を有している。また、クライアント端末４００は、少なくとも、キーボード等の入力装置、表示装置、メモリ、ＣＰＵを備えて構成されるシンクライアントとしての情報処理装置であり、ブレードモジュール２００と接続して、ブレードモジュール２００に各種の情報処理を実行させて実行結果を受け取る機能を有している。 The client server system according to the embodiment of the present invention shown in FIG. 1 includes a server 300 including a plurality of blade modules 200, a management server 100 to which a management terminal 600 used by an administrator is connected, and a plurality of clients used by a user. A terminal 400 is connected to a network 500 such as an intranet. Each of the plurality of blade modules 200 is an information processing apparatus having at least an HDD 30, a RAM 40, and a CPU, and is connected to the client terminal 400 via the network 500 and has a function of providing various information processing services to the user. Have. Similarly to the blade module 200, the management server 100 is an information processing apparatus that includes an HDD, a RAM, a CPU, and the like, acquires information from the blade module 200, and manages and controls the blade module 200. Has the function to perform. The client terminal 400 is an information processing apparatus as a thin client that includes at least an input device such as a keyboard, a display device, a memory, and a CPU. The client terminal 400 is connected to the blade module 200 and is connected to the blade module 200 in various ways. The information processing is executed and the execution result is received.

図２は本発明の実施形態によるクライアントサーバシステムにおけるブレードモジュール２００と管理サーバ１００との機能構成を示すブロック図である。本発明の実施形態によるクライアントサーバシステムにおけるブレードモジュールには、監視対象となるブレードモジュールと予備のブレードモジュールとがあり、図２には、１台の監視対象ブレードモジュール２００と１台の予備ブレードモジュール２１０とが示されている。 FIG. 2 is a block diagram showing functional configurations of the blade module 200 and the management server 100 in the client server system according to the embodiment of the present invention. The blade module in the client server system according to the embodiment of the present invention includes a blade module to be monitored and a spare blade module. FIG. 2 shows one blade module 200 to be monitored and one spare blade module. 210 is shown.

本発明の実施形態によるクライアントサーバシステムにおけるサーバ３００は、監視対象ブレードモジュール２００を複数台備え、監視対象ブレードモジュール２００の数より少ない数の予備ブレードモジュール２１０を備えている。監視対象ブレードモジュール２００は、外部からアクセスされることによりパーソナルコンピューターとして使用される情報処理装置であり、複数のブレードモジュールを使用したサーバを運用するに当り通常使用される装置である。また、予備のブレードモジュール２１０は、外部からアクセスされることによりパーソナルコンピューターとして使用可能な情報処理装置であり、複数のブレードモジュール使用したシステムに本発明による故障予測・対策方法を適用して運用するに当り監視対象ブレードモジュール２００の故障対策時に使用される装置である。 The server 300 in the client server system according to the embodiment of the present invention includes a plurality of monitoring target blade modules 200 and includes a number of spare blade modules 210 smaller than the number of monitoring target blade modules 200. The monitoring target blade module 200 is an information processing apparatus that is used as a personal computer by being accessed from the outside, and is an apparatus that is normally used when operating a server using a plurality of blade modules. The spare blade module 210 is an information processing apparatus that can be used as a personal computer by being accessed from the outside, and is operated by applying the failure prediction / countermeasure method according to the present invention to a system using a plurality of blade modules. This is a device used when taking measures against failure of the monitoring target blade module 200.

前述の監視対象ブレードモジュール２００及び予備ブレードモジュール２１０は、図２に示すように、同一の機能構成を有している。すなわち、これらのモジュールのそれぞれは、Agent １０，１１、ＯＳ２０，２１、ＨＤＤ（またはＳＳＤ）３０，３１、ＥＣＣ付きＲＡＭ４０，４１を有するモジュール本体部４５，４６と、監視制御装置ＢＭＣ５０，５１と、通信装置ＮＩＣ６０，６１とを備えて構成されている。また、管理サーバ１００は、Manager ７０を有する制御部７５と、通信装置ＮＩＣ６０とを備えて構成されている。これらのブレードモジュー２００、２１０、管理サーバ７０は、例えば、プログラム制御により動作し、イントラネット等のネットワーク５００を介して相互に接続されている。 The aforementioned monitoring target blade module 200 and spare blade module 210 have the same functional configuration as shown in FIG. That is, each of these modules includes Agent bodies 10 and 11, OS 20 and 21, HDD (or SSD) 30 and 31, module main bodies 45 and 46 having ECC-attached RAMs 40 and 41, monitoring control devices BMCs 50 and 51, Communication devices NIC60 and 61 are provided. The management server 100 includes a control unit 75 having a Manager 70 and a communication device NIC60. The blade modules 200 and 210 and the management server 70 operate by program control, for example, and are connected to each other via a network 500 such as an intranet.

管理サーバ１００は、ブレードモジュール２００、２１０を管理するための情報処理装置であり、インストールされているManager ７０を使用することにより、管理下のブレードモジュール２００、２１０の情報の取得・管理、管理下のブレードモジュール２００、２１０の制御を行うことが可能である。Manager ７０は、ブレードモジュール２００、２１０の管理を行うソフトウェアであり、ブレードモジュール２００、２１０のＢＭＣ５０、５１と通信を行うことによるブレードモジュールの電源制御機能、Agent １０、１１にブレードモジュール２００、２１０の各種情報を要求する機能、ブレードモジュール２００、２１０の情報を記録・管理する機能を備えている。 The management server 100 is an information processing apparatus for managing the blade modules 200 and 210. By using the installed Manager 70, the management server 100 acquires, manages, and manages the information of the managed blade modules 200 and 210. The blade modules 200 and 210 can be controlled. The Manager 70 is software for managing the blade modules 200 and 210. The Manager 70 is configured to communicate with the BMCs 50 and 51 of the blade modules 200 and 210. It has a function for requesting various types of information and a function for recording and managing information of the blade modules 200 and 210.

ブレードモジュール２００、２１０のAgent １０、１１は、管理サーバ１００のManager ７０とブレードモジュール２００、２１０との間での情報の授受を補助するソフトウェアであり、ブレードモジュール２００、２１０にインストールして使用される。そして、Agent １０、１１は、ブレードモジュール２００、２１０の電源状態遷移をManager ７０へ通知する機能、Manager ７０からの要求によりブレードモジュール２００、２１０の情報を取得し、Manager ７０へ送信する機能を備えている。 The Agents 10 and 11 of the blade modules 200 and 210 are software that assists in sending and receiving information between the Manager 70 of the management server 100 and the blade modules 200 and 210, and are installed and used in the blade modules 200 and 210. The The Agents 10 and 11 have a function of notifying the manager 70 of the power state transition of the blade modules 200 and 210, and a function of acquiring information of the blade modules 200 and 210 in response to a request from the Manager 70 and transmitting the information to the Manager 70. ing.

ＨＤＤ（ＳＳＤ）３０、３１は、ブレードモジュール２００、２１０に内蔵されている補助記憶装置であり、自己診断情報である、S.M.A.R.T.情報を有している。また、ＥＣＣ付きＲＡＭ４０、４１は、ブレードモジュール２００、２１０に内蔵されている主記憶装置であり、６４ビットの内１ビットのエラーを検出・訂正することができ、６４ビットの内の２ビットのエラーを検出することができる機能を備えている。そして、ＥＣＣ付きＲＡＭ４０、４１は、エラーの検出時に、その情報をＥＣＣエラーログとしてＯＳ２０、２１に伝達する。 HDDs (SSD) 30 and 31 are auxiliary storage devices built in the blade modules 200 and 210, and have S.M.A.R.T. information which is self-diagnosis information. The RAMs 40 and 41 with ECC are main storage devices built in the blade modules 200 and 210, and can detect and correct 1-bit error out of 64 bits, and 2 bits out of 64 bits. It has a function that can detect errors. When the error is detected, the RAM with ECC 40 and 41 transmits the information to the OS 20 and 21 as an ECC error log.

ＮＩＣ６０、６１、６２は、ネットワーク５００を介して情報処理装置相互間での通信を制御する装置であり、ＢＭＣ５０、５１を有するブレードモジュール２００、２１０におけるＮＩＣ６０、６１は、ブレードモジュール本体４５、４６とＢＭＣ５０、５１とのそれぞれと、管理サーバ４００及び外部の情報処理装置であるクライアント端末４００等との間の通信の制御を行う。 The NICs 60, 61, and 62 are devices that control communication between information processing apparatuses via the network 500. The NICs 60 and 61 in the blade modules 200 and 210 having the BMCs 50 and 51 are connected to the blade module main bodies 45 and 46, respectively. Control of communication between each of the BMCs 50 and 51 and the management server 400 and the client terminal 400 as an external information processing apparatus is performed.

ＢＭＣ５０、５１は、ブレードモジュール２００、２１０に内蔵された監視・制御装置であり、ブレードモジュール２００、２１０内においてＢＭＣ５０、５１のみが独立した電源で稼働し、ブレードモジュール２００、２１０の電源制御機能、ブレードモジュール２００、２１０の基板の電圧値や温度値を取得する機能を備えている。また、ＢＭＣ５０、５１は、ＩＰＭＩコマンドを用いることにより、外部の情報処理装置からネットワーク５００を介した指示により、自ブレードモジュール２００、２１０の電源制御や情報取得を行う機能をも備えている。ブレードモジュール２００、２１０のＯＳ２０、２１は、ＩＰアドレスと呼ばれるネットワーク５００における情報処理装置の識別子を有しているが、ＢＭＣ５０、５１は、ブレードモジュール２００、２１０のＯＳ２０、２１のものとは異なる独自のＩＰアドレスを有している。 The BMCs 50 and 51 are monitoring and control devices incorporated in the blade modules 200 and 210. Only the BMCs 50 and 51 operate in the blade modules 200 and 210 with independent power supplies, and the power control function of the blade modules 200 and 210, It has a function of acquiring voltage values and temperature values of the substrates of the blade modules 200 and 210. The BMCs 50 and 51 also have a function of performing power control and information acquisition for the own blade modules 200 and 210 in accordance with instructions from the external information processing apparatus via the network 500 by using IPMI commands. The OSs 20 and 21 of the blade modules 200 and 210 have identifiers of information processing apparatuses in the network 500 called IP addresses, but the BMCs 50 and 51 are unique from the OSs 20 and 21 of the blade modules 200 and 210. IP address.

図３は管理サーバ１００がブレードモジュール２００の劣化を判定し故障を予測を行うために用いる情報を取得する経路を説明する図である。 FIG. 3 is a diagram for explaining a path through which the management server 100 acquires information used for determining deterioration of the blade module 200 and predicting a failure.

管理サーバ１００は、監視対象ブレードモジュール２００の故障予測を行うに当たり、監視対象ブレードモジュール２００の（１）ＨＤＤ（ＳＳＤ）３０のＳ．Ｍ．Ａ．Ｒ．Ｔ．情報、（２）ＥＣＣ付きＲＡＭ４０のＥＣＣエラーログ、（３）ハードウェアモニターログ、（４）自管理サーバ１００のManager ７０が行ったブレードモジュール２００の強制電源ＯＦＦ・強制リセットの累計実行回数、（５）基板の電圧・温度、（６）累計起動時間を使用する。 When the management server 100 predicts a failure of the monitoring target blade module 200, the management server 100 (1) of the monitoring target blade module 200 (1) S.D. M.M. A. R. T.A. Information, (2) ECC error log of RAM 40 with ECC, (3) hardware monitor log, (4) cumulative execution number of forced power-off / forced reset of blade module 200 performed by Manager 70 of own management server 100, ( 5) Use the voltage and temperature of the board, and (6) the cumulative startup time.

そして、管理サーバ１００は、（１）ＨＤＤ（ＳＳＤ）３０のＳ．Ｍ．Ａ．Ｒ．Ｔ．情報、（２）ＥＣＣ付きＲＡＭ４０のＥＣＣエラーログ、（３）ハードウェアモニターログの各情報を、Manager ７０からAgent １０への通信を行い、Agent １０を経由して取得し、（５）基板の電圧・温度、（６）累計起動時間の各情報を、Manager ７０からＢＭＣ５０への通信を行ってＢＭＣ５０から取得する。また、（４）自管理サーバ１００のManager ７０が行ったブレードモジュール２００の強制電源ＯＦＦ・強制リセットの累計実行回数については、Manager ７０自身が行ったものであるのでManager ７０自身で記録する。 Then, the management server 100 (1) S. of HDD (SSD) 30. M.M. A. R. T.A. Information, (2) ECC error log of RAM 40 with ECC, (3) Information of hardware monitor log is acquired from Manager 70 to Agent 10 and acquired via Agent 10, (5) Board Each information of voltage / temperature and (6) cumulative activation time is acquired from the BMC 50 by communicating from the Manager 70 to the BMC 50. Also, (4) the cumulative number of executions of forced power OFF / forced reset of the blade module 200 performed by the Manager 70 of the own management server 100 is performed by the Manager 70 itself, and is recorded by the Manager 70 itself.

図４、図５は本発明の実施形態によるクライアントサーバシステムにおいて、監視対象ブレードモジュール２００の劣化を判定し故障対策を行う処理の例を説明するシーケンスチャート、図６は１次故障対策のための１次劣化判定に使用する情報ついて説明する図、図７は２次故障対策を実施する事象の情報を説明する図であり、次に、これらについて説明する。ここで説明する図４、図５に示すシーケンスの処理は、通常運用時から２段階の故障対策を行う処理の例であり、一連の処理であるので、図４、図５が連続したものであるとして説明する。 4 and 5 are sequence charts for explaining an example of processing for determining the deterioration of the monitored blade module 200 and taking a countermeasure against the failure in the client server system according to the embodiment of the present invention. FIG. FIG. 7 is a diagram for explaining information used for primary deterioration determination, and FIG. 7 is a diagram for explaining information on events for implementing secondary failure countermeasures. These will be described next. The processing of the sequence shown in FIGS. 4 and 5 to be described here is an example of processing for taking measures against failure in two stages from the normal operation, and is a series of processing, and therefore, FIG. 4 and FIG. 5 are continuous. It will be explained as being.

（１）いま、ユーザは、クライアント端末４００をサーバ３００のブレードモジュールの１つに接続して、そのブレードモジュールを使用しているものとする。この場合、使用中のブレードモジュールが監視対象ブレードモジュール２００となり、また、予備ブレードモジュール２１０は、その電源がＯＦＦとされている（シーケンスＡ１、Ａ２）。 (1) Now, it is assumed that the user connects the client terminal 400 to one of the blade modules of the server 300 and uses the blade module. In this case, the blade module in use becomes the monitoring target blade module 200, and the power of the spare blade module 210 is turned off (sequence A1, A2).

（２）前述の状態で、管理サーバ１００のManager ７０は、監視対象ブレードモジュール２００に対して、一定時間周期毎に図３を参照して前述で説明した各情報の送信を要求してそれらの情報を取得し、それらの情報に基づいて監視対象ブレードモジュール２００の１次劣化判定を行う（シーケンスＡ３〜Ａ６）。 (2) In the above-described state, the Manager 70 of the management server 100 requests the monitored blade module 200 to transmit each piece of information described above with reference to FIG. Information is acquired, and primary degradation determination of the monitoring target blade module 200 is performed based on the information (sequence A3 to A6).

シーケンスＡ６の１次劣化判定を行う処理では、管理サーバ１００のManager ７０が図６に示す情報を使用するので、ここで図６に示す情報について説明する。 In the process of performing the primary deterioration determination of the sequence A6, the manager 70 of the management server 100 uses the information shown in FIG. 6, so the information shown in FIG. 6 will be described here.

管理サーバ１００のManager ７０は、監視対象ブレードモジュール２００から図６のブレードモジュールの情報として示している欄の情報と、値として示している欄のその情報の値とを取得する。そして、Manager ７０は、自身が保持している閾値として示している欄のその情報の閾値と、取得したその情報の値とを比較し、比較の結果により、劣化度判定に使用するその情報に対するの変数として示している欄の変数を、“０”とするか“１”とするかを決定している。 The Manager 70 of the management server 100 acquires the information in the column indicated as information on the blade module in FIG. 6 and the value of the information in the column indicated as the value from the monitoring target blade module 200. Then, the Manager 70 compares the threshold value of the information in the column indicated as the threshold value held by the Manager 70 with the value of the acquired information, and the result of the comparison indicates that the information used for the deterioration degree determination It is determined whether the variable in the column indicated as “0” is set to “0” or “1”.

管理サーバ１００のManager ７０が、監視対象ブレードモジュール２００から取得するブレードモジュールの情報、自身で保持している情報の例としては、図６に示すように、
S.M.A.R.T.情報に含まれる５．代替処理済不良セクタ数、値Ｖs5、閾値Ｔs5、
S.M.A.R.T.情報に含まれる７．磁気ヘッドシークエラー率、値Ｖs7、閾値Ｔs7、
S.M.A.R.T.情報に含まれる１２．電源ＯＮ／ＯＦＦ回数、値Ｖs12、閾値Ｔs12、
S.M.A.R.T.情報に含まれる１９３．ロード／アンロード回数、値Ｖs193、閾値Ｔs193、
S.M.A.R.T.情報に含まれる１９６．セクタ代替処理発生回数、値Ｖs196、閾値Ｔs196、
S.M.A.R.T.情報に含まれる１９７．代替処理待ちセクタ数、値Ｖs197、閾値Ｔs197、
S.M.A.R.T.情報に含まれる１９８．回復不可能なセクタ数、値Ｖs198、閾値Ｔs198、
強制電源ＯＦＦ・強制リセットの累計実行回数、値Ｖp 、閾値Ｔp 、
１ビットＥＣＣエラーログの個数、値Ｖe1、閾値Ｔe1、
基板の電圧、値Ｖv 、閾値ＴvL（最低電圧閾値）、ＴvH（最大電圧閾値）、
基板の温度、値Ｖt 、閾値Ｔt 、
累計起動時間、値Ｖa 、閾値Ｔa
がある。 As an example of information of the blade module acquired by the Manager 70 of the management server 100 from the monitored blade module 200 and information held by itself, as shown in FIG.
4. Included in SMART information Substituted number of bad sectors, value Vs5, threshold value Ts5,
Included in SMART information Magnetic head seek error rate, value Vs7, threshold Ts7,
Included in SMART information Power ON / OFF count, value Vs12, threshold Ts12,
193 included in SMART information. Load / unload count, value Vs193, threshold Ts193,
196. included in SMART information. Sector replacement processing occurrence count, value Vs196, threshold value Ts196,
197 included in SMART information. Number of sectors waiting for substitution processing, value Vs197, threshold value Ts197,
198. Included in SMART information. Number of unrecoverable sectors, value Vs198, threshold Ts198,
Cumulative number of forced power off / forced reset, value Vp, threshold Tp,
Number of 1-bit ECC error logs, value Ve1, threshold Te1,
Substrate voltage, value Vv, threshold TvL (minimum voltage threshold), TvH (maximum voltage threshold),
Substrate temperature, value Vt, threshold Tt,
Cumulative startup time, value Va, threshold Ta
There is.

管理サーバ１００のManager ７０は、前述した各情報の値と、閾値とを比較し、Ｆ＝０となる条件の欄、Ｆ＝１となる条件の欄に示す条件式に従ってＦの値０または１を各情報の変数として示している欄の変数の値として決定する。変数として示している欄の各情報の変数は、Ｆは、Ｆs5、Ｆs7、Ｆs12、Ｆs193、Ｆs196、Ｆs197、Ｆs198、Ｆp 、Ｆe1、
Ｆv 、Ｆt 、Ｆa であり、これらは管理サーバ１００のManager ７０により値０または１に決定される。 The Manager 70 of the management server 100 compares the value of each information described above with a threshold value, and the F value 0 or 1 according to the conditional expression shown in the condition column where F = 0 and the condition column where F = 1. Is determined as the value of the variable in the column indicating the variable of each information. The variables of each information in the column shown as variables are F, Fs5, Fs7, Fs12, Fs193, Fs196, Fs197, Fs198, Fp, Fe1,
Fv, Ft, and Fa, which are determined by the manager 70 of the management server 100 to the value 0 or 1.

そして、管理サーバ１００のManager ７０は、シーケンスＡ６の判定の処理で各情報の重要度によって各変数Ｆ毎に予め定めた重みＷ＝Ｗs5、Ｗs7、Ｗs12、Ｗs193、Ｗs196、Ｗs197、Ｗs198、Ｗp 、Ｗe1、Ｗv 、Ｗt 、Ｗa （０＜Ｗ≦１）を乗じ、得られた値を総和して劣化進行度Ｅを、式（１）により算出し、Ｅが１次劣化判定値ＴE1を超えたか否かにより、後述の１次故障対策を行うべかき否かを判定している。 Then, the Manager 70 of the management server 100 determines the weights W = Ws5, Ws7, Ws12, Ws193, Ws196, Ws197, Ws198, Wp, and the like that are determined in advance for each variable F according to the importance of each information in the determination process of the sequence A6. Multiplying We1, Wv, Wt, Wa (0 <W ≦ 1), and summing the obtained values, the degree of deterioration E is calculated by equation (1), and whether E exceeds the primary deterioration judgment value TE1 It is determined whether or not to deal with the primary failure countermeasure described later.

Ｅ＝Σ（Ｆ・Ｗ） ……（１）
（３）管理サーバ１００のManager ７０は、シーケンスＡ６の判定で、劣化進行度Ｅが１次劣化判定値ＴE1を超えておらず、１次故障対策を行わないと判定した場合、シーケンスＡ３からの処理に戻り、監視対象ブレードモジュール２００からの情報を要求する処理からの動作を繰り返し、劣化進行度Ｅが前記１次劣化判定値ＴE1を超えて１次劣化判定条件を満たしたとき、１次故障対策を行うために待機状態に遷移する（シーケンスＡ８）。 E = Σ (F ・ W) …… (1)
(3) When the manager 70 of the management server 100 determines in the determination of the sequence A6 that the deterioration progress level E does not exceed the primary deterioration determination value TE1 and the primary failure countermeasure is not performed, the sequence from the sequence A3 Returning to the process, the operation from the process requesting information from the monitored blade module 200 is repeated, and the primary failure occurs when the deterioration progress E exceeds the primary deterioration determination value TE1 and satisfies the primary deterioration determination condition. In order to take countermeasures, the state transits to a standby state (sequence A8).

（４）管理サーバ１００のManager ７０は、ユーザが使用中の監視対象ブレードモジュール２００の使用が終了し、前述の１次故障対策待機状態で監視対象ブレードモジュール２００のAgent１０から監視対象ブレードモジュール２００の電源ＯＦＦ通知を受信すると、受信した電源ＯＦＦ通知が再起動等によるものでないことを判断するため、一定時間待機を続け、もし、待機中に監視対象ブレードモジュール２００からの電源ＯＮ通知を受信した場合、何もせず、再び電源ＯＦＦ通知を受信するまで１次故障対策待機状態を続ける。また、Manager ７０は、待機中に電源ＯＮ通知を受信しなければ、１次故障対策として、監視対象ブレードモジュール２００と予備ブレードモジュール２１０とのミラー化を開始する（シーケンスＡ７、Ａ９〜Ａ１１）。 (4) The Manager 70 of the management server 100 terminates the use of the monitoring target blade module 200 that is being used by the user, and the Agent 10 of the monitoring target blade module 200 from the monitoring target blade module 200 in the above-described primary failure countermeasure standby state. When a power-off notification is received, in order to determine that the received power-off notification is not due to a restart or the like, the system continues to wait for a certain period of time, and if a power-on notification from the monitored blade module 200 is received during the standby No action is taken and the primary failure countermeasure standby state is continued until a power OFF notification is received again. If the manager 70 does not receive a power-on notification during standby, the manager 70 starts mirroring the monitored blade module 200 and the spare blade module 210 as a countermeasure against the primary failure (sequence A7, A9 to A11).

（５）Manager ７０は、監視対象ブレードモジュール２００と予備ブレードモジュール２１０のそれぞれのＢＭＣ５０、５１と通信を行い、ＩＰＭＩによる電源ＯＮ命令を発して、監視対象ブレードモジュール２００と予備ブレードモジュール２１０の電源ＯＮ操作を行う（シーケンスＡ１２〜Ａ１４）。 (5) The Manager 70 communicates with the BMCs 50 and 51 of the monitoring target blade module 200 and the spare blade module 210, issues a power ON command by IPMI, and turns on the power of the monitoring target blade module 200 and the spare blade module 210. Operation is performed (sequences A12 to A14).

（６）その後、Manager ７０は、両ブレードモジュールのAgentからブレードモジュールの電源ＯＮ通知を受信すると、ミラー化を開始するためにデータ複製命令を両ブレードモジュール２００、２１０に送信する（シーケンスＡ１５〜Ａ１７）。 (6) Thereafter, when Manager 70 receives the blade module power ON notification from the Agents of both blade modules, Manager 70 transmits a data replication command to both blade modules 200 and 210 to start mirroring (sequences A15 to A17). ).

（７）ミラー化開始のためのデータ複製命令を受け取った監視対象ブレードモジュール２００は、自モジュール２００内のＨＤＤ（ＳＳＤ）３０のシステム領域以外の領域のデータを予備ブレードモジュール２１０のＨＤＤ（ＳＳＤ）３１へ複製する。複製開始時、監視対象ブレードモジュール２００のAgent １０は、Manager ７０へ複製の開始を通知する（シーケンスＡ１８、Ａ１９）。 (7) Upon receiving the data replication command for starting mirroring, the monitoring target blade module 200 uses the HDD (SSD) of the spare blade module 210 to store data in an area other than the system area of the HDD (SSD) 30 in the own module 200. Duplicate to 31. At the start of replication, the Agent 10 of the monitoring target blade module 200 notifies the manager 70 of the start of replication (sequence A18, A19).

（８）システム領域以外の領域のデータの複製が完了すると、予備ブレードモジュール２１０のAgent １１は、監視対象ブレードモジュール２００のAgent １０へ複製の完了を通知する（シーケンスＡ２０）。 (8) When the replication of the data in the area other than the system area is completed, the Agent 11 of the spare blade module 210 notifies the Agent 10 of the monitoring target blade module 200 of the completion of the replication (sequence A20).

（９）データ複製の完了通知の送受信後、両ブレードモジュール２００、２１０は、再起動を開始し、電源ＯＦＦをManager ７０に通知し、一旦、電源をＯＦＦした後、電源をＯＮとする（シーケンスＡ２１〜Ａ２６）。 (9) After transmission / reception of the data replication completion notification, both blade modules 200 and 210 start restarting, notify the power supply OFF to the Manager 70, temporarily turn off the power, and then turn on the power (sequence) A21-A26).

（10）監視対象ブレードモジュール２００は、自モジュールのＯＳ２０、予備ブレードモジュール２１０のＯＳ２１の起動前に、システム領域のデータを予備ブレードモジュール２１０に複製し、以後、監視対象ブレード２００のＨＤＤ（ＳＳＤ）３０へのデータの書き込みが予備ブレード２１０においても全く同一の内容で行われるようにミラー設定を行う（シーケンスＡ２７、Ａ２８）。 (10) The monitored blade module 200 replicates the data in the system area to the spare blade module 210 before the OS 20 of the own module and the OS 21 of the spare blade module 210 are activated, and thereafter the HDD (SSD) of the monitored blade 200 Mirror setting is performed so that data writing to 30 is performed in the spare blade 210 with exactly the same contents (sequences A27 and A28).

（11）前述までの処理が完了すると、予備ブレードモジュール２１０のAgent １１は、監視対象ブレードモジュール２００のAgent １０へ処理の完了を通知しＯＳを起動してミラー化処理を完了する（シーケンスＡ２９）。 (11) When the above processing is completed, the Agent 11 of the spare blade module 210 notifies the Agent 10 of the monitoring target blade module 200 of the completion of the processing, starts the OS, and completes the mirroring processing (sequence A29). .

（12）両ブレードモジュール２００、２１０は、電源ＯＮをManager ７０に通知し、これ以降、両ブレードモジュール２００、２１０のＨＤＤ（ＳＳＤ）３０、３１のデータが常に同一の内容となるミラーリングを開始し、ミラーリング状態で動作する（シーケンスＡ３０〜Ａ３２、Ａ３４）。 (12) Both blade modules 200 and 210 notify the manager 70 that the power is ON, and thereafter, the mirroring of the HDD (SSD) 30 and 31 of both blade modules 200 and 210 always starts to have the same contents. In the mirroring state (sequences A30 to A32, A34).

（13）前述までの処理で１次故障対策の処理が済んだことになり、管理サーバ１００のManager ７０は、１次故障対策が済んだ監視対象ブレードモジュール２００に対し、引き続き一定周期で前記各情報の取得を行い取得した情報を基にした２次劣化判定を行う。この２次劣化判定の処理は、前述で説明した劣化進行度Ｅの値が２次劣化判定値ＴE2（ＴE2＞ＴE1）以上となったとき、または、図７に示す事象が発生した場合に、Manager ７０が後述の２次故障対策を行うと決定する判定処理である（シーケンスＡ３３、Ａ３５〜Ａ３７）。 (13) The processing of the primary failure countermeasure has been completed by the processing described above, and the Manager 70 of the management server 100 continues to each of the monitored blade modules 200 for which the primary failure countermeasure has been completed, at a constant cycle. Information is acquired and secondary deterioration determination is performed based on the acquired information. This secondary deterioration determination process is performed when the value of the deterioration progress E described above becomes equal to or greater than the secondary deterioration determination value TE2 (TE2> TE1), or when the event shown in FIG. 7 occurs. This is a determination process in which the Manager 70 determines to take a secondary failure countermeasure described later (sequences A33, A35 to A37).

シーケンスＡ３７の２次劣化判定を行う処理では、管理サーバ１００のManager ７０が図７に示す事象の情報を使用するので、ここで図７に示す事象の情報について説明する。 In the process of performing the secondary deterioration determination of the sequence A37, the manager 70 of the management server 100 uses the event information shown in FIG. 7, and therefore the event information shown in FIG. 7 will be described here.

管理サーバ１００のManager ７０は、監視対象ブレードモジュール２００から取得した情報として示す欄の情報及びその情報の内容を示す欄の内容とにより定義される事象が生じたとき、２次劣化が生じていると判定して、２次故障対策を行うと決定する。 When the manager 70 of the management server 100 has an event defined by the information in the column indicated as information acquired from the monitored blade module 200 and the content in the column indicating the content of the information, secondary degradation has occurred. It is determined that secondary countermeasures will be taken.

その事象は、図７に示しているように、
１．ＥＣＣエラーの情報が、２ビットＥＣＣエラーの検出時、
２．S.M.A.R.T.情報に含まれる１．読み込みエラー率の情報が、既定の閾値を超過している時（S.M.A.R.T.エラーログの発生時）、
３．S.M.A.R.T.情報に含まれる２．ハードディスクの処理能力の情報が、既定の閾値を超過している時（S.M.A.R.T.エラーログの発生時）、
４．S.M.A.R.T.情報に含まれる３．スピンアップタイムの情報が、既定の閾値を超過している時（S.M.A.R.T.エラーログの発生時）、
５．S.M.A.R.T.情報に含まれる５．代替処理済不良セクタ数の情報が、過去数回の情報取得に渡る値の増加時、
６．S.M.A.R.T.情報に含まれる１９６．セクタ代替処理発生回数の情報が、過去数回の情報取得に渡る値の増加時、
７．S.M.A.R.T.情報に含まれる１９８．回復不可能セクタ数の情報が、過去数回の情報取得に渡る値の増加時
である。 The event is shown in FIG.
1. When the ECC error information is detected as a 2-bit ECC error,
2. Included in SMART information When the reading error rate information exceeds the predetermined threshold (when SMART error log occurs),
3. Included in SMART information When the hard disk processing capacity information exceeds the preset threshold (when a SMART error log occurs),
4). 2. Included in SMART information When the spin-up time information exceeds a predetermined threshold (when a SMART error log occurs),
5. 4. Included in SMART information When the information on the number of defective sectors that have been replaced has increased over the past several times,
6). 196. included in SMART information. When the information on the number of sector replacement processing occurrences increases in value over the past several times of information acquisition,
7). 198. Included in SMART information. This is the time when the information on the number of unrecoverable sectors increases over the past several times of information acquisition.

（14）管理サーバ１００のManager ７０は、シーケンスＡ３７の判定で、劣化進行度Ｅが２次劣化判定値ＴE2を超えておらず、また、図７に示して説明した事象も発生しておらず、２次故障対策を行わないと判定した場合、シーケンスＡ３３からの処理に戻り、監視対象ブレードモジュール２００からの情報を要求する処理からの動作を繰り返し、劣化進行度Ｅが前記２次劣化判定値ＴE2を超えて２次劣化判定条件を満たしたとき、または、図７に示して説明した事象の少なくとも１つが発生していた場合、２次故障対策を行うために待機状態に遷移する（シーケンスＡ３９）。 (14) The manager 70 of the management server 100 does not have the degradation progress degree E exceeding the secondary degradation judgment value TE2 in the judgment of the sequence A37, and the event described with reference to FIG. 7 has not occurred. When it is determined that the secondary failure countermeasure is not taken, the process returns to the process from the sequence A33, the operation from the process requesting information from the monitoring target blade module 200 is repeated, and the deterioration progress E is the secondary deterioration determination value. When the secondary deterioration determination condition is satisfied beyond TE2, or when at least one of the events described with reference to FIG. 7 has occurred, the state transits to a standby state in order to take measures against the secondary failure (sequence A39). ).

（15）管理サーバ１００のManager ７０は、ユーザが使用中の監視対象ブレードモジュール２００の使用が終了し、前述の１次故障対策待機状態で監視対象ブレードモジュール２００のAgent１０から監視対象ブレードモジュール２００の電源ＯＦＦ通知を受信すると、受信した電源ＯＦＦ通知が再起動等によるものでないことを判断するため、一定時間待機を続け、もし、待機中に監視対象ブレードモジュール２００からの電源ＯＮ通知を受信した場合、何もせず、再び電源ＯＦＦ通知を受信するまで１次故障対策待機状態を続ける。また、Manager ７０は、待機中に電源ＯＮ通知を受信しなければ、２次故障対策として、監視対象ブレードモジュール２００を予備ブレードモジュール２１０に切り替える処理を開始する。この切り替えの処理は、監視対象ブレードモジュール２００と予備ブレードモジュール２１と０のコンピューター名とＩＰアドレスとを変更することにより、運用するブレードモジュールを予備ブレードモジュール２１０に切り替え、それまで予備ブレードモジュール２１０であったブレードモジュールを監視対象ブレードモジュール２００として保守対応ができるようにする処理である（シーケンスＡ３８、Ａ４０〜Ａ４２）。 (15) The Manager 70 of the management server 100 terminates the use of the monitoring target blade module 200 being used by the user, and the Agent 10 of the monitoring target blade module 200 from the monitoring target blade module 200 in the above-described primary failure countermeasure standby state. When a power-off notification is received, in order to determine that the received power-off notification is not due to a restart or the like, the system continues to wait for a certain period of time, and if a power-on notification from the monitored blade module 200 is received during the standby No action is taken and the primary failure countermeasure standby state is continued until a power OFF notification is received again. If the manager 70 does not receive a power-on notification during standby, the manager 70 starts a process of switching the monitored blade module 200 to the spare blade module 210 as a countermeasure against the secondary failure. This switching process is performed by changing the computer name and IP address of the monitored blade module 200, the spare blade module 21, and 0 to switch the blade module to be operated to the spare blade module 210. This is processing for enabling the maintenance of the existing blade module as the monitoring target blade module 200 (sequence A38, A40 to A42).

（16）Manager ７０は、切り替え処理の開始前に、まず、監視対象ブレードモジュール２００のＢＭＣ５０と通信を行い、ＩＰＭＩによる電源ＯＮ命令を発して、監視対象ブレードモジュール２００の電源ＯＮ操作を行う（シーケンスＡ４３〜Ａ４４）。 (16) Before starting the switching process, the Manager 70 first communicates with the BMC 50 of the monitoring target blade module 200, issues a power ON command by IPMI, and performs a power ON operation of the monitoring target blade module 200 (sequence). A43-A44).

（17）その後、Manager ７０は、監視対象ブレードモジュール２００のAgent１０からブレードモジュールの電源ＯＮ通知を受信すると、ミラーリング解除の命令を両ブレードモジュール２００、２１０に送信し、両ブレードモジュール２００、２１０にミラーリングを終了させる（シーケンスＡ４５、Ａ４６、Ａ４８、Ａ４９）。 (17) Thereafter, when the Manager 70 receives a power ON notification of the blade module from the Agent 10 of the monitoring target blade module 200, the Manager 70 transmits a command to cancel the mirroring to both the blade modules 200 and 210, and mirrors to both the blade modules 200 and 210. Is terminated (sequences A45, A46, A48, A49).

（18）そして、管理サーバ１００のManager ７０は、両ブレードモジュール２００、２１０のコンピューター名、及び、ＯＳとＢＭＣとのＩＰアドレスを書き換えるバッチファイルを生成し、両ブレードモジュール２００、２１０に時間差を設けて送信する。この送信は、ブレードモジュール２００への送信が早く行われる。生成したバッチファイルに記述される処理の内容は、自身のコンピューター名を変更し、自身のＯＳのＩＰアドレスを書き換え、自身のＢＭＣのＩＰアドレスを書き換え、ＢＭＣのリセットを行うというものである（シーケンスＡ４７、Ａ５０）。 (18) The Manager 70 of the management server 100 generates a batch file for rewriting the computer names of the blade modules 200 and 210 and the IP addresses of the OS and BMC, and sets a time difference between the blade modules 200 and 210. To send. This transmission is performed quickly to the blade module 200. The content of the process described in the generated batch file is to change its own computer name, rewrite its own OS IP address, rewrite its own BMC IP address, and reset the BMC (sequence). A47, A50).

（19）管理サーバ１００のManager ７０からバッチファイルを受け取った監視対象ブレードモジュール２００のAgent １０は、ファイルに記述された内容に従って、監視対象ブレードモジュール２００のコンピューター名をダミーのコンピューター名に変更し、ＯＳ２０とＢＭＣ５０とのＩＰアドレスをダミーのＩＰアドレスに変更する処理を行い、ＢＭＣ５０をリセットする。なお、ダミーのコンピューター名とダミーのＩＰアドレスとは、システム内に存在する予備ブレードの台数以上の個数を予め確保しておき、その１つを用いる（シーケンスＡ５１〜Ａ５４）。 (19) The Agent 10 of the monitoring target blade module 200 that has received the batch file from the Manager 70 of the management server 100 changes the computer name of the monitoring target blade module 200 to a dummy computer name according to the contents described in the file. A process of changing the IP address of the OS 20 and the BMC 50 to a dummy IP address is performed, and the BMC 50 is reset. Note that the dummy computer name and the dummy IP address are reserved in advance so that the number of spare blades existing in the system is equal to or greater than the number of spare blades (sequences A51 to A54).

（20）管理サーバ１００のManager ７０からブレードモジュール２００より少し遅くバッチファイルを受け取った予備ブレードモジュール２１０のAgent１１は、ファイルに記述された内容に従って、予備ブレードモジュール２１０のコンピューター名を監視対象ブレード２００の元のコンピューター名に変更し、ＯＳ２１とＢＭＣ５１とのＩＰアドレスを監視対象ブレードモジュール２００に当てられていた元のＩＰアドレスに変更する処理を行い、ＢＭＣ５１をリセットする（シーケンスＡ５５〜Ａ５８）。 (20) The Agent 11 of the spare blade module 210 that received the batch file a little later than the blade module 200 from the Manager 70 of the management server 100 sets the computer name of the spare blade module 210 of the monitored blade 200 according to the contents described in the file. The original computer name is changed, the IP address of the OS 21 and the BMC 51 is changed to the original IP address assigned to the monitoring target blade module 200, and the BMC 51 is reset (sequence A55 to A58).

（21）その後、監視対象ブレードモジュールとなった予備ブレードモジュール２１０は、コンピューター名とＩＰアドレスとの変更が完了ことを管理サーバ１００のManager ７０に通知する（シーケンスＡ５９）。 (21) Thereafter, the spare blade module 210 that has become the monitoring target blade module notifies the manager 70 of the management server 100 that the change of the computer name and the IP address has been completed (sequence A59).

（22）予備ブレードモジュール２１０からコンピューター名とＩＰアドレスとの変更が完了したことの通知を受け取った管理サーバ１００のManager ７０は、今まで監視対象ブレードモジュールであったブレードモジュール２００のＢＭＣ５０と通信を行い、ＩＰＭＩにより、ＢＭＣ５０に電源ＯＦＦ操作を行うように指示し、ブレードモジュール２００のＢＭＣ５０にブレードモジュール２００の電源をＯＦＦさせてここでの処理を終了する（シーケンスＡ６０〜Ａ６２）。 (22) The Manager 70 of the management server 100 that has received the notification that the change of the computer name and the IP address has been completed from the spare blade module 210 communicates with the BMC 50 of the blade module 200 that has been the monitoring target blade module so far. Then, the BMC 50 is instructed to perform the power OFF operation by IPMI, the BMC 50 of the blade module 200 is turned off, and the processing here ends (sequences A60 to A62).

以上説明した処理により、ブレードモジュールの故障予測・切り換えの処理が完了することになる。 With the processing described above, the blade module failure prediction / switching processing is completed.

図８、図９は本発明の実施形態によるクライアントサーバシステムにおいて、監視対象ブレードモジュール２００の劣化を判定し故障対策を行う処理の他の例を説明するシーケンスチャートであり、次に、これについて説明する。ここで説明する図８、図９に示すシーケンスの処理は、通常運用時からの故障対策を１回だけ行う処理の例であり、一連の処理であるので、図８、図９が連続したものであるとして説明する。なお、図８、図９により説明する処理の他の例は、劣化判定条件として、図４、図５により説明した例における２次劣化判定条件と同一の条件、すなわち、劣化進行度Ｅの値が２次劣化判定値ＴE2以上となったとき、または、図７に示す事象が発生した場合であるとする。 FIG. 8 and FIG. 9 are sequence charts for explaining another example of processing for determining deterioration of the monitored blade module 200 and taking countermeasures against the failure in the client server system according to the embodiment of the present invention. To do. The processing of the sequence shown in FIGS. 8 and 9 to be described here is an example of processing in which the failure countermeasure from the normal operation is performed only once, and is a series of processing, so that FIG. 8 and FIG. 9 are continuous. It explains as being. In addition, the other example of the process described with reference to FIG. 8 and FIG. 9 is the same condition as the secondary deterioration determination condition in the example described with reference to FIG. 4 and FIG. Is equal to or greater than the secondary deterioration determination value TE2, or the event shown in FIG. 7 occurs.

図８、図９に示すシーケンスにおいて、ブレードモジュールの監視から故障対策開始までのシーケンスＢ１〜Ｂ１１での処理動作は、図４、図５により説明した例におけるシーケンスＡ１〜Ａ１１での処理動作と同様である。 In the sequences shown in FIGS. 8 and 9, the processing operations in the sequences B1 to B11 from the monitoring of the blade module to the start of the failure countermeasure are the same as the processing operations in the sequences A1 to A11 in the example described with reference to FIGS. It is.

その後の、具体的な故障対策の内容としては、まず、監視対象ブレードモジュール２００のデータの予備ブレードモジュール２１０への複製を行う（シーケンスＢ１２〜Ｂ３０）。この処理は、図４、図５により説明した例の１次故障対策におけるシーケンスＡ１２〜Ａ２７、Ａ２９〜Ａ３１の処理でのデータ複製の処理と同様の動作である。 Thereafter, as specific contents of the countermeasure against failure, first, the data of the monitoring target blade module 200 is copied to the spare blade module 210 (sequence B12 to B30). This process is the same as the data duplication process in the processes of sequences A12 to A27 and A29 to A31 in the primary failure countermeasure in the example described with reference to FIGS.

そして、図８、図９に示すシーケンスの処理では、データの複製後、ミラーリングではなく、運用するブレードモジュールの切り替えを行う（シーケンスＢ３１〜Ｂ４１）。この処理は、図４、図５により説明したシーケンスＡ４７、Ａ５０〜Ａ５９の処理における２次故障対策としてのブレードモジュールの切り替え処理と同様の動作である。 In the sequence processing shown in FIGS. 8 and 9, after the data replication, the blade module to be operated is switched instead of mirroring (sequence B31 to B41). This process is the same operation as the blade module switching process as a countermeasure against secondary failure in the processes of sequences A47 and A50 to A59 described with reference to FIGS.

図８、図９により説明した処理の他の例は、劣化判定条件として、図４、図５により説明した例における２次劣化判定条件と同一の条件、すなわち、劣化進行度Ｅの値が２次劣化判定値ＴE2以上となったとき、または、図７に示す事象が発生した場合であるとしたが、劣化進行度Ｅの値がＴE3（ＴE2＞ＴE3＞ＴE1）以上である場合としてもよく、また、図７に示す事象が図７に示す例とは異なるものであってもよいし、劣化進行度Ｅの値がＴE3以上である場合だけとしてもよい。 In other examples of the processing described with reference to FIGS. 8 and 9, the deterioration determination condition is the same as the secondary deterioration determination condition in the example described with reference to FIGS. 4 and 5, that is, the value of the deterioration progress E is 2. It is assumed that the next deterioration judgment value TE2 or higher or the event shown in FIG. 7 has occurred, but the deterioration progress level E may be higher than TE3 (TE2> TE3> TE1). Further, the event shown in FIG. 7 may be different from the example shown in FIG. 7, or only when the value of the deterioration progress E is equal to or higher than TE3.

前述した本発明の実施形態によれば、情報処理装置等の保守対象とする装置の故障が発生する危険性が高いと判断すると、その故障を回避するための対処を行ことができるので、装置管理者の手間を削減すると共に、故障の危険性を予測していたにもかかわらずブレードモジュールを故障させてしまうような事態を避けることができる。 According to the above-described embodiment of the present invention, when it is determined that there is a high risk of failure of a device to be maintained such as an information processing device, it is possible to take measures to avoid the failure. In addition to reducing the time and effort of the administrator, it is possible to avoid a situation in which the blade module is damaged even though the risk of failure is predicted.

また、本発明の実施形態によれば、故障が発生する危険性があると判断した装置に対してのみミラー化を行うこととしているので、通常の冗長構成と比べ必要な装置数を少なくすることができ、設置スペース・管理の手間・ＴＣＯ等を削減することができる。 In addition, according to the embodiment of the present invention, mirroring is performed only for a device that is determined to be at risk of failure, so the number of necessary devices is reduced compared to a normal redundant configuration. It is possible to reduce installation space, management effort, TCO, and the like.

また、本発明の実施形態によれば、通常の故障予測では、予測した故障日時よりも早く実際の故障が発生する場合が考えられるが、故障対策としてのミラーリングを早期に行うことができるため、予測を早回る故障発生による被害を抑えることができる。 Further, according to the embodiment of the present invention, in normal failure prediction, there may be a case where an actual failure occurs earlier than the predicted failure date and time, but since mirroring as a countermeasure against failure can be performed early, The damage caused by the failure that is expected earlier can be suppressed.

通常の故障予測では、装置の交換などの対策を早期に行う場合、実際に装置の故障が発生する危険性が低く安全であるが、装置の保守作業を行う頻度が上がり、費用が増大し、一方、装置の交換などの対策を遅めに行う場合、装置の保守作業を行う頻度が下がるが、予測故障日時よりも先に装置の故障が発生する危険性が高くなる。本発明の実施形態によれば、２段階の故障対策を行い、第１段階としてやや早期にミラー化を行い、第２段階の故障対策として装置の交換を行うこととしているので、これら２つの懸念を解消することができる。 In normal failure prediction, when measures such as device replacement are taken early, the risk of device failure is low and safe, but the frequency of equipment maintenance increases and costs increase. On the other hand, when countermeasures such as replacement of devices are performed late, the frequency of device maintenance work decreases, but the risk of device failure occurring before the predicted failure date and time increases. According to the embodiment of the present invention, two steps of failure countermeasures are performed, mirroring is performed slightly earlier as the first step, and the device is replaced as a second step of failure countermeasures. Can be eliminated.

前述した本発明の実施形態での各処理は、プログラムにより構成し、本発明が備えるＣＰＵに実行させることができ、また、それらのプログラムは、ＦＤ、ＣＤＲＯＭ、ＤＶＤ等の記録媒体に格納して提供することができ、また、ネットワークを介してディジタル情報により提供することができる。 Each process in the above-described embodiment of the present invention is configured by a program and can be executed by a CPU included in the present invention. These programs are stored in a recording medium such as an FD, CDROM, or DVD. It can be provided and can be provided by digital information via a network.

１０、１１ Agent
２０、２１ＯＳ
３０、３１ＨＤＤ（ＳＳＤ）
４０、４１ＥＣＣ付きＲＡＭ
４５、４６モジュール本体部
５０、５１ＢＭＣ
６０〜６２ＮＩＣ
７０ Manager
７５制御部
１００管理サーバ
２００、２１０ブレードモジュール
３００サーバ
４００クライアント端末
５００ネットワーク
６００管理者用端末 10, 11 Agent
20, 21 OS
30, 31 HDD (SSD)
40, 41 RAM with ECC
45, 46 Module body 50, 51 BMC
60-62 NIC
70 Manager
75 Control unit 100 Management server 200, 210 Blade module 300 Server 400 Client terminal 500 Network 600 Administrator terminal

Claims

In the device failure prediction and countermeasure method that predicts the failure of the device to be maintained in advance and implements countermeasures against the failure,
A management server connected to the maintenance target device via a network;
The management server has management means for acquiring information on the device from the device to be maintained at a certain period and managing the device,
The management means calculates the deterioration progress of the device to be maintained based on the acquired information of the device to be maintained, and when the calculated deterioration progress is a predetermined progress, Alternatively, when the information acquired from the device to be maintained indicates a predetermined event, a failure prediction / measure method for implementing a failure countermeasure for the device to be maintained.

2. The failure according to claim 1, wherein the failure countermeasure for the device to be maintained is a measure for switching and operating the device to be maintained to a spare device different from the device to be maintained. Prediction and countermeasure methods.

The information collected from the maintenance target device by the management means is the S.P. M.M. A. R. T.A. The information includes hardware monitoring log, ECC error log of RAM with ECC, voltage value / temperature value of board, and cumulative number of forced power-off / forced reset performed on blade module. Item 1 or 2 failure prediction and countermeasure method.

In the device failure prediction and countermeasure method that predicts the failure of the device to be maintained in advance and implements countermeasures against the failure,
A management server connected to the maintenance target device via a network;
The management server has management means for acquiring information on the device from the device to be maintained at a certain period and managing the device,
The management means calculates a deterioration progress degree of the maintenance target apparatus based on the acquired information of the maintenance target apparatus, and the calculated deterioration progress degree becomes a predetermined first progress degree. If the information acquired from the maintenance target device indicates a predetermined event, the first failure countermeasure is performed on the maintenance target device, and then the calculated degree of deterioration progresses. A failure prediction / countermeasure method characterized by implementing a second failure countermeasure for the device to be maintained when the second degree of advance is reached.

The first failure countermeasure for the maintenance target device is a countermeasure for mirroring and operating the maintenance target device and a spare device different from the maintenance target device. 5. The failure prediction / countermeasure according to claim 4, wherein the second countermeasure against failure is a countermeasure for switching the maintenance target device to a spare device different from the maintenance target device. Method.

The information collected from the maintenance target device by the management means is the S.P. M.M. A. R. T.A. The information includes hardware monitoring log, ECC error log of RAM with ECC, voltage value / temperature value of board, and cumulative number of forced power-off / forced reset performed on blade module. Item 6. The failure prediction / countermeasure method according to item 5 or 5.

In the client server system that can predict the failure of the device to be maintained in advance and implement countermeasures against the failure,
A management server connected to the maintenance target device via a network;
The management server has management means for acquiring information on the device from the device to be maintained at a certain period and managing the device,
The management means calculates the deterioration progress of the maintenance target device based on the acquired information of the maintenance target device, and the calculated deterioration progress becomes a predetermined progress degree. Or a means for taking a countermeasure against a failure of the maintenance target device when the information acquired from the maintenance target device indicates a predetermined event.

In the client server system that can predict the failure of the device to be maintained in advance and implement countermeasures against the failure,
A management server connected to the maintenance target device via a network;
The management server has management means for acquiring information on the device from the device to be maintained at a certain period and managing the device,
The managing means calculates a deterioration progress degree of the maintenance target device based on the acquired information of the maintenance target device, and a first progress degree in which the calculated deterioration progress degree is predetermined. Or when the information acquired from the device to be maintained indicates a predetermined event, means for implementing the first failure countermeasure for the device to be maintained, and then calculated A client server system comprising: means for implementing a second countermeasure against a failure with respect to the apparatus to be maintained when the degree of deterioration has reached a predetermined second degree of progress.