JP5995265B2

JP5995265B2 - Information processing system, maintenance method and program

Info

Publication number: JP5995265B2
Application number: JP2012074131A
Authority: JP
Inventors: 久進藤
Original assignee: NEC Platforms Ltd
Current assignee: NEC Platforms Ltd
Priority date: 2012-03-28
Filing date: 2012-03-28
Publication date: 2016-09-21
Anticipated expiration: 2032-03-28
Also published as: JP2013206105A

Description

本発明は、情報処理システム、保守方法及びプログラムに関し、特に、保守作業に関する指示を出力する情報処理システム、保守方法及びプログラムに関する。 The present invention relates to an information processing system, a maintenance method, and a program, and more particularly, to an information processing system, a maintenance method, and a program that output instructions related to maintenance work.

特許文献１に、設定当初のＦＲＵテーブルを用いて正確な被疑割合を含む情報を提供する障害管理システムが開示されている。同文献によると、この障害管理システムは、サービスプロセッサにより障害事象が検知された場合に、当該障害事象を特定する情報をキーとして障害履歴情報を検索し、当該障害事象が過去に発生した障害事象と一致する場合に、前記障害履歴情報から当該障害事象を引き起こした誘因履歴のある障害要因部位を抽出する機能を備える。そして、前記誘因履歴のある障害要因部位が抽出された場合に、当該障害要因部位の前記誘因頻度に応じて、ＦＲＵテーブルの当該障害要因部位に対応する設定当初の被疑割合を補正して得られる補正被疑割合を算出する。当該障害事象と当該障害要因部位との関係が、前記ＦＲＵテーブルと前記障害履歴情報とで一致する場合に、当該障害履歴情報の相当する前記誘因頻度をインクリメントする、と記載されている。 Patent Document 1 discloses a failure management system that provides information including an accurate suspicion rate using an initial FRU table. According to this document, when a failure event is detected by the service processor, the failure management system searches failure history information using information for identifying the failure event as a key, and the failure event in which the failure event has occurred in the past. And the failure history portion having the trigger history that caused the failure event is extracted from the failure history information. Then, when a failure factor part having the incentive history is extracted, the initial suspect ratio corresponding to the failure factor part of the FRU table is corrected according to the cause frequency of the failure factor part. Calculate the corrected suspicion rate. It is described that, when the relationship between the failure event and the failure factor site matches between the FRU table and the failure history information, the incentive frequency corresponding to the failure history information is incremented.

また、特許文献２には、発生した異常状態の復旧難易度を算出し、復旧難易度に応じた宛て先に通知メールを送信するようにして、適切な宛て先にエラーを通知することができるようにした電子装置が開示されている。同文献によると、この電子装置は、異常状態を検出する状態検出部と、状態検出部によって検出された異常状態に応じた内容の通知メールを生成するメール生成部と、状態検出部によって検出された異常状態に応じた宛て先を抽出する宛て先抽出部と、宛て先抽出部によって抽出された宛て先との間の電子メールの通信及び再送を制御する通信制御部とを備えると記載されている。そして、状態検出部は、検出した異常状態の復旧難易度を算出し、宛て先抽出部は、算出された復旧難易度に基づいて通知メールを送信する宛て先を抽出し、メール生成部は、宛て先抽出部よって抽出された宛て先に対する通知メールを生成し、通信制御部は、メール生成部によって生成された通知メールを宛て先抽出部よって抽出された宛て先に送信する、と記載されている。 Further, in Patent Document 2, it is possible to notify an error to an appropriate destination by calculating a recovery difficulty level of an abnormal state that has occurred and transmitting a notification mail to a destination corresponding to the recovery difficulty level. An electronic device is disclosed. According to the document, the electronic device is detected by a state detection unit that detects an abnormal state, a mail generation unit that generates a notification mail having contents according to the abnormal state detected by the state detection unit, and a state detection unit. A destination extracting unit that extracts a destination according to an abnormal state, and a communication control unit that controls communication and retransmission of e-mail between the destination extracted by the destination extracting unit. Yes. Then, the state detection unit calculates the recovery difficulty level of the detected abnormal state, the destination extraction unit extracts a destination to which the notification mail is transmitted based on the calculated recovery difficulty level, and the mail generation unit It is described that a notification mail for the destination extracted by the destination extraction unit is generated, and the communication control unit transmits the notification mail generated by the mail generation unit to the destination extracted by the destination extraction unit. Yes.

特開２０１１−１７５５１３号公報JP 2011-175513 A 特開２００７−３０４４４号公報JP 2007-30444 A

以下の分析は、本発明によって与えられたものである。近年、電子機器製品のダウンサイジング化にともない、多くのコモディティプロセッサを高速ネットワークで接続する高密度実装サーバ（ブレード、クラスタ）技術など分散・並列処理技術が進んでいる。これらの技術の進歩により保守コスト削減のための構造設計も重要な要素の一つであるが、相反してシステム構成あるいは形態によってはより高度な高密度実装がとられ、定期保守あるいは障害発生時の保守自体も高度な訓練や相応の経験が必要となるケースがある。 The following analysis is given by the present invention. In recent years, with downsizing of electronic equipment products, distributed and parallel processing technologies such as a high-density mounting server (blade, cluster) technology that connects many commodity processors through a high-speed network are progressing. Structural design for reducing maintenance costs is also an important factor due to the advancement of these technologies, but on the contrary, depending on the system configuration or configuration, higher-density mounting is taken, so that regular maintenance or failure occurs In some cases, the maintenance itself requires advanced training and appropriate experience.

また、上記システム構成や形態に依存しない場合でも、設計不良、製造工程での初期不良あるいはロット不良（品質バラツキ）、顧客先での運用環境による劣化、部品の経年劣化等が挙げられ、これらの要因に応じて保守の対応も追随しないと適切な保守が出来ず、運用停止時間の長期化や健全に運用されているシステムへの副次的な影響を及ぼし兼ねずユーザに甚大な被害を与えることがある。 In addition, even if it does not depend on the above system configuration or form, there may be design defects, initial defects or lot defects (quality variation) in the manufacturing process, deterioration due to the operating environment at the customer site, aging deterioration of parts, etc. If the maintenance response does not follow according to the factors, appropriate maintenance cannot be performed, which may cause prolonged operation stop time and side effects on a soundly operated system, and may cause serious damage to the user. Sometimes.

上記のように、適切な保守を行うためには、保守作業の難易度に応じた訓練や経験が必要とされるところ、特許文献１の障害管理システムは、障害発生時に、障害要因部位について正確な被疑割合を提供することを主眼としており、保守作業の難易度を提供できるものとなっていない。 As described above, in order to perform appropriate maintenance, training and experience corresponding to the difficulty level of maintenance work are required. However, the failure management system of Patent Document 1 is accurate about the failure factor site when a failure occurs. The main goal is to provide a high suspicion rate, and it is not possible to provide a difficulty level of maintenance work.

特許文献２の電子機器は、異常状態が検出されると、異常状態の復旧難易度を算出すると記載されているが、当該復旧難易度は、異常状態を通知する通知メールの宛先を決定するために用いられているに過ぎない。また、復旧難易度の算出方法自体も、同公報図４のような異常ステータスと復旧難易度とを対応付けたテーブル（異常ステータスエリア）から読み出すものであり、上述のように、システム構成や形態のみならず、様々な要素が絡み合って適切な保守を行う必要があるシステムには到底対応することは不可能である。 The electronic device disclosed in Patent Document 2 is described to calculate a recovery difficulty level of an abnormal state when an abnormal state is detected. The recovery difficulty level is used to determine a destination of a notification mail that notifies the abnormal state. It is only used for. Further, the restoration difficulty level calculation method itself is also read out from the table (abnormal status area) in which the abnormal status and the restoration difficulty level are associated as shown in FIG. 4, and as described above, the system configuration and configuration In addition, it is impossible to cope with a system in which various elements are intertwined and appropriate maintenance is required.

本発明は、上記した事情に鑑みてなされたものであって、その目的とするところは、上記した多種多様かつ難易度の異なる保守作業への対応を支援する情報処理システム、保守方法及びプログラムを提供することにある。 The present invention has been made in view of the above-described circumstances, and an object of the present invention is to provide an information processing system, a maintenance method, and a program for supporting correspondence to the above-described various maintenance tasks having different difficulty levels. It is to provide.

本発明の第１の視点によれば、少なくともプロセッサ、メモリを含む置換可能な部位を備えた情報処理システムであって、前記置換可能な部位からエラーコードが出力された場合、前記置換可能な部位のうちの障害要因部位を特定するための情報と、該障害要因部位の障害の履歴情報とを格納した情報源にアクセスして、前記エラーコードに対応する被疑割合付きの障害要因部位情報と、該障害要因部位の障害の履歴情報とを取得する障害要因解析部と、前記被疑割合付きの障害要因部位と前記障害の履歴情報とに基づいて、保守作業の難易度を示す保守レベルを算出する保守レベル算出部と、を備え、前記保守レベルを含んだ保守作業の指示情報を出力する情報処理システムが提供される。 According to a first aspect of the present invention , there is provided an information processing system including a replaceable part including at least a processor and a memory, and when an error code is output from the replaceable part, the replaceable part Access to an information source storing information for identifying a failure factor portion of the error history information and failure history information of the failure factor portion, failure factor portion information with a suspect ratio corresponding to the error code , Based on the failure factor analysis unit that acquires the failure history information of the failure factor part, the failure factor part with the suspect ratio, and the failure history information, a maintenance level that indicates a difficulty level of the maintenance work is calculated. And an information processing system that outputs maintenance work instruction information including the maintenance level.

本発明の第２の視点によれば、少なくともプロセッサ、メモリを含む置換可能な部位を備えた情報処理システムによる保守方法であって、前記置換可能な部位からエラーコードが出力された場合、前記置換可能な部位のうちの障害要因部位を特定するための情報と、該障害要因部位の障害の履歴情報とを格納した情報源にアクセスして、前記エラーコードに対応する被疑割合付きの障害要因部位情報と、該障害要因部位の障害の履歴情報とを取得するステップと、前記被疑割合付きの障害要因部位と前記障害の履歴情報とに基づいて、保守作業の難易度を示す保守レベルを算出するステップと、前記保守レベルを含んだ保守作業の指示情報を出力するステップとを含む、保守方法が提供される。本方法は、保守員に対し、保守レベルを含んだ保守作業の指示情報を出力する情報処理システムという、特定の機械に結びつけられている。 According to a second aspect of the present invention, there is provided a maintenance method by an information processing system including a replaceable part including at least a processor and a memory, and when an error code is output from the replaceable part, the replacement and information for specifying the cause of the error portion of the possible sites, to access the information source of the history information stored in the failure of the fault-site disorder with suspect rate corresponding to the error code and cause part information, acquiring the history information of the failure of the disorder factors site, based on the history information of the suspected ratio with a fault-site with the disorder, a maintenance level indicating a degree of difficulty of maintenance There is provided a maintenance method including a step of calculating and a step of outputting maintenance work instruction information including the maintenance level. This method is associated with a specific machine called an information processing system that outputs maintenance work instruction information including a maintenance level to maintenance personnel.

本発明の第３の視点によれば、少なくともプロセッサ、メモリを含む置換可能な部位を備えた情報処理システムに含まれるコンピュータに、前記置換可能な部位からエラーコードが出力された場合、前記被疑割合付きの障害要因部位情報と、該障害要因部位の障害の履歴情報とを取得する処理と、前記被疑割合付きの障害要因部位と前記障害の履歴情報とに基づいて、保守作業の難易度を示す保守レベルを算出する処理と、前記保守レベルを含んだ保守作業の指示情報を出力する処理とを実行させるプログラムが提供される。なお、このプログラムは、コンピュータが読み取り可能な記憶媒体に記録することができる。即ち、本発明は、コンピュータプログラム製品として具現することも可能である。
According to a third aspect of the present invention, when an error code is output from the replaceable part to a computer included in an information processing system including a replaceable part including at least a processor and a memory , the suspect ratio The degree of difficulty of maintenance work is indicated based on the process of obtaining the failure factor part information with the fault, the history information of the fault of the fault factor part, and the fault factor part with the suspect ratio and the fault history information. There is provided a program for executing a process for calculating a maintenance level and a process for outputting maintenance work instruction information including the maintenance level. This program can be recorded on a computer-readable storage medium. That is, the present invention can be embodied as a computer program product.

本発明によれば、多種多様かつ難易度の異なる保守作業への対応を好適に支援することが可能となる。 ADVANTAGE OF THE INVENTION According to this invention, it becomes possible to support suitably the response | compatibility to the various maintenance work from which the difficulty level differs.

本発明の一実施形態の構成を示す図である。It is a figure which shows the structure of one Embodiment of this invention. 本発明の第１の実施形態の情報処理システムの構成を示す図である。It is a figure which shows the structure of the information processing system of the 1st Embodiment of this invention. 本発明の第１の実施形態の情報処理システムのＦＲＵテーブルに保持されている情報を示す図である。It is a figure which shows the information hold | maintained at the FRU table of the information processing system of the 1st Embodiment of this invention. 本発明の第１の実施形態の情報処理システムの動作を表した流れ図である。It is a flowchart showing operation | movement of the information processing system of the 1st Embodiment of this invention. 保守作業の流れを表した図である。It is a figure showing the flow of maintenance work. 論理パーティションによる複数のシステムが運用されているサーバの例である。This is an example of a server in which a plurality of systems using logical partitions are operated. 論理パーティションの構成例を示す図である。It is a figure which shows the structural example of a logical partition. 図７の構成における障害発生時の影響範囲を示す図である。It is a figure which shows the influence range at the time of the failure generation in the structure of FIG. 本発明の第１の実施形態の情報処理システムの状態表示ランプ（エラー／メンテナンス状態）の点灯制御を行う回路構成を示す図である。搭載可能なメンテナンス報告手段および障害報告手段の一例を示す図である。It is a figure which shows the circuit structure which performs lighting control of the status display lamp (error / maintenance state) of the information processing system of the 1st Embodiment of this invention. It is a figure which shows an example of the maintenance report means and fault report means which can be mounted. 図７の構成におけるエラー閾値変更処理を説明するための図である。It is a figure for demonstrating the error threshold value change process in the structure of FIG. エラー検出時の判定フローを表した流れ図である。It is a flowchart showing the determination flow at the time of error detection.

はじめに本発明の一実施形態の概要について図面を参照して説明する。なお、この概要に付記した図面参照符号は、理解を助けるための一例として各要素に便宜上付記したものであり、本発明を図示の態様に限定することを意図するものではない。 First, an outline of an embodiment of the present invention will be described with reference to the drawings. Note that the reference numerals of the drawings attached to this summary are attached to the respective elements for convenience as an example for facilitating understanding, and are not intended to limit the present invention to the illustrated embodiment.

本発明は、図１に示すように、その一実施形態において、障害要因解析部１１０と、保守レベル算出部１２０と、前記保守レベルを含んだ保守作業の指示情報を出力する表示部１３０とを備える構成にて実現できる。 As shown in FIG. 1, the present invention includes a failure factor analysis unit 110, a maintenance level calculation unit 120, and a display unit 130 that outputs maintenance work instruction information including the maintenance level. This can be realized with the configuration provided.

より具体的には、障害要因解析部１１０は、管理対象のシステムに含まれる置換可能な部位のうちの障害要因部位を特定するための情報と、該障害要因部位の障害の履歴情報とを格納した情報源１００にアクセスして、管理対象のシステムに含まれる置換可能な部位から発せられるエラーコードから、被疑割合付きの障害要因部位情報と、該障害要因部位の障害の履歴情報とを生成する。 More specifically, the failure factor analysis unit 110 stores information for identifying a failure factor portion of replaceable portions included in the managed system, and failure history information of the failure factor portion. The information source 100 is accessed, and from the error code generated from the replaceable part included in the managed system, the fault factor part information with the suspect ratio and the fault history information of the fault factor part are generated. .

そして、前記保守レベル算出部１２０は、障害要因部位と前記障害の履歴情報とに基づいて、保守作業の難易度を示す保守レベルを算出する。例えば、保守レベル算出部１２０は、ある障害要因部位に故障が多発している場合、保守レベルを引き上げる。これにより、相応の高度な訓練や経験を持った保守員による保守や、オンラインではなく、オフラインによる保守作業が行われる。
［第１の実施形態］ The maintenance level calculation unit 120 calculates a maintenance level indicating the difficulty level of the maintenance work based on the failure factor site and the failure history information. For example, the maintenance level calculation unit 120 increases the maintenance level when a failure frequently occurs in a certain failure factor part. As a result, maintenance by maintenance personnel with appropriate advanced training and experience, and maintenance work offline rather than online are performed.
[First Embodiment]

続いて、本発明の第１の実施形態について図面を参照して詳細に説明する。図２は、本発明の第１の実施形態の構成を表したブロック図である。 Next, a first embodiment of the present invention will be described in detail with reference to the drawings. FIG. 2 is a block diagram showing the configuration of the first exemplary embodiment of the present invention.

図２を参照すると、障害情報管理サーバ１２と接続された情報処理システム１１が示されている。情報処理システム１１は、主記憶装置（以下、「ＭＥＭ」）２１と、複数のプロセッサ（以下、「ＰＲＯＣ」）２２と、複数のノードコントローラ（以下、「ＮＣ」）２３と、複数のクロスバスイッチ（以下、「Ｘｂａｒ」）２４と、複数の入出力装置（以下、「ＩＯ」）２５と、サービスプロセッサ（以下、「ＳＶＰ」）２６と、データ収集部４０と、障害要因解析部４１と、保守レベル算出部４２と、コンソール４３と、構成情報解析部４４とを備えている。 Referring to FIG. 2, an information processing system 11 connected to the failure information management server 12 is shown. The information processing system 11 includes a main storage device (hereinafter, “MEM”) 21, a plurality of processors (hereinafter, “PROC”) 22, a plurality of node controllers (hereinafter, “NC”) 23, and a plurality of crossbar switches. (Hereinafter referred to as “Xbar”) 24, a plurality of input / output devices (hereinafter referred to as “IO”) 25, a service processor (hereinafter referred to as “SVP”) 26, a data collection unit 40, a failure factor analysis unit 41, A maintenance level calculation unit 42, a console 43, and a configuration information analysis unit 44 are provided.

また、この情報処理システム１１は、情報源として、ＦＲＵ（ＦｉｅｌｄＲｅｐｌａｃａｂｌｅＵｎｉｔ）テーブル３０と、障害履歴格納部Ａ３１と、障害履歴格納部Ｂ３２と、を備えている。 In addition, the information processing system 11 includes a field replaceable unit (FRU) table 30, a failure history storage unit A31, and a failure history storage unit B32 as information sources.

ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、Ｘｂａｒ２４、ＩＯ２５のいずれか１つあるいは複数の箇所で障害が検出された場合、信号線ｅ００１を介してエラーがＳＶＰ２６に報告される。 If a failure is detected at any one or more of MEM21, PROC22, NC23, Xbar24, and IO25, an error is reported to SVP 26 via signal line e001.

ＳＶＰ２６は、前記エラー報告を受信すると、そのサービスログから、上記ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、Ｘｂａｒ２４、ＩＯ２５の障害情報を採取する。さらに、ＳＶＰ２６は、前記障害情報に含まれるエラーインディケータフラグ（ＥＩＦ）をキーとして、ＦＲＵテーブル３０から障害要因部位（ＮＡＭＥ）やその被疑割合（ＲＡＴＥ）等を抽出し、障害履歴格納部Ａ３１に登録する。 Upon receiving the error report, the SVP 26 collects the failure information of the MEM 21, PROC 22, NC 23, Xbar 24, and IO 25 from the service log. Further, the SVP 26 extracts the failure factor part (NAME), its suspicious rate (RATE), etc. from the FRU table 30 using the error indicator flag (EIF) included in the failure information as a key, and registers it in the failure history storage unit A31. To do.

ＦＲＵテーブル３０は、エラーコードを示すエラーインディケータフラグ（ＥＩＦ）に対応する障害要因部位（ＮＡＭＥ）やその被疑割合（ＲＡＴＥ）、製造ロット番号やリビジョン番号（ＲＥＶ）、ベンダーＩＤ（ＶＩＤ）等を登録したテーブルである。 The FRU table 30 registers a failure factor part (NAME) corresponding to an error indicator flag (EIF) indicating an error code, a suspected ratio (RATE), a manufacturing lot number, a revision number (REV), a vendor ID (VID), and the like. It is a table.

図３は、ＦＲＵテーブル３０の一例を示す図である。例えば、ＥＩＦが「Ｎ０＿ＥＩＦ＿０」には、ＦＲＵ［０］として、ＮＡＭＥ＝Ｎ０＿ＮＣＣ、ＲＡＴＥ＝１００、ＲＥＶ＝Ａ０００１、ＶＩＤ＝０００という情報が対応付けられている。これは、「Ｎ０＿ＥＩＦ＿０」とのエラーインディケータフラグ（ＥＩＦ）から、障害要因部位として、１００％の割合でＮ０ノードコントローラのカード「Ｎ０＿ＮＣＣ」が特定されることを示している。同様に、「Ｎ０＿ＥＩＦ＿２」とのエラーインディケータフラグ（ＥＩＦ）から、障害要因部位として、Ｎ０ノードコントローラのポート０「Ｎ０＿Ｐ０」、Ｎ１ノードコントローラのポート０「Ｎ１＿Ｐ０」、ケーブルＡ（ＣＡＢＬＥ＿Ａ）が特定される。また、それぞれ被疑割合は、４９％、５０％、１％という情報が得られる。 FIG. 3 is a diagram illustrating an example of the FRU table 30. For example, the information “NAME = N0_NCC, RATE = 100, REV = A0001, VID = 000” is associated as FRU [0] with the EIF “N0_EIF_0”. This indicates that the card “N0_NCC” of the N0 node controller is specified at a rate of 100% as the failure factor portion from the error indicator flag (EIF) with “N0_EIF_0”. Similarly, the port 0 “N0_P0” of the N0 node controller, the port 0 “N1_P0” of the N1 node controller, and the cable A (CABLE_A) are identified from the error indicator flag (EIF) with “N0_EIF_2”. . Further, information on the suspect ratios of 49%, 50%, and 1% is obtained.

また、図３のＦＲＵテーブルは、ＥＩＦ毎に、エラー回数（ＥｒｒｏｒＣｏｕｎｔ）と、メンテナンス状態フラグ（ＭＮ１）とを保持可能となっている。これらの情報は、ＳＶＰ２６から適宜更新される。 In addition, the FRU table of FIG. 3 can hold the number of errors (Error Count) and the maintenance status flag (MN1) for each EIF. Such information is appropriately updated from the SVP 26.

障害履歴格納部Ａ３１は、情報処理システム１１で検出された障害の履歴情報を格納する。また、これらの障害の履歴情報に、製造ロットやベンダＩＤ等を含ませることで、障害発生頻度から、設計マージンに余裕のない機能や製造ロットにより障害発生頻度の多い部品の分析が可能となる。障害履歴格納部Ａ３１には、ＥＩＦ毎に、エラー回数フィールドを管理するテーブルが備えられており、各ＥＩＦのエラー発生回数を把握できるようになっている。 The failure history storage unit A31 stores history information of failures detected by the information processing system 11. In addition, by including manufacturing lots, vendor IDs, and the like in the failure history information, it is possible to analyze parts with high failure occurrence frequency based on functions and production lots that have no margin in design margin from the failure occurrence frequency. . The failure history storage unit A31 is provided with a table for managing the error count field for each EIF so that the error occurrence count of each EIF can be grasped.

障害履歴格納部Ｂ３２は、信号線ｎ００１を介して障害情報サーバ１２から提供された他の情報処理システムで検出された障害の履歴情報を格納する。また、障害履歴格納部Ｂ３２も、ＥＩＦ毎に、エラー回数フィールドを管理するテーブルが備えられており、各ＥＩＦのエラー発生回数を把握できるようになっている。 The failure history storage unit B32 stores history information of failures detected by other information processing systems provided from the failure information server 12 via the signal line n001. Also, the failure history storage unit B32 is provided with a table for managing the error frequency field for each EIF, so that the error occurrence frequency of each EIF can be grasped.

データ収集部４０は、ＳＶＰ２６によるＦＲＵテーブル３０、障害履歴格納部Ａ３１の更新が完了すると、ＦＲＵテーブル３０、障害履歴格納部Ａ３１及び障害履歴格納部Ｂ３２のデータを収集し、障害要因解析部４１及び構成情報解析部４４に出力する。 When the update of the FRU table 30 and the failure history storage unit A31 by the SVP 26 is completed, the data collection unit 40 collects data of the FRU table 30, the failure history storage unit A31, and the failure history storage unit B32, and the failure factor analysis unit 41 and The data is output to the configuration information analysis unit 44.

障害要因解析部４１は、データ収集部４０から出力されたデータに基づいて、報告されたエラーが過去の障害履歴、他の情報処理システムの障害履歴、製造ロット等を分析し、被疑割合付きの障害要因部位情報と、該障害要因部位の障害の履歴情報とを生成する。また、障害要因解析部４１が、必要に応じて外部サーバ等に対し、障害要因部位として特定された部位の情報等を問い合わせるようにしてもよい。 Based on the data output from the data collection unit 40, the failure factor analysis unit 41 analyzes the reported failure history, failure history of other information processing systems, manufacturing lots, etc. Fault factor part information and fault history information of the fault factor part are generated. Further, the failure factor analysis unit 41 may make an inquiry to an external server or the like for information on a part specified as a failure factor part, if necessary.

保守レベル算出部４２は、障害要因解析部４１から出力された被疑割合付きの障害要因部位情報と、該障害要因部位の障害の履歴情報と、信号線ｎ００２を介して得られる保守情報とに基づいて、保守の難易度を算出し、被疑割合付きの障害要因部位情報とともに保守の難易度情報（保守レベル）を含んだ保守作業の指示情報を出力する。例えば、保守支援情報に、保守時のミス（手順ミス）による障害や保守による故障（過剰な押し込み、引き込みなどにより発生したもの）の回数が含まれている場合、保守レベル算出部４２は、この保守時のミスによる故障回数情報が多い部位の保守レベルを保守レベル高（難易度大）と算出する。前記保守レベルの算出は、例えば、予め定めた数式により、障害要因部位や、故障の回数、そのうちの保守時のミスによる回数等を評点に換算し、予め定めたレベル毎の閾値と、この評点と比較することにより求めることができる。なお、閾値は部品のＦＩＴ（ＦａｉｌｕｒｅＩｎＴｉｍｅ）値等から障害要因部位毎に決めておくことが好ましい。 The maintenance level calculation unit 42 is based on the failure factor part information with the suspect ratio output from the failure factor analysis part 41, the failure history information of the failure factor part, and the maintenance information obtained via the signal line n002. Then, the maintenance difficulty level is calculated, and the maintenance work instruction information including the maintenance difficulty level information (maintenance level) is output together with the failure factor part information with the suspect ratio. For example, when the maintenance support information includes the number of failures due to maintenance errors (procedure errors) or failures due to maintenance (occurrence of excessive push-in, pull-in, etc.), the maintenance level calculation unit 42 The maintenance level of a part having a lot of failure frequency information due to mistakes during maintenance is calculated as a high maintenance level (high difficulty level). The maintenance level is calculated, for example, by converting a factor of failure, the number of failures, the number of times of mistakes during maintenance, etc. into a score using a predetermined mathematical formula, a threshold for each predetermined level, and this score. Can be obtained by comparing with The threshold value is preferably determined for each failure factor site from the FIT (Failure In Time) value of the component.

コンソール４３は、保守レベル算出部４２から出力された、被疑割合付きの障害要因部位情報および保守の難易度情報（保守レベル）を含んだ保守作業の指示情報を出力する。 The console 43 outputs maintenance work instruction information including failure factor information with a suspect ratio and maintenance difficulty information (maintenance level) output from the maintenance level calculator 42.

構成情報解析部４４は、データ収集部４０から出力されたデータに基づいて、運用形態から構成情報を分析してＳＶＰ２６に伝達する。この分析結果には、例えば、障害要因部位の保守操作（カバーの脱着、ケーブルの移動／脱着、モジュール交換）に伴う副次的な影響による運用可否情報が含まれる。前記分析の結果、運用可能と判断された場合、ＳＶＰ２６は、ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、Ｘｂａｒ２４、ＩＯ２５をメンテナンスモードに移行させるとともに、メンテナンスモード中のエラーカウント等の変更を実施する。 The configuration information analysis unit 44 analyzes the configuration information from the operation mode based on the data output from the data collection unit 40 and transmits the configuration information to the SVP 26. This analysis result includes, for example, operational availability information due to secondary effects associated with maintenance operations (cover removal, cable movement / removal, module replacement) at the failure factor site. If it is determined that the operation is possible as a result of the analysis, the SVP 26 shifts the MEM 21, PROC 22, NC 23, Xbar 24, and IO 25 to the maintenance mode, and changes the error count during the maintenance mode.

障害情報管理サーバ１２は、信号線ｎ００３、ｎ００４を介して、情報処理システム１１を含む他の情報処理システムと接続され、障害情報を収集するサーバである。より具体的には、障害情報管理サーバ１２は、前記収集した障害情報を蓄積する障害情報データベース（障害情報ＤＢ）３５と、保守支援情報として情報処理システム１１を含む他の情報処理システムにて行われた保守作業の情報を格納する保守情報データベース（保守情報ＤＢ）３６とを備えている。障害情報ＤＢ３５に格納された情報は、所定のタイミングで、信号線ｎ００１を介して、情報処理システム１１の障害履歴格納部Ｂ３２に転送される。 The failure information management server 12 is connected to other information processing systems including the information processing system 11 via signal lines n003 and n004, and is a server that collects failure information. More specifically, the failure information management server 12 is operated by a failure information database (failure information DB) 35 that stores the collected failure information and other information processing systems including the information processing system 11 as maintenance support information. And a maintenance information database (maintenance information DB) 36 for storing information on the maintenance work. Information stored in the failure information DB 35 is transferred to the failure history storage unit B32 of the information processing system 11 through the signal line n001 at a predetermined timing.

なお、図２に示した情報処理システム１１のデータ収集部４０、障害要因解析部４１と、保守レベル算出部４２とおよび構成情報解析部４４はそれぞれ、情報処理システム１１に搭載されたコンピュータに、そのハードウェアを用いて、上記した各処理を実行させるコンピュータプログラムにより実現することもできる。 Note that the data collection unit 40, failure factor analysis unit 41, maintenance level calculation unit 42, and configuration information analysis unit 44 of the information processing system 11 shown in FIG. It can also be realized by a computer program that executes the above-described processes using the hardware.

続いて、本実施形態の動作について図面を参照して詳細に説明する。情報処理システム１１においては、以下の障害が発生しうる。
・ＭＥＭ２１−ＰＲＯＣ２２間
・ＰＲＯＣ２２−ＮＣ２３間
・ＮＣ２３−Ｘｂａｒ２４間
・Ｘｂａｒ２４−ＩＯ２５間
・ＳＶＰ２６−ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、Ｘｂａｒ２４、ＩＯ２５間
・ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、Ｘｂａｒ２４、ＩＯ２５、ＳＶＰ２６の単体障害
以下の説明では、情報処理システム１１の複数のノード間を接続しているＸｂａｒ２４とＩＯ２５間で障害を発生した場合の動作を説明する。 Next, the operation of this embodiment will be described in detail with reference to the drawings. In the information processing system 11, the following failures may occur.
・ Between MEM21 and PROC22 ・ Between PROC22 and NC23 ・ Between NC23 and Xbar24 ・ Between Xbar24 and IO25 ・ Between SVP26 and MEM21, between PROC22, NC23, Xbar24 and IO25 In the description, an operation when a failure occurs between the Xbar 24 and the IO 25 connecting the plurality of nodes of the information processing system 11 will be described.

図４は、本発明の第１の実施形態の情報処理システムにおいて、Ｘｂａｒ２４またはＩＯ２５のいずれかにおいてエラーが検出された際の動作を表した流れ図である。図４を参照すると、まず、Ｘｂａｒ２４−ＩＯ２５間の障害により、Ｘｂａｒ２４またはＩＯ２５のいずれかにおいてエラーが検出されると（ステップＳ００１）、ＳＶＰ２６へのエラー報告が行われる（ステップＳ００２）。 FIG. 4 is a flowchart showing an operation when an error is detected in either Xbar 24 or IO 25 in the information processing system according to the first embodiment of this invention. Referring to FIG. 4, first, when an error is detected in either Xbar 24 or IO25 due to a failure between Xbar24 and IO25 (step S001), an error is reported to SVP 26 (step S002).

前記エラー報告を受けたＳＶＰ２６は、そのサービスログ（ＳＶＰログ）から、上記Ｘｂａｒ２４、ＩＯ２５の障害情報を採取する（ステップＳ００３）。 Upon receiving the error report, the SVP 26 collects the failure information of the Xbar 24 and IO 25 from the service log (SVP log) (step S003).

次に、ＳＶＰ２６は、前記採取した障害情報に含まれるエラーインディケータ（ＥＩＦ）をキーとしてＦＲＵテーブル３０から該当するデータを検索する（ステップＳ００４）。次に、ＳＶＰ２６は、前記ＦＲＵテーブル３０から検索したデータを障害履歴格納部Ａ３１に格納するとともに、データ収集部４０を起動する。ここで、ＳＶＰ２６は、前記ＦＲＵテーブル３０の該当するエントリのエラー回数フィールドの値を１加算する。 Next, the SVP 26 searches for corresponding data from the FRU table 30 using the error indicator (EIF) included in the collected failure information as a key (step S004). Next, the SVP 26 stores the data retrieved from the FRU table 30 in the failure history storage unit A31 and activates the data collection unit 40. Here, the SVP 26 adds 1 to the value of the error count field of the corresponding entry in the FRU table 30.

データ収集部４０は、まず、ステップＳ００４での登録より前に、ステップＳ００３で特定されたエラーインディケータ（ＥＩＦ）に対応するデータが障害履歴格納部Ａ３１に登録されていたか否かを確認する（ステップＳ００５）。 The data collection unit 40 first confirms whether or not the data corresponding to the error indicator (EIF) specified in step S003 has been registered in the failure history storage unit A31 before the registration in step S004 (step S004). S005).

ここで、ステップＳ００３で特定されたエラーインディケータ（ＥＩＦ）に対応するデータが障害履歴格納部Ａ３１に登録されていた場合（ステップＳ００５のＹｅｓ）、データ収集部４０は、ステップＳ００３で特定されたエラーインディケータ（ＥＩＦ）に対応するデータが障害履歴格納部Ｂ３２に登録されていたか否かを確認する（ステップＳ００６−１）。 Here, when the data corresponding to the error indicator (EIF) identified in step S003 is registered in the failure history storage unit A31 (Yes in step S005), the data collection unit 40 determines that the error identified in step S003. It is confirmed whether or not the data corresponding to the indicator (EIF) has been registered in the failure history storage unit B32 (step S006-1).

また、ステップＳ００３で特定されたエラーインディケータ（ＥＩＦ）に対応するデータが障害履歴格納部Ａ３１に登録されていない場合も（ステップＳ００５のＮｏ）、同様に、データ収集部４０は、ステップＳ００３で特定されたエラーインディケータ（ＥＩＦ）に対応するデータが障害履歴格納部Ｂ３２に登録されていたか否かを確認する（ステップＳ００６−２）。 Similarly, when the data corresponding to the error indicator (EIF) specified in step S003 is not registered in the failure history storage unit A31 (No in step S005), the data collection unit 40 similarly specifies in step S003. It is confirmed whether or not the data corresponding to the error indicator (EIF) is registered in the failure history storage unit B32 (step S006-2).

上記ステップＳ００５、Ｓ００６−１、Ｓ００６−２の結果に応じて、ステップＳ００７〜Ｓ０１０のいずれかの処理が行われる。また、障害履歴格納部Ａ３１、障害履歴格納部Ｂ３２のいずれかまたは双方に、同一の障害履歴が存在していた場合、データ収集部４０は、それぞれのエラー回数フィールドを管理するテーブルのエラー発生回数フィールドの値を１加算する。 Depending on the results of steps S005, S006-1, and S006-2, any one of steps S007 to S010 is performed. If the same failure history exists in either or both of the failure history storage unit A31 and the failure history storage unit B32, the data collection unit 40 determines the number of error occurrences in the table managing each error number field. Add 1 to the field value.

まず、障害履歴格納部Ａ３１、障害履歴格納部Ｂ３２の双方に、同一の障害履歴が存在していた場合（ステップＳ００５、Ｓ００６−１が共にＹｅｓ）、データ収集部４０は、これら双方の障害履歴を障害要因解析部４１に出力する。障害要因解析部４１は、前記双方の障害履歴情報について、製造ロット、ベンダーＩＤ等の条件を比較分析し、障害要因部位および障害要因部位の被疑割合の補正の必要性を判定する（ステップＳ００７）。 First, when the same failure history exists in both the failure history storage unit A31 and the failure history storage unit B32 (both steps S005 and S006-1 are Yes), the data collection unit 40 determines that both of the failure histories Is output to the failure factor analysis unit 41. The failure factor analysis unit 41 compares and analyzes conditions such as the production lot and the vendor ID for both of the failure history information, and determines the necessity of correcting the failure factor part and the suspect ratio of the failure factor part (step S007). .

一方、障害履歴格納部Ａ３１に、同一の障害履歴が存在しているが、障害履歴格納部Ｂ３２に、同一の障害履歴が存在していない場合（ステップＳ００５がＹｅｓ、Ｓ００６−１がＮｏ）、データ収集部４０は、障害履歴格納部Ａ３１の障害履歴を障害要因解析部４１に出力する。障害要因解析部４１は、障害履歴格納部Ａ３１の障害履歴情報について、製造ロット、ベンダーＩＤ等の条件を比較分析し、障害要因部位および障害要因部位の被疑割合の補正の必要性を判定する（ステップＳ００８）。 On the other hand, if the same failure history exists in the failure history storage unit A31, but the same failure history does not exist in the failure history storage unit B32 (Yes in step S005, No in S006-1), The data collection unit 40 outputs the failure history stored in the failure history storage unit A31 to the failure factor analysis unit 41. The failure factor analysis unit 41 compares and analyzes conditions such as the manufacturing lot and the vendor ID for the failure history information in the failure history storage unit A31, and determines the necessity of correcting the failure factor part and the suspect ratio of the failure factor part ( Step S008).

一方、障害履歴格納部Ａ３１に、同一の障害履歴が存在していないが、障害履歴格納部Ｂ３２に、同一の障害履歴が存在している場合（ステップＳ００５がＮｏ、Ｓ００６−２がＹｅｓ）、データ収集部４０は、障害履歴格納部Ｂ３２の障害履歴を障害要因解析部４１に出力する。障害要因解析部４１は、障害履歴格納部Ｂ３２の障害履歴情報について、製造ロット、ベンダーＩＤ等の条件を比較分析し、障害要因部位および障害要因部位の被疑割合の補正の必要性を判定する（ステップＳ００９）。 On the other hand, when the same failure history does not exist in the failure history storage unit A31, but the same failure history exists in the failure history storage unit B32 (No in step S005, Yes in S006-2), The data collection unit 40 outputs the failure history in the failure history storage unit B32 to the failure factor analysis unit 41. The failure factor analysis unit 41 compares and analyzes the conditions such as the manufacturing lot and the vendor ID for the failure history information in the failure history storage unit B32, and determines the necessity of correcting the failure factor part and the suspect ratio of the failure factor part ( Step S009).

一方、障害履歴格納部Ａ３１、障害履歴格納部Ｂ３２の双方に、同一の障害履歴が存在していない場合（ステップＳ００５、Ｓ００６−１が共にＮｏ）、データ収集部４０は、ＦＲＵテーブルのデータをそのまま送信する（ステップＳ０１０）。障害要因解析部４１は、ＦＲＵテーブルのデータを用いて、障害要因部位と、該障害要因部位の被疑割合とを出力する（ステップＳ０１０）。 On the other hand, when the same failure history does not exist in both the failure history storage unit A31 and the failure history storage unit B32 (both No in steps S005 and S006-1), the data collection unit 40 stores the data in the FRU table. It transmits as it is (step S010). The failure factor analysis unit 41 outputs the failure factor site and the suspected ratio of the failure factor site using the data in the FRU table (step S010).

なお、上記したステップＳ００７〜Ｓ００９における被疑割合の補正方法については、特許文献１に詳細に記載されている。 In addition, the correction method of the suspicious ratio in the above-described steps S007 to S009 is described in detail in Patent Document 1.

次に、保守レベル算出部４２が、前記ステップＳ００７〜Ｓ０１０で得られた情報と保守支援情報とを基に保守の難易度を算出し、被疑割合付きの障害要因部位情報や保守支援情報とともに保守の難易度情報（保守レベル）を含んだ保守作業の指示情報を生成・出力する（ステップＳ０１１）。ここで、保守レベルの算出の結果、保守レベルが高い場合（難易度大）や、保守による他装置への副次的影響が予見される場合、保守レベル算出部４２は、オンライン保守は行わないようにするといった指示を生成する。 Next, the maintenance level calculation unit 42 calculates the maintenance difficulty level based on the information obtained in steps S007 to S010 and the maintenance support information, and performs maintenance together with the failure factor part information with the suspect ratio and the maintenance support information. Maintenance work instruction information including the difficulty level information (maintenance level) is generated and output (step S011). Here, if the maintenance level is high (difficulty level) as a result of the calculation of the maintenance level, or if a secondary effect of maintenance on other devices is predicted, the maintenance level calculation unit 42 does not perform online maintenance. An instruction to do so is generated.

最後に、コンソール４３にて、保守レベル算出部４２から出力された保守指示が表示される（ステップＳ０１２）。保守員は、保守指示に応じて、例えば、運用停止後の保守（オフライン保守）に切り替えるための保守スケジュールを作成し、保守作業を開始する。 Finally, the maintenance instruction output from the maintenance level calculation unit 42 is displayed on the console 43 (step S012). In accordance with the maintenance instruction, for example, the maintenance staff creates a maintenance schedule for switching to maintenance after operation stop (offline maintenance), and starts maintenance work.

続いて、保守作業の一連の流れを説明する。図５は、保守作業の流れを表した図である。以下、本実施形態の情報処理システム１１は、図６に示すような論理パーティションによる複数のシステムが運用されているサーバであるものとする。また、そのＮＣ２３、Ｘｂａｒ２４、ＩＯ２５間は、図７に示すように接続され、パーティション０（ＰＡＲ０）と、パーティション１（ＰＡＲ１）と、が構成され、それぞれ第１のオペレーティングシステム（ＯＳ＿０）、第２のオペレーティングシステム（ＯＳ＿１）に割り当てて運用されているものとして説明する。 Next, a series of maintenance work will be described. FIG. 5 is a diagram showing the flow of maintenance work. Hereinafter, it is assumed that the information processing system 11 of this embodiment is a server in which a plurality of systems using logical partitions as shown in FIG. 6 are operated. Also, the NC 23, Xbar 24, and IO 25 are connected as shown in FIG. 7, and partition 0 (PAR0) and partition 1 (PAR1) are configured, and the first operating system (OS_0) and second In the following description, it is assumed that the operating system (OS_1) is assigned and operated.

ここで、図８のＸｂａｒ１−ＩＯ１間での障害検出により、情報処理システム１１は、暫定的に保守操作を禁ずる保守操作ロックを指示し、障害状態表示ランプの点滅動作等により、障害を検出したことを表示する。図９は、情報処理システム１１（図１１では、情報処理システム１１中の装置Ａ１１Ａ、装置Ｂ１１Ｂのみを示す）に備えられるエラー状態表示ランプ（ＥＦ表示）４８およびメンテナンス状態表示ランプ（ＭＦ表示）４７の点灯制御を行う回路構成を示す図である。ここでは、情報処理システム１１に含まれる装置１１Ａ、１１Ｂのいずれかで障害（ＥＦ）が検出されると、ＥＦ制御部４６が、エラー状態表示ランプ（ＥＦ表示）４８を点滅させる。これにより、コンソール以外でも保守員等に障害発生を認識させることができる。 Here, by detecting a failure between Xbar1 and IO1 in FIG. 8, the information processing system 11 instructs a maintenance operation lock temporarily prohibiting the maintenance operation, and detects a failure by a blinking operation of a failure state display lamp or the like. Display. FIG. 9 shows an error status display lamp (EF display) 48 and a maintenance status display lamp (MF display) 47 provided in the information processing system 11 (in FIG. 11, only the devices A11A and B11B in the information processing system 11 are shown). It is a figure which shows the circuit structure which performs lighting control of. Here, when a failure (EF) is detected in any of the devices 11A and 11B included in the information processing system 11, the EF control unit 46 causes the error state display lamp (EF display) 48 to blink. Thereby, it is possible to make a maintenance person or the like recognize the occurrence of a failure other than the console.

次に、情報処理システム１１は、その障害がシステムの自動訂正機能等により訂正可能な障害であるか否かを判定する（ステップＳ１０１）。ここで、訂正可能な障害と判断した場合（ステップＳ１０１のＹＥＳ）、情報処理システム１１は、自動訂正処理を行ない、エラー状態表示ランプ（ＥＦ表示）４８を消灯する（ステップＳ１０２）。 Next, the information processing system 11 determines whether or not the failure is a failure that can be corrected by the automatic correction function of the system (step S101). If it is determined that the fault can be corrected (YES in step S101), the information processing system 11 performs an automatic correction process and turns off the error status display lamp (EF display) 48 (step S102).

訂正可能な障害でないと判断した場合（ステップＳ１０１のＮＯ）、次に、該当データを再送可能であるか否かを判定する。ここで、該当データを再送不可能な障害と判断した場合（ステップＳ１０３のＮＯ）、リカバリ不可障害と判断し、Ｘｂａｒ１−ＩＯ１間を閉塞する処理が行われる。 If it is determined that the failure is not correctable (NO in step S101), it is next determined whether or not the data can be retransmitted. Here, if it is determined that the corresponding data is a failure that cannot be retransmitted (NO in step S103), it is determined that the failure is a non-recoverable failure, and a process of closing between Xbar1 and IO1 is performed.

該当データを再送可能な障害と判断した場合（ステップＳ１０３のＹＥＳ）、メンテナンスモードに遷移させるか否かの判断が行われる（ステップＳ１０４）。なお、ここで、他に稼動中のシステムがなく、保守操作ミス等による副次的なシステム障害の影響がない場合、メンテナンスモードへの遷移は不要と判断され、保守指示書に基づいて保守が行われる（ステップＳ１０４のＮＯ）。具体的には、今回検出された障害の発生件数（エラー回数）が所定の閾値未満であれば（ステップＳ１０５のＮＯ）、運用継続となり（ステップＳ１０６）、そうでない場合には、リカバリ不可障害と判断し、Ｘｂａｒ１−ＩＯ１間を閉塞する処理が行われる。なお、図５の例では、Ｘｂａｒ１−ＩＯ１間を閉塞する前に、予防保守通知（ステップＳ１０８）を出力するか否かのエラー判定が行われる（ステップＳ１０７）。 If it is determined that the data can be retransmitted (YES in step S103), it is determined whether or not to shift to the maintenance mode (step S104). Here, if there is no other system in operation and there is no influence of a secondary system failure due to a maintenance operation error, etc., it is determined that the transition to the maintenance mode is unnecessary, and maintenance is performed based on the maintenance instruction. Is performed (NO in step S104). Specifically, if the number of failures (number of errors) detected this time is less than a predetermined threshold (NO in step S105), the operation is continued (step S106). Judgment is performed and processing between Xbar1 and IO1 is closed. In the example of FIG. 5, an error determination is made as to whether or not a preventive maintenance notification (step S108) is to be output (step S107) before closing between Xbar1 and IO1.

一方、図８に示すように、第１のオペレーティングシステム（ＯＳ＿０）、第２のオペレーティングシステム（ＯＳ＿１）が運用中である場合、ステップＳ１０４において、メンテナンスモードに遷移する。この場合、情報処理システム１１は、図９に示すメンテナンス状態表示ランプ（ＭＦ表示）４７を点灯する。 On the other hand, as shown in FIG. 8, when the first operating system (OS_0) and the second operating system (OS_1) are in operation, the process transits to the maintenance mode in step S104. In this case, the information processing system 11 lights the maintenance status display lamp (MF display) 47 shown in FIG.

前記メンテナンス状態表示ランプ４７の点灯やコンソール４３の表示により、メンテナンスモードに移行したことを認識した保守員は、コンソール４３に表示された障害履歴、保守情報、保守レベル（難易度）等から総合的に判断し、運用中の保守（オンライン保守）を実施するか否かを判断する（ステップＳ１０９）。 The maintenance staff who has recognized that the maintenance mode has been switched to by the lighting of the maintenance status display lamp 47 or the display of the console 43 is comprehensively determined from the failure history, maintenance information, maintenance level (difficulty level), etc. displayed on the console 43. It is determined whether or not maintenance during operation (online maintenance) is to be performed (step S109).

ここで、例えば、保守レベルが高い場合（難易度大）、保守による他装置への副次的影響が予見されるため、保守員は、オンライン保守は行わず、顧客と相談しシステムダウンに繋がるような副次的な影響を排除した保守スケジュールに変更することができる（図５の流れ図の作業を中断）。 Here, for example, when the maintenance level is high (difficulty level is high), the side effects on the other devices due to maintenance are foreseen, so the maintenance staff does not perform online maintenance but leads to system down by consulting with the customer. It is possible to change to a maintenance schedule that eliminates such secondary effects (interruption of the flowchart in FIG. 5).

一方、保守レベルが高くなく（難易度小〜中）、オンライン保守が可能と判断された場合、保守員は、保守ロック指示を解除し、保守作業を開始する。 On the other hand, if the maintenance level is not high (difficulty level: medium to medium) and it is determined that online maintenance is possible, the maintenance staff releases the maintenance lock instruction and starts maintenance work.

まず、保守員は、障害要因部位ＩＯ１の交換に先立って閾値変更処理を行なう（ステップＳ１１０）。図１０は、障害要因部位ＩＯ１を交換する際にエラー閾値を変更する箇所を表わした図である。図１０の例では、ＩＯ１の交換による論理パーティションへの影響を最小限にするために、ＳＶＰ２６より、Ｘｂａｒ０−Ｉ０、Ｘｂａｒ０−Ｉ１、Ｘｂａｒ１−Ｉ０、Ｘｂａｒ１−Ｉ１のエラーカウントの閾値変更が行われている。具体的には、交換作業の間にエラーが発生しても、システムの切り離しや予防保守通知の出力が抑止されるよう、これらのエラーカウントの閾値を暫定的に引き上げる、あるいは、エラーカウントを無効化する等の措置が行われる。 First, the maintenance staff performs a threshold value changing process prior to replacement of the failure factor site IO1 (step S110). FIG. 10 is a diagram showing a location where the error threshold is changed when the failure factor site IO1 is replaced. In the example of FIG. 10, the error count threshold of Xbar0-I0, Xbar0-I1, Xbar1-I0, and Xbar1-I1 is changed from the SVP 26 in order to minimize the influence on the logical partition due to the replacement of IO1. ing. Specifically, even if an error occurs during replacement, the error count threshold is temporarily raised or the error count is disabled so that system disconnection and preventive maintenance notification output are suppressed. Measures such as

これにより、図１１に示すように、メンテナンスモードにおいて、前記閾値等のパラメータの調整が行われるため（Ｓ２０２、Ｓ２０４、Ｓ２０６）、交換部位に関連する箇所にて障害が検出されても（図１１のＳ２０３、Ｓ２０５、Ｓ２０７）、エラー判定１〜３（Ｓ２０８〜Ｓ２１０）にて、否定判定が行われる。続く、エラー分析（Ｓ２１１）においても、メンテナンスモードである旨と、構成情報と、これらのエラー判定結果とを踏まえた分析が行われ、メンテナンスを中断するか否かやエラー表示を行うか否かが決定される。さらに、これらの結果は、保守情報ＤＢ３６に蓄積される。 As a result, as shown in FIG. 11, parameters such as the threshold value are adjusted in the maintenance mode (S202, S204, S206), so even if a failure is detected at a location related to the replacement site (FIG. 11). S203, S205, S207), and error determinations 1-3 (S208-S210), a negative determination is made. In the error analysis (S211) that follows, the analysis is performed based on the fact that it is the maintenance mode, the configuration information, and these error determination results, and whether or not the maintenance is interrupted or whether or not an error is displayed. Is determined. Further, these results are accumulated in the maintenance information DB 36.

上記閾値変更後、今回検出された障害の発生件数（エラー回数）が前記変更後の閾値未満であれば（ステップＳ１１１のＮＯ）、メンテナンスモードを維持した状態で運用継続となる（ステップＳ１１２）。この結果、保守による他のシステムへの副次的影響が最小限に抑えられる。 After the threshold value change, if the number of faults detected this time (number of errors) is less than the threshold value after the change (NO in step S111), the operation is continued with the maintenance mode maintained (step S112). As a result, the side effects of maintenance on other systems are minimized.

一方、今回検出された障害の発生件数（エラー回数）が前記変更後の閾値を越えてしまうような場合には、図１１に示したフローにて、エラー判定が行われる（ステップＳ１１３）。前記エラー判定の結果、ＮＧ（メンテナンス中断）と判定した場合、コンソール４３等にエラー判定通知が出力され（ステップＳ１１６）、リカバリ不可障害と判断し、Ｘｂａｒ１−ＩＯ１間を閉塞する処理が行われる。この場合、図９に示すメンテナンス状態表示ランプ（ＭＦ表示）４７やエラー状態表示ランプ（ＥＦ表示）４８を点灯させるなどを併せて行ってもよい。その際に、エラー発生箇所の数やエラー件数などに応じて、エラー状態表示ランプ（ＥＦ表示）４８の点灯数を制御するようにしてもよい。 On the other hand, if the number of faults detected this time (number of errors) exceeds the threshold after the change, error determination is performed in the flow shown in FIG. 11 (step S113). As a result of the error determination, if it is determined that NG (maintenance is interrupted), an error determination notification is output to the console 43 or the like (step S116), it is determined that the failure is unrecoverable, and a process of closing Xbar1-IO1 is performed. In this case, the maintenance status display lamp (MF display) 47 and the error status display lamp (EF display) 48 shown in FIG. At that time, the number of lighting of the error state display lamp (EF display) 48 may be controlled in accordance with the number of error locations, the number of errors, and the like.

一方、前記エラー判定の結果、ＯＫ（メンテナンス継続可）と判定した場合、メンテナンスモードを維持した状態で運用継続となる（ステップＳ１１４）。この結果、保守による他のシステムへの副次的影響が最小限に抑えられる。 On the other hand, if it is determined that the result of the error determination is OK (maintenance can be continued), the operation is continued with the maintenance mode maintained (step S114). As a result, the side effects of maintenance on other systems are minimized.

以上の過程を経て、保守が完了したら、再度、図１０に示したエラーカウント閾値等を戻し、メンテナンスモードを解除し通常運用に復帰する。 After the above process, when the maintenance is completed, the error count threshold shown in FIG. 10 is returned again, the maintenance mode is canceled, and the normal operation is resumed.

以上のように、本実施形態によれば、保守レベルを含んだ保守作業の指示情報が出力されるため、保守員に適切な保守させることができる。加えて、保守レベルが高くなく保守を行う場合においても、上述の保守作業の流れのように、エラー回数の閾値を適宜引き上げることで、運用中の他のシステムへの影響を低減することができる。これらにより、近年問題となっている平均復旧時間（ＭＴＴＲ）を短縮して保守員への負担を軽減し、なおかつ顧客への影響を最小限に止めることができる。 As described above, according to the present embodiment, maintenance work instruction information including a maintenance level is output, so that maintenance personnel can be appropriately maintained. In addition, even when maintenance is not performed at a high level, it is possible to reduce the influence on other operating systems by appropriately raising the threshold value of the number of errors as in the above-described maintenance work flow. . As a result, the mean recovery time (MTTR), which has become a problem in recent years, can be shortened to reduce the burden on maintenance personnel, and the influence on customers can be minimized.

システム縮退や拡張に伴う構成変更の際に、障害履歴および保守履歴情報を反映することにより、極力障害発生の高い部位を回避してシステムの構成および保守を支援することができる。これにより、平均故障間隔（ＭＴＢＦ）の影響を最小限にとどめ、結果としてシステムの稼働率が改善してシステム全体の信頼性を向上させることができる。 By reflecting the failure history and maintenance history information when the configuration is changed due to system degeneration or expansion, it is possible to avoid the site where the failure is as high as possible and support the configuration and maintenance of the system. As a result, the influence of the mean time between failures (MTBF) can be minimized, and as a result, the operating rate of the system can be improved and the reliability of the entire system can be improved.

以上、本発明の実施形態を説明したが、本発明は、上記した実施形態に限定されるものではなく、本発明の基本的技術的思想を逸脱しない範囲で、更なる変形・置換・調整を加えることができる。例えば、上記した実施形態では、障害検出時の保守の例を挙げて説明したが、定期点検による部品交換、予兆保守による作業の場合も同様に適用可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and further modifications, substitutions, and adjustments may be made without departing from the basic technical idea of the present invention. Can be added. For example, in the above-described embodiment, an example of maintenance at the time of detecting a failure has been described, but the present invention can be similarly applied to parts replacement by periodic inspection and work by predictive maintenance.

また、上記した実施形態では、エラー回数の閾値の変更等により、保守作業による副次的影響を低減するものとして説明したが、一時的に動作モードを可変にしデータの転送レートを低下させること等により保守による副次的影響を最小限にするようにしてもよい。 Further, in the above-described embodiment, it has been described that the secondary influence by the maintenance work is reduced by changing the threshold value of the number of errors, but the operation mode is temporarily changed to lower the data transfer rate, etc. Thus, the side effects of maintenance may be minimized.

また、上記した実施形態では、保守情報ＤＢ３６には、情報処理システム１１を含む他の情報処理システムにて行われた保守作業の情報を格納するものとして説明したが、下記のような情報を記録しておくことも望ましい。
・メンテナンスレコーダによる保守、点検操作の映像情報
これらの映像情報は、障害部位やエラーコード等のタグを付与され、障害発生時に障害部位等より関連する映像情報を索引、参照できるようにすることが好ましい。また、これら映像情報は、定点ＷＥＢカメラやベテラン保守員が着用する小型カメラ（例えば、メガネに装着した小型カメラ）等から収集するようにしてもよい。加えて、これらのＷＥＢカメラや小型カメラには、障害情報管理サーバ１２に対し、保守あるいは点検の開始から完了までの情報を送信する手段および格納する手段を設けることが好ましい。さらに、遠隔地の保守員がリアルタイムで上記映像情報を視聴できるようにしてもよい。もちろん、セキュリティレベルや保守員のアクセスポリシに基づいた視聴制御が行われる。 In the above-described embodiment, the maintenance information DB 36 is described as storing information on maintenance work performed in other information processing systems including the information processing system 11, but the following information is recorded. It is also desirable to keep it.
-Video information of maintenance and inspection operations by the maintenance recorder These video information are tagged with tags such as faulty parts and error codes so that related video information can be indexed and referenced from faulty parts when a fault occurs. preferable. Further, the video information may be collected from a fixed-point WEB camera, a small camera worn by an experienced maintenance worker (for example, a small camera attached to glasses), or the like. In addition, these WEB cameras and small cameras are preferably provided with means for transmitting and storing information from the start to completion of maintenance or inspection to the failure information management server 12. Further, a remote maintenance staff may be able to view the video information in real time. Of course, viewing control based on the security level and the access policy of maintenance personnel is performed.

また、上記した実施形態では、情報処理システム１１が単体で動作するものとして説明したが、複数の情報処理システムで障害情報を授受し、他の情報処理システムから障害報告の受信の都度、保守レベルを再計算して保守員に提示するにしてもよい。例えば、他の情報処理システムから、ある部位の障害報告を受信した場合、情報処理システム１１が、過去の障害履歴情報を参照し同一部位あるいは関連部位の有無を判定するようにすることができる。 In the above-described embodiment, the information processing system 11 has been described as operating alone. However, each time a failure report is received from a plurality of information processing systems and a failure report is received from another information processing system, a maintenance level is provided. May be recalculated and presented to maintenance personnel. For example, when a failure report for a certain part is received from another information processing system, the information processing system 11 can determine the presence or absence of the same part or a related part with reference to past fault history information.

また、上記した実施形態では、保守レベルを算出するためのパラメータは予め登録されているものとして説明したが、適宜、これらを点検、修正できるようにしてもよい。例えば、実際に行う保守の形態（オンライン、オフライン）、サーバの型（ラックマウント、ブレード）情報を用いることで、より精緻な保守指示を出力することができる。加えて、本来はオンラインでメンテナンスが可能であるが、当該保守を行った場合、保守対象外（管理対象外）の装置に影響する可能性があるか否かを判定するようにしてもよい。その結果によって、例えば、保守対象外（管理対象外）への影響によりシステムダウンとなる致命障害のリスクがある場合は、顧客への問い合わせを行い保守スケジュールを確立するといった運用を行うことが可能になる。 In the above-described embodiment, the parameters for calculating the maintenance level are described as being registered in advance. However, these parameters may be appropriately inspected and corrected. For example, more precise maintenance instructions can be output by using information on the type of maintenance actually performed (online, offline) and server type (rack mount, blade) information. In addition, although maintenance is possible online, it may be determined whether or not there is a possibility of affecting a device that is not subject to maintenance (not subject to management) when the maintenance is performed. Depending on the result, for example, if there is a risk of a fatal failure that causes the system to go down due to the influence of being out of maintenance (out of management), it is possible to perform operations such as inquiring customers and establishing a maintenance schedule Become.

なお、上記の特許文献の各開示を、本書に引用をもって繰り込むものとする。本発明の全開示（請求の範囲を含む）の枠内において、さらにその基本的技術思想に基づいて、実施形態ないし実施例の変更・調整が可能である。また、本発明の請求の範囲の枠内において種々の開示要素（各請求項の各要素、各実施形態ないし実施例の各要素、各図面の各要素等を含む）の多様な組み合わせ、ないし選択が可能である。すなわち、本発明は、請求の範囲を含む全開示、技術的思想にしたがって当業者であればなし得るであろう各種変形、修正を含むことは勿論である。 It should be noted that the disclosures of the above patent documents are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiments and examples can be changed and adjusted based on the basic technical concept. Further, various combinations or selections of various disclosed elements (including each element of each claim, each element of each embodiment or example, each element of each drawing, etc.) within the scope of the claims of the present invention. Is possible. That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea.

１１情報処理システム
１１Ａ、１１Ｂ装置
１２障害情報管理サーバ
２１主記憶装置（ＭＥＭ）
２２プロセッサ（ＰＲＯＣ）
２３ノードコントローラ（ＮＣ）
２４クロスバスイッチ（Ｘｂａｒ）
２５入出力装置（ＩＯ）
２６サービスプロセッサ（ＳＶＰ）
３０ＦＲＵ（ＦｉｅｌｄＲｅｐｌａｃａｂｌｅＵｎｉｔ）テーブル
３１障害履歴格納部Ａ
３２障害履歴格納部Ｂ
３５障害情報データベース（障害情報ＤＢ）
３６保守情報データベース（保守情報ＤＢ）
４０データ収集部
４１障害要因解析部
４２保守レベル算出部
４３コンソール
４４構成情報解析部
４６ＥＦ制御部
４７メンテナンス状態表示ランプ（ＭＦ表示）
４８エラー状態表示ランプ（ＥＦ表示）
１００情報源
１１０障害要因解析部
１２０保守レベル算出部
１３０表示部
ｅ００１、ｎ００１、ｎ００２、ｎ００３、ｎ００４信号線 DESCRIPTION OF SYMBOLS 11 Information processing system 11A, 11B apparatus 12 Fault information management server 21 Main memory (MEM)
22 Processor (PROC)
23 Node controller (NC)
24 Crossbar switch (Xbar)
25 Input / output unit (IO)
26 Service Processor (SVP)
30 FRU (Field Replaceable Unit) table 31 Failure history storage unit A
32 Fault history storage B
35 Failure information database (failure information DB)
36 Maintenance Information Database (Maintenance Information DB)
40 Data Collection Unit 41 Failure Factor Analysis Unit 42 Maintenance Level Calculation Unit 43 Console 44 Configuration Information Analysis Unit 46 EF Control Unit 47 Maintenance Status Display Lamp (MF Display)
48 Error status indicator lamp (EF display)
DESCRIPTION OF SYMBOLS 100 Information source 110 Failure factor analysis part 120 Maintenance level calculation part 130 Display part e001, n001, n002, n003, n004 Signal line

Claims

An information processing system having a replaceable part including at least a processor and a memory,
When an error code is output from the replaceable part , access to an information source storing information for identifying a fault factor part of the replaceable part and fault history information of the fault factor part Then, a failure factor analysis unit that acquires failure factor part information with a suspicion ratio corresponding to the error code , and failure history information of the failure factor part,
A maintenance level calculation unit that calculates a maintenance level indicating a difficulty level of maintenance work based on the failure factor part with the suspect ratio and the history information of the failure, and
An information processing system for outputting maintenance work instruction information including the maintenance level.

The information source includes a maintenance information database storing maintenance support information related to each failure factor site,
The information processing system according to claim 1, wherein the maintenance work instruction information including the maintenance support information is output.

The maintenance information database further includes failure information due to a maintenance error that occurred in each failure factor site,
The information processing system according to claim 2, wherein the maintenance level calculation unit calculates a maintenance level by using failure information due to a mistake at the time of maintenance in addition to the failure factor part and the failure history information.

If the maintenance level is below a predetermined level, instruct online maintenance work;
The information processing system according to any one of claims 1 to 3, wherein when the maintenance level exceeds the predetermined level, an offline maintenance work is instructed.

5. The information processing system according to claim 1, further comprising a maintenance operation lock function that prohibits online maintenance operation until a predetermined condition is satisfied after the maintenance work instruction information including the maintenance level is output.

In addition, based on the configuration information of the system to be managed, a configuration information analysis unit that determines the presence or absence of another operating system affected by the maintenance work or another logical partition in operation, and notifies the service processor,
6. The information processing system according to claim 1, wherein the service processor controls another operation system affected by the maintenance work or another logical partition in operation to suppress failure detection by the maintenance work. .

The information source includes moving image data recording a video of each maintenance operation,
The information processing system according to claim 1, wherein the moving image data is reproduced in response to a request from a maintenance staff.

A maintenance method by an information processing system having at least a replaceable part including a processor and a memory ,
If an error code from the substitutable positions is output, information source which stores information for specifying the cause of the error portion of said substitutable positions, and history information of the failure of the fault-site And acquiring failure factor part information with a suspicion rate corresponding to the error code , and failure history information of the failure factor part, and
Calculating a maintenance level indicating a difficulty level of maintenance work based on the failure factor portion with the suspect ratio and the history information of the failure ;
Outputting maintenance instruction information including the maintenance level.

In a computer included in an information processing system including at least a processor and a replaceable part including a memory ,
When an error code is output from the replaceable part, a process of acquiring the failure factor part information with the suspect ratio and the failure history information of the failure factor part;
A process of calculating a maintenance level indicating a difficulty level of maintenance work based on the failure factor part with the suspect ratio and the history information of the failure ;
A program for executing a process of outputting maintenance work instruction information including the maintenance level.