JP5023086B2

JP5023086B2 - Computer system

Info

Publication number: JP5023086B2
Application number: JP2009019314A
Authority: JP
Inventors: 和寛松下; 典雄荒城; 仁茂仲野谷
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-01-30
Filing date: 2009-01-30
Publication date: 2012-09-12
Anticipated expiration: 2029-01-30
Also published as: JP2010176464A

Description

本発明は、計算機システムにおけるシステム障害発生時の要因解析に関する。 The present invention relates to factor analysis when a system failure occurs in a computer system.

従来の計算機システムにおいては、システム障害によってＯＳ（オペレーティングシステム）が正常に動作できない状態になった場合、要因解析のための障害情報を採取するためのダンプ処理手段が動作する。ダンプ処理とは、システム障害発生時のメモリ等の主記憶装置上のデータを取得し、ハードディスク等の補助記憶装置にダンプファイルとして保存するものである。ダンプ処理手段により保存されたダンプファイルは、主に解析担当者が障害解析を行なうために用いられる。 In a conventional computer system, when an OS (operating system) cannot operate normally due to a system failure, dump processing means for collecting failure information for factor analysis operates. The dump processing is to acquire data on a main storage device such as a memory when a system failure occurs and save it as a dump file in an auxiliary storage device such as a hard disk. The dump file saved by the dump processing means is mainly used by a person in charge of analysis for failure analysis.

ところで、一般に、ハードウェアの故障によりシステム障害となった場合は、故障情報がメモリ上ではなく、各ハードウェアに備えられハードウェア状態を格納する状態レジスタに記録される。このため、ダンプファイルの解析では故障したハードウェアを特定することができず、一つ一つハードウェアの動作を調べることで特定することしかできなかった。 By the way, generally, when a system failure occurs due to a hardware failure, failure information is recorded not in the memory but in a status register that is provided in each hardware and stores the hardware status. For this reason, the analysis of the dump file cannot identify the failed hardware, and can only identify it by examining the operation of the hardware one by one.

計算機システムは、メモリ、複数のハードディスク、演算処理装置、入力装置、表示装置、ＤＶＤドライブ、複数のＰＣＩ接続機器等からなる多数のハードウェアから構成されているため、スキルのある解析担当者でも一つ一つハードウェアの動作を調べる作業には時間がかかり、システムを復旧させるまでには多大な時間を要していた。 Since the computer system is composed of a large number of hardware including a memory, a plurality of hard disks, an arithmetic processing unit, an input device, a display device, a DVD drive, a plurality of PCI devices, etc., even a skilled analyst can It took a long time to check the hardware operation one by one, and it took a lot of time to restore the system.

また、ダンプファイルはバイナリデータとして書き込まれており、ユーザがダンプファイル中の障害情報を直接認識できない状態となっている。したがって、ユーザがダンプファイル中のデータを参照するには、特殊なソフトウェアと障害解析に関する専門的な知識が必要であり、一般的なユーザはシステム保守担当者等に障害解析を依頼せざるをえなかった。 The dump file is written as binary data, and the user cannot directly recognize the failure information in the dump file. Therefore, in order for the user to refer to the data in the dump file, special software and specialized knowledge about failure analysis are required, and general users have to ask a system maintenance person etc. to perform failure analysis. There wasn't.

本発明が解決しようとする課題は、ハードウェア故障によるシステム障害発生時に故障したハードウェアを自動で特定してユーザに通知することができる計算機システムを提供することにある。 The problem to be solved by the present invention is to provide a computer system that can automatically identify and notify a user of hardware that has failed when a system failure occurs due to a hardware failure.

上記課題を解決するため、本発明は、主記憶装置と補助記憶装置を含む複数のハードウェアを備え、それら複数のハードウェアはそれぞれ当該ハードウェアの状態を格納する状態レジスタを有してなる計算機システムにおいて、システム障害発生時に、状態レジスタからレジスタ情報を取得し、そのレジスタ情報から故障しているハードウェアを特定して故障情報を生成し、その故障情報を格納する主記憶装置上での位置を特定するための識別情報を付してその識別情報とともに故障情報を主記憶装置に格納してからダンプ処理手段によりダンプファイルを生成し、システム障害解析時に、ダンプファイルを検索して識別情報が記録されている主記憶装置上の領域に付して記録されている故障情報を読み出して出力するシステム障害処理手段を設けてなることを特徴とする。 In order to solve the above problems, the present invention includes a plurality of hardware including a main storage device and an auxiliary storage device, and each of the plurality of hardware includes a status register for storing the status of the hardware. In the system, when a system failure occurs, the register information is acquired from the status register, the faulty hardware is identified from the register information, the failure information is generated, and the location on the main storage device for storing the failure information a dump file by dump processing unit denoted by the identification information together with the identification information failure information from and stored in the main storage device for identifying the a system failure analysis, identification information by searching the dump file system failure processing means reads and outputs failure information subjected in the region of the main memory that is recorded is recorded Characterized by comprising providing.

本発明によれば、故障情報を生成して主記憶装置に格納した後にダンプ処理手段によりダンプファイルを生成しているので、システム障害解析時に、ダンプファイルを検索することで故障したハードウェアを自動で特定してユーザに通知し、システム復旧までの時間を短縮することができる。 According to the present invention, after the failure information is generated and stored in the main storage device, the dump file is generated by the dump processing means. Therefore, when the system failure is analyzed, the failed hardware is automatically detected by searching the dump file. The user can be identified and notified and the time to system recovery can be shortened.

また、システム障害発生時に、状態レジスタからレジスタ情報を取得し、そのレジスタ情報を格納する主記憶装置上での位置を特定するための識別情報を付してその識別情報とともにレジスタ情報を主記憶装置に格納してからダンプ処理手段によりダンプファイルを生成し、システム障害解析時に、ダンプファイルを検索して識別情報が記録されている主記憶装置上の領域に付して記録されているレジスタ情報を読み出し、そのレジスタ情報から故障しているハードウェアを特定して故障情報を出力するシステム障害処理手段を設けるように構成して、故障情報を生成する場所を変えてもよい。 Also, when a system failure occurs, register information is acquired from the status register , and identification information for specifying the position on the main storage device for storing the register information is attached, and the register information is stored together with the identification information in the main storage device. A dump file is generated by the dump processing means after being stored in the file, and when the system failure is analyzed, the dump file is searched and the register information recorded in the area on the main storage where the identification information is recorded is recorded. It may be configured to provide system failure processing means for reading out and identifying the hardware that has failed from the register information and outputting the failure information, and the location where the failure information is generated may be changed.

本発明によれば、ハードウェア故障によるシステム障害発生時に故障したハードウェアを自動で特定してユーザに通知することができる。 ADVANTAGE OF THE INVENTION According to this invention, the hardware which failed at the time of the system failure generation | occurrence | production by hardware failure can be identified automatically, and a user can be notified.

実施例１に係る計算機システムの機能構成図である。1 is a functional configuration diagram of a computer system according to Embodiment 1. FIG. ＯＳの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of OS. 実施例１に係る障害情報管理機能の処理の流れを示すフローチャートである。6 is a flowchart illustrating a processing flow of a failure information management function according to the first embodiment. 実施例１に係る故障箇所自動通知機能の処理の流れを示すフローチャートである。6 is a flowchart illustrating a process flow of a failure location automatic notification function according to the first embodiment. 実施例２に係る計算機システムの機能構成図である。FIG. 9 is a functional configuration diagram of a computer system according to a second embodiment. 実施例２に係る障害情報管理機能の処理の流れを示すフローチャートである。12 is a flowchart illustrating a processing flow of a failure information management function according to the second embodiment. 実施例２に係る故障箇所自動通知機能の処理の流れを示すフローチャートである。10 is a flowchart illustrating a flow of processing of a failure location automatic notification function according to the second embodiment.

以下、本発明の計算機システムの実施例を図面を参照して説明する。 Embodiments of the computer system of the present invention will be described below with reference to the drawings.

本発明の実施例１を、図１乃至４を参照して説明する。図１は本実施例の計算機システムの構成図である。図１に示すように、本実施例の計算機システムは、メモリ２とハードディスク４及び、図示していない演算処理装置、入力装置、表示装置、ＤＶＤドライブ、複数のＰＣＩ接続機器等からなる複数のハードウェアで構成されており、それぞれのハードウェアは、そのハードウェアの状態を示すレジスタ情報６を格納する状態レジスタを有している。 A first embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a configuration diagram of a computer system according to this embodiment. As shown in FIG. 1, the computer system according to the present embodiment includes a memory 2 and a hard disk 4, and a plurality of hardware units including an arithmetic processing unit, an input unit, a display unit, a DVD drive, a plurality of PCI devices, etc., not shown. Each piece of hardware has a status register that stores register information 6 indicating the status of the hardware.

また、本実施例の計算機システムのＯＳ１０は、ＯＳ１０が動作停止に陥るようなシステム障害が発生した時に動作するシステム障害処理部１２と、メモリ２に格納されているデータからダンプファイル４１を生成してハードディスク４に格納するためのダンプ処理部１４とを備えている。 Further, the OS 10 of the computer system according to the present embodiment generates a dump file 41 from the system failure processing unit 12 that operates when a system failure that causes the OS 10 to stop operating and the data stored in the memory 2 is generated. And a dump processing unit 14 for storing in the hard disk 4.

さらに、ＯＳ１０上には、障害情報管理機能１８と、故障箇所自動通知機能１６とが設けられており、障害情報管理機能１８はシステム障害発生時に、故障箇所自動通知機能１６はシステム起動時に動作するようになっている。 Furthermore, a failure information management function 18 and a failure location automatic notification function 16 are provided on the OS 10. The failure information management function 18 operates when a system failure occurs, and the failure location automatic notification function 16 operates when the system is activated. It is like that.

また、障害情報管理機能１８は、レジスタ情報取得部２０と、故障箇所解析部２４と、識別情報付加部２６とを備えている。 The failure information management function 18 includes a register information acquisition unit 20, a failure location analysis unit 24, and an identification information addition unit 26.

このように構成される本実施例の計算機システムのシステム障害発生時における故障箇所情報自動通知の手順を図２乃至４を用いて説明する。図２は本実施例のシステム障害発生時におけるＯＳ１０の処理の流れを示すフローチャートである。 A procedure for automatic notification of failure location information when a system failure occurs in the computer system of this embodiment configured as described above will be described with reference to FIGS. FIG. 2 is a flowchart showing the processing flow of the OS 10 when a system failure occurs in this embodiment.

ハードウェア故障によるシステム障害が発生した場合、ＯＳ１０の一部であるシステム障害処理部１２が処理を開始する（Ｓ１１）。システム障害処理部１２は、後述する障害情報管理機能１８を呼び出し（Ｓ１２）、障害情報管理機能１８の処理の終了を待つ（Ｓ１３）。障害情報管理機能１８の処理が終了すると、ダンプ処理部１４が処理を開始し、メモリ２に格納されているデータからダンプファイル４１を生成してハードディスク４に格納する（Ｓ１４）。ダンプファイル４１の格納が終了すると、ＯＳ１０は再起動を行なう（Ｓ１５）。 When a system failure occurs due to a hardware failure, the system failure processing unit 12 which is a part of the OS 10 starts processing (S11). The system failure processing unit 12 calls a failure information management function 18 described later (S12), and waits for the end of the processing of the failure information management function 18 (S13). When the processing of the failure information management function 18 is completed, the dump processing unit 14 starts processing, generates a dump file 41 from the data stored in the memory 2 and stores it in the hard disk 4 (S14). When the storage of the dump file 41 is completed, the OS 10 restarts (S15).

図３は障害情報管理機能１８の処理の流れを示すフローチャートである。システ障害発生時に、ＯＳ１０のシステム障害処理部１２により呼び出された障害情報管理機能１８は、レジスタ情報取得部２０の処理を開始する。レジスタ情報取得部２０は、システム障害発生時のハードウェアの正常／異常を示すレジスタ情報６を各ハードウェアが有する状態レジスタから取得する（Ｓ２１）。 FIG. 3 is a flowchart showing a processing flow of the failure information management function 18. When a system failure occurs, the failure information management function 18 called by the system failure processing unit 12 of the OS 10 starts processing of the register information acquisition unit 20. The register information acquisition unit 20 acquires register information 6 indicating normality / abnormality of hardware at the time of system failure from a status register included in each hardware (S21).

その後、故障箇所解析部２４で、取得したレジスタ情報６の中から異常を示す値となっているものを見つけ、故障したハードウェアを特定できるデータ、例えば、ハードウェア名称の文字列を故障箇所情報２１として生成する（Ｓ２２）。識別情報付加部２６は、故障箇所情報２１にメモリ２上での位置を特定するための識別情報２３を付加する（Ｓ２３）。この時、付加する識別情報２３は、ダンプファイル４１内の検索おいて、他に同一の値が見つからないような固有値、例えば固有名詞等とする。 Thereafter, the failure location analysis unit 24 finds out the acquired register information 6 having a value indicating an abnormality, and specifies data that can identify the failed hardware, for example, a character string of the hardware name. 21 (S22). The identification information adding unit 26 adds identification information 23 for specifying a position on the memory 2 to the failure location information 21 (S23). At this time, the identification information 23 to be added is a unique value such as a proper noun that cannot be found otherwise in the search in the dump file 41.

次に、ダンプ処理部１４がデータを取得するメモリ２上の領域に、故障箇所情報２１と識別情報２３とを格納する（Ｓ２４）。識別情報付加部２６の処理が終了すると、ＯＳ１０のシステム障害処理部１２に処理が戻り、メモリ２に格納した故障箇所情報２１と識別情報２３とが、ダンプ処理部１４により生成されるダンプファイル４１に記録された状態で残る。 Next, the failure location information 21 and the identification information 23 are stored in an area on the memory 2 from which the dump processing unit 14 acquires data (S24). When the process of the identification information adding unit 26 is completed, the process returns to the system failure processing unit 12 of the OS 10, and the failure location information 21 and the identification information 23 stored in the memory 2 are generated by the dump processing unit 14. Remains recorded.

図４は故障箇所自動通知機能１６の処理の流れを示すフローチャートである。図２の（Ｓ１５）のＯＳ１０の再起動により故障箇所自動通知機能１６が処理を開始すると、まずダンプファイル４１から識別情報２３を検索し、識別情報２３が記録されていなかった場合は処理を終了する（Ｓ３１）。識別情報２３が記録されていた場合は、識別情報２３が記録されている領域に付して記録されている故障箇所情報２１を読み取る（Ｓ３２）。読み取った故障箇所情報２１をログ等に記録、又は表示装置に出力する（Ｓ２３）。ユーザは出力された故障箇所情報２１を確認してシステム復旧のために対応する。 FIG. 4 is a flowchart showing a process flow of the failure location automatic notification function 16. When the failure location automatic notification function 16 starts processing by restarting the OS 10 in FIG. 2 (S15), first, the identification information 23 is searched from the dump file 41. If the identification information 23 is not recorded, the processing ends. (S31). When the identification information 23 is recorded, the failure location information 21 recorded on the area where the identification information 23 is recorded is read (S32). The read failure location information 21 is recorded in a log or outputted to a display device (S23). The user confirms the output fault location information 21 and responds for system recovery.

以上説明したように本実施例によれば、故障箇所情報２１を生成してメモリ２に格納した後にダンプ処理部１４によりダンプファイル４１を生成しているので、システム障害解析時に、ダンプファイル４１を検索することで故障したハードウェアを自動で特定してユーザに通知することができ、システム復旧までの時間を短縮することができる。 As described above, according to the present embodiment, since the dump file 41 is generated by the dump processing unit 14 after the failure location information 21 is generated and stored in the memory 2, the dump file 41 is stored in the system failure analysis. By searching, the failed hardware can be automatically identified and notified to the user, and the time until system recovery can be shortened.

次に、本発明の実施例２を、図５乃至７を参照して説明する。図５は、故障箇所の特定をＯＳ１０の再起動後に行なう場合の計算機システムの構成図である。本実施例は実施例１の構成のうち、障害情報管理機能１８の故障箇所解析部２４を故障箇所自動通知機能１６に移動したものであり、その他の構成は実施例１と同様である。また、本実施例でのＯＳ１０の処理の流れは図２と同様であるため、障害情報管理機能１８と故障箇所自動通知機能１６の処理の流れを図６，７を用いて説明する。 Next, a second embodiment of the present invention will be described with reference to FIGS. FIG. 5 is a configuration diagram of the computer system in the case where the fault location is specified after the OS 10 is restarted. In this embodiment, the failure location analysis unit 24 of the failure information management function 18 is moved to the failure location automatic notification function 16 in the configuration of the first embodiment, and other configurations are the same as those of the first embodiment. Since the processing flow of the OS 10 in this embodiment is the same as that in FIG. 2, the processing flow of the failure information management function 18 and the failure location automatic notification function 16 will be described with reference to FIGS.

図６は障害情報管理機能１８の処理の流れを示すフローチャートである。システム障害発生時にＯＳ１０のシステム障害処理部１２により障害情報管理機能１８が呼び出されると、障害情報管理機能１８はレジスタ情報取得部２０の処理を開始する。レジスタ情報取得部２０はシステム障害発生時のハードウェアの正常／異常を示すレジスタ情報６を状態レジスタから取得する（Ｓ４１）。 FIG. 6 is a flowchart showing a processing flow of the failure information management function 18. When the failure information management function 18 is called by the system failure processing unit 12 of the OS 10 when a system failure occurs, the failure information management function 18 starts processing of the register information acquisition unit 20. The register information acquisition unit 20 acquires the register information 6 indicating the normality / abnormality of the hardware at the time of system failure from the status register (S41).

次に、識別情報付加部２６は、レジスタ情報６にメモリ２上での位置を特定するための識別情報２３を付加する（Ｓ４２）。その後、ダンプ処理部１４がデータを取得するメモリ２上の領域に、レジスタ情報６と識別情報２３を格納する（Ｓ４３）。識別情報付加部２６の処理が終了すると、ＯＳ１０のシステム障害処理部１２に処理が戻され、メモリ２に格納したレジスタ情報６と識別情報２３が、ダンプファイル４１に記録された状態で残る。 Next, the identification information adding unit 26 adds identification information 23 for specifying a position on the memory 2 to the register information 6 (S42). Thereafter, the register information 6 and the identification information 23 are stored in an area on the memory 2 from which the dump processing unit 14 acquires data (S43). When the process of the identification information adding unit 26 is completed, the process is returned to the system failure processing unit 12 of the OS 10, and the register information 6 and the identification information 23 stored in the memory 2 remain recorded in the dump file 41.

図７は故障箇所自動通知機能１６の処理の流れを示すフローチャートである。ＯＳ１０の再起動により故障箇所自動通知機能１６が開始されると、まずダンプファイル４１から識別情報２３を検索し、識別情報２３が記録されていなかった場合は処理を終了する（Ｓ５１）。識別情報２３が記録されていた場合は、識別情報２３が記録されている領域に付して記録されているレジスタ情報６を読み取る（Ｓ５２）。故障箇所解析部２４では、読み取ったレジスタ情報６の中から異常を示す値となっているものを見つけ、故障箇所を特定する（Ｓ５３）。その後、特定した故障箇所をログ等に記録、又は表示装置に出力する（Ｓ５４）。本実施例でも、実施例１と同様の効果を得ることができる。 FIG. 7 is a flowchart showing a processing flow of the failure location automatic notification function 16. When the failure location automatic notification function 16 is started by restarting the OS 10, the identification information 23 is first searched from the dump file 41. If the identification information 23 is not recorded, the process is terminated (S51). If the identification information 23 is recorded, the register information 6 recorded in the area where the identification information 23 is recorded is read (S52). The failure location analysis unit 24 finds a value indicating an abnormality from the read register information 6 and identifies the failure location (S53). Thereafter, the identified failure location is recorded in a log or the like or output to the display device (S54). In the present embodiment, the same effect as that of the first embodiment can be obtained.

以上、実施例１，２について説明したが、本発明は、これらに限らず適宜構成を変更して適用することができる。例えば、複数の計算機システムがネットワークで接続されており、１つの計算機システムにシステム障害が発生しネットワークに異常がない場合に、別の計算機システムにネットワークを介して故障箇所情報を出力できるように構成することもできる。 As described above, the first and second embodiments have been described. However, the present invention is not limited to these and can be applied by appropriately changing the configuration. For example, when multiple computer systems are connected via a network, when a system failure occurs in one computer system and there is no abnormality in the network, the fault location information can be output to another computer system via the network. You can also

２メモリ
４ハードディスク
６レジスタ情報
１０ＯＳ
１２システム障害処理部
１４ダンプ処理部
１６故障箇所自動通知機能
１８障害情報管理機能
２０レジスタ情報取得部
２１故障箇所情報
２３識別情報
２４故障箇所解析部
２６識別情報付加部
４１ダンプファイル 2 Memory 4 Hard disk 6 Register information 10 OS
12 System failure processing unit 14 Dump processing unit 16 Failure location automatic notification function 18 Failure information management function 20 Register information acquisition unit 21 Failure location information 23 Identification information 24 Failure location analysis unit 26 Identification information addition unit 41 Dump file

Claims

A plurality of hardware including a main storage device and an auxiliary storage device, and a dump processing means for generating a dump file from information stored in the main storage device when a system failure occurs and storing the dump file in the auxiliary storage device, In the computer system, each of the plurality of hardware has a status register for storing the status of the hardware,
When a system failure occurs, register information is acquired from the status register, the hardware that has failed is identified from the register information, failure information is generated, and the location on the main storage device that stores the failure information is determined. Attaching identification information for identification and storing the failure information together with the identification information in the main storage device, generating a dump file by the dump processing means, and searching for the dump file when analyzing a system failure A computer system comprising: system failure processing means for reading out and outputting failure information recorded in an area on the main storage device in which identification information is recorded .

A plurality of hardware including a main storage device and an auxiliary storage device, and a dump processing means for generating a dump file from information stored in the main storage device when a system failure occurs and storing the dump file in the auxiliary storage device, In the computer system, each of the plurality of hardware has a status register for storing the status of the hardware,
When a system failure occurs, register information is acquired from the status register, and identification information for specifying a position on the main storage device for storing the register information is attached, and the register information is stored together with the identification information in the main memory. A dump file is generated by the dump processing means after being stored in a device, and when the system failure is analyzed, the dump file is searched and recorded in the area on the main storage device where the identification information is recorded. reads and has register information, the computer system characterized by comprising providing a system failure processing means for outputting failure information identifies hardware that failed from the register information.