JP2006119778A

JP2006119778A - Information processing system, input/output device, method for use therewith for automatically sending data during system failure, and its program

Info

Publication number: JP2006119778A
Application number: JP2004305050A
Authority: JP
Inventors: Tomoaki Nagano; 知明長野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-10-20
Filing date: 2004-10-20
Publication date: 2006-05-11
Anticipated expiration: 2024-10-20
Also published as: JP4265521B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information processing system capable of monitoring the operation of an existing OS without modifying the OS at all. <P>SOLUTION: The OS alive monitoring mechanism 41 of an I/O device 4 is achieved using the interruption mechanism of a device mounted in a conventional I/O bus. An interrupt is reported so as to detect access to a register that indicates the state of a device existing in the I/O device 4 in order for the OS 10 to check the cause of the interrupt, so as to monitor if the OS 10 responds to the interrupt. If there is no response from the OS 10, the I/O device 4 determines that the OS 10 has stopped, and the operation of sending necessary data based on data that the OS 10 sets during initialization of the I/O device 4 and on information operated as directed by the OS 10 during operation of the OS 10 is started after the stop of the OS 10. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は情報処理入出力システム、入出力装置及びそれに用いるシステム障害時の自動データ送出方法並びにそのプログラムに関し、特に情報処理装置の障害発生時における障害情報の取得方法に関する。 The present invention relates to an information processing input / output system, an input / output device, an automatic data transmission method used in the event of a system failure, and a program thereof, and more particularly to a method for acquiring failure information when a failure occurs in an information processing device.

コンピュータに障害が発生した場合、障害の部位を判定するために有用なデータとして、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）のコンソールメッセージ及びエラーメッセージが存在する。これらのメッセージはメインメモリ上の特定のエリアに格納されている場合が多く、ＯＳは任意のタイミングで格納しているメッセージの出力をＩ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ：入出力）装置に対して要求し、画面または他のコンピュータを通してメッセージ出力を行う。 When a failure occurs in a computer, there are OS (Operating System) console messages and error messages as useful data for determining the location of the failure. These messages are often stored in a specific area on the main memory, and the OS requests an I / O (Input / Output) device to output the stored message at an arbitrary timing. Message output through a screen or other computer.

特に、中・大型コンピュータにおいては、メッセージを通信回線を通して接続された小型コンピュータに送出し、実際にメッセージを文字列として出力するのが小型コンピュータであるというケースが多い。 In particular, in a medium / large computer, it is often the case that a small computer sends a message to a small computer connected through a communication line and actually outputs the message as a character string.

メインとなるコンピュータに障害が発生した場合、メモリ上のメッセージエリアには出力予定のメッセージが用意されているにもかかわらず、メッセージの送出指示が行われていないケースが頻繁に存在し、障害発生の原因となる動作またはエラーに関するメッセージがコンソール上に出力されていないため、保守の効率が悪化するという問題がある。 When a failure occurs in the main computer, there are frequent cases where a message is not instructed to be sent even though a message scheduled to be output is prepared in the message area on the memory. There is a problem that the efficiency of maintenance deteriorates because the message regarding the operation or error causing the error is not output on the console.

このような場合、中・大型コンピュータでは、診断専用プロセッサからメインメモリ上のメッセージエリアのデータを抜き出し、バイナリデータを解析して最終メッセージを得ることが可能であるシステムが存在する。しかしながら、ＯＳが用意したメッセージエリアの特定及びそのエリアから得られたバイナリデータの解析を行うには、保守員に高いスキルが要求され、すべての保守員にそのスキルがあるとは限らず、結果として専門の解析要員が必要になる。 In such a case, there is a system that can extract the data in the message area on the main memory from the diagnostic-dedicated processor and analyze the binary data to obtain the final message in the medium / large computer. However, in order to specify the message area prepared by the OS and to analyze the binary data obtained from the area, maintenance personnel are required to have high skills, and not all maintenance personnel have the skills. Special analysis personnel are required.

尚、上記のコンピュータシステムにおいては、プラットフォームＯＳに障害が発生しても、プラットフォームＯＳを自動的に再起動してアプリケーションにおける処理を自動続行させる方法が提案されている（例えば、特許文献１参照）。 In the above computer system, even if a failure occurs in the platform OS, a method for automatically restarting the platform OS and automatically continuing processing in the application has been proposed (see, for example, Patent Document 1). .

特開２００２−２４４８８５号公報Japanese Patent Laid-Open No. 2002-244885

従来、ＡＴ（ＡｄｖａｎｃｅｄＴｅｃｈｎｏｌｏｇｙ）バス、ＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ）バス、ＰＣＩ−Ｘバス等のコンピュータシステムに搭載されるＩ／Ｏ用バスには、バス上で障害が発生したことを上位装置に通知するための手段が設けられているが、上位装置の障害をバス上に実装された装置に通知する手段は提供されていない。このため、Ｉ／Ｏバス上に実装された装置からはＯＳの動作停止を的確に判断することが困難である。 Conventionally, an I / O bus mounted on a computer system such as an AT (Advanced Technology) bus, a PCI (Peripheral Component Interconnect) bus, or a PCI-X bus notifies the host device that a fault has occurred on the bus. However, there is no means for notifying a device mounted on the bus of a failure of the host device. For this reason, it is difficult to accurately determine whether the OS has stopped operating from a device mounted on the I / O bus.

また、従来のＩ／Ｏバス上に実装された装置は、ＯＳからの指示によって各種動作を行うため、ＯＳが停止した場合、データの通信等の動作が不能になるという問題がある。尚、上記の特許文献１ではこの問題を解決することはできない。 Further, since devices mounted on a conventional I / O bus perform various operations in response to instructions from the OS, there is a problem that operations such as data communication become impossible when the OS stops. Note that the above-mentioned Patent Document 1 cannot solve this problem.

そこで、本発明の目的は上記の問題点を解消し、既存のＯＳになんら変更を加えることなく、ＯＳの動作監視を行うことができる情報処理システム、入出力装置及びそれに用いるシステム障害時の自動データ送出方法並びにそのプログラムを提供することにある。 Therefore, an object of the present invention is to solve the above-described problems and to perform an information processing system and an input / output device that can monitor the operation of the OS without making any changes to the existing OS, and an automatic operation in case of a system failure used therefor. The object is to provide a data transmission method and its program.

また、本発明の他の目的は、システム障害発生時でもＯＳが出力すべく用意した全メッセージの出力が行われ、障害復旧のための情報取得を容易に行うことができる情報処理システム、入出力装置及びそれに用いるシステム障害時の自動データ送出方法並びにそのプログラムを提供することにある。 Another object of the present invention is to provide an information processing system and an input / output system in which all messages prepared for output by the OS can be output even when a system failure occurs, and information for failure recovery can be easily obtained. It is an object to provide an apparatus, an automatic data transmission method used in the event of a system failure, and a program thereof.

本発明による情報処理システムは、障害の部位を判定するために有用なデータとして、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）のコンソールメッセージ及びエラーメッセージをメインメモリに保持する情報処理装置と、前記コンソールメッセージ及びエラーメッセージを出力する入出力装置とを含む情報処理システムであって、
前記入出力装置が、割込み元デバイス及び要因が特定できない空割込みを前記情報処理装置に行って前記ＯＳの動作状態を監視する監視手段を備えている。 The information processing system according to the present invention includes, as data useful for determining the location of a failure, an information processing apparatus that stores an OS (Operating System) console message and an error message in a main memory, and the console message and error message. An information processing system including an input / output device for output,
The input / output device includes monitoring means for monitoring the operating state of the OS by performing an empty interrupt whose interrupt source device and cause cannot be specified to the information processing device.

本発明による他の情報処理システムは、上記の構成のほかに、前記入出力装置が、前記監視手段で前記ＯＳの停止状態を検出した時に前記メインメモリの特定エリアの内容を取得して外部に送出するシステム障害時データ取得手段を具備している。 According to another information processing system of the present invention, in addition to the above configuration, when the input / output device detects a stop state of the OS by the monitoring unit, the content of the specific area of the main memory is acquired to the outside. It has a data acquisition means at the time of system failure to be transmitted.

本発明による入出力装置は、障害の部位を判定するために有用なデータとして情報処理装置のメインメモリに保持されるＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）のコンソールメッセージ及びエラーメッセージを出力する入出力装置であって、
割込み元デバイス及び要因が特定できない空割込みを前記情報処理装置に行って前記ＯＳの動作状態を監視する監視手段を備えている。 An input / output device according to the present invention is an input / output device that outputs an OS (Operating System) console message and an error message held in a main memory of an information processing device as useful data for determining a fault site. ,
Monitoring means for monitoring the operating state of the OS by performing an empty interrupt that cannot specify the interrupt source device and the cause to the information processing apparatus.

本発明による他の入出力装置は、上記の構成のほかに、前記監視手段で前記ＯＳの停止状態を検出した時に前記メインメモリの特定エリアの内容を取得して外部に送出するシステム障害時データ取得手段を具備している。 In addition to the above-described configuration, another input / output device according to the present invention acquires the contents of a specific area of the main memory when the monitoring unit detects a stop state of the OS, and transmits the data to the outside when the system is in failure An acquisition means is provided.

本発明によるシステム障害時の自動データ送出方法は、障害の部位を判定するために有用なデータとして、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）のコンソールメッセージ及びエラーメッセージをメインメモリに保持する情報処理装置と、前記コンソールメッセージ及びエラーメッセージを出力する入出力装置とを含む情報処理システムに用いるシステム障害時の自動データ送出方法であって、
前記入出力装置が、割込み元デバイス及び要因が特定できない空割込みを前記情報処理装置に行って前記ＯＳの動作状態を監視する処理を実行している。 An automatic data transmission method in the event of a system failure according to the present invention includes an information processing apparatus that stores an OS (Operating System) console message and an error message in main memory as useful data for determining the location of the failure, and the console An automatic data transmission method in the event of a system failure used in an information processing system including an input / output device that outputs a message and an error message,
The input / output device performs a process of monitoring the operating state of the OS by giving an empty interrupt whose cause cannot be identified to the interrupt source device to the information processing device.

本発明による他のシステム障害時の自動データ送出方法は、上記の処理のほかに、前記入出力装置が、前記ＯＳの動作状態を監視する処理で前記ＯＳの停止状態を検出した時に前記メインメモリの特定エリアの内容を取得して外部に送出する処理を実行している。 According to another aspect of the present invention, there is provided an automatic data transmission method in the event of a system failure, in addition to the above processing, the main memory when the input / output device detects a stopped state of the OS in a process of monitoring the operating state of the OS. The process of acquiring the contents of the specific area and sending it to the outside is executed.

本発明によるシステム障害時の自動データ送出方法のプログラムは、障害の部位を判定するために有用なデータとして、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）のコンソールメッセージ及びエラーメッセージをメインメモリに保持する情報処理装置と、前記コンソールメッセージ及びエラーメッセージを出力する入出力装置とを含む情報処理システムに用いるシステム障害時の自動データ送出方法のプログラムであって、前記入出力装置のコンピュータに、割込み元デバイス及び要因が特定できない空割込みを前記情報処理装置に行って前記ＯＳの動作状態を監視する処理を実行させている。 The program of the automatic data transmission method at the time of a system failure according to the present invention includes an information processing apparatus that stores an OS (Operating System) console message and an error message in main memory as useful data for determining the location of the failure, An automatic data transmission method program for a system failure used in an information processing system including an input / output device that outputs the console message and an error message, and the interrupt source device and the factor cannot be specified in the computer of the input / output device A process for monitoring the operating state of the OS by performing an empty interrupt to the information processing apparatus is executed.

本発明による他のシステム障害時の自動データ送出方法のプログラムは、前記入出力装置のコンピュータに、前記ＯＳの動作状態を監視する処理で前記ＯＳの停止状態を検出した時に前記メインメモリの特定エリアの内容を取得して外部に送出する処理を実行させている。 According to another aspect of the present invention, there is provided a program for a method for automatically sending data when a system failure has occurred. The process of acquiring the contents of and sending them to the outside is executed.

すなわち、本発明の情報処理システムは、Ｉ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ：入出力）バスに搭載される通信装置が、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）がＨＷ（ハードウェア）障害またはパニック等の原因によって動作を停止したことを検出するＯＳａｌｉｖｅ監視機構と、ＯＳの停止が検出された際にＯＳの指示なくメモリ上の特定エリアのデータを取得する手段とを持つことを特徴とする。 That is, the information processing system according to the present invention operates when a communication device mounted on an I / O (Input / Output) bus causes an OS (Operating System) to operate due to an HW (hardware) failure or a panic. It is characterized by having an OS alive monitoring mechanism for detecting the stop and a means for acquiring data in a specific area on the memory without an instruction from the OS when the stop of the OS is detected.

ＯＳａｌｉｖｅ監視機構は、従来のＩ／Ｏバスに実装される装置が持つ「割込み」という機構を用いて実現している。通常のＩ／Ｏ装置でも、動作の終了やエラーをＯＳに通知するため、割込み手段を持っており、Ｉ／Ｏバスは割込みをＯＳに通知するための手段を提供している。 The OS alive monitoring mechanism is realized by using a mechanism called “interrupt” that a device mounted on a conventional I / O bus has. Even a normal I / O device has an interrupt means for notifying the OS of the end of an operation or an error, and the I / O bus provides means for notifying the OS of an interrupt.

割込みが通知された場合、ＯＳは割込みの要因を検査するためにＩ／Ｏ装置内に存在する「Ｉ／Ｏ装置の状態を示すレジスタ」にアクセスするのが一般的であるが、本発明の情報処理システムでは、ＯＳが「Ｉ／Ｏ装置の状態を示すレジスタ」へアクセスしてくることを検出する手段を持ち、割込みに対してＯＳが応答するかどうかを監視する。ＯＳの応答がない場合、Ｉ／Ｏ装置はＯＳが停止したと判定し、ＯＳ停止後の動作を開始する。 When an interrupt is notified, the OS generally accesses the “register indicating the status of the I / O device” in the I / O device to check the cause of the interrupt. The information processing system has means for detecting that the OS accesses the “register indicating the state of the I / O device”, and monitors whether the OS responds to an interrupt. If there is no OS response, the I / O device determines that the OS has stopped and starts the operation after the OS stops.

ＯＳ停止後の動作については、ＯＳがＩ／Ｏ装置を初期化する際に設定しておくデータと、ＯＳ動作中にＯＳからの指示で動作していた情報とを基に、必要なデータを送出するための回路を装置内に持ち、ＯＳ停止判定が行われた際にＯＳ停止後の動作を行う回路を動作させる。 Regarding the operation after the OS is stopped, the necessary data is obtained based on the data set when the OS initializes the I / O device and the information that was operated by the instruction from the OS during the OS operation. A circuit for sending is provided in the apparatus, and when the OS stop determination is made, the circuit that performs the operation after the OS stop is operated.

これによって、本発明の情報処理システムでは、ＯＳが停止したことを検出する機構と、ＯＳ停止をきっかけとしてメッセージエリアのデータ出力を自動的に行う機構とを通信装置に持たせることによって、ＯＳ停止時に出力されていないメッセージが存在する場合に、そのメッセージを自動的に出力することで、エラー解析の効率を上げることを可能にしている。 Thus, in the information processing system according to the present invention, the communication device has a mechanism for detecting that the OS has stopped and a mechanism for automatically outputting data in the message area triggered by the OS stop. When there is a message that is not output from time to time, it is possible to increase the efficiency of error analysis by automatically outputting the message.

つまり、本発明の情報処理システムでは、上記の機構を用いることによって、システム障害発生時でもＯＳが出力すべく用意した全メッセージの出力が行われ、障害復旧のための情報取得が容易になる。 That is, in the information processing system of the present invention, by using the above mechanism, all messages prepared for output by the OS are output even when a system failure occurs, and information acquisition for failure recovery is facilitated.

また、本発明の情報処理システムでは、ＯＳａｌｉｖｅ監視手段によるＯＳの動作監視について、Ｉ／Ｏの割込みをサポートするＯＳであれば、既存のＯＳになんら変更を加えることなく動作が可能であり、上記の通信装置をコンピュータシステムに実装するだけでＯＳの動作監視を行うことが可能となる。 Further, in the information processing system of the present invention, the OS operation monitoring by the OS alive monitoring means can operate without any change to the existing OS as long as the OS supports an I / O interrupt. It is possible to monitor the operation of the OS simply by mounting the above communication device on the computer system.

本発明の情報処理システムは、以下に述べるような構成及び動作とすることで、既存のＯＳになんら変更を加えることなく、ＯＳの動作監視を行うことができるという効果が得られる。 The information processing system according to the present invention is configured and operated as described below, so that the OS operation can be monitored without any change to the existing OS.

本発明の他の情報処理システムは、以下に述べるような構成及び動作とすることで、システム障害発生時でもＯＳが出力すべく用意した全メッセージの出力が行われ、障害復旧のための情報取得を容易に行うことができるという効果が得られる。 The other information processing system of the present invention is configured and operated as described below, so that even when a system failure occurs, all messages prepared for output by the OS are output, and information for failure recovery is acquired. The effect that it can perform easily is acquired.

次に、本発明の実施の形態について図面を参照して説明する。図１は本発明の実施の形態による情報処理システムの構成を示すブロック図である。図１において、本発明の実施の形態による情報処理システムは、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）１０を含むＣＰＵ（中央処理装置）１と、メインメモリ（ＭａｉｎＭｅｍｏｒｙ）２と、Ｉ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ：入出力）ブリッジ（Ｂｒｉｄｇｅ）３とからなる情報処理装置（コンピュータ本体）と、Ｉ／Ｏブリッジ３に接続されているＩ／Ｏバス１００に接続されるＩ／Ｏ装置４及び他のＩ／Ｏ装置５とから構成されている。 Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of an information processing system according to an embodiment of the present invention. 1, an information processing system according to an embodiment of the present invention includes a CPU (Central Processing Unit) 1 including an OS (Operating System) 10, a main memory (Main Memory) 2, and an I / O (Input / Output: An information processing device (computer main body) including an input / output) bridge 3, an I / O device 4 connected to the I / O bus 100 connected to the I / O bridge 3, and other I / Os The apparatus 5 is comprised.

ここで、他のＩ／Ｏ装置５はＩ／Ｏバス１００上に接続された適当なＩ／Ｏカード（Ｃａｒｄ）であり、その存在の有無等は問わない。本発明の実施の形態で開示する機構を備えた装置はＩ／Ｏ装置４である。 Here, the other I / O device 5 is an appropriate I / O card (Card) connected on the I / O bus 100, and it does not matter whether or not it exists. An apparatus having the mechanism disclosed in the embodiment of the present invention is an I / O apparatus 4.

Ｉ／Ｏ装置４はＯＳａｌｉｖｅ監視機構４１と、システム障害時データ取得手段４２と、ＤＭＡ（ＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓ）制御部４３と、通常動作制御部４４と、ローカルメモリ（ＬｏｃａｌＭｅｍｏｒｙ）４５と、データ（Ｄａｔａ）送出元切替え部４６と、通信制御部４７とから構成されており、通信制御部４７は図示せぬネットワークに接続されている。 The I / O device 4 includes an OS alive monitoring mechanism 41, a system failure data acquisition unit 42, a DMA (Direct Memory Access) control unit 43, a normal operation control unit 44, a local memory (Local Memory) 45, data (Data) It is comprised from the transmission origin switch part 46 and the communication control part 47, and the communication control part 47 is connected to the network which is not shown in figure.

Ｉ／Ｏ装置４において、ＯＳａｌｉｖｅ監視機構４１及びシステム障害時データ取得手段４２は本発明の実施の形態の特色となる回路である。また、他のＤＭＡ制御部４３、通常動作制御部４４、ローカルメモリ４５、通信制御部４７は、一般のＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）［例えば、Ｅｔｈｅｒｎｅｔ（登録商標）等］カードに存在する回路であり、既知の技術として知られている。 In the I / O device 4, the OS alive monitoring mechanism 41 and the system failure time data acquisition means 42 are circuits that are features of the embodiment of the present invention. The other DMA control unit 43, normal operation control unit 44, local memory 45, and communication control unit 47 are circuits existing in a general LAN (Local Area Network) [for example, Ethernet (registered trademark)] card. Known as a known technique.

さらに、データ送出元切替え部４６は、ネットワークに送出するデータを、システム障害時データ取得手段４２と通常動作制御部４４とのどちらから通信制御部４７に入力するかを切替えるために必要な回路であるが、この回路についても動作は機知の技術として提供される。 Further, the data transmission source switching unit 46 is a circuit necessary for switching whether the data transmitted to the network is input to the communication control unit 47 from the data acquisition unit 42 at the time of system failure or the normal operation control unit 44. However, the operation of this circuit is provided as a known technology.

ＯＳａｌｉｖｅ監視機構４１は、従来のＩ／Ｏバスに実装される装置が持つ「割込み」という機構を用いて実現している。通常のＩ／Ｏ装置でも、動作の終了やエラーをＯＳに通知するため、割込み手段を持っており、Ｉ／Ｏバスは割込みをＯＳに通知するための手段を提供している。 The OS alive monitoring mechanism 41 is realized by using a mechanism called “interrupt” possessed by a device mounted on a conventional I / O bus. Even a normal I / O device has an interrupt means for notifying the OS of the end of an operation or an error, and the I / O bus provides means for notifying the OS of an interrupt.

割込みが通知された場合、ＯＳは割込みの要因を検査するためにＩ／Ｏ装置内に存在する「Ｉ／Ｏ装置の状態を示すレジスタ」にアクセスするのが一般的であるが、本発明の実施の形態による情報処理システムでは、ＯＳ１０が「Ｉ／Ｏ装置の状態を示すレジスタ」へアクセスしてくることを検出する手段を持ち、割込みに対してＯＳ１０が応答するかどうかを監視する。ＯＳ１０の応答がない場合、Ｉ／Ｏ装置４はＯＳ１０が停止したと判定し、ＯＳ１０の停止後の動作を開始する。 When an interrupt is notified, the OS generally accesses the “register indicating the status of the I / O device” in the I / O device to check the cause of the interrupt. The information processing system according to the embodiment has means for detecting that the OS 10 accesses the “register indicating the state of the I / O device”, and monitors whether the OS 10 responds to an interrupt. If there is no response from the OS 10, the I / O device 4 determines that the OS 10 has stopped, and starts the operation after the OS 10 has stopped.

ＯＳ１０の停止後の動作については、ＯＳ１０がＩ／Ｏ装置４を初期化する際に設定しておくデータと、ＯＳ１０の動作中に、ＯＳ１０からの指示で動作していた情報とを基に、必要なデータを送出するための回路をＩ／Ｏ装置４内に持ち、ＯＳ１０の停止判定が行われた際に、ＯＳ１０の停止後の動作を行う回路を動作させる。 The operation after the stop of the OS 10 is based on the data set when the OS 10 initializes the I / O device 4 and the information that was operated by the instruction from the OS 10 during the operation of the OS 10. A circuit for transmitting necessary data is provided in the I / O device 4, and when the stop determination of the OS 10 is performed, a circuit that performs an operation after the stop of the OS 10 is operated.

これによって、本発明の実施の形態による情報処理システムでは、ＯＳ１０が停止したことを検出する機構と、ＯＳ１０の停止をきっかけとしてメッセージエリアのデータ出力を自動的に行う機構とをＩ／Ｏ装置４に持たせることによって、ＯＳ１０の停止時に出力されていないメッセージが存在する場合に、そのメッセージを自動的に出力することで、エラー解析の効率を上げることができる。 Thus, in the information processing system according to the embodiment of the present invention, the I / O device 4 includes a mechanism for detecting that the OS 10 has stopped and a mechanism for automatically outputting data in the message area triggered by the stop of the OS 10. Therefore, if there is a message that is not output when the OS 10 is stopped, the error analysis efficiency can be increased by automatically outputting the message.

つまり、本発明の実施の形態による情報処理システムでは、上記の機構を用いることによって、システム障害発生時でもＯＳ１０が出力すべく用意した全メッセージの出力が行われ、障害復旧のための情報取得が容易になる。 That is, in the information processing system according to the embodiment of the present invention, by using the above-described mechanism, all messages prepared for output by the OS 10 are output even when a system failure occurs, and information acquisition for failure recovery is performed. It becomes easy.

また、本発明の実施の形態による情報処理システムでは、ＯＳａｌｉｖｅ監視機構４１によるＯＳ１０の動作監視について、Ｉ／Ｏの割込みをサポートするＯＳであれば、既存のＯＳになんら変更を加えることなく動作が可能であり、上記のＩ／Ｏ装置４をコンピュータシステムに実装するだけでＯＳ１０の動作監視を行うことができる。 Further, in the information processing system according to the embodiment of the present invention, the OS 10 monitoring by the OS alive monitoring mechanism 41 can be performed without any change to the existing OS as long as the OS supports an I / O interrupt. It is possible to monitor the operation of the OS 10 simply by mounting the I / O device 4 in the computer system.

次に、本発明の一実施例について図面を参照して説明する。図２は本発明の一実施例による情報処理システムの構成を示すブロック図である。図２において、本発明の一実施例による情報処理システムは、ＣＰＵ１と、メインメモリ２と、Ｉ／Ｏブリッジ３とからなるコンピュータ本体と、Ｉ／Ｏブリッジ３に接続されているＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ）バス２００に接続されているＩ／Ｏ装置４及び他のＩ／Ｏ装置５とから構成されている。ここで、Ｉ／Ｏ装置５はＰＣＩバス２００上に接続された適当なＰＣＩカードであり、その存在の有無等は問わない。本実施例で開示する機構を備えた装置はＩ／Ｏ装置４である。 Next, an embodiment of the present invention will be described with reference to the drawings. FIG. 2 is a block diagram showing a configuration of an information processing system according to an embodiment of the present invention. 2, an information processing system according to an embodiment of the present invention includes a computer main body including a CPU 1, a main memory 2, and an I / O bridge 3, and a PCI (Peripheral Component) connected to the I / O bridge 3. An I / O device 4 connected to the (Interconnect) bus 200 and another I / O device 5 are included. Here, the I / O device 5 is an appropriate PCI card connected on the PCI bus 200, and it does not matter whether or not it exists. The device provided with the mechanism disclosed in this embodiment is an I / O device 4.

Ｉ／Ｏ装置４において、ＯＳａｌｉｖｅ監視機構４１と、システム障害時データ取得手段４２とが本実施例の特色となる回路である。また、Ｉ／Ｏ装置４において他の回路、つまりＤＭＡ制御部４３、通常動作制御部４４、ローカルメモリ４５、メッセージ表示装置６が接続されたＬＡＮ制御部４８は一般のＬＡＮカードに存在する回路であり、既知の技術として知られている。 In the I / O device 4, the OS alive monitoring mechanism 41 and the system failure time data acquisition means 42 are circuits that are the features of this embodiment. In the I / O device 4, other circuits, that is, the DMA control unit 43, the normal operation control unit 44, the local memory 45, and the LAN control unit 48 to which the message display device 6 is connected are circuits that exist in a general LAN card. Yes, it is known as a known technique.

さらに、データ送出元切替え部４６は、ＬＡＮ（図示せず）に送出するデータを、システム障害時データ取得手段４２と通常動作制御回路４４とのどちらからＬＡＮ制御部４８に入力するかを切替えるために必要な回路であるが、この回路についても動作は機知の技術として提供される。 Further, the data transmission source switching unit 46 switches whether data to be transmitted to the LAN (not shown) is input to the LAN control unit 48 from the data acquisition means 42 at the time of system failure or the normal operation control circuit 44. The operation of this circuit is also provided as a known technology.

図３は図２のＯＳａｌｉｖｅ監視装置４１の構成を示すブロック図である。図３において、ＯＳａｌｉｖｅ監視装置４１は割込み手段４１１と、タイマ（Ｔｉｍｅｒ）回路４１２と、割込要因レジスタ４１３と、割込要因レジスタアクセス検出手段４１４と、タイマ回路４１５と、タイムアウト（Ｔｉｍｅｏｕｔ）判定部４１６とから構成されている。 FIG. 3 is a block diagram showing the configuration of the OS alive monitoring device 41 of FIG. In FIG. 3, the OS alive monitoring device 41 includes an interrupt unit 411, a timer (Timer) circuit 412, an interrupt factor register 413, an interrupt factor register access detection unit 414, a timer circuit 415, and a timeout determination. Part 416.

割込み手段４１１及び割込要因レジスタ４１３は、実際には図２に示すところの通常動作制御部４４と共用する回路である。さらに、割込み要因レジスタ４１３とは、ＰＣＩＢＵＳ規格で定められるＰＣＩＳｔａｔｕｓレジスタのことであり、すべてのＰＣＩカードが実装している標準レジスタである。 The interrupt means 411 and the interrupt factor register 413 are actually circuits that are shared with the normal operation control unit 44 shown in FIG. Furthermore, the interrupt factor register 413 is a PCI Status register defined by the PCI BUS standard, and is a standard register mounted on all PCI cards.

図４は図２のメインメモリ２上のデータ格納イメージを示す図であり、図５は図２のシステム障害時データ取得手段４２の構成を示すブロック図である。図４において、メインメモリ２上には、ｄｅｓｃｒｉｐｔｏｒと、メッセージバッファ（ＭｅｓｓａｇｅＢｕｆｆｅｒ）と、バッファポインタ（Ｂｕｆｆｅｒｐｏｉｎｔｅｒ）とが設けられている。 4 is a diagram showing a data storage image on the main memory 2 in FIG. 2, and FIG. 5 is a block diagram showing a configuration of the system failure time data acquisition means 42 in FIG. In FIG. 4, a descriptor, a message buffer (Message Buffer), and a buffer pointer (Buffer pointer) are provided on the main memory 2.

図５において、障害時データ取得手段４２はレジスタ群４２１と、システム障害時データ取得制御部４２２と、メッセージ保存バッファ４２３と、メッセージ比較用バッファ４２４と、差分出力部４２５とから構成されている。 In FIG. 5, the failure time data acquisition means 42 includes a register group 421, a system failure time data acquisition control unit 422, a message storage buffer 423, a message comparison buffer 424, and a difference output unit 425.

レジスタ群４２１は、ＯＳ１０のメッセージバッファの開始アドレスを保持するためのバッファポインタアドレス（Ｂｕｆｆｅｒｐｏｉｎｔｅｒａｄｄｒｅｓｓ）レジスタと、ＯＳ１０のメッセージバッファサイズを示すバッファサイズ（Ｂｕｆｆｅｒｓｉｚｅ）レジスタと、ＯＳ１０が最後に通知してきたバッファの開始アドレスを保持するバッファポインタ＃１レジスタと、ＯＳ１０の停止後にバッファポインタを検査した結果を格納するバッファポインタ＃２レジスタとからなる。 The register group 421 has a buffer pointer address (Buffer pointer address) register for holding the start address of the message buffer of the OS 10, a buffer size (Buffer size) register indicating the message buffer size of the OS 10, and the OS 10 finally notifies. The buffer pointer # 1 register holding the start address of the buffer and the buffer pointer # 2 register storing the result of checking the buffer pointer after the OS 10 is stopped.

メッセージ保存バッファ４２３は、ＯＳ１０が最後にメッセージ出力要求を通知してきた時のメッセージバッファの内容を保持する。メッセージ比較用バッファ４２４は、ＯＳ１０が停止した後のメッセージバッファ内のデータを取得保存する。差分出力部４２５はメッセージ保存バッファ４２３の内容とメッセージ比較用バッファ４２４の内容とを比較し、それらの差分を出力する。 The message storage buffer 423 holds the contents of the message buffer when the OS 10 has finally notified the message output request. The message comparison buffer 424 acquires and stores data in the message buffer after the OS 10 is stopped. The difference output unit 425 compares the contents of the message storage buffer 423 and the contents of the message comparison buffer 424 and outputs the difference between them.

まず、従来のＰＣＩバス上のＩ／Ｏ装置の動作について簡単に説明する。図１０は一般的なＰＣＩカードの一例として、ＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）カードの構造を示している。ここではメインメモリ６上に用意されたデータをＤＩＳＫ９０上に格納する場合の動作について説明する。 First, the operation of a conventional I / O device on the PCI bus will be briefly described. FIG. 10 shows a structure of a SCSI (Small Computer System Interface) card as an example of a general PCI card. Here, an operation when data prepared on the main memory 6 is stored on the DISK 90 will be described.

メインメモリ６上にアプリケーション（図示せず）またはＯＳ１０等が作成したデータをＤＩＳＫ９０上に格納する場合、ＯＳ１０はまずＳＣＳＩカード８に対して与えるコマンドである「Ｄｅｓｃｒｉｐｔｏｒ」をメインメモリ６上に用意する。この「Ｄｅｓｃｒｉｐｔｏｒ」には、「命令（ＤＩＳＫＷｒｉｔｅ：ＤＩＳＫ９０にライトするという命令）」、「書込みデータアドレス（ＤＩＳＫ９０に書込むデータのメインメモリ６上の格納先頭アドレス）」、「書込みデータの大きさ（ＤＩＳＫ９０に書込むデータの大きさ９」、「ＤＩＳＫ上のアドレス（ＤＩＳＫ９０内の書込み先アドレス）」等が書かれている。 When storing data created by an application (not shown) or the OS 10 or the like on the main memory 6 on the DISK 90, the OS 10 first prepares a “Descriptor” that is a command given to the SCSI card 8 on the main memory 6. . This “Descriptor” includes “instruction (DISK Write: instruction to write to DISK 90)”, “write data address (storage start address of data to be written to DISK 90 on main memory 6)”, “size of write data” (Data size 9 to be written to DISK 90), "address on DISK (write destination address in DISK 90)" and the like are written.

次に、ＯＳ１０はＳＣＳＩカード８に対し、メインメモリ６上の「Ｄｅｓｃｒｉｐｔｏｒ」が格納されているアドレス（Ｄｅｓｃｒｉｐｔｏｒアドレス）をＳＣＳＩカード８に通知する。ＳＣＳＩカード８はＤｅｓｃｒｉｐｔｏｒアドレスを受取ると、ＤＭＡ制御部８２を用いて「Ｄｅｓｃｒｉｐｔｏｒ」をＤＭＡリード（Ｒｅａｄ）し、ローカルメモリ８５上にコピーする。ＳＣＳＩカード８は「Ｄｅｓｃｒｉｐｔｏｒ」に書かれた情報を基に、メインメモリ６から書込みデータをローカルメモリ８５にコピーする。 Next, the OS 10 notifies the SCSI card 8 of the address (Descriptor address) at which “Descriptor” on the main memory 6 is stored. When receiving the Descriptor address, the SCSI card 8 DMA-reads (Descriptor) using the DMA controller 82 and copies it to the local memory 85. The SCSI card 8 copies the write data from the main memory 6 to the local memory 85 based on the information written in “Descriptor”.

ＳＣＳＩカード８はＤＩＳＫ制御部８６を用いて、ローカルメモリ８５にコピーしたデータをＤＩＳＫ９０上にライトする。これまでの一連の動作が終了すると、ＳＣＳＩカード８はＳｔａｔｕｓレジスタ８４上に「動作終了」を示すｂｉｔを立て、ＯＳ１０に割込みを行う。 The SCSI card 8 uses the DISK control unit 86 to write the data copied to the local memory 85 onto the DISK 90. When the series of operations so far are finished, the SCSI card 8 sets a bit indicating “end of operation” on the status register 84 and interrupts the OS 10.

割込みを受取ったＯＳ１０は、まず割込みをしてきたデバイスを探索する。この時、ＯＳ１０は割込みを行う可能性のある装置上のＳｔａｔｕｓレジスタをすべて読出し、割込みの要因が示されているデバイスを探索する。ここではＳＣＳＩカード８が割込み要因をＳｔａｔｕｓレジスタ８４に書いているため、ＯＳ１０はＳＣＳＩカード８が割込み元と判断し、ｄｒｉｖｅｒを通じてデータをＤＩＳＫ９０に書込む動作が終了したことを判定する。 The OS 10 that has received the interrupt first searches for a device that has interrupted. At this time, the OS 10 reads all the Status registers on the device that may cause an interrupt, and searches for a device indicating the cause of the interrupt. Here, since the SCSI card 8 writes the interrupt factor in the Status register 84, the OS 10 determines that the SCSI card 8 is the interrupt source, and determines that the operation of writing data to the DISK 90 through the driver has ended.

上記の処理は、一般的なＰＣＩカードの一動作例を簡単に説明したものであるが、本実施例の第一の特徴である「ＯＳａｌｉｖｅ監視機構４１」は、上記の説明の中の割込みと呼ばれる動作を用いて実装することができる。ＯＳａｌｉｖｅ監視機構４２の構成は図３に示している。図３の「ａ」でＯＳａｌｉｖｅ監視機構４１はＯＳ１０に対して割込みを通知している。但し、この割込みを行う際にＯＳａｌｉｖｅ監視機構４１は割込み要因を示すレジスタに割込みの要因をセットせずに割込みを行っている。一般的なＰＣＩカードは割込み要因をセットせずに割込み通知を行う動作を行わない。 The above processing is a simple description of an operation example of a general PCI card, but the “OS alive monitoring mechanism 41”, which is the first feature of this embodiment, is an interrupt in the above description. It can be implemented using an operation called The configuration of the OS alive monitoring mechanism 42 is shown in FIG. At “a” in FIG. 3, the OS alive monitoring mechanism 41 notifies the OS 10 of an interrupt. However, when performing this interrupt, the OS alive monitoring mechanism 41 performs an interrupt without setting the interrupt factor in the register indicating the interrupt factor. A general PCI card does not perform an interrupt notification operation without setting an interrupt factor.

図６は図２のＯＳａｌｉｖｅ監視機構４１の動作を示すフローチャートである。これら図２〜図６を参照してＯＳａｌｉｖｅ監視機構４１の動作について説明する。 FIG. 6 is a flowchart showing the operation of the OS alive monitoring mechanism 41 in FIG. The operation of the OS alive monitoring mechanism 41 will be described with reference to FIGS.

ＯＳ１０は割込み元デバイスと割込みの要因とを調査するために、割込み元となりうるすべてのデバイスを調査するが、この時、ＯＳ１０は割込み要因レジスタ４１３のアドレスを指定してＩ／Ｏのリード命令を発行してくる。ＯＳａｌｉｖｅ監視機構４１はこのＯＳ１０が割込み要因レジスタ４１３のアドレスを指定してリードをしてくる命令を、アドレスをキーとして検出する割込要因レジスタアクセス検出手段４１４を用いて検出する。 In order to investigate the interrupt source device and the cause of the interrupt, the OS 10 investigates all devices that can be the interrupt source. At this time, the OS 10 designates the address of the interrupt factor register 413 and issues an I / O read instruction. I will issue. The OS alive monitoring mechanism 41 detects an instruction read by the OS 10 by designating the address of the interrupt factor register 413 by using an interrupt factor register access detecting unit 414 that detects the address as a key.

もちろん、割込要因はセットしていないため、ＯＳ１０は割込み元デバイスを特定することができずに、割込元デバイスの探索を終了する。上記のように、割込が発生したにもかかわらず、割込み元デバイス及び要因が特定できないタイプの割込みは「空割込み」と呼ばれる。 Of course, since the interrupt factor is not set, the OS 10 cannot specify the interrupt source device and ends the search for the interrupt source device. As described above, a type of interrupt whose interrupt source device and cause cannot be specified even though an interrupt has occurred is called an “empty interrupt”.

本実施例では、この「空割込み」を一定時間毎にわざと発生させ、ＯＳ１０が割込み要因レジスタ４１３の調査を行うかどうかを監視することによって（図６ステップＳ１〜Ｓ３，Ｓ５）、ＯＳ１０が正常な動作を行っているかどうかを判定することに特徴がある。 In the present embodiment, this “empty interrupt” is intentionally generated at regular intervals, and by monitoring whether the OS 10 checks the interrupt factor register 413 (steps S1 to S3 and S5 in FIG. 6), the OS 10 is normal. It is characterized by determining whether or not a proper operation is being performed.

ＯＳ１０が正常な動作をしていない場合、空割込みを発生させても割込要因レジスタ４１３へのアクセスが行われないため、空割込みを発生してから一定時間、割込要因レジスタ４１３へのアクセスが行われないことをきっかけに、ＯＳ１０の停止を判定し、ＯＳ１０の停止信号を出力するのがＯＳａｌｉｖｅ監視機構４１の動作である（図６ステップＳ１〜Ｓ３，Ｓ５，Ｓ６）。尚、本実施例では、ＯＳ１０が正常な動作をしている場合、空割込みを発生させると割込要因レジスタ４１３へのアクセスが行われるため、上記のアクセスタイマのストップとリセットとが行われる。 If the OS 10 is not operating normally, the interrupt factor register 413 is not accessed even if an empty interrupt is generated. Therefore, the interrupt factor register 413 is accessed for a certain time after the empty interrupt is generated. It is the operation of the OS alive monitoring mechanism 41 that determines whether the OS 10 is stopped and outputs a stop signal of the OS 10 (steps S1 to S3, S5, and S6 in FIG. 6). In this embodiment, when the OS 10 is operating normally, if the empty interrupt is generated, the interrupt factor register 413 is accessed, so that the access timer is stopped and reset.

また、タイマ回路４１２が空割込みを起動する間隔は、任意に設定することができるように設計し、システムに対してＯＳａｌｉｖｅ監視機構４１の空割込みが大きな負担にならないようにするのが望ましい。さらに、タイムアウト判定部４１６も、システム毎に空割込み発生から割込み要因レジスタ４１３の読出しが行われる時間が異なるため、任意の時間経過でＯＳ１０の停止信号を出力することができるように、設定に自由度を持たせることが望ましい。 Further, it is desirable that the interval at which the timer circuit 412 activates the empty interrupt is designed to be arbitrarily set so that the empty interrupt of the OS alive monitoring mechanism 41 does not place a heavy burden on the system. Furthermore, the time-out determination unit 416 can also be freely set so that the stop signal of the OS 10 can be output after an arbitrary time since the time for reading the interrupt factor register 413 from the occurrence of an empty interrupt varies from system to system. It is desirable to have a degree.

次に、本実施例の第二の特徴である障害時データ取得手段４２について説明する。本実施例では、ＯＳ１０が正常動作している場合にデータ送信を行う機構として、通常動作制御部４４と、通常動作制御部４４が用いるローカルメモリ４５とが存在すると同時に、ＯＳ１０の障害時にデータ出力を行うシステム障害時データ取得手段４２がＩ／Ｏ装置４に実装されている。 Next, the failure data acquisition means 42, which is the second feature of this embodiment, will be described. In the present embodiment, there is a normal operation control unit 44 and a local memory 45 used by the normal operation control unit 44 as a mechanism for transmitting data when the OS 10 is operating normally, and at the same time, data output in the event of a failure of the OS 10 A system failure time data acquisition unit 42 is mounted on the I / O device 4.

図７〜図９は図２のシステム障害時データ取得手段４２の動作を示すフローチャートである。これら図２〜図５及び図７を参照してシステム障害時データ取得手段４２の動作について説明する。
一般的に、ＯＳ１０がコンソールメッセージを書込むメインメモリ２上のエリアには固定的な領域をサイクリックに使用する。本実施例では、このエリアをメッセージバッファと呼ぶ。メモリの使用状態によってメッセージバッファの位置を変更する必要がある場合のために、多くのＯＳではメッセージバッファの先頭アドレスを示すバッファポインタを用意している。 7 to 9 are flowcharts showing the operation of the system failure time data acquisition means 42 of FIG. The operation of the data acquisition unit 42 at the time of system failure will be described with reference to FIGS.
Generally, a fixed area is cyclically used as an area on the main memory 2 where the OS 10 writes a console message. In this embodiment, this area is called a message buffer. In the case where it is necessary to change the position of the message buffer depending on the use state of the memory, many OSs provide a buffer pointer indicating the head address of the message buffer.

本実施例におけるＩ／Ｏ装置４には、このバッファポインタがメインメモリ２上のどこにあるかを示すバッファポインタアドレスレジスタを実装しており（図５のレジスタ群４２１参照）、ＯＳ１０がシステムの初期化を行う際に一度だけこのレジスタにバッファポインタの格納アドレスをセットする（図７ステップＳ１１，Ｓ２１、図８ステップＳ３１，Ｓ４１）。また、メッセージバッファのサイズもシステム初期化の際にセットされる。 The I / O device 4 in this embodiment is mounted with a buffer pointer address register indicating where this buffer pointer is located on the main memory 2 (see the register group 421 in FIG. 5), and the OS 10 is the initial stage of the system. When registering, the storage address of the buffer pointer is set in this register only once (steps S11 and S21 in FIG. 7, steps S31 and S41 in FIG. 8). The message buffer size is also set during system initialization.

ＯＳ１０が正常動作している場合、メッセージ出力要求はＯＳ１０からｄｅｓｃｒｉｐｔｏｒアドレスの通知として行われる（図７ステップＳ１２、図８ステップＳ３２）。Ｄｅｓｃｒｉｐｔｏｒアドレスを受取ると、まず、通常動作制御部４４がｄｅｓｃｒｉｐｔｏｒをＤＭＡ制御部４３を用いてローカルメモリ４５に取得する（図７ステップＳ１３、図８ステップＳ３３）。続いて、システム障害時データ取得手段４１がバッファポインタをＤＭＡ制御部４３を用いて取得し、レジスタ群４２１のバッファポインタ＃１レジスタに格納する（図７ステップＳ２２、図８ステップＳ４２）。 When the OS 10 is operating normally, a message output request is issued from the OS 10 as a descriptor address notification (step S12 in FIG. 7 and step S32 in FIG. 8). When the Descriptor address is received, the normal operation control unit 44 first acquires the descriptor in the local memory 45 using the DMA control unit 43 (Step S13 in FIG. 7 and Step S33 in FIG. 8). Subsequently, the system failure time data acquisition unit 41 acquires the buffer pointer using the DMA control unit 43 and stores it in the buffer pointer # 1 register of the register group 421 (step S22 in FIG. 7, step S42 in FIG. 8).

さらに、通常動作制御部４４がｄｅｓｃｒｉｐｔｏｒの内容に応じて出力要求があったメッセージをＤＭＡ制御部４３を用いてローカルメモリ４５に取得する（図７ステップＳ１４、図８ステップＳ３４）。通常動作制御部４４は取得したメッセージをデータ送出元切替え部４６を介してＬＡＮ制御部４８に送信し（図８ステップＳ３５）、ＬＡＮ制御部４８がメッセージを通信先に通知する。 Further, the normal operation control unit 44 acquires the message requested to be output according to the contents of the descriptor in the local memory 45 using the DMA control unit 43 (step S14 in FIG. 7 and step S34 in FIG. 8). The normal operation control unit 44 transmits the acquired message to the LAN control unit 48 via the data transmission source switching unit 46 (step S35 in FIG. 8), and the LAN control unit 48 notifies the communication destination of the message.

メッセージが通信されている間、システム障害時データ取得手段４２はバッファポインタのアドレスが指し示すメッセージバッファの全内容をメッセージ保存バッファ４２３にＤＭＡ制御部４３を用いてコピーする（図７ステップＳ２３、図８ステップＳ４３）。通常動作制御部４４による通信と、システム障害時データ取得手段４２による全メッセージのコピー動作の双方が終了した時点で、通常動作制御部４４はＯＳ１０に対して割込みを行い（図８ステップＳ３６）、メッセージ出力の終了を通知する。 While the message is being communicated, the data acquisition unit 42 at the time of system failure copies the entire contents of the message buffer indicated by the address of the buffer pointer to the message storage buffer 423 using the DMA control unit 43 (step S23 in FIG. 7, FIG. 8). Step S43). At the time when both the communication by the normal operation control unit 44 and the copy operation of all messages by the data acquisition unit 42 at the time of system failure are completed, the normal operation control unit 44 interrupts the OS 10 (step S36 in FIG. 8). Notify the end of message output.

ＯＳａｌｉｖｅ監視機構４１によって、ＯＳ１０の停止が確認された場合、通常動作制御部４４は動作を停止する（図８ステップＳ３７，Ｓ３８）。これと同時に、システム障害時データ取得手段４２は、まずＤＭＡ制御部４３を用いてバッファポインタの内容をレジスタ群４２１のバッファポインタ＃２レジスタにコピーする（図８ステップＳ４４）。さらに、システム障害時データ取得手段４２はレジスタ群４２１のバッファポインタ＃２レジスタの指し示すアドレスからＤＭＡ制御部４３を用いてメッセージバッファの全内容をメッセージ比較用バッファ４２４にコピーする（図８ステップＳ４５）。 When the OS alive monitoring mechanism 41 confirms the stop of the OS 10, the normal operation control unit 44 stops the operation (steps S37 and S38 in FIG. 8). At the same time, the system failure time data acquisition means 42 first copies the contents of the buffer pointer to the buffer pointer # 2 register of the register group 421 using the DMA control unit 43 (step S44 in FIG. 8). Further, the system failure time data acquisition means 42 copies the entire contents of the message buffer to the message comparison buffer 424 using the DMA control unit 43 from the address indicated by the buffer pointer # 2 register of the register group 421 (step S45 in FIG. 8). .

バッファポインタ＃１レジスタの内容とバッファポインタ＃２レジスタの内容とが一致する場合（図８ステップＳ４６）、ＯＳ１０は障害発生直前までメッセージバッファをサイクリックに使用していたことになる。このため、新規メッセージはメッセージ保存バッファ４２３の内容とメッセージ比較用バッファ４２４の内容とを比較し、内容に差分があるエリアに格納されていることになる。システム障害時データ取得手段４２はこの差分が検出された内容を通信手段（ＬＡＮ制御部４８）に送出する（図８ステップＳ４７，Ｓ４８）。 If the contents of the buffer pointer # 1 register match the contents of the buffer pointer # 2 register (step S46 in FIG. 8), the OS 10 has used the message buffer cyclically until just before the failure occurred. For this reason, the new message is stored in an area where the contents of the message storage buffer 423 and the contents of the message comparison buffer 424 are different from each other. The data acquisition unit 42 at the time of system failure sends the content in which this difference is detected to the communication unit (LAN control unit 48) (steps S47 and S48 in FIG. 8).

これに対し、バッファポインタ＃１レジスタの内容とバッファポインタ＃２レジスタの内容とが異なる場合（図８ステップＳ４６）、ＯＳ１０は最後にメッセージ出力を要求した後、メッセージバッファの領域を変更したことになるため、新規のメッセージはすべて変更後のメッセージバッファの先頭から格納されていることになる。このため、システム障害時データ取得手段４２はメッセージ比較用バッファ４２４内のデータを通信手段（ＬＡＮ制御部４８）に送出し（図８ステップＳ５０，Ｓ５１）、動作を終了する。 On the other hand, if the contents of the buffer pointer # 1 register are different from the contents of the buffer pointer # 2 register (step S46 in FIG. 8), the OS 10 has changed the message buffer area after the last request for message output. Therefore, all new messages are stored from the beginning of the message buffer after the change. Therefore, the system failure time data acquisition means 42 sends the data in the message comparison buffer 424 to the communication means (LAN control unit 48) (steps S50 and S51 in FIG. 8) and ends the operation.

このように、本実施例では、上述した機構を用いることによって、システム障害発生時でもＯＳ１０が出力すべく用意した全メッセージの出力が行われ、障害復旧のための情報取得が容易になる。 As described above, in this embodiment, by using the mechanism described above, all messages prepared to be output by the OS 10 are output even when a system failure occurs, and information acquisition for failure recovery is facilitated.

また、本実施例では、ＯＳａｌｉｖｅ監視機構４１によるＯＳ１０の動作監視について、Ｉ／Ｏの割込みをサポートするＯＳであれば、既存のＯＳになんら変更を加えることなく動作が可能であり、上述した機構をコンピュータシステムに実装するだけで、ＯＳ１０の動作監視を行うことができる。 In the present embodiment, the OS 10 monitoring by the OS alive monitoring mechanism 41 can be performed without any change to the existing OS as long as the OS supports an I / O interrupt. The operation of the OS 10 can be monitored simply by mounting the mechanism on the computer system.

本発明の実施の形態による情報処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information processing system by embodiment of this invention. 本発明の一実施例による情報処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information processing system by one Example of this invention. 図２のＯＳａｌｉｖｅ監視装置の構成を示すブロック図である。It is a block diagram which shows the structure of the OS alive monitoring apparatus of FIG. 図２のメインメモリ上のデータ格納イメージを示す図である。It is a figure which shows the data storage image on the main memory of FIG. 図２のシステム障害時データ取得手段の構成を示すブロック図である。It is a block diagram which shows the structure of the data acquisition means at the time of a system failure of FIG. 図２のＯＳａｌｉｖｅ監視機構の動作を示すフローチャートである。3 is a flowchart illustrating an operation of the OS alive monitoring mechanism in FIG. 2. 図２のシステム障害時データ取得手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data acquisition means at the time of a system failure of FIG. 図２のシステム障害時データ取得手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data acquisition means at the time of a system failure of FIG. 図２のシステム障害時データ取得手段の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the data acquisition means at the time of a system failure of FIG. 従来例による情報処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information processing system by a prior art example.

Explanation of symbols

１ＣＰＵ
２メインメモリ
３Ｉ／Ｏブリッジ
４Ｉ／Ｏ装置
５他のＩ／Ｏ装置
６メッセージ表示装置
１０ＯＳ
４１ＯＳａｌｉｖｅ監視機構
４２システム障害時データ取得手段
４３ＤＭＡ制御部
４４通常動作制御部
４５ローカルメモリ
４６データ送出元切替え部
４７通信制御部
４８ＬＡＮ制御部
１００Ｉ／Ｏバス
２００ＰＣＩバス
４１１割込み手段
４１２，４１５タイマ回路
４１３割込要因レジスタ
４１４割込要因レジスタアクセス検出手段
４１６タイムアウト判定部
４２１レジスタ群
４２２システム障害時データ取得制御部
４２３メッセージ保存バッファ
４２４メッセージ比較用バッファ
４２５差分出力部
1 CPU
2 Main memory
3 I / O bridge
4 I / O devices
5 Other I / O devices
6 Message display device
10 OS
41 OS alive monitoring mechanism
42 Data acquisition means for system failure
43 DMA controller
44 Normal operation controller
45 Local memory
46 Data transmission source switching part
47 Communication control unit
48 LAN controller
100 I / O bus
200 PCI bus
411 Interrupt means 412 and 415 Timer circuit
413 Interrupt factor register
414 Interrupt factor register access detection means
416 Timeout determination unit
421 Register group
422 Data acquisition control unit at the time of system failure
423 Message storage buffer
424 Message comparison buffer
425 Difference output unit

Claims

Useful data for determining the location of the failure includes an information processing device that holds an OS (Operating System) console message and an error message in a main memory, and an input / output device that outputs the console message and the error message. An information processing system,
The information processing system, wherein the input / output device has monitoring means for monitoring the operating state of the OS by performing an empty interrupt that cannot specify an interrupt source device and a cause to the information processing device.

The information processing system according to claim 1, wherein the monitoring unit generates the empty interrupt at predetermined time intervals.

3. The information processing according to claim 1, wherein the monitoring unit monitors whether or not the OS checks an interrupt factor register indicating an interrupt factor in response to the occurrence of the empty interrupt. system.

The input / output device includes a data acquisition unit at the time of system failure that acquires the contents of a specific area of the main memory and sends the contents to the outside when the monitoring unit detects a stopped state of the OS. The information processing system according to any one of claims 1 to 3.

The main memory includes a message buffer for holding the console message and an error message,
The data acquisition means at the time of system failure sends out the difference between the contents of the message buffer when the OS has last notified a message output request and the contents of the message buffer after the OS has stopped. The information processing system according to claim 4.

An input / output device that outputs an OS (Operating System) console message and an error message held in the main memory of the information processing device as useful data for determining the location of the failure,
An input / output device comprising monitoring means for monitoring an operating state of the OS by performing an empty interrupt whose cause cannot be specified and an interrupt source device to the information processing device.

The input / output apparatus according to claim 6, wherein the monitoring unit generates the empty interrupt at predetermined time intervals.

8. The input / output according to claim 6, wherein the monitoring unit monitors whether or not the OS checks an interrupt factor register indicating an interrupt factor in response to the occurrence of the empty interrupt. apparatus.

9. The system failure time data acquisition means for acquiring the contents of a specific area of the main memory and transmitting the contents to the outside when the monitoring means detects a stop state of the OS. Any input / output device.

If the main memory includes a message buffer that holds the console message and error message,
The data acquisition means at the time of system failure sends out the difference between the contents of the message buffer when the OS has last notified a message output request and the contents of the message buffer after the OS has stopped. The input / output device according to claim 9.

Useful data for determining the location of the failure includes an information processing device that holds an OS (Operating System) console message and an error message in a main memory, and an input / output device that outputs the console message and the error message. An automatic data transmission method in the event of a system failure used in an information processing system,
An automatic data transmission method, wherein the input / output device performs a process of monitoring an operating state of the OS by performing an empty interrupt whose interrupt source device and cause cannot be specified to the information processing device.

12. The automatic data transmission method according to claim 11, wherein the process of monitoring the operating state of the OS generates the empty interrupt at predetermined time intervals.

The process for monitoring the operating state of the OS monitors whether or not the OS checks an interrupt factor register indicating an interrupt factor in response to the occurrence of the empty interrupt. Item 13. The automatic data transmission method according to Item 12.

When the input / output device detects a stop state of the OS in the process of monitoring the operating state of the OS, the input / output device acquires a content of a specific area of the main memory and executes a process of sending the content to the outside. The automatic data transmission method according to any one of claims 1 to 3.

A message buffer for holding the console message and the error message is provided in the main memory,
The process of acquiring the contents of the specific area and sending them to the outside includes the contents of the message buffer when the OS has last notified a message output request, and the contents of the message buffer after the OS has stopped. 15. The automatic data transmission method according to claim 14, wherein the difference between the two is transmitted to the outside.

Useful data for determining the location of the failure includes an information processing device that holds an OS (Operating System) console message and an error message in main memory, and an input / output device that outputs the console message and the error message. An automatic data transmission method program used in an information processing system in the event of a system failure, wherein an interrupt source device and an empty interrupt whose cause cannot be specified are sent to the information processing apparatus and the operating state of the OS A program for executing the process of monitoring.

Claims for causing a computer of the input / output device to execute a process of acquiring the contents of a specific area of the main memory and transmitting the contents to the outside when a stop state of the OS is detected by a process of monitoring the operating state of the OS Item 16. The program according to Item 16.