JP2015225522A

JP2015225522A - System and failure processing method

Info

Publication number: JP2015225522A
Application number: JP2014110314A
Authority: JP
Inventors: 努長岡; Tsutomu Nagaoka
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2014-05-28
Filing date: 2014-05-28
Publication date: 2015-12-14
Anticipated expiration: 2034-05-28
Also published as: JP6357879B2

Abstract

PROBLEM TO BE SOLVED: To provide a system and a failure processing method that can stop a portion where a failure has occurred without causing operation stop of the whole system in the case where the failure has occurred at the portion constituting the system.SOLUTION: A system includes: control means for controlling the whole system; a switch for communicating with the control means in accordance with a communication standard of PCI Express; a device for communicating with the switch in accordance with the communication standard PCI Express; and a monitoring unit that communicates with the switch in accordance with a predetermined communication standard, monitors if there is a failure in the device, transmits information for controlling the switch to the switch via communication in accordance with the predetermined communication standard if a failure is detected, and thereby restrains transfer of information on the failure to the control means from the switch via communication in accordance with the communication standard PCI Express.

Description

本発明は、システムおよび障害処理方法に関する。 The present invention relates to a system and a failure processing method.

特許文献１には、少なくとも１つ以上のサーバを有するサーバシャーシと、PCIeスイッチが搭載されたIOスロット拡張装置とが、PCIeケーブルにて接続されている計算機システムにおいて、前記サーバは、演算部と、記憶部と、インタフェースとを有し、前記IOスロット拡張装置は、PCIeスイッチと、前記PCIeスイッチと接続されたIOスロット拡張装置コントローラと、前記PCIeスイッチと接続されたPCIカードスロットとを有し、前記PCIカードスロットには、HBA及びHBAポートを有するPCIカードが搭載され、前記HBAは、前記HBAポート毎に割り当てられた識別子を保持する識別子エリア有し、前記PCIeスイッチは、PCIeスイッチレジスタを有し、前記IOスロット拡張装置コントローラは、前記PCIカードスロットに搭載される前記PCIカードが有する前記HBAポートに割り当てた識別子を管理する割り当て識別子テーブルと、PCIeスイッチレジスタ更新制御部とを有し、前記PCIeスイッチレジスタ更新制御部は、前記IOスロット拡張装置の電源投入を受けて、前記IOスロット拡張装置コントローラが有する前記割り当て識別子テーブルを、前記PCIeスイッチレジスタの前記識別子格納エリアに複製し、前記インタフェースが有するHBA識別子更新制御部は、前記サーバの電源投入を受けて、前記PCIeケーブルを介して、前記PCIeスイッチレジスタの前記識別子格納エリアに複製された割り当て識別子テーブルを取得し、前記取得した割り当て識別子テーブルを、前記サーバの前記記憶部に保存し、前記記憶部に保存した割り当て識別子テーブルから、前記PCIカードスロットと前記PCIカード上のHBAポートとに対応した割り当て識別子を参照し、前記HBAの前記識別子エリアに記録されている識別子を、前記記憶部に保存した割り当て識別子テーブルから参照した割り当て識別子に更新することを特徴とする計算機システムが開示されている。 In Patent Document 1, in a computer system in which a server chassis having at least one server and an IO slot expansion device equipped with a PCIe switch are connected by a PCIe cable, the server includes: an arithmetic unit; The IO slot expansion device has a PCIe switch, an IO slot expansion device controller connected to the PCIe switch, and a PCI card slot connected to the PCIe switch. The PCI card slot includes a PCI card having an HBA and an HBA port, the HBA has an identifier area for holding an identifier assigned to each HBA port, and the PCIe switch includes a PCIe switch register. The IO slot expansion device controller has the HBA port of the PCI card mounted in the PCI card slot An allocation identifier table for managing allocated identifiers, and a PCIe switch register update control unit. The PCIe switch register update control unit receives power on the IO slot expansion device, and the IO slot expansion device controller The assigned identifier table having is copied to the identifier storage area of the PCIe switch register, and the HBA identifier update control unit included in the interface receives the power-on of the server, and the PCIe switch via the PCIe cable An allocation identifier table replicated in the identifier storage area of the register is acquired, the acquired allocation identifier table is stored in the storage unit of the server, and the PCI card slot is stored in the storage unit from the allocation identifier table stored in the storage unit. And the HBA port on the PCI card Against reference identifiers, the identifier recorded in the identifier area of the HBA, the computer system and updates the assignment identifier referenced from the allocation identifier table stored in the storage unit is disclosed.

特開２０１２−１５０６２３号公報JP 2012-150623 A

本発明の課題は、システムを構成する部位に障害が発生した場合において、システム全体の動作停止を招くことなく障害が発生した部位を停止させることが可能なシステムおよび障害処理方法を提供することである。 An object of the present invention is to provide a system and a failure processing method capable of stopping a site where a failure has occurred without causing an operation stop of the entire system when a failure occurs in a site constituting the system. is there.

上記目的を達成するために、請求項１に記載のシステムは、システム全体を制御する制御手段と、前記制御手段とＰＣＩＥｘｐｒｅｓｓの通信規格で通信がなされるスイッチと、前記スイッチとＰＣＩＥｘｐｒｅｓｓの通信規格で通信がなされるデバイスと、前記スイッチと予め定められた通信規格で通信がなされ、かつ前記デバイスの障害の有無を監視するとともに、障害が検知された場合に前記予め定められた通信規格による通信を介して前記スイッチを制御するための情報を前記スイッチに送信することにより、前記スイッチからＰＣＩＥｘｐｒｅｓｓの通信規格による通信を介して前記制御手段に前記障害に関する情報が転送されることを抑止する監視部と、を含むものである。 In order to achieve the above object, the system according to claim 1 includes a control unit that controls the entire system, a switch that communicates with the control unit according to a PCI Express communication standard, and a communication between the switch and PCI Express. A device that communicates with a standard is communicated with the switch according to a predetermined communication standard, and monitors whether or not the device has a failure. When a failure is detected, the communication is performed according to the predetermined communication standard. By transmitting information for controlling the switch via communication to the switch, it is possible to prevent information relating to the failure from being transferred from the switch to the control means via communication according to the PCI Express communication standard. And a monitoring unit.

また、請求項２に記載の発明は、請求項１に記載の発明において、前記監視部と前記デバイスとは予め定められた信号を伝送する伝送路で接続され、前記監視部は、前記予め定められた通信規格による通信を介して前記スイッチを制御するための情報を前記スイッチに送信した後前記伝送路を介して前記デバイスを停止させる信号を送信するものである。 According to a second aspect of the present invention, in the first aspect of the present invention, the monitoring unit and the device are connected via a transmission line that transmits a predetermined signal, and the monitoring unit is configured to perform the predetermined determination. The information for controlling the switch is transmitted to the switch through communication according to the communication standard, and then a signal for stopping the device is transmitted through the transmission path.

また、請求項３に記載の発明は、請求項１または請求項２に記載の発明において、前記デバイスは、前記監視部が前記予め定められた通信規格による通信を介して前記スイッチを制御するための情報を前記スイッチに送信した後前記伝送路を介して前記デバイスを停止させる信号を送信する前に、予め定められたログ情報を収集しＰＣＩＥｘｐｒｅｓｓの通信規格による通信を介して前記制御手段に送信するものである。 According to a third aspect of the present invention, in the first or second aspect of the present invention, the device is configured so that the monitoring unit controls the switch via communication according to the predetermined communication standard. Before transmitting a signal for stopping the device via the transmission path after transmitting the information to the switch, the log information is collected in advance and communicated to the control means via communication according to the PCI Express communication standard. To be sent.

また、請求項４に記載の発明は、請求項１〜請求項３のいずれか１項に記載の発明において、前記スイッチを制御するための情報は、前記スイッチのレジスタ情報を記憶するＰＣＩＥｘｐｒｅｓｓで規格化されているコンフィグレーションレジスタに記憶されているＡＥＲレジスタのＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫを有効にするための情報であるものである。 The invention according to claim 4 is the invention according to any one of claims 1 to 3, wherein the information for controlling the switch is a PCI Express that stores register information of the switch. This is information for enabling Uncorrectable Error MASK of the AER register stored in the standardized configuration register.

また、請求項５に記載の発明は、請求項１〜請求項４のいずれか１項に記載の発明において、前記予め定められた通信規格がＩ^２Ｃの通信規格であるものである。 The invention according to claim 5 is the invention according to any one of claims 1 to 4, wherein the predetermined communication standard is an I ² C communication standard.

また、請求項６に記載の発明は、請求項１〜請求項５のいずれか１項に記載の発明において、前記デバイスは自己の温度を制御する温度制御部を備え、前記デバイスの障害が前記温度制御部の障害であるものである。 The invention according to claim 6 is the invention according to any one of claims 1 to 5, wherein the device includes a temperature control unit that controls its own temperature, and the failure of the device is This is a failure of the temperature control unit.

一方、上記目的を達成するために、請求項７に記載の障害処理方法は、スイッチとＰＣＩＥｘｐｒｅｓｓの通信規格で通信がなされるデバイスの障害の有無を監視するとともに前記スイッチと予め定められた通信規格で通信がなされる監視部により前記デバイスの障害を検知するステップと、前記監視部が前記予め定められた通信規格による通信を介して前記スイッチを制御するための情報を前記スイッチに送信するステップと、前記スイッチを制御するための情報により、前記スイッチからＰＣＩＥｘｐｒｅｓｓの通信規格による通信を介して、システム全体を制御する制御手段に前記障害に関する情報が転送されることを抑止するステップと、を含むものである。 On the other hand, in order to achieve the above object, the failure processing method according to claim 7 monitors the presence of a failure of a device that communicates with a switch and a PCI Express communication standard and communicates with the switch in advance. A step of detecting a failure of the device by a monitoring unit that communicates according to a standard; and a step of transmitting information for the monitoring unit to control the switch via communication according to the predetermined communication standard. And the step of inhibiting the information related to the failure from being transferred from the switch to the control means for controlling the entire system via the communication according to the PCI Express communication standard by the information for controlling the switch. Is included.

請求項１および請求項７に記載の発明によれば、システムを構成する部位に障害が発生した場合において、システム全体の動作停止を招くことなく障害が発生した部位が停止される、という効果が得られる。 According to the first and seventh aspects of the present invention, when a failure occurs in a part constituting the system, there is an effect that the part in which the failure occurs is stopped without causing an operation stop of the entire system. can get.

請求項２に記載の発明によれば、監視部により、予め定められた通信規格による通信を介してスイッチを制御するための情報をスイッチに送信した後伝送路を介してデバイスを停止させない場合と比較して、障害の発生したデバイスがシステムから切り離される、という効果が得られる。 According to the second aspect of the present invention, the monitoring unit does not stop the device via the transmission line after transmitting information for controlling the switch to the switch via communication according to a predetermined communication standard. In comparison, the effect is obtained that the failed device is disconnected from the system.

請求項３に記載の発明によれば、監視部が予め定められた通信規格による通信を介してスイッチを制御するための情報をスイッチに送信した後伝送路を介してデバイスを停止させる信号を送信する前に、デバイスが予め定められたログ情報を収集しＰＣＩＥｘｐｒｅｓｓの通信規格による通信を介して制御手段に送信しない場合と比較して、デバイスの停止前に制御手段により障害内容等のログ情報が取得される、という効果が得られる。 According to the third aspect of the present invention, the monitoring unit transmits information for controlling the switch via communication according to a predetermined communication standard, and then transmits a signal for stopping the device via the transmission path. Compared with the case where the device collects predetermined log information and does not send it to the control means via communication according to the PCI Express communication standard, the control means logs information such as failure contents before the device is stopped. Is obtained.

請求項４に記載の発明によれば、スイッチを制御するための情報をスイッチのレジスタ情報を記憶するＰＣＩＥｘｐｒｅｓｓで規格化されているコンフィグレーションレジスタに記憶されているＡＥＲレジスタのＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫを有効にするための情報としない場合と比較して、より簡易にシステムダウンが抑止される、という効果が得られる。 According to the fourth aspect of the present invention, the uncorrectable error MASK of the AER register stored in the configuration register standardized by PCI Express that stores the register information of the switch is enabled as information for controlling the switch. Compared with the case where the information is not used, the system down is more easily suppressed.

請求項５に記載の発明によれば、予め定められた通信規格としてＩ^２Ｃの通信規格を用いない場合と比較して、より汎用的な通信路を介してスイッチを制御するための情報がスイッチに送信される、という効果が得られる。 According to the fifth aspect of the present invention, there is no information for controlling the switch via a more general communication path than in the case where the I ² C communication standard is not used as the predetermined communication standard. The effect of being transmitted to the switch is obtained.

請求項６に記載の発明によれば、デバイスの障害をデバイス自身の温度制御部の障害としない場合と比較して、デバイスの過剰な温度上昇が抑止される、という効果が得られる。 According to the sixth aspect of the invention, it is possible to obtain an effect that an excessive temperature rise of the device is suppressed as compared with a case where the failure of the device is not the failure of the temperature control unit of the device itself.

実施の形態に係るコンピュータの構成の一例を示す概略構成図である。It is a schematic block diagram which shows an example of a structure of the computer which concerns on embodiment. 実施の形態に係るアクセラレータ基板の構成の一例を示す概略構成図である。It is a schematic block diagram which shows an example of a structure of the accelerator board | substrate which concerns on embodiment. 従来技術に係るコンピュータの障害発生時の動作を説明するための図である。It is a figure for demonstrating the operation | movement at the time of the failure occurrence of the computer based on a prior art. 実施の形態に係るコンピュータの障害処理方法の手順を説明するための図である。It is a figure for demonstrating the procedure of the failure processing method of the computer which concerns on embodiment. 実施の形態に係るコンピュータの障害発生時の動作を説明するための図である。It is a figure for demonstrating the operation | movement at the time of the failure generation of the computer which concerns on embodiment. 実施の形態に係るＰＣＩｅスイッチのコンフィグレーションレジスタ、ＡＥＲレジスタを説明するための図である。It is a figure for demonstrating the configuration register of a PCIe switch which concerns on embodiment, and an AER register.

以下、図面を参照して、本発明の実施の形態について詳細に説明する。以下の説明では、本発明に係るシステムを、画像形成装置等に付随して設けられる画像処理等を実行するコンピュータに適用した形態を例示して説明する。画像形成装置によって画像形成される画像は、通常画像形成装置のＣＰＵ（中央演算処理装置）で動作するソフトウエアで処理され、画像形成装置の画像形成部に供給される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the following description, an example in which the system according to the present invention is applied to a computer that executes image processing or the like provided in association with an image forming apparatus will be described. An image formed by the image forming apparatus is usually processed by software operating on a CPU (Central Processing Unit) of the image forming apparatus and supplied to the image forming unit of the image forming apparatus.

他方、高速の画像形成装置においては、ソフトウエアによる処理とは別に画像処理機能専用に特化したハードウエア（デバイス：以下、「デバイス」とは、例えばＡＳＩＣのような専用回路に限らず、そのような専用回路を搭載したプリント配線基板のことも「デバイス」と記すことがある。）を設けて、つまり専用のハードウエアを拡張して処理を加速化する場合がある。その場合、画像形成装置とは別に、画像処理のためのハードウエアを拡張したコンピュータを備える場合がある。以下に述べる本実施の形態に係るシステムの説明では、本発明をそのようなコンピュータに適用した形態を例示して説明する。 On the other hand, in a high-speed image forming apparatus, apart from processing by software, hardware specialized for an image processing function (device: hereinafter, “device” is not limited to a dedicated circuit such as an ASIC, for example. A printed wiring board on which such a dedicated circuit is mounted may also be referred to as a “device.”), That is, the dedicated hardware may be expanded to accelerate the processing. In that case, a computer having an extended hardware for image processing may be provided separately from the image forming apparatus. In the following description of the system according to the present embodiment, an example in which the present invention is applied to such a computer will be described.

ハードウエアを拡張する手段としてはＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）等種々適用されるが、本実施の形態に係るコンピュータでは、ＰＣＩＥｘｐｒｅｓｓ（登録商標）（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔＥｘｐｒｅｓｓ、以下「ＰＣＩｅ」と表記する場合がある）の通信規格によりハードウエアを拡張した形態を例示して説明する。つまり、本実施の形態に係るコンピュータでは、コンピュータの内部において該コンピュータを構成する各部位の少なくとも一部が、ＰＣＩｅの通信規格を介して接続されている。なお、本実施の形態におけるコンピュータとは、ハードウエアとオペレーティングシステム（以下、「ＯＳ」という場合がある）とを含む概念であり、ＯＳの制御の下で動作するハードウエアを意味する。 Various means such as USB (Universal Serial Bus) are applied as means for expanding the hardware, but in the computer according to the present embodiment, PCI Express (registered trademark) (Peripheral Component Interconnect Express, hereinafter referred to as “PCIe”). An example in which the hardware is expanded according to the communication standard is described. That is, in the computer according to the present embodiment, at least a part of each part constituting the computer is connected via the PCIe communication standard. Note that the computer in this embodiment is a concept including hardware and an operating system (hereinafter also referred to as “OS”), and means hardware that operates under the control of the OS.

ここで、ＰＣＩｅの通信規格では、データの伝送を行なう機器の間が２．５Ｇｂｐｓ（ｂｉｔｐｅｒｓｅｃｏｎｄ）や５．０Ｇｂｐｓでデータ伝送可能な一対のシリアル伝送路によりポイントツーポイントで接続されており、データ通信網としてはルートコンプレックスを頂点とした木（ツリー）構造により構成されている。ＰＣＩｅの通信規格では、レーンを複数並列化することによりデバイス間でのデータ伝送の高速化が図られるようになっている。また、ＰＣＩｅの通信規格では、接続された各機器間のデータ伝送はパケットを用いて実行される。 Here, in the PCIe communication standard, data transmission devices are connected point-to-point by a pair of serial transmission lines capable of transmitting data at 2.5 Gbps (bit per second) or 5.0 Gbps, The data communication network has a tree structure with the root complex as a vertex. In the PCIe communication standard, data transmission between devices can be speeded up by parallelizing a plurality of lanes. In the PCIe communication standard, data transmission between connected devices is performed using packets.

図１は、本実施の形態に係るシステムとしてのコンピュータ１０の構成の一例を示す概略構成図である。図１に示すように、コンピュータ１０は、メインＣＰＵ１２、ＤＤＲ（ＤｏｕｂｌｅＤａｔａＲａｔｅ）メモリ１４、ルートコンプレックス（図１では、「ＲｏｏｔＣｏｍｐｌｅｘ」と表記）１６、アクセラレータ基板２０、２２を含んで構成されている。なお、コンピュータ１０には、図１に示す構成の他に画像形成装置と接続するための基板等も備えられているが、図示を省略している。 FIG. 1 is a schematic configuration diagram showing an example of a configuration of a computer 10 as a system according to the present embodiment. As shown in FIG. 1, the computer 10 includes a main CPU 12, a DDR (Double Data Rate) memory 14, a root complex (indicated as “Root Complex” in FIG. 1) 16, and accelerator boards 20 and 22. Yes. In addition to the configuration shown in FIG. 1, the computer 10 is provided with a substrate and the like for connection to the image forming apparatus.

図１に示すように、本実施の形態に係るコンピュータ１０では、木構造の最上位に位置するルートコンプレックス１６に、装置全体の動作を司るメインＣＰＵ１２と、メインＣＰＵ１２の各種処理等において用いられるシステムメモリとしてのＤＤＲメモリ１４が接続されている。メインＣＰＵ１２には、コンピュータ１０の全体を統括して制御するＯＳが搭載されている。 As shown in FIG. 1, in the computer 10 according to the present embodiment, a system used for a main complex 12 that controls the operation of the entire apparatus, a variety of processes of the main CPU 12, and the like in a root complex 16 positioned at the top of the tree structure. A DDR memory 14 as a memory is connected. The main CPU 12 is loaded with an OS that controls the entire computer 10.

また、ルートコンプレックス１６には、たとえば画像の圧縮、伸長等の画像処理機能を実現するために用いられるデバイスとしてのアクセラレータ基板２０、２２が接続されている。ルートコンプレックスとは、ＰＣＩｅの通信規格による通信路を介して接続されたＰＣＩｅ規格準拠の各デバイスのコンフィグレーション空間から各デバイスの設定情報を読み出し、各デバイスの各々へのアドレス空間の割り当ておよび当該各デバイスの各々へのデータの転送を制御する通信装置である。本実施の形態に係るコンピュータ１０のルートコンプレックス１６は、アクセラレータ基板２０、２２へのアドレス空間の割り当ておよびデータの転送、すなわちパケットの転送等を制御している。 Further, accelerator boards 20 and 22 as devices used for realizing an image processing function such as image compression and decompression are connected to the root complex 16. The root complex is a method for reading setting information of each device from the configuration space of each device compliant with the PCIe standard connected via a communication path based on the PCIe communication standard, assigning an address space to each device, It is a communication device that controls the transfer of data to each of the devices. The root complex 16 of the computer 10 according to the present embodiment controls allocation of address space to the accelerator boards 20 and 22 and data transfer, that is, packet transfer and the like.

図１に示すように、アクセラレータ基板２０は、ＰＣＩｅスイッチ（図１では、「ＰＣＩｅＳｗｉｔｃｈ」と表記）１８、およびＰＣＩｅスイッチ１８に接続されたＮ個のＤＲＰ（ＤｙｎａｍｉｃａｌｌｙＲｅｃｏｎｆｉｇｕｒａｂｌｅＰｒｏｃｅｓｓｏｒ、
動的再構成可能なプロセッサ）２８−１〜ＤＲＰ２８−ＮおよびＤＤＲメモリ３０−１〜ＤＤＲメモリ３０−Ｎを含んで構成されている。なお、以下においては、ＤＲＰ２８−１〜ＤＲＰ−Ｎの各々を区別しない場合には「ＤＲＰ２８」と表記し、ＤＤＲメモリ３０−１〜ＤＤＲメモリ３０−Ｎの各々を区別しない場合には「ＤＤＲメモリ３０」と表記する。 As shown in FIG. 1, the accelerator board 20 includes a PCIe switch (indicated as “PCIe Switch” in FIG. 1) 18, and N DRPs (Dynamically Reconfigurable Processors) connected to the PCIe switch 18.
Dynamic reconfigurable processor) 28-1 to DRP 28-N and DDR memory 30-1 to DDR memory 30-N. In the following, when each of DRP 28-1 to DRP-N is not distinguished, it is expressed as “DRP 28”, and when each of DDR memory 30-1 to DDR memory 30-N is not distinguished, “DDR memory”. 30 ”.

ＰＣＩｅスイッチ１８は、２つ以上のポートを接続し、ポート間でのパケットのルーティングを行う通信装置である。つまり、ＰＣＩｅスイッチ１８は、ルートコンプレックス１６、ＤＲＰ２８の各々の間のデータの転送、すなわちパケットの転送を制御する。また、ＰＣＩｅスイッチ１８は、後述するように、自己の情報や状態を記憶するコンフィグレーションレジスタを備えている。 The PCIe switch 18 is a communication device that connects two or more ports and routes packets between the ports. That is, the PCIe switch 18 controls data transfer between the root complex 16 and the DRP 28, that is, packet transfer. Further, the PCIe switch 18 includes a configuration register that stores its own information and status, as will be described later.

また、本実施の形態に係るＰＣＩｅスイッチ１８はＡＥＲ（ＡｄｖａｎｃｅｄＥｒｒｏｒＲｅｐｏｒｔｉｎｇ）機能を備え、たとえばコンピュータ１０内で障害が発生した場合には、当該障害に関する情報がＡＥＲレジスタに記録される。ＡＥＲレジスタとは、
ＰＣＩｅの通信規格のオプションとして規格化された３２ビットのレジスタであり、上記コンフィグレーションレジスタの一部を構成している。 Further, the PCIe switch 18 according to the present embodiment has an AER (Advanced Error Reporting) function. For example, when a failure occurs in the computer 10, information on the failure is recorded in the AER register. What is the AER register?
This is a 32-bit register standardized as an option of the PCIe communication standard, and constitutes a part of the configuration register.

ＤＲＰ２８は画像処理機能を実行するプロセッサであり、ＤＲＰ２８にはハードウエアとしての専用回路とともに、画像処理の制御等を実行するファームウエア（以下、「ＦＷ」という場合がある）が搭載されている。また、ＤＤＲメモリ３０は該ＦＷの実行等において用いられるメモリである。 The DRP 28 is a processor that executes an image processing function. The DRP 28 is equipped with a dedicated circuit as hardware and firmware that executes control of image processing and the like (hereinafter sometimes referred to as “FW”). The DDR memory 30 is a memory used for executing the FW.

アクセラレータ基板２２もアクセラレータ基板２０と同様にＰＣＩｅスイッチ２４を備え、該ＰＣＩｅスイッチ２４にも、ＤＲＰ等の画像処理機能を実行する部材（図示省略）が接続されている。なお、本実施の形態に係るコンピュータ１０では、２枚のアクセラレータ基板２０、２２がルートコンプレックス１６に接続された形態を例示して説明するが、これに限られず、アクセラレータ基板は、必要とされる画像処理の能力等に応じて３枚以上設けてもよい。 Similarly to the accelerator substrate 20, the accelerator substrate 22 includes a PCIe switch 24, and a member (not shown) that executes an image processing function such as DRP is connected to the PCIe switch 24. In the computer 10 according to the present embodiment, an example in which two accelerator boards 20 and 22 are connected to the root complex 16 will be described as an example. However, the present invention is not limited to this, and an accelerator board is required. Three or more sheets may be provided according to the image processing capability.

図２は、本実施の形態に係るアクセラレータ基板２０のより詳細な構成を示す構成図である。図２に示すように、アクセラレータ基板２０は、ＰＣＩｅスイッチ１８、ＤＲＰ２８−１、２８−２、ＣＰＬＤ（ＣｏｍｐｌｅｘＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）３２、ファン（図２では、「ＦＡＮ」と表記）３４、温度センサ（図２では、「ＴＥＭＰ」と表記）３６を含んで構成されている。なお、図２に示すアクセラレータ基板２０では、ＰＣＩｅスイッチ１８に２個のＤＲＰが接続された形態を例示しており、また、図１におけるＤＤＲメモリ１４、３０、ルートコンプレックス１６の図示を省略している。 FIG. 2 is a configuration diagram showing a more detailed configuration of the accelerator substrate 20 according to the present embodiment. As shown in FIG. 2, the accelerator board 20 includes a PCIe switch 18, DRP 28-1, 28-2, CPLD (Complex Programmable Logic Device) 32, a fan (indicated as “FAN” in FIG. 2) 34, a temperature sensor ( In FIG. 2, it is configured including “TEMP”. 2 illustrates an example in which two DRPs are connected to the PCIe switch 18, and the illustration of the DDR memories 14 and 30 and the route complex 16 in FIG. 1 is omitted. Yes.

図２に示すファン３４は、アクセラレータ基板２０上に実装されたＤＲＰ２８−１（以下、単に、「ＤＲＰ２８」と表記する）を冷却するためのファンであり、発熱体としてのＤＲＰ２８の温度上昇を抑制している。温度センサ３６は、ＤＲＰ２８に接してまたはＤＲＰ２８の周囲に配置されており、ＤＲＰ２８の温度を検出している。 The fan 34 shown in FIG. 2 is a fan for cooling the DRP 28-1 (hereinafter simply referred to as “DRP 28”) mounted on the accelerator board 20, and suppresses the temperature rise of the DRP 28 as a heating element. doing. The temperature sensor 36 is disposed in contact with or around the DRP 28 and detects the temperature of the DRP 28.

ＣＰＬＤ３２は、主として、温度センサ３６により検出した温度に基づいてＤＲＰ２８の温度が予め定められた範囲に収まるようにファン３４を制御している。また、ＣＰＬＤ３２は、ファン３４の異常、つまりファン３４の動作停止等の障害の発生を監視している。つまり、ＣＰＬＤ３２は、ファン３４の動作を制御することによりＤＲＰ２８の温度を制御するとともに、ファン３４における障害の発生を監視する監視・制御部として機能している。 The CPLD 32 mainly controls the fan 34 so that the temperature of the DRP 28 falls within a predetermined range based on the temperature detected by the temperature sensor 36. The CPLD 32 monitors the occurrence of a failure such as an abnormality of the fan 34, that is, an operation stop of the fan 34. That is, the CPLD 32 controls the temperature of the DRP 28 by controlling the operation of the fan 34 and functions as a monitoring / control unit that monitors the occurrence of a failure in the fan 34.

アクセラレータ基板２０では、メインＣＰＵ１２とＰＣＩｅスイッチ１８との間、ＰＣＩｅスイッチ１８とＤＲＰ２８との間がＰＣＩｅインタフェースを介して接続されている、つまりＰＣＩｅの通信路によって接続されている。また、ＤＲＰ２８とＣＰＬＤ３２との間、ＣＰＬＤ３２とファン３４との間、およびＣＰＬＤ３２と温度センサ３６との間は、ＣＭＯＳインタフェース、ＴＴＬインタフェース等のレベル（振幅）信号によるインタフェースを介して接続されている、つまりレベル信号を伝送する伝送路で接続されている。本実施の形態に係るアクセラレータ基板２０では、一例として、ＣＭＯＳインタフェースによる伝送路を採用している。 In the accelerator board 20, the main CPU 12 and the PCIe switch 18 and the PCIe switch 18 and the DRP 28 are connected via a PCIe interface, that is, connected by a PCIe communication path. Further, the DRP 28 and the CPLD 32, the CPLD 32 and the fan 34, and the CPLD 32 and the temperature sensor 36 are connected via an interface based on a level (amplitude) signal such as a CMOS interface or a TTL interface. That is, they are connected by a transmission path for transmitting level signals. In the accelerator substrate 20 according to the present embodiment, as an example, a transmission path using a CMOS interface is employed.

図２に示すように、本実施の形態に係るコンピュータ１０では、一般的なＰＣＩｅインタフェースのみを介する接続とは異なり、ＰＣＩｅスイッチ１８とＣＰＬＤ３２との間がＩ^２Ｃ（Ｉｎｔｅｒ−ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）の通信規格による通信路で接続されている。したがって、ＰＣＩｅスイッチ１８とＣＰＬＤ３２との間において、
ＰＣＩｅインタフェースを介する通信および上記伝送路による信号伝送とは別の経路で通信がなされる。Ｉ^２Ｃの通信規格による通信路の詳細に関しては後述する。 As shown in FIG. 2, in the computer 10 according to the present embodiment, I ² C (Inter-Integrated Circuit) communication is performed between the PCIe switch 18 and the CPLD 32, unlike a connection via only a general PCIe interface. They are connected by a standard communication path. Therefore, between the PCIe switch 18 and the CPLD 32,
Communication is performed through a path different from the communication via the PCIe interface and the signal transmission through the transmission path. Details of the communication path according to the I ² C communication standard will be described later.

ところで、予め定められた通信規格によりハードウエアを拡張した場合において、何らかの方法によりハードウエア自身の重大な障害（たとえば、ハードウエアの過剰な温度上昇の恐れ等）を検知した場合、メインＣＰＵ等の上位システムに当該障害を通知した後当該ハードウエアを即座にリセット（停止）させたい場合も生ずる。しかしながら、従来技術に係るシステムにおいては、障害の発生したハードウエアを即座に停止させると、ＯＳがハングアップし、システムダウンが発生する場合があった。 By the way, when the hardware is expanded according to a predetermined communication standard, if a serious failure of the hardware itself (for example, fear of excessive temperature rise of the hardware) is detected by some method, the main CPU, etc. In some cases, it is desired to immediately reset (stop) the hardware after notifying the host system of the failure. However, in the system according to the related art, when the hardware in which the failure has occurred is immediately stopped, the OS may hang up and the system may be down.

上記のようなシステムダウンが発生する場合の一例を、図３を参照して説明する。図３は、従来技術に係るコンピュータ１０において障害が発生時した場合の動作を説明するための図である。 An example of the case where the above system failure occurs will be described with reference to FIG. FIG. 3 is a diagram for explaining the operation when a failure occurs in the computer 10 according to the related art.

図３に示す従来技術に係るコンピュータ１０のアクセラレータ基板２０ａは、図２に示す本実施の形態に係るアクセラレータ基板２０におけるＩ^２Ｃの通信規格による通信路を削除したものなので、同じ構成には同じ符号を付してその説明を省略する。また、以下では、発生した障害がファン３４の動作停止である場合を例示して説明する。なお、図３に示す［１］〜［５］は、以下に示すステップ［１］〜［５］の動作が発生する位置に対応している。 The accelerator board 20a of the computer 10 according to the prior art shown in FIG. 3 is obtained by deleting the communication path according to the I ² C communication standard in the accelerator board 20 according to the present embodiment shown in FIG. Reference numerals are assigned and explanations thereof are omitted. In the following, a case where the failure that has occurred is an operation stop of the fan 34 will be described as an example. Note that [1] to [5] shown in FIG. 3 correspond to positions where the operations of the following steps [1] to [5] occur.

［１］：ＣＰＬＤ３２がファン３４の動作停止（以下、「Ｆａｎ−Ｆａｉｌ」という場合がある）を検知する。
［２］：ＣＰＬＤ３２がＤＲＰ２８にＦａｎ−Ｆａｉｌを通知する。
［３］：ＤＲＰ２８のＦＷがメインＣＰＵ１２に、つまり上位システムとしてのＯＳにＦａｎ−Ｆａｉｌを通知する。
［４］：ＣＰＬＤ３２がＤＲＰ２８をリセットする、すなわち動作を停止させる。ファン３４の動作が停止したのでＤＲＰ２８の温度制御が不能となり、ＤＲＰ２８の温度が異常に上昇する恐れがあるからである。
［５］：メインＣＰＵ１２が訂正不可能なエラー（以下、「ＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒ」という場合がある）を内容とするパケットを受信することにより、ＯＳがハングアップする（メインＣＰＵ１２が停止し、システムダウンが発生する）。 [1]: The CPLD 32 detects the operation stop of the fan 34 (hereinafter sometimes referred to as “Fan-Fail”).
[2]: The CPLD 32 notifies the DRP 28 of Fan-Fail.
[3]: The FW of DRP 28 notifies Fan-Fail to the main CPU 12, that is, the OS as the host system.
[4]: The CPLD 32 resets the DRP 28, that is, stops the operation. This is because the temperature control of the DRP 28 becomes impossible because the operation of the fan 34 is stopped, and the temperature of the DRP 28 may rise abnormally.
[5]: When the main CPU 12 receives a packet containing an uncorrectable error (hereinafter sometimes referred to as “Uncorrectable Error”), the OS hangs up (the main CPU 12 stops and the system goes down) Occur).

ここで、ステップ［５］において、ＯＳがハングアップする理由について説明する。 Here, the reason why the OS hangs up in step [5] will be described.

ＤＲＰ２８とＰＣＩｅスイッチ１８との間では、通常ＰＣＩｅインタフェース上のパケットを送信、受信しながらお互いの存在を確認している。しかしながら、ＤＲＰ２８がリセットされるとＰＣＩｅインタフェース上のパケットの送信、受信が突然できなくなる（ＤＲＰ２８とＰＣＩｅスイッチ１８との間でリンク切れが発生する）ことにより、ＰＣＩｅスイッチ１８がダウンストリーム（下り）方向において不在を検知する、つまり、いわゆるＳｕｒｐｒｉｓｅＤｏｗｎＥｒｒｏｒを検知する。 Between the DRP 28 and the PCIe switch 18, the existence of each other is confirmed while transmitting and receiving packets on the normal PCIe interface. However, when the DRP 28 is reset, a packet on the PCIe interface cannot be transmitted or received suddenly (a link break occurs between the DRP 28 and the PCIe switch 18), so that the PCIe switch 18 is in the downstream (downstream) direction. The absence is detected, that is, so-called Surprise Down Error is detected.

ＰＣＩｅスイッチ１８がＳｕｒｐｒｉｓｅＤｏｗｎＥｒｒｏｒを検知すると、ＰＣＩｅスイッチ１８によってＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒを内容とするパケットがメインＣＰＵ１２に転送される。ＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒは、発生したエラーがハードウエアにより訂正することができない場合に発生するエラーであり、
また処理方法も不明であるためＯＳがハングアップする。 When the PCIe switch 18 detects a surprise down error, the PCIe switch 18 transfers a packet having the contents of the uncorrectable error to the main CPU 12. Uncorrectable Error is an error that occurs when the error that has occurred cannot be corrected by hardware,
Also, since the processing method is unknown, the OS hangs up.

以上のように、従来技術に係るアクセラレータ基板２０ａでは、障害が発生するとＯＳがハングアップする場合がある。ＯＳがハングアップするとコンピュータ１０の動作が停止するのみならず、コンピュータ１０内に接続されている各機器のログ（コンピュータ１０内の処理内容、発生した警告等の履歴）情報の収集もできなくなる。したがって、ＯＳがハングアップした原因等も明らかにすることができず、対処方法も不明となる。 As described above, in the accelerator substrate 20a according to the related art, the OS may hang up when a failure occurs. When the OS hangs up, not only does the operation of the computer 10 stop, but it also becomes impossible to collect log information (history of processing contents in the computer 10, history of warnings, etc.) of each device connected in the computer 10. Therefore, the cause of the OS hanging up cannot be clarified, and the coping method is unknown.

そこで、本発明では、障害を検知したデバイスが、ＰＣＩｅの通信規格による通信路とは別の通信路であるＩ^２Ｃの通信規格による通信路を介して当該障害に基づく情報をＰＣＩｅスイッチに送信し、当該障害に基づく情報によってＰＣＩｅスイッチのＡＥＲレジスタのＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫを強制的に書き換えてＭＡＳＫを有効にするようにした。このことにより、ＰＣＩｅスイッチからＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒメッセージパケットが上位システムに転送されないので、ＯＳのハングアップあるいはシステムダウンが回避される。 Therefore, in the present invention, a device that detects a failure transmits information based on the failure to the PCIe switch via a communication path based on the I ² C communication standard, which is a communication path different from the communication path based on the PCIe communication standard. Then, the uncorrectable error MASK of the AER register of the PCIe switch is forcibly rewritten by the information based on the failure so as to enable the MASK. As a result, the Uncorrectable Error message packet is not transferred from the PCIe switch to the host system, so that OS hang-up or system down is avoided.

つぎに、図４ないし図６を参照して、本実施の形態に係るシステムとしてのコンピュータ１０においてファン３４の動作停止の障害（Ｆａｎ−Ｆａｉｌ）が発生した場合の障害処理方法、およびコンピュータ１０の動作について説明する。 Next, referring to FIG. 4 to FIG. 6, a failure processing method in the case where a failure (Fan-Fail) of the operation stop of the fan 34 occurs in the computer 10 as the system according to the present embodiment, and the computer 10 The operation will be described.

図４は、本実施の形態に係るコンピュータ１０においてＦａｎ−Ｆａｉｌが発生した場合の障害処理方法の手順を示しており、図５は、該障害処理方法の各ステップに対応する動作が発生するコンピュータ１０内の部位を示している。つまり、図５の［１］〜［５］は、図４に示すステップＳ１〜Ｓ５の各々に対応している。 FIG. 4 shows a procedure of a failure processing method when Fan-Fail occurs in the computer 10 according to the present embodiment, and FIG. 5 shows a computer in which an operation corresponding to each step of the failure processing method occurs. The site | part in 10 is shown. That is, [1] to [5] in FIG. 5 correspond to steps S1 to S5 shown in FIG.

図４および図５を参照して、ステップＳ１では、ＣＰＬＤ３２が、ＣＭＯＳインタフェースによる伝送路を介して、ファン３４のＦａｎ−Ｆａｉｌを検知する。 Referring to FIGS. 4 and 5, in step S1, CPLD 32 detects Fan-Fail of fan 34 via a transmission path using a CMOS interface.

つぎのステップＳ２では、ＣＰＬＤ３２が、ＣＭＯＳインタフェースによる伝送路を介して、ＤＲＰ２８にＦａｎ−Ｆａｉｌを通知する。 In the next step S2, the CPLD 32 notifies the DRP 28 of Fan-Fail via a transmission path using a CMOS interface.

つぎのステップＳ３では、ＣＰＬＤ３２がＩ^２Ｃポートを介して、ＰＣＩｅスイッチ１８の対応するＡＥＲレジスタのＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫレジスタを強制的に書き換えてＭＡＳＫを有効にする。具体的には、ＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫのＡＥＲレジスタ内のアドレス（たとえば、０ｘＦＢＣ）を指定して書き換え情報を送信し、ＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫに強制的にオールＦを書き込む。このことにより、ＰＣＩｅスイッチ１８によるＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒメッセージパケットのメインＣＰＵ１２への転送が禁止される。 In the next step S3, the CPLD 32 forcibly rewrites the Uncorrectable Error MASK register of the corresponding AER register of the PCIe switch 18 via the I ² C port to enable MASK. Specifically, the rewrite information is transmitted by designating an address (for example, 0xFBC) in the AER register of Uncorrectable Error MASK, and all-F is forcibly written in Uncorrectable Error MASK. As a result, the transfer of the Uncorrectable Error message packet to the main CPU 12 by the PCIe switch 18 is prohibited.

ここで、図６を参照して、ＡＥＲレジスタについてより詳細に説明する。図６は、本実施の形態に係るＰＣＩｅスイッチ１８の構成の一例を示す図である。 Here, the AER register will be described in more detail with reference to FIG. FIG. 6 is a diagram illustrating an example of the configuration of the PCIe switch 18 according to the present embodiment.

図６に示すように、ＰＣＩｅスイッチ１８は、複数（図６に示す例では３つ）のポートＰ１、Ｐ２、Ｐ３（以下、総称する場合は「ポートＰ」という）を備え、ポートごとにＰＣＩｅコンフィグレーションレジスタ（以下、単にコンフィグレーションレジスタという場合がある）Ｃ１、Ｃ２、Ｃ３（以下、総称する場合は「コンフィグレーションレジスタＣ」という）を備えている。また、上述したように、ＰＣＩｅスイッチ１８は、ＣＰＬＤ３２との通信を行うためのＩ^２Ｃの通信規格のインタフェース４０を備えている。 As shown in FIG. 6, the PCIe switch 18 includes a plurality (three in the example shown in FIG. 6) of ports P1, P2, and P3 (hereinafter collectively referred to as “port P”), and the PCIe switch 18 is provided for each port. Configuration registers (hereinafter sometimes simply referred to as configuration registers) C1, C2, and C3 (hereinafter collectively referred to as “configuration registers C”) are provided. Further, as described above, the PCIe switch 18 includes the interface 40 of the I ² C communication standard for performing communication with the CPLD 32.

複数のポートＰの各々は、障害が発生した場合に当該障害を検知して、コンフィグレーションレジスタＣに障害情報を記録する。より具体的には、コンフィグレーションレジスタＣの一部であるＡＥＲレジスタに障害情報を記録する。ここで、障害情報とは、障害の内容を示す情報であり、たとえば、ポートＰに接続されているデバイスとのリンク切れが発生した場合には、ＡＥＲレジスタにＳｕｒｐｒｉｓｅＤｏｗｎＥｒｒｏｒが記録される。また、複数のポートの各々は、訂正不可能なエラーに対応する障害を検知した場合には、ＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒメッセージパケットをメインＣＰＵ１２宛てに送信する。 When a failure occurs, each of the plurality of ports P detects the failure and records failure information in the configuration register C. More specifically, fault information is recorded in an AER register that is a part of the configuration register C. Here, the failure information is information indicating the content of the failure. For example, when a link disconnection with the device connected to the port P occurs, “Surprise Down Error” is recorded in the AER register. Further, each of the plurality of ports transmits an Uncorrectable Error message packet to the main CPU 12 when a failure corresponding to an uncorrectable error is detected.

再び図４および図５を参照し、つぎのステップＳ４では、ＤＲＰ２８のＦＷが、ＰＣＩｅインタフェースを介して、ＯＳにＦａｎ−Ｆａｉｌを通知する。 4 and 5 again, in the next step S4, the FW of the DRP 28 notifies the OS of Fan-Fail via the PCIe interface.

つぎのステップＳ５では、ＣＰＬＤ３２が、ＣＭＯＳインタフェースによる伝送路を介して、ＤＲＰ２８をリセットする、すなわち動作を停止させる。この際、ステップＳ３で、ＰＣＩｅスイッチ１８のＡＥＲレジスタのＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫが有効にされているので、メインＣＰＵ１２にＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒメッセージパケットが転送されることはない。なお、本実施の形態に係るコンピュータ１０が通常に動作している状態では、ＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫは無効にされている、つまり、ＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒメッセージパケットがメインＣＰＵ１２宛てに送信される状態となっている。 In the next step S5, the CPLD 32 resets the DRP 28 via the transmission path using the CMOS interface, that is, stops the operation. At this time, since the Uncorrectable Error MASK of the AER register of the PCIe switch 18 is enabled in Step S3, the Uncorrectable Error message packet is not transferred to the main CPU 12. In the state in which the computer 10 according to the present embodiment is operating normally, the Uncorrectable Error MASK is disabled, that is, the Uncorrectable Error message packet is transmitted to the main CPU 12.

以上の手順により、ＤＲＰ２８がシステムから切り離され、ファン３４の動作停止に起因するＤＲＰ２８の温度上昇が回避される。しかもＣＰＵ１２がＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒパケットを受信することがないので、ＯＳがハングアップする（システムがダウンする）こともない。 With the above procedure, the DRP 28 is disconnected from the system, and the temperature rise of the DRP 28 due to the stop of the operation of the fan 34 is avoided. In addition, since the CPU 12 does not receive the Uncorrectable Error packet, the OS does not hang up (the system goes down).

ここで、上記実施の形態では、ＤＲＰ２８をリセットする場合について説明したが、ＤＲＰ２８のリセットに加えてログ情報を収集するステップを加えてもよい。この場合、上記ステップＳ４とステップＳ５との間に、たとえば、「ＤＲＰ２８のＦＷが必要なログを収集し、上位システムとしてのメインＣＰＵ１２に送付する」というステップを加えればよい。 Here, although the case where the DRP 28 is reset has been described in the above embodiment, a step of collecting log information in addition to resetting the DRP 28 may be added. In this case, for example, a step of “collecting a log that requires FW of the DRP 28 and sending it to the main CPU 12 as a host system” may be added between the step S4 and the step S5.

また、上記実施の形態では、ステップＳ３で、ＣＰＬＤ３２がＩ^２Ｃポートを使って、ＰＣＩｅスイッチ１８の対応するＡＥＲレジスタのＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫレジスタを強制的に書き換えてＭＡＳＫを有効にする形態を例示して説明したが、これに限られない。たとえば、メインＣＰＵ１２にＰＣＩｅスイッチ１８のレジスタ書き換え用のドライバをインストールしておき、メインＣＰＵ１２がＵｎｃｏｒｒｅｃｔａｂｌｅＥｒｒｏｒＭＡＳＫレジスタを書き換えてもよい。この場合、メインＣＰＵ１２が書き換えを完了した後、あるいは加えてログ情報を収集した後、レジスタ書き換え、ログ情報収集が完了したことをＣＰＬＤ３２に通知し、ＣＰＬＤ３２はその後ＤＲＰ２８をリセットするようにすればよい。 In the above embodiment, an example in which the CPLD 32 uses the I ² C port to forcibly rewrite the uncorrectable error MASK register of the corresponding AER register of the PCIe switch 18 to enable MASK in step S3. However, this is not a limitation. For example, a register rewrite driver for the PCIe switch 18 may be installed in the main CPU 12 so that the main CPU 12 rewrites the Uncorrectable Error MASK register. In this case, after the main CPU 12 completes the rewriting or additionally collects the log information, the register rewriting and the log information collection are notified to the CPLD 32, and the CPLD 32 then resets the DRP 28. .

なお、上記実施の形態では、本発明を画像処理等を実行するコンピュータにおいて、温度制御に関する障害が発生した場合の形態を例示して説明したが、これに限られず、他の機器における当該機器の重大な障害、たとえば画像形成装置のレーザによる描画部分のレーザ出力に異常が発生した場合の形態等に適用してもよい。 In the above-described embodiment, the present invention has been described by exemplifying a case where a failure related to temperature control occurs in a computer that executes image processing or the like. The present invention may be applied to a case where a serious failure, for example, an abnormality occurs in a laser output of a drawing portion by a laser of an image forming apparatus.

また、上記実施の形態では、ファン３４が動作を停止した場合を障害として検知する形態を例示して説明したが、これに限られず、たとえば温度センサ３６の温度が予め定められた閾値を越えたことを障害として検知する形態に適用してもよい。 In the above embodiment, the case where the case where the fan 34 stops operating is described as an example. However, the present invention is not limited to this. For example, the temperature of the temperature sensor 36 exceeds a predetermined threshold. You may apply to the form which detects this as a fault.

１０コンピュータ
１２メインＣＰＵ
１４、３０ＤＤＲメモリ
１６ルートコンプレックス
２０、２０ａ、２２アクセラレータ基板
１８、２４ＰＣＩｅスイッチ
２８ＤＲＰ
３２ＣＰＬＤ
３４ファン
３６温度センサ
４０Ｉ^２Ｃインタフェース
Ｃ１、Ｃ２、Ｃ３コンフィグレーションレジスタ
Ｐ１、Ｐ２、Ｐ３ポート 10 Computer 12 Main CPU
14, 30 DDR memory 16 Root complex 20, 20a, 22 Accelerator board 18, 24 PCIe switch 28 DRP
32 CPLD
34 Fan 36 Temperature sensor 40 I ² C interface C1, C2, C3 Configuration registers P1, P2, P3 ports

Claims

Control means for controlling the entire system;
A switch capable of communicating with the control means according to a PCI Express communication standard;
A device capable of communicating with the switch according to a PCI Express communication standard;
The switch communicates with the switch according to a predetermined communication standard, monitors whether or not the device has a failure, and controls the switch via communication according to the predetermined communication standard when a failure is detected. A monitoring unit that suppresses transfer of information regarding the failure from the switch to the control unit via communication according to a PCI Express communication standard by transmitting information to the switch;
Including system.

The monitoring unit and the device are connected by a transmission path that transmits a predetermined signal,
2. The monitoring unit transmits a signal for stopping the device via the transmission path after transmitting information for controlling the switch to the switch via communication according to the predetermined communication standard. The system described in.

The device transmits a signal for stopping the device via the transmission line after the monitoring unit transmits information for controlling the switch via communication according to the predetermined communication standard. The system according to claim 1 or 2, wherein previously collected log information is collected and transmitted to the control means via communication according to a communication standard of PCI Express.

The information for controlling the switch is information for enabling Uncorrectable Error MASK of an AER register stored in a configuration register standardized by PCI Express that stores register information of the switch. The system according to any one of claims 1 to 3.

The system according to claim 1, wherein the predetermined communication standard is an I ² C communication standard.

The system according to any one of claims 1 to 5, wherein the device includes a temperature control unit that controls its own temperature, and the failure of the device is a failure of the temperature control unit.

Monitoring the presence or absence of a failure of a device that communicates with a switch and a PCI Express communication standard, and detecting a failure of the device by a monitoring unit that communicates with the switch according to a predetermined communication standard;
Transmitting the information for controlling the switch to the switch through communication according to the predetermined communication standard by the monitoring unit;
Suppressing the information related to the failure from being transferred from the switch to a control unit that controls the entire system through communication based on a PCI Express communication standard, based on information for controlling the switch;
Failure handling method.