JP2012194790A

JP2012194790A - Fault detection method, control device and multiprocessor system

Info

Publication number: JP2012194790A
Application number: JP2011058310A
Authority: JP
Inventors: Naoki Fujimoto; 直樹藤本
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2011-03-16
Filing date: 2011-03-16
Publication date: 2012-10-11
Anticipated expiration: 2031-03-16
Also published as: JP5371123B2

Abstract

PROBLEM TO BE SOLVED: To solve the problem in which detecting a fault that occurs on a communication path in a device whose hardware is divided into multiple partitions is time consuming since it is difficult to detect the location of the fault and necessary to identify a faulty part by manually changing parts.SOLUTION: A control device in this invention includes: communication means that communicates with an external device using a first interface; detection means that detects the occurrence of a communication fault; first count means that counts the number of faults detected by the detection means; and count control means that controls the first count means to count up and controls count means in the external device to count up using a second interface, when the detection means detects the fault.

Description

本発明は、複数の部品又は装置が接続されているシステムにおいて発生した障害箇所を適切に処理するための障害検出方法及び当該方法に用いられる制御装置とそのシステムに関する。 The present invention relates to a failure detection method for appropriately processing a failure location that occurs in a system in which a plurality of components or devices are connected, a control device used in the method, and a system thereof.

近年、並列処理によってアプリケーションの実行速度を向上させるため、シングルプロセッサシステムからマルチプロセッサシステムへの移行が加速している。また、マルチプロセッサシステムに搭載されるプロセッサモジュール数も増大傾向にあり、システム内の経路が複雑化している。 In recent years, the shift from single processor systems to multiprocessor systems has been accelerated in order to improve the execution speed of applications through parallel processing. In addition, the number of processor modules installed in a multiprocessor system is also increasing, and the paths in the system are complicated.

このように経路が複雑化すると、障害が発生した場合における障害発生箇所の特定が困難となる。これまでにも、障害発生箇所の特定を容易化するための技術は種々開発されている。例えば、特許文献１には、複数のデータ端末装置（ＤＴＥ）がそれぞれ回線終端装置（ＤＣＥ）を介して接続されているシステムにおいて、障害が発生した場合に、ＤＣＥ−ＤＣＥ間での障害発生かＤＴＥ−ＤＣＥ間での障害発生かを容易に特定するための技術が開示されている。 If the path becomes complicated in this way, it becomes difficult to specify the location of the failure when a failure occurs. Until now, various techniques for facilitating the identification of a fault occurrence location have been developed. For example, Patent Document 1 describes whether a failure occurs between DCE and DCE when a failure occurs in a system in which a plurality of data terminal devices (DTE) are connected via line termination devices (DCE). A technique for easily specifying whether a failure has occurred between DTE and DCE is disclosed.

また、特許文献２には、中央制御装置が複数のパスで記憶装置群にアクセスするシステムにおいて、障害が発生したパスを障害発生回数のみを基準として閉塞させるのではなく、残りの使用可能なパス数の大小に応じて、閉塞させる条件を変化させる技術が開示されている。 Further, in Patent Document 2, in a system in which a central control unit accesses a storage device group by a plurality of paths, a path in which a failure has occurred is not blocked based only on the number of occurrences of the failure, but the remaining usable paths A technique for changing the blocking condition according to the number is disclosed.

また、特許文献３には、関連する技術として異なるオペレーションシステムで動作する複数のプロセッサシステムにおいてデバイスを共有する技術が開示されている。一般的に、このような共有デバイスシステムにおいては、各々のプロセッサシステムが共有デバイスの使用時に発生した障害に関する障害情報を個別に有しており、障害が発生した場合は、これらの障害情報のみから障害発生箇所の被疑を判断して対処している。 Patent Document 3 discloses a technology for sharing a device in a plurality of processor systems that operate with different operation systems as a related technology. In general, in such a shared device system, each processor system individually has failure information regarding a failure that occurs when the shared device is used. If a failure occurs, only the failure information is used. Determine the suspicion of the failure location and take action.

特開２００３−１８３２９号公報JP 2003-18329 A 特開２０００−１４８６５５号公報JP 2000-148655 A 特開２００４−２４６７７９号公報Japanese Patent Laid-Open No. 2004-246779

従来、複数のパーティションにハードウェア分割された装置は、各パーティション内で取得されるアクセスエラーログ等の障害情報のみで被疑を判断していた。このような障害処理方法では、当該パーティションから共有デバイスまでの経路上で障害が発生した場合は、障害発生箇所の特定が困難であり、人手での部品交換による部品の特定(切り分け)を実施する必要があるため、被疑判断に時間を要していた。また、パーティション数が多数にのぼり、複雑な経路で構成された共有デバイスシステムでは、従来の対処方法では、場合によっては、障害発生箇所を特定することができず、被疑経路上の全ての部品を交換しなければならない事態も生じていた。 Conventionally, an apparatus that is hardware-divided into a plurality of partitions determines a suspicion only from failure information such as an access error log acquired in each partition. In such a failure processing method, if a failure occurs on the path from the partition to the shared device, it is difficult to identify the location of the failure, and parts are identified (isolated) by manual parts replacement. Because it was necessary, it took time to make a suspicion. Also, in a shared device system with a large number of partitions and a complicated path, the conventional countermeasures cannot identify the location of the failure in some cases, and all parts on the suspected path There was also a situation that had to be replaced.

このような事態に対しては、上記特許文献の技術を用いても有効な解決方法とはならず、従って新たな障害検出方法やそのための制御装置が求められていた。 For such a situation, even if the technique of the above-mentioned patent document is used, it is not an effective solution. Therefore, a new failure detection method and a control device therefor have been demanded.

本発明は、通信経路上で障害が発生した場合に被疑装置の絞り込みを容易化できる障害検出方法及び当該障害検出方法で用いられる制御装置並びにマルチプロセッサシステムを提供することを目的とする。 An object of the present invention is to provide a failure detection method capable of easily narrowing down suspected devices when a failure occurs on a communication path, a control device used in the failure detection method, and a multiprocessor system.

本発明のマルチプロセッサシステムは、ハードウェア分割された複数のパーティションと、前記複数のパーティションからアクセスされる共有装置と、から構成されるマルチプロセッサシステムであって、前記複数のパーティションと前記共有装置は、データ通信に用いられる第１のインタフェース及び障害の検出に用いられる第２のインタフェースで接続され、前記複数のパーティションと前記共有装置は、障害をカウントするカウント手段を具備し、前記複数のパーティションは、前記データ通信において障害が発生した時に前記データ通信に係るパーティション及び前記共有装置が具備するカウント手段のカウント値を変更する制御を行う制御手段を具備する。 The multiprocessor system of the present invention is a multiprocessor system including a plurality of hardware-divided partitions and a shared device accessed from the plurality of partitions, wherein the plurality of partitions and the shared device are The plurality of partitions and the shared device are connected by a first interface used for data communication and a second interface used for failure detection, and the plurality of partitions comprise a counting means for counting failures, and the plurality of partitions are And a control means for performing control to change the count value of the counting means included in the partition related to the data communication and the shared device when a failure occurs in the data communication.

また、本発明の制御装置は、第１のインタフェースを用いて外部装置と通信を行う通信手段と、前記通信に障害が発生したことを検出する検出手段と、前記検出手段で検出された障害をカウントする第１カウント手段と、前記検出手段で障害が検出された場合に、前記第１カウント手段をカウントアップする制御を行うと共に、前記外部装置のカウント手段をカウントアップする制御を第２のインタフェースを用いて行うカウント制御手段と、を具備する。 The control device according to the present invention includes a communication unit that communicates with an external device using the first interface, a detection unit that detects that a failure has occurred in the communication, and a failure detected by the detection unit. A first counting means for counting and a control for counting up the first counting means when a failure is detected by the detecting means, and a control for counting up the counting means of the external device; And a count control means which performs using

また、本発明の制御装置は、他の制御装置から通信障害に関する情報を入力し、前記入力した通信障害に関する情報に基づいて、第１のインタフェースを用いて外部装置と通信を行う通信手段と、前記通信障害に関する情報に基づいて前記通信手段が行う通信において障害が発生したことを検出する検出手段と、前記通信障害に関する情報に基づいて前記通信手段が行う通信において障害が検出された場合に、前記外部装置のカウント手段をカウントアップする制御を前記第２のインタフェースを用いて行い、前記通信障害に関する情報に基づいて前記通信手段が行う通信において障害が検出されなかった場合に、前記外部装置のカウント手段をカウントダウンする制御を前記第２のインタフェースを用いて行うカウント制御手段と、を具備する。 In addition, the control device of the present invention receives communication information from another control device, and based on the input communication failure information, a communication unit that communicates with an external device using the first interface; When a failure is detected in a communication performed by the communication unit based on the information on the communication failure, and a detection unit that detects that a failure has occurred in the communication performed by the communication unit based on the information on the communication failure, When the control for counting up the counting unit of the external device is performed using the second interface, and no failure is detected in the communication performed by the communication unit based on the information regarding the communication failure, the external device Count control means for performing control to count down the count means using the second interface; That.

また、本発明の障害処理方法は、第１のインタフェースを用いて外部装置と通信を行う通信ステップと、前記通信に障害が発生したことを検出する検出ステップと、前記検出ステップで検出された障害をカウントするカウントステップと、前記障害が検出された場合に、前記外部装置のカウント手段をカウントアップする制御を第２のインタフェースを用いて行うカウント制御ステップと、を有する。 The failure processing method of the present invention includes a communication step of communicating with an external device using the first interface, a detection step of detecting that a failure has occurred in the communication, and a failure detected in the detection step And a count control step of performing control to count up the counting means of the external device using the second interface when the failure is detected.

本発明によれば、通信経路上で障害が発生した場合に被疑装置の絞り込みを容易化できる障害検出方法及び当該障害検出方法で用いられる制御装置並びにマルチプロセッサシステムを提供することができる。 According to the present invention, it is possible to provide a failure detection method capable of easily narrowing down suspected devices when a failure occurs on a communication path, a control device used in the failure detection method, and a multiprocessor system.

実施の形態１にかかる通信システムの構成を示したブロック図である。1 is a block diagram showing a configuration of a communication system according to a first exemplary embodiment. 実施の形態２にかかるマルチプロセッサシステムの構成を示したブロック図である。FIG. 3 is a block diagram showing a configuration of a multiprocessor system according to a second exemplary embodiment. 実施の形態２にかかるマルチプロセッサシステムの障害処理動作を示したシーケンス図である。FIG. 10 is a sequence diagram illustrating a failure handling operation of the multiprocessor system according to the second embodiment. 実施の形態２にかかるマルチプロセッサシステムの変形例の構成を示したブロック図である。FIG. 6 is a block diagram showing a configuration of a modification of the multiprocessor system according to the second exemplary embodiment;

（実施の形態１）
以下、図面を参照して本発明の実施の形態について説明する。図１は、本発明に係る障害処理機能を備えた通信システムの構成を示したブロック図である。 (Embodiment 1)
Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the configuration of a communication system having a failure handling function according to the present invention.

本発明実施の形態１に係る通信システムは、大きく分けてパーティション１００、パーティション２００、パーティション３００の３つの部分から構成される。 The communication system according to Embodiment 1 of the present invention is roughly composed of three parts: a partition 100, a partition 200, and a partition 300.

パーティション１００は、後述する機能モジュールが集まった１つのブロックであり、装置や部品であっても良いし、ハードウェア分割されたパーティションであってもよい。パーティション１００は、通信モジュール１０１と、障害検出モジュール１０２と、カウント制御モジュール１０３と、カウントモジュール１０４と、を有する。 The partition 100 is one block in which functional modules to be described later are gathered, and may be a device or a part, or may be a partition divided by hardware. The partition 100 includes a communication module 101, a failure detection module 102, a count control module 103, and a count module 104.

通信モジュール１０１は、第１のインタフェースである制御／データインタフェース８０１を用いてパーティション３００内の通信モジュール３０１と信号やデータの送受などの通信を行う。また、必要に応じてパーティション２００内の通信モジュール２０１とも信号やデータの送受などの通信を行う。 The communication module 101 performs communication such as transmission and reception of signals and data with the communication module 301 in the partition 300 using the control / data interface 801 which is the first interface. Further, communication such as transmission and reception of signals and data is performed with the communication module 201 in the partition 200 as necessary.

障害検出モジュール１０２は、通信モジュール１０１が行う通信において障害が発生したことを検出する。障害の検出方法としては、例えば通信先のパーティションより障害情報が送られてきた場合や、リクエストに対するリプライが一定時間返却されない場合に通信障害が発生していると判断し、カウント制御モジュール１０３に障害が発生したことを通知する。 The failure detection module 102 detects that a failure has occurred in communication performed by the communication module 101. As a failure detection method, for example, when failure information is sent from a communication destination partition or when a reply to a request is not returned for a certain period of time, it is determined that a communication failure has occurred, and the count control module 103 has failed. Notify that has occurred.

カウント制御モジュール１０３は、通信モジュール１０１が行った通信に障害が発生したした旨の通知を障害検出モジュール１０２から受け取ると、当該通信に係るパーティションに属するカウントモジュールをカウントアップする制御を行う。例えば、通信モジュール１０１と通信モジュール３０１との間で障害が発生した場合は、当該通信に係るパーティションはパーティション１００とパーティション３００となる。従って、カウント制御モジュール１０３は、パーティション１００に属するカウントモジュール１０４とパーティション３００に属するカウントモジュール３０２をカウントアップする制御を行う。ここで、当該制御は、上記通信が行われる制御／データインタフェース８０１とは別に設けられた診断インタフェース８０２を用いて行われる。なお、カウントモジュール１０４とカウント制御モジュール１０３の間に専用インタフェースを別途設け、同一パーティション内のカウントアップ制御と外部パーティションに対するカウントアップ制御を異なるインタフェースを用いて行うよう実装しても良い。当該専用インタフェースを同一パーティション内で設ける場合も含め、カウントアップ制御を行うためのインタフェースを第２のインタフェースと定義し、通信を行うためのインタフェースを第１のインタフェースと定義する。 When receiving a notification from the failure detection module 102 that a failure has occurred in the communication performed by the communication module 101, the count control module 103 performs control to count up the count modules belonging to the partition related to the communication. For example, when a failure occurs between the communication module 101 and the communication module 301, the partitions related to the communication are the partition 100 and the partition 300. Accordingly, the count control module 103 performs control to count up the count module 104 belonging to the partition 100 and the count module 302 belonging to the partition 300. Here, the control is performed using a diagnostic interface 802 provided separately from the control / data interface 801 in which the communication is performed. A dedicated interface may be separately provided between the count module 104 and the count control module 103, and the count-up control in the same partition and the count-up control for the external partition may be performed using different interfaces. Including the case where the dedicated interface is provided in the same partition, an interface for performing count-up control is defined as a second interface, and an interface for performing communication is defined as a first interface.

カウントモジュール１０４は、上記通信の障害をカウントする。カウントモジュール１０４におけるカウントは、後述するカウント制御モジュール１０３が行うカウントアップ制御に基づいて行われる。また、パーティション２００からパーティション１００への通信に障害が発生した場合は、カウントモジュール１０４におけるカウントは、パーティション２００に属するカウント制御モジュール２０３が行う構成であっても良い。 The count module 104 counts the communication failure. The count in the count module 104 is performed based on count-up control performed by a count control module 103 described later. Further, when a failure occurs in communication from the partition 200 to the partition 100, the count in the count module 104 may be configured to be performed by the count control module 203 belonging to the partition 200.

次にパーティション２００について説明する。パーティション２００はパーティション１００と同一の構成をとる。パーティション２００は、通信モジュール２０１と、障害検出モジュール２０２と、カウント制御モジュール２０３と、カウントモジュール２０４と、を有する。 Next, the partition 200 will be described. The partition 200 has the same configuration as the partition 100. The partition 200 includes a communication module 201, a failure detection module 202, a count control module 203, and a count module 204.

通信モジュール２０１は、第１のインタフェースである制御／データインタフェース８０１を用いてパーティション３００内の通信モジュール３０１と通信を行う。また、必要に応じてパーティション１００内の通信モジュール１０１とも通信を行う。 The communication module 201 communicates with the communication module 301 in the partition 300 using the control / data interface 801 that is the first interface. Further, communication is performed with the communication module 101 in the partition 100 as necessary.

障害検出モジュール２０２は、通信モジュール２０１が行う通信において障害が発生したことを検出する。障害の検出方法としては、障害検出モジュール１０２と同様の方法をとることができる。 The failure detection module 202 detects that a failure has occurred in communication performed by the communication module 201. As a failure detection method, the same method as the failure detection module 102 can be used.

カウント制御モジュール２０３は、通信モジュール２０１が行った通信に障害が発生した場合に、当該通信に係るパーティションに属するカウントモジュールをカウントアップする制御を行う。ここで、当該制御は、カウント制御モジュール１０３と同様、上記通信が行われる制御／データインタフェース８０１とは別に設けられた診断インタフェース８０２を用いて行われる。なお、こちらもパーティション１００の場合と同様、カウントモジュール２０４とカウント制御モジュール２０３の間に専用インタフェースを別途設け、同一パーティション内のカウントアップ制御と外部パーティションに対するカウントアップ制御を異なるインタフェースを用いて行っても良い。 The count control module 203 performs control to count up the count modules belonging to the partition related to the communication when a failure occurs in the communication performed by the communication module 201. Here, like the count control module 103, this control is performed using a diagnostic interface 802 provided separately from the control / data interface 801 through which the communication is performed. As in the case of the partition 100, a dedicated interface is separately provided between the count module 204 and the count control module 203, and count-up control in the same partition and count-up control for the external partition are performed using different interfaces. Also good.

カウントモジュール２０４は、上記通信の障害をカウントする。カウントモジュール２０４におけるカウントは、カウント制御モジュール２０３が行うカウントアップ制御に基づいて行われる。また、パーティション１００からパーティション２００への通信に障害が発生した場合は、カウントモジュール２０４におけるカウントは、パーティション１００に属するカウント制御モジュール１０３が行う構成であっても良い。 The count module 204 counts the communication failure. Counting in the count module 204 is performed based on count-up control performed by the count control module 203. Further, when a failure occurs in communication from the partition 100 to the partition 200, the count in the count module 204 may be performed by the count control module 103 belonging to the partition 100.

次に、パーティション３００について説明する。パーティション３００は、後述する機能モジュールが集まった１つのブロックであり、例えば共有デバイスなどであってもよい。パーティション３００は、通信モジュール３０１とカウントモジュール３０２を有する。 Next, the partition 300 will be described. The partition 300 is one block in which functional modules to be described later are collected, and may be a shared device, for example. The partition 300 includes a communication module 301 and a count module 302.

通信モジュール３０１は、パーティション１００に属する通信モジュール１０１やパーティション２００に属する通信モジュール２０１との間で制御／データインタフェース８０１を用いて通信を行う。 The communication module 301 communicates with the communication module 101 belonging to the partition 100 and the communication module 201 belonging to the partition 200 using the control / data interface 801.

カウントモジュール３０２は、外部パーティションに含まれるカウント制御モジュールが診断インタフェース８０２を用いて行うカウントアップ制御に基づいて上記通信の障害をカウントする。また、当該カウントした値は、必要に応じてカウント制御モジュールが行う読み出し制御に基づいて当該カウント制御モジュールに読み出される。 The count module 302 counts the communication failure based on count-up control performed by the count control module included in the external partition using the diagnostic interface 802. Further, the counted value is read to the count control module based on read control performed by the count control module as necessary.

当該構成によれば、これら複数のパーティション間で行われる通信において通信障害が発生した場合、障害原因となっているパーティションに属するカウントモジュールのカウント値が、他のパーティションに属するカウントモジュールのカウント値よりも大きくなっていく。従って、障害発生箇所を容易に特定することができる。 According to this configuration, when a communication failure occurs in communication performed between these multiple partitions, the count value of the count module belonging to the partition causing the failure is greater than the count value of the count module belonging to another partition. Will also grow. Therefore, it is possible to easily identify the location where the failure has occurred.

なお、上記説明では、障害が発生した場合にカウントモジュールのカウント値をカウントアップする場合について説明したが、障害が発生しなかった場合に、カウントモジュールのカウント値をカウントダウンする構成であっても良い。 In the above description, the case where the count value of the count module is counted up when a failure occurs has been described. However, when the failure does not occur, the count value of the count module may be counted down. .

（実施の形態２）
実施の形態２は、実施の形態１に係る障害処理機能を導入したマルチプロセッサシステムに関する。以下、図面を用いて説明する。なお、重複する部分に関しては一部説明を省略する。 (Embodiment 2)
The second embodiment relates to a multiprocessor system in which the failure handling function according to the first embodiment is introduced. Hereinafter, it demonstrates using drawing. A part of the description of the overlapping parts is omitted.

図２は、本実施の形態２に係るマルチプロセッサシステム１０００の構成を示している。マルチプロセッサシステム１０００は、パーティション１００とパーティション２００の２つのパーティションにハードウェア分割されている。マルチプロセッサシステム１０００は、大きく分けて、これら２つのパーティション１００とパーティション２００と、中継装置４１０と、光学ドライブ３１０とから構成される。パーティション１００とパーティション２００は、それぞれ図１におけるパーティション１００及びパーティション２００に対応し、光学ドライブ３１０は、図１におけるパーティション３００に対応する。また、中継装置４１０は、パーティション１００及びパーティション２００とパーティション３００との間の通信を中継するための装置であり、マザーボード４００上に配置される。以下、各ブロックについて詳細に説明する。 FIG. 2 shows the configuration of the multiprocessor system 1000 according to the second embodiment. The multiprocessor system 1000 is hardware-divided into two partitions, a partition 100 and a partition 200. The multiprocessor system 1000 is roughly composed of these two partitions 100, 200, a relay device 410, and an optical drive 310. The partition 100 and the partition 200 correspond to the partition 100 and the partition 200 in FIG. 1, respectively, and the optical drive 310 corresponds to the partition 300 in FIG. The relay device 410 is a device for relaying communication between the partition 100 and the partition 200 and the partition 300, and is arranged on the motherboard 400. Hereinafter, each block will be described in detail.

パーティション１００は、プロセッサモジュール１１０とサービスプロセッサ１２０とを有する。プロセッサモジュール１１０は、命令の実行や制御を行うプロセッサと、データを記憶するメインメモリなどから構成される。プロセッサモジュール１１０が実行するリクエストに基づいて、リクエスト信号がサービスプロセッサ１２０を経由して光学ドライブ３１０へ送信される。 The partition 100 includes a processor module 110 and a service processor 120. The processor module 110 includes a processor that executes and controls instructions and a main memory that stores data. A request signal is transmitted to the optical drive 310 via the service processor 120 based on the request executed by the processor module 110.

サービスプロセッサ１２０は、プロセッサモジュール１１０を補助するためのプロセッサであり、後述する障害検出処理やプロセッサモジュール１１０の立ち上げ処理などを行う。サービスプロセッサ１２０は、内部に通信部１２１と、障害検出部１２２と、カウント制御部１２３と、カウンタ１２４と、を有する。 The service processor 120 is a processor for assisting the processor module 110, and performs a failure detection process and a startup process of the processor module 110, which will be described later. The service processor 120 includes a communication unit 121, a failure detection unit 122, a count control unit 123, and a counter 124 inside.

通信部１２１は、外部装置の通信部と通信を行う。当該通信にはサービスプロセッサ１２０自身が実行する外部装置への通信の他に、プロセッサモジュール１１０からの通信の転送も含む。通信部１２１は、プロセッサモジュール１１０内のプロセッサから送られてくるリクエスト信号を入力し、当該リクエスト信号をリクエスト先の装置へ転送する。 The communication unit 121 communicates with a communication unit of an external device. The communication includes communication transfer from the processor module 110 in addition to communication to an external device executed by the service processor 120 itself. The communication unit 121 inputs a request signal sent from the processor in the processor module 110 and transfers the request signal to the request destination apparatus.

障害検出部１２２は、通信部１２１が行う通信に障害が発生したかどうかを検出する。障害検出部１２２は、障害情報の一つであるアクセスエラーパケットを通信部１２１から受け取り、当該パケットを解析してどの通信経路で障害が発生したかを特定する。また、通信部１２１がリクエスト信号を出力してから時間計測を行い、一定時間経過してもリクエスト信号の出力先からリプライ信号が返信されない場合には、当該通信経路において障害が発生しているものと判断する。障害検出部１２２は、上記方法などにより通信部１２１が行う通信において障害が発生したことを検出するとカウント制御部１２３に通信障害が発生したことを通知する。障害検出部１２２からカウント制御部１２３に通知される情報には、障害発生時間、障害発生回数、障害の致命度、当該障害が発生した通信経路に関する情報、カウントアップ制御を行う必要がある装置のカウンタのアドレス情報などから適宜必要なものが選ばれて通知される。 The failure detection unit 122 detects whether a failure has occurred in the communication performed by the communication unit 121. The failure detection unit 122 receives an access error packet, which is one piece of failure information, from the communication unit 121, analyzes the packet, and identifies on which communication path the failure has occurred. In addition, if the communication unit 121 measures the time after outputting the request signal and the reply signal is not returned from the output destination of the request signal even after a predetermined time has elapsed, a failure has occurred in the communication path. Judge. When the failure detection unit 122 detects that a failure has occurred in communication performed by the communication unit 121 by the above method or the like, the failure detection unit 122 notifies the count control unit 123 that a communication failure has occurred. The information notified from the failure detection unit 122 to the count control unit 123 includes the failure occurrence time, the number of failure occurrences, the fatality of the failure, information on the communication path in which the failure has occurred, and the device that needs to perform count-up control. Necessary necessary information is selected from the address information of the counter and notified.

カウント制御部１２３は、障害検出部１２２から通信障害に関する通知を受け取ると、当該通信に係るパーティション又は外部装置に属するカウンタをカウントアップする制御を行う。例えば、サービスプロセッサ１２０と光学ドライブ３１０との間で障害が発生した場合は、当該通信に係るパーティション及び外部装置は、パーティション１００、中継装置４１０、光学ドライブ３１０となる。従って、カウント制御部１２３は、サービスプロセッサ１２０が有するカウンタ１２４をカウントアップする制御を行い、サービスプロセッサ１２０外部の装置に含まれるカウンタ３１３、カウンタ４１２をカウントアップする制御を、診断インタフェース８０２を用いて行う。また、カウント制御部１２３は、図示せぬ制御部からの読み出し指示を入力し、診断インタフェース８０２を介して接続されているカウンタから当該カウンタに記憶されている値を読み出す制御を行う。当該読み出されたそれぞれのカウンタの値は、必要に応じて図示せぬユーザインタフェースを介して使用者に提示される。また、カウント制御部１２３は、必要に応じて上記通信経路に存在する装置のカウンタをカウントダウンする制御を行う。 When the count control unit 123 receives a notification regarding a communication failure from the failure detection unit 122, the count control unit 123 performs control to count up a counter belonging to the partition or external device related to the communication. For example, when a failure occurs between the service processor 120 and the optical drive 310, the partition and the external device related to the communication are the partition 100, the relay device 410, and the optical drive 310. Therefore, the count control unit 123 performs control to count up the counter 124 included in the service processor 120, and performs control to count up the counter 313 and the counter 412 included in a device outside the service processor 120 using the diagnosis interface 802. Do. In addition, the count control unit 123 inputs a read instruction from a control unit (not shown), and performs control to read a value stored in the counter from a counter connected via the diagnosis interface 802. The read counter values are presented to the user via a user interface (not shown) as necessary. In addition, the count control unit 123 performs control to count down a counter of a device existing on the communication path as necessary.

カウンタ１２４は、障害検出部１２２で検出された障害ついてカウントされた計数値を保持する。カウンタ１２４は、例えば不揮発性ＲＯＭなどで構成される。カウンタ１２４が保持している値は、カウント制御部１２３からのカウントアップ制御に基づいてインクリメントされる。 The counter 124 holds a count value counted for the failure detected by the failure detection unit 122. The counter 124 is configured by, for example, a nonvolatile ROM. The value held by the counter 124 is incremented based on the count up control from the count control unit 123.

次に、パーティション２００について説明する。パーティション２００は、プロセッサモジュール２１０とサービスプロセッサ２２０とから構成される。なお、パーティション２００はパーティション１００と同様の構成であるため説明を省略する。すなわち、プロセッサモジュール２１０は、プロセッサモジュール１１０と同様の構成であり、サービスプロセッサ２２０はサービスプロセッサ１２０と同様の構成である。 Next, the partition 200 will be described. The partition 200 includes a processor module 210 and a service processor 220. Since the partition 200 has the same configuration as the partition 100, the description thereof is omitted. That is, the processor module 210 has the same configuration as the processor module 110, and the service processor 220 has the same configuration as the service processor 120.

次に、光学ドライブ３１０について説明する。光学ドライブ３１０は、複数のパーティションからアクセスを受ける共有デバイスであり、内部に通信部３１１、障害検出部３１２、カウンタ３１３を有する。 Next, the optical drive 310 will be described. The optical drive 310 is a shared device that receives access from a plurality of partitions, and includes a communication unit 311, a failure detection unit 312, and a counter 313.

通信部３１１は、外部装置の通信部と所定のプロトコルに従って通信を行う。例えば、外部装置からデータの読み出しや書き込みに関するリクエスト信号を入力し、図示せぬ制御部が当該リクエスト信号で指定されているアドレスに対してデータの読み出しや書き込みに関する制御を行う。通信部３１１は、読み出されたデータやアクセスが完了したことを示すリプライ信号を、リクエスト信号送信元に返信する。 The communication unit 311 communicates with the communication unit of the external device according to a predetermined protocol. For example, a request signal related to data read / write is input from an external device, and a control unit (not shown) performs control related to data read / write with respect to an address specified by the request signal. The communication unit 311 returns a read signal indicating that the read data or access has been completed to the request signal transmission source.

障害検出部３１２は、光学ドライブ３１０自身で発生した障害を検出する。障害検出部３１２は、通信部３１１で発生した通信エラーや、図示せぬ制御部で発生する読出し／書き込み等の失敗などを障害として検出し、障害種別や障害発生時刻などを纏めた障害情報を作成する。障害検出部３１２で作成された障害情報は、必要に応じてハードウェア分割されたパーティション内に設置されたサービスプロセッサへ送信される。 The failure detection unit 312 detects a failure that has occurred in the optical drive 310 itself. The failure detection unit 312 detects a communication error that has occurred in the communication unit 311 or a failure such as read / write that occurs in a control unit (not shown) as a failure, and includes failure information that summarizes the failure type, failure occurrence time, and the like. create. The failure information created by the failure detection unit 312 is transmitted to a service processor installed in a partition that is divided into hardware as necessary.

カウンタ３１３は、診断インタフェース８０２に接続されており、光学ドライブ３１０外部に存在するカウント制御部からのカウントアップ制御に基づいて記憶する値をインクリメントする。また、カウント制御部からのカウントダウン制御に基づいて記憶する値を１つデクリメントする。当該カウンタ３１３は、光学ドライブ３１０の障害に関する被疑判断に用いられる。 The counter 313 is connected to the diagnostic interface 802 and increments a value to be stored based on count-up control from a count control unit existing outside the optical drive 310. Further, the stored value is decremented by one based on the countdown control from the count control unit. The counter 313 is used for suspicious judgment regarding the failure of the optical drive 310.

次に、中継装置４１０について説明する。中継装置は各パーティションと光学ドライブとの通信を中継するために、マザーボード４００上に配置される装置である。中継装置４１０は、切替制御部４１１とカウンタ４１２を有する。 Next, the relay device 410 will be described. The relay device is a device arranged on the mother board 400 in order to relay communication between each partition and the optical drive. The relay device 410 includes a switching control unit 411 and a counter 412.

切替制御部４１１は、サービスプロセッサ１２０又はサービスプロセッサ２２０から光学ドライブ３１０へのアクセスを排他制御する。すなわち、切替制御部４１１は、光学ドライブ３１０の接続先を切り替えることにより、どちらか一方のパーティションのみが光学ドライブを使用可能となるように排他制御を行う。 The switching control unit 411 exclusively controls access from the service processor 120 or the service processor 220 to the optical drive 310. That is, the switching control unit 411 performs exclusive control so that only one of the partitions can use the optical drive by switching the connection destination of the optical drive 310.

カウンタ４１２は、診断インタフェース８０２に接続されており、中継装置４１０外部に存在するカウント制御部からのカウントアップ制御に基づいて記憶する値をインクリメントする。また、カウント制御部からのカウントダウン制御に基づいて記憶する値を１つデクリメントする。当該カウンタ４１２は、中継装置４１０の障害に関する被疑判断に用いられる。 The counter 412 is connected to the diagnosis interface 802 and increments a value to be stored based on count-up control from a count control unit existing outside the relay device 410. Further, the stored value is decremented by one based on the countdown control from the count control unit. The counter 412 is used for suspicion determination regarding the failure of the relay device 410.

次に、本発明のマルチプロセッサシステムの動作について図を参照して説明する。図３は、サービスプロセッサ１２０から光学ドライブ３１０への経路で障害が発生した場合の本発明のマルチプロセッサシステムにおける処理の流れを示したシーケンス図である。 Next, the operation of the multiprocessor system of the present invention will be described with reference to the drawings. FIG. 3 is a sequence diagram showing the flow of processing in the multiprocessor system of the present invention when a failure occurs in the path from the service processor 120 to the optical drive 310.

パーティション１００に属するサービスプロセッサ１２０が、光学ドライブ３１０へのリクエスト(リードまたはライト)を実行し、リクエスト信号を出力する(Ｓ１０１)。当該リクエスト信号は、サービスプロセッサ１２０内の通信部１２１から出力される。ここで、サービスプロセッサ１２０は、当該リクエストに対するリプライ待ちの状態となる（Ｓ１０２）。 The service processor 120 belonging to the partition 100 executes a request (read or write) to the optical drive 310 and outputs a request signal (S101). The request signal is output from the communication unit 121 in the service processor 120. Here, the service processor 120 enters a reply waiting state for the request (S102).

上記リクエスト信号を入力した光学ドライブ３１０は、そのリクエスト内容に従った制御や処理を行い、リプライ信号を返信する（Ｓ１０３）。サービスプロセッサ１２０から光学ドライブ３１０までの経路上で障害が無ければ、サービスプロセッサ１２０からのリクエストに対するリプライが、光学ドライブ３１０から中継装置４１０を経由してサービスプロセッサ１２０に返却される。一方、サービスプロセッサ１２０から光学ドライブ３１０までの経路上に障害があれば、上記リクエストが光学ドライブ３１０に到達しないため、当該リクエストに対するリプライが返却されない。 The optical drive 310 having received the request signal performs control and processing according to the content of the request, and returns a reply signal (S103). If there is no failure on the path from the service processor 120 to the optical drive 310, a reply to the request from the service processor 120 is returned from the optical drive 310 to the service processor 120 via the relay device 410. On the other hand, if there is a failure on the path from the service processor 120 to the optical drive 310, the request does not reach the optical drive 310, and therefore no reply is returned for the request.

サービスプロセッサ１２０内の障害検出部１２２は、通信部１２１が行った通信で障害が発生したかどうかを判定する（Ｓ１０４）。具体的には、通信部１２１が実行したリクエストに対するリプライが所定の時間以内に返却されない場合は、サービスプロセッサ１２０から光学ドライブ３１０までの通信経路上で障害が発生しているものと判定する。一方、当該リプライが所定時間内に返却された場合は、障害が発生していないものと判定する。障害が発生していると判定された場合は、障害検出部１２２は、障害が発生した旨の通知をカウント制御部１２３に行う。一方、障害が発生していないと判定された場合は、障害検出部１２２は、特段の処理を行うことなく、次の通信における障害検出に備える。 The failure detection unit 122 in the service processor 120 determines whether a failure has occurred in the communication performed by the communication unit 121 (S104). Specifically, when a reply to the request executed by the communication unit 121 is not returned within a predetermined time, it is determined that a failure has occurred on the communication path from the service processor 120 to the optical drive 310. On the other hand, if the reply is returned within a predetermined time, it is determined that no failure has occurred. When it is determined that a failure has occurred, the failure detection unit 122 notifies the count control unit 123 that a failure has occurred. On the other hand, when it is determined that no failure has occurred, the failure detection unit 122 prepares for failure detection in the next communication without performing special processing.

サービスプロセッサ１２０内のカウント制御部１２３は、障害検出部１２２によって上記障害が発生したと判定された場合、自身が管理するカウンタ１２４のカウントアップ制御を行う(Ｓ１０５)。カウント制御部１２３は、さらに、診断インタフェース８０２を用いて、光学ドライブ３１０内のカウンタ３１３及び中継装置４１０内のカウンタ４１２のカウントアップ制御を行う(Ｓ１０６)。この結果、カウンタ１２４、カウンタ３１３、カウンタ４１２の値はそれぞれ"１"となる。なお、これらのカウンタの値は、当該カウンタが属する装置又は部品を通る経路上において発生した障害の回数を示している。 When the failure detection unit 122 determines that the failure has occurred, the count control unit 123 in the service processor 120 performs count-up control of the counter 124 managed by itself (S105). The count control unit 123 further performs count-up control of the counter 313 in the optical drive 310 and the counter 412 in the relay device 410 using the diagnostic interface 802 (S106). As a result, the values of the counter 124, the counter 313, and the counter 412 are each “1”. Note that the values of these counters indicate the number of failures that have occurred on the path through the device or component to which the counter belongs.

次に、サービスプロセッサ１２０は、サービスプロセッサ１２０の属するパーティション１００側からの経路で障害が発生したこと、及び、サービスプロセッサ１２０のカウンタ１２４の値を、パーティション２００のサービスプロセッサ２２０へ通知する(Ｓ１０７)。より具体的には、図示せぬ制御部からの読み出し指示に基づいて、カウント制御部１２３がカウンタ１２４に記憶されている値を読み出す。カウント制御部１２３は、読み出したカウント値を制御部に出力する。当該制御部は、カウント制御部１２３から入力したカウンタ１２４の値や、リクエストの内容、どの経路又はどの装置に対するリクエストにおいて障害が発生したかに関する情報等を纏めて診断情報を生成する。当該診断情報は、パーティション１００側で発生した通信障害に関する情報である。通信部１２１は、制御部より当該診断情報を受け取ると制御／データインタフェースを用いてパーティション２００へ出力する。 Next, the service processor 120 notifies the service processor 220 of the partition 200 of the failure in the path from the partition 100 side to which the service processor 120 belongs and the value of the counter 124 of the service processor 120 (S107). . More specifically, the count control unit 123 reads a value stored in the counter 124 based on a read instruction from a control unit (not shown). The count control unit 123 outputs the read count value to the control unit. The control unit collectively generates diagnosis information including the value of the counter 124 input from the count control unit 123, the content of the request, information regarding which route or device the failure has occurred in, and the like. The diagnostic information is information related to a communication failure that has occurred on the partition 100 side. When receiving the diagnostic information from the control unit, the communication unit 121 outputs the diagnosis information to the partition 200 using the control / data interface.

上記診断情報を受け取ったパーティション２００のサービスプロセッサ２２０は、光学ドライブ３１０に対するリクエストを実行する（Ｓ１０８）。より具体的には、サービスプロセッサ２２０の通信部２２１は、サービスプロセッサ１２０の通信部１２１から出力された上記診断情報を入力し、当該診断情報を図示せぬ制御部に出力する。当該制御部は、受け取った診断情報からリクエスト先やリクエスト内容を特定し、リクエストを実行する。当該リクエストが実行されることにより、リクエスト信号が通信部２２１から光学ドライブ３１０へ出力される。サービスプロセッサ２２０は、この後当該リクエストに対するリプライ待ちの状態となる（Ｓ１０９）。 The service processor 220 of the partition 200 that has received the diagnostic information executes a request for the optical drive 310 (S108). More specifically, the communication unit 221 of the service processor 220 inputs the diagnostic information output from the communication unit 121 of the service processor 120, and outputs the diagnostic information to a control unit (not shown). The control unit identifies the request destination and request content from the received diagnostic information, and executes the request. By executing the request, a request signal is output from the communication unit 221 to the optical drive 310. Thereafter, the service processor 220 enters a reply waiting state for the request (S109).

上記リクエスト信号を入力した光学ドライブ３１０は、そのリクエスト内容に従った制御や処理を行い、リプライ信号を返信する（Ｓ１１０）。サービスプロセッサ２２０から光学ドライブ３１０までの経路上に障害が無ければ、サービスプロセッサ２２０が実行したリクエストに対するリプライが、光学ドライブ３１０から中継装置４１０を経由してサービスプロセッサ２２０に返却される。一方、サービスプロセッサ２２０から光学ドライブ３１０までの経路上に障害があれば、上記リクエストが光学ドライブ３１０に到達しないため、当該リクエストに対するリプライが返却されない。 The optical drive 310 having received the request signal performs control and processing according to the request content, and returns a reply signal (S110). If there is no failure on the path from the service processor 220 to the optical drive 310, a reply to the request executed by the service processor 220 is returned from the optical drive 310 to the service processor 220 via the relay device 410. On the other hand, if there is a failure on the path from the service processor 220 to the optical drive 310, the request does not reach the optical drive 310, and therefore no reply is returned for the request.

サービスプロセッサ２２０内の障害検出部２２２は、上記診断情報に基づいて通信部２２１が行った通信で障害が発生したかどうかを判定する（Ｓ１１１）。具体的には、通信部２２１が出力したリクエスト信号に対するリプライ信号が所定の時間以内に返却されない場合は、サービスプロセッサ２２０から光学ドライブ３１０までの通信経路上でも障害が発生しているものと判定する。一方、通信部２２１が出力したリクエスト信号に対するリプライ信号が所定の時間以内に返却された場合は、サービスプロセッサ２２０から光学ドライブ３１０までの通信経路上に障害は存在しないものと判定する。障害検出部２２２は、上記判定結果や当該通信に係る外部装置のカウンタのアドレス等を纏めた判定情報を生成し、カウント制御部２２３に出力する。 The failure detection unit 222 in the service processor 220 determines whether a failure has occurred in communication performed by the communication unit 221 based on the diagnostic information (S111). Specifically, if the reply signal for the request signal output by the communication unit 221 is not returned within a predetermined time, it is determined that a failure has occurred on the communication path from the service processor 220 to the optical drive 310. . On the other hand, when the reply signal for the request signal output from the communication unit 221 is returned within a predetermined time, it is determined that there is no failure on the communication path from the service processor 220 to the optical drive 310. The failure detection unit 222 generates determination information that summarizes the determination result and the address of the counter of the external device related to the communication, and outputs the determination information to the count control unit 223.

カウント制御部２２３は、入力した上記判定情報に基づいて、カウンタのカウントアップ・カウントダウン制御を行う。具体的には、上記判定情報に含まれる判定結果が、通信経路上に障害が存在しないという内容であった場合は、カウント制御部２２３は、当該通信に係る装置のカウンタの値をデクリメントさせるカウントダウン制御を行う（Ｓ１１２）。ここでは、カウント制御部２２３は、光学ドライブ３１０に属するカウンタ３１３及び中継装置４１０に属するカウンタ４１２をカウントダウンする制御を診断インタフェース８０２を用いて行う。 The count control unit 223 performs count-up / count-down control of the counter based on the input determination information. Specifically, when the determination result included in the determination information indicates that there is no failure on the communication path, the count control unit 223 counts down the counter value of the device related to the communication. Control is performed (S112). Here, the count control unit 223 performs control to count down the counter 313 belonging to the optical drive 310 and the counter 412 belonging to the relay apparatus 410 using the diagnosis interface 802.

次に、カウント制御部２２３は、カウンタ３１３及びカウンタ４１２から値を読み出す読み出し制御を行う。また、カウント制御部２２３は、カウンタ２２４から値を読み出す（Ｓ１１３）。これら読み出された各々のカウンタの値は図示せぬ制御部に出力される。 Next, the count control unit 223 performs read control for reading values from the counter 313 and the counter 412. The count control unit 223 reads a value from the counter 224 (S113). The read counter values are output to a control unit (not shown).

当該制御部は、これらのカウンタの値及びＳ１０７でサービスプロセッサ１２０から受け取った診断情報に含まれるカウンタ１２４の値を比較することにより、マルチプロセッサシステム１０００内部で発生した障害に関する被疑部品を特定する（Ｓ１１４）。具体的には、上記一連の処理の結果、マルチプロセッサシステム１０００に含まれる各々の装置が有するカウンタの値は、それぞれ、カウンタ１２４が"１"、カウンタ２２４が"０"、カウンタ３１３が"０"、カウンタ４１２が"０"となる。従って最もカウンタの値が大きいカウンタ１２４の属するサービスプロセッサ１２０が通信障害における被疑部品となる。 The control unit compares the values of these counters with the value of the counter 124 included in the diagnostic information received from the service processor 120 in S107, thereby identifying a suspected part related to a failure that has occurred inside the multiprocessor system 1000 ( S114). Specifically, as a result of the above series of processing, the counter values of the respective devices included in the multiprocessor system 1000 are “1” for the counter 124, “0” for the counter 224, and “0” for the counter 313, respectively. “The counter 412 is set to“ 0 ”. Accordingly, the service processor 120 to which the counter 124 having the largest counter value belongs becomes a suspected part in communication failure.

一方、上記判定情報に含まれる判定結果が、通信経路上に障害が発生しているとの内容であった場合は、カウント制御部２２３は、当該通信に係る装置のカウンタの値をインクリメントさせるカウントアップ制御を行う（Ｓ１１５）。ここでは、カウント制御部２２３は、光学ドライブ３１０に属するカウンタ３１３及び中継装置４１０に属するカウンタ４１２をカウントアップする制御を診断インタフェース８０２を用いて行う。 On the other hand, if the determination result included in the determination information indicates that a failure has occurred on the communication path, the count control unit 223 counts to increment the counter value of the device related to the communication. Up control is performed (S115). Here, the count control unit 223 performs control for counting up the counter 313 belonging to the optical drive 310 and the counter 412 belonging to the relay apparatus 410 using the diagnosis interface 802.

次に、カウント制御部２２３は、カウンタ３１３及びカウンタ４１２から値を読み出す読み出し制御を行う。また、カウント制御部２２３は、カウンタ２２４から値を読み出す（Ｓ１１６）。これら読み出された各々のカウンタの値は図示せぬ制御部に出力される。当該制御部は、これらのカウンタの値及びＳ１０７でサービスプロセッサ１２０から受け取った診断情報に含まれるカウンタ１２４の値を比較することにより、マルチプロセッサシステム１０００内部で発生した障害に関する被疑部品を特定する（Ｓ１１７）。具体的には、上記一連の処理の結果、マルチプロセッサシステム１０００に含まれる各々の装置が有するカウンタの値は、それぞれ、カウンタ１２４が"１"、カウンタ２２４が"１"、カウンタ３１３が"２"、カウンタ４１２が"２"となる。従って最もカウンタの値が大きいカウンタ３１３の属する光学ドライブ３１０か、カウンタ４１２の属する中継装置４１０が通信障害における被疑部品となる。これらのどちらが被疑部品であるかは、別途取得したアクセスエラーログやその他の障害情報と組み合わせて総合的に判断することにより特定しても良いし、これら２つの被疑部品を取り換えても良い。 Next, the count control unit 223 performs read control for reading values from the counter 313 and the counter 412. Further, the count control unit 223 reads a value from the counter 224 (S116). The read counter values are output to a control unit (not shown). The control unit compares the values of these counters with the value of the counter 124 included in the diagnostic information received from the service processor 120 in S107, thereby identifying a suspected part related to a failure that has occurred inside the multiprocessor system 1000 ( S117). Specifically, as a result of the above series of processing, the counter values of the respective devices included in the multiprocessor system 1000 are “1” for the counter 124, “1” for the counter 224, and “2” for the counter 313, respectively. “The counter 412 is set to“ 2 ”. Therefore, the optical drive 310 to which the counter 313 having the largest counter value belongs or the relay device 410 to which the counter 412 belongs becomes a suspected part in communication failure. Which of these is a suspicious part may be specified by comprehensively determining it in combination with an access error log or other failure information acquired separately, or these two suspicious parts may be replaced.

上記構成とすることで、障害が発生している可能性のある被疑部品や被疑装置の絞り込みが容易化できるため、交換部品数の削減及び部品交換に伴うシステム停止時間の短縮が可能となる。 With the above-described configuration, it is possible to easily narrow down the suspected parts and suspected devices that may have failed, so that it is possible to reduce the number of replacement parts and the system stop time associated with parts replacement.

なお、上記説明では、診断情報を受け取ったサービスプロセッサ２２０が行ったリクエストに対するリプライに応じてカウントダウンが行われる構成であったが、これに限るものではない。通信部２２１によって行われる通信が成功する度にカウントダウンを行っても良いし、カウントダウンを行わない構成であっても良い。 In the above description, the countdown is performed according to the reply to the request made by the service processor 220 that has received the diagnostic information. However, the present invention is not limited to this. The countdown may be performed every time the communication performed by the communication unit 221 is successful, or the countdown may not be performed.

また、上記カウント制御部２２３は、Ｓ１１２又はＳ１１５で診断インタフェース８０２に接続されている当該通信に係る装置のカウンタの値をデクリメント又はインクリメントさせる構成としたがこれに限るものではない。カウント制御部２２３は、サービスプロセッサ２２０に属するカウンタ２２４の値についても合わせてデクリメント又はインクリメントさせる制御を行ってもよい。また、カウント制御部２２３は、図示せぬ制御部から診断情報に基づいてカウントダウンさせる装置のカウンタを特定し、当該特定したカウンタをカウントダウンさせる構成であっても良い。すなわち、カウント制御部２２３は、パーティション１００からの経路とパーティション２００からの経路の重複部分の装置についてのみ当該装置に属するカウンタをカウントダウンさせる構成であっても良い。 The count control unit 223 is configured to decrement or increment the counter value of the communication-related device connected to the diagnostic interface 802 in S112 or S115, but is not limited thereto. The count control unit 223 may perform control to decrement or increment the value of the counter 224 belonging to the service processor 220 as well. Further, the count control unit 223 may be configured to specify a counter of a device to be counted down based on diagnostic information from a control unit (not shown) and to count down the specified counter. That is, the count control unit 223 may be configured to count down the counters belonging to only the devices in the overlapping portion of the route from the partition 100 and the route from the partition 200.

また、上記説明では、サービスプロセッサ１２０からサービスプロセッサ２２０へ送られる診断情報内にサービスプロセッサ１２０内のカウンタ１２４の値が含まれる構成を示したがこれに限るものではない。Ｓ１１３又はＳ１１６でカウンタ３１３及びカウンタ４１２からカウント値を読み出す制御を行う時に、合わせてカウンタ１２４から読み出しても良い。 In the above description, the configuration in which the value of the counter 124 in the service processor 120 is included in the diagnostic information sent from the service processor 120 to the service processor 220 is not limited to this. When performing control to read the count value from the counter 313 and the counter 412 in S113 or S116, it may be read from the counter 124 together.

また、上記説明では、各装置に属するカウンタが診断インタフェース８０２に直接接続されている構成を示したがこれに限るものではない。例えば、各装置内に自装置に属するカウンタの値をカウントアップ・カウントダウンする制御部を備え、当該制御部が診断インタフェースに接続される構成であっても良い。この場合、カウント制御部１２３やカウント制御部２２３から診断インタフェースを用いて送られるカウントアップ・カウントダウン指示信号に基づいて、これらの装置内制御部がカウンタの値をインクリメント・デクリメントさせても良い。 In the above description, the configuration in which the counter belonging to each device is directly connected to the diagnosis interface 802 is shown, but the present invention is not limited to this. For example, a configuration may be adopted in which each device includes a control unit that counts up / down a counter value belonging to the own device, and the control unit is connected to a diagnostic interface. In this case, based on the count-up / count-down instruction signal sent from the count control unit 123 or the count control unit 223 using the diagnostic interface, these in-device control units may increment / decrement the counter value.

また、上記説明では、外部装置である光学ドライブや中継装置内部にそれぞれカウンタが配置される構成を示したがこれに限るものではない。システム内に存在する各々の装置のカウント値を記憶する記憶部を別途配置する構成であっても良い。当該記憶部は、診断インタフェースに接続され、各装置のカウント値を記憶する領域に分割されている。そして、上記カウント制御部が、診断インタフェースを用いて、前記記憶部内の該当する装置のカウント値を変更する制御を行える構成であっても良い。 In the above description, a configuration is shown in which counters are arranged in the optical drive and relay device, which are external devices, but the present invention is not limited to this. A configuration in which a storage unit for storing the count value of each device existing in the system may be separately provided. The storage unit is connected to a diagnostic interface and is divided into areas for storing count values of the respective devices. And the structure which can perform control which changes the count value of the applicable apparatus in the said memory | storage part using the said diagnostic control interface may be sufficient.

また、共有デバイスとしては光学ドライブに限るものではなく、補助記憶装置や共有インタフェースなど様々なデバイスとすることができる。 The shared device is not limited to the optical drive, and various devices such as an auxiliary storage device and a shared interface can be used.

また、上記説明ではカウント制御部がカウンタの値をインクリメント・デクリメントする構成について説明したがこれに限るものではない。各カウンタを、"０"又は"１"のいずれかの値を記憶するフラグ記憶部としてもよい。この場合、カウント制御部はフラグ制御部となり、当該フラグ制御部は、障害検出部における検出結果に基づいて、通信経路に存在するフラグ記憶部のフラグの上げ下げを行う。このような構成であっても良い。 In the above description, the configuration in which the count control unit increments or decrements the value of the counter has been described. However, the present invention is not limited to this. Each counter may be a flag storage unit that stores a value of “0” or “1”. In this case, the count control unit becomes a flag control unit, and the flag control unit raises or lowers the flag of the flag storage unit existing in the communication path based on the detection result in the failure detection unit. Such a configuration may be adopted.

また、上記説明における被疑装置及び被疑部品とは、共に障害が発生した場合に障害原因として推定される障害発生箇所を示している。 Further, the suspected device and the suspected component in the above description indicate a failure occurrence location that is estimated as a cause of failure when a failure occurs.

また、上記説明は、マルチプロセッサシステムの一例にすぎず、様々な変更が可能である。図４に、本発明にかかるマルチプロセッサシステムの変形例のブロック図を示す。図４のマルチプロセッサシステムでは、パーティション１００−Ａからパーティション１００−ＮのＮ個のパーティションにハードウェア分割されている。また各パーティションには複数のプロセッサモジュールが含まれ、さらに、これらのプロセッサモジュールを補助するサービスプロセッサが含まれる。また、中継装置４１０−Ａから中継装置４１０−ＭまでのＭ個の中継装置を中継して共有デバイス３１０−Ａから共有デバイス３１０−ＬのＬ個の共有デバイスと接続されている。なお、ここで実線は、制御／データインタフェース８０１を示し、一点鎖線は診断インタフェース８０２を表している。このように拡張されていても良い。 The above description is only an example of a multiprocessor system, and various modifications can be made. FIG. 4 is a block diagram showing a modification of the multiprocessor system according to the present invention. In the multiprocessor system of FIG. 4, the hardware is divided into N partitions from the partition 100 -A to the partition 100 -N. Each partition includes a plurality of processor modules, and further includes a service processor that assists these processor modules. Further, M relay devices from the relay device 410-A to the relay device 410-M are relayed and connected to L shared devices from the shared device 310-A to the shared device 310-L. Here, the solid line indicates the control / data interface 801, and the alternate long and short dash line indicates the diagnosis interface 802. It may be expanded in this way.

その他、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。例えば、以下の構成を採ることができる。 In addition, the present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention. For example, the following configuration can be adopted.

（１）ハードウェア分割された複数のパーティションと、前記複数のパーティションからアクセスされる共有装置と、から構成されるマルチプロセッサシステムであって、前記複数のパーティションと前記共有装置は、データ通信に用いられる第１のインタフェース及び障害の検出に用いられる第２のインタフェースで接続され、前記複数のパーティションと前記共有装置は、障害をカウントするカウント手段を具備し、前記複数のパーティションは、前記データ通信において障害が発生した時に前記データ通信に係るパーティション及び前記共有装置が具備するカウント手段のカウント値を変更する制御を行う制御手段を具備する、マルチプロセッサシステム。
（２）第１のインタフェースを用いて外部装置と通信を行う通信手段と、前記通信に障害が発生したことを検出する検出手段と、前記検出手段で検出された障害をカウントする第１カウント手段と、前記検出手段で障害が検出された場合に、前記第１カウント手段をカウントアップする制御を行うと共に、前記外部装置のカウント手段をカウントアップする制御を第２のインタフェースを用いて行うカウント制御手段と、を具備する制御装置。
（３）前記カウント制御手段は、前記検出手段で障害が検出されなかった場合に、前記外部装置のカウント手段をカウントダウンする制御を前記第２のインタフェースを用いて行う、（２）に記載の制御装置。
（４）前記通信手段は、中継装置を介して前記外部装置と通信を行い、前記カウント制御手段は、前記検出手段で障害が検出された場合に、前記第１カウント手段をカウントアップする制御を行うと共に、前記外部装置及び前記中継装置のカウント手段をカウントアップする制御を第２のインタフェースを用いて行う、（２）に記載の制御装置。
（５）他の制御装置から通信障害に関する情報を入力し、前記入力した通信障害に関する情報に基づいて、第１のインタフェースを用いて外部装置と通信を行う通信手段と、前記通信障害に関する情報に基づいて前記通信手段が行う通信において障害が発生したことを検出する検出手段と、前記通信障害に関する情報に基づいて前記通信手段が行う通信において障害が検出された場合に、前記外部装置のカウント手段をカウントアップする制御を前記第２のインタフェースを用いて行い、前記通信障害に関する情報に基づいて前記通信手段が行う通信において障害が検出されなかった場合に、前記外部装置のカウント手段をカウントダウンする制御を前記第２のインタフェースを用いて行うカウント制御手段と、を具備する制御装置。
（６）被疑装置を判定する判定手段を更に具備し、前記カウント制御手段は、前記第２のインタフェースに接続されているカウント手段からカウント値を読み出す制御を更に行い、前記判定手段は、前記カウント制御手段によって読み出されたカウント値に基づいて被疑装置を判定する、（５）に記載の制御装置。
（７）前記通信障害に関する情報には、前記他の制御装置のカウント手段でカウントされたカウント値が含まれ、前記判定手段は、前記通信障害に関する情報に含まれる前記カウント値と前記カウント制御手段によって読み出されたカウント値とを比較することで被疑装置を判定する、（６）に記載の制御装置。
（８）前記通信手段は、リクエスト信号を出力すると共に前記リクエスト信号に対するリプライ信号を入力することで前記外部装置と通信を行い、前記検出手段は、前記リクエスト信号が出力されてから所定の時間以内に前記リクエスト信号に対するリプライ信号が入力されなかった場合に前記通信に障害が発生したと判断する、（２）乃至（７）のいずれかに記載の制御装置。
（９）第１のインタフェースを用いて外部装置と通信を行う通信ステップと、前記通信に障害が発生したことを検出する検出ステップと、前記検出ステップで検出された障害をカウントするカウントステップと、前記障害が検出された場合に、前記外部装置のカウント手段をカウントアップする制御を第２のインタフェースを用いて行うカウント制御ステップと、を有する障害検出方法。
（１０）前記カウント制御ステップの後段に、前記第２のインタフェースに接続された前記カウント手段からカウント値を読み出す読み出しステップと、前記読み出しステップにおいて読み出されたカウント値に基づいて被疑装置を判定する判定ステップと、を更に有する（９）に記載の障害検出方法。
（１１）前記通信手段は、前記外部装置自身が有する検出手段で検出された障害に関する情報を前記第１のインタフェースを用いて前記外部装置から入力し、前記判定手段は、前記障害に関する情報と前記読み出したカウント値とから被疑装置を判定する、（６）又は（７）に記載の制御装置。
（１２）他の通信装置から通信障害に関する情報を入力する入力ステップと、前記入力ステップで入力された通信障害に関する情報に基づいて、第１のインタフェースを用いて所定の外部装置と通信を行う通信ステップと、前記通信で障害が発生したかどうかを判定する判定ステップと、前記判定ステップにおいて前記通信で障害が発生したと判定された場合に、前記通信に係る外部装置のカウント手段をカウントアップする制御を第２のインタフェースを用いて行い、前記判定ステップにおいて前記通信で障害が発生しなかったと判定された場合に、前記通信に係る外部装置のカウント手段をカウントダウンする制御を第２のインタフェースを用いて行うカウント制御ステップと、を有する障害処理方法。
（１３）前記カウント制御ステップの後段に、前記第２のインタフェースに接続された前記カウント手段からカウント値を読み出す読み出しステップと、前記読み出しステップにおいて読み出されたカウント値に基づいて被疑装置を判定する被疑装置判定ステップと、を更に有する（１２）に記載の障害処理方法。
（１４）前記通信手段は、外部プロセッサからのリクエスト信号を中継する、（２）乃至（７）のいずれかに記載の制御装置。 (1) A multiprocessor system including a plurality of hardware-divided partitions and a shared device accessed from the plurality of partitions, wherein the plurality of partitions and the shared device are used for data communication. The plurality of partitions and the shared device are provided with counting means for counting failures, and the plurality of partitions are connected in the data communication. A multiprocessor system, comprising: a control unit that performs control to change a count value of a count unit included in the partition related to the data communication and the shared device when a failure occurs.
(2) Communication means for communicating with an external device using the first interface, detection means for detecting that a failure has occurred in the communication, and first count means for counting the failure detected by the detection means And, when a failure is detected by the detection means, performs control to count up the first count means, and performs control to count up the count means of the external device using a second interface. And a control device.
(3) The control according to (2), wherein the count control means performs control to count down the count means of the external device using the second interface when no failure is detected by the detection means. apparatus.
(4) The communication unit communicates with the external device via a relay device, and the count control unit performs control to count up the first count unit when a failure is detected by the detection unit. The control device according to (2), wherein the control is performed using a second interface to count up the counting unit of the external device and the relay device.
(5) Communication information input from another control device, communication means for communicating with an external device using the first interface based on the input communication failure information, and information related to the communication failure Detecting means for detecting that a failure has occurred in the communication performed by the communication means, and counting means of the external device when a failure is detected in the communication performed by the communication means based on the information relating to the communication failure Control for counting up the count means of the external device when no fault is detected in the communication performed by the communication means based on the information relating to the communication fault. And a count control means for performing the above operation using the second interface.
(6) The apparatus further includes a determination unit that determines the suspected device, wherein the count control unit further performs control to read a count value from the count unit connected to the second interface, and the determination unit includes the count unit. The control device according to (5), wherein the suspicious device is determined based on the count value read by the control means.
(7) The information related to the communication failure includes a count value counted by the counting unit of the other control device, and the determination unit includes the count value included in the information related to the communication failure and the count control unit. The control device according to (6), wherein the suspicious device is determined by comparing with the count value read by.
(8) The communication unit communicates with the external device by outputting a request signal and a reply signal to the request signal, and the detection unit is within a predetermined time after the request signal is output. The control device according to any one of (2) to (7), wherein it is determined that a failure has occurred in the communication when a reply signal to the request signal is not input.
(9) a communication step of communicating with an external device using the first interface, a detection step of detecting that a failure has occurred in the communication, a counting step of counting the failure detected in the detection step, And a count control step of performing control to count up the counting means of the external device using a second interface when the fault is detected.
(10) A reading step for reading a count value from the counting means connected to the second interface, and a suspected device are determined based on the count value read in the reading step after the count control step. The failure detection method according to (9), further comprising a determination step.
(11) The communication unit inputs information regarding a failure detected by the detection unit included in the external device itself from the external device using the first interface, and the determination unit includes the information regarding the failure and the information about the failure The control device according to (6) or (7), wherein the suspicious device is determined from the read count value.
(12) An input step for inputting information on communication failure from another communication device, and communication for communicating with a predetermined external device using the first interface based on the information on communication failure input in the input step A determination step of determining whether a failure has occurred in the communication, and a counting means of the external device involved in the communication is counted up when it is determined in the determination step that a failure has occurred in the communication Control is performed using the second interface, and when it is determined in the determination step that no failure has occurred in the communication, control for counting down the counting means of the external device related to the communication is performed using the second interface. And a count control step.
(13) Subsequent to the count control step, a read step for reading the count value from the count means connected to the second interface, and determining the suspected device based on the count value read in the read step The failure processing method according to (12), further comprising a suspected device determination step.
(14) The control device according to any one of (2) to (7), wherein the communication unit relays a request signal from an external processor.

１００パーティション
１０１通信モジュール
１０２障害検出モジュール
１０３カウント制御モジュール
１０４カウントモジュール
１１０プロセッサモジュール
１２０サービスプロセッサ
１２１通信部
１２２障害検出部
１２３カウント制御部
１２４カウンタ
２００パーティション
２０１通信モジュール
２０２障害検出モジュール
２０３カウント制御モジュール
２０４カウントモジュール
２１０プロセッサモジュール
２２０サービスプロセッサ
２２１通信部
２２２障害検出部
２２３カウント制御部
２２４カウンタ
３００パーティション
３０１通信モジュール
３０２カウントモジュール
３１０光学ドライブ
３１１通信部
３１２障害検出部
３１３カウンタ
４００マザーボード
４１０中継装置
４１１切替制御部
４１２カウンタ
８０１データインタフェース
８０２診断インタフェース
１０００マルチプロセッサシステム 100 partition 101 communication module 102 fault detection module 103 count control module 104 count module 110 processor module 120 service processor 121 communication unit 122 fault detection unit 123 count control unit 124 counter 200 partition 201 communication module 202 fault detection module 203 count control module 204 count Module 210 Processor module 220 Service processor 221 Communication unit 222 Fault detection unit 223 Count control unit 224 Counter 300 Partition 301 Communication module 302 Count module 310 Optical drive 311 Communication unit 312 Fault detection unit 313 Counter 400 Motherboard 410 Relay device 411 Switching control unit 412 Counter 8 1 data interface 802 diagnostic interface 1000 multiprocessor system

Claims

Multiple partitions partitioned by hardware,
A shared device accessed from the plurality of partitions;
A multiprocessor system comprising:
The plurality of partitions and the shared device are connected by a first interface used for data communication and a second interface used for failure detection,
The plurality of partitions and the shared device comprise counting means for counting failures,
The plurality of partitions include a control unit that performs control to change a count value of a partition related to the data communication and a counting unit included in the shared device when a failure occurs in the data communication.
Multiprocessor system.

A communication means for communicating with an external device using the first interface;
Detecting means for detecting that a failure has occurred in the communication;
First counting means for counting faults detected by the detecting means;
A count control means for performing control to count up the first count means when a failure is detected by the detection means, and performing control to count up the count means of the external device using a second interface; ,
A control device comprising:

The count control means performs control to count down the count means of the external device using the second interface when no failure is detected by the detection means.
The control device according to claim 2.

The communication means communicates with the external device via a relay device,
The count control means performs control to count up the first count means when a failure is detected by the detection means, and performs control to count up the count means of the external device and the relay device. Using the interface of
The control device according to claim 2.

Communication means for inputting information related to a communication failure from another control device, and communicating with an external device using the first interface based on the information related to the input communication failure;
Detecting means for detecting that a failure has occurred in communication performed by the communication means based on information on the communication failure;
When a failure is detected in communication performed by the communication means based on the information related to the communication failure, control for counting up the counting means of the external device is performed using the second interface, and the information related to the communication failure A count control means for performing control to count down the count means of the external device using the second interface when no failure is detected in communication performed by the communication means based on
A control device comprising:

It further comprises determination means for determining the suspected device,
The count control means further performs control to read a count value from the count means connected to the second interface,
The determination means determines the suspected device based on the count value read by the count control means;
The control device according to claim 5.

The information related to the communication failure includes a count value counted by the counting means of the other control device,
The determination unit determines the suspected device by comparing the count value included in the information regarding the communication failure and the count value read by the count control unit;
The control device according to claim 6.

The communication means outputs a request signal and communicates with the external device by inputting a reply signal for the request signal,
The detection means determines that a failure has occurred in the communication when a reply signal to the request signal is not input within a predetermined time after the request signal is output.
The control device according to claim 2.

A communication step of communicating with an external device using the first interface;
A detecting step for detecting that a failure has occurred in the communication;
A counting step for counting faults detected in the detection step;
A count control step of performing control to count up the counting means of the external device using the second interface when the failure is detected;
A fault detection method comprising:

A reading step of reading a count value from the counting means connected to the second interface after the count control step;
A determination step of determining the suspected device based on the count value read in the reading step;
The fault detection method according to claim 9, further comprising: