JP4584124B2

JP4584124B2 - Information processing apparatus, error processing method thereof, and control program

Info

Publication number: JP4584124B2
Application number: JP2005337972A
Authority: JP
Inventors: 富子山田
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2005-11-24
Filing date: 2005-11-24
Publication date: 2010-11-17
Anticipated expiration: 2025-11-24
Also published as: JP2007148467A

Description

本発明は、情報処理装置およびそのエラー処理方法ならびに制御プログラムに関し、特にプロセッサ等から発行されたトランザクションを受け付けてからアウト・オブ・オーダー（ｏｕｔｏｆｏｒｄｅｒ）処理でリプライを返却するまでの間トランザクションを管理する情報処理装置およびそのエラー処理方法ならびに制御プログラムに関する。 The present invention relates to an information processing apparatus, an error processing method thereof, and a control program, and more particularly, to a transaction from when a transaction issued from a processor or the like is accepted until a reply is returned by out-of-order processing. The present invention relates to an information processing apparatus to be managed, an error processing method thereof, and a control program.

プロセッサ等から発行されたトランザクションを受け付けてからリプライを返却するまでの間トランザクションを管理する情報処理装置は知られている。 An information processing apparatus that manages a transaction from when a transaction issued by a processor or the like is received until a reply is returned is known.

このトランザクションの管理には２通りの方法が存在する。その一つはトランザクションの発行順序に従ってリプライを返却する方法であり、ＦＩＦＯ（ｆｉｒｓｔ−ｉｎ，ｆｉｒｓｔ−ｏｕｔ）を前提としており、トランザクションの発行順序とリプライ応答順序は一致する（例えば、特許文献１および２参照）。 There are two methods for managing this transaction. One of them is a method of returning replies according to the transaction issue order, which is based on FIFO (first-in, first-out), and the transaction issue order and reply response order match (for example, Patent Document 1 and 2).

他方はトランザクションの発行順序とリプライ応答順序が一致しない方法であり、これが上述のアウト・オブ・オーダー処理である。 The other is a method in which the transaction issue order and reply response order do not match, and this is the above-mentioned out-of-order processing.

従来のトランザクションの管理において、データに障害が発生した場合は、ＥＣＣ（ｅｒｒｏｒｃｏｒｒｅｃｔｉｎｇｃｏｄｅ）等の誤り訂正機能を用いてその障害を復旧させている。 In the conventional transaction management, when a failure occurs in the data, the failure is recovered by using an error correction function such as ECC (error correcting code).

しかし、アドレス制御系で障害が発生した場合は、コヒーレンシ（ｃｏｈｅｒｅｎｃｙ：依存性）の保障やデータ化けを防止する必要があるため、動作を継続させることができず、システムダウンとしている。 However, when a failure occurs in the address control system, it is necessary to ensure coherency and prevent data corruption, so the operation cannot be continued and the system is down.

一方、リトライの実施によって動作継続を図ることが可能な場合でも、バッファの特定部分に障害がある場合（以下、このような障害を「固定障害」と記す）は、リトライの失敗でシステムダウンを招く場合がある。 On the other hand, even if it is possible to continue the operation by retrying, if there is a failure in a specific part of the buffer (hereinafter referred to as “fixed failure”), the failure of the retry causes the system to go down. You may be invited.

また、アウト・オブ・オーダーのトランザクションを処理する場合は、一般的に高性能化を図るために複数のトランザクションをオーバーラップして処理する。このため、アウト・オブ・オーダー処理を行う部位内にはハードマクロ（ｈａｒｄｍａｃｒｏ）やレジスタで構成される多数のバッファが使用されている。 Also, when processing out-of-order transactions, in general, a plurality of transactions are overlapped for higher performance. For this reason, a large number of buffers composed of hard macros and registers are used in the part that performs out-of-order processing.

したがって、これらのバッファの一箇所でも故障が発生するとシステムダウンにつながるので、信頼性がより高く求められる場合には、動作を継続させるための対策が必要である。 Therefore, if a failure occurs in one of these buffers, it will lead to a system failure. Therefore, when higher reliability is required, a measure for continuing the operation is required.

ところで、上述の特許文献１および２記載の発明は、高速化のため、先行ライトトランザクションの完了を待たずに後続ライトトランザクションを発行する装置に関するもので、先行ライトトランザクションがリトライになった場合に、同一ソースから発行された後続ライトトランザクションの順序保障を守るために、後続ライトトランザクションの全てをリトライさせるものである。 By the way, the inventions described in Patent Documents 1 and 2 described above relate to an apparatus that issues a subsequent write transaction without waiting for the completion of the preceding write transaction for speeding up. When the preceding write transaction is retried, In order to protect the order guarantee of subsequent write transactions issued from the same source, all subsequent write transactions are retried.

また、他の発明として、受け付けたトランザクション情報の全てをその処理が終了するまでシステム制御機能内のバッファに保持しておき、障害時には保持しておいたトランザクションをシステム制御内で再発行することによりシステムダウンを防止するメモリアクセス制御装置が開示されている（例えば、特許文献３参照）。 As another invention, all received transaction information is held in the buffer in the system control function until the processing is completed, and the stored transaction is reissued in the system control at the time of failure. A memory access control device that prevents system down is disclosed (for example, see Patent Document 3).

また、他の発明として、割り込みを受けたとき、処理を行っていたトランザクションがリトライ可能であれば、先に割り込みを処理し、その後にそのトランザクションをリトライする装置が開示されている（例えば、特許文献４参照）。 As another invention, when an interrupt is received, if the transaction being processed can be retried, an apparatus that processes the interrupt first and then retry the transaction is disclosed (for example, a patent) Reference 4).

特開２００１−２１６２５９号公報（段落０００８、図１）JP 2001-216259 A (paragraph 0008, FIG. 1) 特開２００２−１０８８３６号公報（段落０００８、図１）JP 2002-108836 A (paragraph 0008, FIG. 1) 特開平０１−１４０３５７号公報（第３頁左上欄、第８行〜第１４行、第１図）Japanese Laid-Open Patent Publication No. 01-140357 (page 3, upper left column, lines 8-14, FIG. 1) 特開２００２−０９１９３９号公報（段落００１４、図１）JP 2002-091939 A (paragraph 0014, FIG. 1)

上述の特許文献１および２記載の発明はトランザクションの発行順序とリプライ応答順序が一致するトランザクションの管理に関するものであり、ＦＩＦＯを前提としている。このため同一ソースから発行された後続ライトトランザクションの順序保障を守るために、後続ライトトランザクションの全てをリトライさせている。 The inventions described in Patent Documents 1 and 2 described above relate to management of transactions in which the transaction issue order and the reply response order match, and are premised on the FIFO. For this reason, all subsequent write transactions are retried in order to protect the order guarantee of subsequent write transactions issued from the same source.

これに対し、本発明はアウト・オブ・オーダーのトランザクション管理に関し、アウト・オブ・オーダーのトランザクション管理では、ＦＩＦＯとは異なり、必ずしも同一ソースから発行された全ての後続トランザクションを順序保障のために全てリトライさせる必要はない。本発明は、障害が発生した箇所で処理されていたトランザクションをリトライすることで、システムダウンを減少させることを目的としている。 On the other hand, the present invention relates to out-of-order transaction management. Unlike out-of-order transaction management, all of the subsequent transactions issued from the same source are not necessarily executed in order to guarantee the order. There is no need to retry. An object of the present invention is to reduce system down by retrying a transaction that has been processed at a location where a failure has occurred.

したがって、特許文献１および２記載の発明は、本発明とその目的、構成、効果のいずれもが全く相違する。 Therefore, the inventions described in Patent Documents 1 and 2 are completely different from the present invention in all of the objects, configurations, and effects.

また、上述の特許文献３記載の発明はコヒーレンシ保障やバッファの固定故障の対策については触れておらず、コヒーレンシを保障しながらシステムダウンを回避することや、バッファの固定故障によるリトライの多発を回避することは困難である。 In addition, the invention described in Patent Document 3 does not touch on coherency guarantees and countermeasures for buffer fixing failures, avoiding system down while ensuring coherency, and avoiding frequent retries due to buffer fixing failures. It is difficult to do.

また、上述の特許文献４記載の発明は割り込み処理の高速化を目的とするものであり、その目的が本発明と全く相違する。 The invention described in Patent Document 4 described above is intended to speed up interrupt processing, and the object is completely different from the present invention.

そこで本発明の目的は、トランザクション管理で障害が発生した場合に、コヒーレンシ保障を保持した状態で、トランザクションの受け付けからリプライ返却までの処理を継続動作させることができ、かつバッファの固定障害に起因するシステムダウンを低減させることが可能な情報処理装置およびそのエラー処理方法ならびに制御プログラムを提供することにある。 Accordingly, an object of the present invention is to enable continuous operation from transaction acceptance to reply return in a state where coherency guarantee is maintained when a failure occurs in transaction management, and is caused by a fixed buffer failure. An object of the present invention is to provide an information processing apparatus capable of reducing system down, an error processing method thereof, and a control program.

前記課題を解決するために、本発明による情報処理装置は、発行元からのトランザクションを受け付けてからアウト・オブ・オーダー（ｏｕｔｏｆｏｒｄｅｒ）処理でリプライを返却するまでの間トランザクションを管理する情報処理装置であって、受け付けたトランザクションを格納するバッファでのトランザクション管理に関する障害をリトライ可能障害とリトライ不可能障害に区別して検出し、リトライ可能障害を検出した場合にそのトランザクションのリトライとともに、そのトランザクションとコヒーレンシの関係にある他のトランザクションもリトライするトランザクション制御手段を含むことを特徴とする。 In order to solve the above-described problem, an information processing apparatus according to the present invention manages an transaction from when a transaction is received from an issuer until a reply is returned by an out-of-order process. The device detects failures related to transaction management in the buffer that stores received transactions by distinguishing between failures that can be retried and failures that cannot be retried, and when a retryable failure is detected, the transaction is retried and It includes a transaction control means for retrying other transactions having a coherency relationship.

また、本発明によるエラー処理方法は、発行元からのトランザクションを受け付けてからアウト・オブ・オーダー（ｏｕｔｏｆｏｒｄｅｒ）処理でリプライを返却するまでの間トランザクションを管理する情報処理装置におけるエラー処理方法であって、受け付けたトランザクションを格納するバッファでのトランザクション管理に関する障害をリトライ可能障害とリトライ不可能障害に区別して検出し、リトライ可能障害を検出した場合にそのトランザクションのリトライとともに、そのトランザクションとコヒーレンシの関係にある他のトランザクションもリトライするトランザクション制御ステップを含むことを特徴とする。 The error processing method according to the present invention is an error processing method in an information processing apparatus that manages a transaction from when a transaction from an issuer is accepted until a reply is returned in an out-of-order process. Therefore, failure related to transaction management in the buffer that stores accepted transactions is detected by distinguishing between failures that can be retried and failures that cannot be retried. When a failure that can be retried is detected, the transaction is retried and the transaction and coherency It includes a transaction control step for retrying other related transactions.

また、本発明による制御プログラムは、発行元からのトランザクションを受け付けてからアウト・オブ・オーダー（ｏｕｔｏｆｏｒｄｅｒ）処理でリプライを返却するまでの間トランザクションを管理する情報処理装置におけるエラー処理方法の制御プログラムであって、コンピュータに、受け付けたトランザクションを格納するバッファでのトランザクション管理に関する障害をリトライ可能障害とリトライ不可能障害に区別して検出し、リトライ可能障害を検出した場合にそのトランザクションのリトライとともに、そのトランザクションとコヒーレンシの関係にある他のトランザクションもリトライするトランザクション制御処理を実行させるためのものであることを特徴とする。 Further, the control program according to the present invention controls an error processing method in an information processing apparatus that manages a transaction from when a transaction is received from an issuer to when a reply is returned in an out-of-order process. This is a program that detects failures related to transaction management in the buffer that stores accepted transactions in the computer by distinguishing between failures that can be retried and failures that cannot be retried, and when a retryable failure is detected , along with retrying the transaction, The present invention is characterized in that a transaction control process for retrying other transactions having a coherency relationship with the transaction is executed.

本発明によれば、受け付けたトランザクションを格納するバッファでの障害を検出し、そのトランザクションのリトライとともに、そのトランザクションとコヒーレンシの関係にある他のトランザクションもリトライする。 According to the present invention, a failure in a buffer for storing an accepted transaction is detected, and other transactions that have a coherency relationship with the transaction are retried along with the retry of the transaction.

本発明によれば、上記構成を含むため、トランザクション管理で障害が発生した場合に、コヒーレンシ保障を保持した状態で、トランザクションの受け付けからリプライ返却までの処理を継続動作させることができ、かつバッファの固定障害に起因するシステムダウンを低減させることが可能となる。 According to the present invention, since the above configuration is included, when a failure occurs in transaction management, processing from acceptance of a transaction to return of a reply can be continuously performed in a state where coherency guarantee is maintained, and a buffer is stored. It is possible to reduce the system down due to the fixed failure.

具体的には、情報処理装置でプロセッサ等が発行するトランザクションを受け付けてからアウト・オブ・オーダー処理でリプライを返却するまでの間トランザクションを管理するシステム制御部内において、ハードマクロやレジスタから構成されるバッファの特定部分に障害が発生した場合にも、コヒーレンシを保ちながら継続動作が出来、縮退動作も行うため、即システムダウンにつながる障害を減らすことが出来る。このためハードウェアの間欠障害やバッファの特定エントリの固定障害対策として有用である。また、プロセッサ等へのリトライ時に使用するトランザクション識別子等の情報以外のところで発生した障害であれば継続動作ができる可能性が高いためハードウェア故障の救済範囲を広げることが可能となる。 Specifically, it is composed of hard macros and registers in the system control unit that manages transactions from when a transaction issued by a processor or the like is received by the information processing apparatus to when a reply is returned by out-of-order processing. Even when a failure occurs in a specific part of the buffer, a continuous operation can be performed while coherency is maintained, and a degeneration operation is also performed, so that it is possible to reduce failures that immediately lead to system down. Therefore, it is useful as a countermeasure for intermittent hardware failures and fixed failures for specific entries in the buffer. In addition, since it is highly possible that a failure occurs in a place other than information such as a transaction identifier used when retrying to a processor or the like, it is highly possible that a continuous operation can be performed, so that it is possible to widen the scope of repair of a hardware failure.

以下、本発明の実施の形態について添付図面を参照しながら説明する。図１は本発明に係る情報処理装置の一例の構成図である。同図を参照すると、本発明に係る情報処理装置の一例は、システム制御部１と、複数のプロセッサ２（２−１〜２−ｎ（ｎは正の整数））と、複数のデバイス３（３−１〜３−ｍ（ｍは正の整数））と、第１メモリ４と、第２メモリ５とを含んで構成される。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. FIG. 1 is a configuration diagram of an example of an information processing apparatus according to the present invention. Referring to the figure, an example of an information processing apparatus according to the present invention includes a system control unit 1, a plurality of processors 2 (2-1 to 2-n (n is a positive integer)), and a plurality of devices 3 ( 3-1 to 3 -m (m is a positive integer)), the first memory 4, and the second memory 5.

複数のプロセッサ２、複数のデバイス３、第１メモリ４および第２メモリ５はシステム制御部１と接続されている。 The plurality of processors 2, the plurality of devices 3, the first memory 4 and the second memory 5 are connected to the system control unit 1.

システム制御部１はプロセッサ２、第１メモリ４あるいはデバイス３からトランザクションを受け付けて制御を行う。 The system control unit 1 receives a transaction from the processor 2, the first memory 4, or the device 3 and performs control.

なお、本実施例ではプロセッサ２が発行するトランザクションを受け付けてからアウト・オブ・オーダー処理でリプライを返却するまでの処理の一例について説明するが、プロセッサ２が発行するトランザクションに限定されるものではなく、第１メモリ４やデバイス３等の他のデバイスが発行するトランザクションに本発明を適用することが可能である。 In this embodiment, an example of processing from receiving a transaction issued by the processor 2 to returning a reply by out-of-order processing will be described. However, the processing is not limited to the transaction issued by the processor 2. The present invention can be applied to transactions issued by other devices such as the first memory 4 and the device 3.

一方、システム制御部１はアウト・オブ・オーダー・トランザクション制御部１１（以下、トランザクション制御部１１と記す）と、後続リトライ部１２とを含んで構成され、トランザクション制御部１１はバッファ２１と、障害検出部２２と、リプライ制御部２３と、障害カウンタ監視部２４とを含んで構成される。 On the other hand, the system control unit 1 includes an out-of-order transaction control unit 11 (hereinafter referred to as a transaction control unit 11) and a subsequent retry unit 12, and the transaction control unit 11 includes a buffer 21, a fault A detection unit 22, a reply control unit 23, and a failure counter monitoring unit 24 are included.

また、障害検出部２２はリトライ可能障害部３１と、リトライ不可能障害部３２とを含んでおり、リプライ制御部２３はリトライ部４１を含んでいる。 The failure detection unit 22 includes a retryable failure unit 31 and a nonretryable failure unit 32, and the reply control unit 23 includes a retry unit 41.

障害検出部２２のリトライ可能障害部３１はバッファ２１と相互に信号の送受信を行い、受信した信号に基づく信号を後続リトライ部１２と、リプライ制御部２３内のリトライ部４１とに送信する。リトライ不可能障害部３２はバッファ２１からの信号の受信のみを行う。 The retryable failure unit 31 of the failure detection unit 22 transmits and receives signals to and from the buffer 21, and transmits a signal based on the received signal to the subsequent retry unit 12 and the retry unit 41 in the reply control unit 23. The retry impossible unit 32 only receives a signal from the buffer 21 .

また、バッファ２１からの信号はリプライ制御部２３のリトライ部４１に送信され、リプライ制御部２３からの信号は後続リトライ部１２へ送信される。 Further, the signal from the buffer 21 is transmitted to the retry unit 41 of the reply control unit 23, and the signal from the reply control unit 23 is transmitted to the subsequent retry unit 12.

また、バッファ２１と障害カウンタ監視部２４は相互に信号の送受信を行う。 Further, the buffer 21 and the fault counter monitoring unit 24 transmit and receive signals to and from each other.

障害検出部２２は、リトライ可能障害部３１とリトライ不可能障害部３２に区別してシステム制御部１内の障害検出を行う。リプライ制御部２３のリトライ部４１は、障害が発生したときに、プロセッサ２−１〜２−n へ障害が発生したトランザクションや他の処理中のトランザクションのリトライを発行する。 The failure detection unit 22 detects a failure in the system control unit 1 by distinguishing between a retryable failure unit 31 and a retry impossible failure unit 32. When a failure occurs, the retry unit 41 of the reply control unit 23 issues a retry of a transaction in which a failure has occurred or another transaction being processed to the processors 2-1 to 2-n.

後続リトライ部１２は、障害が検出されたトランザクションのリトライ処理が完了するまでの間、後続のプロセッサが発行するトランザクションのリトライを行う。障害カウンタ監視部２４はバッファ２１に設けられた後述する障害カウンタ５２を監視する。 The subsequent retry unit 12 retries a transaction issued by the subsequent processor until the retry processing of the transaction in which the failure is detected is completed. The failure counter monitoring unit 24 monitors a failure counter 52 described later provided in the buffer 21.

次に、バッファ２１の構成の一例について説明する。図２はバッファ２１の構成の一例を示す図である。同図を参照すると、バッファ２１は複数のエントリ（エントリ０〜ｐ（ｐは正の整数））を含み、各エントリは使用中フラグ５０と、情報５１と、障害カウンタ５２とを含んでいる。一つのエントリが一つのトランザクションに対応している。 Next, an example of the configuration of the buffer 21 will be described. FIG. 2 is a diagram illustrating an example of the configuration of the buffer 21. Referring to the figure, the buffer 21 includes a plurality of entries (entries 0 to p (p is a positive integer)), and each entry includes a busy flag 50, information 51, and a failure counter 52. One entry corresponds to one transaction.

障害カウンタ５２は障害発生回数を格納する。障害カウンタ監視部２４は、固定障害時のリトライ多発によるシステムダウンを防ぐために、障害カウンタ５２に示される障害発生回数を監視して、一定期間に発生した障害が許容範囲回数以内の場合は障害カウンタ５２のリセットを実施し、障害が許容範囲を超えた場合は固定障害と認識してバッファ２１を縮退させる。 The failure counter 52 stores the number of failure occurrences. The failure counter monitoring unit 24 monitors the number of failure occurrences indicated by the failure counter 52 in order to prevent system down due to frequent retries at the time of a fixed failure, and if the failure occurring within a certain period is within the allowable number of times, the failure counter If the fault exceeds the allowable range, the buffer 21 is degenerated by recognizing that the fault is a fixed fault.

トランザクション制御部１１は、プロセッサ２が発行するトランザクションを受け付けてからアウト・オブ・オーダー処理でリプライを返却するまでの間、それぞれのトランザクションを唯一の識別子を使って管理する。 The transaction control unit 11 manages each transaction using a unique identifier from the time when the transaction issued by the processor 2 is received until the reply is returned by out-of-order processing.

トランザクション制御部１１内にある複数エントリから成るバッファ２１で障害が発生したときに、プロセッサ２にリトライを発行してコヒーレンシを保ちながら障害を救済するための機能は次のもので構成する。 When a failure occurs in the buffer 21 composed of a plurality of entries in the transaction control unit 11, a function for relieving the failure while issuing a retry to the processor 2 and maintaining coherency is configured as follows.

障害発生時にプロセッサにリトライを発行するための機能は、障害検出部２２のリトライ可能障害部３１とリトライ不可能障害部３２、リトライ部４１、後続リトライ部１２で構成され、障害による縮退機能は、バッファ２１中の障害カウンタ５２、障害カウンタ監視部２４から構成される。 The function for issuing a retry to the processor when a failure occurs is composed of a retryable failure unit 31, a retry impossible failure unit 32, a retry unit 41, and a subsequent retry unit 12 of the failure detection unit 22. The fault counter 52 in the buffer 21 and the fault counter monitoring unit 24 are configured.

バッファ２１の障害検出部２２は、プロセッサ２にリトライを返却して動作を継続できるような障害をリトライ可能障害部３１として、継続動作が不可能な場合はリトライ不可能障害部３２として分けて構成する。 The failure detection unit 22 of the buffer 21 is configured to divide a failure that can be retried to the processor 2 and continue the operation as a retryable failure unit 31, and as a failure unit 32 that cannot be retried when the continuous operation is impossible. To do.

リトライ不可能障害部３２で検出する障害としては、例えばプロセッサ２が発行したトランザクション識別子があげられる。これはアウト・オブ・オーダー処理ではプロセッサ２がトランザクションを発行したときに付与されている識別子を用いてプロセッサ２にリプライを返却するが、この識別子に障害が発生している場合は正しくリトライが出来ないためである。 Examples of the failure detected by the retry impossible failure unit 32 include a transaction identifier issued by the processor 2. This is because in out-of-order processing, a reply is returned to the processor 2 using the identifier assigned when the processor 2 issues a transaction. If a failure has occurred in this identifier, the retry can be performed correctly. This is because there is not.

リトライ部４１は、プロセッサ２から受け付けたトランザクションの応答を行うリプライ制御部２３内に持ち、障害検出部２２のリトライ可能障害部３１でバッファ２１障害が検出された場合に、リトライ可能障害部３１から通知を受け、障害が検出されたバッファ２１のエントリのトランザクションをプロセッサ２にリプライの一種としてリトライを発行する機能である。 The retry unit 41 is included in the reply control unit 23 that responds to the transaction received from the processor 2, and from the retryable failure unit 31 when the retryable failure unit 31 of the failure detection unit 22 detects a buffer 21 failure. This function receives a notification and issues a retry as a kind of reply to the processor 2 in the transaction of the entry in the buffer 21 in which a failure is detected.

また、障害が検出されたトランザクションがコヒーレンシ保障の必要があるトランザクションだった場合には、バッファ２１に格納されている他のトランザクションについてもプロセッサ２にリトライを発行してコヒーレンシを保つ機能も持つ。 In addition, when the transaction in which the failure is detected is a transaction that needs to ensure coherency, the other transaction stored in the buffer 21 also has a function of issuing a retry to the processor 2 to maintain coherency.

後続リトライ部１２は、障害発生後から障害が検出されたトランザクションのリトライ処理が終了するまでの間に、新たにプロセッサ２から発行されたトランザクションをリトライするための機能であり、障害が発生したトランザクションがコヒーレンシ保障を必要とする場合に有効になり、障害トランザクションのリトライ処理が終了すると解除される。 The subsequent retry unit 12 is a function for retrying a transaction newly issued from the processor 2 between the occurrence of a failure and the completion of retry processing of the transaction in which the failure is detected. Becomes effective when coherency guarantee is required, and is released when retry processing of a faulty transaction is completed.

後続リトライ部１２は、後続リトライ部１２が有効である間、プロセッサ２が発行した新規トランザクションがコヒーレンシ保障対象ではない場合は受け付けて通常の処理を行い、コヒーレンシ保障対象である場合はリトライにすることでコヒーレンシを保障している。 The subsequent retry unit 12 accepts a new transaction issued by the processor 2 if it is not subject to coherency guarantee and performs normal processing while the subsequent retry unit 12 is valid, and makes a retry if it is subject to coherency guarantee. Guarantees coherency.

バッファ２１中の障害カウンタ５２はエントリ毎に設け、障害が発生するたびに１カウントアップする。障害カウンタ監視部２４は、バッファ２１中の障害カウンタ５２を監視する機能で、あらかじめ縮退させる障害発生回数を設定しておいて、障害カウンタ５２の値が設定回数を越えるとそのエントリを縮退する。 A failure counter 52 in the buffer 21 is provided for each entry, and is incremented by one every time a failure occurs. The failure counter monitoring unit 24 is a function for monitoring the failure counter 52 in the buffer 21 and sets the number of failure occurrences to be degenerated in advance, and degenerates the entry when the value of the failure counter 52 exceeds the set number of times.

また、一定期間内で許容できる障害回数も定めておき、障害カウンタ５２の値が設定した回数以下の場合は障害カウンタ５２をリセットする。縮退方法についてはバッファの使用中フラグ５０を“使用中”にして、情報５１には例えばオール“０”のような他の制御に影響を及ぼさない無効データを格納して無効化する。使用中フラグ５０を“使用中”にしておくことで新たにそのエントリが選択されることはないため特別な回路を設けなくても縮退と同等の機能が得られる。 Also, the number of failures that can be allowed within a certain period is determined, and the failure counter 52 is reset when the value of the failure counter 52 is less than the set number. For the degeneration method, the buffer in-use flag 50 is set to “in use”, and invalid information that does not affect other controls such as all “0” is stored in the information 51 and invalidated. Since the entry is not newly selected by setting the in-use flag 50 to “in use”, a function equivalent to degeneration can be obtained without providing a special circuit.

次に、システム制御部１のエラー処理方法について詳細に説明する。図３は障害が検出されたトランザクションについての処理の一例を示すフローチャートである。 Next, the error processing method of the system control unit 1 will be described in detail. FIG. 3 is a flowchart illustrating an example of processing for a transaction in which a failure is detected.

図３を参照すると、障害検出部２２でバッファ２１の障害を検出する（Ｓ１０１）。リトライ可能障害部３１で障害を検出した場合（Ｓ１０２にて“ＹＥＳ”の場合）はシステムダウンさせずに障害が発生したバッファ２１のエントリの障害カウンタ５２のカウントアップを行い（Ｓ１０３）、リトライ部４１ではプロセッサ２等にリトライを発行するために必要な情報を読み出してリトライ処理を行う（Ｓ１０４）。 Referring to FIG. 3, the failure detection unit 22 detects a failure in the buffer 21 (S101). When a failure is detected by the retryable failure unit 31 (in the case of “YES” in S102), the failure counter 52 of the entry of the buffer 21 in which the failure has occurred without counting down the system is counted up (S103), and the retry unit In 41, information necessary for issuing a retry to the processor 2 or the like is read and a retry process is performed (S104).

プロセッサ２へのリトライ発行が終了したところで、後続リトライ部１２が有効である場合（Ｓ１０５で“ＹＥＳ”の場合）は後続リトライ部１２を解除して（Ｓ１０６）、処理を終了する。 When the retry issuance to the processor 2 is completed, if the subsequent retry unit 12 is valid (“YES” in S105), the subsequent retry unit 12 is released (S106), and the process is terminated.

リトライ処理が終了したトランザクションのエントリの再使用については、直ぐに使用出来る場合や、一定期間おいた後でリセットして利用する場合、また、他のバッファのリセットも行って利用する場合等、バッファの用途や障害箇所によって異なる。 Reuse of transaction entries for which retry processing has been completed is possible, such as when it can be used immediately, when it is used after being reset for a certain period, or when other buffers are reset and used. Varies depending on the application and location of failure.

障害検出がリトライ可能障害部３１ではない場合（Ｓ１０２にて“ＮＯ”の場合）、すなわちリトライ不可能障害部３２で検出された場合はシステムダウンさせる（Ｓ１０９）。 When the failure detection is not the retryable failure unit 31 (in the case of “NO” in S102), that is, when the failure detection unit 32 detects that the failure is not possible to retry, the system is brought down (S109).

リトライ可能障害部３１で障害を検出した場合には（Ｓ１０２にて“ＹＥＳ”の場合）、障害検出したバッファ２１のエントリのトランザクションがコヒーレンシ保障対象トランザクションかどうかについても調べる（Ｓ１０７）。 When a failure is detected by the retryable failure unit 31 (in the case of “YES” in S102), it is also checked whether or not the transaction of the entry of the buffer 21 in which the failure is detected is a coherency guarantee target transaction (S107).

明らかにコヒーレンシ保障対象トランザクションではないと判断できる場合を除き（Ｓ１０７にて“ＹＥＳ”の場合）、コヒーレンシ保障対象トランザクションとみなして後続リトライ部１２を有効にする（Ｓ１０８）。 Unless it can be clearly determined that the transaction is not a coherency guarantee target transaction (in the case of “YES” in S107), it is regarded as a coherency guarantee target transaction and the subsequent retry unit 12 is enabled (S108).

明らかにコヒーレンシ保障対象トランザクションではないとは判断できない例として、コヒーレンシ保障対象か否かを判断するための情報に障害が発生した場合があげられる。 An example in which it cannot be clearly determined that the transaction is not subject to coherency guarantee is a case where a failure occurs in information for judging whether or not the transaction is subject to coherency guarantee.

後続リトライ部１２が有効でない場合（Ｓ１０５で“ＮＯ”の場合）、および明らかにコヒーレンシ保障対象トランザクションではないと判断できる場合（Ｓ１０７にて“ＮＯ”の場合）は処理を終了する。 If the subsequent retry unit 12 is not valid (in the case of “NO” in S105 ) , and if it can be clearly determined that the transaction is not a coherency guarantee target transaction (in the case of “NO” in S107), the process is terminated.

図４はバッファ２１のリトライ可能障害部３１で障害が検出された際に障害が検出されたトランザクション以外のアウト・オブ・オーダー処理中のトランザクションに対する処理の一例を示すフローチャートである。 FIG. 4 is a flowchart showing an example of processing for a transaction in the out-of-order processing other than the transaction in which a failure is detected when a failure is detected in the retryable failure unit 31 of the buffer 21.

図４を参照すると、リトライ可能障害部３１で障害が検出されると（Ｓ２０１）、リトライ部４１では障害が検出されたトランザクションがコヒーレンシ保障対象であるかどうかを調べる（Ｓ２０２）。 Referring to FIG. 4, when a failure is detected in the retryable failure unit 31 (S201), the retry unit 41 checks whether the transaction in which the failure is detected is a coherency guarantee target (S202).

障害が検出されたトランザクションがコヒーレンシ保障対象でない場合は（Ｓ２０２にて“ＮＯ”の場合）、処理中のアウト・オブ・オーダー・トランザクションは通常処理を継続する（Ｓ２０５）。 If the transaction in which the failure is detected is not subject to coherency guarantee (“NO” in S202), the out-of-order transaction being processed continues normal processing (S205).

障害が検出されたトランザクションがコヒーレンシ保障対象であった場合は（Ｓ２０２にて“ＹＥＳ”の場合）、処理中のアウト・オブ・オーダー・トランザクション全てについてそのトランザクションがコヒーレンシ保障対象であるかを調べ（Ｓ２０３）、コヒーレンシ保障対象でない場合は（Ｓ２０３にて“ＮＯ”の場合）、通常処理を継続し（Ｓ２０５）、コヒーレンシ保障対象であれば（Ｓ２０３にて“ＹＥＳ”の場合）、リトライ処理を行う（Ｓ２０４）。 If the transaction in which the failure is detected is subject to coherency guarantee (in the case of “YES” in S202), it is checked whether the transaction is subject to coherency guarantee for all out-of-order transactions being processed ( If it is not a coherency guarantee target (in the case of “NO” in S203), normal processing is continued (S205), and if it is a coherency guarantee target (in the case of “YES” in S203), a retry process is performed. (S204).

リトライ処理（Ｓ２０４）は、バッファ２１のそれぞれのエントリの情報５１内に図示しないリトライフラグを設けておいて、障害発生時にリトライ処理が必要なトランザクションであると判定された場合はそのトランザクションのエントリのリトライフラグを有効にしておき、通常動作時のリプライを行うために必要な条件が揃った後でリトライに差し替えて、プロセッサへリトライを発行して処理を終了する。データが付随する場合はデータを廃棄して処理を終了する。 In the retry process (S204), a retry flag (not shown) is provided in the information 51 of each entry in the buffer 21, and when it is determined that the transaction requires the retry process when a failure occurs, the entry of the entry of that transaction is stored. The retry flag is made valid, and after the conditions necessary for performing the reply in the normal operation are met, the retry is replaced, and a retry is issued to the processor to end the processing. If data accompanies, the data is discarded and the process is terminated.

図５はリトライ可能障害部３１で検出された障害の処理中に、プロセッサ２から新規トランザクションが発行されたときの処理の一例を示すフローチャートである。プロセッサ２からトランザクションを受け付けると（Ｓ３０１）、後続リトライ部１２が有効かどうかを調べる（Ｓ３０２）。 FIG. 5 is a flowchart showing an example of processing when a new transaction is issued from the processor 2 during processing of a failure detected by the retryable failure unit 31. When a transaction is received from the processor 2 (S301), it is checked whether the subsequent retry unit 12 is valid (S302).

後続リトライ部１２が無効な場合は（Ｓ３０２にて“ＮＯ”の場合）、トランザクションを受け付けてアウト・オブ・オーダー処理等の通常処理を実施する（Ｓ３０５）。 If the subsequent retry unit 12 is invalid (in the case of “NO” in S302), the transaction is accepted and normal processing such as out-of-order processing is performed (S305).

後続リトライ部１２が有効な場合は（Ｓ３０２にて“ＹＥＳ”の場合）、プロセッサ２から受け取ったトランザクションがコヒーレンシ保障対象か否かを調べ（Ｓ３０３）、コヒーレンシ保障対象の場合は（Ｓ３０３にて“ＹＥＳ”の場合）、リトライ処理を行い（Ｓ３０４）、コヒーレンシ保障対象でない場合は（Ｓ３０３にて“ＮＯ”の場合）、トランザクションを受け付けてアウト・オブ・オーダー処理等の通常処理を継続する（Ｓ３０５）。 If the subsequent retry unit 12 is valid (“YES” in S302), it is checked whether or not the transaction received from the processor 2 is a coherency guarantee target (S303). If “YES”, retry processing is performed (S304), and if it is not a coherency guarantee target (“NO” in S303), the transaction is accepted and normal processing such as out-of-order processing is continued (S305). ).

図６はバッファ２１の障害カウンタ５２の監視動作の一例を示すフローチャートである。障害カウンタ監視部２４では、バッファ２１のそれぞれのエントリにある障害カウンタ５２を一定期間毎に監視し（Ｓ４０１）、障害カウンタ５２の値が縮退閾値回数を超えた場合は（Ｓ４０２にて“ＹＥＳ”の場合）、バッファ２１の該当エントリの縮退を行う（Ｓ４０５）。 FIG. 6 is a flowchart showing an example of the monitoring operation of the failure counter 52 of the buffer 21. The failure counter monitoring unit 24 monitors the failure counter 52 in each entry of the buffer 21 at regular intervals (S401), and when the value of the failure counter 52 exceeds the number of degeneration thresholds (“YES” in S402). ), The corresponding entry in the buffer 21 is degenerated (S405).

これは固定故障によるリトライ多発を防ぐためである。ここで、縮退閾値回数とは縮退を開始する障害発生回数のことであり、使用する要件に応じて任意の値に設定可能である。 This is to prevent frequent retries due to fixed failures. Here, the degeneration threshold number is the number of failure occurrences at which degeneration starts, and can be set to any value according to the requirements to be used.

縮退縮退方法としてはバッファ２１のエントリの使用中フラグ５０を有効にすることによって、新たにそのエントリが使用されないようして、同時に、他の制御に影響を及ぼさないようにそのエントリの情報５１にはオール“０”等の無効なデータを書き込むことで縮退とする。 As a degeneration / reduction method, the entry flag 51 of the entry in the buffer 21 is made valid so that the entry is not used anew, and at the same time, the information 51 of the entry is set so as not to affect other controls. Is degenerated by writing invalid data such as all “0”.

障害カウンタ５２が縮退閾値回数以下の場合は（Ｓ４０２にて“ＮＯ”の場合）、一定期間内で許容される障害回数として設定したリセット可能回数の値と障害カウンタ５２値とを比較し（Ｓ４０３）、リセット可能回数以下の場合は（Ｓ４０３にて“ＮＯ”の場合）、障害カウンタ５２をリセットする（Ｓ４０４）。これは間欠障害に起因したバッファの縮退によってバッファ容量の減少を防ぐためである。 When the failure counter 52 is equal to or smaller than the degeneration threshold number (in the case of “NO” in S402), the value of the resettable number set as the allowable number of failures within a certain period is compared with the value of the failure counter 52 (S403). ) If the number is less than the resettable number (“NO” in S403), the failure counter 52 is reset (S404). This is to prevent a decrease in buffer capacity due to buffer degeneration caused by intermittent failure.

一方、リセット可能回数を超えた場合は（Ｓ４０３にて“ＹＥＳ”の場合）、障害カウンタ５２をリセットしない。 On the other hand, when the resettable number is exceeded (“YES” in S403), the failure counter 52 is not reset.

次に、エラー処理方法の制御プログラムについて詳細に説明する。図１を参照すると、第１メモリ４と第２メモリ５とが表示されている。第１メモリには本発明に係るエラー処理以外の処理を行うための制御プログラムあるいはデータが格納されている。一方、第２メモリ５には本発明に係るエラー処理を行うための制御プログラムが格納されている。
本発明に係るエラー処理を行うための制御プログラムとは、図３〜図６にフローチャートで示す処理をコンピュータ（本実施例では図１のシステム制御部１）に実行させるためのプログラムである。 Next, the control program for the error processing method will be described in detail. Referring to FIG. 1, a first memory 4 and a second memory 5 are displayed. The first memory stores a control program or data for performing processing other than error processing according to the present invention. On the other hand, the second memory 5 stores a control program for performing error processing according to the present invention.
The control program for performing error processing according to the present invention is a program for causing a computer (system control unit 1 in FIG. 1 in this embodiment) to execute the processing shown in the flowcharts in FIGS.

システム制御部１は第２メモリ５からエラー処理を行うための制御プログラムを読み出し、そのプログラムに従ってトランザクション制御部１１および後続リトライ部１２を制御する。その制御内容については既に述べたのでその説明は省略する。 The system control unit 1 reads a control program for performing error processing from the second memory 5 and controls the transaction control unit 11 and the subsequent retry unit 12 according to the program. Since the details of the control have already been described, the description thereof will be omitted.

なお、本実施例では、エラ−処理方法について、制御プログラムで実現する方法を示したが、これをハ−ドウエア・シ−ケンスで実現することも可能である。 In the present embodiment, the error processing method is realized by a control program. However, this can also be realized by a hardware sequence.

プロセッサがリトライ機能を有し、アウト・オブ・オーダー・トランザクションを処理するシステムにおいて、多量なバッファを用いるような場合に有効であり、高信頼性が求められる情報処理装置での利用が考えられる。また、プロセッサに限らずＩ／Ｏカードやメモリからでも、リトライ機能を有しアウト・オブ・オーダー処理を実施している場合には適用可能である。 In a system in which a processor has a retry function and processes an out-of-order transaction, it is effective when a large amount of buffers are used, and it can be used in an information processing apparatus that requires high reliability. Further, the present invention can be applied not only to a processor but also from an I / O card or a memory when a retry function is provided and out-of-order processing is performed.

本発明に係る情報処理装置の一例の構成図である。It is a block diagram of an example of the information processing apparatus which concerns on this invention. バッファ２１の構成の一例を示す図である。3 is a diagram illustrating an example of a configuration of a buffer 21. FIG. 障害が検出されたトランザクションについての処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process about the transaction by which the failure was detected. バッファ２１のリトライ可能障害部３１で障害が検出された際に障害が検出されたトランザクション以外のアウト・オブ・オーダー処理中のトランザクションに対する処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of processing for a transaction in the process of out-of-order processing other than the transaction in which a failure is detected when a failure is detected in the retryable failure unit 31 of the buffer 21. リトライ可能障害部３１で検出された障害の処理中に、プロセッサ２から新規トランザクションが発行されたときの処理の一例を示すフローチャートである。10 is a flowchart illustrating an example of processing when a new transaction is issued from a processor 2 during processing of a failure detected by a retryable failure unit 31. バッファ２１の障害カウンタ５２の監視動作の一例を示すフローチャートである。4 is a flowchart illustrating an example of a monitoring operation of a failure counter 52 of a buffer 21.

Explanation of symbols

１システム制御部
２プロセッサ
３デバイス
４第１メモリ
５第２メモリ
１１アウト・オブ・オーダー・トランザクション制御部
１２後続リトライ部
２３リプライ制御部
２４障害カウンタ監視部
３１リトライ可能障害部
３２リトライ不可能障害部
４１リトライ部
５０使用中フラグ
５１情報
５２障害カウンタ DESCRIPTION OF SYMBOLS 1 System control part 2 Processor 3 Device 4 1st memory 5 2nd memory 11 Out-of-order transaction control part 12 Subsequent retry part 23 Reply control part 24 Fault counter monitoring part 31 Retryable fault part 32 Retry impossible fault part 41 Retry unit 50 In-use flag 51 Information 52 Fault counter

Claims

An information processing apparatus that manages a transaction from when a transaction is received from an issuer until a reply is returned in an out-of-order process.
Detects failures related to transaction management in the buffer that stores accepted transactions by distinguishing between failures that can be retried and failures that cannot be retried. When a failure that can be retried is detected, there is a relationship between the transaction and coherency along with the retry of the transaction. An information processing apparatus comprising transaction control means for retrying another transaction.

The information processing apparatus according to claim 1, wherein another transaction having a coherency relationship with the transaction related to the failure detection is a transaction being processed when the failure occurs.

The information processing apparatus according to claim 1, wherein the other transaction having a coherency relationship with the transaction related to the failure detection is a newly issued transaction.

The buffer is provided with a failure counter that records the number of failure occurrences of an entry corresponding to a transaction from the issuer, and the transaction control means degenerates the entry when the failure counter value exceeds a certain threshold. The information processing apparatus according to claim 1, wherein:

An error processing method in an information processing apparatus that manages a transaction after receiving a transaction from an issuer until returning a reply in an out-of-order process,
Detects failures related to transaction management in the buffer that stores accepted transactions by distinguishing between failures that can be retried and failures that cannot be retried. When a failure that can be retried is detected, there is a relationship between the transaction and coherency along with the retry of the transaction. An error processing method comprising a transaction control step for retrying another transaction.

6. The error processing method according to claim 5, wherein another transaction having a coherency relationship with the transaction related to the failure detection is a transaction being processed when the failure occurs.

6. The error processing method according to claim 5, wherein the other transaction having a coherency relationship with the transaction related to the failure detection is a newly issued transaction.

The buffer is provided with a failure counter that records the number of failure occurrences of an entry corresponding to a transaction from the issuer, and the transaction control means degenerates the entry when the failure counter value exceeds a certain threshold. The error processing method according to claim 5, wherein:

A control program for an error processing method in an information processing apparatus that manages a transaction from when a transaction is received from an issuer until a reply is returned in an out-of-order process.
The computer detects failures related to transaction management in the buffer that stores accepted transactions by distinguishing between failures that can be retried and failures that cannot be retried. When a retryable failure is detected, the transaction is retried and the transaction and coherency A control program for executing transaction control processing that retries other related transactions.

The control program according to claim 9, wherein the other transaction having a coherency relationship with the transaction related to the failure detection is a transaction being processed when the failure occurs.

The control program according to claim 9, wherein another transaction having a coherency relationship with the transaction related to the failure detection is a newly issued transaction.

The buffer is provided with a failure counter that records the number of failure occurrences of an entry corresponding to a transaction from the issuer, and the transaction control means degenerates the entry when the failure counter value exceeds a certain threshold. The control program according to claim 9, wherein the control program is performed.