JP2012221101A

JP2012221101A - Inter-processing part mismatching detection method under consideration of reboot due to failure and shared device and cluster system

Info

Publication number: JP2012221101A
Application number: JP2011084550A
Authority: JP
Inventors: Eiju Kita; 栄寿喜多
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-04-06
Filing date: 2011-04-06
Publication date: 2012-11-12
Anticipated expiration: 2031-04-06
Also published as: JP5464449B2

Abstract

PROBLEM TO BE SOLVED: To achieve inter-processing part mismatching detection accompanied by reboot with a method of making it unnecessary for a host to manage any telegraphic message identifier or change the telegraphic message identifier for each processing.SOLUTION: In a shared device including a first processing part connected to a first host, a second processing part connected to a second host, and a storage part to be used by the first processing part and the second processing part, the first processing part is configured to, when receiving the transmission request of a first message from the first host, transmit a request to the second processing part, and the second processing part is configured to transmit the message to the second host, and to transmit a reply to the first processing part, and the first processing part is configured to determine whether or not the pertinent reply is the reply corresponding to the transmission request of the first message by collating the boot and reboot frequencies of the first processing part.

Description

本発明は、複数の処理部を含む装置における処理部間の不整合検出に関する。 The present invention relates to inconsistency detection between processing units in an apparatus including a plurality of processing units.

メインフレーム等の大型コンピュータシステムにおいて、スケーラビリティや可用性の向上のために複数のホストを互いに結合したクラスタシステムが用いられている。 In a large computer system such as a mainframe, a cluster system in which a plurality of hosts are coupled to each other is used to improve scalability and availability.

クラスタシステムではホスト同士がデータをやり取りしながら処理を行うが、このためのホスト間通信を複数のホストとインタフェースを持つ共有装置を介してホスト同士がメッセージ通信を行うことによって実現する方法がある。 In the cluster system, processing is performed while the hosts exchange data, and there is a method for realizing communication between the hosts by performing message communication between the hosts via a shared device having an interface with a plurality of hosts.

この際に使用される共有装置は内部に複数の処理部と、処理部が共有する記憶部を備えている。そして、ここでのメッセージ通信機能とは、ホストからメッセージ送信要求を受け取った処理部が他方の処理部にリクエストを出し、リクエストを受け取った処理部がホストにメッセージを送信。送信に成功すると、元の処理部にリプライを返し、リプライを受け取った処理部が要求元のホストにリプライを返すというものである。 The sharing device used at this time includes a plurality of processing units and a storage unit shared by the processing units. The message communication function here means that a processing unit that has received a message transmission request from a host issues a request to the other processing unit, and the processing unit that has received the request transmits a message to the host. If the transmission is successful, a reply is returned to the original processing unit, and the processing unit that has received the reply returns a reply to the requesting host.

ここで、ホストから要求を受け取った処理部が他方の処理部にリクエストを出した直後に障害でリブートし、リブート後にホストから異なる送信要求を受け取った場合について考える。この場合、他方の処理部に本要求に対するリクエストを出した後に、すれ違いで、他方の処理部から前のリクエストに対するリプライが返されると、本リプライを後のリクエストに対するリプライと誤認するという問題があった。 Here, consider a case where a processing unit that receives a request from the host reboots due to a failure immediately after issuing a request to the other processing unit, and receives a different transmission request from the host after rebooting. In this case, after issuing a request for this request to the other processing unit, if a reply to the previous request is returned from the other processing unit, there is a problem that this reply is mistaken as a reply to the subsequent request. It was.

このような問題を解消する方法として、例えば特許文献１に記載の技術を用いることが考えられる。特許文献１に記載に記載の技術では、処理要求毎に電文識別子を加え、この電文識別子を参照することにより、各処理を識別することが可能である。 As a method for solving such a problem, for example, it is conceivable to use the technique described in Patent Document 1. In the technique described in Patent Literature 1, it is possible to identify each process by adding a message identifier for each processing request and referring to the message identifier.

特開２００３−０８５０６０号公報JP 2003-085060 A

上述したように、特許文献１に記載の技術等を用いることにより各処理を識別することが可能となる。 As described above, each process can be identified by using the technique described in Patent Document 1.

しかしながら、特許文献１に記載の技術等では、各処理毎に電文識別子を変化させる必要があり、電文識別子の付与及び確認に伴う処理が煩雑になるという問題があった。 However, in the technique described in Patent Document 1, it is necessary to change the message identifier for each process, and there is a problem that the process associated with the assignment and confirmation of the message identifier becomes complicated.

また、特許文献１に記載の技術等は、ホストたる各サーバが電文識別子を管理する必要があるという問題があった。加えて、特許文献１に記載の技術等は、共有装置の処理部のリブートを考慮したものではなく、リブートが生じたタイミングによっては処理部間ですれ違いによる不整合が発生してしまうという課題を直接解決するものではなかった。 Further, the technique described in Patent Document 1 has a problem that each server as a host needs to manage a message identifier. In addition, the technique described in Patent Document 1 does not consider rebooting of the processing unit of the shared apparatus, and there is a problem that mismatch occurs due to passing between the processing units depending on the timing at which the reboot occurs. It was not a direct solution.

そこで、本実施形態では、ホストにおいて電文識別子を管理させることなく、且つ、各処理毎に電文識別子を変化させる必要がない方法で、リブートに伴う処理部間での不整合を検出することが可能な、障害によるリブートを考慮した処理部間の不整合検出方法並びに共有装置及びクラスタシステムを提供することを目的とする。 Therefore, in this embodiment, it is possible to detect inconsistencies between processing units associated with rebooting in a method that does not require the host to manage the message identifier and does not need to change the message identifier for each process. Another object of the present invention is to provide a method for detecting inconsistencies between processing units in consideration of reboot due to a failure, a shared apparatus, and a cluster system.

本発明の第１の観点によれば、第１のホストに接続される第１の処理部と、第２のホストに接続される第２の処理部と、前記第１の処理部及び第２の処理部が利用する記憶部と、を有する共有装置において、前記第１の処理部が前記第１のホストからの第１のメッセージの送信要求を受け取ると、前記第２の処理部に対してリクエストを送信し、当該リクエストを受信した前記第２の処理部が前記メッセージを前記第２のホストに送信すると共にリプライを前記第１の処理部に送信し、前記リプライを受け取った第１の処理部は当該リプライが前記第１のメッセージの送信要求に応じたリプライであるか否かを判断し、前記判断は、前記記憶部上で、前記リクエスト及び前記リプライにそれぞれ対応付けられた、前記第１の処理部のブート及びリブートの回数を照らし合わせることにより行われることを特徴とする共有装置が提供される。 According to the first aspect of the present invention, the first processing unit connected to the first host, the second processing unit connected to the second host, the first processing unit, and the second processing unit And a storage unit used by the first processing unit, when the first processing unit receives a transmission request for the first message from the first host, the second processing unit A first process that transmits a request, the second processing unit receiving the request transmits the message to the second host, transmits a reply to the first processing unit, and receives the reply. The unit determines whether the reply is a reply according to the transmission request for the first message, and the determination is associated with the request and the reply on the storage unit, respectively. 1 processing unit boot and Sharing apparatus is provided which comprises carrying out by collating the number of reboots.

本発明の第２の観点によれば、第１のホストに接続される第１の処理部と、第２のホストに接続される第２の処理部と、前記第１の処理部及び第２の処理部が利用する記憶部と、を有する共有装置が行う不整合検出方法において、前記第１の処理部が前記第１のホストからの第１のメッセージの送信要求を受け取ると、前記第２の処理部に対してリクエストを送信し、当該リクエストを受信した前記第２の処理部が前記メッセージを前記第２のホストに送信すると共にリプライを前記第１の処理部に送信し、前記リプライを受け取った第１の処理部は当該リプライが前記第１のメッセージの送信要求に応じたリプライであるか否かを判断し、前記判断は、前記記憶部上で、前記リクエスト及び前記リプライにそれぞれ対応付けられた、前記第１の処理部のブート及びリブートの回数を照らし合わせることにより行われることを特徴とする不整合検出方法が提供される。 According to a second aspect of the present invention, a first processing unit connected to a first host, a second processing unit connected to a second host, the first processing unit and the second In the inconsistency detection method performed by the shared device having a storage unit used by the first processing unit, when the first processing unit receives a transmission request for the first message from the first host, the second processing unit The second processing unit that has received the request transmits the message to the second host, transmits a reply to the first processing unit, and sends the reply to the first processing unit. The received first processing unit determines whether the reply is a reply according to the transmission request for the first message, and the determination corresponds to the request and the reply on the storage unit, respectively. Attached Mismatch detection method characterized in that it is carried out by collating the boot and the number of reboots the processor is provided.

本発明によれば、共有装置内のブートカウンタエリアの値を要因とすることによりリブートが発生してもリクエストとリプライの対応づけをすることが出来ることから、ホストにおいて電文識別子を管理させることなく、且つ、各処理毎に電文識別子を変化させる必要がない方法で、リブートに伴う処理部間での不整合を検出することが可能となる。 According to the present invention, a request and a reply can be associated with each other even if a reboot occurs by using the value of the boot counter area in the shared device as a factor, so that the message identifier is not managed in the host. In addition, it is possible to detect inconsistencies between processing units due to rebooting in a method that does not require changing the message identifier for each process.

本発明の実施形態の基本的構成を表す図である。It is a figure showing the basic composition of the embodiment of the present invention. 本発明の実施形態におけるブート回数登録部の基本的動作を表すフローチャートである。It is a flowchart showing the basic operation | movement of the boot frequency registration part in embodiment of this invention. 本発明の実施形態におけるホスト要求処理部の基本的動作を表すフローチャートである。It is a flowchart showing the basic operation | movement of the host request | requirement process part in embodiment of this invention. 本発明の実施形態における他処理部要求部の基本的動作を表すフローチャートである。It is a flowchart showing the basic operation | movement of the other process part request | requirement part in embodiment of this invention. 本発明の実施形態における他処理部リプライ処理部の基本的動作を表すフローチャートである。It is a flowchart showing the basic operation | movement of the other process part reply process part in embodiment of this invention.

次に、本発明の実施形態について図面を用いて詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

図１を参照すると、本実施形態は、共有装置１０００と第１のホスト２０００と第２のホスト３０００とを有している。 Referring to FIG. 1, the present embodiment includes a shared device 1000, a first host 2000, and a second host 3000.

そして、本実施形態においては、第１のホスト２０００と第２のホスト３０００が、共有装置１０００を介して相互にデータをやり取りしながら処理を行う。なお、やり取りされるデータはどのような規格やプロトコルに準拠したものであってもよい。本実施形態は、任意の通信方式に準拠して実装することが可能である。 In this embodiment, the first host 2000 and the second host 3000 perform processing while exchanging data with each other via the shared apparatus 1000. The exchanged data may be based on any standard or protocol. This embodiment can be implemented in conformity with an arbitrary communication method.

共有装置１０００は、記憶部１１０と第１の処理部１２０と第２の処理部１３０とを有している。 The shared apparatus 1000 includes a storage unit 110, a first processing unit 120, and a second processing unit 130.

また、第１の処理部１２０は、第１のホスト２０００とは直接接続されているが第２のホスト３０００とは直接接続されていない。同様に、第２の処理部１３０は、第２のホスト３０００とは直接接続されているが第１のホスト２０００とは直接接続されていない。なお、各ホストとの接続は任意の方式により行うことが可能であり、接続方式に特に制限は無い。 Further, the first processing unit 120 is directly connected to the first host 2000 but is not directly connected to the second host 3000. Similarly, the second processing unit 130 is directly connected to the second host 3000 but is not directly connected to the first host 2000. Connection with each host can be performed by any method, and there is no particular limitation on the connection method.

記憶部１１０は、記憶装置であり、その用途に応じて複数の領域を有している。具体的には、記憶部１１０は、ブートカウンタ領域、リクエスト領域、リプライ領域及び通信データ領域を有している。そして、これらの各領域は処理部毎に１つずつ用意されている。そのため、本実施形態では、第１の処理部１２０用の領域として、第１のブートカウンタ領域１１１、第１のリクエスト領域１１３、第１のリプライ領域１１５及び第１の通信データ領域１１７を有している。加えて、第２の処理部１３０用の領域として、第２のブートカウンタ領域１１２、第２のリクエスト領域１１４、第２のリプライ領域１１６、及び第２の通信データ領域１１８を有している。 The storage unit 110 is a storage device, and has a plurality of areas depending on the application. Specifically, the storage unit 110 has a boot counter area, a request area, a reply area, and a communication data area. Each of these areas is prepared for each processing unit. Therefore, in the present embodiment, the first boot counter area 111, the first request area 113, the first reply area 115, and the first communication data area 117 are provided as areas for the first processing unit 120. ing. In addition, a second boot counter area 112, a second request area 114, a second reply area 116, and a second communication data area 118 are provided as areas for the second processing unit 130.

これらの各領域の具体的用途に関しては後述する。 Specific uses of these areas will be described later.

第１の処理部１２０及び第２の処理部１３０は、ローカルメモリ（以下の説明及び図中においては、適宜「ＬＭ」と呼ぶ。）１２１及び１３１を有している。 The first processing unit 120 and the second processing unit 130 have local memories 121 and 131 (referred to as “LM” where appropriate in the following description and drawings).

また、第１の処理部１２０及び第２の処理部１３０は、それぞれ、ブート回数登録部１２２及び１３２と、ホスト要求処理部１２３及び１３３と、他処理部要求処理部１２４及び１３４と、他処理部リプライ処理部１２５及び１３５を有する。 In addition, the first processing unit 120 and the second processing unit 130 are, respectively, boot count registration units 122 and 132, host request processing units 123 and 133, other processing unit request processing units 124 and 134, and other processing. Part reply processing parts 125 and 135.

ブート回数登録部１２２及び１３２は、第１の処理部１２０及び第２の処理部１３０のブート時と、リブート時に記憶部１１０のブートカウンタ領域にインクリメントしたブート回数を登録する機能を有する。 The boot count registration units 122 and 132 have a function of registering the boot count incremented in the boot counter area of the storage unit 110 when the first processing unit 120 and the second processing unit 130 are booted and when rebooting.

ホスト要求処理部１２３及び１３３は、自身が接続されているホストから送信要求を受け取った際に、他の処理部に要求を出す機能を有する。 The host request processing units 123 and 133 have a function of issuing a request to another processing unit when a transmission request is received from a host to which the host request processing unit 123 or 133 is connected.

他処理部要求処理部１２４及び１３４は、他処理部のホスト要求処理部から要求を受け取った際に、自身が接続されているホストにメッセージを送信し、送信成功後に要求元のホスト要求処理部に応答（以下、適宜「リプライ」と呼ぶ）を返す機能を有する。 When receiving a request from the host request processing unit of the other processing unit, the other processing unit request processing units 124 and 134 transmit a message to the host to which the other processing unit request processing unit 124 is connected, and after successful transmission, the requesting host request processing unit Has a function of returning a response (hereinafter referred to as “reply” as appropriate).

他処理部リプライ処理部１２５及び１３５は、他処理部の他処理部要求処理部からリプライを受け取った際に自身が接続されているホストにリプライを返す機能を有する。 The other processing unit reply processing units 125 and 135 have a function of returning a reply to the host to which the other processing unit reply processing units 125 and 135 are connected when receiving a reply from the other processing unit request processing unit.

次に、図２乃至図５のフローチャートを参照して本実施形態の有する各部の動作について詳細に説明する。 Next, the operation of each unit of the present embodiment will be described in detail with reference to the flowcharts of FIGS.

図２は、ブート回数登録部１２２及び１３２の動作を表すフローチャートである。ブート回数登録部１２２及び１３２は自身が含まれている処理部がブートする際及びリブートする際に動作を行う。 FIG. 2 is a flowchart showing the operation of the boot number registration units 122 and 132. The boot number registration units 122 and 132 operate when a processing unit including the boot number registration unit 122 boots and reboots.

図２を参照すると、ブート時及びリブート時に、ブート回数登録部１２２及び１３２は、記憶部１１０内の自処理部に割り当てられているブートカウンタ領域の値を読み出す（ステップＡ１）。 Referring to FIG. 2, at the time of booting and rebooting, the boot count registration units 122 and 132 read the value of the boot counter area assigned to the own processing unit in the storage unit 110 (step A1).

次に、読み出した値とブートカウンタが取り得る最大値を比較する（ステップＡ２）。ステップＡ２における比較の結果、両値が等しければ（ステップＡ２においてＹｅｓ）ブートカウンタ領域に１を登録する（ステップＡ３）。すなわち、処理後のブートカウンタ値は「１」となる。 Next, the read value is compared with the maximum value that can be taken by the boot counter (step A2). If the two values are equal as a result of the comparison in step A2 (Yes in step A2), 1 is registered in the boot counter area (step A3). That is, the boot counter value after processing is “1”.

一方、ステップＡ２における比較の結果、両値が等しくなければ（ステップＡ２においてＮｏ）、現在のブートカウンタ領域の値に１を加算する（ステップＡ４）。すなわち、処理後のブートカウンタ値は「現在のブートカウンタ値＋１」となる。 On the other hand, if the two values are not equal as a result of the comparison in step A2 (No in step A2), 1 is added to the value of the current boot counter area (step A4). That is, the boot counter value after processing is “current boot counter value + 1”.

以上で、ブート回数登録部１２２及び１３２の動作は終了する。 Thus, the operations of the boot count registration units 122 and 132 are completed.

図３は、ホスト要求処理部１２３及び１３３の動作を表すフローチャートである。 FIG. 3 is a flowchart showing the operation of the host request processing units 123 and 133.

図３を参照すると、ホスト要求処理部１２３及び１３３は、ホストからの通信要求を受信するまで待機をする（ステップＢ１においてＮｏ）。 Referring to FIG. 3, the host request processing units 123 and 133 stand by until a communication request from the host is received (No in step B1).

通信要求を受信すると（ステップＢ１においてＹｅｓ）、受信した通信要求に付随する通信メッセージを、通信先ホストが接続されている処理部に割り当てられている通信データ領域に格納する（ステップＢ２）。 When the communication request is received (Yes in Step B1), the communication message accompanying the received communication request is stored in the communication data area assigned to the processing unit to which the communication destination host is connected (Step B2).

次に、記憶部１１０の自処理部に割り当てられているブートカウンタ領域の値を読み出し（ステップＢ３）、読み出した値を記憶部１１０の通信先ホストが接続されている処理部に割り当てられているリクエスト領域に格納する（ステップＢ４）。その後、再び、ステップＢ１に戻り、ホストからの通信要求を受信するまで待機をする。 Next, the value of the boot counter area assigned to the own processing unit of the storage unit 110 is read (step B3), and the read value is assigned to the processing unit to which the communication destination host of the storage unit 110 is connected. Store in the request area (step B4). Thereafter, the process returns to step B1 again and waits until a communication request from the host is received.

図４は、他処理部要求処理部１２４及び１３４の動作を表すフローチャートである。 FIG. 4 is a flowchart showing the operation of the other processing unit request processing units 124 and 134.

図４を参照すると、他処理部要求処理部１２４及び１３４は、記憶部１１０の自処理部に割り当てられているリクエスト領域の値の監視をする（ステップＣ１）。 Referring to FIG. 4, the other processing unit request processing units 124 and 134 monitor the value of the request area allocated to the own processing unit of the storage unit 110 (step C1).

そして、記憶部１１０の自処理部に割り当てられているリクエスト領域に０以外の値が書き込まれるまで待機をする（ステップＣ２においてＮｏ）。 And it waits until a value other than 0 is written in the request area allocated to the self-processing part of the memory | storage part 110 (in step C2 No).

０以外の値が書き込まれると（ステップＣ２においてＹｅｓ）、書き込まれた値を自処理部内のＬＭに一時保存する。また、ＬＭに書き込まれた値を一時保存するに伴いリクエスト領域を０でクリアする（ステップＣ３）。 When a value other than 0 is written (Yes in step C2), the written value is temporarily stored in the LM in the processing unit. Further, the request area is cleared to 0 as the value written in the LM is temporarily saved (step C3).

次に、記憶部１１０の自処理部に割り当てられた通信データ領域に格納されている通信メッセージを読み出す（ステップＣ４）。 Next, the communication message stored in the communication data area assigned to the own processing unit of the storage unit 110 is read (step C4).

そして、自身に接続されているホストにステップＣ４において読み出した通信メッセージを送信し（ステップＣ５）、ホストからの応答（リプライ）を受信するまで待機をする（ステップＣ６においてＮｏ）。 Then, the communication message read in step C4 is transmitted to the host connected to itself (step C5), and the process waits until a response (reply) is received from the host (No in step C6).

そして、ホストからの応答を受信すると（ステップＣ６においてＹｅｓ）、リクエスト領域からＬＭに一時保存していた値を、リクエストの要求元の処理部に割り当てられている記憶部１１０のリプライ領域に格納する（ステップＣ７）その後、再びステップＣ１に戻り、リクエスト領域の値の監視をする。 When a response from the host is received (Yes in step C6), the value temporarily stored in the LM from the request area is stored in the reply area of the storage unit 110 allocated to the request source processing unit. (Step C7) Thereafter, the process returns to Step C1 again, and the value of the request area is monitored.

図５は、他処理部リプライ処理部１２５及び１３５の動作を表すフローチャートである。 FIG. 5 is a flowchart showing the operations of the other processing unit reply processing units 125 and 135.

図５を参照すると、他処理部リプライ処理部１２５及び１３５は、記憶部１１０の自処理部に割り当てられているリプライ領域の値の監視をする（ステップＤ１）。 Referring to FIG. 5, the other processing unit reply processing units 125 and 135 monitor the value of the reply area allocated to the own processing unit of the storage unit 110 (step D1).

記憶部１１０の自処理部に割り当てられているリプライ領域に０以外の値が書き込まれるまで待機をする（ステップＤ２においてＮｏ）。 It waits until a value other than 0 is written in the reply area assigned to its own processing unit in the storage unit 110 (No in step D2).

０以外の値が書き込まれると（ステップＣ２においてＹｅｓ）、書き込まれた値を自処理部内のＬＭに一時保存する。また、ＬＭに書き込まれた値を一時保存するに伴いリプライ領域を０でクリアする（ステップＤ３）。 When a value other than 0 is written (Yes in step C2), the written value is temporarily stored in the LM in the processing unit. Further, as the value written in the LM is temporarily stored, the reply area is cleared to 0 (step D3).

次に、一時保存した値と記憶部１１０の自処理部に割り当てられているブートカウンタ領域の値を比較する（ステップＤ４）。ステップＤ４における比較の結果、両値が等しければ（ステップＤ４においてＹｅｓ）、自身に接続されているホストに通信完了を報告する（ステップＤ５）。そして、ステップＤ１に戻る。 Next, the temporarily stored value is compared with the value of the boot counter area assigned to the own processing unit of the storage unit 110 (step D4). If the two values are equal as a result of the comparison in step D4 (Yes in step D4), the communication completion is reported to the host connected to itself (step D5). Then, the process returns to step D1.

一方、ステップＤ４における比較の結果、両値が等しくなければ（ステップＤ４においてＮｏ）、何もせずにステップＤ１に戻る。すなわち、この場合には自身に接続されているホストに通信完了は報告されないこととなる。 On the other hand, if the two values are not equal as a result of the comparison in step D4 (No in step D4), the process returns to step D1 without doing anything. That is, in this case, the completion of communication is not reported to the host connected to itself.

以上が本実施形態における各部の動作である。 The above is the operation of each part in the present embodiment.

次に、本実施形態の動作について、具体的な例を挙げて詳細に説明する。 Next, the operation of this embodiment will be described in detail with a specific example.

今回の具体例では、第１のホスト２０００から第２のホスト３０００へのメッセージ通信を行う際に、処理の途中で、第１の処理部１２０が障害でリブートする、というケースを例に取って説明する。説明の為、このときの第１のブートカウンタ領域１１１の値は１であったと仮定する（ステップＳ４）。 In this specific example, the case where the first processing unit 120 reboots due to a failure in the middle of processing when performing message communication from the first host 2000 to the second host 3000 is taken as an example. explain. For explanation, it is assumed that the value of the first boot counter area 111 at this time is 1 (step S4).

第１のホスト２０００が第１の処理部１２０にメッセージ通信を要求する。すると、ホストからの通信要求を監視している第１の処理部１２０のホスト要求処理部１２３が要求の受信を認識する（ステップＢ１においてＹｅｓ）。 The first host 2000 requests the first processing unit 120 for message communication. Then, the host request processing unit 123 of the first processing unit 120 that monitors the communication request from the host recognizes the reception of the request (Yes in Step B1).

ホスト要求処理部１２３は受信した通信要求に付随する通信メッセージを通信先ホストに接続された処理部である第２の処理部１３０に割り当てられている、第２の通信データ領域１１８に格納する（ステップＢ２）。 The host request processing unit 123 stores the communication message accompanying the received communication request in the second communication data area 118 assigned to the second processing unit 130 that is a processing unit connected to the communication destination host ( Step B2).

次に、第２の処理部１３０にリクエストを出すために、第１のブートカウンタ領域１１１の値を読み出す（ステップＢ３）。そして、ステップＢ３において読み出した値を第２のリクエスト領域１１４に格納する（ステップＢ４）。 Next, in order to issue a request to the second processing unit 130, the value of the first boot counter area 111 is read (step B3). Then, the value read in step B3 is stored in the second request area 114 (step B4).

次に、自身に割り当てられている第２のリクエスト領域１１４を監視している第２の処理部１３０の他処理部要求部１３４は、第２のリクエスト領域１１４に０以外の値が書き込まれたことを確認する（ステップＣ２においてＹｅｓ）。 Next, the other processing unit request unit 134 of the second processing unit 130 monitoring the second request area 114 allocated to itself has a value other than 0 written in the second request area 114 (Yes in step C2).

他処理部要求部１３４は、書き込まれた値をＬＭ１３１に一時保存し、第２のリクエスト領域１１４を０でクリアする（ステップＣ３）。 The other processing unit request unit 134 temporarily stores the written value in the LM 131, and clears the second request area 114 with 0 (step C3).

次に、第２の通信データ領域１１８に格納されている通信メッセージを読み出し（ステップＣ４）、第２のホスト３０００に当該通信メッセージを送信し（ステップＣ５）、第２のホスト３０００からの応答を待つ（ステップＣ６）。 Next, the communication message stored in the second communication data area 118 is read (step C4), the communication message is transmitted to the second host 3000 (step C5), and the response from the second host 3000 is received. Wait (step C6).

そして、第２のホスト３０００からの応答を受信すると（ステップＣ６においてＹｅｓ）、第２のリクエスト領域１１４からＬＭ１３１に一時保存していた値を、第１のリプライ領域１１５に格納する（ステップＣ７）。 When a response from the second host 3000 is received (Yes in Step C6), the value temporarily stored in the LM 131 from the second request area 114 is stored in the first reply area 115 (Step C7). .

次に、自身に割り当てられている第１のリプライ領域１１５を監視している第１の処理部１２０の他処理部リプライ処理部１２５は第１のリプライ領域１１５に０以外の値が書き込まれたことを認識すると（ステップＤ２においてＹｅｓ）、書き込まれた値をＬＭ１２１に一時保存し、第１のリプライ領域１１５を０でクリアする（ステップＤ３）。 Next, the other processing unit reply processing unit 125 of the first processing unit 120 monitoring the first reply region 115 assigned to itself has a value other than 0 written in the first reply region 115. When this is recognized (Yes in step D2), the written value is temporarily stored in the LM 121, and the first reply area 115 is cleared to 0 (step D3).

次に、ＬＭ１２１に一時保存した値と第１のブートカウンタ領域１１１の値を比較する（ステップＤ４）。ここで、通常は２つの値は等しくなるため（ステップＤ４においてＹｅｓ）、第１のホスト２０００に対して通信完了の応答を返す（ステップＤ５）。 Next, the value temporarily stored in the LM 121 is compared with the value in the first boot counter area 111 (step D4). Here, since the two values are usually equal (Yes in step D4), a communication completion response is returned to the first host 2000 (step D5).

ここで、第１の処理部１２０が、第２の処理部１３０にリクエストを出した直後にリブートしたとする。このとき、第１の処理部１２０は、リブート直後にブート回数登録部１２２によって、第１のブートカウンタ領域１１１の値が２に更新される（ステップＡ１、Ａ２においてＹｅｓ、Ａ３）。 Here, it is assumed that the first processing unit 120 reboots immediately after issuing a request to the second processing unit 130. At this time, in the first processing unit 120, the value of the first boot counter area 111 is updated to 2 by the boot count registration unit 122 immediately after rebooting (Yes in steps A1 and A2 and A3).

この状態で、第２の処理部１３０からリプライを受け取ると、先の、ＬＭ１２１に一時保存した値と第１のブートカウンタ領域１１１の値を比較する処理（ステップＤ４）において、一時保存した値はリブート前の値の１だが、第１のブートカウンタ領域１１１の値はリブート後の２になっているため、比較は不一致となり（ステップＤ４においてＮｏ）、第１のホスト２０００に対して通信完了の応答は行われない。 In this state, when a reply is received from the second processing unit 130, in the process of comparing the value temporarily stored in the LM 121 with the value of the first boot counter area 111 (step D4), the temporarily stored value is Although the value is 1 before rebooting, the value in the first boot counter area 111 is 2 after rebooting, so the comparison does not match (No in step D4), and communication with the first host 2000 is completed. No response is made.

以上説明した本発明の実施形態は、以下に示すような多くの効果を奏する。 The embodiment of the present invention described above has many effects as described below.

第１の効果は、確実に処理部間の通信正常終了を確認出来ることである。 The first effect is that the normal end of communication between processing units can be confirmed with certainty.

その理由はブートカウンタ領域の値を要因とすることによりリブートが発生してもリクエストとリプライの対応づけをすることが出来る為である。リクエストとリプライの対応にずれが生じるのはリブートが発生したときのみである。 The reason is that the request and reply can be associated even if a reboot occurs by using the value of the boot counter area as a factor. There is a difference between the request and reply correspondence only when a reboot occurs.

第２の効果は、処理が軽いことである。 The second effect is that the processing is light.

その理由はブート処理（リブート処理含む）で一度リクエスト要因であるブートカウンタ領域の値をインクリメントしたらその後、リブートが発生するまで値をインクリメントせずともよい為である。通信発行毎にリクエスト要因をインクリメントする方法と比較し、少ない処理で同等の効果を得ることが出来る。 The reason is that once the value of the boot counter area, which is a request factor, is incremented once in the boot process (including the reboot process), it is not necessary to increment the value until a reboot occurs thereafter. Compared to the method of incrementing the request factor for each communication issuance, the same effect can be obtained with less processing.

なお、本発明の実施形態である共有装置は、ハードウェアにより実現することもできるが、コンピュータをその共有装置として機能させるためのプログラムをコンピュータがコンピュータ読み取り可能な記録媒体から読み込んで実行することによっても実現することができる。 The sharing apparatus according to the embodiment of the present invention can be realized by hardware, but the computer reads a program for causing the computer to function as the sharing apparatus from a computer-readable recording medium and executes the program. Can also be realized.

また、本発明の実施形態による不整合検出方法は、ハードウェアにより実現することもできるが、コンピュータにその方法を実行させるためのプログラムをコンピュータがコンピュータ読み取り可能な記録媒体から読み込んで実行することによっても実現することができる。 Further, the inconsistency detection method according to the embodiment of the present invention can be realized by hardware, but the computer reads a program for causing the computer to execute the method from a computer-readable recording medium and executes the program. Can also be realized.

また、上述した実施形態は、本発明の好適な実施形態ではあるが、上記実施形態のみに本発明の範囲を限定するものではなく、本発明の要旨を逸脱しない範囲において種々の変更を施した形態での実施が可能である。 Moreover, although the above-described embodiment is a preferred embodiment of the present invention, the scope of the present invention is not limited only to the above-described embodiment, and various modifications are made without departing from the gist of the present invention. Implementation in the form is possible.

加えて、本実施形態は、一般的な技術と比べ下記の点で有利な効果を奏している。 In addition, the present embodiment has advantageous effects in the following points as compared with general techniques.

一般的な技術では、処理要求毎に電文識別子を変化させている。しかし、本実施形態では識別子(ブート回数)を処理要求毎に変化させておらず異なる。すなわち、本実施形態では装置が(リ)ブートしたときのみインクリメントしており、識別子の更新を最小限に留めることで負荷を軽減させることが可能となるという効果を奏する。 In general technology, the message identifier is changed for each processing request. However, in the present embodiment, the identifier (number of boots) is not changed for each processing request and is different. That is, in this embodiment, it is incremented only when the device is (re) booted, and the load can be reduced by minimizing the update of the identifier.

また、一般的な技術では、例えば各ブートデバイスのブート回数に制限をかける為、ブート回数を利用している。それに対し、本実施形態では処理部間の通信のすれ違いを防止する為、ブート回数を利用できるという効果を奏する。 In general technology, for example, the number of boots is used to limit the number of boots of each boot device. On the other hand, this embodiment has an effect that the number of boots can be used in order to prevent passing of communication between the processing units.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）第１のホストに接続される第１の処理部と、第２のホストに接続される第２の処理部と、前記第１の処理部及び第２の処理部が利用する記憶部と、を有する共有装置において、
前記第１の処理部が前記第１のホストからの第１のメッセージの送信要求を受け取ると、前記第２の処理部に対してリクエストを送信し、当該リクエストを受信した前記第２の処理部が前記メッセージを前記第２のホストに送信すると共にリプライを前記第１の処理部に送信し、前記リプライを受け取った第１の処理部は当該リプライが前記第１のメッセージの送信要求に応じたリプライであるか否かを判断し、
前記判断は、前記記憶部上で、前記リクエスト及び前記リプライにそれぞれ対応付けられた、前記第１の処理部のブート及びリブートの回数を照らし合わせることにより行われることを特徴とする共有装置。 (Supplementary Note 1) A first processing unit connected to the first host, a second processing unit connected to the second host, and a memory used by the first processing unit and the second processing unit A sharing device having
When the first processing unit receives a transmission request for the first message from the first host, the second processing unit transmits a request to the second processing unit and receives the request. Transmits the message to the second host and transmits a reply to the first processing unit, and the first processing unit that receives the reply responds to the transmission request for the first message. Determine if it ’s a reply,
The determination is performed by comparing the number of boots and reboots of the first processing unit associated with the request and the reply on the storage unit, respectively.

（付記２）付記１に記載の共有装置において、
前記判断に際して、前記リクエスト及び前記リプライにそれぞれ対応付けられた、前記第１の処理部のブート及びリブートの回数、が一致した場合には当該リプライが正当なものであると判断して前記第１のホストに対してリプライ処理を行い一致しなかった場合は当該リプライが正当なものでないと判断して当該リプライを廃棄することを特徴とする共有装置。 (Supplementary Note 2) In the sharing device according to Supplementary Note 1,
In the determination, if the number of times of booting and rebooting of the first processing unit respectively associated with the request and the reply matches, it is determined that the reply is valid and the first A sharing apparatus, characterized in that if a reply process is not performed for the host, the reply is judged to be invalid and the reply is discarded.

（付記３）付記１又は２に記載の共有装置において、
前記第１の処理部は前記リクエストを送信する際に前記記憶部に自身のブート及びリブートの回数を記録し、前記第２の処理部は前記リクエストを受け取るとともに前記記録されたブート及びリブートの回数を読み込んで自身が記憶し、前記第２の処理部は前記リプライを送信するとともに前記自身が記憶したブート及びリブートの回数を前記記憶部に記録し、前記第１の処理部は自身が前記記憶部に記録したブート及びリブートの回数と前記第２の処理部が前記記憶部に記録したブート及びリブートの回数を比較することにより前記判断を行うことを特徴とする共有装置。 (Supplementary Note 3) In the sharing apparatus according to Supplementary Note 1 or 2,
When the first processing unit transmits the request, the first processing unit records its own boot and reboot times in the storage unit, and the second processing unit receives the request and records the recorded boot and reboot times. And the second processing unit transmits the reply and records the number of boots and reboots stored by the second processing unit in the storage unit, and the first processing unit stores the memory A sharing apparatus that performs the determination by comparing the number of times of booting and rebooting recorded in a storage unit with the number of times of booting and rebooting recorded in the storage unit by the second processing unit.

（付記４）付記１乃至３の何れか１に記載の共有装置において、
前記第１の処理部が第２の処理部としても動作し、前記第２の処理部が第１の処理部としても動作することを特徴とする共有装置。 (Appendix 4) In the sharing device according to any one of appendices 1 to 3,
The sharing apparatus, wherein the first processing unit also operates as a second processing unit, and the second processing unit also operates as a first processing unit.

（付記５）第１のホストに接続される第１の処理部と、第２のホストに接続される第２の処理部と、前記第１の処理部及び第２の処理部が利用する記憶部と、を有する共有装置並びに当該共有装置に接続された前記第１のホスト及び前記第２のホストを含んだクラスタシステムにおいて、
前記第１のホスト及び前記第２のホストは前記共有装置を介して通信を行い、
前記共有装置は付記１乃至４の何れか１項に記載の共有装置であることを特徴とするクラスタシステム。 (Supplementary Note 5) A first processing unit connected to the first host, a second processing unit connected to the second host, and a memory used by the first processing unit and the second processing unit And a cluster system including the first host and the second host connected to the shared device,
The first host and the second host communicate via the shared device;
The cluster system, wherein the shared device is the shared device according to any one of appendices 1 to 4.

（付記６）第１のホストに接続される第１の処理部と、第２のホストに接続される第２の処理部と、前記第１の処理部及び第２の処理部が利用する記憶部と、を有する共有装置が行う不整合検出方法において、
前記第１の処理部が前記第１のホストからの第１のメッセージの送信要求を受け取ると、前記第２の処理部に対してリクエストを送信し、当該リクエストを受信した前記第２の処理部が前記メッセージを前記第２のホストに送信すると共にリプライを前記第１の処理部に送信し、前記リプライを受け取った第１の処理部は当該リプライが前記第１のメッセージの送信要求に応じたリプライであるか否かを判断し、
前記判断は、前記記憶部上で、前記リクエスト及び前記リプライにそれぞれ対応付けられた、前記第１の処理部のブート及びリブートの回数を照らし合わせることにより行われることを特徴とする不整合検出方法。 (Supplementary Note 6) A first processing unit connected to the first host, a second processing unit connected to the second host, and a memory used by the first processing unit and the second processing unit Inconsistency detection method performed by a shared device having
When the first processing unit receives a transmission request for the first message from the first host, the second processing unit transmits a request to the second processing unit and receives the request. Transmits the message to the second host and transmits a reply to the first processing unit, and the first processing unit that receives the reply responds to the transmission request for the first message. Determine if it ’s a reply,
The determination is performed by comparing the number of boots and reboots of the first processing unit associated with the request and the reply on the storage unit, respectively. .

（付記７）付記６に記載の不整合検出方法において、
前記判断に際して、前記リクエスト及び前記リプライにそれぞれ対応付けられた、前記第１の処理部のブート及びリブートの回数、が一致した場合には当該リプライが正当なものであると判断して前記第１のホストに対してリプライ処理を行い一致しなかった場合は当該リプライ正当なものでないと判断して当該リプライを廃棄することを特徴とする不整合検出方法。 (Appendix 7) In the inconsistency detection method described in Appendix 6,
In the determination, if the number of times of booting and rebooting of the first processing unit respectively associated with the request and the reply matches, it is determined that the reply is valid and the first A mismatch detection method characterized in that if a reply process is not performed for a host in step S3, the reply is judged to be invalid and the reply is discarded.

（付記８）付記６又は７に記載の不整合検出方法において、
前記第１の処理部は前記リクエストを送信する際に前記記憶部に自身のブート及びリブートの回数を記録し、前記第２の処理部は前記リクエスト受け取るとともに前記記録されたブート及びリブートの回数を読み込んで自身が記憶し、前記第２の処理部は前記リプライを送信するとともに前記自身が記憶したブート及びリブートの回数を前記記憶部に記録し、前記１の処理部は自身が前記記憶部に記録したブート及びリブートの回数と前記第２の処理部が前記記憶部に記録したブート及びリブートの回数を比較することにより前記判断を行うことを特徴とする不整合検出方法。 (Supplementary note 8) In the inconsistency detection method according to supplementary note 6 or 7,
When the first processing unit transmits the request, the first processing unit records its boot and reboot times in the storage unit, and the second processing unit receives the request and records the recorded boot and reboot times. Read and store itself, the second processing unit transmits the reply and records the number of boots and reboots stored by the second processing unit in the storage unit, and the first processing unit stores itself in the storage unit. A mismatch detection method, wherein the determination is performed by comparing the recorded number of boots and reboots with the number of boots and reboots recorded in the storage unit by the second processing unit.

（付記９）付記６乃至８の何れか１に記載の不整合検出方法において、
前記第１の処理部が第２の処理部としても動作し、前記第２の処理部が第１の処理部としても動作することを特徴とする不整合検出方法。 (Supplementary note 9) In the mismatch detection method according to any one of supplementary notes 6 to 8,
The mismatch detection method, wherein the first processing unit also operates as a second processing unit, and the second processing unit also operates as a first processing unit.

（付記１０）第１のホストに接続される第１の処理部と、第２のホストに接続される第２の処理部と、前記第１の処理部及び第２の処理部が利用する記憶部と、を有する共有装置並びに当該共有装置に接続された前記第１のホスト及び前記第２のホストを含んだクラスタシステムが行う不整合検出において、
前記第１のホスト及び前記第２のホストは前記共有装置を介して通信を行い、
前記共有装置は付記６乃至８の何れか１に記載の不整合検出方法を行うことを特徴とする不整合検出方法。 (Supplementary Note 10) A first processing unit connected to the first host, a second processing unit connected to the second host, and a memory used by the first processing unit and the second processing unit Inconsistency detection performed by the cluster system including the first host and the second host connected to the shared device and the shared device,
The first host and the second host communicate via the shared device;
9. The inconsistency detection method, wherein the sharing apparatus performs the inconsistency detection method according to any one of appendices 6 to 8.

本発明は、メインフレーム等の大型コンピュータシステムにおける、複数のホストを接続したクラスタシステムでホスト間の通信に好適である。 The present invention is suitable for communication between hosts in a cluster system in which a plurality of hosts are connected in a large computer system such as a mainframe.

１１０記憶部
１１１第１のブートカウンタ領域
１１２第２のブートカウンタ領域
１１３第１のリクエスト領域
１１４第２のリクエスト領域
１１５第１のリプライ領域
１１６第２のリプライ領域
１１７第１の通信データ領域
１１８第２の通信データ領域
１２０第１の処理部
１２１、１３１ローカルメモリ
１２２、１３２ブート回数登録部
１２３、１３３ホスト要求処理部
１２４、１３４他処理部要求処理部
１２５、１３５他処理部リプライ処理部
１３０第２の処理部
１０００共有装置
２０００第１のホスト
３０００第２のホスト 110 storage unit 111 first boot counter area 112 second boot counter area 113 first request area 114 second request area 115 first reply area 116 second reply area 117 first communication data area 118 first Second communication data area 120 First processing unit 121, 131 Local memory 122, 132 Boot count registration unit 123, 133 Host request processing unit 124, 134 Other processing unit request processing unit 125, 135 Other processing unit reply processing unit 130 Two processing units 1000 Shared device 2000 First host 3000 Second host

Claims

A first processing unit connected to the first host; a second processing unit connected to the second host; and a storage unit used by the first processing unit and the second processing unit. In a shared device having
When the first processing unit receives a transmission request for the first message from the first host, the second processing unit transmits a request to the second processing unit and receives the request. Transmits the message to the second host and transmits a reply to the first processing unit, and the first processing unit that receives the reply responds to the transmission request for the first message. Determine if it ’s a reply,
The determination is performed by comparing the number of boots and reboots of the first processing unit associated with the request and the reply on the storage unit, respectively.

The sharing apparatus according to claim 1,
In the determination, if the number of times of booting and rebooting of the first processing unit respectively associated with the request and the reply matches, it is determined that the reply is valid and the first A sharing apparatus, characterized in that if a reply process is not performed for the host, the reply is judged to be invalid and the reply is discarded.

The sharing apparatus according to claim 1 or 2,
When the first processing unit transmits the request, the first processing unit records its own boot and reboot times in the storage unit, and the second processing unit receives the request and records the recorded boot and reboot times. And the second processing unit transmits the reply and records the number of boots and reboots stored by the second processing unit in the storage unit, and the first processing unit stores the memory A sharing apparatus that performs the determination by comparing the number of times of booting and rebooting recorded in a storage unit with the number of times of booting and rebooting recorded in the storage unit by the second processing unit.

The sharing apparatus according to any one of claims 1 to 3,
The sharing apparatus, wherein the first processing unit also operates as a second processing unit, and the second processing unit also operates as a first processing unit.

A first processing unit connected to the first host; a second processing unit connected to the second host; and a storage unit used by the first processing unit and the second processing unit. A cluster system including the shared device and the first host and the second host connected to the shared device,
The first host and the second host communicate via the shared device;
The cluster system according to claim 1, wherein the shared device is the shared device according to claim 1.

A first processing unit connected to the first host; a second processing unit connected to the second host; and a storage unit used by the first processing unit and the second processing unit. Inconsistency detection method performed by a shared device having
When the first processing unit receives a transmission request for the first message from the first host, the second processing unit transmits a request to the second processing unit and receives the request. Transmits the message to the second host and transmits a reply to the first processing unit, and the first processing unit that receives the reply responds to the transmission request for the first message. Determine if it ’s a reply,
The determination is performed by comparing the number of boots and reboots of the first processing unit associated with the request and the reply on the storage unit, respectively. .

The inconsistency detection method according to claim 6,
In the determination, if the number of times of booting and rebooting of the first processing unit respectively associated with the request and the reply matches, it is determined that the reply is valid and the first A mismatch detection method characterized in that when a reply process is not performed for a host in step S3, the reply is judged to be invalid and the reply is discarded.

The inconsistency detection method according to claim 6 or 7,
When the first processing unit transmits the request, the first processing unit records its own boot and reboot times in the storage unit, and the second processing unit receives the request and records the recorded boot and reboot times. And the second processing unit transmits the reply and records the number of boots and reboots stored by the second processing unit in the storage unit, and the first processing unit stores the memory A mismatch detection method, wherein the determination is performed by comparing the number of boots and reboots recorded in a storage unit with the number of boots and reboots recorded in the storage unit by the second processing unit.

The mismatch detection method according to any one of claims 6 to 8,
The mismatch detection method, wherein the first processing unit also operates as a second processing unit, and the second processing unit also operates as a first processing unit.

A first processing unit connected to the first host; a second processing unit connected to the second host; and a storage unit used by the first processing unit and the second processing unit. Inconsistency detection performed by the cluster system including the shared device and the first host and the second host connected to the shared device,
The first host and the second host communicate via the shared device;
The inconsistency detection method according to claim 6, wherein the sharing apparatus performs the inconsistency detection method according to claim 6.