JP5148441B2

JP5148441B2 - Communication path redundancy and switching method in computer interconnection network, server device realizing the method, server module thereof, and program thereof

Info

Publication number: JP5148441B2
Application number: JP2008253793A
Authority: JP
Inventors: 毅小倉
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-09-30
Filing date: 2008-09-30
Publication date: 2013-02-20
Anticipated expiration: 2028-09-30
Also published as: JP2010086227A

Description

ＰＣ（Personal Computer）などの計算機を複数台接続して１つの高速な計算機システムを構築するＰＣクラスタリング分野において、計算機間を接続する相互結合網の構成を冗長化して通信経路の耐障害性を向上させる、ＰＣクラスタシステム内部通信の高信頼化手法に関する。 In the PC clustering field, where multiple computers such as PCs (Personal Computers) are connected to build a single high-speed computer system, the configuration of the interconnection network connecting computers is made redundant to improve the fault tolerance of the communication path. The present invention relates to a highly reliable technique for PC cluster system internal communication.

処理能力が低いかわりに安価であるＰＣ等の小型計算機を多数接続して、１台の大型で高速な計算機システムと同等の処理能力を有するシステムを経済的に構築する、ＰＣクラスタリングとよばれる技術がある。このＰＣクラスタリング技術は、科学技術計算システム、データセンター等における各種サーバ、大容量ストレージシステムなどの構築手法として幅広い分野で用いられている。 A technology called PC clustering that economically constructs a system that has the same processing capacity as one large, high-speed computer system by connecting many inexpensive computers such as PCs instead of low processing capacity There is. This PC clustering technology is used in a wide range of fields as a construction method for various servers in a scientific and technical computing system, a data center, etc., and a large-capacity storage system.

ＰＣクラスタリングにおいて、ＰＣ間の接続に用いられる物理伝送媒体を相互結合網と呼ぶ。相互結合網には様々な種類があり、それぞれが持つ通信機能の特色を生かした使い分けがなされている。相互結合網の具体的な例としては、Ｍｙｒｉｎｅｔ，ＦｉｂｒｅＣｈａｎｎｅｌ，ＩｎｆｉｎｉＢａｎｄなどがある。さらに通常は、１種類の相互結合網においても、その上位レイヤでは複数の種類の通信方式（プロトコル）を動作させることができる。図１にＰＣクラスタシステムの概要を示す。 In PC clustering, a physical transmission medium used for connection between PCs is called an interconnection network. There are various types of interconnected networks, and they are properly used by taking advantage of the features of their communication functions. Specific examples of the interconnection network include Myrinet, Fiber Channel, InfiniBand, and the like. Further, usually, even in one type of interconnection network, a plurality of types of communication systems (protocols) can be operated in the upper layer. FIG. 1 shows an outline of the PC cluster system.

これら数ある相互結合網の中でも、高速性、低価格性、使い易さ等が評価され、近年のＰＣクラスタシステムにおいて最も広く用いられているのがＩｎｆｉｎｉＢａｎｄ（非特許文献１）である。そして、このＩｎｆｉｎｉＢａｎｄの上位レイヤではＴＣＰ／ＩＰ，ＳＤＰ，ＭＰＩ等の様々な種類のプロトコルが動作可能であるが、その中の代表的なプロトコルの１つにｕＤＡＰＬ（user direct access programming library）（非特許文献２）がある。 Among these interconnected networks, InfiniBand (Non-patent Document 1) is most widely used in recent PC cluster systems because of its high speed, low cost, ease of use, and the like. Various types of protocols such as TCP / IP, SDP, and MPI can be operated in the upper layer of InfiniBand. One of the typical protocols is uDAPL (user direct access programming library) (non- There is a patent document 2).

ｕＤＡＰＬとは、ＩｎｆｉｎｉＢａｎｄ等の幾つかの相互結合網自体が持つＲＤＭＡ（remote direct memory access）機能を、相互結合網の種類に依らない共通のソフトウェアインタフェースを介して利用できるようにするための統一されたＡＰＩ（application interface）、および、データ転送プロトコルを提供するソフトウェア体系である。ここで、ＲＤＭＡとは、従来から広く用いられている、１台の計算機における入出力（Ｉ／Ｏ）処理の高速化技術であるＤＭＡ（ｄirect memory access）を、ＬＡＮ（local area network）や相互結合網で接続された遠隔の計算機間の通信に拡張したものである。 uDAPL is a standard for making it possible to use RDMA (remote direct memory access) functions of some interconnected networks such as InfiniBand via a common software interface that does not depend on the type of interconnected network. A software system that provides an application interface (API) and a data transfer protocol. Here, RDMA refers to DMA (direct memory access), which is a technique for speeding up input / output (I / O) processing in a single computer, which has been widely used in the past. This is an extension to communications between remote computers connected by a combined network.

ＤＭＡでは、計算機に直結された外部装置との間のデータの入出力（Ｉ／Ｏ）の際に、ＣＰＵをほとんど使用せずに当該外部装置のコントローラが自計算機の主メモリに直接アクセスしてデータ転送を行うことで、計算機にかかる処理負荷の低減やデータ入出力の高速化を図る。 In DMA, when data is input / output (I / O) to / from an external device directly connected to a computer, the controller of the external device directly accesses the main memory of the computer without using the CPU. By performing the transfer, the processing load on the computer is reduced and the data input / output speed is increased.

ＲＤＭＡにおいても同様に、遠隔の計算機どうしの主メモリ間でＣＰＵをほとんど使用しないデータ転送が行えるため、非常に低負荷かつ高速な計算機間通信が可能となる。実際、ｕＤＡＰＬはＩｎｆｉｎｉＢａｎｄの上位レイヤで動作するプロトコル群の中でも最も高速な物の１つであり、重要性が高まっている。 Similarly, in RDMA, data transfer can be performed between the main memories of remote computers with almost no CPU, so that communication between computers can be performed with very low load and high speed. In fact, uDAPL is one of the fastest protocol groups operating in the upper layers of InfiniBand and is gaining in importance.

ＲＤＭＡでは、データ送信や受信の対象となる計算機の主メモリの領域をあらかじめ通信相手に対して公開し、データ転送経路を確立した後に通信を開始するのが本質的な動作形態となる。この点において、その他の一般的なデータ転送方式、すなわち、送信側では受信側の最終的なデータの格納場所を認識せずにデータを送信し、受信側ではデータを受信するたびにアドホックに格納場所を決定するような方式とは異なっている。 In RDMA, an essential operation mode is that a main memory area of a computer that is a target of data transmission or reception is disclosed to a communication partner in advance, and communication is started after a data transfer path is established. In this regard, other general data transfer methods, that is, the transmitting side transmits data without recognizing the final data storage location on the receiving side, and the receiving side stores the data ad hoc each time it receives data. It is different from the method of determining the place.

後者のような方式では、計算機間であらかじめ通信経路を確立しない非コネクション型通信が可能であるのに対し、前者のＲＤＭＡにおいては、計算機間で事前に通信経路を確立するコネクション型通信が前提となる。 In the latter method, connectionless communication without establishing a communication path between computers in advance is possible, whereas in the former RDMA, connection type communication in which a communication path is established between computers in advance is assumed. Become.

さらに、データの送信／受信処理の対象となるメモリ領域を明示的に指定する必要がある。 Further, it is necessary to explicitly specify a memory area to be subjected to data transmission / reception processing.

したがってｕＤＡＰＬも、このようなＲＤＭＡの本質的な動作形態を反映したもの、すなわち、コネクション設定機構やデータ転送の対象となる主メモリ領域の抽象化機構等を中心としたＡＰＩとなっている。以下にｕＤＡＰＬのＡＰＩの概要を説明する。 Therefore, uDAPL is an API reflecting such an essential operation mode of RDMA, that is, an API centering on a connection setting mechanism, an abstraction mechanism of a main memory area that is a target of data transfer, and the like. The outline of the uDAPL API will be described below.

まず、ｕＤＡＰＬのＡＰＩで用いられる主なオブジェクト群を図２に示す。ＩｎｆｉｎｉＢａｎｄＨＣＡ（Host Channel Adapter）を搭載したホストＡ（以降、ＰＣ等の小型計算機をホストと呼ぶ）とホストＢとがＩｎｆｉｎｉＢａｎｄスイッチ経由で接続されており、上記ホストの主メモリ間でＲＤＭＡによるデータ転送を行うことを前提とした図である。なお、ここではホストＡがＲＤＭＡのイニシエータ、すなわち、ＲＤＭＡによるデータ転送を発動する側となる場合を示している。 First, main object groups used in the uDAPL API are shown in FIG. Host A (hereinafter referred to as a small computer such as a PC) equipped with InfiniBand HCA (Host Channel Adapter) and host B are connected via an InfiniBand switch, and data transfer by RDMA between the main memories of the host It is a figure premised on performing. Here, a case where the host A is an RDMA initiator, that is, a side that initiates data transfer by RDMA is shown.

前述のように、ｕＤＡＰＬはコネクション型の通信方式を採用している。図２のＥＰ（End Point）は、通信主体となるアプリケーションプロセス間のコネクションの端点を抽象化したオブジェクトである。コネクションはＥＰ間で確立され、１対１接続のみが可能である。アプリケーションプロセスは通信処理を起動する際、自身が保持するＥＰのうちの特定の１つを指定する。これにより、その通信に使われるコネクション（と通信の相手となるＥＰ）が特定される。 As described above, uDAPL employs a connection type communication system. An EP (End Point) in FIG. 2 is an object that abstracts an end point of a connection between application processes serving as communication subjects. Connections are established between EPs, and only one-to-one connections are possible. When the application process starts the communication process, it designates a specific one of the EPs held by the application process. As a result, the connection used for the communication (and the EP as the communication partner) is specified.

なお、本明細書では、ホストのＨＣＡとＩｎｆｉｎｉＢａｎｄスイッチ間の物理的な１本の接続回線を通信リンクと呼ぶ。また、ＥＰ間で確立された論理的な通信経路をコネクションと呼ぶ。 In this specification, one physical connection line between the host HCA and the InfiniBand switch is called a communication link. A logical communication path established between EPs is called a connection.

ＬＭＲ（Local Memory Region）は、通信動作の対象となる主メモリの領域を抽象的に表現するためのオブジェクトである。１つのＬＭＲは、単一の連続した仮想的なメモリ領域を表す。アプリケーションプロセスはこのＬＭＲが表す仮想的な連続メモリ領域の全部、あるいは一部を指定して通信処理を起動する。ＬＭＲはその使用に先立って、それが対応づけられている（マッピングされている）主メモリの実際の物理的な領域が決定されている。そのため、ＬＭＲの指定は当該通信処理の対象となる主メモリの領域を指定することを意味する。 An LMR (Local Memory Region) is an object for abstractly expressing a main memory region that is a target of a communication operation. One LMR represents a single contiguous virtual memory area. The application process starts communication processing by designating all or part of the virtual continuous memory area represented by the LMR. Prior to its use, the LMR has determined the actual physical area of main memory to which it is mapped (mapped). Therefore, LMR designation means designation of a main memory area to be subjected to the communication process.

ＲＭＲ（Remote Memory Region）は、自ホスト内のＬＭＲにマッピングされた主メモリの領域に対する遠隔ホストからのアクセス制限に関する設定のためのオブジェクトである。ＲＭＲはＬＭＲの全部、あるいは一部対してバインドすることができ（図は１つのＲＭＲをＬＭＲの全領域にバインドした場合を示している）、バインドされたＬＭＲの領域にマッピングされている主メモリの領域に対してＲＭＲに付随して設定したアクセス制限が適用される。すなわち、ＬＭＲから任意の領域を切り出してアクセス制限を柔軟に設定するためのメモリウインドウを提供する。 RMR (Remote Memory Region) is an object for setting related to access restriction from a remote host to a main memory area mapped to LMR in its own host. The RMR can be bound to all or part of the LMR (the figure shows the case where one RMR is bound to the entire region of the LMR), and the main memory mapped to the bound region of the LMR The access restriction set in association with the RMR is applied to this area. That is, a memory window is provided for setting an access restriction flexibly by cutting out an arbitrary area from the LMR.

ＰＺ（Protection Zone）は、各種オブジェクトをグループ化し、グループ外のオブジェクトからの操作を排除する手段を提供するオブジェクトである。上記ＥＰ、ＬＭＲ、ＲＭＲはそれぞれ、その生成時にただ１つのＰＺと関連づけられる。あるＥＰを経由して実行される通信操作は、そのＥＰが属するＰＺと同じＰＺに属するＬＭＲやＲＭＲにマッピングされた主メモリの領域にしかアクセスできない。 A PZ (Protection Zone) is an object that provides a means for grouping various objects and excluding operations from objects outside the group. Each of the EP, LMR, and RMR is associated with only one PZ when it is generated. A communication operation executed via a certain EP can access only an area of the main memory mapped to the LMR or RMR belonging to the same PZ as the PZ to which the EP belongs.

ＥＶＤ（Event Dispatcher）は、ｕＤＡＰＬの各種操作の完了をイベントとして通知するためのオブジェクトである。前述のＥＰはその生成時において、メッセージ受信完了通知用ＥＶＤ（ｒｅｃｅｉｖｅＥＶＤ）、メッセージ送信／ＲＤＭＡｒｅｅｄ／ＲＤＭＡｗｒｉｔｅ／ＲＭＲのＬＭＲへのマッピング完了通知用ＥＶＤ（ｒｅｑｕｅｓｔＥＶＤ）、ＥＰ間コネクション確立通知用ＥＶＤ（ｃｏｎｎｅｃｔＥＶＤ）の３つのＥＶＤと関連づけられる。アプリケーションプロセスは当該ＥＰを介して行ったこれらの動作の完了を各ＥＶＤからのイベント通知により認識する。 EVD (Event Dispatcher) is an object for notifying completion of various operations of uDAPL as an event. At the time of generation, the above-mentioned EP is a message reception completion notification EVD (receive EVD), a message transmission / RDMA read / RDMA write / RMR mapping completion notification EVD (request EVD), and an inter-EP connection establishment notification. It is associated with three EVDs of EVD (connect EVD). The application process recognizes completion of these operations performed through the EP by event notification from each EVD.

なお、上記ＥＰ、ＬＭＲ、ＲＭＲ、ＰＺ、ＥＶＤの各オブジェクトは、その生成時に１つのＩｎｆｉｎｉＢａｎｄＨＣＡに関連づけられる。 Note that the EP, LMR, RMR, PZ, and EVD objects are associated with one InfiniBand HCA at the time of generation.

以上がｕＤＡＰＬのＡＰＩで用いられる主なオブジェクト群の説明である。次に、これらを用いて実際に通信を行う際の手順を説明する。例えば、ホストＡがホストＢの主メモリ内のデータをＲＤＭＡによって読み出し、自身の主メモリ内に格納する場合の手順は以下の（１）〜（９）のようになる。本手順の概要を図３に示す。 This completes the description of the main object group used in the uDAPL API. Next, a procedure for actually communicating using these will be described. For example, the procedure in the case where the host A reads data in the main memory of the host B by RDMA and stores it in its own main memory is as follows (1) to (9). An outline of this procedure is shown in FIG.

（１）ホストＡは、ＲＤＭＡによってアクセスすべきホストＢのメモリ領域（ターゲットメモリ領域）に関する情報を要求する。この要求メッセージ自体はホストＡの主メモリ内の要求送信バッファ内に格納されており、このバッファとマッピングされているＬＭＲ、および、ＥＰを指定してメッセージ送信動作を起動する。（この動作はＲＤＭＡの機能を使用しない通常のメッセージ送信である。ホストＡは、送信したメッセージがホストＢの主メモリ上のどの場所に格納されるかを知らないことに留意）。 (1) The host A requests information regarding the memory area (target memory area) of the host B to be accessed by RDMA. The request message itself is stored in a request transmission buffer in the main memory of the host A, and a message transmission operation is started by designating the LMR and EP mapped to this buffer. (This operation is a normal message transmission without using the RDMA function. Note that host A does not know where in the main memory of host B the transmitted message is stored).

（２）ホストＡは、ｒｅｑｕｅｓｔＥＶＤからの送信動作完了イベント通知を通して上記メッセージ送信が正常に完了した事を確認する。 (2) The host A confirms that the message transmission has been normally completed through the transmission operation completion event notification from the request EVD.

（３）ホストＡは、メモリ情報受信バッファにマッピングされているＬＭＲとＥＰを指定して、ホストＢからのメモリ情報受信動作を起動する。 (3) The host A specifies the LMR and EP mapped in the memory information reception buffer, and starts the memory information reception operation from the host B.

（４）ホストＡは、ｒｅｃｅｉｖｅＥＶＤからの通知を通してホストＢからのメモリ情報の受信完了を待つ。 (4) Host A waits for completion of reception of memory information from host B through notification from receive EVD.

（５）ホストＢは、自身の主メモリ内の要求受信バッファにマッピングされているＬＭＲ、および、ＥＰを指定して、ホストＡからのメモリ要求受信動作を起動する。 (5) The host B starts the memory request reception operation from the host A by designating the LMR and EP mapped in the request reception buffer in its own main memory.

（６）ホストＢは、ｒｅｃｅｉｖｅＥＶＤからの通知を通してホストＡからのメモリ情報要求の受信完了を待つ。 (6) The host B waits for the completion of the reception of the memory information request from the host A through the notification from the receive EVD.

（７）ホストＢは、ホストＡからのメモリ情報要求を受信すると、要求受信バッファに格納されたメモリ要求の内容を確認し、ホストＡからのＲＤＭＡによるメモリアクセスを許可する主メモリ領域（ＲＤＭＡ領域）に関する情報、具体的には、開始アドレス、レングス、アクセスキーの３つ組（ｒｍｒ＿ｔｒｉｐｌｅｔ）の送信動作を起動する。この時送信する内容は主メモリのメモリ情報送信バッファに格納しておき、この領域にマッピングされたＬＭＲとＥＰを指定してメッセージ送信を行う。（この動作もＲＤＭＡとは異なる通常のメッセージ送信である）。 (7) When the host B receives the memory information request from the host A, the host B confirms the contents of the memory request stored in the request reception buffer, and permits a memory access by RDMA from the host A (RDMA area) ), Specifically, a transmission operation of a triplet (rmr_triplet) of a start address, a length, and an access key is started. The contents to be transmitted at this time are stored in the memory information transmission buffer of the main memory, and message transmission is performed by designating the LMR and EP mapped in this area. (This operation is also normal message transmission different from RDMA).

（８）ホストＢは、ｒｅｑｕｅｓｔＥＶＤからの送信動作完了イベント通知を通して上記メッセージの送信完了を待つ。この完了を確認すると、（５）の動作に戻る。（ホストＢはＲＤＭＡ処理そのものの開始・終了については感知しない）。 (8) The host B waits for the transmission completion of the message through the transmission operation completion event notification from the request EVD. When this completion is confirmed, the operation returns to ( 5 ). (Host B does not sense the start / end of RDMA processing itself).

（９）ホストＡは、ｒｅｃｅｉｖｅＥＶＤからのイベント通知を通してホストＢからのメモリ情報の受信を確認すると、メモリ情報受信バッファに格納された、ホストＢからのメモリ情報（ｒｍｒ＿ｔｒｉｐｌｅｔ）を用いて、（７）のＲＤＭＡ動作の起動処理に移る。 (9) When the host A confirms reception of the memory information from the host B through the event notification from the receive EVD, the host A uses the memory information (rmr_triplet) stored in the memory information reception buffer (7). ) RDMA operation start processing.

（１０）主メモリ上のＲＤＭＡ領域にマッピングされたＬＭＲを用いてその主メモリ領域をデータ受信用領域に指定し、上記ｒｍｒ＿ｔｒｉｐｌｅｔによりターゲットとなるホストＢの主メモリ領域を指定して、また、ＥＰを指定して、ＲＤＭＡｒｅａｄ処理を起動する。この処理を起動した後は、（９）で用いたのと同一のｒｅｃｅｉｖｅＥＶＤを用いてＲＤＭＡｒｅａｄの終了通知を待つ。
(10) Using the LMR mapped to the RDMA area on the main memory, the main memory area is designated as the data receiving area, the main memory area of the target host B is designated by the rmr_triplet, and EP And RDMA read processing is started. After this process is started, the end of RDMA read is waited using the same receive EVD used in (9).

（１１）ＩｎｆｉｎｉＢａｎｄの物理層、および、リンクレイヤの機能によりＲＤＭＡによる実際のデータ転送が行われる。ホストＢの主メモリ上のＲＤＭＡ領域内のデータがＩｎｆｉｎｉＢａｎｄで扱われる最大データ長（４Ｋｂｙｔｅ）より大きい場合、当該データは複数のパケットに分割されてホストＡに転送される。ホストＡでは、受信したパケットを主メモリ上のＲＤＭＡ領域に格納していく。ホストＡ、Ｂにおいてこれらの処理がほとんどＣＰＵの介在なく行われる。 (11) Actual data transfer by RDMA is performed by the functions of the physical layer and link layer of InfiniBand. When the data in the RDMA area on the main memory of the host B is larger than the maximum data length (4 Kbytes) handled by InfiniBand, the data is divided into a plurality of packets and transferred to the host A. The host A stores the received packet in the RDMA area on the main memory. In the hosts A and B, these processes are performed almost without any CPU.

（１２）ホストＡは、ＥＶＤからの通知によりＲＤＭＡｒｅａｄの終了を確認すると、ホストＢの当該ＬＭＲに相当するＲＤＭＡ領域の全てのデータがホストＡの当該ＬＭＲに相当するＲＤＭＡ領域に転送されたと判断する。引き続き他のＲＤＭＡ領域を使用してＲＤＭＡ転送を続ける場合は（１）へ戻る。 (12) When the end of the RDMA read is confirmed by the notification from the EVD, the host A determines that all the data in the RDMA area corresponding to the LMR of the host B has been transferred to the RDMA area corresponding to the LMR of the host A To do. To continue RDMA transfer using another RDMA area, return to (1).

以上がｕＤＡＰＬによるＲＤＭＡ通信手順の概要である。なお、上記はＲＤＭＡｒｅａｄの説明であるが、イニシエータが対向ホストへデータを送信するＲＤＭＡｗｒｉｔｅの場合も、データの転送方向が異なる以外は同様の動作手順となる。 The above is the outline of the RDMA communication procedure by uDAPL. Although the above is an explanation of RDMA read, the same operation procedure is used for RDMA write in which the initiator transmits data to the opposite host except that the data transfer direction is different.

本明細書は、コネクション設定機構やデータ転送の対象となる主メモリ領域の抽象化機構を持つＡＰＩから成る通信方式を対象とする。現時点でその唯一の具体例として挙げられるものがｕＤＡＰＬである。 The present specification is directed to a communication method including an API having a connection setting mechanism and an abstraction mechanism of a main memory area that is a target of data transfer. At present, uDAPL is the only specific example.

INFINIBAND NETWORK ARCHITECTURE、MINDSHARE、INC.、 Tom Shanley、PC SYSTEM ARCHITECTURE SERIES、Addison-Wesley (2003)、ISBN 0-321-11765-4.INFINIBAND NETWORK ARCHITECTURE, MINDSHARE, INC., Tom Shanley, PC SYSTEM ARCHITECTURE SERIES, Addison-Wesley (2003), ISBN 0-321-11765-4. uDAPL:User Direct Access Programming Library, Version 2.0, DAT collaborative.uDAPL: User Direct Access Programming Library, Version 2.0, DAT collaborative.

バージョン１．２までの従来のｕＤＡＰＬでは、通信経路の耐障害性向上のための仕組み、すなわち、通信経路の冗長化や障害発生時の通信経路の切り替え等の仕組みを提供していなかった。現在のｕＤＡＰＬバージョン２．０では、これらの機能をＨＡ（High Availability）機能としてオプションでサポートするとしている（非特許文献２、Page 58-62）。しかし、そのＡＰＩにおいて、ＥＰのコネクション設定の冗長化の指定方法については明記されている（非特許文献２、Page 231, 233）ものの、ＬＭＲと組み合わせた場合の動作等について詳細が明記されておらず、不明な点が多い。 Conventional uDAPL up to version 1.2 did not provide a mechanism for improving fault tolerance of a communication path, that is, a mechanism for redundancy of a communication path or switching of a communication path when a fault occurs. In the current uDAPL version 2.0, these functions are optionally supported as HA (High Availability) functions (Non-Patent Document 2, Pages 58-62). However, in the API, the specification method of redundancy of the EP connection setting is specified (Non-Patent Document 2, Pages 231, 233), but details of the operation in combination with the LMR are not specified. There are many unclear points.

したがって、現状、相互結合網の耐障害性を図るには、アプリケーションソフトウェアレベルでユーザ自身が明示的に耐障害性を考慮したアーキテクチャを実現する以外に方法がない。その場合、以下の２つの自明な方式が考えられる。 Therefore, at present, there is no way to achieve fault tolerance of the interconnection network other than by realizing an architecture that explicitly considers fault tolerance at the application software level. In that case, the following two obvious methods can be considered.

（方式１）冗長化した通信経路の全てに常時同じデータを冗長的に流すが、実際にはそのうちの１つの通信経路上のデータのみを使用する。障害が発生した際には、使用するデータを他の通信経路上のデータに切り替える。 (Method 1) The same data is always redundantly supplied to all of the redundant communication paths, but only data on one of the communication paths is actually used. When a failure occurs, the data to be used is switched to data on another communication path.

（方式２）通信経路は冗長化するが、ただ１つの通信経路にのみデータを流し通信を行う。障害が発生した際には、使用する通信経路を切り替え、その新しい通信経路にのみデータを流す。 (Method 2) Although the communication path is made redundant, data is transmitted through only one communication path for communication. When a failure occurs, the communication path to be used is switched, and data is sent only to the new communication path.

図４は、（方式１）を図示したものである。ホストＡが外部装置であるディスクからデータを主メモリ内に読み出した後、それがホストＢの主メモリにＲＤＭＡによって転送され、ホストＢはその受信データをＮＩＣ経由でネットワークに配信する例である。ホストＡ、Ｂ間のＲＤＭＡ通信の耐障害性を向上するため、コネクション１、２に対応した２つの通信経路による冗長化を図っている。 FIG. 4 illustrates (Method 1). In this example, host A reads data from a disk as an external device into the main memory, and then transfers the data to the main memory of host B by RDMA, and host B distributes the received data to the network via the NIC. In order to improve the fault tolerance of the RDMA communication between the hosts A and B, redundancy is achieved by two communication paths corresponding to the connections 1 and 2.

ホストＡ、Ｂの主メモリ上には、通信経路１、および、通信経路２に対応したＲＤＭＡ領域１、２が存在する。そして、これらのＲＤＭＡ領域は複数のセグメントに分割されており、各セグメントに対して１つずつＬＭＲがマッピングされている。ＲＤＭＡ領域１内の各セグメントにマッピングされたＬＭＲの集合をＬＭＲ群１、ＲＤＭＡ領域２内の各セグメントにマッピングされたＬＭＲの集合をＬＭＲ群２と呼ぶ。すなわち、図４は、各通信経路に対して図２に示したＲＤＭＡ領域、および、ＬＭＲを複数保持する例である。 On the main memories of the hosts A and B, there are RDMA areas 1 and 2 corresponding to the communication path 1 and the communication path 2. These RDMA areas are divided into a plurality of segments, and one LMR is mapped to each segment. A set of LMRs mapped to each segment in the RDMA region 1 is called an LMR group 1, and a set of LMRs mapped to each segment in the RDMA region 2 is called an LMR group 2. That is, FIG. 4 is an example in which a plurality of RDMA areas and LMRs shown in FIG. 2 are held for each communication path.

ホストＡは、主メモリ上のセグメントと同じサイズのデータを読み出し、セグメントへ順番に格納していく。ディスク上の１つのデータについて、必ず２重に読み出し、一方をＲＤＭＡ領域１内のセグメント、もう一方をＲＤＭＡ領域２内のセグメントに格納する。このようにして、同じデータをＲＤＭＡ領域１、２に冗長化して格納していく。 The host A reads data having the same size as the segment on the main memory and stores it in the segment in order. One data on the disk is always read twice, one is stored in a segment in the RDMA area 1 and the other is stored in a segment in the RDMA area 2. In this way, the same data is stored redundantly in the RDMA areas 1 and 2.

また、ＬＭＲ群１、２を用いて、常時通信経路１、２の両方を使用しながら、ＲＤＭＡ領域１、２に格納されたデータがホストＡからホストＢへ転送される（ＲＤＭＡのイニシエータはどちらのホストでも構わない）。ホストＡのＲＤＭＡ領域１内のデータは通信経路１経由でホストＢのＲＤＭＡ領域１へ、ホストＡのＲＤＭＡ領域２内のデータは通信経路２経由でホストＢのＲＤＭＡ領域２へ転送される。 In addition, using the LMR groups 1 and 2, the data stored in the RDMA areas 1 and 2 are transferred from the host A to the host B while using both the communication paths 1 and 2 at any time. Can be a host). Data in the RDMA area 1 of the host A is transferred to the RDMA area 1 of the host B via the communication path 1, and data in the RDMA area 2 of the host A is transferred to the RDMA area 2 of the host B via the communication path 2.

各通信経路上では、１回のＲＤＭＡ動作で１つのセグメント内のデータが転送され、これを繰り返すことによってデータ転送が継続して行われる。なお、このデータ転送動作はディスクからのデータ読み出し動作と並行して行われる。 On each communication path, data in one segment is transferred by one RDMA operation, and data transfer is continuously performed by repeating this. This data transfer operation is performed in parallel with the data read operation from the disk.

以上のようにして、ディスクから読み出された全てのデータが冗長化され、冗長化された通信経路を経由してホスト間で転送される。 As described above, all data read from the disk is made redundant and transferred between hosts via the redundant communication path.

ホストＢは、ホストＡからのデータ受信動作と並行して、コネクション１、２のどちらか一方からのデータをＮＩＣ経由でネットワークへ配信する動作を行う。 In parallel with the data reception operation from the host A, the host B performs an operation of distributing data from either one of the connections 1 and 2 to the network via the NIC.

ここで、例えば通信経路１からのデータをネットワークに配信している最中に同通信経路上で障害が発生し通信できなくなった場合、ホストＢは、そのＮＩＣ上で、ネットワークへ配信するデータを通信経路２からのデータに切り替える。このようにして、ネットワークに対し途切れのないデータ配信を継続する。 Here, for example, when a failure occurs on the communication path while data from the communication path 1 is being distributed to the network and communication becomes impossible, the host B sends data to be distributed to the network on the NIC. Switch to data from communication path 2. In this way, continuous data distribution to the network is continued.

この例から分かるように、（方式１）では、ホスト間で冗長化された通信経路のうちの一部の通信経路に障害が発生してもホスト間のデータ転送が途切れることがない。図４の例では、ホストＢのＮＩＣ上でネットワークへ配信するデータの供給元となる通信経路を切り替えるだけで、システムとしての正常な動作が維持できる。データを流す通信経路の切り替えが必要ないので、後述する（方式２）が有するような、通信経路の切り替えに伴うＲＤＭＡ領域間でのデータのコピーや移動処理が必要ない、という利点がある。 As can be seen from this example, in (Method 1), even if a failure occurs in a part of the communication paths made redundant between the hosts, the data transfer between the hosts is not interrupted. In the example of FIG. 4, normal operation as a system can be maintained by simply switching the communication path that is the source of data distributed to the network on the NIC of the host B. Since there is no need to switch the communication path through which the data flows, there is an advantage that data copying or moving processing between the RDMA areas accompanying the switching of the communication path as described later (method 2) is unnecessary.

しかし、（方式１）では、常時データを冗長化して転送するため、ホストの処理負荷や通信に使用する各種リソースの消費量が増大するという問題がある。図４の例では、
・ホストＡにおけるディスクからのデータ読み出し
・ホストＡにおけるデータ送信
・ホストＢにおけるデータ受信
の処理が常時二重化されている。ホストＡの送信処理、および、ホストＢの受信処理はＲＤＭＡによって行われるため、ホストのＣＰＵにかかる負荷は小さい。しかし、ディスクからのデータ読み出し処理は、ホストの処理負荷を増大させることがある。また、ディスクのような外部装置とのデータの入出力性能は、システムの全体性能の支配項となる場合が多く、Ｉ／Ｏバス等この部分のリソースの消費量を増やすことはシステムの全体性能を向上させる上で得策でないことが多い。特に、高速・大容量のデータ通信を行うシステムほどこのような問題は深刻となる。以上が（方式１）の問題点である。 However, in (Method 1), since data is always transferred in a redundant manner, there is a problem that the processing load of the host and the consumption of various resources used for communication increase. In the example of FIG.
Data read from the disk in the host A, data transmission in the host A, and data reception in the host B are always duplexed. Since the transmission processing of the host A and the reception processing of the host B are performed by RDMA, the load on the host CPU is small. However, the process of reading data from the disk may increase the processing load on the host. In addition, data input / output performance with external devices such as disks is often a governing term for overall system performance. Increasing the consumption of resources in this part such as the I / O bus increases overall system performance. It is often not a good idea to improve In particular, such a problem becomes more serious as the system performs high-speed and large-capacity data communication. The above is the problem of (Method 1).

一方、図５は、（方式２）を図示したものである。図４と同様に本例も、ホストＡが外部のディスクからデータを主メモリ内に読み出した後、それがホストＢの主メモリにＲＤＭＡによって転送され、ホストＢはその受信データをＮＩＣ経由でネットワーク上に配信する。ホストＡ、Ｂ間の通信経路も図４の例と同様に２重化している。 On the other hand, FIG. 5 illustrates (Method 2). As in FIG. 4, in this example, after the host A reads the data from the external disk into the main memory, it is transferred to the main memory of the host B by RDMA, and the host B sends the received data to the network via the NIC. Deliver on top. The communication path between the hosts A and B is also duplicated as in the example of FIG.

図４と異なる点は、２つの通信経路のうちのただ１つの通信経路にのみデータを流して通信を行う点である。正常時において、ホストＡのディスクから読み出されたデータは、主メモリの通信経路１用のＲＤＭＡ領域１へ格納されたのち、ＲＤＭＡによって通信経路１を用いてデータをホストＢへ転送される。ホストＢは、主メモリの通信経路１用のＲＤＭＡ領域１に受信したデータを、ＮＩＣを経由してネットワークへ配信する。図５ではこの状態でのデータの流れを実線矢印で示している。 4 is different from FIG. 4 in that data is transmitted through only one of the two communication paths for communication. Under normal conditions, data read from the disk of the host A is stored in the RDMA area 1 for the communication path 1 of the main memory, and then transferred to the host B using the communication path 1 by RDMA. The host B distributes the data received in the RDMA area 1 for the communication path 1 of the main memory to the network via the NIC. In FIG. 5, the data flow in this state is indicated by solid arrows.

本動作中に通信経路１上で障害が発生すると、ホストＡとホストＢは、使用する主メモリのＲＤＭＡ領域、および、ＬＭＲ群を通信経路１用から通信経路２用の物へと切り替える。これにより、切り替え後は、図中の破線矢印で示すように、ディスクから読み出されたデータは通信経路２を経由してホストＢへ転送され、ネットワークへ配信される。 When a failure occurs on the communication path 1 during this operation, the host A and the host B switch the RDMA area and LMR group of the main memory to be used from the communication path 1 to the communication path 2. As a result, after switching, the data read from the disk is transferred to the host B via the communication path 2 and distributed to the network, as indicated by the dashed arrows in the figure.

このように（方式２）では、データ転送そのものは冗長化しないため、ホストの処理負荷や各種リソースの消費量が増大するという（方式１）のような問題はない。 As described above, in (method 2), since the data transfer itself is not made redundant, there is no problem like (method 1) in which the processing load on the host and the consumption of various resources increase.

しかし、（方式２）では、障害発生時にデータを流す通信経路を切り替える操作が必要であり、この時に、主メモリ上のＲＤＭＡ領域間でデータをコピー（または移動）したり、あるいは、新しく使用するＲＤＭＡ領域へ外部装置からデータを再度読み出さなければならない場合がある。例えば図６に示すように、通信経路１上で障害が発生した時に、ホストＡのＲＤＭＡ領域１内にまだホストＢへの転送が完了していないデータが残っていた場合を考える。この場合ホストＡは、このデータを通信経路２経由でホストＢへ転送できるようにするため、以下の（Ａ）、（Ｂ）のうちのいずれかを実行する必要がある。 However, in (Method 2), it is necessary to switch the communication path through which data flows when a failure occurs. At this time, data is copied (or moved) between RDMA areas on the main memory or newly used. In some cases, data must be read again from the external device to the RDMA area. For example, as shown in FIG. 6, when a failure occurs on the communication path 1, let us consider a case where data that has not yet been transferred to the host B remains in the RDMA area 1 of the host A. In this case, the host A needs to execute one of the following (A) and (B) so that the data can be transferred to the host B via the communication path 2.

（Ａ）ＲＤＭＡ領域１内の未転送データをＲＤＭＡ領域２へコピー（または移動）する（図７）。 (A) Copy (or move) untransferred data in the RDMA area 1 to the RDMA area 2 (FIG. 7).

（Ｂ）ＲＤＭＡ領域１内の未転送データと同じデータを再度ディスクから読み出し、ＲＤＭＡ領域２へ格納する（図８）。 (B) The same data as the untransferred data in the RDMA area 1 is read again from the disk and stored in the RDMA area 2 (FIG. 8).

なお、図のように１回のＲＤＭＡ動作におけるデータ転送単位（本例では１つのセグメント）内のデータのＲＤＭＡ転送の途中で通信経路１に障害が発生し、当該ＲＤＭＡ転送が途中で異常終了した場合は、上記（Ａ）、（Ｂ）のいずれかを完了した後、当該セグメントの先頭からＲＤＭＡ転送を再開する。これは、各回のＲＤＭＡの進行状況はアプリケーションプロセスでは把握できず、異常終了するまでに転送したデータ量を正確に把握することができないため、初めから実行し直す必要があるためである。 As shown in the figure, a failure occurred in the communication path 1 during the RDMA transfer of data in a data transfer unit (one segment in this example) in one RDMA operation, and the RDMA transfer ended abnormally in the middle. In this case, after completing either (A) or (B), the RDMA transfer is resumed from the beginning of the segment. This is because the progress status of each RDMA cannot be grasped by the application process, and the amount of data transferred before abnormal termination cannot be grasped accurately, so it is necessary to re-execute from the beginning.

そして、上記（Ａ）または（Ｂ）の実行中は、ホストＡ、Ｂ間のデータ転送は停止する。そのため、ホストＢからのネットワークへのデータ配信がとぎれないようにするには、ホストＡ側で上記の動作をできるだけ高速に実行する必要がある。しかし、特にシステムが高速・大容量のデータを扱う場合、上記のコピー（移動）や再読み出しを行うデータ量が大きくなる傾向があり、上記の動作を高速に実行することは容易ではない。 During the execution of (A) or (B), data transfer between the hosts A and B is stopped. Therefore, in order to prevent the data distribution from the host B to the network, it is necessary to execute the above operation on the host A side as fast as possible. However, particularly when the system handles high-speed and large-capacity data, the amount of data to be copied (moved) or re-read tends to increase, and it is not easy to execute the above operation at high speed.

本発明の目的は、通信経路上で障害が発生した場合の迅速な通信経路の切り替え動作を維持しながら、上述の従来の自明な通信経路冗長化方式が持つ問題を解決することである。 An object of the present invention is to solve the problems of the above-described conventional trivial communication path redundancy method while maintaining a rapid switching operation of a communication path when a failure occurs on the communication path.

上記課題を解決するための手段として、外部装置とのデータの入出力、および、ホスト間でのＲＤＭＡ転送の対象となる主メモリの領域は単一の物を用意する一方で、冗長化した通信経路毎にその通信経路上で通信を行うためのＬＭＲを用意し、上記単一の主メモリ領域に対して、各通信リンクに対応したＬＭＲを重ねてマッピングする方式を発明した。 As means for solving the above problems, a single main memory area is prepared for data input / output with an external device and RDMA transfer between hosts, while redundant communication is provided. An LMR for performing communication on the communication path for each path is prepared, and a method of mapping the LMR corresponding to each communication link on the single main memory area is invented.

本発明の概要を図９に示す。本発明は、冗長化された通信経路（ここでは２つの通信経路）毎にそれぞれ対応したＬＭＲ群を用意する点は［発明が解決しようとする課題］において説明した従来の（方式１）、（方式２）と同様であるが、主メモリ上のＲＤＭＡ領域は単一の領域を用意する点が異なる。 An outline of the present invention is shown in FIG. In the present invention, the LMR group corresponding to each redundant communication path (here, two communication paths) is prepared in the conventional (method 1) described in [Problem to be Solved by the Invention], ( Similar to method 2), except that the RDMA area on the main memory provides a single area.

正常動作時には、［発明が解決しようとする課題］で述べた従来の（方式２）と同様に、冗長化された複数の通信経路のうちの１つだけを使用してホスト間でのデータ転送を行う。図９の例で言えば、ホストＡでは、主メモリ上のセグメントと同じサイズのデータがディスクから読み出されセグメントに格納されていくとともに、ＬＭＲ群１を用いて通信経路１経由でＲＤＭＡによりそれらのデータがホストＢへ転送される（ＲＤＭＡのイニシエータはどちらのホストでもよい）。ホストＢでは、ＬＭＲ群１を用いて通信経路１から受信したデータを主メモリに格納するとともに、主メモリ上のデータをＮＩＣ経由でネットワークへ配信する。 During normal operation, data transfer between hosts using only one of a plurality of redundant communication paths as in the conventional (method 2) described in [Problems to be solved by the invention] I do. In the example of FIG. 9, in the host A, data having the same size as the segment on the main memory is read from the disk and stored in the segment. Are transferred to the host B (the RDMA initiator may be either host). In the host B, the data received from the communication path 1 using the LMR group 1 is stored in the main memory, and the data on the main memory is distributed to the network via the NIC.

ここで、図１０に示すように、通信経路１に障害が発生すると、ホストＡ、および、ホストＢはＲＤＭＡに使用するＬＭＲ群を１から２へ切り替える。そして、まだホストＢへ転送していない主メモリ上のセグメントから、ＲＤＭＡによるデータ転送を再開する。 Here, as shown in FIG. 10, when a failure occurs in the communication path 1, the host A and the host B switch the LMR group used for RDMA from 1 to 2. Then, the data transfer by RDMA is resumed from the segment on the main memory that has not been transferred to the host B yet.

同図のように１回のＲＤＭＡ動作におけるデータ転送単位（本例では１つのセグメント）内のデータのＲＤＭＡ転送の途中で通信経路１に障害が発生し、当該ＲＤＭＡ転送が途中で異常終了した場合は、使用するＬＭＲ群を１から２へ切り替えた後、当該セグメントの先頭から再びＲＤＭＡ転送を実行する。これは、各回のＲＤＭＡの進行状況はアプリケーションプロセスでは把握できず、異常終了するまでに転送したデータ量を正確に把握することができないため、初めから実行し直す必要があるためである。 When a failure occurs in the communication path 1 in the middle of RDMA transfer of data in a data transfer unit (one segment in this example) in one RDMA operation as shown in the figure, and the RDMA transfer ends abnormally in the middle After switching the LMR group to be used from 1 to 2, the RDMA transfer is executed again from the beginning of the segment. This is because the progress status of each RDMA cannot be grasped by the application process, and the amount of data transferred before abnormal termination cannot be grasped accurately, so it is necessary to re-execute from the beginning.

ホストＡにおけるディスクから主メモリへのデータ読み出し、および、ホストＢにおける主メモリからネットワークへのデータ配信動作は、ＬＭＲ群の切り替えの影響を受けずに継続が可能である。 Data reading operation from the disk to the main memory in the host A and data distribution operation from the main memory to the network in the host B can be continued without being affected by the switching of the LMR group.

本発明では、ディスクからのデータ読み出しは、各データについて１度だけ、単一のＲＤＭＡ領域に対して行えばよい。また、ネットワークへのデータ配信においても、単一のＲＤＭＡ領域からＮＩＣへのデータ読み出しは各データについて１度だけ行えばよい。したがって、［発明が解決しようとする課題］で述べた（方式１）の問題、すなわち、外部装置と主メモリとの間のデータ転送を冗長して行うことに起因する、ホストの処理負荷の増大や各種リソース消費量の増大という問題はない。 In the present invention, data reading from the disk need only be performed once for each piece of data in a single RDMA area. In data distribution to the network, data reading from a single RDMA area to the NIC need only be performed once for each data. Therefore, the problem of (method 1) described in [Problems to be solved by the invention], that is, the increase in processing load on the host due to redundant data transfer between the external device and the main memory. There is no problem of increasing the consumption of various resources.

また、本発明は、障害発生時においてデータを流す通信経路を変更する点では［発明が解決しようとする課題］で述べた（方式２）と同じである。しかし、本発明では、通信経路の切り替えのために必要な動作はＬＭＲ群の切り替えだけである。図９の例では、ホストＡにおけるディスクから主メモリへのデータ読み出し、および、ホストＢにおける主メモリからネットワークへのデータ配信動作は、上記のＬＭＲ群の切り替えの影響を受けずに、単一のＲＤＭＡ領域を使用したまま常時継続できる。したがって、［発明が解決しようとする課題］で述べた（方式２）の問題、すなわち、ＲＤＭＡ領域間でのデータのコピー（移動）やディスクからのデータの再読み出しのような、高速実行が困難となり得る処理が介在する、という問題もない。 The present invention is the same as (Method 2) described in [Problem to be Solved by the Invention] in that the communication path through which data flows when a failure occurs is changed. However, in the present invention, the only operation necessary for switching the communication path is to switch the LMR group. In the example of FIG. 9, the data read operation from the disk in the host A to the main memory and the data distribution operation from the main memory to the network in the host B are not affected by the switching of the LMR group described above, This can be continued all the time using the RDMA area. Therefore, the problem of (Method 2) described in [Problems to be solved by the invention], that is, high-speed execution such as data copying (moving) between RDMA areas and data re-reading from a disk is difficult. There is also no problem that possible processing is involved.

本発明によれば、通信経路上で障害が発生した場合の迅速な通信経路の切り替え動作を維持しながら、［発明が解決しようとする課題］で述べた従来の自明な通信経路冗長化方式が持つ問題を解決することが可能である。 According to the present invention, the conventional trivial communication path redundancy system described in [Problems to be Solved by the Invention] is maintained while maintaining a rapid switching operation of a communication path when a failure occurs on the communication path. It is possible to solve the problems that you have.

［発明を実施するための最良の形態１］
以下、本発明を実施するための最良の形態１（以下、形態１と記載）について説明する。本形態１は、［特許請求の範囲］に記載の請求項１、３、４の方法に基づいており、また、それを具現化したサーバ装置であるため請求項５にも対応する。本形態１の装置全体の構成を図１１に示す。 [Best Mode for Carrying Out the Invention 1]
Hereinafter, the best mode 1 (hereinafter referred to as mode 1) for carrying out the present invention will be described. This Embodiment 1 is based on the method of Claims 1, 3, and 4 of [Claims], and corresponds to Claim 5 because it is a server device that embodies it. FIG. 11 shows the overall configuration of the apparatus according to the first embodiment.

図１１に示す相互結合網冗長型マルチメディアサーバ装置が本形態に係るサーバ装置である。映像データなどの時間連続性を持つマルチメディアデータを外部装置へ配信する機能、および、外部装置から受信したマルチメディアデータを蓄積する機能を有する。外部装置と直接通信する通信サーバモジュール１〜Ｍ_１、マルチメディアデータが格納される蓄積装置１〜Ｍ_２、蓄積装置を制御する蓄積サーバモジュール１〜Ｍ_２、サーバ内データベース、通信サーバモジュールと蓄積サーバモジュール間でのマルチメディアデータ転送に使用されるサーバモジュール間ネットワークスイッチ、および、通信サーバモジュールと蓄積サーバモジュールとサーバ内データベースを接続するローカルエリアネットワークで構成される。 The interconnection network redundant multimedia server apparatus shown in FIG. 11 is a server apparatus according to this embodiment. It has a function of distributing multimedia data having time continuity such as video data to an external device and a function of storing multimedia data received from the external device. Communication server module 1 to M 1 which communicates an external device _directly, storage device 1 to M 2 that multimedia data is _stored, the storage server module 1 to M _2, the server in a database for controlling the storage device, a communication server module storage A network switch between server modules used for multimedia data transfer between server modules, and a local area network connecting the communication server module, the storage server module, and the database in the server.

通信サーバモジュールと蓄積サーバモジュールは汎用の計算機から成るモジュールで、相互結合網冗長型マルチメディアサーバ装置はこれらをクラスタ化したものである。すなわち、通信サーバモジュールと蓄積サーバモジュールは請求項１に記載の計算機に対応し、相互結合網冗長型マルチメディアサーバ装置は請求項１に記載のサーバ装置に対応する。 The communication server module and the storage server module are modules composed of general-purpose computers, and the interconnected network redundant multimedia server apparatus is a cluster of these. That is, the communication server module and the storage server module correspond to the computer described in claim 1, and the interconnection network redundant multimedia server device corresponds to the server device described in claim 1.

相互結合網冗長型マルチメディアサーバ装置は、端末側ネットワークを介して端末やマルチメディア情報入力装置などの外部装置と接続されている。端末は、同サーバ装置からマルチメディアデータの配信を受ける装置である。また、マルチメディア情報入力装置は、同サーバ装置へ蓄積したいマルチメディアデータを送信する装置である。 The interconnected network redundant multimedia server device is connected to an external device such as a terminal or a multimedia information input device via a terminal-side network. The terminal is a device that receives multimedia data from the server device. The multimedia information input device is a device that transmits multimedia data to be stored in the server device.

相互結合網冗長型マルチメディアサーバ装置内のサーバモジュール間ネットワークスイッチは、請求項１に記載の相互結合網に対応する。本形態１では、同スイッチはＩｎｆｉｎｉＢａｎｄスイッチで、上位レイヤにｕＤＡＰＬを用いるものとする。このｕＤＡＰＬは請求項１に記載の「計算機間のコネクション設定機構および当該計算機の主メモリ領域の抽象化機構を持つ通信方式」に対応する。 The network switch between server modules in the interconnection network redundant multimedia server apparatus corresponds to the interconnection network according to claim 1. In the first embodiment, the switch is an InfiniBand switch and uDAPL is used for an upper layer. This uDAPL corresponds to “a communication system having a connection setting mechanism between computers and an abstraction mechanism of a main memory area of the computer”.

本形態１では、上記スイッチを２つ用いることにより相互結合網を冗長化している。すなわち、請求項１に記載のＤの値を２とした場合に対応する。 In the first embodiment, the interconnection network is made redundant by using two switches. That is, this corresponds to the case where the value of D described in claim 1 is 2.

１つのスイッチ、そこから伸びる通信リンク、および、それらの通信リンクによって当該スイッチに接続された全てのＨＣＡからなる１つのサーバモジュール間ネットワークを系と呼び、同図における２つの系をそれぞれ０系、１系とする。また、この０系あるいは１系内において、ある１台の通信サーバモジュールとある１台の蓄積サーバモジュールを接続する通信リンクと同スイッチからなる通信経路を、０系通信経路、１系通信経路、のように呼ぶ。 A network between server modules consisting of one switch, a communication link extending from the switch, and all HCAs connected to the switch by the communication link is called a system, and the two systems in FIG. 1 system. In addition, in this 0 system or 1 system, a communication path comprising the same switch and a communication link connecting one communication server module and one storage server module is defined as a 0 system communication path, a 1 system communication path, Call like this.

各系の内部では任意のＨＣＡ間で通信が可能であるが、系をまたがる通信はできないものとする。 It is assumed that communication between arbitrary HCAs is possible within each system, but communication across systems is not possible.

各通信サーバモジュールは、端末側ネットワークとの接続のためのネットワークインタフェースカード（ＮＩＣ）を１枚、および、０系、１系の各サーバモジュール間ネットワークスイッチとの接続のためにＩｎｆｉｎｉＢａｎｄホストアダプタカード（ＨＣＡ）を２枚搭載している。 Each communication server module has one network interface card (NIC) for connection to the terminal-side network and an InfiniBand host adapter card (for connection to the network switch between the 0-system and 1-system server modules. Two HCA) are installed.

各蓄積サーバモジュールは、０系、１系の各サーバモジュール間ネットワークスイッチとの接続のためのＩｎｆｉｎｉＢａｎｄＨＣＡを２枚、および、自身が管理する蓄積装置との接続のためのファイバチャネルホストバスアダプタ（ＨＢＡ）を１枚搭載している。 Each storage server module has two InfiniBand HCAs for connection to the network switches between the 0-system and 1-system server modules, and a Fiber Channel host bus adapter (for connection with the storage device managed by itself) One HBA) is installed.

相互結合網冗長型マルチメディアサーバ装置内のローカルエリアネットワークは、通信サーバモジュール、蓄積サーバモジュール、サーバ内データベースを接続するネットワークである。このネットワークは前記０系、１系のサーバモジュール間ネットワークのように計算機間のコネクション設定機構や当該計算機の主メモリ領域の抽象化機構を持つ通信方式を実行可能である必要はなく、ＬＡＮ（Local Area Network）等で用いられるイーサネットなどの一般的なネットワークでよい。 The local area network in the interconnected network redundant multimedia server device is a network that connects the communication server module, the storage server module, and the database in the server. This network does not need to be able to execute a communication method having a connection setting mechanism between computers or an abstraction mechanism of the main memory area of the computer like the network between the 0 system and 1 system server module. A general network such as Ethernet used in an area network may be used.

相互結合網冗長型マルチメディアサーバ装置内のサーバ内データベースは、同サーバ装置に格納されている番組の格納場所や、各蓄積装置の使用状況に関するデータを保持するデータベースである。同サーバ装置の通信サーバモジュール、および、蓄積サーバモジュールはこのローカルエリアネットワークを介してこのサーバ内データベースにアクセスし、上記の情報を得る。このサーバ内データベースを実現する装置としては、前記のデータを保持できるだけの記憶装置を内部に持ち、かつ、通信サーバモジュールや蓄積サーバモジュールとの通信が可能な物であればよく、汎用の計算機等で実現するのが簡明である。このサーバ内データベースは内部に、番組格納情報テーブル、および、蓄積装置使用状況テーブルの２つのテーブルを持つ。これらを図１２に示す。 The intra-server database in the interconnected network redundant multimedia server device is a database that holds data relating to the storage location of programs stored in the server device and the usage status of each storage device. The communication server module and the storage server module of the server device access the in-server database via the local area network to obtain the above information. As an apparatus for realizing the database in the server, a storage apparatus capable of holding the above-mentioned data and having communication with the communication server module or storage server module may be used. It is easy to realize with. This internal database has two tables: a program storage information table and a storage device usage status table. These are shown in FIG.

番組格納情報テーブルは、相互結合網冗長型マルチメディアサーバ装置に格納されている各番組のマルチメディアデータがどの蓄積装置に格納されているかを示すテーブルである。図１２の例ではＩＤが１の番組が蓄積装置５に、ＩＤが２の番組が蓄積装置Ｍ_２に格納されていることを示している。新しい番組が同サーバ装置に格納されるたびに、テーブル内にこのようなエントリが作成されていく。また、蓄積装置使用状況テーブルは、同サーバ装置の各蓄積装置の空き容量を示すテーブルである。蓄積装置１からＭ_２の各蓄積装置の空き容量が記憶されている。新しい番組が同サーバ装置に格納されるたびに、テーブル内の空き容量を表す数値が更新されていく。 The program storage information table is a table showing in which storage device the multimedia data of each program stored in the interconnection network redundant multimedia server apparatus is stored. The example of FIG. 12 shows that the program with ID 1 is stored in the storage device 5 and the program with ID ₂ is stored in the storage device M2. Each time a new program is stored in the server device, such an entry is created in the table. The storage device usage status table is a table showing the free capacity of each storage device of the server device. The free capacity of each storage device from storage device 1 to M ₂ is stored. Each time a new program is stored in the server device, the numerical value indicating the free space in the table is updated.

本形態１において、マルチメディアデータの配信は以下のように行われる。マルチメディアデータの配信を要求する端末は、サーバ装置内の任意の１台の通信サーバモジュールに対し、端末側ネットワークを介してマルチメディアデータの配信要求を発行する。配信要求には、要求元の端末を特定する要求元ＩＤ、および、配信を要求するマルチメディアデータを特定する番組ＩＤが情報として含まれる（要求元ＩＤや番組ＩＤの管理方法についてはここでは規定しない）。 In the first embodiment, multimedia data is distributed as follows. A terminal requesting delivery of multimedia data issues a delivery request for multimedia data to any one communication server module in the server device via the terminal-side network. The distribution request includes, as information, a request source ID for specifying the request source terminal and a program ID for specifying the multimedia data for which distribution is requested (the management method of the request source ID and the program ID is specified here). do not do).

この配信要求を受信した通信サーバモジュールは、ローカルエリアネットワークを介してサーバ内データベースにアクセスし、同データベース内の番組格納情報テーブルの当該番組ＩＤを含むエントリを検索して、当該番組のマルチメディアデータが格納されている蓄積装置番号を得る。 The communication server module that has received the distribution request accesses the database in the server via the local area network, searches for an entry including the program ID in the program storage information table in the database, and determines the multimedia data of the program. Is stored.

そして、サーバモジュール間ネットワークのうち正常に動作している１つの系を使用して、上記蓄積装置番号の蓄積装置を管理する蓄積サーバモジュールから当該番組のマルチメディアデータを受信し、これを要求元ＩＤに相当する端末へ配信する。 Then, the multimedia data of the program is received from the storage server module that manages the storage device of the storage device number using one system that is operating normally in the network between the server modules, and the request source Deliver to the terminal corresponding to the ID.

また、マルチメディアデータの蓄積は以下のように行われる。マルチメディアデータの蓄積を要求するマルチメディア情報入力装置は、サーバ装置内の任意の１台の通信サーバモジュールに対し、端末側ネットワークを介してマルチメディアデータの蓄積要求を発行する。蓄積要求には、要求元のマルチメディア情報入力装置を特定する要求元ＩＤ、および、蓄積するマルチメディアデータに新規に割り振る番組ＩＤが情報として含まれる。 Further, the storage of multimedia data is performed as follows. A multimedia information input device that requests storage of multimedia data issues a request to store multimedia data to any one communication server module in the server device via the terminal-side network. The accumulation request includes, as information, a request source ID that identifies the request source multimedia information input device and a program ID that is newly allocated to the multimedia data to be accumulated.

蓄積要求を受信した通信サーバモジュールは、ローカルエリアネットワークを介してサーバ内データベースにアクセスし、同データベース内の蓄積装置使用状況テーブルを参照して、空き容量が最も多い蓄積装置を当該マルチメディアデータの格納先蓄積装置に決定する。 The communication server module that has received the storage request accesses the database in the server via the local area network, refers to the storage device usage status table in the database, and selects the storage device with the largest free space for the multimedia data. Determine the storage device.

そして、サーバモジュール間ネットワークのうち正常に動作している１つの系を使用して、上記蓄積装置を管理する蓄積サーバモジュールに対し、マルチメディア情報入力装置から受信したマルチメディアデータを送信する。 Then, the multimedia data received from the multimedia information input device is transmitted to the storage server module that manages the storage device, using one normally operating system among the network between server modules.

蓄積サーバモジュールは、受信したマルチメディア情報を自身が管理する蓄積装置へ格納する。当該番組の全てのマルチメディア情報の蓄積装置への格納が完了すると、蓄積サーバモジュールは、サーバ内データベースにアクセスし、当該番組ＩＤとそれを格納した蓄積装置番号からなる新しいエントリを番組格納情報テーブルへ追加し、さらに、番組格納後の当該蓄積装置の空き容量を蓄積装置使用状況テーブルに記録し、処理を終了する。 The storage server module stores the received multimedia information in a storage device managed by the storage server module. When storage of all the multimedia information of the program in the storage device is completed, the storage server module accesses the database in the server, and a new entry consisting of the program ID and the storage device number storing the program ID is stored in the program storage information table. Furthermore, the free capacity of the storage device after storing the program is recorded in the storage device usage status table, and the process is terminated.

なお、１番組のマルチメディア情報は固定長のセグメントに分割されて蓄積装置に格納されるものとする。蓄積装置からの読み出しや蓄積装置への書き込み、通信サーバモジュールと蓄積サーバモジュール間の転送は、いずれもこのセグメントを単位に行う。 Note that the multimedia information of one program is divided into fixed-length segments and stored in the storage device. Reading from the storage device, writing to the storage device, and transfer between the communication server module and the storage server module are all performed in units of this segment.

また、通信サーバモジュールと蓄積サーバモジュールとの間のデータ転送はｕＤＡＰＬのＡＰＩを介したＲＤＭＡにより実行する。１回のＲＤＭＡ操作で上記セグメントの１つが両者の間で転送される。この転送動作を繰り返すことにより当該番組のマルチメディアデータの転送が行われる。なお、本形態１においては、通信サーバモジュールがＲＤＭＡのイニシエータであるものとする。 Data transfer between the communication server module and the storage server module is performed by RDMA via uDAPL API. One of the segments is transferred between the two in a single RDMA operation. By repeating this transfer operation, the multimedia data of the program is transferred. In the first embodiment, it is assumed that the communication server module is an RDMA initiator.

なお、端末側ネットワークとのデータの送受信時の形式については特に規定しない。 Note that the format used when transmitting / receiving data to / from the terminal-side network is not particularly specified.

以上が本形態１の装置構成、および、動作の概要の説明である。これらを踏まえ、以下に本発明技術を含んだ詳細な説明を行う。まず、図１３に通信サーバモジュールの内部構造の詳細を示す。 The above is the description of the apparatus configuration and the outline of the operation of the first embodiment. Based on these, a detailed description including the technique of the present invention will be given below. First, FIG. 13 shows details of the internal structure of the communication server module.

ＲＤＭＡ領域は、番組のマルチメディア情報を格納するためのバッファで、オペレーティングシステムが提供する仕組みを用いて主メモリに確保した領域である。Ｎ個の領域に分かれており、各領域にマルチメディア情報の１つのセグメントを格納する。 The RDMA area is a buffer for storing multimedia information of a program, and is an area secured in the main memory using a mechanism provided by the operating system. Divided into N areas, each segment stores one segment of multimedia information.

配信動作の場合はＲＤＭＡ転送によって蓄積サーバモジュールから受信したセグメントが格納され、端末側ネットワークへ配信されるまでのバッファとして機能する。 In the case of the distribution operation, the segment received from the storage server module by RDMA transfer is stored and functions as a buffer until it is distributed to the terminal side network.

蓄積動作の場合は端末側ネットワークから受信したマルチメディア情報が格納され、ＲＤＭＡ転送によってセグメント単位で蓄積サーバモジュールへ転送されるまでのバッファとして機能する。 In the accumulation operation, the multimedia information received from the terminal-side network is stored and functions as a buffer until it is transferred to the accumulation server module in segments by RDMA transfer.

読み出しポインタ（ｒｄ＿ｐｔｒ）、書き込みポインタ（ｗｒ＿ｐｔｒ）、フルフラグ（ｆｕｌｌ＿ｆｌａｇ）、エンプティフラグ（ｅｍｐｔｙ＿ｆｌａｇ）は、上記ＲＤＭＡ領域へのセグメントの格納状況を管理するための、ソフトウェアプログラム変数である。 A read pointer (rd_ptr), a write pointer (wr_ptr), a full flag (full_flag), and an empty flag (empty_flag) are software program variables for managing the storage status of segments in the RDMA area.

ｒｄ＿ｐｔｒは、ＲＤＭＡ領域に既に格納されているセグメントを読み出す際に、読み出し元となるバッファ番号を保持するポインタである。配信動作の場合は、ＮＩＣ経由で端末側ネットワークへ配信するセグメントを指し、蓄積動作の場合は、ＲＤＭＡで蓄積サーバモジュールへ転送するセグメントを指す。 rd_ptr is a pointer that holds a buffer number that is a read source when a segment already stored in the RDMA area is read. In the case of the distribution operation, it indicates a segment that is distributed to the terminal side network via the NIC, and in the case of the storage operation, it indicates a segment that is transferred to the storage server module by RDMA.

ｗｒ＿ｐｔｒは、ＲＤＭＡ領域へセグメントを書き込む際に、書き込み先となるバッファ番号を保持するポインタである。配信動作の場合は、蓄積サーバモジュールからＲＤＭＡで転送されてきたセグメントの格納先バッファを指し、蓄積動作の場合は、端末側ネットワークから受信したマルチメディアデータの格納先バッファを指す。 wr_ptr is a pointer that holds a buffer number as a write destination when writing a segment to the RDMA area. In the case of the distribution operation, it indicates the storage destination buffer of the segment transferred by the RDMA from the storage server module, and in the case of the storage operation, it indicates the storage destination buffer of the multimedia data received from the terminal side network.

ｆｕｌｌ＿ｆｌａｇは、全てのバッファにセグメントが格納されていれば値１、それ以外の場合は値０を保持するフラグである。 The full_flag is a flag that holds a value of 1 if segments are stored in all buffers, and a value of 0 in other cases.

ｅｍｐｔｙ＿ｆｌａｇは、どのバッファにもセグメントが格納されていない場合は値１、それ以外の場合は値０を保持するフラグである。 empty_flag is a flag that holds a value of 1 if no segment is stored in any buffer, and a value of 0 otherwise.

これらのポインタとフラグを用いて、Ｎ個のバッファからなる有限長のＲＤＭＡ領域を仮想的なリングバッファとして使用することにより、セグメント長がＮよりも長い番組データを扱うことができる。 By using a finite-length RDMA area composed of N buffers as a virtual ring buffer using these pointers and flags, program data having a segment length longer than N can be handled.

要求送信バッファは、ＲＤＭＡ領域と同様の方法で主メモリ上に確保した領域で、ＲＤＭＡ転送の実行前に蓄積サーバモジュールに対して送信するメモリ情報要求を格納するバッファである。 The request transmission buffer is an area secured on the main memory in the same manner as the RDMA area, and is a buffer for storing a memory information request to be transmitted to the storage server module before executing the RDMA transfer.

メモリ情報受信バッファは、上記と同様の方法で主メモリ上に確保した領域で、上記メモリ情報要求に対して蓄積サーバモジュールから送られてきたメモリ情報を受信するためのバッファである。 The memory information receiving buffer is a buffer for receiving the memory information sent from the storage server module in response to the memory information request in an area secured on the main memory by the same method as described above.

上記ＲＤＭＡ領域、要求送信バッファ、メモリ情報受信バッファを全て合わせたものが請求項１に記載のデータ格納手段に対応する。したがって、これらは全て、冗長化した通信経路の数には関係なく１つだけ確保する。 A combination of the RDMA area, the request transmission buffer, and the memory information reception buffer corresponds to the data storage means according to claim 1. Therefore, only one of them is ensured regardless of the number of redundant communication paths.

データ転送用ＬＭＲ群０、１はｕＤＡＰＬのオブジェクトであり、それぞれ、冗長化した２つの０系、１系通信経路用のＬＭＲである。図９のＬＭＲ群１、２に相当する物である。主メモリ上のＲＤＭＡ領域の各バッファ［ｉ］（０≦ｉ≦Ｎ−１）に対して、０系用のＬＭＲ［０］［ｉ］と１系用のＬＭＲ［１］［ｉ］の２つのＬＭＲがマッピングされている。 Data transfer LMR groups 0 and 1 are uDAPL objects, which are LMRs for two redundant 0-system and 1-system communication paths, respectively. This corresponds to LMR groups 1 and 2 in FIG. For each buffer [i] (0 ≦ i ≦ N−1) in the RDMA area on the main memory, 2 of LMR [0] [i] for 0 system and LMR [1] [i] for 1 system Two LMRs are mapped.

本形態１では、マルチメディアデータの転送だけでなくＲＤＭＡに伴うメモリ情報の送受信においても２つの通信経路を用いて信頼性の向上を図る。そのために、上記要求送信バッファ、および、メモリ情報受信バッファについても、それらにマッピングされた０系、１系通信経路用のＬＭＲを用意する。Ｓ−ＬＭＲが要求送信バッファ、Ｒ−ＬＭＲがメモリ情報受信バッファに対応したＬＭＲである。 In the first embodiment, not only the transfer of multimedia data but also the transmission / reception of memory information associated with RDMA use two communication paths to improve reliability. For this purpose, LMRs for the 0-system and 1-system communication paths mapped to the request transmission buffer and the memory information reception buffer are prepared. S-LMR is an LMR corresponding to a request transmission buffer, and R-LMR is an LMR corresponding to a memory information reception buffer.

１つのデータ転送用ＬＭＲ群と１つのメモリ情報交換用ＬＭＲ群を合わせた物が請求項１に記載のメモリ領域抽象化手段に対応し、本形態１では２つのメモリ領域抽象化手段を保持した場合に対応する。 The combination of one LMR group for data transfer and one LMR group for memory information exchange corresponds to the memory area abstraction means according to claim 1, and in the first embodiment, two memory area abstraction means are held. Corresponds to the case.

ｒｅｑｕｅｓｔＥＶＤは、蓄積サーバモジュールへ送信するメモリ情報要求の送信完了、および、１回のＲＤＭＡｒｅａｄ／ｗｒｉｔｅ動作の完了を通知するｕＤＡＰＬオブジェクトである。０系、１系通信経路毎に用意する。 The request EVD is a uDAPL object that notifies the completion of the transmission of the memory information request to be transmitted to the storage server module and the completion of one RDMA read / write operation. Prepare for each of the 0-system and 1-system communication paths.

ｒｅｃｅｉｖｅＥＶＤは、蓄積サーバモジュールからのメモリ情報の受信完了を通知するｕＤＡＰＬオブジェクトである。０系、１系通信経路毎に用意する。 The receive EVD is a uDAPL object that notifies the completion of reception of memory information from the storage server module. Prepare for each of the 0-system and 1-system communication paths.

ＥＰは、蓄積サーバモジュールとの通信を行うために確立するコネクションの端点となるｕＤＡＰＬオブジェクトである。本形態１では、０系、１系の２つの通信経路上で通信サーバモジュールと蓄積サーバモジュール間のコネクションを確立することでコネクションを冗長化し、両者間の通信の高信頼化を図る。そのために０系、１系用の２つのＥＰを用いる。すなわち、本形態１は、合計２本の冗長な通信経路を保有する請求項１に記載の経路冗長化手段の一例を実現している。 The EP is a uDAPL object that is an end point of a connection established for communication with the storage server module. In the present embodiment 1, the connection is made redundant by establishing the connection between the communication server module and the storage server module on the two communication paths of the 0 system and the 1 system, and the communication between the two is made highly reliable. Therefore, two EPs for 0 system and 1 system are used. In other words, the present embodiment 1 realizes an example of the path redundancy means according to claim 1 that has a total of two redundant communication paths.

各系のＥＰは、その生成時において、それぞれ図中のｒｅｑｕｅｓｔＥＶＤ、ｒｅｃｅｉｖｅＥＶＤと関連づけられているものとする。これにより、ある系のＥＰを経由して実行する動作の完了通知は同じ系内の当該ＥＶＤを通して行われる。 It is assumed that the EPs of each system are associated with the request EVD and the receive EVD in the drawing at the time of generation. Thereby, the completion notification of the operation to be executed via the EP of a certain system is performed through the EVD in the same system.

本形態１では、ＬＭＲ群とＥＰについては、同じ系内の物どうしの組み合わせでしか使用しない。このため、各系のＬＭＲ群とＥＰはその系内の１つのＰＺに関連づけて生成し、異なる系に属するこれらのリソースを混在させた処理が行えないようにしている。 In the first embodiment, the LMR group and the EP are used only in a combination of things in the same system. For this reason, the LMR group and EP of each system are generated in association with one PZ in the system so that processing in which these resources belonging to different systems are mixed cannot be performed.

通信経路監視部は、通信サーバモジュールと蓄積サーバモジュールを結ぶ各通信経路の通信可否状態を監視するソフトウェアモジュールである。０系、１系の各通信経路に対応して１つずつ存在する。また、その内部に、当該通信経路の状態を記録するためのソフトウェアプログラム変数であるステータス変数を持つ。請求項４に記載の通信経路障害・回復検出手段に対応する。 The communication path monitoring unit is a software module that monitors the communication availability status of each communication path connecting the communication server module and the storage server module. One exists corresponding to each communication path of the 0 system and the 1 system. In addition, it has a status variable that is a software program variable for recording the state of the communication path. This corresponds to the communication path failure / recovery detecting means according to claim 4.

各系の通信経路監視部は、当該通信経路の対向の蓄積サーバモジュール上の通信経路監視部と相互に定期的にメッセージ交換を行い、事前に設定した規定時間（請求項３に記載の「メッセージ交換の規定時間」に対応するもの）に基づいて、当該通信経路における障害発生や障害からの回復を検出する。 The communication path monitoring unit of each system periodically exchanges messages with the communication path monitoring unit on the storage server module opposite to the communication path, and sets a predetermined time (the “message according to claim 3”). The occurrence of a failure in the communication path and recovery from the failure are detected based on the “replacement specified time”).

メッセージ交換が上記規定時間以内に終了できなかった場合、当該通信経路に障害が発生して使用不可能になったとみなし、ステータス変数に値「断」を設定する。メッセージ交換が上記規定時間以内に終了できた場合は、当該通信経路が使用可能であるとみなし、ステータス変数の値を「正常」に設定する。 If the message exchange cannot be completed within the specified time, it is considered that the communication path has failed and becomes unusable, and the value “OFF” is set in the status variable. If the message exchange is completed within the specified time, it is assumed that the communication path is usable, and the status variable value is set to “normal”.

なお、このメッセージ交換は、ｕＤＡＰＬのＡＰＩで提供されるメッセージｓｅｎｄ／ｒｅｃｅｉｖｅやＲＤＭＡではなく、ＩＣＭＰｅｃｈｏｒｅｑｕｅｓｔ／ｒｅｐｌｙのような軽量な通信を使用する。 Note that this message exchange uses lightweight communication such as ICMP echo request / reply instead of message send / receive and RDMA provided by the uDAPL API.

ｕＤＡＰＬには通信経路の障害をアプリケーションソフトウェアに対して割り込み通知する機能がない。しかし、後述するように、ＥＶＤを経由して各種の処理の完了通知を待つ際、待ち時間の規定時間（請求項３に記載の「データ転送の規定時間」に対応するもの）を設定することができる。したがって、データ転送の際にこの規定時間を用いて通信経路の障害を検出することも可能である。 uDAPL does not have a function of notifying application software of an interruption in a communication path. However, as will be described later, when waiting for notification of completion of various processes via the EVD, a specified waiting time (corresponding to the “specified time for data transfer” according to claim 3) is set. Can do. Therefore, it is possible to detect a failure in the communication path using this specified time during data transfer.

これに対し、本形態１は、請求項３、４に記載の方法、すなわち、データ転送の規定時間と通信経路監視部の２つを併用して通信経路の状態を判断する方法を用いている。これは以下の２点において前者より優れている。 On the other hand, the present embodiment 1 uses the method according to claims 3 and 4, that is, a method for judging the state of the communication path by using both the specified time for data transfer and the communication path monitoring unit. . This is superior to the former in the following two points.

・マルチメディアデータの１つのセグメントのデータサイズは非常に大きく、その転送にかかる時間のばらつき（ジッタ）も大きくなる傾向がある。したがって、データ転送の規定時間を元に通信経路の障害検出を行う方法では、このジッタを吸収できるだけの十分に長い規定時間を設定しなければ、データ転送時の単なる処理遅れと通信経路障害を混同する可能性が生じる。しかし、これでは迅速な障害検出が行えない。請求項３、４に記載の方法では、上記のような誤判断なく、しかも、障害検出の迅速化が図れる。 -The data size of one segment of multimedia data is very large, and the variation (jitter) of time required for the transfer tends to increase. Therefore, in the method of detecting a failure in the communication path based on the specified time for data transfer, if the specified time is not long enough to absorb this jitter, the simple processing delay at the time of data transfer and the communication path failure are confused. The possibility to do. However, this does not allow for quick failure detection. In the method according to the third and fourth aspects, the fault detection can be speeded up without erroneous determination as described above.

・データ転送動作とは独立に動作する通信可否状態の監視機構により、障害検出だけでなく障害からの回復を検出できる。 -A communication availability monitoring mechanism that operates independently of the data transfer operation can detect not only a failure but also a recovery from the failure.

以上が通信サーバモジュールの内部構造の詳細説明である。次に、図１４に蓄積サーバモジュールの内部構造の詳細を示す。 The above is the detailed description of the internal structure of the communication server module. Next, FIG. 14 shows details of the internal structure of the storage server module.

蓄積サーバモジュールの０系ＥＰと通信サーバモジュールの０系ＥＰとの間、および、蓄積サーバモジュールの１系ＥＰと通信サーバモジュールの１系ＥＰとの間でコネクションが確立されている。 Connections are established between the 0 system EP of the storage server module and the 0 system EP of the communication server module, and between the 1 system EP of the storage server module and the 1 system EP of the communication server module.

主メモリ上のＲＤＭＡ領域は、番組のマルチメディア情報を格納するためのバッファで、配信動作の場合は蓄積装置から読み出されたセグメントが格納され、ＲＤＭＡによって通信サーバモジュールへ送信するまでのバッファとして機能する。蓄積動作の場合はＲＤＭＡ転送によって通信サーバモジュールから受信したセグメントが格納され、蓄積装置へ書き込むまでのバッファとして機能する。 The RDMA area in the main memory is a buffer for storing the multimedia information of the program. In the case of a distribution operation, the segment read from the storage device is stored, and is used as a buffer for transmission to the communication server module by RDMA. Function. In the accumulation operation, the segment received from the communication server module by RDMA transfer is stored and functions as a buffer until it is written to the accumulation device.

本ＲＤＭＡ領域も通信サーバモジュールにおけるそれと同様に、ｒｄ＿ｐｔｒ、ｗｒ＿ｐｔｒ、ｆｕｌｌ＿ｆｌａｇ、ｅｍｐｔｙ＿ｆｌａｇを用いて仮想的なリングバッファとして使用する。 This RDMA area is also used as a virtual ring buffer by using rd_ptr, wr_ptr, full_flag, and empty_flag, as in the communication server module.

メモリ情報送信バッファは、通信サーバモジュールからのメモリ情報要求に対して送信するメモリ情報を格納するバッファである。 The memory information transmission buffer is a buffer for storing memory information to be transmitted in response to a memory information request from the communication server module.

要求受信バッファは、通信サーバモジュールからのメモリ情報要求を受信するためのバッファである。 The request reception buffer is a buffer for receiving a memory information request from the communication server module.

これら以外の構成要素については、通信サーバモジュールの対応する構成要素と同じあるため説明は省略する。 Since the other constituent elements are the same as the corresponding constituent elements of the communication server module, description thereof will be omitted.

以上が通信サーバモジュール、および、蓄積サーバモジュールの内部構造の詳細説明である。以上のように構成した相互結合網冗長型マルチメディアサーバ装置の動作の詳細を以下に説明する。 The above is the detailed description of the internal structure of the communication server module and the storage server module. Details of the operation of the interconnection network redundant multimedia server apparatus configured as described above will be described below.

なお、以下では、同サーバ装置から端末へのマルチメディアデータの配信動作だけを説明する。蓄積動作の場合はマルチメディアデータの転送方向が逆になるだけで、本質的な差異は存在しない。 In the following, only the multimedia data distribution operation from the server apparatus to the terminal will be described. In the case of the accumulation operation, only the transfer direction of the multimedia data is reversed, and there is no essential difference.

また、通信／蓄積サーバモジュール内の各構成要素を生成するタイミングについては本形態１では規定せず、全ての動作の開始前に各構成要素が存在していることを前提とする。したがって、図１３と図１４に示した通信サーバモジュール／蓄積サーバモジュール間のコネクションについても、以下の説明の配信動作における対向のサーバモジュールどうしを接続したもので、事前の存在を前提とする。 Also, the timing for generating each component in the communication / storage server module is not defined in the first embodiment, and it is assumed that each component exists before the start of all operations. Therefore, the connection between the communication server module / accumulation server module shown in FIGS. 13 and 14 is also made by connecting opposite server modules in the distribution operation described below, and presupposes existence.

また、簡略化のため、図１３、図１４においてはＥＰ間コネクション確立通知用ＥＶＤの記載を、図１４においてはデータ転送用ＬＭＲに対応するＲＭＲの記載は省略している。 For simplification, description of EVD for connection establishment notification between EPs is omitted in FIGS. 13 and 14, and description of RMR corresponding to LMR for data transfer is omitted in FIG.

まず、図１５に、本形態１における通信サーバモジュールの処理を示すフローチャートを示す。同図に示すＳ１〜Ｓ１５に従って通信サーバモジュールの処理を説明する。 First, FIG. 15 shows a flowchart showing processing of the communication server module in the first embodiment. The processing of the communication server module will be described according to S1 to S15 shown in FIG.

（Ｓ１）端末から番組の配信要求を受信する。 (S1) A program distribution request is received from the terminal.

（Ｓ２）サーバ内データベースにアクセスし、上記配信要求に含まれる番組ＩＤをキーとして同データベース内の番組格納情報テーブルを検索し、当該番組が格納されている蓄積装置を管理する蓄積装置番号を得る。 (S2) The server database is accessed, the program storage information table in the database is searched using the program ID included in the distribution request as a key, and the storage device number for managing the storage device storing the program is obtained. .

（Ｓ３）Ｓ２で得た蓄積装置番号の蓄積装置を管理する蓄積サーバモジュールに対し、当該番組ＩＤを含む配信開始通知を発行する。 (S3) A distribution start notification including the program ID is issued to the storage server module that manages the storage device having the storage device number obtained in S2.

（Ｓ４）当該蓄積サーバモジュールからの前処理完了通知を受信するまで待つ。この前処理完了通知は、蓄積サーバモジュール側が各種リソースの初期化を完了し、当該番組のセグメントの転送が可能となったことを示す。 (S4) Wait until a preprocessing completion notification is received from the storage server module. This pre-processing completion notification indicates that the storage server module side has completed initialization of various resources, and the segment of the program can be transferred.

（Ｓ５）端末への初期配信時刻、すなわち、当該番組の当該端末への配送を開始する時刻を設定する。例えば、初期配送時刻を、端末からの配信要求を受信した時刻に設定すると、蓄積サーバモジュールから当該番組の最初のセグメントを受信した時に配信が開始される。 (S5) An initial delivery time to the terminal, that is, a time to start delivery of the program to the terminal is set. For example, if the initial delivery time is set to the time when the delivery request from the terminal is received, the delivery is started when the first segment of the program is received from the storage server module.

（Ｓ６）ｒｄ＿ｐｔｒ、および、ｗｒ＿ｐｔｒの値をそれぞれ、主メモリ内のバッファ［０］を指すよう、０に初期化する。また、上記バッファがまだ空（セグメントが一切格納されていない）であるため、ｆｕｌｌ＿ｆｌａｇを０、ｅｍｐｔｙ＿ｆｌａｇを１に初期化する。 (S6) The values of rd_ptr and wr_ptr are each initialized to 0 to point to the buffer [0] in the main memory. Since the buffer is still empty (no segment is stored), full_flag is initialized to 0 and empty_flag is initialized to 1.

（Ｓ７）冗長化された２つの通信経路のうち、使用する通信経路の系番号を表す変数Ｉを０に初期化する（最初は０系通信経路から使用する）。 (S7) Of the two redundant communication paths, a variable I representing the system number of the communication path to be used is initialized to 0 (initially used from the 0 system communication path).

（Ｓ８）後述する「端末へのセグメント配信処理」を起動する。 (S8) “Segment distribution process to terminal” described later is started.

（Ｓ９）蓄積サーバモジュールに対して送信するメモリ情報要求の内容を主メモリ上の要求送信バッファへ書き込む。このメモリ情報要求は、転送対象となるセグメントを特定する番号等を含めることによりＲＤＭＡｒｅａｄ動作の対象となる蓄積サーバモジュールのＬＭＲを特定するものであるが、本明細書ではその形式は規定しない。 (S9) The contents of the memory information request to be transmitted to the storage server module are written to the request transmission buffer on the main memory. This memory information request specifies the LMR of the storage server module that is the target of the RDMA read operation by including a number that specifies the segment that is the transfer target, but the format is not specified in this specification.

（Ｓ１０）ｆｕｌｌ＿ｆｌａｇの値が０になるまで待つ。すなわち、主メモリ上にバッファの空きがあり、蓄積サーバモジュールからのセグメントの受信が可能になるまで待つ。 (S10) Wait until the value of full_flag becomes 0. That is, the process waits until there is an empty buffer in the main memory and a segment can be received from the storage server module.

（Ｓ１１）蓄積サーバモジュールから当該番組の１つのセグメントを受信する。本ステップの詳細は後述する。 (S11) One segment of the program is received from the storage server module. Details of this step will be described later.

（Ｓ１２）１つのセグメントを受信したら、ｗｒ＿ｐｔｒがその次のセグメントの格納先のバッファを指すよう、値を更新する。すでに最後尾のバッファ［Ｎ１］を指していれば先頭バッファ［０］を、そうでなければ、１つ先のバッファを指すように値を更新する。 (S12) When one segment is received, the value is updated so that wr_ptr points to the storage destination buffer of the next segment. If the last buffer [N1] has already been indicated, the value is updated so that the first buffer [0] is indicated, otherwise the next buffer is indicated.

（Ｓ１３）セグメントの受信直後は主メモリ上の全バッファが空でないことが確実であるため、ｅｍｐｔｙ＿ｆｌａｇの値を０にセットする。 (S13) Since it is certain that all the buffers in the main memory are not empty immediately after receiving the segment, the value of empty_flag is set to 0.

（Ｓ１４）Ｓ１２でｗｒ＿ｐｔｒの値を更新した結果、もしｒｄ＿ｐｔｒの値と等しくなれば、主メモリ上の全バッファがセグメントデータによって占有されたことを意味するため、ｆｕｌｌ＿ｆｌａｇの値を１にセットする。そうでなければ何もしない。 (S14) As a result of updating the value of wr_ptr in S12, if it becomes equal to the value of rd_ptr, it means that all the buffers in the main memory are occupied by the segment data, so the value of full_flag is set to 1. Otherwise it does nothing.

（Ｓ１５）Ｓ１１で蓄積サーバモジュールから受信したセグメントが当該番組の最終セグメントであれば、処理を終了する。そうでなければＳ９へ戻る。 (S15) If the segment received from the storage server module in S11 is the last segment of the program, the process ends. Otherwise, return to S9.

図１６は、図１５のＳ８において起動された「端末へのセグメント配信処理」のフローチャートである。同図に示すＳ２０〜Ｓ２７に従って本処理を説明する。 FIG. 16 is a flowchart of the “segment distribution process to terminal” started in S8 of FIG. This process will be described in accordance with S20 to S27 shown in FIG.

（Ｓ２０）現在の時刻が、設定されている「端末への配信時刻」を過ぎるまで待つ。 (S20) Wait until the current time passes the set “delivery time to terminal”.

（Ｓ２１）主メモリのｅｍｐｔｙ＿ｆｌａｇを参照し、その値が０、すなわち、配信すべきセグメントがあることを確認する。ｅｍｐｔｙ＿ｆｌａｇの値が１であれば０になるまで待つ。 (S21) Referring to empty_flag in the main memory, it is confirmed that the value is 0, that is, there is a segment to be distributed. If the value of empty_flag is 1, it waits until it becomes 0.

（Ｓ２２）バッファ［ｒｄ＿ｐｔｒ］に格納されているセグメントを、Ｓ１で受信した配信要求内の要求元ＩＤで特定される端末へ配信する。本明細書では、端末側ネットワークへ配信する際のデータ形式については規定しない。 (S22) The segment stored in the buffer [rd_ptr] is distributed to the terminal specified by the request source ID in the distribution request received in S1. In this specification, the data format for distribution to the terminal-side network is not specified.

（Ｓ２３）１つのセグメントを配信したら、ｒｄ＿ｐｔｒがその次のセグメントの格納先のバッファを指すように、値を更新する。すでに最後尾のバッファ［Ｎ１］を指していれば先頭バッファ［０］を、そうでなければ、１つ先のバッファを指すように値を更新する。 (S23) When one segment is distributed, the value is updated so that rd_ptr points to the buffer of the storage destination of the next segment. If the last buffer [N1] has already been indicated, the value is updated so that the first buffer [0] is indicated, otherwise the next buffer is indicated.

（Ｓ２４）セグメントの配信直後は、主メモリのバッファのうちセグメントが格納されていない空のバッファが少なくとも１つはあるので、ｆｕｌｌ＿ｆｌａｇの値を０にセットする。 (S24) Immediately after the segment is delivered, there is at least one empty buffer in the main memory in which no segment is stored, so the value of full_flag is set to 0.

（Ｓ２５）Ｓ２３でｒｄ＿ｐｔｒの値を更新した結果、もしｗｒ＿ｐｔｒの値と等しくなれば、主メモリ上の全バッファが空となったことを意味するため、ｅｍｐｔｙ＿ｆｌａｇの値を１にセットする。そうでなければ何もしない。 (S25) As a result of updating the value of rd_ptr in S23, if it becomes equal to the value of wr_ptr, it means that all the buffers in the main memory are empty, so the value of empty_flag is set to 1. Otherwise it does nothing.

（Ｓ２６）現在設定されている「端末への配信時刻」に、端末での１セグメントの再生時間を加えたものを、次の配信時刻として設定する。 (S26) A value obtained by adding the playback time of one segment at the terminal to the currently set “delivery time to terminal” is set as the next delivery time.

（Ｓ２７）Ｓ２２で配信したセグメントが当該番組の最終セグメントであれば、処理を終了する。そうでなければＳ２０へ戻る。 (S27) If the segment distributed in S22 is the last segment of the program, the process is terminated. Otherwise, the process returns to S20.

図１７は、図１５のＳ１１「蓄積サーバモジュールからのセグメント受信処理」を詳細に示したものである。同図に示すＳ３０〜Ｓ４４に従って本処理を説明する。 FIG. 17 shows details of S11 “Segment reception processing from storage server module” in FIG. This processing will be described in accordance with S30 to S44 shown in FIG.

（Ｓ３０）要求送信バッファ内の内容をメモリ情報要求として、Ｓ−ＬＭＲ［Ｉ］を用いてＩ系通信経路を介して蓄積サーバモジュールへ送信する。このメモリ情報要求はＲＤＭＡではなく、ｕＤＡＰＬで用意されている単純なメッセージｓｅｎｄ機構を用いる。 (S30) The contents in the request transmission buffer are transmitted as a memory information request to the storage server module via the I-system communication path using S-LMR [I]. This memory information request uses not a RDMA but a simple message send mechanism prepared by uDAPL.

（Ｓ３１）Ｓ３０のメッセージｓｅｎｄが事前に設定した規定時間内に正常に終了すれば、Ｓ３５へ移る。規定時間内に終了しなければＳ３２へ移る。 (S31) If the message send in S30 ends normally within a predetermined time set in advance, the process proceeds to S35. If not completed within the specified time, the process proceeds to S32.

（Ｓ３２）Ｉ系通信経路のステータス変数の値を確認する。その値が「正常」なら、メッセージｓｅｎｄの処理遅れの可能性があるため、Ｓ３１へ戻る。「正常」でなければ（「断」であれば）Ｓ３３へ移る。 (S32) The status variable value of the I-system communication path is confirmed. If the value is “normal”, there is a possibility that the processing of the message send is delayed, and the process returns to S31. If it is not “normal” (if “off”), the process proceeds to S33.

（Ｓ３３）Ｉ系通信経路に障害が発生したと判断し、代わりに使用できる通信経路が他にあるか否かを判断する。なければここで処理を終了（異常終了）する。あればＳ３４へ移る。 (S33) It is determined that a failure has occurred in the I-system communication path, and it is determined whether there are other communication paths that can be used instead. If not, the process ends here (abnormal end). If so, the process proceeds to S34.

（Ｓ３４）新しく使用する通信経路の系番号を決定する。ステータスが「正常」な通信経路の系番号のうち、最小の番号を変数Ｉにセットする。その後、Ｓ３０に戻り、新しい通信経路上でメモリ情報要求を送信し直す。 (S34) The system number of the communication path to be newly used is determined. The smallest number among the system numbers of the communication paths whose status is “normal” is set in the variable I. Thereafter, the process returns to S30, and the memory information request is transmitted again on the new communication path.

（Ｓ３５）Ｒ−ＬＭＲ［Ｉ］を用いてＩ系通信経路を介してメモリ情報を蓄積サーバモジュールから受信し、内容をメモリ情報受信バッファに格納する。このメモリ情報の受信はＲＤＭＡではなく、ｕＤＡＰＬで用意されている単純なメッセージｒｅｃｅｉｖｅ機構を用いる。 (S35) The memory information is received from the storage server module via the I-system communication path using R-LMR [I], and the contents are stored in the memory information reception buffer. This memory information is received by using a simple message receive mechanism prepared by uDAPL, not by RDMA.

（Ｓ３６）Ｓ３５のメッセージｒｅｃｅｉｖｅが事前に設定した規定時間内に正常に終了すれば、Ｓ４０へ移る。規定時間内に終了しなければＳ３７へ移る。 (S36) If the message receive in S35 ends normally within the specified time set in advance, the process proceeds to S40. If not finished within the specified time, the process proceeds to S37.

（Ｓ３７）Ｉ系通信リンクのステータス変数の値を確認する。その値が「正常」なら、メッセージｒｅｃｅｉｖｅの処理遅れの可能性があるため、Ｓ３６へ戻る。「正常」でなければ（「断」であれば）Ｓ３８へ移る。 (S37) The status variable value of the I-system communication link is confirmed. If the value is “normal”, there is a possibility that the message receive processing may be delayed, and the process returns to S36. If it is not “normal” (if “off”), the process proceeds to S38.

（Ｓ３８）Ｉ系通信経路に障害が発生したと判断し、代わりに使用できる通信経路が他にあるか否かを判断する。なければここで処理を終了（異常終了）する。あればＳ３９へ移る。 (S38) It is determined that a failure has occurred in the I-system communication path, and it is determined whether there are other communication paths that can be used instead. If not, the process ends here (abnormal end). If there is, move to S39.

（Ｓ３９）新しく使用する通信経路の系番号を決定する。ステータスが「正常」な通信経路の系番号のうち、最小の番号を変数Ｉにセットする。その後、Ｓ３５に戻り、新しい通信リンク上でメモリ情報を受信し直す。 (S39) The system number of the communication path to be newly used is determined. The smallest number among the system numbers of the communication paths whose status is “normal” is set in the variable I. Thereafter, the process returns to S35, and the memory information is received again on the new communication link.

（Ｓ４０）蓄積サーバモジュールから番組情報のセグメントを受信するために、ＬＭＲ［Ｉ］［ｒｄ＿ｐｔｒ］を受信用ＬＭＲに指定し、Ｓ３６で受信したメモリ情報で指定された蓄積サーバモジュールのバッファのアドレスをターゲットとするＲＤＭＡｒｅａｄを発行する。 (S40) In order to receive the segment of the program information from the storage server module, LMR [I] [rd_ptr] is designated as the receiving LMR, and the buffer address of the storage server module designated by the memory information received in S36 is set. Issue the target RDMA read.

（Ｓ４１）Ｓ４０のＲＤＭＡｒｅａｄが事前に設定した規定時間内に正常に終了しセグメントをバッファ［Ｉ］［ｒｄ＿ｐｔｒ］に受信できれば、図１５のＳ１２へ移る。規定時間内に終了しなければＳ４２へ移る。 (S41) If the RDMA read in S40 ends normally within a predetermined time set in advance and the segment can be received in the buffer [I] [rd_ptr], the process proceeds to S12 in FIG. If not completed within the specified time, the process proceeds to S42.

（Ｓ４２）Ｉ系通信経路のステータス変数の値を確認する。その値が「正常」なら、ＲＤＭＡｒｅａｄの処理遅れの可能性があるため、Ｓ４１へ戻る。「正常」でなければ（「断」であれば）Ｓ４３へ移る。 (S42) The value of the status variable of the I-system communication path is confirmed. If the value is “normal”, there is a possibility that RDMA read processing is delayed, and the process returns to S41. If it is not “normal” (if “off”), the process proceeds to S43.

（Ｓ４３）Ｉ系通信経路に障害が発生したと判断し、代わりに使用できる通信経路が他にあるか否かを判断する。なければここで処理を終了（異常終了）する。あればＳ４４へ移る。 (S43) It is determined that a failure has occurred in the I-system communication path, and it is determined whether there are other communication paths that can be used instead. If not, the process ends here (abnormal end). If there is, the process proceeds to S44.

（Ｓ４４）新しく使用する通信経路の系番号を決定する。ステータスが「正常」な通信経路の系番号のうち、最小の番号を変数Ｉにセットする。その後、Ｓ４０に戻り、新しい通信経路上でＲＤＭＡｒｅａｄをやり直す。 (S44) The system number of the communication path to be newly used is determined. The smallest number among the system numbers of the communication paths whose status is “normal” is set in the variable I. Thereafter, the process returns to S40, and RDMA read is performed again on the new communication path.

図１８は、図１３における０系、１系通信経路監視部の処理を示したフローチャートである。各系の通信経路監視部は互いに独立に、また、図１５、図１６、図１７に示した通信サーバモジュールの動作とも独立に、それぞれ同図のように動作する。同図のＳ５０〜Ｓ６０に従って通信経路監視部の処理を説明する。 FIG. 18 is a flowchart showing the processing of the 0-system and 1-system communication path monitoring unit in FIG. The communication path monitoring units of each system operate independently of each other and independently of the operations of the communication server modules shown in FIGS. 15, 16, and 17, respectively. The processing of the communication path monitoring unit will be described according to S50 to S60 of the same figure.

（Ｓ５０）自通信経路の状態を表すステータス変数の値を「正常」に初期化する。 (S50) The value of the status variable indicating the state of the own communication path is initialized to “normal”.

（Ｓ５１）対向ホスト上の、同じ系に対応する通信経路監視部に対し、ＩＣＭＰｅｃｈｏリクエストパケットを送信する。 (S51) An ICMP echo request packet is transmitted to the communication path monitoring unit corresponding to the same system on the opposite host.

（Ｓ５２）Ｓ５１の対向の通信経路監視部から、事前に設定した規定時間内にＩＣＭＰｅｃｈｏリプライパケットを受信できれば、自通信経路に問題はないと判断し、Ｓ５１へ戻って監視を継続する。受信できなければＳ５３へ移る。 (S52) If an ICMP echo reply packet can be received within a predetermined time set in advance from the opposite communication path monitoring unit in S51, it is determined that there is no problem in the own communication path, and the process returns to S51 to continue monitoring. If not received, the process proceeds to S53.

（Ｓ５３）自通信経路に障害が発生したと判断し、ステータス変数の値を「断」に変更する。 (S53) It is determined that a failure has occurred in the own communication path, and the value of the status variable is changed to “OFF”.

（Ｓ５４）自通信経路の回復を検出するため、Ｓ５１の対向の通信経路監視部に対し、ＩＣＭＰｅｃｈｏリクエストパケットを送信する。 (S54) In order to detect recovery of the own communication path, an ICMP echo request packet is transmitted to the opposite communication path monitoring unit in S51.

（Ｓ５５）Ｓ５４の対向の通信経路監視部から、事前に設定した規定時間内にＩＣＭＰｅｃｈｏリプライパケットを受信できれば、自通信経路が回復したと判断し、Ｓ５６へ移る。受信できなければＳ５４へ戻って自通信経路の回復検出動作を継続する。 (S55) If an ICMP echo reply packet can be received within a predetermined time set in advance from the opposite communication path monitoring unit in S54, it is determined that the own communication path has been recovered, and the process proceeds to S56. If not received, the process returns to S54 to continue the recovery detection operation of the own communication path.

（Ｓ５６）自通信経路のＥＰと対向ホスト上のＥＰとの間のコネクションを切断する。 (S56) The connection between the EP of the own communication path and the EP on the opposite host is disconnected.

（Ｓ５７）自通信経路のＥＰを消去する。 (S57) The EP of the own communication path is deleted.

（Ｓ５８）自通信経路の新しいＥＰを生成する。 (S58) A new EP for the own communication path is generated.

（Ｓ５９）Ｓ５８で生成した新しいＥＰを用いて、自通信経路における対向のＥＰと新規にコネクションを確立する。 (S59) Using the new EP generated in S58, a new connection is established with the opposite EP in the communication path.

（Ｓ６０）自通信経路のステータス変数の値を「正常」に戻し、Ｓ５１へ移って通常の回線監視動作を行う。 (S60) The value of the status variable of the own communication path is returned to “normal”, and the process proceeds to S51 to perform a normal line monitoring operation.

ＩｎｆｉｎｉＢａｎｄには、通信経路に対して一度発行した通信処理を明示的に取り消すための機能がない。そのため、障害から回復した通信経路をそのまま再使用せずにＥＰとコネクションを消去した後に新しいものを作り直す処理をステップ５６からステップ５９でおこなっている。これにより、通信経路に障害が発生した瞬間に実行中であった通信処理が障害からの回復後に誤って再起動されることを防止する。 InfiniBand has no function for explicitly canceling communication processing once issued to a communication path. For this reason, a process of recreating a new one after deleting the EP and connection without reusing the communication path recovered from the failure is performed from step 56 to step 59. This prevents a communication process that was being executed at the moment when a failure occurred in the communication path from being erroneously restarted after recovery from the failure.

図１９は、本形態１における蓄積サーバモジュールの処理を示すフローチャートである。同図のＳ７０〜Ｓ８１に従って、通信サーバモジュールの処理を説明する。 FIG. 19 is a flowchart showing processing of the storage server module in the first embodiment. The processing of the communication server module will be described according to S70 to S81 in the same figure.

（Ｓ７０）通信サーバモジュールが図１５のＳ３において送信した配信開始通知を受信する。 (S70) The communication server module receives the distribution start notification transmitted in S3 of FIG.

（Ｓ７１）ｒｄ＿ｐｔｒ、および、ｗｒ＿ｐｔｒの値をそれぞれ、主メモリ内のバッファ［０］を指すよう、０に初期化する。また、上記バッファがまだ空（セグメントが一切格納されていない）であるため、ｆｕｌｌ＿ｆｌａｇを０、ｅｍｐｔｙ＿ｆｌａｇを１に初期化する。 (S71) The values of rd_ptr and wr_ptr are each initialized to 0 to indicate the buffer [0] in the main memory. Since the buffer is still empty (no segment is stored), full_flag is initialized to 0 and empty_flag is initialized to 1.

（Ｓ７２）冗長化された２つの通信リンクのうち、使用する方の通信経路を示す変数Ｉを０に初期化する（最初は０系通信リンクから使用する）。 (S72) Of the two redundant communication links, the variable I indicating the communication path to be used is initialized to 0 (initially used from the 0-system communication link).

（Ｓ７３）後述する「蓄積装置からのセグメント読み出し処理」を起動する。 (S73) “Segment read processing from storage device” described later is started.

（Ｓ７４）ｅｍｐｔｙ＿ｆｌａｇの値が０になるまで待つ。すなわち、蓄積装置から読み出され主メモリ上のバッファに格納されたセグメントが存在し、通信サーバモジュールへのセグメントの送信が可能になるまで待つ。 (S74) Wait until the value of empty_flag becomes 0. That is, it waits until there is a segment read from the storage device and stored in the buffer on the main memory, and the segment can be transmitted to the communication server module.

（Ｓ７５）バッファ［Ｉ］［ｒｄ＿ｐｔｒ］内のセグメントが当該番組の最初のセグメントであればＳ７６へ、そうでなければＳ７７へ移る。 (S75) If the segment in the buffer [I] [rd_ptr] is the first segment of the program, the process proceeds to S76, and if not, the process proceeds to S77.

（Ｓ７６）通信サーバモジュールに対して前処理完了通知を発行する。 (S76) A pre-processing completion notification is issued to the communication server module.

（Ｓ７７）通信サーバモジュールへ当該番組の１つのセグメントを送信する。本ステップの詳細は後述する。 (S77) One segment of the program is transmitted to the communication server module. Details of this step will be described later.

（Ｓ７８）１つのセグメントを送信したら、ｒｄ＿ｐｔｒがその次に送信するセグメントの格納先のバッファを指すよう、値を更新する。すでに最後尾のバッファ［Ｎ１］を指していれば先頭バッファ［０］を、そうでなければ、１つ先のバッファを指すように値を更新する。 (S78) When one segment is transmitted, the value is updated so that rd_ptr points to the storage buffer of the segment to be transmitted next. If the last buffer [N1] has already been indicated, the value is updated so that the first buffer [0] is indicated, otherwise the next buffer is indicated.

（Ｓ７９）セグメントの送信直後は主メモリ上の全バッファが満杯ないことが確実であるため、ｆｕｌｌ＿ｆｌａｇの値を０にセットする。 (S79) Since it is certain that all the buffers on the main memory are not full immediately after the transmission of the segment, the value of full_flag is set to 0.

（Ｓ８０）Ｓ７８でｒｄ＿ｐｔｒの値を更新した結果、もしｗｒ＿ｐｔｒの値と等しくなれば、主メモリ上の全バッファが空となったことを意味するため、ｅｍｐｔｙ＿ｆｌａｇの値を１にセットする。そうでなければ何もしない。 (S80) As a result of updating the value of rd_ptr in S78, if it becomes equal to the value of wr_ptr, it means that all the buffers in the main memory have become empty, so the value of empty_flag is set to 1. Otherwise it does nothing.

（Ｓ８１）Ｓ７７で通信サーバモジュールへ送信したセグメントが当該番組の最終セグメントであれば、処理を終了する。そうでなければＳ７４へ戻る。 (S81) If the segment transmitted to the communication server module in S77 is the last segment of the program, the process ends. Otherwise, the process returns to S74.

図２０は、図１９のＳ７３において起動された「蓄積装置からのセグメント読み出し処理」の処理を示すフローチャートである。同図のＳ９０〜Ｓ９５に従って、本処理を説明する。 FIG. 20 is a flowchart showing the “segment reading process from the storage device” started in S73 of FIG. This processing will be described in accordance with S90 to S95 in FIG.

（Ｓ９０）ｆｕｌｌ＿ｆｌａｇの値が０になるまで待つ。すなわち、主メモリ上にバッファの空きがあり、蓄積装置から読み出すセグメントの格納が可能になるまで待つ。 (S90) Wait until the value of full_flag becomes zero. That is, it waits until there is an empty buffer in the main memory and storage of the segment read from the storage device becomes possible.

（Ｓ９１）蓄積装置から当該番組のセグメントを１つ読み出し、主メモリ上のバッファ［ｗｒ＿ｐｔｒ］に格納する。 (S91) One segment of the program is read from the storage device and stored in the buffer [wr_ptr] on the main memory.

（Ｓ９２）１つのセグメントをバッファに格納したら、ｗｒ＿ｐｔｒがその次に蓄積装置から読み出したセグメントの格納先のバッファを指すよう、値を更新する。すでに最後尾のバッファ［Ｎ１］を指していれば先頭バッファ［０］を、そうでなければ、１つ先のバッファを指すように値を更新する。 (S92) When one segment is stored in the buffer, the value is updated so that wr_ptr points to the storage buffer of the next segment read from the storage device. If the last buffer [N1] has already been indicated, the value is updated so that the first buffer [0] is indicated, otherwise the next buffer is indicated.

（Ｓ９３）蓄積装置から読み出したセグメントを主メモリ上のバッファへ格納した直後は、全バッファが空でないことが確実であるため、ｅｍｐｔｙ＿ｆｌａｇの値を０にセットする。 (S93) Immediately after storing the segment read from the storage device in the buffer on the main memory, it is certain that all the buffers are not empty, so the value of empty_flag is set to 0.

（Ｓ９４）Ｓ９３でｗｒ＿ｐｔｒの値を更新した結果、もしｒｄ＿ｐｔｒの値と等しくなれば、主メモリ上の全バッファがセグメントデータによって占有されたことを意味するため、ｆｕｌｌ＿ｆｌａｇの値を１にセットする。そうでなければ何もしない。 (S94) As a result of updating the value of wr_ptr in S93, if it becomes equal to the value of rd_ptr, it means that all the buffers in the main memory are occupied by the segment data, so the value of full_flag is set to 1. Otherwise it does nothing.

（Ｓ９５）Ｓ９１で蓄積装置から読み出したセグメントが当該番組の最終セグメントであれば、処理を終了する。そうでなければＳ９０へ戻る。 (S95) If the segment read from the storage device in S91 is the last segment of the program, the process ends. Otherwise, the process returns to S90.

図２１は、図１９のＳ７７「通信サーバモジュールへのセグメント送信処理」を詳細に示したものである。同図に示すＳ１００〜Ｓ１１１に従って本処理を説明する。 FIG. 21 shows details of S77 “Segment transmission process to communication server module” in FIG. This processing will be described in accordance with S100 to S111 shown in FIG.

（Ｓ１００）Ｒ−ＬＭＲ［Ｉ］を用いてＩ系通信経路を介してメモリ情報要求を通信サーバモジュールから受信し、内容を要求受信バッファに格納する。このメモリ情報要求の受信はＲＤＭＡではなく、ｕＤＡＰＬで用意されている単純なメッセージｒｅｃｅｉｖｅ機構を用いる。 (S100) A memory information request is received from the communication server module via the I-system communication path using R-LMR [I], and the contents are stored in the request reception buffer. The memory information request is received not by RDMA but by using a simple message receive mechanism prepared by uDAPL.

（Ｓ１０１）Ｓ１００のメッセージｒｅｃｅｉｖｅが事前に設定した規定時間内に正常に終了すれば、Ｓ１０５へ移る。規定時間内に終了しなければＳ１０２へ移る。 (S101) If the message receive in S100 ends normally within a predetermined time set in advance, the process proceeds to S105. If not completed within the specified time, the process proceeds to S102.

（Ｓ１０２）Ｉ系通信経路のステータス変数の値を確認する。その値が「正常」なら、メッセージｒｅｃｅｉｖｅの処理遅れの可能性があるため、Ｓ１０１へ戻る。「正常」でなければ（「断」であれば）Ｓ１０３へ移る。 (S102) The status variable value of the I-system communication path is confirmed. If the value is “normal”, there is a possibility that the process of message receive may be delayed, and the process returns to S101. If it is not “normal” (if “off”), the process proceeds to S103.

（Ｓ１０３）Ｉ系通信経路に障害が発生したと判断し、代わりに使用できる通信経路が他にあるか否かを判断する。なければここで処理を終了（異常終了）する。あればＳ１０４へ移る。 (S103) It is determined that a failure has occurred in the I-system communication path, and it is determined whether there are other communication paths that can be used instead. If not, the process ends here (abnormal end). If there is, the process proceeds to S104.

（Ｓ１０４）新しく使用する通信経路の系番号を決定する。ステータスが「正常」な通信経路の系番号のうち、最小の番号を変数Ｉにセットする。その後、Ｓ１００に戻り、新しい通信経路上でメモリ情報を受信し直す。 (S104) The system number of the communication path to be newly used is determined. The smallest number among the system numbers of the communication paths whose status is “normal” is set in the variable I. Thereafter, the process returns to S100, and the memory information is received again on the new communication path.

（Ｓ１０５）通信サーバモジュールへ送信するメモリ情報の内容をメモリ情報送信バッファへ書き込む。メモリ情報には、バッファ［ｒｄ＿ｐｔｒ］に対応するｒｍｒ＿ｔｒｉｐｌｅｔ（開始アドレス、レングス、アクセスキー）が含まれる。 (S105) The contents of the memory information to be transmitted to the communication server module are written into the memory information transmission buffer. The memory information includes rmr_triplet (start address, length, access key) corresponding to the buffer [rd_ptr].

（Ｓ１０６）主メモリ上のバッファ［ｒｄ＿ｐｔｒ］に蓄積装置から読み出したセグメントが格納された状態になるまで待つ。 (S106) Wait until the segment read from the storage device is stored in the buffer [rd_ptr] on the main memory.

（Ｓ１０７）Ｓ−ＬＭＲ［Ｉ］を用いてＩ系通信経路を介して、バッファ［ｒｄ＿ｐｔｒ］に関するメモリ情報を通信サーバモジュールへ送信する。このメモリ情報の送信はＲＤＭＡではなく、ｕＤＡＰＬで用意されている単純なメッセージｓｅｎｄ機構を用いる。 (S107) The memory information regarding the buffer [rd_ptr] is transmitted to the communication server module via the I-system communication path using S-LMR [I]. The memory information is transmitted using a simple message send mechanism prepared by uDAPL instead of RDMA.

（Ｓ１０８）Ｓ１０６のメッセージｓｅｎｄが事前に設定した規定時間内に正常に終了すれば、図１９のＳ７８へ移る。終了しなければＳ１０８へ移る。 (S108) If the message send in S106 ends normally within a predetermined time set in advance, the process proceeds to S78 in FIG. If not completed, the process proceeds to S108.

（Ｓ１０９）Ｉ系通信経路のステータス変数の値を確認する。その値が「正常」なら、メッセージｓｅｎｄの処理遅れの可能性があるため、Ｓ１０７へ戻る。「正常」でなければ（「断」であれば）Ｓ１０９へ移る。 (S109) The status variable value of the I-system communication path is confirmed. If the value is “normal”, there is a possibility that the processing of the message send is delayed, and the process returns to S107. If it is not “normal” (if “off”), the process proceeds to S109.

（Ｓ１１０）Ｉ系通信経路に障害が発生したと判断し、代わりに使用できる通信経路が他にあるか否かを判断する。なければここで処理を終了（異常終了）する。あればＳ１１０へ移る。 (S110) It is determined that a failure has occurred in the I-system communication path, and it is determined whether there are other communication paths that can be used instead. If not, the process ends here (abnormal end). If there is, the process proceeds to S110.

（Ｓ１１１）新しく使用する通信経路の系番号を決定する。ステータスが「正常」な通信経路の系番号のうち、最小の番号を変数Ｉにセットする。その後、Ｓ１０６に戻り、新しい通信リンク上でメモリ情報を送信し直す。 (S111) A system number of a communication path to be newly used is determined. The smallest number among the system numbers of the communication paths whose status is “normal” is set in the variable I. Thereafter, the process returns to S106, and the memory information is retransmitted on the new communication link.

なお、図１４における０系、１系通信経路監視部は、通信サーバモジュールのそれらと同様に、それぞれが独立に、また、図１９、図２０、図２１に示した蓄積サーバモジュールの動作とも独立に動作する。その動作は、図１８に示した通信サーバモジュールの通信経路監視部の動作と同じである。 Note that the 0-system and 1-system communication path monitoring units in FIG. 14 are independent of each other, and independent of the operation of the storage server modules shown in FIGS. To work. The operation is the same as the operation of the communication path monitoring unit of the communication server module shown in FIG.

本形態１は２個のサーバモジュール間ネットワークスイッチを用いているので２本のコネクションを備えているが、Ｄ（≧２）個のサーバモジュール間ネットワークスイッチを用いる場合は、Ｄ本のコネクションを備え、データ転送用ＬＭＲ群等もこれに対応して各モジュールでＤ個ずつ備えることはいうまでもない。 Since this embodiment 1 uses two server module network switches, it has two connections. However, when D (≧ 2) server module network switches are used, it has D connections. Needless to say, the data transfer LMR group and the like are also provided in each module in correspondence with this.

［発明を実施するための最良の形態２］
以下、本発明を実施するための最良の形態２（以下、形態２と記載）について説明する。本形態２は、［特許請求の範囲］に記載の請求項２、３、４の方法に基づいており、また、それを具現化したサーバ装置であるため請求項６にも対応する。本形態２のサーバ装置全体の構成を図２２に示す。 [Best Mode for Carrying Out the Invention 2]
The best mode 2 (hereinafter referred to as mode 2) for carrying out the present invention will be described below. The present embodiment 2 is based on the method of claims 2, 3 and 4 described in [Claims], and corresponds to claim 6 because it is a server device embodying the method. The configuration of the entire server device according to the second embodiment is shown in FIG.

図１１に示した形態１との違いは、０系−１系間接続用通信リンクが存在する点である。これは請求項２に記載の相互結合網間通信リンクに対応するものである。この通信リンクにより、本形態２では、０系ＨＣＡと１系ＨＣＡとの間の通信が可能となっている。 The difference from the form 1 shown in FIG. 11 is that a communication link for connection between the 0 system and the 1 system exists. This corresponds to the interconnecting network communication link according to claim 2. With this communication link, communication between the 0-system HCA and the 1-system HCA is possible in the second embodiment.

図２２における通信サーバモジュールと蓄積サーバモジュールの内部構造の詳細を図２３に示す。 FIG. 23 shows details of the internal structure of the communication server module and the storage server module in FIG.

通信サーバモジュール上のデータ格納手段は、請求項１に記載のそれに対応するもので、図１３のバッファ、要求送信バッファ、および、メモリ情報受信バッファを合わせて簡略化して記載したものである。また、メモリ領域抽象化手段１〜４は、請求項１に記載のメモリ領域抽象化手段に対応するもので、図１３のデータ転送ＬＭＲ群とメモリ情報交換用ＬＭＲ群を合わせて簡略化して記載したものである。通信経路監視部１〜４は、請求項４に記載の通信経路障害・回復検出手段に対応するものである。また、図１３と同様に各コネクションに対応してＰＺと各ＥＶＤを備える記載を省略している。また、図１３と同様に読み出しポインタ、書き込みポインタ、フルフラグ、エンプティフラグも備えるが記載を省略している。 The data storage means on the communication server module corresponds to that described in claim 1, and is a simplified description of the buffer, request transmission buffer, and memory information reception buffer of FIG. The memory area abstraction means 1 to 4 correspond to the memory area abstraction means described in claim 1, and the data transfer LMR group and the memory information exchange LMR group of FIG. It is a thing. The communication path monitoring units 1 to 4 correspond to the communication path failure / recovery detection means according to claim 4. Further, as in FIG. 13, the description of including PZ and each EVD corresponding to each connection is omitted. Further, as in FIG. 13, a read pointer, a write pointer, a full flag, and an empty flag are provided, but the description is omitted.

蓄積サーバモジュール上のデータ格納手段は、請求項１に記載のそれに対応するもので、図１４のバッファ、メモリ情報送信バッファ、および、要求受信バッファを合わせて簡略化して記載したものである。その他の部分については、通信サーバモジュールの場合と同じように図１４の一部を省略して記載している。 The data storage means on the storage server module corresponds to that described in claim 1, and is a simplified description of the buffer, memory information transmission buffer, and request reception buffer of FIG. Other parts are omitted from FIG. 14 as in the case of the communication server module.

形態１との違いは、通信サーバモジュールと蓄積サーバモジュールの０系通信リンクと１系通信リンクをまたがるものを含む合計４本のコネクションを備え、メモリ領域抽象化手段、ＥＰ、通信経路監視部、および、記載を省略したＰＺ、各ＥＶＤもこの４本のコネクションに対応して両計算機で４つずつ備えている点である。 The difference from Form 1 is that the communication server module and the storage server module have a total of four connections including those crossing the 0-system communication link and the 1-system communication link, and the memory area abstraction means, EP, communication path monitoring section, In addition, PZ and EVD which are omitted from description are provided with four computers corresponding to these four connections.

これにより、計算機間通信の耐障害性を形態１の場合よりも高めている。形態１では、例えば通信サーバモジュールの０系通信リンクと蓄積サーバモジュールの１系通信リンクの合計２箇所が切断した場合、両計算機間の通信ができなくなる。本形態２ではこのような場合でも、通信サーバモジュールの１系通信リンクと蓄積サーバモジュールの０系通信リンクを使用したコネクション３を使用して通信を継続することが可能である。 As a result, the fault tolerance of communication between computers is higher than that in the first embodiment. In the first mode, for example, when a total of two places, that is, the 0-system communication link of the communication server module and the 1-system communication link of the storage server module are disconnected, communication between both computers cannot be performed. In this case, even in such a case, communication can be continued using the connection 3 using the 1-system communication link of the communication server module and the 0-system communication link of the storage server module.

サーバ内データベースについても、形態１と全く同様のものであり、図１２に示したものと同様の情報を内部に持つ。 The database in the server is also exactly the same as in the form 1, and has the same information as that shown in FIG.

図２３の各部の動作フローの詳細については、形態１とほぼ同様のため、記載を省略する。 The details of the operation flow of each part in FIG.

なお、本形態２においては、コネクション１に相当する通信経路を通信経路１、コネクション２に相当する通信経路を通信経路２、以下同様に、通信経路３、４のように呼ぶ。 In the second embodiment, the communication path corresponding to connection 1 is referred to as communication path 1, the communication path corresponding to connection 2 is referred to as communication path 2, and so on.

また、上記の各通信経路に対応するリソースであるコネクション、ＥＰ、メモリ領域抽象化手段、通信経路監視部への番号の付与について、ＲＤＭＡのイニシエータである通信サーバモジュール側を基準にして図２３のように１から４の番号を付与し、蓄積サーバモジュール側では、通信サーバモジュール側の対応するリソースの番号と同じ番号を付与する。これにより、通信経路の切り替え処理（形態１における図１７、および、図２１に含まれる処理）において、形態１の図１７と図２１と同じ方法で通信サーバモジュールと蓄積サーバモジュールの間で使用する通信経路を常時同期させることができる。 Also, with respect to the assignment of numbers to the connection, EP, memory area abstraction means, and communication path monitoring unit corresponding to each communication path, the communication server module side that is the RDMA initiator is used as a reference in FIG. Thus, numbers 1 to 4 are assigned, and on the storage server module side, the same numbers as the corresponding resource numbers on the communication server module side are assigned. Thus, in the communication path switching process (the process included in FIGS. 17 and 21 in the first embodiment), it is used between the communication server module and the storage server module in the same manner as FIGS. 17 and 21 in the first embodiment. Communication paths can always be synchronized.

本形態２は２本の通信リンクを有するものであるので４本のコネクションを備えているが、Ｄ（≧２）本の通信リンクを有する場合は、Ｄ^２本のコネクションを備え、メモリ領域抽象化手段等もこれに対応して各計算機でＤ^２個ずつ備えることはいうまでもない。 Since this form 2 has two communication links, it has four connections, but when it has D (≧ 2) communication links, it has D ^two connections and abstracts the memory area. It goes without saying that the computerization means and the like are also provided with ^two D's in each computer.

以上の本発明の各形態の説明で明らかなように、本発明の相互結合網冗長型マルチメディアサーバ装置は次のような利点がある。 As is apparent from the description of each embodiment of the present invention, the interconnected network redundant multimedia server apparatus of the present invention has the following advantages.

まず、同サーバ装置から番組のマルチメディア情報を配信する場合、マルチメディア情報のセグメントの読み出しは各セグメントについて１度だけ、単一のＲＤＭＡ領域に対して行えばよい。また、ネットワークへのデータ配信においても、単一のＲＤＭＡ領域からＮＩＣへのデータ読み出しは各データについて１度だけ行えばよい。 First, when the multimedia information of a program is distributed from the server device, the segment of the multimedia information needs to be read out for each segment only once for each segment. In data distribution to the network, data reading from a single RDMA area to the NIC need only be performed once for each data.

したがって、［発明が解決しようとする課題］で述べた（方式１）の問題、すなわち、外部装置と主メモリとの間のデータ転送を冗長して行うことに起因する、ホストの処理負荷の増大や各種リソース消費量の増大という問題を回避することができる。これは、同サーバ装置へ番組のマルチメディア情報を蓄積する場合にも言える。 Therefore, the problem of (method 1) described in [Problems to be solved by the invention], that is, the increase in processing load on the host due to redundant data transfer between the external device and the main memory. And the problem of increased consumption of various resources can be avoided. This can also be said when the program multimedia information is stored in the server device.

さらに、本発明において、通信リンクの障害発生時にデータを流す通信リンクを変更する点では［発明が解決しようとする課題］で述べた（方式２）と同じである。しかし、本発明では通信リンクの切り替えのために必要な動作はＬＭＲ群の切り替えだけである。そして、この切り替え処理にはほとんど負荷がかからず、瞬時に終了する。 Further, the present invention is the same as (Method 2) described in [Problems to be solved by the invention] in that the communication link through which data flows is changed when a communication link failure occurs. However, in the present invention, the only operation necessary for switching the communication link is to switch the LMR group. The switching process is hardly loaded and ends immediately.

したがって、同サーバ装置から番組のマルチメディアデータを配信する場合、ホストＢにおける主メモリからネットワークへのデータ配信動作は上記のＬＭＲ群の切り替えの影響をほとんど受けずに、とぎれることなく正常に継続できる。また、同サーバ装置へ番組のマルチメディア情報を蓄積する場合においても、ホストＢからホストＡへのセグメントの転送ができなくなる時間は非常にわずかである。したがって、マルチメディア情報入力装置から送信されてきた番組のマルチメディア情報をとりこぼすことはない。 Therefore, when distributing the multimedia data of the program from the server device, the data distribution operation from the main memory to the network in the host B can be normally continued without being interrupted and hardly affected by the switching of the LMR group. . Even when the multimedia information of the program is stored in the server device, the time during which the segment cannot be transferred from the host B to the host A is very short. Therefore, the multimedia information of the program transmitted from the multimedia information input device is not missed.

［発明が解決しようとする課題］で述べた（方式２）では、ＲＤＭＡ領域間でのデータのコピー（移動）やディスクからのデータの再読み出しが必要である。上記のような配信動作や外部からのマルチメディア情報の受信をとぎれなく実行するには、これらの動作を高速に実行する必要があるが、これは非常に困難である。しかし、本発明においては、このような問題も解決される。 In (Method 2) described in [Problems to be Solved by the Invention], it is necessary to copy (move) data between RDMA areas and reread data from a disk. In order to seamlessly execute the distribution operation as described above and the reception of multimedia information from the outside, it is necessary to execute these operations at high speed, but this is very difficult. However, in the present invention, such a problem is also solved.

以上に述べたことから、本発明によれば、通信リンク上で障害が発生した場合の迅速な通信リンクの切り替え動作を維持しながら、［発明が解決しようとする課題］で述べた従来の自明な通信リンク冗長化方式が持つ問題を解決することが可能である。 As described above, according to the present invention, the conventional obviousness described in [Problems to be Solved by the Invention] is maintained while maintaining a quick switching operation of a communication link when a failure occurs on the communication link. It is possible to solve the problem of the redundant communication link redundancy method.

本発明の装置はコンピュータとプログラムによっても実現でき、プログラムを記録媒体に記録することも、ネットワークを通して提供することも可能である。 The apparatus of the present invention can be realized by a computer and a program, and the program can be recorded on a recording medium or provided through a network.

背景技術であるＰＣクラスタシステムの構成例を示す図である。It is a figure which shows the structural example of the PC cluster system which is background art. 背景技術であるｕＤＡＰＬのＡＰＩで用いられる主なオブジェクト群を示す図である。It is a figure which shows the main object groups used by API of uDAPL which is background art. ｕＤＡＰＬのＲＤＭＡｒｅａｄにより２台の計算機間で通信を行う際の手順を示す図である。It is a figure which shows the procedure at the time of performing communication between two computers by RDMA read of uDAPL. 従来技術の例である（方式１）を説明する図である。It is a figure explaining (example 1) which is an example of a prior art. 従来技術の例である（方式２）の概要を説明する図である。It is a figure explaining the outline | summary of (system 2) which is an example of a prior art. 従来技術の例である（方式２）の動作の詳細を説明する図である。It is a figure explaining the detail of operation | movement of (system 2) which is an example of a prior art. 従来技術の例である（方式２）の動作の詳細を説明する図である。It is a figure explaining the detail of operation | movement of (system 2) which is an example of a prior art. 従来技術の例である（方式２）の動作の詳細を説明する図である。It is a figure explaining the detail of operation | movement of (system 2) which is an example of a prior art. 本発明の課題を解決するための手段の概要を説明する図である。It is a figure explaining the outline | summary of the means for solving the subject of this invention. 本発明の課題を解決するための手段の動作の詳細を説明する図である。It is a figure explaining the detail of operation | movement of the means for solving the subject of this invention. 本発明の実施の形態１の装置全体の構成を示す図である。It is a figure which shows the structure of the whole apparatus of Embodiment 1 of this invention. 本発明の実施の形態１におけるサーバ内データベースが管理する情報を説明する図である。It is a figure explaining the information which the database in a server in Embodiment 1 of this invention manages. 本発明の実施の形態１における通信サーバモジュールの構成を示すブロック図である。It is a block diagram which shows the structure of the communication server module in Embodiment 1 of this invention. 本発明の実施の形態１における蓄積サーバモジュールの構成を示すブロック図である。It is a block diagram which shows the structure of the storage server module in Embodiment 1 of this invention. 本発明の実施の形態１における通信サーバモジュールの処理を示すフローチャートである。It is a flowchart which shows the process of the communication server module in Embodiment 1 of this invention. 本発明の実施の形態１における通信サーバモジュールの、端末へのセグメント配信処理を示すフローチャートである。It is a flowchart which shows the segment delivery process to the terminal of the communication server module in Embodiment 1 of this invention. 図１４のステップ１１の詳細を示すフローチャートである。It is a flowchart which shows the detail of step 11 of FIG. 本発明の実施の形態１における通信サーバモジュール、および、蓄積サーバモジュールの通信経路監視部の処理を示すフローチャートである。It is a flowchart which shows the process of the communication path monitoring part of the communication server module and storage server module in Embodiment 1 of this invention. 本発明の実施の形態１における蓄積サーバモジュールの処理を示すフローチャートである。It is a flowchart which shows the process of the storage server module in Embodiment 1 of this invention. 本発明の実施の形態１における蓄積サーバモジュールの、蓄積装置からのセグメント読み出し処理を示すフローチャートである。It is a flowchart which shows the segment read-out process from the storage device of the storage server module in Embodiment 1 of this invention. 図１８のステップ７７の詳細を示すフローチャートである。It is a flowchart which shows the detail of step 77 of FIG. 本発明の実施の形態２の装置全体の構成を示す図である。It is a figure which shows the structure of the whole apparatus of Embodiment 2 of this invention. 本発明の実施の形態２における通信サーバモジュール、および、蓄積サーバモジュールの構成を示すブロック図である。It is a block diagram which shows the structure of the communication server module and storage server module in Embodiment 2 of this invention.

Claims

A plurality of computers are connected using an interconnection network capable of executing a communication method having a connection setting mechanism between computers and an abstraction mechanism of the main memory area of the computer, and the interconnection network is D (≧ 2). A method for redundancy and switching of communication paths in a server apparatus configured to have D communication links made redundant to a single system and each computer connecting to each system,
Each of the computers is
A path redundancy means for setting a connection between its own communication link i (1 ≦ i ≦ D) and the communication link i of the opposite computer to secure a total of D redundant communication paths;
One data storage means secured in the main memory as a storage area for data to be communicated;
D memory area abstraction means corresponding to the communication paths and all corresponding to the data storage means,
The communication path during the transfer of the data contained in the data storage means to and from the opposite computer using a communication path corresponding to the memory area abstraction means via a certain one In the event of a failure, simply switching the memory area abstraction means used to a memory area abstraction means corresponding to another communication path, switching the communication path used for the data transfer, and continuing the data transfer. Communication path redundancy and switching method.

In the communication path redundancy and switching method according to claim 1,
In the server device in which a communication link between interconnection networks is added to the server device so that communication can be performed between arbitrary systems of the interconnection network redundant to D (≧ 2) systems,
For each of the computers, a connection is established between the communication link i (1 ≦ i ≦ D) of the own computer and all the communication links of the opposite computers to secure a total of ^two redundant communication paths. The communication path redundancy and switching method, wherein the path redundancy means is changed as described above, and the memory area abstraction means is changed to have D ² corresponding to each communication path.

In the communication path redundancy and switching method according to claim 1 or 2,
Wherein each computer corresponds to each communication path, performs a computer and the message exchange of the counter via the communication path, a slightly longer time than the time normally required for data transfer to the prescribed time for data transfer, further If the message exchange is completed within the specified time, the shorter time is set as the specified time for the message exchange. After the normal state is recorded as the state of the communication path, the same processing is repeated and within the specified time. If it could not be terminated, it has a communication path failure detection means having a function of recording the abnormal state as the state of the communication path and terminating the process,
If the data transfer cannot be completed within the specified time, refer to the status of the communication path, and if it is normal, wait for the end of the data transfer until the next specified time elapses. A communication path redundancy and switching method in which the data transfer is continued by switching to another communication path in a normal state if it is in a state.

In the communication path redundancy and switching method according to claim 3,
Each of the computers can continue the message exchange even after recording an abnormal state as a communication path state to the communication path failure detection means, and can complete the message exchange again within the specified time. In this case, it has a communication path failure / recovery detection means whose function is changed to return the communication path to the normal state.
A communication path redundancy and switching method that makes it possible to reuse a communication path whose communication has been recovered after a failure once.

A plurality of computers are connected using an interconnection network capable of executing a communication method having a connection setting mechanism between computers and an abstraction mechanism of the main memory area of the computer, and the interconnection network is D (≧ 2). A server apparatus configured to have D communication links that are redundant to each system and each computer connects to each system;
Each of the computers is
A path redundancy means for setting a connection between its own communication link i (1 ≦ i ≦ D) and the communication link i of the opposite computer to secure a total of D redundant communication paths;
One data storage means secured in the main memory as a storage area for data to be communicated;
D memory area abstraction means corresponding to the communication paths and all corresponding to the data storage means,
The communication path during the transfer of the data contained in the data storage means to and from the opposite computer using a communication path corresponding to the memory area abstraction means via a certain one In the event of a failure, simply switching the memory area abstraction means used to a memory area abstraction means corresponding to another communication path, switching the communication path used for the data transfer, and continuing the data transfer. Server device.

The server device according to claim 5,
In the server device in which a communication link between interconnection networks is added to the server device so that communication can be performed between arbitrary systems of the interconnection network redundant to D (≧ 2) systems,
For each of the computers, a connection is established between the communication link i (1 ≦ i ≦ D) of the own computer and all the communication links of the opposite computers to secure a total of ^two redundant communication paths. In this way, the path redundancy means is changed, and the memory area abstraction means is changed to have D ² corresponding to each communication path.

A plurality of computers are connected using an interconnection network capable of executing a communication method having a connection setting mechanism between computers and an abstraction mechanism of the main memory area of the computer, and the mutual connection network is D (≧ 2 ) Redundant to each system, each computer has D communication links for connecting to each system, and the plurality of computers have a communication server module that communicates directly with an external device and a storage that controls the storage device A communication server module in a server device composed of server modules,
A path redundancy means for setting a connection between its own communication link i (1 ≦ i ≦ D) and the communication link i of the storage server module to secure a total of D redundant communication paths;
One data storage means secured in the main memory as a storage area for data to be communicated;
D memory area abstraction means corresponding to the communication paths and all corresponding to the data storage means,
The communication path during the transfer of the data contained in the data storage means to and from the opposite computer using a communication path corresponding to the memory area abstraction means via a certain one In the event of a failure, simply switching the memory area abstraction means used to a memory area abstraction means corresponding to another communication path, switching the communication path used for the data transfer, and continuing the data transfer. Communication server module.

A plurality of computers are connected using an interconnection network capable of executing a communication method having a connection setting mechanism between computers and an abstraction mechanism of the main memory area of the computer, and the mutual connection network is D (≧ 2 ) Redundant to each system, each computer has D communication links for connecting to each system, and the plurality of computers have a communication server module that communicates directly with an external device and a storage that controls the storage device A storage server module in a server device composed of server modules,
Path redundancy means for setting a connection between its own communication link i (1 ≦ i ≦ D) and the communication link i of the communication server module to secure a total of D redundant communication paths;
One data storage means secured in the main memory as a storage area for data to be communicated;
D memory area abstraction means corresponding to the communication paths and all corresponding to the data storage means,
The communication path during the transfer of the data contained in the data storage means to and from the opposite computer using a communication path corresponding to the memory area abstraction means via a certain one In the event of a failure, simply switching the memory area abstraction means used to a memory area abstraction means corresponding to another communication path, switching the communication path used for the data transfer, and continuing the data transfer. Storage server module.

A program for causing a computer to function as the communication server module according to claim 7.

A program for causing a computer to function as the storage server module according to claim 8.