JP2005242500A

JP2005242500A - Read request arbitration control system and its method

Info

Publication number: JP2005242500A
Application number: JP2004048861A
Authority: JP
Inventors: Shusaku Uchibori; 修作内堀
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2004-02-25
Filing date: 2004-02-25
Publication date: 2005-09-08
Anticipated expiration: 2024-02-25
Also published as: JP4019056B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an NUMA distributed shared memory parallel computer system for obtaining a read request arbitration control system for improving the effective performance of data transfer. <P>SOLUTION: Generally, it is not necessary to secure ordering between read requests in data transfer in an NUMA type distributed shared memory parallel computer. The access latency of the request is calculated from a request destination physical address, and ordering is changed so that the request whose access latency to the request destination physical address is much more longer can be given priority, and a read request arbitrating unit 10 for issuing the read request is installed on a data transfer path. Thus, the efficiency of the transfer of the read request can be attained, and the performance improvement of the data transfer efficiency can be realized as a result. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明はリードリクエスト調停制御システム及びその方法に関し、特にＮＵＭＡ（NonUniform Memory Access）型分散共有メモリ並列計算機におけるリードリクエスト調停制御方式に関するものである。 The present invention relates to a read request arbitration control system and method, and more particularly to a read request arbitration control system in a NUMA (Non Uniform Memory Access) type distributed shared memory parallel computer.

従来の共有メモリシステムの一例として、図１７に示すようなＵＭＡ（Uniform Memory Access ）システムがある。このシステムでは、物理アドレスに依存することなく、メモリアクセスレイテンシが同じであるとして、各プロセッサ８０１〜８０４の主記憶ユニット８０６に対するリードリクエストのリプライは、制御ユニット８０５によって、当該リードリクエストの発行順に返却されるようになっている。 As an example of a conventional shared memory system, there is a UMA (Uniform Memory Access) system as shown in FIG. In this system, it is assumed that the memory access latency is the same without depending on the physical address, and the reply of the read request to the main storage unit 806 of each processor 801 to 804 is returned by the control unit 805 in the order of issue of the read request. It has come to be.

よって、リードリクエスト間の調停は必要なく、図１８に示すように、リードリクエスト調停ユニット１１内において、リードリクエスト１１０をリードリクエスト登録テーブル１０２に登録しておき、登録した順番にリードリクエストを発行する、いわゆるＦＩＦＯ（First In First Out：先入れ先出し）方式を採用することで、何等問題を生ずることはなかった。 Therefore, arbitration between read requests is not necessary, and as shown in FIG. 18, the read request 110 is registered in the read request registration table 102 in the read request arbitration unit 11, and the read requests are issued in the registered order. By adopting the so-called FIFO (First In First Out) method, no problem was caused.

近年、コストの面でのメリットのために、図１９に示すようなＮＵＭＡ（Non-Uniform Memory Access ）システムがある。図を参照すると、本システムは、複数のプロセッサノード２０，２１と、これ等ノード間を接続するプロセッサ間接続装置４０とを含んでいる。各プロセッサノード２０，２１は同一構成であり、例えば、プロセッサノード２０は、複数のプロセッサ２００ａ，２００ｂと、プロセッサノード接続ユニット２０２と、主記憶ユニット２０１とを有している。 In recent years, there has been a NUMA (Non-Uniform Memory Access) system as shown in FIG. 19 because of cost advantages. Referring to the figure, this system includes a plurality of processor nodes 20 and 21 and an interprocessor connection device 40 for connecting these nodes. The processor nodes 20 and 21 have the same configuration. For example, the processor node 20 includes a plurality of processors 200a and 200b, a processor node connection unit 202, and a main storage unit 201.

このシステムでは、物理アドレスに依存してメモリアクセスレイテンシが異なるために、リードリクエストを、従来のように、リクエスト順に発行するＦＩＦＯ方式では、リードリクエスト転送の効率が悪化し、ノード間データ転送の実効性能が低下する。 In this system, the memory access latency differs depending on the physical address. Therefore, in the conventional FIFO method in which read requests are issued in the order of requests, the efficiency of read request transfer deteriorates and the data transfer between nodes is effective. Performance decreases.

ここで、特許文献１を参照すると、各プロセッサのメモリアクセスタイムを均等化することを目的としたマルチプロセッサシステムが開示されており、デー転送経路上に、リクエスト間の調停を行う調停部を設け、この調停部で、リクエスト元からの蓄積された（リクエストに付加された）伝達時間を元に、最も長い伝達時間を有する（つまり、リクエスト元で発行されてから、最も長く待たされている）リクエストから、優先して発行するというものである。すなわち、この技術では、複数のリクエスト元から、一つのリクエスト先へのリクエストの性能向上を図るようになっている。 Here, referring to Patent Document 1, a multiprocessor system for the purpose of equalizing the memory access time of each processor is disclosed, and an arbitration unit that arbitrates between requests is provided on a data transfer path. In this arbitration unit, it has the longest transmission time based on the accumulated transmission time (added to the request) from the request source (that is, the longest waiting time since it was issued by the request source) The request is issued with priority. In other words, this technique is designed to improve the performance of requests from a plurality of request sources to one request destination.

特開平２−２９４８６７号公報JP-A-2-294867 特開２００３−２１６４８９号公報JP 2003-21689A

上述した図４に示したＮＵＭＡ型分散共有メモリ並列計算機では、物理アドレスに依存してメモリアクセスレイテンシが異なるために、リードリクエストを、従来のように、リクエスト順に発行するＦＩＦＯ方式では、リードリクエスト転送の効率が悪化し、ノード間データ転送の実効性能が低下するという課題の他に、他社の既存のシステムを組み合わせて、一部を自社開発するということが行われ他社製品には変更が難しいので、汎用的な方法でないと、自社開発部分だけで性能を向上させることが難しいという課題もある。 In the NUMA type distributed shared memory parallel computer shown in FIG. 4 described above, the memory access latency differs depending on the physical address. Therefore, in the conventional FIFO method in which read requests are issued in the order of requests, read request transfer is performed. In addition to the problem that the effective performance of inter-node data transfer deteriorates, the existing systems of other companies are combined and some of them are developed in-house, so it is difficult to change to other companies' products. However, if it is not a general-purpose method, there is a problem that it is difficult to improve the performance only by the in-house developed part.

また、上述した特許文献１の技術では、複数のリクエスト元から、一つのリクエスト先へのリクエストの性能向上を図るものであって、一つのリクエスト元から複数のリクエスト先へのリクエストの場合の性能向上を図ることはできないという課題がある。また、この技術においては、リクエストに特殊なプライオリティ情報を付加することが必要であり、よってこの技術の実際の適用に当たっては、システム変更が必要になるという欠点があり、よってＮＵＭＡ型分散共有メモリ並列計算機に汎用的に適用できないという課題がある。 Further, in the above-described technology of Patent Document 1, the performance of requests from a plurality of request sources to one request destination is improved, and the performance in the case of a request from one request source to a plurality of request destinations. There is a problem that improvement cannot be achieved. In this technique, it is necessary to add special priority information to the request. Therefore, in actual application of this technique, there is a disadvantage that a system change is required. Therefore, the NUMA type distributed shared memory parallel is required. There is a problem that it cannot be applied universally to computers.

本発明の目的は、ＮＵＭＡ型分散共有メモリ並列計算機に汎用的に適用でき、データ転送を効率化することにより、データ転送実効性能の向上を可能としたリードリクエスト調停制御システム及びその方法を提供することである。 SUMMARY OF THE INVENTION An object of the present invention is to provide a read request arbitration control system and method that can be applied universally to a NUMA type distributed shared memory parallel computer and can improve data transfer effective performance by improving data transfer efficiency. That is.

本発明によるリードリクエスト調停制御システムは、一以上のプロセッサと主記憶装置を有する複数のノードでシステムを構成する分散型共有メモリ並列計算機におけるリードリクエスト調停制御システムであって、リードリクエストのリクエスト先アドレスに対するアクセスレイテンシに応じて、前記リードリクエストの調停制御をなす制御手段を含むことを特徴とする。 A read request arbitration control system according to the present invention is a read request arbitration control system in a distributed shared memory parallel computer comprising a plurality of nodes having one or more processors and a main storage device, and a request destination address of a read request Control means for performing arbitration control of the read request in accordance with the access latency for.

本発明によるリードリクエスト調停制御方法は、一以上のプロセッサと主記憶装置を有する複数のノードでシステムを構成する分散型共有メモリ並列計算機におけるリードリクエスト調停制御方法であって、リードリクエストのリクエスト先アドレスに対するアクセスレイテンシに応じて、前記リードリクエストの調停制御をなす制御ステップを含むことを特徴とする。 A read request arbitration control method according to the present invention is a read request arbitration control method in a distributed shared memory parallel computer that forms a system with a plurality of nodes having one or more processors and a main storage device, and a request destination address of a read request And a control step of performing arbitration control of the read request in accordance with the access latency with respect to.

本発明によるプログラムは、一以上のプロセッサと主記憶装置を有する複数のノードでシステムを構成する分散型共有メモリ並列計算機におけるリードリクエスト調停制御方法をコンピュータに実行させるためのプログラムであって、リードリクエストのリクエスト先アドレスに対するアクセスレイテンシに応じて、前記リードリクエストの調停制御をなす処理を含むことを特徴とする。 A program according to the present invention is a program for causing a computer to execute a read request arbitration control method in a distributed shared memory parallel computer that forms a system with a plurality of nodes having one or more processors and a main storage device. Including a process for performing arbitration control of the read request in accordance with the access latency for the request destination address.

本発明の作用を述べる。一般に、ＮＵＭＡ型分散共有メモリ並列計算機内での、データ転送においては、リードリクエスト間では、オーダリングの保障は必要がないために、リクエスト先物理アドレスから当該リクエストのアクセスレイテンシを算出し、リクエスト先物理アドレスのレイテンシがより長いリクエストを優先するようにオーダリングを変更してリードリクエストを発行するよう構成する。これにより、リードリクエストの転送を効率化し、結果として、データ転送効率の性能向上を可能とする。 The operation of the present invention will be described. In general, in data transfer in a NUMA type distributed shared memory parallel computer, since it is not necessary to guarantee ordering between read requests, the access latency of the request is calculated from the request destination physical address, and the request destination physical A configuration is made such that a read request is issued by changing the ordering so that a request with a longer address latency is given priority. Thereby, the transfer of the read request is made efficient, and as a result, the performance of the data transfer efficiency can be improved.

本発明による第１の効果は、ＮＵＭＡ型分散共有メモリ並列計算機内のデータ転送の特徴であるリクエスト物理アドレスに依存したアクセスレイテンシに合わせて、リクエスト先物理アドレスへのアクセスレイテンシが長いリクエストを優先するように、オーダリングを変更して、リードリクエストを発行しているので、リードリクエスト転送が効率化されデータ転送実効性能を向上できるということである。 The first effect of the present invention is to give priority to a request having a long access latency to the request destination physical address in accordance with the access latency depending on the request physical address, which is a feature of data transfer in the NUMA type distributed shared memory parallel computer. As described above, since the read request is issued by changing the ordering, the read request transfer is made efficient and the data transfer effective performance can be improved.

本発明による第２の効果は、一般的に計算機内のデータ転送でオーダリング保証の必要の無いリードリクエスト間に対してオーダリングを変更しているので、特に何等の制約なく、ＮＵＭＡ型分散共有メモリ並列計算機に対して適用可能となって、データ転送実効性能を向上できるということである。 The second effect of the present invention is that, in general, the ordering is changed between read requests that do not require ordering guarantee in the data transfer in the computer. Therefore, the NUMA type distributed shared memory parallelism is not particularly limited. It can be applied to computers, and the data transfer effective performance can be improved.

本発明による第３の効果は、リードリクエスト調停ユニットを、データ転送系路上で、リードリクエスト間の調停が出来る任意の箇所に設けることができるので、ＮＵＭＡ型分散共有メモリ並列計算機のプロセッサノード、Ｉ／Ｏノード、または、ノード間接続装置内等、任意の箇所に設けることにより、データ転送実効性能を向上できるということである。 The third effect of the present invention is that the read request arbitration unit can be provided at any location where arbitration between read requests can be performed on the data transfer path, so that the processor node of the NUMA type distributed shared memory parallel computer, I This means that the data transfer effective performance can be improved by providing it at any location such as in the / O node or the inter-node connection device.

以下に、図面を参照しつつ本発明の実施の形態について説明する。図１は本発明の実施の形態を示すブロック図である。図１においては、本発明の一実施の形態としてのＮＵＭＡ型分散共有メモリ並列計算機が示されている。同図に示すように、ＮＵＭＡ型分散共有メモリ並列計算機は、複数のプロセッサノード２０〜２３、複数のＩ／Ｏノード３０〜３３、およびこれらノード間を接続するノード間接続装置４０からなる。本例では、プロセッサノード数およびＩ／Ｏノード数がそれぞれ４であるが、実際には任意の数で実現可能である。 Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of the present invention. FIG. 1 shows a NUMA type distributed shared memory parallel computer as an embodiment of the present invention. As shown in the figure, the NUMA type distributed shared memory parallel computer includes a plurality of processor nodes 20 to 23, a plurality of I / O nodes 30 to 33, and an inter-node connection device 40 for connecting these nodes. In this example, the number of processor nodes and the number of I / O nodes are 4 respectively, but in practice, any number can be realized.

図２を参照すると、図１のブロックの各々の詳細な構成が示されている。プロセッサノード２０は、一以上のプロセッサ２００ａ〜２００ｂ、主記憶ユニット２０１、プロセッサノード制御ユニット２０２からなり、プロセッサ２００およびプロセッサノード制御ユニット２０２は、システムバス２１０で接続されている。Ｉ／Ｏノード３０は、Ｉ／Ｏノード制御ユニット３００からなる。 Referring to FIG. 2, the detailed structure of each of the blocks of FIG. 1 is shown. The processor node 20 includes one or more processors 200 a to 200 b, a main storage unit 201, and a processor node control unit 202, and the processor 200 and the processor node control unit 202 are connected by a system bus 210. The I / O node 30 includes an I / O node control unit 300.

また、ノード間接続装置４０は、ノード間接続制御ユニット４００からなる。さらに本実施例では、データ転送経路上、つまりプロセッサノード２０とノード間接続装置の間、およびＩ／Ｏノード制御３０とノード間接続装置４０の間に、本発明によるリードリクエスト調停ユニット１０が接続される。リードリクエスト調停ユニット１０は、データ転送経路上に備えるので、本実施例のように各ノード内に備えることも、図３に示すように、ノード間接続ユニット４００内に備えることも、あるいは、その両方に備えることも可能である。 The inter-node connection device 40 includes an inter-node connection control unit 400. Furthermore, in this embodiment, the read request arbitration unit 10 according to the present invention is connected on the data transfer path, that is, between the processor node 20 and the internode connection apparatus, and between the I / O node control 30 and the internode connection apparatus 40. Is done. Since the read request arbitration unit 10 is provided on the data transfer path, it may be provided in each node as in the present embodiment, or may be provided in the inter-node connection unit 400 as shown in FIG. It is also possible to provide for both.

分散共有メモリは、各プロセッサノード２０〜２３の主記憶ユニット２０１に分散配置されている。分散配置の第一の例は、図４に示すように、各プロセッサノードの主記憶をスタックしてシステムの主記憶空間を形成するもの、つまりメモリアドレス上位ビットでプロセッサノードを指定する方式がある。 The distributed shared memory is distributed in the main storage unit 201 of each processor node 20-23. As shown in FIG. 4, the first example of the distributed arrangement is a system in which the main memory of each processor node is stacked to form the main memory space of the system, that is, the processor node is designated by the upper bits of the memory address. .

本実施例では、主記憶空間のアドレス０００（１６進数）から３ＦＦ番地がプロセッサノード２０に、４００から７ＦＦ番地がプロセッサノード２１に、８００からＢＦＦ番地がプロセッサノード２２に、Ｃ００から最後のＦＦＦ番地までがプロセッサノード２３にそれぞれ分散配置され、アドレス上位２ビットでプロセッサノードを指定する。 In this embodiment, from the address 000 (hexadecimal) of the main storage space, the 3FF address is the processor node 20, the 400-7FF address is the processor node 21, the 800-BFF address is the processor node 22, and the last FFF address is C00. Are distributed to the processor nodes 23, and the processor nodes are designated by the upper 2 bits of the address.

分散配置の第二の例は、特許文献２に開示の如く、すなわち、図５に示すように、各プロセッサノード２０〜２３に、ノード間インタリーブ制御ユニットを備え、各プロセッサノードの主記憶ユニット間でインタリーブを行う方式、つまりアドレス途中ビットでノードを指定する方式がある。 A second example of the distributed arrangement is disclosed in Patent Document 2, that is, as shown in FIG. 5, each of the processor nodes 20 to 23 includes an inter-node interleave control unit, and between the main storage units of each processor node. There is a method of performing interleaving, that is, a method of specifying a node by a bit in the middle of an address.

本例では、主記憶空間のアドレス０００（１６進数）から０ＦＦ番地と４００から４ＦＦ番地と８００から８ＦＦ番地とＣ００からＣＦＦ番地までがプロセッサノード２０に、アドレス１００から１ＦＦ番地と５００から５ＦＦ番地と９００から９ＦＦ番地とＤ００からＤＦＦ番地までがプロセッサノード２１に、アドレス２００から２ＦＦ番地と６００から６ＦＦ番地とＡ００からＡＦＦ番地とＥ００からＥＦＦ番地までがプロセッサノード２２に、アドレス３００から３ＦＦ番地と７００から７ＦＦ番地とＢ００からＢＦＦ番地とＦ００から最後のＦＦＦ番地までがプロセッサノード２３に、それぞれ２５６バイト単位でインタリーブされて分散配置され、アドレス下位８ビットよりも上２ビットでプロセッサノードを指定する。 In this example, main memory space address 000 (hexadecimal) to address 0FF, address 400 to 4FF, address 800 to 8FF, address C00 to CFF are the processor node 20, address 100 to 1FF and address 500 to 5FF. Addresses 900 to 9FF, D00 to DFF are assigned to the processor node 21, addresses 200 to 2FF, 600 to 6FF, A00 to AFF, E00 to EFF are assigned to the processor node 22, and addresses 300 to 3FF are assigned to 700. To 7FF, B00 to BFF, and F00 to the last FFF are interleaved and distributed in the processor node 23 in units of 256 bytes, and the processor node is specified by 2 bits above the lower 8 bits of the address.

図６を参照すると、図２に示した本発明によるリードリクエスト調停ユニット１０の詳細な構成が示されている。図において、物理アドレス別レイテンシ検索テーブル１０１は、物理アドレス毎のメモリアクセスレイテンシつまり、リクエストを発行してからリプライが返ってくるまでの時間を、予め保持しており、リクエスト物理アドレス１１１を入力するとアクセスレイテンシ１１２が出力される。 Referring to FIG. 6, there is shown a detailed configuration of the read request arbitration unit 10 according to the present invention shown in FIG. In the figure, the latency search table for each physical address 101 stores in advance the memory access latency for each physical address, that is, the time from when a request is issued until a reply is returned, and when the request physical address 111 is input. The access latency 112 is output.

本例では、例えば、図４に示したように、主記憶空間が各プロセッサノードに分散配置されている場合、図７に示すように、物理アドレス別レイテンシ検索テーブル１０１は、アクセスレイテンシを保持するエントリー１２０と、選択回路１２１とにより構成される。このとき、各プロセッサノードに対して１エントリー、全部で４エントリー必要であり、例えば、プロセッサノード２０〜２３の各々に対するアクセスレイテンシが、それぞれ２００ｎＳ、４００ｎＳ、６００ｎＳ、８００ｎＳと、予め保持されている。また、選択回路１２１は、１１ビットのリクエスト先物理アドレス１１１の上位２ビット、つまりビット１１およびビット１０で、エントリーが選択される。 In this example, for example, as shown in FIG. 4, when the main storage space is distributed in each processor node, as shown in FIG. 7, the latency search table 101 by physical address holds the access latency. An entry 120 and a selection circuit 121 are included. At this time, one entry is required for each processor node, for a total of four entries. For example, the access latencies for each of the processor nodes 20 to 23 are held in advance as 200 nS, 400 nS, 600 nS, and 800 nS, respectively. The selection circuit 121 selects an entry with the upper 2 bits of the 11-bit request destination physical address 111, that is, the bits 11 and 10.

また、例えば、図５に示したように、主記憶空間が各プロセッサノードにインタリーブされて分散配置されている場合も、図８に示すように、物理アドレス別レイテンシ検索テーブル１０１は、アクセスレイテンシを保持するエントリー１２０と、選択回路１２２とにより構成される。このときも、各プロセッサノードに対して１エントリー、全部で４エントリー必要であり、例えば、プロセッサノード２０〜２３の各アクセスレイテンシが、それぞれ２００ｎＳ、４００ｎＳ、６００ｎＳ、８００ｎＳと、予め保持されている。また、選択回路１２２は、１１ビットのリクエスト先物理アドレス１１１の下位８ビットより上位２ビット、つまりビット９およびビット８で、エントリーが選択される。 Also, for example, as shown in FIG. 5, even when the main storage space is interleaved and distributed in each processor node, as shown in FIG. 8, the latency search table 101 by physical address shows the access latency. An entry 120 to be held and a selection circuit 122 are included. At this time as well, one entry is required for each processor node, for a total of four entries. For example, the access latencies of the processor nodes 20 to 23 are previously held as 200 nS, 400 nS, 600 nS, and 800 nS, respectively. In addition, the selection circuit 122 selects an entry with the upper 2 bits, that is, the bit 9 and the bit 8 from the lower 8 bits of the 11-bit request destination physical address 111.

リードリクエスト登録テーブル１０２は、図９にその詳細が示されるように、複数のエントリーからなり、一つのエントリーには、リクエスト先物理アドレスおよびリクエストデータ長等の、通常のリードリクエスト１１０のリードリクエスト情報、物理アドレス別レイテンシ検索テーブル１０１によりリクエスト先物理アドレス１１１から得られたアクセスレイテンシ１１２、およびリードリクエスト登録時にタイマ１０３から出力される登録時刻１１３が格納される。 As shown in detail in FIG. 9, the read request registration table 102 includes a plurality of entries, and one entry includes read request information of a normal read request 110 such as a request destination physical address and a request data length. In addition, the access latency 112 obtained from the request destination physical address 111 by the latency search table 101 by physical address and the registration time 113 output from the timer 103 when registering the read request are stored.

調停手段１００は、図１０にその詳細を示すように、リードリクエスト登録テーブル１０２の各エントリーに対して、タイマ１０３が示す現在時刻１１５からリクエスト登録時刻１３０を引いた、登録されてからの待ち時間１３１を算出し、さらにアクセスレイテンシ１３２を加えて登録時刻からのレイテンシ１３３を算出し、その登録時刻からのレイテンシ１３３が最長であるリードリクエスト１１６を選択して発行する。タイマ１０３は現在時刻を保持する。 As shown in detail in FIG. 10, the arbitrating unit 100 waits for registration after subtracting the request registration time 130 from the current time 115 indicated by the timer 103 for each entry in the read request registration table 102. 131 is calculated, and the access latency 132 is further added to calculate the latency 133 from the registration time, and the read request 116 having the longest latency 133 from the registration time is selected and issued. The timer 103 holds the current time.

以上、詳細に実施例の構成を述べたが、図１のＮＵＭＡ型分散共有メモリ並列計算機、図２のプロセッサノード２０、Ｉ／Ｏノード３０、ノード間接続装置４０、および図４、図５の分散共有メモリの分散配置方式は、当業者にとってよく知られており、また本発明とは直接関係しないので、その詳細な構成は省略する。 The configuration of the embodiment has been described in detail above. The NUMA type distributed shared memory parallel computer of FIG. 1, the processor node 20, the I / O node 30, the inter-node connection device 40 of FIG. 2, and the configurations of FIGS. The distributed arrangement method of the distributed shared memory is well known to those skilled in the art and is not directly related to the present invention, and thus the detailed configuration is omitted.

次に、図６に示したリードリクエスト調停ユニット１０の動作を説明する。リードリクエスト調停ユニット１０は、リードリクエスト１１０の発生からリードリクエスト登録テーブル１０２のエントリーへの登録動作および、調停手段１００により、リードリクエスト登録テーブル１０２に登録されているリードリクエスト間で調停を行い、リクエスト１１６を発行する動作からなる。 Next, the operation of the read request arbitration unit 10 shown in FIG. 6 will be described. The read request arbitration unit 10 performs a registration operation from the generation of the read request 110 to the entry of the read request registration table 102, and arbitration between the read requests registered in the read request registration table 102 by the arbitration unit 100. 116 is issued.

まず、図１１を使用してリードリクエスト１１０発生からリードリクエスト登録テーブル１０２のエントリーへの登録動作を説明する。図において、リードリクエスト１１０がリードリクエスト調停ユニット１０に与えられると、リードリクエスト１１０のリクエスト先物理アドレス１１１およびリクエストデータ長等のリードリクエスト情報が、リードリクエスト登録テーブル１０２の空きエントリーに登録される（図１１（ａ））。 First, the registration operation from the generation of the read request 110 to the entry of the read request registration table 102 will be described with reference to FIG. In the figure, when a read request 110 is given to the read request arbitration unit 10, read request information such as a request destination physical address 111 and a request data length of the read request 110 is registered in an empty entry of the read request registration table 102 ( FIG. 11 (a)).

また、このときの時刻１１３をタイマ１０３から得て、登録時刻として同エントリーに登録する（図１１（ｂ））。さらに、リードリクエスト１１０のリクエスト先物理アドレス１１１が物理アドレス別レイテンシ検索テーブル１０１に与えられ（図１４（ｃ））、その物理アドレスへのアクセスレイテンシ１１２つまり、リードリクエストを発行してからリプライが返ってくるまでの時間が出力されるので、アクセスレイテンシとして同エントリーに登録する（図１１（ｄ））。 Further, the time 113 at this time is obtained from the timer 103 and registered in the same entry as the registration time (FIG. 11B). Furthermore, the request destination physical address 111 of the read request 110 is given to the latency search table 101 by physical address (FIG. 14C), and the access latency 112 to the physical address, that is, a reply is returned after issuing the read request. Since the time until it comes is output, it is registered in the same entry as the access latency (FIG. 11 (d)).

なお、リードリクエスト登録テーブル１０２に空きエントリーがなくなると、リクエスト元に対してフロー制御等が行われるが、当業者にとってよく知られており、また本発明とは直接関係しないので、その詳細な構成は省略する。 When there is no empty entry in the read request registration table 102, flow control or the like is performed on the request source. This is well known to those skilled in the art and is not directly related to the present invention. Is omitted.

次に、図１０を使用して、調停手段１００により、リードリクエスト登録テーブル１０２に登録されているリードリクエスト間で調停を行い、リクエスト１１６を発行する動作を説明する。図１０において、フローコントロール等の結果により、リードリクエスト１１６が発行可能となると、リードリクエスト登録テーブル１０２に登録されているエントリーに対して、タイマ１０３から得られる現在時刻１１５からリクエスト登録時刻１３０を引き算し、リクエストが登録されてから現在時刻までの待ち時間１３１を算出する（図１０（ａ））。 Next, with reference to FIG. 10, description will be given of an operation in which the arbitration unit 100 performs arbitration between read requests registered in the read request registration table 102 and issues a request 116. In FIG. 10, when the read request 116 can be issued due to the result of flow control or the like, the request registration time 130 is subtracted from the current time 115 obtained from the timer 103 for the entry registered in the read request registration table 102. Then, the waiting time 131 from the registration of the request to the current time is calculated (FIG. 10A).

次に、この待ち時間１３１にアクセスレイテンシ１３２を加算して、登録時刻からリードリプライを受け取るまでのレイテンシ１３３を得る（図１０（ｂ））。調停手段１００では、空きエントリ以外の全てのエントリーに対してこれら登録時間からのレイテンシ１３３を算出し、レイテンシ最長のリクエストを選択し、リードリクエスト１１６として発行する（図１０（ｃ））。発行したリードリクエスト１１６は、占めていたエントリーから削除される。 Next, the access latency 132 is added to the waiting time 131 to obtain a latency 133 from the registration time until the read reply is received (FIG. 10B). The arbitration unit 100 calculates the latency 133 from these registration times for all entries other than the empty entry, selects the request with the longest latency, and issues it as a read request 116 (FIG. 10C). The issued read request 116 is deleted from the occupied entry.

さらに、具体例を用いて実施例の動作を説明する。図３に示されるＩ／Ｏノード３０において、Ｉ／Ｏノード制御ユニット３００で４Ｋバイトのリードリクエストが発生し、調停ユニット１０に対してＩ／Ｏノード制御ユニット３００から２５６バイトに分割されたリードリクエストが１６個発行されたとする。フロー制御により、調停ユニット１０からは２５ｎＳに１回リクエストが発行されるものとする。 Further, the operation of the embodiment will be described using a specific example. In the I / O node 30 shown in FIG. 3, a read request of 4 Kbytes is generated in the I / O node control unit 300, and the read divided into 256 bytes from the I / O node control unit 300 to the arbitration unit 10 Assume that 16 requests are issued. It is assumed that a request is issued once every 25 nS from the arbitration unit 10 by flow control.

図４に示すように、主記憶空間が各プロセッサノードに分散配置されている場合、従来のようにリクエスト順に発行するＦＩＦＯ方式では、図１２に示すように、０ｎＳ後から２５ｎＳ間隔でリードリクエストが発行され、最後のリクエストのリプライは、１２７５ｎＳ後に返る。 As shown in FIG. 4, when the main storage space is distributed to each processor node, in the conventional FIFO method in which requests are issued in the order of requests, read requests are sent at intervals of 25 nS after 0 nS, as shown in FIG. The reply of the last request issued is returned after 1275 nS.

一方、図７に示す物理アドレス別レイテンシ検索テーブルを持つ本発明による調停ユニット１０を備えた場合は、図１３に示すように、リクエスト先物理アドレスへのアクセスレイテンシが長いリクエストを優先するようにオーダリングを変更し、リードリクエストを発行しているので、０ｎＳ後から２５ｎＳ間隔でリードリクエストが発行され、最後のリクエストのリプライは、８７５ｎＳ後に返る。つまり、従来よりも４００ｎＳ早くリプライが返り、リードリクエスト転送が効率化されデータ転送実効性能が向上されていることがわかる。 On the other hand, when the arbitration unit 10 according to the present invention having the latency search table by physical address shown in FIG. 7 is provided, as shown in FIG. 13, ordering is performed so that a request with a long access latency to the request physical address is prioritized. Since the read request is issued, the read request is issued at intervals of 25 nS from 0 nS, and the reply of the last request is returned after 875 nS. That is, it can be seen that the reply is returned 400 nS earlier than before, the read request transfer is made more efficient, and the effective data transfer performance is improved.

また、図５に示すように、主記憶空間が各プロセッサノードにインタリーブ分散配置されている場合も、従来のようにリクエスト順に発行するＦＩＦＯ方式では、図１４に示すように、０ｎＳ後から２５ｎＳ間隔でリードリクエストが発行され、最後のリクエストのリプライは、１２７５ｎＳ後に返る。 As shown in FIG. 5, even when the main storage space is distributed in an interleaved manner in each processor node, in the conventional FIFO method in which requests are issued in the order of requests, as shown in FIG. A read request is issued, and the reply of the last request is returned after 1275 nS.

一方、図８に示す物理アドレス別レイテンシ検索テーブルを持つ本発明による調停ユニット１０を備えた場合は、図１５に示すように、リクエスト先物理アドレスへのアクセスレイテンシが長いリクエストを優先するようにオーダリングを変更し、リードリクエストを発行しているので、０ｎＳ後から２５ｎＳ間隔でリードリクエストが発行され、最後のリクエストのリプライは、８７５ｎＳ後に返る。つまり、従来よりも４００ｎＳ早くリプライが返り、リードリクエスト転送が効率化されデータ転送実効性能が向上されていることがわかる。 On the other hand, when the arbitration unit 10 according to the present invention having the latency search table by physical address shown in FIG. 8 is provided, as shown in FIG. 15, ordering is performed so that a request with a long access latency to the request physical address is prioritized. Since the read request is issued, the read request is issued at intervals of 25 nS from 0 nS, and the reply of the last request is returned after 875 nS. That is, it can be seen that the reply is returned 400 nS earlier than before, the read request transfer is made more efficient, and the effective data transfer performance is improved.

本発明の他の実施の形態として、その基本的構成は上記の通りであるが、物理アドレス別レイテンシ検索テーブル１０１について、先の実施の形態では、予め設定されていた物理アドレスに依存したアクセスレイテンシを、実際のリードリクエストとそのリプライを元に、動的に設定するように、更に工夫している。 As another embodiment of the present invention, the basic configuration is as described above. With regard to the latency search table 101 for each physical address, in the previous embodiment, the access latency depending on the physical address set in advance is used. Is further devised so that it is dynamically set based on the actual read request and its reply.

その構成を図１６に示す。本図において、調停手段１００によりリードリクエスト１１６が発行されると、レイテンシ計算手段１０４にアクセス先物理アドレスおよびタイマ１０３から得られるリクエスト発行時刻１４０を登録する（図１６（ａ））。リードリクエスト１１６に対するリードリプライ１１７が返却されると、タイマ１０３から得られるリプライ到着時刻１４１を登録する（図１６（ｂ））。更に、リプライ到着時刻１４１からリクエスト発行時刻１４０を引き算したアクセスレイテンシ１１８を計算する（図１６（ｃ））。最後に、物理アドレス別レイテンシ検索テーブル１０１の、物理アドレス１１９に対応するアクセスレイテンシをアクセスレイテンシ１１８で更新する（図１６（ｄ））。 The configuration is shown in FIG. In this figure, when the read request 116 is issued by the arbitration unit 100, the access destination physical address and the request issue time 140 obtained from the timer 103 are registered in the latency calculation unit 104 (FIG. 16A). When the read reply 117 for the read request 116 is returned, the reply arrival time 141 obtained from the timer 103 is registered (FIG. 16B). Further, the access latency 118 is calculated by subtracting the request issue time 140 from the reply arrival time 141 (FIG. 16C). Finally, the access latency corresponding to the physical address 119 in the latency search table 101 by physical address is updated with the access latency 118 (FIG. 16 (d)).

このように、本例では、先の実施の形態では予め設定されていた物理アドレスに依存したアクセスレイテンシを、実際のリードリクエストとそのリプライをもとに動的に更新するようにしているので、初期設定が必要なく、また、ＮＵＭＡ型分散共有メモリ並列計算機の動的な構成変更や実行ジョブの負荷に応じて、動的にデータ転送が最適化されデータ転送実効性能が向上するという効果が得られる。 As described above, in this example, the access latency depending on the physical address set in advance in the previous embodiment is dynamically updated based on the actual read request and its reply. There is no need for initialization, and the data transfer is dynamically optimized according to the dynamic configuration change of the NUMA-type distributed shared memory parallel computer and the load of the execution job, and the data transfer effective performance is improved. It is done.

本構成において、レイテンシ計算手段１０４を複数備え、複数のリードリクエストに対してアクセスレイテンシを動的に設定することできる。また、一回のリクエストだけでなく複数回のリクエストのレイテンシの平均を求め物理アドレス別レイテンシ検索テーブル１０１を更新する構成にしてもよい。 In this configuration, a plurality of latency calculation means 104 are provided, and the access latency can be dynamically set for a plurality of read requests. Further, the latency search table 101 for each physical address may be updated by calculating the average latency of not only a single request but also a plurality of requests.

なお、上述した各実施の形態の動作は、予めＲＯＭなどの記録媒体にその動作手順を記録しておき、これをコンピュータにより読み取らせて実行させるように構成することができることは明白である。 It is obvious that the operation of each of the above-described embodiments can be configured such that the operation procedure is recorded in advance on a recording medium such as a ROM and is read and executed by a computer.

本発明が適用されるＮＵＭＡ型分散共有メモリ並列計算機システムの構成図である。It is a block diagram of a NUMA type distributed shared memory parallel computer system to which the present invention is applied. 図１の各部の具体例を示す図である。It is a figure which shows the specific example of each part of FIG. 図２の変形例を示す図である。It is a figure which shows the modification of FIG. 各プロセッサノードの主記憶をスタックしてシステムの主記憶空間を形成した例を示す図である。It is a figure which shows the example which stacked the main memory of each processor node, and formed the main memory space of the system. 各プロセッサノードの主記憶間でインタリーブしてシステムの主記憶空間を形成した例を示す図である。It is a figure which shows the example which formed the main memory space of the system by interleaving between the main memories of each processor node. 図１のリードリクエスト調停ユニット１０の一例の詳細を示す図である。It is a figure which shows the detail of an example of the read request arbitration unit 10 of FIG. 物理アドレス別レイテンシ検索テーブル１０１の一例を説明する図である。It is a figure explaining an example of the latency search table 101 according to physical address. 物理アドレス別レイテンシ検索テーブル１０１の他の例を説明する図である。It is a figure explaining the other example of the latency search table 101 by physical address. リードリクエスト登録テーブル１０２の例を説明する図である。5 is a diagram for explaining an example of a read request registration table 102. FIG. 調停手段１００の例を説明する図である。3 is a diagram for explaining an example of arbitration means 100. FIG. リードリクエスト調停ユニット１０の動作を説明する図である。FIG. 6 is a diagram for explaining the operation of the read request arbitration unit 10. 従来の分散配置方式のＦＩＦＯ方式によるリードリクエストとリードリプライとの時間関係を示す図である。It is a figure which shows the time relationship between the read request by the FIFO method of the conventional distributed arrangement method, and a read reply. 本発明の分散配置方式によるリードリクエストとリードリプライとの時間関係を示す図である。It is a figure which shows the time relationship between the read request and read reply by the distributed arrangement system of this invention. 従来のインタリーブ分散配置方式のＦＩＦＯ方式によるリードリクエストとリードリプライとの時間関係を示す図である。It is a figure which shows the time relationship between the read request and read reply by the FIFO method of the conventional interleave distributed arrangement method. 本発明のインタリーブ分散配置方式によるリードリクエストとリードリプライとの時間関係を示す図である。It is a figure which shows the time relationship between the read request and read reply by the interleave distributed arrangement system of this invention. 図１のリードリクエスト調停ユニット１０の他の例の詳細を示す図である。It is a figure which shows the detail of the other example of the read request arbitration unit 10 of FIG. 従来のＵＭＡシステムの例を示す図である。It is a figure which shows the example of the conventional UMA system. 従来のＦＩＦＯ方式によるリードリクエストの発行の例を示す図である。It is a figure which shows the example of issue of the read request by the conventional FIFO system. 従来のＮＵＭＡシステムの例を示す図である。It is a figure which shows the example of the conventional NUMA system.

Explanation of symbols

１０ノードリクエスト調停ユニット
２０〜２３プロセッサノード
３０〜３３Ｉ／Ｏノード
４０ノード間接続装置
１００調停手段
１０１物理アドレスレイテンシ検索テーブル
１０２リードリクエスト登録テーブル
１０３タイマ
２００ａ，２００ｂプロセッサ
２０１主記憶ユニット
２０２プロセッサノード制御ユニット
３００Ｉ／Ｏノード制御ユニット
10 Node request arbitration unit
20-23 processor node
30-33 I / O node
40 Internode connection device
100 Mediation means
101 Physical address latency search table
102 Read request registration table
103 timer 200a, 200b processor
201 Main memory unit
202 processor node control unit
300 I / O node control unit

Claims

A read request arbitration control system in a distributed shared memory parallel computer comprising a plurality of nodes having one or more processors and a main storage device, the read request depending on an access latency for a request destination address of the read request A read request arbitration control system, comprising control means for performing arbitration control.

2. The read request arbitration control system according to claim 1, wherein the control means preferentially issues a read request having a longer access latency.

The control means stores a table in which the addresses of the read request destinations and latencies corresponding to the addresses are stored in advance, and refers to the table by the request destination in response to the occurrence of the read request. The read request arbitration control system according to claim 1, further comprising: means for calculating

4. The read request arbitration control system according to claim 3, further comprising means for updating the latency of the table based on an actual response to the read request.

A read request arbitration control method in a distributed shared memory parallel computer comprising a plurality of nodes having one or more processors and a main storage device, the read request depending on an access latency for a request destination address of the read request A read request arbitration control method, comprising a control step for performing arbitration control.

6. The read request arbitration control method according to claim 5, wherein the control step preferentially issues a read request having a longer access latency.

In response to the generation of the read request, the control step calculates the latency by referring to a table in which each address of the read request destination and a latency corresponding to each address are stored in advance. 7. The read request arbitration control method according to claim 5 or 6, further comprising steps.

8. The read request arbitration control method according to claim 7, further comprising the step of updating the latency of the table based on an actual response to the read request.

A program for causing a computer to execute a read request arbitration control method in a distributed shared memory parallel computer comprising a plurality of nodes having one or more processors and a main storage device, and accessing a request destination address of a read request A computer-readable program comprising processing for arbitrating the read request according to latency.