JP2013114538A

JP2013114538A - Information processing apparatus, information processing method and control program

Info

Publication number: JP2013114538A
Application number: JP2011261512A
Authority: JP
Inventors: Kosuke Haruki; 耕祐春木
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2011-11-30
Filing date: 2011-11-30
Publication date: 2013-06-10

Abstract

PROBLEM TO BE SOLVED: To shorten the time to access a memory, thereby enhancing the execution performance.SOLUTION: An information processing apparatus includes plural processors enabling parallel processing, and a memory shared between the plural processors. Allocation means allocates a work group in which an access range of the memory can be described in advance and which is constituted by plural threads, to be executed by any of the plural processors, by referring to each of the described access ranges.

Description

本発明の実施形態は、情報処理装置、情報処理方法及び制御プログラムに関する。 Embodiments described herein relate generally to an information processing apparatus, an information processing method, and a control program.

従来複数のスレッドの命令を並列実行するマルチスレッドプロセッサが知られている（例えば、特許文献１参照）。
このようなマルチスレッドプロセッサとして、ＧＰＵ（Graphics Processing Units）が知られているが、近年、このＧＰＵを並列演算能力を汎用演算に用いるＧＰＧＰＵ（General Purpose computing on Graphics Processing Units）が提案されている。 Conventionally, a multi-thread processor that executes instructions of a plurality of threads in parallel is known (see, for example, Patent Document 1).
As such a multi-thread processor, GPU (Graphics Processing Units) is known, but in recent years, GPGPU (General Purpose computing on Graphics Processing Units) using parallel processing capability of this GPU for general-purpose computation has been proposed.

例えば、ＧＰＧＰＵの実行プラットフォームとしては、ＣＵＤＡ（Compute Unified Device Architecture）、ＯｐｅｎＣＬ（Open Computing Language）といった技術が知られている。
これらのＧＰＧＰＵの実行プラットフォームにおいては、ＧＰＵ上に多数搭載されている演算プロセッサを並列に動作させることによって、大規模な演算を高速に実行することが可能となっている。 For example, technologies such as CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language) are known as execution platforms for GPGPU.
In these GPGPU execution platforms, it is possible to execute a large-scale operation at high speed by operating a large number of operation processors mounted on the GPU in parallel.

特開２０１０−２７７３７１号公報JP 2010-277371 A

ところで、上述したようなＧＰＧＰＵの実行プラットフォーム上におけるプログラミングにおいて、実行パフォーマンスを大きく左右する要因は、演算プロセッサ上における演算時間ではなく、各演算プロセッサがメモリにアクセスしてデータの読み書きを行うアクセス時間が主なものとなっていた。 By the way, in the programming on the execution platform of the GPGPU as described above, the factor that greatly affects the execution performance is not the calculation time on the calculation processor but the access time for each calculation processor to access the memory and read / write data. It was the main thing.

本発明は、上記に鑑みてなされたものであって、メモリに対するアクセス時間を低減し、ひいては、実行パフォーマンスを向上することが可能な情報処理装置、情報処理方法及び制御プログラムを提供することにある。 The present invention has been made in view of the above, and it is an object of the present invention to provide an information processing apparatus, an information processing method, and a control program capable of reducing an access time to a memory and thus improving execution performance. .

実施形態の情報処理装置は、並列処理が可能な複数のプロセッサと、前記複数のプロセッサで共有されるメモリと、を備えている。
そして、割当手段は、メモリのアクセス範囲が予め記述可能とされるとともに複数のスレッドで構成されたワークグループを、それぞれ記述されたアクセス範囲を参照して複数のプロセッサのいずれかに実行させるために割り当てる。 The information processing apparatus according to the embodiment includes a plurality of processors capable of parallel processing and a memory shared by the plurality of processors.
The allocating unit is configured to allow a memory access range to be described in advance and to cause one of a plurality of processors to execute a work group composed of a plurality of threads with reference to each described access range. assign.

図１は、実施形態に係る情報処理装置の概要構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a schematic configuration of an information processing apparatus according to the embodiment. 図２は、ＧＰＵの概要構成説明図である。FIG. 2 is an explanatory diagram of a schematic configuration of the GPU. 図３は、演算ユニットの詳細構成説明図である。FIG. 3 is an explanatory diagram of the detailed configuration of the arithmetic unit. 図４は、ＧＰＵにおけるアプリケーションプログラムの実行モデルの概念説明図である。FIG. 4 is a conceptual explanatory diagram of an execution model of an application program in the GPU. 図５は、各ワークグループのＶＲＡＭの利用状況の説明図である。FIG. 5 is an explanatory diagram of the usage status of VRAM in each work group. 図６は、ＡＰＩの仕様の一例の説明図である。FIG. 6 is an explanatory diagram of an example of API specifications. 図７は、実施形態の処理フローチャートである。FIG. 7 is a processing flowchart of the embodiment.

次に図面を参照して実施形態について説明する。
図１は、実施形態に係る情報処理装置の概要構成の一例を示す図である。
情報処理装置１０は、大別すると、汎用プロセッサであるＭＰＵ１１、グラフィックプロセッサであるＧＰＵ１２、比較的高速なバス（Ｂｕｓ）を介して通信を行う回路を対象としてインタフェース動作を行うノースブリッジ１３、ノースブリッジ１３にメモリバス１４を介して接続され、記録媒体として機能して、各種制御プログラム等が格納されたＲＯＭ１５、ノースブリッジ１３にメモリバス１４を介して接続され、ワークエリア等として各種データを格納するＲＡＭ１６、ノースブリッジ１３に接続され比較的低速なバスを介して通信を行う回路を対象としてインタフェース動作を行うサウスブリッジ１７、サウスブリッジ１７に接続され、外部記憶装置として機能するＨＤＤ１８、サウスブリッジ１７にＰＣＩバス１９を介して接続され、通信インタフェース動作を行う通信インタフェース(ＩＦ)部２０を備えている。
また、ＧＰＵ１２には、キャッシュ２１を介してＶＲＡＭ２２が接続されている。
上記構成において、ＭＰＵ１１は、いわゆるマイクロコンピュータとして構成されており、図示しないＣＰＵ、ＲＯＭ、ＲＡＭ等を備えている。
ＧＰＵ１２は、行列演算などの定式化された単純な演算の繰り返しを高速で行うための複数の演算ユニットを多数備えている。 Next, embodiments will be described with reference to the drawings.
FIG. 1 is a diagram illustrating an example of a schematic configuration of an information processing apparatus according to the embodiment.
The information processing apparatus 10 can be broadly divided into an MPU 11 that is a general-purpose processor, a GPU 12 that is a graphic processor, a north bridge 13 that performs an interface operation for a circuit that performs communication via a relatively high-speed bus (Bus), and a north bridge. 13 is connected via a memory bus 14 and functions as a recording medium. The ROM 15 stores various control programs and the like. The north bridge 13 is connected via a memory bus 14 and stores various data as a work area. The HDD 16 and the south bridge 17 connected to the south bridge 17 and the south bridge 17 that perform an interface operation for a circuit that is connected to the RAM 16 and the north bridge 13 and performs communication via a relatively low-speed bus, and function as an external storage device. Connect via PCI bus 19 It is provided with a communication interface (IF) unit 20 which performs a communication interface operation.
A VRAM 22 is connected to the GPU 12 via a cache 21.
In the above configuration, the MPU 11 is configured as a so-called microcomputer, and includes a CPU, a ROM, a RAM, and the like (not shown).
The GPU 12 includes a plurality of arithmetic units for performing repetition of formulated simple operations such as matrix operations at high speed.

ここで、図２を参照してＧＰＵ１２について詳細に説明する。
図２は、ＧＰＵの概要構成説明図である。
ＧＰＵ１２は、複数の演算ユニット３１と、これら複数の演算ユニット３１上におけるスレッドの実行制御を行うコントローラ３２を備えている。各演算ユニット３１は、例えば、ＰＣＩＥｘｐｒｅｓｓ２．０規格に則った高速グラフィックバス３３を介して、キャッシュ２１に接続されている。さらにこのキャッシュ２１を介して、複数の演算ユニット３１により共用されるＶＲＡＭ２２が接続されている。
ＶＲＡＭ２２は、大別すると、汎用メモリエリア３４及び各種定数を格納する定数メモリエリア３５を備えている。 Here, the GPU 12 will be described in detail with reference to FIG.
FIG. 2 is an explanatory diagram of a schematic configuration of the GPU.
The GPU 12 includes a plurality of arithmetic units 31 and a controller 32 that performs thread execution control on the plurality of arithmetic units 31. Each arithmetic unit 31 is connected to the cache 21 via, for example, a high-speed graphic bus 33 conforming to the PCI Express 2.0 standard. Further, a VRAM 22 shared by a plurality of arithmetic units 31 is connected via the cache 21.
The VRAM 22 includes a general-purpose memory area 34 and a constant memory area 35 for storing various constants.

図３は、演算ユニットの詳細構成説明図である。
ＧＰＵ１２を構成している各演算ユニット３１は、複数の演算エレメント４１と、全演算エレメント４１により共有されるメモリエリアを構成するローカルメモリ４２と、各演算エレメント４１に対応づけられ、当該対応する演算エレメント４１により占有されるメモリエリアを構成するプライベートメモリ４３と、各演算エレメント４１とローカルメモリ４２とを相互に接続するバス４４と、を備えている。ここで、バス４４と、高速グラフィックバス３３とは、相互に接続されている。 FIG. 3 is an explanatory diagram of the detailed configuration of the arithmetic unit.
Each arithmetic unit 31 constituting the GPU 12 is associated with each of the arithmetic elements 41, a plurality of arithmetic elements 41, a local memory 42 constituting a memory area shared by all the arithmetic elements 41, and the corresponding arithmetic elements 41. A private memory 43 that constitutes a memory area occupied by the element 41 and a bus 44 that interconnects each arithmetic element 41 and the local memory 42 are provided. Here, the bus 44 and the high-speed graphic bus 33 are connected to each other.

次に再び情報処理装置の構成について説明する。
ノースブリッジ１３は、比較的広帯域で高速なバスの通信インタフェース動作を行っている。
ＲＯＭ１５は、制御プログラム等の不揮発的に記憶する必要があるデータを格納している。
ＲＡＭ１６は、いわゆるメインメモリを構成しており、各種のデータを一時的に格納する。
サウスブリッジ１７は、機械的な駆動部を有するＨＤＤ、光ディスクドライブ等の比較的狭帯域で低速なバスに接続される装置との間の通信インタフェース動作を行う。
ＨＤＤ１８は、低速動作ではあるが、大容量のデータを記憶する。
通信インタフェース(ＩＦ)部２０は、図示しないイーサネット（登録商標）規格に準拠した通信ネットワークを介して他の情報処理装置等との間でデータ通信を行う際の通信インタフェース動作を行う。 Next, the configuration of the information processing apparatus will be described again.
The north bridge 13 performs a high-speed bus communication interface operation with a relatively wide bandwidth.
The ROM 15 stores data that needs to be stored in a nonvolatile manner such as a control program.
The RAM 16 constitutes a so-called main memory and temporarily stores various data.
The south bridge 17 performs a communication interface operation with a device connected to a relatively narrow-band, low-speed bus such as an HDD having a mechanical drive unit or an optical disk drive.
The HDD 18 stores a large amount of data although it operates at a low speed.
The communication interface (IF) unit 20 performs a communication interface operation when performing data communication with another information processing apparatus or the like via a communication network conforming to the Ethernet (registered trademark) standard (not shown).

ここで、実施形態の動作説明に先立ち、処理対象のデータについて説明する。
本実施形態においては、スレッドを実行する演算ユニット３１は、可能な場合には、メモリアクセス領域（メモリアクセス範囲）が近いスレッド群（≒タスク）を近い時間に実行するようにされている。
ＧＰＵ１２においては、一つのアプリケーションプログラムが複数のデータに対して実行されるＳＰＭＤ（Single Program Multiple Data）という実行モデルが適用されることが多い。 Here, prior to describing the operation of the embodiment, data to be processed will be described.
In this embodiment, the arithmetic unit 31 that executes a thread is configured to execute a thread group (≈task) having a close memory access area (memory access range) at a close time if possible.
In the GPU 12, an execution model called SPMD (Single Program Multiple Data) in which one application program is executed for a plurality of data is often applied.

ところで、一つのアプリケーションプログラムは、並列に実行されるスレッド（ワークアイテム）と呼ばれる小さなプログラムで構成されている。そして、スレッドを実行するに際しては、複数のスレッドが、一つのワークグループとしてグルーピングされてワークグループ単位で実行される。したがって、一つのアプリケーションプログラムを実行することで複数のワークグループが並列に実行されることとなる。 By the way, one application program is composed of a small program called a thread (work item) executed in parallel. When executing threads, a plurality of threads are grouped as one work group and executed in units of work groups. Therefore, a plurality of work groups are executed in parallel by executing one application program.

そして、各演算ユニット３１には、ワークグループ単位で割り当てがなされる。演算ユニット３１に割り当てられたワークグループを構成する複数のスレッドには、割り当てがなされた演算ユニット３１に対応するローカルメモリ４２が共有可能とされている。すなわち、同一の演算ユニット３１に割り当てられた複数のワークグループは、ローカルメモリ４２を共有する。
さらにワークグループを構成する各スレッドは、それぞれ演算エレメント４１に割り当てられ、各演算エレメント４１に割り当てられたスレッドは、当該演算エレメント４１に割り当てられたスレッドのみが参照可能なプライベートメモリ４３を利用可能な状態となっている。 Each arithmetic unit 31 is assigned in units of work groups. A plurality of threads constituting a work group assigned to the arithmetic unit 31 can share the local memory 42 corresponding to the assigned arithmetic unit 31. That is, a plurality of work groups assigned to the same arithmetic unit 31 share the local memory 42.
Furthermore, each thread constituting the work group is assigned to each computation element 41, and the thread assigned to each computation element 41 can use the private memory 43 that can be referred to only by the thread assigned to the computation element 41. It is in a state.

図４は、ＧＰＵにおけるアプリケーションプログラムの実行モデルの概念説明図である。
図４に示すように、アプリケーションプログラムＡＰＬは、概念上、２次元のインデックス＝（Ｗｘ，Ｗｙ）で識別され、それぞれ並列に実行可能とされた複数のワークグループＷＧにより構成されている。各ワークグループＷＧは、同様にそれぞれ並列に実行可能とされた複数のスレッドＴＨにより構成されている。 FIG. 4 is a conceptual explanatory diagram of an execution model of an application program in the GPU.
As shown in FIG. 4, the application program APL is conceptually configured by a plurality of work groups WG that are identified by a two-dimensional index = (Wx, Wy) and can be executed in parallel. Each work group WG is similarly composed of a plurality of threads TH that can be executed in parallel.

図５は、各ワークグループのＶＲＡＭの利用状況の説明図である。
図５においては、３つのワークグループがメモリを参照している場合の説明図である。
図５に示すように、グループＩＤ＝１００が割り当てられているワークグループＷＧ１のＶＲＡＭ２２の汎用メモリエリア３４の参照エリアと、グループＩＤ＝３００が割り当てられているワークグループＷＧ３のＶＲＡＭ２２の汎用メモリエリア３４の参照エリアと、は一部重なる領域がある。 FIG. 5 is an explanatory diagram of the usage status of VRAM in each work group.
FIG. 5 is an explanatory diagram when three work groups are referring to a memory.
As shown in FIG. 5, the reference area of the general-purpose memory area 34 of the VRAM 22 of the work group WG1 to which the group ID = 100 is assigned and the general-purpose memory area 34 of the VRAM 22 of the work group WG3 to which the group ID = 300 is assigned. There is an area that partially overlaps the reference area.

これに対し、グループＩＤ＝２００が割り当てられているワークグループＷＧ２の参照エリアは、いずれのワークグループＷＧ１、ＷＧ３が参照しているエリアとは重なりがなく、異なる領域となっている。
上述したようにグループＩＤ＝１００が割り当てられているワークグループＷＧ１のＶＲＡＭ２２の汎用メモリエリア３４の参照エリアと、グループＩＤ＝３００が割り当てられているワークグループＷＧ３のＶＲＡＭ２２の汎用メモリエリア３４の参照エリアと、は一部重なりがある。したがって、このまま、ワークグループＷＧ１と、ワークグループＷＧ３と、を並列動作可能な別の演算ユニット３１にそれぞれ割り当てたとしても、ワークグループＷＧ１及びワークグループＷＧ３が同時に並列して実行された場合には、同一のメモリアドレスにアクセスすることはできない。すなわち、同一データのフェッチは排他制御により異なる演算ユニット３１で同時に行うことはできないため、実質的な実行効率が低下することとなる。 On the other hand, the reference area of the work group WG2 to which the group ID = 200 is assigned is different from the area referred to by any of the work groups WG1 and WG3, and is a different area.
As described above, the reference area of the general-purpose memory area 34 of the VRAM 22 of the work group WG1 to which the group ID = 100 is assigned, and the reference area of the general-purpose memory area 34 of the VRAM 22 of the work group WG3 to which the group ID = 300 is assigned. And there is some overlap. Therefore, even if the work group WG1 and the work group WG3 are simultaneously executed in parallel, even if the work group WG1 and the work group WG3 are respectively assigned to different arithmetic units 31 capable of operating in parallel, It is not possible to access the same memory address. In other words, since the same data cannot be fetched simultaneously by different arithmetic units 31 due to the exclusive control, the substantial execution efficiency is lowered.

したがって、このような場合には、ワークグループＷＧ１及びワークグループＷＧ３を同一の演算ユニット３１に割り当てることで、メモリアクセス空間における最適な割り当て（空間方向のアクセス最適化）を行うことが可能となる。 Therefore, in such a case, by assigning the work group WG1 and the work group WG3 to the same arithmetic unit 31, it is possible to perform the optimum assignment in the memory access space (access optimization in the space direction).

一方、グループＩＤ＝１００が割り当てられているワークグループＷＧ１のＶＲＡＭ２２の汎用メモリエリア３４の参照エリアと、グループＩＤ＝２００が割り当てられているワークグループＷＧ２のＶＲＡＭ２２の汎用メモリエリア３４の参照エリアと、は、重なりがない。
このため、このままワークグループＷＧ１と、ワークグループＷＧ２と、を同一の演算ユニット３１に割り当てたとしても、例えば、ワークグループＷＧ１の次にワークグループＷＧ２を同一の演算ユニット３１に割り当てて実行した場合には、演算ユニット３１は、ワークグループＷＧ１の処理終了後、ワークグループＷＧ２の処理を行う前にデータの再読込を行う必要があり、キャッシュ２１の利用効率が低下する。 On the other hand, a reference area of the general-purpose memory area 34 of the VRAM 22 of the work group WG1 to which the group ID = 100 is assigned, a reference area of the general-purpose memory area 34 of the VRAM 22 of the work group WG2 to which the group ID = 200 is assigned, There is no overlap.
For this reason, even if the work group WG1 and the work group WG2 are assigned to the same arithmetic unit 31, the work group WG2 is assigned to the same arithmetic unit 31 next to the work group WG1 and executed. That is, the arithmetic unit 31 needs to re-read data after the processing of the work group WG1 is completed and before the processing of the work group WG2 is performed, and the use efficiency of the cache 21 is reduced.

したがって、このような場合には、ワークグループＷＧ１と、ワークグループＷＧ２を同時並行して処理が可能な別の演算ユニット３１にそれぞれ割り当てることで、時間軸方向における最適な割り当て（時間方向のアクセス最適化）を行うことが可能となる。
これらのため、本実施形態においては、ワークグループについて空間方向及び時間方向のアクセス最適化を行うために、ランタイムモジュールに各ワークグループの参照エリア（メモリアクセス範囲）を通知する構成を採っている。 Therefore, in such a case, by assigning the work group WG1 and the work group WG2 to different arithmetic units 31 capable of processing in parallel, optimal assignment in the time axis direction (optimal access in the time direction) Can be performed.
For this reason, in this embodiment, in order to optimize the access in the spatial direction and the time direction for the work group, a configuration is adopted in which the runtime module is notified of the reference area (memory access range) of each work group.

この場合に、各ワークグループＷＧ１〜ＷＧ３に参照エリアが割り当てられる態様としては、アドレスが連続したメモリエリアが参照エリアとして割り当てられるラスター形式の割り当て態様と、アドレスは不連続であるが、概念的に２次元のメモリ空間上でタイル形状（矩形形状）のメモリエリアが参照エリアとして割り当てられるタイル形式の割り当て態様（連続したメモリ空間では、所定アドレスずつ離れて、複数の同一容量のメモリ空間が配置される態様となる）、が存在する。
そこで、本実施形態においては、ＡＰＩの仕様として、参照エリアがラスター形式で割り当てられる場合と、タイル形式で割り当てられる場合との、双方に対応可能な仕様を採用した。 In this case, as a mode in which the reference areas are allocated to the work groups WG1 to WG3, a raster-type allocation mode in which a memory area having consecutive addresses is allocated as a reference area, and addresses are discontinuous, but conceptually Tile-type allocation mode in which a tile-shaped (rectangular) memory area is allocated as a reference area on a two-dimensional memory space (in a continuous memory space, a plurality of memory spaces of the same capacity are arranged at predetermined addresses. Is present).
Therefore, in the present embodiment, as the API specification, a specification that can handle both the case where the reference area is assigned in the raster format and the case where the reference area is assigned in the tile format is adopted.

図６は、ＡＰＩの仕様の一例の説明図である。
図６の記述態様は、ＯｐｅｎＣＬ規格に則った場合における参照エリア通知関数の記述態様である。
参照エリアがラスター形式で通知される場合の参照エリア通知関数ＦＮのパラメータとしては、システムのリソースを管理し、ハードウェアとソフトウェアコンポーネントのやりとりを管理するカーネル（ｋｅｒｎｅｌ）を特定するためのカーネル名パラメータ、ラスター形式に相当する参照エリア割り当て形式パラメータ＝「ＴＹＰＥ＿ＲＡＳＴＥＲ」、参照エリア開始アドレスパラメータ（start_position）、参照エリアサイズパラメータ（size）及び参照エリア通知関数ＦＮのフォーマットを一定とするために使用しないパラメータについて、対応するパラメータが存在しないことを示すＮＵＬＬパラメータがある。 FIG. 6 is an explanatory diagram of an example of API specifications.
The description mode of FIG. 6 is a description mode of the reference area notification function in the case of complying with the OpenCL standard.
As a parameter of the reference area notification function FN when the reference area is notified in a raster format, a kernel name parameter for managing a system resource and specifying a kernel for managing the exchange of hardware and software components Reference area allocation format parameter corresponding to raster format = “TYPE_RASTER”, reference area start address parameter (start_position), reference area size parameter (size), and parameters not used to make the format of the reference area notification function FN constant , There is a NULL parameter indicating that there is no corresponding parameter.

また、参照エリアがタイル形式で通知される場合の参照エリア通知関数ＦＮのパラメータとしては、カーネルを特定するためのカーネル名パラメータ、タイル形式に相当する参照エリア割り当て形式パラメータ＝「ＴＹＰＥ＿ＴＩＬＥ」、参照エリア開始アドレスパラメータ（start_position）、参照エリアを２次元のメモリ空間として表現した場合の横幅に相当する連続アドレス数を示す参照エリア水平サイズパラメータ（h_size）及び参照エリアを２次元のメモリ空間として表現した場合の縦幅に相当する参照エリア垂直サイズパラメータ（ｖ_size）がある。 In addition, as parameters of the reference area notification function FN when the reference area is notified in the tile format, the kernel name parameter for specifying the kernel, the reference area allocation format parameter corresponding to the tile format = “TYPE_TILE”, the reference area When the start address parameter (start_position), the reference area horizontal size parameter (h_size) indicating the number of consecutive addresses corresponding to the horizontal width when the reference area is expressed as a two-dimensional memory space, and the reference area are expressed as a two-dimensional memory space There is a reference area vertical size parameter (v_size) corresponding to the vertical width of.

この場合において、参照エリア開始アドレスパラメータは、ワークグループＷＧ及び当該ワークグループＷＧに含まれるスレッドＴＨの数（アイテム数）により定まり、ワークグループＷＧを特定するためのインデックスに対応するワークグループＩＤ組込変数g_idx及び当該ワークグループＩＤパラメータg_idxで特定されるワークグループに含まれるスレッドＴＨの数を示すスレッド数（アイテム数）組込変数g_numに基づいて算出される。これらのワークグループＩＤ組込変数g_idx及びスレッド数組込変数g_numは、ワークグループを構成するスレッドを特定するためのｘ方向及びｙ方向の２次元のデータであり、次元を表すパラメータ（［０］又は［１］）によりｘ方向（＝［０］）及びｙ方向（＝［１］）が示される。 In this case, the reference area start address parameter is determined by the work group WG and the number of threads TH (number of items) included in the work group WG, and includes a work group ID corresponding to an index for identifying the work group WG. It is calculated based on a thread number (number of items) built-in variable g_num indicating the number of threads TH included in the work group specified by the variable g_idx and the work group ID parameter g_idx. The work group ID built-in variable g_idx and the thread number built-in variable g_num are two-dimensional data in the x direction and the y direction for specifying the threads constituting the work group, and are parameters ([0]) representing dimensions. Or [1]) indicates the x direction (= [0]) and the y direction (= [1]).

図６の例の場合、
start_position＝g_num［0］*g_idx［0］＋g_num［1］*g_idx［1］
となっている。
また、参照エリアサイズパラメータ（size）は、定数である。 In the example of FIG.
start_position = g_num [0] * g_idx [0] + g_num [1] * g_idx [1]
It has become.
The reference area size parameter (size) is a constant.

次に実施形態の動作を説明する。
図７は、実施形態の処理フローチャートである。
まず、コントローラ３２は、キューの先頭にあるタスクのグループのメモリアクセス領域を取得する（ステップＳ１０）。 Next, the operation of the embodiment will be described.
FIG. 7 is a processing flowchart of the embodiment.
First, the controller 32 acquires the memory access area of the task group at the head of the queue (step S10).

次にコントローラ３２は、キューがいっぱいであるか否かを判別する（ステップＳ１１）。
ステップＳ１１の判別において、キューがいっぱいである場合には（ステップＳ１１；Ｙｅｓ）、再び処理をステップＳ１０に移行して待機状態となる。
ステップＳ１１の判別において、キューがいっぱいではない、すなわち、キューに余裕がある場合には（ステップＳ１１；Ｎｏ）、スケジュール待ちをしているタスクのグループがあり、他のタスクのグループをキューに積むことが可能であるか否かを判別する（ステップＳ１２）。 Next, the controller 32 determines whether or not the queue is full (step S11).
If it is determined in step S11 that the queue is full (step S11; Yes), the process proceeds to step S10 again to enter a standby state.
If it is determined in step S11 that the queue is not full, that is, if there is room in the queue (step S11; No), there is a group of tasks waiting for a schedule, and other task groups are stacked in the queue. It is determined whether or not it is possible (step S12).

ステップＳ１２の判別において、スケジュール待ちをしているタスクのグループがない場合には（ステップＳ１２；Ｎｏ）、再び処理をステップＳ１０に移行して待機状態となる。
ステップＳ１２の判別において、スケジュール待ちをしているタスクのグループがあり、他のタスクのグループをキューに積むことが可能である場合には（ステップＳ１２；Ｙｅｓ）、当該スケジュール待ちをしているタスクのグループのメモリアクセス領域を計算により算出する（ステップＳ１３）。 If it is determined in step S12 that there is no group of tasks waiting for a schedule (step S12; No), the process proceeds to step S10 again to enter a standby state.
If there is a group of tasks waiting for the schedule in the determination in step S12 and it is possible to queue other task groups (step S12; Yes), the task waiting for the schedule The memory access area of the group is calculated by calculation (step S13).

次にコントローラ３２は、算出したメモリアクセス領域がキューの先頭にあるグループのメモリアクセス領域と重なっているか否かを判別する（ステップＳ１４）。
ステップＳ１４の判別において、算出したタスクのグループのメモリアクセス領域がキューの先頭にあるグループのメモリアクセス領域と重なっている場合には（ステップＳ１４；Ｙｅｓ）、当該タスクのグループをメモリアクセス領域が重なるグループのリストに追加し（ステップＳ１５）、処理をステップＳ１７に移行する。 Next, the controller 32 determines whether or not the calculated memory access area overlaps the memory access area of the group at the head of the queue (step S14).
If it is determined in step S14 that the memory access area of the calculated task group overlaps with the memory access area of the group at the head of the queue (step S14; Yes), the memory access area overlaps the group of the task. It adds to the list | wrist of a group (step S15), and transfers a process to step S17.

一方、ステップＳ１４の判別において、算出したタスクのグループのメモリアクセス領域がキューの先頭にあるグループのメモリアクセス領域と重なっていない場合には（ステップＳ１４；Ｎｏ）、当該タスクのグループをメモリアクセス領域が重ならないグループのリストに追加する（ステップＳ１６）。
続いてコントローラ３２は、メモリアクセス領域が重なるグループのリスト及びメモリアクセス領域が重ならないグループのリストに含まれるタスクのグループの数がキューにタスクのグループを積むか否かを判定するための所定の閾値を超えたか否かを判別する（ステップＳ１７）。 On the other hand, if it is determined in step S14 that the calculated memory access area of the group of tasks does not overlap with the memory access area of the group at the head of the queue (step S14; No), the task group is assigned to the memory access area. Are added to the list of groups that do not overlap (step S16).
Subsequently, the controller 32 determines whether or not the number of task groups included in the list of groups in which the memory access areas overlap and the list of groups in which the memory access areas do not overlap accumulates the group of tasks in the queue. It is determined whether or not the threshold is exceeded (step S17).

ステップＳ１７の判別において、両リストに含まれる全タスクのグループの数が所定の閾値を超えていない場合には（ステップＳ１７；Ｎｏ）、コントローラ３２は、スケジュール待ちをしているタスクのグループがあり、他のタスクのグループをキューに積むことが可能であるか否かを再び判別する（ステップＳ１８）。
ステップＳ１８の判別において、スケジュール待ちしているタスクのグループがある場合には（ステップＳ１８；Ｙｅｓ）、処理を再びステップＳ１３に移行し、以下、同様の処理を行う。 If it is determined in step S17 that the number of groups of all tasks included in both lists does not exceed the predetermined threshold (step S17; No), the controller 32 has a group of tasks waiting for a schedule. Then, it is determined again whether another task group can be loaded in the queue (step S18).
If it is determined in step S18 that there is a group of tasks waiting for a schedule (step S18; Yes), the process proceeds to step S13 again, and the same process is performed thereafter.

ステップＳ１８の判別において、スケジュール待ちしているタスクのグループがない場合には（ステップＳ１８；Ｎｏ）、コントローラ３２は、メモリアクセス領域が重なるグループのリストに含まれるタスクのグループのうち、最も重なる領域が多いタスクのグループを時間方向にキューに積む（ステップＳ１９）。すなわち、キューの先頭にあるワークグループの処理後に同一の演算ユニット３（あるいは、演算エレメント４１）において、処理がなされるようにキューに積まれることとなる。 In the determination in step S18, if there is no task group waiting for the schedule (step S18; No), the controller 32 has the most overlapping area among the task groups included in the list of groups with overlapping memory access areas. A group of tasks having a large number is accumulated in the queue in the time direction (step S19). That is, after processing of the work group at the head of the queue, the same arithmetic unit 3 (or arithmetic element 41) is loaded in the queue so that the processing is performed.

また、コントローラ３２は、メモリアクセス領域が重ならないグループのリストに含まれるタスクのグループのうち、最もアドレスが近いタスクのグループを空間方向にキューに積む（ステップＳ２０）。すなわち、キューの先頭にあるタスクのグループとは、別の演算ユニット３１（あるいは演算エレメント４１）において、処理がなされるようにキューに積まれることとなる。 In addition, the controller 32 queues the task group having the closest address among the task groups included in the list of groups with which the memory access areas do not overlap (step S20). That is, the task group at the head of the queue is stacked in the queue so that processing is performed in another arithmetic unit 31 (or arithmetic element 41).

以上の説明のように、本実施形態によれば、メモリアクセス領域（メモリ参照領域）が重なり、同時にメモリにアクセスすることができないワークグループ同士あるいはスレッド同士は、時間方向にキューに積まれるように割り当てられる。したがって、キャッシュにおけるヒット率を向上させることができ、処理効率を向上させることができる。 As described above, according to the present embodiment, the memory access areas (memory reference areas) overlap, and work groups or threads that cannot access the memory at the same time are stacked in the queue in the time direction. Assigned. Therefore, the hit rate in the cache can be improved and the processing efficiency can be improved.

また、メモリアクセス領域（メモリ参照領域）が重ならず、同時にメモリにアクセスすることが可能なワークグループ同士あるいはスレッド同士は、空間方向にキューに積まれて、すなわち、別の演算ユニット３１あるいは別の演算エレメント４１に割り当てられる。したがって、同時にメモリにアクセスすることが可能なワークグループ同士あるいはスレッド同士を、同時並列に処理することが可能となり、処理効率を向上させることができる。 In addition, work groups or threads that do not overlap memory access areas (memory reference areas) and can access the memory at the same time are stacked in a queue in the spatial direction, that is, another arithmetic unit 31 or another Are assigned to the calculation element 41. Accordingly, work groups or threads that can simultaneously access the memory can be processed in parallel at the same time, and the processing efficiency can be improved.

本実施形態の情報処理装置で実行される制御プログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のコンピュータで読み取り可能な記録媒体に記録されて提供される。 The control program executed by the information processing apparatus of the present embodiment is an installable or executable file, and is a computer such as a CD-ROM, flexible disk (FD), CD-R, DVD (Digital Versatile Disk). Recorded on a readable recording medium.

また、本実施形態の情報処理装置で実行される制御プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。また、本実施形態の情報処理装置で実行される制御プログラムをインターネット等のネットワーク経由で提供または配布するように構成しても良い。 The control program executed by the information processing apparatus according to the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network. The control program executed by the information processing apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.

また、本実施形態の情報処理装置の制御プログラムを、ＲＯＭ等の記憶媒体に予め組み込んで提供するように構成してもよい。
本実施形態の情報処理装置で実行される制御プログラムは、上述した各部（参照手段、割当手段）を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵ（プロセッサ）がＲＯＭ等の記憶媒体あるいは上記記録媒体から制御プログラムを読み出して実行することにより上記各手段が主記憶装置上にロードされ、参照手段、割当手段が主記憶装置上に生成されるようになっている。 In addition, the control program for the information processing apparatus according to the present embodiment may be provided by being incorporated in advance in a storage medium such as a ROM.
The control program executed by the information processing apparatus of the present embodiment has a module configuration including the above-described units (reference means, assignment means), and the CPU (processor) is a storage medium such as a ROM as actual hardware. Alternatively, by reading the control program from the recording medium and executing it, the means are loaded onto the main storage device, and the reference means and the assigning means are generated on the main storage device.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０…情報処理装置、１１…ＭＰＵ、１２…ＧＰＵ、１５…ＲＯＭ（記録媒体）、２１…キャッシュ、２２…ＶＲＡＭ、３１…演算ユニット（プロセッサ）、３２…コントローラ（参照手段、割当手段）、３３…高速グラフィックバス、３４…汎用メモリエリア（メモリ）、４１…演算エレメント（プロセッサ）、４２…ローカルメモリ（メモリ）、４３…プライベートメモリ、４４…バス、ＡＰＬ…アプリケーションプログラム、ＦＮ…参照エリア通知関数、ＴＨ…スレッド、ＷＧ、ＷＧ１〜ＷＧ３…ワークグループ、ＷＧＧ…ワークグループ群。 DESCRIPTION OF SYMBOLS 10 ... Information processing apparatus, 11 ... MPU, 12 ... GPU, 15 ... ROM (recording medium), 21 ... Cache, 22 ... VRAM, 31 ... Arithmetic unit (processor), 32 ... Controller (reference means, allocation means), 33 ... High-speed graphic bus, 34 ... General-purpose memory area (memory), 41 ... Calculation element (processor), 42 ... Local memory (memory), 43 ... Private memory, 44 ... Bus, APL ... Application program, FN ... Reference area notification function TH, Thread, WG, WG1 to WG3, Work group, WGG, Work group group.

実施形態の情報処理装置は、並列処理が可能な複数のプロセッサと、複数のプロセッサで共有されるメモリと、を備えている。
そして、割当手段は複数のスレッドで構成されたワークグループのメモリのアクセス範囲が参照エリア通知関数に予め記述可能とされ、記述されたアクセス範囲を参照エリア通知関数を参照して取得し、取得した前記アクセス範囲に基づいて、前記ワークグループを複数のプロセッサのいずれかに実行させるために割り当てる。 The information processing apparatus according to the embodiment includes a plurality of processors capable of parallel processing and a memory shared by the plurality of processors.
Then, assigning unit memory access range of workgroup composed of a plurality of threads are in advance can be described in the reference area notification function, refer to get a reference area information function describe the access range, acquires Based on the access range, the work group is assigned to be executed by one of a plurality of processors.

Claims

Multiple processors capable of parallel processing;
A memory shared by the plurality of processors;
Allocation means for allocating a work group composed of a plurality of threads so that the access range of the memory can be described in advance and executed by any of the plurality of processors with reference to the described access range When,
An information processing apparatus comprising:

The processor is connected to the memory via a cache;
The assigning means continuously assigns a plurality of work groups having at least a part of the access range to the same processor,
The information processing apparatus according to claim 1.

The assigning means preferentially assigns a work group having the largest overlap range to the same processor among a plurality of work groups having at least a part of the access range overlapping.
The information processing apparatus according to claim 2.

The assigning means assigns a plurality of work groups that do not have overlapping access ranges to different processors in order to perform parallel processing,
The information processing apparatus according to any one of claims 1 to 3.

The assigning means preferentially assigns a work group having the closest address to a processor among a plurality of work groups whose access ranges do not overlap.
The information processing apparatus according to claim 4.

The processor is configured as an arithmetic unit constituting a GPU,
The memory is configured as a VRAM shared by the arithmetic units.
The information processing apparatus according to any one of claims 1 to 5.

The processor is configured as an arithmetic element constituting each of a plurality of arithmetic units included in the GPU,
The memory is configured as a local memory assigned to each arithmetic unit.
The information processing apparatus according to any one of claims 1 to 5.

An information processing method executed in an information processing apparatus comprising a plurality of processors capable of parallel processing and a memory shared by the plurality of processors,
A reference process for referring to the described access range for a work group configured with a plurality of threads in which the access range of the memory can be described in advance.
An allocation process for assigning each of the workgroups to be executed by any of the plurality of processors based on the referenced access range;
An information processing method comprising:

A control program for controlling an information processing apparatus including a plurality of processors capable of parallel processing and a memory shared by the plurality of processors by a computer,
Reference means for referring to the described access range for the work group configured with a plurality of threads and the access range of the memory can be described in advance.
Allocating means for allocating each work group to be executed by any of the plurality of processors based on the referenced access range;
Control program to function.