JP2014164664A

JP2014164664A - Task parallel processing method and device and program

Info

Publication number: JP2014164664A
Application number: JP2013037130A
Authority: JP
Inventors: Takahiro Hisamura; 孝寛久村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-02-27
Filing date: 2013-02-27
Publication date: 2014-09-08

Abstract

PROBLEM TO BE SOLVED: To shorten an access time to data shared between processors.SOLUTION: In a task parallel processing method, when the preceding task (second task) of a first task assigned to a first processor is assigned to any processor other than the first processor, the first processor is allowed to stand by for executing the first task until the output data of the second task can be referenced by the first processor, and after the output data of the second task is able to be referenced by the first processor, the output data are acquired from the storage source of the output data, and stored in a first local memory owned by the first processor, and the first processor is allowed to execute the first task by referencing the output data stored in the first local memory, and the output data of the first task are stored in the first local memory, and when a following task (third task) of the first task is assigned to any processor other than the first processor, the output data of the first task can be referenced by any processor other than the first processor.

Description

本発明は、タスク並列処理方法、装置及びプログラムに関し、特に、有線回線及び無線回線を用いて受信した信号を処理するタスク並列処理方法、装置及びプログラムに関する。 The present invention relates to a task parallel processing method, apparatus, and program, and more particularly, to a task parallel processing method, apparatus, and program for processing signals received using a wired line and a wireless line.

複数のプロセッサが協調して並列処理を行うためには、複数のプロセッサは何らかの手段でデータを共有する必要がある。例えば、ひとつのＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ）に複数のプロセッサが搭載されたマルチコアプロセッサにおいて、一般的なデータ共有手段は、複数のプロセッサからアクセス可能な共有メモリを用いることである。 In order for a plurality of processors to perform parallel processing in cooperation, the plurality of processors need to share data by some means. For example, in a multi-core processor in which a plurality of processors are mounted on one LSI (Large Scale Integration), a common data sharing means is to use a shared memory accessible from the plurality of processors.

複数のプロセッサからアクセス可能な共有メモリは、通常、アクセス調停のために、アクセスに時間がかかる。そこで、各プロセッサに高速なメモリを配置するのが一般的である。各プロセッサの高速なメモリの実現手段としては、ローカルメモリ又はキャッシュ（メモリ）のいずれかが挙げられる。 The shared memory accessible from a plurality of processors usually takes time to access due to access arbitration. Therefore, it is common to arrange a high-speed memory in each processor. As a means for realizing a high-speed memory of each processor, either a local memory or a cache (memory) can be cited.

キャッシュは、共有メモリの部分的なコピー（以下、「部分的コピー」という。）を保存しておく一時記憶手段であり、キャッシュの動作はキャッシュ自身によって自動的に制御される。一方、ローカルメモリは単なる一時記憶手段である。 The cache is temporary storage means for storing a partial copy of the shared memory (hereinafter referred to as “partial copy”), and the operation of the cache is automatically controlled by the cache itself. On the other hand, the local memory is merely temporary storage means.

マルチコアプロセッサにおいて、各プロセッサがキャッシュをもつ場合には、複数のキャッシュのデータの整合性（コヒーレント）を保つ必要がある。コヒーレントを維持する手段としては、キャッシュ自身にそのような機能を追加するハードウェア的な方法と、ソフトウェア的な方法と、二つが考えられる。一般的にはハードウェア的な方法が使われることが多い。ハードウェア的な方法はキャッシュの回路規模を大きくし、キャッシュの消費電力を増やすというデメリットがある。 In a multi-core processor, when each processor has a cache, it is necessary to maintain data consistency (coherent) of the plurality of caches. There are two methods for maintaining coherence: a hardware method for adding such a function to the cache itself, and a software method. In general, hardware methods are often used. The hardware method has the demerit of increasing the cache circuit scale and increasing the power consumption of the cache.

一方、ソフトウェア的な方法は、そのデメリットを回避することができるものの、各プロセッサは共有データにアクセスする際に所定のソフトウェア的手順を順守する必要がある。この所定のソフトウェア的手順は、マルチコアプロセッサ上で動作するソフトウェア全体の構成にも影響を与えるものである。ソフトウェア的な方法のひとつの例が特許文献５に報告されている。 On the other hand, the software method can avoid the disadvantages, but each processor needs to follow a predetermined software procedure when accessing the shared data. This predetermined software procedure also affects the configuration of the entire software operating on the multi-core processor. One example of a software method is reported in Patent Document 5.

キャッシュは、共有メモリの部分的コピーを自動的に管理するので、非常に便利な一時記憶手段である。しかしながら、プロセッサがデータをキャッシュにリード要求してから、そのデータをキャッシュがプロセッサへ渡すまでにかかる時間は、そのデータがキャッシュに存在するか否かで、大きく変わる。つまり、キャッシュのデータ読み出しにかかる時間はばらつく。 Since the cache automatically manages a partial copy of the shared memory, it is a very convenient temporary storage means. However, the time taken from when the processor requests the cache to read data until the cache delivers the data to the processor varies greatly depending on whether or not the data exists in the cache. That is, the time required for reading data from the cache varies.

そのため、キャッシュのデータ読み出し時間のばらつきを回避するために、ローカルメモリを積極的に使う、という分野がある。特に、リアルタイム性を重視するような分野で、ローカルメモリをもつマルチコアプロセッサが使われる。 Therefore, there is a field in which local memory is actively used in order to avoid variations in cache data read time. In particular, a multi-core processor having a local memory is used in a field where real-time characteristics are important.

ローカルメモリはキャッシュと異なり、ローカルメモリのデータ読み出しにかかる時間は短く一定である。但し、ローカルメモリの容量は限られているため、マルチコアプロセッサで動作するソフトウェアを設計する開発者が、ローカルメモリにどのようなデータを配置するかを厳密に設計する必要がある。 Unlike a cache, the local memory takes a short time to read data from the local memory. However, since the capacity of the local memory is limited, it is necessary for a developer who designs software operating on a multi-core processor to strictly design what data is to be arranged in the local memory.

ローカルメモリにどのようなデータを配置するかは、マルチコアプロセッサでどのように並列処理を行うかという問題と密接にかかわっている。そのため、ソフトウェア開発者が、並列処理とデータ配置とを合わせて検討してきた。 What kind of data is arranged in the local memory is closely related to the problem of how to perform parallel processing in a multi-core processor. For this reason, software developers have studied both parallel processing and data placement.

従来から、複数のプロセッサがアクセスする可能性があるデータを共有メモリに配置し、ひとつのプロセッサだけがアクセスするデータをローカルメモリに配置する、という手法が一般的に使われている（例えば、図１１）。この手法はデータ配置を固定的に決めるものである。この方法は非常にシンプルで考えやすいが、デメリットがある。それは、共有メモリに配置されたデータへのアクセスには時間がかかる、というデメリットである。 Conventionally, a method of arranging data that can be accessed by a plurality of processors in a shared memory and arranging data accessed by only one processor in a local memory has been generally used (for example, FIG. 11). This method determines the data arrangement in a fixed manner. This method is very simple and easy to think, but has its disadvantages. This is a disadvantage that it takes time to access data arranged in the shared memory.

このデメリットを回避するために、共有メモリの部分的コピーをローカルメモリに置くという手法が考えられる。ところが、この場合には、部分的コピーの管理が非常に煩雑になる、という別の問題が発生する。この手法を実行するには、マルチコアプロセッサでの並列処理と、各プロセッサのローカルメモリ上の部分的コピーの管理と、マルチプロセッサ間における部分的コピーの整合性維持と、を合わせて考える必要がある。部分的コピーの整合性維持とは、複数プロセッサのローカルメモリに同じ部分的コピーを置いて、それらをそれぞれ異なる値で書き換えることが無いようにすること、である。 In order to avoid this disadvantage, a method of placing a partial copy of the shared memory in the local memory can be considered. However, in this case, another problem that management of partial copying becomes very complicated occurs. In order to execute this method, it is necessary to consider parallel processing in multi-core processors, management of partial copies in the local memory of each processor, and maintaining consistency of partial copies among multiprocessors. . Maintaining the consistency of partial copies is to place the same partial copies in the local memory of a plurality of processors so that they are not rewritten with different values.

つまり、並列処理において、ローカルメモリ上に部分的コピーを置くためには、ローカルメモリ上の部分的コピーの管理と整合性維持と並列処理とを統合的に扱う必要がある。しかしながら、これらを統合的に扱うことは、プログラマにとって大きな負担であり、非常に工数がかかるため、現実的ではなかった。 In other words, in order to place a partial copy on the local memory in parallel processing, it is necessary to handle the partial copy management, consistency maintenance, and parallel processing on the local memory in an integrated manner. However, it is not practical to handle these in an integrated manner because it is a heavy burden on the programmer and takes a lot of man-hours.

尚、関連技術として、以下の特許文献１〜４がある。特許文献１には、処理性能を落とさずに消費電力を低減し、また、実時間処理を要求するプログラムを実行するに際しても、時間制約を遵守しつつ、電力を低減することを目的とするマルチプロセッサシステム及びコンパイラに関する技術が開示されている。 In addition, there exist the following patent documents 1-4 as a related technique. Patent Document 1 discloses a multi-purpose power source that reduces power consumption without degrading processing performance, and reduces power while observing time constraints when executing a program that requires real-time processing. Techniques relating to processor systems and compilers are disclosed.

特許文献２には、メモリへデータを効率よく配置するためのメモリ管理方法に関する技術が開示されている。特に、特許文献１にかかるプロセッサは、メモリの記憶領域を複数の異なるサイズのブロックに分割し、タスクの実行時に使用されるデータに適合するサイズのブロックを選択し、選択されたブロックに、タスクの実行時に使用されるデータを格納するものである。 Patent Document 2 discloses a technique related to a memory management method for efficiently arranging data in a memory. In particular, the processor according to Patent Document 1 divides a storage area of a memory into a plurality of blocks having different sizes, selects a block having a size suitable for data used when the task is executed, and assigns a task to the selected block. Stores data used during execution of.

特許文献３には、共有メモリと複数のプロセッサとを有し、各プロセッサがローカルメモリを有する処理システムに関する技術が開示されている。特許文献３にかかるプロセッサは、共有メモリからプログラムの実行等に関連するローカルメモリへデータをコピーする。 Patent Document 3 discloses a technique related to a processing system having a shared memory and a plurality of processors, each processor having a local memory. The processor according to Patent Document 3 copies data from a shared memory to a local memory related to program execution or the like.

特許文献４には、プロセッサ間排他制御機能を有しない複数個のプロセッサと共有メモリとを装備したマルチプロセッサシステムに関する技術が開示されている。特許文献４にかかるマルチプロセッサシステムでは、あるプロセッサが他系プロセッサからの処理要求を受けると、送信バッファ内のデータに基づいた処理を行い、その後、処理要求元のプロセッサに処理終了を通知する。 Patent Document 4 discloses a technique related to a multiprocessor system including a plurality of processors having no interprocessor exclusive control function and a shared memory. In the multiprocessor system according to Patent Document 4, when a certain processor receives a processing request from another processor, it performs processing based on the data in the transmission buffer, and then notifies the processing requesting processor of the end of processing.

特開２００６−２９３７６８号公報JP 2006-293768 A 特開２００８−２１７１３４号公報JP 2008-217134 A 特開２００６−２２１６３８号公報JP 2006-221638 A 特開２０００−２３５５５３号公報JP 2000-235553 A 国際公開第２００９／０５７７６２号International Publication No. 2009/057762

以上説明したように、共有メモリとローカルメモリとを備えるマルチコア・プロセッサ上での並列処理において、「複数のプロセッサからアクセスされる可能性がある共有データを共有メモリに配置する」という従来から一般的に使われている共有データ配置方法は、「共有メモリに配置されたデータへのアクセスには時間がかかる」、という課題を持っている。この課題は、共有データの数が多い場合に並列処理全体の処理時間が増加する、という課題につながる。その理由は、共有メモリに格納された変数等の共有データに対しては、タスクの実行中に複数回のアクセスが発生し得るからである。尚、特許文献１乃至５にはこれらを解決する手段が開示されていない。 As described above, in parallel processing on a multi-core processor including a shared memory and a local memory, it has been generally known that “shared data that can be accessed from multiple processors is arranged in the shared memory”. However, the shared data arrangement method used in the field has a problem that “access to data arranged in the shared memory takes time”. This problem leads to a problem that the processing time of the entire parallel processing increases when the number of shared data is large. This is because shared data such as variables stored in the shared memory can be accessed multiple times during the execution of a task. Patent Documents 1 to 5 do not disclose means for solving these problems.

本発明は、上述した問題点を考慮してなされたものであり、プロセッサ間で共有するデータへのアクセス時間を短縮するためのタスク並列処理方法、装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a task parallel processing method, apparatus, and program for shortening access time to data shared between processors.

本発明の第１の態様にかかるタスク並列処理方法は、
第１プロセッサに割り当てられた第１タスクにおける先行タスクである第２タスクが当該第１プロセッサ以外に割り当てられている場合に、当該第２タスクによる出力データが当該第１プロセッサにより参照可能となるまで、当該第１タスクの実行を待機し、
前記第２タスクによる出力データが前記第１プロセッサにより参照可能となった後、当該出力データの格納元から当該出力データを取得して前記第１プロセッサが有する第１ローカルメモリへ格納し、
前記第１ローカルメモリに格納された出力データを参照して、当該第１プロセッサが前記第１タスクを実行し、当該第１タスクによる出力データを当該第１ローカルメモリに格納し、
前記第１タスクにおける後続タスクである第３タスクが当該第１プロセッサ以外のプロセッサに割り当てられている場合に、前記第１タスクによる出力データを当該第１プロセッサ以外のプロセッサから参照可能な状態とする。 The task parallel processing method according to the first aspect of the present invention includes:
When the second task that is the preceding task in the first task assigned to the first processor is assigned to other than the first processor, the output data from the second task can be referred to by the first processor , Wait for the execution of the first task,
After the output data from the second task can be referred to by the first processor, the output data is acquired from the output data storage source and stored in the first local memory of the first processor;
With reference to the output data stored in the first local memory, the first processor executes the first task, stores the output data from the first task in the first local memory,
When a third task, which is a subsequent task in the first task, is assigned to a processor other than the first processor, the output data from the first task can be referred to from a processor other than the first processor. .

本発明の第２の態様にかかるタスク並列処理装置は、
ローカルメモリを有する複数のプロセッサを備え、
前記複数のプロセッサのうち第１プロセッサは、
第１ローカルメモリを有し、
当該第１プロセッサに割り当てられた第１タスクにおける先行タスクである第２タスクが当該第１プロセッサ以外に割り当てられている場合に、当該第２タスクによる出力データが参照可能となるまで、当該第１タスクの実行を待機し、
前記第２タスクによる出力データが参照可能となった後、当該出力データの格納元から当該出力データを取得して前記第１ローカルメモリへ格納し、
前記第１ローカルメモリに格納された出力データを参照して、前記第１タスクを実行し、当該第１タスクによる出力データを当該第１ローカルメモリに格納し、
前記第１タスクにおける後続タスクである第３タスクが当該第１プロセッサ以外のプロセッサに割り当てられている場合に、前記第１タスクによる出力データを当該第１プロセッサ以外のプロセッサから参照可能な状態とする。 The task parallel processing device according to the second aspect of the present invention is:
Comprising a plurality of processors having local memory;
A first processor of the plurality of processors is
Having a first local memory;
When the second task, which is the preceding task in the first task assigned to the first processor, is assigned to other than the first processor, the first task is used until the output data from the second task can be referred to. Wait for task execution,
After the output data by the second task can be referred to, the output data is obtained from the storage source of the output data and stored in the first local memory,
Referring to output data stored in the first local memory, executing the first task, storing output data by the first task in the first local memory,
When a third task, which is a subsequent task in the first task, is assigned to a processor other than the first processor, the output data from the first task can be referred to from a processor other than the first processor. .

本発明の第３の態様にかかるタスク並列処理プログラムは、
第１プロセッサに割り当てられた第１タスクにおける先行タスクである第２タスクが当該第１プロセッサ以外に割り当てられている場合に、当該第２タスクによる出力データが当該第１プロセッサにより参照可能となるまで、当該第１タスクの実行を待機する処理と、
前記第２タスクによる出力データが前記第１プロセッサにより参照可能となった後、当該出力データの格納元から当該出力データを取得して前記第１プロセッサが有する第１ローカルメモリへ格納する処理と、
前記第１ローカルメモリに格納された出力データを参照して、当該第１プロセッサが前記第１タスクを実行し、当該第１タスクによる出力データを当該第１ローカルメモリに格納する処理と、
前記第１タスクにおける後続タスクである第３タスクが当該第１プロセッサ以外のプロセッサに割り当てられている場合に、前記第１タスクによる出力データを当該第１プロセッサ以外のプロセッサから参照可能な状態とする処理と、
をコンピュータに実行させる。 The task parallel processing program according to the third aspect of the present invention is:
When the second task that is the preceding task in the first task assigned to the first processor is assigned to other than the first processor, the output data from the second task can be referred to by the first processor A process of waiting for the execution of the first task;
After the output data from the second task can be referred to by the first processor, the output data is acquired from the storage source of the output data and stored in the first local memory of the first processor;
A process in which the first processor executes the first task with reference to the output data stored in the first local memory, and stores the output data from the first task in the first local memory;
When a third task, which is a subsequent task in the first task, is assigned to a processor other than the first processor, the output data from the first task can be referred to from a processor other than the first processor. Processing,
Is executed on the computer.

本発明により、プロセッサ間で共有するデータへのアクセス時間を短縮するためのタスク並列処理方法、装置及びプログラムを提供することができる。 According to the present invention, it is possible to provide a task parallel processing method, apparatus, and program for shortening access time to data shared between processors.

本発明の実施形態にかかるマルチコアプロセッサの構成及びデータ配置の例を示すブロック図である。It is a block diagram which shows the example of a structure and data arrangement | positioning of the multi-core processor concerning embodiment of this invention. 本発明の実施形態で対象とするタスクフローグラフの例を示す図である。It is a figure which shows the example of the taskflow graph made into object by embodiment of this invention. 本発明の実施形態にかかるタスク並列処理方法の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the task parallel processing method concerning embodiment of this invention. 本発明の実施形態にかかるプロセッサの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the processor concerning embodiment of this invention. 本発明の実施形態の具体例１にかかるタスク割当てとデータ配置の例を示す図である。It is a figure which shows the example of the task allocation and data arrangement | positioning concerning the specific example 1 of embodiment of this invention. 本発明の実施形態の具体例１にかかるタスク処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the task process concerning the specific example 1 of embodiment of this invention. 本発明の実施形態の具体例１にかかるタスク処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the task process concerning the specific example 1 of embodiment of this invention. 本発明の実施形態の具体例２にかかるタスク割当てとデータ配置の例を示す図である。It is a figure which shows the example of the task allocation and data arrangement | positioning concerning the specific example 2 of embodiment of this invention. 本発明の実施形態の具体例２にかかるタスク処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the task process concerning the specific example 2 of embodiment of this invention. 本発明の実施形態の具体例２にかかるタスク処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the task process concerning the specific example 2 of embodiment of this invention. 関連技術にかかるマルチコアプロセッサにおけるデータ配置の例を示すブロック図である。It is a block diagram which shows the example of the data arrangement | positioning in the multi-core processor concerning related technology.

以下では、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。各図面において、同一要素には同一の符号が付されており、説明の明確化のため、必要に応じて重複説明は省略する。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. In the drawings, the same elements are denoted by the same reference numerals, and redundant description will be omitted as necessary for the sake of clarity.

まず、本発明の実施形態の説明をするにあたり、本実施形態が処理対象とする並列処理モデル及びトポロジカルソートに関して説明する。 First, in describing the embodiment of the present invention, a parallel processing model and a topological sort to be processed by the present embodiment will be described.

＜本発明が対象とする並列処理モデル＞
本発明の模範的な実施形態が扱う並列処理について説明する。本発明の模範的な実施形態は、非循環型有向グラフ（ＤＡＧ：ｄｉｒｅｃｔｅｄａｃｙｃｌｉｃｇｒａｐｈ）に属するタスクフローグラフで表現可能な並列処理を扱う。以降では、特に限定しない場合、「タスクフローグラフ」は非循環型有向グラフに属するものとする。 <Parallel processing model targeted by the present invention>
The parallel processing handled by the exemplary embodiment of the present invention will be described. An exemplary embodiment of the present invention deals with parallel processing that can be represented by a task flow graph belonging to a directed acyclic graph (DAG). Hereinafter, unless otherwise specified, it is assumed that the “task flow graph” belongs to the acyclic directed graph.

タスクフローグラフは、並列処理を構成する計算処理（タスク）のデータ依存関係を表すグラフ構造である。並列処理を構成する計算処理をタスクと呼ぶことにする。タスクフローグラフにおいて、ノードはタスクを表し、ノード間のエッジはタスク間のデータ依存関係を表す。タスクフローグラフの例を図２に示す。図２には６個のタスクＴ１〜Ｔ６が存在する。タスクフローグラフを使うことにより、対象となる並列処理を構成するタスクのデータ依存関係を簡単に表現することができる。ここでいう「データ依存関係」とは、実行対象のタスクが使用するデータに関してデータを供給するタスクの役割を果たすデータ供給タスクと、当該データを使用するタスクの役割を果たすデータ使用タスクとの間に存在する実行順序（先行、後続）等の依存関係を示すものである。 The task flow graph has a graph structure representing data dependency of calculation processes (tasks) constituting parallel processing. A calculation process constituting parallel processing is called a task. In the task flow graph, a node represents a task, and an edge between nodes represents a data dependency between tasks. An example of a task flow graph is shown in FIG. In FIG. 2, there are six tasks T1 to T6. By using the task flow graph, it is possible to easily express the data dependency relationship of the tasks constituting the target parallel processing. “Data dependency” here means between a data supply task that plays the role of supplying data related to the data used by the task to be executed and a data use task that plays the role of using the data The dependency relations such as the execution order (preceding and succeeding) existing in FIG.

例えば、タスクＴ１の計算結果（データＡ）をタスクＴ２が使用する場合を想定する。この場合には、データＡに関して、タスクＴ１はデータ供給タスクであり、タスクＴ２はデータ使用タスクである。そして、このような場合に、データ供給タスクであるタスクＴ１からデータ使用タスクであるタスクＴ２に対してデータ依存関係が存在すると、言える。そして、タスクＴ１からタスクＴ２に対してデータ依存関係が存在するなら、タスクＴ２はタスクＴ１の後で実行されなければならない。この場合、タスクＴ２はタスクＴ１における先行タスクであり、タスクＴ１はタスクＴ２における後続タスクである。 For example, it is assumed that the task T2 uses the calculation result (data A) of the task T1. In this case, for data A, task T1 is a data supply task and task T2 is a data use task. In such a case, it can be said that there is a data dependency relationship from the task T1 that is the data supply task to the task T2 that is the data use task. If there is a data dependency from task T1 to task T2, task T2 must be executed after task T1. In this case, the task T2 is a preceding task in the task T1, and the task T1 is a succeeding task in the task T2.

さらに、本発明の模範的な実施形態が扱う並列処理では、タスクフローグラフの各ノードには、トポロジカルソートにもとづいた順序番号が付与されているものとする。トポロジカルソートについては後述する。 Furthermore, in the parallel processing handled by the exemplary embodiment of the present invention, it is assumed that each node in the task flow graph is assigned a sequence number based on topological sort. The topological sort will be described later.

加えて、本発明の模範的な実施形態が扱う並列処理では、タスクフローグラフで表現可能な計算処理を複数のプロセッサ又は複数のコア（以下、プロセッサとコアとを含めて「プロセッサ」という）で並列に計算するものとし、タスクフローグラフの各タスクをどのプロセッサが実行するかが予め決定されているものとする。どのプロセッサがどのタスクを実行するかという情報は、タスクフローグラフの付属情報とする。一般的には、プロセッサの数よりもタスクの数が多いので、ひとつのプロセッサが複数のタスクを実行することになる。前述のとおり、タスクにはトポロジカルオーダにもとづいた順序番号が付与されているので、本発明の模範的な実施形態は、この順序番号をタスクの実行順序とみなして、各プロセッサに、順序番号が小さいタスクから順番にタスクを実行させる。本実施形態は、このようなタスクフローグラフで表現可能な計算処理の並列処理を扱う。 In addition, in the parallel processing handled by the exemplary embodiment of the present invention, calculation processing that can be expressed by a task flow graph is performed by a plurality of processors or a plurality of cores (hereinafter referred to as “processors” including the processor and the core). It is assumed that calculation is performed in parallel, and which processor executes each task of the task flow graph is determined in advance. Information about which processor executes which task is information attached to the task flow graph. Generally, since the number of tasks is larger than the number of processors, one processor executes a plurality of tasks. As described above, since tasks are given sequence numbers based on topological orders, the exemplary embodiment of the present invention regards this sequence number as the execution order of tasks, and each processor has a sequence number. The tasks are executed in order from the smallest task. The present embodiment handles parallel processing of computation processing that can be represented by such a task flow graph.

＜トポロジカルソートについて＞
続いて、トポロジカルソートについて説明する。トポロジカルソートとは、非循環有向グラフの各ノードを順序付けして、どのノードもその出力エッジの先のノードよりも前にくるように並べることである。トポロジカルソートの典型的な利用例は、タスクのスケジューリングや、コンパイラにおける命令スケジューリングである。つまり、データ依存関係をもつタスクセットを非循環有向グラフで表現し、その非循環有向グラフをトポロジカルソートすることによって、タスクを実行すべき順序がわかることになる。トポロジカルソートで得られる順序は、ひとつだけとは限らず、複数の正しい順序が存在しうる。 <About topological sort>
Next, the topological sort will be described. The topological sort is to order each node of the acyclic directed graph so that every node comes before the node ahead of its output edge. Typical usage examples of topological sort are task scheduling and instruction scheduling in a compiler. In other words, the order in which tasks should be executed can be understood by representing a task set having data dependency relationships with a directed acyclic graph and topologically sorting the acyclic directed graph. The order obtained by the topological sort is not limited to one, and a plurality of correct orders may exist.

＜一般的なデータ配置方法の課題＞
続いて、本発明の実施形態の説明の前に、関連技術にかかる一般的なデータ配置方法の課題をまとめておく。及びローカルメモリ１１〜１４を有するＣＰＵ２１〜２４と、共有メモリ３０とを備えたマルチコアプロセッサ９００における、一般的なデータ配置方法の例を図１１に示す。図１１に示す関連技術にかかる方法は、データを非共有か共有かで分類し、共有データ９０１を共有メモリ３０に配置し、非共有データ９１１、９２１、９３１及び９４１をローカルメモリ１１、１２，１３及び１４にそれぞれ配置する。ここで、共有メモリ３０とローカルメモリ１１〜１４に配置されるデータには重複がない。つまり、共有データ９０１は複数のプロセッサが使用するデータであり、非共有データ９１１はＣＰＵ２１のみが使用するデータであり、非共有データ９２１はＣＰＵ２２のみが使用するデータであり、非共有データ９３１はＣＰＵ２３のみが使用するデータであり、非共有データ９４１はＣＰＵ２４のみが使用するデータである。 <Problems of general data placement methods>
Subsequently, prior to the description of the embodiment of the present invention, problems of a general data arrangement method according to related technology are summarized. 11 shows an example of a general data arrangement method in the multi-core processor 900 including the CPUs 21 to 24 having the local memories 11 to 14 and the shared memory 30. The method according to the related art shown in FIG. 11 classifies data as non-shared or shared, places the shared data 901 in the shared memory 30, and sets the non-shared data 911, 921, 931 and 941 to the local memories 11, 12, 13 and 14, respectively. Here, there is no overlap between the data arranged in the shared memory 30 and the local memories 11-14. That is, the shared data 901 is data used by a plurality of processors, the non-shared data 911 is data used only by the CPU 21, the non-shared data 921 is data used only by the CPU 22, and the non-shared data 931 is the CPU 23. The non-shared data 941 is data that only the CPU 24 uses.

共有メモリは、ローカルメモリに比べて、一般的にはデータ読み書きに時間がかかる。したがって、共有メモリに多くの共有データを配置することは、共有データを使用する並列処理全体の計算時間を増加させることになる。さらに、関連技術にかかる方法では、二つ以上のプロセッサが使用する共有データは全て共有メモリに配置されることになり、並列動作させる処理の数を増やすためにプロセッサ数を増やすと、共有メモリにアクセスするプロセッサの数が増えるので、共有メモリへのアクセス回数が増えることになる。つまり、関連技術にかかる方法は、アクセス時間が大きい共有メモリに対してプロセッサが何度もアクセスする、という課題がある。 The shared memory generally takes time to read and write data compared to the local memory. Therefore, arranging a large amount of shared data in the shared memory increases the calculation time of the entire parallel processing using the shared data. Furthermore, in the method according to the related art, all shared data used by two or more processors is placed in the shared memory. If the number of processors is increased to increase the number of processes to be operated in parallel, the shared memory is stored in the shared memory. Since the number of accessing processors increases, the number of accesses to the shared memory increases. That is, the method according to the related art has a problem that the processor accesses the shared memory having a long access time many times.

＜本発明の模範的な実施形態＞
本発明の模範的な実施形態について説明する。本実施形態は、ローカルメモリを備える複数のプロセッサで構成されたマルチコア・プロセッサ上での並列処理を提供する並列情報処理方法および装置である。図１は、本発明の実施形態にかかるマルチコアプロセッサ１００の構成及びデータ配置の例を示すブロック図である。マルチコアプロセッサ１００は、ローカルメモリ１１〜１４と、ＣＰＵ２１〜２４と、共有メモリ３０とを備える。各プロセッサ２１〜２４のローカルメモリ１１〜１４には、各プロセッサが割り当てられたタスクが参照するデータ（割当タスク参照用データ１１１、１２１、１３１及び１４１）を配置するとともに、全てのプロセッサからアクセス可能な共有メモリ３０に、複数プロセッサで共有すべきデータ（マルチタスク参照用データ３０１）を配置する。 <An exemplary embodiment of the present invention>
Exemplary embodiments of the invention are described. The present embodiment is a parallel information processing method and apparatus that provides parallel processing on a multi-core processor including a plurality of processors including a local memory. FIG. 1 is a block diagram showing an example of the configuration and data arrangement of a multi-core processor 100 according to an embodiment of the present invention. The multi-core processor 100 includes local memories 11 to 14, CPUs 21 to 24, and a shared memory 30. In the local memories 11 to 14 of the processors 21 to 24, data (assigned task reference data 111, 121, 131, and 141) referred to by the task to which each processor is allocated is arranged and accessible from all processors. Data (multitask reference data 301) to be shared by a plurality of processors is placed in the shared memory 30.

本実施形態は、タスクフローグラフとして表現可能なプログラムの任意の第１タスクの実行に関して、第１タスクを実行する前に、タスクフローグラフにおいて第１タスクへのエッジをもち、なおかつ第１タスクを実行する第１プロセッサ以外が実行する第２タスク群の出力データが参照可能状態となることを待ち（第１ステップ）、第１ステップの後で、第２タスク群の出力データを第１プロセッサのローカルメモリにコピーし（第２ステップ）、第１タスクを実行してその計算結果を第１プロセッサのローカルメモリに格納し（第３ステップ）、タスクフローグラフにおいて第１タスクからのエッジをもちなおかつ第１プロセッサ以外が実行する第３タスク群が存在する場合に、第１タスクの出力データを第３タスク群が参照可能な状態とする（第４ステップ）、ことを特徴とする、並列情報処理装置、及び方法である。尚、第１タスクに加え、第２タスク及び第３タスクも、タスクフローグラフとして表現可能なプログラムとして実装されているものとする。 In the present embodiment, regarding the execution of an arbitrary first task of a program that can be expressed as a task flow graph, before executing the first task, the task flow graph has an edge to the first task, and the first task is Wait until the output data of the second task group to be executed by other than the first processor to execute is ready to be referred to (first step), and after the first step, the output data of the second task group is sent to the first processor. Copy to the local memory (second step), execute the first task, store the calculation result in the local memory of the first processor (third step), and have an edge from the first task in the task flow graph. A state in which the third task group can refer to the output data of the first task when there is a third task group to be executed by other than the first processor To (Fourth step), characterized in that, in parallel information processing apparatus, and methods. In addition to the first task, the second task and the third task are implemented as programs that can be expressed as a task flow graph.

言い換えると、本実施形態は、第１プロセッサに割り当てられた第１タスクにおける先行タスクである第２タスクが当該第１プロセッサ以外に割り当てられている場合に、当該第２タスクによる出力データが当該第１プロセッサにより参照可能となるまで、当該第１タスクの実行を待機し、第２タスクによる出力データが第１プロセッサにより参照可能となった後、当該出力データの格納元から当該出力データを取得して第１プロセッサが有する第１ローカルメモリへ格納し、第１ローカルメモリに格納された出力データを参照して、当該第１プロセッサが第１タスクを実行し、当該第１タスクによる出力データを当該第１ローカルメモリに格納し、第１タスクにおける後続タスクである第３タスクが当該第１プロセッサ以外のプロセッサに割り当てられている場合に、第１タスクによる出力データを当該第１プロセッサ以外のプロセッサから参照可能な状態とする、タスク並列処理方法、方法及びプログラムである。 In other words, in the present embodiment, when the second task that is the preceding task in the first task assigned to the first processor is assigned to other than the first processor, the output data from the second task is the first task. Wait for execution of the first task until it can be referred to by one processor, and after the output data from the second task becomes referenceable by the first processor, obtain the output data from the storage source of the output data To the first local memory of the first processor, referring to the output data stored in the first local memory, the first processor executes the first task, and the output data from the first task is The third task, which is stored in the first local memory and is a subsequent task in the first task, is sent to a processor other than the first processor. If you are devoted Ri, the output data of the first task to see ready the processor other than the first processor, a task parallel processing method, method and program.

マルチコアプロセッサ１００は、タスクフローグラフとして表現可能なプログラムの任意のタスクを第１タスクとして、図３に示すように、上記の第１ステップ（Ｓ１）から第４ステップ（Ｓ４）を行う。 The multi-core processor 100 performs the first step (S1) to the fourth step (S4) as shown in FIG. 3 with an arbitrary task of a program that can be expressed as a task flow graph as the first task.

第１ステップは、第１タスクを実行可能な状態になるまで待ち合せるという処理である。ここで、第２タスク群は、第１プロセッサ以外で実行され、なおかつ第１タスクへ入力データを供給するタスク群である。そして、第２タスク群の出力データを第１タスクが参照可能な状態であることが確認されると、第１タスクを実行可能な状態となる。もし第２タスク群が存在しないならば、出力データを参照すべき第２タスク群は存在しないので、待ち合わせは不要となり、第１タスクは実行可能な状態となる。例えば、タスクフローグラフにおいて第１タスクへのエッジをもつタスクが全て第１プロセッサ（第１タスクを実行するプロセッサ）で実行されるなら、第２タスク群は存在しないことになる。また、第２タスク群は、１個以上のタスクであればよい。 The first step is a process of waiting until the first task is ready to be executed. Here, the second task group is a task group that is executed by other than the first processor and supplies input data to the first task. When it is confirmed that the first task can refer to the output data of the second task group, the first task can be executed. If the second task group does not exist, there is no second task group to which the output data should be referenced, so no waiting is required, and the first task is in an executable state. For example, if all tasks having an edge to the first task are executed by the first processor (processor that executes the first task) in the task flow graph, the second task group does not exist. Further, the second task group may be one or more tasks.

上述のことは、次のように言い換えることもできる。すなわち、第１タスクを実行するプロセッサとして第１プロセッサが予め割り当てられているものとする。そして、第２タスクを実行するプロセッサとして第１プロセッサ以外のプロセッサが予め割り当てられているものとする。また、第２タスクは第１タスクにおける先行タスクであるものとする。この場合、第１プロセッサは、第２タスクによる出力データが当該第１プロセッサにより参照可能となるまで、当該第１タスクの実行を待機する（Ｓ１）。 The above can be paraphrased as follows. That is, it is assumed that the first processor is assigned in advance as a processor that executes the first task. Assume that a processor other than the first processor is assigned in advance as a processor that executes the second task. Further, it is assumed that the second task is a preceding task in the first task. In this case, the first processor waits for execution of the first task until the output data from the second task can be referred to by the first processor (S1).

第２ステップは、第１タスクへの入力データとなる第２タスク群の出力データを、第１プロセッサのローカルメモリへコピーする処理である。第２タスク群の出力データは第１プロセッサから参照可能な場所に格納されているものの、第１プロセッサにとってはそれらの場所よりもローカルメモリのほうがアクセス時間が早い。したがって、第２タスク群の出力データを第１プロセッサのローカルメモリにコピーすることで、第１タスクを実行する第１プロセッサは第１タスクの入力データに高速にアクセスすることが可能になる。第２ステップにおいて、第２タスク群の出力データが格納される場所は、共有メモリであってもよいし、他プロセッサからアクセス可能な各プロセッサのローカルメモリであってもよい。もし第２タスク群が存在しないならば、コピーすべき出力データは存在しないので、コピー処理は不要となる。 The second step is a process of copying the output data of the second task group serving as input data to the first task to the local memory of the first processor. Although the output data of the second task group is stored in a location that can be referred to from the first processor, the access time of the local memory is earlier than that location for the first processor. Therefore, by copying the output data of the second task group to the local memory of the first processor, the first processor executing the first task can access the input data of the first task at high speed. In the second step, the location where the output data of the second task group is stored may be a shared memory or a local memory of each processor accessible from other processors. If the second task group does not exist, there is no output data to be copied, so the copy process is unnecessary.

上述のことは、次のように言い換えることもできる。すなわち、第１プロセッサは、第２タスクによる出力データが自身により参照可能となった後、当該出力データの格納元から当該出力データを取得して自身が有する第１ローカルメモリへ格納する（Ｓ２）。 The above can be paraphrased as follows. That is, after the output data from the second task can be referred to by the first processor, the first processor acquires the output data from the storage source of the output data and stores it in the first local memory of the first processor (S2). .

また、コピーすべき出力データが、第１タスクを実行するプロセッサのローカルメモリに存在することが明らかである場合にも、コピー処理は不要である。例えば、或るプロセッサに割り当てられた複数の第１タスクが同じタスクを第２タスク群としてもつ場合には、第２ステップにおいてコピーすべき出力データが同じなので、最初の第２ステップでコピーしておけば、他の第２ステップでは、コピーすべき出力データがローカルメモリに存在することは明らかなので、最初の第２ステップ以降ではコピー処理は不要となる。 Also, when it is clear that the output data to be copied exists in the local memory of the processor that executes the first task, the copy process is unnecessary. For example, if a plurality of first tasks assigned to a certain processor have the same task as the second task group, the output data to be copied in the second step is the same. In this case, in the other second step, it is clear that the output data to be copied exists in the local memory, so that the copy process is not necessary after the first second step.

上述のことは、次のように言い換えることもできる。すなわち、第１タスクの後に実行される第４タスクも第１プロセッサに割り当てられているものとする。そして、第４タスクにおける先行タスクが第２タスクであるものとする。この場合、第１プロセッサは、第２タスクの出力データの格納元から当該出力データを取得せずに、第１タスクの実行前に第１ローカルメモリに格納された第２タスクの出力データを参照して、第４タスクを実行する。 The above can be paraphrased as follows. That is, the fourth task executed after the first task is also assigned to the first processor. It is assumed that the preceding task in the fourth task is the second task. In this case, the first processor refers to the output data of the second task stored in the first local memory before the execution of the first task without acquiring the output data from the storage source of the output data of the second task. Then, the fourth task is executed.

第３ステップは、第１タスクを実行する処理である。第１ステップで実行可能な状態であることを確認し、第２ステップで入力データを準備したので、第１タスクの実行準備が整う。第３ステップにおいて、第１プロセッサは、第１プロセッサのローカルメモリに存在する入力データを使用して第１タスクを実行し、第１タスクの出力データは第１プロセッサのローカルメモリに格納される。 The third step is a process for executing the first task. Since it is confirmed that the state is executable in the first step and the input data is prepared in the second step, the first task is ready for execution. In the third step, the first processor executes the first task using the input data existing in the local memory of the first processor, and the output data of the first task is stored in the local memory of the first processor.

上述のことは、次のように言い換えることもできる。すなわち、第１プロセッサは、第１ローカルメモリに格納された出力データを参照して、第１タスクを実行し（Ｓ３）、第１タスクによる出力データを第１ローカルメモリに格納する。 The above can be paraphrased as follows. That is, the first processor refers to the output data stored in the first local memory, executes the first task (S3), and stores the output data from the first task in the first local memory.

第４ステップは、第１タスクの出力データを入力データとして使用する第３タスク群のために、第１プロセッサのローカルメモリに格納されている第１タスクの出力データが参照可能な状態にする処理である。この処理には、第１タスクの出力データが第１プロセッサ以外から参照可能で無い場合に第１タスクの出力データを第１プロセッサ以外が参照可能な場所（例えば共有メモリ）にコピーする処理と、第１タスクの実行が完了しその出力データが参照可能な状態であることを記録する処理と、を含む。このコピー処理は、第２ステップのコピー処理と対（つい）になる処理である。もし第２ステップのコピー元が共有メモリであるならば、第４ステップのコピー先も共有メモリとする。もし第２ステップのコピー元が或るプロセッサのローカルメモリであるならば、各プロセッサは互いのローカルメモリをアクセス可能なので、第４ステップのコピー処理は不要である。この一連の処理によって、第１タスクの出力データが参照可能であることを第３タスク群の実行担当のプロセッサが知ることができる。もし第３タスク群が存在しないならば、第１プロセッサ以外は第１タスクの出力データを使わないため、第４ステップのコピー処理は不要であり、第１タスクの出力データが参照可能である否かを知るべきタスクが存在しないため、第１タスクの出力データの参照可能状態を記録する必要もない。 The fourth step is a process of making the output data of the first task stored in the local memory of the first processor accessible for the third task group that uses the output data of the first task as input data. It is. In this process, when the output data of the first task is not referable from other than the first processor, the process of copying the output data of the first task to a place (for example, shared memory) that can be referred to by other than the first processor; Recording the fact that the execution of the first task is completed and the output data can be referred to. This copy process is a process that is paired with the copy process in the second step. If the copy source in the second step is a shared memory, the copy destination in the fourth step is also a shared memory. If the copy source of the second step is the local memory of a certain processor, each processor can access each other's local memory, so the copy process of the fourth step is unnecessary. Through this series of processing, the processor in charge of execution of the third task group can know that the output data of the first task can be referred to. If the third task group does not exist, since the output data of the first task is not used except for the first processor, the copy process in the fourth step is unnecessary, and whether the output data of the first task can be referred to. Since there is no task to know, there is no need to record the referenceable state of the output data of the first task.

上述のことは、次のように言い換えることもできる。すなわち、第１プロセッサは、第１タスクにおける後続タスクである第３タスクが当該第１プロセッサ以外のプロセッサに割り当てられている場合に、第１タスクによる出力データを当該第１プロセッサ以外のプロセッサから参照可能な状態とするように設定する（Ｓ４）。 The above can be paraphrased as follows. That is, the first processor refers to output data from the first task from a processor other than the first processor when a third task that is a subsequent task in the first task is assigned to a processor other than the first processor. It sets so that it may be in a possible state (S4).

続いて、上述の第１ステップから第４ステップを、各プロセッサの視点で説明する。図４は、本発明の実施形態にかかるプロセッサの処理の流れを示すフローチャートである。まず、各プロセッサは、自身に割り当てられたタスクを順序番号の小さいものから順に選択する（Ｓ２１）。そして、各プロセッサは、選択したタスクについて上述した第１ステップから第４ステップの順序で実行する（Ｓ２２）。その後、自身に割り当てられた全てのタスクの処理が終了したか否かを判定する（Ｓ２３）。そのため、各プロセッサは、自身に実行が割り当てられているタスクの数だけ、第１ステップから第４ステップを繰り返す。 Subsequently, the above-described first to fourth steps will be described from the viewpoint of each processor. FIG. 4 is a flowchart showing a processing flow of the processor according to the embodiment of the present invention. First, each processor selects tasks assigned to it in order from the smallest sequence number (S21). Each processor executes the selected task in the order from the first step to the fourth step described above (S22). Thereafter, it is determined whether or not the processing of all tasks assigned to itself has been completed (S23). Therefore, each processor repeats the first step to the fourth step as many times as the number of tasks to which execution is assigned.

さらに、各プロセッサは、トポロジカルオーダにもとづいた順序番号にしたがってタスクを順番に実行する。この実行順序は、タスクフローグラフのデータ依存関係にもとづいて予め決められた実行順序なので、ひとつのプロセッサに割り当てられたタスク同士でタスクの実行完了を確認することは不要である。また、ひとつのプロセッサに割り当てられたタスクの出力データは同じローカルメモリに格納されるため、ひとつのプロセッサに割り当てられたタスク同士で共有メモリを介して出力データを受け渡しすることも不要である。したがって、上述のように、第２タスク群が存在しない場合には、第１ステップの待合せが不要となるとともに、第２ステップのコピー処理が不要となる。さらに、第３タスク群が存在しない場合には、第４ステップのコピー処理及び実行完了記録処理が不要となる。 Furthermore, each processor executes tasks in order according to the sequence number based on the topological order. Since this execution order is an execution order determined in advance based on the data dependency relationship of the task flow graph, it is not necessary to confirm the completion of task execution among the tasks assigned to one processor. In addition, since the output data of the task assigned to one processor is stored in the same local memory, it is not necessary for the tasks assigned to one processor to pass output data through the shared memory. Therefore, as described above, when the second task group does not exist, the waiting for the first step becomes unnecessary and the copying process of the second step becomes unnecessary. Further, when the third task group does not exist, the copy process and the execution completion recording process in the fourth step are not necessary.

＜本発明の実施形態の具体例１＞
実施形態の具体例１について説明する。具体例１は、ローカルメモリを備えた複数のプロセッサと、プロセッサ間のデータ共有のための共有メモリと、を含むマルチコア・プロセッサ上での並列処理を提供する並列情報処理装置、及び方法である。 <Specific example 1 of embodiment of this invention>
A specific example 1 of the embodiment will be described. Specific example 1 is a parallel information processing apparatus and method for providing parallel processing on a multi-core processor including a plurality of processors including a local memory and a shared memory for sharing data between the processors.

具体例１は、タスクフローグラフとして表現可能なプログラムの任意のタスクを第１タスクとして、前述の第１ステップから第４ステップを行う。具体例１では、共有メモリを使ってプロセッサ間でデータを共有する。したがって、第２ステップのコピー元は共有メモリとなり、第４ステップのコピー先も共有メモリとなる。 In the first specific example, an arbitrary task of a program that can be expressed as a task flow graph is set as the first task, and the first to fourth steps are performed. In Specific Example 1, data is shared between processors using a shared memory. Therefore, the copy source in the second step is a shared memory, and the copy destination in the fourth step is also a shared memory.

第１ステップにおいて、具体例１は、第１タスクを実行可能な状態（第２タスク群の出力データが参照可能な状態）になるまで待合わせる。続いて、第２ステップにおいて、具体例１は、第１タスクへの入力データとなる第２タスク群の出力データを、共有メモリから第１プロセッサのローカルメモリへコピーする。続いて、第３ステップにおいて、具体例１は、第１タスクを実行し、その出力データをローカルメモリに格納する。続いて、第４ステップにおいて、具体例１は、第１タスクの出力データを入力データとして使用する第３タスク群が存在するならば、第１プロセッサのローカルメモリに格納されている第１タスクの出力データを共有メモリへコピーするとともに、第１タスクの出力データが参照可能な状態であることを共有メモリに記録する。 In the first step, Example 1 waits until the first task can be executed (the output data of the second task group can be referred to). Subsequently, in the second step, the specific example 1 copies the output data of the second task group serving as input data to the first task from the shared memory to the local memory of the first processor. Subsequently, in the third step, the specific example 1 executes the first task and stores the output data in the local memory. Subsequently, in the fourth step, in the first specific example, if there is a third task group that uses the output data of the first task as input data, the first task is stored in the local memory of the first processor. The output data is copied to the shared memory, and the fact that the output data of the first task can be referenced is recorded in the shared memory.

続いて、上述の第１ステップから第４ステップを、各プロセッサの視点で説明する。前述のとおり、本発明の模範的な実施形態は、各プロセッサは順序番号が小さいタスクから順番に実行するものとしている。つまり、具体例１の各プロセッサは、自身に実行が割り当てられているタスクの数だけ、第１ステップから第４ステップを繰り返す。 Subsequently, the above-described first to fourth steps will be described from the viewpoint of each processor. As described above, the exemplary embodiment of the present invention assumes that each processor executes in order from the task with the smallest sequence number. That is, each processor of the first specific example repeats the first step to the fourth step as many times as the number of tasks assigned to the processor.

さらに、各プロセッサは、トポロジカルオーダにもとづいた順序番号にしたがってタスクを順番に実行する。この実行順序は、タスクフローグラフのデータ依存関係に基づいて予め決められた実行順序なので、ひとつのプロセッサに割り当てられたタスク同士でタスクの実行完了を確認することは不要である。また、ひとつのプロセッサに割り当てられたタスクの出力データは同じローカルメモリに格納されるため、ひとつのプロセッサに割り当てられたタスク同士で共有メモリを介して出力データを受け渡しすることは不要である。したがって、上述のように、第２タスク群が存在しない場合には、第１ステップの待合せが不要となるとともに、第２ステップのコピー処理が不要となる。さらに、第３タスク群が存在しない場合には、第４ステップのコピー処理および出力データ参照可能状態記録処理が不要となる。 Furthermore, each processor executes tasks in order according to the sequence number based on the topological order. Since this execution order is an execution order determined in advance based on the data dependency relationship of the task flow graph, it is not necessary to confirm the completion of task execution among the tasks assigned to one processor. In addition, since the output data of the task assigned to one processor is stored in the same local memory, it is not necessary for the tasks assigned to one processor to pass output data through the shared memory. Therefore, as described above, when the second task group does not exist, the waiting for the first step becomes unnecessary and the copying process of the second step becomes unnecessary. Furthermore, when the third task group does not exist, the copy process and the output data referenceable state recording process in the fourth step are not necessary.

＜具体例１の動作例＞
続いて、具体例１の動作例について図面を使って説明する。まず、図２は、具体例１が対象とするタスクフローグラフの例を示す図である。つまり、６個のタスクＴ１〜Ｔ６が図２のようなデータ依存関係をもつものとする。具体的には、タスクＴ１は、先行タスクがなく、タスクＴ２及びＴ３を後続タスクとする。そのため、タスクＴ１は、外部データを入力とし、データＡを出力する。タスクＴ２は、タスクＴ１を先行タスクとし、タスクＴ４を後続タスクとする。そのため、タスクＴ２は、タスクＴ１により出力されたデータＡを入力とし、データＢを出力する。タスクＴ３は、タスクＴ１を先行タスクとし、タスクＴ４及びＴ５を後続タスクとする。そのため、タスクＴ３は、タスクＴ１により出力されたデータＡを入力とし、データＢを出力する。タスクＴ４は、タスクＴ２及びＴ３を先行タスクとし、タスクＴ６を後続タスクとする。そのため、タスクＴ４は、タスクＴ２により出力されたデータＢ及びタスクＴ３により出力されたデータＣを入力とし、データＤを出力する。タスクＴ５は、タスクＴ３を先行タスクとし、タスクＴ６を後続タスクとする。そのため、タスクＴ５は、タスクＴ３により出力されたデータＣを入力とし、データＥを出力する。タスクＴ６は、タスクＴ４及びＴ５を先行タスクとし、後続タスクがないものとする。そのため、タスクＴ６は、タスクＴ４により出力されたデータＤ及びタスクＴ５により出力されたデータＥを入力とし、データＦを出力する。 <Operation example of specific example 1>
Subsequently, an operation example of the first specific example will be described with reference to the drawings. First, FIG. 2 is a diagram illustrating an example of a task flow graph targeted by the first specific example. That is, it is assumed that the six tasks T1 to T6 have the data dependency as shown in FIG. Specifically, the task T1 has no preceding task, and the tasks T2 and T3 are the subsequent tasks. Therefore, the task T1 receives external data and outputs data A. The task T2 has the task T1 as a preceding task and the task T4 as a subsequent task. Therefore, the task T2 receives the data A output by the task T1 and outputs data B. The task T3 has the task T1 as a preceding task and the tasks T4 and T5 as subsequent tasks. Therefore, the task T3 receives the data A output by the task T1 and outputs data B. For the task T4, the tasks T2 and T3 are the preceding tasks, and the task T6 is the succeeding task. Therefore, task T4 receives data B output by task T2 and data C output by task T3, and outputs data D. The task T5 has the task T3 as a preceding task and the task T6 as a subsequent task. Therefore, the task T5 receives the data C output by the task T3 and outputs data E. It is assumed that task T6 has tasks T4 and T5 as preceding tasks and no subsequent tasks. Therefore, the task T6 receives the data D output by the task T4 and the data E output by the task T5, and outputs data F.

図５は、本発明の実施形態の具体例１にかかるマルチコアプロセッサ１００のタスク割当てとデータ配置の例を示す図である。図５では、具体例１が備えるプロセッサの数を４として説明するが、プロセッサ数はこれに限定されない。図２の６個のタスクＴ１〜Ｔ６を具体例１の４個のプロセッサに図５のように予め割当てておくものとする。つまり、タスクＴ１及びＴ２をＣＰＵ２１に、タスクＴ３及びＴ４をＣＰＵ２２に、タスクＴ５をＣＰＵ２３に、タスクＴ６をＣＰＵ２４に、それぞれ割り当てるものとする。このように割当てを決めたことにより、本発明の実施形態にかかる基本的な考え方「プロセッサのローカルメモリにタスクの入出力データを配置する」に基づいて、各タスクの入出力データの配置は図５のように決まることになる。つまり、ＣＰＵ２１のローカルメモリ１１にはタスクＴ１とタスクＴ２の入出力データである外部データ５１１、データＡ５１２及びデータＢ５１３を、ＣＰＵ２２のローカルメモリ１２にはタスクＴ３及びＴ４の入出力データであるデータＡ５２１、データＢ５２２、データＣ５２３及びデータＤ５２４を、ＣＰＵ２３のローカルメモリ１３にはタスクＴ５の入出力データであるデータＣ５３１及びデータＥ５３２を、ＣＰＵ２４のローカルメモリ１４にはタスクＴ６の入出力データであるデータＤ５４１、データＥ５４２及びデータＦ５４３を、それぞれ配置することになる。 FIG. 5 is a diagram illustrating an example of task assignment and data arrangement of the multi-core processor 100 according to the first specific example of the embodiment of the present invention. In FIG. 5, the number of processors included in the specific example 1 is described as four, but the number of processors is not limited to this. Assume that the six tasks T1 to T6 in FIG. 2 are assigned in advance to the four processors in the first specific example as shown in FIG. That is, the tasks T1 and T2 are assigned to the CPU 21, the tasks T3 and T4 are assigned to the CPU 22, the task T5 is assigned to the CPU 23, and the task T6 is assigned to the CPU 24. Since the assignment is determined in this way, the arrangement of the input / output data of each task is shown in FIG. It will be decided like 5. That is, the external data 511, data A512 and data B513 which are input / output data of the tasks T1 and T2 are stored in the local memory 11 of the CPU 21, and the data A521 which is input / output data of the tasks T3 and T4 are stored in the local memory 12 of the CPU 22. , Data B522, data C523 and data D524, data C531 and data E532 which are input / output data of task T5 in the local memory 13 of the CPU 23, and data D541 which is input / output data of task T6 in the local memory 14 of the CPU 24. , Data E542 and data F543 are respectively arranged.

そして、具体例１では、共有メモリを用いてプロセッサ間でデータを共有するので、相異なるプロセッサの間で実行されるタスクの間で受け渡すべきデータは、図５のように共有メモリ３０に配置されることになる。つまり、共有メモリ３０には、外部データ３１１、データＡ３１２、データＢ３１３、データＣ３１４、データＤ３１５、データＥ３１６及びデータＦ３１７が格納される。尚、外部データ５１１と外部データ３１１とは同一内容であり、データＡ５１２、データＡ５２１及びデータＡ３１２は同一内容であり、データＢ５１３、データＢ５２２及びデータＢ３１３は同一内容であり、データＣ５２３、データＣ５３１及びデータＣ３１４は同一内容であり、データＤ５２４、データＤ５４１及びデータＤ３１５は同一内容であり、データＥ５３２、データＥ５４２及びデータＥ３１６は同一内容であり、データＦ５４３及びデータＦ３１７は同一内容であるものとする。 In Specific Example 1, since data is shared between processors using a shared memory, data to be passed between tasks executed between different processors is arranged in the shared memory 30 as shown in FIG. Will be. That is, the external memory 311, data A 312, data B 313, data C 314, data D 315, data E 316, and data F 317 are stored in the shared memory 30. The external data 511 and the external data 311 have the same contents, the data A 512, the data A 521, and the data A 312 have the same contents, the data B 513, the data B 522, and the data B 313 have the same contents, and the data C 523, the data C 531, and Data C314 has the same contents, data D524, data D541, and data D315 have the same contents, data E532, data E542, and data E316 have the same contents, and data F543 and data F317 have the same contents.

次に、具体例１にかかるタスク処理（タスクＴ１からタスクＴ６を実行する様子）の流れを示すフローチャートを図６及び図７に示す。図６及び図７において、具体例１の４個のプロセッサ２１〜２４は、トポロジカルオーダにもとづいた順序番号にしたがってタスクを順番に実行する。各タスクの実行は前述の第１ステップから第４ステップに基づいて行われる。タスクＴ１からタスクＴ６のタスクフローグラフ（図２）によると、どのタスクの出力データも使用しないで計算を行うのはタスクＴ１であるので、タスクＴ１だけが第１ステップの待合せが不要で、すぐに実行できる状態にあることがわかる。したがって、図６及び図７では、ＣＰＵ２１によるタスクＴ１の処理から動作が開始する。 Next, FIGS. 6 and 7 are flowcharts showing a flow of task processing (a state in which the task T1 to the task T6 are executed) according to the first specific example. 6 and 7, the four processors 21 to 24 of the first specific example sequentially execute the tasks according to the sequence numbers based on the topological order. Each task is executed based on the first to fourth steps described above. According to the task flow graph from task T1 to task T6 (FIG. 2), it is task T1 that performs calculation without using the output data of any task, so only task T1 does not need to wait for the first step. It can be seen that it is ready to execute. Accordingly, in FIG. 6 and FIG. 7, the operation starts from the processing of the task T1 by the CPU 21.

ＣＰＵ２１は、まずタスクＴ１の第１ステップを行うが、上述の通りタスクＴ１については第１ステップの待合せが不要となるので、ＣＰＵ２１は、すぐに、タスクＴ１の第２ステップに移る。次に、ＣＰＵ２１は、タスクＴ１の第２ステップにおいて、タスクＴ１の入力データ（外部データ３１１）を共有メモリ３０からローカルメモリ１１へ（外部データ５１１として）コピーする（Ｓ３１０）。次に、ＣＰＵ２１は、タスクＴ１の第３ステップにおいて、タスクＴ１を実行し（Ｓ３１１）、タスクＴ１の出力データ（データＡ５１２）をローカルメモリ１１に書き込む。次に、ＣＰＵ２１は、タスクＴ１の第４ステップにおいて、前述の第３タスク群に相当するタスク（ＣＰＵ２２が実行するタスクＴ３）が存在するため、タスクＴ１の出力データ（データＡ５１２）をローカルメモリ１１から共有メモリ３０へ（データＡ３１２として）コピーする（Ｓ３１２）とともに、タスクＴ１の出力データが参照可能な状態になったというフラグ（完了フラグ）をセットする（Ｓ３１３）。完了フラグは共有メモリ３０上に置くものとする。ここまでで、タスクＴ１の第１ステップから第４ステップが終了する。 The CPU 21 first performs the first step of the task T1, but as described above, since the waiting for the first step is not necessary for the task T1, the CPU 21 immediately moves to the second step of the task T1. Next, in the second step of the task T1, the CPU 21 copies the input data (external data 311) of the task T1 from the shared memory 30 to the local memory 11 (as external data 511) (S310). Next, in the third step of the task T1, the CPU 21 executes the task T1 (S311), and writes the output data (data A512) of the task T1 in the local memory 11. Next, since there is a task corresponding to the above-described third task group (task T3 executed by the CPU 22) in the fourth step of the task T1, the CPU 21 stores the output data (data A512) of the task T1 in the local memory 11 Is copied to the shared memory 30 (as data A312) (S312), and a flag (completion flag) that the output data of the task T1 can be referred is set (S313). The completion flag is placed on the shared memory 30. Thus far, the first to fourth steps of task T1 are completed.

続いて、ＣＰＵ２１は、タスクＴ１の処理が終わると、順序番号による実行順序にもとづいて、タスクＴ２の処理を行う。図２のタスクフローグラフによると、タスクＴ２はタスクＴ１の出力データだけを使用するので、タスクＴ２を前述の第１タスクとみなす場合に、前述の第２タスク群に相当するタスクが存在しない。第２タスク群は存在せず、タスクＴ１は既にＣＰＵ２１によって実行完了しているため、タスクＴ２の第１ステップは不要となり、ＣＰＵ２１は、すぐに、次のステップに移る。次に、ＣＰＵ２１は、タスクＴ２の第２ステップにおいて、第２タスク群に相当するタスクが存在しないので、コピーは不要となり、次のステップに移る。次に、ＣＰＵ２１は、タスクＴ２の第３ステップにおいて、タスクＴ２を実行し（Ｓ３１４）、タスクＴ２の出力データ（データＢ５１３）をローカルメモリ１１に書き込む。次に、ＣＰＵ２１は、タスクＴ２の第４ステップにおいて、前述の第３タスク群に相当するタスク（ＣＰＵ２２が実行するタスクＴ４）が存在するため、タスクＴ２の出力データ（データＢ５１３）をローカルメモリ１１から共有メモリ３０へ（データＢ３１３として）コピーする（Ｓ３１５）とともに、タスクＴ２の出力データが参照可能な状態になったというフラグ（完了フラグ）をセットする（Ｓ３１６）。ここまでで、タスクＴ２の第１ステップから４が終了する。 Subsequently, when the processing of the task T1 is completed, the CPU 21 performs the processing of the task T2 based on the execution order based on the sequence number. According to the task flow graph of FIG. 2, since the task T2 uses only the output data of the task T1, when the task T2 is regarded as the first task, there is no task corresponding to the second task group. Since the second task group does not exist and the task T1 has already been executed by the CPU 21, the first step of the task T2 is not necessary, and the CPU 21 immediately moves to the next step. Next, since there is no task corresponding to the second task group in the second step of task T2, the CPU 21 does not need to copy and moves to the next step. Next, in the third step of the task T2, the CPU 21 executes the task T2 (S314) and writes the output data (data B513) of the task T2 in the local memory 11. Next, since there is a task corresponding to the above-described third task group (task T4 executed by the CPU 22) in the fourth step of the task T2, the CPU 21 outputs the output data (data B513) of the task T2 to the local memory 11 To the shared memory 30 (as data B313) (S315), and a flag (completion flag) that the output data of the task T2 can be referred is set (S316). At this point, the first to fourth steps of task T2 are completed.

続いて、ＣＰＵ２２によるタスクＴ３の処理について説明する。ＣＰＵ２１がタスクＴ１とタスクＴ２を実行すると、タスクＴ１及びＴ２の出力データが参照可能な状態になる。図２のタスクフローグラフによると、それによって、ＣＰＵ２２は、タスクＴ３と（タスクＴ３の実行終了後）タスクＴ４を実行可能な状態となる。図２のタスクフローグラフによると、タスクＴ３を前述の第１タスクとみなす場合に、前述の第２タスク群に相当するタスクはタスクＴ１である。ＣＰＵ２２は、タスクＴ３の第１ステップにおいて、タスクＴ１の出力データが参照可能な状態になったというフラグ（完了フラグ）を参照して、タスクＴ３を実行可能な状態（タスクＴ１の出力データが参照可能な状態）になるまで待合わせる（Ｓ３２０）。そして、ＣＰＵ２１によるタスクＴ１の処理（第１ステップから第４ステップ）が完了した後は、タスクＴ１の出力データの参照可能フラグ（完了フラグ）がセットされている（Ｓ３１３）ので、ＣＰＵ２２はタスクＴ１の完了フラグを検出して（Ｓ３２１）待合せを終了し、次のステップに移る。次に、ＣＰＵ２２は、タスクＴ３の第２ステップにおいて、第２タスク群に相当するタスク（タスクＴ１）が存在するため、タスクＴ１の出力データ（データＡ３１２）を共有メモリ３０からＣＰＵ２２のローカルメモリ１２へ（データＡ５２１として）コピーする（Ｓ３２２）。次に、ＣＰＵ２２は、タスクＴ３の第３ステップにおいて、タスクＴ３を実行し（Ｓ３２３）、タスクＴ３の出力データ（データＣ５２３）をローカルメモリ１２に書き込む。次に、ＣＰＵ２２は、タスクＴ３の第４ステップにおいて、前述の第３タスク群に相当するタスク（ＣＰＵ２３が実行するタスクＴ５）が存在するため、タスクＴ３の出力データ（データＣ５２３）をローカルメモリ１２から共有メモリ３０へ（データＣ３１４として）コピーする（Ｓ３２４）とともに、タスクＴ３の出力データが参照可能な状態になったというフラグ（完了フラグ）をセットする（Ｓ３２５）。ここまでで、タスクＴ３の第１ステップから４が終了する。 Subsequently, processing of the task T3 by the CPU 22 will be described. When the CPU 21 executes the tasks T1 and T2, the output data of the tasks T1 and T2 can be referred to. According to the task flow graph of FIG. 2, the CPU 22 is in a state where the task T3 and the task T4 can be executed (after the execution of the task T3). According to the task flow graph of FIG. 2, when the task T3 is regarded as the first task, the task corresponding to the second task group is the task T1. The CPU 22 refers to a flag (completion flag) that the output data of the task T1 can be referred to in the first step of the task T3, and can execute the task T3 (refer to the output data of the task T1). Wait until it becomes possible (S320). Then, after the processing of task T1 by CPU 21 (from the first step to the fourth step) is completed, the output data reference flag (completion flag) of task T1 is set (S313), so that CPU 22 performs task T1. The completion flag is detected (S321), the waiting is finished, and the process proceeds to the next step. Next, since the task (task T1) corresponding to the second task group exists in the second step of the task T3, the CPU 22 transfers the output data (data A 312) of the task T1 from the shared memory 30 to the local memory 12 of the CPU 22. To (as data A521) is copied (S322). Next, in the third step of the task T3, the CPU 22 executes the task T3 (S323), and writes the output data (data C523) of the task T3 in the local memory 12. Next, in the fourth step of the task T3, the CPU 22 has a task corresponding to the above-described third task group (task T5 executed by the CPU 23), and therefore outputs the output data (data C523) of the task T3 to the local memory 12 To the shared memory 30 (as data C314) (S324), and a flag (completion flag) that the output data of the task T3 can be referred is set (S325). Thus far, the first to fourth steps of task T3 are completed.

続いて、ＣＰＵ２２によるタスクＴ４の処理について説明する。ＣＰＵ２２は、タスクＴ３の処理が終わると、順序番号による実行順序にもとづいて、タスクＴ４の処理を行う。図２のタスクフローグラフによると、タスクＴ４を前述の第１タスクとみなす場合に、前述の第２タスク群に相当するタスクはタスクＴ２である。ＣＰＵ２２は、タスクＴ４の第１ステップにおいて、タスクＴ２の出力データが参照可能な状態になったというフラグ（完了フラグ）を参照して、タスクＴ４を実行可能な状態（タスクＴ２の出力データが参照可能な状態）になるまで待合わせる。ここでは、ＣＰＵ２１によるタスクＴ２の処理（第１ステップから第４ステップ）が完了した後、タスクＴ２出力データの参照可能フラグ（完了フラグ）がセットされている（Ｓ３１６）ので、ＣＰＵ２２はタスクＴ２の完了フラグを検出して（Ｓ３２６）待合せを終了し、次のステップに移る。次に、ＣＰＵ２２は、タスクＴ４の第２ステップにおいて、第２タスク群に相当するタスク（タスクＴ２）が存在するため、タスクＴ２の出力データ（データＢ３１３）を共有メモリ３０からＣＰＵ２２のローカルメモリ１２へ（データＢ５２２として）コピーする（Ｓ３２７）。次に、ＣＰＵ２２は、タスクＴ４の第３ステップにおいて、タスクＴ４を実行し（Ｓ３２８）、タスクＴ４の出力データ（データＤ５２４）をローカルメモリ１２に書き込む。次に、ＣＰＵ２２は、タスクＴ４の第４ステップにおいて、前述の第３タスク群に相当するタスク（ＣＰＵ２４が実行するタスクＴ６）が存在するため、タスクＴ４の出力データ（データＤ５２４）をローカルメモリ１２から共有メモリ３０へ（データＤ３１５として）コピーする（Ｓ３２９）とともに、タスクＴ４の出力データが参照可能な状態になったというフラグ（完了フラグ）をセットする（Ｓ３３０）。ここまでで、タスクＴ４の第１ステップから４が終了する。 Next, the process of task T4 by the CPU 22 will be described. When the processing of the task T3 is completed, the CPU 22 performs the processing of the task T4 based on the execution order based on the sequence number. According to the task flow graph of FIG. 2, when the task T4 is regarded as the first task, the task corresponding to the second task group is the task T2. In the first step of task T4, the CPU 22 refers to a flag (completion flag) that the output data of task T2 can be referred to, and can execute task T4 (refer to the output data of task T2). Wait until it becomes possible. Here, after the processing of the task T2 by the CPU 21 (from the first step to the fourth step) is completed, the reference flag (completion flag) of the task T2 output data is set (S316). The completion flag is detected (S326), the waiting is finished, and the process proceeds to the next step. Next, since there is a task (task T2) corresponding to the second task group in the second step of task T4, the CPU 22 transfers the output data (data B313) of the task T2 from the shared memory 30 to the local memory 12 of the CPU 22. To (as data B522) (S327). Next, in the third step of the task T4, the CPU 22 executes the task T4 (S328), and writes the output data (data D524) of the task T4 to the local memory 12. Next, in the fourth step of task T4, the CPU 22 has a task corresponding to the above-described third task group (task T6 executed by the CPU 24), and therefore outputs the output data (data D524) of the task T4 to the local memory 12 To the shared memory 30 (as data D315) (S329), and a flag (completion flag) that the output data of the task T4 can be referred is set (S330). Thus far, the first step 4 of the task T4 is completed.

続いて、ＣＰＵ２３によるタスクＴ５の処理について説明する。ＣＰＵ２２がタスクＴ３を実行すると、タスクＴ３の出力データが参照可能な状態になる。図２のタスクフローグラフによると、それによって、ＣＰＵ２３は、タスクＴ５を実行可能な状態となる。図２のタスクフローグラフによると、タスクＴ５を前述の第１タスクとみなす場合に、前述の第２タスク群に相当するタスクはタスクＴ３である。ＣＰＵ２３は、タスクＴ５の第１ステップにおいて、タスクＴ３の出力データが参照可能な状態になったというフラグ（完了フラグ）を参照して、タスクＴ５を実行可能な状態（タスクＴ３の出力データが参照可能な状態）になるまで待合わせる（Ｓ３４０）。そして、ＣＰＵ２２によるタスクＴ３の処理（第１ステップから第４ステップ）が完了した後は、タスクＴ３の出力データの参照可能フラグ（完了フラグ）がセットされている（Ｓ３２５）ので、ＣＰＵ２３はタスクＴ３の完了フラグを検出して（Ｓ３４１）待合せを終了し、次のステップに移る。次に、ＣＰＵ２３は、タスクＴ５の第２ステップにおいて、第２タスク群に相当するタスク（タスクＴ３）が存在するため、タスクＴ３の出力データ（データＣ３１４）を共有メモリ３０からＣＰＵ２３のローカルメモリ１３へ（データＣ５３１として）コピーする（Ｓ３４２）。次に、ＣＰＵ２３は、タスクＴ５の第３ステップにおいて、タスクＴ５を実行し（Ｓ３４３）、タスクＴ５の出力データ（データＥ５３２）をローカルメモリ１３に書き込む。次に、ＣＰＵ２３は、タスクＴ５の第４ステップにおいて、前述の第３タスク群に相当するタスク（ＣＰＵ２４が実行するタスクＴ６）が存在するため、タスクＴ５の出力データ（データＥ５３２）をローカルメモリ１３から共有メモリ３０へ（データＥ３１６として）コピーする（Ｓ３４４）とともに、タスクＴ５の出力データが参照可能な状態になったというフラグ（完了フラグ）をセットする（Ｓ３４５）。ここまでで、タスクＴ５の第１ステップから４が終了する。 Next, the process of task T5 by the CPU 23 will be described. When the CPU 22 executes the task T3, the output data of the task T3 can be referred to. According to the task flow graph of FIG. 2, the CPU 23 is ready to execute the task T5. According to the task flow graph of FIG. 2, when the task T5 is regarded as the first task, the task corresponding to the second task group is the task T3. In the first step of task T5, CPU 23 refers to a flag (completion flag) that the output data of task T3 can be referred to, and can execute task T5 (refer to the output data of task T3). Wait until it becomes possible (S340). After the processing of task T3 by CPU 22 (from the first step to the fourth step) is completed, the output data reference flag (completion flag) of task T3 is set (S325), so that CPU 23 performs task T3. The completion flag is detected (S341), the waiting is ended, and the process proceeds to the next step. Next, since there is a task (task T3) corresponding to the second task group in the second step of task T5, the CPU 23 transfers the output data (data C314) of the task T3 from the shared memory 30 to the local memory 13 of the CPU 23. To (as data C531) (S342). Next, in the third step of task T5, CPU 23 executes task T5 (S343), and writes the output data (data E532) of task T5 to local memory 13. Next, since there is a task corresponding to the above-described third task group (task T6 executed by the CPU 24) in the fourth step of the task T5, the CPU 23 outputs the output data (data E532) of the task T5 to the local memory 13 To the shared memory 30 (as data E316) (S344), and a flag (completion flag) that the output data of the task T5 can be referred is set (S345). Thus far, the first to fourth steps of task T5 are completed.

続いて、ＣＰＵ２４によるタスクＴ６の処理について説明する。ＣＰＵ２２がタスクＴ４を、ＣＰＵ２３がタスクＴ５を、それぞれ実行すると、タスクＴ４及びタスクＴ５の出力データが参照可能な状態になる。図２のタスクフローグラフによると、それによって、ＣＰＵ２４は、タスクＴ６を実行可能な状態となる。図２のタスクフローグラフによると、タスクＴ６を前述の第１タスクとみなす場合に、前述の第２タスク群に相当するタスクはタスクＴ４及びタスクＴ５である。ＣＰＵ２４は、タスクＴ６の第１ステップにおいて、タスクＴ３の出力データが参照可能な状態になったというフラグ（完了フラグ）を参照して、タスクＴ６を実行可能な状態（タスクＴ３の出力データが参照可能な状態）になるまで待合わせる（Ｓ３５０）。そして、ＣＰＵ２２によるタスクＴ４の処理（第１ステップから第４ステップ）及びＣＰＵ２３によるタスクＴ５の処理（第１ステップから第４ステップ）が完了した後は、タスクＴ４の出力データの参照可能フラグ（完了フラグ）及びタスクＴ５出力データ参照可能フラグ（完了フラグ）がセットされている（Ｓ３３０及びＳ３４５）ので、ＣＰＵ２４はタスクＴ４及びＴ５の完了フラグを検出して（Ｓ３５１及びＳ３５２）待合せを終了し、次のステップに移る。次に、ＣＰＵ２４は、タスクＴ６の第２ステップにおいて、第２タスク群に相当するタスク（タスクＴ４及びタスクＴ５）が存在するため、タスクＴ４及びタスクＴ５の出力データ（データＤ３１５及びデータＥ３１６）を共有メモリ３０からＣＰＵ２４のローカルメモリ１４へ（データＤ５４１及びデータＥ５４２として）コピーする（Ｓ３５３及びＳ３５４）。次に、ＣＰＵ２４は、タスクＴ６の第３ステップにおいて、タスクＴ６を実行し（Ｓ３５５）、タスクＴ６の出力データ（データＦ５４３）をローカルメモリ１４に書き込む。次に、ＣＰＵ２４は、タスクＴ６の第４ステップにおいて、前述の第３タスク群に相当するタスクは存在しないものの、タスクＴ６の出力データは図２のタスクフローグラフ全体の出力データとみなせるので、タスクＴ６の出力データ（データＦ５４３）をローカルメモリ１４から共有メモリ３０へ（データＦ３１７として）コピーする（Ｓ３５６）とともに、タスクＴ６の出力データが参照可能な状態になったというフラグ（完了フラグ）をセットする（Ｓ３５７）。ここまでで、タスクＴ６の第１ステップから４が終了する。 Next, the process of task T6 by the CPU 24 will be described. When the CPU 22 executes the task T4 and the CPU 23 executes the task T5, the output data of the tasks T4 and T5 can be referred to. According to the task flow graph of FIG. 2, the CPU 24 is ready to execute the task T6. According to the task flow graph of FIG. 2, when the task T6 is regarded as the first task, the tasks corresponding to the second task group are the task T4 and the task T5. In the first step of task T6, the CPU 24 refers to a flag (completion flag) that the output data of task T3 can be referred to, and can execute task T6 (refer to the output data of task T3). Wait until it becomes possible (S350). Then, after the processing of task T4 by CPU 22 (from the first step to the fourth step) and the processing of task T5 by CPU 23 (from the first step to the fourth step) are completed, an output data reference flag (completed) of task T4 is completed. Flag) and the task T5 output data referable flag (completion flag) are set (S330 and S345), the CPU 24 detects the completion flags of the tasks T4 and T5 (S351 and S352), and ends the waiting. Move on to the next step. Next, since there is a task (task T4 and task T5) corresponding to the second task group in the second step of task T6, the CPU 24 outputs the output data (data D315 and data E316) of task T4 and task T5. Copy from the shared memory 30 to the local memory 14 of the CPU 24 (as data D541 and data E542) (S353 and S354). Next, the CPU 24 executes the task T6 in the third step of the task T6 (S355), and writes the output data (data F543) of the task T6 in the local memory 14. Next, in the fourth step of task T6, although there is no task corresponding to the aforementioned third task group, the output data of task T6 can be regarded as the output data of the entire task flow graph of FIG. The output data (data F543) of T6 is copied from the local memory 14 to the shared memory 30 (as data F317) (S356), and a flag (completion flag) that the output data of task T6 can be referred is set. (S357). Thus far, the first to fourth steps of task T6 are completed.

具体例１にかかるマルチコアプロセッサ１００は、図２のタスクフローグラフ全体の出力データであるタスクＴ６の出力データが参照可能な状態になったことを確認することによって、図２のタスクフローグラフ全体の処理が完了したと判断する。 The multi-core processor 100 according to the first specific example confirms that the output data of the task T6 that is the output data of the entire task flow graph of FIG. It is determined that the processing is completed.

＜本発明の実施形態の具体例２＞
実施形態の具体例２について説明する。具体例２は、複数のプロセッサが各々備えるローカルメモリを各プロセッサが相互にアクセス可能なように構成されたマルチコアプロセッサ上での並列処理を提供する並列情報処理装置、及び方法である。 <Specific Example 2 of Embodiment of the Present Invention>
Specific example 2 of the embodiment will be described. Specific example 2 is a parallel information processing apparatus and method for providing parallel processing on a multi-core processor configured such that each processor can mutually access local memory included in each of a plurality of processors.

具体例２は、タスクフローグラフとして表現可能なプログラムの任意のタスクを第１タスクとして、前述の第１ステップから第４ステップを行う。具体例２では、共有メモリを使わず、プロセッサが有するローカルメモリを参照することによって、プロセッサ間でデータを共有する。したがって、第２ステップのコピー元は或るプロセッサのローカルメモリとなり、第４ステップのコピー処理は不要となる。 In the second specific example, an arbitrary task of a program that can be expressed as a task flow graph is set as the first task, and the first to fourth steps are performed. In the second specific example, the shared memory is not used, and the data is shared between the processors by referring to the local memory included in the processors. Therefore, the copy source of the second step is a local memory of a certain processor, and the copy process of the fourth step is not necessary.

第１ステップにおいて、具体例２は、第１タスクを実行可能な状態（第２タスク群の出力データが参照可能な状態）になるまで待合わせる。続いて、第２ステップにおいて、具体例２は、第１タスクへの入力データとなる第２タスク群の出力データを、第２タスク群を実行したプロセッサのローカルメモリ群から第１プロセッサのローカルメモリへコピーする。続いて、第３ステップにおいて、具体例２は、第１タスクを実行し、その出力データをローカルメモリに格納する。続いて、第４ステップにおいて、具体例２は、第１タスクの出力データを入力データとして使用する第３タスク群が存在するならば、第１タスクの出力データが参照可能な状態であることを第１プロセッサのローカルメモリに記録する。 In the first step, the specific example 2 waits until the first task can be executed (the output data of the second task group can be referred to). Subsequently, in the second step, the specific example 2 shows that the output data of the second task group serving as input data to the first task is transferred from the local memory group of the processor that executed the second task group to the local memory of the first processor. Copy to. Subsequently, in the third step, the specific example 2 executes the first task and stores the output data in the local memory. Subsequently, in the fourth step, specific example 2 indicates that if there is a third task group that uses the output data of the first task as input data, the output data of the first task can be referred to. Record in the local memory of the first processor.

続いて、上述の第１ステップから第４ステップを、各プロセッサの視点で説明する。前述のとおり、本発明の模範的な実施形態は、各プロセッサは順序番号が小さいタスクから順番に実行するものとしている。つまり、具体例１と同様に、具体例２の各プロセッサは、自身に実行が割り当てられているタスクの数だけ、第１ステップから第４ステップを繰り返す。さらに、具体例２の各プロセッサは、トポロジカルオーダにもとづいた順序番号にしたがってタスクを順番に実行する、という点も具体例２と同様である。したがって、具体例２と同様で、具体例１においても、ひとつのプロセッサに割り当てられたタスクどうしでタスクの実行完了を確認することは不要で、第２タスク群が存在しない場合に、第１ステップの待合せが不要となるとともに、第２ステップのコピー処理は不要で、第３タスク群が存在しない場合に、第４ステップのコピー処理および出力データ参照可能状態記録処理が不要、である。 Subsequently, the above-described first to fourth steps will be described from the viewpoint of each processor. As described above, the exemplary embodiment of the present invention assumes that each processor executes in order from the task with the smallest sequence number. That is, similarly to the specific example 1, each processor of the specific example 2 repeats the first step to the fourth step as many times as the number of tasks assigned to the processor. Further, each processor of the second specific example is similar to the second specific example in that the tasks are sequentially executed in accordance with the sequence number based on the topological order. Therefore, as in the second specific example, in the first specific example, it is not necessary to confirm the completion of the task execution among the tasks assigned to one processor, and the first step is performed when the second task group does not exist. Is not required, and the copy process in the second step is unnecessary, and when the third task group does not exist, the copy process in the fourth step and the output data referenceable state recording process are unnecessary.

＜具体例２の動作例＞
続いて、具体例２の動作例について図面を使って説明する。図８は、本発明の実施形態の具体例２にかかるマルチコアプロセッサ１００ａのタスク割当てとデータ配置の例を示す図である。図８では、具体例２が備えるプロセッサの数を４として説明するが、プロセッサ数はこれに限定されない。具体例１の説明で使用したものと同じ図２のタスクフローグラフを使って具体例２の動作を説明する。図２の６個のタスクＴ１〜Ｔ６を具体例２の４個のプロセッサ７１〜７４に図８のように予め割当てておくものとする。つまり、タスクＴ１とタスクＴ２をＣＰＵ７１に、タスクＴ３とタスクＴ４をＣＰＵ７２に、タスクＴ５をＣＰＵ７３に、タスクＴ６をＣＰＵ７４に、それぞれ割り当てるものとする。この割り当ても、具体例１と同等である。このように割当てを決めたことにより、本発明の実施形態にかかる基本的な考え方「プロセッサのローカルメモリにタスクの入出力データを配置する」に基づいて、各タスクの入出力データの配置は図８のように決まることになる。つまり、図５におけるローカルメモリ１１〜１４がローカルメモリ６１〜６４に置き換わったものである。ローカルメモリのデータ配置は具体例１と同等である。そして、具体例２では、各プロセッサのローカルメモリを用いてプロセッサ間でデータを共有する。 <Operation example of specific example 2>
Subsequently, an operation example of the specific example 2 will be described with reference to the drawings. FIG. 8 is a diagram illustrating an example of task assignment and data arrangement of the multi-core processor 100a according to the second specific example of the embodiment of the present invention. In FIG. 8, the number of processors included in the specific example 2 is described as four, but the number of processors is not limited to this. The operation of specific example 2 will be described using the same task flow graph of FIG. 2 used in the description of specific example 1. It is assumed that the six tasks T1 to T6 in FIG. 2 are assigned in advance to the four processors 71 to 74 in the specific example 2 as shown in FIG. That is, task T1 and task T2 are assigned to CPU 71, task T3 and task T4 are assigned to CPU 72, task T5 is assigned to CPU 73, and task T6 is assigned to CPU 74. This assignment is also the same as in the first specific example. Since the assignment is determined in this way, the arrangement of the input / output data of each task is shown in FIG. It will be decided like 8. That is, the local memories 11 to 14 in FIG. 5 are replaced with the local memories 61 to 64. The data arrangement in the local memory is the same as in the first specific example. In the second specific example, data is shared between the processors using the local memory of each processor.

次に、具体例２にかかるタスク処理（タスクＴ１からタスクＴ６を実行する様子）の流れを示すフローチャートを図９及び図１０に示す。具体例１の図６及び図７と具体例２の図９及び図１０の違いとしては、大きく２点ある。一点目は、具体例２では、各タスクの第２ステップにおける第２タスク群の出力データのコピー処理が不要であることである。具体的には、図６及び図７のＳ３１０、Ｓ３２２、Ｓ３２７、Ｓ３４２、Ｓ３５３、Ｓ３５４、Ｓ３５６に相当するコピー処理が図９及び図１０には不要である。二点目は、具体例２では、各タスクの第４ステップにおける第３タスク群のためのコピー処理の宛先が第３タスク群を実行するプロセッサのローカルメモリであることである。具体的には、図６及び図７のＳ３１２、Ｓ３１５、Ｓ３２４、Ｓ３２９、Ｓ３４４に相当するコピー処理の宛先が後続タスクが有するローカルメモリであることである。 Next, FIGS. 9 and 10 are flowcharts showing a flow of task processing (a state in which the task T1 to the task T6 are executed) according to the second specific example. There are two major differences between FIGS. 6 and 7 of the first specific example and FIGS. 9 and 10 of the second specific example. The first point is that the specific example 2 does not require the copy processing of the output data of the second task group in the second step of each task. Specifically, copy processing corresponding to S310, S322, S327, S342, S353, S354, and S356 of FIGS. 6 and 7 is not required in FIGS. The second point is that in specific example 2, the destination of the copy process for the third task group in the fourth step of each task is the local memory of the processor that executes the third task group. Specifically, the destination of the copy process corresponding to S312, S315, S324, S329, and S344 in FIGS. 6 and 7 is the local memory of the subsequent task.

尚、これとは逆に、各タスクの第４ステップにおける第３タスク群のためのコピー処理を省略する場合には、第２ステップにおける第２タスク群の出力データのコピー処理を実行することでも実現できる。つまり、第２プロセッサが第２タスクの実行後に第２ローカルメモリを第１プロセッサに対して参照可能とし、第１プロセッサが第１タスクの実行後に第１ローカルメモリを第２プロセッサに対して参照可能することでも実現できる。 On the other hand, if the copy process for the third task group in the fourth step of each task is omitted, the output data of the second task group in the second step may be copied. realizable. That is, the second processor can reference the second local memory to the first processor after execution of the second task, and the first processor can reference the first local memory to the second processor after execution of the first task. Can also be realized.

＜その他の発明の実施の形態＞
尚、本発明にかかる他の実施形態である並列情報処理方法あるいは装置は、次のように表現することもできる。すなわち、タスクフローグラフとして表現可能なプログラムの任意の第１タスクの実行に関して、第１タスクを実行する前に、タスクフローグラフにおいて第１タスクへのエッジをもち、なおかつ第１タスクを実行する第１プロセッサ以外が実行する第２タスク群の出力データが参照可能状態となることを待ち（第１ステップ）、第２タスク群の出力データを第１プロセッサのローカルメモリにコピーし（第２ステップ）、第１タスクを実行してその計算結果を第１プロセッサのローカルメモリに格納し（第３ステップ）、タスクフローグラフにおいて第１タスクからのエッジをもちなおかつ第１プロセッサ以外が実行する第３タスク群が存在する場合に、第１タスクの出力データを第３タスク群が参照可能な状態とする（第４ステップ）、ことを特徴とする。 <Other embodiments of the invention>
The parallel information processing method or apparatus according to another embodiment of the present invention can also be expressed as follows. That is, regarding the execution of an arbitrary first task of a program that can be expressed as a task flow graph, the first task that has an edge to the first task in the task flow graph and that executes the first task before executing the first task. Wait until the output data of the second task group executed by other than one processor becomes accessible (first step), and copy the output data of the second task group to the local memory of the first processor (second step). The first task is executed, the calculation result is stored in the local memory of the first processor (third step), and the third task that has an edge from the first task in the task flow graph and is executed by other than the first processor When there is a group, the output data of the first task is in a state that can be referred to by the third task group (fourth step), And wherein the door.

また、上記の第２ステップにおいて、第２タスク群の出力データが第１プロセッサのローカルメモリに存在することが明確である場合にコピーを省略することが望ましい。さらに、上記の第２ステップのコピー元は共有メモリであり、第４ステップのコピー先も共有メモリであることもできる。または、上記の第２ステップのコピー元は第２タスク群を実行するプロセッサのローカルメモリであり、プロセッサのローカルメモリを相互に参照可能とすることによって第４ステップのコピー処理を省略するとよい。 In the second step, it is desirable to omit copying when it is clear that the output data of the second task group exists in the local memory of the first processor. Further, the copy source in the second step may be a shared memory, and the copy destination in the fourth step may be a shared memory. Alternatively, the copy source of the second step is the local memory of the processor that executes the second task group, and the copy process of the fourth step may be omitted by making the local memories of the processors mutually referable.

このように、対象とする並列処理に合わせて、マルチコアプロセッサが備えるローカルメモリに共有データを配置し、それをプロセッサ間で受け渡すことにより、共有メモリへのアクセス回数を減らし、さらに並列処理全体の処理時間を減らす、ことを可能にする。つまり、本発明にかかる実施形態は、共有メモリとローカルメモリとを備えるマルチコアプロセッサと、そのうえで動作するソフトウェアによって、共有メモリへのアクセス回数が少なく、全体の処理時間が少ない、並列情報処理装置、あるいは並列情報処理方法、を提供する。そのため、プロセッサが備えるローカルメモリを活用することにより、共有メモリへのアクセス回数を減らし、メモリアクセスコストを低くし、全体の処理時間を短くすることができる。 In this way, according to the target parallel processing, the shared data is arranged in the local memory included in the multi-core processor, and the data is transferred between the processors, thereby reducing the number of accesses to the shared memory and further improving the overall parallel processing. It makes it possible to reduce processing time. That is, the embodiment according to the present invention is a parallel information processing apparatus in which the number of accesses to the shared memory is small and the overall processing time is small by a multi-core processor including the shared memory and the local memory, and software operating on the multi-core processor. A parallel information processing method is provided. Therefore, by utilizing the local memory provided in the processor, the number of accesses to the shared memory can be reduced, the memory access cost can be reduced, and the overall processing time can be shortened.

尚、本発明にかかる実施形態は、組み込み向けプロセッサ、汎用コンピュータ用プロセッサ、など、さまざまなプロセッサによる並列処理に応用可能である。 The embodiment according to the present invention can be applied to parallel processing by various processors such as an embedded processor and a general-purpose computer processor.

さらに、本発明は上述した実施の形態のみに限定されるものではなく、既に述べた本発明の要旨を逸脱しない範囲において種々の変更が可能であることは勿論である。例えば、上述の実施の形態では、本発明をハードウェアの構成として説明したが、本発明は、これに限定されるものではない。本発明は、任意の処理を、ＣＰＵ（Central Processing Unit）にコンピュータプログラムを実行させることにより実現することも可能である。 Furthermore, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present invention described above. For example, in the above-described embodiment, the present invention has been described as a hardware configuration, but the present invention is not limited to this. The present invention can also realize arbitrary processing by causing a CPU (Central Processing Unit) to execute a computer program.

上述の例において、プログラムは、様々なタイプの非一時的なコンピュータ可読媒体（non-transitory computer readable medium）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（tangible storage medium）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ−ＲＯＭ（Read Only Memory）、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、ＤＶＤ（Digital Versatile Disc）、ＢＤ(Blu-ray(登録商標) Disc)、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（Programmable ROM）、ＥＰＲＯＭ（Erasable PROM）、フラッシュＲＯＭ、ＲＡＭ（Random Access Memory））を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（transitory computer readable medium）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above example, the program can be stored and supplied to a computer using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R / W, DVD (Digital Versatile Disc), BD (Blu-ray (registered trademark) Disc), semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM ( Random Access Memory)). The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

１００マルチコアプロセッサ
１００ａマルチコアプロセッサ
１１ローカルメモリ
１２ローカルメモリ
１３ローカルメモリ
１４ローカルメモリ
１１１割当タスク参照用データ
１２１割当タスク参照用データ
１３１割当タスク参照用データ
１４１割当タスク参照用データ
２１ＣＰＵ
２２ＣＰＵ
２３ＣＰＵ
２４ＣＰＵ
３０共有メモリ
３０１マルチタスク参照用データ
３１１外部データ
３１２データＡ
３１３データＢ
３１４データＣ
３１５データＤ
３１６データＥ
３１７データＦ
５１１外部データ
５１２データＡ
５１３データＢ
５２１データＡ
５２２データＢ
５２３データＣ
５２４データＤ
５３１データＣ
５３２データＥ
５４１データＤ
５４２データＥ
５４３データＦ
６１ローカルメモリ
６２ローカルメモリ
６３ローカルメモリ
６４ローカルメモリ
７１ＣＰＵ
７２ＣＰＵ
７３ＣＰＵ
７４ＣＰＵ
９００マルチコアプロセッサ
９０１共有データ
９１１非共有データ
９２１非共有データ
９３１非共有データ
９４１非共有データ
Ｔ１タスク
Ｔ２タスク
Ｔ３タスク
Ｔ４タスク
Ｔ５タスク
Ｔ６タスク DESCRIPTION OF SYMBOLS 100 Multi-core processor 100a Multi-core processor 11 Local memory 12 Local memory 13 Local memory 14 Local memory 111 Allocation task reference data 121 Allocation task reference data 131 Allocation task reference data 141 Allocation task reference data 21 CPU
22 CPU
23 CPU
24 CPU
30 Shared memory 301 Multitask reference data 311 External data 312 Data A
313 Data B
314 Data C
315 Data D
316 Data E
317 Data F
511 External data 512 Data A
513 Data B
521 Data A
522 Data B
523 Data C
524 Data D
531 Data C
532 Data E
541 Data D
542 Data E
543 Data F
61 Local memory 62 Local memory 63 Local memory 64 Local memory 71 CPU
72 CPU
73 CPU
74 CPU
900 Multi-core processor 901 Shared data 911 Non-shared data 921 Non-shared data 931 Non-shared data 941 Non-shared data T1 task T2 task T3 task T4 task T5 task T6 task

Claims

When the second task that is the preceding task in the first task assigned to the first processor is assigned to other than the first processor, the output data from the second task can be referred to by the first processor , Wait for the execution of the first task,
After the output data from the second task can be referred to by the first processor, the output data is acquired from the output data storage source and stored in the first local memory of the first processor;
With reference to the output data stored in the first local memory, the first processor executes the first task, stores the output data from the first task in the first local memory,
When a third task, which is a subsequent task in the first task, is assigned to a processor other than the first processor, the output data from the first task can be referred to from a processor other than the first processor. Task parallel processing method.

If the preceding task in the fourth task executed after the first task in the first processor is the second task, the output data is not obtained from the storage source of the output data of the second task, 2. The task parallel processing method according to claim 1, wherein the first processor executes the fourth task with reference to output data of the second task stored in the first local memory before execution of the first task. .

The storage source of the output data by the second task is a shared memory,
3. The task parallel processing method according to claim 1, wherein, after execution of the first task, the first processor reads output data from the first task from the first local memory and stores the data in the shared memory.

The storage source of the output data by the second task is a second local memory included in a second processor that executes the second task,
The second processor can refer to the output data of the second task other than the second processor after the execution of the second task,
3. The task parallel processing method according to claim 1, wherein the first processor can refer to output data of the first task to a device other than the first processor after execution of the first task.

5. The task parallel processing according to claim 1, wherein the first task, the second task, and the third task are implemented as a program that can be expressed as a task flow graph. 6. Method.

Comprising a plurality of processors having local memory;
A first processor of the plurality of processors is
Having a first local memory;
When the second task, which is the preceding task in the first task assigned to the first processor, is assigned to other than the first processor, the first task is used until the output data from the second task can be referred to. Wait for task execution,
After the output data by the second task can be referred to, the output data is obtained from the storage source of the output data and stored in the first local memory,
Referring to output data stored in the first local memory, executing the first task, storing output data by the first task in the first local memory,
When a third task, which is a subsequent task in the first task, is assigned to a processor other than the first processor, the output data from the first task can be referred to from a processor other than the first processor. Task parallel processing unit.

The first processor is
When the preceding task in the fourth task executed after the first task in the first processor is the second task, the output data is not obtained from the storage source of the output data of the second task, The task parallel processing device according to claim 6, wherein the first processor executes the fourth task with reference to output data of the second task stored in the first local memory before execution of the first task. .

The storage source of the output data by the second task is a shared memory,
The first processor is
8. The task parallel processing device according to claim 6, wherein, after execution of the first task, the first processor reads output data from the first task from the first local memory and stores the data in the shared memory. 9.

The storage source of the output data by the second task is a second local memory included in a second processor that executes the second task,
The second processor can refer to the output data of the second task to other than the second processor after the execution of the second task,
8. The task parallel processing device according to claim 6, wherein the first processor can refer to output data of the first task to a device other than the first processor after the execution of the first task.

When the second task that is the preceding task in the first task assigned to the first processor is assigned to other than the first processor, the output data from the second task can be referred to by the first processor A process of waiting for the execution of the first task;
After the output data from the second task can be referred to by the first processor, the output data is acquired from the storage source of the output data and stored in the first local memory of the first processor;
A process in which the first processor executes the first task with reference to the output data stored in the first local memory, and stores the output data from the first task in the first local memory;
When a third task, which is a subsequent task in the first task, is assigned to a processor other than the first processor, the output data from the first task can be referred to from a processor other than the first processor. Processing,
A task parallel processing program that causes a computer to execute.