JP2014063278A

JP2014063278A - Synchronous processing circuit and synchronous processing method

Info

Publication number: JP2014063278A
Application number: JP2012207071A
Authority: JP
Inventors: Takeshi Soga; 武史曽我; Hiroshi Inoue; 弘士井上; Hiroshi Sasaki; 広佐々木
Original assignee: Kyushu University NUC
Current assignee: Kyushu University NUC
Priority date: 2012-09-20
Filing date: 2012-09-20
Publication date: 2014-04-10

Abstract

PROBLEM TO BE SOLVED: To provide a synchronous processing circuit capable of performing not only entire synchronous processing but also partial synchronous processing with a simple hardware mechanism.SOLUTION: The synchronous processing circuit performs a partial or entire synchronous processing of plural processing means using tree-structured plural nodes. The tree-structured nodes include: plural leaf nodes 3 positioned at the bottom each corresponding to the plural processing means; a root node 7 positioned at the top; and plural internal nodes 5 other than the root node 7 and the leaf nodes 3. The tree structure notifies a logical value not only from a child node to a parent node but also from the parent node to the child node. Each internal node 5 includes an internal node selection section that selects whether a logical value obtained by performing a logical operation on a logical value which is notified from a child node of the internal node 5 should be notified to the parent node, or a NOT logical value notified from the parent node should be notified.

Description

本発明は、同期処理回路及び同期処理方法に関し、特に、ツリー構造により複数の処理手段の一部又は全部の同期状態を判断する同期処理回路等に関する。 The present invention relates to a synchronization processing circuit and a synchronization processing method, and more particularly, to a synchronization processing circuit that determines a synchronization state of some or all of a plurality of processing means by a tree structure.

プロセス技術やハードウェア技術、ソフトウェア技術等の進歩により、チップ上に複数のプロセッサコア（以下、「コア」と略す。）を搭載したマルチコアプロセッサが広く普及するに至った。さらに、より多くのコアを搭載した、メニーコアプロセッサについても研究が及んでおり、例えば100個のコアを搭載した商用化チップも存在する。 With advances in process technology, hardware technology, software technology, etc., multi-core processors having multiple processor cores (hereinafter abbreviated as “cores”) on a chip have become widespread. In addition, many-core processors with more cores are also being researched. For example, there are commercial chips with 100 cores.

メニーコアプロセッサ等において、高い実効性能を実現するためには、搭載したコアの数に見合っただけの並列性を活用できなければならない。例えば画像処理や科学技術計算のように極めて高い並列性が内在する場合においては、コアの稼働率を高めることができる。しかしながら、十分な並列性が抽出できない場合が多く存在する。したがって、メニーコアプロセッサの応用範囲を拡大しその普及を図るためには、プログラム内並列性のみならず、プログラム間並列性をも積極的に活用しなければならない。すなわち、メニーコアプロセッサにおいては、効率的なマルチスレッド・マルチプログラム実行環境の実現が極めて重要となる。具体的には、ＯＳ、メール、ゲーム、コンパイラなどの複数ソフトウェアが同時に動作する組込みシステムや、複数ユーザが異なるジョブを投入するデータセンター・サーバ応用が考えられる。 In many-core processors, etc., in order to achieve high effective performance, it is necessary to be able to utilize parallelism that is commensurate with the number of installed cores. For example, in the case where extremely high parallelism is inherent, such as in image processing and scientific and technological calculations, the operating rate of the core can be increased. However, there are many cases where sufficient parallelism cannot be extracted. Therefore, in order to expand the application range of the many-core processor and to spread it, not only in-program parallelism but also inter-program parallelism must be actively utilized. That is, in the many-core processor, it is extremely important to realize an efficient multithread / multiprogram execution environment. Specifically, an embedded system in which a plurality of software such as an OS, mail, a game, and a compiler operate simultaneously, and a data center server application in which a plurality of users submit different jobs can be considered.

並列化された複数のプログラムを同時に実行する場合の課題として、コア間での同期処理の高効率化が挙げられる。同期は、コア間の歩調を合わせるために必要である。例えば、複数のコアがメモリやバスなどの共通資源を使用する際にアクセス順序を決めるときや、複数のコアが同時に動作を開始したい場合などに実行される。ここで、ある同期に参加するコアの集合をコアグループと呼ぶ。一般に、並列化プログラムの実行に使用するコアの数は、当該プログラムに内在する並列性の度合いに大きく依存する。これに加え、複数のプログラムが同時に実行される状況においては、複数のコアグループそれぞれが同期処理を発行する場合もある。したがって、メニーコアプロセッサにおいては、複数の様々なコアグループに対して独立かつ並列に実行可能な同期機構の実現が必要となる。 One of the challenges in simultaneously executing a plurality of parallel programs is to increase the efficiency of synchronization processing between cores. Synchronization is necessary to keep pace between cores. For example, it is executed when an access order is determined when a plurality of cores use a common resource such as a memory or a bus, or when a plurality of cores want to start operation simultaneously. Here, a set of cores participating in a certain synchronization is called a core group. In general, the number of cores used for execution of a parallelized program greatly depends on the degree of parallelism inherent in the program. In addition, in a situation where a plurality of programs are executed simultaneously, each of the plurality of core groups may issue a synchronization process. Therefore, in the many-core processor, it is necessary to realize a synchronization mechanism that can be executed independently and in parallel for a plurality of various core groups.

同期手法の代表的なものの一つとして、バリア同期がある。これは、同期の対象となる各コアが実行するプログラム中に、対象コア全てが許可を出すまで停止する指示を埋め込むことにより同期を実現する手法である。バリア同期は、MPI、OpenMP、pthread等の多くの並列コンピューティング環境に実装されている。バリア同期は、一般に、専用ハードウェアの追加なしに実現可能である。 One of the typical synchronization methods is barrier synchronization. This is a technique for realizing synchronization by embedding an instruction to stop until all target cores give permission in a program executed by each core to be synchronized. Barrier synchronization is implemented in many parallel computing environments such as MPI, OpenMP, and pthread. Barrier synchronization can generally be realized without the addition of dedicated hardware.

HPC分野で使用されるような大規模システムにおいては、ハードウェアバリア機構が実装されていることが多い。例えば、大規模システムにおいては、システムに比して小規模かつ高速なハードウェアバリア機能を実装している（非特許文献１〜３参照）。特許文献１及び２は、ツリー・ネットワーク構造を利用して同期等を実現するものである。 In a large-scale system used in the HPC field, a hardware barrier mechanism is often implemented. For example, a large-scale system has a hardware barrier function that is smaller and faster than the system (see Non-Patent Documents 1 to 3). Patent Documents 1 and 2 implement synchronization and the like using a tree network structure.

また、オンチップマルチコア・メニーコアプロセッサ用の研究も行われている（非特許文献４〜６参照）。Villaら（非特許文献４参照）は、コア間データ通信用のNoCに回路追加を行ってバリアの高速化を行っている。Itoら（非特許文献５参照）及びHoeerら（非特許文献６参照）は、専用のネットワークを用いてバリア同期を行っている。（ Research on on-chip multi-core and many-core processors has also been conducted (see Non-Patent Documents 4 to 6). Villa et al. (See Non-Patent Document 4) increase the barrier speed by adding a circuit to NoC for inter-core data communication. Ito et al. (See non-patent document 5) and Hoeer et al. (See non-patent document 6) perform barrier synchronization using a dedicated network. (

特表２００４−５２９４１４号公報JP-T-2004-529414 特表２００４−５３６３７２号公報JP-T-2004-536372

Gara，A.、外１３名，Overview of the Blue Gene/L system architecture，IBM Journal of Research and Development，Vol．49，No．2，pp.195-212，2005.Gara, A., 13 others, Overview of the Blue Gene / L system architecture, IBM Journal of Research and Development, Vol. 49, No. 2, pp.195-212, 2005. Habata，S.、外３名，Hardware system of the Earth Simulator，Parallel Computing，Vol．30，No．12，pp.1287-1313，2004.Habata, S., 3 others, Hardware system of the Earth Simulator, Parallel Computing, Vol. 30, no. 12, pp.1287-1313, 2004. Ajima，Y.，外２名，Tofu：A 6d mesh/torus interconnect for exascale computers，Computer，Vol．42，No．11，pp.36-40，2009.Ajima, Y., 2 others, Tofu: A 6d mesh / torus interconnect for exascale computers, Computer, Vol. 42, no. 11, pp.36-40, 2009. Villa，O.、外２名，Efficiency and scalability of barrier synchronization on NoC based many-core architectures，Proceedings of the 2008 international conference on Compilers，architectures and synthesis for embedded systems（Altman，E.，ed.），Atlanta，GA，USA，ACM，ACM，pp.81-90，2008.Villa, O., 2 others, Efficiency and scalability of barrier synchronization on NoC based many-core architectures, Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems (Altman, E., ed.), Atlanta, GA, USA, ACM, ACM, pp.81-90, 2008. Ito，M.、外１６名，An 8640 MIPS SoC with Independent Power-Off Control of 8 CPUs and 8 RAMs by An Automatic Parallelizing Compiler，2008 IEEE International Solid-State Circuits Conference Digest of Technical Papers（Fujino，L.C.，ed.)，San Francisco，CA，IEEE，S Digital Publishing Inc，pp.90-598，2008.Ito, M., 16 others, An 8640 MIPS SoC with Independent Power-Off Control of 8 CPUs and 8 RAMs by An Automatic Parallelizing Compiler, 2008 IEEE International Solid-State Circuits Conference Digest of Technical Papers (Fujino, LC, ed. ), San Francisco, CA, IEEE, S Digital Publishing Inc, pp. 90-598, 2008. Hoefler，T.、外３名，Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters，19th International Conference on Architecture and Computing Systems −ARCS06，pp.343-350，2006.Hoefler, T., 3 others, Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters, 19th International Conference on Architecture and Computing Systems -ARCS06, pp.343-350, 2006.

しかしながら、バリア同期は、数多くのコアが参加する場合には、バリア同期に要する時間が大きくなり、頻繁に使用するとソフトウェア実行性能の低下要因となる。さらに、各コアのバリア処理完了時刻に大きくばらつきが発生して、事前のチューニングが期待通りに反映されなくなる、などの問題が発生する。 However, barrier synchronization increases the time required for barrier synchronization when a large number of cores participate, and if it is frequently used, it causes a decrease in software execution performance. Furthermore, there is a problem that a large variation occurs in the barrier processing completion time of each core, and the prior tuning is not reflected as expected.

そのため、HPC分野で使用されるような大規模システムにおいては、ハードウェアバリア機構が実装されていることが多い。しかしながら、大規模システムでのハードウェアバリア機構（非特許文献１乃至３、特許文献１及び２参照）は、筺体間やチップ間を繋ぐ長距離ネットワークを制御する多機能なネットワークコントローラの存在を前提として設計されている。そのため、回路規模が大きく、そのままメニーコアプロセッサのハードウェアバリアとして使用することは難しい。また、大規模システムでは、配線コスト低減を目的として、ハードウェアバリアで使用するネットワークを他の通信機能と共有している。そのため、常に最短レイテンシでのバリア同期実行は期待できない。 Therefore, a hardware barrier mechanism is often implemented in a large-scale system used in the HPC field. However, the hardware barrier mechanism in a large-scale system (see Non-Patent Documents 1 to 3, Patent Documents 1 and 2) is premised on the existence of a multifunctional network controller that controls a long-distance network connecting between bodies and chips. Designed as Therefore, the circuit scale is large and it is difficult to use it as it is as a hardware barrier of a many-core processor. In a large-scale system, a network used for a hardware barrier is shared with other communication functions for the purpose of reducing wiring costs. For this reason, it is not always possible to expect barrier synchronization execution with the shortest latency.

また、Villaらの手法（非特許文献４参照）は、ネットワークをデータと共用している。そのため、バリア同期のレイテンシは固定にならない。Itoら（非特許文献５参照）及びHoeflerら（非特許文献６参照）の手法は、いずれも、ネットワーク内のコアのグループ分けをハードウェア回路によって行う機能は持っていない。 The method of Villa et al. (See Non-Patent Document 4) shares a network with data. Therefore, the latency of barrier synchronization is not fixed. None of the methods of Ito et al. (See Non-Patent Document 5) and Hoefler et al. (See Non-Patent Document 6) have a function of grouping cores in a network by a hardware circuit.

そこで、本願発明は、簡単なハードウェア機構により実現可能であり、さらに、全体としてだけでなく部分的にも同期処理を実現可能な同期処理回路等を提案することを目的とする。 Therefore, the present invention has an object of proposing a synchronization processing circuit and the like that can be realized by a simple hardware mechanism and that can realize synchronization processing not only as a whole but also partially.

本願発明の第１の観点は、ツリー構造の複数のノードを用いて複数の処理手段の一部又は全部の同期処理を行う同期処理回路であって、前記ツリー構造のノードには、最下位に位置して前記複数の処理手段にそれぞれ対応する複数の葉ノードと、最上位に位置する根ノードと、前記根ノード及び前記葉ノード以外の複数の内部ノードが含まれ、前記ツリー構造において、子ノードから親ノードに対して論理値を通知するだけでなく、親ノードから子ノードに対しても論理値を通知するものであり、前記各内部ノードが、親ノードに対して、当該内部ノードの子ノードから通知された論理値を論理演算して得られた論理値を通知するか、又は、当該親ノードから通知される論理値の否定を通知するかを選択する内部ノード選択手段を備えるものである。 A first aspect of the present invention is a synchronization processing circuit that performs a synchronization process on a part or all of a plurality of processing means using a plurality of nodes in a tree structure, and the nodes in the tree structure are arranged at the lowest order. A plurality of leaf nodes that respectively correspond to the plurality of processing means, a root node that is positioned at the highest level, and a plurality of internal nodes other than the root node and the leaf node; In addition to notifying the logical value from the node to the parent node, the logical value is also notified from the parent node to the child node, and each internal node notifies the parent node of the internal node. Provided with an internal node selection means for selecting whether to notify a logical value obtained by performing a logical operation on a logical value notified from a child node or to notify a negative logical value notified from the parent node A.

本願発明の第２の観点は、第１の観点の同期処理回路であって、前記各葉ノードが、親ノードに対して、当該葉ノードに対応する処理手段が同期ポイントに到達したか否かを示す論理値を通知するか、又は、当該親ノードから通知される論理値の否定を通知するかを選択する葉ノード選択手段を備えるものである。 A second aspect of the present invention is the synchronous processing circuit according to the first aspect, wherein each leaf node is a parent node and whether or not the processing means corresponding to the leaf node has reached a synchronization point. Or a leaf node selection means for selecting whether to notify the negation of the logical value notified from the parent node.

本願発明の第３の観点は、第２の観点の同期処理回路であって、当該同期処理回路は、バリア完了待ち／受付可能状態とバリア完了通知待ち状態との間で状態遷移するものであり、前記対応する処理手段が同期ポイントに到達したか否かを示す論理値を通知する前記葉ノードは、親ノードに対して、前記バリア完了待ち／受付可能状態において、対応する処理手段が同期ポイントに到達したことを示す論理値として同期状態論理値を通知し、同期ポイントには到達していないことを示す論理値として前記同期状態論理値とは異なる非同期状態論理値を通知し、前記バリア完了通知待ち状態において、親ノードから前記同期処理論理値を通知された場合、前記非同期状態論理値を通知するものであり、前記根ノード及び前記親ノードから通知される論理値の否定を通知する内部ノードは、前記バリア完了待ち／受付可能状態において、全ての子ノードが前記同期状態論理値を通知する場合に、子ノードに対して、前記同期状態論理値を通知して、前記バリア完了通知待ち状態に遷移し、前記バリア完了通知待ち状態において、全ての子ノードから前記非同期状態論理値を通知された場合に、前記バリア完了待ち／受付可能状態に遷移するものである。 A third aspect of the present invention is a synchronous processing circuit according to the second aspect, wherein the synchronous processing circuit makes a state transition between a barrier completion waiting / acceptable state and a barrier completion notification waiting state. The leaf node for notifying the logical value indicating whether or not the corresponding processing means has reached the synchronization point is in response to the parent node in the barrier completion waiting / accepting state. The synchronous state logical value is notified as a logical value indicating that the synchronization point has been reached, the asynchronous state logical value different from the synchronous state logical value is notified as the logical value indicating that the synchronization point has not been reached, and the barrier completion When the synchronous processing logical value is notified from the parent node in the notification waiting state, the asynchronous logical state value is notified, and is notified from the root node and the parent node. The internal node that notifies the negation of the logical value is in the state of waiting for barrier completion / acceptance, and when all the child nodes notify the synchronous state logical value, Notifying, transitioning to the barrier completion notification waiting state, and transitioning to the barrier completion waiting / acceptable state when the asynchronous state logical value is notified from all the child nodes in the barrier completion notification waiting state Is.

本願発明の第４の観点は、第２又は第３の観点の同期処理回路であって、前記各内部ノードにおいて折り返しを行うか否かを示す情報を記憶する折返し記憶手段と、前記各葉ノードにおいてマスクを行うか否かを示す情報を記憶するマスク記憶手段と、前記各葉ノードが、親ノードに対して、前記マスク記憶手段に記憶された情報に応じて、当該葉ノードに対応する処理手段が同期ポイントに到達したか否かを示す論理値を通知するか、又は、当該親ノードから通知される論理値の否定を通知するかを選択する葉ノード選択手段を備え、前記各内部ノードの前記内部ノード選択手段は、前記折返し記憶手段に記憶された情報に応じて選択し、前記根ノードは、子ノードに対して、当該内部ノードの子ノードから通知された論理値を論理演算して得られた論理値を通知するものである。 According to a fourth aspect of the present invention, there is provided a synchronization processing circuit according to the second or third aspect, wherein the return storage means stores information indicating whether or not the internal nodes are to be returned, and the leaf nodes. A mask storage means for storing information indicating whether or not to perform masking, and a process in which each leaf node corresponds to the leaf node according to the information stored in the mask storage means with respect to the parent node. Each of the internal nodes comprises a leaf node selecting means for selecting whether to notify a logical value indicating whether the means has reached a synchronization point, or to notify a negative logical value notified from the parent node. The internal node selecting means selects the information according to the information stored in the return storage means, and the root node performs a logical operation on the logical value notified from the child node of the internal node to the child node. Obtained and notifies the logical value.

本願発明の第５の観点は、第１から第４のいずれかの観点の同期処理回路であって、前記ツリー構造は複数存在し、各ツリー構造において、他のツリー構造とは独立して同期処理を行うものである。 A fifth aspect of the present invention is a synchronization processing circuit according to any one of the first to fourth aspects, wherein there are a plurality of the tree structures, and each tree structure is synchronized independently of other tree structures. The processing is performed.

本願発明の第６の観点は、第１から第５のいずれかの観点の同期処理回路であって、前記各内部ノードは、前記ツリー構造の構成を動的に変更するための構成変更処理同期機構を備え、前記構成変更処理同期機構は、親ノードに対して、自ノード及び全ての子ノードが同期要求中の場合、親ノードに同期要求を送り、子ノードに対して、いずれかの子ノードから同期要求中に親ノードへの同期要求を行っていて、かつ、完了していない場合は、同期完了していないことを示す論理値を送り、それ以外では異なる論理値を送るものである。 A sixth aspect of the present invention is a synchronization processing circuit according to any one of the first to fifth aspects, wherein each internal node is configured to synchronize a configuration change process for dynamically changing the configuration of the tree structure. The configuration change processing synchronization mechanism sends a synchronization request to the parent node when the own node and all the child nodes are requesting synchronization to the parent node, and from any child node to the child node When a synchronization request is made to the parent node during the synchronization request and is not completed, a logical value indicating that the synchronization is not completed is sent, and otherwise a different logical value is sent.

本願発明の第７の観点は、ツリー構造の複数のノードを用いて複数の処理手段の一部又は全部の同期処理を行う同期処理回路における同期処理方法であって、前記ツリー構造において、子ノードから親ノードに対して論理値を通知するだけでなく、親ノードから子ノードに対しても論理値を通知するものであり、前記同期処理回路が備える内部ノード選択手段が、前記ツリー構造において親ノード及び子ノードが存在する内部ノードが、前記親ノードに対して、当該内部ノードの子ノードから通知された論理値を論理演算して得られた論理値を通知するか、又は、当該親ノードから通知される論理値の否定を通知するかを選択する選択ステップを含むものである。 According to a seventh aspect of the present invention, there is provided a synchronization processing method in a synchronization processing circuit that performs a synchronization process for a part or all of a plurality of processing means using a plurality of nodes in a tree structure. Not only notifies the logical value from the parent node to the parent node, but also notifies the logical value from the parent node to the child node. An internal node in which a node and a child node exist notifies the parent node of a logical value obtained by performing a logical operation on the logical value notified from the child node of the internal node, or the parent node This includes a selection step for selecting whether to notify the negation of the logical value notified from.

なお、ノードは、ハードウェア・モジュールであり、根ノード及び内部ノードにおける論理演算は、バリア完了待ち／受付可能状態では、例えば、子ノードから通知された論理値が、全て、同期ポイントに到達したものを示す値である場合には、その同期ポイントに到達したことを示す論理値とし、そうでない場合には異なる論理値とするものである。これにより、折返し設定でない内部ノードは、親ノードに対して、コアグループ内の当該内部ノードの子ノードが全て同期ポイントに到達したことを通知する。そして、根ノード及び折返し設定がされた内部ノードは、コアグループ内の処理手段における処理が同期ポイントに到達したと判定し、各葉ノードに対して、同期処理完了通知の信号を送り、バリア完了通知待ち状態に遷移することができる。また、バリア完了通知待ち状態では、例えば、葉ノードから通知された論理値が、全て、葉ノードの同期設定レジスタがリセットされたことを示す論理値であれば、葉ノードの同期設定レジスタがリセットされたことを示す論理値とし、そうでない場合には異なる論理値とするものである。これにより、根ノード及び折返し設定がされた内部ノードは、コアグループ内の処理手段の同期設定レジスタが全てリセットされたと判定し、バリア完了待ち／受付可能状態に遷移することができる。 Note that the node is a hardware module, and the logical operations in the root node and the internal node are in the barrier completion waiting / acceptance state. For example, all the logical values notified from the child nodes have reached the synchronization point. If it is a value indicating a thing, it is a logical value indicating that the synchronization point has been reached, otherwise it is a different logical value. As a result, the internal node that is not set to return notifies the parent node that all the child nodes of the internal node in the core group have reached the synchronization point. Then, the internal node for which the root node and the loopback setting are set determines that the processing in the processing means in the core group has reached the synchronization point, sends a synchronization processing completion notification signal to each leaf node, and completes the barrier It is possible to transition to a notification waiting state. In the barrier completion notification waiting state, for example, if all the logical values notified from the leaf node are logical values indicating that the synchronous setting register of the leaf node has been reset, the synchronous setting register of the leaf node is reset. It is a logical value indicating that it has been performed, and otherwise a different logical value. As a result, the internal node for which the root node and the loopback setting are set can determine that all the synchronization setting registers of the processing means in the core group have been reset, and transition to the barrier completion waiting / acceptable state.

また、根ノード及び内部ノードにおいて論理演算により得られた論理値を、それぞれの根ノード及び内部ノードの状態とし、内部ノード及び葉ノードにおいて、直近の親ノードの状態を記憶する親ノード状態記憶手段を設け、折り返し又はマスクを行うときには、親ノード状態記憶手段に記憶された論理値の否定を通知するようにしてもよい。 In addition, the logical value obtained by the logical operation in the root node and the internal node is set as the state of the respective root node and internal node, and the state of the latest parent node is stored in the internal node and the leaf node. May be provided to notify the negation of the logical value stored in the parent node state storage means when performing folding or masking.

また、第３の観点の同期処理回路において、前記子ノードから通知された論理値を論理演算して得られた論理値を通知する前記内部ノードは、親ノードに対して、前記バリア完了待ち／受付可能状態において、全ての子ノードが前記同期状態論理値を通知する場合に、前記同期状態論理値を通知し、子ノードのいずれかが前記非同期状態論理値を通知する場合に、前記非同期状態論理値を通知するものであり、前記バリア完了通知待ち状態において、全ての子ノードが前記非同期状態論理値を通知する場合に、前記非同期状態論理値を通知するものであり、子ノードのいずれかが前記同期状態論理値を通知する場合に、前記同期状態論理値を通知するものであり、前記根ノード及び前記親ノードから通知される論理値の否定を通知する内部ノードは、前記バリア完了待ち／受付可能状態において、子ノードのいずれかが前記非同期状態論理値を通知する場合に、子ノードに対して、前記非同期状態論理値を通知し、全ての子ノードが前記同期状態論理値を通知する場合に、子ノードに対して、前記同期状態論理値を通知して、前記バリア完了通知待ち状態に遷移し、前記バリア完了通知待ち状態において、子ノードのいずれかが前記同期状態論理値を通知する場合に、子ノードに対して、前記同期状態論理値を通知し、全ての子ノードが前記非同期状態論理値を通知する場合に、子ノードに対して、前記非同期状態論理値を通知して、前記バリア完了待ち／受付可能状態に遷移するものであってもよい。 In the synchronous processing circuit according to the third aspect, the internal node that notifies the logical value obtained by performing a logical operation on the logical value notified from the child node, waits for the barrier completion / In the acceptable state, when all the child nodes notify the synchronous state logical value, the synchronous state logical value is notified, and when any of the child nodes notifies the asynchronous state logical value, the asynchronous state A logical value is notified, and when all the child nodes notify the asynchronous state logical value in the barrier completion notification waiting state, the asynchronous state logical value is notified, and any one of the child nodes is notified. Notifies the synchronous state logical value when the synchronous state logical value is notified, and an internal node that notifies the negative of the logical value notified from the root node and the parent node. In the barrier completion waiting / acceptable state, when any of the child nodes notifies the asynchronous state logical value, the asynchronous state logical value is notified to the child node, and all the child nodes When notifying the synchronization state logical value, the child node is notified of the synchronization state logical value, transitioned to the barrier completion notification wait state, and any of the child nodes in the barrier completion notification wait state Notifies the synchronous state logical value to the child node, and notifies all the child nodes to the asynchronous state logical value when all the child nodes notify the asynchronous logical state value. An asynchronous state logical value may be notified, and the state may transition to the barrier completion wait / acceptable state.

本願発明の各観点によれば、複数のノードがツリー構造に接続するというシンプルな構造であり、小規模・小レイテンシ・ばらつきの無いハードウェアバリアを可能にする。さらに、簡単な機能追加（例えば、折返しや、第３の観点のマスク）により、バリアネットワークの分割を可能にする。さらに、バリア回路の複数実装（第５の観点の多重化）により、バリアに参加するコアの組み合わせに柔軟性を持たせることが可能になる。よって、多様なプロセスが動作するであろうメニーコアプロセッサにおいても対応可能な柔軟なハードウェアバリア機構を実現することが可能になる。さらに、第６の観点にあるように、ツリー構造は、同期機構を応用して、各ノードが、分散して、全ノードで構成変更が実施されたかを確認して、動的に構成を変更することを実現することができる。なお、構成変更を必要としない同期サブツリーネットワークに対しては、同一設定で新たな設定を行うことにより、現在行っているバリア同期処理を継続することが可能になっている。そのため、本機能によりバリアネットワーク内に新たなバリアグループを動的に生成することが可能になる。 According to each aspect of the present invention, a simple structure in which a plurality of nodes are connected to a tree structure enables a hardware barrier free from small-scale, small latency, and variation. Furthermore, the barrier network can be divided by simple function addition (for example, folding or a mask from the third viewpoint). Furthermore, by implementing a plurality of barrier circuits (multiplexing from the fifth viewpoint), it becomes possible to give flexibility to combinations of cores participating in the barrier. Therefore, it is possible to realize a flexible hardware barrier mechanism that can be applied to a many-core processor that may operate various processes. Furthermore, as in the sixth aspect, the tree structure applies a synchronization mechanism, and each node is distributed to check whether the configuration change has been performed on all nodes, and the configuration is dynamically changed. Can be realized. Note that for a synchronized subtree network that does not require a configuration change, it is possible to continue the current barrier synchronization process by performing a new setting with the same setting. Therefore, a new barrier group can be dynamically generated in the barrier network by this function.

なお、例えば特許文献１及び２には、ツリー・ネットワーク構造を利用して同期等を実現することが記載されている。しかしながら、例えばパルス信号等を利用するものであり、上からの折返しをするものではないと評価することができる。本願発明の各観点は、上から折り返すことに技術的特徴が認められるものである。 For example, Patent Documents 1 and 2 describe that synchronization and the like are realized using a tree network structure. However, it can be evaluated that, for example, it uses a pulse signal or the like and does not return from above. In each aspect of the present invention, technical features are recognized by folding from above.

本願発明に係る同期処理回路の一例の概要を示すブロック図である。It is a block diagram which shows the outline | summary of an example of the synchronous processing circuit which concerns on this invention. 図１の葉ノード３のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the leaf node 3 of FIG. 提案するバリア機構の状態遷移を示す図である。It is a figure which shows the state transition of the barrier mechanism proposed. 図１の内部ノード５のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the internal node 5 of FIG. ８個の葉ノード４１₁，…，４１₈を有するバリアネットワークを４分割した例を示す図である。Eight leaf nodes 41 _1, ..., is a diagram showing an example in which divided into four barrier network with 41 _8. バリア機構の状態遷移の一例を示す図である。It is a figure which shows an example of the state transition of a barrier mechanism. 図４の内部ノード５における同期回路３１の一例であるハードウェア構成を示す図である。FIG. 5 is a diagram illustrating a hardware configuration that is an example of a synchronization circuit 31 in the internal node 5 of FIG. 4. ８ノードからなるバリアネットワークを２つ使用して、２つのグループに分割した例を示す図である。It is a figure which shows the example divided into two groups using two barrier networks which consist of 8 nodes. 動的な構成変更を実現するための葉ノードの構成変更処理同期機構の一例を示す図である。It is a figure which shows an example of the structure change process synchronization mechanism of the leaf node for implement | achieving a dynamic structure change. 動的な構成変更を実現するための内部ノードの構成変更処理同期機構の一例を示す図である。It is a figure which shows an example of the structure change process synchronization mechanism of the internal node for implement | achieving a dynamic structure change. 実験においてノードの構成変更を行うコアの割り当てを示す図である。It is a figure which shows allocation of the core which performs a structure change of a node in experiment. 実験においてバリア同期時間を測定するために使用した評価用プログラムを示す図である。It is a figure which shows the program for evaluation used in order to measure barrier synchronous time in experiment. 実験における測定対象を示す図である。It is a figure which shows the measuring object in experiment.

以下では、図面を参照して、本願発明の実施例について説明する。なお、本願発明は、この実施例に限定されるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The present invention is not limited to this embodiment.

図１は、本願発明に係る同期処理回路の一例の概要を示すブロック図である。図１を参照して、本願発明に係るハードウェアバリア機構（以下、「バリア機構」と略す。）の基本構造と動作を説明する。 FIG. 1 is a block diagram showing an outline of an example of a synchronization processing circuit according to the present invention. With reference to FIG. 1, the basic structure and operation of a hardware barrier mechanism (hereinafter abbreviated as “barrier mechanism”) according to the present invention will be described.

本実施例では、固定的なコアグループを対象とする。本実施例のバリア機構の実現には、以下のハードウェア要素が必要となる。 In this embodiment, a fixed core group is targeted. In order to realize the barrier mechanism of this embodiment, the following hardware elements are required.

バリア専用の双方向二分木ネットワークをバリアネットワークという。図１は、４コアでコアグループを構成した場合の例を示す。コア１₁，…，１₄（以下、符号の添え字は、複数（又は、複数のもののいずれか）を意味する場合には省略する場合もある。）は、プロセッサコア（「コア」と略す。）（本願請求項の「処理手段」の一例）である。図１では、４つの場合を例にして示している。 A two-way binary tree network dedicated to barriers is called a barrier network. FIG. 1 shows an example in which a core group is configured with four cores. Cores 1 ₁ ,..., 1 ₄ (hereinafter, the subscripts may be omitted when referring to plural (or any of plural ones)) are referred to as processor cores (abbreviated as “cores”). (An example of “processing means” in the claims of the present application). In FIG. 1, four cases are shown as examples.

同期処理回路は、葉ノード３（本願請求項の「葉ノード」の一例）と、内部ノード５（本願請求項の「内部ノード」の一例）と、根ノード７（本願請求項の「根ノード」の一例）と、折返し記憶部９₁，９₂（本願請求項の「折返し記憶手段」の一例）と、マスク記憶部１１₁，…，１１₄（本願請求項の「マスク記憶手段」の一例）を備える。 The synchronization processing circuit includes a leaf node 3 (an example of “leaf node” in the claims), an internal node 5 (an example of “internal node” in the claims), and a root node 7 (“root node” in the claims). ), Folding storage units 9 ₁ , 9 ₂ (an example of “folding storage means” in claims of this application), and mask storage units 11 ₁ ,..., 11 ₄ (“mask storage means” in claims of this application). An example).

葉ノード３は、各コアに追加されるハードウェア・モジュールである。同期設定レジスタ及び同期状態表示レジスタという２種類の１ビット・レジスタを有する。同期設定レジスタは、当該コアがバリア同期の完了待ち状態にあるか否かを指示するレジスタである。当該コアのプログラム実行が同期ポイントに到達すると‘１’にセットされる。また、同期状態表示レジスタがリセットされると‘０’にリセットされる。同期状態表示レジスタは、根ノード７が管理するバリア機構状態レジスタの複製を保持するレジスタである。 The leaf node 3 is a hardware module added to each core. There are two types of 1-bit registers, a synchronization setting register and a synchronization status display register. The synchronization setting register is a register that indicates whether or not the core is in a barrier synchronization completion waiting state. When the program execution of the core reaches the synchronization point, it is set to '1'. When the synchronization status display register is reset, it is reset to “0”. The synchronization status display register is a register that holds a copy of the barrier mechanism status register managed by the root node 7.

図２（ａ）は、葉ノード３のハードウェア構成の一例を示す図である。葉ノード３は、切り離し（マスク）機能をサポートする。マスク機能は、葉ノード３に対して適用される設定であり、葉ノード３に対応するコア１がバリアに参加するか否かを指定する。本実施例では、図１のマスク記憶部１１に、各葉ノード３のマスク機能の有効／無効を示す情報が記憶されている。葉ノード３は、同期設定レジスタ２１と、同期状態表示レジスタ２３と、否定演算部２５と、セレクタ２７（本願請求項の「葉ノード選択手段」の一例）を備える。セレクタ２７は、マスク記憶部１１におけるマスクの有効／無効の設定に応じて、同期設定レジスタ２１の値か、又は、同期状態表示レジスタ２３の値を否定演算部２５により否定したものかを選択して親ノードに通知する。すなわち、バリア機構内部では、マスクが有効であれば、葉ノード３は、親ノードに対して、同期設定レジスタ２１の値を送信する（図２（ｂ）参照）。一方、無効な場合には、葉ノード３は、親ノードに対して、同期状態表示レジスタ２３の反転値を出力することで、上位のノードに対する影響を排除する（図２（ｃ）参照）。 FIG. 2A is a diagram illustrating an example of a hardware configuration of the leaf node 3. Leaf node 3 supports a detach (mask) function. The mask function is a setting applied to the leaf node 3, and specifies whether or not the core 1 corresponding to the leaf node 3 participates in the barrier. In this embodiment, information indicating validity / invalidity of the mask function of each leaf node 3 is stored in the mask storage unit 11 of FIG. The leaf node 3 includes a synchronization setting register 21, a synchronization state display register 23, a negative operation unit 25, and a selector 27 (an example of “leaf node selection means” in the claims of the present application). The selector 27 selects whether the value of the synchronization setting register 21 or the value of the synchronization state display register 23 is negated by the negation operation unit 25 according to the mask valid / invalid setting in the mask storage unit 11. To notify the parent node. That is, if the mask is valid inside the barrier mechanism, the leaf node 3 transmits the value of the synchronization setting register 21 to the parent node (see FIG. 2B). On the other hand, when the node is invalid, the leaf node 3 outputs the inverted value of the synchronization state display register 23 to the parent node, thereby eliminating the influence on the upper node (see FIG. 2C).

根ノード７は、コアグループにおける同期処理実行状態を管理するハードウェア・モジュールである。根ノード７の管理対象コアグループは、子孫の葉ノードのうち、マスク記憶部１１によりマスク機能が無効な葉ノードを除いたものである。図１の例では、マスク機構が有効な場合（マスク機能を使用していない場合）には、根ノード７が管理するコアグループは、葉ノード３₁〜３₄である。管理下にある葉ノード３から同期設定レジスタ値を集計し、全てのコア１が同期ポイントに到達したか否かを判定する。内部には１ビットの同期状態レジスタを有する。 The root node 7 is a hardware module that manages the synchronous processing execution state in the core group. The management target core group of the root node 7 is obtained by removing the leaf nodes whose mask function is invalid by the mask storage unit 11 from the descendant leaf nodes. In the example of FIG. 1, when the mask mechanism is valid (when the mask function is not used), the core groups managed by the root node 7 are the leaf nodes 3 _{1 to} 3 ₄ . The synchronization setting register values are counted from the leaf node 3 under management, and it is determined whether or not all the cores 1 have reached the synchronization point. There is a 1-bit synchronization status register inside.

提案するバリア機構には、バリア完了待ち／受付可能状態とバリア完了通知待ち状態という２つの状態があり、同期状態レジスタにより現状態を表す。状態遷移を図３に示す。バリア完了待ち／受付可能状態（図３のW_SET）は、コアグループに属するコア１が同期処理を実行中、又は、次のバリア同期処理を実行可能な状態である。このとき、同期状態レジスタは‘０’にリセットされる。バリア完了通知待ち状態（図３のW_RST）は、バリア同期が完了し、全葉ノードに対して完了通知をブロードキャストしている状態である。このとき、同期状態レジスタは‘１’にセットされる。 The proposed barrier mechanism has two states: a barrier completion wait / acceptable state and a barrier completion notification wait state, and the current state is indicated by a synchronization state register. The state transition is shown in FIG. The barrier completion waiting / acceptable state (W_SET in FIG. 3) is a state in which the core 1 belonging to the core group is executing the synchronization process or can execute the next barrier synchronization process. At this time, the synchronization state register is reset to “0”. The barrier completion notification waiting state (W_RST in FIG. 3) is a state in which barrier synchronization is completed and a completion notification is broadcast to all leaf nodes. At this time, the synchronization status register is set to “1”.

内部ノード５は、根ノード７と葉ノード３の中間に位置するハードウェア・モジュールである。本実施例では、内部構造は根ノード７と同じであるとする。内部ノード５は、バリアネットワークを分割して複数コアグループの同期処理を実現する場合には、各コアグループの最上位の内部ノード５は、折返し機能が有効とされ、当該コアグループの根ノードとして動作する。 The internal node 5 is a hardware module located between the root node 7 and the leaf node 3. In this embodiment, it is assumed that the internal structure is the same as that of the root node 7. When the internal node 5 divides the barrier network and realizes the synchronization processing of a plurality of core groups, the uppermost internal node 5 of each core group is enabled with the loopback function, and is used as the root node of the core group. Operate.

図４（ａ）は、内部ノード５のハードウェア構成の一例を示す図である。内部ノード５は、折返し機能をサポートする。折返し機能は、内部ノード５に対して適用される設定であり、上位のノードのコアグループのバリアに参加するか否かを指定する。本実施例では、図１の折返し記憶部９に、各内部ノード５の折返し機能の有効／無効を示す情報が記憶されている。内部ノード５は、同期回路３１と、同期状態表示レジスタ３３と、否定演算部３５と、セレクタ３７（本願請求項の「内部ノード選択手段」の一例）と、セレクタ３９を備える。同期回路３１は、例えば、バリア完了待ち／受付可能状態では、子ノードから通知された論理値が、全て、同期ポイントに到達したものを示す値である場合には、その同期ポイントに到達したことを示す論理値とし、そうでない場合には異なる論理値とするものである。同期回路３１は、例えば、バリア完了通知待ち状態では、葉ノードから通知された論理値が、全て、葉ノードの同期設定レジスタがリセットされたことを示す論理値であれば、葉ノードの同期設定レジスタがリセットされたことを示す論理値とし、そうでない場合には異なる論理値とするものである。セレクタ３７は、折返し記憶部９における折返しの有効／無効の設定に応じて、同期状態表示レジスタ３３の値を否定演算部３５により否定したものか（図４（ｂ）参照）、又は、同期回路３１から出力された論理値か（図４（ｃ）参照）を選択して親ノードに通知する。セレクタ３９は、折返し記憶部９における折返しの有効／無効の設定に応じて、同期回路３１から出力された論理値か（図４（ｂ）参照）、又は、親ノードから通知された論理値か（図４（ｃ）参照）を選択して子ノードに通知する。 FIG. 4A is a diagram illustrating an example of a hardware configuration of the internal node 5. The internal node 5 supports the return function. The loopback function is a setting applied to the internal node 5 and designates whether or not to participate in the barrier of the core group of the upper node. In this embodiment, information indicating validity / invalidity of the folding function of each internal node 5 is stored in the folding storage unit 9 of FIG. The internal node 5 includes a synchronization circuit 31, a synchronization state display register 33, a negative operation unit 35, a selector 37 (an example of “internal node selection means” in the claims of the present application), and a selector 39. For example, in the barrier completion waiting / acceptable state, the synchronization circuit 31 has reached the synchronization point when all the logical values notified from the child nodes are values indicating that the synchronization point has been reached. In other cases, a different logical value is used. For example, in the barrier completion notification waiting state, the synchronization circuit 31 sets the synchronization setting of the leaf node if all the logical values notified from the leaf node are logical values indicating that the synchronization setting register of the leaf node has been reset. It is a logical value indicating that the register has been reset, and a different logical value otherwise. Whether the selector 37 has negated the value of the synchronization state display register 33 by the negation calculation unit 35 (see FIG. 4B) or the synchronization circuit according to the setting of the loopback valid / invalid in the loopback storage unit 9 The logical value output from 31 (see FIG. 4C) is selected and notified to the parent node. The selector 39 determines whether the logical value output from the synchronization circuit 31 (see FIG. 4B) or the logical value notified from the parent node according to the setting of the loopback valid / invalid in the loopback storage unit 9 (See FIG. 4C) is selected and notified to the child node.

本実施例では、内部ノード５は、根ノード７と同じハードウェア構成を採る。したがって、内部ノード５は、根ノード７と同一の機能を有することが可能である。そこで、ある内部ノードを根ノードとして動作させることで、バリアネットワークを複数の部分木に分割する。図５は、８個の葉ノード４１₁，…，４１₈を有するバリアネットワークを４分割した例である。各葉ノード４１の数字は、マスクの有無を示す。‘０’はマスク有効を示し、‘１’はマスク無効を示す。内部ノード４３₁〜４３₆及び根ノード４５の数字は、折返しの有無を示す。‘０’は折返し、‘１’は折り返さないことを示す。図５の葉ノード４１₁，４１₂及び４１₈は、コアグループＡを構成している。図５の葉ノード４１₃及び４１₄は、コアグループＢを構成している。コアグループＢでは、親ノードとなる内部ノード４１₂に対して折返し機能を適用することで、根ノードとして動作させる。図５の葉ノード４１₅及び４１₆は、コアグループＣを構成している。コアグループＣでは、親ノードとなる内部ノード４１₃に対して折返し機能を適用することで、根ノードとして動作させる。図５の葉ノード４１₇は、コアグループＤを構成している。 In this embodiment, the internal node 5 adopts the same hardware configuration as that of the root node 7. Therefore, the internal node 5 can have the same function as the root node 7. Therefore, the barrier network is divided into a plurality of subtrees by operating an internal node as a root node. FIG. 5 shows an example in which a barrier network having _eight leaf nodes 41 ₁ ,. The number of each leaf node 41 indicates the presence or absence of a mask. “0” indicates that the mask is valid, and “1” indicates that the mask is invalid. The numbers of the internal nodes 43 ₁ to 43 ₆ and the root node 45 indicate the presence or absence of folding. “0” indicates that the call is folded, and “1” indicates that the call is not folded. The leaf nodes 41 ₁ , 41 ₂ and 41 ₈ in FIG. The leaf nodes 41 ₃ and 41 ₄ in FIG. In the core group B, by applying the folding function to the internal node 41 ₂ as a parent node, to operate as a root node. The leaf nodes 41 ₅ and 41 ₆ in FIG. In the core group C, a loopback function is applied to the internal node 41 _{3 serving} as a parent node, thereby operating as a root node. Leaf nodes 41 ₇ of FIG. 5 constitutes the core group D.

図４（ｂ）に示すように、内部ノードに対して折返し機能が適用されると、子ノードから回送された信号（同期設定レジスタの値）の論理演算結果を親ノードには回送せず、同期処理完了通知信号として子ノードに転送する。例えば、両方の子ノードが同期ポイントに到達した場合には、対応する同期設定レジスタが‘１’にセットされ、これらの論理積がとられる。この結果が、当該子ノードに対して同期処理完了通知信号としてフィードバックされる。また、親ノードに対しては、親ノードから回送されてきた同期処理完了通知信号の反転値を転送することで、上位（根ノードに近い方）のノードに対する影響を排除する。このように部分木を用いて小規模なバリアネットワークを複数構成することにより、各コアグループは独立かつ並列にバリア同期処理を行うことができるようになる。 As shown in FIG. 4B, when the loopback function is applied to the internal node, the logical operation result of the signal (synchronization setting register value) forwarded from the child node cannot be forwarded to the parent node. It is transferred to the child node as a synchronization processing completion notification signal. For example, when both child nodes reach the synchronization point, the corresponding synchronization setting register is set to ‘1’ and the logical product of these is obtained. This result is fed back to the child node as a synchronization processing completion notification signal. Further, by transferring the inversion value of the synchronization processing completion notification signal forwarded from the parent node to the parent node, the influence on the upper (closer to the root node) node is eliminated. By configuring a plurality of small-scale barrier networks using partial trees in this way, each core group can perform barrier synchronization processing independently and in parallel.

まず、図６及び図７を参照して、折返し機能及びマスク機能を使用しない場合について、バリア機構の動作を説明する。図６は、バリア機構の状態が「バリア完了待ち／受付可能状態」である場合から、状態を遷移して、再度、「バリア完了待ち／受付可能状態」になるまでを示す。 First, the operation of the barrier mechanism will be described with reference to FIGS. 6 and 7 when the folding function and the mask function are not used. FIG. 6 shows a state from when the state of the barrier mechanism is “barrier completion waiting / acceptable state” to when the state is changed to become “barrier completion waiting / acceptable state” again.

図６において、内部ノード５３及び根ノード５５（図６の丸）内の数字は、スラッシュの左は同期設定レジスタの値を示し、スラッシュの右は状態出力値を示す。葉ノード５１（図６の四角）内の数字は、スラッシュの左は同期設定レジスタの値を示し、スラッシュの右は状態表示レジスタの値を示す。 In FIG. 6, the numbers in the internal node 53 and the root node 55 (circles in FIG. 6) indicate the value of the synchronization setting register on the left of the slash and the status output value on the right of the slash. The numbers in the leaf nodes 51 (squares in FIG. 6) indicate the value of the synchronization setting register on the left of the slash and the value of the status display register on the right of the slash.

図６（ａ）において、バリア機構の初期状態は「バリア完了待ち／受付可能状態」であり、全ての葉ノード５１の同期設定レジスタの初期値は‘０’とする（つまり、次のバリアを実行可能な状態である）。 In FIG. 6A, the initial state of the barrier mechanism is “barrier completion waiting / acceptable state”, and the initial values of the synchronization setting registers of all the leaf nodes 51 are set to “0” (that is, the next barrier is changed). Ready to run).

同期処理において、葉ノード５１₃に対応するコアが同期ポイントに到達したとする。このとき、当該コアの同期設定レジスタが‘１’に設定される。コア２の同期設定レジスタの値が、内部ノード５３₂に送信される。このとき、他方の子ノードの同期設定レジスタの値がセットされていなければ、同期待ち状態を継続する（図６（ｂ））。 In the synchronization process, a core corresponding to the leaf node 51 ₃ has reached the synchronization point. At this time, the synchronization setting register of the core is set to “1”. The value of the synchronization setting register of the core 2 is transmitted to the internal node 53 _2. At this time, if the value of the synchronization setting register of the other child node is not set, the synchronization waiting state is continued (FIG. 6B).

コア５１₄が同期ポイントに到達し、同期設定レジスタが‘１’に設定される。これにより、内部ノード５３₂は、親ノード５５（ここでは根ノード）に対して、同期設定レジスタの値を伝搬させる（図６（ｃ））。 Reaches the core 51 ₄ synchronization point, synchronization setting register is set to '1'. Accordingly, the internal node 53 _2, the parent node 55 (where the root node), propagates the value of the synchronization setting register (Figure 6 (c)).

葉ノード５１₁及び５１₂に対応するコアが同期ポイントに到達すると、内部ノード５３1は、根ノード５５に対して同期設定レジスタの値を伝搬させる。根ノード５５は全てのコアが同期ポイントに到達したと判定する。これにより、各葉ノードに対して同期処理完了通知の信号を送り、バリア完了通知待ち状態に遷移する（図６（ｄ））。 When the cores corresponding to the leaf nodes 51 ₁ and 51 ₂ reach the synchronization point, the internal node 531 propagates the value of the synchronization setting register to the root node 55. The root node 55 determines that all the cores have reached the synchronization point. As a result, a synchronization processing completion notification signal is sent to each leaf node, and a transition to a barrier completion notification waiting state is made (FIG. 6D).

各葉ノードは、根ノードからバリア完了通知を受け取った後、同期設定レジスタを‘０’にリセットし、これが根ノードへと伝搬する（図６（ｅ））。 After receiving the barrier completion notification from the root node, each leaf node resets the synchronization setting register to “0” and propagates it to the root node (FIG. 6E).

根ノード５５は、全ての葉ノード５１の同期設定レジスタの値がリセットされたことを判定し、バリア機構の状態をバリア完了待ち／受付可能状態にする。これにより、次の新たなバリア同期を受け付けることが可能となる（図６（ａ））。 The root node 55 determines that the values of the synchronization setting registers of all the leaf nodes 51 have been reset, and sets the state of the barrier mechanism to a barrier completion waiting / acceptable state. As a result, the next new barrier synchronization can be accepted (FIG. 6A).

図７は、図４の内部ノード５における同期回路３１の一例を示す図である。同期回路３１は、論理積演算部６１と、論理和演算部６３と、セレクタ６５と、同期記憶部６７を備える。論理積演算部６１は、子ノードから入力した同期設定レジスタの値の論理積をとる。論理和演算部６３は、子ノードから入力した同期設定レジスタの値の論理和をとる。セレクタ６５は、同期記憶部６７に記憶された値に基づき、論理積演算部６１の出力と論理和演算部６３の出力のいずれかを選択するものである。同期記憶部６７は、セレクタ６５の出力値を記憶する。図７にあるように、内部ノード５は、バリア完了待ち／受付可能状態の場合には、２つの子ノードから入力した同期設定レジスタの値の論理積をとる。一方、バリア完了通知待ち状態の場合には、これらの値の論理和を親ノードに回送する。これらの値が根に向かって伝搬することにより、根ノードは全ての葉ノードの状況を把握することができる。 FIG. 7 is a diagram illustrating an example of the synchronization circuit 31 in the internal node 5 of FIG. The synchronization circuit 31 includes an AND operation unit 61, an OR operation unit 63, a selector 65, and a synchronization storage unit 67. The logical product operation unit 61 calculates the logical product of the values of the synchronization setting register input from the child nodes. The logical sum operation unit 63 calculates the logical sum of the values of the synchronization setting register input from the child nodes. The selector 65 selects either the output of the logical product operation unit 61 or the output of the logical sum operation unit 63 based on the value stored in the synchronous storage unit 67. The synchronization storage unit 67 stores the output value of the selector 65. As shown in FIG. 7, when the internal node 5 is in the barrier completion waiting / acceptable state, the internal node 5 takes the logical product of the values of the synchronization setting registers input from the two child nodes. On the other hand, in the barrier completion notification waiting state, the logical sum of these values is forwarded to the parent node. By propagating these values toward the root, the root node can grasp the status of all leaf nodes.

折返し機能及びマスク機能により、続設定情報を変更して、バリアネットワークを論理的に分割し、一つのバリアネットワーク内で複数のバリア同期を実行できるように拡張することができる。 By using the loopback function and the mask function, it is possible to expand the configuration so that a plurality of barrier synchronizations can be executed within one barrier network by changing the subsequent setting information and logically dividing the barrier network.

各ノードの動作は、折返し機能が有効な内部ノードについては、子ノードに対しては図６における根ノードと同じものとなる。折返し機能が無効な内部ノードについては、図６における内部ノードと同じものとなる。マスク機能が有効な葉ノードについては、図６における葉ノードと同じものとなる。 The operation of each node is the same as the root node in FIG. 6 for the child nodes for the internal nodes for which the loopback function is valid. The internal node with the invalid loopback function is the same as the internal node in FIG. The leaf nodes for which the mask function is valid are the same as the leaf nodes in FIG.

折返し機能が有効な内部ノード及びマスク機能が無効な葉ノードは、親ノードに対して、親ノードのコアグループが「バリア完了待ち／受付可能状態」であれば、同期ポイントに到達したことを示す論理値を通知し、「バリア完了通知待ち状態」であれば、同期設定レジスタをリセットしたことを示す論理値を通知することとなり、上位（根ノードに近い方）のノードに対する影響を排除する。 An internal node with the loopback function enabled and a leaf node with the mask function disabled indicates to the parent node that the synchronization point has been reached if the core group of the parent node is “waiting for barrier completion / acceptable”. If the logical value is notified and “waiting for barrier completion notification”, the logical value indicating that the synchronization setting register has been reset is notified, and the influence on the upper (closer to the root node) node is eliminated.

これにより、マルチプログラム実行環境下等において、複数の様々なコアグループに対して独立かつ並列に実行可能な同期処理をサポートすることが可能になる。 This makes it possible to support synchronous processing that can be executed independently and in parallel for a plurality of various core groups under a multi-program execution environment or the like.

さらに、図８にあるように、多重化により、柔軟性を拡大することができる。マスク・折返しによる設定では、部分木単位のグループ構成以外は行えない。そのため、それ以外のグループ構成には、バリアネットワークを複数用意すること（多重化）により対応する。図８は、８ノードからなるバリアネットワークを２つ使用して、葉ノード６１₁、６１₃、６１₅及び６１₇と、葉ノード６１₂、６１₄，６１₆及び６１₈で組分けされた、２つのグループに分割した例を示す。各ネットワークにおけるノードは、独立して動作する。 Furthermore, as shown in FIG. 8, the flexibility can be expanded by multiplexing. In setting by masking / wrapping, only the group structure of subtree units can be performed. Therefore, other group configurations are handled by preparing a plurality of barrier networks (multiplexing). Figure 8 is to use two barrier network of eight nodes, and leaf nodes 61 _1, 61 _3, 61 ₅ and 61 ₇ were grouped leaf node 61 _2, 61 _4, 61 ₆ and 61 ₈ An example of division into two groups is shown. Nodes in each network operate independently.

本願発明のバリア機構は、折返し機能、マスク機能及び多重化を組み合わせることによって、少レイテンシでばらつきのないハードウェアバリアを、様々なノードの組み合わせで実行することを可能にする。すなわち、メニーコアプロセッサ等で利用することができ、柔軟性を有する新しいハードウェアバリア方式を実現することができる。具体的には、（１）低レイテンシでばらつきが小さく、（２）バリアネットワークの分割を可能にし、かつ、（３）小規模な回路量で実現できるハードウェアバリア機構である。また、この小規模なバリアネットワークを複数用意することで、多様なプロセスが動作するであろうメニーコアプロセッサにおいても対応可能な柔軟なハードウェアバリア機構を実現することができる。 The barrier mechanism of the present invention makes it possible to execute a hardware barrier with a small latency and no variation by combining various nodes by combining a folding function, a mask function, and multiplexing. That is, it can be used in a many-core processor or the like, and a new hardware barrier method having flexibility can be realized. Specifically, it is a hardware barrier mechanism that can be realized with (1) low latency and small variation, (2) division of the barrier network, and (3) small circuit amount. Also, by preparing a plurality of small-scale barrier networks, it is possible to realize a flexible hardware barrier mechanism that can cope with a many-core processor in which various processes will operate.

続いて、図９及び１０を参照して、バリア機構の動的な構成変更の一例について説明する。図９及び図１０は、それぞれ、動的な構成変更を実現するための葉ノード及び内部ノードの構成変更処理同期機構の一例を示す。なお、図９及び１０は、ノード間にFFが入っていない例である。FFが入っている場合には、親ノードから出力が遅れるため、更新レジスタの変更タイミングを適宜調整することにより、実現することができる。 Next, an example of dynamic configuration change of the barrier mechanism will be described with reference to FIGS. FIG. 9 and FIG. 10 each show an example of a configuration change processing synchronization mechanism of a leaf node and an internal node for realizing a dynamic configuration change. 9 and 10 are examples in which no FF is inserted between nodes. When FF is included, the output is delayed from the parent node, so that it can be realized by appropriately adjusting the update register change timing.

図９及び１０のバリア機構は、各ノードに分散して設定することにより、構成変更の並列処理を可能とし、さらに、構成変更の高速化を図っている。さらに、本バリア機構は、構成変更の確認を補助するための構成変更処理同期機構を備えている。この構成変更処理同期機構は、各ノードが、全ノードで構成変更が実施されたかを確認するための機構である。これは、バリア同期機構を応用することにより実現されている。すなわち、(1)各ノードは、構成変更を指示するための構成設定入力、及び、全ノード構成変更が行われたことを確認するための構成状態出力を持ち、(2)構成設定値を入力すると同時に構成設定入力に'1' を入力し、(3)構成状態出力が'0'になるまで待つ。このような一連の処理を行うことにより、構成設定が全ノードで実施されたことが確認でき、以降は新たな構成設定によるバリア同期および再度の構成設定が可能になる。 The barrier mechanism shown in FIGS. 9 and 10 is configured to be distributed to each node, thereby enabling parallel processing of configuration change and further speeding up the configuration change. Further, the barrier mechanism includes a configuration change processing synchronization mechanism for assisting confirmation of the configuration change. This configuration change processing synchronization mechanism is a mechanism for each node to confirm whether the configuration change has been performed on all nodes. This is realized by applying a barrier synchronization mechanism. That is, (1) each node has a configuration setting input for instructing a configuration change, and a configuration status output for confirming that the configuration change of all nodes has been performed, and (2) a configuration setting value is input. At the same time, input “1” to the configuration setting input, and (3) wait until the configuration status output becomes “0”. By performing such a series of processing, it can be confirmed that the configuration setting has been performed in all the nodes, and thereafter, the barrier synchronization by the new configuration setting and the configuration setting again can be performed.

構成変更を必要としない同期サブツリーネットワークに対しては、同一設定で新たな設定を行うことにより、現在行っているバリア同期処理を継続することが可能になっている。そのため、本機能によりバリアネットワーク内に新たなバリアグループを動的に生成することが可能になる。 For a synchronous subtree network that does not require a configuration change, it is possible to continue the current barrier synchronization process by performing a new setting with the same setting. Therefore, a new barrier group can be dynamically generated in the barrier network by this function.

動的構成変更機構は、バリア同期機構と同様に葉ノードと内部ノード（根ノード）の組み合わせで、全ノードの構成設定入力を確認し結果を広報する。バリア同期機構と異なり、折り返しやマスクのハードウェアはもたず、ネットワークの全ノード間の同期を取る。また、同期完了確認から同期待ちへの状態遷移は、動的構成変更機構内で実施している。さらに同期要求入力信号はパルス信号である。 Similar to the barrier synchronization mechanism, the dynamic configuration change mechanism confirms the configuration setting input of all the nodes and advertises the result using a combination of leaf nodes and internal nodes (root nodes). Unlike the barrier synchronization mechanism, it does not have hardware for folding or masking, and synchronizes all nodes in the network. Moreover, the state transition from the synchronization completion confirmation to the synchronization waiting is performed in the dynamic configuration change mechanism. Further, the synchronization request input signal is a pulse signal.

図９の葉ノードは、論理和演算部７３と、論理積演算部７５と、更新レジスタ７７を備える。図９の葉ノードは、同期の要求を行っている間（更新要求='1'又は更新レジスタ='1'）、親ノードに'1'を送る。すなわち、論理和演算部７３は、更新レジスタ７７に記憶された値と更新要求の論理和を演算する。親ノードからは、同期完了していない場合には'1'、完了している場合には'0'が返る。論理積演算部７５は、親ノードに通知する値と親ノードから通知された値との論理積を演算する。論理積演算部７５の演算結果が状態出力となる。状態出力は、葉ノードが同期要求中で親ノードが同期処理中の場合は'1'、それ以外では'0'となる。更新レジスタ７７は、論理積演算部７５の演算結果を記憶する。 The leaf node in FIG. 9 includes a logical sum operation unit 73, a logical product operation unit 75, and an update register 77. The leaf node in FIG. 9 sends “1” to the parent node while making a synchronization request (update request = “1” or update register = “1”). That is, the logical sum operation unit 73 calculates the logical sum of the value stored in the update register 77 and the update request. From the parent node, “1” is returned when synchronization is not completed, and “0” is returned when it is completed. The AND operation unit 75 calculates a logical product of the value notified to the parent node and the value notified from the parent node. The operation result of the logical product operation unit 75 becomes a status output. The status output is '1' if the leaf node is requesting synchronization and the parent node is performing synchronization processing, and '0' otherwise. The update register 77 stores the calculation result of the logical product calculation unit 75.

図１０の内部ノードは、論理和演算部８１と、論理積演算部８３、８５、８７及び８９と、論理和演算部９１及び９３と、否定演算部９５と、更新レジスタ９７を備える。図１０の内部ノードでは、子ノードに対して、いずれかの子ノードから同期要求中に親ノードへの同期要求を行っていてかつ完了していない場合は'1'を返し、それ以外の場合には'0'を返す。親ノードに対しては、自ノード及び全ての子ノードが同期要求中の場合、親ノードに同期要求のため'1'を送る。根ノードと内部ノードの違いは、根ノードの場合同期を取る親ノードが存在しないため、親ノードからの値として常に'0'（同期完了）が送られてくることである。 The internal node of FIG. 10 includes a logical sum operation unit 81, logical product operation units 83, 85, 87, and 89, logical sum operation units 91 and 93, a negative operation unit 95, and an update register 97. The internal node in FIG. 10 returns '1' if a synchronization request is made to a parent node during a synchronization request from one of the child nodes and is not completed. Otherwise, it returns '1'. Returns '0'. When the own node and all child nodes are requesting synchronization, “1” is sent to the parent node for the synchronization request. The difference between the root node and the internal node is that there is no parent node to be synchronized in the case of the root node, so that “0” (synchronization completion) is always sent as a value from the parent node.

すなわち、論理和演算部８１は、更新レジスタ９７に記憶された値と更新要求との論理和を演算する。論理積演算部８３は、子ノードから通知された値の論理積を演算する。論理積演算部８５は、論理和演算部８１と論理積演算部８３の出力の論理積を演算して、親ノードに通知する。否定演算部９５は、親ノードに通知した値の否定を演算する。親ノードからは、同期完了していない場合には'1'、完了している場合には'0'が返る。論理和演算部９３は、親ノードから通知された値と否定演算部９５の演算結果との論理和を演算する。論理積演算部８７は、論理和演算部９３の演算結果と論理和演算部８１の出力との論理積を演算する。更新レジスタ９７は、論理積演算部８７の演算結果を記憶する。論理和演算部９１は、子ノードから通知された値の論理和を演算する。論理積演算部８９は、論理和演算部９１と９３の演算結果の論理積を演算し、子ノードへ通知する。 That is, the logical sum operation unit 81 calculates the logical sum of the value stored in the update register 97 and the update request. The AND operation unit 83 calculates the logical product of the values notified from the child nodes. The logical product operation unit 85 calculates the logical product of the outputs of the logical sum operation unit 81 and the logical product operation unit 83 and notifies the parent node. The negation calculation unit 95 calculates negation of the value notified to the parent node. From the parent node, “1” is returned when synchronization is not completed, and “0” is returned when it is completed. The logical sum operation unit 93 calculates a logical sum of the value notified from the parent node and the operation result of the negative operation unit 95. The AND operation unit 87 calculates the logical product of the operation result of the OR operation unit 93 and the output of the OR operation unit 81. The update register 97 stores the calculation result of the logical product calculation unit 87. The logical sum operation unit 91 calculates the logical sum of the values notified from the child nodes. The logical product operation unit 89 calculates the logical product of the operation results of the logical sum operation units 91 and 93 and notifies the child node.

構成設定を変更する場合のソフトウェアの処理について説明する。動的構成変更機構について、ソフトウェアとのインターフェイスは、バリア制御レジスタ（ＢＲＣ）を介して行うこととする。ＢＲＣのビットには、ＲＥＴ、ＭＳＫ、ＳＹＮ及びＳＴＴが含まれる。ここで、ＲＥＴ、ＭＳＫ、ＳＹＮ及びＳＴＴは、複数のビットを含む。各ビットは、バリアネットワーク番号に対応する。ＲＥＴ、ＭＳＫ及びＳＹＮは、それぞれ、折返し情報、マスク情報及び同期制御である。ＳＴＴは、書込時は書き込み先を選択し、１は構成情報（ＲＥＴ／ＭＳＫ）、０は同期制御（ＳＹＮ）を書き込む。読出時は同期状態を示し、１は同期完了、０は同期リセット完了を示す。 Software processing for changing the configuration setting will be described. As for the dynamic configuration change mechanism, the interface with software is performed through a barrier control register (BRC). The BRC bits include RET, MSK, SYN, and STT. Here, RET, MSK, SYN, and STT include a plurality of bits. Each bit corresponds to a barrier network number. RET, MSK, and SYN are return information, mask information, and synchronization control, respectively. STT selects a writing destination at the time of writing, 1 writes configuration information (RET / MSK), and 0 writes synchronous control (SYN). At the time of reading, it indicates a synchronization state, 1 indicates synchronization completion, and 0 indicates synchronization reset completion.

構成設定を変更する場合には、STTビットに１、RET/MSKに構成情報を書き込む。構成設定を行うバリアネットワークでは、自ノードの設定に変更が無い場合にも書き込みを行わなければならない。構成設定の実施中はSTTビットが１になる。構成設定が完了した場合にはSTTビットが０になり、以降は同期セットが可能になる。構成設定を行った後は、バリアネットワークはリセットされ、同期セット待ち状態になる。 When changing the configuration setting, 1 is written in the STT bit and the configuration information is written in the RET / MSK. In the barrier network that performs configuration setting, writing must be performed even if there is no change in the setting of the own node. The STT bit is set to 1 during configuration setting. When the configuration is completed, the STT bit is set to 0, and thereafter, synchronous setting is possible. After the configuration is set, the barrier network is reset and enters a synchronization set waiting state.

構成設定の完了を待たずに他のバリアネットワークの操作をする場合には、構成設定中のネットワークに対しては同期制御かつ同期リセットをセットしなければならない。BRCレジスタの読み込みだけでは、構成設定中の状態か否かはわからないので、ソフトウェア的に記憶しておく必要がある。構成設定時にSYN=0にしておけば、read-modify-writeすることで構成設定中にバリア同期が開始されることはない。 When operating another barrier network without waiting for the completion of configuration setting, synchronization control and synchronization reset must be set for the network being configured. Just reading the BRC register does not tell whether the configuration is in progress, so it must be stored in software. If SYN = 0 is set at the time of configuration, barrier synchronization is not started during configuration by read-modify-write.

続いて、本願発明のバリア機構の評価を、バリア同期時間の測定及び回路規模の計数によって行う。本評価では、メニーコアアーキテクチャ評価環境であるSMYLEref（グェンチュオンソン、外４名，FPGAを用いたメニーコアアーキテクチャSMYLErefの評価環境の構築，情報処理学会研究報告，Vol．198，No．15，pp.1-7，2012.参照）のRTL設計を使用した。SMYLErefは、MIPS R3000アーキテクチャをベースとした複数のコアをバス結合でクラスタ構成し、多数のクラスタを２次元メッシュのオンチップネットワークで結合したアーキテクチャとなっている。本実験では、１クラスタ、16コアのモデルを作成し、16個のコアに対して8つのバリアネットワークを実装した。また、バリア制御のために各コア内にレジスタを実装し、各コアがレジスタを読み書きすることによって、バリア同期を実施可能にした。レジスタは、自コアが使用する葉ノード及び15ある内部ノードと根ノードのうち、一つの構成及び同期を制御する（図１１）。 Subsequently, the barrier mechanism of the present invention is evaluated by measuring the barrier synchronization time and counting the circuit scale. In this evaluation, SMYLEref (Many-Coreonson, 4 others, construction of evaluation environment for Manycore architecture SMYLEref using FPGA, Information Processing Society of Japan, Vol.198, No.15, pp.1 -7, 2012.) RTL design. SMYLEref is an architecture in which multiple cores based on the MIPS R3000 architecture are clustered by bus connection, and many clusters are connected by a two-dimensional mesh on-chip network. In this experiment, a model of one cluster and 16 cores was created, and 8 barrier networks were implemented for 16 cores. In addition, a register is installed in each core for barrier control, and each core reads and writes the register to enable barrier synchronization. The register controls the configuration and synchronization of one of the leaf nodes used by the core and 15 internal nodes and root nodes (FIG. 11).

まず、バリア同期時間測定について説明する。図１２は、バリア同期時間を測定するために使用した評価用プログラムを示す。本評価用プログラムは、最初に各コアが担当する、葉ノード及び内部ノードのマスク情報及び折返し情報をレジスタに書込む（barrier cong関数）。次にバリア同期の実行（barrier関数）を2回繰り返す。バリア関数を2回実行するのは、関数のプログラムをキャッシュに載せることと、全てのコアが同時に計測対象のバリア関数実行を開始するためである。 First, the barrier synchronization time measurement will be described. FIG. 12 shows an evaluation program used to measure the barrier synchronization time. The evaluation program first writes the mask information and turn-back information of leaf nodes and internal nodes, which each core is in charge of, into a register (barrier cong function). Next, execute barrier synchronization (barrier function) twice. The reason why the barrier function is executed twice is that the function program is placed in the cache and that all cores simultaneously start executing the barrier function to be measured.

barrier関数内では以下の処理を行う。すなわち、（１）状態表示が'0'になるまでレジスタの読込を繰り返す（check barrier reset）。（２）同期設定='1'をレジスタに書き込む（barrier sync set）。（３）状態表示が'1'になるまでレジスタの読込を繰り返す（check barrier set）。（４）同期設定='0'をレジスタに書込む（barrier sync reset）。 The following processing is performed in the barrier function. That is, (1) reading of the register is repeated until the status display becomes “0” (check barrier reset). (2) Write synchronization setting = '1' to the register (barrier sync set). (3) Repeat reading of the register until the status display becomes “1” (check barrier set). (4) Write synchronization setting = '0' to the register (barrier sync reset).

バリア同期時間としては、2回目のbarrier関数実行に要したクロックサイクル数を計測した。計測は、xilinx社製RTLシミュレータISIM（0.40d版）を使用して行った。また、比較ためにハードウェアバリアを使用しないテストプログラム2種に対しても計測を行った。これらのテストでは、2回目のバリア関数実行前にハードウェアバリア関数を2回実行して、全コアが同時にバリア関数を実行できるようにしている。2種のプログラムは、それぞれ、バリア同期の手法（Wilkinson, B.、外１名，Parallel Programming：Techniques and Applications Using Networked Workstations and Parallel Computers，Prentice Hall，Upper Saddle River，NJ，USA，1999.）が異なっており、一方は小ノード数向けのmaster-slave方式、もう一方は中以上のノード数向けのbuttery方式用いている。 As the barrier synchronization time, the number of clock cycles required to execute the second barrier function was measured. The measurement was performed using an RTL simulator ISIM (0.40d version) manufactured by xilinx. For comparison, we also measured two test programs that do not use hardware barriers. In these tests, the hardware barrier function is executed twice before executing the second barrier function so that all cores can execute the barrier function simultaneously. Each of the two programs has a barrier synchronization method (Wilkinson, B., 1 other, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Prentice Hall, Upper Saddle River, NJ, USA, 1999.) One is using the master-slave method for small nodes and the other is the buttery method for medium and higher nodes.

表１は、16コア全てがバリア同期に参加した時の計測結果を示す。単位は、クロックサイクルである。表中の‘種別’は測定対象の種別で、‘HW’はハードウェアバリア、‘MS’はMaster-Slave 方式、‘BUT’はbuttery方式を示す。‘開始最速’、‘開始最遅’、‘完了最速’及び‘完了最遅’は、それぞれ、開始最速を起点とした相対経過クロックサイクル数を示す（図１３）。‘開始最速’はbarrier関数の実行開始が最も早かったコアの相対経過クロックサイクル数で、表中ではここを起点とするため、全ての種別において値は0 となる。‘開始最遅’は、barrier関数の実行開始が最も遅かったコアの相対経過クロック数を示し、‘完了最速’は、barrier関数の完了が最も速かったコア、完了最遅は最も遅かったコアの相対経過クロック数を示す。‘完了-完了’は完了最速と完了最遅のクロック差を示す。‘完了最遅’の値を関数実行レイテンシ、‘完了-完了’をコア間のばらつきと考えると、HWは、MSに対してレイテンシを1/87に、ばらつきを1/79に、BUTに対してはレイテンシを1/66に、ばらつきを1/30に改善している。 Table 1 shows the measurement results when all 16 cores participate in barrier synchronization. The unit is a clock cycle. “Type” in the table indicates the type of measurement object, “HW” indicates a hardware barrier, “MS” indicates a Master-Slave method, and “BUT” indicates a buttery method. 'Starting fastest', 'Starting slowest', 'Completing fastest', and 'Completing slowest' respectively indicate the number of relative elapsed clock cycles starting from the starting fastest (FIG. 13). 'Starting fastest' is the relative elapsed clock cycle number of the core at which the execution of the barrier function was the earliest. Since this is the starting point in the table, the value is 0 for all types. 'Latest start' indicates the relative elapsed clock number of the core with the slowest start of execution of the barrier function, and 'Fastest completion' indicates the core with the fastest completion of the barrier function and the slowest completion of the core. Indicates the relative elapsed clock number. 'Completed-Completed' indicates the clock difference between the fastest completion and the slowest completion. Considering the value of 'latest late' as function execution latency and 'complete-completion' as inter-core variability, HW reduces latency to 1/87 for MS, 1/79 for variability and BUT. The latency has been improved to 1/66 and the variation to 1/30.

続いて、回路規模評価について説明する。回路規模は、Xilinx社製の論理合成ツールxst（ISE 13.10.40版）を使用して求めた。合成対象FPGAにはVertex-6（ML605評価ボード）を指定している。 Next, circuit scale evaluation will be described. The circuit scale was obtained using a logic synthesis tool xst (ISE 13.10.40 version) manufactured by Xilinx. Vertex-6 (ML605 evaluation board) is specified as the synthesis target FPGA.

表２にバリア同期時間測定に使用したSMYLErefモデルの合成結果、及び、そこからハードウェアバリア部分のみを取り出した合成結果を示す。表中の'Type'は計数された対象で、'LUT'はLUT、'Reg'はレジスタ、'RAM'は36Kbit RAM(ブロックRAM) を示す。また、'BAR'はバリア同期部分、'SMYLE'はSMYLErefの合成結果を表す。'BAR/S-MYLE'はLUT、Reg及びRAMそれぞれについてSMYLEに対するBARの百分率である。BAR/SMYLEはコアの増加によっては変わらないが、クラスタが増加すれば減少する。コア当たり32KByte程度のメモリに必要な面積のことも鑑みると、SMYRErefメニーコアプロセッサに対してバリアのハードウェア回路は十分に小規模であると考えられる。 Table 2 shows the synthesis result of the SMYLEref model used for the barrier synchronization time measurement, and the synthesis result in which only the hardware barrier part is extracted therefrom. In the table, “Type” indicates a counted object, “LUT” indicates LUT, “Reg” indicates a register, and “RAM” indicates 36 Kbit RAM (block RAM). 'BAR' represents the barrier synchronization part, and 'SMYLE' represents the synthesis result of SMYLEref. 'BAR / S-MYLE' is the percentage of BAR to SMYLE for each LUT, Reg and RAM. BAR / SMYLE does not change as the number of cores increases, but decreases as the number of clusters increases. Considering the area required for a memory of about 32 KBytes per core, it is considered that the hardware circuit of the barrier is sufficiently small compared to the SMYREref many-core processor.

以上より、本実施例では、小規模なハードウェアバリア回路にバリアネットワーク分割機能を持たせ、さらにネットワークを多重化することにより、高速、安定かつ、柔軟なバリア同期を実現する手法を説明した。 As described above, in the present embodiment, a method for realizing high-speed, stable and flexible barrier synchronization by providing a small-scale hardware barrier circuit with a barrier network dividing function and further multiplexing the network has been described.

実験では、バリア同期時間評価において、ハードウェアバリア機構を使用した場合レイテンシが1/66、ばらつきは1/30となり、提案手法の有効性が示された。また、回路規模評価では、ハードウェアバリア回路はメニーコアプロセッサ全体に対して、LUT数で1,3%、レジスタ数で1.8%であり、小規模なものとなっている。 In the experiment, in the barrier synchronization time evaluation, the latency was 1/66 and the variation was 1/30 when the hardware barrier mechanism was used, indicating the effectiveness of the proposed method. Also, in the circuit scale evaluation, the hardware barrier circuit is small, with 13% LUTs and 1.8% registers compared to the entire many-core processor.

１コア，３葉ノード，５内部ノード，９折返し記憶部，１１マスク記憶部，２１同期設定レジスタ，２３同期状態表示レジスタ，２５否定演算部，２７セレクタ，３１同期回路，３３同期状態表示レジスタ，３５否定演算部，３７セレクタ，４１葉ノード，４３内部ノード，４５根ノード，５１葉ノード，５３内部ノード，５５根ノード，６１論理積演算部，６３論理和演算部，６５セレクタ，６７同期記憶部，７１葉ノード，７３論理和演算部，７５論理積演算部，７７更新レジスタ，８１論理和演算部，８３、８５、８７、８９論理積演算部，９１９３論理和演算部，９５否定演算部，９７更新レジスタ 1 core, 3 leaf node, 5 internal node, 9 loopback storage unit, 11 mask storage unit, 21 synchronization setting register, 23 synchronization state display register, 25 negation operation unit, 27 selector, 31 synchronization circuit, 33 synchronization state display register, 35 negative operation unit, 37 selector, 41 leaf node, 43 internal node, 45 root node, 51 leaf node, 53 internal node, 55 root node, 61 logical product operation unit, 63 logical sum operation unit, 65 selector, 67 synchronous storage Unit, 71 leaf node, 73 logical sum operation unit, 75 logical product operation unit, 77 update register, 81 logical sum operation unit, 83, 85, 87, 89 logical product operation unit, 91 93 logical sum operation unit, 95 negative operation Part, 97 update register

Claims

A synchronization processing circuit that performs a part or all of the synchronization processing of a plurality of processing means using a plurality of nodes in a tree structure,
The tree-structured nodes include a plurality of leaf nodes positioned at the lowest level and corresponding to the plurality of processing units, a root node positioned at the highest level, and a plurality of internal nodes other than the root node and the leaf nodes. Nodes included,
In the tree structure, not only the logical value is notified from the child node to the parent node, but also the logical value is notified from the parent node to the child node.
Each internal node notifies a parent node of a logical value obtained by performing a logical operation on a logical value notified from a child node of the internal node, or a logical value notified from the parent node A synchronization processing circuit comprising internal node selection means for selecting whether to notify the negative of

Each leaf node notifies the parent node of a logical value indicating whether the processing means corresponding to the leaf node has reached the synchronization point, or the logical value notified from the parent node The synchronization processing circuit according to claim 1, further comprising a leaf node selection unit that selects whether to notify a negative.

The synchronous processing circuit performs a state transition between a barrier completion waiting / acceptable state and a barrier completion notification waiting state,
The leaf node that notifies the logical value indicating whether or not the corresponding processing means has reached the synchronization point,
In the barrier completion waiting / acceptable state, the corresponding processing means notifies the synchronization state logical value as a logical value indicating that the synchronization point has been reached, and indicates the synchronization as a logical value indicating that the synchronization point has not been reached. Notify asynchronous state logical value different from state logical value,
In the barrier completion notification waiting state, when the synchronous processing logical value is notified from a parent node, the asynchronous state logical value is notified,
An internal node that notifies the negation of the logical value notified from the root node and the parent node,
When all the child nodes notify the synchronous state logical value in the barrier completion waiting / acceptable state, the synchronous state logical value is notified to the child node, and the state transits to the barrier completion notification waiting state. And
The synchronous processing circuit according to claim 2, wherein when the asynchronous state logical value is notified from all the child nodes in the barrier completion notification waiting state, the state is shifted to the barrier completion waiting / acceptable state.

A return storage means for storing information indicating whether or not to perform return at each internal node;
Mask storage means for storing information indicating whether or not to perform masking in each leaf node;
Whether each leaf node notifies the parent node of a logical value indicating whether or not the processing means corresponding to the leaf node has reached the synchronization point according to the information stored in the mask storage means Or a leaf node selection means for selecting whether to notify the negation of the logical value notified from the parent node,
The internal node selection means of each internal node selects according to the information stored in the loopback storage means,
The synchronous processing circuit according to claim 2, wherein the root node notifies a child node of a logical value obtained by performing a logical operation on the logical value notified from the child node of the internal node.

5. The synchronization processing circuit according to claim 1, wherein a plurality of tree structures exist, and synchronization processing is performed independently of other tree structures in each tree structure.

Each of the internal nodes includes a configuration change processing synchronization mechanism for dynamically changing the configuration of the tree structure,
The configuration change processing synchronization mechanism is
When the own node and all child nodes are requesting synchronization to the parent node, send a synchronization request to the parent node.
If a synchronization request from one of the child nodes to the parent node is made to the child node and is not completed, a logical value indicating that the synchronization is not completed is sent to the child node. Otherwise, The synchronous processing circuit according to claim 1, which sends different logical values.

A synchronization processing method in a synchronization processing circuit that performs a synchronization process of a part or all of a plurality of processing means using a plurality of nodes in a tree structure,
In the tree structure, not only the logical value is notified from the child node to the parent node, but also the logical value is notified from the parent node to the child node.
The internal node selection means included in the synchronous processing circuit performs logical operation on the logical value notified from the child node of the internal node to the parent node by the internal node where the parent node and the child node exist in the tree structure. A synchronization processing method including a selection step of selecting whether to notify the logical value obtained in this way or to notify the negative of the logical value notified from the parent node.