JP2012522280A

JP2012522280A - Single instruction multiple data (SIMD) processor having multiple processing elements interconnected by a ring bus

Info

Publication number: JP2012522280A
Application number: JP2011540254A
Authority: JP
Inventors: ハンノリースケ
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-03-30
Filing date: 2009-09-25
Publication date: 2012-09-20
Anticipated expiration: 2029-09-25
Also published as: JP5488609B2

Abstract

複数の処理要素を有する単一命令多重データ（ＳＩＭＤ）プロセッサはデータメモリ内の読み出し専用パラメータデータのアドレスを複数の処理要素の数に対応するビット位置で第１部分と第２部分に分割するための分割部と、第１部分に応じたアドレスの内部メモリから取り出した読み出し専用パラメータデータのリングバス上のシフト動作の回数を、処理要素自身の位置と、アクセス対象の読み出し専用パラメータデータが格納されている処理要素のリングバス上の位置を指定し、第２部分に対応する該アクセス対象の読み出し専用パラメータデータのグローバルアドレスの部分との間の差と比較し、他の処理要素に読み出し専用パラメータを取得させるための比較部と、を備える。 A single instruction multiple data (SIMD) processor having a plurality of processing elements for dividing an address of read-only parameter data in a data memory into a first part and a second part at bit positions corresponding to the number of the plurality of processing elements The number of shift operations on the ring bus of the read-only parameter data fetched from the internal memory at the address corresponding to the first part, the position of the processing element itself, and the read-only parameter data to be accessed are stored. The position of the processing element on the ring bus is specified, compared with the difference between the global address part of the read-only parameter data to be accessed corresponding to the second part, and the read-only parameter is set to the other processing element And a comparison unit for acquiring.

Description

本発明はデータ処理装置、データ処理システム、及びデータ処理方法に関する。 The present invention relates to a data processing device, a data processing system, and a data processing method.

単一命令多重データ（ＳＩＭＤ）処理で動作するプロセッサが提唱されている（特許文献１）。図１５を参照して、そのようなＳＩＭＤの１つの例を説明する。図１５はＳＩＭＤアーキテクチャを示している概念的なブロック図である。図１５に示されているように、ＳＩＭＤアーキテクチャ９０は中央プロセッサ（ＣＰ）１０、複数の処理要素（ＰＥ）１１、リングバス１２及び１３、及び接続１４を備える。図１５は、それぞれＰＥ_００〜ＰＥ_１５で示されている１６個のＰＥ１１を示している。 A processor that operates by single instruction multiple data (SIMD) processing has been proposed (Patent Document 1). One example of such SIMD will be described with reference to FIG. FIG. 15 is a conceptual block diagram illustrating the SIMD architecture. As shown in FIG. 15, the SIMD architecture 90 includes a central processor (CP) 10, a plurality of processing elements (PE) 11, ring buses 12 and 13, and connections 14. Figure 15 shows the _PE 00 _{-PE 15} PE11 16 pieces of which are shown, respectively.

ＣＰ１０はパラメータを格納するデータメモリ（ＤＭＥＭ）１６を備え、ＰＥ１１は処理のためにそれらのパラメータを使用する。各ＰＥ１１はＣＰ１０から転送されたパラメータを格納する内部メモリ（ＩＭＥＭ）１７を有する。ＣＰ１０はパイプライン化されたリングバス１２及び１３によって各ＰＥ１１に接続されている。ＣＰ１０及び各ＰＥ１１は接続１４を介してリングバス１２及び１３に接続されている。データはＣＰ１０と各ＰＥ１１との間でリングバス１２を介して時計回り方向に、及びリングバス１３を介して反時計回り方向に転送される。すなわち、データはＣＰ１０から各ＰＥ１１へ、時計回りリングバス１２及び反時計回りリングバス１３を介して転送される。 The CP 10 includes a data memory (DMEM) 16 that stores parameters, and the PE 11 uses these parameters for processing. Each PE 11 has an internal memory (IMEM) 17 for storing parameters transferred from the CP 10. The CP 10 is connected to each PE 11 by pipelined ring buses 12 and 13. CP 10 and each PE 11 are connected to ring buses 12 and 13 via connection 14. Data is transferred between the CP 10 and each PE 11 in the clockwise direction via the ring bus 12 and in the counterclockwise direction via the ring bus 13. That is, data is transferred from the CP 10 to each PE 11 via the clockwise ring bus 12 and the counterclockwise ring bus 13.

処理が開始されると、各ＰＥ１１はＣＰ１０のＤＭＥＭ１６から処理に必要なパラメータを取り出す。各ＰＥ１１は次のような一般的な方法でＣＰ１０のＤＭＥＭ１６に格納されているパラメータを要求する。 When the process is started, each PE 11 takes out parameters necessary for the process from the DMEM 16 of the CP 10. Each PE 11 requests parameters stored in the DMEM 16 of the CP 10 by the following general method.

（１）要求に応じた転送
（２）プレローディング (1) Transfer according to request (2) Preloading

上述の（１）要求に応じた転送の場合、ＰＥ１１がパラメータを必要とするたびに、ＣＰ１０によってパラメータがＤＭＥＭ１６から読み出され、要求を出しているＰＥ１１に転送される。このシーケンスは、例えば、非特許文献１に開示されている。しかしながら、ＰＥ１１によってデータが要求されるたびに要求パケットが交換されると、バスの通信量が大幅に増大する。１６個のＰＥが同時に又は連続的にデータを要求すると、リングバスの通信量は大幅に増大してしまう。さらに、ＰＥがデータを要求してからそれを受け取るまでに時間がかかり、ＰＥ１１は処理を開始する前に必要なデータが取り出されるまで待たなければならない。それゆえ、高いパラレル処理効率を期待することはできない。 In the case of transfer in response to the above (1) request, each time the PE 11 needs a parameter, the parameter is read from the DMEM 16 by the CP 10 and transferred to the requesting PE 11. This sequence is disclosed in Non-Patent Document 1, for example. However, if a request packet is exchanged each time data is requested by the PE 11, the amount of bus communication increases significantly. If 16 PEs request data simultaneously or sequentially, the traffic on the ring bus will increase significantly. Furthermore, it takes time from the PE requesting data to receiving it, and the PE 11 must wait until the necessary data is retrieved before starting the process. Therefore, high parallel processing efficiency cannot be expected.

図１６を参照して、データがプレローディングされる場合（上述の（２）の場合）を説明する。図１６はＰＥ１１での並列使用のための、内部メモリ（ＩＭＥＭ）１７内のパラメータの初期設定を示している。 With reference to FIG. 16, a case where data is preloaded (in the case of (2) described above) will be described. FIG. 16 shows the initial setting of parameters in the internal memory (IMEM) 17 for parallel use in the PE 11.

各ＰＥ１１によるパラメータの使用の前に、ＣＰ１０によって全パラメータが一度、ＤＭＥＭ１６から読み出される。そして、それらのパラメータは各ＰＥ１１のＩＭＥＭ１７に格納するために、全てのＰＥ１１に一斉送信される。プログラム実行中、各ＰＥ１１は要求されたパラメータを読み出すために、任意のタイミングでそれ自身のＩＭＥＭ１７にアクセスすることができる。しかしながら、各ＰＥは自身のＩＭＥＭ１７に格納された全てのパラメータを有するため、各ＩＭＥＭ１７は非常に大きなメモリ容量を必要とする。このような状況から、システムは非常に大きな空間を必要とする。さらに、プレローディングは多数のデータを転送及び書き込みするために相当な時間を要する。 All parameters are read once from the DMEM 16 by the CP 10 before the parameters are used by each PE 11. These parameters are broadcast to all the PEs 11 in order to be stored in the IMEM 17 of each PE 11. During program execution, each PE 11 can access its own IMEM 17 at any time to read out the requested parameters. However, since each PE has all the parameters stored in its own IMEM 17, each IMEM 17 requires a very large memory capacity. From this situation, the system requires a very large space. Furthermore, preloading requires a considerable amount of time to transfer and write a large amount of data.

また、ＳＩＭＤアーキテクチャにおいて、ＩＭＥＭ１７の使用を最適化するためにＰＥ１１をグループ化することができる。図１７はこのシステム構造を示している。パラメータは複数のＩＭＥＭ１７に分配され、複数のＩＭＥＭ１７に格納される。この状態において、あるＰＥがそれ自身のＩＭＥＭ１７には格納されておらず、隣接するＩＭＥＭ１７に格納されているパラメータにアクセスしたい場合がある。上述のＳＩＭＤアーキテクチャに対して、特許文献２に開示されている仕組みを適用することができる。 Also, in the SIMD architecture, PEs 11 can be grouped to optimize the use of IMEM 17. FIG. 17 shows this system structure. The parameters are distributed to the plurality of IMEMs 17 and stored in the plurality of IMEMs 17. In this state, there is a case where a certain PE is not stored in its own IMEM 17 and it is desired to access parameters stored in the adjacent IMEM 17. The mechanism disclosed in Patent Document 2 can be applied to the SIMD architecture described above.

ここで、コンパイル時に複数のＰＥがグループ化され、それら全てがアクセス可能な共通内部メモリを有する。その内部メモリに同時にアクセスしようとしているＰＥの全てに対してアクセスインジケータが設定される。アクセスインジケータを有するＰＥの１つが選択されるとともに、同一のアドレスにアクセスしようとしているＰＥが探し出される。そして、パラメータが内部メモリからロードされ、同一のアドレスにアクセスしようとしている全てのＰＥに転送される。さらに、これらのＰＥのアクセスインジケータがクリアされる。全てのＰＥからアクセスインジケータがクリアされるまで、この処理が繰り返される。この方法により、同一のアドレスに対する複数のアクセスが防止されるので、最適なアクセスが達成される。 Here, a plurality of PEs are grouped at the time of compiling and all have a common internal memory that can be accessed. An access indicator is set for all of the PEs simultaneously trying to access the internal memory. One PE with an access indicator is selected and the PE that is trying to access the same address is located. The parameters are then loaded from the internal memory and transferred to all PEs attempting to access the same address. Furthermore, the access indicators of these PEs are cleared. This process is repeated until the access indicators are cleared from all PEs. This method prevents multiple accesses to the same address, thus achieving optimal access.

特許文献３は、隣接する処理要素をグループ化することによって内部メモリアクセスを最適化し、それによりＳＩＭＤアーキテクチャの能力を最適化するための異なる手法を開示している。この手法においては、コンパイル時に２つの隣接する処理要素が処理要素の組にグループ化される。これらの組にされた処理要素において、異なるデータバスに接続されたメモリの両方の要素に対して同一のアドレスが割り当てられる。この構成は、例えば、１つのメモリをデータの取得のために使用し、他のメモリをデータの出力にために使用することを可能にする。 U.S. Patent No. 6,057,031 discloses a different approach for optimizing internal memory access by grouping adjacent processing elements, thereby optimizing the capabilities of the SIMD architecture. In this approach, two adjacent processing elements are grouped into a set of processing elements at compile time. In these grouped processing elements, the same address is assigned to both elements of the memory connected to different data buses. This configuration, for example, allows one memory to be used for data acquisition and another memory to be used for data output.

特許文献４及び特許文献５はさらに異なる手法を開示している。特許文献４及び５において、割り当ては中央プロセッサ自身によって行われる。特許文献５においては、リングバス上のデータのシフトを制御するためにリングバスコントローラが備えられている。データがリングバスに転送された後、中央プロセッサはリングバスコントローラにリングバス上のデータをシフトするように指示する。リングバスコントローラによる制御動作により、データはリングバス上を所定の量だけ移動する。所定のシフト動作が完了すると、リングバスコントローラは所望のシフト動作が完了したことを中央プロセッサに知らせる。そして、中央プロセッサは処理要素（ＰＥ）にそのデータを取り出すように指示する。処理要素（ＰＥ）は必要なデータを取り出す。 Patent Literature 4 and Patent Literature 5 disclose different methods. In Patent Documents 4 and 5, the allocation is performed by the central processor itself. In Patent Document 5, a ring bus controller is provided to control the shift of data on the ring bus. After the data is transferred to the ring bus, the central processor instructs the ring bus controller to shift the data on the ring bus. The data moves on the ring bus by a predetermined amount by the control operation by the ring bus controller. When the predetermined shift operation is complete, the ring bus controller informs the central processor that the desired shift operation is complete. The central processor then instructs the processing element (PE) to retrieve the data. The processing element (PE) retrieves the necessary data.

米国特許公報第３５３７０７４号U.S. Pat. No. 3,537,074 米国特許公報第７３６３４７２号U.S. Pat. No. 7,363,472 米国特許公報第６７８５８００号US Patent No. 6785800 米国特許公報第５８２８８９４号U.S. Pat. No. 5,828,894 欧州特許公報第０１４７８５７Ａ２号（日本公開特許公報第６０−１４０４５６号）European Patent Publication No. 0147857A2 (Japanese Patent Publication No. 60-140456)

Zvonko G. Vranesic、Michael Stumm、David M. Lewis、及びRon White「Hector: A Hierarchically Structured Shared-Memory Multiprocessor」 Computer、第２４巻、第１号、７２〜７９頁、１９９１年１月、７５頁、１〜６行目Zvonko G. Vranesic, Michael Stumm, David M. Lewis, and Ron White "Hector: A Hierarchically Structured Shared-Memory Multiprocessor" Computer, Vol. 24, No. 1, pp. 72-79, January 1991, p. 75. 1st to 6th lines

データを転送する第１の方法（すなわち、要求に応じた転送）はアクセスが非常に遅いという問題がある。この問題の理由の１つは、要求のたびにＤＭＥＭからＩＭＥＭにデータが転送されなければならないということである。もう１つの理由は、１つのＩＭＥＭにデータが転送されている間、他の全てのＰＥはそのデータ要求が履行されるまで、それらの実行を中断して待たなければならないということである。 The first method of transferring data (i.e. transfer on demand) has a problem that access is very slow. One reason for this problem is that data must be transferred from DMEM to IMEM on every request. Another reason is that while data is being transferred to one IMEM, all other PEs must suspend their execution and wait until their data request is fulfilled.

データを転送する第２の方法（すなわち、プレローディング）は高速であるが、パラメータデータが各ＰＥのＩＭＥＭ内に格納されなければならないので、内部メモリ内に大きなメモリ空間を必要とする。 The second method of transferring data (ie, preloading) is fast, but requires large memory space in internal memory because the parameter data must be stored in each PE's IMEM.

特許文献２に開示された方法はデータをＰＥグループの内部メモリに格納することによって、この内部メモリの増大に対する問題を解決することを目的としている。特許文献２はまた、データにアクセスするための一般的な方法を示している。しかしながら、この一般的な方法のために、メモリアクセスの前にＰＥ間でアドレスを交換及び比較しなければならず、ＰＥ間のアドレス転送及び比較のために余分な制御論理及び余分な処理時間を消費する。 The method disclosed in Patent Document 2 aims to solve the problem of the increase in internal memory by storing data in the internal memory of the PE group. U.S. Pat. No. 6,057,089 also shows a general method for accessing data. However, because of this general method, addresses must be exchanged and compared between PEs prior to memory access, which requires extra control logic and extra processing time for address transfer and comparison between PEs. Consume.

特許文献３に開示された方法は内部メモリ内のデータ量を減少させることができないという短所を有する。特許文献４に開示された方法は自己グループ化を行うために余分な制御論理を必要とするという短所を有する。特許文献５に開示された方法はリングバスのシフト動作を制御するために余分な制御論理を必要とし、中央プロセッサがＰＥによるデータの出入力動作及びリングバスコントローラによるリングバスシフトを管理しなければならないという短所を有する。 The method disclosed in Patent Document 3 has a disadvantage that the amount of data in the internal memory cannot be reduced. The method disclosed in Patent Document 4 has a disadvantage in that extra control logic is required to perform self-grouping. The method disclosed in Patent Document 5 requires extra control logic to control the shift operation of the ring bus, and the central processor must manage the data input / output operation by the PE and the ring bus shift by the ring bus controller. It has the disadvantage of not becoming.

上述の特許／非特許文献に開示された方法は時間又は領域の点で非効率的である。 The methods disclosed in the above patent / non-patent literature are inefficient in terms of time or area.

本発明は上述の問題の観点からなされたものであり、その目的は読み出し専用のパラメータが複数の内部メモリに分散されて格納されている場合に、（１つ又は複数の）リングバスを介して該読み出し専用パラメータを効率的に転送及び取り込みすることが可能なデータ処理装置、データ処理システム、及びデータ処理方法を提供することである。 The present invention has been made in view of the above-described problems, and its purpose is to connect via a ring bus (s) when read-only parameters are distributed and stored in a plurality of internal memories. To provide a data processing device, a data processing system, and a data processing method capable of efficiently transferring and capturing the read-only parameter.

本発明によると、データが複数の内部メモリに分散されて格納されている場合に、該データを効率的に読みだすことが可能なデータ処理装置、データ処理システム、及びデータ処理方法を提供することができる。 According to the present invention, it is possible to provide a data processing device, a data processing system, and a data processing method capable of efficiently reading data when the data is distributed and stored in a plurality of internal memories. Can do.

本発明の上述及び他の目的、長所、及び特徴は付随する図面とともに以下の特定の実施形態の記載を参照することによって、より明白になるだろう。
本発明の実施形態にしたがったデータ処理装置９００のアーキテクチャを示している概念的なブロック図である。ＤＭＥＭ１０６に格納された読み出し専用パラメータとアドレスとの関係を示している。各読み出し専用パラメータのグローバルアドレス６００の１つの形式を示している。Ａｄｄｒ_ＤＭＥＭとＡｄｄｒ_ＩＭＥＭとの関係を示している。ＰＥ１０１の構造を概略的に示しているブロック図である。分割部１２２によって行われる分割処理の概念図を示している。分割部１２２を示しているブロック図である。分割部の必要なクロックサイクルでの、予想されるソフトウェアエミュレーションを示している。ｃｍｐｍｖ部１２３を示しているブロック図である。比較／移動部の必要なクロックサイクルでの、予想されるソフトウェアエミュレーションを示している。各ＰＥ１０１におけるデータ処理方法を示しているフローチャートである。リングバスのシフト動作を制御するためにＣＰ１００で実行される処理動作を示している。Ｈ．２６４ビデオデコーダのデコードループを示しているブロック図である。マクロブロックを示している図である。特許文献１のＳＩＭＤアーキテクチャを示している概念的なブロック図である。内部メモリ（ＩＭＥＭ）内のパラメータの初期設定を示している。ＩＭＥＭの使用を最適化するためにＰＥをグループ化することができるシステム構造を示している。 The above and other objects, advantages, and features of the present invention will become more apparent by referring to the following description of specific embodiments in conjunction with the accompanying drawings.
FIG. 6 is a conceptual block diagram illustrating the architecture of a data processing apparatus 900 according to an embodiment of the present invention. The relationship between the read-only parameter stored in the DMEM 106 and the address is shown. One form of global address 600 for each read-only parameter is shown. The relationship between Addr _DMEM and Addr _IMEM is shown. It is a block diagram showing roughly the structure of PE101. The conceptual diagram of the division | segmentation process performed by the division part 122 is shown. 3 is a block diagram showing a dividing unit 122. FIG. Fig. 4 shows the expected software emulation at the required clock cycle of the divider. 3 is a block diagram showing a cmpmv unit 123. FIG. Fig. 6 shows an expected software emulation at the required clock cycle of the compare / move unit. It is a flowchart which shows the data processing method in each PE101. The processing operation executed by the CP 100 to control the shift operation of the ring bus is shown. H. 2 is a block diagram illustrating a decoding loop of an H.264 video decoder. FIG. It is a figure which shows the macroblock. 2 is a conceptual block diagram showing a SIMD architecture of Patent Document 1. FIG. The initial setting of parameters in the internal memory (IMEM) is shown. Fig. 2 illustrates a system structure that allows PEs to be grouped to optimize IMEM usage.

（実施の形態１）
本発明の実施形態にしたがったデータ処理装置は単一命令多重データ処理（ＳＩＭＤ）を実行するプロセッサである。図１を参照して、本発明の実施形態にしたがったデータ処理装置を説明する。図１は本発明の実施形態にしたがったデータ処理装置９００のアーキテクチャを示している概念的なブロック図である。図１に示されているように、このアーキテクチャは中央プロセッサ（ＣＰ）１００、データメモリ（ＤＭＥＭ）１０６、処理要素（ＰＥ）１０１、内部メモリ（ＩＭＥＭ）１０７、リングバス１０２、リングバス１０３、接続１０４、及び、シフトレジスタ１０５を備える。 (Embodiment 1)
A data processing apparatus according to an embodiment of the present invention is a processor that performs single instruction multiple data processing (SIMD). A data processing apparatus according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a conceptual block diagram illustrating the architecture of a data processing apparatus 900 according to an embodiment of the present invention. As shown in FIG. 1, this architecture consists of a central processor (CP) 100, data memory (DMEM) 106, processing element (PE) 101, internal memory (IMEM) 107, ring bus 102, ring bus 103, connection 104 and a shift register 105.

ＣＰ１００は読み出し専用パラメータを格納するデータメモリＤＭＥＭ１０６を有し、ＰＥ１０１は処理のためにそれらの読み出し専用パラメータを使用する。ここで、処理のために３２個の読み出し専用パラメータが使用される具体例について説明する。すなわち、ＤＭＥＭ１０６には３２個の読み出し専用パラメータが格納される。ここで、ＤＭＥＭ１０６に格納された３２個の読み出し専用パラメータのアドレスがそれぞれ「００」〜「３１」に設定されているとする。図２はＤＭＥＭ１０６内の読み出し専用パラメータとそれらのＤＭＥＭ１０６内のアドレスＡｄｄｒ_ＤＭＥＭとの関係を示している。 The CP 100 has a data memory DMEM 106 for storing read-only parameters, and the PE 101 uses these read-only parameters for processing. Here, a specific example in which 32 read-only parameters are used for processing will be described. That is, 32 read-only parameters are stored in the DMEM 106. Here, it is assumed that addresses of 32 read-only parameters stored in the DMEM 106 are set to “00” to “31”, respectively. FIG. 2 shows the relationship between the read-only parameters in the _DMEM 106 and the addresses Addr _DMEM in the _DMEM 106.

ＣＰ１００は接続１０４を介して２つのリングバス１０２及び１０３に接続されている。ＣＰ１００はＤＭＥＭ１０６に格納された読み出し専用パラメータを読み出し、読み出された読み出し専用パラメータはリングバス１０２及び１０３を介して転送される。 CP 100 is connected to two ring buses 102 and 103 via connection 104. The CP 100 reads out the read-only parameter stored in the DMEM 106, and the read-out read parameter is transferred via the ring buses 102 and 103.

図１は１６個のＰＥ１０１が備えられている例を示している。図１において、説明の簡略化のために１６個のＰＥ１０１にはそれぞれ添え字「００」〜「１５」が付けられている。すなわち、１６個のＰＥ１０１はそれぞれＰＥ_００〜ＰＥ_１５として識別される。１６個のＰＥ１０１はＳＩＭＤモードで動作する。すなわち、ＣＰ１００が単一の命令を送ると、ＰＥ１０１は並列処理を実行する。 FIG. 1 shows an example in which 16 PEs 101 are provided. In FIG. 1, subscripts “00” to “15” are attached to the 16 PEs 101 for simplification of explanation. That is, the 16 PEs 101 are identified as PE _{00 to} PE ₁₅ respectively. The 16 PEs 101 operate in SIMD mode. That is, when the CP 100 sends a single command, the PE 101 executes parallel processing.

全てのＰＥ１０１は接続１０４を介して２つのリングバス１０２及び１０３に接続されている。リングバス１０２及び１０３にはシフトレジスタ１０５が備えられている。シフトレジスタ１０５はリングバス１０２及び１０３上で互いに接続されている。リングバス１０２及び１０３の各々のシフトレジスタ１０５の数はＰＥ１０１の数に一致している。リングバス１０３はリングバス１０２とは逆の方向にデータを転送する。リングバス１０２は時計回り方向にデータを転送し、リングバス１０３は反時計回り方向にデータを転送する。それゆえ、リングバス１０２上のシフトレジスタ１０５のシフト方向はリングバス１０３上のシフトレジスタ１０５のシフト方向に対して逆方向である。 All PEs 101 are connected to the two ring buses 102 and 103 via connection 104. The ring buses 102 and 103 are provided with a shift register 105. The shift register 105 is connected to each other on the ring buses 102 and 103. The number of shift registers 105 in each of the ring buses 102 and 103 matches the number of PEs 101. The ring bus 103 transfers data in the opposite direction to the ring bus 102. The ring bus 102 transfers data in the clockwise direction, and the ring bus 103 transfers data in the counterclockwise direction. Therefore, the shift direction of the shift register 105 on the ring bus 102 is opposite to the shift direction of the shift register 105 on the ring bus 103.

また、各ＰＥ１０１はそれ自身のＩＭＥＭ１０７に接続されている。各ＩＭＥＭ１０７はローカルデータ格納部として機能する。単一のＩＭＥＭ１０７に対して単一のＰＥ１０１が接続されている。すなわち、ＩＭＥＭ１０７の数は１６個であり、ＰＥ１０１の数に等しい。これらのＩＭＥＭ１０７は分散的な並列処理のために必要な読み出し専用パラメータを格納する。ここで、各ＩＭＥＭ１０７が２つの読み出し専用パラメータを格納する具体例について説明する。すなわち、全部で３２個（１６×２）の読み出し専用パラメータが存在する例について説明する。 Each PE 101 is connected to its own IMEM 107. Each IMEM 107 functions as a local data storage unit. A single PE 101 is connected to a single IMEM 107. That is, the number of IMEMs 107 is 16, which is equal to the number of PEs 101. These IMEMs 107 store read-only parameters necessary for distributed parallel processing. Here, a specific example in which each IMEM 107 stores two read-only parameters will be described. That is, an example in which there are 32 (16 × 2) read-only parameters in total will be described.

まず、リングバス１０２に備えられたシフトレジスタ１０５によって３２個のパラメータが順次、転送される。最初のクロックサイクルで、ＤＭＥＭ１０６からアドレス「００」に格納されている読み出し専用パラメータ「０１」が読み出され、リングバス１０２に備えられたシフトレジスタ１０５に保持される。なお、ＣＰ１００はＤＭＥＭ１０６から読み出したデータを最も近いシフトレジスタ１０５に転送する。すなわち、読み出し専用パラメータ「０１」はＣＰ１００の下流側で最も近いシフトレジスタ１０５に格納される。次のクロックサイクルで、読み出し専用パラメータ「０１」は次のシフトレジスタ１０５に転送されるとともに、ＣＰ１００からアドレス「０１」に格納されている読み出し専用パラメータ「０２」が読み出され、シフトレジスタ１０５に保持される。 First, 32 parameters are sequentially transferred by the shift register 105 provided in the ring bus 102. In the first clock cycle, the read-only parameter “01” stored in the address “00” is read from the DMEM 106 and held in the shift register 105 provided in the ring bus 102. Note that the CP 100 transfers the data read from the DMEM 106 to the nearest shift register 105. That is, the read-only parameter “01” is stored in the closest shift register 105 on the downstream side of the CP 100. In the next clock cycle, the read-only parameter “01” is transferred to the next shift register 105, and the read-only parameter “02” stored at the address “01” is read from the CP 100 to the shift register 105. Retained.

この処理を繰り返すことにより、１６個の読み出し専用パラメータがシフトレジスタ１０５に保持される。すなわち、リングバス１０２に備えられた各シフトレジスタ１０５は読み出し専用パラメータを１つずつ保持する。そして、各ＩＭＥＭ１０７は対応するシフトレジスタ１０５に保持されている読み出し専用パラメータデータを格納する。すなわち、各ＩＭＥＭ１０７には読み出し専用パラメータが１つずつ保持される。例えば、読み出し専用パラメータ「０１」はＰＥ_００のＩＭＥＭ１０７に格納される。同様に、読み出し専用パラメータ「０２」〜「１６」はそれぞれ、ＰＥ_０１〜ＰＥ_１５のＩＭＥＭ１０７に格納される。 By repeating this process, 16 read-only parameters are held in the shift register 105. That is, each shift register 105 provided in the ring bus 102 holds one read-only parameter. Each IMEM 107 stores read-only parameter data held in the corresponding shift register 105. That is, each IMEM 107 holds one read-only parameter. For example, the read-only parameter “01” is stored in the IMEM 107 of PE ₀₀ . Similarly, the read-only parameters “02” to “16” are stored in the IMEM 107 of PE _{01 to} PE ₁₅ , respectively.

この処理は２回繰り返され、それによって各ＩＭＥＭ１０７には２つの読み出し専用パラメータが格納される。読み出し専用パラメータ「１７」〜「３２」も上述した方法で転送される。結果として、例えば、読み出し専用パラメータ「０１」及び「１７」がＰＥ_００のＩＭＥＭ１０７に順次、格納される。 This process is repeated twice, whereby each IMEM 107 stores two read-only parameters. The read-only parameters “17” to “32” are also transferred by the method described above. As a result, for example, the read-only parameters “01” and “17” are sequentially stored in the IMEM 107 of PE ₀₀ .

次に、各読み出し専用パラメータのグローバルアドレスについて説明する。図３は各読み出し専用パラメータのグローバルアドレス６００の１つの形式を示している。図３に示されているように、グローバルアドレスは２つの部分に分割される。上位ビット６０１は、ＩＭＥＭ１０７内の読み出し専用パラメータのアドレスを示すアドレスＡｄｄｒ_ＩＭＥＭを表す部分である。このアドレスＡｄｄｒ_ＩＭＥＭは以下の式によって計算することができる。
Ａｄｄｒ_ＩＭＥＭ＝Ａｄｄｒ_ＤＭＥＭ／ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ・・・（１） Next, the global address of each read-only parameter will be described. FIG. 3 shows one form of global address 600 for each read-only parameter. As shown in FIG. 3, the global address is divided into two parts. The upper bit 601 is a part representing an address Addr _IMEM indicating the address of a read-only parameter in the _{IMEM 107} . This address Addr _IMEM can be calculated by the following equation.
Addr _IMEM = Addr _DMEM / PE_PER_GROUP (1)

読み出し専用パラメータはＰＥグループに分散されて格納されているので、ＩＭＥＭ１０７内のＡｄｄｒ_ＩＭＥＭはＤＭＥＭ１０６のＡｄｄｒ_ＤＭＥＭをＰＥ１０１の数で割ることによって計算される。Ａｄｄｒ_ＤＭＥＭの上位ビットに注目することにより、Ａｄｄｒ_ＩＭＥＭを計算することができる。例えば、Ａｄｄｒ_ＤＭＥＭが「２７」、ＰＥ＿ＰＥＲ＿ＧＲＯＵＰが「１６」、Ａｄｄｒ_ＩＭＥＭが「１」であるとする。ＰＥ＿ＰＥＲ＿ＧＲＯＵＰが「１６」かつＡｄｄｒ_ＤＭＥＭが「００」〜「１５」の範囲にある場合、Ａｄｄｒ_ＩＭＥＭは０である。Ａｄｄｒ_ＤＭＥＭが「１６」〜「３１」の範囲にある場合、Ａｄｄｒ_ＩＭＥＭは１である。図４はＡｄｄｒ_ＤＭＥＭとＡｄｄｒ_ＩＭＥＭとの関係を示している。 Since the read-only parameter is distributed and stored in the PE group, the Addr _IMEM in the IMEM ₁₀₇ is calculated by dividing the Addr _{DMEM in the DMEM} 106 by the number of PEs 101. By _looking at the upper bits of the Addr _DMEM , the Addr _IMEM can be calculated. For example, it is assumed that Addr _DMEM is “27”, PE_PER_GROUP is “16”, and Addr _IMEM is “1”. When PE_PER_GROUP is “16” and Addr _DMEM is in the range of “00” to “15”, Addr _IMEM is 0. The Addr _IMEM is 1 when the Addr _DMEM is in the range of “16” to “31”. FIG. 4 shows the relationship between Addr _DMEM and Addr _IMEM .

このように、ＩＭＥＭ１０７内のアドレスＡｄｄｒ_ＩＭＥＭを計算するためにアドレスＡｄｄｒ_ＤＭＥＭをＰＥ１０１の数ＰＥ＿ＰＥＲ＿ＧＲＯＵＰで割る。上述の例はＰＥ＿ＰＥＲ＿ＧＲＯＵＰ＝１６として説明したが、もちろん、ＰＥ＿ＰＥＲ＿ＧＲＯＵＰは１６以外の数値であってもよい。 Thus, dividing the number PE_PER_GROUP of PE101 address _{Addr DMEM} to calculate the address _{Addr IMEM} in IMEM107. Although the above example has been described with PE_PER_GROUP = 16, of course, PE_PER_GROUP may be a numerical value other than 16.

下位ビット６０２は、読み出し専用パラメータを格納しているＩＭＥＭのリングバス１０２上の位置を示すＰＯＳ_ＩＭＥＭを表す部分である。すなわち、ＰＯＳ_ＩＭＥＭはアクセス対象の読み出し専用パラメータのグローバルアドレスの一部であり、該アクセス対象の読み出し専用パラメータが格納されているリングバス１０２内の位置を指定する。 The lower bit 602 is a portion representing the POS _IMEM indicating the position on the ring bus 102 of the IMEM storing the read-only parameter. That is, the POS _IMEM is a part of the global address of the read-only parameter to be accessed, and specifies the position in the ring bus 102 where the read-only parameter to be accessed is stored.

ＰＯＳ_ＩＭＥＭはＡｄｄｒ_ＤＭＥＭとＰＥ＿ＰＥＲ＿ＧＲＯＵＰ（この例においては＝１６）を用いたモジュロ演算を行うこと、すなわち、割り算の余りによって計算される。図４はＡｄｄｒ_ＤＭＥＭとＰＯＳ_ＩＭＥＭとの関係を示している。すなわち、読み出し専用パラメータのグローバルアドレスは各々、２つの部分６０１及び６０２から構成されている。 The POS _IMEM is calculated by performing a modulo operation using Addr _DMEM and PE_PER_GROUP (= 16 in this example), that is, by the remainder of the division. FIG. 4 shows the relationship between Addr _DMEM and POS _IMEM . That is, each global address of the read-only parameter is composed of two parts 601 and 602.

なお、部分６０１は第１オペランドとなり、部分６０２は第２オペランドとなる。部分６０１はアドレスの上位部分であり、ビット位置の左側に位置している。部分６０２はアドレスの下位部分であり、ビット位置の右側に位置している。 The part 601 becomes the first operand, and the part 602 becomes the second operand. The part 601 is the upper part of the address and is located on the left side of the bit position. Part 602 is the lower part of the address and is located to the right of the bit position.

下位部分６０２と上位部分６０１の間の境界６０３はＰＥの数に応じて決まる。すなわち、アドレスを２つの部分に分割する境界６０３はＰＥグループに含まれるＰＥの数ＰＥ＿ＰＥＲ＿ＧＲＯＵＰに応じて変化する。詳細には、分割位置はｌｏｇ_２（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ）によって計算される。 A boundary 603 between the lower part 602 and the upper part 601 is determined according to the number of PEs. That is, the boundary 603 that divides the address into two parts changes according to the number of PEs PE_PER_GROUP included in the PE group. Specifically, the division position is calculated by log ₂ (PE_PER_GROUP).

例えば、ＰＥの数が１６（＝２^４）である場合、グローバルアドレスが分割されるビット位置（分割位置）は下位側から４番目のビットに対応する。したがって、境界６０３は下位側から４番目のビットと５番目のビットの間に位置する。下位側の４つのビットはＰＯＳ_ＩＭＥＭを表し、それより上位側のビットはＡｄｄｒ_ＩＭＥＭを表す。例えば、Ａｄｄｒ_ＤＭＥＭが１６ビットで表されるとすると、上位側の１２個のビットがＡｄｄｒ_ＩＭＥＭに対応する。 For example, when the number of PEs is 16 (= 2 ⁴ ), the bit position (division position) where the global address is divided corresponds to the fourth bit from the lower side. Therefore, the boundary 603 is located between the fourth bit and the fifth bit from the lower side. The lower 4 bits represent POS _IMEM, and the higher bits represent Addr _IMEM . For example, if Addr _DMEM is represented by 16 bits, the upper 12 bits correspond to Addr _IMEM .

次に、図５を参照してＰＥ１０１の構造を説明する。図５はＰＥ１０１の構造を概略的に示しているブロック図である。図５に示されているように、ＰＥ１０１は多様な演算を実行する演算論理装置（ＡＬＵ）１２１を備える。演算論理装置１２１は分割部１２２及び比較／移動部１２３を備えている。分割部１２２はＡｄｄｒ_ＤＭＥＭを２つの部分に分割するための分割処理を実行する。比較／移動（ｃｍｐｍｖ）部１２３は読み出し専用パラメータを移動させるためにシフト距離「シフト」をリングバス１０２及び１０３上のシフトの回数と比較するための比較／移動処理を実行する。 Next, the structure of the PE 101 will be described with reference to FIG. FIG. 5 is a block diagram schematically showing the structure of the PE 101. As shown in FIG. 5, the PE 101 includes an arithmetic logic unit (ALU) 121 that executes various operations. The arithmetic logic unit 121 includes a dividing unit 122 and a comparison / movement unit 123. The dividing unit 122 executes a dividing process for dividing the Addr _DMEM into two parts. The comparison / movement (cmpmv) unit 123 executes a comparison / movement process for comparing the shift distance “shift” with the number of shifts on the ring buses 102 and 103 in order to move the read-only parameter.

以下に、ＰＥ１０１で実行される処理を詳細に説明する。まず、ＰＥ１０１で実行される複数の処理のうち、Ａｄｄｒ_ＤＭＥＭを２つの部分に分割するための処理（以下、「分割処理」とも呼ぶ）を説明する。 Hereinafter, processing executed by the PE 101 will be described in detail. First, a process for dividing the Addr _DMEM into two parts (hereinafter, also referred to as “dividing process”) among a plurality of processes executed by the PE 101 will be described.

図６は分割部１２２によって実行される分割処理の概念図を示している。この分割処理はＡｄｄｒ_ＤＭＥＭ及びＰＥ＿ＰＥＲ＿ＧＲＯＵＰに基づいて行われる。ＣＰ１００から各分割部１２２にＡｄｄｒ_ＤＭＥＭ及びＰＥ＿ＰＥＲ＿ＧＲＯＵＰが入力される。そして、各分割部１２２はｌｏｇ_２（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ）を用いてＡｄｄｒ_ＤＭＥＭを分割する。なお、ｌｏｇ_２（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ）は自然数として与えられる。 FIG. 6 shows a conceptual diagram of the dividing process executed by the dividing unit 122. This division processing is performed based on Addr _DMEM and PE_PER_GROUP. Addr _DMEM and PE_PER_GROUP are input from the CP 100 to each division unit 122. Each dividing unit 122 divides the Addr _DMEM using log ₂ (PE_PER_GROUP). Note that log ₂ (PE_PER_GROUP) is given as a natural number.

ここで、Ａｄｄｒ_ＤＭＥＭを２つの部分に分割することによって得られた２つの値はそれぞれＤＳＴ０及びＤＳＴ１であるとする。詳細には、Ａｄｄｒ_ＤＭＥＭはＰＥの数に応じて決まる分割点で分割され、２つの出力ＤＳＴ０及びＤＳＴ１を与える。ここで、ＤＳＴ０はＡｄｄｒ_ＩＭＥＭに対応し、ＤＳＴ１はＰＯＳ_ＩＭＥＭに対応する。 Here, it is assumed that two values obtained by dividing the Addr _DMEM into two parts are DST0 and DST1, respectively. Specifically, the Addr _DMEM is divided at a division point determined according to the number of PEs, and provides two outputs DST0 and DST1. Here, DST0 corresponds to Addr _IMEM , and DST1 corresponds to POS _IMEM .

これらの値は以下の式（２）によって計算することができる。
（ＤＳＴ０，ＤＳＴ１）＝ｓｐｌｉｔ（Ａｄｄｒ_ＤＭＥＭ，ｌｏｇ_２（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ））・・・（２） These values can be calculated by the following equation (2).
(DST0, DST1) = split (Addr _DMEM , log ₂ (PE_PER_GROUP)) (2)

例えば、ＰＥ＿ＰＥＲ＿ＧＲＯＵＰが２のｎ乗（ｎは自然数）である場合、ｌｏｇ_２（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ）は自然数となる。この例において、ＤＳＴ０は（Ａｄｄｒ_ＤＭＥＭ／ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ）に等しく、式（１）によって表されるＡｄｄｒ_ＩＭＥＭに対応する。 For example, when PE_PER_GROUP is n to the power of 2 (n is a natural number), log ₂ (PE_PER_GROUP) is a natural number. In this example, DST0 is equal to (Addr _DMEM / PE_PER_GROUP) and corresponds to the Addr _IMEM represented by equation (1).

次に、図７を参照して分割部の構造を説明する。図７は各ＰＥ１０１における分割部１２２を示しているブロック図である。 Next, the structure of the dividing unit will be described with reference to FIG. FIG. 7 is a block diagram showing the dividing unit 122 in each PE 101.

各ＰＥ１０１は入力値（Ａｄｄｒ_ＤＭＥＭ）を２つの部分に分割する。以下では、Ａｄｄｒ_ＤＭＥＭが１６ビットで表されると仮定して説明を行う。図７において、ＣＰ１００からＳＲＣ０及びＳＲＣ１が転送される。ＳＲＣ０は１６ビットのＡｄｄｒ_ＤＭＥＭに対応し、ＳＲＣ１はＰＥ＿ＰＥＲ＿ＧＲＯＵＰを示すビットシフト量の値である。なお、ＳＲＣ０は符号なしの値である。ここで、ＰＥグループに含まれるＰＥの数は１６（＝２^４）であるので、ビットシフト量は４である。すなわち、ＰＥの数を示すビットの数がビットシフト量に対応する。 Each PE 101 divides the input value (Addr _DMEM ) into two parts. In the following description, it is assumed that Addr _DMEM is represented by 16 bits. In FIG. 7, SRC0 and SRC1 are transferred from CP100. SRC0 corresponds to 16-bit Addr _DMEM , and SRC1 is a bit shift value indicating PE_PER_GROUP. SRC0 is an unsigned value. Here, since the number of PEs included in the PE group is 16 (= 2 ⁴ ), the bit shift amount is 4. That is, the number of bits indicating the number of PEs corresponds to the bit shift amount.

ビット右シフタ４０１はＳＲＣ０のビットをビットシフト量だけ右側にシフトする。すなわち、ＳＲＣ０は４ビット分、右側にシフトされる。結果として、Ａｄｄｒ_ＤＭＥＭの上位側１２ビットが対象となる。そして、ＳＲＣ０のビットを右側にシフトして得られた値はＤＳＴ０として出力される。ＤＳＴ０はＡｄｄｒ_ＩＭＥＭに対応する。ＤＳＴ０は上述の方法によってＳＲＣ０及びＳＣＲ１に基づいて計算される。すなわち、ＳＲＣ０を、ＳＲＣ１に対応するビットの数（桁数）だけ右側にシフトして得られた値はＤＳＴ０に対応する（図８を参照）。例えば、ＳＲＣ０が（２進記述で）「１１０１１０１１０１００１１０１」である場合、上位１２ビット「１１０１１０１１０１００」はＤＳＴ０を表す。したがって、ＤＳＴ０はＡｄｄｒ_ＩＭＥＭに対応する。 The bit right shifter 401 shifts the bit of SRC0 to the right by the bit shift amount. That is, SRC0 is shifted to the right by 4 bits. As a result, the upper 12 bits of Addr _DMEM are targeted. The value obtained by shifting the bit of SRC0 to the right is output as DST0. DST0 corresponds to Addr _IMEM . DST0 is calculated based on SRC0 and SCR1 by the method described above. That is, the value obtained by shifting SRC0 to the right by the number of bits (number of digits) corresponding to SRC1 corresponds to DST0 (see FIG. 8). For example, when SRC0 is “1101101101001101” (in binary description), the upper 12 bits “110110110100” represents DST0. Therefore, DST0 corresponds to Addr _IMEM .

ここで、図７において、ＴＭＰ０の１６ビットの値は全て１である。詳細には、ＴＭＰ０はＡｄｄｒ_ＤＭＥＭのビットの数に等しい数のビットによって表される最大値に固定される。ＴＭＰ０は２進記述で「１１１１１１１１１１１１１１１１」として表される。 Here, in FIG. 7, the 16-bit values of TMP0 are all 1. Specifically, TMP0 is fixed at a maximum value represented by a number of bits equal to the number of bits in Addr _DMEM . TMP0 is represented as “1111111111111111” in binary description.

ビット左シフタ４０２はＴＭＰ０のビットをＳＣＲ１だけ左側にシフトする。詳細には、ビット左シフタ４０２はＴＭＰ０の下位４ビットを値０で置き換える。結果として、ビット左シフタ４０２の出力ＴＭＰ１は「１１１１１１１１１１１１００００」と表される。すなわち、ＴＭＰ０をＳＲＣ１に対応するビットの数（桁数）だけ左側にシフトして得られた値はＴＭＰ１に対応する（図８を参照）。 The bit left shifter 402 shifts the bit of TMP0 to the left by SCR1. Specifically, the bit left shifter 402 replaces the lower 4 bits of TMP0 with the value 0. As a result, the output TMP1 of the bit left shifter 402 is expressed as “1111111111110000”. That is, the value obtained by shifting TMP0 to the left by the number of bits (number of digits) corresponding to SRC1 corresponds to TMP1 (see FIG. 8).

インバータ４０３はＴＭＰ１のビットの値を反転する。ＴＭＰ１が反転処理され、ＴＭＰ２として出力される（図８を参照）。結果として、インバータ４０３の出力ＴＭＰ２は「００００００００００００１１１１」と表される。すなわち、下位４ビットの値は１であり、上位１２ビットの値は０である。 The inverter 403 inverts the value of the bit of TMP1. TMP1 is inverted and output as TMP2 (see FIG. 8). As a result, the output TMP2 of the inverter 403 is expressed as “0000000000000001111”. That is, the value of the lower 4 bits is 1, and the value of the upper 12 bits is 0.

そして、ＡＮＤブロック４０４はＳＲＣ０とＴＭＰ２との論理積を計算する。ＳＲＣ０とＴＭＰ２との論理積はＤＳＴ１として出力される（図８を参照）。この時点で、ＴＭＰ２は下位４ビットの値が１であり、上位１２ビットの値が０である。したがって、ＡＮＤブロック４０４はＳＲＣ０の下位４ビットを対象にする。すなわち、ＡＮＤブロック４０４の出力ＤＳＴ１はＳＲＣ０の下位４ビットの値に等しい。ＤＳＴ１はＰＯＳ_ＩＭＥＭに対応する。 Then, the AND block 404 calculates a logical product of SRC0 and TMP2. The logical product of SRC0 and TMP2 is output as DST1 (see FIG. 8). At this point, the value of the lower 4 bits of TMP2 is 1, and the value of the upper 12 bits is 0. Therefore, the AND block 404 targets the lower 4 bits of SRC0. That is, the output DST1 of the AND block 404 is equal to the value of the lower 4 bits of SRC0. DST1 corresponds to POS _IMEM .

このように、Ａｄｄｒ_ＤＭＥＭは２つの部分に分割することができる。 Thus, the Addr _DMEM can be divided into two parts.

また、これらの数値を使用してシフト距離「シフト」を得ることができる。各ＰＥ１０１はシフト距離「シフト」を計算する。シフト距離「シフト」はリングバス上のシフトの回数を規定する。シフト距離「シフト」は位置ＰＯＳ_ｏｗｎとＰＯＳ_ＩＭＥＭとの間のシフト距離を表す整数である。 Also, the shift distance “shift” can be obtained using these numerical values. Each PE 101 calculates the shift distance “shift”. The shift distance “shift” defines the number of shifts on the ring bus. The shift distance “shift” is an integer representing the shift distance between the positions POS _own and POS _IMEM .

ここで、読み出し専用パラメータを要求しているＰＥ１０１、すなわち、アクセス先のＰＥ１０１がＰＥ自身であり、それの位置がＰＯＳ_ｏｗｎとして表されるとする。さらに、読み出し専用パラメータを保持しているＩＭＥＭ１０７の位置、すなわち、アクセス元のＩＭＥＭの位置がＰＯＳ_ＩＭＥＭとして表されるとする。すなわち、読み出し専用パラメータを要求しているＰＥ１０１の位置がＰＯＳ_ｏｗｎとして表され、要求された読み出し専用パラメータを格納しているＩＭＥＭ１０７の位置がＰＯＳ_ＩＭＥＭとして表されるとする。 Here, it is assumed that the PE 101 requesting the read-only parameter, that is, the PE 101 as the access destination is the PE itself, and its position is represented as POS _own . Further, it is assumed that the position of the IMEM 107 holding the read-only parameter, that is, the position of the access source _IMEM is represented as a POS _IMEM . That is, the position of PE101 requesting the read-only parameter is represented as a _{POS own,} the position of IMEM107 that contains the read-only parameters requested are expressed as _{POS IMEM.}

なお、位置ＰＯＳ_ｏｗｎ及びＰＯＳ_ＩＭＥＭはリングバス１０２上に位置するので、これらの位置は、例えば、図１に示されているように「００」〜「１５」等の自然数によって表される。例えば、図１に示されているように、ＰＥに付けられた添え字が位置を表す。 Since the position _{POS own} and _{POS IMEM} is located on ring bus 102, these positions, for example, represented by a natural number such as "00" to "15" as shown in Figure 1. For example, as shown in FIG. 1, the subscript attached to PE represents the position.

ＰＯＳ_ｏｗｎはＰＥ自身の番号ＰＥ_ｏｗｎ及びＰＥ＿ＰＥＲ＿ＧＲＯＵＰ用いたモジュロ演算を行うことによって計算される。ここで、一般的な場合、ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ用いたモジュロ演算が必要となる。例えば、アーキテクチャ内の利用可能なＰＥの数ＮＯ＿ＯＦ＿ＰＥがグループ内のＰＥ１０１の数ＰＥ＿ＰＥＲ＿ＧＲＯＵＰに等しくない場合、モジュロ演算が必要となる。これらの数が等しい場合、ＰＯＳ_ｏｗｎを計算するためのモジュロ演算は省略することができる。すなわち、ＰＥ_ｏｗｎはＰＯＳ_ｏｗｎに等しい。 POS _own is calculated by performing a modulo operation using the number _{PE own} and PE_PER_GROUP of PE itself. Here, in a general case, a modulo operation using PE_PER_GROUP is required. For example, if the number of available PEs NO_OF_PE in the architecture is not equal to the number PE_PER_GROUP of PEs 101 in the group, a modulo operation is required. If these numbers are equal, modulo operations for calculating POS _own can be omitted. _{That, PE own} equals _{POS own.}

シフト距離「シフト」は、読み出し専用パラメータがリングバス１０２又は１０３上のＰＯＳ_ｏｗｎに到達するまでのデータ転送の回数に対応する。したがって、シフト距離「シフト」はＰＯＳ_ｏｗｎからＰＯＳ_ＩＭＥＭを引くことによって計算することができる。 Shift distance "shift" is read-only parameter corresponds to the number of data transfers to reach the POS _own on ring bus 102 or 103. Therefore, the shift distance “shift” can be calculated by subtracting POS _IMEM from POS _own .

シフト距離「シフト」は、データ（読み出し専用パラメータ）がＰＯＳ_ＩＭＥＭからＰＯＳ_ｏｗｎに到達するまでのデータ転送の回数に対応する符号付きの整数である。例えば、ＰＯＳ_ｏｗｎ＝４かつＰＯＳ_ＩＭＥＭ＝６の場合、シフト距離「シフト」は−２である。また、ＰＯＳ_ｏｗｎ＝６かつＰＯＳ_ＩＭＥＭ＝３の場合、シフト距離「シフト」は＋３である。 Shift distance "shift", the data (read-only parameter) is an integer with a code corresponding to the number of data transfers to reach _{POS own} from _{POS IMEM.} For example, when POS _own = 4 and POS _IMEM = 6, the shift distance “shift” is −2. In addition, when POS _own = 6 and POS _IMEM = 3, the shift distance “shift” is +3.

シフト距離「シフト」は複数のＰＥ１０１で並列的に計算される。なお、Ａｄｄｒ_ＤＭＥＭ及びＰＥ＿ＰＥＲ＿ＧＲＯＵＰはＣＰ１００から各ＰＥ１０１に送られる。また、各ＰＥ１０１は事前にＰＯＳ_ｏｗｎを保持している。 The shift distance “shift” is calculated in parallel by a plurality of PEs 101. Note that Addr _DMEM and PE_PER_GROUP are sent from the CP 100 to each PE 101. Each PE 101 holds POS _own in advance.

各シフト距離「シフト」は以下の式によって計算される。 Each shift distance “shift” is calculated by the following equation.

「シフト」＝ＰＯＳ_ｏｗｎ−ＰＯＳ_ＩＭＥＭ
＝（ＰＥ_ｏｗｎ％（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ））−（Ａｄｄｒ_ＤＭＥＭ％（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ））・・・（３） "Shift" = _POS own _{-POS IMEM}
_{_{= (PE own% (PE_PER_GROUP)}} ) - (Addr DMEM% (PE_PER_GROUP)) ··· (3)

ここで、「％」はモジュロ演算を意味している。 Here, “%” means a modulo operation.

上述の式（３）で表されるように、シフト距離「シフト」はＰＯＳ_ｏｗｎとＰＯＳ_ＩＭＥＭとの間の距離に基づいて計算される。シフト距離「シフト」の絶対値はデータを取得するために必要なシフトの回数を規定し、シフト距離「シフト」の符号はシフトの方向を規定する。 As represented by equation (3) above, the shift distance “shift” is calculated based on the distance between POS _own and POS _IMEM . The absolute value of the shift distance “shift” defines the number of shifts necessary to acquire data, and the sign of the shift distance “shift” defines the direction of the shift.

すなわち、シフト距離「シフト」の符号が正であるか負であるかに応じて、データ（読み出し専用パラメータ）がリングバス１０２及び１０３のどちらから取得されるかが決定される。例えば、シフト距離「シフト」の符号が正である場合、データはリングバス１０２から取得され、符号が負である場合、データはリングバス１０３から取得される。 In other words, depending on whether the sign of the shift distance “shift” is positive or negative, it is determined which of the ring buses 102 and 103 the data (read-only parameter) is acquired. For example, when the sign of the shift distance “shift” is positive, data is acquired from the ring bus 102, and when the sign is negative, data is acquired from the ring bus 103.

次に、図９を参照してｃｍｐｍｖ部１２３の構造を説明する。図９は各ＰＥ１０１におけるｃｍｐｍｖ部１２３の構造を示しているブロック図である。ｃｍｐｍｖ部１２３は入力値の比較処理、及び比較結果に応じた転送処理を実行する。 Next, the structure of the cmpmv portion 123 will be described with reference to FIG. FIG. 9 is a block diagram showing the structure of the cmpmv section 123 in each PE 101. The cmpmv unit 123 executes input value comparison processing and transfer processing according to the comparison result.

リングバス１０２及び１０３上のシフトの回数がＳＲＣ２として入力される。ＳＲＣ２は符号なしの値、すなわち、正の値である。また、予め計算されたシフト距離「シフト」がＳＲＣ３として入力される。 The number of shifts on the ring buses 102 and 103 is input as SRC2. SRC2 is an unsigned value, that is, a positive value. Further, the shift distance “shift” calculated in advance is input as SRC3.

なお、シフト距離「シフト」は符号付きの値である。すなわち、シフト距離「シフト」の最上位ビット（ＭＳＢ）は符号を表す。例えば、シフト距離「シフト」の最上位ビットが１の場合、シフト距離「シフト」は負であり、最上位ビットが０の場合、シフト距離「シフト」は正である。すなわち、シフト距離「シフト」の最上位ビットは符号を表す符号ビットである。なお、シフト距離「シフト」は式（３）に基づいて各ＰＥ１０１により計算される。 The shift distance “shift” is a signed value. That is, the most significant bit (MSB) of the shift distance “shift” represents a sign. For example, when the most significant bit of the shift distance “shift” is 1, the shift distance “shift” is negative, and when the most significant bit is 0, the shift distance “shift” is positive. That is, the most significant bit of the shift distance “shift” is a sign bit representing a sign. The shift distance “shift” is calculated by each PE 101 based on Expression (3).

加算／減算部５０１は符号なしＳＲＣ２と符号ありＳＲＣ３の加算／減算を行う。この処理のために、ＳＲＣ３の符号ビットはインバータ５０２に入力される。インバータ５０２はＳＲＣ３の符号ビットを反転する。ＳＲＣ３の符号ビットが反転され、モード信号「モード」として出力される（図１０を参照）。反転されたビットは加算／減算部のモードを決定するモード信号「モード」となる。インバータ５０２は反転されたビットをモード信号「モード」として加算／減算部５０１に出力する。 The addition / subtraction unit 501 performs addition / subtraction of unsigned SRC2 and signed SRC3. For this process, the sign bit of SRC3 is input to inverter 502. Inverter 502 inverts the sign bit of SRC3. The sign bit of SRC3 is inverted and output as the mode signal “mode” (see FIG. 10). The inverted bit becomes a mode signal “mode” that determines the mode of the addition / subtraction unit. The inverter 502 outputs the inverted bit to the addition / subtraction unit 501 as the mode signal “mode”.

上述したように、シフト距離「シフト」が負の場合、符号ビットの値は１である。この場合、インバータ５０２は反転ビットの値を０に設定する。反転されたビットの値が０の場合、加算／減算部５０１は加算モードに移行する。すなわち、加算／減算部５０１はＳＲＣ２とＳＲＣ３との和を計算する。 As described above, when the shift distance “shift” is negative, the value of the sign bit is 1. In this case, the inverter 502 sets the value of the inversion bit to 0. When the value of the inverted bit is 0, the addition / subtraction unit 501 shifts to the addition mode. That is, the addition / subtraction unit 501 calculates the sum of SRC2 and SRC3.

一方、シフト距離「シフト」が正の場合、符号ビットの値は０である。この場合、インバータ５０２は反転ビットの値を１に設定する。そして、インバータ５０２は反転されたビットを加算／減算部５０１に出力する。反転されたビットの値が１の場合、加算／減算部５０１は減算モードに移行し、ＳＲＣ２とＳＲＣ３との差を計算する。すなわち、加算又は減算が実行され、ＴＭＰ３が出力される（図１０を参照）。 On the other hand, when the shift distance “shift” is positive, the value of the sign bit is 0. In this case, the inverter 502 sets the value of the inversion bit to 1. Then, the inverter 502 outputs the inverted bit to the addition / subtraction unit 501. When the value of the inverted bit is 1, the addition / subtraction unit 501 shifts to the subtraction mode and calculates the difference between SRC2 and SRC3. That is, addition or subtraction is executed and TMP3 is output (see FIG. 10).

上述したように、インバータ５０２は、モードを切り替える加算／減算部５０１のために使用される。詳細には、インバータ５０２はシフト距離「シフト」の符号ビットを受信する。そして、加算／減算部５０１はシフト距離「シフト」の符号、すなわち、最上位ビットＭＳＢにしたがって加算モードと減算モードとの間の切り換えを行う。すなわち、加算／減算部５０１はインバータ５０２の出力にしたがってモードを切り替えながら、加算モード及び減算モードを実行する。すなわち、加算／減算部５０１は排他的に加算又は減算を行う。したがって、加算／減算部５０１はＳＲＣ２とＳＲＣ３との和又は差をＴＭＰ３として出力する。 As described above, the inverter 502 is used for the addition / subtraction unit 501 for switching modes. Specifically, inverter 502 receives the sign bit of the shift distance “shift”. Then, the addition / subtraction unit 501 switches between the addition mode and the subtraction mode according to the sign of the shift distance “shift”, that is, the most significant bit MSB. That is, the addition / subtraction unit 501 executes the addition mode and the subtraction mode while switching the mode according to the output of the inverter 502. That is, the addition / subtraction unit 501 performs addition or subtraction exclusively. Therefore, the addition / subtraction unit 501 outputs the sum or difference of SRC2 and SRC3 as TMP3.

ＳＲＣ２とＳＲＣ３との和又は差はＴＭＰ３として判定部５０３に入力される。判定部５０３はＴＭＰ３が０であるかどうかを判定する。ＳＲＣ２とＳＲＣ３の絶対値が互いに等しい場合、ＴＭＰ３は０になる。詳細には、ＴＭＰ３の全てのビット値が０である場合、ＴＭＰ３は０となる。そして、ＴＭＰ３が０である場合、判定部５０３は、ＴＭＰ３が０であることを示す信号ＤＳＴ２を出力する。例えば、ＴＭＰ３＝０のときＤＳＴ２＝１となり、ＴＭＰ３が０以外の値であるときＤＳＴ２＝０となる。すなわち、ＴＭＰ３が０であるかどうかが決定され、ＤＳＴ２が出力される（図１０を参照）。このように、判定部５０３から、ＴＭＰ３が０であるかどうかを示す信号ＤＳＴ２が出力される。 The sum or difference between SRC2 and SRC3 is input to determination section 503 as TMP3. The determination unit 503 determines whether TMP3 is zero. If the absolute values of SRC2 and SRC3 are equal to each other, TMP3 is zero. Specifically, when all the bit values of TMP3 are 0, TMP3 is 0. When TMP3 is 0, the determination unit 503 outputs a signal DST2 indicating that TMP3 is 0. For example, DST2 = 1 when TMP3 = 0, and DST2 = 0 when TMP3 is a value other than zero. That is, it is determined whether TMP3 is 0, and DST2 is output (see FIG. 10). As described above, the determination unit 503 outputs the signal DST2 indicating whether TMP3 is 0 or not.

ＰＥ１０１はＤＳＴ２＝1への応答でリングバス１０２又は１０３から読み出し専用パラメータのデータを取得する。すなわち、読み出し専用パラメータを取得するタイミングが決定される。 The PE 101 acquires read-only parameter data from the ring bus 102 or 103 in response to DST2 = 1. That is, the timing for acquiring the read-only parameter is determined.

次に、ＰＥ１０１がリングバス１０２及び１０３のどちらから読み出し専用パラメータを取得すべきかを決定するための処理を説明する。この処理のために、ＳＲＣ４及びＳＲＣ５がマルチプレクサ５０４に入力される。また、マルチプレクサ５０４は入力ライン「ＣＴＲＬ」を介してＳＲＣ３の符号ビットを受け取る。 Next, a process for determining which of the ring buses 102 and 103 the PE 101 should acquire the read-only parameter will be described. For this process, SRC 4 and SRC 5 are input to multiplexer 504. Multiplexer 504 also receives the SRC3 sign bit via input line “CTRL”.

ＳＲＣ４の値は時計回りリングバス１０２上の現在の値である。ＳＲＣ５の値は反時計回りリングバス１０３上の現在の値である。マルチプレクサ５０４の入力ラインＣＴＲＬが０の場合、ＳＲＣ４がマルチプレクサ５０４を通過する。一方、マルチプレクサ５０４の入力ラインＣＴＲＬが１の場合、ＳＲＣ５がマルチプレクサ５０４を通過する。すなわち、マルチプレクサ５０４はＳＲＣ３の符号ビットにしたがって、ＰＥ_ｏｗｎがそこから読み出し専用パラメータを取り出すべきリングバスを決定する（図１０を参照）。 The value of SRC4 is the current value on the clockwise ring bus 102. The value of SRC5 is the current value on the counterclockwise ring bus 103. When the input line CTRL of the multiplexer 504 is 0, the SRC 4 passes through the multiplexer 504. On the other hand, when the input line CTRL of the multiplexer 504 is 1, the SRC 5 passes through the multiplexer 504. That is, the multiplexer 504 determines the ring bus from which PE _own should extract the read-only parameter according to the sign bit of SRC3 (see FIG. 10).

例えば、ＳＲＣ３の符号が正である場合、ＳＲＣ４の値がＤＳＴ３として出力される。この場合、時計回りリングバス１０２が選択されたことになる。一方、ＳＲＣ３の符号が負である場合、ＳＲＣ５の値がＤＳＴ３として出力される。この場合、反時計回りリングバス１０３が選択されたことになる。 For example, when the sign of SRC3 is positive, the value of SRC4 is output as DST3. In this case, the clockwise ring bus 102 is selected. On the other hand, when the sign of SRC3 is negative, the value of SRC5 is output as DST3. In this case, the counterclockwise ring bus 103 is selected.

そして、ＤＳＴ２が１である場合、ＰＥ１０１は選択されたリングバスから読み出し専用パラメータを取得する。 When DST2 is 1, the PE 101 acquires the read-only parameter from the selected ring bus.

図１１を参照して、分割部１２２及びｃｍｐｍｖ部１２３によって実行される処理動作を詳細に説明する。なお、以下の例は、全てのＰＥ１０１が並列処理において単一かつ同一読み出し専用パラメータを使用するもとして説明する。そのようなケースは非ブロック化フィルタを使用する画像処理等で発生する。 With reference to FIG. 11, the processing operation performed by the dividing unit 122 and the cmpmv unit 123 will be described in detail. The following example will be described assuming that all PEs 101 use a single and the same read-only parameter in parallel processing. Such a case occurs in image processing using a deblocking filter.

図１１は各ＰＥ１０１におけるデータ処理方法を示しているフローチャートである。すなわち、図１１に示されているデータ処理は各ＰＥ１０１で実行される。 FIG. 11 is a flowchart showing a data processing method in each PE 101. That is, the data processing shown in FIG. 11 is executed by each PE 101.

ＣＰ１００から各ＰＥ１０１に対して、ＤＭＥＭ１０６に保持されている並列処理に必要な読み出し専用パラメータのアドレスが転送される。例えば、ＳＩＭＤモードにて非ブロック化フィルタ処理が実行される場合、ＣＰ１００から並列処理に必要な読み出し専用パラメータのＡｄｄｒ_ＤＭＥＭ及びＰＥ＿ＰＥＲ＿ＧＲＯＵＰが転送される。 The address of the read-only parameter necessary for parallel processing held in the DMEM 106 is transferred from the CP 100 to each PE 101. For example, when the deblocking filter processing is executed in the SIMD mode, Addr _DMEM and PE_PER_GROUP of read-only parameters necessary for parallel processing are transferred from the CP 100.

そして、各ＰＥ１０１の分割部１２２は読み出し専用パラメータのＡｄｄｒ_ＩＭＥＭを計算する（ステップＳ１０１）。すなわち、各ＰＥ１０１はＡｄｄｒ_ＤＭＥＭ及びＰＥ＿ＰＥＲ＿ＧＲＯＵＰを使用して、上述の式（１）によりＡｄｄｒ_ＩＭＥＭを得る。 Then, the dividing unit 122 of each PE 101 calculates Addr _IMEM as a read-only parameter (step S101). That is, each PE 101 uses Addr _DMEM and _{PE_PER_GROUP} to obtain Addr _{IMEM according} to the above equation (1).

次に、必要な読み出し専用パラメータが保持されているＩＭＥＭ１０７のリングバス１０２及び１０３上の位置が計算される（ステップＳ１０２）。すなわち、各ＰＥ１０１はＰＯＳ_ＩＭＥＭを計算する。上述したように、ＰＯＳ_ＩＭＥＭはＡｄｄｒ_ＤＭＥＭ及びＰＥ＿ＰＥＲ＿ＧＲＯＵＰを使用したモジュロ演算を実行することによって計算される。 Next, the positions of the IMEM 107 on the ring buses 102 and 103 where necessary read-only parameters are held are calculated (step S102). That is, each PE ₁₀₁ calculates POS _IMEM . As described above, the POS _IMEM is calculated by performing a modulo operation using Addr _DMEM and PE_PER_GROUP.

ここで、ステップＳ１０１及びＳ１０２は分割部１２２によって実行される。図７に示されているＤＳＴ０を出力するステップを含む処理はステップＳ１０１に対応する。図７に示されているＤＳＴ１を出力するステップを含む処理はステップＳ１０２に対応する。 Here, steps S <b> 101 and S <b> 102 are executed by the dividing unit 122. The process including the step of outputting DST0 shown in FIG. 7 corresponds to step S101. The process including the step of outputting DST1 shown in FIG. 7 corresponds to step S102.

次に、各ＰＥ１０１はシフト距離「シフト」を計算する（ステップＳ１０３）。
「シフト」＝ＰＯＳ_ｏｗｎ−ＰＯＳ_ＩＭＥＭ
＝（ＰＥ_ｏｗｎ％（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ））−（Ａｄｄｒ_ＤＭＥＭ％（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ））・・・（３） Next, each PE 101 calculates the shift distance “shift” (step S103).
"Shift" ₌ POS own _{-POS IMEM}
_{_{= (PE own% (PE_PER_GROUP)}} ) - (Addr DMEM% (PE_PER_GROUP)) ··· (3)

次に、各ＰＥ１０１はアドレス（Ａｄｄｒ_ＩＭＥＭ）及び制御信号をＩＭＥＭ１０７に転送する（ステップＳ１０４）。各ＰＥ１０１はＡｄｄｒ_ＩＭＥＭに対応する読み出し専用パラメータを取得するための命令を各ＩＭＥＭ１０７に送る。 Next, each PE 101 transfers an address (Addr _IMEM ) and a control signal to the IMEM 107 (step S104). Each PE 101 sends a command for acquiring a read-only parameter corresponding to the Addr _IMEM to each IMEM 107.

そして、各ＩＭＥＭ１０７の出力がリングバス１０２及び１０３の両方に送られる（ステップＳ１０５）。詳細には、ＰＥ１０１がＩＭＥＭ１０７からＩＭＥＭ１０７内のＡｄｄｒ_ＩＭＥＭの位置に格納された読み出し専用パラメータを受け取り、その読み出し専用パラメータをリングバス１０２及び１０３に転送する。 Then, the output of each IMEM 107 is sent to both the ring buses 102 and 103 (step S105). In particular, it receives the read-only parameter PE101 is stored in the position of the _{Addr IMEM} in IMEM107 from IMEM107, forwards the read-only parameter to the ring bus 102 and 103.

次に、予め計算されたシフト距離「シフト」が０であるかどうかが判定される（ステップＳ１０６）。すなわち、各ＰＥ１０１は、それ自身のＩＭＥＭ１０７に読み出し専用パラメータが格納されているかどうかを判定する。予め計算されたシフト距離「シフト」が０である場合（ステップＳ１０６でＹＥＳ）、ＰＥ１０１はそれ自身のＩＭＥＭ１０７の出力を受け取る（ステップＳ１０７）。 Next, it is determined whether or not the pre-calculated shift distance “shift” is 0 (step S106). That is, each PE 101 determines whether a read-only parameter is stored in its own IMEM 107. If the pre-calculated shift distance “shift” is 0 (YES in step S106), the PE 101 receives the output of its own IMEM 107 (step S107).

詳細には、ＰＥ１０１はＰＥ１０１に対応するＩＭＥＭ１０７に格納されている読み出し専用パラメータを取得する。もちろん、読み出し専用パラメータはシフトレジスタ１０５から取得されてもいいし、又はＩＭＥＭ１０７から取得されてもよい。すなわち、シフト距離「シフト」が０に等しいＰＥ１０１については、読み出し専用パラメータはシフトされる前に取得される。そして、シフト距離「シフト」が０に等しいＰＥ１０１については、読み出し専用パラメータを取得するための処理は終了する（ステップＳ１０８）。 Specifically, the PE 101 acquires a read-only parameter stored in the IMEM 107 corresponding to the PE 101. Of course, the read-only parameter may be acquired from the shift register 105 or may be acquired from the IMEM 107. That is, for the PE 101 whose shift distance “shift” is equal to 0, the read-only parameter is acquired before being shifted. Then, for the PE 101 whose shift distance “shift” is equal to 0, the process for acquiring the read-only parameter ends (step S108).

予め計算されたシフト距離「シフト」が０でない場合（ステップＳ１０６でＮＯ）、読み出し専用パラメータはリングバス上でシフトされる。ｃｍｐｍｖ部１２３はリングバス１０２及び１０３上のシフト回数をシフト距離「シフト」の絶対値と比較する（ステップＳ１０９）。リングバス１０２及び１０３上のシフト回数がシフト距離「シフト」の絶対値より小さい場合、（ステップＳ１０９でＮＯ）、読み出し専用パラメータは再度、シフトされる。すなわち、読み出し専用パラメータは、リングバス１０２及び１０３上で行われたシフト回数が予め計算されたシフト距離「シフト」の絶対値に等しくなるまで繰り返しシフトされる。 If the pre-calculated shift distance “shift” is not 0 (NO in step S106), the read-only parameter is shifted on the ring bus. The cmpmv unit 123 compares the number of shifts on the ring buses 102 and 103 with the absolute value of the shift distance “shift” (step S109). If the number of shifts on the ring buses 102 and 103 is smaller than the absolute value of the shift distance “shift” (NO in step S109), the read-only parameter is shifted again. That is, the read-only parameter is repeatedly shifted until the number of shifts performed on the ring buses 102 and 103 becomes equal to the absolute value of the shift distance “shift” calculated in advance.

そして、シフト距離「シフト」がリングバス上のシフト回数と等しくなったとき（ステップＳ１０９でＹＥＳ）、シフト距離「シフト」が０より大きいかどうかを判定する。すなわち、シフト距離「シフト」の符号を判定する。 When the shift distance “shift” becomes equal to the number of shifts on the ring bus (YES in step S109), it is determined whether the shift distance “shift” is greater than zero. That is, the sign of the shift distance “shift” is determined.

符号が負である場合（ステップＳ１１０でＮＯ）、反時計回りリングバス１０３から読み出し専用パラメータのデータが取得される（ステップＳ１１１）。符号が正である場合（ステップＳ１１０でＹＥＳ）、時計回りリングバス１０２から読み出し専用パラメータのデータが取得される（ステップＳ１１２）。 If the sign is negative (NO in step S110), read-only parameter data is acquired from the counterclockwise ring bus 103 (step S111). If the sign is positive (YES in step S110), read-only parameter data is acquired from the clockwise ring bus 102 (step S112).

ここで、ステップＳ１０９〜Ｓ１１２はｃｍｐｍｖ部１２３によって実行される。図９に示されているＤＳＴ２を出力するステップを含む処理はステップＳ１０９に対応する。図９に示されているＤＳＴ３を出力するステップを含む処理はステップＳ１１０〜Ｓ１１２に対応する。 Here, steps S109 to S112 are executed by the cmpmv unit 123. The process including the step of outputting DST2 shown in FIG. 9 corresponds to step S109. The process including the step of outputting DST3 shown in FIG. 9 corresponds to steps S110 to S112.

上述の方法により、読み出し専用パラメータがリングバス１０２及び１０３を介して転送される。そして、各ＰＥ１０１は処理のために必要な読み出し専用パラメータを取得する。取得された読み出し専用パラメータは各ＰＥ１０１に組み込まれているレジスタに格納される。そして、各ＰＥ１０１は読み出し専用パラメータを使用して処理（例えば、非ブロック化フィルタ処理）を実行する。当然のことながら、各ＰＥ１０１はＳＩＭＤモードで処理を実行する。 Read-only parameters are transferred via the ring buses 102 and 103 by the method described above. Each PE 101 acquires a read-only parameter necessary for processing. The acquired read-only parameter is stored in a register incorporated in each PE 101. Each PE 101 performs processing (for example, deblocking filter processing) using the read-only parameter. As a matter of course, each PE 101 executes processing in the SIMD mode.

次に、図１２を参照してＣＰ１００にて実行される処理動作を説明する。図１２はリングバスのシフト動作を制御するためにＣＰ１００で実行される処理動作を示している。まず、全てのＰＥ１０１が読み出し専用パラメータの取得を既に完了しているかどうかを判定する（ステップＳ２０１）。全てのＰＥ１０１が読み出し専用パラメータを既に取得している場合（ステップＳ２０１でＹＥＳ）、ＣＰ１００において実行される処理は終了する。 Next, processing operations executed in the CP 100 will be described with reference to FIG. FIG. 12 shows the processing operation executed by the CP 100 to control the shift operation of the ring bus. First, it is determined whether all the PEs 101 have already acquired the read-only parameter (step S201). If all the PEs 101 have already acquired the read-only parameter (YES in step S201), the process executed in the CP 100 ends.

少なくともいずれかのＰＥ１０１が読み出し専用パラメータの取得を完了していない場合（ステップＳ２０１でＮＯ）、ＣＰ１００はリングバス１０２及び１０３上で読み出し専用パラメータを１回シフトする（ステップＳ２０２）。さらに、シフト回数をカウントしているシフトカウンタを１つ増大させる（ステップＳ２０３）。そして、ステップＳ２０１に戻り、全てのＰＥ１０１が読み出し専用パラメータの取得を完了するまで同様な処理を繰り返す。 If at least one of the PEs 101 has not acquired the read-only parameter (NO in step S201), the CP 100 shifts the read-only parameter once on the ring buses 102 and 103 (step S202). Further, the shift counter that counts the number of shifts is incremented by one (step S203). Then, the process returns to step S201, and the same processing is repeated until all the PEs 101 complete the acquisition of the read-only parameter.

次に、本実施形態の効果について説明する。 Next, the effect of this embodiment will be described.

（１）読み出し専用パラメータは特許文献２に開示されているように１６個のＰＥを含むＰＥグループに分散されて格納されているが、特許文献２とは異なる様式で格納されており、これらの読み出し専用パラメータは（複数の）ＰＥによって同一のグローバルアドレスで同時に読み出される。この構成はＰＥ１０１間のアドレス情報の転送の必要性を無くす。すなわち、ＰＥ１０１間で読み出し専用パラメータの位置情報を転送する必要がない。各ＰＥ１０１は正確な位置情報を事前に通知されているので、各ＰＥ１０１はどのＰＥ１０１が必要な読み出し専用パラメータを保持しているかを認識している。読み出し専用パラメータのＡｄｄｒ_ＩＭＥＭはＰＥによって計算され、読み出し専用パラメータを要求しているＰＥ１０１と該読み出し専用パラメータを保持しているＰＥ１０１との距離は事前に、ＰＥ１０１によって並列的に計算することができる。結果として、データ処理の効率は劇的に改善される。 (1) The read-only parameters are distributed and stored in a PE group including 16 PEs as disclosed in Patent Document 2, but are stored in a manner different from that of Patent Document 2, and these Read-only parameters are read simultaneously by the PE (s) at the same global address. This configuration eliminates the need for address information transfer between PEs 101. That is, it is not necessary to transfer the position information of the read-only parameter between the PEs 101. Since each PE 101 is notified of accurate position information in advance, each PE 101 recognizes which PE 101 holds the necessary read-only parameter. The Addr _{IMEM of the} read-only parameter is calculated by the PE, and the distance between the PE ₁₀₁ requesting the read-only parameter and the PE ₁₀₁ holding the read-only parameter can be calculated in parallel by the PE 101 in advance. As a result, the efficiency of data processing is dramatically improved.

（２）読み出し専用パラメータがＩＭＥＭ１０７に分散されて格納されている場合であっても、アクセスのために必要な処理時間を短縮することができる。反対向きの転送方向を有する２つのリングバス１０２及び１０３がＰＥ１０１に接続されており、それによって処理時間を約半分に短縮することができる。すなわち、シフト回数の最大値をＰＥ１０１の数の半分に減らすことができる。したがって、図１に示されている例において、全てのＰＥ１０１が必要な読み出し専用パラメータを取得するためにリングバスは最大でも８回シフトされればよい。 (2) Even when read-only parameters are distributed and stored in the IMEM 107, the processing time required for access can be shortened. Two ring buses 102 and 103 having opposite transfer directions are connected to the PE 101, so that the processing time can be reduced to about half. That is, the maximum value of the number of shifts can be reduced to half of the number of PEs 101. Therefore, in the example shown in FIG. 1, the ring bus only needs to be shifted at most eight times in order to obtain the read-only parameters that all PEs 101 need.

（３）上述した方法により、他のＩＭＥＭ１０７に格納されているデータを使用して算術処理を行うことができる。すなわち、複数のＰＥ１０１が処理を実行するために必要な読み出し専用パラメータを他のＩＭＥＭ１０７に格納することができる。また、ＤＭＥＭ１０６の読み出し専用パラメータデータを複数のＩＭＥＭ１０７に分散して格納することができる。結果として、ＩＭＥＭ１０７の容量を減少させることができる。 (3) By the method described above, arithmetic processing can be performed using data stored in another IMEM 107. That is, read-only parameters necessary for the processing by the plurality of PEs 101 can be stored in the other IMEM 107. Further, the read-only parameter data of the DMEM 106 can be distributed and stored in the plurality of IMEMs 107. As a result, the capacity of the IMEM 107 can be reduced.

（４）分割部１２２の使用は１クロックサイクルでの分割処理を可能にする。図７に示されている分割部１２２の各機能部は１クロックサイクルで単一の動作として実行される。したがって、図８に示されているように、この新規の機能部は必要なクロックサイクルを４サイクルから１サイクルに短縮させることができる。分割部１２２の４つの機能が中間信号を遅れさせるバッファやレジスタを使用せずに、同一のクロックサイクルで処理されるという理由により、このクロックサイクルの短縮が実現される。 (4) Use of the division unit 122 enables division processing in one clock cycle. Each functional unit of the dividing unit 122 shown in FIG. 7 is executed as a single operation in one clock cycle. Therefore, as shown in FIG. 8, the new functional unit can shorten the required clock cycle from 4 cycles to 1 cycle. This shortening of the clock cycle is realized because the four functions of the dividing unit 122 are processed in the same clock cycle without using a buffer or a register for delaying the intermediate signal.

（５）図９に示されているｃｍｐｍｖ部１２３の各機能部も１クロックサイクルで単一の動作として実行される。したがって、図１０に示されているように、この新規の機能部は必要なクロックサイクルを４サイクルから１サイクルに短縮させることができる。ｃｍｐｍｖ部１２３の４つの機能が中間信号を遅れさせるバッファやレジスタを使用せずに、同一のクロックサイクルで処理されるという理由により、このクロックサイクルの短縮が実現される。 (5) Each functional unit of the cmpmv unit 123 shown in FIG. 9 is also executed as a single operation in one clock cycle. Therefore, as shown in FIG. 10, the new functional unit can shorten the required clock cycle from 4 cycles to 1 cycle. This shortening of the clock cycle is realized because the four functions of the cmpmv unit 123 are processed in the same clock cycle without using a buffer or a register for delaying the intermediate signal.

（実施の形態２）
上述した単一命令多重データ処理（ＳＩＭＤ）を実行するデータ処理装置は好ましくは、並列画像プロセッサに適用することができる。上述のアーキテクチャをＨ．２６４非ブロック化フィルタに対して利用したケースを以下に説明する。 (Embodiment 2)
The data processing apparatus that performs the single instruction multiple data processing (SIMD) described above is preferably applicable to a parallel image processor. The above architecture is described in H.264. The case used for the H.264 deblocking filter will be described below.

図１３はＨ．２６４ビデオデコーダのデコードループ２０８を示しているブロック図である。Ｈ．２６４非ブロック化フィルタ２０１はインター予測部２０３及びイントラ予測部２０５とともにデコードループ２０８内で動作する閉ループフィルタである。非ブロック化フィルタ（デブロッキングフィルタ）２０１はローパスフィルタ（ＬＰＦ）として使用される。 FIG. 2 is a block diagram illustrating a decoding loop 208 of an H.264 video decoder. FIG. H. The H.264 deblocking filter 201 is a closed loop filter that operates in the decode loop 208 together with the inter prediction unit 203 and the intra prediction unit 205. A deblocking filter (deblocking filter) 201 is used as a low pass filter (LPF).

デコードループ２０８はさらに、加算部２０７、選択部２０６、参照フレームメモリ２０４、及び実フレームメモリ２０２を備える。加算部２０７はエラー信号２００と、Ｈ．２６４デコーダのデコードループで復号された画像の再構成画素値とを加算する。デコーダで画像を復号するために、イントラ予測及びインター予測の２つの技術が利用される。インター予測においては、画像を復号するために既に復号されているフレームの画素値が使用される。一方、イントラ予測では、現在処理されているマクロブロックを復号するために、実フレームの既に復号されている隣接するマクロブロックのデータが使用される。 The decode loop 208 further includes an adder 207, a selector 206, a reference frame memory 204, and a real frame memory 202. The adder 207 receives the error signal 200 and the H.264 signal. The reconstructed pixel value of the image decoded by the decoding loop of the H.264 decoder is added. Two techniques, intra prediction and inter prediction, are used to decode the image at the decoder. In inter prediction, pixel values of frames that have already been decoded are used to decode an image. On the other hand, in intra prediction, in order to decode a currently processed macroblock, data of an adjacent macroblock that has already been decoded in a real frame is used.

ここで、イントラ予測とインター予測の選択はＨ．２６４ビデオエンコーダで実行される。エラー信号とともに、イントラ予測及びインター予測のどちらか一方を選択するための信号がＨ．２６４ストリーム内の副次的情報としてＨ．２６４デコーダに転送される。実フレームメモリ２０２は実フレームを格納するためのフレームメモリである。参照フレームメモリ２０４はインター予測で使用される参照フレームを格納するためのメモリである。高い圧縮比での符号化の場合、非ブロック化フィルタ（デブロッキングフィルタ）２０１で、ブロックに伴う損失の多い復号が緩和される。 Here, the selection of intra prediction and inter prediction is H.264. H.264 video encoder. A signal for selecting either intra prediction or inter prediction together with the error signal is H.264. H.264 as side information in the H.264 stream. H.264 decoder. The real frame memory 202 is a frame memory for storing real frames. The reference frame memory 204 is a memory for storing a reference frame used in inter prediction. In the case of encoding with a high compression ratio, the non-blocking filter (deblocking filter) 201 reduces the lossy decoding associated with the block.

ここで、図１４を参照してＨ．２６４非ブロック化フィルタ２０１におけるマクロブロックについて説明する。図１４はマクロブロックを示している図である。 Here, referring to FIG. A macroblock in the H.264 deblocking filter 201 will be described. FIG. 14 shows a macro block.

非ブロック化フィルタ２０１に対しては、同一の画像内容を記述する２つの異なるマクロブロック３００又はサブブロック３０１における２つの画素３０３は、２つの画素の独立した予測及び符号化の後、ブロック境界３０２の両側で異なる復号値の結果となる。非ブロック化フィルタ２０１はそのような復号値の間の差を、差の大きさの推定値に応じて緩和する。 For the deblocking filter 201, the two pixels 303 in two different macroblocks 300 or sub-blocks 301 describing the same image content, after independent prediction and encoding of the two pixels, block boundary 302 Result in different decoded values on both sides. The deblocking filter 201 relaxes the difference between such decoded values according to the estimated magnitude of the difference.

この差は量子化によって生じているので、この差の大きさは量子化ノイズに関係している。それゆえ、２つのパラメータ「ａ」及び「Ｃ０」が導入される。パラメータ「ａ」及び「Ｃ０」は量子化ステップの大きさに比例し、かつノイズ分散の平方根に比例する。さらに、第３のパラメータ「β」が導入される。これら全てのパラメータはブロックエッジへの、フィルタの容認可能な影響を決定する。パラメータ「ａ」及び「Ｃ０」がブロックの大きさに関係するのに対し、パラメータ「β」はブロック境界３０２の近傍の信号の平坦性に関係し、したがって可視度に関係する。 Since this difference is caused by quantization, the magnitude of this difference is related to quantization noise. Therefore, two parameters “a” and “C0” are introduced. The parameters “a” and “C0” are proportional to the magnitude of the quantization step and proportional to the square root of the noise variance. Furthermore, a third parameter “β” is introduced. All these parameters determine the acceptable effect of the filter on the block edge. The parameters “a” and “C0” are related to the block size, whereas the parameter “β” is related to the flatness of the signal in the vicinity of the block boundary 302 and is therefore related to the visibility.

非ブロック化フィルタの輝度成分について説明する。図１４に示されているように、単一のマクロブロック３００が１６×１６の画素３０３を含むとする。マクロブロックの単一のエッジ３０２に１６回のフィルタ動作が実行される。なお、図１４はＨ．２６４ビデオデコーダの非ブロック化フィルタ処理で使用されるマクロブロック構造を示している。 The luminance component of the deblocking filter will be described. As shown in FIG. 14, assume that a single macroblock 300 includes 16 × 16 pixels 303. Sixteen filter operations are performed on a single edge 302 of the macroblock. Note that FIG. 2 illustrates a macroblock structure used in the deblocking filtering of an H.264 video decoder.

各マクロブロック３００はさらに１６個のサブブロック３０１に分割される。単一のサブブロック３０１は４×４の画素３０３を含む。各エッジ３０２は２つの隣接するサブブロック３０１の間に延びている。１つのエッジを処理するために、エッジの片側４個ずつ、計８個の画素が必要である。 Each macroblock 300 is further divided into 16 sub-blocks 301. A single sub-block 301 includes 4 × 4 pixels 303. Each edge 302 extends between two adjacent sub-blocks 301. To process one edge, a total of eight pixels are required, four on each side of the edge.

これらの１６回のフィルタ動作が図１に示されている１６（ＮＯ＿ＯＦ＿ＰＥ）個のＰＥ１０１にマッピングされた場合、１６回のフィルタ動作は全て、単一のＰＥグループで並列的に処理される（ＰＥ＿ＰＥＲ＿ＧＲＯＵＰ＝ＮＯ＿ＯＦ＿ＰＥ＝１６個のＰＥ）。画像データ自体に加え、非ブロック化フィルタ処理には読み出し専用パラメータ（ａ、β、Ｃ０）の表が必要である。また、画像データ及び読み出し専用パラメータの表に加え、各エッジに対して表のインデックスに等しいアドレスが必要である。 If these 16 filter operations are mapped to 16 (NO_OF_PE) PEs 101 shown in FIG. 1, all 16 filter operations are processed in parallel in a single PE group (PE_PER_GROUP). = NO_OF_PE = 16 PEs). In addition to the image data itself, a table of read-only parameters (a, β, C0) is required for deblocking filtering. In addition to the table of image data and read-only parameters, an address equal to the table index is required for each edge.

例えば、非ブロック化フィルタ処理のために必要な読み出し専用パラメータａ、β、Ｃ０はＤＭＥＭ１０６から転送され、ＰＥグループの全てのＩＭＥＭに分散されて格納される。データがイントラ予測を使用して復号される場合、全てのＰＥ１０１によって同一の読み出し専用パラメータが読み出されるだろう。詳細には、非ブロック化フィルタ処理において、複数のＰＥ１０１は同一の値のパラメータを読み出すことによって並列処理を実行する。この場合、ＣＰ１００は同一のパラメータセットを読み込むための命令を送信する。そして、全てのＰＥ１０１は同一の値のパラメータを読み込む。１６個のＰＥ１０１は同一の値のパラメータを読み込むことによって並列処理を実行する。上の例では、全てのＰＥ１０１が同一の値のパラメータを読み込むデータ処理方法について説明した。 For example, the read-only parameters a, β, and C0 necessary for the deblocking filter processing are transferred from the DMEM 106, and are distributed and stored in all the IMEMs in the PE group. If the data is decoded using intra prediction, the same read-only parameter will be read by all PEs 101. Specifically, in the deblocking filter processing, the plurality of PEs 101 execute parallel processing by reading parameters having the same value. In this case, the CP 100 transmits a command for reading the same parameter set. All the PEs 101 read parameters having the same value. The 16 PEs 101 execute parallel processing by reading parameters having the same value. In the above example, the data processing method has been described in which all PEs 101 read parameters having the same value.

本発明はそれの実施形態を参照しながら開示及び説明されてきたが、本発明はこれらの実施形態に限定されるものではない。当業者には、請求の範囲によって規定される本発明の意図及び範囲から外れることなく、これらの実施形態の形状や詳細に対して多様な変更を加えることができることが明白であるだろう。 Although the invention has been disclosed and described with reference to embodiments thereof, the invention is not limited to these embodiments. It will be apparent to those skilled in the art that various modifications can be made to the shapes and details of these embodiments without departing from the spirit and scope of the invention as defined by the claims.

多様な処理を実行する構成要素は機能部又はブロックとして記載されてきたが、それらの機能部又はブロックを手段に置き換えることも可能である。上述の説明では、例としてＳＩＭＤ技術を利用する処理要素が説明されたが、本発明は他の処理要素に対して適用することもできる。例えば、非ブロック化フィルタ処理以外の並列処理を実行する処理要素が利用されてもよい。 Although the component which performs various processes has been described as a function part or a block, it is also possible to replace the function part or block with a means. In the above description, a processing element using SIMD technology has been described as an example, but the present invention can also be applied to other processing elements. For example, a processing element that executes parallel processing other than deblocking filter processing may be used.

図７に示されているように、ＳＲＣ０は右方向にシフトされ、ＴＭＰ０は左方向にシフトされているが、これらのシフト方向は反転されてもよい。例えば、アドレスＡｄｄｒ_ＤＭＥＭ、アドレスＡｄｄｒ_ＩＭＥＭ、及び位置ＰＯＳ_ＩＭＥＭの全体構造が反転された場合、シフト方向も反転される。ここで、用語「反転された」は最下位ビットが左側に配置され、最上位ビットが右側に配置されることを意味する。それゆえ、この場合、ＳＲＣ０は左方向にシフトされ、ＴＭＰ０は右方向にシフトされる。 As shown in FIG. 7, although SRC0 is shifted to the right and TMP0 is shifted to the left, these shift directions may be reversed. For example, _when the entire structure of the address Addr _DMEM , the address Addr _IMEM , and the position POS _IMEM is inverted, the shift direction is also inverted. Here, the term “inverted” means that the least significant bit is arranged on the left side and the most significant bit is arranged on the right side. Therefore, in this case, SRC0 is shifted leftward and TMP0 is shifted rightward.

実施形態１としてリングバス１０２及びリングバス１０３の両方を備えるアーキテクチャが示されたが、リングバス１０２だけを備えるアーキテクチャが採用されてもよい。この場合、「シフト」はリングバス１０２のシフト方向とともに計算されなければならない。そして、加算／減算の切り替えは不要であり、マルチプレクサ５０４の選択動作も不要である。このアーキテクチャにおいては、より多くのリングバス１０２のシフト動作が必要となるだろうが、分散的に格納された読み出し専用パラメータの使用効率は十分に改善されるだろう。 Although the architecture including both the ring bus 102 and the ring bus 103 is shown as the first embodiment, an architecture including only the ring bus 102 may be adopted. In this case, the “shift” must be calculated along with the shift direction of the ring bus 102. Switching between addition / subtraction is unnecessary, and the selection operation of the multiplexer 504 is also unnecessary. In this architecture, more ring bus 102 shifting operations will be required, but the usage efficiency of the read-only parameters stored in a distributed manner will be sufficiently improved.

＜文献の引用＞
この出願は、２００９年３月３０日に出願された国際出願ＰＣＴ／ＪＰ２００９／０５７０２０を基礎とする優先権を主張し、その開示の全てをここに取り込む。 <Citation of literature>
This application claims priority based on international application PCT / JP2009 / 057020, filed on March 30, 2009, the entire disclosure of which is incorporated herein.

本発明は並列処理を実行するデータ処理装置、データ処理システム、及びデータ処理方法に適用することができる。 The present invention can be applied to a data processing apparatus, a data processing system, and a data processing method that execute parallel processing.

１００…ＣＰ
１０１…ＰＥ
１０２…時計回り方向リングバス
１０３…反時計回り方向リングバス
１０４…接続
１０５…シフトレジスタ
１０６…ＤＭＥＭ
１０７…ＩＭＥＭ
１２１…ＡＬＵ
１２２…分割部
１２３…ｃｍｐｍｖ部
２０１…非ブロック化フィルタ
２０２…実フレームメモリ
２０３…インター予測部
２０４…参照フレームメモリ
２０５…イントラ予測部
２０６…切り替え部
２０７…加算部
２０８…デコード部
３００…マクロブロック
３０１…サブブロック
３０２…エッジ
３０３…画素
４０１…ビット右シフタ
４０２…ビット左シフタ
４０３…インバータ
４０４…ＡＮＤ部
５０１…加算／減算部
５０２…符号ビットインバータ
５０３…判定部
５０４…マルチプレクサ
６０１…Ａｄｄｒ_ＩＭＥＭ
６０２…ＰＯＳ_ＩＭＥＭ
６０３…境界 100 ... CP
101 ... PE
102: clockwise ring bus 103 ... counterclockwise ring bus 104 ... connection 105 ... shift register 106 ... DMEM
107 ... IMEM
121 ... ALU
122 ... Dividing unit 123 ... cmpmv unit 201 ... Deblocking filter 202 ... Real frame memory 203 ... Inter prediction unit 204 ... Reference frame memory 205 ... Intra prediction unit 206 ... Switching unit 207 ... Addition unit 208 ... Decoding unit 300 ... Macroblock 301 ... sub-block 302 ... edge 303 ... pixel 401 ... bit right shifter 402 ... bit left shifter 403 ... inverter 404 ... AND unit 501 ... addition / subtraction unit 502 ... sign bit inverter 503 ... determination unit 504 ... multiplexer 601 ... Addr _IMEM
602 ... POS _IMEM
603 ... Boundary

Claims

A data processing apparatus for performing parallel processing by a plurality of processing elements,
Each of the plurality of processing elements transmits the read-only parameter data from a data memory in parallel to transfer read-only parameter data from one processing element's internal memory to another processing element via at least one ring bus. The internal memory for storing data in a distributed manner;
The data processing device
Dividing means for dividing an address of the read-only parameter data in the data memory into a first part and a second part at bit positions corresponding to the number of the plurality of processing elements;
Comparing means for obtaining the number of shift operations when shifting the read-only parameter data on the at least one ring bus, the read-only parameter arranged at an address corresponding to the first portion of the internal memory Retrieve the data,
Designates a position on the at least one ring bus of the processing element in which the read-only parameter data to be accessed is stored, and the read-only access target corresponding to the second part The difference between the global address portion of the parameter data and the position of the processing element itself is compared with the number of shift operations of the ring bus, and the read-only parameter data is subjected to other processing based on the comparison result. A comparison means for the element to obtain,
A data processing apparatus comprising:

The data processing apparatus according to claim 1,
When the number of the plurality of processing elements is NO _PE , the bit position is determined by log ₂ (NO _PE ),
The first part is the upper part of the address of the data memory and is located to the left of the bit position;
The data processing apparatus, wherein the second part is a lower part of the address of the data memory and is located on the right side of the bit position.

The data processing apparatus according to claim 1 or 2,
The dividing means is
Logical right shift means for calculating a right shift value by shifting the address of the data memory to the right by the number of bits corresponding to the number of processing elements;
The left shift value is obtained by shifting a fixed value in which the number of bits is equal to the number of bits of the address of the data memory and all the bits are 1 by the number of bits corresponding to the number of processing elements to the left Logical left shift means for calculating;
Inverter means for calculating an inverted value by inverting the left shift value;
A data processing apparatus comprising: AND means for calculating a logical product of the inverted value and the address of the data memory.

The data processing device according to any one of claims 1 to 3,
The data processing apparatus, wherein the at least one ring bus includes two ring buses whose shift directions are opposite to each other.

The data processing apparatus according to claim 4, wherein
The comparison means includes
An addition / subtraction means for performing an addition process or a subtraction process between the number of shift operations and the difference between the position of the processing element itself and the part of the global address;
Means for switching the process of the addition / subtraction means between the addition process and the subtraction process according to the sign of the difference;
Determination means for determining whether the output of the addition / subtraction means is zero;
Selecting means for selecting one ring bus from which the read-only parameter data is taken out of the two ring buses according to the sign of the difference,
The global address portion is the global address of the read-only parameter data to be accessed that specifies the position on the at least one ring bus of the processing element in which the read-only parameter data to be accessed is stored Part of
The number of shift operations is given as an unsigned value, and the difference is given as a signed value.

A data processing method for performing parallel processing by a plurality of processing elements, wherein each of the plurality of processing elements is read-only from an internal memory of one processing element to another processing element via at least one ring bus In order to transfer parameter data in parallel, the internal memory that stores the read-only parameter data in a distributed manner from a data memory,
Dividing the address of the read-only parameter data in the data memory into a first portion and a second portion at bit positions corresponding to the number of the plurality of processing elements;
The number of shift operations on the at least one ring bus of the read-only parameter data fetched from the internal memory at the address corresponding to the first part is determined by the position of the processing element itself and the read-only parameter data to be accessed. A position between the processing element on which the at least one ring bus is stored and a global address portion of the read-only parameter data to be accessed corresponding to the second portion; Comparing and causing the other processing element to acquire the read-only parameter according to the comparison result. A data processing method, comprising:

The data processing method according to claim 6,
The division is
Calculating a right shift value by shifting the address of the data memory to the right by the number of bits corresponding to the number of the plurality of processing elements;
The left shift value is obtained by shifting a fixed value in which the number of bits is equal to the number of bits of the address of the data memory and all the bits are 1 by the number of bits corresponding to the number of the processing elements to the left. Calculating,
Calculating an inversion value by inverting the left shift value;
And calculating a logical product of the inverted value and the address of the data memory.

The data processing method according to claim 6 or 7,
The data processing method, wherein the at least one ring bus includes two ring buses whose shift directions are opposite to each other.

The data processing method according to claim 8, wherein
Said comparing
Performing an addition process or a subtraction process between the number of shift operations and the difference between the position of the processing element itself and the part of the global address;
Switching the process of the addition / subtraction step between the addition process and the subtraction process according to the sign of the difference;
Determining whether the output of the addition / subtraction means is zero;
Selecting one ring bus from which the parallel processing data is taken out of the two ring buses according to the sign of the difference,
The global address portion is the global address of the read-only parameter data to be accessed that specifies the position on the at least one ring bus of the processing element in which the read-only parameter data to be accessed is stored Part of
The data processing method, wherein the number of shift operations is given as an unsigned value, and the difference is given as a signed value.

A data memory for storing data;
A plurality of processing elements for performing parallel processing, wherein the address of the read-only parameter data in the data memory is divided into a first part and a second part at bit positions corresponding to the number of the plurality of processing elements Processing elements,
A plurality of internal memories each distributed corresponding to one of the plurality of processing elements and storing the read-only parameter data from the data memory;
At least one ring bus connected to the plurality of processing elements to transfer the read-only parameter data retrieved from the internal memory at an address corresponding to the first portion;
A central processor for counting the number of shift operations of the read-only parameter data on the at least one ring bus, and a data processing system for parallel processing comprising:
The plurality of processing elements simultaneously place read-only parameter data on the ring bus based on a comparison result between the number of shift operations and the difference between the position of the processing element itself and the global address portion. A data processing system for performing parallel processing for acquiring the read-only parameter data from the at least one ring bus.
Here, the portion of the global address designates a position on the at least one ring bus of the processing element in which the read-only parameter data to be accessed is stored, and corresponds to the second portion This is the global address portion of the read-only parameter data to be accessed.