JP5358315B2

JP5358315B2 - Parallel computing device

Info

Publication number: JP5358315B2
Application number: JP2009150019A
Authority: JP
Inventors: 新次郎豊田; 宣明宮川
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2009-06-24
Filing date: 2009-06-24
Publication date: 2013-12-04
Anticipated expiration: 2029-06-24
Also published as: JP2011008416A

Description

本発明は、並列計算装置に関する。 The present invention relates to a parallel computing device.

大容量の演算処理を行う電子計算装置には、画像処理用途に用いられるものがある。そのような電子計算装置が、リアルタイムに演算処理を行うシステムを例に示す。
自動車に搭載されたビデオカメラによって得られる映像に基づいて画像処理を行って、検出された結果をドライバーに示すことにより安全運転を補助するシステムがある。このようなシステムに用いられる電子計算装置（画像処理プロセッサー）は、ビデオカメラと一緒に自動車に搭載され、ビデオカメラからの映像（例えば640画素×480画素、10フレーム/秒）を分析する。例えば、歩行者検出をするような用途の分析に用いられる画像処理プロセッサーには、100GOPS（Giga (10⁹) operations per second）以上の演算能力が必要と言われている。歩行者の検出精度を高めたり、或いは遠くの歩行者を検出したりするには、ビデオカメラが出力する映像信号から示される細部の状況を検出して、判定する必要がある。細部の状況を判定できる解像度を基準にすると、ビデオカメラの解像度を高め、映像信号を構成する画素数を増やす必要がある。それに伴って画像分析に必要な演算量は扱う画素数にほぼ比例して増大する。そのため、より高速な画像処理プロセッサーが必要になる。
また、安全性を高めるには歩行者をより短時間で検出しなければならず、それには検出周期を短くし、１秒当たりに処理するフレーム数を増やす必要がある。フレーム数を増やすと、画像分析に必要な演算量は扱うフレーム数に比例して増大するので、画像処理プロセッサーの更なる高速化が要求されている。 Some electronic computing devices that perform large-capacity arithmetic processing are used for image processing. A system in which such an electronic computing device performs arithmetic processing in real time is shown as an example.
There is a system that assists safe driving by performing image processing based on video obtained by a video camera mounted on an automobile and presenting a detected result to a driver. An electronic computer (image processor) used in such a system is mounted on a car together with a video camera, and analyzes video (for example, 640 pixels × 480 pixels, 10 frames / second) from the video camera. For example, it is said that an image processor used for analysis of applications such as pedestrian detection needs to have a computing capacity of 100 GOPS (Giga (10 ⁹ ) operations per second) or more. In order to improve the detection accuracy of pedestrians or to detect pedestrians far away, it is necessary to detect and determine the detailed situation indicated from the video signal output by the video camera. Based on the resolution that can determine the details of the situation, it is necessary to increase the resolution of the video camera and increase the number of pixels constituting the video signal. Accordingly, the amount of calculation required for image analysis increases in proportion to the number of pixels handled. Therefore, a faster image processor is required.
Moreover, in order to improve safety, a pedestrian must be detected in a shorter time, which requires a shorter detection cycle and an increased number of frames to be processed per second. When the number of frames is increased, the amount of computation required for image analysis increases in proportion to the number of frames to be handled, and thus further speeding up of the image processor is required.

一方で、車載環境では供給できる電力に限りがあるので、低消費電力（数ワット以下）で動作することが必須であり、消費電力の増大を無視してプロセッサーを高速化することはできない。
近年、パーソナルコンピュータ（PC）などに使われる、汎用プロセッサー（汎用CPU）の性能が飛躍的に向上した（10Gflops程度）。それでも上記用途に必要とされる性能と比較すると、性能不足であり、かつ消費電力が数10Ｗ（ワット）にもなるので車載用途には使えない。
PC用の高性能画像処理専用のプロセッサーとして、例えば米国nVIDIA社からGPGPU（General Purpose computing on Graphics Processing Unit）と呼ばれる半導体装置が市販されている。この半導体装置に搭載されるプロセッサーの公称演算性能は数100Gflosであり十分と推定されるが、消費電力が多大（数10Ｗ（ワット））であり、また画像処理専用のプロセッサーであるが映像データを生成する用途向きであり、歩行者検出のような用途には不向きである。 On the other hand, since there is a limit to the power that can be supplied in an in-vehicle environment, it is essential to operate with low power consumption (several watts or less), and the speed of the processor cannot be increased by ignoring the increase in power consumption.
In recent years, the performance of general-purpose processors (general-purpose CPUs) used in personal computers (PCs) has improved dramatically (about 10 Gflops). Still, compared with the performance required for the above applications, the performance is insufficient and the power consumption is several tens of watts (watts), so it cannot be used for in-vehicle applications.
As a processor dedicated to high-performance image processing for a PC, for example, a semiconductor device called GPGPU (General Purpose Computing on Graphics Processing Unit) is commercially available from nVIDIA, USA. Although the nominal computing performance of the processor mounted on this semiconductor device is estimated to be several hundreds Gflos, it is estimated to be sufficient, but the power consumption is large (several tens of watts), and it is a processor dedicated to image processing, but video data It is suitable for use and is not suitable for uses such as pedestrian detection.

消費電力が数ワットと少ないプロセッサーとしては、例えば、株式会社ルネサステクノロジ社の組み込み用プロセッサーＳＨ４シリーズなどがあるが、演算性能が2Gflops未満と低過ぎる。
これらの状況から、低消費電力で非常に高い演算処理性能を達成する計算機アーキテクチャを有する画像処理プロセッサーが開発されている。その一例として、車載用画像処理を目的とする画像処理プロセッサーがある。その画像処理プロセッサーは、動作周波数を低く抑えた消費電力が非常に小さい要素プロセッサーエレメント（要素PE）を、１個のLSIに多数個集積して並列処理する構成となっている。例えば、日本電気株式会社が開発したIMAPCARなどである（例えば、非特許文献１）。 As a processor with low power consumption of several watts, for example, there is an embedded processor SH4 series of Renesas Technology Corp., but the calculation performance is too low at less than 2 Gflops.
Under these circumstances, an image processor having a computer architecture that achieves extremely high arithmetic processing performance with low power consumption has been developed. One example is an image processor intended for in-vehicle image processing. The image processor has a configuration in which a large number of element processor elements (element PEs) with a very low power consumption with a low operating frequency are integrated and processed in one LSI. For example, IMAPCAR developed by NEC Corporation (for example, Non-Patent Document 1).

岡崎、昭倫、古賀、肥田野、「車載組込み用画像認識プロセッサIMAPCAR」、NEC技報 Vol. 60 No. 2/2007 p17 - 20,2007．Okazaki, Akinori, Koga, Hidano, “In-vehicle embedded image recognition processor IMAPCAR”, NEC Technical Report Vol. 60 No. 2/2007 p17-20,2007. Kyo, S.; Okazaki, S.; Arai, T., "An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems"， Computer Architecture, 2005. ISCA apos;05. Proceedings. 32nd International Symposium on Volume , Issue , 4-8 June 2005 p134 - 145,2005．Kyo, S .; Okazaki, S .; Arai, T., "An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems", Computer Architecture, 2005. ISCA apos; 05. Proceedings. 32nd International Symposium on Volume, Issue, 4 -8 June 2005 p134-145,2005.

ところで、一般的に、並列処理プロセッサーを実際に動作させた時に得られる実効性能は、そこに内蔵されている演算器（ＡＬＵ）の総数と、それらの動作周波数との掛け算で得られる公称値に比べてかなり低い。実効性能が低下する主な原因は、演算器が必要とするデータを適当なタイミングで供給できなかったり、或いは演算結果を直ぐに格納できなかったりして演算器に遊びが生じるからである。並列処理プロセッサーを実際に動作させたときに得られる実効性能を高めるには途切れなく演算器へデータを供給し、同時に演算結果を迅速に適当な場所へ格納できるようにする構成が必要とされる。 By the way, in general, the effective performance obtained when a parallel processor is actually operated is the nominal value obtained by multiplying the total number of arithmetic units (ALU) incorporated therein and their operating frequencies. It is considerably low compared. The main cause of the decrease in effective performance is that data required by the arithmetic unit cannot be supplied at an appropriate timing, or the arithmetic result cannot be stored immediately and play occurs in the arithmetic unit. In order to increase the effective performance obtained when the parallel processing processor is actually operated, it is necessary to supply data to the computing unit without interruption, and at the same time to store the result of the operation quickly in an appropriate location. .

多数の要素PEを並列動作させる方式にはSIMD（Single Instruction Multi Data）型、MIMD（Multi Instruction Multi Data）型及びVLIW（Very Long Instruction Word）型などのアーキテクチャがあるが、車載用画像処理用途では低コストと高いパフォーマンス性を併せ持つ必要があることから、SIMD型にVLIW型を組み合わせた前述のIMAPCARの方式が優れている。
SIMD型画像処理プロセッサーの構成例を図１２に示す。SIMD型並列計算装置では、同じ命令で全ての要素PEが同時に動作するという構造上、各要素PEが一斉にデータ読み込んだり、一斉にデ−タを出力したりする。例えば、１つのＬＳＩの中に１２８個の要素PEが在り、これらが100MHz（メガヘルツ）のクロックで動作して一斉に８ビットのデータを1つ要求したとすると、瞬間的に必要なデータ転送速度は、100MHz×1B×128 = 12.8GB（ギガバイト）/秒である。これは実現困難なデータ転送速度ではないが、外部回路も含めてコスト上昇と消費電力の増大を招いてしまう。そこで、図の構成では要素PE毎にローカルメモリを設け、それらと要素PEとの間を専用バスで接続している。このような構成にすることで、各要素PEとローカルメモリ間で必要なデータ転送速度は最大100MB（メガバイト）/秒になる。 There are SIMD (Single Instruction Multi Data) type, MIMD (Multi Instruction Multi Data) type and VLIW (Very Long Instruction Word) type architectures for parallel operation of multiple element PEs. Since it is necessary to have both low cost and high performance, the above-mentioned IMAPCAR method combining the VLIW type with the SIMD type is superior.
A configuration example of the SIMD type image processor is shown in FIG. In the SIMD type parallel computing device, all the element PEs simultaneously read data or output data all at once because of the structure that all the elements PE operate simultaneously with the same instruction. For example, if there are 128 element PEs in one LSI, and these operate on a 100 MHz (megahertz) clock and request one 8-bit data at a time, the instantaneous data transfer rate required Is 100 MHz × 1B × 128 = 12.8 GB (gigabytes) / second. This is not a difficult data transfer speed, but it also increases costs and power consumption including external circuits. Therefore, in the configuration shown in the figure, a local memory is provided for each element PE, and these and the element PE are connected by a dedicated bus. With this configuration, the required data transfer rate between each element PE and local memory is a maximum of 100 MB (megabytes) / second.

一方、処理すべきデータは先ずLSI外部の素子から供給され、処理結果はLSI外部へと取り出されなければならない。先に上げた例（640画素×480画素、10フレーム/秒）で入力画像が白黒で８bit/画素であるとすると、入力データのデータ転送速度は 640×480×10/s×1B = 3.1MB/秒であり、入力データ線を８本使えば3.1MHzのクロックで入力できる。この程度のデータ入力速度は容易に達成できる。また、一般的に画像処理では、出力データのデータ量は入力データと同程度なので、これも容易に出力できる。 On the other hand, data to be processed must first be supplied from an element outside the LSI, and the processing result must be extracted outside the LSI. Assuming that the input image is 8 bits / pixel in black and white in the example (640 pixels x 480 pixels, 10 frames / second), the data transfer rate of the input data is 640 x 480 x 10 / s x 1B = 3.1MB / Sec. If you use 8 input data lines, you can input with 3.1MHz clock. This level of data input speed can be easily achieved. In general, in image processing, the amount of output data is almost the same as that of input data, so that it can be easily output.

また、非特許文献１の技術によれば、IMAPCARの要素PEは、4つのサブプロセッサー（以後ＳＰＥ）を持つVLIW型である（図１３参照）。4つのサブＰＥは機能分化している。すなわち、ローカルメモリとのデータ入出力又は加算を担当するサブＰＥ（図中MEM/Add）、論理演算を担当するサブＰＥ（同Logic）、算術論理演算を担当するサブＰＥ（同ALU）、及び乗算を担当するサブＰＥ（同Multiplier）である。４つのサブＰＥに共通に２８個のジェネラルレジスタ（ＧＲ）を備える。この構成では次に示す３つの問題点がある。
一つ目の問題は、ローカルメモリとのデータ入出力が１つのサブＰＥでしか行えないので、メモリアクセスが集中するとデータ入出力が間に合わなくなったり（メモリデータバスのバンド幅の不足）、逆にメモリアクセスや加算が殆ど無い時は、このサブＰＥが遊んでしまったりする点である。
二つ目の問題は、１つのサブＰＥを乗算に固定しているが、画像処理において乗算を行う頻度はそれほど多くなく、このサブＰＥ（同Multiplier）が遊んでしまう可能性が高いことである。これらの問題により、常に4つのサブＰＥをアクティブ（遊びが無い状態）に保ち続けることは難しい。
三つ目の問題は、２８個のＧＲにある。
非特許文献２における 5.3節によれば、サブＰＥで実行される命令のオペランドは、ソース２、ディスティネーション１の３オペランドである。ＧＲは２８個在るのでオペランド指定にそれぞれ５ビット必要になり、合計では１５ビットになる。したがって、VLIW構成とするために４つのサブＰＥを合わせると、６０ビット必要であり、命令コードを格納するメモリの容量が大きくなって、コストが上昇する要因になってしまう。 Further, according to the technique of Non-Patent Document 1, the element PE of IMAPCAR is a VLIW type having four sub-processors (hereinafter referred to as SPE) (see FIG. 13). The four sub-PEs are functionally differentiated. That is, a sub PE (MEM / Add in the figure) in charge of data input / output or addition with the local memory, a sub PE (same logic) in charge of logical operation, a sub PE (same ALU) in charge of arithmetic logic operation, and A sub-PE (multiplier) responsible for multiplication. 28 general registers (GR) are provided in common to the four sub-PEs. This configuration has the following three problems.
The first problem is that data input / output to / from local memory can only be done by one sub-PE, so if memory access is concentrated, data input / output may not be in time (insufficient memory data bus bandwidth), and conversely When there is almost no memory access or addition, this sub-PE is idle.
The second problem is that one sub-PE is fixed to multiplication, but the frequency of multiplication in image processing is not so high, and there is a high possibility that this sub-PE (same Multiplier) will be idle. . Due to these problems, it is difficult to always keep the four sub-PEs active (no play).
The third problem is in 28 GRs.
According to Section 5.3 in Non-Patent Document 2, the operands of the instruction executed in the sub-PE are three operands of source 2 and destination 1. Since there are 28 GRs, 5 bits are required for each operand specification, and the total is 15 bits. Therefore, when the four sub-PEs are combined for the VLIW configuration, 60 bits are required, which increases the capacity of the memory for storing the instruction code, which increases the cost.

発明は、上記問題を解決すべくなされたもので、その目的は、実効効率の高い並列計算装置を提供することを目的とする。 The present invention has been made to solve the above problems, and an object of the invention is to provide a parallel computing device with high effective efficiency.

上記問題を解決するために、請求項１に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ２）と、前記演算プロセッサーのそれぞれに制御信号を供給する命令実行制御部（例えば、実施の形態における命令実行制御プロセッサーＰＥ−Ｉ３）と、を備え、前記演算プロセッサーが、複数のデータ又は複数の演算結果を保持する記憶部（例えば、実施の形態におけるローカルメモリ１１Ａ〜１１Ｄ）と、前記記憶部から読み出したデータに演算処理を施し、その結果を前記記憶部に供給する演算部（例えば、実施の形態におけるＡＬＵ１２Ａ〜１２Ｃ及び１２ＤＡ）と、を備えるサブプロセッサー（例えば、実施の形態におけるサブプロセッサー（ＳＰＥ）２Ａ〜２Ｄ）を複数備え、複数の前記サブプロセッサーのうち少なくとも２つのサブプロセッサーが同じ構造を有するものであり、同じ構成を有するサブプロセッサーは同一構造サブプロセッサー群（例えば、実施の形態におけるサブプロセッサーグループＳＰＥ２Ｇ）を構成し、前記命令実行制御部は、前記同一構造サブプロセッサー群に含まれるサブプロセッサーに供給する制御信号を交換して供給する交換部（例えば、実施の形態における交換部３３Ａ〜３３Ｃ）を備えることを特徴とする並列計算装置である。 In order to solve the above problem, the invention described in claim 1 is directed to a plurality of arithmetic processors (for example, arithmetic processor PE2 in the embodiment) that perform arithmetic processing in parallel, and control signals to each of the arithmetic processors. An instruction execution control unit (for example, instruction execution control processor PE-I3 in the embodiment) to be supplied, and the arithmetic processor holds a plurality of data or a plurality of operation results (for example, the embodiment) And the arithmetic units (for example, ALUs 12A to 12C and 12DA in the embodiment) that perform arithmetic processing on the data read from the storage unit and supply the result to the storage unit. sub processor (e.g., sub-processor (SPE) 2A-2D in the embodiment) of With few, are those least two sub-processors among the plurality of sub-processors have the same structure, constituting the sub-processor identical structure sub processor group (e.g., sub-processor groups SPE2G in the embodiment) having the same configuration The instruction execution control unit includes an exchange unit (for example, exchange units 33A to 33C in the embodiment) that exchanges and supplies a control signal to be supplied to a sub-processor included in the same-structured sub-processor group. a parallel computing apparatus characterized.

請求項２に記載した発明は、前記サブプロセッサーは、前記演算部の一方の入力に接続され、書き込まれる情報を記憶するレジスタ部（例えば、実施の形態におけるＡｃｃ１３Ａ〜１３Ｄ）を備え、前記レジスタ部は、通常時は前記演算部による演算結果を記憶し、記憶した演算結果を出力し、前記同一構造サブプロセッサー群は、レジスタ参照部（例えば、実施の形態におけるレジスタ参照部１５Ａ〜１５Ｃ）を備え、前記レジスタ参照部は、前記同一構造サブプロセッサー群に含まれるそれぞれの前記サブプロセッサーが備えるそれぞれの前記レジスタ部からの出力を、前記サブプロセッサーの前記演算部の他方の入力に入力データを供給するレジスタのうち、少なくとも一部のレジスタからの入力として扱う際において、前記レジスタ部と前記入力データを供給するレジスタとの対応付けを変更することができるものであり、前記命令実行制御部は、前記同一構造サブプロセッサー群の各サブプロセッサーに供給する前記制御信号を交換する際には、交換前に、前記入力データを供給するレジスタのそれぞれに対応付けられていた、それぞれの前記レジスタ部に代えて、当該レジスタ部が属する前記サブプロセッサーに交換前に供給されていた前記制御信号を交換後に供給されることとなる前記サブプロセッサーが備える前記レジスタ部を、前記入力データを供給するレジスタにそれぞれ新たに対応付けるように、前記レジスタ参照部を制御するものであることを特徴とする請求項１に記載の並列計算装置である。 According to a second aspect of the present invention, the subprocessor includes a register unit (for example, Acc13A to 13D in the embodiment) that is connected to one input of the arithmetic unit and stores information to be written. Normally stores the calculation result of the calculation unit, and outputs the stored calculation result. The sub- processor group having the same structure includes a register reference unit (for example, the register reference units 15A to 15C in the embodiment). The register reference unit supplies an output from each of the register units included in each sub-processor included in the same-structured sub-processor group to input data to the other input of the arithmetic unit of the sub-processor. When handling as input from at least some of the registers, the register And the register for supplying the input data can be changed. When the instruction execution control unit exchanges the control signal to be supplied to each sub processor of the same structure sub processor group, Is replaced with each of the register units associated with each of the registers that supply the input data before replacement, and the control signal supplied before replacement to the sub-processor to which the register unit belongs The register reference unit is controlled so as to newly associate the register unit included in the sub processor to be supplied after replacement with the register that supplies the input data. it is a parallel computing system according to claim 1.

請求項１から請求項２に記載した発明によれば、並列計算装置は、並列して演算処理を行う複数の演算プロセッサーを備える。演算プロセッサーでは、命令実行制御部がそれぞれに制御命令を供給する。サブプロセッサーは、記憶部が複数のデータ又は複数の演算結果を保持する。演算部は、記憶部から読み出したデータに演算処理を施し、その結果を記憶部に供給する。
また、複数のサブプロセッサーのうち少なくとも２つのサブプロセッサーは同じ構造を有する。それらの同じ構造を有するサブプロセッサーは、同一構造プロセッサー群を形成する。また、命令実行制御部は、交換部が同一構造サブプロセッサー群に含まれるサブプロセッサーに供給する制御信号を交換して供給する。
これにより、記憶部に記憶されたデータを交換したり、参照するデータを切り換えて同様の処理をしたりするためにオペランドの異なる類似のプログラムを用意することもなく、データと処理の組み合わせを交換することができる。 According to the first and second aspects of the invention, the parallel computing device includes a plurality of arithmetic processors that perform arithmetic processing in parallel. In the arithmetic processor, the instruction execution control unit supplies a control instruction to each. In the sub processor, the storage unit holds a plurality of data or a plurality of calculation results. The arithmetic unit performs arithmetic processing on the data read from the storage unit and supplies the result to the storage unit.
In addition, at least two of the plurality of sub processors have the same structure. Those sub-processors having the same structure form an identical structure processor group. The instruction execution control unit exchanges and supplies control signals supplied from the exchange unit to the sub processors included in the same structure sub processor group.
This allows you to exchange combinations of data and processing without exchanging data stored in the storage unit or preparing similar programs with different operands to switch referenced data and perform similar processing. can do.

本発明の第１実施形態を示す概略ブロック図である。1 is a schematic block diagram showing a first embodiment of the present invention. 第１実施形態におけるＰＥ２の構成を示すブロック図である。It is a block diagram which shows the structure of PE2 in 1st Embodiment. 第１実施形態における画像データを入力する場合のタイミングチャートである。It is a timing chart in the case of inputting image data in a 1st embodiment. 第１実施形態におけるＰＥ２から外部へデータを出力する場合のタイミングチャートである。It is a timing chart in the case of outputting data outside from PE2 in a 1st embodiment. 第１実施形態における演算処理に入力される映像信号の例を示す図である。It is a figure which shows the example of the video signal input into the arithmetic processing in 1st Embodiment. 第１実施形態におけるアドレスレジスタの構成を示す図である。It is a figure which shows the structure of the address register in 1st Embodiment. 第１実施形態におけるローカルメモリへの画像データのマッピングを示す図である。It is a figure which shows the mapping of the image data to the local memory in 1st Embodiment. 第１実施形態におけるＰＥ−Ｉ３を示す概略ブロック図である。It is a schematic block diagram which shows PE-I3 in 1st Embodiment. 第１実施形態における交換部３１における命令選択状態を示す図である。It is a figure which shows the command selection state in the exchange part 31 in 1st Embodiment. 第１実施形態におけるレジスタ参照部の動作を示す図である。It is a figure which shows operation | movement of the register reference part in 1st Embodiment. 第２実施形態におけるＰＥ２の構成を示すブロック図である。It is a block diagram which shows the structure of PE2 in 2nd Embodiment. 従来技術によるSIMD型画像処理プロセッサーの構成例のブロック図である。It is a block diagram of the structural example of the SIMD type image processor by a prior art. 従来技術による要素演算プロセッサーの構成例を示す。The structural example of the element arithmetic processor by a prior art is shown.

（第１実施形態）
図を参照し、並列計算装置の一実施形態について示す。
図１は、本発明の実施形態を示す概略ブロック図である。
この図に示される並列計算装置１は、複数の演算プロセッサー（「要素ＰＥ」ともいう）によって並列演算処理を行う。本実施形態の詳細な説明に先立ち、並列計算装置１の構成概要について説明する。
並列計算装置１における演算プロセッサー（ＰＥ）２−０〜２−１０６（まとめて「演算プロセッサー（ＰＥ）２」という。）、及びＰＥ２のそれぞれに制御命令を供給する命令実行制御プロセッサー（ＰＥ−Ｉ）３を含んで構成されている。
また、並列計算装置１は、入出力プロセッサー（ＩＯＰ）４、命令メモリ５、データ入力シフトレジスタ６、データ出力シフトレジスタ７及び外部メモリ９を備える。 (First embodiment)
An embodiment of a parallel computing device will be described with reference to the drawings.
FIG. 1 is a schematic block diagram showing an embodiment of the present invention.
The parallel computing device 1 shown in this figure performs parallel arithmetic processing by a plurality of arithmetic processors (also referred to as “element PEs”). Prior to detailed description of the present embodiment, an outline of the configuration of the parallel computing device 1 will be described.
Operation processors (PE) 2-0 to 2-106 (collectively referred to as “operation processors (PE) 2”) in the parallel computing device 1 and an instruction execution control processor (PE-I) that supplies control instructions to each of the PEs 2. ) 3 is included.
The parallel computing device 1 includes an input / output processor (IOP) 4, an instruction memory 5, a data input shift register 6, a data output shift register 7, and an external memory 9.

演算プロセッサー２は、それぞれが４個のサブプロセッサー（ＳＰＥ）２Ａ〜２Ｄを有する。
ＳＰＥ２Ａ〜２Ｄは、それぞれが異なる命令を実行するVLIW（Very Long Instruction Word）型の構成を有している。それぞれのＰＥ２では、ＳＰＥ２Ａ〜２Ｄが組み合わされた同じ構成で形成される。また、全てのＰＥ２が有する１０７個のＳＰＥ２Ａは、SIMD（Single Instruction Multi Data)型で構成され、全てのＳＰＥ２Ａで同一の命令を実行する。また、ＳＰＥ２Ｂ、ＳＰＥ２Ｃ、ＳＰＥ２Ｄについても同様に、それぞれがSIMD型で構成される。
それらのＳＰＥ２Ａ〜２Ｄは、構成の異なるＳＰＥを組み合わせて構成される。演算プロセッサー２における演算処理を行うＳＰＥ２ＡからＳＰＥ２Ｃと、演算処理のほかに外部との入出力などの構成を備えるＳＰＥ２Ｄの組み合わせを例にして説明する。なお、同じ構成を有するＳＰＥ２ＡからＳＰＥ２Ｃの組み合わせをサブプロセッサーグループ２Ｇという。 Each of the arithmetic processors 2 includes four sub processors (SPE) 2A to 2D.
Each of the SPEs 2A to 2D has a VLIW (Very Long Instruction Word) type configuration that executes different instructions. Each PE2 is formed with the same configuration in which SPE2A to 2D are combined. Moreover, 107 SPE2A which all PE2 has are comprised by SIMD (Single Instruction Multi Data) type, and execute the same instruction in all SPE2A. Similarly, SPE2B, SPE2C, and SPE2D are each configured as a SIMD type.
These SPEs 2A to 2D are configured by combining SPEs having different configurations. A description will be given by taking, as an example, a combination of SPE2A to SPE2C for performing arithmetic processing in the arithmetic processor 2 and SPE2D having a configuration such as external input / output in addition to arithmetic processing. A combination of SPE2A to SPE2C having the same configuration is referred to as a sub processor group 2G.

ＰＥ−Ｉ３は、ＰＥ２の命令の実行順序を制御する。
ＰＥ−Ｉ３は、ＰＥ２のプログラムにおけるループ処理やサブルーチンコールなどの条件分岐を必要とする処理の制御を行う。ＰＥ-Ｉ３及びＰＥ２の命令をアセンブラプログラムで記述すると、ＳＰＥ２ＡからＳＰＥ２Ｄ及びＰＥ-Ｉ３の命令の５命令を並列に実行するVLIW型の命令として記述される。 The PE-I3 controls the execution order of the instructions of the PE2.
The PE-I 3 controls processing that requires conditional branching such as loop processing and subroutine calls in the PE 2 program. When the PE-I3 and PE2 instructions are described in an assembler program, they are described as VLIW-type instructions that execute five instructions SPE2A to SPE2D and PE-I3 in parallel.

ＩＯＰ４は、ＰＥ２外部との入出力を行う。
SIMD+VLIW型の並列計算装置１で実行されるプログラムコードは、計算開始前にＩＯＰ４によって外部メモリ９から予め読み込まれ、ＰＥ-Ｉ３に付属する命令メモリ５に書き込まれる。その後、ＩＯＰ４がＰＥ-Ｉ３に計算開始信号を送ると、ＰＥ-Ｉ３は命令メモリから自分自身で実行する命令と、ＳＰＥ２ＡからＳＰＥ２Ｄで実行すべき４個の命令とを読み出して演算を開始する。演算対象のデータはＩＯＰ４によって外部から取り込まれ、データ入力シフトレジスタ６を介して、ＰＥ２−０からＰＥ２−１０６のプロセッサーに分割して転送される。また、演算結果は、データ出力シフトレジスタ７を介してＩＯＰ４によって各ＰＥ２から読み出され、外部メモリ９へ転送され、ＰＥ２の外部に出力される。 The IOP4 performs input / output with the outside of the PE2.
The program code executed by the SIMD + VLIW type parallel computing device 1 is read in advance from the external memory 9 by the IOP 4 before the calculation is started, and is written in the instruction memory 5 attached to the PE-I 3. Thereafter, when the IOP4 sends a calculation start signal to the PE-I3, the PE-I3 reads out the instruction to be executed by itself from the instruction memory and the four instructions to be executed by the SPE2D from the SPE2A, and starts the operation. Data to be calculated is fetched from the outside by the IOP 4 and transferred through the data input shift register 6 to the processors PE2-0 to PE2-106. The calculation result is read from each PE 2 by the IOP 4 via the data output shift register 7, transferred to the external memory 9, and output outside the PE 2.

データ入力シフトレジスタ６及びデータ出力シフトレジスタ７は、並列計算装置１のＰＥ２で演算を行うデータの入力及び出力を行う。データ入力シフトレジスタ６は、ＩＯＰ４を介して入力されたデータをシフトレジスタの機能により、シリアルで順次入力されるデータに対応するＰＥ２が配置されている位置までシフトする。また、データ出力シフトレジスタ７は、各ＰＥ２でそれぞれで演算処理された結果を、ＰＥ２が配置されている位置からＩＯＰ４までシフトして、ＩＯＰ４を介して出力する。
このように、複数のＰＥ２は、並列して演算処理及び入出力処理を行うことができる。 The data input shift register 6 and the data output shift register 7 input and output data to be operated by the PE 2 of the parallel computing device 1. The data input shift register 6 shifts the data input via the IOP 4 to a position where the PE 2 corresponding to the serially input data is arranged by the function of the shift register. Further, the data output shift register 7 shifts the result of the arithmetic processing in each PE2 from the position where the PE2 is arranged to the IOP4 and outputs the result via the IOP4.
In this way, the plurality of PEs 2 can perform arithmetic processing and input / output processing in parallel.

図２は、本発明の並列計算装置１におけるＰＥ２の構成を示すブロック図である。
この図に示される並列計算装置１は、ＰＥ２−ｋと、ＰＥ２−ｋを挟んで配置されるＰＥ２−（ｋ−１）とＰＥ２−（ｋ＋１）、ＰＥ−Ｉ３、ＩＯＰ４、命令メモリ５、データ入力シフトレジスタ６、データ出力シフトレジスタ７が示される。図１と同じ構成には同じ符号を付す。 FIG. 2 is a block diagram showing the configuration of the PE 2 in the parallel computing device 1 of the present invention.
The parallel computing device 1 shown in this figure includes PE2-k, PE2- (k-1) and PE2- (k + 1), PE-I3, IOP4, instruction memory 5, and data arranged with PE2-k in between. An input shift register 6 and a data output shift register 7 are shown. The same components as those in FIG.

ＰＥ２−ｋは、隣接するＰＥ２−（ｋ−１）とＰＥ２−（ｋ＋１）と同じ構成を有することから、ＰＥ２−ｋ（以下、隣接関係を特に示さない限り「ＰＥ２」と示す。）を参照して構成を示す。
ＰＥ２は、４個のサブプロセッサー（ＳＰＥ）２Ａ〜２Ｄのほか、レジスター２Ｍを有する。また、ＳＰＥ２ＡからＳＰＥ２Ｃは、同じ構成を有し、サブプロセッサーグループＳＰＥ２Ｇを形成する。 Since PE2-k has the same configuration as adjacent PE2- (k-1) and PE2- (k + 1), see PE2-k (hereinafter referred to as "PE2" unless otherwise indicated). To show the configuration.
The PE 2 includes a register 2M in addition to the four sub processors (SPEs) 2A to 2D. Also, SPE2A to SPE2C have the same configuration and form a sub processor group SPE2G.

レジスター２Ｍは、演算に必要なデータの一時記憶用の記憶領域である。レジスター２Ｍは、内部に１２個のレジスター（R4〜R15）を備える。レジスター２Ｍは、各ＳＰＥから参照され、また書き込みが行われる。 The register 2M is a storage area for temporary storage of data necessary for calculation. The register 2M includes 12 registers (R4 to R15) inside. The register 2M is referred to from each SPE and is written.

ＳＰＥ２Ａは、ローカルメモリー(ＬＭ)１１Ａ、ＡＬＵ１２Ａ、Ａｃｃ１３Ａ、セレクタ(Ｓｅｌ)１４Ａ及びレジスタ参照部（Ｓｅｌ）１５Ａを備える。同様に、ＳＰＥ２Ｂは、ローカルメモリー(ＬＭ)１１Ｂ、ＡＬＵ１２Ｂ、Ａｃｃ１３Ｂ、セレクタ(Ｓｅｌ)１４Ｂ及びレジスタ参照部（Ｓｅｌ）１５Ｂを備える。ＳＰＥ２Ｃは、ローカルメモリー(ＬＭ)１１Ｃ、ＡＬＵ１２Ｃ、Ａｃｃ１３Ｃ、セレクタ(Ｓｅｌ)１４Ｃ及びレジスタ参照部（Ｓｅｌ）１５Ｃを備える。 The SPE 2A includes a local memory (LM) 11A, an ALU 12A, an Acc 13A, a selector (Sel) 14A, and a register reference unit (Sel) 15A. Similarly, the SPE 2B includes a local memory (LM) 11B, an ALU 12B, an Acc 13B, a selector (Sel) 14B, and a register reference unit (Sel) 15B. The SPE 2C includes a local memory (LM) 11C, an ALU 12C, an Acc 13C, a selector (Sel) 14C, and a register reference unit (Sel) 15C.

まず、ＳＰＥ２Ａ〜２Ｃの構成についてＳＰＥ２Ａを代表して説明する。
ＳＰＥ２Ａにおいてローカルメモリー１１Ａは、例えば、512 × 8b（ビット）の構成を有するメモリであり、演算処理の入力データ及び演算結果を記録する。
Ａｃｃ１３Ａは、ＡＬＵ１２Ａの演算結果が書き込まれ、また、ＡＬＵ１２Ａ〜１２Ｄから参照されるアキュムレータである。
セレクタ１４Ａは、ＡＬＵ１２Ａへの入力を選択する。
ＡＬＵ１２Ａは、入力されるデータに基づいて所定の演算処理を行う。ＡＬＵ１２Ａの一方の入力は、Ａｃｃ１３Ａからのデータが供給される。ＡＬＵ１２Ａの他方の入力は、セレクター１４Ａにより選択されたデータが供給される。
セレクター１４Ａによって選択されるデータは、レジスタ２Ｍ、Ａｃｃ１３ＡとＡｃｃ１３ＢとＡｃｃ１３ＣとＡｃｃ１３Ｄ、ローカルメモリー１１Ａに記憶されたデータのいずれかである。 First, the structure of SPE2A-2C is demonstrated on behalf of SPE2A.
In the SPE 2A, the local memory 11A is, for example, a memory having a 512 × 8b (bit) configuration, and records input data and calculation results of calculation processing.
The Acc 13A is an accumulator in which the calculation result of the ALU 12A is written and is referenced from the ALUs 12A to 12D.
The selector 14A selects an input to the ALU 12A.
The ALU 12A performs predetermined arithmetic processing based on the input data. One input of the ALU 12A is supplied with data from the Acc 13A. The other input of the ALU 12A is supplied with the data selected by the selector 14A.
The data selected by the selector 14A is any one of the register 2M, Acc13A, Acc13B, Acc13C, Acc13D, and data stored in the local memory 11A.

ＡＬＵ１２Ａによる演算結果は通常はＡｃｃ１３Ａに書き込まれるが、Ａｃｃ１３Ａのデータをレジスター２Ｍへ転送する命令を使って、レジスター２Ｍ内のレジスターR4〜R15のいずれかを選択して演算結果を書き込む。また、Ａｃｃ１３Ａのデータをメモリー１１Ａへ転送する命令を使って、メモリー１１Ａ内の記憶領域のいずれかを選択して書き込む。
レジスタ参照部１５Ａは、ＰＥ−Ｉ３からの制御信号により、セレクタ１４Ａに出力するデータを選択する。選択できるデータは、Ａｃｃ１３Ａ〜Ａｃｃ１３Ｃのデータである。セレクタ１４Ａにおける選択条件を変更することなく、レジスタ選択部１５Ａの選択を変更することによって、Ａｃｃ１３Ａ〜Ａｃｃ１３Ｃのデータを参照することが可能になる。
ＳＰＥ２Ｂ〜ＳＰＥ２Ｃは、ＳＰＥ２Ａの構成と対応する構成を有する。 The calculation result by the ALU 12A is normally written in the Acc 13A. However, the calculation result is written by selecting one of the registers R4 to R15 in the register 2M using an instruction for transferring the data of the Acc 13A to the register 2M. Further, using a command for transferring the data of Acc13A to the memory 11A, one of the storage areas in the memory 11A is selected and written.
The register reference unit 15A selects data to be output to the selector 14A based on a control signal from the PE-I3. The data that can be selected is data of Acc13A to Acc13C. By changing the selection of the register selection unit 15A without changing the selection condition in the selector 14A, the data of Acc13A to Acc13C can be referred to.
SPE2B to SPE2C have a configuration corresponding to the configuration of SPE2A.

続いて、ＳＰＥ２Ａ〜ＳＰＥ２Ｃと異なる構成を有するＳＰＥ２Ｄについて示す。
ＳＰＥ２Ｄは、ローカルメモリー１１Ｄ、演算部１２、Ａｃｃ１３Ｄ及びセレクタ１４Ｄ、データ入力バッファ（SIN_reg）１６、データ出力バッファ（SOUT_reg）１７、ＰＥ出力レジスタＬ（PE-out-L）１８、ＰＥ出力レジスタＲ（PE-out-R）１９を備える。このように、ＳＰＥ２Ｄは、他のＳＰＥ２Ａ〜ＳＰＥ２Ｃと異なる独自の構成を有しており、レジスタ参照部を備えていないことも、前述のＳＰＥ２Ａ〜ＳＰＥ２Ｃとの相違点になる。 Subsequently, SPE2D having a configuration different from SPE2A to SPE2C will be described.
The SPE2D includes a local memory 11D, an arithmetic unit 12, an Acc13D and a selector 14D, a data input buffer (SIN_reg) 16, a data output buffer (SOUT_reg) 17, a PE output register L (PE-out-L) 18, and a PE output register R ( PE-out-R) 19. As described above, the SPE2D has a unique configuration different from the other SPE2A to SPE2C and does not include a register reference unit, which is a difference from the above-described SPE2A to SPE2C.

ＳＰＥ２Ｄが有する構成を、ＳＰＥ２Ａと対比して示す。
ＳＰＥ２Ｄにおけるローカルメモリー１１Ｄは、ローカルメモリー１１Ａに相当する。
演算部１２Ｄは、ＡＬＵ１２ＤＡ、乗算器１２ＤＭ、セレクタ１２ＤＳを備える。ＡＬＵ１２ＤＡは、ＡＬＵ１２Ａに相当し、出力はセレクタ１２ＤＳを介してＡＣＣ１３Ｄに入力される。乗算器１２ＤＭは、ＡＬＵ１２ＤＡと同じ入力信号が設定され、入力されるデータの乗算を行う。セレクタ１２ＤＳは、ＡＬＵ１２ＤＡ又は乗算器１２ＤＭの演算結果を選択的にＡｃｃ１３Ｄに出力する。Ａｃｃ１３Ｄは、Ａｃｃ１３Ａに相当する。
セレクタ１４Ｄは、セレクタ１４Ａと同じレジスタ２Ｍ、Ａｃｃ１３ＡとＡｃｃ１３ＢとＡｃｃ１３ＣとＡｃｃ１３Ｄ、ローカルメモリー１１Ｄに記憶されたデータのほかに、データ入力バッファ１６を介して入力されるデータ入力シフトレジスタ６からの入力データ、隣接するＰＥ２−（ｋ−１）やＰＥ２−（ｋ＋１）からの入力データがある。 The configuration of SPE2D is shown in comparison with SPE2A.
The local memory 11D in the SPE2D corresponds to the local memory 11A.
The arithmetic unit 12D includes an ALU 12DA, a multiplier 12DM, and a selector 12DS. The ALU 12DA corresponds to the ALU 12A, and the output is input to the ACC 13D via the selector 12DS. The multiplier 12DM sets the same input signal as that of the ALU 12DA and performs multiplication of input data. The selector 12DS selectively outputs the calculation result of the ALU 12DA or the multiplier 12DM to the Acc 13D. Acc13D corresponds to Acc13A.
The selector 14D has the same register 2M as the selector 14A, Acc13A, Acc13B, Acc13C and Acc13D, data stored in the local memory 11D, and input data from the data input shift register 6 input via the data input buffer 16. , There is input data from adjacent PE2- (k-1) and PE2- (k + 1).

データ入力バッファ１６は、ＰＥ２に入力されたデータを各ＰＥ２にセットするデータ入力シフトレジスタ６からの入力を一時的に保持する。データ出力バッファ１７は、演算された結果をＰＥ２から出力する際に、データ出力シフトレジスタ７に出力するデータを一時的に記憶し、データ出力シフトレジスタ７によって読み出しが行われる。
ＰＥ出力レジスタＬ１８とＰＥ出力レジスタＲ１９は、隣接するＰＥ２に出力するデータを一時的に記憶する。
ＰＥ出力レジスタＬ１８やＰＥ出力レジスタＲ１９に記憶されたデータは、隣接するＰＥ２におけるＳＰＥ２Ｄから参照される。
ＰＥ出力レジスタＬ１８は、図の左隣に隣接するＰＥ２−（ｋ−１）にデータを出力する。ＰＥ出力レジスタＲ１９は、図の右隣に隣接するＰＥ２−（ｋ＋１）にデータを出力する。
このように、ＳＰＥ２Ｄは、乗算を含んだ演算処理とＰＥ２外部とのデータ交換処理を選択的に処理できる。 The data input buffer 16 temporarily holds the input from the data input shift register 6 that sets the data input to the PE2 in each PE2. The data output buffer 17 temporarily stores data to be output to the data output shift register 7 when the calculated result is output from the PE 2, and reading is performed by the data output shift register 7.
The PE output register L18 and the PE output register R19 temporarily store data to be output to the adjacent PE2.
The data stored in the PE output register L18 and the PE output register R19 is referred to from the SPE2D in the adjacent PE2.
The PE output register L18 outputs data to the adjacent PE2- (k-1) on the left side of the drawing. The PE output register R19 outputs data to the adjacent PE2- (k + 1) on the right side of the drawing.
In this way, the SPE2D can selectively process the arithmetic processing including multiplication and the data exchange processing outside the PE2.

以上に示したＳＰＥ２Ａ〜２Ｄにより、ローカルメモリ１１Ａ〜１１Ｄは、各ＳＰＥに分割して配置される。この図に示した例では、その容量は512 x 8b（ビット）の単位で４分割されているものとする。ローカルメモリ１１Ａ〜１１Ｄの合計容量は、分割せずにっまとめて配置される2k x 8b（ビット）と同じ記憶容量であるが、分割して配置したことにより、それぞれのＳＰＥで同時にアクセスできるようになり、ＰＥ２におけるローカルメモリに対するデータ転送能力は、４倍になる。この構成により、全てのＳＰＥが待ち時間なしでメモリアクセスできるようになり、さらに、マルチポートメモリを使う場合に生じるアクセスの衝突問題を回避できる。ただし、これらの利点と引き換えに、2k x 8b（ビット）のメモリを１つ配置する場合よりも、ローカルメモリ１１Ａ〜１１Ｄの配置に必要な面積は若干増大する。 With the SPEs 2A to 2D described above, the local memories 11A to 11D are divided and arranged in each SPE. In the example shown in this figure, it is assumed that the capacity is divided into four in units of 512 × 8b (bits). The total capacity of the local memories 11A to 11D is the same storage capacity as 2k x 8b (bits) that are arranged together without being divided, but by being arranged separately, each SPE can be accessed simultaneously. Thus, the data transfer capability for the local memory in PE2 is quadrupled. With this configuration, all SPEs can access the memory without waiting time, and further, it is possible to avoid an access collision problem that occurs when using a multi-port memory. However, in exchange for these advantages, the area required for the arrangement of the local memories 11A to 11D is slightly increased as compared with the case where one 2k × 8b (bit) memory is arranged.

ローカルメモリ１１Ａ〜１１Ｄは、それぞれ内部にアドレスレジスタを備えている。ＳＰＥ２Ａ〜２Ｄにおいて、ローカルメモリ１１Ａ〜１１Ｄをアクセスする場合は、先ずアドレス情報を図示されないアドレスレジスタにセットし、次にメモリへの書き込みか、或いはメモリからの読み出しを行う。ローカルメモリ１１Ａ〜１１Ｄにアクセスする度にアドレスレジスタを設定し直すのでは効率が悪くなる。そこで、通常のメモリアクセス命令とは別に、アクセス後に自動的にアドレスレジスタに設定されたアドレスが１増える命令と、１減る命令とを用意する。このようなアドレスレジスタを用いることにより、メモリアクセス時に必要な命令数を減らすことができ、画像処理のようにメモリアクセスが多くなる処理では、メモリアクセスの効率を改善することができる。 Each of the local memories 11A to 11D includes an address register therein. In the SPEs 2A to 2D, when accessing the local memories 11A to 11D, address information is first set in an address register (not shown), and then writing to the memory or reading from the memory is performed. If the address register is reset every time the local memories 11A to 11D are accessed, the efficiency is deteriorated. Therefore, in addition to the normal memory access instruction, an instruction that automatically increments the address set in the address register after access and an instruction that decrements by 1 are prepared. By using such an address register, the number of instructions necessary for memory access can be reduced, and the efficiency of memory access can be improved in a process in which memory access increases, such as image processing.

また、全てのＳＰＥには、ＡＬＵ１２Ａ〜１２ＤＡをそれぞれ備えているので、どのＳＰＥにおいても算術論理演算処理ができる。更に、ＳＰＥ２Ｄには乗算器（Mul）１２ＤＭを備えており、乗算処理が可能である。この構成により、各ＳＰＥはそれぞれ算術論理演算処理を並列に行えることから、コンパイラによる命令割付の自由度が大幅に向上する。 In addition, since all SPEs are respectively provided with ALUs 12A to 12DA, arithmetic logic operation processing can be performed in any SPE. Further, the SPE2D is provided with a multiplier (Mul) 12DM and can perform multiplication processing. With this configuration, each SPE can perform arithmetic logic operation processing in parallel, which greatly improves the degree of freedom of instruction assignment by the compiler.

また、各ＳＰＥ２は、ジェネラルレジスタ方式ではなく、アキュムレータ方式である。即ち、それぞれのＡＬＵ１２Ａ〜ＡＬＵ１２ＤＡの一方の入力は、アキュムレータ（Acc）に固定され、また、ＡＬＵ１２Ａ〜ＡＬＵ１２ＤＡの出力先はそれぞれのＡｃｃ１３Ａ〜１３Ｄである。ＡＬＵ１２Ａ〜ＡＬＵ１２ＤＡの他方の入力だけが、命令に応じてデータの参照先を指定できる。
ＰＥ２は、演算に必要なデータの一時記憶用に用いられるレジスタ２Ｍを備える。通常は、レジスタ２ＭからＡＬＵ１２Ａ〜１２ＤＡへデータが供給される。このアーキテクチャにより、このレジスタを参照する命令に必要なオペランドが４ビットと少なくでき、４つのＳＰＥ２Ａ〜２Ｄを独立して制御する場合でも、合計１６ビットで構成できるので非常にコンパクトになる。 Each SPE 2 is not a general register system but an accumulator system. That is, one input of each of the ALUs 12A to ALU12DA is fixed to the accumulator (Acc), and the output destinations of the ALUs 12A to ALU12DA are the respective Accs 13A to 13D. Only the other input of the ALU 12A to ALU 12DA can specify the data reference destination according to the instruction.
The PE 2 includes a register 2M used for temporary storage of data necessary for calculation. Normally, data is supplied from the register 2M to the ALUs 12A to 12DA. With this architecture, the number of operands required for an instruction referring to this register can be reduced to 4 bits, and even when the four SPEs 2A to 2D are controlled independently, a total of 16 bits can be configured, resulting in a very compact size.

続いて各ＳＰＥにおけるアキュムレータ（Ａｃｃ）選択制御処理について示す。
Ａｃｃ１３Ａ〜Ａｃｃ１３Ｄは、各ＡＬＵからの参照が可能な４つのレジスタ（レジスタR0〜R3）である。Ａｃｃ１３Ａ〜Ａｃｃ１３Ｄは、それぞれのＳＰＥに分散して配置されるが、他のＳＰＥから参照して読み出すことができる。
ただし、ＳＰＥ２Ａ〜２Ｃにおいて、Ａｃｃ１３Ａ〜Ａｃｃ１３ＣとＡｃｃ１３Ｄとは、参照方法が異なる。
Ａｃｃ１３Ａ〜Ａｃｃ１３Ｃは、レジスタ参照部１５Ａ〜１５Ｃによって選択され、次にＡｃｃ１３Ｄやレジスタ２Ｍなどと共にセレクタ１４Ａ〜１４Ｄによって選択されてＡＬＵ１２Ａ〜ＡＬＵ１２ＤＡの片側に入力される。
Ａｃｃ１３Ｄは、レジスタ２Ｍなどと共にセレクタ１４Ａ〜１４Ｄによって選択されてＡＬＵ１２Ａ〜１２ＤＡの片側に入力される点が異なる。 Next, accumulator (Acc) selection control processing in each SPE will be described.
Acc13A to Acc13D are four registers (registers R0 to R3) that can be referenced from each ALU. Although Acc13A to Acc13D are distributed and arranged in each SPE, they can be read by referring to other SPEs.
However, in SPE2A to 2C, Acc13A to Acc13C and Acc13D have different reference methods.
Acc13A to Acc13C are selected by the register reference units 15A to 15C, and then selected by the selectors 14A to 14D together with the Acc13D, the register 2M, etc., and input to one side of the ALUs 12A to ALU12DA.
Acc13D is different in that it is selected by the selectors 14A to 14D together with the register 2M and the like and input to one side of the ALUs 12A to 12DA.

セレクタ１４Ａ〜１４Ｄの選択により、各ＡＬＵから参照されるレジスタR3は、常にＡｃｃ１３Ｄに対応しているが、レジスタR0〜R2は、レジスタ参照部１５Ａ〜１５Ｃが設定される状態により参照先を変更できるようになる。そのため、レジスタR0〜R2は、レジスタ参照部１５Ａ〜１５Ｃの設定によってＡｃｃ１３Ａ〜Ａｃｃ１３Ｃのいずれかに変化する。
レジスタR4〜R15へのデータの書き込みは、Ａｃｃ１３Ａ〜Ａｃｃ１３Ｄからのデータ転送命令によって行われる。同一レジスタへの同時書き込みは、コンパイラで容易に回避できる。また、レジスタR4〜R15へのデータの書き込みは上記の命令でしか実行できないので、同一レジスタからの読み出しと書き込みが同時に起こる場合でも、データのバイパスを設けるなどでタイミング問題を容易に回避できる。
なお、演算部１２Ｄ並びに乗算器１２ＤＭへの入力は、前述のＡＬＵ１２ＤＡの説明を参照する。 The register R3 referenced from each ALU always corresponds to Acc13D by the selection of the selectors 14A to 14D, but the registers R0 to R2 can change the reference destination depending on the state in which the register reference units 15A to 15C are set. It becomes like this. Therefore, the registers R0 to R2 change to any one of Acc13A to Acc13C depending on the setting of the register reference units 15A to 15C.
Data writing to the registers R4 to R15 is performed by a data transfer command from Acc13A to Acc13D. Simultaneous writing to the same register can be easily avoided by the compiler. In addition, since data writing to the registers R4 to R15 can be executed only by the above-mentioned instruction, even when reading and writing from the same register occur simultaneously, a timing problem can be easily avoided by providing a data bypass.
For the input to the arithmetic unit 12D and the multiplier 12DM, refer to the description of the ALU 12DA described above.

図を参照し、画像データを入出力する処理について示す。
図３は、画像データの入力処理のタイミングチャートである。
ＳＰＥ２Ｄには、ＰＥ２の外部とデータを入出力するための構成を有している。
入力画像データは、ＩＯＰ４によって取り込まれ、データ入力シフトレジスタ（Serial-in）６にシフトされながら設定される。ＰＥ２がリセット（初期化）された直後（時刻ｔ_ｉ０）はデータ入力シフトレジスタ６が空であることを示すフラグSin-emtyが１に、また、データ入力シフトレジスタ６にデータが準備できたことを示すフラグSin-rdyが０になっている。計算を開始するとＩＯＰ４はデータ入力シフトレジスタ６へデータを書き込む前にSin-emtyが1になるのを待つ（時刻ｔ_ｉ１）が、既にSin-emtyが１なので直ぐにデータ転送を始める（時刻ｔ_ｉ２）。 A process for inputting and outputting image data will be described with reference to the drawings.
FIG. 3 is a timing chart of image data input processing.
The SPE2D has a configuration for inputting / outputting data to / from the outside of the PE2.
Input image data is captured by the IOP 4 and set while being shifted to the data input shift register (Serial-in) 6. Immediately after PE2 is reset (initialized) (time t _i 0), the flag Sin-emty indicating that the data input shift register 6 is empty is set to 1, and the data input shift register 6 is ready for data. The flag Sin-rdy indicating this is 0. When calculation is started, the IOP 4 waits for Sin-emty to become 1 before writing data to the data input shift register 6 (time t _i 1), but since Sin-emty is already 1, data transfer is immediately started (time t _i 2).

一方、ＰＥ−Ｉ３はSin-rdyが１になるのを待ってウェイト状態になる（時刻ｔ_ｉ２）。ＰＥ−Ｉ３がウェイト状態になると同時に、全てのＳＰＥがウェイト状態に入る。ＩＯＰ４はデータ入力シフトレジスタ６に１ライン分のデータ転送を終了するとSin-rdyを１にする（時刻ｔ_ｉ４）。するとＰＥ−Ｉ３はウェイト状態から抜け出して（時刻ｔ_ｉ５）、１クロックでデータ入力シフトレジスタ６からデータ入力レジスタ（SIN-reg）１６へデータを転送すると同時にSin-rdyをクリアし、更にSin-emtyを１にしてＩＯＰ４にデータ入力シフトレジスタ６が空になったことを知らせる（時刻ｔ_ｉ８）。データ入力バッファー（SIN-reg）１６にデータが用意できると、それをＳＰＥ２Ｄの処理により読み出されて、ローカルメモリに蓄積したり、或いは即座に計算処理に使ったりできる。 On the other hand, PE-I3 waits for Sin-rdy to become 1 (time t _i 2). As soon as PE-I3 enters the wait state, all SPEs enter the wait state. The IOP 4 sets Sin-rdy to 1 when the data transfer for one line to the data input shift register 6 is completed (time t _i 4). Then, PE-I 3 exits from the wait state (time t _i 5), and at the same time, transfers data from the data input shift register 6 to the data input register (SIN-reg) 16 in one clock, and simultaneously clears Sin-rdy. -emty is set to 1 to inform IOP4 that the data input shift register 6 has become empty (time t _i 8). When data is prepared in the data input buffer (SIN-reg) 16, it can be read out by the processing of the SPE2D and stored in the local memory or used immediately for calculation processing.

一方、ＩＯＰ４は１ライン分のデータを転送した後で他にするべき処理が無くなるとウェイト状態に入って、次のラインのデータを転送するためにSin-emtyが１になるのを待っつ（時刻ｔ_ｉ７）。Sin-emtyが１になるとＩＯＰ４は再びデータ入力シフトレジスタ６へのデータ転送を開始する。 On the other hand, IOP4 enters the wait state when there is no other processing after transferring the data for one line, and waits for Sin-emty to become 1 to transfer the data for the next line ( Time t _i 7). When Sin-emty becomes 1, the IOP 4 starts data transfer to the data input shift register 6 again.

Sin-emtyは、ＩＯＰ４が適当なタイミングでクリアする（時刻ｔ_ｉ３）。この手順により、ＩＯＰ４からのデータ書き込みとＰＥ２での計算処理とが同期し、更にＰＥ２での計算処理と画像データの入力作業とを並行させて実行できる。 Sin-emty is cleared by IOP4 at an appropriate timing (time t _i 3). By this procedure, the data writing from the IOP 4 and the calculation process in the PE 2 are synchronized, and the calculation process in the PE 2 and the image data input work can be executed in parallel.

図４は、ＩＯＰ４を介して外部へデータを出力する処理のタイミングチャートである。
ＰＥ２がリセットされた直後は、データ出力バッファ（SOUT-reg）１７が空であることを示すフラグSOUT-emtyが1に、また SOUT-regにデータが準備できたことを示すフラグSOUT-rdyが0になっている（時刻ｔｏ０）。計算を開始すると、ＩＯＰ４はSOUT-rdyが1になるのを待ってウェイト状態になる（時刻ｔｏ１）。
一方、ＰＥ−Ｉ３はSOUT-emtyが1になるのを待つウェイト状態になろうとするが、既にSOUT-emtyが1なのでウェイトには入らない（時刻ｔｏ１）。そして、ＰＥ−Ｉ３がSOUT-emtyをクリアすると同時に、ＳＰＥ２ＤがSOUT-regに出力データを蓄積し始める（時刻ｔｏ３）。
出力すべきデータの準備できると、ＰＥ−Ｉ３はSOUT-rdyを１にする（時刻ｔｏ４）。するとＩＯＰ４はウェイト状態から抜け出して（時刻ｔｏ７）、１クロックでSOUT-regからSerial-outへデータを転送すると同時にSOUT-rdyをクリアし、更にSOUT-emtyを1にしてＰＥ−Ｉ３にSOUT-regが空になったことを通知する（時刻ｔｏ８）。その後、ＩＯＰ４はSerial-outをシフトしながらデータを読み出して、LSIの外部へと転送する（時刻ｔｏ８）。この手順により、ＰＥ２でのデータ処理と、ＩＯＰ４によるLSI外部へのデータ出力とを同期させ、更にこれらの処理を並行して実行できる。 FIG. 4 is a timing chart of a process for outputting data to the outside via the IOP4.
Immediately after PE2 is reset, the flag SOUT-emty indicating that the data output buffer (SOUT-reg) 17 is empty is set to 1, and the flag SOUT-rdy indicating that data is ready in SOUT-reg is set. 0 (time to0). When the calculation is started, the IOP 4 waits until SOUT-rdy becomes 1 (time to 1).
On the other hand, PE-I3 tries to enter a wait state waiting for SOUT-emty to become 1, but does not enter the wait because SOUT-emty is already 1 (time to1). Then, at the same time as PE-I3 clears SOUT-emty, SPE2D starts to accumulate output data in SOUT-reg (time to3).
When the data to be output is prepared, PE-I3 sets SOUT-rdy to 1 (time to4). Then, IOP4 exits the wait state (time to7), transfers data from SOUT-reg to Serial-out in one clock, clears SOUT-rdy at the same time, sets SOUT-emty to 1, and sets SOUT-em to PE-I3. Notify that reg is empty (time to8). Thereafter, the IOP 4 reads the data while shifting the Serial-out and transfers it to the outside of the LSI (time to 8). By this procedure, the data processing in PE2 and the data output to the outside of the LSI by IOP4 can be synchronized, and these processes can be executed in parallel.

この他に、隣り合ったＰＥ２間でデータを転送するために、ＰＥ出力レジスタＬレジスタ１８とＰＥ出力レジスタＲレジスタ１９が在る。左側即ちＰＥ２の番号が小さいＰＥ２へ渡したいデータは、ＳＰＥ２ＤがそれをＰＥ出力レジスタＬレジスタ１８に書くと、次の命令以降で左隣のＳＰＥ２Ｄが読み出せる。同様に、右側即ちＰＥ２の番号が大きいＰＥ２へ渡したいデータは、ＳＰＥ２ＤがそれをＰＥ出力レジスタＲレジスタ１９に書き込むと、次の命令以降で右隣のＳＰＥ２Ｄが読み出せる。
以上述べてきたように、図２に示す構成を用いることにより、ローカルメモリへのアクセスにおけるバンド幅の不足や各ＳＰＥを命令実行状態に保ち続けることに効果がある。 In addition, there are a PE output register L register 18 and a PE output register R register 19 for transferring data between adjacent PEs 2. When the SPE2D writes it to the PE output register L register 18 on the left side, that is, the PE2 with the smaller PE2 number, the SPE2D adjacent to the left can be read after the next instruction. Similarly, when the SPE2D writes data to the PE output register R register 19 on the right side, that is, the PE2 having a larger PE2 number, the SPE2D adjacent to the right can be read after the next instruction.
As described above, the use of the configuration shown in FIG. 2 is effective in lacking bandwidth in accessing the local memory and keeping each SPE in the instruction execution state.

さらに、画像処理用途に適用する際に、生じるローカルメモリへのアクセスにおけるバンド幅の不足の問題を解決する技術について説明する。
図５は、本発明の実施形態における演算処理に入力される映像信号の例を示す。
入力される映像信号は、図に示されるように640画素×480画素を２次元に配列した画素で構成される画面によって示される白黒画像を表している。１画素は、８ビットで構成され、各画素の明るさに応じた階調が示される。 Furthermore, a technique for solving the problem of insufficient bandwidth in access to a local memory that occurs when applied to an image processing application will be described.
FIG. 5 shows an example of a video signal input to the arithmetic processing in the embodiment of the present invention.
As shown in the figure, the input video signal represents a black and white image displayed by a screen composed of pixels in which 640 pixels × 480 pixels are two-dimensionally arranged. One pixel is composed of 8 bits, and a gradation corresponding to the brightness of each pixel is indicated.

続いて、画像処理特有の処理についての問題と対策について説明する。
並列計算装置１には、映像信号がビデオカメラから１ライン分の６４０画素を単位として、ラインごとに４８０回に分けてデータが入力される。入力されたデータは、ＰＥ２−０からＰＥ２−１０６の１０７個のＰＥ２に分割して格納され、各ＰＥ２が分担して演算処理をする。
ところで、画像処理では、特定の画素を定め、その画素に対し、２次元平面で示す座標軸方向に隣接する上下左右方向の画素のデータを参照することが多い。
頻繁に参照される上下左右の画素のデータを効率よくアクセスできないと、処理が煩雑になり処理速度が低下する。
最初に、水平方向に並んだ左右の画素を効率よく扱うための方法を説明する。
各ＰＥ２は、入力されたデータの内、分割された６画素分のデータを担当する（107×6 = 642）。しかし、自ら担当する６画素しかローカルメモリに格納しないと、頻繁に両隣のＰＥ２との間でデータの受け渡しが発生して効率が悪くなる。そこで各ＰＥ２は、両隣が格納すべきデータを１画素分ずつ重複して格納することにする。各ローカルメモリは、１ライン当たり８画素を記憶することにする。 Next, problems and countermeasures regarding processing unique to image processing will be described.
Data is input to the parallel computing device 1 by dividing the video signal into 480 times for each line in units of 640 pixels for one line from the video camera. The input data is divided and stored in 107 PEs 2 from PE 2-0 to PE 2-106, and each PE 2 shares the arithmetic processing.
By the way, in image processing, a specific pixel is defined, and data of pixels in the vertical and horizontal directions adjacent to the pixel in the coordinate axis direction indicated by a two-dimensional plane is often referred to.
If the data of the upper, lower, left, and right pixels that are frequently referred to cannot be accessed efficiently, the processing becomes complicated and the processing speed decreases.
First, a method for efficiently handling the left and right pixels arranged in the horizontal direction will be described.
Each PE 2 is in charge of the data of 6 pixels divided among the input data (107 × 6 = 642). However, if only 6 pixels in charge are stored in the local memory, data is frequently exchanged between the adjacent PEs 2 and the efficiency is deteriorated. Therefore, each PE 2 stores data that should be stored by both neighbors one pixel at a time. Each local memory will store 8 pixels per line.

続いて、垂直方向に並んだ上下の画素を効率よく扱うための方法を説明する。上下方向の画素は、ライン単位で扱われることからライン間の参照として説明する。
各ＳＰＥに配置されるローカルメモリ１１Ａ〜１１Ｄの容量は512×8b（ビット）なので、それぞれのＳＰＥが備える１つのローカルメモリには、一度に６４ライン（= 512÷8（画素））分のデータを格納できる。ＰＥ２内にローカルメモリは、４個あるので、最大で２５６ライン分のデータを同時に格納できるが、１画面は４８０ラインなので１画面分のデータまでを格納することはできない。
画像処理では画像を狭い領域ごとに区切って演算対象にすることが多いため、入力されるデータを例えばＳＰＥ２Ａのローカルメモリ１１Ａだけに順に格納すると、演算する時にＳＰＥ２Ａのローカルメモリ１１Ａだけが頻繁に読み出され、メモリを分散配置して得られたデータ転送能力を生かせないことになる。 Next, a method for efficiently handling the upper and lower pixels arranged in the vertical direction will be described. Since the pixels in the vertical direction are handled in units of lines, they will be described as a reference between lines.
Since the capacity of the local memories 11A to 11D arranged in each SPE is 512 × 8b (bits), one local memory included in each SPE has data for 64 lines (= 512 ÷ 8 (pixels)) at a time. Can be stored. Since there are four local memories in PE2, a maximum of 256 lines of data can be stored at the same time, but since one screen is 480 lines, it is not possible to store up to one screen of data.
In image processing, an image is often divided into narrow areas to be subject to calculation. Therefore, if input data is stored in order, for example, only in the local memory 11A of the SPE2A, only the local memory 11A of the SPE2A is frequently read when performing calculations. The data transfer capability obtained by distributing the memory is not used.

また、一度に６５ライン分以上のデータを格納しようとすると、初めの６４ライン分と次の６４ライン分のデータを別々のメモリに格納しなければならず、データを格納するためのプログラムだけでなく、そのデータを使うプログラムも、条件判断処理を行うため複雑になってしまう。
このように、単にローカルメモリを分割して配置するだけでは、メモリアクセスのバンド幅を広げることはできず、また６５ライン分以上のデータを格納して処理しようとすると、処理プログラムが複雑化するという新たな問題が生じてしまう。 Further, if data for 65 lines or more is to be stored at a time, the data for the first 64 lines and the data for the next 64 lines must be stored in separate memories, and only a program for storing the data is required. In addition, the program that uses the data becomes complicated because the condition determination process is performed.
In this way, the bandwidth of memory access cannot be expanded simply by dividing and arranging the local memory, and the processing program becomes complicated when trying to store and process data for 65 lines or more. A new problem arises.

ここで、ローカルメモリへアクセスするためのアドレスレジスタについて説明する。
図６は、アドレスレジスタの構成を示す図である。この図に示されるアドレスレジスタには、モード０〜モード２として選択できる３つのモードを設定できる。
モード０では、ベースポインタＢＰと、４つのポインタＡＰ０〜ＡＰ３によって参照するメモリアドレスを定める。メモリアクセス命令は、そのオペランドにポインタＡＰ０〜ＡＰ３を一つ指定できる。例えばポインタＡＰ０を指定するとポインタＡＰ０の下位３ビットがメモリアドレスのビット2-0になり、ビット4-3は０固定、そしてＢＰの値がビット8-5になる。 Here, an address register for accessing the local memory will be described.
FIG. 6 is a diagram showing the configuration of the address register. In the address register shown in this figure, three modes that can be selected as mode 0 to mode 2 can be set.
In mode 0, a memory address to be referred to is determined by the base pointer BP and the four pointers AP0 to AP3. The memory access instruction can designate one pointer AP0 to AP3 as its operand. For example, when the pointer AP0 is designated, the lower 3 bits of the pointer AP0 become bits 2-0 of the memory address, the bits 4-3 are fixed to 0, and the value of BP becomes bits 8-5.

メモリアドレスを自動的に１増加させる命令を実行する場合は、ローカルメモリをアクセスした後にポインタＡＰ０が１増加する。ポインタＡＰ０の値が、７から８になると、下位３ビットが全て０になるので、結局最初のアドレスに戻ることになる。このモードは、画像処理において、１ラインごとに処理する場合に便利である。ラインを１だけ移動する場合は、ベースポインタＢＰの値を１だけ増減する。ベースポインタＢＰやポインタＡＰ０〜ＡＰ３を１だけ増減する命令が用意されている。このモードでは、例えばポインタＡＰ０はデータの読み出し番地を、ポインタＡＰ１は一時的にデータを書き込む格納番地をそれぞれ保持し、読み書きする命令でポインタＡＰ０とポインタＡＰ１とをオペランド指定して切り替えてそれぞれの番地を切り換えてアクセスできる。 When an instruction for automatically incrementing the memory address by 1 is executed, the pointer AP0 is incremented by 1 after accessing the local memory. When the value of the pointer AP0 is changed from 7 to 8, the lower 3 bits are all 0, so that it eventually returns to the first address. This mode is convenient when processing line by line in image processing. When the line is moved by 1, the value of the base pointer BP is increased or decreased by 1. Instructions for increasing / decreasing the base pointer BP and the pointers AP0 to AP3 by 1 are prepared. In this mode, for example, the pointer AP0 holds the data read address, the pointer AP1 holds the storage address to which data is temporarily written, and the pointer AP0 and the pointer AP1 are designated by the operand by the read / write instruction to switch the respective addresses. Can be accessed.

モード１の動作も同様であるが、ポインタＡＰ０とＡＰ２の下位４ビットが割り当てられているので、データを２ラインずつアクセスする場合に有用である。
モード２ではポインタＡＰ０の値がそのまま９ビットのアドレスに変換できるので、５１２Ｂ（バイト）のメモリ空間をリニアにアクセスできる。 The operation in mode 1 is the same, but since the lower 4 bits of the pointers AP0 and AP2 are allocated, it is useful when data is accessed two lines at a time.
In mode 2, since the value of the pointer AP0 can be directly converted into a 9-bit address, the 512B (byte) memory space can be accessed linearly.

図７は、本実施形態によるローカルメモリへの画像データのマッピング例を示す。
６５ライン分以上のデータを格納する場合にも処理プログラムを同一にするために、図に示すように画像データを１ライン毎にＳＰＥ２Ａのローカルメモリ１１Ａ、ＳＰＥ２Ｂのローカルメモリ１１Ｂ、ＳＰＥ２Ｃのローカルメモリ１１Ｃと順に格納することにする。 FIG. 7 shows an example of mapping of image data to the local memory according to the present embodiment.
In order to make the processing program the same when storing data for 65 lines or more, as shown in the figure, the image data is stored for each line in the local memory 11A of the SPE2A, the local memory 11B of the SPE2B, and the local memory 11C of the SPE2C. Are stored in order.

最初に画像信号におけるライン０のデータをローカルメモリーに格納する場合の手順を示す。 First, the procedure for storing the data of line 0 in the image signal in the local memory is shown.

ＩＯＰ４によって取り込まれた画像信号は、データ入力シフトレジスタ６によって、ＳＰＥ２Ｄにおけるデータ入力バッファ１６に設定される。設定された画像信号のデータをＳＰＥ２ＤにおけるＡＬＵ１２ＤＡがデータ入力バッファ１６から読み出すと、データはＡｃｃ１３Ｄに書き込まれ、レジスタR3として参照できるようになる。ＳＰＥ２ＡにおけるＡＬＵ１２Ａは、レジスタR3に書き込まれたデータを参照し、Ａｃｃ１３Ａへ転送した後、ローカルメモリ１１Ａに書き込む。 The image signal captured by the IOP 4 is set in the data input buffer 16 in the SPE 2D by the data input shift register 6. When the ALU 12DA in the SPE 2D reads the set image signal data from the data input buffer 16, the data is written to the Acc 13D and can be referred to as the register R3. The ALU 12A in the SPE 2A refers to the data written in the register R3, transfers it to the Acc 13A, and then writes it in the local memory 11A.

続いて、ＳＰＥ２Ａのローカルメモリ１１Ａにライン０のデータを書き込むプログラムを使って、次のライン１のデータをＳＰＥ２Ｂのローカルメモリ１１Ｂに書き込む手順を説明する。
図８は、本発明の実施形態におけるＰＥ−Ｉ３を示す概略ブロック図である。この図に示されるＰＥ−Ｉ３は、内部に交換部３１を備える。
ＰＥ−Ｉ３は、命令メモリ５に登録されたプログラムに記述された命令を分解し、ＳＰＥ２Ａ用、ＳＰＥ２Ｂ用、ＳＰＥ２Ｃ用の命令（OPコード）と、その命令の交換を行うローテーション命令を抽出する。ローテーション命令はＰＥ−Ｉ３で実行される。ＳＰＥ２Ａ用、ＳＰＥ２Ｂ用、ＳＰＥ２Ｃ用の命令（OPコード）を、それぞれｓｐｅ−ａ用コード、ｓｐｅ−ｂ用コード、ｓｐｅ−ｃ用コードと示し、ローテーション命令をＲＯＴ命令と示す。
ＰＥ―Ｉ３における交換部３１は、これらのOPコードを、必要に応じてローテーション（交換）する。 Next, a procedure for writing the next line 1 data to the local memory 11B of the SPE2B using a program for writing the data of the line 0 to the local memory 11A of the SPE2A will be described.
FIG. 8 is a schematic block diagram showing the PE-I 3 in the embodiment of the present invention. PE-I3 shown by this figure is provided with the exchange part 31 inside.
The PE-I 3 disassembles the instructions described in the program registered in the instruction memory 5 and extracts instructions for SPE2A, SPE2B, and SPE2C (OP codes) and rotation instructions for exchanging the instructions. The rotation instruction is executed by PE-I3. The instructions (OP codes) for SPE2A, SPE2B, and SPE2C are indicated as spe-a code, spe-b code, and spe-c code, respectively, and the rotation instruction is indicated as a ROT instruction.
The exchange unit 31 in the PE-I 3 rotates (exchanges) these OP codes as necessary.

交換部３１は、命令セレクト部３２、命令選択部３３Ａ〜３３Ｃ、命令デコード部３４Ａ〜３４Ｃを備える。
交換部３１における命令セレクト部３２は、入力されるＲＯＴ命令に応じて、命令選択部３３Ａ〜３３Ｃの入力選択を制御する。命令セレクト部３２は、内部に２ビットのカウンタを備え、ＲＯＴ命令が実行されるとカウンタの値を１ずつ増加させる。命令セレクト部３２は、カウンタの値が２のときにＲＯＴ命令が実行されると値を０に戻し、０〜２の範囲で変化させる。命令セレクト部３２は、そのカウンタの値に応じて、命令選択部３３Ａ〜３３Ｃの入力を切り換えるセレクト信号を出力する。
命令選択部３３Ａ〜３３Ｃは、入力されるｓｐｅ−ａ用コード、ｓｐｅ−ｂ用コード、ｓｐｅ−ｃ用コードを命令セレクト部３２が出力するセレクト信号に応じて切り換える。
命令デコーダ部３４Ａ〜３４Ｃは、入力される命令コードに応じて各ＳＰＥ２Ａ〜２Ｃの制御信号を生成し出力する。 The exchange unit 31 includes an instruction selection unit 32, instruction selection units 33A to 33C, and instruction decoding units 34A to 34C.
The instruction selection unit 32 in the exchange unit 31 controls input selection of the instruction selection units 33A to 33C according to the input ROT instruction. The instruction selection unit 32 includes a 2-bit counter inside, and increments the counter value by one when the ROT instruction is executed. The instruction selector 32 returns the value to 0 when the ROT instruction is executed when the counter value is 2, and changes the value in the range of 0-2. The instruction selector 32 outputs a select signal for switching the inputs of the instruction selectors 33A to 33C according to the counter value.
The instruction selection units 33A to 33C switch the input spe-a code, the spe-b code, and the spe-c code in accordance with a select signal output from the instruction selection unit 32.
The instruction decoder units 34A to 34C generate and output control signals for the SPEs 2A to 2C according to the input instruction code.

交換部３１において命令選択部３３Ａは、ｓｐｅ−ａ用コードがＩ０入力に、ｓｐｅ−ｂ用コードがＩ２入力に、ｓｐｅ−ｃ用コードがＩ１入力に入力される。命令選択部３３Ｂは、ｓｐｅ−ａ用コードがＩ１入力に、ｓｐｅ−ｂ用コードがＩ０入力に、ｓｐｅ−ｃ用コードがＩ２入力に入力される。命令選択部３３Ｃは、ｓｐｅ−ａ用コードがＩ２入力に、ｓｐｅ−ｂ用コードがＩ１入力に、ｓｐｅ−ｃ用コードがＩ０入力に入力される。
命令選択部３３Ａ〜３３Ｃは、入力されるセレクト信号の値０〜２に応じて、対応するＩ０入力、Ｉ１入力、Ｉ２入力の各入力端子に入力されるコードを出力する。 In the exchanging unit 31, the instruction selecting unit 33A receives the code for spe-a at the I0 input, the code for spe-b at the I2 input, and the code for spe-c at the I1 input. In the instruction selector 33B, the code for spe-a is input to the I1 input, the code for spe-b is input to the I0 input, and the code for spe-c is input to the I2 input. In the instruction selecting unit 33C, the code for spe-a is input to the I2 input, the code for spe-b is input to the I1 input, and the code for spe-c is input to the I0 input.
The instruction selection units 33A to 33C output codes input to the corresponding input terminals of the I0 input, the I1 input, and the I2 input in accordance with the values 0 to 2 of the input select signal.

図を参照し、図８に示した交換部３１によって各ＳＰＥに供給される制御信号を交換する処理を説明する。
図９は、交換部３１における命令選択状態を示す図である。
命令セレクト部３２が出力するセレクト信号が０のとき、ＳＰＥ２Ａの制御信号にはｓｐｅ−ａ用コードが、ＳＰＥ２Ｂの制御信号にはｓｐｅ−ｂ用コードが、ＳＰＥ２Ｃの制御信号にはｓｐｅ−ｃ用コードが出力される。ＲＯＴ命令が実行され、セレクト信号が１になると、ＳＰＥ２Ａの制御信号にはｓｐｅ−ｃ用コードが、ＳＰＥ２Ｂの制御信号にはｓｐｅ−ａ用コードが、ＳＰＥ２Ｃの制御信号にはｓｐｅ−ｂ用コードが出力される。さらにＲＯＴ命令が実行され、セレクト信号が２になると、ＳＰＥ２Ａの制御信号にはｓｐｅ−ｂ用コードが、ＳＰＥ２Ｂの制御信号にはｓｐｅ−ｃ用コードが、ＳＰＥ２Ｃの制御信号にはｓｐｅ−ａ用コードが出力される。次にＲＯＴ命令が実行されるとセレクト信号の値は０に戻る。 A process of exchanging control signals supplied to each SPE by the exchanging unit 31 shown in FIG. 8 will be described with reference to the drawings.
FIG. 9 is a diagram illustrating an instruction selection state in the exchange unit 31.
When the select signal output from the instruction select unit 32 is 0, the spe-a code is used for the SPE2A control signal, the spe-b code is used for the SPE2B control signal, and the spe-c code is used for the SPE2C control signal. Code is output. When the ROT instruction is executed and the select signal becomes 1, the SPE2A control signal has a spe-c code, the SPE2B control signal has a spe-a code, and the SPE2C control signal has a spe-b code. Is output. When the ROT instruction is executed and the select signal becomes 2, the SPE2A control signal is Spe-b code, the SPE2B control signal is Spe-c code, and the SPE2C control signal is Spe-a code. Code is output. Next, when the ROT instruction is executed, the value of the select signal returns to zero.

命令セレクト部３２におけるセレクト信号が０の状態で、ＳＰＥ２Ａがｓｐｅ−ａ用コードにしたがってライン０の読み込みを行った後に、ＰＥ−Ｉ３でＲＯＴ命令を実行すると、セレクト信号が１になる。そのためｓｐｅ−ａ用として記述されたプログラムのOPコードであっても、ＳＰＥ２Ｂで実行されるようになる。つまり、ライン０を読み込んだプログラムと全く同じプログラムを実行してライン１のデータをＳＰＥ２Ｂのローカルメモリ１１Ｂに書き込むことができる。同様に、ライン１の読み込み終了後にＲＯＴ命令を実行すると、ライン２のデータがＳＰＥ２Ｃのローカルメモリ１１Ｃに書き込むことができる。
次に、ＲＯＴ命令を実行してデータを読み込むと、ＳＰＥ２Ａでライン０のデータがライン３のデータで上書きされることになる。それが不都合な場合は、ＰＥ−Ｉ３でＲＯＴ命令を行うのと同時にＳＰＥ２ＡでベースポインタＢＰ（図６）を１増やす命令を実行しておく。このようにすることで、ＲＯＴ命令を実行しながら単純にループするプログラムで、図７のようにデータを格納することができる。 When the ROT instruction is executed in PE-I3 after the SPE 2A reads line 0 in accordance with the code for spe-a while the select signal in the instruction select unit 32 is 0, the select signal becomes 1. Therefore, even an OP code of a program described for spe-a is executed by SPE2B. That is, it is possible to write the data of line 1 in the local memory 11B of the SPE 2B by executing the same program as the program that has read line 0. Similarly, when the ROT instruction is executed after the reading of the line 1 is completed, the data of the line 2 can be written into the local memory 11C of the SPE 2C.
Next, when data is read by executing the ROT instruction, the data on line 0 is overwritten with the data on line 3 in SPE2A. If this is inconvenient, the ROT instruction is executed at PE-I3 and at the same time, the instruction to increase the base pointer BP (FIG. 6) by 1 is executed at SPE2A. In this way, data can be stored as shown in FIG. 7 by a program that simply loops while executing the ROT instruction.

なお、ＳＰＥ２ＤとＰＥ−Ｉ３の命令は、常にｓｐｅ−ｄ用コードとｐｅ−ｉ用コードであり変化しない。また、ＳＰＥ２Ａ〜ＳＰＥ２Ｃをローテート（交換）するためには、これらが全く同じ構成でなければならない。したがって、ＰＥ２の外部とのデータ入出力に必要な構成や、乗算器などは全てＳＰＥ２Ｄに集中して配置する。 Note that the instructions of SPE2D and PE-I3 are always code for spe-d and code for pe-i and do not change. Moreover, in order to rotate (exchange) SPE2A-SPE2C, these must have the completely same structure. Accordingly, the configuration necessary for data input / output with the outside of PE2 and the multipliers are all concentrated on SPE2D.

次に、図７のように格納された画像データを使って、画像処理が効率的に行えることを説明する。例えば３画素×３画素の窓を有するデジタルフィルタを画像に適用する場合を示す。
最初の処理では、ライン０（上段）、ライン１（中段）、ライン２（下段）からそれぞれ３画素ずつのデータ（合計9画素）を読み出して処理する。ライン０のデータはＳＰＥ２Ａ、ライン１のデータはＳＰＥ２Ｂ、ライン２のデータはＳＰＥ２Ｃの各ローカルメモリ１１Ａ〜１１Ｃにそれぞれ格納されていると仮定する。
ＳＰＥ２Ａ〜２Ｃでは、各ラインのデータから３画素ずつ分散して読み出せる。つまり、拡大されたメモリバンド幅を有効に使うことができる。１ライン分の処理が終わって次のラインの処理に移ると、ライン１が上段、ライン２が中段、ライン３が下段になる。１ライン分の処理の最後で、ＲＯＴ命令を実行すれば、spe-a用コードがＳＰＥ２Ｂで、spe-ｂ用コードがＳＰＥ２Ｃで、spe-ｃ用コードがＳＰＥ２Ａでそれぞれ実行されるようになるので、順次切り換えられるラインへの移動が簡単に行える。 Next, it will be described how image processing can be performed efficiently using image data stored as shown in FIG. For example, a case where a digital filter having a window of 3 pixels × 3 pixels is applied to an image is shown.
In the first process, data of 3 pixels (9 pixels in total) is read and processed from line 0 (upper stage), line 1 (middle stage), and line 2 (lower stage). It is assumed that the data of line 0 is stored in SPE2A, the data of line 1 is stored in SPE2B, and the data of line 2 is stored in each local memory 11A to 11C of SPE2C.
In the SPEs 2A to 2C, three pixels can be distributed and read from the data of each line. That is, the expanded memory bandwidth can be used effectively. When the processing for one line is completed and the processing for the next line is started, line 1 becomes the upper stage, line 2 becomes the middle stage, and line 3 becomes the lower stage. If the ROT instruction is executed at the end of processing for one line, the code for spe-a is executed by SPE2B, the code for spe-b is executed by SPE2C, and the code for spe-c is executed by SPE2A. It is easy to move to a line that can be switched sequentially.

つまり、上段のデータにアクセスする命令はspe-a用のコードとして記述されているが、データが格納されているローカルメモリーは、ＳＰＥ２Ｂにおけるローカルメモリー１１Ｂである。spe-a用のコードが、ローカルメモリー１１Ｂが含まれるＳＰＥ２Ｂで実行されるので、結局ライン１のデータが上段のデータとして処理することになるからである。ただし、spe-ｃ用のコードについては注意が必要で、これはＳＰＥ２Ａで実行されるが、そのままではライン０をアクセスしてしまう。そこで、ＲＯＴ命令と同時にＳＰＥ２ＡにおいてベースポインタＢＰを１増加する命令を実行しておく。このように、ＲＯＴ命令とベースポインタＢＰを増加させる命令を加えて、単純にループするプログラムで、図７のように格納されたデータを使った画像処理ができる。 That is, the instruction to access the upper data is described as a code for spe-a, but the local memory in which the data is stored is the local memory 11B in the SPE 2B. This is because the code for spe-a is executed by the SPE 2B including the local memory 11B, so that the data on the line 1 is eventually processed as the upper data. However, it is necessary to pay attention to the code for spe-c, which is executed by SPE2A, but if it is left as it is, line 0 is accessed. Therefore, an instruction for incrementing the base pointer BP by 1 is executed in SPE2A simultaneously with the ROT instruction. In this way, image processing using data stored as shown in FIG. 7 can be performed with a program that simply loops by adding an ROT instruction and an instruction to increase the base pointer BP.

最後に、図を参照し、図２の３つのレジスタ参照部１５Ａ〜１５Ｃの機能について説明する。
図１０は、レジスタ参照部の動作を示す図である。
ＳＰＥをローテーションしてもプログラムが正しく動作するためには、spe-a用のコードが実行されているＳＰＥのＡｃｃが、他のＳＰＥからは、レジスタR0として読み出せることが必要になる。同様に、spe-ｂ用のコードが実行されているＳＰＥのＡｃｃがレジスタR1、spe-ｃ用のコードが実行されているＳＰＥのＡｃｃがレジスタR2として読み出せることが必要になる。
そこで、レジスタ参照部１５Ａ〜１５Ｃは、図に示すように、命令セレクト部３２が出力するセレクト信号の値に応じてＡｃｃ１３Ａ〜１３Ｃを選択してレジスタR0〜R2とする。
命令セレクト部３２が出力するセレクト信号が０のとき、レジスタR0にはＡｃｃ１３Ａが、レジスタR1にはＡｃｃ１３Ｂが、レジスタR2にはＡｃｃ１３Ｃが参照される。セレクト信号が１のとき、レジスタR0にはＡｃｃ１３Ｂが、レジスタR1にはＡｃｃ１３Ｃが、レジスタR2にはＡｃｃ１３Ａが参照される。セレクト信号が２のとき、レジスタR0にはＡｃｃ１３Ｃが、レジスタR1にはＡｃｃ１３Ａが、レジスタR2にはＡｃｃ１３Ｂが参照される。 Finally, the functions of the three register reference units 15A to 15C in FIG. 2 will be described with reference to the drawings.
FIG. 10 is a diagram illustrating the operation of the register reference unit.
In order for the program to operate correctly even when the SPE is rotated, it is necessary that the Acc of the SPE in which the code for spe-a is executed can be read from another SPE as the register R0. Similarly, it is necessary that the Acc of the SPE in which the code for spe-b is executed is read as register R1, and the Acc of the SPE in which the code for spe-c is executed is read as register R2.
Therefore, the register reference units 15A to 15C select Acc 13A to 13C as registers R0 to R2 according to the value of the select signal output from the instruction select unit 32, as shown in the figure.
When the select signal output from the instruction select unit 32 is 0, Acc13A is referred to in the register R0, Acc13B is referred to in the register R1, and Acc13C is referred to in the register R2. When the select signal is 1, Acc13B is referred to in the register R0, Acc13C is referred to in the register R1, and Acc13A is referred to in the register R2. When the select signal is 2, the register R0 refers to Acc13C, the register R1 refers to Acc13A, and the register R2 refers to Acc13B.

（第２実施形態）
図を参照し、並列計算装置の一実施形態について示す。
図１１は、第２実施形態におけるＰＥ２の構成を示すブロック図である。図２と同じ構成には同じ符号を付す。
この図に示される並列計算装置１ａは、ＰＥ２ａ、ＰＥ−Ｉ３ａ、ＩＯＰ４、命令メモリ５、データ入力シフトレジスタ６、データ出力シフトレジスタ７が示される。図１と同じ構成には同じ符号を付す。また、ＰＥ２ａは、同じ構成を有する複数のＰＥ２ａ−０〜ＰＥ２ａ−１０６を代表する。また、ＩＯＰ４、命令メモリ５は、記載が省略されている。 (Second Embodiment)
An embodiment of a parallel computing device will be described with reference to the drawings.
FIG. 11 is a block diagram showing a configuration of PE2 in the second embodiment. The same components as those in FIG.
The parallel computing device 1a shown in this figure includes PE2a, PE-I3a, IOP4, instruction memory 5, data input shift register 6, and data output shift register 7. The same components as those in FIG. PE2a represents a plurality of PE2a-0 to PE2a-106 having the same configuration. Further, IOP4 and instruction memory 5 are not shown.

ＰＥ２ａは、図２におけるＰＥ２に相当し、一部異なる構成を有している。ＰＥ２ａは、ＳＰＥ２Ａａ〜ＳＰＥ２Ｄａとレジスタ２Ｍを有する。
ＰＥ２ａにおけるＳＰＥ２Ａａ〜ＳＰＥ２Ｄａは、図２におけるＳＰＥ２Ａ〜ＳＰＥ２Ｄに相当するが、セレクタ１４Ａ〜セレクタ１４Ｄ及びレジスタ参照部１５Ａ〜１５Ｃに代え、セレクタ１４ＸＡ〜セレクタ１４ＸＤを備える。
セレクタ１４ＸＡ〜セレクタ１４ＸＣは、セレクタ１４Ａ〜セレクタ１４Ｃとレジスタ参照部１５Ａ〜１５Ｃを一体化した構成にあたり入力される信号を選択制御信号に応じて選択する。セレクタ１４ＸＡに入力される信号は、Ａｃｃ１３Ａ〜Ａｃｃ１３Ｄ、ローカルメモリ１１Ａ及びレジスタ２Ｍであり、ＰＥ−Ｉ３ａからの選択制御信号に応じて入力される信号を切り換える。セレクタ１４ＸＢ〜１４ＸＣについても、セレクタ１４ＸＡと同様である。
また、セレクタ１４ＸＤは、入力される信号が、Ａｃｃ１３Ａ〜Ａｃｃ１３Ｄ、ローカルメモリ１１Ｄ、レジスタ２Ｍ及びデータ入力バッファー（SIN_reg）１６からの情報並びに隣接するＰＥ２ａからの入力信号であり、ＰＥ−Ｉ３ａからの選択制御信号に応じて入力される信号を切り換える。 PE2a corresponds to PE2 in FIG. 2, and has a partially different configuration. The PE 2a includes SPE2Aa to SPE2Da and a register 2M.
SPE2Aa to SPE2Da in PE2a correspond to SPE2A to SPE2D in FIG. 2, but include selectors 14XA to 14XD instead of selectors 14A to 14D and register reference units 15A to 15C.
The selectors 14XA to 14XC select signals to be input according to the configuration in which the selectors 14A to 14C and the register reference units 15A to 15C are integrated according to the selection control signal. Signals input to the selector 14XA are Acc13A to Acc13D, the local memory 11A, and the register 2M. The signals input according to the selection control signal from the PE-I 3a are switched. The selectors 14XB to 14XC are the same as the selector 14XA.
In addition, the selector 14XD receives the signals from the Acc 13A to Acc 13D, the local memory 11D, the register 2M and the data input buffer (SIN_reg) 16 and the input signal from the adjacent PE 2a, and is selected from the PE-I 3a. The input signal is switched according to the control signal.

ＰＥ−Ｉ３ａは、図２におけるＰＥ−Ｉ３に相当し、交換部３１ａを備える。交換部３１ａは、命令セレクト部３２と選択制御部３５Ａ〜３５Ｄを備える。
命令選択デコード部３５Ａ〜３５Ｄは、図８における命令選択部３３Ａ〜３３Ｄと命令デコード部３４Ａ〜３４Ｄをそれぞれ合わせた構成を備え、入力されるｓｐｅ−ａ用コード、ｓｐｅ−ｂ用コード、ｓｐｅ−ｃ用コードを、命令セレクト部３２が出力するセレクト信号に応じて切り換えて、入力される命令コードに応じて各ＳＰＥ２Ａ〜２Ｃを制御する制御信号を生成し出力する。 The PE-I 3a corresponds to the PE-I 3 in FIG. 2 and includes an exchange unit 31a. The exchanging unit 31a includes an instruction selecting unit 32 and selection control units 35A to 35D.
The instruction selection decoding units 35A to 35D have a configuration in which the instruction selection units 33A to 33D and the instruction decoding units 34A to 34D in FIG. 8 are respectively combined, and input spe-a code, spe-b code, spe- The c code is switched according to the select signal output from the instruction select unit 32, and a control signal for controlling each of the SPEs 2A to 2C is generated and output according to the input instruction code.

上記の構成とすることによりＳＰＥ２Ａ〜２Ｃ内におけるレジスタR0〜R3の選択を簡素化し、ＳＰＥ内のセレクタ１４ＸＡ〜１４ＸＣの１段で切り換えるようになる。
第１実施形態に示したＲＯＴ命令によって、ＳＰＥ２Ａ用、ＳＰＥ２Ｂ用、ＳＰＥ２Ｃ用の命令（OPコード）を交換し、各ＳＰＥのＡＬＵ１２Ａ〜１２Ｃに供給する選択処理を同じように実施できる。 With the above configuration, the selection of the registers R0 to R3 in the SPEs 2A to 2C is simplified, and the selectors 14XA to 14XC in the SPE are switched in one stage.
By using the ROT instruction shown in the first embodiment, the selection process (OP code) for SPE2A, SPE2B, and SPE2C can be exchanged and supplied to the ALUs 12A to 12C of each SPE in the same manner.

以上に示した実施形態により、複数のサブＰＥを持つVLIW型アーキテクチャーを採用した並列計算機において、ローカルメモリをサブＰＥ毎に分割して配置することで、ローカルメモリへの多重アクセス問題を回避しながら、メモリとサブＰＥ間のデータ転送能力を高めることができる。さらに、2つ以上のサブＰＥの構成を全く同じにし、サブＰＥの動作を制御する制御信号を交換することによって、ローカルメモリに格納されている全てのデータを短時間（1クロック）で交換したのと同じ効果を出せる。これらの特徴により、サブＰＥとローカルメモリ間のデータ転送能力が大幅に向上し、実効演算性能が高い並列処理プロセッサーを提供できる。 According to the embodiment described above, in a parallel computer adopting a VLIW type architecture having a plurality of sub-PEs, the problem of multiple access to the local memory can be avoided by dividing and arranging the local memory for each sub-PE. However, the data transfer capability between the memory and the sub-PE can be increased. In addition, all the data stored in the local memory was exchanged in a short time (1 clock) by making the configuration of two or more sub-PEs exactly the same and exchanging control signals for controlling the operation of the sub-PEs. The same effect can be achieved. With these features, the data transfer capability between the sub-PE and the local memory is greatly improved, and a parallel processing processor with high effective computing performance can be provided.

本発明の実施形態によれば、並列計算装置１は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）２を備える。演算プロセッサー２では、命令実行制御部３がそれぞれに制御命令を供給する。サブプロセッサーＳＰＥ２Ａは、記憶部１１Ａが複数のデータ又は複数の演算結果を保持する。ＡＬＵ１２Ａは、記憶部１１Ａから読み出したデータに演算処理を施し、その結果を記憶部に供給する。
また、サブプロセッサーは、複数のサブプロセッサーのうち少なくとも２つのサブプロセッサーが同じ構造を有する。それらの同じ構造を有するサブプロセッサーは、同一構造プロセッサー群を形成する。また、命令実行制御部は、交換部が、同一構造サブプロセッサー群に含まれるサブプロセッサーに供給する制御信号を交換して供給する。
これにより、記憶部に記憶されたデータを交換したり、参照するデータを切り換えて同様の処理をしたりするためにオペランドの異なる類似のプログラムを用意することもなく、１命令の処理により、データと処理の組み合わせを交換することができる。 According to the embodiment of the present invention, the parallel computing device 1 includes a plurality of arithmetic processors (PE) 2 that perform arithmetic processing in parallel. In the arithmetic processor 2, the instruction execution control unit 3 supplies a control instruction to each. In the sub processor SPE2A, the storage unit 11A holds a plurality of data or a plurality of calculation results. The ALU 12A performs arithmetic processing on the data read from the storage unit 11A and supplies the result to the storage unit.
Further, in the sub processor, at least two sub processors among the plurality of sub processors have the same structure. Those sub-processors having the same structure form an identical structure processor group. In addition, the instruction execution control unit exchanges and supplies a control signal supplied from the exchange unit to the sub processors included in the same structure sub processor group.
As a result, it is possible to exchange data stored in the storage unit, or to change the data to be referred to and perform similar processing without preparing a similar program with different operands. And processing combinations can be exchanged.

なお、本発明は、上記の各実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で変更可能である。
例えば、本発明の説明では１個の並列計算装置１（LSI）に内蔵される演算プロセッサー２の個数を107としたが、本発明はこれに制限されるものではなく、１以上の演算プロセッサー２を内蔵する計算機に適用できる。また、VLIW型で並列化するサブＰＥの個数を各演算プロセッサー２ごとに４としたが、本発明はこれに制限されるものではなく、複数のＳＰＥを持つシステムに適用できる。さらに、ＲＯＴ命令により、OPコードを交換するサブＰＥの個数を３としたが、本発明はこれに制限されるものではなく、複数のサブＰＥのOPコードが交換されるシステムに適用できる。 The present invention is not limited to the above embodiments, and can be modified without departing from the spirit of the present invention.
For example, in the description of the present invention, the number of arithmetic processors 2 incorporated in one parallel computing device 1 (LSI) is 107, but the present invention is not limited to this, and one or more arithmetic processors 2 are used. Applicable to computers with built-in Further, although the number of sub-PEs to be parallelized in the VLIW type is set to 4 for each arithmetic processor 2, the present invention is not limited to this and can be applied to a system having a plurality of SPEs. Further, although the number of sub-PEs whose OP codes are exchanged by the ROT instruction is 3, the present invention is not limited to this, and can be applied to a system in which OP codes of a plurality of sub-PEs are exchanged.

また、本発明の説明では全ての演算プロセッサー２でSIMD型を構成し、１個の命令制御プロセッサーによって制御されるとしたが、本発明はこれに制限されるものではなく、演算プロセッサー２を複数のグループに分割し、それぞれに制御プロセッサーを配置するような、SIMD型とMIMD型の中間のアーキテクチャにも適用できる。また、図１１に示す第２実施形態の構成では、各ＳＰＥのOPコードを交換した後でOPコードをデコードしているが、制御信号の本数がOPコードのビット数よりも少ないような場合には、OPコードをデコードした後で交換しても、本発明と同様の効果を得ることができる。 In the description of the present invention, it is assumed that all the arithmetic processors 2 form a SIMD type and are controlled by one instruction control processor. However, the present invention is not limited to this, and a plurality of arithmetic processors 2 are provided. It can be applied to an intermediate architecture between SIMD type and MIMD type. In the configuration of the second embodiment shown in FIG. 11, the OP code is decoded after exchanging the OP code of each SPE. However, when the number of control signals is smaller than the number of bits of the OP code. If the OP code is decoded and then exchanged, the same effect as that of the present invention can be obtained.

また、図２に示した構成では説明を分かり易くするために、Ａｃｃ１３Ａ〜Ａｃｃ１３Ｃの選択を２段にしているが、必要な回路素子の数を減らし、同時に回路の動作速度を上げるためには、図１１に示した構成のように１段で構成する方が望ましい。その場合は図１１に示すように、図８と同じ命令セレクト信号を用いて、セレクタ１４ＸＡ〜１４ＸＤのデータ選択信号そのものを交換する。 Further, in the configuration shown in FIG. 2, the selection of Acc13A to Acc13C is made in two stages for easy understanding, but in order to reduce the number of necessary circuit elements and at the same time increase the operation speed of the circuit, It is desirable to configure in one stage as shown in FIG. In this case, as shown in FIG. 11, the data selection signals themselves of the selectors 14XA to 14XD are exchanged using the same instruction select signal as in FIG.

２演算プロセッサー（ＰＥ）
２Ａ、２Ｂ、２Ｃ、２Ｃ、２Ｄサブ演算プロセッサー（ＳＰＥ、サブプロセッサー）
２Ｇサブ演算プロセッサーグループ（同一構造プロセッサー群）
３命令実行制御プロセッサー（ＰＥ−Ｉ、命令実行制御部）
１１Ａ、１１Ｂ、１１Ｃ、１１Ｄローカルメモリー（ＬＭ、記憶部）
１２Ａ、１２Ｂ、１２Ｃ、１２ＤＡＡＬＵ（演算部）
１３Ａ、１３Ｂ、１３Ｃ、１３ＤＡｃｃ（レジスター部）
３１交換部 2 Operation processor (PE)
2A, 2B, 2C, 2C, 2D Sub operation processor (SPE, Sub processor)
2G sub-processor group (same processor group)
3 Instruction execution control processor (PE-I, instruction execution control unit)
11A, 11B, 11C, 11D Local memory (LM, storage unit)
12A, 12B, 12C, 12DA ALU (arithmetic unit)
13A, 13B, 13C, 13D Acc (register part)
31 Exchange Department

Claims

A plurality of arithmetic processors that perform arithmetic processing in parallel;
An instruction execution control unit for supplying a control signal to each of the arithmetic processors;
With
The arithmetic processor is
A storage unit for holding a plurality of data or a plurality of calculation results;
An arithmetic unit that performs arithmetic processing on the data read from the storage unit and supplies the result to the storage unit;
With multiple sub-processors with
Is at least two sub-processors among the plurality of sub-processors having the same structure, the sub-processor having the same configuration and form the same structure sub processor group,
The instruction execution control unit
A parallel computing device comprising: an exchange unit that exchanges and supplies control signals to be supplied to sub-processors included in the same-structure sub-processor group.

The sub-processor is
A register unit connected to one input of the arithmetic unit and storing information to be written;
The register unit is
Usually, the calculation result by the calculation unit is stored, the stored calculation result is output,
The same structure sub- processor group includes a register reference unit,
The register reference unit is a register that supplies an output from each of the register units included in each of the sub-processors included in the same-structured sub-processor group to input data to the other input of the arithmetic unit of the sub-processor. Among them, when handling as input from at least some of the registers, it is possible to change the association between the register unit and the register that supplies the input data,
The instruction execution control unit, when exchanging the control signal to be supplied to each sub processor of the same structure sub processor group, was associated with each of the registers that supply the input data before the exchange, In place of each of the register units, the input unit includes the register unit included in the sub processor to be supplied after the replacement of the control signal supplied to the sub processor to which the register unit belongs. 2. The parallel computing apparatus according to claim 1 , wherein the register reference unit is controlled so as to be newly associated with a register that supplies the data.