JP6477185B2

JP6477185B2 - Arithmetic processing apparatus and control method of arithmetic processing apparatus

Info

Publication number: JP6477185B2
Application number: JP2015079382A
Authority: JP
Inventors: 和也浅野
Original assignee: Socionext Inc
Current assignee: Socionext Inc
Priority date: 2015-04-08
Filing date: 2015-04-08
Publication date: 2019-03-06
Anticipated expiration: 2035-04-08
Also published as: JP2016200906A

Description

本明細書で言及する実施例は、演算処理装置および演算処理装置の制御方法に関する。 The embodiments referred to herein relate to an arithmetic processing unit and a control method of the arithmetic processing unit.

近年、例えば、画像認識やビッグデータの解析手法における処理を実行するために、ＳＩＭＤ(Single Instruction Multiple Data)型演算処理装置(プロセッサ)による並列処理が利用されている。 In recent years, for example, parallel processing by a single instruction multiple data (SIMD) type arithmetic processing unit (processor) has been used to execute processing in an image recognition or analysis method of big data.

このようなＳＩＭＤ型プロセッサにおいて、例えば、命令(コマンド)は、ＰＥ(プロセッサエレメント)制御部から共通コマンドバス(共通バス)を通して、各プロセッサエレメント(ＰＥ：Processor Element)に伝えられる。 In such a SIMD processor, for example, an instruction (command) is transmitted from a PE (processor element) control unit to each processor element (PE: Processor Element) through a common command bus (common bus).

また、各ＰＥ(マルチコアの演算処理装置における各コア)は、例えば、調停回路(アービタ：Arbiter)を介して、メモリ(共通のリソース)にアクセスすることで、メモリに格納されたデータの並列処理を実行する。 In addition, each PE (each core in a multi-core processor) accesses the memory (common resource) via, for example, an arbitration circuit (Arbiter) to perform parallel processing of data stored in the memory. Run.

ところで、従来、複数のプロセッサエレメントを含むＳＩＭＤ型演算処理装置および演算処理装置の制御方法としては、様々なものが提案されている。 By the way, conventionally, various methods have been proposed as SIMD type arithmetic processing devices including a plurality of processor elements and control methods of the arithmetic processing devices.

特開２００５−１４８８９９号公報JP, 2005-148899, A 特開平０８−０３０５７７号公報Japanese Patent Application Laid-Open No. 08-030577 特開平０６−０３６０６０号公報Japanese Patent Application Publication No. 06-036060

上述したように、例えば、ＳＩＭＤ型プロセッサにおいて、ＰＥ制御部からのコマンドは、共通バスを介して各ＰＥに入力される。さらに、各ＰＥは、例えば、メモリ側に設けられた複雑なアービタを介してメモリをアクセスし、また、ＰＥ側には、メモリリードリクエストやアクノリッジ機構が実装されることになる。 As described above, for example, in the SIMD processor, a command from the PE control unit is input to each PE via the common bus. Furthermore, each PE accesses the memory through, for example, a complex arbiter provided on the memory side, and a memory read request and an acknowledge mechanism are implemented on the PE side.

そのため、例えば、全てのＰＥがコマンド実行を終了するまで、次のコマンドを実行することができず、処理サイクル数が増加してデータ処理を高速に実行するのが困難になるといった課題がある。また、回路規模によっては、アービタの動作周波数の低下が遅くなり、やはり、データ処理を高速に実行するのが困難になるといった課題がある。 Therefore, for example, there is a problem that the next command can not be executed until all the PEs finish executing the command, and the number of processing cycles increases, which makes it difficult to execute data processing at high speed. In addition, depending on the circuit size, the decrease in the operating frequency of the arbiter may be delayed, again making it difficult to execute data processing at high speed.

一実施形態によれば、共通のリソースに対して、同時にアクセス可能な複数のプロセッサエレメントを含む演算処理装置であって、第１バスと、第２バスと、を有する演算処理装置が提供される。 According to one embodiment, there is provided an arithmetic processing unit including a plurality of processor elements simultaneously accessible to a common resource, the arithmetic processing unit having a first bus and a second bus. .

前記第１バスは、前記複数のプロセッサエレメントに対して並列に接続され、それぞれの前記プロセッサエレメントに共通の第１命令を与え、前記第２バスは、前記複数のプロセッサエレメントに対して直列に接続される。 The first bus is connected in parallel to the plurality of processor elements to give a common first instruction to each of the processor elements, and the second bus is connected in series to the plurality of processor elements Be done.

前記第２バスは、前記複数のプロセッサエレメントにおいて、第１サイクルで第１プロセッサエレメントに与えた第２命令を、前記第１サイクルの次の第２サイクルで前記第１プロセッサエレメントに直列接続される第２プロセッサエレメントに伝える。 The second bus is connected in series to the first processor element in the second cycle following the first cycle, for the second instruction given to the first processor element in the first cycle in the plurality of processor elements. Tell the second processor element.

開示の演算処理装置および演算処理装置の制御方法は、回路規模の増加を抑えつつ、複数のプロセッサエレメントによる共通のリソースに対する処理速度を向上させることができるという効果を奏する。 The disclosed arithmetic processing device and the control method of the arithmetic processing device have an effect that processing speed for a common resource by a plurality of processor elements can be improved while suppressing an increase in circuit scale.

図１は、関連技術としての演算処理装置の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of an arithmetic processing unit as a related art. 図２は、図１に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図である。FIG. 2 is a block diagram showing an example of processor elements in the arithmetic processing unit shown in FIG. 図３は、図１に示す演算処理装置によりグラフィックス処理を行う場合の一例を説明するための図である。FIG. 3 is a view for explaining an example in the case where graphics processing is performed by the arithmetic processing unit shown in FIG. 図４は、２分木探索処理の一例を説明するための図である。FIG. 4 is a diagram for explaining an example of the binary tree search process. 図５は、図１に示す演算処理装置により、図４に示す２分木探索処理を実行した場合の動作の一例を説明するためのタイミングチャートである。FIG. 5 is a timing chart for explaining an example of the operation when the binary tree search process shown in FIG. 4 is executed by the arithmetic processing unit shown in FIG. 図６は、第１実施例の演算処理装置を示すブロック図である。FIG. 6 is a block diagram showing the processing unit of the first embodiment. 図７は、図６に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図である。FIG. 7 is a block diagram showing an example of processor elements in the arithmetic processing unit shown in FIG. 図８は、図７に示すプロセッサエレメントにおけるセレクタを説明するための図である。FIG. 8 is a diagram for describing a selector in the processor element shown in FIG. 図９は、第１実施例の演算処理装置による動作の一例を説明するためのタイミングチャートである。FIG. 9 is a timing chart for explaining an example of the operation of the processing unit of the first embodiment. 図１０は、第２実施例の演算処理装置を示すブロック図である。FIG. 10 is a block diagram showing an arithmetic processing unit of the second embodiment. 図１１は、図１０に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図である。FIG. 11 is a block diagram showing an example of processor elements in the arithmetic processing unit shown in FIG. 図１２は、図１１に示すプロセッサエレメントにおけるセレクタを説明するための図である。FIG. 12 is a diagram for describing a selector in the processor element shown in FIG. 図１３は、第２実施例の演算処理装置による動作の一例を説明するためのタイミングチャートである。FIG. 13 is a timing chart for explaining an example of the operation of the processing unit of the second embodiment. 図１４は、第３実施例の演算処理装置を示すブロック図である。FIG. 14 is a block diagram showing an arithmetic processing unit of the third embodiment. 図１５は、図１４に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図である。FIG. 15 is a block diagram showing an example of processor elements in the arithmetic processing unit shown in FIG. 図１６は、図１５に示すプロセッサエレメントにおけるセレクタを説明するための図(その１)である。FIG. 16 is a diagram (part 1) for describing a selector in the processor element shown in FIG. 図１７は、図１５に示すプロセッサエレメントにおけるセレクタを説明するための図(その２)である。FIG. 17 is a second diagram to explain a selector in a processor element shown in FIG. 15; 図１８は、第３実施例の演算処理装置による動作の一例を説明するためのタイミングチャートである。FIG. 18 is a timing chart for explaining an example of the operation of the processing unit of the third embodiment. 図１９は、本実施例の演算処理装置が適用される半導体集積回路の一例を示すブロック図である。FIG. 19 is a block diagram showing an example of a semiconductor integrated circuit to which the processing unit of this embodiment is applied.

まず、本実施例の演算処理装置および演算処理装置の制御方法を詳述する前に、図１〜図５を参照して、演算処理装置の制御方法の一例、並びに、その問題点を説明する。 First, before describing the arithmetic processing device and the control method of the arithmetic processing device according to the present embodiment in detail, an example of a control method of the arithmetic processing device and the problems thereof will be described with reference to FIGS. .

図１は、関連技術としての演算処理装置の一例を示すブロック図であり、例えば、画像認識やビッグデータの解析手法における処理を実行するＳＩＭＤ型演算処理装置(ＳＩＭＤ型プロセッサ)の一例を示すものである。 FIG. 1 is a block diagram showing an example of an arithmetic processing unit as a related art, and shows, for example, an example of a SIMD type arithmetic processing unit (SIMD type processor) that executes processing in an image recognition and analysis method of big data. It is.

図１において、参照符号101a〜101nはプロセッサエレメント(ＰＥ)、102はＰＥ制御部、103はメモリ(ツリー用メモリ)、104は共通コマンドバス(共通バス)、105は比較対象データバス、そして、106は調停回路(アービタ)を示す。 In FIG. 1, reference numerals 101a to 101n denote processor elements (PE), 102 denotes a PE control unit, 103 denotes a memory (memory for tree), 104 denotes a common command bus (common bus), 105 denotes a comparison target data bus, Reference numeral 106 denotes an arbitration circuit (arbiter).

なお、図１では、複数のＰＥ101a〜101nが同時にアクセス可能な共通のリソースとして、ツリー用メモリ103を示しているが、例えば、図示しない画像用メモリに対しても、調停を行うアービタが設けられることになる。 In FIG. 1, the tree memory 103 is shown as a common resource that can be simultaneously accessed by the plurality of PEs 101 a to 101 n, but, for example, an arbiter that performs arbitration is also provided to an image memory (not shown). It will be.

ＰＥ制御部102は、命令(コマンド)を一時的に保持するフェッチ部121、および、プログラムを格納するプログラムメモリ122を含む。ここで、各ＰＥ101a〜101nは、共通バス104を通してＰＥ制御部102からの共通コマンドを受け取り、それぞれアービタ106との間のバス110a〜110nを介してメモリ103に対してアクセスする。 The PE control unit 102 includes a fetch unit 121 which temporarily holds an instruction (command), and a program memory 122 which stores a program. Here, each of the PEs 101a to 101n receives a common command from the PE control unit 102 through the common bus 104, and accesses the memory 103 through the buses 110a to 110n with the arbiter 106, respectively.

図１において、各ＰＥ101a〜101nとアービタ106間のバス110a〜110nは、１本の線として描かれているが、例えば、双方向で複数ビット(例えば、３２または６４ビット)のアドレスバスおよびデータバスを含む。 Although the buses 110a to 110n between the PEs 101a to 101n and the arbiter 106 in FIG. 1 are drawn as one line, for example, an address bus and data of a plurality of bits (for example, 32 or 64 bits) in both directions Including the bus.

また、例えば、図１に示すＳＩＭＤ型プロセッサにより２分木(binary tree)探索を行う場合、２分木の各ツリーデータ(ツリー０〜ｎにおけるノード(node)０〜ｋ)のデータは、例えば、１つのメモリ(ツリーデータ格納メモリ)に格納される。 Also, for example, when a binary tree search is performed by the SIMD processor shown in FIG. 1, data of each tree data (nodes 0 to k in trees 0 to n) of the binary tree is, for example, , And stored in one memory (tree data storage memory).

さらに、比較対象データは、例えば、別のメモリ(ＲＡＭ：Random Access Memory)から供給するようにしてもよい。また、ＰＥの数は、仕様に応じて数個〜数百個というように様々に設定されるが、画像認識やビッグデータの解析手法における処理を実行するものとして、例えば、百数十個程度にすることが考えられる。 Furthermore, the comparison target data may be supplied from, for example, another memory (RAM: Random Access Memory). Also, the number of PEs may be variously set to a few to a few hundred according to the specification, but, for example, about one hundred and several tens to execute processing in an image recognition or analysis method of big data. It is conceivable to

図２は、図１に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図である。図２に示されるように各ＰＥ101(101a〜101n)は、共通バス104からコマンドを受け取って保持するコマンドバッファ111、および、コマンドバッファ111に保持されたコマンドを実行するプロセッサコア112を含む。 FIG. 2 is a block diagram showing an example of processor elements in the arithmetic processing unit shown in FIG. As shown in FIG. 2, each of the PEs 101 (101a to 101n) includes a command buffer 111 that receives and holds commands from the common bus 104, and a processor core 112 that executes commands held in the command buffer 111.

そして、ＰＥ101は、メモリ103に対して、バス110およびアービタ106を介してRAM(メモリ)アドレスおよびRAMリードリクエストを出力し、メモリ103からアービタ106およびバス110を介して、RAMリードACK(アクノリッジ)およびRAMリードデータを受け取る。 Then, the PE 101 outputs a RAM (memory) address and a RAM read request to the memory 103 via the bus 110 and the arbiter 106, and a RAM read ACK (acknowledgement) from the memory 103 via the arbiter 106 and the bus 110. And receive RAM read data.

図３は、図１に示す演算処理装置によりグラフィックス処理を行う場合の一例を説明するための図である。ここで、図３は、複数のＰＥ101a〜101nを含むＳＩＭＤ型プロセッサにより、同一プログラムで複数のデータ処理を行う例として、グラフィックス処理を行う場合を説明するためのものである。 FIG. 3 is a view for explaining an example in the case where graphics processing is performed by the arithmetic processing unit shown in FIG. Here, FIG. 3 is for explaining the case where graphics processing is performed as an example in which a plurality of data processing is performed with the same program by a SIMD type processor including a plurality of PEs 101a to 101n.

図３に示されるように、グラフィックス処理を行う場合、例えば、各ＰＥ101a〜101nに対して、それぞれが担当する画素のピクセルデータ(A1,A2,A3,… 〜 N1,N2,N3,…)を決まった順序で送り込む。 As shown in FIG. 3, when graphics processing is performed, for example, pixel data (A 1, A 2, A 3,... To N 1, N 2, N 3,...) Of pixels handled by each of the PEs 101 a to 101 n. Send in a fixed order.

それぞれのＰＥ101a〜101nは、入力されたデータに対して単純に決まった演算(例えば、A1*A2+A3 〜 N1*N2+N3)を実施して出力する。すなわち、グラフィックス処理を行う場合には、各ＰＥ101a〜101nの内部状態に関わらず、例えば、外部(外部メモリ)からＤＭＡ(Direct Memory Access)などによってデータを供給することが可能である。 Each of the PEs 101a to 101n performs a simple operation (for example, A1 * A2 + A3 to N1 * N2 + N3) on input data and outputs the result. That is, when performing graphics processing, data can be supplied from outside (external memory) by DMA (Direct Memory Access) or the like regardless of the internal state of each of the PEs 101a to 101n.

図４は、２分木探索処理の一例を説明するための図であり、図４(a)は、２分木のノードを示し、図４(b)は、２分木探索処理(２分木処理)の一例を説明するためのフローチャートである。 FIG. 4 is a diagram for explaining an example of the binary tree search process, and FIG. 4 (a) shows nodes of the binary tree, and FIG. 4 (b) shows binary tree search process (bifurcation) Tree processing) is a flowchart for explaining an example.

図４(a)に示されるように、２分木は、データ構造の一種である木構造において、親ノードが有する子ノードの数が２つ以下のもので、親のないノードを根(root node：node 0)とし、順に、node 1, node 2, …として配置される。 As shown in FIG. 4A, in a tree structure which is a kind of data structure, a binary tree has a parent node having two or less child nodes and has a root without a parent (root node: node 0), and in order are arranged as node 1, node 2,.

この２分木は、例えば、画像認識等の機械学習において使用するランダムフォレストと呼ばれる手法で使用される。そして、ＳＩＭＤ型プロセッサは、複数の２分木探索を並列処理するために適用される。なお、以下の説明では、簡略化のために、図６に示すＳＩＭＤ型プロセッサを用いて、２分木探索の各ノード(node 1, 2, 3, …)において、ある画素の輝度を閾値と比較することにする。 This binary tree is used, for example, in a method called random forest used in machine learning such as image recognition. A SIMD-type processor is then applied to process multiple binary tree searches in parallel. In the following description, for the sake of simplicity, the SIMD processor shown in FIG. 6 is used to set the luminance of a certain pixel as a threshold at each node (node 1, 2, 3,...) Of binary tree search. I will compare.

図４(b)に示されるように、２分木処理を開始(ＳＴＡＲＴ)すると、ステップＳＴ１において、ｉ＝０、すなわち、ルートノード(node 0)とし、ステップＳＴ２に進む。ステップＳＴ２では、node iにおいて、ツリー用メモリ(３２)から判定用画素座標、判定用閾値、判定結果が真／偽のそれぞれについての分岐先情報、並びに、最終ノードの場合の判定結果を読み出す(リードする)。なお、分岐先情報とは、次のノードの情報が格納されているツリー用メモリアドレス(ただし、最終ノードの場合には、NULL(ヌル))を示す。 As shown in FIG. 4B, when binary tree processing is started (START), i = 0, ie, the root node (node 0) is set in step ST1, and the process proceeds to step ST2. In step ST2, in node i, the determination pixel coordinates for determination, the determination threshold, the branch destination information for each of the determination results true / false, and the determination result in the case of the final node are read out from the tree memory (32) Lead). The branch destination information indicates a tree memory address (in the case of the final node, NULL (null)) in which information of the next node is stored.

さらに、ステップＳＴ３に進んで、画像用メモリ(３１)から判定用画素座標に相当する輝度データをリードし、ステップＳＴ４に進む。ここで、ツリー用メモリ(３２)は、２分木の各ノードの情報を格納するメモリであり、また、画像用メモリ(３１)は、各画素の輝度を格納するメモリである。 Further, the process proceeds to step ST3, the luminance data corresponding to the determination pixel coordinates is read from the image memory (31), and the process proceeds to step ST4. Here, the memory for tree (32) is a memory for storing information of each node of the binary tree, and the memory for image (31) is a memory for storing the luminance of each pixel.

ステップＳＴ４では、判定を実行、すなわち、ステップＳＴ３で読み出した輝度が閾値よりも大きいかどうかを判定(輝度＞閾値？)し、さらに、ステップＳＴ５において、最終ノードかどうか、すなわち、分岐先がヌルかどうか(分岐先＝＝Null)を判定する。 In step ST4, determination is performed, that is, it is determined whether the luminance read in step ST3 is larger than the threshold (luminance> threshold?), And in step ST5, it is the final node, ie, the branch destination is null. It is determined whether or not (branch destination == Null).

ステップＳＴ５において、最終ノードではないと判定すると、ステップＳＴ６に進んで、判定結果による分岐先ノードを決定し、ステップＳＴ２に戻って、同様の処理を繰り返す。一方、ステップＳＴ５において、最終ノードであると判定すると、ステップＳＴ７に進んで、判定結果を出力し、処理を終了(ＥＮＤ)する。 If it is determined in step ST5 that the node is not the final node, the process proceeds to step ST6 to determine a branch destination node according to the determination result, and returns to step ST2 to repeat the same processing. On the other hand, when it is determined in step ST5 that the node is the final node, the process proceeds to step ST7, the determination result is output, and the process is ended (END).

ところで、各ＰＥ101a〜101nに対して、それぞれ１つの２分木を割り当てる場合を考えると、例えば、ルートノードnode 0のデータは、選択肢がないので一意に決まる。すなわち、node 0のデータは、外部から供給することは可能であるが、次のステップでノードnode 1またはnode 2のどちらを選択するかは、該当するＰＥのみが認識している情報なので、外部から対応するデータを供給することは困難である。 By the way, assuming that one binary tree is allocated to each of the PEs 101a to 101n, for example, data of the root node node 0 is uniquely determined because there is no option. That is, although the data of node 0 can be supplied from the outside, it is the information that only the corresponding PE recognizes as to which of node 1 or node 2 is to be selected in the next step. It is difficult to supply corresponding data from

つまり、２分木探索をＳＩＭＤ型プロセッサに適用する場合には、少なくとも２ステップ目以降において、各ＰＥ101a〜101nから画像用メモリ(３１)およびツリー用メモリ(３２)に対する個別のアクセスが求められることになる。 That is, when the binary tree search is applied to the SIMD processor, individual accesses to the image memory (31) and the tree memory (32) are required from each of the PEs 101a to 101n at least after the second step. become.

そのため、複数のＰＥ101a〜101nからメモリへの同時アクセスが発生し、メモリ側に複雑なアービタ(調停回路)の実装が求められ、その結果、回路規模によってはアービタの動作周波数が遅くなり、データ処理を高速に実行するのが困難になるといった課題がある。 Therefore, simultaneous accesses to the memory from multiple PEs 101a to 101n occur, and the implementation of a complex arbiter (arbitration circuit) is required on the memory side. As a result, the operating frequency of the arbiter becomes slower depending on the circuit size. There is a problem that it becomes difficult to execute

さらに、ＳＩＭＤ型プロセッサでは、全てのＰＥ101a〜101nが命令を終了するまで、次の命令を実行することが難しいため、上述したメモリアクセスにより個々のＰＥ101a〜101nの調停が行われている間は、次の命令に進むことが困難になる。そのため、処理サイクル数が増加し、やはりデータ処理を高速に実行するのが困難になるといった課題がある。 Furthermore, in the SIMD type processor, it is difficult to execute the next instruction until all the PEs 101a to 101n complete the instruction. Therefore, while the arbitration of the individual PEs 101a to 101n is performed by the above-described memory access, It becomes difficult to move on to the next instruction. Therefore, there is a problem that the number of processing cycles increases and it is difficult to execute data processing at high speed.

図５は、図１に示す演算処理装置により、図４に示す２分木探索処理を実行した場合の動作の一例を説明するためのタイミングチャートである。なお、メモリは、一般的なシングルポートのＲＡＭとする。 FIG. 5 is a timing chart for explaining an example of the operation when the binary tree search process shown in FIG. 4 is executed by the arithmetic processing unit shown in FIG. The memory is a general single port RAM.

図５に示されるように、まず、共通バス104に対して、ツリー用メモリ(３２)のリードコマンドであるｃｍｄ１が発行される。各ＰＥ101a〜101nは、一斉にツリー用メモリ(３２)のリードコマンドを実行しようとするが、ツリー用メモリは、１サイクルに１つのＰＥからのアクセスを受け付けるだけである。 As shown in FIG. 5, first, a read command cmd1 for the tree memory (32) is issued to the common bus 104. Each of the PEs 101a to 101n tries to execute a read command of the tree memory (32) all at once, but the tree memory only receives an access from one PE in one cycle.

ここで、ツリー用メモリ(３２)のアービタが、例えば、ＰＥ(プロセッサエレメント)の数が１６個で、ＰＥ０→ＰＥ１→ … →ＰＥ１５とアクセスを受け付ける場合を考える。このとき、ＰＥ０は、最初のサイクルでメモリアクセスを受け付けられるので、その後、他のＰＥのアクセスが終了するまでウェイト(wait)状態になる。 Here, it is assumed that the arbiter of the tree memory (32) receives an access such as PE 0 → PE 1 → ... → PE 15 when the number of PEs (processor elements) is 16, for example. At this time, since the PE 0 can receive a memory access in the first cycle, it then enters a wait state until the other PE's access is completed.

また、ＰＥ１は、ＰＥ０の次のサイクルでメモリアクセスを受け付けられた後、ウェイト状態になる。このようにして、ＰＥ１５のメモリアクセスが受け付けられるまでには、合計１６サイクルが費やされる。 Also, after being accepted for memory access in the cycle next to PE0, PE1 enters a wait state. Thus, a total of 16 cycles are spent until the memory access of the PE 15 is accepted.

なお、次の画像用メモリ(３１)へのリード命令(リードコマンド)であるｃｍｄ２も、同様に合計１６サイクルを費やすことになり、この２つの命令により、合計３２サイクルが費やされることになる。 Similarly, a read command (read command) to the next image memory (31), cmd2, will consume a total of 16 cycles, and a total of 32 cycles will be spent by these two instructions.

このように、関連技術の演算処理装置によれば、各ＰＥ101a〜101nからメモリへのアクセス時間を短縮することが難しく、例えば、画像処理における画像認識速度を向上させるのが困難になっている。 As described above, according to the processing device of the related art, it is difficult to shorten the access time from each of the PEs 101a to 101n to the memory, and it is difficult to improve, for example, the image recognition speed in image processing.

以下、演算処理装置および演算処理装置の制御方法の実施例を、添付図面を参照して詳述する。図６は、第１実施例の演算処理装置を示すブロック図であり、例えば、画像認識やビッグデータの解析手法における処理を実行するＳＩＭＤ型演算処理装置(ＳＩＭＤ型プロセッサ)の第１実施例を示すものである。 Hereinafter, an embodiment of an arithmetic processing unit and a control method of the arithmetic processing unit will be described in detail with reference to the attached drawings. FIG. 6 is a block diagram showing an arithmetic processing unit according to the first embodiment. For example, a first embodiment of a SIMD type arithmetic processing unit (SIMD processor) which executes processing in image recognition and analysis method of big data is described. It is shown.

図６において、参照符号１a〜１nはプロセッサエレメント(ＰＥ)、２はＰＥ(プロセッサエレメント)制御部、３１は画像用メモリ、３２はツリー用メモリ、４は共通コマンドバス(共通バス：第１バス)、そして、５はカスケードバス(第２バス)を示す。 In FIG. 6, reference numerals 1a to 1n denote processor elements (PE), 2 denotes a PE (processor element) control unit, 31 denotes an image memory, 32 denotes a tree memory, and 4 denotes a common command bus (common bus: first bus) And 5 indicate a cascade bus (second bus).

ここで、カスケードバス５は、コマンド(第２命令)を、各サイクルに基づいて順番に直列接続されるＰＥに渡す(バケツリレーする)ためのバスであり、本明細書では、コマンドをバケツリレーするバスを、便宜的に、カスケードバスと呼ぶことにする。 Here, the cascade bus 5 is a bus for passing (bucket relaying) a command (second instruction) to the PEs serially connected in series based on each cycle, and in the present specification, the command is bucket relayed. This bus will be called a cascade bus for convenience.

また、画像用メモリ３１には、２分木探索(２分木処理)で比較対象になる各画素の情報(例えば、輝度)が格納され、また、ツリー用メモリ３２には、２分木処理で使用するノード情報が格納されている。 In addition, information (for example, luminance) of each pixel to be compared in binary tree search (binary tree processing) is stored in the image memory 31, and binary tree processing is performed in the tree memory 32. The node information used in is stored.

各ＰＥ１a〜１nには、それぞれ１つの２分木処理が割り当てられる。また、共通バス４は、複数のＰＥ１a〜１nに対して並列に接続され、それぞれのＰＥ１a〜１nに対して共通の共通コマンド(第１命令)を与えるようになっている。 One binary tree process is assigned to each of the PEs 1a to 1n. The common bus 4 is connected in parallel to the plurality of PEs 1a to 1n, and is configured to give a common command (first instruction) common to the respective PEs 1a to 1n.

さらに、カスケードバス５は、複数のＰＥ１a〜１nに対して直列に接続され、複数のＰＥ１a〜１nにおいて、例えば、第１サイクルでＰＥ１ａに与えたカスケードコマンド(第２命令)を、次の第２サイクルでＰＥ１ｂに伝えるようになっている。 Furthermore, the cascade bus 5 is connected in series to the plurality of PEs 1a to 1n, and in the plurality of PEs 1a to 1n, for example, the cascade command (second instruction) given to the PE 1a in the first cycle is It is designed to transmit to PE1b in a cycle.

ＰＥ制御部２は、命令(コマンド)を一時的に保持するフェッチ部２１、および、プログラムを格納するプログラムメモリ２２を含む。ここで、各ＰＥ１a〜１nは、共通バス４を通して、ＰＥ制御部２からの共通コマンドを受け取り、さらに、カスケードバス５を通して、カスケードコマンドを受け取る。 The PE control unit 2 includes a fetch unit 21 that temporarily holds an instruction (command), and a program memory 22 that stores a program. Here, each of the PEs 1 a to 1 n receives a common command from the PE control unit 2 through the common bus 4, and further receives a cascade command through the cascade bus 5.

図６に示されるように、各ＰＥ１a〜１nと画像用メモリ３１は、それぞれバス１８a〜１８nで接続され、また、各ＰＥ１a〜１nとツリー用メモリ３２は、それぞれバス１９a〜１９nで接続されている。 As shown in FIG. 6, the PEs 1a to 1n and the image memory 31 are connected by buses 18a to 18n, respectively, and the PEs 1a to 1n and the tree memory 32 are connected by buses 19a to 19n, respectively. There is.

なお、これらのバス１８a〜１８nおよび１９a〜１９nは、１本の線として描かれているが、例えば、双方向で複数ビット(例えば、３２または６４ビット)のアドレスバスおよびデータバスを含んでいる。 Although these buses 18a to 18n and 19a to 19n are drawn as one line, they include, for example, an address bus and a data bus of multiple bits (for example, 32 or 64 bits) in both directions. .

図６において、ＰＥ制御部２から、共通バス４を通して送られる共通コマンド(第１命令)、並びに、カスケードバス５を通して送られるカスケードコマンド(第２命令)には、１ビットのビットフィールド(カスケードビットフィールド)が設定されている。 In FIG. 6, a common command (first instruction) sent through the common bus 4 from the PE control unit 2 and a cascade command (second instruction) sent through the cascade bus 5 have 1 bit bit field (cascade bit) Field is set.

命令フォーマットの一例を、次の表１に示す。ここでは、命令長を３３ビットとし、最上位ビットをカスケードビットフィールドＣＢＦ(Cascade Bit Field：命令種別情報)に割り当て、下位３２ビットを、通常の３２bitプロセッサ用の命令フィールドとしている。

An example of the instruction format is shown in Table 1 below. Here, the instruction length is 33 bits, the most significant bit is assigned to a cascade bit field CBF (Cascade Bit Field: instruction type information), and the lower 32 bits are used as an instruction field for a normal 32-bit processor.

表１に示されるように、カスケードビットフィールドＣＢＦは、例えば、『１』のときはカスケードバス５を通して伝える命令(カスケードコマンド)であることを示し。『０』のときは、共通バス４を通して伝える命令(共通コマンド)であることを示す。 As shown in Table 1, the cascade bit field CBF indicates, for example, that it is an instruction (cascade command) to be transmitted through the cascade bus 5 when "1". When it is "0", it indicates that it is an instruction (common command) transmitted through the common bus 4.

上述したように、ＰＥ制御部２と各ＰＥ１a〜１nは、共通バス４およびカスケードバス５の２つのバスによって接続されている。そして、ＰＥ制御部２は、各ＰＥ１a〜１nがメモリアクセスを行わないコマンド(共通コマンド)を共通バス４に発行し、メモリアクセスを行うコマンド(カスケードコマンド)をカスケードバス５に発行する。 As described above, the PE control unit 2 and each of the PEs 1 a to 1 n are connected by two buses, the common bus 4 and the cascade bus 5. Then, the PE control unit 2 issues a command (common command) in which each of the PEs 1 a to 1 n does not perform memory access to the common bus 4 and issues a command (cascade command) to perform memory access to the cascade bus 5.

この際、バス上に制御信号(命令種別情報)として、カスケードビットフィールドＣＢＦを設け、カスケードバス５にコマンドを発行する際には、例えば、カスケードビットフィールドＣＢＦを『１』(高レベル『Ｈ』)に設定する。 At this time, when the cascade bit field CBF is provided as a control signal (instruction type information) on the bus and a command is issued to the cascade bus 5, for example, the cascade bit field CBF is “1” (high level “H” Set to).

なお、ＣＢＦは、例えば、プログラムを作成するプログラマ、または、プログラミング言語で書かれたプログラムのソースコードを変換するコンパイラにより、ソフト的にコマンド中に組み込むことができる。 The CBF can be softly incorporated into a command, for example, by a programmer who creates a program or a compiler which converts the source code of a program written in a programming language.

或いは、コマンド中にＣＢＦを埋め込まずに、例えば、コマンドの種別を判断して、共通バス４を通して各ＰＥ１a〜１nに伝える共通コマンドか、カスケードバス５を通して各ＰＥ１a〜１nに伝えるカスケードコマンドかを判断して処理を行うことも可能である。 Alternatively, without embedding CBF in the command, for example, the type of command is determined to determine whether it is a common command to be transmitted to each PE 1 a to 1 n through common bus 4 or a cascade command to be transmitted to each PE 1 a to 1 n through cascade bus 5 It is also possible to perform processing.

図７は、図６に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図である。図７に示されるように、各プロセッサエレメント１(１a〜１n)は、コマンドバッファ１１、プロセッサコア１２およびセレクタ(第１セレクタ)１３を含む。 FIG. 7 is a block diagram showing an example of processor elements in the arithmetic processing unit shown in FIG. As shown in FIG. 7, each processor element 1 (1 a to 1 n) includes a command buffer 11, a processor core 12 and a selector (first selector) 13.

セレクタ１３は、カスケードビットフィールド(命令種別情報)ＣＢＦにより、共通バス４の共通コマンドまたはカスケードバス５のカスケードコマンドを選択し、セレクタ１３で選択されたコマンドは、コマンドバッファ１１に保持される。 The selector 13 selects the common command of the common bus 4 or the cascade command of the cascade bus 5 by the cascade bit field (instruction type information) CBF, and the command selected by the selector 13 is held in the command buffer 11.

すなわち、セレクタ１３は、ＣＢＦが『１』のとき、カスケードバス５のカスケードコマンドを選択し、ＣＢＦが『０』のとき、共通バス４の共通コマンドを選択し、その選択されたコマンドがコマンドバッファ１１に保持される。 That is, selector 13 selects the cascade command of cascade bus 5 when CBF is "1", and selects the common command of common bus 4 when CBF is "0", and the selected command is a command buffer. 11 is held.

プロセッサコア１２は、コマンドバッファ１１に保持されたコマンドを実行する。そして、ＰＥ１は、コマンドバッファ１１から出力されたコマンドがメモリアクセスの場合、ＰＥ１からメモリに対して、アドレス(RAMアドレス)と読み出しイネーブル(RAMリードイネーブル)を出力し、対応する読み出しデータ(RAMリードデータ)を受け取る。 The processor core 12 executes the command held in the command buffer 11. Then, when the command output from the command buffer 11 is a memory access, the PE1 outputs an address (RAM address) and a read enable (RAM read enable) to the memory from the PE1, and the corresponding read data (RAM read) Receive data).

コマンドバッファ１１には、カスケードバス５上の制御信号であるカスケードビットフィールドＣＢＦを観測し、そのＣＢＦが『１』のとき、セレクタ５で選択されたカスケードバス５によるカスケードコマンドが取り込まれる。また、コマンドバッファ１１には、ＣＢＦが『０』のとき、セレクタ５で選択された共通バス４による共通コマンドが取り込まれる。 The command bit buffer 11 observes a cascade bit field CBF which is a control signal on the cascade bus 5, and when the CBF is "1", a cascade command by the cascade bus 5 selected by the selector 5 is taken. When the CBF is “0”, the command buffer 11 receives a common command from the common bus 4 selected by the selector 5.

コマンドバッファ１１は、取り込んだコマンドをバッファリングし、自身のＰＥ(例えば、ＰＥ１a)のプロセッサコア１２に対して、バッファリングしたコマンドを出力する。さらに、コマンドバッファ１１にバッファリングされたコマンドは、次のサイクルで、直列接続される(次段の)ＰＥ(例えば、ＰＥ１b)にカスケードバス５を通して伝えられる。 The command buffer 11 buffers the fetched command, and outputs the buffered command to the processor core 12 of its own PE (for example, PE1a). Furthermore, the command buffered in the command buffer 11 is transmitted through the cascade bus 5 to the serially connected (next stage) PEs (e.g., the PE 1 b) in the next cycle.

これにより、例えば、メモリアクセスを行うコマンド(カスケードコマンド)は、ＣＢＲが『１』に設定されているため、ＰＥ制御部２に近いプロセッサエレメントから順番に(１a→１b→１c→…)、１サイクルずつずれて伝えられることになる。 Thereby, for example, since the CBR is set to “1”, the command (cascade command) for performing memory access is sequentially processed from the processor element closer to the PE control unit 2 (1a → 1b → 1c →...) It will be transmitted off the cycle.

前述したように、ＰＥ１は、ＰＥ制御部２から出力されたコマンドを実行し、コマンドがメモリアクセスの場合、ＰＥ１からメモリに対して、アドレス(RAMアドレス)とリードイネーブルを出力(RAMリードイネーブル)する。 As described above, the PE1 executes the command output from the PE control unit 2. When the command is a memory access, the PE1 outputs an address (RAM address) and a read enable to the memory (RAM read enable) Do.

ここで、アドレスは、リードイネーブルがアサートされているサイクルのみ有意な値を持ち、その他のサイクルでは『０』を出力する。また、メモリアクセスのコマンドは、カスケードバス５を通して入力されるため、サイクルが重ならないことが保証されることになる。 Here, the address has a significant value only in the cycle in which the read enable is asserted, and outputs "0" in the other cycles. Also, since memory access commands are input through the cascade bus 5, it is guaranteed that cycles do not overlap.

そのため、メモリへ入力されるアドレスおよびリードイネーブルは、各ＰＥ１(１a〜１n)から出力されるアドレスとリードイネーブルのそれぞれの論理和をとった単純なものとすることができる。また、ＰＥ１側では、リードイネーブルを発行してから、決められたサイクル後(例えば、１サイクル後)にリードデータを受け取ることができるので、アクノリッジ機構は不要になる。 Therefore, the address and the read enable input to the memory can be made simple by taking the logical sum of the address and the read enable output from the respective PEs 1 (1a to 1n). Also, on the PE 1 side, since the read enable can be issued and then read data can be received after a determined cycle (for example, after one cycle), the acknowledgment mechanism becomes unnecessary.

ＰＥ制御部２は、例えば、最終のプロセッサエレメント１nが出力するＣＢＦ(ＣＴＳ)を観測することによって、全てのプロセッサエレメント１a〜１nのメモリアクセスが終了したことを認識することができ、共通バス４に対するコマンドの発行が可能になる。 The PE control unit 2 can recognize that the memory access of all the processor elements 1a to 1n is completed, for example, by observing CBF (CTS) output from the final processor element 1n, and the common bus 4 It is possible to issue commands for

すなわち、上述した図６に示されるように、例えば、最終のプロセッサエレメント１nのカスケードビットフィールドＣＢＦをカスケード終了信号ＣＴＳとして、所定の配線Ｌ1を介してＰＥ制御部２に入力させればよいことになる。 That is, as shown in FIG. 6 described above, for example, the cascade bit field CBF of the final processor element 1 n may be input to the PE control unit 2 via the predetermined wiring L1 as the cascade end signal CTS. Become.

このように、第１実施例の演算処理装置によれば、例えば、メモリ側に対して複雑なアービタを設けることなく、また、プロセッサエレメント(ＰＥ)側でメモリリードリクエストやアクノリッジ機構を実装することなく、所定の処理を行うことが可能になる。 As described above, according to the processing unit of the first embodiment, for example, the memory read request and the acknowledge mechanism are implemented on the processor element (PE) side without providing a complex arbiter on the memory side. Instead, it becomes possible to perform predetermined processing.

図８は、図７に示すプロセッサエレメントにおけるセレクタを説明するための図であり、図８(a)は、セレクタ１３に注目して示し、図８(b)は、セレクタ１３の真理値表を示す。 FIG. 8 is a diagram for explaining the selectors in the processor element shown in FIG. 7. FIG. 8 (a) shows the selector 13 paying attention to it, and FIG. 8 (b) shows the truth table of the selector 13. Show.

図８(a)に示されるように、セレクタ１３には、カスケードバス５を経由したカスケードコマンド(Ａ)、並びに、共通バス４を経由した共通コマンド(Ｂ)が入力され、カスケードビットフィールドＣＢＦ(Ｓ)の値により一方が選択されるようになっている。 As shown in FIG. 8 (a), the cascade command (A) via the cascade bus 5 and the common command (B) via the common bus 4 are input to the selector 13, and the cascade bit field CBF ( One is selected according to the value of S).

ここで、カスケードコマンド(Ａ)，共通コマンド(Ｂ)および選択コマンド(Ｙ)は、例えば、３３ビット(［３２：０］)とされ、ＣＢＦは、３３ビットのコマンドにおける最上位の１ビット(［３２］)とされている。なお、これは、単なる一例であり、コマンドの構成は、様々に変化し得るのはいうまでもない。 Here, the cascade command (A), the common command (B) and the selection command (Y) are, for example, 33 bits ([32: 0]), and CBF is the most significant 1 bit (33 bits) [32]). Incidentally, this is merely an example, and it is needless to say that the configuration of the command may be variously changed.

図８(b)に示されるように、例えば、カスケードビットフィールド(Ｓ)が『１』のとき、セレクタ１３により選択される選択コマンド(Ｙ［３２：０］)は、カスケードコマンド(Ａ［３２：０］)になる。また、カスケードビットフィールド(Ｓ)が『０』のとき、選択コマンド(Ｙ［３２：０］)は、共通コマンド(Ｂ［３２：０］)になる。 As shown in FIG. 8B, for example, when the cascade bit field (S) is "1", the selection command (Y [32: 0]) selected by the selector 13 is a cascade command (A [32]. : 0]). When the cascade bit field (S) is “0”, the selection command (Y [32: 0]) is a common command (B [32: 0]).

図９は、第１実施例の演算処理装置による動作の一例を説明するためのタイミングチャートであり、１６個(ＰＥ０〜ＰＥ１５)のプロセッサエレメントを使用し、ツリー用メモリ３２および画像用メモリ３１に連続してメモリアクセスする場合の波形を示す。なお、メモリは、一般的なシングルポートのＲＡＭとする。 FIG. 9 is a timing chart for explaining an example of the operation of the processing unit of the first embodiment, wherein 16 (PE0 to PE15) processor elements are used and the tree memory 32 and the image memory 31 are used. The waveform in the case of continuous memory access is shown. The memory is a general single port RAM.

図９と、前述した図５の比較から明らかなように、例えば、画像用メモリ３１へのリードコマンドであるｃｍｄ１およびｃｍｄ２を実行するのに３２サイクル要していたものが、本第１実施例によれば、１７サイクルに低減されるのが分かる。 As apparent from the comparison between FIG. 9 and FIG. 5 described above, for example, although it takes 32 cycles to execute the read commands cmd1 and cmd2 to the image memory 31, the first embodiment According to, it can be seen that it is reduced to 17 cycles.

すなわち、ツリー用メモリ３２へのアクセスであるコマンド(命令)ｃｍｄ１と画像用メモリへのアクセスであるコマンドｃｍｄ２は、異なるメモリへのアクセスであるためにカスケードバス５に対して連続したサイクルで発行することができる。 That is, the command (instruction) cmd1 which is an access to the memory for tree 32 and the command cmd2 which is an access to the image memory are issued in a continuous cycle to the cascade bus 5 because they are accesses to different memories. be able to.

これにより、コマンドｃｍｄ１の終了には、図５の場合と同様に、１６サイクルを費やすことになるが、コマンドｃｍｄ２は、実質的に、１サイクルで処理が終了できることになり、コマンドの実行時間を大幅に短縮できることが分かる。 As a result, although the end of the command cmd1 consumes 16 cycles as in the case of FIG. 5, the command cmd2 can substantially finish the processing in one cycle, and the execution time of the command is It can be seen that it can be shortened significantly.

なお、図９の動作について、ツリー用メモリ３２と画像用メモリ３１に対するリードコマンドは、例えば、ＣＢＦが『１』に設定され、ＰＥ制御部２のプログラムメモリ２２に格納されている。 As for the read command for the tree memory 32 and the image memory 31, for example, CBF is set to “1” and stored in the program memory 22 of the PE control unit 2 in the operation of FIG. 9.

また、ＰＥ制御部２のフェッチ部２１は、ＣＢＦを見て、これらのコマンドをカスケードバス５に出力する。ここで、コマンド中のＣＢＦの設定は、コンパイラ或いはプログラマによって予め設定されることを想定しているが、例えば、ＰＥ制御部２のフェッチ部２１にコマンドを解析するロジックを追加し、自動的にＣＢＦを付加するようにしてもよい。 Also, the fetch unit 21 of the PE control unit 2 sees the CBF and outputs these commands to the cascade bus 5. Here, it is assumed that the setting of CBF in the command is set in advance by a compiler or a programmer, but for example, logic for analyzing the command is added to the fetch unit 21 of the PE control unit 2 to automatically CBF may be added.

また、本実施例において、異なるメモリ(例えば、図６の画像用メモリ３１およびツリー用メモリ３２)に対するアクセスは、連続したカスケードコマンドとして発行することが可能であり、これは、画像用およびツリー用の２つのメモリに限定されない。 Also, in the present embodiment, accesses to different memories (for example, the image memory 31 and the tree memory 32 in FIG. 6) can be issued as continuous cascade commands, which are for image and tree. It is not limited to two memories.

さらに、連続したカスケードコマンドとして発行することができるアクセス(命令)は、メモリアクセスに限定されるものではなく、例えば、複数のプロセッサエレメント(１a〜１n)によりアクセスが可能な共通のリソースに対するものであってもよい。 Furthermore, accesses (instructions) that can be issued as continuous cascade commands are not limited to memory accesses, but, for example, are for common resources that can be accessed by multiple processor elements (1a to 1n). It may be.

ここで、メモリ(画像用メモリ３１およびツリー用メモリ３２)は、共通のリソースの単なる一例に過ぎない。すなわち、共通のリソースとしては、例えば、入出力インターフェースや個々のプロセッサエレメントに実装するには高価になるような、乗算器を始めとする演算器等の様々なものが適用可能である。 Here, the memory (image memory 31 and tree memory 32) is merely an example of a common resource. That is, as a common resource, for example, various things such as an operation unit including a multiplier which is expensive to be implemented in an input / output interface or individual processor elements are applicable.

以上のように、本第１実施例によれば、例えば、メモリアクセス毎にアービトレーションを行って全てのプロセッサエレメントのアクセスが完了するのを待つ方式に比較して、命令実行サイクル数を大幅に短縮することが可能になる。 As described above, according to the first embodiment, for example, the number of instruction execution cycles is significantly reduced as compared to a method of performing arbitration for each memory access and waiting for access of all processor elements to be completed. It will be possible to

すなわち、プロセッサエレメントとメモリの間に接続される複雑な調停回路が不要になり、メモリアクセス時のリードアクセスが終了するまで次の実行を待たなくてもよいため、２番目以降の選択命令コマンドの処理速度を大幅に向上させることができる。なお、この効果は、以下に説明する第２および第３実施例でも同様に発揮され、さらに、メモリ以外の共通のリソースに対するアクセスでも同様に発揮されることになる。 That is, since the complicated arbitration circuit connected between the processor element and the memory becomes unnecessary and it is not necessary to wait for the next execution until the read access at the time of memory access is completed, Processing speed can be greatly improved. This effect is similarly exhibited in the second and third embodiments described below, and is also exhibited similarly in the access to common resources other than the memory.

図１０は、第２実施例の演算処理装置を示すブロック図であり、複数のプロセッサエレメントの内、途中で処理が早く終わるものが含まれる場合に対応するものを示す。図１０と、前述した図６の比較から明らかなように、第２実施例では、２番目のプロセッサエレメント１bに与える２分木処理において、終端ノードＮＮにたどりついた場合が例として説明される。 FIG. 10 is a block diagram showing an arithmetic processing unit according to the second embodiment, which corresponds to the case where one of the plurality of processor elements is terminated prematurely. As apparent from the comparison between FIG. 10 and FIG. 6 described above, in the second embodiment, the case where the terminal node NN is reached in the binary tree processing given to the second processor element 1b will be described as an example.

すなわち、図１０に示されるように、例えば、プロセッサエレメント１bは、終端ノードＮＮにたどりついた場合には、与えられた２分木処理が早期に終了することになる。第２実施例では、早期に判定結果が決まった処理については、判定終了のフラグを立てることにより、カスケードバス５で接続されたプロセッサエレメント１bをスキップし、より一層、処理速度の向上を図るようになっている。 That is, as shown in FIG. 10, for example, when the processor element 1 b reaches the terminal node NN, the given binary tree processing ends early. In the second embodiment, for the processing for which the determination result is determined early, the processor element 1b connected by the cascade bus 5 is skipped by setting the determination end flag, and the processing speed is further improved. It has become.

図１０に示されるように、各ＰＥ１a〜１nに対してそれぞれ割り当てられた２分木処理において、これら全ての２分木処理(２分木)が同じノード数となっているとは限らない。さらに、全てのＰＥ１a〜１nに対して、処理する２分木を割り当てられるとも限らない。 As shown in FIG. 10, in the binary tree processing assigned to each of the PEs 1a to 1n, it is not always the case that the number of nodes is the same for all binary tree processing (binary tree). Furthermore, not all PEs 1a to 1n can be assigned binary trees to be processed.

そこで、早期に判定結果が決まった(データ処理が完了している)ＰＥや２分木処理が割り当てられなかった(データ処理が不要の)ＰＥについては、判定終了や無効のフラグを立てることで、それらのＰＥをスキップする。 Therefore, for PEs for which the determination result was decided earlier (data processing has been completed) and PEs for which binary tree processing was not allocated (data processing is not required), the determination completion or invalid flag is set. , Skip their PE.

すなわち、カスケードバス５により直列に接続された複数のＰＥ１a〜１nから、データ処理が不要または完了しているＰＥをスキップすることにより、より一層、処理速度を向上させることが可能になる。 That is, by skipping PEs for which data processing is unnecessary or completed from the plurality of PEs 1a to 1n connected in series by the cascade bus 5, the processing speed can be further improved.

図１１は、図１０に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図であり、図１２は、図１１に示すプロセッサエレメントにおけるセレクタを説明するための図である。 FIG. 11 is a block diagram showing an example of a processor element in the arithmetic processing unit shown in FIG. 10, and FIG. 12 is a diagram for explaining a selector in the processor element shown in FIG.

ここで、図１２(a)は、セレクタ(第２セレクタ)１４に注目して示し、図１２(b)は、セレクタ１４の真理値表を示す。なお、セレクタ(第１セレクタ)１３は、図７および図８を参照して説明したのと同様のものである。 Here, FIG. 12A shows the selector (second selector) 14 paying attention, and FIG. 12B shows a truth table of the selector 14. The selector (first selector) 13 is the same as that described with reference to FIGS. 7 and 8.

図１１および図１２(a)に示されるように、セレクタ１４には、セレクタ１３の出力およびコマンドバッファ１１の出力が入力され、オアゲート153の出力(ＣＳ0)に応じて一方を選択して出力する。 As shown in FIGS. 11 and 12A, the selector 14 receives the output of the selector 13 and the output of the command buffer 11, selects one of them according to the output (CS0) of the OR gate 153, and outputs it. .

ここで、プロセッサコア１２は、無効フラグが『１』の時には動作を行わない。無効フラグの設定／解除は、図示しない専用の制御信号を使用して行うことを想定しているが、例えば、共通バス４に対して専用のロジック回路を付加し、無効フラグの設定用命令が発行されたときに、無効フラグの設定／解除を行うようにすることもできる。 Here, the processor core 12 does not operate when the invalidation flag is "1". Although it is assumed that setting / releasing of the invalidation flag is performed using a dedicated control signal not shown, for example, a dedicated logic circuit is added to the common bus 4 and an instruction for setting the invalidation flag is It is possible to set / cancel the invalid flag when issued.

また、コマンドバッファ１１は、判定終了フラグまたは無効フラグが『１』の時にはカスケードコマンドをＮＯＰ(No-OPeration)コマンドに置き換える。さらに、判定終了フラグの設定は、共通バス４またはカスケードバス５を経由したコマンドを使用して、プロセッサコア１２から行うようになっている。なお、判定終了フラグの解除は、共通バス４を経由したコマンドを使用して、プロセッサコア１２から行うことができる。 Also, the command buffer 11 replaces the cascade command with a NOP (No-OPeration) command when the determination end flag or the invalid flag is "1". Further, the setting of the determination end flag is performed from the processor core 12 using a command via the common bus 4 or the cascade bus 5. The determination end flag can be released from the processor core 12 using a command via the common bus 4.

すなわち、オアゲート153は、判定終了フラグ(151)および無効フラグ(152)の論理和を取って、少なくとも一方が『１』になったら、セレクタ１４に対して、『１』の制御信号ＣＳ0を出力する。 That is, the OR gate 153 takes the logical sum of the determination end flag (151) and the invalid flag (152), and outputs the control signal CS0 of "1" to the selector 14 when at least one becomes "1". Do.

図１２(b)に示されるように、セレクタ１４は、オアゲート153からの『１』の制御信号ＣＳ0(Ｓが『１』)を受け取り、セレクタ１３で選択されたコマンド(Ｂ［３２：０］)を選択する。 As shown in FIG. 12B, the selector 14 receives the control signal CS0 (S is “1”) of “1” from the OR gate 153, and the command (B [32: 0] selected by the selector 13 Choose).

そして、セレクタ１３で選択された、コマンド(Ｂ［３２：０］、そのサイクルのコマンド)が、セレクタ１４による選択コマンド(Ｙ［３２：０］)として、カスケードバス５を通して直列接続される(次段の)プロセッサエレメントに伝えられる。このとき、オアゲート153からの『１』の制御信号ＣＳ0は、コマンドバッファ１１にも入力され、上述したＮＯＰコマンドへの置き換え処理が行われる。 Then, the command (B [32: 0], the command of the cycle) selected by the selector 13 is serially connected through the cascade bus 5 as a selection command (Y [32: 0]) by the selector 14 (next Transferred to the (stage) processor element. At this time, the control signal CS0 of "1" from the OR gate 153 is also input to the command buffer 11, and the above-mentioned replacement process with the NOP command is performed.

具体的に、図１０に示されるように、例えば、ＰＥ１bによる２分木処理が早期に終了した場合には、判定終了フラグ(151)を立てる(『１』にする)ことで、オアゲート153から出力される制御信号ＣＳ0が『１』になる。 Specifically, as shown in FIG. 10, for example, when the binary tree processing by the PE 1b ends early, the determination end flag (151) is set (set to “1”) to allow the OR gate 153 to The control signal CS0 to be output becomes "1".

これにより、セレクタ１３で選択された、そのサイクルのコマンド(Ｂ［３２：０］)が選択コマンド(Ｙ［３２：０］)として、ＰＥ１bの次段のＰＥ１cに伝えられ、カスケードバス５によるコマンドの伝送時間を短縮することが可能になる。 Thereby, the command (B [32: 0]) of the cycle selected by the selector 13 is transmitted to PE1 c next to PE1 b as a selection command (Y [32: 0]), and the command by the cascade bus 5 is It is possible to reduce the transmission time of

また、例えば、ＰＥ１bに対して２分木処理が割り当てられなかった場合には、無効フラグ(152)を立てる(『１』にする)ことで、オアゲート153から出力される制御信号ＣＳ0が『１』になる。 Also, for example, when binary tree processing is not assigned to PE 1 b, the invalidation flag (152) is set (set to “1”), whereby the control signal CS0 output from the OR gate 153 is “1”. "become.

これにより、セレクタ１３で選択されたコマンド(Ｂ［３２：０］、そのサイクルのコマンド)が選択コマンド(Ｙ［３２：０］)として、ＰＥ１bの次段のＰＥ１cに伝えられ、カスケードバス５によるコマンドの伝送時間を短縮することが可能になる。 Thereby, the command (B [32: 0], the command of the cycle) selected by the selector 13 is transmitted to PE1 c next to PE1 b as a selection command (Y [32: 0]), and the cascade bus 5 It becomes possible to shorten the transmission time of the command.

なお、判定終了フラグ(151)および無効フラグ(152)が立てられず、オアゲート153から出力される制御信号ＣＳ0が『０』のとき、セレクタ１４は、コマンドバッファ１１の出力(Ａ［３２：０］)を選択し、前述した第１実施例と同様の処理が行われる。 When the determination completion flag (151) and the invalid flag (152) are not set and the control signal CS0 output from the OR gate 153 is “0”, the selector 14 outputs the command buffer 11 (A [32: 0]. ] Is selected, and the same processing as in the first embodiment described above is performed.

すなわち、図１２(b)に示されるように、セレクタ１４は、オアゲート153からの『０』の制御信号ＣＳ0(Ｓが『０』)を受け取り、コマンドバッファ１１から出力されたコマンド(Ａ［３２：０］)を選択する。 That is, as shown in FIG. 12 (b), the selector 14 receives the control signal CS 0 (S is “0”) of “0” from the OR gate 153, and the command (A [32 : 0]) is selected.

そして、コマンドバッファ１１から出力されたコマンド(Ａ［３２：０］、前のサイクルのコマンド)が、セレクタ１４による選択コマンド(Ｙ［３２：０］)として、カスケードバス５を通して直列接続されるプロセッサエレメントに伝えられる。すなわち、制御信号ＣＳ0が『０』のときは、第１実施例と同様の処理が行われることになる。 Then, a processor in which a command (A [32: 0], a command of the previous cycle) output from the command buffer 11 is serially connected through the cascade bus 5 as a selection command (Y [32: 0]) by the selector 14 It is transmitted to the element. That is, when the control signal CS0 is "0", processing similar to that of the first embodiment is performed.

図１３は、第２実施例の演算処理装置による動作の一例を説明するためのタイミングチャートであり、プロセッサエレメントＰＥ１が既に判定処理を終了している場合を説明するためのものである。なお、参照符号ｃｍｄ１は、ツリー用メモリ３２へのアクセスコマンドを示し、ｃｍｄ２は、画像用メモリ３１へのアクセスコマンドを示す。また、メモリは、一般的なシングルポートのＲＡＭとする。 FIG. 13 is a timing chart for explaining an example of the operation of the arithmetic processing unit of the second embodiment, which is for explaining the case where the processor element PE1 has already finished the determination processing. The reference code cmd1 indicates an access command to the tree memory 32, and cmd2 indicates an access command to the image memory 31. Also, the memory is a general single-port RAM.

図１３と、前述した図９の比較から明らかなように、本第２実施例によれば、判定処理を終了しているＰＥ１においては、セレクタ(第２セレクタ)１４の制御信号ＣＳ0が『１』になる。 As apparent from the comparison between FIG. 13 and FIG. 9 described above, according to the second embodiment, the control signal CS0 of the selector (second selector) 14 is “1” in PE1 which has finished the determination process. "become.

そのため、第２セレクタ１４からは、セレクタ(第１セレクタ)１３で選択されたコマンド(Ｂ［３２：０］)が選択コマンド(Ｙ［３２：０］)として出力され、ＰＥ１をスルーして次段のＰＥ２に伝えられる。 Therefore, the command (B [32: 0]) selected by the selector (first selector) 13 is output from the second selector 14 as the selection command (Y [32: 0]), and the next through PE1. It is transmitted to PE2 of the stage.

すなわち、ＰＥ１は判定終了フラグがセット(Ｓが『１』)されているので、ＰＥ０からカスケードバス５を通して受け取ったコマンド(ｃｍｄ１，ｃｍｄ２)は、コマンドバッファ１１でバッファリングされずに、ＰＥ１をスルーしてＰＥ２へと伝えられる。 That is, since the determination end flag is set (S is "1"), the command (cmd1, cmd2) received from PE0 through the cascade bus 5 is not buffered by the command buffer 11, and PE1 is passed through. And it is transmitted to PE2.

すなわち、ＰＥ２は、本来(ＰＥ１の判定処理が終了していない場合)、ＰＥ１がコマンドを受け取るサイクルでコマンドを受け取り、従って、ＰＥ３〜ＰＥ１５も、図９に示す場合よりも１サイクル早くコマンドを受け取って処理することになる。 That is, PE2 originally receives a command in a cycle in which PE1 receives a command (when the determination processing of PE1 has not ended), and therefore PE3 to PE15 also receive a command one cycle earlier than the case shown in FIG. Will be processed.

これにより、図９を参照して説明した第１実施例では１７サイクル要していた処理が、ＰＥ１をスキップすることにより、１サイクル削減され、１６サイクルまで短縮することが可能になる。この処理サイクルの短縮の効果は、例えば、判定処理が終了しているか無効設定になっているプロセッサエレメントの数が多いほど顕著になるのはいうまでもない。 This makes it possible to reduce the process required for 17 cycles in the first embodiment described with reference to FIG. 9 by 1 cycle and skip to 16 cycles by skipping PE1. It goes without saying that the effect of shortening the processing cycle becomes more pronounced, for example, as the number of processor elements for which the determination processing has been completed or which is invalidated is larger.

図１４は、第３実施例の演算処理装置を示すブロック図であり、複数のプロセッサエレメントを、独立動作が可能な経路を使って並列に動作させることにより、実行サイクルのさらなる短縮を図るものである。 FIG. 14 is a block diagram showing an arithmetic processing unit according to a third embodiment, in which execution cycles are further shortened by operating a plurality of processor elements in parallel using paths capable of independent operation. is there.

具体的に、例えば、左右に設けた２つのカメラによるそれぞれの画像を処理する場合、画像用メモリ３は、２つの独立した画像用メモリ311，312となる。ここで、画像用メモリ311，312は、例えば、アドレス(ＲＡＭアドレス)の最上位ビット(ＭＳＢ)で識別することができる。 Specifically, for example, when processing the respective images by two cameras provided on the left and right, the image memory 3 becomes two independent image memories 311 and 312. Here, the image memories 311 and 312 can be identified, for example, by the most significant bit (MSB) of the address (RAM address).

なお、図１４では、例えば、図１５に示すルートセレクタレジスタ１６に対してＲＡＭアドレスのＭＳＢが既に設定され、各ＰＥがアクセスする画像用メモリが既に指定されている。 In FIG. 14, for example, the MSB of the RAM address is already set in the route selector register 16 shown in FIG. 15, and the image memory to which each PE accesses is already designated.

すなわち、図１４は、初段のＰＥ１aが第１画像用メモリ311をアクセスし、２つのデータ(比較対象データ)を読み込んで比較(判定)処理を行い、２段目のＰＥ１bが第２画像用メモリ312をアクセスし、２つのデータを読み込んで比較処理を行う場合を示す。 That is, in FIG. 14, the first-stage PE 1a accesses the first image memory 311, reads two data (data to be compared), performs comparison (determination) processing, and the second-stage PE 1b performs the second image memory A case where access is made to 312 and two data are read and comparison processing is performed is shown.

なお、以下に説明する第３実施例では、２つの独立した画像用メモリ311，312を例としているが、画像用メモリの数は、２つに限定されるものではない。例えば、４つの独立した画像用メモリに適用する場合には、ＲＡＭアドレスの上位２ビットを用いて各ＰＥがアクセスする画像用メモリを指定することができる。 In the third embodiment described below, although two independent image memories 311 and 312 are taken as an example, the number of image memories is not limited to two. For example, in the case of application to four independent image memories, the upper two bits of the RAM address can be used to designate an image memory to be accessed by each PE.

さらに、画像用メモリは、共通のリソースの単なる例であり、複数のプロセッサエレメントが同時にアクセス可能な共通のリソースであればメモリに限定されるものではない。すなわち、共通のリソースとしては、例えば、入出力インターフェースや個々のプロセッサエレメントに実装するには高価となるような、乗算器を始めとする演算器等の様々なものであってもよい。そして、本第３実施例は、これら独立した複数の演算器等に対して適用することもできる。 Furthermore, the image memory is merely an example of a common resource, and it is not limited to the memory as long as it is a common resource that can be accessed simultaneously by a plurality of processor elements. That is, the common resource may be, for example, various elements such as an arithmetic unit including a multiplier which is expensive to be implemented in an input / output interface or an individual processor element. The third embodiment can also be applied to the plurality of independent computing units and the like.

図１４に示されるように、第３実施例の演算処理装置において、ＰＥ制御部２と各ＰＥ１a〜１nを接続するカスケードバスは、独立した画像用メモリ311，312の個数と同じ２本(５１，５２)設けられている。 As shown in FIG. 14, in the processing unit of the third embodiment, two cascade buses connecting the PE control unit 2 and the PEs 1a to 1n are the same as the number of independent image memories 311 and 312 (51 , 52) provided.

また、各ＰＥ１a〜１nからは、ツリー用メモリ３２に対してアクセスするバス１９a〜１９n、並びに、２つの画像用メモリ311，312に対してアクセスするバスが独立に設けられている。 Further, from each of the PEs 1a to 1n, buses 19a to 19n for accessing the tree memory 32 and buses for accessing the two image memories 311 and 312 are provided independently.

ただし、上述のように、図１４は、ＰＥ１aがバス181aを介して既に指定された第１画像用メモリ311から２つのデータを読み込み、ＰＥ１bがバス181bを介して既に指定された第２画像用メモリ312から２つのデータを読み込んで比較処理を行う様子を示している。 However, as described above, FIG. 14 shows that the PE 1a reads two data from the first image memory 311 already designated via the bus 181a, and the PE 1b for the second image already designated via the bus 181b. It shows that two pieces of data are read from the memory 312 and comparison processing is performed.

各ＰＥ１a〜１nは、ツリー情報として格納されたデータ処理ごと(例えば、画像ごと)に割り当てられ、ＰＥ１a〜１nが独立して動作可能な経路のカスケード接続(カスケードバス)を選択することで、ＰＥ１a〜１nを並列に動作させ、処理時間の短縮を図る。 Each of the PEs 1a to 1n is assigned for each data processing (for example, for each image) stored as tree information, and the PEs 1a to 1n are independently connected by selecting a cascade connection (cascade bus) with which paths can be operated. The processing time is reduced by operating 1 to 1 n in parallel.

図１４に示されるように、独立にアクセスできる画像用メモリが２つ(311，312)ある場合、例えば、ＲＡＭアドレスの上位１ビット目(最上位ビット：ＭＳＢ：Most Significant Bit)を、２つの画像用メモリ311，312の選択に使用することができる。 As shown in FIG. 14, when there are two image memories (311, 312) that can be accessed independently, for example, the upper first bit (most significant bit: MSB: Most Significant Bit) of the RAM address It can be used to select the image memories 311 and 312.

すなわち、各ＰＥ１a〜１nは、ＲＡＭアドレスのＭＳＢによって、アクセス対象のＲＡＭが、第１画像用メモリ311および第２画像用メモリ312のどちらであるかを識別することができるようになっている。 That is, each of the PEs 1a to 1n can identify whether the RAM to be accessed is the first image memory 311 or the second image memory 312 based on the MSB of the RAM address.

このときに、アクセス対象の２つの画像用メモリ(ＲＡＭ)311，312に対して、関連する２本のカスケードバス５１，５２が設けられる。なお、ＲＡＭアドレスのＭＳＢを使用して、第１および第２画像用メモリ311，312の選択を行うのは単なる例であり、これに限定されるものではない。 At this time, associated two cascade buses 51 and 52 are provided for two image memories (RAMs) 311 and 312 to be accessed. Note that selecting the first and second image memories 311 and 312 using the MSB of the RAM address is merely an example, and the present invention is not limited to this.

ここで、ＲＡＭアドレスのＭＳＢ(ＣＳ１)が『０』のとき、偶数番目のＰＥ１a，１c，…(ＰＥ０，ＰＥ２，…，ＰＥ１４)は、第１カスケードバス５１を通して伝えられる第１カスケードコマンドにより第１画像用メモリ311をアクセスするものとする。 Here, when the MSB (CS1) of the RAM address is “0”, the even-numbered PEs 1a, 1c,... (PE0, PE2,..., PE14) are selected by the first cascade command transmitted through the first cascade bus 51. It is assumed that one image memory 311 is accessed.

また、ＣＳ１が『１』のとき、奇数番目のＰＥ１b，１d，…(ＰＥ１，ＰＥ３，…，ＰＥ１５)が、第２カスケードバス５２を通して伝えられる第２カスケードコマンドにより第２画像用メモリ312をアクセスするものとする。 In addition, when CS1 is “1”, odd-numbered PEs 1b, 1d,... (PE1, PE3,..., PE15) access the second image memory 312 by the second cascade command transmitted through the second cascade bus 52. It shall be.

なお、各ＰＥ１a〜１n(ＰＥ０〜ＰＥ１５)は、自らがアクセスする画像用メモリ311，312に対応したカスケードバス５１，５２からのコマンド(命令)のみを実行する。そして、他のカスケードバスからのコマンドは、コマンドバッファ１１によるバッファリングを行わずに、そのままスルーさせて、直列接続される(次段の)ＰＥに伝える。 Each of the PEs 1a to 1n (PE0 to PE15) executes only a command (instruction) from the cascade bus 51 or 52 corresponding to the image memory 311 or 312 accessed by itself. Then, commands from other cascade buses are passed through as they are without buffering by the command buffer 11, and are transmitted to PEs connected in series (next stage).

また、ツリー用メモリ３２に対するアクセスコマンドは、第１カスケードバス５１または第２カスケードバス５２を通して伝えてもよいが、第１および第２カスケードバスとは別の第３カスケードバスを設け、その第３カスケードバスを通して伝えることもできる。 In addition, although an access command to the memory for tree 32 may be transmitted through the first cascade bus 51 or the second cascade bus 52, a third cascade bus different from the first and second cascade buses is provided, and the third It can also be transmitted through the cascade bus.

図１４において、ＰＥ制御部２から第１および第２カスケードバス５１，５２を通して送られる命令(カスケードコマンド)には、ＣＢＦとは別に、カスケードバスの一方のルートを選択するためのルート選択ビットフィールド(ＲＳＢＦ)が設定される。 In FIG. 14, a route selection bit field for selecting one route of the cascade bus separately from CBF in the instruction (cascade command) sent from the PE control unit 2 through the first and second cascade buses 51 and 52. (RSBF) is set.

カスケードコマンドの命令フォーマットの一例を、次の表２に示す。ここでは、命令長を３４ビットとし、最上位ビットをＲＳＢＦ(Route Select Bit Field：ルート選択情報)に割り当てている。なお、下位３３ビットは、前述した表１と同様であり、上位２ビット目をＣＢＦ(命令種別情報)に割り当て、上位３ビット目からの下位３２ビットを、通常の３２bitプロセッサ用の命令フィールドとしている。 An example of the instruction format of the cascade command is shown in Table 2 below. Here, the instruction length is 34 bits, and the most significant bit is assigned to RSBF (Route Select Bit Field: route selection information). The lower 33 bits are the same as in Table 1 described above, the upper 2 bits are assigned to CBF (instruction type information), and the lower 32 bits from the upper 3 bits are used as an instruction field for a normal 32-bit processor. There is.

ここで、例えば、ＲＳＢＦが『１』に設定されるときには、ＣＢＦも『１』に設定されることになる。

Here, for example, when the RSBF is set to “1”, the CBF is also set to “1”.

なお、ＲＳＢＦは、例えば、プログラムを作成するプログラマ、または、プログラミング言語で書かれたプログラムのソースコードを変換するコンパイラにより、ソフト的にコマンド中に組み込むことができる。 The RSBF can be incorporated into commands in a soft manner by, for example, a programmer who creates a program or a compiler which converts the source code of a program written in a programming language.

或いは、ＰＥ制御部２のフェッチ部２１に対して、命令(コマンド)やメモリアクセス先を解析するロジック回路を追加して、ＲＳＢＦをコンド中に自動的に付加することも可能である。 Alternatively, it is also possible to add an RSBF automatically in the conduit by adding a logic circuit for analyzing an instruction (command) and a memory access destination to the fetch unit 21 of the PE control unit 2.

また、ツリー用メモリ３２に対するアクセスの場合、例えば、ＲＳＢＦは『０』で、ＣＢＦは『１』になるが、このとき、本第３実施例では、第１カスケードバス５１からコマンドを取り込むものとする。 In the case of access to the tree memory 32, for example, RSBF is "0" and CBF is "1". At this time, in the third embodiment, commands are fetched from the first cascade bus 51. Do.

図１５は、図１４に示す演算処理装置におけるプロセッサエレメントの一例を示すブロック図であり、図１６および図１７は、図１５に示すプロセッサエレメントにおけるセレクタを説明するための図である。 FIG. 15 is a block diagram showing an example of a processor element in the arithmetic processing unit shown in FIG. 14. FIGS. 16 and 17 are diagrams for explaining selectors in the processor element shown in FIG.

ここで、図１６は、２つのセレクタ(第１セレクタ１３および第２セレクタ１４)に注目して示し、図１７(a)は、第１セレクタ１３の真理値表を示し、図１７(b)は、第２セレクタ１４の真理値表を示す。なお、図１７(a)および図１７(b)において、符号『×』は、考慮に入れなくてもよい(don't care：ドントケア)を示す。 Here, FIG. 16 is shown paying attention to two selectors (the first selector 13 and the second selector 14), and FIG. 17 (a) shows a truth table of the first selector 13, and FIG. 17 (b) Shows a truth table of the second selector 14. In addition, in FIG. 17A and FIG. 17B, the symbol “x” indicates “don't care” that may not be taken into consideration.

図１５に示されるように、各ＰＥ１(１a〜１n)は、ルートセレクタレジスタ１６を含む。ルートセレクタレジスタ１６は、例えば、画像用メモリ(311，312)をアクセスするアドレスの最上位ビット(ＲＡＭアドレスのＭＳＢ)を保持していて、カスケードバス５１，５２を選択する制御信号(ルートセレクト信号)ＣＳ１をセレクタ１３，１４に出力する。 As shown in FIG. 15, each PE 1 (1 a to 1 n) includes a route selector register 16. The root selector register 16 holds, for example, the most significant bit (MSB of the RAM address) of the address for accessing the image memory (311, 312), and selects the cascade bus 51, 52 (control signal (route select signal) ) Output CS1 to selectors 13 and 14.

すなわち、各ＰＥ１(１a〜１n)は、画像用メモリ(311，312)に対するアクセスを実行するサイクルの前に、ＲＡＭアドレスのＭＳＢ(ＣＳ１)を、ルートセレクタレジスタ１６に設定する。 That is, each PE 1 (1 a to 1 n) sets the MSB (CS 1) of the RAM address in the route selector register 16 before the cycle of executing the access to the image memory (311, 312).

また、図１５および図１６に示されるように、第１セレクタ１３には、第１カスケードバス５１を介した第１カスケードコマンド(Ａ)、並びに、第２カスケードバス５２を介した第２カスケードコマンド(Ｂ)が入力される。 Further, as shown in FIGS. 15 and 16, the first selector 13 receives the first cascade command (A) via the first cascade bus 51 and the second cascade command via the second cascade bus 52. (B) is input.

ここで、表２を参照して説明したように、各カスケードコマンド(例えば、３４ビット)の最上位ビットＭＳＢは、ルート選択情報(ＲＳＢＦ)として使用され、上位２ビット目は、命令種別情報(ＣＢＦ)として使用される。 Here, as described with reference to Table 2, the most significant bit MSB of each cascade command (for example, 34 bits) is used as route selection information (RSBF), and the upper second bit is the instruction type information ( Used as CBF).

すなわち、第１カスケードコマンドのＲＳＢＦおよびＣＢＦは、第１セレクタ１３のＳ１およびＳ２として入力され、第２カスケードコマンドのＲＳＢＦおよびＣＢＦは、第１セレクタ１３のＳ３およびＳ４として入力される。また、共通バス４を介した共通コマンドは、第１セレクタ１３のＣとして入力される。 That is, RSBF and CBF of the first cascade command are input as S1 and S2 of the first selector 13, and RSBF and CBF of the second cascade command are input as S3 and S4 of the first selector 13. The common command via the common bus 4 is input as C of the first selector 13.

なお、初段のプロセッサエレメント１aには、ＰＥ制御部２から出力される第１および第２カスケードバス５１，５２がそのまま接続される。ＰＥ制御部２から第１および第２カスケードバス５１，５２には、共通のコマンドが発行される。ここで、第１セレクタ１３は、Ｓ０〜Ｓ４に基づいて、第１および第２カスケードコマンド並びに共通コマンドのいずれかを選択する。 The first and second cascade buses 51 and 52 output from the PE control unit 2 are connected as they are to the processor element 1a of the first stage. A common command is issued from the PE control unit 2 to the first and second cascade buses 51 and 52. Here, the first selector 13 selects one of the first and second cascade commands and the common command based on S0 to S4.

具体的に、図１７(a)に示されるように、第１セレクタ１３は、第１カスケードコマンド(Ａ)がＳ１＝Ｓ２＝『１』であり、Ｓ０＝『１』に一致する場合、第１カスケードコマンド(Ａ)をＹとして選択し、コマンドバッファ１１に出力する。 Specifically, as shown in FIG. 17A, when the first cascade command (A) satisfies S1 = S2 = “1” and S0 = “1”, as shown in FIG. 1) Select a cascade command (A) as Y and output it to the command buffer 11.

また、第１セレクタ１３は、第２カスケードコマンド(Ｂ)がＳ３＝Ｓ４＝『１』であり、Ｓ０＝『０』に一致する場合、第２カスケードコマンド(Ｂ)をＹとして選択し、コマンドバッファ１１に出力する。 When the second cascade command (B) matches S3 = S4 = “1” and S0 = “0”, the first selector 13 selects the second cascade command (B) as Y, and the command Output to buffer 11.

さらに、第１セレクタ１３は、第１カスケードコマンド(Ａ)のＳ２＝『０』で、第２カスケードコマンド(Ｂ)のＳ４＝『０』のとき、共通バス４からの共通コマンド(Ｃ)をＹとして選択し、コマンドバッファ１１に出力する。 Furthermore, the first selector 13 selects the common command (C) from the common bus 4 when S2 = “0” of the first cascade command (A) and S4 = “0” of the second cascade command (B). It is selected as Y and output to the command buffer 11.

そして、図１７(b)に示されるように、第２セレクタ１４は、ＲＳＢＦおよびＣＢＦが共に『１』(Ｓ１＝Ｓ２＝『１』，Ｓ３＝Ｓ４＝『１』)で、かつ、Ｓ０(ＣＳ１)に一致しないカスケードコマンドＡ，Ｂを、次段のＰＥにそのままＹ１，Ｙ０として伝える。 Then, as shown in FIG. 17B, in the second selector 14, both RSBF and CBF are “1” (S1 = S2 = “1”, S3 = S4 = “1”), and S0 (S1 = S2 = “1”). Cascade commands A and B which do not match CS1) are transmitted to the PE of the next stage as they are as Y1 and Y0.

すなわち、Ｓ０が『１』のとき、第２カスケードバス５２に対して、Ｓ３＝Ｓ４＝『１』のカスケードコマンド(Ｂ)が入力された場合には、そのカスケードコマンド(Ｂ)を、次段のＰＥの第２カスケードバス５２に対してそのまま出力(Ｙ０)する。 That is, when the cascade command (B) of S3 = S4 = “1” is input to the second cascade bus 52 when S0 is “1”, the cascade command (B) is input to the next stage. Output (Y0) as it is to the second cascade bus 52 of PE.

また、Ｓ０が『０』のとき、第１カスケードバス５１に対して、Ｓ１＝Ｓ２＝『１』のカスケードコマンド(Ａ)が入力された場合には、そのカスケードコマンド(Ａ)を、次段のＰＥの第１カスケードバス５１に対してそのまま出力(Ｙ１)する。 When the cascade command (A) of S1 = S2 = “1” is input to the first cascade bus 51 when S0 is “0”, the cascade command (A) is input to the next stage. Output (Y1) to the first cascade bus 51 of the PE of FIG.

なお、他の場合は、コマンドバッファ１１からの出力(Ｃ)を、そのまま次段のＰＥの第１および第２カスケードバス５１，５２に対して出力(Ｙ１，Ｙ０)することになる。 In other cases, the output (C) from the command buffer 11 is output (Y1, Y0) as it is to the first and second cascade buses 51, 52 of the next-stage PE.

図１８は、第３実施例の演算処理装置による動作の一例を説明するためのタイミングチャートであり、独立動作が可能な２つの経路(第１および第２カスケードバス５１，５２)により、２つの独立した画像用メモリ311，312をアクセスする様子を示すものである。 FIG. 18 is a timing chart for explaining an example of the operation of the processing unit of the third embodiment, and two paths (first and second cascade buses 51 and 52) capable of independent operation are used. It shows how the independent image memories 311 and 312 are accessed.

図１８において、偶数番号のプロセッサエレメント(ＰＥ０，ＰＥ２，…，ＰＥ１４)は、例えば、第１カスケードバス５１を通して伝えられる第１カスケードコマンドにより、ＲＡＭアドレスのＭＳＢが『０』の第１画像用メモリ311をアクセスするものとする。 18, even-numbered processor elements (PE0, PE2,..., PE14) are, for example, memory for the first image for the first address of the RAM address “0” by the first cascade command transmitted through the first cascade bus 51. Suppose that 311 is accessed.

また、奇数番号のプロセッサエレメント(ＰＥ１，ＰＥ３，…，ＰＥ１５)は、例えば、第２カスケードバス５２を通して伝えられる第２カスケードコマンドにより、ＲＡＭアドレスのＭＳＢが『１』の第１画像用メモリ312をアクセスするものとする。 In addition, odd-numbered processor elements (PE1, PE3,..., PE15) are controlled, for example, by the second cascade command transmitted through the second cascade bus 52 to store the first image memory 312 whose MSB of the RAM address is "1". It shall be accessed.

その結果、図１８に示されるように、同じ画像用メモリに対してアクセスするＰＥにより同じカスケードバスを使用するグループが形成され、異なるカスケードバスに属するグループのアクセスと同時に処理できるため、処理時間のさらなる短縮が可能になる。 As a result, as shown in FIG. 18, PEs accessing the same image memory form a group using the same cascade bus, and processing can be performed simultaneously with access to groups belonging to different cascade buses. Further shortening is possible.

なお、独立した画像用メモリの数は、２つに限定されるものではなく、また、複数のＰＥが同時にアクセス可能な共通のリソースであれば画像用メモリに限定されないのは、前述した通りである。 The number of independent image memories is not limited to two, and is not limited to image memories as long as it is a common resource that multiple PEs can simultaneously access, as described above. is there.

図１９は、本実施例の演算処理装置が適用される半導体集積回路の一例を示すブロック図であり、デジタルカメラ向けのＳｏＣ(System On a Chip)を概略的に示すものである。図１９に示されるように、ＳｏＣ(半導体集積回路)200は、ＳＩＭＤブロック201、内蔵メモリ202、メモリＩＦ(インターフェース)203、イメージセンサＩＦ204、画像処理部205、ＣＰＵ２０６およびバスマトリックス207を含む。 FIG. 19 is a block diagram showing an example of a semiconductor integrated circuit to which the arithmetic processing unit of the present embodiment is applied, and schematically showing an SoC (System On a Chip) for digital cameras. As shown in FIG. 19, a SoC (semiconductor integrated circuit) 200 includes a SIMD block 201, an embedded memory 202, a memory IF (interface) 203, an image sensor IF 204, an image processing unit 205, a CPU 206, and a bus matrix 207.

ここで、ＳＩＭＤブロック201、内蔵メモリ202、メモリＩＦ203、イメージセンサＩＦ204、画像処理部205およびＣＰＵ２０６は、バスマトリックス207により相互に接続されている。また、イメージセンサＩＦ204は、例えば、外部に設けられたレンズ等の光学系による画像をイメージセンサで電気信号に変換した信号を受け取るものである。なお、上述した各実施例の演算処理装置は、ＳＩＭＤブロック201に対応する。 Here, the SIMD block 201, the built-in memory 202, the memory IF 203, the image sensor IF 204, the image processing unit 205, and the CPU 206 are mutually connected by a bus matrix 207. The image sensor IF 204 receives, for example, a signal obtained by converting an image by an optical system such as a lens provided outside into an electrical signal by the image sensor. The arithmetic processing unit of each of the embodiments described above corresponds to the SIMD block 201.

すなわち、図１９に示すＳＩＭＤブロック201におけるＰＥ制御部２は、例えば、ＣＰＵ206からの命令により制御される。また、ＳＩＭＤブロック201における画像用メモリ３１(311，312)およびツリー用メモリ３２は、例えば、メモリＩＦ203を介して外部の大容量メモリ(図示しない)から取り込んだデータを格納する内蔵メモリ202からのデータを読み込むことになる。 That is, the PE control unit 2 in the SIMD block 201 shown in FIG. 19 is controlled, for example, by an instruction from the CPU 206. Also, the image memory 31 (311, 312) and the tree memory 32 in the SIMD block 201 are, for example, from the built-in memory 202 that stores data fetched from an external large capacity memory (not shown) via the memory IF 203. It will read data.

以上において、本実施例の演算処理装置の適用は、画像処理を行うＳｏＣに限定されるものではなく、様々な電子機器に使用する半導体集積回路として幅広くて適用することができるのはいうまでもない。 In the above, the application of the arithmetic processing unit of the present embodiment is not limited to the SoC that performs image processing, and it can be widely applied as a semiconductor integrated circuit used for various electronic devices. Absent.

以上、実施形態を説明したが、ここに記載したすべての例や条件は、発明および技術に適用する発明の概念の理解を助ける目的で記載されたものであり、特に記載された例や条件は発明の範囲を制限することを意図するものではない。また、明細書のそのような記載は、発明の利点および欠点を示すものでもない。発明の実施形態を詳細に記載したが、各種の変更、置き換え、変形が発明の精神および範囲を逸脱することなく行えることが理解されるべきである。 Although the embodiments have been described above, all the examples and conditions described herein are for the purpose of assisting the understanding of the concept of the invention applied to the invention and the technology, and the examples and conditions described are particularly It is not intended to limit the scope of the invention. Also, such descriptions in the specification do not show the advantages and disadvantages of the invention. While the embodiments of the invention have been described in detail, it should be understood that various changes, substitutions, and alterations can be made without departing from the spirit and scope of the invention.

以上の実施例を含む実施形態に関し、さらに、以下の付記を開示する。
（付記１）
共通のリソースに対して、同時にアクセス可能な複数のプロセッサエレメントを含む演算処理装置であって、
前記複数のプロセッサエレメントに対して並列に接続され、それぞれの前記プロセッサエレメントに共通の第１命令を与える第１バスと、
前記複数のプロセッサエレメントに対して直列に接続され、前記複数のプロセッサエレメントにおいて、第１サイクルで第１プロセッサエレメントに与えた第２命令を、前記第１サイクルの次の第２サイクルで前記第１プロセッサエレメントに直列接続される第２プロセッサエレメントに伝える第２バスと、を有する、
ことを特徴とする演算処理装置。 Further, the following appendices will be disclosed regarding the embodiment including the above-described example.
(Supplementary Note 1)
An arithmetic processing unit including a plurality of processor elements that can simultaneously access a common resource,
A first bus connected in parallel to the plurality of processor elements and providing a common first instruction to each of the processor elements;
A second instruction, which is serially connected to the plurality of processor elements, and which is given to the first processor element in the first cycle in the plurality of processor elements, is said first in the second cycle following the first cycle. And a second bus for communicating to a second processor element serially connected to the processor element.
Arithmetic processing apparatus characterized in that.

（付記２）
それぞれの前記プロセッサエレメントは、
命令を実行するプロセッサコアと、
前記第１命令および前記第２命令のいずれかを示す命令種別情報に従って、前記第１バスおよび前記第２バスのいずれかを選択する第１セレクタと、を有する、
ことを特徴とする付記１に記載の演算処理装置。 (Supplementary Note 2)
Each said processor element is
A processor core that executes instructions,
A first selector for selecting one of the first bus and the second bus according to instruction type information indicating one of the first instruction and the second instruction;
The arithmetic processing unit according to appendix 1, characterized in that:

（付記３）
前記第１命令および前記第２命令は、それぞれの命令が第１命令か第２命令かを示す前記命令種別情報のビットフィールドを含み、
前記第１セレクタは、
前記命令種別情報のビットフィールドが前記第２命令を示すとき、前記第２バスを選択して前記第２命令を出力し、
前記命令種別情報のビットフィールドが前記第１命令を示すとき、前記第１バスを選択して前記第１命令を出力する、
ことを特徴とする付記２に記載の演算処理装置。 (Supplementary Note 3)
The first instruction and the second instruction include bit fields of the instruction type information indicating whether each instruction is a first instruction or a second instruction,
The first selector is
When the bit field of the instruction type information indicates the second instruction, the second bus is selected to output the second instruction;
When the bit field of the instruction type information indicates the first instruction, the first bus is selected to output the first instruction.
The arithmetic processing unit according to appendix 2, characterized in that

（付記４）
それぞれの前記プロセッサエレメントは、さらに、
前記第１セレクタで選択された命令を保持するコマンドバッファを有する、
ことを特徴とする付記２または付記３に記載の演算処理装置。 (Supplementary Note 4)
Each said processor element is further
A command buffer for holding an instruction selected by the first selector;
The arithmetic processing unit according to appendix 2 or appendix 3, characterized in that

（付記５）
前記第１プロセッサエレメントは、
前記第１サイクルにより、前記第１セレクタで選択された命令を前記コマンドバッファに保持し、
前記第２サイクルにより、前記コマンドバッファに保持された命令を、前記第２バスを通して、前記第２プロセッサエレメントに伝える、
ことを特徴とする付記４に記載の演算処理装置。 (Supplementary Note 5)
The first processor element is
In the first cycle, the instruction selected by the first selector is held in the command buffer,
The second cycle transmits the instruction held in the command buffer to the second processor element through the second bus.
The arithmetic processing unit according to appendix 4, characterized in that

（付記６）
それぞれの前記プロセッサエレメントは、さらに、
前記第１セレクタで選択された命令および前記コマンドバッファに保持された命令のいずれか一方を選択する第２セレクタを有する、
ことを特徴とする付記４に記載の演算処理装置。 (Supplementary Note 6)
Each said processor element is further
And a second selector that selects one of the instruction selected by the first selector and the instruction held in the command buffer.
The arithmetic processing unit according to appendix 4, characterized in that

（付記７）
前記第１プロセッサエレメントは、前記第１プロセッサエレメントによるデータ処理が不要または完了している場合、
前記第２セレクタは、前記第１セレクタで選択された命令を、前記第１プロセッサエレメントをスキップして選択し、前記第２プロセッサエレメントに伝える、
ことを特徴とする付記６に記載の演算処理装置。 (Appendix 7)
When the first processor element does not need or complete data processing by the first processor element,
The second selector selects the instruction selected by the first selector by skipping the first processor element, and transmits the selected instruction to the second processor element.
The arithmetic processing unit according to appendix 6, characterized in that

（付記８）
前記共通のリソースは、独立して動作可能な複数の共通リソースを含み、
前記第２バスは、前記複数の共通リソースの数に対応した数の複数の第２バスを含む、
ことを特徴とする付記６または付記７に記載の演算処理装置。 (Supplementary Note 8)
The common resource includes a plurality of common resources that can operate independently,
The second bus includes a plurality of second buses corresponding in number to the plurality of common resources.
The arithmetic processing unit according to Supplementary Note 6 or 7, characterized in that

（付記９）
それぞれの前記プロセッサエレメントは、さらに、
前記複数の共通リソースのうち、自身のプロセッサエレメントがアクセスする共通リソースを指定するルートセレクタレジスタを含む、
ことを特徴とする付記８に記載の演算処理装置。 (Appendix 9)
Each said processor element is further
Among the plurality of common resources, including a route selector register specifying a common resource accessed by its own processor element
The arithmetic processing unit according to appendix 8, characterized in that

（付記１０）
前記第２命令は、自身の命令が、前記複数の第２バスにおけるどのルートを通して伝えられるかを選択するルート選択情報のビットフィールドを含む、
ことを特徴とする付記９に記載の演算処理装置。 (Supplementary Note 10)
The second instruction includes a bit field of route selection information that selects through which route on the plurality of second buses that instruction is transmitted.
The arithmetic processing unit according to appendix 9, characterized in that

（付記１１）
前記演算処理装置は、ＳＩＭＤ型演算処理装置であり、
前記共通のリソースは、メモリである、
ことを特徴とする付記１乃至付記１０のいずれか１項に記載の演算処理装置。 (Supplementary Note 11)
The arithmetic processing unit is a SIMD type arithmetic processing unit,
The common resource is memory,
The arithmetic processing unit according to any one of Appendixes 1 to 10, which is characterized in that:

（付記１２）
前記メモリは、画像データを格納する画像用メモリと、前記画像データに対して２分木探索を実行するときに使用するノード情報を格納するツリー用メモリと、を含み、
前記第１命令は、前記ツリー用メモリをアクセスし、
前記第２命令は、前記画像用メモリをアクセスする、
ことを特徴とする付記１１に記載の演算処理装置。 (Supplementary Note 12)
The memory includes an image memory for storing image data, and a tree memory for storing node information used when executing a binary tree search on the image data.
The first instruction accesses the tree memory,
The second instruction accesses the image memory.
The arithmetic processing unit according to appendix 11, characterized in that

（付記１３）
前記複数のプロセッサエレメントは、それぞれ
前記第１命令による前記ツリー用メモリのアクセス、および、前記第２命令による前記画像用メモリのアクセスによるデータ処理を、並列して同時に行う、
ことを特徴とする付記１２に記載の演算処理装置。 (Supplementary Note 13)
The plurality of processor elements simultaneously perform parallel data processing by accessing the tree memory by the first instruction and accessing the image memory by the second instruction.
The arithmetic processing unit according to appendix 12, characterized in that

（付記１４）
付記１乃至付記１３のいずれか１項に記載の演算処理装置を含む、
ことを特徴とする半導体集積回路。 (Supplementary Note 14)
An arithmetic processing unit according to any one of Appendixes 1 to 13
A semiconductor integrated circuit characterized by

（付記１５）
共通のリソースに対して、同時にアクセス可能な複数のプロセッサエレメントを含む演算処理装置の制御方法であって、
前記複数のプロセッサエレメントに対して並列に接続された第１バスを通して、それぞれの前記プロセッサエレメントに共通の第１命令を与え、
前記複数のプロセッサエレメントに対して直列に接続された第２バスを通して、前記複数のプロセッサエレメントにおいて、第１サイクルで第１プロセッサエレメントに与えた第２命令を、前記第１サイクルの次の第２サイクルで前記第１プロセッサエレメントに直列接続される第２プロセッサエレメントに伝える、
ことを特徴とする演算処理装置の制御方法。 (Supplementary Note 15)
A control method of an arithmetic processing unit including a plurality of processor elements simultaneously accessible to a common resource, the control method comprising:
Providing a common first instruction to each of the processor elements through a first bus connected in parallel to the plurality of processor elements;
The second instruction given to the first processor element in the first cycle in the plurality of processor elements through the second bus serially connected to the plurality of processor elements is the second following the first cycle Transferring to a second processor element serially connected to the first processor element in a cycle
A control method of an arithmetic processing unit characterized in that.

１，１a〜１n，ＰＥ０〜ＰＥ１５プロセッサエレメント(ＰＥ)
２，102 ＰＥ制御部
４，104 共通コマンドバス(共通バス：第１バス)
５，５１，５２カスケードバス(第２バス)
１１，111 コマンドバッファ
１２プロセッサコア
１３セレクタ(第１セレクタ)
１４第２セレクタ
１６ルートセレクタレジスタ
２１，121 フェッチ部
２２，122 プログラムメモリ
３１，311，312 画像用メモリ
３２，103 ツリー用メモリ
106 アービタ 1, 1a-1n, PE0-PE15 Processor Element (PE)
2, 102 PE control unit 4, 104 common command bus (common bus: first bus)
5, 51, 52 Cascade bus (second bus)
11, 111 command buffer 12 processor core 13 selector (first selector)
14 second selector 16 root selector register 21, 121 fetch unit 22, 122 program memory 31, 311, 312 image memory 32, 103 tree memory
106 Arbiter

Claims

An arithmetic processing unit including a plurality of processor elements that can simultaneously access a common resource,
A first bus connected in parallel to the plurality of processor elements and providing a common first instruction to each of the processor elements;
A second instruction, which is serially connected to the plurality of processor elements, and which is given to the first processor element in the first cycle in the plurality of processor elements, is said first in the second cycle following the first cycle. And a second bus for communicating to a second processor element serially connected to the processor element.
Arithmetic processing apparatus characterized in that.

Each said processor element is
A processor core that executes instructions,
A first selector for selecting one of the first bus and the second bus according to instruction type information indicating one of the first instruction and the second instruction;
The arithmetic processing unit according to claim 1, characterized in that:

The first instruction and the second instruction include bit fields of the instruction type information indicating whether each instruction is a first instruction or a second instruction,
The first selector is
When the bit field of the instruction type information indicates the second instruction, the second bus is selected to output the second instruction;
When the bit field of the instruction type information indicates the first instruction, the first bus is selected to output the first instruction.
The arithmetic processing unit according to claim 2, characterized in that:

Each said processor element is further
A command buffer for holding an instruction selected by the first selector;
The arithmetic processing unit according to claim 2 or 3 characterized by things.

The first processor element is
In the first cycle, the instruction selected by the first selector is held in the command buffer,
The second cycle transmits the instruction held in the command buffer to the second processor element through the second bus.
The arithmetic processing unit according to claim 4, characterized in that:

Each said processor element is further
And a second selector that selects one of the instruction selected by the first selector and the instruction held in the command buffer.
The arithmetic processing unit according to claim 4, characterized in that:

When the first processor element does not need or complete data processing by the first processor element,
The second selector selects the instruction selected by the first selector by skipping the first processor element, and transmits the selected instruction to the second processor element.
The arithmetic processing unit according to claim 6, characterized in that:

The common resource includes a plurality of common resources that can operate independently,
The second bus includes a plurality of second buses corresponding in number to the plurality of common resources.
The arithmetic processing unit according to claim 6 or 7 characterized by things.

Each said processor element is further
Among the plurality of common resources, including a route selector register specifying a common resource accessed by its own processor element
The arithmetic processing unit according to claim 8, characterized in that:

The second instruction includes a bit field (RSBF) of route selection information which selects through which route in the plurality of second buses that instruction of its own is transmitted.
The arithmetic processing unit according to claim 9, characterized in that:

A control method of an arithmetic processing unit including a plurality of processor elements simultaneously accessible to a common resource, the control method comprising:
Providing a common first instruction to each of the processor elements through a first bus connected in parallel to the plurality of processor elements;
The second instruction given to the first processor element in the first cycle in the plurality of processor elements through the second bus serially connected to the plurality of processor elements is the second following the first cycle Transferring to a second processor element serially connected to the first processor element in a cycle
A control method of an arithmetic processing unit characterized in that.