JP2009140514A

JP2009140514A - Semiconductor device

Info

Publication number: JP2009140514A
Application number: JP2009014308A
Authority: JP
Inventors: Yoshifumi Yoshikawa; 宜史吉川; Shigehiro Asano; 滋博浅野
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-01-26
Filing date: 2009-01-26
Publication date: 2009-06-25
Anticipated expiration: 2027-03-28
Also published as: JP4703735B2

Abstract

PROBLEM TO BE SOLVED: To provide a semiconductor device, capable of efficiently performing operation, and moreover, without increasing the manufacturing cost and the power consumption. SOLUTION: The semiconductor device includes a first arithmetic engine which performs first operation for each cycle and outputs first data showing the result of the first operation and a first valid signal, showing a first value or a second value for every cycle; a second arithmetic engine which performs second operation for each cycle, and outputs second data showing the result of the second operation and a second valid signal, showing the first value or the second value for every cycle; and a buffer between arithmetic engines which is used to exchange the first data and the second data between the first arithmetic engine and the second arithmetic engine, and writes the first data or the second data, when the first valid signal or the second valid signal shows the first value, a code for the semiconductor device is generated. COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、動的リコンフィギュラブル回路技術を用いた半導体装置に関する。 The present invention relates to a semiconductor device using dynamic reconfigurable circuit technology.

近年、低コスト・低消費電力が求められる携帯機器においても機能の複雑化、多様化が進んでおり、高い性能が必要とされてきている。高性能と低消費電力を両立させるためには専用ハードウェアの開発が不可避であるが、その開発費と製造費は年々増大している。これらを削減するものとして、動的リコンフィギュラブル回路技術を用いた半導体装置が注目されている（例えば、非特許文献１参照。）。 In recent years, even in portable devices that require low cost and low power consumption, functions have become more complex and diversified, and high performance has been required. In order to achieve both high performance and low power consumption, development of dedicated hardware is inevitable, but its development and manufacturing costs are increasing year by year. As a means for reducing these, a semiconductor device using a dynamic reconfigurable circuit technique has attracted attention (for example, see Non-Patent Document 1).

動的リコンフィギュラブル回路技術を用いた半導体装置はソフトウェアなどで与えられる命令に従って演算を行う通常のプロセッサと同様の装置であるが、次の点で通常のプロセッサとは相違している。すなわち、動的リコンフィギュラブル回路技術を用いた半導体装置は、その動作時に、命令に対応する演算器の設定を記憶装置から読み出して変更できる。記憶装置の内容は動的に書き換え可能であり、記憶装置に蓄える演算器の設定を半導体装置の使用状況に応じて動的に適宜書き換えることにより、一つの命令で多様な演算を行わせることができる。このように、命令と演算器の設定との対応関係を動的に変えられる点で通常のプロセッサとは相違している。 A semiconductor device using dynamic reconfigurable circuit technology is the same device as a normal processor that performs operations according to instructions given by software or the like, but differs from a normal processor in the following points. That is, the semiconductor device using the dynamic reconfigurable circuit technology can read and change the setting of the arithmetic unit corresponding to the instruction from the storage device during its operation. The contents of the storage device can be dynamically rewritten, and various operations can be performed with one instruction by dynamically rewriting the settings of the arithmetic unit stored in the storage device appropriately according to the usage status of the semiconductor device. it can. Thus, it is different from a normal processor in that the correspondence between the instruction and the setting of the arithmetic unit can be dynamically changed.

命令と演算器の設定を動的に変更できない通常のプロセッサでは、演算器の異なる設定が異なる命令に対応するよう演算器の設定を符号化し、これを「命令」としている。性能向上のために、演算器により実現することが可能な設定の種類を増やすと、命令のビット幅は増大し、命令を蓄えるのに必要なメモリなどの記憶装置のサイズも増大する。その結果、製造コストが増大するとともに、符号化された演算器の設定を命令から復号する際に消費される電力量も増大する。 In an ordinary processor in which the setting of the instruction and the arithmetic unit cannot be dynamically changed, the setting of the arithmetic unit is encoded so that different settings of the arithmetic unit correspond to different instructions, and this is set as an “instruction”. When the types of settings that can be realized by the arithmetic unit are increased for performance improvement, the bit width of the instruction increases, and the size of a storage device such as a memory necessary for storing the instruction also increases. As a result, the manufacturing cost increases, and the amount of power consumed when decoding the encoded arithmetic unit setting from the command also increases.

一方、動的リコンフィギュラブル回路技術を用いた半導体装置では、命令と演算器の設定との対応を動的に変更できる。演算器の設定を変化させるために必要な命令のビット幅は、演算器で実現可能な設定の種類が増大してもさほど増えることはない。 On the other hand, in the semiconductor device using the dynamic reconfigurable circuit technology, the correspondence between the instruction and the setting of the arithmetic unit can be changed dynamically. The bit width of an instruction necessary for changing the setting of the arithmetic unit does not increase so much even if the type of setting that can be realized by the arithmetic unit increases.

したがって、動的リコンフィギュラブル回路技術を用いた半導体装置は、同等の演算処理性能を有する通常のプロセッサなどの半導体装置と比べて、製造コストおよび消費電力の点で有利であるとされる。 Therefore, a semiconductor device using the dynamic reconfigurable circuit technology is advantageous in terms of manufacturing cost and power consumption as compared with a semiconductor device such as a normal processor having equivalent arithmetic processing performance.

動的リコンフィギュラブル回路技術を用いた半導体装置の性能をより高くするためには、そのような半導体装置に複数の演算器を持たせ、それら演算器の設定変更をそれぞれ独立に制御できることが必要である。また、演算を終えたデータの受け渡しが演算器の間で可能であり、かつそのデータ受け渡しのための設定変更が行えることも必要である。 In order to improve the performance of a semiconductor device using dynamic reconfigurable circuit technology, it is necessary to have a plurality of arithmetic units in such a semiconductor device and to control the setting changes of these arithmetic units independently. It is. In addition, it is also necessary to be able to transfer data after the calculation between the calculators and to change settings for the data transfer.

このような半導体装置では、一つの演算処理を複数の演算器を用いて流れ作業式に行っている場合において、ある演算器がその演算結果を別の演算器に渡そうとした際に、その演算結果を受け取る側の演算器ではまだ受け取りの準備ができていない状況が起こり得る。そのような場合には、演算結果を渡そうとする演算器の処理を停止するとともに、流れ作業において停止する演算器よりも前の作業を行っている全ての演算器の処理も停止する必要がある。この停止処理のことをパイプラインインターロック処理と呼ぶ。 In such a semiconductor device, when one arithmetic processing is performed in a flow work equation using a plurality of arithmetic units, when one arithmetic unit attempts to pass the arithmetic result to another arithmetic unit, There may be a situation where the computing unit that receives the computation result is not ready to receive it yet. In such a case, it is necessary to stop the processing of the arithmetic unit that is going to pass the calculation result, and also stop the processing of all the arithmetic units that are working before the arithmetic unit to be stopped in the flow work. is there. This stop process is called pipeline interlock process.

従来の動的リコンフィギュラブル回路技術を用いた半導体装置では、パイプラインインターロック機構は実現されていない。このため演算器の数や演算器間でのデータ受け渡しに用いるバッファの数を多くしておき、複雑な演算処理が発生してもパイプラインインターロック処理が必要となる事態があまり生じないようにしている。パイプラインインターロック処理の発生が避けられないようなさらに複雑な演算処理については、これを複数の演算処理に分割して逐次実行するようにしている。 In a semiconductor device using the conventional dynamic reconfigurable circuit technology, a pipeline interlock mechanism is not realized. For this reason, the number of arithmetic units and the number of buffers used for data transfer between arithmetic units are increased so that even if complicated arithmetic processing occurs, the situation where pipeline interlock processing is necessary does not occur so much. ing. More complicated arithmetic processing that cannot avoid the occurrence of pipeline interlock processing is divided into a plurality of arithmetic processing and executed sequentially.

「リコンフィギュラブルシステム」、オーム社、ページ１４１−２０８“Reconfigurable System”, Ohm, page 141-208

パイプラインインターロック機構が実現されていないことから、従来の動的リコンフィギュラブル回路技術を用いた半導体装置には、パイプラインインターロック処理回避のために演算器の数やバッファの数を多くすることによる製造コストの増大という問題点がある。 Since the pipeline interlock mechanism has not been realized, the number of arithmetic units and the number of buffers are increased in the semiconductor device using the conventional dynamic reconfigurable circuit technology in order to avoid pipeline interlock processing. There is a problem that the manufacturing cost increases due to this.

また、パイプラインインターロック処理回避のために複雑な演算処理を複数の演算処理に分割して逐次実行すると、演算を効率よく行えない。 Further, if a complicated arithmetic process is divided into a plurality of arithmetic processes and sequentially executed in order to avoid the pipeline interlock process, the arithmetic cannot be performed efficiently.

複雑な演算処理をパイプラインインターロック処理が必要とならない演算処理に変更する方法として、データの受け渡しを行う演算器が演算処理とは関係のない無駄な演算を実行し、これによりデータ受け渡しのタイミングを調整することが考えられる。しかしながら、この方法では無駄な演算の実行に伴う消費電力の増大が問題となるため、低消費電力が求められる機器に搭載される動的リコンフィギュラブル回路技術を用いた半導体装置では採用されていない。 As a method of changing complicated arithmetic processing to arithmetic processing that does not require pipeline interlock processing, the arithmetic unit that performs data transfer performs useless operations that are not related to the arithmetic processing, so that the timing of data transfer It is conceivable to adjust. However, this method has a problem of increase in power consumption accompanying execution of useless operations, and is not adopted in a semiconductor device using dynamic reconfigurable circuit technology mounted on a device that requires low power consumption. .

本発明は以上に鑑みてなされたものであり、製造コストおよび消費電力を増大させることなく、しかも演算を効率よく行える半導体装置を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a semiconductor device that can efficiently perform an operation without increasing manufacturing cost and power consumption.

本発明の一観点に係る半導体装置は、サイクルごとに第１の演算を行い、前記第１の演算の結果を示す第１のデータと、第１の値又は第２の値を示す第１のバリッド信号とを前記サイクルごとに出力する第１の演算エンジンと、前記サイクルごとに第２の演算を行い、前記第２の演算の結果を示す第２のデータと、前記第１の値又は前記第２の値を示す第２のバリッド信号とを前記サイクルごとに出力する第２の演算エンジンと、前記第１の演算エンジンと前記第２の演算エンジンの間で前記第１のデータ及び前記第２のデータを受け渡すために用いられ、前記第１のバリッド信号又は前記第２のバリッド信号が前記第１の値を示すならば、前記第１のデータ又は前記第２のデータの書込みが可能であり、前記第１のバリッド信号又は前記第２のバリッド信号が前記第２の値を示すならば、前記第１のデータ又は前記第２データの書込みを禁止する演算エンジン間バッファと、を具備する。 A semiconductor device according to an aspect of the present invention performs a first calculation for each cycle, first data indicating a result of the first calculation, and a first value indicating a first value or a second value. A first operation engine that outputs a valid signal for each cycle; a second operation for each cycle; second data indicating a result of the second operation; and the first value or the A second valid engine that outputs a second valid signal indicating a second value for each cycle, and the first data and the second data between the first compute engine and the second compute engine. If the first valid signal or the second valid signal indicates the first value, the first data or the second data can be written. And the first valid signal or the first If valid signal indicates the second value comprises a buffer between the first data or calculation engine for prohibiting the writing of the second data.

本発明によれば、製造コストおよび消費電力を増大させることなく、しかも演算を効率よく行える半導体装置を提供できる。 According to the present invention, it is possible to provide a semiconductor device that can efficiently perform an operation without increasing manufacturing cost and power consumption.

一実施形態に係る半導体装置を示すブロック図1 is a block diagram illustrating a semiconductor device according to an embodiment. 演算エンジン間バッファを示す図Diagram showing the buffer between calculation engines 演算エンジンを示す図Diagram showing the calculation engine 演算エンジン間バッファが有するデータレジスタを示す図The figure which shows the data register which the buffer between arithmetic engines has コード転送制御ユニットを示す図Diagram showing code transfer control unit コードメモリにおけるコードの配置を示す図Diagram showing code layout in code memory 演算エンジンが有する演算ユニットを示す図The figure which shows the arithmetic unit which the arithmetic engine has 演算ユニットが有する演算器を示す図The figure which shows the arithmetic unit which the arithmetic unit has 演算エンジンが有する出力コントローラを示す図The figure which shows the output controller which the arithmetic engine has 演算エンジンが有する別の出力コントローラを示す図The figure which shows another output controller which a calculation engine has 演算エンジンが有する入力コントローラを示す図The figure which shows the input controller which the arithmetic engine has コードの生成手順を示すフローチャートFlow chart showing the code generation procedure データ依存グラフを示す図Diagram showing data dependency graph ノードのスケジューリング処理の手順を示すフローチャートA flowchart showing the procedure of node scheduling processing スピル処理の手順を示すフローチャートFlow chart showing spill process ノード置換後のデータ依存グラフを示す図Diagram showing data dependency graph after node replacement コード出力処理の手順を示すフローチャートFlow chart showing the procedure of code output processing データ依存グラフから生成されたコードに従って半導体装置を実行させた際のタイミングチャートTiming chart when the semiconductor device is executed according to the code generated from the data dependence graph

一実施の形態に係る半導体装置１を図１に示す。半導体装置１は、プロセッサなどの外部装置からの指示によりデータ処理を行うリコンフィギュアラブルデバイスであって、５つの演算エンジン１１Ａ〜Ｅと、演算エンジン間バッファ１２と、コードメモリ１３と、コード転送制御装置１４と、データメモリ１５を有している。なお、ここでいうデータ処理とは、演算エンジン１１Ａ〜Ｅによって個々の演算が行われたことによる一連の演算の総称である。 A semiconductor device 1 according to an embodiment is shown in FIG. The semiconductor device 1 is a reconfigurable device that performs data processing in response to an instruction from an external device such as a processor, and includes five arithmetic engines 11A to 11E, an arithmetic engine buffer 12, a code memory 13, and code transfer control. A device 14 and a data memory 15 are included. The data processing here is a general term for a series of operations performed by individual operations performed by the operation engines 11A to 11E.

コードメモリ１３は演算エンジン１１Ａ〜Ｅとコード転送制御装置１４に接続する。データメモリ１５は演算エンジン１１Ａの入力に接続し、演算エンジン１１Ｅの出力に接続する。演算エンジン間バッファ１２は演算エンジン１１Ｂ〜Ｅの入力に接続し、演算エンジン１１Ａ〜Ｄの出力に接続する。 The code memory 13 is connected to the calculation engines 11A to 11E and the code transfer control device 14. The data memory 15 is connected to the input of the calculation engine 11A and is connected to the output of the calculation engine 11E. The arithmetic engine buffer 12 is connected to the inputs of the arithmetic engines 11B to 11E and is connected to the outputs of the arithmetic engines 11A to 11D.

演算エンジン１１Ａ〜Ｅは、演算の設定を動的に変更することが可能なデータ処理エンジンである。演算エンジン１１Ａ〜Ｅは、データ処理の開始前にコードメモリ１３から転送されるコードに従って設定を変更しながら、演算エンジン１１Ａ〜Ｅに入力されるデータに対して演算を行い、演算の結果を示すデータをＶａｌｉｄ（バリッド）信号とともに演算エンジン間バッファ１２に対して出力する。 The calculation engines 11A to 11E are data processing engines capable of dynamically changing calculation settings. The calculation engines 11A to 11E perform calculation on the data input to the calculation engines 11A to 11E while changing the setting according to the code transferred from the code memory 13 before the start of data processing, and indicate the result of the calculation. The data is output to the inter-arithmetic engine buffer 12 together with a Valid signal.

演算エンジン間バッファの詳細な構成を図２に示す。演算エンジン間バッファ１２は８つのデータレジスタ１２０Ａ〜Ｈを有し、演算エンジン間のデータ受け渡しに使用される。演算エンジン１１Ａ〜Ｅはコードに従ってデータレジスタ１２０Ａ〜Ｈの中から一つを選択して、読み出しおよび書き込みに用いる。ただし、演算エンジン間バッファ１２への書き込みの際には、演算エンジン１１Ａ〜ＤからのＶａｌｉｄ信号が０の場合には書き込みは行われない。 A detailed configuration of the buffer between the arithmetic engines is shown in FIG. The inter-arithmetic engine buffer 12 has eight data registers 120A to 120H, and is used for data exchange between the arithmetic engines. The arithmetic engines 11A to 11E select one of the data registers 120A to 120H according to the code and use it for reading and writing. However, at the time of writing to the inter-arithmetic engine buffer 12, if the Valid signal from the arithmetic engines 11A to 11D is 0, the writing is not performed.

コードメモリ１３には演算エンジン１１Ａ〜Ｅが使用するコードが保存される。プロセッサなどは、半導体装置１がデータ処理を開始する前に主記憶装置からコードメモリ１３にコードを転送しておく。 The code memory 13 stores codes used by the arithmetic engines 11A to 11E. The processor or the like transfers the code from the main storage device to the code memory 13 before the semiconductor device 1 starts data processing.

コード転送制御装置１４は、プロセッサなどによりコードの転送完了通知を受けた後に、コードメモリ１３からコードを順に読み出し、それらを演算エンジン１１Ａ〜Ｅにそれぞれ転送する機能を有する。 The code transfer control device 14 has a function of sequentially reading codes from the code memory 13 after receiving a code transfer completion notification from a processor or the like and transferring them to the operation engines 11A to 11E, respectively.

データメモリ１５は、半導体装置１がデータ処理の開始時に与えるデータや、半導体装置１によるデータ処理の中間結果または最終結果を一時的に保持するものである。また、データメモリ１５には、データ処理の開始前にプロセッサなどの外部装置により初期の入力データが書き込まれる。半導体装置１は、データメモリ１５に保持されたデータ処理の中間結果を再び入力データとしてデータ処理を続けることもできる。データメモリ１５に保持された最終結果は、プロセッサなどの外部装置によりデータメモリ１５から読み出され、主記憶装置に書き込まれる。 The data memory 15 temporarily stores data given by the semiconductor device 1 at the start of data processing, and intermediate results or final results of data processing by the semiconductor device 1. In addition, initial input data is written into the data memory 15 by an external device such as a processor before the start of data processing. The semiconductor device 1 can continue the data processing using the intermediate result of the data processing held in the data memory 15 as input data again. The final result held in the data memory 15 is read from the data memory 15 by an external device such as a processor and written to the main storage device.

なお、本実施形態では演算エンジン１１Ａ〜Ｅの数を５つとし、データレジスタ１２０Ａ〜Ｈの数を８つとしているが、半導体装置１に必要とされる処理能力に応じてこれらの数を変更してもよい。 In this embodiment, the number of arithmetic engines 11A to 11E is five and the number of data registers 120A to H is eight. However, these numbers are changed according to the processing capability required for the semiconductor device 1. May be.

演算エンジン１１Ａの詳細な構成を図３に示す。演算エンジン１１Ａは、入力コントローラ１１０と、演算ユニット１１３Ａ〜Ｅと、データパイプラインレジスタ１１４Ａ〜Ｅと、制御パイプラインレジスタ１１５Ａ〜Ｅと、出力コントローラ１１６と、最終コンテキストＩＤラッチ１１７と、マルチプレクサ１１８を有する。 A detailed configuration of the arithmetic engine 11A is shown in FIG. The arithmetic engine 11A includes an input controller 110, arithmetic units 113A to 113E, data pipeline registers 114A to E, control pipeline registers 115A to E, an output controller 116, a final context ID latch 117, and a multiplexer 118. Have.

入力コントローラ１１０、演算ユニット１１３Ａ〜Ｅ、および出力コントローラ１１６はコードメモリ１３とコード転送制御装置１１４に接続する。コードメモリ１３から演算エンジン１１Ａ〜Ｅにそれぞれ送られるコードは、コード転送制御装置１１４から同時に送られるタグの値に従い、入力コントローラ１１０、演算ユニット１１３Ａ〜Ｅ、出力コントローラ１１６の内部の記憶部に、それぞれで用いられる部分ごとに分割して格納される。 The input controller 110, the arithmetic units 113A to 113E, and the output controller 116 are connected to the code memory 13 and the code transfer control device 114. Codes respectively sent from the code memory 13 to the arithmetic engines 11A to 11E are stored in internal storage units of the input controller 110, the arithmetic units 113A to 113E, and the output controller 116 according to the tag values sent simultaneously from the code transfer control device 114. Each part used in each is divided and stored.

入力コントローラ１１０は、データメモリ１５にも接続する。入力コントローラ１１０は、内部の記憶部に格納されるコードを順に解釈して、入力データの参照位置を表す入力Ａ選択信号、入力Ｂ選択信号を出力し、データメモリ１５からデータを読み出す。 The input controller 110 is also connected to the data memory 15. The input controller 110 sequentially interprets the codes stored in the internal storage unit, outputs an input A selection signal and an input B selection signal indicating the reference position of the input data, and reads data from the data memory 15.

また入力コントローラ１１０は、制御パイプラインレジスタ１１５Ａと、マルチプレクサ１１８にも接続し、同じコードを順に解釈して、コンテキストＩＤとＶａｌｉｄビットを出力する。この際、マルチプレクサ１１８は、Ｖａｌｉｄビットが１の場合は入力コントローラ１１０が出力するコンテキストＩＤの値を選び、Ｖａｌｉｄビットが０の場合は最終コンテキストＩＤラッチ１１７にセットされる値を選ぶ。この値はＶａｌｉｄビットとともに制御パイプラインレジスタ１１５Ａにセットされる。またマルチプレクサ１１８により選択された値は最終コンテキストＩＤラッチ１１７にもセットされる。最終コンテキストＩＤラッチ１１７はデータ処理の開始時には０とする。 The input controller 110 is also connected to the control pipeline register 115A and the multiplexer 118, interprets the same code in order, and outputs a context ID and a Valid bit. At this time, the multiplexer 118 selects the value of the context ID output from the input controller 110 when the Valid bit is 1, and selects the value set in the final context ID latch 117 when the Valid bit is 0. This value is set in the control pipeline register 115A together with the Valid bit. The value selected by the multiplexer 118 is also set in the final context ID latch 117. The final context ID latch 117 is set to 0 at the start of data processing.

制御パイプラインレジスタ１１５Ａ〜Ｄはそれぞれ制御パイプラインレジスタ１１５Ｂ〜Ｅに接続し、制御パイプラインレジスタ１１５Ｅは出力コントローラ１１６に接続することにより、演算ユニット１１３Ａ〜Ｅの動作を制御するためのパイプラインを構成している。そのため、あるサイクルに入力コントローラ１１０が制御パイプラインレジスタ１１５ＡにセットしたコンテキストＩＤおよびＶａｌｉｄビットの値は、その次のサイクルから１サイクルごとに制御パイプラインレジスタ１１５Ｂ〜Ｅ、出力コントローラ１６へと順に転送される。このような転送をパイプライン式転送という。 The control pipeline registers 115A to 115D are connected to the control pipeline registers 115B to 115E, respectively. The control pipeline register 115E is connected to the output controller 116, so that a pipeline for controlling the operation of the arithmetic units 113A to 113E is obtained. It is composed. Therefore, the value of the context ID and the Valid bit set in the control pipeline register 115A by the input controller 110 in a certain cycle is sequentially transferred to the control pipeline registers 115B to 115E and the output controller 16 every cycle from the next cycle. Is done. Such transfer is called pipeline transfer.

演算ユニット１１３Ａ〜Ｅはそれぞれ制御パイプラインレジスタ１１５Ａ〜Ｅに接続する。また、演算ユニット１１３Ａ〜Ｄの出力は、それぞれデータパイプラインレジスタ１１４Ａ〜Ｄを介して演算ユニット１１３Ｂ〜Ｅの入力に接続することで、データを演算するためのデータパイプラインを構成している。なお、演算ユニット１１３Ａの入力は、データメモリ１５に接続する。さらに、演算ユニット１１３Ｅの出力はデータパイプラインレジスタ１１４Ｄを介して、演算エンジン間バッファ１２に接続する。 The arithmetic units 113A to 113E are connected to the control pipeline registers 115A to 115E, respectively. The outputs of the arithmetic units 113A to 113D are connected to the inputs of the arithmetic units 113B to 113E via the data pipeline registers 114A to 114D, respectively, thereby constituting a data pipeline for calculating data. The input of the arithmetic unit 113A is connected to the data memory 15. Further, the output of the arithmetic unit 113E is connected to the inter-arithmetic engine buffer 12 via the data pipeline register 114D.

演算ユニット１１３Ａ〜Ｅは、演算の設定を変更可能であって、サイクルごとにそれぞれ制御パイプラインレジスタ１１５Ａ〜ＥにセットされるコンテキストＩＤをアドレスとして内部の記憶部からコードを読み出し、そのコードにより選ばれた一つの設定情報に従って演算の設定を変更する。そして、そのサイクルに入力Ａデータ信号、入力Ｂデータ信号にセットされるデータに対し、変更後の設定に従って演算を行い、その演算結果をそれぞれデータパイプラインレジスタ１１４Ａ〜Ｅに書き込む。ただし、制御パイプラインレジスタ１１５Ａ〜ＥにセットされるＶａｌｉｄビットが０の場合には、演算結果はデータパイプラインレジスタ１１４Ａ〜Ｅには書き込まれない。 The arithmetic units 113A to 113E can change the arithmetic setting, read the code from the internal storage unit using the context ID set in the control pipeline register 115A to E for each cycle as an address, and select the code based on the code. The calculation setting is changed according to the set information. Then, an operation is performed on the data set in the input A data signal and the input B data signal in the cycle according to the changed setting, and the operation results are written in the data pipeline registers 114A to 114E, respectively. However, when the Valid bit set in the control pipeline registers 115A to 115E is 0, the operation result is not written to the data pipeline registers 114A to 114E.

このように、演算エンジン１１においては、入力コントローラ１１０から出力されるＶａｌｉｄビットが０の場合には、最終コンテキストＩＤラッチ１１７に保存されているコンテキストＩＤが示す設定情報が演算ユニット１１３の設定として用いられるため、Ｖａｌｉｄビットが０である間は、演算ユニット１１３の設定は変化しない。また、データパイプラインレジスタ１１４の値も、Ｖａｌｉｄビットが０の間は変化しないため、それを入力として演算を行う演算ユニット１１３の出力結果もまた変化しない。このように、半導体装置１ではＶａｌｉｄビットが０である際に生じる信号線の変化を少なくすることで、演算ユニット１１３およびデータパイプラインレジスタ１１４で消費する電力についてもパイプラインインターロックを用いない従来の動的リコンフィギュラブル回路より小さくなる。 As described above, in the arithmetic engine 11, when the Valid bit output from the input controller 110 is 0, the setting information indicated by the context ID stored in the final context ID latch 117 is used as the setting of the arithmetic unit 113. Therefore, while the Valid bit is 0, the setting of the arithmetic unit 113 does not change. In addition, since the value of the data pipeline register 114 does not change while the Valid bit is 0, the output result of the arithmetic unit 113 that performs an operation using the valid bit as an input also does not change. As described above, in the semiconductor device 1, the change in the signal line that occurs when the Valid bit is 0 is reduced, so that the power consumed by the arithmetic unit 113 and the data pipeline register 114 is not used in the pipeline interlock. This is smaller than the dynamically reconfigurable circuit.

出力コントローラ１１６は、演算エンジン間バッファ１２に接続する。出力コントローラ１１６は、制御パイプラインレジスタ１１５Ｅが出力するコンテキストＩＤをアドレスとして内部の記憶部からコードを読み出し、そのコードに従ってデータの出力位置を示す出力選択信号を出力する。また、制御パイプラインレジスタ１１５ＥにセットされているＶａｌｉｄビットをそのままＶａｌｉｄ信号として出力する。 The output controller 116 is connected to the inter-engine buffer 12. The output controller 116 reads the code from the internal storage unit using the context ID output from the control pipeline register 115E as an address, and outputs an output selection signal indicating the output position of the data according to the code. Further, the Valid bit set in the control pipeline register 115E is output as it is as a Valid signal.

演算エンジン１１Ｂ〜Ｅは、半導体装置１を構成する他の装置との接続関係が演算エンジン１１Ａと異なるが、内部の構成は演算エンジン１１Ａと同様である。演算エンジン１１Ｂ〜Ｅが有する入力コントローラ１１０と演算ユニット１１３Ａは、演算エンジン間バッファ１２に接続している。また演算エンジン１１Ｅが有する出力コントローラ１１６とデータパイプラインレジスタ１１４Ｅは、データメモリ１５に接続している。 The arithmetic engines 11B to 11E are different from the arithmetic engine 11A in connection relation with other devices constituting the semiconductor device 1, but the internal configuration is the same as that of the arithmetic engine 11A. The input controller 110 and the arithmetic unit 113A included in the arithmetic engines 11B to 11E are connected to the inter-arithmetic engine buffer 12. The output controller 116 and the data pipeline register 114E included in the arithmetic engine 11E are connected to the data memory 15.

なお、演算エンジン１１Ａ〜Ｅの演算ユニット１１３の数は５つであるとしたが、半導体装置１に必要とされる処理能力に応じて変更してもよい。また、演算ユニット１１３の数は演算エンジン１１Ａ〜Ｅごとに異なっていてもよい。 Although the number of the arithmetic units 113 of the arithmetic engines 11A to 11E is five, it may be changed according to the processing capability required for the semiconductor device 1. Further, the number of the arithmetic units 113 may be different for each of the arithmetic engines 11A to 11E.

次に、半導体装置１における処理の開始から終了までの流れを説明する。半導体装置１の処理は、データ処理前の初期化処理とデータ処理の２つに大別できる。 Next, the flow from the start to the end of processing in the semiconductor device 1 will be described. The processing of the semiconductor device 1 can be broadly divided into two types: initialization processing before data processing and data processing.

まず、初期化処理について説明する。 First, the initialization process will be described.

プロセッサなどの外部装置は、半導体装置１への入力データをデータメモリ１５に格納し、演算エンジン１１Ａ〜Ｅの動作を規定するコードをコードメモリ１３に格納する。 An external device such as a processor stores input data to the semiconductor device 1 in the data memory 15, and stores a code defining the operation of the arithmetic engines 11 A to 11 E in the code memory 13.

プロセッサなどの外部装置はコードメモリ１３へのコードの格納が完了すると、半導体装置１にコード転送の完了をパルス信号により通知する。コード転送制御装置１４は、半導体装置１からコード転送の完了通知を受けると、コードメモリ１３からコードを順に読み出し、それらを、格納先を示すタグとともに演算エンジン１１Ａ〜Ｅのそれぞれに転送する。 When the storage of the code in the code memory 13 is completed, the external device such as a processor notifies the semiconductor device 1 of the completion of the code transfer by a pulse signal. When the code transfer control device 14 receives a code transfer completion notification from the semiconductor device 1, the code transfer control device 14 sequentially reads the codes from the code memory 13 and transfers them to each of the operation engines 11 A to 11 E together with a tag indicating the storage location.

演算エンジン１１Ａ〜１１Ｅに転送されたコードは、入力コントローラ１１０、演算ユニット１１３Ａ〜Ｅ、出力コントローラ１１６の内部の記憶部に、それぞれで用いられる部分ごとに分割して格納される。 The codes transferred to the arithmetic engines 11A to 11E are divided and stored in the storage units inside the input controller 110, the arithmetic units 113A to E, and the output controller 116 for each part used.

コード転送制御装置１４は、演算エンジン１１Ａ〜Ｅへのコードの転送を完了すると、プロセッサなどの外部装置に演算準備の完了を通知する。 When the code transfer control device 14 completes the code transfer to the operation engines 11A to 11E, it notifies the external device such as a processor of the completion of the operation preparation.

プロセッサなどの外部装置は、演算準備の完了通知を受けた後に、半導体装置１にデータ処理の開始をパルス信号により通知する。 An external device such as a processor notifies the semiconductor device 1 of the start of data processing by a pulse signal after receiving a notification of completion of calculation preparation.

次に、データ処理について説明する。 Next, data processing will be described.

入力コントローラ１１０は、内部の記憶部に格納されたコードをサイクルごとに解釈し、そのコードに従ってコンテキストＩＤとＶａｌｉｄビットをサイクルごとに出力する。そして、前述の通りＶａｌｉｄビットに従って入力コントローラ１１０が出力するコンテキストＩＤと、最終コンテキストＩＤラッチ１１７に格納されるコンテキストＩＤのいずれか一方が選択され、そのコンテキストＩＤがＶａｌｉｄビットとともに演算ユニット１１３Ａ〜Ｅおよび出力コントローラ１１６にパイプライン式に転送される。また、入力コントローラ１１０は、上記コードに従って、サイクルごとに入力Ａ選択信号、入力Ｂ選択信号を出力する。 The input controller 110 interprets the code stored in the internal storage unit for each cycle, and outputs a context ID and a Valid bit for each cycle according to the code. Then, as described above, either one of the context ID output by the input controller 110 according to the Valid bit and the context ID stored in the final context ID latch 117 is selected, and the context ID is combined with the Valid bit and the arithmetic units 113A to 113E and It is transferred to the output controller 116 in a pipeline manner. The input controller 110 outputs an input A selection signal and an input B selection signal for each cycle according to the code.

データメモリ１５および演算エンジン間バッファ１２は、サイクルごとに演算エンジン１１Ａ〜Ｅの入力コントローラ１１０が出力する入力Ａ選択信号、入力Ｂ選択信号に従ってデータを読み出し、演算エンジン１１Ａ〜Ｅの演算ユニット１１３Ａの入力Ａデータ信号、入力Ｂデータ信号にそのデータをセットする。 The data memory 15 and the computation engine buffer 12 read data according to the input A selection signal and the input B selection signal output by the input controller 110 of the computation engines 11A to 11E for each cycle, and the data of the computation unit 113A of the computation engines 11A to E The data is set in the input A data signal and the input B data signal.

演算ユニット１１３Ａ〜Ｅは、サイクルごとにコンテキストＩＤをアドレスとして内部の記憶部からコードを読み出し、そのコードにより選ばれた一つの設定情報に従って演算の設定を変更する。そして、そのサイクルに入力Ａデータ信号、入力Ｂデータ信号にセットされるデータに対し、変更後の設定に従って演算を行う。演算結果は、データパイプラインレジスタ１１４Ａ〜Ｅを介してデータメモリ１５または演算エンジン間バッファ１２にパイプライン式に転送される。ただし、Ｖａｌｉｄビットが０の場合には演算結果はデータパイプラインレジスタ１１４Ａ〜Ｅには書き込まれない。 The arithmetic units 113A to 113E read the code from the internal storage unit using the context ID as an address for each cycle, and change the arithmetic setting according to one piece of setting information selected by the code. Then, an operation is performed on the data set in the input A data signal and the input B data signal in the cycle according to the changed setting. The operation result is transferred in a pipeline manner to the data memory 15 or the inter-operation engine buffer 12 via the data pipeline registers 114A to 114E. However, when the Valid bit is 0, the operation result is not written to the data pipeline registers 114A to 114E.

出力コントローラ１１６は、サイクルごとに、コンテキストＩＤをアドレスとして内部の記憶部からコードを読み出し、そのコードに従ってデータの出力位置を示す出力選択信号を出力する。また、制御パイプラインレジスタ１１５ＥにセットされているＶａｌｉｄビットをそのままＶａｌｉｄ信号として出力する。 For each cycle, the output controller 116 reads a code from the internal storage unit using the context ID as an address, and outputs an output selection signal indicating the data output position according to the code. Further, the Valid bit set in the control pipeline register 115E is output as it is as a Valid signal.

データメモリ１５および演算エンジン間バッファ１２は、サイクルごとに出力選択信号により指定される場所に、データパイプラインレジスタ１１４Ｅにセットされた値を書き込む。ただし、この書き込みはＶａｌｉｄ信号が１の場合にのみ行われ、Ｖａｌｉｄ信号が０の場合には書き込みは行われない。 The data memory 15 and the inter-engine buffer 12 write the value set in the data pipeline register 114E at a location specified by the output selection signal every cycle. However, this writing is performed only when the Valid signal is 1, and when the Valid signal is 0, writing is not performed.

演算エンジン１１Ａ〜１１Ｅの入力コントローラ１１０が全てのコードを解釈し終えた時点で、半導体装置１はプロセッサなどの外部装置にデータ処理の完了を通知する。以上によりデータ処理は終了する。 When the input controllers 110 of the arithmetic engines 11A to 11E have interpreted all the codes, the semiconductor device 1 notifies the external device such as a processor of completion of data processing. Thus, the data processing ends.

プロセッサなどの外部装置はデータ処理の完了通知を受けてから規定サイクル以上が経過した後に、データメモリ１５に蓄えられた半導体装置１の演算結果を読み出す。この規定サイクルは、最後にコードを解釈し終えた演算エンジン１１が備える演算ユニット１１３の数と、データメモリ１５へのデータ書き込みに要するサイクル数を加えたサイクル数になる。 An external device such as a processor reads out the calculation result of the semiconductor device 1 stored in the data memory 15 after a predetermined cycle or more has elapsed after receiving the notification of completion of data processing. This prescribed cycle is the number of cycles obtained by adding the number of arithmetic units 113 included in the arithmetic engine 11 that has finally interpreted the code and the number of cycles required for writing data to the data memory 15.

次に、以上のような半導体装置１におけるデータ処理を実現する、データレジスタ１２０、コード転送制御ユニット１４、演算ユニット１１３、出力コントローラ１１６および入力コントローラ１１０について詳細に説明する。 Next, the data register 120, the code transfer control unit 14, the arithmetic unit 113, the output controller 116, and the input controller 110 that realize data processing in the semiconductor device 1 as described above will be described in detail.

データレジスタ１２０の実現例を図４に示す。データレジスタ１２０は、データラッチ１２００とＡＮＤロジック１２０１Ａ〜ＤとＯＲロジック１２０２とマルチプレクサ１２０３を有する。データはデータラッチ１２００に格納される。 An implementation example of the data register 120 is shown in FIG. The data register 120 includes a data latch 1200, AND logics 1201A to 1201D, an OR logic 1202, and a multiplexer 1203. Data is stored in the data latch 1200.

ＡＮＤロジック１２０１Ａ〜Ｄは、それぞれ、デコーダＡ〜Ｄ出力とＶａｌｉｄＡ〜Ｄ信号のそれぞれのＡＮＤをＯＲロジック１２０２に入力する。これにより、演算エンジン１１Ａ〜Ｄの少なくとも一つからのＶａｌｉｄ信号が１であり、かつその出力選択信号がこのデータレジスタ１２０を選択する（即ちデコーダＡ〜Ｄ出力が１である）場合にのみ、ＯＲロジック１２０２は１を出力し、そうでない場合は０を出力する。ＯＲロジック１２０２からの出力信号はデータラッチ１２００のライトイネーブル信号として用いられる。このため、Ｖａｌｉｄ信号が０の場合にはデータレジスタ１２０にデータは書き込まれない。 The AND logics 1201A to 1201D respectively input the ANDs of the decoders A to D outputs and the Valid A to D signals to the OR logic 1202. Thereby, only when the Valid signal from at least one of the arithmetic engines 11A to 11D is 1, and the output selection signal selects the data register 120 (that is, the outputs of the decoders A to D are 1), The OR logic 1202 outputs 1; otherwise, it outputs 0. An output signal from the OR logic 1202 is used as a write enable signal for the data latch 1200. For this reason, when the Valid signal is 0, data is not written to the data register 120.

マルチプレクサ１２０３は、ＶａｌｉｄＡ〜Ｄ信号が１の場合に、演算エンジン１１Ａ〜Ｄからの書き込みデータＡ〜Ｄを選択する。例えば、ＶａｌｉｄＡ信号が１の場合には、演算エンジン１１Ａからの書き込みデータＡを選択する。ＶａｌｉｄＢ信号が１の場合には、演算エンジン１１Ｂからの書き込みデータBを選択する。ＶａｌｉｄＣ〜Ｄ信号についても同様である。マルチプレクサ１２０３により選択されたデータはデータラッチ１２００に書き込まれる。ＶａｌｉｄＡ〜Ｄ信号が全て０の場合や、ＶａｌｉｄＡ〜Ｄ信号のうち２つ以上が１である場合の動作は未定義である。ただし、ＶａｌｉｄＡ〜Ｄ信号が全て０の場合には、前述の通りデータラッチ１２００にはデータは書き込まれない。 The multiplexer 1203 selects the write data A to D from the arithmetic engines 11A to 11D when the Valid A to D signals are 1. For example, when the ValidA signal is 1, the write data A from the arithmetic engine 11A is selected. When the ValidB signal is 1, the write data B from the arithmetic engine 11B is selected. The same applies to the Valid C to D signals. Data selected by the multiplexer 1203 is written into the data latch 1200. The operation when the Valid A to D signals are all 0 or when two or more of the Valid A to D signals are 1 is undefined. However, when the Valid A to D signals are all 0, data is not written into the data latch 1200 as described above.

コード転送制御ユニット１４の実現例を図５に示す。コード転送制御ユニット１４は、メモリＩＤレジスタ１４０と、アドレスレジスタ１４１と、コードアドレスレジスタ１４２と、インクリメンタ１４３Ａ〜Ｃと、マルチプレクサ１４４Ａ〜Ｄと、Ｖａｌｉｄラッチ１４５と、比較器１４６を有する。 An implementation example of the code transfer control unit 14 is shown in FIG. The code transfer control unit 14 includes a memory ID register 140, an address register 141, a code address register 142, incrementers 143A to C, multiplexers 144A to D, a valid latch 145, and a comparator 146.

コードメモリ１３には、図６に示すように、演算エンジン１１Ａ〜Ｅが内部に有するコード格納用メモリの種類ごとに連続してコードが配置されているものとする。またコードの各々にはＥｎｄビットが付加されているものとする。Ｅｎｄビットの値は、メモリの種類ごとに連続するコードの最終に相当するコードの場合にのみ１となり、他のコードの場合は０となる。 As shown in FIG. 6, it is assumed that codes are continuously arranged in the code memory 13 for each type of code storage memory included in the arithmetic engines 11A to 11E. It is assumed that an End bit is added to each code. The value of the End bit is 1 only in the case of a code corresponding to the end of a continuous code for each type of memory, and 0 in the case of other codes.

このようなコードメモリ１３から演算エンジン１１Ａ〜１１Ｅに転送されるコードには、そのコードを格納するメモリの種類を示すメモリＩＤと、そのメモリのどの位置に格納するかを示すアドレスと、それが有効かを示すＶａｌｉｄビットとがコード転送制御ユニット１４により付加される。これらメモリＩＤ、アドレス、Ｖａｌｉｄビットの値は、それぞれメモリＩＤレジスタ１４０、アドレスレジスタ１４１、Ｖａｌｉｄラッチ１４５に格納される。なお、メモリＩＤ、アドレス、Ｖａｌｉｄビットの組をコードのタグと呼ぶ。 The code transferred from the code memory 13 to the arithmetic engines 11A to 11E includes a memory ID indicating the type of memory for storing the code, an address indicating where the memory is stored, and A valid bit indicating validity is added by the code transfer control unit 14. The values of the memory ID, address, and valid bit are stored in the memory ID register 140, address register 141, and valid latch 145, respectively. A set of a memory ID, an address, and a Valid bit is called a code tag.

コードアドレスレジスタ１４２には、コードメモリ１３からコードを読み出す際に用いられるアドレスが格納される。 The code address register 142 stores an address used when reading a code from the code memory 13.

プロセッサなどの外部装置は、半導体装置１にコード転送の完了を通知する前に、データ処理に必要なコードが格納されているコードメモリ１３の先頭アドレスを、予め外部バスを介してコードアドレスレジスタ１４２に書き込んでおく。 Before an external device such as a processor notifies the semiconductor device 1 of the completion of code transfer, the code address register 142 stores the leading address of the code memory 13 storing a code required for data processing via the external bus in advance. Write in.

半導体装置１がコード転送の完了通知を受けると、マルチプレクサ１４４Ａ〜Ｂにより初期値０が選択され、メモリＩＤレジスタ１４０およびアドレスレジスタ１４１に格納される。また、マルチプレクサ１４４Ｄにより１が選択されてＶａｌｉｄラッチ１４５にセットされる。 When the semiconductor device 1 receives the code transfer completion notification, the initial value 0 is selected by the multiplexers 144A-B and stored in the memory ID register 140 and the address register 141. Further, 1 is selected by the multiplexer 144D and set in the Valid latch 145.

次のサイクルから、メモリＩＤレジスタ１４０、アドレスレジスタ１４１、Ｖａｌｉｄラッチ１４５の値が、コードメモリ１３から送られるコードに付加されて、演算エンジン１１Ａ〜Ｅに転送される。また、サイクルの終了時に、アドレスレジスタ１４１、コードアドレスレジスタ１４２の値が、インクリメンタ１４３Ｂ〜Ｃによりそれぞれ１増やされる。 From the next cycle, the values of the memory ID register 140, the address register 141, and the valid latch 145 are added to the code sent from the code memory 13 and transferred to the arithmetic engines 11A to 11E. At the end of the cycle, the values of the address register 141 and the code address register 142 are incremented by 1 by the incrementers 143B to 143C, respectively.

コードメモリ１３からＥｎｄビットが１となるコードが転送されると、そのサイクルの終了時にはメモリＩＤレジスタ１４０の値がインクリメンタ１４３Ａにより１増やされ、またマルチプレクサ１４４Ｂにより０が選択されてアドレスレジスタ１４１の値は０にリセットされる。 When a code whose End bit is 1 is transferred from the code memory 13, the value of the memory ID register 140 is incremented by 1 by the incrementer 143A at the end of the cycle, and 0 is selected by the multiplexer 144B and the address register 141 The value is reset to 0.

以下、メモリＩＤレジスタ１４０の値が、有効なメモリＩＤの最大値＋１になるまで、同様の処理が繰り返される。メモリＩＤの値が有効なメモリＩＤの最大値＋１になると、Ｖａｌｉｄラッチ１４５に０がセットされ、演算エンジン１１Ａ〜Ｅへのコード転送は完了する。 Thereafter, the same processing is repeated until the value of the memory ID register 140 becomes the maximum value +1 of the effective memory ID. When the value of the memory ID becomes the maximum value +1 of the valid memory ID, 0 is set in the Valid latch 145, and the code transfer to the arithmetic engines 11A to 11E is completed.

演算ユニット１１３の実現例を図７に示す。演算ユニット１１３は、設定を動的に変更可能な演算器１１３０と、制御テーブルメモリ１１３１と、設定情報レジスタ１１３２Ａ〜Ｄと、マルチプレクサ１１３３を有する。 An implementation example of the arithmetic unit 113 is shown in FIG. The arithmetic unit 113 includes an arithmetic unit 1130 capable of dynamically changing settings, a control table memory 1131, setting information registers 1132A to 1132D, and a multiplexer 1133.

設定情報レジスタ１１３２Ａ〜Ｄには、データ処理において演算器１１３０で用いられる設定情報が保存されている。設定情報レジスタ１１３２の数は用途に応じて変更してよい。制御テーブルメモリ１１３１には、設定情報レジスタ１１３２Ａ〜Ｄの選択信号値がデータ処理で用いられるコンテキストＩＤの種類数分だけ、先頭から順に格納されている。 The setting information registers 1132A to 1132D store setting information used by the computing unit 1130 in data processing. The number of setting information registers 1132 may be changed according to the application. In the control table memory 1131, selection signal values of the setting information registers 1132 A to 1132 D are stored in order from the top by the number of types of context IDs used in data processing.

設定情報レジスタ１１３２Ａ〜Ｄおよび制御テーブルメモリ１１３１は、初期化においてコードメモリ１３から転送されるコードにより更新される。コード転送制御装置１４によりコードに付加されるＶａｌｉｄビットが１で、かつコードに付加されるメモリＩＤが、設定情報レジスタ１１３２Ａ〜Ｄ、制御テーブルメモリ１１３１を示すメモリＩＤと一致する場合に、メモリＩＤが一致する制御テーブルメモリ１１３１または設定情報レジスタ１１３２Ａ〜Ｄにコードが書き込まれる。制御テーブルメモリ１１３１にコードを書き込む場合には、コード転送制御装置１４によりコードに付加されるアドレスを書き込みアドレスとして用いる。 The setting information registers 1132A to 1132D and the control table memory 1131 are updated with codes transferred from the code memory 13 in initialization. When the Valid bit added to the code by the code transfer control device 14 is 1 and the memory ID added to the code matches the memory ID indicating the setting information registers 1132A to 1132D and the control table memory 1131, the memory ID Are written in the control table memory 1131 or the setting information registers 1132A to 1132D that coincide with each other. When a code is written in the control table memory 1131, an address added to the code by the code transfer control device 14 is used as a write address.

演算器１１３０の実現例を図８に示す。演算器１１３０は８ビットのＡＬＵとシフタをそれぞれ４つずつ備えており、３２ビットの２入力に対して、８ビット単位で異なる演算を行うよう設定できる。上述したように、演算器１１３０の設定は動的に変更可能である。この演算結果を３２ビットの出力の一つとする。また、演算器１１３０はクロスバーを備え、シフタからの８ビット出力４つの配置順を変更した結果を３２ビット出力の一つとする。 An implementation example of the computing unit 1130 is shown in FIG. The arithmetic unit 1130 includes four 8-bit ALUs and four shifters, respectively, and can be set to perform different operations in units of 8 bits with respect to two 32-bit inputs. As described above, the setting of the calculator 1130 can be changed dynamically. This calculation result is set as one of 32-bit outputs. The arithmetic unit 1130 includes a crossbar, and the result of changing the arrangement order of the four 8-bit outputs from the shifter is one of the 32-bit outputs.

この例では、演算器１１３０の設定情報は８ビット演算あたり、ＡＬＵの入力の一つを直値とするかどうかを決める入力モードを１ビット、直値を８ビット、ＡＬＵ設定を２ビット、シフト値を３ビット、クロスバー設定を２ビットとする計１６ビットからなる。演算器１１３０全体では６４ビットの設定情報となる。 In this example, the setting information of the arithmetic unit 1130 is 1 bit for the input mode for determining whether one of the inputs of the ALU is to be a direct value, 8 bits for the direct value, and 2 bits for the ALU setting per 8-bit operation. It consists of 16 bits in total, 3 bits for the value and 2 bits for the crossbar setting. The arithmetic unit 1130 as a whole has 64-bit setting information.

演算ユニット１１３０は、入力コントローラ１１０から送信されるコンテキストＩＤをアドレスとして制御テーブルメモリ１１３１から値を読み出し、その値をマルチプレクサ１１３３の選択信号として設定情報レジスタ１１３２Ａ〜Ｄの一つを選択し、そこから設定情報を読み出して演算器１１３０に適用する。これによりコンテキストＩＤごとに演算の設定を変えるという動作が実現される。 The arithmetic unit 1130 reads the value from the control table memory 1131 using the context ID transmitted from the input controller 110 as an address, selects one of the setting information registers 1132A to 1132D as the selection signal of the multiplexer 1133, and from there The setting information is read and applied to the calculator 1130. This realizes an operation of changing the calculation setting for each context ID.

演算エンジン１１Ｅが有する出力コントローラ１１６の実現例を図９に示す。演算エンジン１１Ｅが有する出力コントローラ１１６は、ベースアドレスレジスタ１１６０Ａ〜Ｂと、制御テーブルメモリ１１６１と、加算器１１６２とマルチプレクサ１１６３を有する。 An implementation example of the output controller 116 included in the arithmetic engine 11E is shown in FIG. The output controller 116 included in the arithmetic engine 11E includes base address registers 1160A to 1160B, a control table memory 1161, an adder 1162, and a multiplexer 1163.

ベースアドレスレジスタ１１６０Ａ〜Ｂには、データメモリ１５への出力アドレスを計算する際に用いられるベースアドレスが格納される。ベースアドレスレジスタ１１６０の数は１以上の任意の数でよい。制御テーブルメモリ１１６１には、ベースアドレスレジスタ１１６０Ａ〜Ｂを選択するための選択信号値とオフセットとが対になって格納されている。ベースアドレスレジスタ１１６０Ａ〜Ｂと制御テーブルメモリ１１６１の初期設定は、演算ユニット１１３の設定情報メモリ１１３２の初期化と同様の手法で行われる。 Base address registers 1160A to 1160B store base addresses used when calculating an output address to the data memory 15. The number of base address registers 1160 may be an arbitrary number of 1 or more. The control table memory 1161 stores a selection signal value for selecting the base address registers 1160A-B and an offset in pairs. The initial setting of the base address registers 1160A-B and the control table memory 1161 is performed by the same method as the initialization of the setting information memory 1132 of the arithmetic unit 113.

演算エンジン１１Ｅが有する出力コントローラ１１６は、入力コントローラ１１０から送信されるコンテキストＩＤをアドレスとして制御テーブルメモリ１１６１を参照し、アドレス計算に用いるベースアドレスが格納されているベースアドレスレジスタ１１６０Ａ〜Ｂを選択するための選択信号値と、オフセットを制御テーブルメモリ１１６１から読み出す。読み出された選択信号値はマルチプレクサ１１６３の選択信号となり、ベースアドレスレジスタ１１６０Ａ〜Ｂのいずれか一方が選択され、そこに格納されるベースアドレスが読み出される。読み出されたベースアドレスは、出力選択信号として外部に出力されるとともに、加算器１１６２においてオフセットとの加算が行われる。その加算結果は、選択されたベースアドレスレジスタ１１６０に書き戻される。ただし、Ｖａｌｉｄビットとして０が入力された場合には、ベースアドレスレジスタ１１６０の更新は行われない。 The output controller 116 of the arithmetic engine 11E refers to the control table memory 1161 using the context ID transmitted from the input controller 110 as an address, and selects the base address registers 1160A to 1160B that store base addresses used for address calculation. The selection signal value and the offset for reading are read from the control table memory 1161. The read selection signal value becomes the selection signal of the multiplexer 1163, and any one of the base address registers 1160A to 1160B is selected, and the base address stored therein is read. The read base address is output to the outside as an output selection signal, and the adder 1162 adds the offset. The addition result is written back to the selected base address register 1160. However, when 0 is input as the Valid bit, the base address register 1160 is not updated.

一方、演算エンジン１１Ａ〜１１Ｄが有する出力コントローラ１１６は、図１０に示すように、演算エンジン１１Ｅが有する出力コントローラ１１６とは異なる。演算エンジン１１Ａ〜１１Ｄが有する出力コントローラ１１６の制御テーブルメモリ１１６１には、出力に用いる演算エンジン間バッファ１２のデータレジスタ１２Ａ〜Ｈの一つを選択するための選択信号の値が、データ処理に用いられるコンテキストＩＤの数に対応して記憶されている。 On the other hand, the output controller 116 included in the calculation engines 11A to 11D is different from the output controller 116 included in the calculation engine 11E as illustrated in FIG. In the control table memory 1161 of the output controller 116 included in the arithmetic engines 11A to 11D, the value of the selection signal for selecting one of the data registers 12A to 12H of the inter-arithmetic engine buffer 12 used for output is used for data processing. Stored in correspondence with the number of context IDs.

演算エンジン１１Ａ〜１１Ｄが有する出力コントローラ１１６は、入力コントローラ１１０から送信されるコンテキストＩＤをアドレスとして制御テーブルメモリ１１６１から選択信号の値を読み出し、それを出力選択信号として出力する。 The output controller 116 included in the arithmetic engines 11A to 11D reads the value of the selection signal from the control table memory 1161 using the context ID transmitted from the input controller 110 as an address, and outputs it as an output selection signal.

図１１は演算エンジン１１Ａ〜Ｅが有する入力コントローラ１１０の実現例を示す図である。入力コントローラ１１０は、入力Ａ選択部１１００と、入力Ｂ選択部１１０１と、コンテキスト情報メモリ１１０２と、コンテキストＩＤラッチ１１０３と、データ処理終了ラッチ１１０４と、インクリメンタ１１０５と、マルチプレクサ１１０６と、ラッチ１１０７Ａ〜Ｂと、タイミングラッチ１１０８Ａ〜Ｂを有する。 FIG. 11 is a diagram illustrating an implementation example of the input controller 110 included in the calculation engines 11A to 11E. The input controller 110 includes an input A selection unit 1100, an input B selection unit 1101, a context information memory 1102, a context ID latch 1103, a data processing end latch 1104, an incrementer 1105, a multiplexer 1106, and latches 1107A to 1107A. B and timing latches 1108A-B.

入力Ａ選択部１１００と入力Ｂ選択部１１０１は、それぞれ入力Ａ選択信号と入力Ｂ選択信号を生成するための回路である。これらは、演算エンジン１１Ａが有する入力コントローラ１１０の場合は、演算エンジン１１Ｅが有する出力コントローラ１１６と同一のものであり、また演算エンジン１１Ｂ〜Ｅが有する入力コントローラ１１０の場合は、演算エンジン１１Ａ〜Ｄが有する出力コントローラ１１６と同一のもので、出力選択信号が入力Ａ〜Ｂ選択信号として用いられる。Ｖａｌｉｄ信号は出力されない。 The input A selection unit 1100 and the input B selection unit 1101 are circuits for generating an input A selection signal and an input B selection signal, respectively. In the case of the input controller 110 included in the calculation engine 11A, these are the same as the output controller 116 included in the calculation engine 11E, and in the case of the input controller 110 included in the calculation engines 11B to E, the calculation engines 11A to 11D. The output selection signal is used as the input A to B selection signal. The Valid signal is not output.

コンテキスト情報メモリ１１０２は、Ｖａｌｉｄビットとデータ処理終了ビットとを含むコンテキスト情報を、データ処理で用いられるコンテキストＩＤの数だけ保存している。 The context information memory 1102 stores context information including a Valid bit and a data processing end bit by the number of context IDs used in data processing.

コンテキストＩＤラッチ１１０３には、出力するコンテキストＩＤの値が格納される。コンテキストＩＤが出力されると、コンテキストＩＤラッチ１１０３の値はインクリメンタ１１０５により１だけ増やされる。 The context ID latch 1103 stores the value of the context ID to be output. When the context ID is output, the value of the context ID latch 1103 is incremented by 1 by the incrementer 1105.

データ処理終了ラッチ１１０４は、データ処理が完了したかどうかを示す信号を格納するラッチである。半導体装置１の初期状態においてこのラッチの値は、データ処理の完了を意味する１である。 The data processing end latch 1104 is a latch that stores a signal indicating whether or not the data processing is completed. In the initial state of the semiconductor device 1, the value of this latch is 1 which means the completion of data processing.

次に、入力コントローラ１１０の動作を説明する。 Next, the operation of the input controller 110 will be described.

プロセッサなどの外部装置から半導体装置１にデータ処理の開始が通知されると、データ処理終了ラッチ１１０４が０にセットされる。また、このサイクルではコンテキストＩＤラッチ１１０３は０を示しており、コンテキスト情報メモリ１１０２のアドレス０に格納されているコンテキスト情報が読み出される。この読み出されたコンテキスト情報が含んでいるＶａｌｉｄビットとデータ終了ビットとが、それぞれラッチ１１０７Ａ〜Ｂにセットされる。 When an external device such as a processor notifies the semiconductor device 1 of the start of data processing, the data processing end latch 1104 is set to zero. In this cycle, the context ID latch 1103 indicates 0, and the context information stored at address 0 of the context information memory 1102 is read. The Valid bit and the data end bit included in the read context information are set in the latches 1107A to B, respectively.

次のサイクルに、入力Ａ選択部１１００と入力Ｂ選択部１１０１は、コンテキストＩＤラッチ１１０３に格納されるコンテキストＩＤと、ラッチ１１０７Ａに格納されるＶａｌｉｄビットとに従って、それぞれ入力Ａ選択信号、入力Ｂ選択信号を出力する。またこれらコンテキストＩＤとＶａｌｉｄビットは、それらが演算ユニット１１３Ａに到着するタイミングと、入力Ａ選択信号、入力Ｂ選択信号によってそれぞれ読み出される入力Ａデータ、入力Ｂデータが演算ユニット１１３Ａに到着するタイミングとが等しくなるよう、タイミングラッチ１１０８Ａ〜Ｂによりタイミングが調整された後に、演算ユニット１１３Ａに出力される。また、サイクルの終わりに、コンテキストＩＤラッチ１１０３の値がインクリメンタ１１０５により１だけ増やされる。同様にして、コンテキスト情報メモリ１１３からデータ処理終了ビットとして０が読み出される間、サイクルごとにコンテキストＩＤとＶａｌｉｄビットとが出力される。 In the next cycle, the input A selection unit 1100 and the input B selection unit 1101 select the input A selection signal and the input B respectively according to the context ID stored in the context ID latch 1103 and the Valid bit stored in the latch 1107A. Output a signal. The context ID and valid bit have the timing at which they arrive at the arithmetic unit 113A and the timing at which the input A data and input B data read by the input A selection signal and the input B selection signal respectively arrive at the arithmetic unit 113A. After the timings are adjusted by the timing latches 1108A-B so as to be equal, they are output to the arithmetic unit 113A. At the end of the cycle, the value of the context ID latch 1103 is incremented by 1 by the incrementer 1105. Similarly, while 0 is read from the context information memory 113 as the data processing end bit, the context ID and the Valid bit are output for each cycle.

コンテキスト情報メモリ１１３からデータ処理終了ビットとして１が読み出されると、コンテキストＩＤとＶａｌｉｄビットが出力されるとともに、データ処理終了ラッチ１１０４が１に設定される。 When 1 is read from the context information memory 113 as the data processing end bit, the context ID and the Valid bit are output, and the data processing end latch 1104 is set to 1.

次のサイクル以後、データ処理完了信号として１が出力され、また、データ処理終了ラッチ１１０４が１なので、Ｖａｌｉｄビットは０となる。この状態が、演算エンジン１１でデータ処理が完了した状態である。 After the next cycle, 1 is output as the data processing completion signal, and since the data processing end latch 1104 is 1, the Valid bit is 0. This state is a state where the data processing is completed in the arithmetic engine 11.

以上述べたように、演算エンジン１１Ａ〜Ｅはパイプライン式転送によってデータとＶａｌｉｄ信号の両者を同じタイミングで出力する。演算エンジン間バッファ１２はＶａｌｉｄ信号が１であるときに受け取ったデータのみをバッファ（データレジスタ１２０に相当する）に書き込む。 As described above, the arithmetic engines 11A to 11E output both the data and the Valid signal at the same timing by pipeline transfer. The inter-arithmetic engine buffer 12 writes only the data received when the Valid signal is 1 to the buffer (corresponding to the data register 120).

ここで、演算エンジン間バッファ１２において利用できるバッファが存在しなくなることが予見できるサイクルではＶａｌｉｄ信号として０が出力されるように、コンテキスト情報メモリ１１０２に格納されるコードに従って入力コントローラ１１０をソフトウェア制御すれば、パイプラインインターロック機構を有さない構成において、演算器の数や演算エンジン間バッファ１２が有するバッファの数を少なくしても、半導体装置１は効率よく演算を行うことが可能となる。 Here, the input controller 110 is controlled by software according to the code stored in the context information memory 1102 so that 0 can be output as the Valid signal in a cycle in which it can be predicted that there is no buffer available in the inter-arithmetic engine buffer 12. For example, in a configuration that does not have a pipeline interlock mechanism, the semiconductor device 1 can perform operations efficiently even if the number of arithmetic units and the number of buffers of the inter-arithmetic engine buffer 12 are reduced.

演算エンジン間バッファ１２の有するバッファの数が少ない構成とすると、これらバッファにおいて演算結果を一時的に格納する動作により消費される電力をパイプラインインターロックを用いない従来の動的リコンフィギュラブル回路技術を用いた半導体装置と比較して小さくすることが可能となる。また、製造コストの増大を抑えることができる。 When the number of buffers included in the inter-arithmetic engine buffer 12 is small, conventional dynamic reconfigurable circuit technology that does not use pipeline interlocks for the power consumed by the operation of temporarily storing the operation results in these buffers. It becomes possible to make it smaller as compared with a semiconductor device using the. Further, an increase in manufacturing cost can be suppressed.

さらに本実施形態において、演算エンジン１１Ａ〜Ｅは、直前のサイクルに出力したコンテキストＩＤを記憶する最終コンテキストＩＤラッチ１１７と、マルチプレクサ１１８を具備する。入力コントローラ１１０から出力されるＶａｌｉｄビットが０の場合には、最終コンテキストＩＤラッチ１１７に記憶されているコンテキストＩＤがマルチプレクサ１１８により選択され、そのＶａｌｉｄビットとともに演算ユニット１１３へパイプライン式転送によって出力されるよう制御する。演算ユニット１１３の出力結果は、Ｖａｌｉｄビットが０のときにはデータパイプラインレジスタ１１４には書き込まれない。このため、Ｖａｌｉｄビットが０である間は、演算ユニット１１３の入出力データ信号と設定信号は、最後にＶａｌｉｄビットが１であった状態から変化しない。 Further, in the present embodiment, the arithmetic engines 11A to 11E include a final context ID latch 117 that stores the context ID output in the immediately preceding cycle, and a multiplexer 118. When the Valid bit output from the input controller 110 is 0, the context ID stored in the final context ID latch 117 is selected by the multiplexer 118 and output to the arithmetic unit 113 together with the Valid bit by pipeline transfer. To control. The output result of the arithmetic unit 113 is not written to the data pipeline register 114 when the Valid bit is 0. For this reason, while the Valid bit is 0, the input / output data signal and the setting signal of the arithmetic unit 113 do not change from the state where the Valid bit was 1 at the end.

したがって、演算エンジン間バッファ１２のみならず演算エンジン１１内の演算ユニット１１３およびデータパイプラインレジスタ１１４が消費する電力についても、パイプラインインターロックを用いない従来の動的リコンフィギュラブル回路と比較して小さくすることが可能となる。 Therefore, not only the inter-arithmetic engine buffer 12 but also the power consumed by the arithmetic unit 113 and the data pipeline register 114 in the arithmetic engine 11 are compared with the conventional dynamic reconfigurable circuit that does not use the pipeline interlock. It can be made smaller.

効率のよい演算と消費電力の削減は、前述の通り、入力コントローラ１１０をソフトウェア制御することによって達成される。そのためのコードは、別の半導体装置等により予め作成してコードメモリ１３に格納し、コード転送制御装置１４によりデータ処理の開始前に予めコンテキスト情報メモリ１１０２に転送して格納しておく必要がある。以下では、入力コントローラ１１０のコードを予め作成するコンパイラについて説明する。 As described above, efficient calculation and reduction of power consumption can be achieved by software control of the input controller 110. The code for that purpose needs to be created in advance by another semiconductor device or the like and stored in the code memory 13 and transferred and stored in advance in the context information memory 1102 by the code transfer control device 14 before starting data processing. . In the following, a compiler that creates the code of the input controller 110 in advance will be described.

このコンパイラは、例えばコンピュータのプログラムとして実現することができ、演算エンジン１１Ａ〜Ｅの間でどのようにして演算結果の受け渡しが行われるかを示すデータ依存グラフ、演算エンジン１１Ａ〜Ｅのパイプライン段数、およびデータメモリの読み書きに要するサイクル数を入力して記憶する記憶部と、該記憶部に記憶された情報を参照し、入力コントローラ１１０が有するコンテキスト情報メモリ１１０２に格納されるコードと、演算エンジン１１Ａ〜Ｄが有する制御テーブルメモリ１１６１に格納されるコードを生成するコード生成部とを有する。データ依存グラフのデータは、例えばユーザが予め作成しておく。 This compiler can be realized as a computer program, for example, a data dependence graph showing how calculation results are transferred between the calculation engines 11A to 11E, and the number of pipeline stages of the calculation engines 11A to 11E. A storage unit for inputting and storing the number of cycles required for reading and writing the data memory, a code stored in the context information memory 1102 of the input controller 110 with reference to the information stored in the storage unit, and an arithmetic engine A code generation unit that generates a code stored in the control table memory 1161 included in 11A to 11D. The data dependence graph data is prepared in advance by the user, for example.

本実施形態の半導体装置１は、ある与えられた性能を実現するのに必要な演算エンジン間バッファ１２のバッファ数を少なくできることを特徴の一つとしている。このため、コンパイラもまた、演算エンジン間バッファ１２のバッファ数が少ない場合においても正しくコードを生成できなくてはならない。 One feature of the semiconductor device 1 of the present embodiment is that the number of buffers of the inter-arithmetic engine buffers 12 necessary for realizing a given performance can be reduced. For this reason, the compiler must also be able to generate code correctly even when the number of buffers of the inter-arithmetic engine buffer 12 is small.

この目的のために、以下で詳しく説明するコンパイル手法においては、与えられた複数の演算を演算エンジン１１Ａ〜Ｅがどの順序で実行すべきかを次のように決定する。すなわち、その演算への入力となるデータが全く生成されていないような演算よりも、その演算への入力となるデータの一部が既に演算され、該データの一部が演算エンジン間バッファ１２に書き込まれているような演算が優先して実行されるようにする。そうすれば、演算エンジン間バッファ１２におけるデータの滞留時間を極力少なくできる。このようなコンパイル手法は、演算エンジン間バッファ１２が多くのバッファを有することを前提としていた従来のコンパイル手法とは異なるものである。 For this purpose, in the compiling method described in detail below, the order in which the operation engines 11A to 11E execute a plurality of given operations is determined as follows. That is, rather than an operation in which no data that is input to the operation is generated, a part of the data that is input to the operation is already calculated, and a part of the data is stored in the inter-engine buffer 12. The operation as written is preferentially executed. By doing so, the data retention time in the inter-engine buffer 12 can be reduced as much as possible. Such a compiling method is different from the conventional compiling method which presupposes that the inter-operation engine buffer 12 has many buffers.

以下、本実施形態に係るコンパイル手法を詳細に説明する。 Hereinafter, the compiling method according to the present embodiment will be described in detail.

コンパイラは、演算エンジン１１Ａ〜Ｅが有するコンテキスト情報メモリ１１０２に格納されるコードと、演算エンジン１１Ａ〜Ｄが有する制御テーブルメモリ１１６１に格納されるコードを出力する。このため、上述したコード生成部は、演算エンジン１１Ａ〜Ｅにより行われる複数の演算について、入力と出力のデータ依存関係を表すデータ依存グラフを解析する。前述の通り、データ依存グラフは、演算エンジン１１Ａ〜Ｅの間でどのようにして演算結果の受け渡しが行われるかを表している。 The compiler outputs code stored in the context information memory 1102 included in the operation engines 11A to 11E and code stored in the control table memory 1161 included in the operation engines 11A to 11D. For this reason, the above-described code generation unit analyzes a data dependence graph representing a data dependence relationship between input and output for a plurality of computations performed by the computation engines 11A to 11E. As described above, the data dependence graph represents how calculation results are transferred between the calculation engines 11A to 11E.

コード生成部は、ある演算への入力となるデータの一部が既に演算されており、該データの一部が演算エンジン間バッファ１２に書き込まれているような演算をデータ依存グラフから特定する特定部と、そのような演算が優先して実行されるように演算エンジン１１Ａ〜Ｅにより行われる演算の順序を決めるスケジューリング部と、この順序に従い、各サイクルにおいて演算エンジン１１Ａ〜Ｅの各々が演算を行うか否かを決定する決定部と、演算エンジン１１Ａ〜Ｅの各々が演算を行うならば、対応する入力コントローラ１１０からＶａｌｉｄビットとして１を出力し、演算を行なわないならば、対応する入力コントローラ１１０からＶａｌｉｄビットとして０を出力するためのコードを生成する生成部とを有する。また、コード生成部は、演算エンジン１１Ａ〜Ｅがサイクルごとにどの演算の設定を用いればよいかを規定するコードも生成する。 The code generation unit specifies from the data dependence graph an operation in which a part of data to be input to a certain operation has already been calculated and a part of the data is written in the inter-operation engine buffer 12 And a scheduling unit that determines the order of operations performed by the operation engines 11A to 11E so that such operations are executed with priority, and according to this order, each of the operation engines 11A to 11E performs operations in each cycle. If each of the calculation engines 11A to 11E performs a calculation, 1 is output as the Valid bit from the corresponding input controller 110, and if the calculation is not performed, the corresponding input controller And a generation unit that generates a code for outputting 0 as a Valid bit from 110. The code generation unit also generates a code that defines which calculation setting the calculation engines 11A to 11E should use for each cycle.

コード生成のより具体的な処理手順は、例えば、図１２のフローチャートに示す手順に従う。 A more specific processing procedure for code generation follows, for example, the procedure shown in the flowchart of FIG.

図１３にデータ依存グラフの一例を示す。データ依存グラフの一つのノードは演算エンジン１１で実行される一つの演算に対応している。データ依存グラフの矢印は、矢印の元に接続するノードに対応する演算の結果が、矢印の先に接続するノードに対応する演算の入力として使われることを示している。本例のデータ依存グラフには次のような制約がある。すなわち、任意のノードの入力となるノードの数は高々２つとし、また任意のノードの出力は常に一つのノードの入力としてのみ用いられるものとする。データ依存グラフのノードには、その演算を実行する演算エンジン１１Ａ〜Ｅを識別するための、Ａ〜Ｅのいずれかのラベルが付与される。また、各ラベルのノードには、ＩＤ（例えば０から始まる番号）が付与される。 FIG. 13 shows an example of the data dependence graph. One node of the data dependence graph corresponds to one calculation executed by the calculation engine 11. The arrow in the data dependence graph indicates that the result of the operation corresponding to the node connected to the source of the arrow is used as the input of the operation corresponding to the node connected to the end of the arrow. The data dependence graph of this example has the following restrictions. That is, it is assumed that the number of nodes that are input to an arbitrary node is at most two, and the output of an arbitrary node is always used only as an input of one node. A label of any of A to E for identifying the calculation engines 11A to 11E that execute the calculation is given to the node of the data dependence graph. Further, an ID (for example, a number starting from 0) is assigned to each label node.

図１２のフローチャートに沿って、演算エンジン１１Ａ〜Ｅがサイクルごとにどの演算の設定を用いればよいかを図１３のデータ依存グラフから求める手法について説明する。 A method for obtaining which calculation setting should be used for each cycle by the calculation engines 11A to 11E from the data dependence graph of FIG. 13 will be described with reference to the flowchart of FIG.

説明を容易にするため、演算エンジン１１Ａ〜Ｅのレイテンシを１とし、また演算エンジン間バッファ１２はデータレジスタ１２０Ａ〜Ｂの２つのみを有するものとする。なお、この手法はデータレジスタ１２０の数が２より大きい場合にも適用できる。また演算エンジン１１Ａ〜Ｅのレイテンシが２以上の場合にも適用できる。ここで、演算エンジン１１のレイテンシとは、演算エンジン１１が演算を完了するのに必要なサイクル数のことをいう。例えば、演算エンジン１１の演算ユニット１１３の数が５であれば、レイテンシは５である。 For ease of explanation, it is assumed that the latency of the operation engines 11A to 11E is 1, and the inter-operation engine buffer 12 has only two data registers 120A to 120B. This method can also be applied when the number of data registers 120 is greater than two. The present invention can also be applied when the latency of the arithmetic engines 11A to 11E is 2 or more. Here, the latency of the calculation engine 11 refers to the number of cycles necessary for the calculation engine 11 to complete the calculation. For example, if the number of arithmetic units 113 of the arithmetic engine 11 is five, the latency is five.

まずステップＳ０で初期化処理を行う。処理済ノード集合を空にし、グラフＧに図１３のデータ依存グラフをセットする。データレジスタ１２０Ａ〜Ｂの使用開始時刻、使用可能時刻をそれぞれ０とし、演算エンジン１１Ａ〜１１Ｅの使用可能時刻をＬとする。Ｌは、データメモリの読み書きに要するサイクル数であり、ここでは１とする。データ依存グラフのノードＮごとに、該ノードＮと同じノードに出力を行う別のノードＮ’を特定する。ノードＮ’をノードペアテーブルにおけるノードＮの項目に登録する。ノードＮ’が存在しない場合はノードＮの項目を空にする。ノードＮ’が存在する場合、「ノードＮとノードＮ’はペアである」と表現する。そして、スピルノードスタックを空にする。 First, initialization processing is performed in step S0. The processed node set is emptied, and the data dependence graph of FIG. The use start time and the useable time of the data registers 120A to 120B are set to 0, and the use time of the arithmetic engines 11A to 11E is set to L. L is the number of cycles required for reading and writing the data memory, and is 1 here. For each node N of the data dependency graph, another node N ′ that outputs to the same node as the node N is specified. Node N 'is registered in the item of node N in the node pair table. If the node N ′ does not exist, the item of the node N is emptied. When the node N ′ exists, it is expressed as “the node N and the node N ′ are a pair”. Then empty the spill node stack.

ステップＳ１においては、グラフＧに優先処理ノードが存在するかをチェックする。優先処理ノードとは、処理済ノード集合に含まれないノードであって、入力ノードを２つ有し、かつ入力ノードの一つのみが処理済ノードリストに含まれるようなノードである。この優先処理ノードは、演算の入力となるデータの一部が既に演算され、その演算結果が演算エンジン間バッファ１２に書かれているような演算に相当する。そのような演算が優先的に処理されるようにするのが、このステップＳ１の特徴である。この時点では、処理済ノード集合は一つのノードすら含んでいないので、処理はステップＳ２に進む。 In step S1, it is checked whether a priority processing node exists in the graph G. A priority processing node is a node that is not included in the processed node set, has two input nodes, and only one of the input nodes is included in the processed node list. This priority processing node corresponds to an operation in which a part of data serving as an operation input has already been calculated and the operation result is written in the inter-operation engine buffer 12. The feature of step S1 is that such calculation is processed preferentially. At this point in time, the processed node set does not include even one node, so the process proceeds to step S2.

ステップＳ２においては、グラフＧに処理可能ノードが存在するかをチェックする。処理可能ノードとは、処理済ノード集合に含まれないノードであって、かつ入力ノードの全てが処理済みノードリストに含まれているか、または入力ノードを持たないノードのことである。この例では処理可能ノードであるノードＡ０〜３が存在するので、処理はステップＳ３に進む。 In step S2, it is checked whether a processable node exists in the graph G. A processable node is a node that is not included in the processed node set and that all of the input nodes are included in the processed node list or does not have an input node. In this example, there are nodes A0 to A3 that are processable nodes, so the process proceeds to step S3.

ステップＳ３では、グラフＧにおいて深さ最深の処理可能ノードＮを求める。この例では、ノードＡ０〜３は全て同じ深さを持つので、任意のノードを一つ選ぶ。ここでは、例えばノードＡ０が選ばれたものとする。 In step S3, the deepest processable node N in the graph G is obtained. In this example, the nodes A0 to A3 all have the same depth, so one arbitrary node is selected. Here, for example, it is assumed that the node A0 is selected.

ステップＳ４では、ステップＳ３で求められた処理可能ノードＮがスケジュール可能であるかを判定する。「スケジュール可能である」とは、処理可能ノードＮが出力を持たないか、または、処理可能ノードＮの結果出力に使用できるデータレジスタ１２０の少なくとも一つが、ノードＮの入力となる別のノードの結果出力に用いられているか、または使用可能時刻が無限大でないという条件を満たすことをいう。ここで、ノードＮのペアとなるノードＮ’が存在し、かつノードＮ’が処理済ノード集合に含まれる場合には、ノードＮ’の結果出力に用いているデータレジスタ１２０は、ノードＮの結果出力には使用できない。この時点では、処理済みノード集合は空なので、ノードＡ０はスケジュール可能である。ステップ５に進む。 In step S4, it is determined whether the processable node N obtained in step S3 can be scheduled. “Schedulable” means that the processable node N does not have an output, or at least one of the data registers 120 that can be used for the result output of the processable node N is another node that is an input of the node N. It means being used for the result output or satisfying the condition that the usable time is not infinite. Here, when the node N ′ that is a pair of the node N exists and the node N ′ is included in the processed node set, the data register 120 used for the result output of the node N ′ It cannot be used for result output. At this point, since the processed node set is empty, the node A0 can be scheduled. Proceed to step 5.

ステップＳ５では、ノードＮのスケジュールを行う。ステップＳ５におけるスケジューリング処理は、例えば、図１４に示すようなフローチャートに従って行われる。 In step S5, the node N is scheduled. The scheduling process in step S5 is performed according to a flowchart as shown in FIG. 14, for example.

まずステップＳ５Ａにおいて、ノードＮの入力となるノードが存在するかをチェックする。この例では、ノードＡ０は入力となるノードを持たないので、処理はステップＳ５Ｂに進む。 First, in step S5A, it is checked whether there is a node serving as an input for the node N. In this example, since the node A0 has no input node, the process proceeds to step S5B.

次に、ステップＳ５Ｂにおいて、ノードＮの結果出力に使用できるデータレジスタ１２０のうち、その使用可能時刻が最小であるレジスタＲを一つ選ぶ。ただし、ノードＮが出力を持たない場合には、レジスタＲは任意に選ばれる。この時点では、データレジスタ１２０Ａ〜Ｂの使用可能時刻はいずれも１なので、どちらを使用してもよい。ここでは、データレジスタ１２０Ａを使用することにする。 Next, in step S5B, out of the data registers 120 that can be used for the result output of the node N, one register R having the minimum usable time is selected. However, if the node N has no output, the register R is arbitrarily selected. At this time, the usable times of the data registers 120A to 120B are all 1, so either one may be used. Here, the data register 120A is used.

次に、ステップＳ５Ｃにおいて、ノードＮの実行可能時刻に演算エンジン１１のレイテンシを加えた値と、レジスタＲの使用可能時刻とを比較する。ノードＮの実行可能時刻は、ノードＮの任意の入力ノードが結果出力に使うデータレジスタ１２０の使用可能時刻と、演算エンジン１１の使用可能時刻の、遅い方の時刻とする。ノードＮが入力ノードを持たない場合、ノードＮの実行可能時刻は演算エンジン１１の使用可能時刻に等しい。また、ノードＮが出力を持たない場合、処理は常にステップＳ５Ｄに進む。この例では、データレジスタ１２０Ａの使用可能時刻の方が小さいので、ステップＳ５Ｄに進む。 Next, in step S5C, the value obtained by adding the latency of the arithmetic engine 11 to the executable time of the node N is compared with the available time of the register R. The executable time of the node N is the later of the usable time of the data register 120 used by any input node of the node N for the result output and the usable time of the arithmetic engine 11. When the node N has no input node, the executable time of the node N is equal to the usable time of the arithmetic engine 11. If the node N has no output, the process always proceeds to step S5D. In this example, the usable time of the data register 120A is smaller, so the process proceeds to step S5D.

ステップＳ５Ｄにおいて、ノードＮの実行時刻を、ステップＳ５Ｃで求めたノードＮの実行可能時刻とし、その演算エンジン１１の使用可能時刻はノードＮの実行時刻に１を加えたものとする。また、レジスタＲの使用開始時刻はノードＮの実行時刻に演算エンジン１１のレイテンシを加えたものとし、レジスタＲの使用可能時刻を無限大とする。またレジスタＲの所有ノードをＮとする。ただし、ノードＮが出力を持たない場合には、レジスタＲの使用開始時刻および使用可能時刻の更新は行わない。この例では、Ａ０の実行時刻は１、演算エンジン１１Ａの使用可能時刻は２、データレジスタ１２０Ａの使用開始時刻は２、データレジスタ１２０Ａの使用可能時刻は無限大となる。データレジスタ１２０Ａの所有ノードはＡ０になる。 In step S5D, the execution time of node N is set as the executable time of node N obtained in step S5C, and the available time of the calculation engine 11 is obtained by adding 1 to the execution time of node N. Further, the use start time of the register R is obtained by adding the latency of the arithmetic engine 11 to the execution time of the node N, and the useable time of the register R is infinite. Also, let N be the node that owns the register R. However, when the node N has no output, the use start time and the usable time of the register R are not updated. In this example, the execution time of A0 is 1, the usable time of the arithmetic engine 11A is 2, the usage start time of the data register 120A is 2, and the usable time of the data register 120A is infinite. The node owned by the data register 120A is A0.

そしてステップＳ５Ｅにおいて、ノードＮとＮが結果を出力するレジスタＲの組を処理済ノード集合に追加する。この例ではノードＡ０とデータレジスタ１２０Ａの組が処理済ノード集合に追加される。 In step S5E, a set of registers R from which nodes N and N output the result is added to the processed node set. In this example, a set of the node A0 and the data register 120A is added to the processed node set.

以上によりステップＳ５が完了すると、処理はステップＳ１に戻る。ステップＳ１の開始から、再びステップＳ１に戻るまでに行われる一連のステップをイテレーションと呼ぶことにする。 When step S5 is completed as described above, the process returns to step S1. A series of steps performed from the start of step S1 to returning to step S1 will be referred to as iteration.

次のイテレーションＩ１では、ステップＳ１において優先処理ノードＢ０が見つかるので、処理はステップＳ１からステップＳ６に進む。 In the next iteration I1, since the priority processing node B0 is found in step S1, the process proceeds from step S1 to step S6.

ステップＳ６では、まずグラフＧをグラフスタックの最上部に積む。次に、ステップＳ１で求められた優先処理ノードの入力ノードのうち、処理済ノード集合に含まれない方の入力ノードＮ’を含み、かつ優先処理ノードを含まないグラフＧの最大連結部分グラフＧ’を求める。Ｇ’を優先処理グラフと呼ぶ。そして、グラフＧに優先処理グラフＧ’をセットする。この例では、優先処理グラフＧ’はノードＡ１のみを含むグラフとなる。処理はステップＳ６からステップＳ１に戻り、イテレーションＩ２に進む。この例において、イテレーションＩ２では、ステップＳ１からＳ２、Ｓ３へと進み、ノードＡ１が深さ最深の処理可能ノードとなる。 In step S6, the graph G is first stacked on the top of the graph stack. Next, among the input nodes of the priority processing node obtained in step S1, the maximum connected subgraph G of the graph G including the input node N ′ that is not included in the processed node set and does not include the priority processing node. Ask for '. G ′ is called a priority processing graph. Then, the priority processing graph G ′ is set in the graph G. In this example, the priority processing graph G ′ is a graph including only the node A1. The process returns from step S6 to step S1 and proceeds to iteration I2. In this example, in the iteration I2, the process proceeds from step S1 to S2 and S3, and the node A1 becomes the deepest processable node.

ステップＳ４では、ノードＡ１のペアとなるノードＡ０はデータレジスタ１２０Ａを結果出力に用いているので、データレジスタ１２０ＢのみがノードＡ１の結果出力に使用可能となる。データレジスタ１２０Ｂの使用可能時刻は０なので、ノードＡ１はスケジュール可能となり、処理はステップＳ５に進む。ステップＳ５では、ノードＡ０と同様に処理はステップＳ５ＡからＳ５Ｂ、Ｓ５Ｃ、Ｓ５Ｄ、Ｓ５Ｅへと進み、ノードＡ１の実行時刻は２、演算エンジン１１Ａの使用可能時刻は３、データレジスタ１２０Ｂの使用開始時刻は３、データレジスタ１２０Ｂの使用可能時刻は無限大、データレジスタ１２０Ｂの所有ノードはノードＡ１となる。ノードＡ１とデータレジスタ１２０Ｂの組が処理済ノード集合に追加される。処理はイテレーションＩ３に進む。 In step S4, since the node A0 which is a pair of the node A1 uses the data register 120A for the result output, only the data register 120B can be used for the result output of the node A1. Since the usable time of the data register 120B is 0, the node A1 can be scheduled, and the process proceeds to step S5. In step S5, the process proceeds from step S5A to S5B, S5C, S5D, and S5E, as in the case of node A0. The execution time of node A1 is 2, the usable time of computing engine 11A is 3, and the use start time of data register 120B. 3, the usable time of the data register 120B is infinite, and the owning node of the data register 120B is the node A1. A set of the node A1 and the data register 120B is added to the processed node set. Processing proceeds to iteration I3.

イテレーションＩ３において、処理はステップＳ１からＳ２へと進み、グラフＧにはもはや処理可能なノードはないため、ステップＳ７に進む。 In iteration I3, the process proceeds from step S1 to S2, and since there is no longer any processable node in the graph G, the process proceeds to step S7.

ステップＳ７では、グラフスタックが空であるかをチェックする。空でなければステップＳ８に進み、グラフスタックの最上部からグラフを取り出してグラフＧにセットする。この例では、イテレーションＩ１でグラフスタックにグラフが積まれているので、ステップＳ８に進み、グラフスタックからグラフを取り出してグラフＧにセットする。このときのグラフＧは図１３に示したものと同じになる。 In step S7, it is checked whether the graph stack is empty. If not empty, the process proceeds to step S8, where a graph is extracted from the top of the graph stack and set in the graph G. In this example, since the graph is stacked in the iteration I1, the process proceeds to step S8, where the graph is extracted from the graph stack and set in the graph G. The graph G at this time is the same as that shown in FIG.

ステップＳ１０では、スピルノードスタックが空であるかをチェックする。スピルノードスタックが空でなく、かつグラフＧに存在するノードがスピルノードスタックの最上部に積まれている場合には、スピル対応が必要になる。この時点ではスピルノードスタックは空なので、処理はイテレーションＩ４に進む。 In step S10, it is checked whether the spill node stack is empty. When the spill node stack is not empty and the node existing in the graph G is stacked at the top of the spill node stack, spill correspondence is required. Since the spill node stack is empty at this point, the process proceeds to iteration I4.

イテレーションＩ４では、ステップＳ１からＳ２、Ｓ３へと進み、ノードＢ０が深さ最深の処理可能ノードとなる。ノードＢ０の結果出力にはデータレジスタ１２０Ａ〜Ｂの両方が使用可能であり、かつそれらはノードＢ０の入力となるノードＡ１、Ａ２の結果出力に用いられているので、ノードＢ０はスケジュール可能である。 In iteration I4, the process proceeds from step S1 to S2 and S3, and node B0 is the deepest processable node. Since both of the data registers 120A-B can be used for the result output of the node B0 and they are used for the result output of the nodes A1 and A2 which are the inputs of the node B0, the node B0 can be scheduled. .

ステップＳ５において、まずノードＢ０は入力ノードを持つので、処理はステップＳ５ＡからＳ５Ｆに進む。ステップＳ５Ｆでは、ステップＳ３で求められた深さ最深の処理可能ノードＮに対して、その入力ノードごとに結果出力に用いられるデータレジスタ１２０と、その使用開始時刻を求める。そして、求められた使用開始時刻の最大値に１を加えた値を、求められた全てのデータレジスタ使用可能時刻とする。この例では、データレジスタ１２０Ａ〜Ｂの使用可能時刻がそれぞれ４になる。 In step S5, since the node B0 has an input node, the process proceeds from step S5A to S5F. In step S5F, for the processable node N having the deepest depth obtained in step S3, the data register 120 used for the result output and its use start time are obtained for each input node. Then, a value obtained by adding 1 to the maximum value of the obtained use start time is set as all the available data register use times. In this example, the usable times of the data registers 120A to 120B are 4 respectively.

ステップＳ５Ｂでは、データレジスタ１２０Ａ〜Ｂの使用可能時刻は等しいので、どちらを選んでもよい。ここではデータレジスタ１２０Ａが選ばれたとする。以下、処理はステップＳ５ＣからＳ５Ｄ、Ｓ５Ｅへと進み、ノードＢ０の実行時刻は４、演算エンジン１１Ｂの使用可能時刻は５、データレジスタ１２０Ａの使用開始時刻は５、データレジスタ１２０Ａの使用可能時刻は無限大、データレジスタ１２０Ａの所有ノードはＢ０となる。ノードＢ０とデータレジスタ１２０Ａの組が処理済ノード集合に追加される。処理はイテレーションＩ５に進む。 In step S5B, since the usable times of the data registers 120A to 120B are equal, either one may be selected. Here, it is assumed that the data register 120A is selected. Hereinafter, the process proceeds from step S5C to S5D and S5E, the execution time of the node B0 is 4, the usable time of the arithmetic engine 11B is 5, the usage start time of the data register 120A is 5, and the usable time of the data register 120A is Infinite, the node owned by the data register 120A is B0. A set of the node B0 and the data register 120A is added to the processed node set. Processing proceeds to iteration I5.

イテレーションＩ５では、ステップＳ１においてノードＥ０が優先処理となり、ステップＳ６が処理され、イテレーションＩ６に進む。 In iteration I5, node E0 is preferentially processed in step S1, step S6 is processed, and the process proceeds to iteration I6.

イテレーションＩ６において、処理はステップＳ１からＳ２、Ｓ３、Ｓ４へと進み、ステップＳ５においてノードＡ２がスケジュールされる。ステップＳ５において、処理はステップＳ５ＡからＳ５Ｂへと進む。ステップＳ５Ｃにおいて、データレジスタ１２０Ｂの使用可能時刻はノードＡ２の実行可能時刻＋１以上であるので、処理はステップＳ５Ｇに進む。 In iteration I6, the process proceeds from step S1 to S2, S3, S4, and node A2 is scheduled in step S5. In step S5, the process proceeds from step S5A to S5B. In step S5C, the usable time of the data register 120B is equal to or greater than the executable time +1 of the node A2, so the process proceeds to step S5G.

ステップＳ５Ｇでは、ステップＳ５Ｃで求めたレジスタＲの使用可能時刻からノードＮを実行する演算エンジン１１のレイテンシを引いた値をノードＮの実行時刻とする。他の値はステップＳ５Ｄと同様に求める。この例では、Ａ２の実行時刻は３、演算エンジン１１Ａの使用可能時刻は４、データレジスタ１２０Ｂの使用開始時刻は４、データレジスタ１２０Ｂの使用可能時刻は無限大となる。データレジスタ１２０Ｂの所有ノードはＡ２になる。イテレーション７に進む。 In step S5G, a value obtained by subtracting the latency of the arithmetic engine 11 that executes node N from the usable time of register R obtained in step S5C is set as the execution time of node N. Other values are obtained in the same manner as in step S5D. In this example, the execution time of A2 is 3, the usable time of the arithmetic engine 11A is 4, the usage start time of the data register 120B is 4, and the usable time of the data register 120B is infinite. The owned node of the data register 120B is A2. Proceed to iteration 7.

イテレーション７において、処理はステップＳ１からＳ６へと進み、グラフＧがノードＡ３のみのグラフとなる。 In iteration 7, the process proceeds from step S1 to S6, and the graph G becomes a graph of only the node A3.

イテレーション８において、処理はステップＳ１からＳ２、Ｓ３へと進み、ステップＳ４においてノードＡ３をスケジュールしようとするが、データレジスタ１２０Ａ〜Ｂはともに使用可能時刻が無限大なので、ノードＡ３はスケジュール不可能である。処理はステップＳ９に進む。 In iteration 8, the process proceeds from step S1 to S2 and S3. At step S4, node A3 is to be scheduled. Since both data registers 120A and 120B have infinite usable time, node A3 cannot be scheduled. is there. The process proceeds to step S9.

ステップＳ９におけるスピル処理の手順を図１５のフローチャートに示す。まず、ステップＳ９Ａにおいて、データレジスタ１２Ａ〜Ｂからデータメモリ１５に書き戻すものを一つ選ぶ。この書き戻しをレジスタスピル処理と呼ぶ。スケジュール不可能なノードＮのペアとなるノードＮ’が存在する場合には、ノードＮ’の結果出力に用いられていないデータレジスタ１２を選ぶ。ノードＮのペアとなるノードＮ’が存在しない場合には、任意のデータレジスタ１２を選ぶ。この例では、データレジスタ１２Ａが選ばれる。 The procedure of the spill process in step S9 is shown in the flowchart of FIG. First, in step S9A, one of the data registers 12A to 12B to be written back to the data memory 15 is selected. This write back is called register spill processing. When there is a node N ′ that is a pair of nodes N that cannot be scheduled, a data register 12 that is not used for outputting the result of the node N ′ is selected. If there is no node N ′ which is a pair of nodes N, an arbitrary data register 12 is selected. In this example, the data register 12A is selected.

ステップＳ９Ｂでは、演算エンジン１１Ｅでレジスタスピル処理を行う時刻を求める。演算エンジン１１Ｅの使用可能時刻と、データメモリに書き戻すデータレジスタ１２の使用開始時刻に１を加えたものとを比較し、大きい方がレジスタスピル処理を行う時刻になる。また、演算エンジン１１Ｅが行うデータ退避を表すものとしてグラフに新たに追加されるノード（データ退避ノード）の実行時刻と、データレジスタ１２の使用可能時刻を、レジスタスピル処理を行う時刻とする。この例では、時刻６がレジスタスピル処理を行う時刻となり、データ退避ノードＥ１の実行時刻と、データレジスタ１２Ａの使用可能時刻が６となる。 In step S9B, the time for performing the register spill process in the arithmetic engine 11E is obtained. The usable time of the arithmetic engine 11E is compared with the use start time of the data register 12 to be written back to the data memory plus 1, and the larger one is the time for register spill processing. Further, the execution time of a node (data saving node) newly added to the graph as representing data saving performed by the arithmetic engine 11E and the usable time of the data register 12 are set as the time for performing the register spill processing. In this example, time 6 is the time for performing the register spill process, and the execution time of the data saving node E1 and the usable time of the data register 12A are 6.

ステップＳ９Ｃでは、スピルスタックに、データレジスタ１２の所有ノードと、レジスタスピル処理を行う時刻に２×Ｌを加えた時刻の組を積む。この例では、ノードＢ０と時刻８の組がスピルスタックに積まれる。ステップＳ５に進む。 In step S9C, a set of a time obtained by adding 2 × L to the time when register spill processing is performed and the owning node of the data register 12 is stacked on the spill stack. In this example, a set of node B0 and time 8 is stacked on the spill stack. Proceed to step S5.

ステップＳ５では、データレジスタ１２Ａの使用開始時刻が６に更新されているので、Ａ３の実行時刻は５、演算エンジン１１Ａの使用可能時刻は６、データレジスタ１２０Ａの使用開始時刻は６、データレジスタ１２０Ａの使用可能時刻は無限大となる。データレジスタ１２０Ａの所有ノードはＡ３になる。処理はイテレーション９に進む。 In step S5, since the use start time of the data register 12A has been updated to 6, the execution time of A3 is 5, the useable time of the arithmetic engine 11A is 6, the use start time of the data register 120A is 6, and the data register 120A The usable time of is infinite. The node owned by the data register 120A is A3. Processing proceeds to iteration 9.

イテレーション９において、処理はステップＳ１からＳ２、Ｓ７、Ｓ８へと進む。ステップＳ１０において、スピルスタックは空でないが、ステップＳ８でグラフスタックから取り出したグラフＧには、スピルスタックの最上部にあるノードＢ０は含まれないので、イテレーション１０に進む。 In iteration 9, the process proceeds from step S1 to S2, S7, and S8. In step S10, the spill stack is not empty, but the graph G extracted from the graph stack in step S8 does not include the node B0 at the top of the spill stack.

イテレーション１０において、処理はステップＳ１からＳ２、Ｓ３、Ｓ４、Ｓ５へと進む。ノードＢ１の実行時刻は７、演算エンジン１１Ｂの使用可能時刻は８、データレジスタ１２０Ｂの使用開始時刻は８、データレジスタ１２０Ｂの使用可能時刻は無限大となる。データレジスタ１２０Ｂの所有ノードはＢ１になる。イテレーション１１に進む。 In iteration 10, the process proceeds from step S1 to S2, S3, S4, and S5. The execution time of the node B1 is 7, the usable time of the arithmetic engine 11B is 8, the usage start time of the data register 120B is 8, and the usable time of the data register 120B is infinite. The node that owns the data register 120B is B1. Proceed to iteration 11.

イテレーション１１では、処理はステップＳ１からＳ２、Ｓ７、Ｓ８へと進む。ステップＳ１０でスピル対応が必要と判断され、処理はステップＳ１１に進む。 In iteration 11, the process proceeds from step S1 to S2, S7, and S8. In step S10, it is determined that spill handling is necessary, and the process proceeds to step S11.

ステップＳ１１では、まずスピルスタックの先頭からノードと時刻の組を取り出す。そして、演算エンジン１１Ａの使用可能時刻を、取り出された時刻とする。さらにグラフＧから、取り出されたノードを含み、そのノードの出力を含まない最大連結部分グラフ（すなわち優先処理グラフ）Ｇ’を求める。この優先処理グラフＧ’をデータ復帰のための演算エンジン１１Ａのノードに置き換える。この例では、演算エンジン１１Ａの使用可能時刻は８となり、更新されたデータ依存グラフは図１６のようになる。イテレーション１２に進む。 In step S11, a node / time pair is first taken out from the top of the spill stack. Then, the usable time of the arithmetic engine 11A is set as the extracted time. Further, a maximum connected subgraph (that is, a priority processing graph) G ′ that includes the extracted node and does not include the output of the node is obtained from the graph G. This priority processing graph G 'is replaced with a node of the arithmetic engine 11A for data restoration. In this example, the usable time of the arithmetic engine 11A is 8, and the updated data dependence graph is as shown in FIG. Proceed to iteration 12.

イテレーション１２において、処理はステップＳ１からＳ２、Ｓ３、Ｓ４、Ｓ５へと進み、イテレーション１１で生成されたデータ復帰用のノードＡ４がスケジュールされる。ノードＡ４の実行時刻は８、演算エンジン１１Ａの使用可能時刻は９、データレジスタ１２０Ａの使用開始時刻は９、データレジスタ１２０Ａの使用可能時刻は無限大となる。データレジスタ１２０Ａの所有ノードはＡ４になる。処理はイテレーション１３に進む。 In the iteration 12, the process proceeds from step S1 to S2, S3, S4, and S5, and the data restoration node A4 generated in the iteration 11 is scheduled. The execution time of the node A4 is 8, the usable time of the arithmetic engine 11A is 9, the usage start time of the data register 120A is 9, and the usable time of the data register 120A is infinite. The node owned by the data register 120A is A4. Processing proceeds to iteration 13.

イテレーション１３において、処理はステップＳ１からＳ２、Ｓ３、Ｓ４、Ｓ５へと進み、ノードＥ０の実行時刻は１０、演算エンジン１１Ｅの使用可能時刻は１１となる。 In iteration 13, the process proceeds from step S1 to S2, S3, S4, and S5, the execution time of node E0 is 10, and the usable time of computing engine 11E is 11.

次のイテレーション１４において、処理はステップＳ１からＳ２、Ｓ７へと進む。グラフスタックは空であるため、最終的にコードを出力するためのステップＳ１２に進む。 In the next iteration 14, the process proceeds from step S1 to S2 and S7. Since the graph stack is empty, the process proceeds to step S12 for finally outputting a code.

ステップＳ１２におけるコード出力処理の手順を図１７のフローチャートに示す。まずステップＳ１２Ａにおいて、アドレスを示す変数Ｃを０に初期化する。 The procedure of the code output process in step S12 is shown in the flowchart of FIG. First, in step S12A, a variable C indicating an address is initialized to zero.

ステップＳ１２Ｂでは、演算エンジン１１Ａ〜Ｅのそれぞれが有するコンテキスト情報メモリ１１０２の、アドレスＣに保存されるコンテキスト情報を初期化する。この初期化により、Ｖａｌｉｄビット、データ処理終了ビットがそれぞれ０に初期化される。 In step S12B, the context information stored at the address C in the context information memory 1102 included in each of the arithmetic engines 11A to 11E is initialized. By this initialization, the Valid bit and the data processing end bit are each initialized to 0.

ステップＳ１２Ｃでは、処理済ノード集合に含まれる全てのノードＮについて、実行時刻がＣであるものを全て求める。そのようなノードが一つでも存在すれば、処理はステップＳ１２Ｅに進み、一つも存在しなければ、処理はステップＳ１２Ｆに進む。 In step S12C, for all nodes N included in the processed node set, all nodes whose execution time is C are obtained. If there is even one such node, the process proceeds to step S12E, and if none exists, the process proceeds to step S12F.

ステップＳ１２Ｅでは、ステップＳ１２Ｃで見つけられたノードＮごとに、ノードＮを実行する演算エンジン１１が有するコンテキスト情報メモリ１１０２の、アドレスＣに保存されるＶａｌｉｄビットを１に更新する。また、ノードＮが出力を持つ場合には、演算エンジン間バッファ１２のデータレジスタ１２Ａ〜Ｈの一つを選択するための選択信号の値として、レジスタＲを示す値を、ノードＮを実行する演算エンジン１１が有する制御テーブルメモリ１１６１のアドレスＣに保存する。また、ステップＳ１２Ｃで見つかった全てのノードＮを処理済ノード集合から削除する。 In step S12E, for each node N found in step S12C, the Valid bit stored in the address C of the context information memory 1102 included in the arithmetic engine 11 that executes the node N is updated to 1. When the node N has an output, the value indicating the register R is used as the value of the selection signal for selecting one of the data registers 12A to 12H of the inter-arithmetic engine buffer 12, and the operation for executing the node N The data is stored in the address C of the control table memory 1161 included in the engine 11. Further, all the nodes N found in step S12C are deleted from the processed node set.

ステップＳ１２Ｆでは、処理済ノード集合が空であるかを判定する。空でない場合には処理はステップＳ１２Ｈに進み、アドレスＣをＣ＋１に更新してステップＳ１２Ｂに戻る。空の場合には、処理はステップＳ１２Ｇに進む。 In step S12F, it is determined whether the processed node set is empty. If not empty, the process proceeds to step S12H, updates the address C to C + 1, and returns to step S12B. If it is empty, the process proceeds to step S12G.

ステップＳ１２Ｇでは、演算エンジン１１Ａ〜Ｅの入力コントローラ１１０が有するコンテキスト情報メモリ１１０２の、アドレスＣに保存されるデータ処理終了ビットを１とし、コード生成は完了する。 In step S12G, the data processing end bit stored at address C in the context information memory 1102 of the input controller 110 of the arithmetic engines 11A to 11E is set to 1, and the code generation is completed.

以上説明したコンパイル手法により、図１３のデータ依存グラフから生成されたコードに従って半導体装置１を実行させた際のタイミングチャートを図１８に示す。図１８において、演算エンジン１１の演算結果が、１であるＶａｌｉｄビットとともに出力されるサイクルの各々には、その演算に相当するラベル（図１３参照）が示されている。図１８には、データレジスタ１２０の値がどのサイクルで変化するかについても示してある。 FIG. 18 shows a timing chart when the semiconductor device 1 is executed according to the code generated from the data dependence graph of FIG. 13 by the compiling method described above. In FIG. 18, in each cycle in which the calculation result of the calculation engine 11 is output together with a Valid bit of 1, a label (see FIG. 13) corresponding to the calculation is shown. FIG. 18 also shows in which cycle the value of the data register 120 changes.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１…半導体装置；
１１Ａ〜Ｅ…演算エンジン；
１１０…入力コントローラ；
１１００…入力Ａ選択部；
１１０１…入力Ｂ選択部；
１１０２…コンテキスト情報メモリ；
１１０３…コンテキストＩＤラッチ；
１１０４…データ処理終了ラッチ；
１１０５…インクリメンタ；
１１０６…マルチプレクサ；
１１０７Ａ〜Ｂ…ラッチ；
１１０８Ａ〜Ｂ…タイミングラッチ；
１１３Ａ〜Ｅ…演算ユニット；
１１３０…演算器；
１１３１…制御テーブルメモリ；
１１３２Ａ〜Ｄ…設定情報レジスタ；
１１３３…マルチプレクサ；
１１４Ａ〜Ｅ…データパイプラインレジスタ；
１１５Ａ〜Ｅ…制御パイプラインレジスタ；
１１６…出力コントローラ；
１１６０Ａ〜Ｂ…ベースアドレスレジスタ；
１１６１…制御テーブルメモリ；
１１６２…加算器；
１１６３…マルチプレクサ；
１１７…最終コンテキストＩＤラッチ；
１１８…マルチプレクサ；
１２…演算エンジン間バッファ；
１２０Ａ〜Ｈ…データレジスタ；
１２００…データラッチ；
１２０１Ａ〜Ｄ…ＡＮＤロジック；
１２０２…ＯＲロジック；
１２０３…マルチプレクサ；
１３…コードメモリ；
１４…コード転送制御装置；
１４０…メモリＩＤレジスタ；
１４１…アドレスレジスタ；
１４２…コードアドレスレジスタ；
１４３Ａ〜Ｃ…インクリメンタ；
１４４Ａ〜Ｄ…マルチプレクサ；
１４５…Ｖａｌｉｄラッチ；
１４６…比較器；
１５…データメモリ 1 ... Semiconductor device;
11A to E: Calculation engine;
110 ... input controller;
1100: Input A selection unit;
1101 ... Input B selection unit;
1102 ... Context information memory;
1103 Context ID latch;
1104 ... data processing end latch;
1105: Incrementer;
1106: multiplexer;
1107A-B ... latch;
1108A-B ... timing latch;
113A to E: arithmetic unit;
1130: computing unit;
1131: Control table memory;
1132A to D: Setting information register;
1133: multiplexer;
114A to E: Data pipeline register;
115A to E: Control pipeline register;
116 ... output controller;
1160A-B ... Base address register;
1161 ... Control table memory;
1162 ... Adder;
1163: multiplexer;
117 ... Final context ID latch;
118 ... multiplexer;
12 ... Inter-engine buffer;
120A to H: Data register;
1200 ... data latch;
1201A to D ... AND logic;
1202 ... OR logic;
1203 ... multiplexer;
13: Code memory;
14: Code transfer control device;
140 ... memory ID register;
141: Address register;
142 ... code address register;
143A-C: Incrementer;
144A to D: Multiplexer;
145 ... Valid latch;
146 ... comparator;
15 ... Data memory

Claims

A first operation is performed for each cycle, and first data indicating a result of the first operation and a first valid signal indicating the first value or the second value are output for each cycle. 1 arithmetic engine,
The second calculation is performed for each cycle, and the second data indicating the result of the second calculation and the second valid signal indicating the first value or the second value are calculated for each cycle. A second computing engine to output;
Used to pass the first data and the second data between the first arithmetic engine and the second arithmetic engine, wherein the first valid signal or the second valid signal is If the first value is indicated, the first data or the second data can be written, and if the first valid signal or the second valid signal indicates the second value. A buffer between operation engines that prohibits writing of the first data or the second data,
The first calculation engine is:
A first storage for storing a first code for determining a value of the first valid signal;
A first controller that obtains a value of the first valid signal from the first code and outputs the value for each cycle;
The second calculation engine is:
A second storage unit for storing a second code for determining a value of the third valid signal;
Generating a value of the second valid signal from the second code, and generating a first controller and a second code used in a semiconductor device including a second controller that outputs the value for each cycle A compiler that
From the data dependence graph showing the dependence of the first data and the second data passed between the first computation engine and the second computation engine, the first computation is performed for each cycle. A determination unit that determines whether each of the engine and the second calculation engine performs a calculation;
If the first arithmetic engine performs an operation, the first controller outputs the first value, and if the first arithmetic engine does not perform an operation, the first controller causes the first controller to output the first value. A code that outputs a value of 2 is generated as the first code,
If the second calculation engine performs the calculation, the second controller outputs the first value, and if the second calculation engine does not perform the calculation, the second controller causes the second controller to output the first value. And a code generation unit that generates a code that outputs a value of 2 as the second code.

The determination unit identifies, from the data dependence graph, an operation in which a part of data to be input to a certain operation has already been calculated and a part of the data is written in the inter-arithmetic engine buffer. A specific part,
And a scheduling unit that determines the order of the operations of the first and second arithmetic engines so that the operation specified by the specifying unit is executed with priority. The compiler according to claim 1.

A first operation is performed for each cycle, and first data indicating a result of the first operation and a first valid signal indicating the first value or the second value are output for each cycle. 1 arithmetic engine,
The second calculation is performed for each cycle, and the second data indicating the result of the second calculation and the second valid signal indicating the first value or the second value are calculated for each cycle. A second computing engine to output;
Used to pass the first data and the second data between the first arithmetic engine and the second arithmetic engine, wherein the first valid signal or the second valid signal is If the first value is indicated, the first data or the second data can be written, and if the first valid signal or the second valid signal indicates the second value. A buffer between operation engines that prohibits writing of the first data or the second data,
The first calculation engine is:
A first storage for storing a first code for determining a value of the first valid signal;
A first controller that obtains a value of the first valid signal from the first code and outputs the value for each cycle;
The second calculation engine is:
A second storage unit for storing a second code for determining a value of the third valid signal;
Generating a value of the second valid signal from the second code, and generating a first controller and a second code used in a semiconductor device including a second controller that outputs the value for each cycle There is a code generation method to
The decision part
From the data dependence graph showing the dependence of the first data and the second data passed between the first computation engine and the second computation engine, the first computation is performed for each cycle. Determining whether each of the engine and the second computing engine performs computations;
The code generator
If the first arithmetic engine performs an operation, the first controller outputs the first value, and if the first arithmetic engine does not perform an operation, the first controller causes the first controller to output the first value. If a code that outputs a value of 2 is generated as the first code and the second arithmetic engine performs an operation, the second controller outputs the first value, and the second A code generating method comprising: generating, as the second code, a code that causes the second controller to output the second value if the calculation engine of (2) does not perform the calculation.

A first operation is performed for each cycle, and first data indicating a result of the first operation and a first valid signal indicating the first value or the second value are output for each cycle. 1 arithmetic engine,
The second calculation is performed for each cycle, and the second data indicating the result of the second calculation and the second valid signal indicating the first value or the second value are calculated for each cycle. A second computing engine to output;
Used to pass the first data and the second data between the first arithmetic engine and the second arithmetic engine, wherein the first valid signal or the second valid signal is If the first value is indicated, the first data or the second data can be written, and if the first valid signal or the second valid signal indicates the second value. A buffer between operation engines that prohibits writing of the first data or the second data,
The first calculation engine is:
A first storage for storing a first code for determining a value of the first valid signal;
A first controller that obtains a value of the first valid signal from the first code and outputs the value for each cycle;
The second calculation engine is:
A second storage unit for storing a second code for determining a value of the third valid signal;
Generating a value of the second valid signal from the second code, and generating a first controller and a second code used in a semiconductor device including a second controller that outputs the value for each cycle There is a code generation program that
On the computer,
From the data dependence graph showing the dependence of the first data and the second data passed between the first computation engine and the second computation engine, the first computation is performed for each cycle. A procedure for determining whether each of the engine and the second calculation engine performs a calculation;
If the first arithmetic engine performs an operation, the first controller outputs the first value, and if the first arithmetic engine does not perform an operation, the first controller causes the first controller to output the first value. If a code that outputs a value of 2 is generated as the first code and the second arithmetic engine performs an operation, the second controller outputs the first value, and the second A code generation program for executing a procedure for generating, as the second code, a code that causes the second controller to output the second value if the calculation engine of (2) does not perform the calculation.

A first setting information register for storing first setting information identifiable by a first setting ID;
The first setting information is read from the first setting information register in accordance with the first setting ID for each cycle, and the first calculation is performed while changing the setting in accordance with the first setting information. A first calculation engine that outputs first data indicating a result of the calculation and a first valid signal indicating the first value or the second value for each cycle;
A second setting information register for storing second setting information identifiable by a second setting ID;
The second setting information is read from the second setting information register in accordance with the second setting ID for each cycle, and a second calculation is performed while changing the setting in accordance with the second setting information. A second calculation engine that outputs second data indicating a result of the calculation and a second valid signal indicating the first value or the second value for each cycle;
Used to pass the first data and the second data between the first arithmetic engine and the second arithmetic engine, wherein the first valid signal or the second valid signal is If the first value is indicated, the first data or the second data can be written, and if the first valid signal or the second valid signal indicates the second value. A buffer between operation engines that prohibits writing of the first data or the second data,
The first calculation engine is:
A first storage for storing a first code for determining a value of the first valid signal;
A first controller that obtains a value of the first valid signal from the first code and outputs the value for each cycle;
The second calculation engine is:
A second storage unit for storing a second code for determining a value of the third valid signal;
A second controller that obtains the value of the second valid signal from the second code and outputs the value for each cycle; and the first code and the second code used in a reconfigurable device comprising: A compiler that generates
From the data dependence graph showing the dependence of the first data and the second data passed between the first computation engine and the second computation engine, the first computation is performed for each cycle. A determination unit that determines whether each of the engine and the second calculation engine performs a calculation;
If the first arithmetic engine performs an operation, the first controller outputs the first value, and if the first arithmetic engine does not perform an operation, the first controller causes the first controller to output the first value. A code that outputs a value of 2 is generated as the first code,
If the second calculation engine performs the calculation, the second controller outputs the first value, and if the second calculation engine does not perform the calculation, the second controller causes the second controller to output the first value. A code generation unit that generates a code that outputs a value of 2 as the second code.

The determination unit identifies, from the data dependence graph, an operation in which a part of data to be input to a certain operation has already been calculated and a part of the data is written in the inter-arithmetic engine buffer. A specific part,
6. The scheduling unit according to claim 5, further comprising: a scheduling unit that determines an order of operations of the first and second arithmetic engines so that the operation specified by the specifying unit is preferentially executed. Compiler.