JP4330582B2

JP4330582B2 - Pipelined loop structure by MAP compiler

Info

Publication number: JP4330582B2
Application number: JP2005502182A
Authority: JP
Inventors: ハメス，ジェフリー
Original assignee: エス・アール・シィ・コンピューターズ・インコーポレイテッド
Priority date: 2002-10-31
Filing date: 2003-10-17
Publication date: 2009-09-16
Anticipated expiration: 2023-10-17
Also published as: WO2004042503A3; AU2003284288A1; WO2004042503A2; JP2006510125A; EP1573461A2; CA2498866A1; EP1573461A4

Description

関連特許出願との相互参照
本発明は、本発明の譲受人であるコロラド（Colorado）州、コロラド・スプリングス（Colorado Springs）のエス・アール・シィ・コンピューターズ・インコーポレイテッド（SRC Computers, Inc.）に譲渡された、２００２年１０月３１日出願の米国特許出願連続番号第１０／２８５，２９９号「高級プログラミング言語のプログラムをハイブリッド計算プラットフォームのための統一された実行可能なプログラムに変換するためのプロセス」（" Process For Converting Programs In High-Level Programming Languages To A Unified Executable For Hybrid Computing Platforms "）の主題に関連する主題を含み、その開示はここに引用により援用される。 CROSS REFERENCE TO RELATED PATENT APPLICATIONS This invention is a SRC Computers, Inc. of Colorado Springs, Colorado, the assignee of the present invention. U.S. Patent Application Serial No. 10 / 285,299, filed October 31, 2002, for assigning a high-level programming language program to a unified executable program for a hybrid computing platform. Including subject matter related to the subject matter of “Process For Converting Programs In High-Level Programming Languages To A Unified Executable For Hybrid Computing Platforms”, the disclosure of which is incorporated herein by reference.

著作権表示／許可
この特許文献の開示の一部には、著作権保護の対象となる資料が含まれる。著作権所有者は、米国特許商標局の特許ファイルまたは特許記録に現われるように、特許開示の特許文献の何人による複製にも異議を唱えないが、それ以外ではいかなる著作権をも保有する。以下の表示は、適用可能であれは図面を含む以下に記載のソフトウェアおよびデータに適用される。(C)２００２エス・アール・シィ・コンピューターズ・インコーポレイテッド（SRC Computers, Inc.)。 Copyright Notice / Permission Part of the disclosure of this patent document includes material that is subject to copyright protection. The copyright owner does not object to any reproduction of the patent document of the patent disclosure as it appears in the US Patent and Trademark Office patent file or patent record, but otherwise holds any copyright. The following displays apply to the software and data described below, including the drawings, where applicable. (C) 2002 SRC Computers, Inc.

発明の背景
発明の分野
本発明は、再構成可能なハードウェアコンパイラによって作り出される、パイプライン化されたループ構造に関する。より具体的には、本発明は、数が変動するループサイクルおよび長さが変動するクロックレイテンシを有する、パイプライン化されたループ構造のコンパイルに関する。 Background of the Invention
The present invention relates to pipelined loop structures created by reconfigurable hardware compilers. More specifically, the present invention relates to the compilation of pipelined loop structures having loop cycles that vary in number and clock latencies that vary in length.

関連する背景
命令プロセッサは処理能力が急速に向上し続けているため、かつては専らスーパーコンピュータによってなされていたコンピュータインテンシブな計算に益々頻繁に使われている。しかしながら、たとえばコンピュータインテンシブな画像処理および流体力学的シミュレーションを含む、今日の命令プロセッサで実行するのは未だ現実的ではないコンピュータインテンシブなタスクが、依然存在する。 Related background instruction processors are increasingly being used in computer-intensive computations that were once done exclusively by supercomputers, as processing power continues to improve rapidly. However, there are still computer intensive tasks that are not yet practical to perform with today's instruction processors, including, for example, computer intensive image processing and hydrodynamic simulation.

再構成可能な計算は、計算技術の中で益々注目を集める技術である。伝統的な汎用計算は、１つ以上の汎用プロセッサにおいて順次実行されるコンピュータコードによって特徴付けられる。再構成可能な計算は、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）などの再構成可能なハードウェアを論理ルーチンを実行するようプログラミングすることを特徴とする。 Reconfigurable computation is a technology that is gaining more and more attention among computational technologies. Traditional general purpose computing is characterized by computer code that is executed sequentially on one or more general purpose processors. Reconfigurable computation is characterized by programming reconfigurable hardware, such as a field programmable gate array (FPGA), to perform logic routines.

再構成可能な計算によって、コンピュータインテンシブな処理において著しい性能の進歩がもたらされる。たとえば、再構成可能なハードウェアは従来の命令プロセッサよりさらに多くの並列処理およびパイプライン化の特性を有する論理構成でプログラミングされ得る。さらに、再構成可能なハードウェアは、プログラムによって割当てられたタスクを実行するのに非常に効率のよいカスタム論理構成でプログラミングされ得る。さらに、命令プロセッサと再構成可能なハードウェアとの間でプログラムの処理要件を分担すると、コンピュータの処理能力全体を向上させることができる。 Reconfigurable computations provide significant performance advancements in computer intensive processing. For example, reconfigurable hardware can be programmed with a logical configuration that has more parallel processing and pipelining characteristics than conventional instruction processors. Furthermore, the reconfigurable hardware can be programmed with custom logic configurations that are very efficient to perform tasks assigned by the program. Furthermore, sharing the program processing requirements between the instruction processor and reconfigurable hardware can improve the overall processing power of the computer.

たとえば、Ｃまたはフォートランなどの高級言語で書かれたソフトウェアプログラムは、ＭＡＰコンパイラのある再構成可能なハードウェアで実行可能なソフトウェアに変換され得る。高級言語のループ構造は、ＭＡＰコンパイラによって、再構成可能なハードウェアの並行処理およびパイプライン化の特性を発揮する形式に変換され得る。 For example, a software program written in a high-level language such as C or Fortran can be converted to software executable on reconfigurable hardware with a MAP compiler. The loop structure of the high-level language can be converted by the MAP compiler into a format that exhibits reconfigurable hardware parallel processing and pipelining characteristics.

不運なことに、既存のＭＡＰコンパイラは、数ある要件の中でも、ループが終了する前のループ反復の回数が予め定められており、かつループの周期が１クロックであるようなすべてのループ構造のうち小さなサブセットにしか動作しない。したがって、ループが予定の反復数の後に終了してしまわず、かつループが１クロックより大きい周期を有するような、ループ構造をコンパイルできるコンパイラが依然として必要である。 Unfortunately, existing MAP compilers, among other requirements, have all loop structures that have a predetermined number of loop iterations before the loop ends and that have a loop period of one clock. Only works for a small subset. Therefore, there is still a need for a compiler that can compile loop structures such that the loop does not end after a predetermined number of iterations and the loop has a period greater than one clock.

発明の概要
したがって、本発明の１つの実施例は、制御フローデータフローグラフのパイプライン化されたループ構造であって、ループ本体の連続的反復の中で入力値を処理し出力値を生成するループ本体を含み、出力値は、ループ本体に結合される循環ノードによって捉えられ、上記構造はさらに、ループ本体に結合される、最終のループ反復を判断するループ有効ノードと、循環ノードに結合される出力値記憶ノードとを含み、出力値記憶ノードは、ループ有効ノードが最終のループ反復が起ったと判断した後生成される出力値を無視する。 SUMMARY OF THE INVENTION Accordingly, one embodiment of the present invention is a pipelined loop structure of a control flow data flow graph that processes input values and generates output values in successive iterations of the loop body. Including the loop body, the output value is captured by a circular node coupled to the loop body, and the above structure is further coupled to the loop body to determine the final loop iteration, coupled to the loop body, and to the circular node. The output value storage node ignores output values generated after the loop valid node determines that the final loop iteration has occurred.

本発明の他の実施例は、ループ本体の連続的反復の中で入力値を処理して出力値を生成するループ本体を含む、制御フローデータフローグラフのパイプライン化されたループ構造を含み、出力値はループ本体に結合される循環ノードによって捉えられ、さらに、循環ノードに結合されるループドライバノードを含み、ループドライバノードは、周期、すなわち連続した２つのループ反復の実行の間で起るクロックの回数を、ループに対して設定する。 Other embodiments of the present invention include a pipelined loop structure of a control flow data flow graph that includes a loop body that processes input values and generates output values in successive iterations of the loop body; The output value is captured by a circular node that is coupled to the loop body and further includes a loop driver node that is coupled to the circular node, the loop driver node occurring between periods of execution, ie two consecutive loop iterations. Set the number of clocks for the loop.

追加的な新規な特徴が、一部は下記の説明に明示され、一部は下記の明細を考察することで当業者には明らかとなるか、または本発明の実施により学習されるであろう。本発明の特徴と利点とは、付随する請求項の中で特に指摘される、指示、組合せ、および方法によって認識され、会得されるであろう。 Additional novel features will be set forth in part in the description which follows, and in part will be apparent to those of ordinary skill in the art upon review of the following specification, or may be learned by practice of the invention. . The features and advantages of the invention will be realized and attained by means of the instructions, combinations and methods particularly pointed out in the appended claims.

発明の詳細な説明
単純なループ関数においては、ループは、固定した予定の回数を反復してから最終のループ反復の後停止する。対照的に、より複雑なループ関数は、固定した回数を反復した後終了するのではなく、条件が合うまで予測不可能な回数を反復し得る。これらのより複雑なループ関数は最終のループ反復の後も動き続け、そのため、終値の後に続く出力ではない、最終のループ反復出力値を出力記憶ノードが捉えることが困難になる。 Detailed Description of the Invention In a simple loop function, the loop repeats a fixed scheduled number of times and then stops after the final loop iteration. In contrast, a more complex loop function may repeat an unpredictable number of times until a condition is met, rather than exiting after a fixed number of iterations. These more complex loop functions continue to run after the final loop iteration, which makes it difficult for the output storage node to capture the final loop iteration output value, not the output following the closing price.

本発明は、条件が合うまで予測不可能な回数を反復するループ関数を含む、パイプライン化されたループ構造、およびループのパイプライン化の方法を含む。本発明の１つの実施例は、各ループ反復について生成される情報を受け、その情報が最終のループ反復を指示するか否かを判断する、ループ有効ノードを含む。たとえば、各ループ反復について生成された情報はループ有効ノードによって処理され、ループの終了を要求する条件が満たされたか否かが判断される。条件が満たされると、ループ有効ノードは、ループからの次
の出力値が最終のループ反復出力値であることを、終了ノードおよび出力値記憶ノードなどの他のノードに知らせることができる。 The present invention includes a pipelined loop structure and a loop pipelining method that includes a loop function that repeats an unpredictable number of times until a condition is met. One embodiment of the present invention includes a loop valid node that receives information generated for each loop iteration and determines whether the information indicates a final loop iteration. For example, the information generated for each loop iteration is processed by the loop valid node to determine whether a condition requesting the end of the loop has been met. When the condition is met, the loop valid node can inform other nodes, such as the end node and output value storage node, that the next output value from the loop is the final loop iteration output value.

多くのパイプライン化されたループ関数が、１反復につき１クロックよりも大きな周期をさらに必要とする。これらのループ関数と、１クロックサイクルにつき１入力値または１出力値の周波数でのみ動作するパイプライン化されたループ構造とは、互換性がなくてもよい。本発明では、ループ本体に入力される値と値の間で１以上のクロックサイクルが経過するよう周期を調整することができる、ループドライバノードが与えられる。本発明の１つの実施例において、ループドライバノードは周期値「Ｄ」を受入れ、Ｄ値はループ関数の入力および／または出力の間で経過するクロックサイクルの回数を表わす。 Many pipelined loop functions further require a period greater than one clock per iteration. These loop functions may not be compatible with pipelined loop structures that operate only at a frequency of one input value or one output value per clock cycle. In the present invention, a loop driver node is provided that can adjust the period so that one or more clock cycles elapse between values entered into the loop body. In one embodiment of the invention, the loop driver node accepts a period value “D”, where the D value represents the number of clock cycles that elapse between the input and / or output of the loop function.

図１を参照すると、本発明による、パイプライン化されたループ構造１００の１つの実施例が示される。パイプライン化されたループ構造１００は、ロードスカラノード１０４および１０６とループドライバノード１０８とにループ関数の実行を開始するよう信号を送る、開始ノード１０２で始まる。ループドライバノード１０８は、次に、循環ノード１１０，１１２，１１４に、ロードスカラノード１０４および１０６から初期値をロードしてその値をループ本体１１６に示すよう信号を送る。ループ本体１１６の各反復において、循環ノード１１０，１１２，１１４は、ループ本体１１６のそれぞれの反復で生成される出力値を捉え、それらをループ本体１１６の次の反復のための入力値として送るよう準備する。 Referring to FIG. 1, one embodiment of a pipelined loop structure 100 according to the present invention is shown. Pipelined loop structure 100 begins at start node 102 that signals load scalar nodes 104 and 106 and loop driver node 108 to begin execution of the loop function. The loop driver node 108 then signals the circular nodes 110, 112, 114 to load the initial values from the load scalar nodes 104 and 106 and indicate the values to the loop body 116. At each iteration of the loop body 116, the circular nodes 110, 112, 114 capture the output values generated at each iteration of the loop body 116 and send them as input values for the next iteration of the loop body 116. prepare.

下記にさらに詳細に説明されるように、ループドライバノード１０８は「Ｄ」とラベル付けされた入力を受入れることができ、Ｄ値はループ反復の間に追加で起るクロックサイクルの回数を表わす。たとえば、Ｄ＝０の場合、１反復につき１クロックサイクルであり、Ｄ＝１の場合は１反復につき２クロックサイクルである。 As described in further detail below, the loop driver node 108 can accept an input labeled “D”, where the D value represents the number of additional clock cycles that occur during the loop iteration. For example, if D = 0, there are 1 clock cycle per iteration, and if D = 1, there are 2 clock cycles per iteration.

Ｄ値はループ関数のすべての反復に対して固定されていてもよく、またはより複雑なループ関数動作においては、ループ反復間で変動してもよい。Ｄ値はプログラマによって手動で入力されてもよく、またはループ関数の分析に基づいて自動的に計算されてもよい。ループ関数が開始されると、ループドライバノード１０８は、パイプライン化されたループ構造１００において循環ノード１１０，１１２，１１４などの他のノードをアクティブにする速度を、Ｄ値を用いて決定する。 The D value may be fixed for all iterations of the loop function, or may vary between loop iterations in more complex loop function operations. The D value may be entered manually by the programmer or automatically calculated based on the analysis of the loop function. When the loop function is initiated, the loop driver node 108 uses the D value to determine the rate at which other nodes, such as the circular nodes 110, 112, 114, in the pipelined loop structure 100 are active.

パイプライン化されたループ構造１００におけるループの終了は、循環ノード１１４と繋がっているループ有効ノード１１８によって開始できる。１つの実施例において、シングルビット値によって表わせるループ終了信号がループ有効ノード１１８に入力され、ループの終了を指示する条件が満たされたか否かが判断される。ループ有効ノード１１８は「無効」出力信号（「偽」信号とも呼ばれる）を循環ノード１１４に送り、ループ関数が再開されるまで無効な出力信号を送り続ける状態に自らをラッチすることができる。 The end of the loop in the pipelined loop structure 100 can be initiated by a loop valid node 118 connected to the circulation node 114. In one embodiment, a loop end signal, which can be represented by a single bit value, is input to the loop valid node 118 to determine whether a condition for instructing the end of the loop is satisfied. The loop valid node 118 can send an “invalid” output signal (also referred to as a “false” signal) to the circular node 114 and latch itself into a state where it continues to send the invalid output signal until the loop function is resumed.

循環ノード１１４がループ有効ノード１１８から無効な出力信号を受取った後、信号は終了ノード１２０に渡される。終了ノード１２０は次に出力値記憶ノード１２２，１２４をトリガして、ループ本体１１６の最終のループ反復からの最終のループ反復出力値を捉える準備ができる。この機構により、ループが最終反復の後にフリーランを続けたとしても、出力値記憶ノード１２２，１２４は最終のループ反復出力値を捉えることができる。 After circular node 114 receives an invalid output signal from loop valid node 118, the signal is passed to end node 120. The end node 120 is then ready to trigger the output value storage nodes 122, 124 to capture the final loop iteration output value from the final loop iteration of the loop body 116. This mechanism allows the output value storage nodes 122 and 124 to capture the final loop iteration output value even if the loop continues to free run after the final iteration.

ループが終了し、最終のループ反復出力値が出力値記憶ノード１２２，１２４に格納されると、続いてそれらの値はラッチアンドノード１２６によってラッチされ、次に出力ノード１２８を介して分配される。パイプライン化されたループ構造１００において、終了ノード１２０はラッチアンドノード１２６にさらに結合され、出力値記憶ノード１２２，
１２４からいつ値を捉えるべきかをノード１２６に知らせることができる。 When the loop is finished and the final loop iteration output values are stored in the output value storage nodes 122, 124, those values are subsequently latched by the latch and node 126 and then distributed via the output node 128. . In the pipelined loop structure 100, the end node 120 is further coupled to a latch and node 126 to provide output value storage nodes 122,
Node 126 can be informed of when to capture the value from 124.

図２を参照すると、図１のループドライバノード１０８に含まれ得るタイミング信号のタイミング図が示される。CLOCK信号は、クロックサイクルの開始をトリガする、システムのクロック信号から与えられる入力である。START信号はループの開始をトリガする。この信号は開始ノード１０２から受取られる。CIRC_TRIGGER信号はループが開始していることを循環ノードに知らせる。この信号は、循環ノード１１０，１１２，１１４が初期値をロードするために用いる出力である。LOOP_STARTING信号は、リセットパルスを必要とするノードに、新たなループ実行のために状態をクリアするよう告げる。LEADING信号は初期値をロードするよう定期的入力ノードに告げる。最後に、ACTIVE_LAST信号が各反復の最後のクロックにおいてハイになる。この信号を用いて、パイプライン化されたループ構造１００のノードが有効な入力を有することをノードに示す。 Referring to FIG. 2, a timing diagram of timing signals that may be included in the loop driver node 108 of FIG. 1 is shown. The CLOCK signal is an input provided by the system clock signal that triggers the start of a clock cycle. The START signal triggers the start of the loop. This signal is received from the start node 102. The CIRC_TRIGGER signal informs the circular node that the loop has started. This signal is an output that the circular nodes 110, 112, 114 use to load the initial value. The LOOP_STARTING signal tells the node that needs a reset pulse to clear the state for a new loop execution. The LEADING signal tells the periodic input node to load the initial value. Finally, the ACTIVE_LAST signal goes high at the last clock of each iteration. This signal is used to indicate to the node that the node of the pipelined loop structure 100 has a valid input.

ループ運搬されるスカラ変数は、制御フローデータフローのパイプライン化されたループ構造において周期を生成することができる。周期はループ反復間のクロックサイクル数を増やし、それにより、次にＤ値を上げて、ループ本体と循環ノードとが同期して正しいループ本体出力値を確実に捉え、新たな各ループ反復を開始できるようにする必要がある。 Loop-carried scalar variables can generate periods in the pipelined loop structure of the control flow data flow. The period increases the number of clock cycles between loop iterations, which in turn increases the D value, ensuring that the loop body and the circular node are synchronized to ensure that the correct loop body output value is captured and each new loop iteration is started. It needs to be possible.

図３は、パイプライン化されたループ構造３００の部分の例を示し、そこでループドライバノード３０８のＤ値は、少なくとも１ループ反復につき追加の４クロックサイクルを表わす４に、設定されなければならない。上記の図１と同様に、パイプライン化されたループ構造３００は、ロードスカラノード３０４，３０６およびループドライバノード３０８に信号を送る開始ノード３０２で、開始する。この例において、Ｄ＝４の値がループドライバノード３０８に入力されてループ構造の周波数を１ループ反復につき５クロックサイクルとする。Ｄ＝４の値は、ＭＵＬＴノード３１４で体現される乗算マクロの固有のクロックサイクルレイテンシに基づいて選択される。Ｄ＝４をループドライバノード３０８に入力することにより、循環ノード３１０，３１２は５クロックサイクル毎にＭＵＬＴノード３１４に値を入力する。 FIG. 3 shows an example of a portion of a pipelined loop structure 300 where the D value of the loop driver node 308 must be set to 4 representing at least 4 additional clock cycles per loop iteration. Similar to FIG. 1 above, the pipelined loop structure 300 begins with a start node 302 that signals the load scalar nodes 304 and 306 and the loop driver node 308. In this example, a value of D = 4 is input to the loop driver node 308 to set the frequency of the loop structure to 5 clock cycles per loop iteration. The value of D = 4 is selected based on the inherent clock cycle latency of the multiplication macro that is embodied at the MULT node 314. By inputting D = 4 to the loop driver node 308, the circular nodes 310 and 312 input a value to the MULT node 314 every five clock cycles.

一般にＤ値は、パイプライン化されたループ構造における循環ノードの入力および出力の間の経路の最長のものに比例する。図３は、循環ノード３１０，３１２の出力がすべてＭＵＬＴノード３１４に送られ、その入力をＭＵＬＴノード３１４が直接ノード３１０，３１２に送り返す単純な例を示す。より複雑なループ関数と、それらのパイプライン化されたループ構造との例を次にいくつか示す。 In general, the D value is proportional to the longest path between the input and output of the circular node in the pipelined loop structure. FIG. 3 shows a simple example in which all the outputs of the circular nodes 310 and 312 are sent to the MULT node 314 and the inputs are sent directly back to the nodes 310 and 312 by the MULT node 314. Here are some examples of more complex loop functions and their pipelined loop structures:

図４は、１ループ反復につき４クロックサイクルのレイテンシを有する第１の関数（Ｆ１）ノード４１４、および１ループ反復につき６クロックサイクルのレイテンシを有する第２の関数（Ｆ２）ノード４１６がある、制御フローデータフローグラフのパイプライン化されたループ構造４００を示す。パイプライン化されたループ構造４００は、ロードスカラノード４０４，４０６およびループドライバノード４０８に信号を送る開始ノード４０２で開始する。この例においてＤ値は、パイプライン化されたループ構造４００におけるループ関数のうち、最長のレイテンシに基づいて選択される。第２の関数（Ｆ２）ノードは１反復につき６クロックサイクルの最長レイテンシを有するので、Ｄ値は６である。循環ノード４１０，４１２は、Ｄ値に基づいてタイミングを取られるループドライバから信号を受取り、それにより７クロックサイクル毎に値を第１の関数（Ｆ１）ノード４１４および第２の関数（Ｆ２）ノード４１６に入力する。 FIG. 4 illustrates a control where there is a first function (F1) node 414 having a latency of 4 clock cycles per loop iteration and a second function (F2) node 416 having a latency of 6 clock cycles per loop iteration. Fig. 5 shows a pipelined loop structure 400 of a flow data flow graph. Pipelined loop structure 400 begins at start node 402 that signals load scalar nodes 404 and 406 and loop driver node 408. In this example, the D value is selected based on the longest latency among the loop functions in the pipelined loop structure 400. The second function (F2) node has a longest latency of 6 clock cycles per iteration, so the D value is 6. The circular nodes 410, 412 receive signals from the loop driver that is timed based on the D value, so that the value is sent every seven clock cycles to the first function (F1) node 414 and the second function (F2) node. Input to 416.

図５は、循環ノード５１６，５１８，５２０，５２２，５２４とループ関数本体５２６，５２８，５３０，５３２、５３４，５３６，５３８，５４０との間に多くの循環経路を
有する、さらに複雑な、パイプライン化されたループ構造５００を示す。この例において、ループ関数５００の実行は、ロードスカラノード５０４，５０６，５０８，５１０，５１２とロードドライバノード５１４とに信号を送る開始ノード５０２で開始する。ループドライバノード５１４に入力されるＤ値は次の態様で決定され得る。 FIG. 5 shows a more complex pipe with many circulation paths between the circular nodes 516, 518, 520, 522, 524 and the loop function bodies 526, 528, 530, 532, 534, 536, 538, 540. A lined loop structure 500 is shown. In this example, execution of loop function 500 begins at start node 502 that signals load scalar nodes 504, 506, 508, 510, 512 and load driver node 514. The D value input to the loop driver node 514 can be determined in the following manner.

パイプライン化されたループ構造５００は、サイクルに関連するノードと関連しないノードとに分けられ得る、循環ノードを有する。サイクルに関連する循環ノードについて、パイプライン化されたループ構造における循環経路は次のように説明され得る。 Pipelined loop structure 500 has circular nodes that can be divided into nodes that are related to the cycle and nodes that are not related to the cycle. For a circular node associated with a cycle, the circular path in a pipelined loop structure can be described as follows.

１．Ｃ１→Ｄ１→Ｄ６→Ｃ１
２．Ｃ１→Ｄ０→Ｄ６→Ｃ１
３．Ｃ１→Ｄ２→Ｃ２→Ｃ３→Ｄ４→Ｄ６→Ｃ１
ここで、Ｃ０，Ｃ１，Ｃ２，Ｃ３およびＣ４は循環ノード５１６，５１８，５２０，５２２および５２４のそれぞれのラベルであり、Ｄ０，Ｄ１，Ｄ２，Ｄ３，Ｄ４，Ｄ５，Ｄ６およびＤ７はループ関数ノード５２６，５２８，５３０，６３２，５３４，５３６，５３８および５４０のそれぞれのラベルである。 1. C1 → D1 → D6 → C1
2. C1 → D0 → D6 → C1
3. C1->D2->C2->C3->D4->D6-> C1
Here, C0, C1, C2, C3 and C4 are the labels of the circulating nodes 516, 518, 520, 522 and 524, respectively, and D0, D1, D2, D3, D4, D5, D6 and D7 are loop function nodes. 526, 528, 530, 632, 534, 536, 538 and 540, respectively.

Ｄ値を決定する際、サイクルに関連しない循環ノードは無視してもよい。なぜならノードのすべての入力に遅延を挿入することでループ本体に押し込まれるからである。この例では循環（Ｃ４）ノード５２４はパイプライン化されたループ構造５００のサイクルに関連せず、Ｄ値を決定する際に無視される。 In determining the D value, circular nodes not associated with a cycle may be ignored. This is because it is pushed into the loop body by inserting delays into all inputs of the node. In this example, the circular (C4) node 524 is not associated with a cycle of the pipelined loop structure 500 and is ignored in determining the D value.

残りの循環（Ｃ０，Ｃ１，Ｃ２，Ｃ３）ノード５１６，５１８，５２０，５２２については、図６に示される表１のような表が作られ、値が１つの循環ノードから他の循環ノードへと移動するか、または同じ循環ノードに戻るのに、どのループ関数本体を通るかを示す。たとえばセルＣ０，Ｃ０は、値が循環（Ｃ０）ノード５１６から出て元の位置に戻るまでに通らなければならないループ関数本体を特定する。パイプライン化されたループ構造５００において、Ｃ０は元の位置に戻る循環経路を有しておらず、セルは空のまま残される。対照的に、値が循環（Ｃ１）ノード５１８から出て元の位置に戻るために通ることができる循環経路が存在し、この経路はセルＣ１，Ｃ１においてＤ１＋Ｄ６で表わされる。 For the remaining cyclic (C0, C1, C2, C3) nodes 516, 518, 520, and 522, a table such as Table 1 shown in FIG. 6 is created, and the value is changed from one cyclic node to another cyclic node. To indicate which loop function body to go to or return to the same circular node. For example, cells C0 and C0 specify the loop function body that the value must pass before exiting from the cycle (C0) node 516 and returning to its original position. In the pipelined loop structure 500, C0 has no circulation path back to its original position, and the cell is left empty. In contrast, there is a circular path through which the value can be taken to exit the circular (C1) node 518 and return to its original position, which is represented by D1 + D6 in cells C1, C1.

Ｄ０からＤ６のループ関数本体の各々に対してクロックレイテンシが決定され、これらのレイテンシを表１に当てはめて、最長のレイテンシを有する循環経路が決定できる。最長のレイテンシ値を用いて次にＤの最小値が設定され、最小値はループドライバノード５１４に入力されて、パイプライン化されたループ構造５００全体の周期を決定する。 Clock latencies are determined for each of the loop function bodies D0 to D6, and these latencies are applied to Table 1 to determine the circular path having the longest latency. The minimum value of D is then set using the longest latency value and the minimum value is input to the loop driver node 514 to determine the overall period of the pipelined loop structure 500.

状態を持つノード：制御フローデータフローのパイプライン化されたループ構造において、ノードの状態をクリアし、各反復がいつ起るかをノードに告げ、ノードの入力がいつ有効であるかをノードに告げるなどの、状態を持つノードに関する問題を扱うために、状態を持つノードは追加的サポートを必要とする。図７は、ループドライバノード７０８の３個の信号がこの情報を伝えるためにどのよう用いられ得るかを示す。 Node with state: In the pipelined loop structure of the control flow data flow, clear the state of the node, tell the node when each iteration occurs, and tell the node when the node input is valid Stateful nodes need additional support to handle problems with stateful nodes, such as telling. FIG. 7 shows how the three signals of the loop driver node 708 can be used to convey this information.

図７に示されるパイプライン化されたループ構造７００の例は、状態を持つノード７１６の存在を除いては、他のパイプライン化されたループ構造の例と同様に見える。ループ関数は、ロードスカラノード７０４，７０６およびループドライバノード７０８に開始ノード７０２から信号を送らせることによって実行される。ループドライバノード７０８がループの周期により決定された速度で循環ノード７１０，７１２，７１４に活性化信号を送る間、ロードスカラノード７０４，７０６が循環ノード７１０，７１２，７１４に初期値をロードする。循環ノード７１０，７１２，７１４は１つ以上のループ本体（図示され
ない）に結合され、それは次に状態を持つノード７１６に結合される。 The example pipelined loop structure 700 shown in FIG. 7 looks similar to other pipelined loop structure examples, except for the presence of stateful nodes 716. The loop function is executed by having the load scalar nodes 704 and 706 and the loop driver node 708 signal from the start node 702. While the loop driver node 708 sends an activation signal to the circulation nodes 710, 712, 714 at a speed determined by the period of the loop, the load scalar nodes 704, 706 load the circulation nodes 710, 712, 714 with initial values. The circular nodes 710, 712, 714 are coupled to one or more loop bodies (not shown), which in turn are coupled to a node 716 having a state.

上述のように、状態を持つノード７１６に情報を伝えるために、ループドライバノード７０８によって３個の信号が与えられる。これらのうち最初の信号は「有効」信号と呼ばれ、ループドライバノード７０８に結合される循環ノード７１４によって状態を持つノード７１６に到達する。有効信号はさらに、状態を持つノードが条件式内にある場合、条件式を経ることができる。 As described above, three signals are provided by loop driver node 708 to convey information to stateful node 716. The first of these is called the “valid” signal and arrives at a node 716 having a state by a circular node 714 coupled to the loop driver node 708. The valid signal can also go through a conditional expression if the stateful node is in the conditional expression.

ループ関数の中で条件がどのように構築されるかに依存して、有効な信号は状態を持つノード７１６から無視されてもよい。ノードを条件テストの中に置くのではなく、ノードに明示的な述語入力を与えることにより、状態を持つノードの条件が扱われると、有効信号は無視されてもよい。例示的に、４２より大きなアレイの値をすべて合計するためにアキュムレータを扱う場合の方法を２つ考える。 Depending on how the condition is constructed in the loop function, the valid signal may be ignored from the stateful node 716. Rather than placing the node in a conditional test, the valid signal may be ignored if the condition of the stateful node is handled by giving the node an explicit predicate input. Illustratively, consider two ways to handle an accumulator to sum all the values of an array greater than 42.

下記と比較する。 Compare with:

第２のアプローチでは、コンパイラによって構築されるループ構造は、条件付きデータフローを構築する必要がないのでより単純である。さらに、第１のアプローチでは条件式が真のときのみ値が「ｒｅｓ」に割当てられるのに対し、第２のアプローチでは、各反復において値が「ｒｅｓ」に割当てられる。したがって、アキュムレータが第２のアプローチによって構築されると、状態を持つノードに対する有効な信号入力は不要であり、信号は無視されてもよい。有効な信号が所望される場合、状態を持つノードは信号を受入れるために１ビット入力で設計されてもよい。 In the second approach, the loop structure built by the compiler is simpler because there is no need to build a conditional data flow. Furthermore, in the first approach, a value is assigned to “res” only when the conditional expression is true, whereas in the second approach, a value is assigned to “res” at each iteration. Thus, when the accumulator is built by the second approach, no valid signal input to the stateful node is necessary and the signal may be ignored. If a valid signal is desired, the stateful node may be designed with a 1-bit input to accept the signal.

状態を持つノード７１６に対する第２の信号は、ノード内部の状態をクリアするのに用いられる「starting」信号である。この信号はloop_startings出力においてループドライバ７０８によって生成され得る。ループドライバ７０８からの信号が状態を持つノード７１６に到達する前に遅延を経る場合、状態を持つノード７１６はコードブロックの「code_block_reset」信号に接続されない。なぜなら、ループがコードブロックに入る際にブロックの以前の実行からフリーランを続けている可能性があり、「code_block_reset」信号が遅延を経ていないと「code_block_reset」信号を用いることによりノードがリセットし、コードブロックの以前の実行によって依然生じている値の処理を始める可能性があるか
らである。 The second signal for stateful node 716 is a “starting” signal used to clear the internal state of the node. This signal may be generated by the loop driver 708 at the loop_startings output. If the signal from the loop driver 708 is delayed before reaching the stateful node 716, the stateful node 716 is not connected to the “code_block_reset” signal of the code block. Because when the loop enters the code block, it may continue to free run from the previous execution of the block, and if the "code_block_reset" signal has not been delayed, the node resets by using the "code_block_reset" signal, This is because there is a possibility that processing of values still occurring due to the previous execution of the code block may be started.

状態を持つノード７１６に対する第３の信号入力は、各ループ反復において最後のクロックサイクルでハイになる信号である。この信号は元来はループドライバノード７０８から「active_last」信号として出たものであってもよい。状態を持つノード７１６はこの信号がハイになったのを見ると、入力に有効なデータがあると推定する。 The third signal input to stateful node 716 is the signal that goes high in the last clock cycle in each loop iteration. This signal may originally originate from the loop driver node 708 as an “active_last” signal. When the stateful node 716 sees this signal go high, it assumes that there is valid data at the input.

通常は、状態を持つノード７１６はループ終了とは無関係である。ループの終了条件が満たされると、対応する結果が捉えられ、そしてループは動き続ける。しかし、ループのコードブロックが次回実行されるまで、状態を持つノード７１６がその状態を保持しなければならない場合もあり、ノードはループがいつ終了したのかを知る必要がある。この場合、マクロは「有効」入力を用い、「starting」信号を見てもリセットしない。なぜなら、その状態はコードブロック呼出の間保持されることになっているからである。 Normally, a node 716 having a state is independent of the loop end. When the loop termination condition is met, the corresponding result is captured and the loop continues to run. However, there may be cases where the stateful node 716 has to hold that state until the next time the code block of the loop is executed, and the node needs to know when the loop has finished. In this case, the macro uses the “valid” input and does not reset when the “starting” signal is seen. This is because the state is to be held during the code block call.

図８は状態を持つノード７１６に対して用いられ得る信号のタイミング図の例を示す。この例において、「有効」信号は第１の反復の間ハイである。なぜならループは少なくとも１つの反復を実行するボトムテストループだからである。以後、ハイの「有効」信号は、ループが未だ終了しておらず、ノードが条件内にあると、その条件付きの分岐がとられることを示す。「starting」信号はループが始まる前に１クロックの間ハイになり、状態を持つノード７１６の状態をクリアするのに用いられることができる。「active_last」信号は各ループ反復において最後のクロックサイクルでハイになり、ループが終了した後であってもこの動作を続ける。状態を持つノード７１６へのデータ入力は「active_last」信号がハイのとき有効と推定される。 FIG. 8 shows an example of a timing diagram of signals that can be used for stateful node 716. In this example, the “valid” signal is high during the first iteration. This is because the loop is a bottom test loop that performs at least one iteration. Thereafter, a high “valid” signal indicates that if the loop has not yet terminated and the node is in condition, the conditional branch is taken. The “starting” signal goes high for one clock before the loop begins and can be used to clear the state of the stateful node 716. The “active_last” signal goes high on the last clock cycle in each loop iteration and continues this operation even after the loop is finished. Data input to stateful node 716 is assumed to be valid when the “active_last” signal is high.

「leading」信号は定期的入力ノードに適切な同期を与える。１クロックサイクル毎に新たな入力を受入れることができないノードもある。たとえばある整数乗算は、単一のオンチップ乗算機を再使用して３クロック毎にのみ入力を受入れることができる。この問題は、入力の集合とそれに対応する出力との間のクロック遅延の回数であるレイテンシの問題と直交する。ノードが１クロック毎に入力を受入れられない場合は入力が正しく整調された状態に置く必要があり、そこにはノードがいつ新たな入力を受けるかを確立する同期が存在する必要がある。これは「leading」信号によって与えられる関数であり、このようなノードの「valid in」入力に接続され得る。ループドライバノードのＤ値もさらに、少なくとも定期的入力ノードが正しく動作するのに十分な程度にループを遅らせるよう設定されなければならない。 The “leading” signal provides proper synchronization to the periodic input node. Some nodes cannot accept a new input every clock cycle. For example, some integer multiplications can only accept inputs every 3 clocks using a single on-chip multiplier. This problem is orthogonal to the latency problem, which is the number of clock delays between the set of inputs and the corresponding output. If a node cannot accept an input every clock, it must be put in a correctly tuned state, and there must be a synchronization that establishes when the node receives a new input. This is a function given by the “leading” signal and can be connected to the “valid in” input of such a node. The D value of the loop driver node must also be set to delay the loop at least enough to allow the periodic input node to operate correctly.

本発明で用いられてもよい状態を持つノードには少なくとも２種類がある。１つの種類では、ノードのレイテンシはループの反復周期に拘らず（すなわちループドライバノードへのＤ入力の値に拘らず）一定である。他の種類では、状態を持つノードのレイテンシはループの反復周期に基づいて変動する。たとえば、出力の生成を開始する前にＮ個のデータアイテムを受ける状態を持つノードは、ループがループドライバノードによって遅くなる場合、第１の結果が生成される前により多くのクロックサイクルを消費する。状態を保つノードのこの種の動作は、その情報ファイルエントリによって特定される。ノード書込器はこのような状態を持つノードを書込むことを選択し、ループが遅くならない場合、すなわちＤ＝０の場合にのみノードが正しく機能するようにし、かつノードの情報ファイルエントリがその事実を特定しなければならない。 There are at least two types of nodes that may be used in the present invention. In one type, node latency is constant regardless of the loop iteration period (ie, regardless of the value of the D input to the loop driver node). In other types, the latency of stateful nodes varies based on the loop repetition period. For example, a node that has a state that receives N data items before it starts generating output consumes more clock cycles before the first result is generated if the loop is slowed by the loop driver node. . This kind of behavior of the node that maintains state is specified by its information file entry. The node writer chooses to write a node with such a state, makes the node function correctly only if the loop does not slow down, ie D = 0, and the node's information file entry You must identify the facts.

この明細書および付随する特許請求の範囲で使用される「備える」「備えた」「含む」および「含んだ」という用語は、記載される特徴、完全体、構成要素またはステップの存在を特定することを目的としたものであるが、その他１つ以上の特徴、完全体、構成要素、ステップまたは群の存在または追加を除外するものではない。 As used in this specification and the appended claims, the terms “comprising”, “comprising”, “including”, and “included” identify the presence of the described feature, completeness, component, or step. It is intended that this is not to exclude the presence or addition of one or more other features, completeness, components, steps or groups.

本発明の実施例による制御フローデータフローグラフのパイプライン化されたループ構造の例を示す図である。FIG. 6 illustrates an example of a pipelined loop structure of a control flow data flow graph according to an embodiment of the present invention. ループ反復の間が２クロックサイクルであるループドライバノードのタイミング図の例を示す図である。FIG. 6 is an example timing diagram for a loop driver node with 2 clock cycles between loop iterations. ループ運搬されるスカラサイクルを持つループを有する制御フローデータフローグラフのパイプライン化されたループ構造の例を示す図である。FIG. 5 is a diagram illustrating an example of a pipelined loop structure of a control flow data flow graph having a loop with a scalar cycle carried by the loop. ２つ以上の循環ノードを用いるループ運搬されるスカラサイクルがあるループを有する制御フローデータフローグラフのパイプライン化されたループ構造の例を示す図である。FIG. 7 illustrates an example of a pipelined loop structure of a control flow data flow graph having a loop with a loop-carrying scalar cycle using two or more circular nodes. 多くのループ運搬されるスカラサイクルを有する、制御フローデータフローグラフのパイプライン化されたループ構造の例を示す図である。FIG. 6 illustrates an example of a pipelined loop structure of a control flow data flow graph having many loop-carried scalar cycles. 図５に示されるループ構造の、循環ノードの間の経路を表わすチャートの例を示す図である。It is a figure which shows the example of the chart showing the path | route between circulation nodes of the loop structure shown by FIG. 状態を持つノードを含む、制御フローデータフローグラフのパイプライン化されたループ構造の例を示す図である。FIG. 6 is a diagram illustrating an example of a pipelined loop structure of a control flow data flow graph including nodes having states. 状態を持つノードマクロのタイミング図の例を示す図である。It is a figure which shows the example of the timing diagram of the node macro which has a state.

Claims

A reconfigurable computer system including a pipelined loop structure of control flow data flow, the system comprising:
A multi-adaptive processor, wherein the multi-adaptive processor includes a field programmable gate array;
Including a multi-adaptive processor compiler capable of generating code executable on a multi-adaptive processor, wherein the multi-adaptive processor compiler generates code that forms a pipelined loop structure and the pipelined The loop structure repeats after the final condition is met, and the pipelined loop structure is
A loop body that processes input values and generates output values in successive iterations of the loop body, wherein the output values are captured by a circular node coupled to the loop body, the structure further comprising:
A loop valid node coupled to the loop body to determine a final loop iteration;
An output value storage node coupled to the cyclic node, the output value storage node storing a last output value , the last output value being latched for distribution, the output value storage node continuing Ignoring the output value generated after the loop valid node determines that the final loop iteration has occurred, storing the final loop value based on the final loop iteration, and the loop structure further comprising:
Reconfigurable , including a loop driver node for adjusting the period for each iteration of the loop body such that one or more clock cycles elapse between values entered into the loop body Computer system.

The pipelined loop structure of claim 1, wherein the loop valid node outputs a loop valid end signal upon determining that the final loop iteration has occurred.

3. The pipelined loop structure of claim 2, wherein the loop valid node outputs the loop valid end signal at each loop iteration until the loop is resumed after the final loop iteration occurs.

The pipelined loop structure of claim 2, wherein the loop valid end signal includes a data bit.

The pipelined loop structure of claim 1 including an end node coupled to the loop valid node and the output value storage node.

6. The pipelined loop structure of claim 5, wherein the end node includes an end input for receiving a loop valid end signal from the loop valid node.

7. The pipelined loop structure of claim 6, wherein the end node includes an end output for sending a storage node end signal to the output value storage node.

Before Symbol loop driver node is coupled to the circulation node, pipelined loop structure of claim 1.

The pipelined loop structure of claim 8, wherein clock latency is based on a period value input to the loop driver node.

A reconfigurable computer system including a multi-adaptive processor and a multi-adaptive compiler, wherein the multi-adaptive compiler is capable of converting high-level instructions into code that can be executed by the multi-adaptive processor. Including instructions for forming a lined loop structure, wherein the pipelined loop structure repeats an unpredictable number of times after a final condition is satisfied, and the pipelined structure Is
A loop body that processes input values and generates output values in successive iterations of the loop body, wherein the output values are captured by a circular node coupled to the loop body and the final condition is met In response to the output value being latched and distributed to the output node, the subsequent output value is ignored, and the structure further comprises:
Including a loop driver node coupled to the circular node, wherein the loop driver node sets a period for each iteration of the loop body, so that one or more clock periods have the loop body set to a second input value. A reconfigurable computer system that passes before processing and the loop drive node outputs a signal associated with a functional unit having a state.

The pipelined loop structure of claim 10, wherein the loop driver node outputs a CIRC_TRIGGER signal to inform the circular node that a loop has started.

The pipelined loop structure of claim 10, wherein the loop driver node outputs a START signal for triggering the start of a loop.

The pipelined loop structure of claim 10, wherein the loop driver node outputs a LOOP_STARTING signal for clearing a state of a node that requires a reset pulse.

The pipelined loop structure of claim 10, wherein the loop driver node outputs a LEADING signal to tell a periodic input node to load a value.

The pipelined loop structure of claim 10, wherein a period value is equal to a period of a longest scalar cycle carried in the pipelined loop structure.

The pipelined loop structure of claim 10, wherein the period is based on a period value input to the loop driver node.

The pipelined loop structure of claim 10 including a loop valid node coupled to the loop body to determine a final loop iteration.

18. The pipelined loop structure of claim 17, wherein the loop valid node outputs a loop valid end signal upon determining that the final loop iteration has occurred.

The pipelined loop structure of claim 18 including an output value storage node coupled to the circular node.

20. The pipelined loop structure of claim 19, wherein the output value storage node ignores output values generated after the loop valid node determines that the final loop iteration has occurred.

21. The pipelined loop structure of claim 20 including an end node coupled to the loop valid node and the output value storage node.