JP2003223476A

JP2003223476A - Hardware acceleration system for function simulation

Info

Publication number: JP2003223476A
Application number: JP2002334637A
Authority: JP
Inventors: Srihari Cadambi; カダンビスリハリ; Pranav Ashar; アシャープラナブ
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2001-12-05
Filing date: 2002-11-19
Publication date: 2003-08-08
Also published as: JP2006268873A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a hardware acceleration system materializing a low cost, a high performance, a low turnaround time, and a high scalability. <P>SOLUTION: This hardware acceleration system for the function simulation is provided with a general circuit board having a logic chip and a memory. The circuit board can be connected to a computing device by a plug. This system is so applied that the computing device guides DMA (direct memory access) transfer between memories related to the circuit board and the computing device. The circuit board can be constituted using a simulation processor. The simulation processor can be programmed for, at least, a circuit design. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、回路の機能シミュ
レーションのためのハードウェア・アクセラレーション・
システムに関し、特に、シミュレーション・プロセッサを
使用したハードウェア・アクセラレーション・システム及
び方法に関し、また、シミュレーション・プロセッサのた
めのネットリストをコンパイルする方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to hardware acceleration for functional simulation of circuits.
The present invention relates to a system, and more particularly, to a hardware acceleration system and method using a simulation processor, and a method for compiling a netlist for the simulation processor.

【０００２】[0002]

【従来の技術】（１）参考文献以下の各論文は有益な背景情報を提供するものであり、
それを理由として引用によりその全体がここに組み込ま
れ、本明細書の以降の部分では各々に付される角括弧で
囲んだ参照番号により適宜言及される（すなわち、＜４
＞は４番が付されたJ.Abke et al.による文献を示
す）。2. Description of the Related Art (1) References The following papers provide useful background information,
For that reason, it is hereby incorporated by reference in its entirety and is referred to in the remainder of this specification by reference numbers in brackets attached to each of them (ie <4.
> Indicates a reference by J. Abke et al. With number 4.)

【０００３】＜１＞http: //www.quickturn.com/produc
ts/speedsim.htm. ＜２＞http: //www.quickturn.com/products/palladiur
n.htm. ＜３＞2001. http:/www.quickturn.com/products/CoBAL
TUItra.htm. ＜４＞Joerg Abke and Erich Barke. A new placement
method for direct mapping into LUT-based FPGAs (LU
TベースFPGAへの直接マッピングのための新しい配置方
法). In International Conference on Field Programm
able Logic andApplications (FPL 2001), pages 27-3
6, Belfast, Northern Ireland, August2001. ＜５＞Semiconductor Industry Association. Internat
ional technology roadmap for semiconductors (半導
体産業協会「半導体に関する国際技術ロードマップ). 1
999. http://public.itrs.net. ＜６＞Jonathan Babb, Russ Tessier, and Anant Agarw
al. Virtual wires: Overcoming pin limitations in F
PGA-based logic emulators (仮想ワイヤ: FPGAベース
論理エミュレータにおけるピン制約の克服). In Procee
dings of the IEEE Workshop on FPGAs for Custom Com
puting Machines, April 1993. ＜７＞Jonathan Babb, Russ Tessier, Matthew DahI, S
ilvina Hanono, DavidHoki, and Anant Agarwal. Logic
emulation with virtual wires (仮想ワイヤを使用し
た論理エミュレーション). In IEEE Transactions on C
AD of Integrated Circuits and Systema, June 1997. ＜８＞Steve Carlson. A new generation of verificat
ion acceleration. June (新世代の検証アクセラレーシ
ョン). http: //www.tharas.com. ＜９＞M. Chiang and R. Palkovic. LCC simulators sp
eed development of synchronous hardware (LCCシミュ
レータによる同期ハードウェアの高速開発). In Comput
er Design, pages 87-92, March 1986. ＜１０＞Seth C. Goldstein, Herman Schmit, Matt Mo
e, Mihai Budiu, Srihari Cadambi, R. Reed Taylor, a
nd Ronald Laufer. Piperench: A coprocessorfor stre
aming multimedia acceleration (パイプレンチ: マル
チメディア・アクセラレーションのストリーミングのた
めのコプロセッサ). In The 26th AnnualInternational
Symposium on Computer Architecture, pages 28-39,
May 1999. ＜１１＞S. Hauck and G. Borriello. Logic partition
orderings for multi-FPGA systems (マルチFPGAシス
テムのための論理パーティション配列). In ACM/SIGDA
International Symposium on Field Programmable Gate
Arrays, pages32-38, Monterey, CA, February 1995. ＜１２＞Chandra Mulpuri and Scott Hauck. Runtime a
nd quality tradeoffsin FPGA placement and routing
(FPGA配置及び経路指示におけるランタイムと品質のト
レードオフ). In International Symposium on Field P
rogrammable Gate Arrays, pages 29-36, Napa, CA, Fe
bruary 2001. ＜１３＞Alberto Sangiovanno-Vincentelli and Jonath
an Rose. Synthesis methods for field-programmable
gate arrays (フィールド・プログラマブル・ゲート・アレ
イのための合成方法). In Proceedings of the IEEE, V
ol. 81, No. 7, pages 1057-83, July 1993. ＜１４＞E. Shriver and K. Sakallah. Ravel: Assigne
d-delay compiled-code logic simulation (割当て済み
遅延、コンパイル済みコードの論理シミュレーション).
In International Conference on Computer-Aided Des
ign (ICCAD), pages 364-368, 1992. ＜１５＞D. Thomas and P. Moorby. The Verilog Hardw
are Description Language, 3rd Edition (Verilogハー
ドウェア記述言語、第3版). Kluwer AcademicPublisher
s, 1996. ＜１６＞S. Trimberger. Scheduling designs into a t
ime-multiplexed FPGA(設計の時間多重化FPGAへのスケ
ジューリング). In Proceedings of the 1998ACM/SIGDA
Sixth International Symposium on Field Programmab
le Gate Arrays, February 1998. ＜１７＞S. Trimberger, D. Carberry, A. Johnson, an
d J. Wong. A time-multiplexed FPGA (時間多重化FPG
A). In IEEE Symposium on FPGAs for Custom Computin
g Machines (FCCM) 1997, February 1997. ＜１８＞Keith Westgate and Don Mclnnis. Reducing s
imulation time withcycle simulation (サイクル・シミ
ュレーションによるシミュレーション時間の短縮). 200
0. http://www.quickturn.com/tech/cbs.htm. ＜１９＞J. Cong and Y. Ding. An Optimal Technology
Mapping Algorithm for Delay Optimization in Looku
p-Table based FPGA Designs (ルックアップ表をベース
とするFPGA設計における遅延最適化のための最適テクノ
ロジ・マッピング・アルゴリズム). In IEEE Transaction
s on CAD, pages 1-12, January 1994. ＜２０＞F. Corno, M. S. Reorda, and G. Squillero.
RT-level ITC99 Benchmarks and First ATPG Results
(RTレベルのITC99ベンチマーク及び最初のATPG結果). I
n IEEE Design and Test of Computers, pages 44-53,
July 2000. ＜２１＞Xilinx. Virtex-Il 1.5v Field Programmable
Gate Array: AdvanceProduct Specification (Virtex-I
l 1.5vフィールド・プログラマブル・ゲート・アレイ: ア
ドバンス製品仕様). Xilinx Application Databook , O
ctober 2001.http://www.xilinx.com/partinfo/databoo
k.htm.<1> http: //www.quickturn.com/produc
ts / speedsim.htm. <2> http: //www.quickturn.com/products/palladiur
n.htm. <3> 2001. http: /www.quickturn.com/products/CoBAL
TUItra.htm. <4> Joerg Abke and Erich Barke. A new placement
method for direct mapping into LUT-based FPGAs (LU
New placement method for direct mapping to T-based FPGAs). In International Conference on Field Programm
able Logic and Applications (FPL 2001), pages 27-3
6, Belfast, Northern Ireland, August 2001. <5> Semiconductor Industry Association. Internat
ional technology roadmap for semiconductors (International Semiconductor Technology Roadmap for Semiconductors). 1
999. http://public.itrs.net. <6> Jonathan Babb, Russ Tessier, and Anant Agarw
al. Virtual wires: Overcoming pin limitations in F
PGA-based logic emulators. In Procee
dings of the IEEE Workshop on FPGAs for Custom Com
puting Machines, April 1993. <7> Jonathan Babb, Russ Tessier, Matthew DahI, S
ilvina Hanono, DavidHoki, and Anant Agarwal. Logic
emulation with virtual wires. In IEEE Transactions on C
AD of Integrated Circuits and Systema, June 1997. <8> Steve Carlson. A new generation of verificat
ion acceleration. June (New generation verification acceleration). http: //www.tharas.com. <9> M. Chiang and R. Palkovic. LCC simulators sp
eed development of synchronous hardware. In Comput
er Design, pages 87-92, March 1986. <10> Seth C. Goldstein, Herman Schmit, Matt Mo
e, Mihai Budiu, Srihari Cadambi, R. Reed Taylor, a
nd Ronald Laufer. Piperench: A coprocessor for stre
aming multimedia acceleration (Pipe Wrench: a coprocessor for streaming multimedia acceleration). In The 26th Annual International
Symposium on Computer Architecture, pages 28-39,
May 1999. <11> S. Hauck and G. Borriello. Logic partition
orderings for multi-FPGA systems. In ACM / SIGDA
International Symposium on Field Programmable Gate
Arrays, pages32-38, Monterey, CA, February 1995. <12> Chandra Mulpuri and Scott Hauck. Runtime a
nd quality tradeoffsin FPGA placement and routing
(Runtime and quality trade-off in FPGA placement and routing). In International Symposium on Field P
rogrammable Gate Arrays, pages 29-36, Napa, CA, Fe
bruary 2001. <13> Alberto Sangiovanno-Vincentelli and Jonath
an Rose. Synthesis methods for field-programmable
gate arrays. In Proceedings of the IEEE, V
ol. 81, No. 7, pages 1057-83, July 1993. <14> E. Shriver and K. Sakallah. Ravel: Assigne
d-delay compiled-code logic simulation.
In International Conference on Computer-Aided Des
ign (ICCAD), pages 364-368, 1992. <15> D. Thomas and P. Moorby. The Verilog Hardw
are Description Language, 3rd Edition (Verilog Hardware Description Language, 3rd Edition). Kluwer Academic Publisher
s, 1996. <16> S. Trimberger. Scheduling designs into at
ime-multiplexed FPGA (Scheduling design to time-multiplexed FPGA). In Proceedings of the 1998 ACM / SIGDA
Sixth International Symposium on Field Programmab
le Gate Arrays, February 1998. <17> S. Trimberger, D. Carberry, A. Johnson, an
d J. Wong. A time-multiplexed FPGA
A). In IEEE Symposium on FPGAs for Custom Computin
g Machines (FCCM) 1997, February 1997. <18> Keith Westgate and Don Mclnnis. Reducing s
imulation time withcycle simulation. 200
0. http://www.quickturn.com/tech/cbs.htm. <19> J. Cong and Y. Ding. An Optimal Technology
Mapping Algorithm for Delay Optimization in Looku
p-Table based FPGA Designs. Optimal technology mapping algorithm for delay optimization in look-up table based FPGA design. In IEEE Transaction
s on CAD, pages 1-12, January 1994. <20> F. Corno, MS Reorda, and G. Squillero.
RT-level ITC99 Benchmarks and First ATPG Results
(RTC ITC99 benchmark and first ATPG result). I
n IEEE Design and Test of Computers, pages 44-53,
July 2000. <21> Xilinx. Virtex-Il 1.5v Field Programmable
Gate Array: AdvanceProduct Specification (Virtex-I
l 1.5v Field Programmable Gate Array: Advanced Product Specification). Xilinx Application Databook, O
ctober 2001.http: //www.xilinx.com/partinfo/databoo
k.htm.

【０００４】（２）序文ａ）検証ギャップ過去１０年間に新たな適用業務や処理需要が次々と出現
し、それに伴って集積回路（ＩＣ）の複雑性と密度は大
幅に増大した。同時に、市場の重圧の増大により設計サ
イクルの短縮が必須となり、完全自動化された設計手法
への依存度が高まることが予想されている。機能検証
は、こうした設計手法の重要な部分を占める。機能検証
は設計の市場投入までに要する期間全体を決定する上で
きわめて重要な役割を果たす。製造に時間と費用を費や
す前に、設計者は膨大な量の機能検証を実行しなければ
ならない。人的な資源とコンピュータ資源の６０％以上
は、通常の設計プロセスで使用され（文献＜１＞）、そ
のうちの８５％は機能検証に充てられる（文献＜５
＞）。チップの複雑性と密度は過去数年間に急激なペー
スで増大したが（このペースは今後１０年間にわたって
続くと予想される）、回路検証能力はそうではない。つ
まり、機能検証のためのＣＡＤツールのパフォーマンス
は、回路の複雑性に阻まれて十分な向上を果たしていな
いのである。(2) Preface a) Verification gap With the emergence of new applications and processing demands over the last decade, the complexity and density of integrated circuits (ICs) has increased significantly. At the same time, it is expected that shortening the design cycle will become necessary due to the increase of the market pressure, and the dependence on the fully automated design method will increase. Functional verification is an important part of these design methods. Functional verification plays a crucial role in determining the overall time-to-market for a design. Designers must perform a vast amount of functional verification before spending time and money on manufacturing. Over 60% of human and computer resources are used in the normal design process (Ref. <1>), and 85% of them are used for functional verification (Ref.
>). While chip complexity and density has grown exponentially over the last few years (which is expected to continue for the next decade), circuit verification capability is not. That is, the performance of the CAD tool for functional verification is not sufficiently improved due to the complexity of the circuit.

【０００５】その結果生じる「機能検証ギャップ」は、
ハードウェア支援シミュレータの使用や専用ハードウェ
ア・エミュレータによりある程度は取り組まれてきた。
専用エミュレータのパフォーマンスはソフトウェア・シ
ミュレータのそれを大幅に上回るが、コストも大幅に高
い。ソフトウェア・シミュレーション自体のプロセス
は、最近までイベント・ドリブン・シミュレーションを
ベースとしていたが、数年前のサイクル・ベース論理シ
ミュレーションの登場によって急激な進展を遂げてい
る。The resulting "functional verification gap" is
Some work has been done by using hardware-assisted simulators and dedicated hardware emulators.
The performance of the dedicated emulator is significantly higher than that of the software simulator, but the cost is also significantly higher. The process of software simulation itself has until recently been based on event-driven simulation, but has made rapid progress with the advent of cycle-based logic simulation a few years ago.

【０００６】ｂ）サイクル・ベース・シミュレーションサイクル・ベース・シミュレーションは従来のイベント
・ドリブン・シミュレーションとは異なり、機能検証に
非常に適している。イベント・ドリブン・シミュレータ
はイベントが発生した入力のゲートの出力を更新し、続
いて、更新の影響を受けた各ゲートについて将来のイベ
ントをスケジュールする。これは、活動率が低い回路で
は効率的な方法である。各サイクルで更新すべきゲート
数がゲート総数のうちごくわずかな比率を占めるにすぎ
ないからである。この方法により、イベント・ドリブン
・シミュレータはゲート遅延をモデル化し、シミュレー
トすることもできる。しかし、活動率が高い大規模回路
でこの方法を使用すると、メモリ使用量が増大し、シミ
ュレーションの速度が低下する。B) Cycle-based simulation Unlike conventional event-driven simulation, cycle-based simulation is very suitable for functional verification. The event driven simulator updates the gate's output for the input at which the event occurred, and then schedules future events for each gate affected by the update. This is an efficient method for circuits with low activity. This is because the number of gates to be updated in each cycle occupies a very small proportion of the total number of gates. This method also allows the event driven simulator to model and simulate gate delays. However, the use of this method in large circuits with high activity increases memory usage and slows simulation.

【０００７】サイクル・ベース・シミュレーションは、
高速で、メモリ集約度の低い機能検証方法を提供する。
サイクル・ベース・シミュレーションは、以下によって
特徴付けられる。Cycle-based simulation is
A high-speed, low-memory-intensive function verification method is provided.
Cycle-based simulation is characterized by:

【０００８】・値はクロック・エッジでのみ計算され、
中間のゲート結果は計算されない。その代わりに、各ク
ロック・サイクルにおける出力はそのクロック・サイク
ルの入力のブール論理機能として計算される。Values are calculated only on clock edges,
Intermediate gating results are not calculated. Instead, the output at each clock cycle is calculated as a Boolean logic function of the input for that clock cycle.

【０００９】・組み合わせのタイミング遅延は無視され
る。The combinational timing delays are ignored.

【００１０】・通常、シミュレーションは２値化（状態
０、１）または４値化（状態０、１、ｘ、及びｚ）され
る。フル仕様のイベント・ドリブン・シミュレータは、
最大２８個の状態をサポートしなければならない。Normally, the simulation is binarized (states 0, 1) or quaternized (states 0, 1, x, and z). The full-featured event-driven simulator is
It must support up to 28 states.

【００１１】そのため、サイクル・ベース・シミュレー
タは機能検証に焦点を当てることによりパフォーマンス
を高める。実用回路の場合、サイクル・ベース・シミュ
レータはイベント・ドリブン・シミュレータよりも１０
倍も高速でありながら、メモリ使用量は１／５程度でし
かない（文献＜１８＞）。例えば、市販サイクル・シミ
ュレータのＳｐｅｅｄＳｉｍ（Ｑｕｉｃｋｔｕｒｎ／Ｃ
ａｄｅｎｃｅ製）は、標準的なＵｌｔｒａＳｐａｒｃワ
ークステーション上で１５０万ゲートのネットリストを
毎秒１５ベクトルの速度でシミュレートできる。５万〜
１０万ゲートを有するネットリストの場合の速度は、通
常、毎秒約４００〜５００ベクトルである。その結果、
設計検証ではこうしたシミュレータの人気がますます高
まってきている。Therefore, cycle-based simulators enhance performance by focusing on functional verification. For practical circuits, cycle-based simulators are 10 more than event-driven simulators.
Although it is twice as fast, the memory usage is only about 1/5 (reference <18>). For example, a commercial cycle simulator, SpeedSim (Quickturn / C).
can produce a netlist of 1.5 million gates on a standard UltraSparc workstation at a rate of 15 vectors per second. 50,000 ~
The speed for a netlist with 100,000 gates is typically about 400-500 vectors per second. as a result,
Such simulators are becoming more and more popular in design verification.

【００１２】ｃ）ハードウェア支援サイクル・ベース・
シミュレーションサイクル・ベース・シミュレーションの速度をさらに高
めたい場合は、専用ハードウェアを使って加速できる。
サイクル・ベース・シミュレーションには従来のマイク
ロプロセッサでは対処できない相当量の並行性（すなわ
ち、命令レベルのパラレリズム）が存在するため、ハー
ドウェア・アクセラレーションの採用により大きな効果
が期待できる。電気的な再構成が可能なフィールド・プ
ログラマブル・ゲート・アレイ（ＦＰＧＡ）の出現によ
り、安価なハードウェア・ソリューションを作ることが
可能になった。再構成可能性によりＦＰＧＡ上で論理回
路をエミュレートすることが可能になり、さらには空間
パラレリズムの使用による並行性の処理が可能になる。
このような手法により、機能検証が大幅に加速され、複
雑な設計の設計時間の短縮と市場投入期間の短縮が実現
される。C) Hardware support cycle base
If you want to accelerate your simulation cycle-based simulation even faster, you can accelerate it with dedicated hardware.
Since the cycle-based simulation has a considerable amount of concurrency (that is, instruction-level parallelism) that cannot be handled by the conventional microprocessor, the hardware acceleration can be expected to have a great effect. With the advent of electrically reconfigurable field programmable gate arrays (FPGAs), it has become possible to create inexpensive hardware solutions. Reconfigurability makes it possible to emulate logic circuits on FPGAs and even to handle parallelism through the use of spatial parallelism.
By such a method, the functional verification is significantly accelerated, and the design time and the time to market of the complicated design can be shortened.

【００１３】単一のＦＰＧＡで数個の論理設計をエミュ
レートすることはできるが、サイズに限界があり、大規
模な回路では１個を一度に収容することはできない。Ｆ
ＰＧＡ内のリソース量を上回るリソースを必要とする回
路は、ＦＰＧＡに入りきれないのである。Although a single FPGA can emulate several logic designs, there are size limitations and large circuits cannot accommodate one at a time. F
A circuit that requires more resources than the amount of resources in the PGA cannot fit in the FPGA.

【００１４】この問題の次善策としてすぐに思い浮かぶ
のは、複数のＦＰＧＡを使うことである。しかし、複数
ＦＰＧＡによるエミュレーション・システムは、スケー
ラビリティとコスト効果の両面でこの問題に対処できな
い。例えば、１０個のＦＰＧＡで構成されるシステム
は、設計が１０個のＦＰＧＡを組み合わせたサイズより
も大きくなると無力である。また、ＦＰＧＡを接続する
ピンの数が限られていることがボトルネックとなって論
理利用率が低下し、一部のＦＰＧＡは部分的にしか使用
されないという結果になる。さらに、これらのピンは比
較的低速なオンボード相互接続ワイヤを使用しており、
これもエミュレーション速度を低下させる要因となる
（文献＜１１＞）。これらの問題は、ＭＩＴが開発した
ＶｉｒｔｕａｌＷｉｒｅｓのコンセプトによってある程
度は解消されている（文献＜６＞、＜７＞）。しかし、
一部のエミュレーション・ベンダ（Ａｘｉｓなど）は未
だにシステム内で複数のＦＰＧＡと専用設計のハードウ
ェアを使用しており、製品価格は数十万ドル〜数百万ド
ルにも上っている。A second workaround for this problem is immediately to use multiple FPGAs. However, multiple FPGA emulation systems cannot address this issue both in terms of scalability and cost effectiveness. For example, a system consisting of 10 FPGAs is powerless if the design is larger than the combined size of 10 FPGAs. In addition, the fact that the number of pins connecting the FPGA is limited becomes a bottleneck and the logic utilization rate decreases, resulting in that some FPGAs are only partially used. In addition, these pins use relatively slow onboard interconnect wires,
This also causes a reduction in emulation speed (Reference <11>). These problems are solved to some extent by the concept of VirtualWires developed by MIT (references <6>, <7>). But,
Some emulation vendors (such as Axis) still use multiple FPGAs and specially designed hardware in their systems, with product prices in the hundreds of thousands to millions of dollars.

【００１５】エミュレーションへのもう１つの取り組み
方としては、物理的に小規模なＦＰＧＡ上で大規模な設
計を時分割する方法がある。この方法では、回路全体で
はなく、部分的にエミュレーションが行われる。各部分
は単一のＦＰＧＡに収まり、ＦＰＧＡは反復的に再構成
される。この方法であれば、ピンの制約も、複数ＦＰＧ
Ａの採用によるソリューションの高コストの問題も解消
されるが、ＦＰＧＡの再構成に要するオーバーヘッドの
ためにパフォーマンスが低下する。ほとんどの汎用ＦＰ
ＧＡは頻繁な再構成を見越した設計がなされていないの
で、再構成のために用意されているＩ／Ｏピンはごく少
数しかない。そのため、構成の帯域幅は非常に小さく、
再構成時には大幅な遅延が発生する。複数の構成コンテ
キストを使用する状況に対応してオンチップ・ストレー
ジを増やした専用ＦＰＧＡのアーキテクチャが考案され
ているが（文献＜１６＞、＜１７＞）、こうしたアーキ
テクチャは市販されておらず、しかもスケーラブルでは
ない。Another approach to emulation is to time-share large designs on physically small FPGAs. In this method, emulation is partially performed instead of the entire circuit. Each part fits in a single FPGA and the FPGA is reconfigured iteratively. With this method, even if there are multiple FPG
Although the solution's high cost problem is solved by adopting A, the performance is degraded due to the overhead required for reconfiguring the FPGA. Most general purpose FP
Since the GA has not been designed for frequent reconfigurations, only a few I / O pins are available for reconfiguration. Therefore, the bandwidth of the configuration is very small,
There will be a significant delay during reconfiguration. Dedicated FPGA architectures with increased on-chip storage have been devised in response to situations where multiple configuration contexts are used (references <16>, <17>), but such architectures are not commercially available, and Not scalable.

【００１６】（３）技術と関連研究の背景この項では、関連研究のいくつかの態様を、背景と従来
技術を踏まえて論じる。(3) Technology and Background of Related Research In this section, some aspects of the related research will be discussed based on the background and the prior art.

【００１７】（４）シミュレーション手法イベント・ドリブン・シミュレーションでは、ネット上
の値の変化をイベントとみなす。イベントはイベント・
スケジューラによって動的に管理される。イベント・ス
ケジューラはイベントをスケジュールし、スケジュール
済みイベントへのレスポンスとして、値が変化した各ネ
ットを更新する。イベント・スケジューラはさらに、ス
ケジュール済みイベントによって発生する将来のイベン
トもスケジュールする（文献＜１５＞）。イベント・ド
リブン・スケジューリングの主な利点は、柔軟性であ
る。イベント・ドリブン・シミュレータは、タイミング
を任意に遅延させることによって、同期モデルと非同期
モデルの両方をシミュレートすることができる。イベン
ト・ドリブン・シミュレーションの欠点は、本質的に直
列であり、メモリ使用量が大きいために、低レベルのシ
ミュレーション・パフォーマンスしか実現されないこと
である。(4) Simulation Method In event driven simulation, a change in value on the net is regarded as an event. Event is event
Managed dynamically by the scheduler. The event scheduler schedules events and updates each net whose value has changed in response to the scheduled event. The event scheduler also schedules future events triggered by scheduled events (15). The main advantage of event driven scheduling is flexibility. The event driven simulator can simulate both a synchronous model and an asynchronous model by arbitrarily delaying the timing. The disadvantage of event-driven simulation is that it is inherently serial and, due to the large memory usage, provides only low levels of simulation performance.

【００１８】均一化されたコンパイル済みコード論理シ
ミュレータ（サイクル・ベース・シミュレータはこれか
ら派生されたもの）では、イベント・ドリブン・シミュ
レータよりもはるかに高いシミュレーション・パフォー
マンスを得られる可能性がある。それは、イベントの配
列と伝搬に要するランタイム・オーバーヘッドの大部分
が解消されるためである。これは、１クロック・サイク
ルにつき１度、すべてのコンポーネントをトポロジの順
序で評価し、コンポーネントへのすべての入力が、その
コンポーネントが実行される時点までに最新の値となる
ようにすることで実現される。サイクル・ベース・シミ
ュレータの主な欠点は、任意のゲート遅延によってシミ
ュレートできないことにある（文献＜１４＞は顕著な例
外である）。A uniformized compiled code logic simulator (a cycle-based simulator is a derivative of this) can provide much higher simulation performance than an event-driven simulator. This is because most of the runtime overhead required for event sequencing and propagation is eliminated. This is done once every clock cycle by evaluating all components in topological order so that all inputs to the component are up to date by the time the component is executed. To be done. The main drawback of cycle-based simulators is that they cannot be simulated by arbitrary gate delays (ref. <14> is a notable exception).

【００１９】数年前まで、イベント・ドリブン・シミュ
レータは概してサイクル・ベース・シミュレータよりも
好まれていた。それは、ほとんどの回路のアクティビテ
ィ速度は１０〜２０％の範囲内だったからである（文献
＜９＞）。イベント・ドリブン・シミュレータのパフォ
ーマンスは、回路サイズよりむしろ回路活動によって左
右される。回路全体は静的にコンパイルされるのではな
く、シミュレーションは解釈によって進行し、その間に
は、回路活動によって影響されるゲートとネットだけが
更新される。一方、サイクル・ベース・シミュレーショ
ンでは、シミュレーションの開始前に回路全体が静的に
コンパイルされるので、回路内の各ゲートはサイクル毎
に評価される。初期段階にイベント・ドリブン・シミュ
レータの人気が高かったもう１つの理由は、回路の機能
性とタイミングを同時に検査できる点にあった。ただ
し、静的タイミング分析ツールが登場した現在では、機
能性とタイミングを別々に検証することが可能になって
いる。Until a few years ago, event-driven simulators were generally preferred over cycle-based simulators. This is because the activity rate of most circuits was within the range of 10 to 20% (Ref. <9>). Event-driven simulator performance depends on circuit activity rather than circuit size. The entire circuit is not compiled statically, the simulation proceeds by interpretation, during which only the gates and nets affected by the circuit activity are updated. On the other hand, in cycle-based simulation, each gate in the circuit is evaluated cycle by cycle because the entire circuit is statically compiled before the simulation starts. Another reason why event-driven simulators were so popular in the early stages was that they could simultaneously test circuit functionality and timing. However, with the advent of static timing analysis tools, it is now possible to verify functionality and timing separately.

【００２０】近年では、パイプライン処理やパラレル実
行といった（マルチメディアやネットワーキングの領域
の）適用業務が出現しており、活動率がはるかに高い回
路が使用されるようになっている。ゲート遅延が必要と
されない場合には（すなわち、機能検証では）、サイク
ル・ベース・シミュレータの方がイベント・ドリブン・
シミュレータよりも適している。サイクル・ベース・シ
ミュレータは回路全体をシミュレートするが、メモリ使
用量が少なくてパラレル化が可能な性質のために、パフ
ォーマンスはイベント・ドリブン・シミュレータを凌ぐ
（文献＜１４＞、＜１８＞）。In recent years, applications such as pipeline processing and parallel execution (in the area of multimedia and networking) have appeared, and circuits with much higher activity rates are being used. Cycle-based simulators are more event-driven when gate delays are not needed (ie for functional verification).
Better than a simulator. The cycle-based simulator simulates the entire circuit, but its performance is superior to that of the event-driven simulator due to its small memory usage and the possibility of parallelization (References <14>, <18>).

【００２１】本発明の手法は、単一の市販ＦＰＧＡを有
する汎用基板を使用した、サイクル・ベース・シミュレ
ーションのためのスケーラブル・ハードウェア・アクセ
ラレータに関する。この項の残りの部分では、この分野
で競合製品となりうる市販品を始めとする、他のＦＰＧ
Ａベース・ハードウェア・アクセラレータについて論じ
る。The method of the present invention relates to a scalable hardware accelerator for cycle-based simulation using a general-purpose board with a single off-the-shelf FPGA. The rest of this section will cover other FPGs, including commercial products that could be competitors in this area.
Discuss A-based hardware accelerators.

【００２２】ａ）単一ＦＰＧＡシステム論理エミュレーションのための単一ＦＰＧＡを使用する
ことには、２つの大きな問題がある。A) Single FPGA System Using a single FPGA for logic emulation has two major problems.

【００２３】・スケーラビリティの欠如：ＦＰＧＡに収
まらない設計は、全体を１度にエミュレートすることは
できない。このような設計を一部分ずつエミュレートす
るためには再構成を繰り返し実行する必要があり、市販
のＦＰＧＡでは非常に時間がかかる。Lack of scalability: A design that does not fit in an FPGA cannot emulate the whole at once. In order to emulate such a design part by part, it is necessary to perform reconfiguration repeatedly, which is very time-consuming in a commercially available FPGA.

【００２４】・長いコンパイル時間：従来のＦＰＧＡツ
ール・フローは複雑で、大規模な設計では数時間から数
日間の時間を要する。また、これによるシミュレーショ
ン・オーバーヘッドの増大により、設計時間と市場投入
までの期間に重大な影響が生じる。Long compile time: Traditional FPGA tool flows are complex, requiring hours to days for large designs. In addition, the increased simulation overhead will have a significant impact on design time and time to market.

【００２５】文献＜１７＞では、著者は、複数のコンテ
キストを保持し、コンテキスト間で高速な切り替えを行
うことのできる時分割ＦＰＧＡアーキテクチャを提示し
ている。ＦＰＧＡに収まらない大規模な回路は、ちょう
どよいサイズに分割し、各部分をＦＰＧＡの中に格納す
ることができる。このソリューションを使用すると、煩
雑な再構成の反復を回避できるが、ＦＰＧＡ内に備えら
れるコンテキスト・ストレージの量によって影響される
という難点がある。さらに、市販のＦＰＧＡでは複数の
コンテキストを格納して切り替えることができないた
め、専用のＦＰＧＡを構築することが必要になる。In the document <17>, the author presents a time-division FPGA architecture capable of holding a plurality of contexts and switching between the contexts at high speed. Large circuits that do not fit in the FPGA can be divided into just the right size and each part stored in the FPGA. This solution avoids cumbersome reconfiguration iterations, but has the drawback of being affected by the amount of context storage provided in the FPGA. Furthermore, since a commercially available FPGA cannot store and switch a plurality of contexts, it is necessary to build a dedicated FPGA.

【００２６】ｂ）複数ＦＰＧＡシステムエミュレーション・システムは、典型的には、相互接続
された多数の市販ＦＰＧＡで構成される。これにより大
規模な設計をエミュレートすることが可能になるが、Ｆ
ＰＧＡ間通信に使用できるピンが少ないため、各ＦＰＧ
Ａの利用率は大幅に低下する。ピンが少ないためにＦＰ
ＧＡは部分的にしか埋められず、無駄が生じるのであ
る。文献＜６＞では、「ＶｉｒｔｕａｌＷｉｒｅｓ」
と呼ばれる新規な技術が提案されている。Ｖｉｒｔｕａ
ｌＷｉｒｅｓでは、各物理ピンが時分割され、設計内
の複数の「仮想ピン」にマッピングされる。これは何ら
かの時分割ハードウェアを追加して行われるが、設計全
体のエミュレートはＦＰＧＡのクロック速度よりも遅い
クロック速度で行うことが必要になる。それでもなお、
ＶｉｒｔｕａｌＷｉｒｅｓのコンセプトは複数ＦＰＧ
Ａのシステムにはきわめて有効である。B) Multiple FPGA System Emulation systems typically consist of a number of interconnected off-the-shelf FPGAs. This makes it possible to emulate large designs, but F
Since there are few pins that can be used for inter-PGA communication, each FPG
The utilization rate of A is significantly reduced. FP because there are few pins
The GA is only partially filled, resulting in waste. In the document <6>, “Virtual Wires”
Has been proposed. Virtua
In lWires, each physical pin is time-shared and mapped to multiple "virtual pins" in the design. This is done with the addition of some time division hardware, but the overall design emulation needs to be done at a slower clock rate than the FPGA clock rate. Still, yet, furthermore,
The concept of Virtual Wires is multiple FPG
It is extremely effective for the A system.

【００２７】ｃ）市販製品（１）Ｑｕｉｃｋｔｕｒｎ／ＣａｄｅｎｃｅＱｕｉｃｋｔｕｒｎ（現在ではＣａｄｅｎｃｅに吸収
されている）は、サイクル・ベース・シミュレータ、
シミュレーション・アクセラレータ、及びエミュレータ
を販売していた。ＳｐｅｅｄＳｉｍは、ＨＤＬを直接ネ
イティブ・マシン・コードに変換する（ソフトウェ
ア）サイクル・ベースのｖｅｒｉｌｏｇシミュレータ
である。そのパフォーマンスは、複数のテスト・ベクト
ルを単一設計内でシミュレートできるようにする対称マ
ルチプロセッシング（ＳＭＴ）と同時テスト（Ｓ
Ｔ）の手法によって高められる（文献＜１＞）。C) Commercial product (1) Quickturn / Cadence Quickturn (currently absorbed in Cadence) is a cycle-based simulator,
He was selling simulation accelerators and emulators. SpeedSim is a (software) cycle-based verilog simulator that translates HDL directly into native machine code. Its performance is based on symmetric multi-processing (SMT) and simultaneous test (SMT) that allow multiple test vectors to be simulated within a single design.
T) method (reference <1>).

【００２８】シミュレーション・アクセラレーションや
テストベンチ生成用として、また回路エミュレーション
においてＱｕｉｃｋｔｕｒｎの包括的検証製品の１つ
に、Ｐａｌｌａｄｉｕｍがある（文献＜２＞）。Ｐａｌ
ｌａｄｉｕｍは、シミュレーションとエミュレーション
のために特別設計された専用ＡＳＩＣを使用して構築さ
れている。やはりＱｕｉｃｋｔｕｒｎ製のＰａｌｌａｄ
ｉｕｍより大幅に大型のエミュレーション・システム、
ＣｏＢＡＬＴ（文献＜３＞）は、最大１億１，２００万
ゲートまで拡張可能である。これらの製品はすべて、１
００％専用のシステムを必要とするため、非常に高価で
ある（数百万ドルの価格帯）。One of Quickturn's comprehensive verification products for simulation acceleration, test bench generation, and in circuit emulation is Palladium (reference <2>). Pal
The ladium is built using a specially designed ASIC specifically designed for simulation and emulation. After all Pallad made by Quickturn
A much larger emulation system than the ium,
CoBALT (reference <3>) can be expanded to a maximum of 112 million gates. All of these products are 1
It is very expensive (price range of millions of dollars) as it requires a system dedicated to 00%.

【００２９】（２）ＴｈａｒａｓＳｙｓｔｅｍｓＴｈａｒａｓＳｙｓｔｅｍｓは、「Ｈａｍｍｅｒ」
と呼ばれる比較的手頃な検証アクセラレーション・シス
テムを提供する。Ｈａｍｍｅｒハードウェアは、高い帯
域幅のバックプレーンが、いくつかの専有カスタム・ビ
ルトＡＳＩＣを備える基板に接続された構成である。Ａ
ＳＩＣはＲＴＬまたはゲート・レベルの設計の一部を評
価でき、さらには、その他すべてのＡＳＩＣが基板に収
められたノンブロッキング相互接続機構（文献＜８＞）
を提供する。システムは最大８００万ゲートまで拡張で
き、価格は数十万ドルである。(2) Tharas Systems Tharas Systems is "Hammer".
Provides a relatively affordable verification acceleration system called. Hammer hardware is a configuration in which a high bandwidth backplane is connected to a board with several proprietary custom built ASICs. A
SICs can evaluate parts of RTL or gate level designs, and even non-blocking interconnects with all other ASICs on board (ref. <8>).
I will provide a. The system can scale up to 8 million gates and costs hundreds of thousands of dollars.

【００３０】（３）ＩＫＯＳＩＫＯＳ（ｈｔｔｐ：／／ｗｗｗ．ｉｋｏｓ．ｃｏ
ｍ）は、エミュレーション・システムのＶｉｒｔｕａ
ＬｏｇｉｃとＶＳｔａｔｉｏｎを販売している。Ｖｉｒ
ｔｕａＬｏｇｉｃは、ＶｉｒｔｕａｌＷｉｒｅｓのコ
ンセプト（文献＜６＞）を使って互いに接続された数個
のＦＰＧＡから成るハードウェアで構成される。ＶＳｔ
ａｔｉｏｎは比較的大規模なエミュレータで、「Ｔｒａ
ｎｓａｃｔｉｏｎＩｎｔｅｒｆａｃｅＰｏｒｔａ
ｌ」と呼ばれるＩＫＯＳの専用インターフェースを使っ
てワークステーションに接続できる。ＩＫＯＳのシステ
ムは主にエミュレーション市場をターゲットとする。(3) IKOS IKOS (http://www.ikos.co
m) is the emulation system Virtua
It sells Logic and VStation. Vir
tuaLogic consists of hardware consisting of several FPGAs connected to each other using the Virtual Wires concept (reference <6>). VSt
ation is a relatively large-scale emulator.
nsaction Interface Porta
You can connect to a workstation using the IKOS proprietary interface called "l". The IKOS system primarily targets the emulation market.

【００３１】（４）ＡＸＩＳＡＸＩＳ（ｈｔｔｐ：／／ｗｗｗ．ａｘｉｓｃｏｒ
ｐ．ｃｏｍ）が販売するＸｔｒｅｍｅシミュレーショ
ン・アクセラレーション・システムも、数個のＦＰＧＡ
で構成される。ＡＸＩＳのシステムとソフトウェア・シ
ミュレータ「Ｘｃｉｔｅ」を組み合わせると、ハードウ
ェアとソフトウェアの間の「ホットスワップ」が可能に
なる。最初はハードウェアで加速されたシミュレーショ
ンが使用され、設計のバグが発見されると設計全体がソ
フトウェアに効率的にスワップされてデバッグが行われ
る。(4) AXIS AXIS (http: //www.axiscor
p. The Xtreme simulation acceleration system sold by
Composed of. The combination of the AXIS system and the software simulator "Xcite" enables "hot swapping" between hardware and software. Initially, hardware accelerated simulation is used, and if a design bug is found, the entire design is effectively swapped into software for debugging.

【００３２】（５）その他ＡｖｅｒｙＤｅｓｉｇｎＳｙｓｔｅｍｓは、「Ｓｉ
ｍＣｌｕｓｔｅｒ」と呼ばれる製品を販売している。こ
の製品を使用すると、複数のＣＰＵ間でｖｅｒｉｌｏｇ
シミュレーションを効率的に分布することができる。こ
の製品は単独でライセンス利用することも、サード・パ
ーティｖｅｒｉｌｏｇシミュレータと併用することもで
きる。ＬｏｇｉｃＥｘｐｒｅｓｓは「ＳＯＣ−Ｖ２
０」という製品を提供している。この製品も、数個のＦ
ＰＧＡとシミュレーション・アクセラレーションのため
の配線論理で構成される。(5) Others Avery Design Systems is "Si
It sells a product called mCluster. With this product, you can verify log across multiple CPUs.
The simulation can be distributed efficiently. This product can be licensed alone or used with a third party verilog simulator. Logic Express is "SOC-V2
We offer a product called "0". This product also has several F
It is composed of PGA and wiring logic for simulation acceleration.

【００３３】[0033]

【発明が解決しようとする課題】従来のハードウェア支
援シミュレータの使用や専用ハードウェア・エミュレー
タのパフォーマンスはソフトウェア・シミュレータのそ
れを大幅に上回るが、コストも大幅に高い。Although the performance of conventional hardware-assisted simulators and dedicated hardware emulators outperforms that of software simulators, the cost is also significantly higher.

【００３４】従来のハードウェア支援サイクル・ベース
・シミュレーションでは、再構成可能性によりＦＰＧＡ
上で論理回路をエミュレートすることが可能になり、さ
らには空間パラレリズムの使用による並行性の処理が可
能になる。しかし、単一のＦＰＧＡで数個の論理設計を
エミュレートすることはできるが、サイズに限界があ
り、大規模な回路では１個を一度に収容することはでき
ない。In conventional hardware-assisted cycle-based simulation, FPGAs are reconfigurable.
It becomes possible to emulate logic circuits above, and even to handle concurrency by using spatial parallelism. However, although a single FPGA can emulate several logic designs, there is a size limitation and large circuits cannot accommodate one at a time.

【００３５】また、複数ＦＰＧＡによるエミュレーショ
ン・システムは、スケーラビリティとコスト効果の両面
でこの問題に対処できない。また、ＦＰＧＡを接続する
ピンの数が限られていることがボトルネックとなって論
理利用率が低下し、一部のＦＰＧＡは部分的にしか使用
されないという結果になる。さらに、これらのピンは比
較的低速なオンボード相互接続ワイヤを使用しており、
これもエミュレーション速度を低下させる要因とな
る。、一部のエミュレーション・ベンダ（Ａｘｉｓな
ど）はシステム内で複数のＦＰＧＡと専用設計のハード
ウェアを使用しており、製品価格は数十万ドル〜数百万
ドルにも上っている。Also, an emulation system with multiple FPGAs cannot address this problem both in terms of scalability and cost effectiveness. In addition, the fact that the number of pins connecting the FPGA is limited becomes a bottleneck and the logic utilization rate decreases, resulting in that some FPGAs are only partially used. In addition, these pins use relatively slow onboard interconnect wires,
This also causes a reduction in emulation speed. Some emulation vendors (such as Axis) use multiple FPGAs and specially designed hardware in their systems, and their product prices are in the hundreds of thousands to millions of dollars.

【００３６】物理的に小規模なＦＰＧＡ上で大規模な設
計を時分割し、回路全体ではなく、部分的にエミュレー
ションを行う方法では、ピンの制約も、複数ＦＰＧＡの
採用によるソリューションの高コストの問題も解消され
るが、ＦＰＧＡの再構成に要するオーバーヘッドのため
にパフォーマンスが低下する。ほとんどの汎用ＦＰＧＡ
は頻繁な再構成を見越した設計がなされていないので、
再構成のために用意されているＩ／Ｏピンはごく少数し
かない。そのため、構成の帯域幅は非常に小さく、再構
成時には大幅な遅延が発生すると共に、スケーラビリテ
ィに問題がある。In a method in which a large-scale design is time-divided on a physically small FPGA and a part of the circuit is emulated instead of the whole circuit, the pin limitation is caused, and the cost of the solution using a plurality of FPGAs is high. It also solves the problem, but reduces the performance due to the overhead required to reconfigure the FPGA. Most general purpose FPGA
Is not designed for frequent reconfigurations, so
Only a few I / O pins are available for reconfiguration. Therefore, the bandwidth of the configuration is very small, a large delay occurs at the time of reconfiguration, and there is a problem in scalability.

【００３７】イベント・ドリブン・シミュレーションの
欠点は、本質的に直列であり、メモリ使用量が大きいた
めに、低レベルのシミュレーション・パフォーマンスし
か実現されないことである。The disadvantage of event-driven simulation is that it is inherently serial and, due to the large memory usage, provides only low levels of simulation performance.

【００３８】均一化されたコンパイル済みコード論理シ
ミュレータ（サイクル・ベース・シミュレータはこれか
ら派生されたもの）では、イベント・ドリブン・シミュ
レータよりもはるかに高いシミュレーション・パフォー
マンスを得られる可能性があるが、任意のゲート遅延に
よってシミュレートできないという欠点がある。A uniformized compiled code logic simulator (a cycle-based simulator is a derivative of this) may provide much higher simulation performance than an event-driven simulator, but is optional. It has the drawback that it cannot be simulated due to the gate delay of.

【００３９】論理エミュレーションのための単一ＦＰＧ
Ａを使用することには、２つの大きな問題がある。Single FPG for logic emulation
There are two major problems with using A.

【００４０】・スケーラビリティの欠如：ＦＰＧＡに収
まらない設計は、全体を１度にエミュレートすることは
できない。このような設計を一部分ずつエミュレートす
るためには再構成を繰り返し実行する必要があり、市販
のＦＰＧＡでは非常に時間がかかる。Lack of scalability: A design that does not fit in an FPGA cannot emulate the whole at once. In order to emulate such a design part by part, it is necessary to perform reconfiguration repeatedly, which is very time-consuming in a commercially available FPGA.

【００４１】・長いコンパイル時間：従来のＦＰＧＡツ
ール・フローは複雑で、大規模な設計では数時間から数
日間の時間を要する。また、これによるシミュレーショ
ン・オーバーヘッドの増大により、設計時間と市場投入
までの期間に重大な影響が生じる。Long compile time: Traditional FPGA tool flows are complex, requiring hours to days for large designs. In addition, the increased simulation overhead will have a significant impact on design time and time to market.

【００４２】複数ＦＰＧＡシステムであるエミュレーシ
ョン・システムは、典型的には、相互接続された多数の
市販ＦＰＧＡで構成され、大規模な設計をエミュレート
することが可能になるが、ＦＰＧＡ間通信に使用できる
ピンが少ないため、各ＦＰＧＡの利用率は大幅に低下す
る。ピンが少ないためにＦＰＧＡは部分的にしか埋めら
れず、無駄が生じるのである。An emulation system, which is a multiple FPGA system, typically consists of a number of interconnected off-the-shelf FPGAs, allowing emulation of large designs, but used for inter-FPGA communication. Since the number of pins that can be formed is small, the utilization rate of each FPGA is significantly reduced. Due to the small number of pins, the FPGA is only partially filled, resulting in waste.

【００４３】本発明は、従来技術に関連する欠点を解消
し、上記問題のいくつかを解決することを目指すもので
ある。具体的には、本発明による手法は少なくとも、
(i)低コスト、(ii)ハイ・パフォーマンス、(iii)低いタ
ーンアラウンド・タイム、（iv）高いスケーラビリテ
ィ、という４つの利点を提供する。すなわち、本発明は、
シミュレータ並みのコスト、スケーラビリティ、ターン
アラウンド・タイムを実現する一方で、非常に優れたパ
フォーマンスを達成するハードウェア・アクセラレーシ
ョン・システムを提供することを目的とする。The present invention seeks to overcome the shortcomings associated with the prior art and overcome some of the above problems. Specifically, at least the method according to the present invention,
It offers four advantages: (i) low cost, (ii) high performance, (iii) low turnaround time, and (iv) high scalability. That is, the present invention is
The objective is to provide a hardware acceleration system that achieves extremely high performance while achieving simulator-like cost, scalability, and turnaround time.

【００４４】[0044]

【課題を解決するための手段】上記の目的を達成するた
めに、本発明は、論理チップを内蔵する汎用回路基板とメ
モリで構成される、機能シミュレーションのためのハー
ドウェア・アクセラレーション・システムを提供する。
この回路基板は、コンピューティング・デバイスにプラ
グ接続できる。このシステムは、コンピューティング・
デバイスが、回路基板とそのコンピューティング・デバ
イスに関連するメモリの間のＤＭＡ転送を誘導するよう
に適応される。回路基板はさらに、シミュレーション・
プロセッサを使用して構成することも可能である。シミ
ュレーション・プロセッサは、少なくとも１つの回路設
計用としてプログラムすることができる。In order to achieve the above object, the present invention provides a hardware acceleration system for functional simulation, which is composed of a general-purpose circuit board containing a logic chip and a memory. provide.
The circuit board can be plugged into a computing device. This system is
The device is adapted to direct a DMA transfer between the circuit board and the memory associated with the computing device. The circuit board is
It is also possible to configure using a processor. The simulation processor can be programmed for at least one circuit design.

【００４５】他の特定の機能増強では、ＦＰＧＡにシミ
ュレーション・プロセッサがマッピングされる。In another particular enhancement, the simulation processor is mapped to the FPGA.

【００４６】他の特定の機能増強では、シミュレート対
象の回路のネットリストがシミュレーション・プロセッ
サ用にコンパイルされる。In another particular enhancement, a netlist of simulated circuits is compiled for the simulation processor.

【００４７】他の特定の機能増強では、シミュレーショ
ン・プロセッサは、少なくとも１つの要素プロセッサ
と、前記少なくとも１つの要素プロセッサに対応する１
つ以上のレジスタを有する少なくとも１つのレジスタ・
ファイルとをさらに備える。In another particular enhancement, the simulation processor comprises at least one element processor and one corresponding to said at least one element processor.
At least one register having one or more registers
And a file.

【００４８】他の特定の機能増強では、シミュレーショ
ン・プロセッサは、少なくとも１つのメモリ・バンクを
有する分散メモリ・システムをさらに備える。In another particular enhancement, the simulation processor further comprises a distributed memory system having at least one memory bank.

【００４９】他の特定の機能増強では、前記少なくとも
１つのメモリ・バンクが、要素プロセッサ・セットとそ
れに関連するレジスタのために使用される。In another particular enhancement, said at least one memory bank is used for the element processor set and its associated registers.

【００５０】他の特定の機能増強では、レジスタをメモ
リ・バンク上にスピル処理することができる。In another particular enhancement, the registers can be spilled onto a memory bank.

【００５１】他の特定の機能増強では、システムは、前
記少なくとも１つの要素プロセッサを他の要素プロセッ
サと接続する相互接続システムをさらに備える。In another particular enhancement, the system further comprises an interconnection system connecting said at least one element processor with another element processor.

【００５２】他の特定の機能増強では、要素プロセッサ
は任意の２入力ゲートをシミュレートできる。In another particular enhancement, the element processor can simulate any two-input gate.

【００５３】他の特定の機能増強では、要素プロセッサ
はＲＴレベルのシミュレーションを実行できる。In another particular enhancement, the element processor can perform RT-level simulations.

【００５４】他の特定の機能増強では、接続はレジスタ
を介して行われる。In another particular enhancement, the connection is made via a register.

【００５５】他の特定の機能増強では、相互接続ネット
ワークはパイプライン処理される。In another particular enhancement, the interconnection network is pipelined.

【００５６】他の特定の機能増強では、レジスタ・ファ
イルは関連の要素プロセッサに近接して配置される。In another particular enhancement, the register file is located close to the associated element processor.

【００５７】他の特定の機能増強では、分散メモリ・シ
ステムは各レジスタ・ファイルに対応する専用ポートを
有する。In another particular enhancement, the distributed memory system has a dedicated port associated with each register file.

【００５８】他の特定の機能増強では、システムは、ネ
ットリストが基板上のメモリに収まらない場合は、ネッ
トリストのパーティションを１度に１つずつ処理するこ
とができる。In another particular enhancement, the system can process the partitions of the netlist one at a time if the netlist does not fit in on-board memory.

【００５９】他の特定の機能増強では、システムは、ネ
ットリストのパーティションを順次シミュレートするこ
とにより、ネットリスト全体をシミュレートできる。In another particular enhancement, the system can simulate the entire netlist by sequentially simulating partitions of the netlist.

【００６０】他の特定の機能増強では、システムは、回
路の試験に使用されるシミュレーション・ベクトルのサ
ブセットを処理することができる。In another particular enhancement, the system is capable of processing a subset of simulation vectors used to test the circuit.

【００６１】他の特定の機能増強では、システムは、各
サブセットを順次シミュレートすることにより、シミュ
レーション・ベクトル・セット全体をシミュレートでき
る。In another particular enhancement, the system can simulate the entire simulation vector set by sequentially simulating each subset.

【００６２】他の特定の機能増強では、アクセラレーシ
ョン・システムは、設計内の全レジスタの状態を交換す
る能力を有する汎用ソフトウェア・シミュレータと交互
に使用されることができる。In another particular enhancement, the acceleration system can be used interchangeably with a general purpose software simulator capable of exchanging the states of all registers in the design.

【００６３】他の特定の機能増強では、２値化、４値化
いずれのシミュレーションもシミュレーション・プロセ
ッサ上で実行できる。In another particular enhancement, both binarization and binarization simulations can be performed on the simulation processor.

【００６４】他の特定の機能増強では、システムはイン
ターフェース及び演算コードをさらに備え、前記演算コ
ードはシミュレーション・ベクトルに関連する読み取
り、書き込み、その他の演算を指定する。In another particular enhancement, the system further comprises an interface and opcode, said opcode specifying read, write, and other operations associated with the simulation vector.

【００６５】他の特定の機能増強では、シミュレーショ
ン・プロセッサは、少なくとも１つの演算論理回路
（ＡＬＵ：ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎ
ｉｔ）、ゼロかそれ以上の符号付き乗算子、各々が前記
ＡＬＵ及び前記乗算子と関連付けられた少なくとも１つ
のレジスタを有する分散レジスタ・システムをさらに備
える。In another particular enhancement, the simulation processor includes at least one arithmetic logic circuit.
(ALU: arithmetic logic un
it), zero or more signed multipliers, each further comprising a distributed register system having at least one register associated with the ALU and the multiplier.

【００６６】他の特定の機能増強では、システムは各Ａ
ＬＵ用のキャリー・レジスタ・ファイルを備え、当該レ
ジスタの幅は対応するレジスタの幅と同じである。In another particular enhancement, the system uses each A
It has a carry register file for the LU, the width of which is the same as the width of the corresponding register.

【００６７】他の特定の機能増強では、システムが、レ
ジスタを接続するパイプライン処理されたキャリーチェ
ーン相互接続を備える。In another particular enhancement, the system comprises pipelined carry chain interconnects connecting registers.

【００６８】他の態様では、回路の論理シミュレーショ
ンを実行する方法が提供され、当該方法は、シミュレー
ション・プロセッサ用の命令セットを生成するための回
路に対応するネットリストをコンパイルするステップ
と、命令をシミュレーション・プロセッサに対応するオ
ンボード・メモリにロードするステップと、オンボード
・メモリ上にシミュレーション・ベクトル・セットを転
送するステップと、シミュレート対象のネットリストに
対応する命令セットをシミュレーション・プロセッサが
構成されているＦＰＧＡ上にストリーミングするステッ
プと、命令セットを実行して結果ベクトル・セットを生
成するステップと、結果ベクトルをホスト・コンピュー
タ上に転送するステップとを備える。In another aspect, a method is provided for performing a logical simulation of a circuit, the method comprising compiling a netlist corresponding to the circuit to generate an instruction set for a simulation processor, and executing the instructions. The simulation processor configures the steps of loading into the onboard memory for the simulation processor, transferring the simulation vector set onto the onboard memory, and the instruction set corresponding to the simulated netlist. On the FPGA being implemented, executing the instruction set to generate a result vector set, and transferring the result vector to a host computer.

【００６９】本発明のさらに他の態様では、シミュレー
ション・プロセッサ用の回路のネットリストをコンパイ
ルする方法が提供され、前記方法は、グラフのノードが
設計内のハードウェア・ブロックに対応する有向グラフ
として回路の設計を表現するステップと、スケジュール
される準備が完了したノードのレディフロント・サブセ
ットを生成するステップと、レディフロント・セットに
対してトポロジカル・ソートを実行するステップと、現
在に至るまで未選択のノードを選択するステップと、命
令を完了し、使用できる要素プロセッサがない場合は新
しい命令に進むステップと、選択されたノードに対応す
る演算を実行するために、関連する解放されたレジスタ
数が最も多い要素プロセッサを選択するステップと、オ
ペランドをレジスタから選択された要素プロセッサまで
経路指定するステップと、未選択のノードがなくなるま
で反復するステップとを備える。In yet another aspect of the present invention, there is provided a method of compiling a netlist of circuits for a simulation processor, the method comprising the circuit as a directed graph in which the nodes of the graph correspond to hardware blocks in the design. Representing the design of the node, generating a ready-front subset of nodes that are ready to be scheduled, performing a topological sort on the ready-front set, and to date unselected The step of selecting a node, the step of completing the instruction and proceeding to a new instruction if no element processor is available, and the associated number of free registers to perform the operation corresponding to the selected node. Registering operands and steps to select a large number of element processors Comprising a step of designating the route to the selected element processors, and a step of repeating until unselected node is eliminated from.

【００７０】[0070]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照して詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

【００７１】１．ハードウェア・アクセラレーション・
システムここでは、本発明による手法を利用する実装例
であるオーバーオール・ハードウェア・アクセラレーシ
ョン・システムについて説明する。1. Hardware acceleration
System Here, an overall hardware acceleration system, which is an example of implementation using the method according to the present invention, will be described.

【００７２】ＳｉｍＰＬＥｉｎｔｅｒｍｅｄｉａｔｅ
ａｒｃｈｉｔｅｃｔｕｒｅ（ＳｉｍＰＬＥ中間アーキ
テクチャー：以下、ＳｉｍＰＬＥと称する）２．６（例
えば、図２〜４に示すもの）は、シミュレーション・プ
ロセッサに関連する本発明による手法の実装例である。
なお、ここで説明される特定のアーキテクチャと実装は
例示にすぎず、いかなる形であれ権利請求される発明を
限定するものと解釈してはならないことは明らかであ
る。当該技術に精通する当業者には、本発明による手法
の範囲から逸脱することなく、多数の代替実装が可能で
あることが理解されるであろう。さらに、これらの例は
ＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧ
ａｔｅＡｒｒａｙ）２．２を使用して説明されている
が、任意の論理チップを使用できることは言うまでもな
い。SimPLE intermediate
architecture (SimPLE Intermediate Architecture: hereinafter referred to as SimPLE) 2.6 (eg, those shown in FIGS. 2-4) is an example implementation of the inventive approach associated with a simulation processor.
It should be understood that the particular architecture and implementations described herein are exemplary only, and should not be construed as limiting the claimed invention in any way. Those skilled in the art will appreciate that numerous alternative implementations are possible without departing from the scope of the techniques according to the present invention. Furthermore, these examples are based on FPGA (Field Programmable G).
However, it goes without saying that any logic chip can be used.

【００７３】ＦＰＧＡ２．２上でネットリストを時分割
することは、通常、多大な構成上のオーバーヘッドを伴
う。それは、ほとんどのＦＰＧＡは構成ビット用として
ごくわずかなピンしか備えていないためである。本発明
では、この構成帯域幅の問題を、シミュレーション・プ
ロセッサの概念を導入することによって解決する。以下
では、「ＳｉｍＰＬＥ」と称する、こうしたシミュレー
ション・プロセッサの例について説明する。Time-sharing a netlist on FPGA 2.2 usually involves a great deal of structural overhead. This is because most FPGAs have very few pins for configuration bits. The present invention solves this constituent bandwidth problem by introducing the concept of a simulation processor. In the following, an example of such a simulation processor called "SimPLE" will be described.

【００７４】ＳｉｍＰＬＥ２．６は、ネットリストのコ
ンパイル先となる仮想概念である。ＦＰＧＡ２．２上で
１度構成された後、ＳｉｍＰＬＥ２．６は、ＳｉｍＰＬ
Ｅコンパイラと呼ばれるコンパイラ例を使用して、様々
な回路設計に対応してプログラムされる（すなわち、
ＳｉｍＰＬＥ上で異なるネットリストをシミュレートで
きる）。ＳｉｍＰＬＥ用の命令は、ＦＰＧＡのデータＩ
／Ｏピンを使用するため、構成帯域幅が小さくても影響
はない。SimPLE 2.6 is a virtual concept that is a compilation destination of a netlist. After being configured once on FPGA 2.2, SimPLE 2.6
It is programmed for various circuit designs using a compiler example called the E compiler (ie,
Different netlists can be simulated on SimPLE). The instruction for SimPLE is FPGA data I.
Since the / O pin is used, there is no effect even if the configuration bandwidth is small.

【００７５】（１）オーバーオール・システム例ここで説明するオーバーオール・ハードウェア・アクセ
ラレーション・システムは、市販のＦＰＧＡ、メモリ、
ＰＣＩコントローラ、ＤＭＡコントローラを有する汎用
ＰＣＩ基板で構成されるので、当然ながらどんなコンピ
ューティング・システムにもプラグ接続が可能である。
基板はホストのメモリに直接アクセスでき、その動作は
ホストによって制御されると想定する。したがってホス
トは、メイン・メモリと、ＦＰＧＡがアクセス可能な基
板上のメモリとの間のＤＭＡ転送を誘導できる。さら
に、本発明による手法を使用する場合は、基板メモリ
は、ＦＰＧＡか、（ＰＣＩインターフェースを介して）
自身に随時アクセスするホストのいずれかとシングル
ポート接続されるだけでよい。(1) Example of Overall System The overall hardware acceleration system described here is a commercially available FPGA, memory,
Since it is composed of a general-purpose PCI board having a PCI controller and a DMA controller, it can be plugged into any computing system as a matter of course.
It is assumed that the substrate has direct access to the host memory and its operation is controlled by the host. Thus, the host can direct a DMA transfer between the main memory and the memory on the board accessible to the FPGA. Furthermore, when using the method according to the invention, the substrate memory is either an FPGA or (via a PCI interface).
You only have to make a single port connection to one of the hosts that accesses it from time to time.

【００７６】図２は、本発明による手法のシミュレーシ
ョン技法を示す。回路のコンパイル済み（コンパイラ
２．５による）ＳｉｍＰＬＥ命令はシミュレーション・
ベクトル・セットと共に、ＤＭＡを使用してオンボード
・メモリ２．１に転送される。各命令は、ＳｉｍＰＬＥ
２．６内の各要素プロセッサ（ＰＥ：ｐｒｏｃｅｓｓ
ｉｎｇｅｌｅｍｅｎｔ）２．３１〜２．３４の演算を
指定する。各命令は、ネットリストの１つのスライスを
表す。すべての命令を実行すると、１つのシミュレーシ
ョン・ベクトルのネットリスト全体がシミュレートされ
る。そのため、各シミュレーション・ベクトルについ
て、すべての命令が基板のメモリからＦＰＧＡ２．２に
ストリーミングされ、その結果ベクトルが戻されてオン
ボード・メモリ２．１に格納される。ＳｉｍＰＬＥ命令
が基板上のＦＰＧＡメモリ・バスよりも大きい場合は、
細かく時分割され、ＦＰＧＡ２．２上のエクストラ・ハ
ードウェアを使って再編成される。すべてのシミュレー
ション・ベクトルの処理が完了したら、結果ベクトルが
ＤＭＡ処理（ＤＭＡ／ＰＣＩコントローラ２．３によ
る）により基板からホスト２．４に戻される。これによ
り、必要であればまた別のシミュレーション・ベクトル
をシミュレートできるようになる。シミュレーション全
体のホスト制御は、ＡＰＩ３．１（図３に図示）を介し
て行われる。FIG. 2 shows a simulation technique of the method according to the present invention. Circuit compiled (using compiler 2.5) SimPLE instructions are simulated
Along with the vector set is transferred to onboard memory 2.1 using DMA. Each instruction is SimPLE
Each element processor (PE: process) in 2.6
ing element) Specify the operation of 2.31 to 2.34. Each instruction represents one slice of the netlist. Executing all the instructions simulates the entire netlist for one simulation vector. So, for each simulation vector, all instructions are streamed from the board's memory to the FPGA 2.2 and the resulting vector is returned and stored in the onboard memory 2.1. If the SimPLE instruction is larger than the FPGA memory bus on the board,
It is finely time-shared and reorganized using the extra hardware on FPGA 2.2. When all the simulation vectors have been processed, the resulting vector is returned from the board to the host 2.4 by DMA processing (by the DMA / PCI controller 2.3). This allows another simulation vector to be simulated if desired. Host control of the entire simulation is performed via API 3.1 (shown in FIG. 3).

【００７７】シミュレーション速度を定量化するため、
ユーザ・サイクル、プロセッサ・サイクル（文献＜１６
＞で定義されるものと同様）、及びＦＰＧＡサイクルを
定義する。ＦＰＧＡサイクルは、ＳｉｍＰＬＥ２．６が
構成されたＦＰＧＡ２．２のクロック期間である。プロ
セッサ・サイクルは、ＳｉｍＰＬＥ２．６が動作する速
度である。これは、１つのＳｉｍＰＬＥ命令が完了する
までの所要時間として定義される。通常、１つの命令は
１回のＦＰＧＡサイクルで完了するので、プロセッサ・
サイクルはＦＰＧＡサイクルと同一である。ただし、命
令が時分割される場合（すなわち、ＳｉｍＰＬＥ命令が
ＦＰＧＡメモリ・バスより大きい場合）、プロセッサ・
サイクルはＦＰＧＡサイクルより長くなる。例えば、Ｓ
ｉｍＰＬＥ命令がＦＰＧＡメモリ・バスの２倍の幅を有
する場合、プロセッサ・サイクルはＦＰＧＡサイクルの
２倍となる。最後に、ユーザ・サイクルは、単一のシミ
ュレーション・ベクトルのネットリストを完全にシミュ
レートする（すなわちすべての命令を処理する）ために
要する時間である。To quantify the simulation speed,
User cycle, processor cycle (Reference <16
>) And FPGA cycles. The FPGA cycle is the clock period of FPGA 2.2 with SimPLE 2.6 configured. Processor cycles are the speed at which SimPLE 2.6 runs. This is defined as the time required to complete one SimPLE instruction. Normally, one instruction completes in one FPGA cycle, so the processor
The cycle is the same as the FPGA cycle. However, if the instructions are time-shared (ie, the SimPLE instructions are larger than the FPGA memory bus), the processor
The cycle will be longer than the FPGA cycle. For example, S
If the imPLE instruction is twice as wide as the FPGA memory bus, then there are twice as many processor cycles as FPGA cycles. Finally, a user cycle is the time it takes to fully simulate (ie, process all instructions) a single simulation vector netlist.

【００７８】次に、シミュレーション速度を定量化す
る。命令幅がＩＷのＳｉｍＰＬＥアーキテクチャをター
ゲットとする場合に、ＳｉｍＰＬＥコンパイラ２．５は
１つのネットリストにつきＮ個の命令を生成すると想定
する。ＦＰＧＡメモリ・バス幅がＢＷ、ＦＰＧＡクロッ
ク・サイクルがＦＣの場合、ユーザ・サイクルＵＣ及び
シミュレーション速度Ｒは以下によって得られる。Ｕ_Ｃ＝Ｎ×[Ｉ_Ｗ／Ｂ_Ｗ]×Ｆ_Ｃ（１）Ｒ＝1/Ｕ_ｃ（２）Next, the simulation speed is quantified. When targeting the SimPLE architecture with an instruction width of IW, it is assumed that the SimPLE compiler 2.5 will generate N instructions per netlist. If the FPGA memory bus width is BW and the FPGA clock cycle is FC, then the user cycle UC and the simulation speed R are given by: _{_{U C = N × [I W}} / B W] × F C (1) R = 1 / U c (2)

【００７９】従って、シミュレーション速度は、(i)コ
ンパイラによって生成される命令の個数、(ii)命令幅、
及び(iii)ＦＰＧＡクロック・サイクル、を減少させる
ことにより高めることができる。Therefore, the simulation speed is (i) the number of instructions generated by the compiler, (ii) the instruction width,
And (iii) the FPGA clock cycle can be reduced.

【００８０】超大規模回路のコンパイルによってオンボ
ード・メモリに収まらない命令が過剰に生成された場
合、命令は細かく分割されて個別にＤＭＡ処理される。
この処理は全体的なパフォーマンスに影響を及ぼすが、
ＳｉｍＰＬＥ２．６のスケーラビリティは維持される。
ただし、オンボード・メモリをアップグレードすれば、
パフォーマンスの損失を生じさせずにスケーラビリティ
を実現することができる。また、メモリ量を調整するこ
とにより、超大規模ネットリストをシミュレートするこ
とも可能になる。例えば、２５６ＭＢのＳＤＲＡＭを備
える基板であれば、５，０００万ゲート・ネットリスト
の命令をすべて保持できる。If the compilation of a very large scale circuit generates excessive instructions that cannot fit in the onboard memory, the instructions are finely divided and individually DMA-processed.
This process affects overall performance, but
The scalability of SimPLE 2.6 is maintained.
However, if you upgrade your onboard memory,
Scalability can be achieved without any performance loss. Also, by adjusting the amount of memory, it becomes possible to simulate a very large netlist. For example, a board with 256 MB SDRAM can hold all 50 million gate netlist instructions.

【００８１】本発明による手法（特にＳｉｍＰＬＥ）
の１つの目標は、汎用論理チップ（例えば、ＦＰＧＡ基
板など）を使用できる安価なハードウェア・アクセラ
レータを作成することである。この基板は市販のＦＰＧ
Ａ、メモリ、ＰＣＩインターフェースで構成されるの
で、事実上あらゆるコンピューティング・システムとの
互換性を有する「プラグ・アンド・プレイ」である。こ
の基板はメイン・メモリに直接アクセスできると想定さ
れるが、その動作はホスト２．４のＣＰＵによって制御
される。Method according to the invention (especially SimPLE)
One of the goals is to create an inexpensive hardware accelerator that can use general purpose logic chips (eg, FPGA boards, etc.). This board is a commercially available FPG
It is "plug-and-play" compatible with virtually any computing system because it consists of an A, memory, and PCI interface. It is assumed that this board has direct access to main memory, but its operation is controlled by the CPU of host 2.4.

【００８２】図３は、本発明の技法のもう１つの例を示
す。ゲート回路（ＭＵＬＴＩ−ＭＩＬＬＩＯＮＧＡＴ
ＥＣＩＲＣＵＩＴ）３．２のコンパイル済み命令（Ｓ
ｉｍＰＬＥ命令）はシミュレーション・ベクトル・セッ
トと共に、ＤＭＡを使用してオンボード・メモリ２．１
に転送される。それ以降の各シミュレーション・ベクト
ルについては、すべての命令が１ユーザ・サイクル（す
なわち１シミュレーション・サイクル）に相当するＦＰ
ＧＡ２．２を介してストリーミングされ、対応する結果
ベクトルが基板メモリに戻されて格納される。すべての
シミュレーション・ベクトルの処理が完了したら、結果
ベクトルはＤＭＡ処理によりホストのメイン・メモリ
３．３に戻される。まだ試験ベクトルが残っている場合
は、この時点でそれらも同様にシミュレートできる。FIG. 3 illustrates another example of the inventive technique. Gate circuit (MULTI-MILLION GAT
E CIRCUIT 3.2 compiled instructions (S
imPLE instruction) uses simulation vector set with onboard memory 2.1 using DMA.
Transferred to. For each subsequent simulation vector, FP in which all instructions correspond to one user cycle (ie, one simulation cycle)
Streamed over GA 2.2 and the corresponding result vector is stored back into the board memory. When all simulation vectors have been processed, the resulting vector is returned to the host main memory 3.3 by DMA processing. If there are still test vectors, they can be simulated at this point as well.

【００８３】超大規模回路のコンパイルによってオンボ
ード・メモリに収まらない命令が過剰になった場合は、
命令を細かく分割し、個別にＤＭＡ処理する。これによ
って全体的なパフォーマンスが影響されるが、ＳｉｍＰ
ＬＥ２．６のスケーラビリティは維持される。ただし、
オンボード・メモリ２．１をアップグレードすれば、パ
フォーマンスの損失を生じさせずにスケーラビリティを
実現することができる。例えば、２５６ＭＢのＤＲＡＭ
を備える基板であれば、２，０００万ゲート・ネットリ
ストのシミュレーションが可能である。If there are too many instructions that cannot fit in the onboard memory due to the compilation of a very large scale circuit,
Instructions are finely divided and individually DMA-processed. This affects overall performance, but SimP
LE 2.6 scalability is maintained. However,
Onboard memory 2.1 can be upgraded to achieve scalability without performance loss. For example, 256MB DRAM
With a substrate having, it is possible to simulate 20 million gate netlists.

【００８４】以降の項では、命令とシミュレーション・
ベクトルの転送プロセスと、ハードウェア・シミュレー
ションの実行に必要なインターフェース・ソフトウェア
について説明する。In the following sections, instructions and simulation
Describe the vector transfer process and the interface software required to perform the hardware simulation.

【００８５】ａ）命令の転送ほとんどのＳｉｍＰＬＥ構成は大規模なＶｉｒｔｅｘ−
２ＦＰＧＡに楽に収まるが、中には命令語が大きいも
のもある。例えば、プロセッサ×６４、レジスタ×６
４、レジスタ読み取りポート×２、１６Ｋメモリ・ブロ
ック×３２を有するシミュレーション・プロセッサは、
１個の命令につき３０８０ビットを必要とする。最大規
模Ｖｉｒｔｅｘ−２ＦＰＧＡのデータ・ピンアウトは
１１００程度である。そのため、命令を時分割して、複
数プロセッサ・サイクルでＦＰＧＡに転送する必要があ
る。ＨＤＬジェネレータがこの処理を担当し、命令の時
分割を可能にする専用ハードウェアを生成する。このエ
クストラ・ハードウェアはＳｉｍＰＬＥアーキテクチャ
の一部を成し、基板上に存在するＦＰＧＡパッケージに
固有なハードウェアとして使用される。A) Instruction Transfer Most SimPLE configurations are large-scale Virtex-.
2 It easily fits in FPGA, but some commands have large command words. For example, processor x 64, register x 6
4, the simulation processor with register read port x2, 16K memory block x32,
It requires 3080 bits per instruction. The largest Virtex-2 FPGA has a data pinout of around 1100. Therefore, it is necessary to time-division the instruction and transfer it to the FPGA in a plurality of processor cycles. The HDL generator is in charge of this process and creates dedicated hardware that enables time sharing of instructions. This extra hardware is part of the SimPLE architecture and is used as hardware specific to the FPGA package present on the board.

【００８６】ｂ）シミュレーション・ベクトルの転送シミュレート対象のネットリストの１次入力から成る値
セットは、シミュレーション・ベクトルを表す。ネット
リストの機能性を検証する際には、数個のシミュレーシ
ョン・ベクトルが使用されるのが一般的である。シミュ
レーションでは、各ベクトルについて出力ベクトルまた
は結果ベクトルが計算される。そのため、ＳｉｍＰＬＥ
２．６は３種類の「基板レベル」の命令を処理しなけれ
ばならない。この３種類とは、シミュレーション・ベク
トルを表す命令、ＳｉｍＰＬＥコンパイラ２．５によっ
て生成された実際のＳｉｍＰＬＥ命令を表す命令、及び
結果ベクトルの読み取りが行われる専用命令である。B) Transfer of simulation vector The value set consisting of the primary inputs of the netlist to be simulated represents the simulation vector. Several simulation vectors are commonly used when verifying the functionality of a netlist. In the simulation, an output vector or result vector is calculated for each vector. Therefore, SimPLE
2.6 must process three types of "board level" instructions. The three types are an instruction representing a simulation vector, an instruction representing an actual SimPLE instruction generated by the SimPLE compiler 2.5, and a dedicated instruction for reading a result vector.

【００８７】１次入力（ＰＩ：ｐｒｉｍａｒｙｉｎ
ｐｕｔ）は、オンボード・メモリからＳｉｍＰＬＥ２．
６内のローカル・スクラッチパッド・メモリに書き込ま
れ、要素プロセッサによってアクセスされる。同様に、
１次出力（ＰＯ：ｐｒｉｍａｒｙｏｕｔｐｕｔ）はＳ
ｉｍＰＬＥ２．６内の要素プロセッサ２．３１〜２．３
４によってスクラッチパッド・メモリに書き込まれ、オ
ンボード・メモリから読み出される。Primary input (PI: primary in)
put) from the onboard memory to SimPLE2.
It is written to the local scratchpad memory in 6 and accessed by the element processor. Similarly,
The primary output (PO) is S
Element Processors 2.31 to 2.3 in imPLE2.6
4 is written to the scratchpad memory and read from the onboard memory.

【００８８】ゲート・レベルの大規模回路は、数百個の
シミュレーション・ベクトル・ビットを有する。これら
のシミュレーション・ベクトルの転送もやはり、時分割
する必要がある。命令を時分割する場合とは異なり、シ
ミュレーション・ベクトルに必要な時分割の度合いは、
ネットリストによって決まる。ＳｉｍＰＬＥアーキテク
チャはシミュレート対象のネットリストに非依存でなけ
ればならないので、シミュレーション・ベクトルを時分
割するための専用ハードウェアをＳｉｍＰＬＥ２．６上
に存在させることはできない。そのため、次項で説明す
るＳｉｍＰＬＥインターフェース・ソフトウェアがこの
時分割を担当する。各サイクルでは、入力シミュレーシ
ョン・ベクトルはオンボード・メモリからＳｉｍＰＬＥ
２．６の（ＦＰＧＡ２．２上の）スクラッチパッド・メ
モリに直接ロードされる。Large gate-level circuits have hundreds of simulation vector bits. The transfer of these simulation vectors still needs to be time-shared. Unlike the case of time-sharing instructions, the degree of time-sharing required for simulation vectors is
Determined by netlist. Since the SimPLE architecture must be independent of the simulated netlist, dedicated hardware for time-sharing simulation vectors cannot exist on SimPLE 2.6. Therefore, the SimPLE interface software described in the next section takes charge of this time division. In each cycle, the input simulation vector is transferred from onboard memory to SimPLE.
Directly loaded into scratchpad memory (on FPGA 2.2) of 2.6.

【００８９】スクラッチパッド・メモリにロードできる
最大ビット数は、総メモリ帯域幅に等しい。シミュレー
ション・ベクトルが最大メモリ帯域幅よりも長い場合
は、インターフェース・ソフトウェアがシミュレーショ
ン・ベクトルをメモリ帯域幅に等しい語数に分割する。
各シミュレーション・ベクトルには、それを識別する適
切な演算コードが付加される。The maximum number of bits that can be loaded into the scratchpad memory is equal to the total memory bandwidth. If the simulation vector is longer than the maximum memory bandwidth, the interface software splits the simulation vector into a number of words equal to the memory bandwidth.
An appropriate operation code for identifying it is added to each simulation vector.

【００９０】１次出力についても同様の手順が行われ、
メモリ帯域幅に等しいペースでＦＰＧＡ２．２からオフ
ロードされる。A similar procedure is performed for the primary output,
Offloaded from FPGA 2.2 at a pace equal to memory bandwidth.

【００９１】ｃ）ＳｉｍＰＬＥインターフェース・ソフ
トウェアインターフェース・ソフトウェアは、ユーザが指定した
シミュレーション・ベクトルとコンパイラによって生成
されたＳｉｍＰＬＥ命令を入力として、基板レベルの命
令を生成する。これらの命令は、ＦＰＧＡ２．２基板に
備えられるＡＰＩを使用して、ＤＭＡ処理によりオンボ
ード・メモリ２．１上に転送される。C) SimPLE interface software The interface software generates a board level instruction by inputting the simulation vector specified by the user and the SimPLE instruction generated by the compiler. These instructions are transferred onto the onboard memory 2.1 by DMA processing using the API provided on the FPGA 2.2 board.

【００９２】基板レベルの命令は、入力シミュレーショ
ン・ベクトル、出力シミュレーション・ベクトル、実際
のシミュレーション・プロセッサ命令の３つを区別す
る。この３つのケースの区別は、３種類の演算コードを
使用して行われる。演算コード・ビットは、基板レベル
の命令を生成するために、入力シミュレーション・ベク
トル・ビットまたはＳｉｍＰＬＥ命令ビットの前に埋め
込まれる。演算コードが出力シミュレーション・ベクト
ルを示す場合は、残りの命令ビットはトライステート・
バスを使用してＳｉｍＰＬＥ２．６から読み取られる。Board level instructions distinguish between three input simulation vectors, output simulation vectors, and actual simulation processor instructions. The three cases are distinguished using three kinds of operation codes. The opcode bits are embedded before the input simulation vector bits or SimPLE instruction bits to generate board level instructions. If the opcode points to the output simulation vector, the remaining instruction bits are tristated.
Read from SimPLE 2.6 using the bus.

【００９３】インターフェース・ソフトウェアは、適切
な演算コード・ビットを埋め込むことに加えて、１次入
力及び出力ベクトルの編成も行う。シミュレーション・
ベクトルはユーザによって順に指定される。ただし、シ
ミュレーション・ベクトルはＳｉｍＰＬＥ２．６のスク
ラッチパッド・メモリ・ブロックに直接転送されるの
で、ビットはメモリ構成に基づいて再編成される。さら
に、ＳｉｍＰＬＥから送出されるＰＯも同様に再編成さ
れ、最終的な結果ベクトルが生成される。In addition to embedding the appropriate opcode bits, the interface software also organizes the primary input and output vectors. simulation·
Vectors are sequentially specified by the user. However, since the simulation vectors are transferred directly to the SimPLE 2.6 scratchpad memory block, the bits are reorganized based on the memory organization. Further, the POs sent from SimPLE are similarly reorganized to generate the final result vector.

【００９４】２．アーキテクチャここでは、単一の汎用ＦＰＧＡ２．２を使用して大規模
な設計をシミュレートする際に遭遇される問題に焦点を
当てる。ＦＰＧＡ２．２は通常、数百万個のゲートを有
するネットリストをエミュレートするのには小さすぎ
る。そのため、最初にネットリストをデバイスに収まる
サイズのパーティションに分割する必要がある。その後
は、ＦＰＧＡ２．２を繰り返し再構成することにより、
パーティションが順次シミュレートされる。このソリュ
ーションはネットリストのサイズに合わせて拡張可能で
あるが、（構成帯域幅が小さいことにより）ＦＰＧＡ
２．２内の再構成オーバーヘッドが増大するため、実用
には適さない。2. Architecture Here we will focus on the problems encountered when simulating large scale designs using a single general purpose FPGA 2.2. FPGA 2.2 is usually too small to emulate a netlist with millions of gates. Therefore, you first need to divide the netlist into partitions that fit in the device. After that, by repeatedly reconfiguring FPGA 2.2,
Partitions are sequentially simulated. This solution scales to the size of the netlist, but (because of the small configuration bandwidth) FPGA
It is not suitable for practical use because the reconfiguration overhead in 2.2 increases.

【００９５】本発明においては、この構成帯域幅の問題
を、論理シミュレーション（ＳｉｍＰＬＥ）のための
シミュレーション・プロセッサの概念を導入することに
よって解決する。ＳｉｍＰＬＥ２．６は、ネットリスト
のコンパイル先となる仮想概念である。ＳｉｍＰＬＥ
２．６は、一旦ＦＰＧＡに構成された後は、ＳｉｍＰＬ
Ｅコンパイラ２．５を使用して様々な設計（もしくは、
設計の様々な部分）に対応してプログラムされる。Ｓ
ｉｍＰＬＥ用の命令は、ＦＰＧＡ２．２のデータＩ／Ｏ
ピンを使用するため、構成帯域幅が小さくても影響はな
い。In the present invention, this configuration bandwidth problem is solved by introducing the concept of a simulation processor for logic simulation (SimPLE). SimPLE 2.6 is a virtual concept that is a netlist compilation destination. SimPLE
2.6, once configured into FPGA, SimPL
Various designs (or using E compiler 2.5)
Programmed for different parts of the design). S
The instruction for imPLE is the data I / O of FPGA 2.2.
Since pins are used, there is no effect even if the configuration bandwidth is small.

【００９６】（１）ＳｉｍＰＬＥアーキテクチャＳｉｍＰＬＥ２．６は、ＶＬＩＷアーキテクチャ・モデ
ルをベースとする。このようなアーキテクチャでは、ゲ
ート・レベルのネットリスト・シミュレーションに豊富
に存在する本質的なパラレリズムを利用できる。図４
に、ＳｉｍＰＬＥ２．６のテンプレートを示す。このＳ
ｉｍＰＬＥ２．６のテンプレートは、相互接続された非
常に単純な機能単位、すなわち要素プロセッサ２．３１
〜２．３４の大規模なアレイで構成される。(1) SimPLE Architecture SimPLE 2.6 is based on the VLIW architecture model. Such architectures can take advantage of the essential parallelism that is abundant in gate-level netlist simulations. Figure 4
Shows the template of SimPLE 2.6. This S
The template for imPLE 2.6 is a very simple functional unit interconnected, namely an element processor 2.31.
Composed of ~ 2.34 large arrays.

【００９７】各要素プロセッサ２．３１〜２．３４は、
任意の２入力ゲートをシミュレートできるため、各サイ
クルで多数のゲートを同時に評価することが可能であ
る。このテンプレートは、中間信号値を格納するため
の、高クロック速度できわめて高いアクセス可能性を実
現する分散レジスタ・ファイル・システム４．２を備え
る。また、（ＦＰＧＡ２．２は多数のレジスタを有さな
いため）レジスタ数はハードウェアの制約によって制限
されるので、２層目のメモリ階層として、レジスタのス
ピル処理を可能にする分散メモリ・システム４．１が導
入されている。これにより、レジスタのメモリ・ロード
とメモリ格納が可能になる。符号４．３はクロスバーで
ある。The respective element processors 2.31 to 2.34 are
Since any two-input gate can be simulated, it is possible to evaluate many gates simultaneously in each cycle. This template comprises a distributed register file system 4.2 for storing intermediate signal values, which achieves very high accessibility at high clock speeds. In addition, since the number of registers is limited by hardware constraints (since FPGA 2.2 does not have a large number of registers), the distributed memory system 4 that enables spill processing of registers is used as the second memory hierarchy. .1 has been introduced. This allows memory loading and storage of registers. Reference numeral 4.3 is a crossbar.

【００９８】また、複数のメモリ・バンクの存在によ
り、高速な同時アクセスが実現される。格納できる中間
信号値の数を制限するのは総メモリ・サイズだけであ
り、近年のＦＰＧＡ２．２ではこのサイズを非常に大き
くすることができる。例えば、大型のＶｉｒｔｅｘ−Ｉ
ＩにおけるブロックＲＡＭの合計サイズは、約３５０万
ビットである。図５は、リソースの制約がないと想定し
た場合の、ＡＳＡＰスケジュール用の典型的なネットリ
ストに要する中間値の最大数を示す。ＦＰＧＡ上で利用
可能なメモリ容量は、中間値の格納に要する最大メモリ
容量をはるかに上回る。そのため、本発明のスキーム
は、単一ＦＰＧＡ論理シミュレーションの問題を解決す
るための、スケーラブルで、高速で、しかも安価なソリ
ューションを提供する。Further, the presence of a plurality of memory banks realizes high speed simultaneous access. Only the total memory size limits the number of intermediate signal values that can be stored, and this size can be very large in modern FPGA 2.2. For example, a large Virtex-I
The total size of the block RAM in I is about 3.5 million bits. FIG. 5 shows the maximum number of intermediate values required for a typical netlist for an ASAP schedule, assuming no resource constraints. The memory capacity available on the FPGA far exceeds the maximum memory capacity required to store intermediate values. Therefore, the scheme of the present invention provides a scalable, fast, and inexpensive solution for solving the problem of single FPGA logic simulation.

【００９９】以上を要約すると、ＳｉｍＰＬＥ２．６は
以下によって特徴付けられる。Summarizing the above, SimPLE 2.6 is characterized by:

【０１００】・要素プロセッサ（ＰＥ）数。各要素プロ
セッサは、単一ゲートとすることも、より複雑なゲート
（例えば、ＮＡＮＤ、ＮＯＲ、ＯＲ、ＮＯＲの組み合わ
せなど）とすることもできる。ここでは、これをＳｉｍ
ＰＬＥの「幅」と呼ぶ。The number of element processors (PE). Each element processor can be a single gate or a more complex gate (eg, a combination of NAND, NOR, OR, NOR, etc.). Here, this is Sim
This is called the "width" of PLE.

【０１０１】・各レジスタ・ファイルに含まれるレジス
タ数。本発明による実装では、各要素プロセッサが個々
のレジスタ・ファイルを格納する形で分散される。こう
した分散レジスタ・ファイル・システムにより、大規模
な汎用マルチポート型のレジスタ・ファイルに比較して
高速なアクセスが可能になる。The number of registers included in each register file. In the implementation according to the invention, each element processor is distributed in such a way that it stores an individual register file. Such a distributed register file system enables faster access compared to large general purpose multiported register files.

【０１０２】・各レジスタ・ファイル上の読み取りポー
ト数。The number of read ports on each register file.

【０１０３】・各メモリ・バンクのサイズ。The size of each memory bank.

【０１０４】・各メモリ・バンクのポートの（ＰＥの観
点から見た）スパンまたは数。メモリ・バンク内のポー
ト数は、ＰＥ数、すなわちバンク・スパンに等しい。そ
のため、各ＰＥはメモリ・バンクに同時にアクセスでき
る。The span or number (from the PE's point of view) of the port of each memory bank. The number of ports in a memory bank is equal to the number of PEs, or bank span. Therefore, each PE can access the memory bank at the same time.

【０１０５】・各メモリ語のサイズ。これはメモリ・ア
クセスの単位である。The size of each memory word. This is a unit of memory access.

【０１０６】・メモリ・ラテンシー、すなわち、１回の
メモリ・ロードまたはメモリ格納を実行するために要す
るサイクル数。Memory latency, ie the number of cycles required to perform a single memory load or store.

【０１０７】・相互接続ラテンシー。これは、２個のＰ
Ｅ間を相互接続（図４において、クロスバー４．３とし
て図示）をパイプライン処理するために挿入されるエク
ストラ・レジスタを意味する。ＦＰＧＡ２．２上でＳｉ
ｍＰＬＥ２．６のインスタンスを配置及び経路指定する
間に、相互接続がクリティカル・パス上に存在すること
は頻繁にある。そのため、レジスタを挿入することで、
計算効率をある程度犠牲にして全体的なクロック速度を
向上させることができる。• Interconnect latency. This is 2 P
It means an extra register inserted to pipeline the interconnection between Es (shown as crossbar 4.3 in FIG. 4). Si on FPGA 2.2
During placement and routing of mPLE 2.6 instances, interconnects are often on the critical path. Therefore, by inserting a register,
The overall clock speed can be increased at the expense of some computational efficiency.

【０１０８】上記の構成可能なパラメータとは異なり、
ＳｉｍＰＬＥ２．６の以下の特性は不変である。Unlike the configurable parameters above,
The following properties of SimPLE 2.6 are unchanged.

【０１０９】・ＰＥは単純な２入力ゲートである。PE is a simple 2-input gate.

【０１１０】・各レジスタ・ファイルは、自身の要素プ
ロセッサによって書き込まれるか、「メモリ・ロード」
実行中にメモリから直接書き込まれることしかできな
い。Each register file is written or "memory loaded" by its element processor.
It can only be written directly from memory during execution.

【０１１１】・各レジスタ・ファイルは、１個のエクス
トラ読み取りポートを備えており、これによってメモリ
格納を行うことができる。Each register file has one extra read port, which allows memory storage.

【０１１２】・完全な相互接続（クロスバー４．３）
は、各レジスタ・ファイルの各読み取りポート（メモリ
格納用の読み取りポートを除く）をシステム内の各ＰＥ
の入力に接続する。Full interconnection (Crossbar 4.3)
Sets each read port of each register file (excluding the read port for storing memory) to each PE in the system.
Connect to the input of.

【０１１３】（２）ＳｉｍＰＬＥの利点ＳｉｍＰＬＥ２．６は、ＦＰＧＡベースか否かを問わ
ず、ソフトウェア・サイクル・ベース・シミュレーショ
ン及びハードウェア・エミュレータに比較していくつか
の本質的な利点を有している。(2) Advantages of SimPLE SimPLE 2.6, whether FPGA-based or not, has some essential advantages over software cycle-based simulation and hardware emulators. There is.

【０１１４】ａ）パラレリズムＳｉｍＰＬＥ２．６では、複数個の要素プロセッサが単
一サイクル内で同時に実行できるため、サイクル・ベー
ス・シミュレーション内に大量なパラレリズムが存在で
きるという特性を利用できる。これは、従来のプロセッ
サ、すなわちソフトウェア実装では不可能だったことで
ある。A) Parallelism In SimPLE 2.6, since a plurality of element processors can execute simultaneously in a single cycle, a characteristic that a large amount of parallelism can exist in cycle-based simulation can be utilized. This was not possible with conventional processors, or software implementations.

【０１１５】ｂ）レジスタ及びメモリへのアクセスシミュレーション・プロセッサのアーキテクチャ・モデ
ルでは、従来のＣＰＵでアクセス可能なレジスタ数より
もはるかに多数のレジスタに容易にアクセスできる。こ
れにより単一サイクルでレジスタにアクセスすることが
可能になるので、この特性は重要である。また、メモリ
・バンクが近接しているため、レジスタのスピル（あふ
れ）が発生した場合は高速なメモリ・アクセスが可能で
ある。B) Access to Registers and Memory The architecture model of the simulation processor allows easy access to a much larger number of registers than is possible with conventional CPUs. This property is important because it allows the register to be accessed in a single cycle. Further, since the memory banks are close to each other, high-speed memory access is possible when register spill (overflow) occurs.

【０１１６】ｃ）構成可能性ＳｉｍＰＬＥ２．６は汎用ＦＰＧＡ２．２上に構成され
る仮想アーキテクチャなので、コンパイラは最適なＳｉ
ｍＰＬＥ構成をターゲットとするために必要な柔軟性を
備えている。例えば、アプリケーションによって、レジ
スタとメモリの数を増やす必要がある場合もあれば、あ
るいは要素プロセッサの数を増やす方が望ましい場合も
ある。こうした様々な状況に対応して、複数のＳｉｍＰ
ＬＥ構成を再コンパイルにより１個のライブラリにまと
め、コンパイラにこのライブラリから最適な構成を選択
させることができる。このスキームを使用すると、毎回
ＦＰＧＡ２．２の配置と経路指定のプロセスを実行する
という煩雑さも解消される。C) Configurability Since SimPLE 2.6 is a virtual architecture constructed on general-purpose FPGA 2.2, the compiler is optimized for Si.
It has the flexibility needed to target the mPLE configuration. For example, depending on the application, it may be necessary to increase the number of registers and memories, or it may be desirable to increase the number of element processors. Corresponding to such various situations, multiple SimP
The LE configuration can be recompiled into one library and the compiler can be made to select the optimal configuration from this library. Using this scheme also eliminates the complication of having to perform the FPGA 2.2 placement and routing process each time.

【０１１７】ｄ）スケーラビリティＳｉｍＰＬＥ２．６は、ネットリストのサイズに対して
透過であるという、ソフトウェア・ソリューションとよ
く似た特性を有する。ネットリストは命令セットにコン
パイルされる。命令は、任意の数だけＳｉｍＰＬＥ２．
６上で実行できる。ＳｉｍＰＬＥ２．６のサイズが大き
いほどパフォーマンスは向上するが、小型のＳｉｍＰＬ
Ｅ２．６でもネットリストのシミュレーションは可能で
ある。D) Scalability SimPLE 2.6 has the property of being transparent to the size of the netlist, much like a software solution. The netlist is compiled into an instruction set. Any number of SimPLE2.
6 can be run. The larger the size of SimPLE 2.6, the better the performance, but the smaller size SimPL
Even with E2.6, netlist simulation is possible.

【０１１８】ｅ）構成帯域幅ＳｉｍＰＬＥ２．６では、命令用のデータＩ／Ｏピンに
よりＦＰＧＡの構成帯域幅が小さいことによる問題を回
避する。E) Configuration Bandwidth SimPLE 2.6 avoids problems due to the small configuration bandwidth of the FPGA due to the data I / O pins for instructions.

【０１１９】ｆ）ネットリストの分割ネットリストが大きすぎる場合は基板メモリ内に収まる
ように分割し、各部分を個別に転送することにより、ス
ケーラビリティが維持される。F) Division of netlist If the netlist is too large, it is divided so that it fits in the board memory, and each part is transferred individually to maintain scalability.

【０１２０】生成される命令数は、ネットリストのサイ
ズが大きくなるにつれて増える。大きなネットリストの
場合は、命令数が多すぎて基板メモリに収まらない可能
性があるが、これによってシミュレーションが阻害され
ることはない。シミュレーションは以下のように進行す
る。The number of instructions generated increases as the size of the netlist increases. A large netlist may have too many instructions to fit in board memory, but this does not interfere with the simulation. The simulation proceeds as follows.

【０１２１】命令セットは、基板メモリに収まるサイズ
のサブセットに分割される。この命令の分割は、ネット
リストそのものを分割することに等しい。命令サブセッ
トはＤＭＡ処理により個別に基板メモリに転送される。
第１のサブセットがストリーミングによりＦＰＧＡ２．
２を通過する際に、ネットリストのそれに対応する部分
がシミュレートされる。続いて基板メモリ内で第２のサ
ブセットが第１のサブセットに取って代わり、プロセス
が続行される。サブセット間では、シミュレーションさ
れているネットリストの状態が維持される。The instruction set is divided into subsets sized to fit in board memory. Dividing this instruction is equivalent to dividing the netlist itself. The instruction subset is individually transferred to the board memory by the DMA processing.
The first subset is streaming to FPGA2.
On passing 2, the corresponding part of the netlist is simulated. The second subset then replaces the first subset in the substrate memory and the process continues. The state of the netlist being simulated is maintained between the subsets.

【０１２２】例：大きな命令セットＩは、基板メモリに
収まるサイズのＩ１とＩ２に分割される。最初に、シミ
ュレーション・ベクトル・セットＴとＩ１がＤＭＡ処理
により基板メモリに転送される。シミュレーション・ベ
クトル・セットＴ内の第１のシミュレーション・ベクト
ルｔ１に関連するＩ１内のすべての命令が、ストリーミ
ングによりＦＰＧＡ２．２を通過する。続いて、Ｉ２が
ＤＭＡ処理により基板メモリに転送され、Ｉ１に取って
代わる。Ｉ２のすべての命令がストリーミングによりＦ
ＰＧＡ２．２を通過する。これでベクトルｔ１のシミュ
レーションが完了する。ここで、シミュレーションの過
程でＤＭＡ処理を行うためにパフォーマンスに影響が出
るが、本発明の手法のスケーラビリティは維持されるこ
とに留意されたい。Example: A large instruction set I is divided into sizes I1 and I2 that fit in the board memory. First, the simulation vector sets T and I1 are transferred to the substrate memory by DMA processing. All instructions in I1 associated with the first simulation vector t1 in the simulation vector set T are streamed through the FPGA 2.2. Subsequently, I2 is transferred to the substrate memory by the DMA process and replaces I1. All instructions in I2 are streamed to F
Pass through PGA 2.2. This completes the simulation of the vector t1. It should be noted here that the performance of the DMA process is affected in the course of the simulation, but the scalability of the method of the present invention is maintained.

【０１２３】ｇ）シミュレーション・ベクトルの分割大きなシミュレーション・ベクトル・セットの場合は、
小さなブロックに分割し、基板上で各ブロックを個別に
シミュレートすることができる。シミュレーションを行
うためには、シミュレーション・ベクトルと命令の両方
が基板メモリに収まらなければならない。請求項１の本
発明では、命令がメモリに収まらないケースに対処して
いる。G) Division of Simulation Vectors For large simulation vector sets,
It can be divided into smaller blocks and each block can be individually simulated on the substrate. Both the simulation vector and the instructions must fit in the board memory to perform the simulation. The present invention according to claim 1 deals with the case where an instruction does not fit in the memory.

【０１２４】シミュレーション・ベクトルが収まらない
場合は、ブロックに分割し、各ブロックを個別にシミュ
レートできる。例えば、設計には１００万個のベクトル
があるのにオンボード・メモリは（命令に加えて）５０
万個のベクトルしか保持できない場合、シミュレーショ
ン・ベクトル・セットは各々５０万個のベクトルから成
る２つのブロックに分割され、各ブロックが個別にシミ
ュレートされる。これによってパフォーマンスが大幅に
低下することはない。If the simulation vector does not fit, it can be divided into blocks and each block can be simulated individually. For example, a design has 1 million vectors, but on-board memory is 50 (in addition to instructions).
If only ten thousand vectors can be held, the simulation vector set is divided into two blocks of 500,000 vectors each and each block is simulated individually. This does not significantly reduce performance.

【０１２５】h)レジスタの可視化シミュレーションの１次出力は、内部レジスタの状態を
反映しない。内部レジスタを可視化するため、本発明に
よる技法ではＳｉｍＰＬＥメモリ内の特定のロケーショ
ンからロードと格納を行う。シミュレーション後、基板
レベルの命令がこれらのメモリ・ロケーションからレジ
スタ値を抽出する。ここで、次の２点に留意されたい。
（ａ）ＳｉｍＰＬＥ２．６のメモリにおいてレジスタが
実際に存在するロケーションは重要ではなく、任意のロ
ケーションを使用できる。コンパイラとツールがレジス
タの格納場所を認識している限り、基板レベルの命令を
使ってその値を抽出し、それを使用して可視化すること
ができる。（ｂ）基板レベルの命令は、コンパイラが生
成する命令とは異なる。これらの命令は、(i)シミュレ
ーション・ベクトルをＦＰＧＡ２．２に入れる、(ii)コ
ンパイラ命令をＦＰＧＡ２．２に入れる、(iii)ＦＰＧ
Ａから結果を取得する、（iv）ＦＰＧＡ２．２からレジ
スタ値を取得する、という４つの機能を実行する。H) The primary output of the register visualization simulation does not reflect the state of the internal registers. To visualize the internal registers, the technique according to the invention loads and stores from a specific location in SimPLE memory. After simulation, board-level instructions extract register values from these memory locations. Here, note the following two points.
(A) The location where the register actually resides in the memory of SimPLE 2.6 is not important and any location can be used. As long as the compiler and tools know where to store the register, you can use board-level instructions to extract that value and use it to visualize it. (B) Board level instructions are different from the instructions generated by the compiler. These instructions are (i) put the simulation vector in FPGA 2.2, (ii) put the compiler instruction in FPGA 2.2, (iii) FPG
It executes four functions: obtaining a result from A and (iv) obtaining a register value from FPGA 2.2.

【０１２６】ｉ）汎用シミュレータへのインターフェー
スシミュレーション・プロセッサは、汎用ソフトウェア・
シミュレータにインターフェースできる。本発明の手法
では、設計の状態を切り替えることによってシミュレー
ション・プロセッサを汎用ソフトウェア・シミュレータ
にインターフェースする。例えばユーザは、ソフトウェ
ア・シミュレータを使用したイベント・ドリブン・シミ
ュレーションの途中に、シミュレート対象の回路状態の
全体をＳｉｍＰＬＥ２．６に切り替えて多数のベクトル
の機能シミュレーションを実行し、その後再度切り替え
て最終状態をソフトウェア・シミュレータに戻すことが
できる。したがって、ＳｉｍＰＬＥ２．６はソフトウェ
ア・シミュレータにとっての透過なバックエンド・アク
セラレータとなることができる。I) Interface to general-purpose simulator The simulation processor is a general-purpose software
Can interface to the simulator. The technique of the present invention interfaces the simulation processor to a general purpose software simulator by switching the state of the design. For example, during the event-driven simulation using a software simulator, the user switches the entire circuit state to be simulated to SimPLE2.6, executes a functional simulation of many vectors, and then switches again to the final state. Can be returned to the software simulator. Therefore, SimPLE 2.6 can be a transparent back-end accelerator for software simulators.

【０１２７】ここで、状態の切り替えはレジスタを可視
化する手法を使用して実行できることに留意されたい。It should be noted here that the state switching can be performed using a register visualization technique.

【０１２８】ｊ）２値化及び４値化シミュレーション４値化シミュレーションを実行するため、上記シミュレ
ーション・プロセッサの各ワイヤは幅２ビットとする。
幅２ビットのワイヤは、０、１、Ｘ、Ｚの４状態を表す
ことができる。シミュレーション・プロセッサの全体的
なアーキテクチャは変わらない。J) Binarization and quaternarization simulation In order to execute the quaternarization simulation, each wire of the simulation processor has a width of 2 bits.
A wire having a width of 2 bits can represent four states of 0, 1, X, and Z. The overall architecture of the simulation processor remains unchanged.

【０１２９】３．ＲＴＬ回路のアーキテクチャ図２２に示すように、本発明による手法を拡張してＲＴ
Ｌ回路に適用することは容易である。ＲＴレベルの回路
のシミュレーションを加速するためのシミュレーション
・プロセッサのアーキテクチャは、各々幅ｂビットで、
加算、減算、符号拡張、比較、及びビット単位のブール
演算の機能を有する演算論理回路（ＡＬＵ）アレイ（う
ち１つを２２．１として図示）を備える。3. RTL Circuit Architecture As shown in FIG. 22, the technique according to the present invention is extended to RT.
It is easy to apply to the L circuit. The simulation processor architecture for accelerating RT-level circuit simulation is b bits wide each,
An arithmetic logic circuit (ALU) array (one of which is shown as 22.1) having functions of addition, subtraction, sign extension, comparison, and bitwise Boolean operation is provided.

【０１３０】また、各々幅ｂビットの結果を生成する符
号付き乗算子アレイ（うち１つを２２．２として図示）
も備える。さらに、要素プロセッサに近接した位置に、
分散レジスタ・ファイル・システム２２．３も備える。
この分散レジスタ・ファイル・システム２２．３は、少
数の読み取り及び書き込みポートと、相互接続ラテンシ
ーに等しいアクセス時間を有する。さらに、すべての分
散レジスタ・ファイルを接続するｂビットのクロスバー
ラインから成るメイン相互接続システム２２．４を備え
る。各ＡＬＵ用として、ＡＬＵ演算から得られたキャリ
ー値を保持するための別個のビット幅キャリー・レジス
タ・ファイル２２．５を備える。パイプライン処理され
たキャリーチェーン・クロスバー相互接続システム２
２．６は、ビット幅キャリー・レジスタ・ファイル２
２．５を互いに接続して、ＡＬＵ間でのパイプライン処
理によるキャリー伝搬を可能にする。分散メモリ・シス
テム（分散レジスタ・ファイル・システム２２．３）
は、ＡＬＵ２２．１に近接して配置される。上記アーキ
テクチャから外部メモリへのインターフェースは基板上
に配置され、当該インターフェースはベクトルと演算の
読み取りと書き込みを指定する命令と演算コードから成
る。Also, a signed multiplier array (one of which is shown as 22.2) that produces a result each having a width of b bits.
Also prepare. Furthermore, in the position close to the element processor,
It also has a distributed register file system 22.3.
This distributed register file system 22.3 has a small number of read and write ports and an access time equal to the interconnect latency. In addition, it has a main interconnect system 22.4 consisting of b-bit crossbar lines connecting all distributed register files. For each ALU, a separate bit-width carry register file 22.5 is provided to hold the carry value obtained from the ALU operation. Pipelined carry chain crossbar interconnection system 2
2.6 is a bit width carry register file 2
2.5 are connected together to allow carry propagation by pipeline processing between ALUs. Distributed memory system (Distributed register file system 22.3)
Are located in close proximity to ALU 22.1. An interface from the above architecture to an external memory is located on the board and consists of instructions and opcodes that specify the reading and writing of vectors and operations.

【０１３１】４．コンパイラ（１）定義コンパイラについて詳細に説明する前に、いくつかの共
通して使用される用語を定義する。4. Compiler (1) Definition Before describing the compiler in detail, some commonly used terms are defined.

【０１３２】「設計」とは、シミュレート対象のゲート
レベルのネットリストである。設計は、完全に自己充足
型のハードウェアのことも、あるいはシミュレーション
の加速が必要な大規模ネットリストの一部を表すことも
ある。設計の１次入力から成る値セットは、シミュレー
ション・ベクトルを表す。設計の機能性を検証する際に
は、数個のシミュレーション・ベクトルが使用されるの
が一般的である。各ベクトルについて、１個の出力ベク
トルまたは結果ベクトルが取得される。The "design" is a gate-level netlist to be simulated. The design may be fully self-contained hardware or it may represent part of a large netlist that requires accelerated simulation. The value set consisting of the primary inputs of the design represents the simulation vector. Several simulation vectors are commonly used when verifying the functionality of a design. For each vector, one output vector or result vector is obtained.

【０１３３】設計は、有向グラフによって表される。グ
ラフのノードは、設計内のハードウェア機能ブロックに
対応する。１個のノードは複数入力を有することができ
るが、出力は多くて１個である。設計の入力ポートは入
力のないノードであり、設計の出力ポートは出力のない
ノードである。ワイヤ（「ネット」とも呼ばれる）は、
ノードを相互接続する。各ワイヤは単一の発信元（ドラ
イバ）と、「ピン」と呼ばれる複数の宛先（ファンアウ
ト）を有する。The design is represented by a directed graph. The nodes of the graph correspond to the hardware functional blocks in the design. A node can have multiple inputs, but at most one output. Design input ports are nodes with no inputs, and design output ports are nodes with no output. Wires (also called "nets")
Interconnect nodes. Each wire has a single source (driver) and multiple destinations (fanouts) called "pins".

【０１３４】コンパイラに関する記述では、１個のノー
ドが特定の機能リソース（要素プロセッサ）に割り当て
られることを、「スケジュールされる」という。ノード
をスケジュールするためには、要素プロセッサ（ＰＥ）
が解放されていて、そのノードの演算を実行できる状態
であること、及びそのＰＥにアクセス可能な少なくとも
１つのレジスタが解放されていて、ノードの出力を格納
できる状態であることが必要とされる。さらに、レジス
タ・ファイルの相互接続とレジスタ・ポートを使用し
て、ノードの入力がその発信元に正常に接続されること
も必要である。後者は、「入力経路指定」と呼ばれる。In the description regarding the compiler, the allocation of one node to a specific functional resource (element processor) is called "scheduled". Element Processor (PE) to schedule nodes
Must be free and ready to perform operations on that node, and at least one register accessible to that PE must be free and ready to store the output of the node. . In addition, it is also necessary that the node's input be successfully connected to its source using the register file interconnects and register ports. The latter is called "input routing".

【０１３５】ノードは常に、そのすべての発信元が前の
ステップでスケジュールされた後にスケジュールされ
る。具体的には、相互接続ラテンシーをＬとすると、あ
るノードを現在のステップでスケジュールするために
は、そのノードのすべての発信元が少なくともＬタイム
・ステップ前のステップでスケジュールされていなけれ
ばならない。A node is always scheduled after all its sources have been scheduled in the previous step. Specifically, given an interconnect latency of L, in order for a node to be scheduled in the current step, all sources for that node must be scheduled at least L time steps before.

【０１３６】ノードがあるタイム・ステップでスケジュ
ール可能であれば、そのノードはそのタイム・ステップ
において「準備完了状態にある（レディ）」とされる。
概して、ノードは、そのすべての発信元が前のステップ
でスケジュールされている場合は、準備完了状態にあ
る。ただし、相互接続とメモリ・ラテンシーの制約を伴
うＳｉｍＰＬＥでは、ノードが準備完了状態となるため
にはさらなる条件をクリアしなければならない。相互接
続ラテンシーをＩＬ、メモリ・ラテンシーをＭＬとする
と、ノードＮは以下の条件が満たされて初めて、そのタ
イム・ステップＴにおいて準備完了状態にあるとみなさ
れる。A node is said to be "ready" at a time step if it can be scheduled at that time step.
Generally, a node is in the ready state if all its sources have been scheduled in the previous step. However, in SimPLE with interconnect and memory latency constraints, additional conditions must be met for a node to be ready. If the interconnect latency is IL and the memory latency is ML, then node N is considered to be ready at its time step T only if the following conditions are met:

【０１３７】・Ｎの各発信元ノードが時間Ｔｓの時点で
すでにスケジュールされている（ここで、Ｔ＞＝Ｔｓ＋
ＩＬ）。Each source node of N is already scheduled at time Ts (where T> = Ts +
IL).

【０１３８】・メモリからロードされたＮの発信元ノー
ドに関するロードが、タイム・ステップＴｌｓにおいて
実行されている（ここで、Ｔ＞＝Ｔｌｓ＋ＩＬ＋Ｍ
Ｌ）。A load for N source nodes loaded from memory is being performed at time step Tls (where T> = Tls + IL + M
L).

【０１３９】スケジューリング・プロセスの何らかの時
点において準備完了状態にあるノード・セットは、「レ
ディフロント」と呼ばれる。レディフロントは、２タイ
プのノードから成る。第１のタイプは、発信元が活性レ
ジスタであるノード・セットを表す。第２のタイプは、
一部の発信元レジスタがメモリにスピルされたノード・
セットを表す。このようなノードは、「格納された入力
を有するノード」と呼ばれる。A node set that is ready at some point in the scheduling process is called the "ready front". The ready front consists of two types of nodes. The first type represents a node set whose source is the liveness register. The second type is
Nodes with some source registers spilled into memory
Represents a set. Such nodes are called "nodes with stored inputs."

【０１４０】スケジュールの長さは、タイム・ステップ
の総数である。スケジュールの長さはまた、生成された
命令数でもある。１つの設計とコンパイル済み命令のセ
ットがあるとして、その「利用率」とは、スケジュール
において、演算、メモリ・ロード、メモリ格納のいずれ
かを実行しているプロセッサの比率を意味する。アーキ
テクチャ上の制約のため、いくつかのプロセッサが強制
的にアイドル状態にされるので、利用率は常に１００％
未満となる。The length of the schedule is the total number of time steps. The length of the schedule is also the number of instructions generated. Assuming that there is one design and set of compiled instructions, "utilization" means the percentage of processors that are performing operations, memory loads, or memory stores in a schedule. Due to architectural constraints, some processors are forced to idle and utilization is always 100%
Less than

【０１４１】（２）スケジューリング・アルゴリズムコンパイラは、リソースの制約条件のもとで設計をスケ
ジュールする。コンパイラはノードを要素プロセッサに
マッピングし、ワイヤはノードをレジスタに相互接続す
る。レジスタは、レジスタ・ポートの制約条件に従いつ
つ、レジスタ全体の利用率が最小化されるように割り当
てられる。レジスタ・ファイルが満杯の場合、コンパイ
ラはレジスタをスピル処理してメモリに格納することを
選択する。これらのレジスタは、必要に応じて再度ロー
ドされる。スケジューリング・アルゴリズムは決定性
で、非常に高速である（文献＜１０＞）。(2) Scheduling Algorithm The compiler schedules the design under resource constraints. The compiler maps the nodes to element processors and the wires interconnect the nodes to registers. Registers are allocated such that utilization of the registers as a whole is minimized, subject to register port constraints. If the register file is full, the compiler chooses to spill the registers and store them in memory. These registers are reloaded as needed. Scheduling algorithms are deterministic and very fast (ref. <10>).

【０１４２】まず、ネットリストに対してトポロジカル
・ソートが実行され、続いて制約条件を解決するために
数カ所にバッファが挿入される。これについては、後述
のｆ）項で詳細に説明する。その後、ノードは個別の命
令にスケジュールされる。図６に、アルゴリズム全体の
フローを示す。以降の項では、フローの各部について説
明する。First, topological sort is executed on the netlist, and subsequently, buffers are inserted at several places to solve the constraint condition. This will be described in detail in section f) described later. The node is then scheduled for individual instruction. FIG. 6 shows the flow of the entire algorithm. In the following sections, each part of the flow will be described.

【０１４３】ａ）ノードのスケジューリングコンパイル処理では、アーキテクチャのすべての制約条
件に従いつつ、設計内の各ノードのスケジューリングが
行われる。ノードのスケジューリングは、以下のステッ
プから成る。A) Node Scheduling In the compile process, the scheduling of each node in the design is performed while complying with all the constraints of the architecture. Node scheduling consists of the following steps.

【０１４４】・ノード選択。レディフロントからスケジューリング対象のノードが選
択される。この選択は将来のノードの選択順序に影響を
及ぼし、コンパクトなスケジュールを得る上できわめて
重要な要因となる。Node selection. A node to be scheduled is selected from the ready front. This choice influences the future node selection order and is a very important factor in obtaining a compact schedule.

【０１４５】・入力の経路指定。レディフロントの各ノードは、そのすべての入力の経路
指定を行える状態になって初めて、特定のタイム・ステ
ップでスケジュールできるようになる。レジスタ・ファ
イルに格納される値とＰＥの入力との間の経路指定が可
能かどうかは、相互接続と使用可能なレジスタ読み取り
ポート数によって決定される。クロスバー相互接続が完
成すると、１つのＰＥのレジスタ・ファイルと別のＰＥ
の入力との間でデータを直接転送することが可能にな
る。ただし、レジスタ・ポート数が限定されるため、あ
るタイム・ステップにおいて特定のレジスタ・ファイル
から読み取ることのできる値の数も限定される。Input routing. Each ready front node can only be scheduled at a particular time step when it is ready to route all its inputs. Whether routing is possible between the value stored in the register file and the input of the PE is determined by the interconnect and the number of register read ports available. Once the crossbar interconnect is complete, one PE's register file and another PE's
It is possible to directly transfer data to and from the input of. However, because of the limited number of register ports, the number of values that can be read from a particular register file at a given time step is also limited.

【０１４６】・ＰＥの割り当て。入力の経路指定が完了すると、ノードは、使用済みのレ
ジスタ数が最も少ない要素プロセッサ上でスケジュール
される。これは、レジスタ利用率の最小化を図る貪欲ス
キームである。Assignment of PE. Once input routing is complete, the node is scheduled on the element processor with the least number of used registers. This is a greedy scheme that seeks to minimize register utilization.

【０１４７】・レジスタの割り当て。ＰＥの割り当て後、ノードが配置されている要素プロセ
ッサのレジスタ・ファイルにおいて解放されているレジ
スタが、ノード出力の格納用として割り当てられる。解
放されたレジスタを有さないノードがそのＰＥに割り当
てられることはないので、解放されたレジスタの使用可
能性は保証される。Register allocation. After the PE is allocated, the register released in the register file of the element processor in which the node is arranged is allocated for storing the node output. The availability of freed registers is guaranteed because nodes that do not have freed registers are never assigned to that PE.

【０１４８】ｂ）ノード選択ヒューリスティック本発明の技法の目標は、ヒューリスティックスによって
選択プロセスを迅速化することにより、スケジュールの
最小化と利用率の最大化を同時に実現することにある。
コンパイラの実行時間は、ノード選択ヒューリスティッ
クの最適度が高くなるにつれて増大する。B) Node Selection Heuristic The goal of the technique of the present invention is to simultaneously achieve the minimization of schedule and the maximization of utilization by speeding up the selection process by heuristics.
The execution time of the compiler increases as the optimality of the node selection heuristic increases.

【０１４９】本発明の技法では、ノードＮの以下の２つ
の特性に注目して、スケジューリングの実現可能性を評
価する。The technique of the present invention evaluates the feasibility of scheduling by paying attention to the following two characteristics of the node N.

【０１５０】・Ｎをスケジュールすることによって解放
されるレジスタ数。多数のレジスタを解放するノードの
優先順位付けは、レジスタ利用率を最小化するための単
純な貪欲ストラテジである。The number of registers released by scheduling N. Prioritizing nodes that release a large number of registers is a simple greedy strategy to minimize register utilization.

【０１５１】・Ｎのファンアウト。ファンアウトが大き
いノードは、将来のタイム・ステップにおけるノードの
スケジューリングの可能性を拡大させる。-N fan-out. Nodes with high fanouts increase the scheduling potential of the node at future time steps.

【０１５２】そのため、多数のレジスタを解放し、かつ
高ファンアウトを有するノードが優先される。図７に、
ノード選択プロセスを図示する。Therefore, a node that releases a large number of registers and has a high fanout is given priority. In Figure 7,
6 illustrates a node selection process.

【０１５３】ｃ）レジスタのメモリへの格納解放されたレジスタを有さないノードを、タイム・ステ
ップにおいてスケジュールすることはできない。さら
に、レディフロントに含まれるノードが相互接続ラテン
シーの制約条件を満たさない場合は、タイム・ステップ
が空になることがある。こうした状況のもとで、レジス
タ・ファイルが満杯で、かつ解放された各要素プロセッ
サ内で格納演算がスケジュールされる。C) Store registers in memory Nodes that do not have released registers cannot be scheduled in a time step. In addition, the time steps may be empty if the nodes included in the ready front do not meet the interconnect latency constraints. Under these circumstances, a store operation is scheduled in each element processor where the register file is full and freed.

【０１５４】活性レジスタは、その値をスクラッチパッ
ド・メモリに格納することによって満杯のレジスタ・フ
ァイルから解放される。レジスタ・ファイル内のこのよ
うな活性レジスタは、すでにスケジュールされたノード
Ｎの出力であるが、その一部のファンアウトはそのまま
スケジュール対象として残る。この時点では、Ｎは単
に、レディフロントに存在するそのファンアウト・ノー
ド数に基づいて選択される。レディフロント内にファン
アウトを有さない、使用可能な第１のノードが格納され
る。レジスタ・ファイル内にこの制約条件を満たすノー
ドが存在しない場合は、レディフロント内のファンアウ
トが最も少ないノードが選択され、メモリに格納され
る。図８に、レジスタを格納するプロセスを示す。The live register is released from the full register file by storing its value in scratchpad memory. Such an active register in the register file is the output of the already scheduled node N, but some of its fanouts remain scheduled. At this point, N is simply selected based on its number of fanout nodes present in the ready front. A usable first node is stored that has no fanout in the ready front. If no node in the register file satisfies this constraint, then the node with the least fanout in the ready front is selected and stored in memory. FIG. 8 shows the process of storing registers.

【０１５５】ｄ）メモリからのレジスタのロードノードＮの入力がスケジュールされ、一時的にメモリ格
納されている場合は、Ｎのスケジュールを行う前にそれ
をロードする必要がある。レディフロントに存在する、
格納された入力を有さないすべての候補ノードがスケジ
ュールされた後に、使用可能な要素プロセッサがある場
合は、格納された入力を有するノードが選択される。選
択されたノードの入力はメモリからロードされ、そのノ
ードは将来のタイム・ステップでのスケジュール候補と
なる。ノードＮは、以下の要因に基づいて、格納された
入力を有する準備完了状態のノードのリストから選択さ
れる。D) Load register from memory If the input of the node N is scheduled and temporarily stored in memory, it must be loaded before N is scheduled. Present on the lady front,
After all candidate nodes with no stored inputs have been scheduled, if there are element processors available, the node with the stored inputs is selected. The input of the selected node is loaded from memory, making that node a candidate for a schedule at a future time step. Node N is selected from the list of ready nodes with the stored inputs based on the following factors.

【０１５６】・Ｎを配置することによって解放されるレ
ジスタ数。このレジスタ数が大きいほど、そのノードは
入力のロードとＮのスケジューリングに適する。The number of registers released by placing N. The larger the number of registers, the better the node is for loading inputs and scheduling N.

【０１５７】・準備完了状態の格納された入力のファン
アウト数。このファンアウト数は、入力ロード時にスケ
ジュールされるノード数に直接影響を及ぼす。ノードが
そのファンアウトに準備完了状態にあるノードを多数有
する場合、そのノードはロードの有力な候補となる。The number of fanouts of the stored input in the ready state. This fanout number directly affects the number of nodes scheduled on input load. If a node has many nodes in its fanout in the ready state, it becomes a good candidate for loading.

【０１５８】図９に、レディフロントに存在するノード
の入力をロードするプロセスを示す。最初にロードがス
ケジュールされ、続いて、将来のタイム・ステップにお
いて準備完了状態のノードがスケジュールされる。FIG. 9 shows the process of loading the inputs of the nodes present on the ready front. Loads are scheduled first, followed by ready nodes in future time steps.

【０１５９】ｅ）ユーザ指定のレジスタの処理シミュレート対象のネットリスト内のレジスタは、特殊
な方法で処理する必要がある。本発明の手法では、文献
＜１６＞に示される定義と類似した方法で、ユーザ・サ
イクルとプロセッサ・サイクルとを区別する。E) Processing of user-specified registers The registers in the netlist to be simulated must be processed by a special method. The method of the present invention distinguishes between user cycles and processor cycles in a manner similar to the definition given in document <16>.

【０１６０】プロセッサ・サイクルは、ＳｉｍＰＬＥ
２．６が動作する速度を意味する。これは、１つのＳｉ
ｍＰＬＥ命令が完了するまでの所要時間として定義する
ことができる。これは、命令語が時分割される場合（す
なわち、ＳｉｍＰＬＥ命令がＦＰＧＡデータＩ／Ｏピン
を上回る数のビットを有する場合）の処理を除けば、Ｆ
ＰＧＡ２．２上のＳｉｍＰＬＥ２．６のクロック・サイ
クルと同一である。命令語が時分割される場合は、演算
の有効速度が低下する。例えば、ネットリストがコンパ
イルによりＮ個の命令に分割された場合、命令語のサイ
ズをＩ、ＦＰＧＡ２．２の利用可能なピンアウトをＰ、
ＦＰＧＡ２．２のクロック速度をＣとすると、時分割の
係数ＦはＩ／Ｐ、プロセッサのクロック速度はＣ／Ｆで
得られる。一方、ユーザ・サイクルは、１つのベクトル
に関するネットリストを完全にシミュレートするために
要する時間を意味する。上記の例の場合であれば、ユー
ザ・クロック速度はＣ／（Ｆ＊Ｎ）となる。The processor cycle is SimPLE.
2.6 means the speed at which it operates. This is one Si
It can be defined as the time required to complete the mPLE instruction. This is except for processing when the instruction word is time-shared (ie, the SimPLE instruction has more bits than the FPGA data I / O pin).
Identical to SimPLE 2.6 clock cycle on PGA 2.2. When the instruction word is time-divided, the effective speed of the operation decreases. For example, when the netlist is divided into N instructions by compilation, the instruction word size is I, the available pinout of FPGA 2.2 is P,
When the clock speed of the FPGA 2.2 is C, the time division coefficient F is I / P, and the clock speed of the processor is C / F. User cycle, on the other hand, means the time it takes to fully simulate a netlist for one vector. In the case of the above example, the user clock speed would be C / (F * N).

【０１６１】ネットリスト内のゲートＧの入力がユーザ
・レジスタの場合、このゲートの評価に使用すべき値
は、直前のユーザ・サイクルのレジスタ値である。レジ
スタがネットリスト内のゲートＧの出力である場合、レ
ジスタに格納すべき値は、現在のユーザ・サイクルでＧ
によって計算される値である。ただし、直前のユーザ・
サイクルのレジスタ値を使用する必要がある場合には、
この値は現在のユーザ・サイクルでも使用可能でなけれ
ばならない。そのため、ユーザ・レジスタＲは以下の方
法でスケジュールされる。If the input of gate G in the netlist is a user register, the value to be used for the evaluation of this gate is the register value of the immediately preceding user cycle. If the register is the output of gate G in the netlist, the value to be stored in the register is G in the current user cycle.
Is a value calculated by. However, the last user
If you need to use the cycle register values,
This value must also be available in the current user cycle. Therefore, the user register R is scheduled in the following way.

【０１６２】・Ｒが２個のノード（Ｄ_Ｒ及びＱ_Ｒ）に分
割される。Ｄ_ＲはＲの入力、Ｑ_ＲはＲの出力を表す。[0162] · R is divided into two nodes _{(D R} and _{Q R).} D _R input of R, _{Q R} represents the output of the R.

【０１６３】・スケジューリングの制約条件がＤ_Ｒに課
せられる。すなわち、Ｄ_ＲはＱ_Ｒよりも後のタイム・ス
テップでスケジュールされなければならない。Scheduling constraints are imposed on D _R. In other words, D _R must be scheduled in the time step later than the Q _R.

【０１６４】・Ｄ_Ｒのスケジューリング時に、その入力
の値がメモリに格納される。これは、現在のユーザ・サ
イクルのＲの値（次のユーザ・サイクルで使用される）
である。[0164] · D _R when scheduling, the value of the input is stored in the memory. This is the value of R for the current user cycle (used in the next user cycle)
Is.

【０１６５】・Ｑ_Ｒのスケジューリング時に、その入力
の値がメモリからロードされる。これは、直前のユーザ
・サイクルのＲの値（現在のユーザ・サイクルで使用さ
れる）である。ユーザ・レジスタは、コンパイラがユー
ザ・レジスタを処理する方法を示す。図１０に、コンパ
イラがレジスタを処理する方法を示す。[0165] of the · Q _R at the time of scheduling, the value of the input is loaded from memory. This is the value of R from the previous user cycle (used in the current user cycle). User registers indicates how the compiler handles user registers. FIG. 10 shows how the compiler handles registers.

【０１６６】ｆ）１次入力（ＰＩ）と１次出力（ＰＯ）
の処理ゲート・レベルの設計は多数のＰＩとＰＯを有すること
があり、その数は数千ビットに上ることさえある。ＰＩ
のロードとＰＯの格納を迅速化するため、ＳｉｍＰＬＥ
のメモリ内で個々のビットを任意のロケーションにアド
レス指定する処理は行われない。その代わり、連続する
メモリ・ロケーションからすべてのＰＩが順次ロードさ
れる。同様に、連続するメモリ・ロケーションからすべ
てのＰＯが順次格納される。さらに、ＦＰＧＡ２．２外
から（すなわち、基板メモリから）ロードまたは格納を
行う際には、（外部ソフトウェアにより）ＰＩとＰＯが
メモリ語サイズ（メモリからの読み取りとメモリへの書
き込みを行うことのできる単位）の語になるようにグル
ープ化される。これにより、１サイクルに１語の割合で
ロードまたは格納できるようになる。これは、個別ビッ
トをロードするよりもはるかに高速である。F) Primary input (PI) and primary output (PO)
Processing gate level designs can have large numbers of PIs and POs, which can even amount to thousands of bits. PI
SimPLE to speed up loading and storing POs
The process of addressing individual bits to arbitrary locations in the memory is not performed. Instead, all PIs are loaded sequentially from consecutive memory locations. Similarly, all POs are stored sequentially from consecutive memory locations. Furthermore, when loading or storing from outside FPGA 2.2 (ie from board memory), PI and PO (via external software) can do memory word size (read from and write to memory). Units) are grouped into words. This makes it possible to load or store one word per cycle. This is much faster than loading the individual bits.

【０１６７】これらの仮定によりＳｉｍＰＬＥ２．６の
入力−出力インターフェースはさらに単純になるが、そ
の一方でコンパイラに対する制約条件が生じる結果とな
る。まず、コンパイラがＰＩとＰＯを配置する際の制約
が厳しくなる。これは、スクラッチパッド・メモリがバ
ンクに分割されるためである。各バンクは狭い範囲のＰ
Ｅにスパンし、アクセスできるのはこれらのＰＥに限ら
れる。そのため、コンパイラは、各ＰＩまたはＰＯを、
そのインデックスに基づいて特定のメモリ・バンクに割
り当てなければならない。These assumptions further simplify the input-output interface of SimPLE 2.6, but result in constraints on the compiler. First, the constraint on the placement of PI and PO by the compiler becomes severe. This is because the scratchpad memory is divided into banks. Each bank has a narrow range of P
Only these PEs can span and access E. Therefore, the compiler
It must be assigned to a particular memory bank based on its index.

【０１６８】さらに、ＰＯはメモリ格納を表すので、レ
ジスタを格納するためには、ＰＯの直接の（ただし、後
のタイム・ステップの）発信元と同じＰＥ内に配置しな
ければならない。ＰＯもまた特定のメモリ・バンクに格
納する必要があるので、これによりＰＯの直接の発信元
に制約条件が課せられることになる。すなわち、ＰＯの
直接の発信元は、ＰＯが格納される特定のメモリ・バン
クから届く範囲内に配置されなければならない、という
制約条件である。In addition, since PO represents a memory store, it must be placed in the same PE as the PO's direct (but in a later time step) source to store the register. Since the PO also needs to be stored in a particular memory bank, this imposes constraints on the direct originator of the PO. That is, the direct source of the PO is a constraint that it must be located within reach of the particular memory bank in which the PO is stored.

【０１６９】上記の制約により、一部のネットリストが
スケジュール不能になる可能性がある。例えば、ＰＩが
ＰＯに短絡されている場合（これは、最適化後に一部の
ネットリストに発生することがある）、インデックスが
互いに異なるので、ＰＩとＰＯが強制的に異なるメモリ
・バンクに格納される恐れがある。このような異常は、
バッファを挿入することによって解決される。これによ
り、ある程度のリソースを費やすことになるが、スケジ
ューリングの柔軟性は増大する。Due to the above restrictions, there is a possibility that some netlists cannot be scheduled. For example, if PI is shorted to PO (which can happen to some netlists after optimization), the indexes will be different from each other, which forces PI and PO to be stored in different memory banks. May be Such abnormalities are
Resolved by inserting a buffer. This consumes some resources, but increases scheduling flexibility.

【０１７０】ＰＩとＰＯは、ＳｉｍＰＬＥ２．６内で、
図１１に図示するようなメモリ・バンクに編成される。
各メモリ・バンクは、ＰＩとＰＯにそれぞれ専用の部分
と、シミュレーション時のレジスタのスピル処理に使用
される汎用部分とを備える。ＰＩとＰＯのこの編成によ
り、各ＰＥは最大メモリ帯域幅速度で１次入力ビットの
読み取り（または、１次出力ビットの書き込み）を実行
できるようになる。またこの編成により、ビットが任意
のメモリ・ロケーションにアドレス指定されるのを防止
することができる。ＰＩのアセンブルは、インターフェ
ース・ソフトウェアによって容易に実行できる。PI and PO are stored in SimPLE 2.6.
It is organized into memory banks as shown in FIG.
Each memory bank has a portion dedicated to each of PI and PO and a general-purpose portion used for spill processing of registers during simulation. This organization of PIs and POs allows each PE to read primary input bits (or write primary output bits) at maximum memory bandwidth speed. This organization also prevents bits from being addressed to any memory location. Assembling the PI can be easily performed by the interface software.

【０１７１】（３）コンパイルの結果と分析本発明の技法では、業界標準、ＩＳＣＡＳ、その他の代
表的なベンチマークを組み合わせて結果の分析を行う。
この作業の個々の結果について、４つの業界標準ベンチ
マーク（ＮＥＣ１−４）と、ＰｉｃｏＪａｖａ(R)プロ
セッサの整数及びマイクロコード・ユニット（ＩＵ及び
ＵＣＯＤＥ）と、ＩＳＣＡＳ８９，ＩＴＣ９９文献＜
２０＞、共通バス及びＵＳＢコントローラから選択され
た６つの大規模なゲート・レベル組み合わせ及び逐次ネ
ットリストとを使用して分析する。ベンチマークのサイ
ズは、２入力ゲートの個数にして３１，０００〜４３
０，０００の範囲である。(3) Compile Result and Analysis In the technique of the present invention, the result is analyzed by combining the industry standard, ISCAS, and other typical benchmarks.
For individual results of this work, four industry standard benchmarks (NEC1-4), PicoJava (R) processor integer and microcode units (IU and UCODE), ISCAS89, ITC99 literature <
20>, 6 large gate level combinations selected from common bus and USB controller and serial netlist. The size of the benchmark is 31,000 to 43 in terms of the number of 2-input gates.
It is in the range of 10,000.

【０１７２】ａ）ストレージ要件シミュレーション時に生成される一時値の格納には、レ
ジスタとメモリが使用される。レジスタとメモリの量が
不十分な場合、一時値の数が多すぎる回路はＳｉｍＰＬ
Ｅ２．６を使用してシミュレートすることができない。
ただし、近年のＦＰＧＡ２．２ではメモリ量はきわめて
大きい。図１２を見ると、プロセッサ×４８、レジスタ
×６４、各レジスタ・ファイルにつき読み取りポート×
２を備えるＳｉｍＰＬＥアーキテクチャをターゲットと
する場合に必要とされるストレージ量は、ＦＰＧＡ２．
２上で使用可能なメモリによって余裕をもって得られる
ことが分かる。A) Storage requirements Registers and memories are used to store temporary values generated during simulation. If there are not enough registers and memory, the circuit with too many temporary values is SimPL.
It cannot be simulated using E2.6.
However, in recent years FPGA 2.2, the amount of memory is extremely large. Looking at Figure 12, processor x 48, registers x 64, read port per register file x
The amount of storage required when targeting the SimPLE architecture with FPGA2.
It can be seen that the memory available on top of the 2 provides a margin.

【０１７３】ｂ）命令生成の複雑性ｎ個のノードを有するネットリストの場合、レディフロ
ントのノード数はＯ（ｎ）である。（２）スケジューリ
ング・アルゴリズムのｂ）ノード選択ヒューリスティッ
クの項におけるヒューリスティックスがレディフロント
から１個のノードを選択するためには、解放されたレジ
スタ数、ファンアウト、及びレディフロントに含まれる
ファンアウト数を必要とする。このいずれも、事前に計
算することが可能である。したがって、１個のノードの
選択に要する時間はＯ（ｎ）である。本発明の技法で
は、以下の方法でこれを一定時間にする。タイム・ステ
ップの開始時に、レディフロントに含まれる全ノードの
ヒューリスティックスが事前に計算され、そのヒューリ
スティック値をインデックスとするテーブルに挿入され
る。テーブル内のｉ番目のエントリには、ヒューリステ
ィック値がｉとなるレディフロント内の全ノードが含ま
れる。したがって、これらのノードの選択に要する時間
はＯ（１）時間である。図１３に、４４０ＭＨｚのＵｌ
ｔｒａＳｐａｒｃ１０上で稼働するコンパイラの速度を
図示する。B) Instruction Generation Complexity In the case of a netlist having n nodes, the number of ready front nodes is O (n). (2) In order for the heuristic in the b) node selection heuristic of the scheduling algorithm to select one node from the ready front, the number of released registers, the fanout, and the fanout number included in the ready front are set. I need. Both of these can be calculated in advance. Therefore, the time required to select one node is O (n). The technique of the present invention makes this a fixed time by the following method. At the start of the time step, the heuristics of all nodes included in the ready front are pre-computed and inserted into a table indexed by the heuristic values. The i-th entry in the table contains all nodes in the ready front with a heuristic value of i. Therefore, the time required to select these nodes is O (1) time. In Figure 13, Ul of 440MHz
1 illustrates the speed of a compiler running on traSparc 10.

【０１７４】ｃ）ＳｉｍＰＬＥパラメータがコンパイル
効率に及ぼす影響次に、重要なＳｉｍＰＬＥパラメータがコンパイラによ
って生成される命令数に及ぼす影響を評価する。各メモ
リ・バンクのサイズは１６Ｋビット固定、メモリ語サイ
ズは４ビットであったので、いずれもＶｉｒｔｅｘ−Ｉ
ＩＦＰＧＡ上のブロックＲＡＭで対応できる。メモリ
と相互接続のラテンシーは、命令のサイズによって変動
した。相互接続とメモリのパイプライン処理により、Ｆ
ＰＧＡ２．２のクロック速度は改善されるが、コンパイ
ル効率は低下する。実験により、ＦＰＧＡ２．２上で十
分なクロック速度を得るためには、相互接続とメモリの
ラテンシーは２サイクル分が必要であることが判明し
た。このラテンシーは、ＦＰＧＡサイクルに換算したも
のである。したがって、プロセッサ・サイクルが１ＦＰ
ＧＡサイクルを上回る場合（すなわち、ＳｉｍＰＬＥ命
令が時分割を要する場合）には、コンパイラは、相互接
続とメモリのラテンシーはいずれも１であると想定す
る。これは、連続する命令が、２ＦＰＧＡサイクル以上
に相当する１プロセッサ・サイクルの単位に分割される
からである。C) Effect of SimPLE Parameters on Compile Efficiency Next, the effect of important SimPLE parameters on the number of instructions generated by the compiler will be evaluated. Since the size of each memory bank was fixed at 16K bits and the memory word size was 4 bits, both were Virtex-I.
A block RAM on the I FPGA can be used. The latency of memory and interconnects varied with the size of the instructions. F through interconnection and memory pipeline processing
The clock speed of PGA 2.2 is improved, but the compilation efficiency is reduced. Experiments have shown that two cycles of interconnect and memory latencies are required to achieve a sufficient clock speed on FPGA 2.2. This latency is converted into an FPGA cycle. Therefore, 1 FP processor cycle
If it exceeds GA cycles (ie, the SimPLE instruction requires time sharing), the compiler assumes that the interconnect and memory latencies are both one. This is because consecutive instructions are divided into units of 1 processor cycle, which corresponds to 2 FPGA cycles or more.

【０１７５】図１４に、ＳｉｍＰＬＥ２．６に含まれる
プロセッサ数、レジスタ数、レジスタ読み取りポート数
の変動に伴う、コンパイラによって生成される平均命令
数の変動を示す。すなわち、レジスタ・ポートの増大が
コンパイル効率に与える影響を示している。Ｘ軸はＰ−
ｒを示し、ここで、ＰはＳｉｍＰＬＥ実装例内のプロセ
ッサ数、ｒはＳｉｍＰＬＥ実装例内のレジスタ数であ
る。FIG. 14 shows fluctuations in the average number of instructions generated by the compiler due to fluctuations in the number of processors, the number of registers, and the number of register read ports included in SimPLE 2.6. That is, it shows the effect of increasing the number of register ports on the compilation efficiency. X-axis is P-
Here, r is the number of processors in the SimPLE implementation example, and r is the number of registers in the SimPLE implementation example.

【０１７６】ここで注目されるのは、プロセッサ数が３
２個以上になると、レジスタ・ポート数が２を上回って
も平均命令数にほとんど変動がないことである。これ
は、すべてのネットリストはコンパイル時に２ＬＵＴに
マッピングされ、３２プロセッサの場合には、プロセッ
サ上の値の重複を最小化するのに十分な量のパラレリズ
ムが存在する、という事実によって説明できる（単一プ
ロセッサ上で値が重複すると、複数の読み取りポートを
使用することが必要になる）。図１５から、エクストラ
読み取りポートの使用によってＣＬＢの消費数も増える
ことが分かる（ＸｉｌｉｎｘＶｉｒｔｅｘ−ＩＩＦ
ＰＧＡによる試算）。ここで図１５はレジスタ・ポート
の増大がｖｉｒｔｅｘ−ＩＩＣＬＢの利用率に与える
影響を示すしている。Ｘ軸はＰ−ｒを示し、ここで、Ｐ
はＳｉｍＰＬＥ実装例内のプロセッサ数、ｒはＳｉｍＰ
ＬＥ実装例内のレジスタ数である。It should be noted here that the number of processors is three.
When the number of registers is 2 or more, the average number of instructions does not change even if the number of register ports exceeds 2. This can be explained by the fact that all netlists are mapped into 2 LUTs at compile time, and for 32 processors there is a sufficient amount of parallelism to minimize the duplication of values on the processor (single Duplicate values on one processor necessitate the use of multiple read ports). From FIG. 15, it can be seen that the number of CLBs consumed is increased by using the extra read port (Xilinx Virtex-II F).
Estimated by PGA). Here, FIG. 15 shows the effect of increasing register ports on the utilization rate of virtex-II CLB. The X-axis shows Pr, where P
Is the number of processors in the SimPLE implementation example, r is SimP
It is the number of registers in the LE implementation example.

【０１７７】そのため、ここでは読み取りポート×２の
ＳｉｍＰＬＥアーキテクチャに限定する。また、メモリ
構成と、相互接続及びメモリのラテンシーも上記の数値
に固定する。Therefore, here, the limitation is limited to the SimPLE architecture of 2 read ports. Also, the memory configuration and the interconnect and memory latencies are fixed to the above values.

【０１７８】５．ＦＰＧＡの合成シミュレーションに先だって、ＦＰＧＡ２．２上でＳｉ
ｍＰＬＥ２．６を構成する必要がある。これは１度行う
だけでよい。この構成が完了したら、任意の回数のシミ
ュレーションを実行できる。数個のＳｉｍＰＬＥアーキ
テクチャの構成ビットを事前に生成し、ライブラリに格
納しておくことも可能である。そのため、ＦＰＧＡ２．
２上でＳｉｍＰＬＥ２．６を配置及び経路指定するため
に費やされる時間によって、シミュレーション速度が影
響を受けることはない。ただし、ＦＰＧＡクロック速度
はシミュレーション速度に影響を及ぼす。したがって、
ＦＰＧＡ２．２上でＳｉｍＰＬＥ２．６の配置と経路指
定を行い、同時に高いクロック速度を達成することが重
要となる。この項では、ＦＰＧＡ２．２の配置及び経路
指定の手順について説明する。5. Prior to the FPGA synthesis simulation, Si on FPGA 2.2
It is necessary to configure mPLE2.6. You only have to do this once. Once this configuration is complete, any number of simulations can be run. It is also possible to previously generate several SimPLE architecture configuration bits and store them in a library. Therefore, FPGA2.
The simulation time is not affected by the time spent placing and routing SimPLE 2.6 on H.2. However, the FPGA clock speed affects the simulation speed. Therefore,
It is important to place and route SimPLE 2.6 on FPGA 2.2 while simultaneously achieving high clock speeds. This section describes the FPGA 2.2 placement and routing procedure.

【０１７９】ＨＤＬジェネレータは、特定のパラメータ
・セット（プロセッサ数、メモリ・サイズなど）を備え
るＳｉｍＰＬＥ２．６の振る舞いの記述を生成する。ま
た、必要な場合は、ＳｉｍＰＬＥ命令を時分割するため
のエクストラ・ハードウェアを生成することもできる。
この記述の合成は、ＳｙｎｏｐｓｙｓのＦＰＧＡＥｘ
ｐｒｅｓｓを使用して行われ、そのＶｉｒｔｅｘ−２
ＦＰＧＡ上へのマッピング、配置、経路指定はＸｉｌｉ
ｎｘＦｏｕｎｄａｔｉｏｎ４．１ｉを使用して行わ
れる。The HDL generator produces a description of the behavior of SimPLE 2.6 with a particular set of parameters (number of processors, memory size, etc.). If necessary, extra hardware for time-sharing SimPLE instructions can be generated.
This description is synthesized by Synopsys FPGA Ex.
Virtex-2 done using press
Xili mapping, placement and routing on FPGA
nx Foundation 4.1i.

【０１８０】（１）ＳｉｍＰＬＥのためのＦＰＧＡ配
置・経路指定技法経路指定を適切に行うためには、ＦＰＧＡ２．２の配置
がきわめて重要である。これまでの研究により、経路指
定の前に正しい配置を行うと、輻輳が低減され、クロッ
ク速度が大幅に向上することが証明されている（文献＜
１２＞、＜４＞）。本発明の技法では、規則性駆動型の
スキームを使用して良好な配置を実現する。ＳｉｍＰＬ
Ｅ２．６の各インスタンスは、要素プロセッサ、メモリ
・ブロック、レジスタ・ファイルのすべてが互いに同一
なため、本質的に高度な規則性を有している。図１６
に、すべての規則単位を備えるＳｉｍＰＬＥ２．６の階
層を示す。(1) FPGA placement / routing technique for SimPLE The placement of FPGA 2.2 is extremely important for proper routing. Previous studies have demonstrated that correct placement prior to routing reduces congestion and significantly improves clock speeds (ref.
12>, <4>). The technique of the present invention uses a regularity driven scheme to achieve good placement. SimPL
Each instance of E2.6 has an inherently high degree of regularity because all of the element processors, memory blocks, and register files are identical to each other. FIG.
Shows the hierarchy of SimPLE 2.6 with all rule units.

【０１８１】本公開のＦＰＧＡ配置・経路指定技法で
は、以下の４つのステップが実行される。これらのステ
ップとは、(i)設計内の最良な反復単位の識別、(ii)単
一の（相対配置された）ハード・マクロとしての反復単
位の短縮事前配置、(iii)マクロを使用した設計全体の
配置、（iv）全体的な最終経路指定、である。In the disclosed FPGA placement / routing technique, the following four steps are executed. These steps were: (i) identify the best repeat unit in the design, (ii) shorten pre-place the repeat unit as a single (relatively placed) hard macro, and (iii) use macros Placement of the entire design, (iv) overall final routing.

【０１８２】図１６で可能な数個のマクロの間では、最
大のマクロ（すなわち、最上層のマクロ）が最良である
ことが実験によって判明した。大型のマクロは最良の短
縮率を示し、ＩＯが相対的に少ない。識別が完了した
ら、マクロが合成され、ＦＰＧＡＣＬＢにマッピング
され、配置される。ＳｉｍＰＬＥの記述全体は、マクロ
としてインスタンス化され、マッピング、配置、経路指
定が行われる。最適化は、事前配置されたマクロの境界
を横断する形では実行されない。マクロ・フロー全体
は、ＦＰＧＡツールとの対話によって実行されるスクリ
プトを使用して完全に自動化されている。Experiments have shown that among the few macros possible in FIG. 16, the largest macro (ie the topmost macro) is the best. Larger macros show the best shortening rates and relatively less IO. Once identified, the macro is synthesized, mapped to the FPGA CLB and placed. The entire SimPLE description is instantiated as a macro and is mapped, placed and routed. Optimization is not performed across the boundaries of prepositioned macros. The entire macro flow is fully automated using scripts executed by interacting with FPGA tools.

【０１８３】図１７に示す表１は、本公開の技法におけ
るマクロ・ストラテジを使用した場合と使用しなかった
場合のＦＰＧＡクロック速度を比較したものである。実
験はすべて、最新のＸｉｌｉｎｘＦｏｕｎｄａｔｉｏ
ｎ４．１ｉを使用して行った。本公開の技法を使用し
た場合は、最大３倍の向上が確認された。図１６に示す
構造体をマクロに短縮すると、ＦＰＧＡ２．２に配置さ
れたコンポーネントの分布が改善され、クロック速度が
ＰＥ内のレジスタ数の影響を受けにくくなる。Table 1 shown in FIG. 17 is a comparison of FPGA clock speeds with and without the macro strategy in the technique disclosed. All experiments are based on the latest Xilinx Foundation
n 4.1i. Up to a 3x improvement was observed using the techniques of this publication. When the structure shown in FIG. 16 is shortened to a macro, the distribution of components arranged in the FPGA 2.2 is improved, and the clock speed is less affected by the number of registers in PE.

【０１８４】シミュレーション速度は、コンパイル済み
命令数、命令幅、ＦＰＧＡクロック・サイクルを使用し
て、方程式２により計算できる。図１８に、様々なＳｉ
ｍＰＬＥアーキテクチャにおける、ＦＰＧＡメモリ・バ
ス幅の値が２５６の場合と１０２４の場合のシミュレー
ション速度（ベクトル数／秒）を示す。この図から明ら
かなように、ＦＰＧＡメモリ・バスが幅１０２４ビット
の場合には、プロセッサ×４８のアーキテクチャが最良
である。アーキテクチャの幅が広いほど時分割を要する
命令が多くなるので、幅が広ければ広いほどよいという
わけではない。ＦＰＧＡメモリ・バスの幅が小さいアー
キテクチャは、シミュレーション速度がほぼ同じであっ
た。これは、ＦＰＧＡメモリ・バス幅が小さい場合は、
アーキテクチャの幅が広いことによる利点が命令の幅が
広いことによる不利益によって相殺されることを示す。The simulation speed can be calculated by Equation 2 using the number of compiled instructions, the instruction width, and the FPGA clock cycle. Fig. 18 shows various Si
The simulation speed (vectors / second) is shown for the FPGA memory bus width values of 256 and 1024 in the mPLE architecture. As can be seen from this figure, if the FPGA memory bus is 1024 bits wide, the processor × 48 architecture is the best. The wider the architecture, the more instructions that require time division, so the wider the width, the better. Architectures with small FPGA memory bus widths had about the same simulation speed. This is because if the FPGA memory bus width is small,
We show that the advantages of the wide architecture are offset by the disadvantages of the wide instruction range.

【０１８５】６．実験、分析、及び考察この項では、大規模なＶｉｒｔｅｘ−ＩＩＦＰＧＡ
２．２上にＳｉｍＰＬＥ２．６を実装した場合と、汎用
基板に本発明の技法による最初のプロトタイプを実装し
た場合に、実際に得られた速度向上について説明する。6. Experiments, Analysis, and Discussion In this section, a large Virtex-II FPGA
The speed gains actually obtained when SimPLE 2.6 is mounted on 2.2 and when the first prototype according to the technique of the present invention is mounted on a general-purpose board will be described.

【０１８６】（１）Ｖｉｒｔｅｘ−ＩＩ上での速度向上実験結果に基づき、８００万ゲートのＶｉｒｔｅｘ−Ｉ
ＩＦＰＧＡ（ＸＶ２Ｖ８０００）上に、要素プロセ
ッサ×４８、１要素プロセッサにつきレジスタ×６４、
１レジスタ・ファイルにつきレジスタ読み取りポート×
２、各々２要素プロセッサにスパンするｌ６Ｋビットの
バンクから成る分散メモリ・システム、メモリ語サイズ
４ビット、及び相互接続ラテンシー２を有するＳｉｍＰ
ＬＥプロセッサを合成した。ＸｉｌｉｎｘのＦｏｕｎｄ
ａｔｉｏｎツールを使用した。(1) Virtex-II with 8 million gates, based on the results of the speed-up experiment on Virtex-II.
On the I FPGA (XV2V8000), element processors x 48, registers x 64 per element processor,
Register read port per register file ×
2, SimP with distributed memory system consisting of banks of 16 Kbits each spanning 2 element processors, memory word size 4 bits, and interconnect latency 2
An LE processor was synthesized. Foundry of Xilinx
ation tool.

【０１８７】ａ）サイクル・ベース・シミュレーション
との比較サイクル・ベース・シミュレータとして、Ｖｅｒｖｅ
ｒｉｌｏｇコンパイラとＣｙｃｏを使用した。Ｖｅｒは
構造的ｖｅｒｉｌｏｇを読み取り、ＩＶＦと呼ばれる中
間フォームを生成する。ＣｙｃｏはＩＶＦを読み取り、
構造的ｖｅｒｉｌｏｇｘを表す直線Ｃコードを生成す
る。図１９に、サイクル・ベース・シミュレーション及
びＳｉｍＰＬＥに関する、本発明の手法による実験的ツ
ール・フローを示す。速度４４０ＭＨｚのＳｐａｒｃＶ
９プロセッサ、１ＧＢＲＡＭを搭載したＵｌｔｒａＳ
ｐａｒｃ１０システム上でＣコードをコンパイルし、
実行した。ここで、生成されたＣコードをコンパイルす
るために長い時間（数時間）を要したことに留意する必
要がある。このように、コンパイル時間が短いこともＳ
ｉｍＰＬＥ２．６の利点である。A) Comparison with cycle-based simulation Ver ve is used as a cycle-based simulator.
A rilog compiler and Cyco were used. Ver reads the structural verilog and produces an intermediate form called IVF. Cyco reads IVF,
Generate a straight line C code that represents the structural verilogx. FIG. 19 shows an experimental tool flow according to the inventive technique for cycle-based simulation and SimPLE. Sparc V at 440MHz speed
UltraS with 9 processors and 1GB RAM
Compile the C code on a parc 10 system,
Ran It should be noted here that it took a long time (several hours) to compile the generated C code. In this way, the short compile time may cause S
This is an advantage of imPLE2.6.

【０１８８】図２０は、４４０ＭＨｚのＵｌｔｒａＳｐ
ａｒｃワークステーション上で稼働するサイクル・ベ
ース・シミュレータに比較して、１００ＭＨｚ（ほとん
どの基板は１００ＭＨｚで稼働するため、この速度に制
約される）で稼働する４８プロセッサ、６４レジスタを
備えるＳｉｍＰＬＥ２．６によって得られた速度向上を
示す。各ベンチマークの右側カラムは、ＦＰＧＡメモリ
・バス幅が１０２４ビットの場合に達成された速度向上
を示し、それより小さい左側カラムＦＰＧＡメモリ・バ
ス幅が２５６ビットはの場合の速度向上を示す。速度向
上の範囲は、１０２４ビットのＦＰＧＡメモリ・バス幅
の場合は２００倍〜３０００倍であり、２５６ビットＦ
ＰＧＡメモリ・バス幅の場合はこの倍率が７５倍〜ｌ０
００倍に減少する。FIG. 20 shows 440 MHz UltraSp.
Compared to a cycle-based simulator running on an arc workstation, SimPLE 2.6 with 48 processors, 64 registers running at 100MHz (most boards run at 100MHz, which limits this speed) The resulting speed improvements are shown. The right column of each benchmark shows the speed improvement achieved when the FPGA memory bus width is 1024 bits, and the smaller left column shows the speed improvement when the FPGA memory bus width is 256 bits. The range of speed improvement is 200 times to 3000 times for 1024 bit FPGA memory bus width, and 256 bit F
In case of PGA memory bus width, this magnification is 75 to 10
It is reduced to 00 times.

【０１８９】ｂ）ゼロ遅延イベント・ドリブン・シミュ
レーションとの比較この比較では、ゼロ・ゲート遅延を有するＭｏｄｅｌＳ
ｉｍバージョン５．３ｅを使用した。を使用した。各ベ
ンチマークは、ＳｉｍＰＬＥ２．６の場合とまったく同
じ方法で最適化した後、ＭｏｄｅｌＳｉｍにロードして
イベント・ドリブン・シミュレーションを実行した。こ
こでも、シミュレーションのため４４０ＭＨｚのＵｌｔ
ｒａＳｐａｒｃ１０を使用した。図２１に、同じベン
チマークに関して得られた速度向上を示す。速度向上の
範囲は、１０２４ビットのＦＰＧＡメモリ・バス幅の場
合は３００倍〜６０００倍であり、２５６ビットＦＰＧ
Ａメモリ・バス幅の場合はこの倍率が７５倍〜ｌ５００
倍に減少する。B) Comparison with Zero Delay Event Driven Simulation In this comparison, ModelS with zero gate delay is used.
Im version 5.3e was used. It was used. Each benchmark was optimized in exactly the same way as for SimPLE 2.6, then loaded into ModelSim and run event driven simulation. Again, 440MHz Ult for simulation
raSparc 10 was used. FIG. 21 shows the speed gain obtained for the same benchmark. The range of speed improvement is 300 times to 6000 times for the 1024-bit FPGA memory bus width.
In the case of A memory bus width, this magnification is 75 to 1500
Doubled.

【０１９０】（２）プロトタイプを使用した場合の速度
向上ＡｌｐｈａＤａｔａ（ｗｗｗ．ａｌｐｈａｄａｔａ．
ｃｏ．ｕｋ）製の汎用ＦＰＧＡ基板（ＡＤＣＲＣ１００
０）を使用してプロトタイプを実装した。この基板は、
ＦＰＧＡメモリ・バス幅１２８ビットのＸｉｌｉｎｘ
Ｖｉｒｔｅｘ−Ｅ２０００ＦＰＧＡを備える。我々
は、十分な作業が可能なシミュレーション環境と、ユー
ザがネットリストのコンパイルとシミュレーションを行
い、選択された信号を表示することのできるグラフィカ
ル・ユーザ・インターフェースを有しており、実験には
これを使用した。２つの設計について、小規模なプロト
タイプ基板上で速度向上を測定した。一方の設計は４０
０，０００ゲートの逐次ベンチマーク、もう一方の設計
はＰｉｃｏＪａｖａ(R)プロセッサのパイプライン・デ
ータパスの一部である。いずれの設計においても、プロ
トタイプ基板はＭｏｄｅｌＳｉｍの３０倍、サイクル・
ベース・シミュレータの１２倍の高速を達成した。(2) Speed Improvement Using Prototype AlphaData (www.alphadata.
co. general-purpose FPGA board (ADCRC100)
0) was used to implement the prototype. This board is
Xilinx with 128-bit FPGA memory bus width
It comprises a Virtex-E 2000 FPGA. We have a fully operational simulation environment and a graphical user interface that allows the user to compile and simulate the netlist and display the selected signals, which we will experiment with. used. Speed enhancements were measured on a small prototype board for two designs. One design is 40
The 10,000 gate sequential benchmark, the other design is part of the PicoJava® processor's pipelined datapath. In both designs, the prototype board has 30 times the cycle time of ModelSim.
It was 12 times faster than the base simulator.

【０１９１】（３）速度向上の理由この速度向上の主な理由は、(i)パラレリズム、(ii)Ｓ
ｉｍＰＬＥ内のレジスタとメモリが多数に上ること、(i
ii)ＦＰＧＡと基板メモリの間の帯域幅が広いこと、及
び（iv）ＦＰＧＡクロック速度が高速であること、の４
点である。動的パラレリズムの技術を使用したスーパー
スケーラ・プロセッサは、通常、１サイクルにつき２〜
３個の命令を実行する。しかしＳｉｍＰＬＥでは、１サ
イクルにつき要素プロセッサと同数の命令を実行でき
る。ＳｉｍＰＬＥには多数のレジスタが存在し（１個の
要素プロセッサに専用のレジスタが３２個以上）、その
ためにメモリ演算が減少するのである。(3) Reason for speed improvement The main reasons for this speed increase are (i) parallelism and (ii) S.
A large number of registers and memories in imPLE, (i
ii) Wide bandwidth between FPGA and substrate memory, and (iv) High FPGA clock speed.
It is a point. A superscalar processor using dynamic parallelism technology usually has two to one cycle.
Executes three instructions. However, SimPLE can execute the same number of instructions as one element processor per cycle. SimPLE has a large number of registers (32 or more registers dedicated to one element processor), which reduces memory operations.

【０１９２】シミュレーション・プロセスの効率をさら
に高めているのは、ＦＰＧＡと基板メモリ間の高帯域幅
である。これにより、幅の広いＳｉｍＰＬＥ命令でも迅
速に転送することができる。最後に、ＳｉｍＰＬＥアー
キテクチャの規則性により、ＦＰＧＡ上での高速実装が
可能になる。ＦＰＧＡのサイズが大きくなると、より大
規模なＳｉｍＰＬＥアーキテクチャを実装できるので、
さらなる速度向上を実現することができる。Further enhancing the efficiency of the simulation process is the high bandwidth between the FPGA and the substrate memory. Thus, even a wide SIMPLE instruction can be transferred quickly. Finally, the regularity of the SimPLE architecture enables high speed implementation on FPGAs. As the FPGA size increases, a larger SimPLE architecture can be implemented,
Further speed increase can be realized.

【０１９３】本明細書の開示から、他の変更及び異型は
当該技術に精通した当業者には明らかであろう。したが
って、本書では本発明のごく一部の実施例のみが特に説
明されているが、本発明の精神及び範囲から逸脱するこ
となく無数の変更が可能であることは明らかである。From the disclosure herein, other modifications and variations will be apparent to those of ordinary skill in the art. Thus, while only a few embodiments of the invention have been specifically described herein, it will be apparent that numerous modifications can be made without departing from the spirit and scope of the invention.

【０１９４】[0194]

【発明の効果】以上説明したように本発明によれば、
(i)低コスト、（ii）ハイ・パフォーマンス、(iii)低い
ターンアラウンド・タイム、（iv）高いスケーラビリテ
ィ、という４つの利点を有するハードウェア・アクセラ
レーション・システムを実現することが可能となる。す
なわち、シミュレータ並みのコスト、スケーラビリテ
ィ、ターンアラウンド・タイムを実現する一方で、非常
に優れたパフォーマンスを達成する。As described above, according to the present invention,
It is possible to realize a hardware acceleration system having four advantages: (i) low cost, (ii) high performance, (iii) low turnaround time, and (iv) high scalability. In other words, while achieving cost, scalability, and turnaround time comparable to simulators, it achieves extremely excellent performance.

[Brief description of drawings]

【図１】本発明を適用したシステムと従来のシミュレ
ータ及びエミュレータとのコスト及びパフォーマンスの
比較を示す図である。FIG. 1 is a diagram showing a cost and performance comparison between a system to which the present invention is applied and a conventional simulator and emulator.

【図２】本発明の実施例によるＳｉｍＰＬＥ中間アー
キテクチャを使用した単一ＦＰＧＡ上の大規模ネットリ
ストをシミュレートするための概念を説明する図であ
る。FIG. 2 is a diagram illustrating a concept for simulating a large netlist on a single FPGA using the SimPLE intermediate architecture according to an embodiment of the present invention.

【図３】本発明による全体的なシステム構成を示すブ
ロック図である。FIG. 3 is a block diagram showing an overall system configuration according to the present invention.

【図４】要素プロセッサ、メモリ・バンク、読み取り
ポートを有するワイド・レジスタ・ファイル、クロスバ
ーを備えるＳｉｍＰＬＥのアーキテクチャ・モデルの例
を示す図である。FIG. 4 is a diagram showing an example of an architectural model of SimPLE with element processors, memory banks, wide register files with read ports, and crossbars.

【図５】ＡＳＡＰヒューリスティックを使用してスケ
ジュールされた場合の、ネットリストの中間値の最大数
を示す図である。FIG. 5 illustrates the maximum number of median netlist values when scheduled using the ASAP heuristic.

【図６】スケジューリングと命令生成を実行するコン
パイラ例を示すフローチャートである。FIG. 6 is a flowchart showing an example compiler that performs scheduling and instruction generation.

【図７】スケジューリングのためのノード選択の例を
示す図である。FIG. 7 is a diagram showing an example of node selection for scheduling.

【図８】レジスタのメモリへのスピル処理の例を示す
図である。FIG. 8 is a diagram showing an example of spill processing of a register to a memory.

【図９】レディフロント内のノードの入力をロードす
る例を示す図である。FIG. 9 is a diagram showing an example of loading an input of a node in the ready front.

【図１０】ユーザ指定レジスタを処理する例を示す図
である。FIG. 10 is a diagram showing an example of processing a user-specified register.

【図１１】メモリ・システム内の特定のスロットへの
１次入力及び１次出力ビットの割り当てを示す図であ
る。FIG. 11 illustrates allocation of primary input and primary output bits to specific slots within a memory system.

【図１２】ＳｉｍＰＬＥ実装例のストレージ要件を示
す図である。FIG. 12 is a diagram showing storage requirements of a SimPLE implementation example.

【図１３】ＳｉｍＰＬＥ実装例のコンパイル速度を示
す図である。FIG. 13 is a diagram illustrating a compilation speed of a SimPLE implementation example.

【図１４】レジスタ・ポートの増大がコンパイル効率
に与える影響を示す図である。FIG. 14 is a diagram showing the effect of increasing register ports on compilation efficiency.

【図１５】レジスタ・ポートの増大がｖｉｒｔｅｘ−
ＩＩＣＬＢの利用率に与える影響を示す図である。FIG. 15: Increase in register port is virtex-
It is a figure which shows the influence which it has on the utilization rate of II CLB.

【図１６】ＳｉｍＰＬＥ実装の階層で、最大反復単位
のものを示す図である。FIG. 16 is a diagram showing a maximum repeat unit in a hierarchy of SimPLE implementation.

【図１７】規則性駆動型の配置を使用したＳｉｍＰＬ
ＥのＦＰＧＡクロック速度における改良を示す図であ
る。FIG. 17 SimPL using a regularity driven arrangement
FIG. 6 shows an improvement in E FPGA clock speed.

【図１８】各種ＳｉｍＰＬＥ実装例におけるシミュレ
ーション速度（ベクトル／秒）を示す図である。FIG. 18 is a diagram showing simulation speeds (vectors / second) in various SimPLE mounting examples.

【図１９】サイクル・ベース・シミュレーションのソ
フトウェア実装において、ＳｉｍＰＬＥを使用してゲー
ト・レベルのネットリストをシミュレートする場合のツ
ール・フローを示す図である。FIG. 19 is a diagram showing a tool flow in the case of simulating a gate level netlist using SimPLE in software implementation of cycle-based simulation.

【図２０】サイクル・ベース・シミュレータと比較し
たＳｉｍＰＬＥの速度向上度を示す図である。FIG. 20 is a diagram showing the speed improvement of SimPLE compared to a cycle-based simulator.

【図２１】ＭｏｄｅｌＳｉｍに比較したＳｉｍＰＬＥ
の速度向上度を示す図である。FIG. 21 SimPLE compared to ModelSim
It is a figure which shows the speed improvement degree of.

【図２２】ＲＴＬレベル回路のアーキテクチャを示す
図である。FIG. 22 is a diagram showing the architecture of an RTL level circuit.

[Explanation of symbols]

２．１オンボード・メモリ２．２ＦＰＧＡ２．３ＤＭＡ／ＰＣＩコントローラ２．４ホスト２．５ＳｉｍＰＬＥコンパイラ２．６ＳｉｍＰＬＥ２．３１〜２．３４ＰＥ３．１ＡＰＩ３．２ゲート回路３．３メイン・メモリ４．１分散メモリ・システム４．２分散レジスタ・ファイル・システム４．３クロスバー 2.1 Onboard memory 2.2 FPGA 2.3 DMA / PCI controller 2.4 Host 2.5 SimPLE compiler 2.6 SimPLE 2.31-2.34 PE 3.1 API 3.2 Gate circuit 3.3 Main memory 4.1 Distributed memory system 4.2 Distributed register file system 4.3 Crossbar

───────────────────────────────────────────────────── フロントページの続き (72)発明者プラナブアシャーアメリカ合衆国、ニュージャージー 08540 プリンストン、４インディペンデンスウェイエヌイーシーユーエスエーインク内Ｆターム(参考） 5B046 AA08 BA03 JA04 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Pranab Asher New Jersey, United States 08540 Princeton, 4 Indie Pen Dens Way NCE You Inside the SAS ink F term (reference) 5B046 AA08 BA03 JA04

Claims

[Claims]

1. A hardware acceleration system for functional simulation, comprising a general-purpose circuit board having a logic chip and a memory, wherein the general-purpose circuit board can be plugged into a computing device. , Computing ·
A device is adapted to direct a DMA transfer directly between the general purpose circuit board and a memory associated with the computing device, the general purpose circuit board configured using a simulation processor, the simulation processor comprising: A hardware acceleration system, which is programmable for at least one circuit design.

2. The hardware acceleration system according to claim 1, wherein the simulation processor is mapped to the FPGA.

3. The hardware acceleration system according to claim 1, wherein the netlist of the circuit to be simulated is compiled for the simulation processor.

4. The simulation processor comprises:
The hardware acceleration according to claim 1, further comprising at least one register processor including at least one element processor and one or more registers corresponding to the at least one element processor. system.

5. The simulation processor comprises:
A distributed memory having at least one memory bank
The hardware acceleration system of claim 4, further comprising a system.

6. The hardware acceleration system of claim 5, wherein the at least one memory bank is used for a set of elementary processors and their associated registers.

7. The hardware acceleration system of claim 5, wherein the register is capable of spilling on the memory bank.

8. The hardware acceleration system of claim 4, further comprising an interconnection system connecting the at least one element processor to another element processor.

9. The hardware acceleration system according to claim 4, wherein the processor element can simulate any two-input gate.

10. The hardware acceleration system according to claim 4, wherein the processor element is capable of performing RT-level simulation.

11. The hardware acceleration system according to claim 8, wherein the connection by the interconnection system is made through a register.

12. The hardware acceleration system of claim 11, wherein the interconnect system is pipelined.

13. The hardware acceleration of claim 8, wherein the register file is located close to an associated element processor.
system.

14. The hardware acceleration system according to claim 5, wherein the distributed memory system has a dedicated port corresponding to each register file.

15. The hardware acceleration system of claim 3, wherein the system can process the partitions of the netlist one at a time if the netlist does not fit in the memory on the memory board.
system.

16. The hardware acceleration system of claim 15, wherein the system is capable of simulating an entire netlist by sequentially simulating its partitions.

17. The hardware acceleration system of claim 3, wherein the system is capable of processing a subset of simulation vectors used in testing a circuit.

18. The hardware acceleration system of claim 17, wherein the system is capable of simulating an entire simulation vector set by sequentially simulating each subset.

19. The hardware acceleration system of claim 1, wherein the hardware acceleration system can be used interchangeably with a general purpose software simulator capable of exchanging the states of all registers in a design.

20. The hardware acceleration system according to claim 1, wherein both binarized and binarized simulations can be executed on a simulation processor.

21. The hardware of claim 1, further comprising an interface and an opcode, the opcode specifying read, write, and other operations associated with a simulation vector.
Acceleration system.

22. The simulation processor further comprises: at least one arithmetic logic circuit, zero or more signed multipliers, distributed registers each having at least one register associated with the ALU and the multiplier. The hardware acceleration system according to claim 1, comprising a system.

23. The hardware accelerator of claim 22, wherein the system comprises a carry register file for each ALU, the width of the register being the same as the width of the corresponding register. Ration system.

24. The hardware acceleration system of claim 23, further comprising pipelined carry chain interconnects connecting registers.

25. A method of performing a functional simulation of a circuit, comprising the following requirements. a) Compile the netlist corresponding to the circuit to generate the instruction set for the simulation processor b) Load the instructions into the onboard memory corresponding to the simulation processor c) Simulate on the onboard memory Transfer the vector set d) Stream the instruction set corresponding to the netlist to be simulated on the FPGA in which the simulation processor is configured e) Execute the instruction set to generate the resulting vector set f) Transfer the result vector to the host computer.

26. A method for performing functional simulation of a circuit according to claim 25, wherein if the instruction is wider than the bus connecting the onboard memory to the FPGA, the instruction is time-shared. .

27. A method of compiling a netlist of circuits for a simulation processor, comprising the following requirements. a) represent the design of the circuit as a directed graph where the nodes of the graph correspond to the hardware blocks in the design b) generate the readyfront subset of nodes ready for scheduling c) for the readyfront set topological·
Perform a sort d) select an unselected node up to the present e) complete the instruction and proceed to a new instruction if no element processors are available f) to perform the operation corresponding to the selected node , G) selecting the element processor with the highest number of associated free registers g) routing the operand from the register to the selected element processor i) repeating steps dh until there are no unselected nodes.

28. Based on a selection heuristic,
28. The method of claim 27, wherein the node having the largest number of registers released by scheduling the node and the largest fanout number of the node is selected.

29. The circuit netlist of claim 27, wherein when the register file is full, the register to be spilled is selected and stored in memory so that it can be loaded as needed. How to compile.

30. The method of claim 29, wherein if no registers are available in step f, the registers are stored in a memory bank by spilling.

31. The register selected for spill processing has the maximum number of registers previously scheduled based on the selection heuristic and released by the scheduling of the node and the maximum fanout number of the node. 30. The method for compiling a netlist of a circuit according to claim 29, wherein the register is a register which is an output of the node.