JP4801210B2

JP4801210B2 - System for designing expansion processors

Info

Publication number: JP4801210B2
Application number: JP2010133346A
Authority: JP
Inventors: アール・エー・キリアン; リカルド・イー・ゴンザレス; アシシュ・ビー・ディキシット; モニカ・ラム; ワルター・ディー・リヒテンシュタイン; クリストファー・ローエン; ジョン・ルッテンバーグ; ロバート・ピー・ウィルソン; アルバート・レン−ルイ・ワン; ドロール・エリーザー・メイダン; ウェン・キアン・ジアン; リチャード・ルデル
Original assignee: Tensilica Inc
Current assignee: Tensilica Inc
Priority date: 1999-02-05
Filing date: 2010-06-10
Publication date: 2011-10-26
Anticipated expiration: 2020-02-04
Also published as: JP2009015867A; JP2010238256A

Description

本発明は、マイクロプロセッサシステムに向けられ、より詳細には、本発明は、このシステムのプロセッサは、特定のアプリケーションに対するプロセッサの適性を改善するためにプロセッサの設計時点で構成され、機能強化される、１つあるいはそれ以上のプロセッサを含むアプリケーションソリューションの設計に向けられる。本発明は、アプリケーション開発者がユーザ定義のプロセッサ状態を操作し、アプリケーション実行時間およびプロセッササイクル時間までの拡張の影響を直ちに評価する新しい命令を含む、既存の命令セットアーキテクチャまでの新しい命令のような命令拡張を迅速に開発できるシステムに向けられている。 The present invention is directed to a microprocessor system, and more particularly, the present invention is configured and enhanced at the processor design time to improve the processor's suitability for a particular application. Directed to the design of application solutions that include one or more processors. The present invention is such as new instructions up to an existing instruction set architecture, including new instructions that allow application developers to manipulate user-defined processor states and immediately assess the impact of expansion to application execution time and processor cycle time. It is aimed at systems that can quickly develop instruction extensions.

プロセッサは、従来は設計し、変更するのが困難であった。この理由で、プロセッサを含む大部分システムは、汎用使用のために一度設計され、検証され、次に時間にわたる複数のアプリケーションによって使用されたプロセッサを使用する。それ自体、特定のアプリケーションに対するプロセッサの適性は常に理想的でない。特定のアプリケーションのコードを有益に実行するためにプロセッサを変更する（例えば、より速く実行する、より少ない電力を消費する、より少ないコストを要する）ことはしばしば適切である。しかしながら、既存のプロセッサ設計を変更する困難、したがって時間、コストおよびリスクさえ高く、これは一般的には行われない。 Processors have traditionally been difficult to design and change. For this reason, most systems that include a processor use a processor that has been designed and verified once for general use and then used by multiple applications over time. As such, the suitability of a processor for a particular application is not always ideal. It is often appropriate to change the processor (eg, run faster, consume less power, require less cost) to beneficially execute code for a particular application. However, it is difficult to change existing processor designs, and thus even time, cost and risk are high and this is not typically done.

従来のプロセッサを構成可能にする際の困難をより良く理解するために、このプロセッサの開発を考察する。先ず第一に、命令セットアーキテクチャ（ＩＳＡ）が開発される。これは、本来は１回行われ、多数のシステムによって１０年間使用された対策である。例えば、インテルペンティウム（登録商標）プロセッサは、セットされたプロセッサの命令のレガシーを１９７０年代の中頃に導入された８００８マイクロプロセッサおよび８０８０マイクロプロセッサにまで溯ることができる。この変遷において、所定のＩＳＡ設計基準に基づいて、ＩＳＡ命令、シンタックス等が開発され、アセンブラ、デバッガ、コンパイラ等のようなこのＩＳＡのためのソフトウェア開発ツールが開発されている。次に、この特定のＩＳＡのためのシミュレータが開発され、様々なベンチマークは、ＩＳＡの効率を評価するために実行され、ＩＳＡは評価の結果に従って改訂される。ある点では、ＩＳＡは申し分なく考察され、ＩＳＡ処理は、例えば、アセンブラ、デバッガ、コンパイラ等を含む十分に開発されたＩＳＡ使用、ＩＳＡシミュレータ、ＩＳＡ検証スイートおよび開発スイートで終了する。 To better understand the difficulties in making a conventional processor configurable, consider the development of this processor. First of all, an instruction set architecture (ISA) is developed. This is a measure that was originally performed once and was used for 10 years by many systems. For example, the Intel Pentium processor can extend the legacy of set processor instructions to the 8008 and 8080 microprocessors introduced in the mid-1970s. In this transition, based on a predetermined ISA design standard, ISA instructions, syntax, and the like are developed, and software development tools for the ISA such as an assembler, a debugger, and a compiler are developed. Next, a simulator for this particular ISA is developed, and various benchmarks are run to evaluate the efficiency of the ISA, and the ISA is revised according to the results of the evaluation. In some respects, the ISA is well considered and the ISA processing ends with fully developed ISA usage, including assemblers, debuggers, compilers, etc., ISA simulators, ISA verification suites and development suites.

次に、プロセッサ設計は開始する。プロセッサは多数の年数の有用な寿命を有することができ、この変遷もかなりたまに行われ、一般的には、プロセッサは１回設計され、数年間いくつかのシステムによって使用される。ＩＳＡ、ＩＳＡの検証スイートおよびシミュレータならびに様々なプロセッサ開発目標を与えられると、プロセッサのマイクロアーキテクチャが、設計され、シミュレートされ、改訂される。一旦マイクロアーキテクチャが完成されると、マイクロアーキテクチャはハードウェア記述言語（ＨＤＬ）で実行され、マイクロアーキテクチャ検証スイートが開発され、ＨＤＬインプリメンテーション（この後でより多く）を検証するために使用される。次に、この点まで記載されるマニュアル処理とは著しく違って、自動設計ツールは、ＨＤＬ記述に基づいて回路を統合し、回路の構成要素を配置し、経路選択してもよい。したがって、このレイアウトは、チップ領域使用およびタイミングを最適化するために改訂される。それとは別に、付加マニュアル処理は、ＨＤＬ記述に基づいてフロアプランを形成し、ＨＤＬを回路に変換し、それから回路を手動および自動の両方で検証し、レイアウトしてもよい。最後に、このレイアウトは、自動ツールを使用してこの回路に一致することが確実であることを検証され、この回路はレイアウトパラメータに従って検証される。 Next, processor design begins. Processors can have many years of useful life, and this transition occurs occasionally, and is typically designed once and used by several systems for several years. Given the ISA, ISA validation suite and simulator, and various processor development goals, the processor microarchitecture is designed, simulated, and revised. Once the microarchitecture is completed, the microarchitecture is executed in a hardware description language (HDL) and a microarchitecture verification suite is developed and used to verify the HDL implementation (and more later) . Next, unlike the manual processing described up to this point, the automated design tool may integrate circuits based on the HDL description, place circuit components, and route. Thus, this layout is revised to optimize chip area usage and timing. Alternatively, additional manual processing may form a floor plan based on the HDL description, convert the HDL into a circuit, and then verify and layout the circuit both manually and automatically. Finally, the layout is verified using an automated tool to ensure that it matches the circuit, and the circuit is verified according to the layout parameters.

プロセッサ開発が完了した後、全システムは設計される。ＩＳＡおよびプロセッサの設計とは違って、システム設計（現在プロセッサを含むチップの設計を含んでもよい）は、全く一般的であり、システムは一般的には連続して設計される。各システムは、特定のアプリケーションによって比較的短い時間（１年あるいは２年）使用される。コスト、性能、電力および機能性のような所定のシステム目的、予め存在しているプロセッサの仕様、チップファンドリ（通常プロセッサベンダーと密接に結合されている）の仕様に基づいて、全システムアーキテクチャは設計され、プロセッサは設計目的に合わせるように選択され、チップファンドリは、選択される（これはプロセッサ選択に密接に結合される）。 After processor development is complete, the entire system is designed. Unlike ISA and processor designs, system designs (which may include the design of the chip that currently contains the processor) are quite common, and the systems are typically designed sequentially. Each system is used for a relatively short time (one year or two years) depending on the particular application. Based on predetermined system objectives such as cost, performance, power and functionality, pre-existing processor specifications, and chip foundry (usually tightly coupled with processor vendors) specifications, the overall system architecture is Designed, the processor is selected to meet the design objectives, and the chip foundry is selected (this is closely coupled to the processor selection).

次に、選択プロセッサ、ＩＳＡおよびファンドリならびにシミュレーション、検証および予め開発された開発ツール（ならびに選択されたファンドリのための標準セルライブラリ）が与えられると、システムのＨＤＬインプリメンテーションは設計され、検証スイートは、システムＨＤＬインプリメンテーションために開発され、このインプリメンテーションが検証される。次に、システム開発は統合され、回路板上に配置され、経路選択され、レイアウトおよびタイミングは再最適化される。最後に、この回路板は設計され、レイアウトされ、このチップは製造され、この回路板は組み立てられる。 Next, given the selection processor, ISA and foundry and simulation, verification and pre-developed development tools (and standard cell library for the selected foundry), the HDL implementation of the system is designed, A verification suite is developed for a system HDL implementation and this implementation is verified. The system development is then integrated, placed on the circuit board, routed, and layout and timing are reoptimized. Finally, the circuit board is designed and laid out, the chip is manufactured, and the circuit board is assembled.

任意の所与のアプリケーションだけが特定の機能のセットを必要とし、アプリケーションによって必要とされない機能を有するプロセッサは、非常に高価で、より多くの電力を消費し、製造することはより困難であるために、従来のプロセッサ設計に関する他の困難は、全アプリケーションをカバーするためにより多くの機能を有する従来のプロセッサを単に設計することは適切でないという事実から生じる。さらに、プロセッサが最初に設計される場合、アプリケーションの目標の全てを知ることはできない。プロセッサ変更処理が自動化され、信頼性があるようにすることができる場合、アプリケーションソリューションを形成するシステム設計者の能力は著しく高められる。 Because only any given application requires a specific set of functions, a processor with functions that are not required by the application is very expensive, consumes more power, and is more difficult to manufacture In addition, other difficulties associated with conventional processor design stem from the fact that it is not appropriate to simply design a conventional processor with more functions to cover the entire application. Furthermore, when the processor is first designed, not all of the application goals are known. If the processor change process can be automated and made reliable, the ability of the system designer to form an application solution is significantly enhanced.

一例として、複雑なプロトコルを使用してチャネルを介してデータを送受信するように設計される装置を考察する。プロトコルは複雑であるために、処理は、完全にハードワイヤードで、例えば、組合せ、ロジックで適度に行うことができなく、その代わりにプログラマブルプロセッサはプロトコル処理のためにシステムに導入される。プログラム可能性は、バグ固定も可能にし、後で命令メモリを新しいソフトウェアでロードすることによって行われるプロトコルまでアップグレードする。しかしながら、従来のプロセッサは、おそらくこの特定のアプリケーションのために設計されなかった（アプリケーションは、プロセッサが設計された場合、存在さえしなくてもよい）、実行する必要があり、付加プロセッサロジックに対する１つあるいはわずかな命令で行うことができる、実行するのに多数の命令を必要とする操作があり得る。 As an example, consider a device designed to send and receive data over a channel using a complex protocol. Due to the complexity of the protocol, the processing is completely hard-wired, for example, cannot be reasonably done in combination, logic, instead a programmable processor is introduced into the system for protocol processing. Programmability also allows bug fixing and upgrades to a protocol that is done later by loading instruction memory with new software. However, conventional processors were probably not designed for this particular application (the application may not even be present if the processor was designed) and need to be executed, one for additional processor logic. There can be operations that require a large number of instructions to execute, which can be done with one or a few instructions.

プロセッサは容易に機能強化できないために、多数のシステム設計者は、このように機能強化をしようと試みなくて、その代わりに、使用可能な汎用プロセッサで役に立たない純ソフトウェアソリューションを実行することを選択する。非能率は、より遅くてもよいし、あるいはより多くの電力を必要としてもよいし、あるいはより高価であってもよいソリューションを生じる（例えば、このソリューションは、十分な速度でプログラムを実行するためにより大きく、より強力なプロセッサを必要とし得る）。他の設計者は、コプロセッサのようなアプリケーションのために設計する専用ハードウェアの処理要求のいくつかを提供するように選択し、次にプログラムの様々な点で専用ハードウェアに対するプログラマコードアップアクセスを有する。しかしながら、かなり大きな作業ユニットだけが十分スピードアップできるので、専用ハードウェアを使用することによって保存される時間は、データを専用ハードウェアへおよび専用ハードウェアから転送するのに必要とされる付加時間よりも大きいために、プロセッサとこのような専用ハードウェアとの間でデータを転送する時間はシステムに対するこの方式のユーティリティを制限する。 Because processors cannot be easily enhanced, many system designers do not attempt to do so and instead instead run a useless pure software solution with available general-purpose processors. To do. Inefficiency results in a solution that may be slower, may require more power, or may be more expensive (eg, because this solution runs a program at a sufficient speed) May require a larger and more powerful processor). Other designers choose to provide some of the processing requirements of dedicated hardware designed for applications such as coprocessors, and then programmer code-up access to dedicated hardware at various points in the program Have However, because only a fairly large unit of work can speed up sufficiently, the time saved by using dedicated hardware is greater than the additional time required to transfer data to and from the dedicated hardware. Therefore, the time to transfer data between the processor and such dedicated hardware limits this type of utility to the system.

通信チャネルアプリケーション例では、プロトコルは、暗号化、エラー訂正、あるいは圧縮／伸長を必要としてもよい。この処理は、しばしばプロセッサのより大きいワードよりもむしろ個別のビットで作動する。計算のための回路は、むしろ普通であってもよいが、各ビットを抽出するプロセッサに対する要求は、逐次各ビットを処理し、次にビット加算のかなりのオーバーヘッドを再パックする。非常に特有な例として、表１に示された規則を使用するハフマン復号化を考察する（同様な符号化はＭＰＥＧ圧縮規格で使用される）。値および長さの両方が計算されねばならないので、長さのビットは、ストリームで復号化される次の要素の始めを探すためにシフトオフできる。

In communication channel application examples, the protocol may require encryption, error correction, or compression / decompression. This process often operates on individual bits rather than larger words of the processor. The circuitry for the computation may be rather normal, but the request to the processor to extract each bit processes each bit sequentially and then repacks the significant overhead of bit addition. As a very specific example, consider Huffman decoding using the rules shown in Table 1 (similar encoding is used in the MPEG compression standard). Since both value and length must be calculated, the length bits can be shifted off to find the beginning of the next element to be decoded in the stream.

従来の命令セットのためのこれを符号化する多数の方法があるが、この方法の全ては、行われる多数のテストがあるために、多数の命令を要し、組合せロジックのための単一のゲート遅延とは著しく違って、各ソフトウェアインプリメンテーションは複数のプロセッササイクルを要する。例えば、ＭＩＰＳ命令セットを使用する有効な従来のインプリメンテーションは、６つの論理演算、６つの条件付分岐、算術演算、および関連レジスタロードを必要とし得る。有利に設計された命令セットを使用して、符号化は、いっそうよいが、時間すなわち１つの論理演算、６つの条件付分岐、算術演算および関連レジスタロードに関して高価である。 There are many ways to encode this for a traditional instruction set, but all of this method requires a large number of instructions due to the large number of tests performed, and a single for combinatorial logic. Unlike gate delay, each software implementation requires multiple processor cycles. For example, a valid conventional implementation using the MIPS instruction set may require six logical operations, six conditional branches, arithmetic operations, and associated register loads. Using an advantageously designed instruction set, encoding is better, but expensive in terms of time, ie one logical operation, six conditional branches, arithmetic operations and associated register loads.

プロセッサ資源に関して、これは非常に高価であるので、２５６エントリルックアップテーブルは、一般的には一連のビット毎の比較として処理を符号化する代わりに使用される。しかしながら、２５６エントリルックアップテーブルは、著しいペースをとり、アクセスするのにはまた数サイクルであり得る。より長いハフマン符号化の場合、テーブルサイズは法外になり、より複雑で、遅いコードをもたらす。 With respect to processor resources, this is so expensive that a 256 entry lookup table is generally used instead of encoding the process as a series of bit-by-bit comparisons. However, the 256 entry look-up table takes a significant pace and can also be several cycles to access. For longer Huffman coding, the table size becomes prohibitive, resulting in more complex and slower code.

プロセッサの特有のアプリケーション要求を受け入れる問題の可能な解決策は、命令セットおよびプロセッサの機能性を高め、この機能性をカストマイズするために容易に変更し、拡張できるアーキテクチャを有する構成可能なプロセッサを使用することにある。最も簡単な種類の構成可能性は２進選択である。すなわち、機能が有るか無いかのいずれかである。例えば、プロセッサは浮動小数点ハードウェアを有しているか有していないかのいずれかで提供されてもよい。 A possible solution to the problem of accepting processor specific application requirements is to use a configurable processor with an architecture that can be easily modified and expanded to enhance instruction set and processor functionality and customize this functionality There is to do. The simplest type of configurability is binary selection. That is, it has either a function or not. For example, the processor may be provided either with or without floating point hardware.

汎用性は、より細かい等級づけを有する構成選択によって改良されてもよい。例えば、このプロセッサによって、システム設計者は、レジスタファイルのレジスタ数、メモリ幅、キャッシュサイズ、キャッシュ関連性等を指定できる。しかしながら、これらのオプションは、システム設計者によってカスタム化可能性のレベルになお達しない。例えば、上記のハフマン符号化例において、従来技術において公知でないけれども、システム設計者は、特有の命令を含み、例えば、ｈｕｆｆ８ｔ１，ｔ０を復号化を実行することを好んでもよい。ここで、この結果の最上位の８ビットは復号化値であり、最下位の８ビットは長さである。前述されたソフトウェアインプリメンテーションとは著しく違って、ハフマン復号化の直接ハードウェアインプリメンテーションは全く簡単であり、命令を復号化するロジックは、命令復号化をまさに相容れない組合せロジック機能等に対して約３０のゲート、あるいは典型的なプロセッサのゲート総数の０．１％未満を示し、専用プロセッサ命令によって単一サイクルで計算できるので、汎用命令だけを使用することに関して４〜２０の改善率を示す。 Versatility may be improved by configuration selection with finer grading. For example, this processor allows the system designer to specify the number of registers in the register file, memory width, cache size, cache relevance, and so on. However, these options still do not reach the level of customizability by the system designer. For example, in the above Huffman coding example, although not known in the prior art, the system designer may prefer to perform decoding of, for example, huff8t1, t0, including specific instructions. Here, the most significant 8 bits of this result are the decoded value, and the least significant 8 bits are the length. Significantly different from the software implementation described above, the direct hardware implementation of Huffman decoding is quite simple, and the logic to decode the instructions is for combinatorial logic functions, etc. Shows about 30 gates, or less than 0.1% of a typical processor gate count, and can be calculated in a single cycle with dedicated processor instructions, showing 4-20 improvement over using only general purpose instructions .

構成可能なプロセッサ生成の従来の試みは、一般に２つのカテゴリーに属する。すなわち、パラメータ化ハードウェア記述と併用されたロジック統合および抽象マシン記述からのコンパイラおよびアセンブラの自動再ターゲットである。第１のカテゴリーには、ＳｙｎｏｐｓｙｓＤＷ８０５１プロセッサ、ＡＲＭ／ＳｙｎｏｐｓｙｓＡＲＭ７−Ｓ、ＬｅｘｒａＬＸ−４０８０、ＡＲＣ構成可能ＲＩＳＣコアおよびある程度までＳｙｎｏｐｓｙｓ統合可能／構成可能なＰＣＩバスインタフェースのような統合可能なプロセッサハードウェア設計が属する。 Conventional attempts to generate configurable processors generally fall into two categories. That is, automatic retargeting of compilers and assemblers from logic integration and abstract machine descriptions combined with parameterized hardware descriptions. The first category includes integrable processor hardware designs such as Synopsys DW8051 processor, ARM / Synopsys ARM7-S, LexraLX-4080, ARC configurable RISC core and to some extent Synposys integrable / configurable PCI bus interface .

上記の中で、ＳｙｎｏｐｓｙｓＤＷ８０５１は、既存のプロセッサアーキテクチャの２進コンパチブルインプリメンテーション、少数の統合パラメータ、例えば、１２８あるいは２５６バイトの内部ＲＡＭ、パラメータｒｏｍａｄｄｒｓｉｚｅによって決定されたＲＯＭアドレス範囲、オプショナル内部タイマ、可変数（０〜２）の直列ポート、および６個あるいは１３個のソースのいずれかをサポートする割り込み装置を含んでいる。ＤＷ８０５１アーキテクチャは幾分変更できるけれども、ＤＷ８０５１の命令セットアーキテクチャの変更は全然できない。 Among the above, Synopsys DW8051 is a binary compatible implementation of an existing processor architecture, a few integration parameters, eg 128 or 256 bytes internal RAM, parameters rom addr Includes an interrupt device that supports a ROM address range determined by size, an optional internal timer, a variable number (0-2) of serial ports, and either 6 or 13 sources. Although the DW8051 architecture can be changed somewhat, the DW8051 instruction set architecture cannot be changed at all.

ＡＲＭ／ＳｙｎｏｐｓｙｓＡＲＭ７−Ｓプロセッサは、既存のアーキテクチャおよびマイクロアーキテクチャの２進コンパチブルインプリメンテーションを含む。このプロセッサは、２つの構成可能なパラメータ、すなわち、高性能あるいは低性能の乗算器およびデバッグおよび回路内エミュレーションロジックの包含を有する。ＡＲＭ７−Ｓの命令セットアーキテクチャの変更は可能であるけれども、この変更は、既存の非構成可能なプロセッサインプリメンテーションのサブセットであるので、新しいソフトウェアは全然必要とされない。 The ARM / Synopsys ARM7-S processor includes binary compatible implementations of existing architectures and microarchitectures. The processor has the inclusion of two configurable parameters: a high performance or low performance multiplier and debug and in-circuit emulation logic. Although a change in the ARM7-S instruction set architecture is possible, since this change is a subset of the existing non-configurable processor implementation, no new software is required.

ＬｅｘｒａＬＸ−４０８０プロセッサは、標準ＭＩＰＳアーキテクチャの構成可能な変形を有し、命令セット拡張に対してソフトウェアサポートを全然有しない。このプロセッサのオプションは、特定用途用演算に対してＭＩＰＳＡＬＵ操作符号の拡張を可能にするカスタムエンジンインタフェースと、レジスタソースおよびレジスタあるいは１６ビット幅即値ソースならびにディスティネーション信号およびをストール信号含む内部ハードウェアインタフェースと、簡単なメモリ管理装置オプションと、３ＭＩＰＳコプロセッサインタフェースと、キャッシュ、スクラッチパッドＲＡＭあるいはＲＯＭへのフレキシブルローカルメモリインタフェースと、周辺機能およびメモリをプロセッサ専用のローカルバスを接続するバスコントローラと、構成可能な深さの書き込みバッファとを含んでいる。 The Lexra LX-4080 processor has a configurable variant of the standard MIPS architecture and has no software support for instruction set extensions. This processor option includes a custom engine interface that allows the extension of MIPS ALU opcodes for application specific operations, and internal hardware that includes register sources and registers or 16-bit wide immediate sources and destination signals and stall signals Interface, simple memory management device option, 3 MIPS coprocessor interface, flexible local memory interface to cache, scratchpad RAM or ROM, and bus controller to connect peripheral functions and memory to the processor dedicated local bus Including write buffers of possible depth.

ＡＲＣ構成可能なＲＩＳＣコアは、ターゲット技術およびクロック速度に基づいたオンザフライゲート総数概算、命令キャッシュ構成、命令セット拡張、タイマオプション、スクラッチパッドメモリオプション、およびメモリコントローラオプションに対するユーザインタフェースと、メモリへのブロック移動を有するローカルスクラッチパッドＲＡＭ、専用レジスタ、最高１６の余分の条件付コード選択、３２×３２ビットスコアボード乗算ブロック、単一サイクル３２ビットバレルシフタ／ローテトブロック、正規化（第１のビットを探す）命令、結果の（レジスタファイルでなく）コマンドバッファへの直接書き込み、１６ビットＭＵＬ／ＭＡＣブロックおよび３６ビットアキュムレータ、および線形算術を使用するローカルＳＲＡＭへのスライドポインタアクセスのような選択可能なオプションを命令セットと、ＶＨＤＬソースコードの手動編集によって規定されたユーザ命令とを有する。ＡＲＣ設計は、命令セット記述言語を実行する機能を全然有しもしないしまた構成プロセッサに特有のソフトウェアツールも生成しない。 ARC configurable RISC core has user interface and block to memory for on-the-fly gate count estimation, instruction cache configuration, instruction set expansion, timer option, scratch pad memory option, and memory controller option based on target technology and clock speed Local scratchpad RAM with move, dedicated register, up to 16 extra conditional code selections, 32x32 bit scoreboard multiplication block, single cycle 32 bit barrel shifter / rotate block, normalization (find the first bit) Instructions, write results directly to command buffer (not register file), 16-bit MUL / MAC block and 36-bit accumulator, and local SRAM using linear arithmetic Having an instruction set options selectable as sliding pointer access, and a user command which is defined by the manual editing of the VHDL source code. The ARC design has no functionality to execute an instruction set description language and does not generate software tools specific to the configuration processor.

Ｓｙｎｏｐｓｙｓ構成可能なＰＣＩインタフェースは、設置、構成および統合の活動のためのＧＵＩあるいはコマンドラインインタフェースと、前以て必要なユーザ活動が各ステップで行われることの検査と、構成に基づいた選択設計ファイル（Ｖｅｒｉｌｏｇ対ＶＨＤＬ）の設置と、組合せ有効性の検査に対する構成値のためのユーザのパラメータ設定およびプロンプトおよびＨＤＬソースコードのユーザ更新およびＨＤＬソースファイルの無編集に対するＨＤＬ生成のような選択構成と、Ｉ／Ｏパッド、技術に左右されない制約および統合スクリプト、パッド挿入を選択するために技術ライブラリを解析し、技術専用パッドおよび技術に左右されない式の技術に依存するスクリプトへの変換のためにプロンプトするユーザインタフェースのような統合機能とを含んでいる。構成可能なＰＣＩバスインタフェースは、パラメータの一貫性の検査、構成に基づいた設置、ＨＤＬファイルの自動変更を実行するために重要である。 The Synopsys configurable PCI interface includes a GUI or command line interface for installation, configuration, and integration activities, checks that pre-required user activities are performed at each step, and a selection design file based on the configuration (Verilog vs. VHDL) installation and user parameter settings and prompts for configuration values for combination validity checking and selection configuration such as user update of HDL source code and HDL generation for no editing of HDL source files; Analyze the technology library to select I / O pads, technology-independent constraints and integration scripts, pad inserts, and prompt for conversion to technology-dependent scripts and technology-independent formula-dependent scripts User interface And a kind of integration features. A configurable PCI bus interface is important for performing parameter consistency checking, configuration based installation, and automatic modification of HDL files.

さらに、従来の統合技術は、ユーザ目的仕様に基づいて異なるマッピングを選択し、マッピングが速度、電力、領域、目標構成要素に対して最適化することができる。この点で、従来技術では、全マッピング処理によってこの設計を行わないで、プロセッサをこれらの方法で再構成する効果のフィードバックを得ることができない。このようなフィードバックは、システム設計目的が達成されるまで、プロセッサの更なる再構成を行うために使用できる。 Furthermore, conventional integration techniques can select different mappings based on user objective specifications, and the mapping can be optimized for speed, power, area, and target components. In this regard, the prior art cannot obtain feedback of the effect of reconfiguring the processor in these ways without performing this design by the entire mapping process. Such feedback can be used to further reconfigure the processor until system design objectives are achieved.

構成可能なプロセッサ生成の分野の従来技術の研究の第２のカテゴリー、すなわちコンパイラおよびアセンブラの自動目標は、大学の研究の恵まれている分野を包含する。例えば、Ｈａｎｏｎｏらの「ＡＶＩＶ再目標可能なコード生成器の命令選択、資源割当ておよびスケジューリング」（コード生成器の自動作成のために使用される機械命令の表示）；Ｆａｕｔｈらの「ｎＭＬを使用する命令セットプロセッサを述べる」；Ｒａｍｓｅｙらの「埋め込まれたシステムのためのツールを形成するマシン記述」；Ａｈｏらの「ツリーマッチングおよびダイナミックプログラミングを使用するコード生成」（各機械命令に関連した変換を組合わせるアルゴリズム、例えば、加算、ロード、ストア、ブランチ等、一連のプログラム操作は、パターンマッチングのような方法を使用するある機械に左右されない中間形式によって示される；およびＣａｔｔｅｌｌの「コード生成器の形式化および自動導出」（コンパイラ研究のために使用されるマシンアーキテクチャの抽象記述）を参照せよ。 The second category of prior art research in the field of configurable processor generation, the automatic goal of compilers and assemblers, encompasses the well-developed areas of university research. For example, Hanono et al., "AVIV retargetable code generator instruction selection, resource allocation and scheduling" (indicating machine instructions used for automatic generation of code generators); Fauth et al., "Using nML Ramsety et al. “Machine description forming tools for embedded systems”; Aho et al. “Code generation using tree matching and dynamic programming” (translation associated with each machine instruction). A series of program operations, such as combining algorithms, eg, add, load, store, branch, etc., is represented by an intermediate format that is not machine dependent using methods such as pattern matching; and Cattel's “Code Generator Format And automatic derivation ”(Compiler Lab See, the abstract description) of the machine architecture that is used for.

一旦プロセッサが設計されたとすると、プロセッサの作動が検証されねばならない。すなわち、プロセッサは、通常、命令実行の１つのフェーズに適する各段を有するパイプラインを使用して記憶プログラムから命令を実行する。したがって、命令を変えるかあるいは加算するかもしくは構成を変えることはプロセッサロジックの広範囲に及ぶ変化を必要とし得るので、複数のパイプライン段は各々のこのような命令の適切な動作を実行できる。プロセッサの構成は、プロセッサが再検証されるべきであり、この検証が変更および加算に適応することを要求する。これは簡単な仕事ではない。プロセッサは、広範囲にわたる内部および制御状態を有する複雑なロジック装置であり、制御およびデータならびにプログラムの連結解析は検証を要求の厳しい技術にする。プロセッサ検証の困難に付け加えることは適切な検証ツールを開発する際の困難である。検証は従来技術で自動化されないので、検証の汎用性、速度および信頼性はあまり最適でない。 Once the processor is designed, the operation of the processor must be verified. That is, the processor typically executes instructions from the stored program using a pipeline having stages suitable for one phase of instruction execution. Thus, changing or adding instructions or changing the configuration may require extensive changes in the processor logic so that multiple pipeline stages can perform the proper operation of each such instruction. The configuration of the processor requires that the processor should be revalidated and that this verification be adapted to changes and additions. This is not an easy task. Processors are complex logic devices with a wide range of internal and control states, and control and data and program linkage analysis make verification a demanding technology. Adding to the difficulty of processor verification is a challenge in developing an appropriate verification tool. Since verification is not automated by the prior art, the versatility, speed and reliability of verification are not very optimal.

さらに、一旦プロセッサが設計され、検証されると、プロセッサが容易にプログラム化できない場合、プロセッサは特に役に立たない。プロセッサは、通常、コンパイラ、アセンブラ、リンカ、デバッガ、シミュレータおよびプロフィーラを含む広範囲に及ぶソフトウェアツールを使ってプログラム化される。プロセッサが変わる場合、ソフトウェアツールもまた変更されねばならない。この命令がコンパイルし、アセンブルし、シミュレートあるいはデバッグすることができない場合、命令を付加することは全然役に立たない。プロセッサ修正および機能強化に関連するソフトウェア変更のコストは従来技術の汎用プロセッサ設計の主要な障害であった。 Furthermore, once the processor is designed and verified, the processor is not particularly useful if the processor cannot be easily programmed. The processor is typically programmed using a wide range of software tools including compilers, assemblers, linkers, debuggers, simulators and profilers. If the processor changes, the software tool must also change. If this instruction cannot be compiled, assembled, simulated or debugged, adding the instruction is useless. The cost of software changes associated with processor modifications and enhancements has been a major obstacle to prior art general purpose processor designs.

したがって、従来のプロセッサ設計は、プロセッサが通常特定用途のために一般的には設計あるいは変更されない困難のレベルのものであることが分かる。さらに、プロセッサが特定用途のために構成あるいは拡張できる場合、システム効率のかなりの改善は可能であることが分かる。さらに、万一プロセッサ設計を改良する際に電力消費、速度等のようなインプリメンテーション特性のフィードバックを使用できる場合、設計処理の効率および有効性を高めることができる。さらに、従来技術では、一旦プロセッサが変更されると、多くの努力が変更後プロセッサの正確な動作を検証するために必要である。最後に、従来技術は限られたプロセッサ構成可能性のために提供しているけれども、この技術は、構成プロセッサと併用するために合わせられたソフトウェア開発ツールの生成のために提供できない。 Thus, it can be seen that conventional processor designs are at a level of difficulty that the processor is typically not designed or modified in general for a particular application. Furthermore, it can be seen that significant improvements in system efficiency are possible if the processor can be configured or expanded for specific applications. Furthermore, the efficiency and effectiveness of the design process can be increased if feedback of implementation characteristics such as power consumption, speed, etc. can be used in improving the processor design. Furthermore, in the prior art, once a processor is changed, much effort is needed to verify the correct operation of the processor after the change. Finally, although the prior art provides for limited processor configurability, this technique cannot be provided for the generation of software development tools tailored for use with the configuration processor.

上記の基準に合うシステムは確かに従来技術に対する改善であるが、改善を行うことができ、例えば、特殊レジスタに記憶された情報、すなわち得ることができる命令の範囲を著しく制限するプロセッサ状態をアクセスあるいは変更、したがって達成可能な性能改善量を制限する命令を有するプロセッサシステムに対する要求がある。 A system that meets the above criteria is certainly an improvement over the prior art, but improvements can be made, for example, accessing processor states that significantly limit the range of information stored in special registers, i.e. the instructions that can be obtained. Alternatively, there is a need for a processor system with instructions that limit changes and thus the amount of performance improvement that can be achieved.

さらに、新しい特定用途用命令を発明することは、サイクル総数削減、付加ハードウェア資源およびＣＰＵサイクル時間影響間の複雑なトレードオフを必要とする。他の挑戦は、高性能マイクロプロセッサインプリメンテーションのしばしば扱いにくい詳細にアプリケーション開発者を従事させないで新しい命令に対する有効なハードウェアインプリメンテーションを得ることにある。 In addition, inventing new application specific instructions requires a complex trade-off between cycle count reduction, additional hardware resources and CPU cycle time impact. Another challenge is to obtain an effective hardware implementation for new instructions without engaging application developers in the often cumbersome details of high performance microprocessor implementations.

上記のシステムは、ユーザのアプリケーションに最適なプロセッサを設計する融通性をユーザに与える。この問題をより十分に理解するために、多数のソフトウェア設計者のソフトウェアプリケーションの性能に合わせるように多数のソフトウェア設計者によって使用された典型的な方式を考察する。多数のソフトウェア設計者は、一般的には、可能性のある改善のことを考え、この可能性のある改善を用いるためにこの設計者のソフトウェアを変更し、この設計者のソフトウェアソースを再コンパイルし、この可能性のある改善を含む実行可能なアプリケーションを生成し、次にこの可能性のある改善を評価する。この評価の結果に応じて、多数のソフトウェア設計者はこの可能性ある改善を保持あるいは捨ててもよい。一般的には、全処理は２、３分だけで完了できる。これによって、ユーザは、自由に実験し、アイディアを迅速に試用し、保持あるいは捨てることができる。いくつかの場合、可能性のあるアイディアを厳密に評価することは複雑である。ユーザは、非常に多様な状況でアイディアを試したいかもしれない。このような場合、ユーザは、コンパイルされたアプリケーションの多数のバージョン、すなわち一方の元のバージョンおよび可能性のある改善を含む他方のバージョンを所有する。いくつかの場合、可能性のある改善は互いに影響し合うかもしれなく、ユーザはアプリケーションの２つ以上のコピーを所有してもよく、各々のコピーは異なるサブセットの可能性のある改善を使用する。多数のバージョンを保持することによって、ユーザは異なる環境の下で異なるバージョンを繰り返して容易にテストできる。 The above system gives the user the flexibility to design the best processor for the user's application. To better understand this problem, consider the typical scheme used by many software designers to match the performance of many software designers' software applications. Many software designers generally consider a possible improvement, modify the designer's software to use this potential improvement, and recompile the designer's software source And generate an executable application that includes this potential improvement, and then evaluate this potential improvement. Depending on the results of this evaluation, many software designers may retain or discard this potential improvement. In general, the entire process can be completed in just a few minutes. This allows the user to experiment freely, quickly try out ideas, and retain or discard them. In some cases, rigorous evaluation of possible ideas is complex. Users may want to experiment with ideas in a wide variety of situations. In such cases, the user owns multiple versions of the compiled application, one original version and the other version containing possible improvements. In some cases, possible improvements may affect each other, and the user may own more than one copy of the application, each copy using a different subset of possible improvements . By maintaining multiple versions, users can easily test different versions repeatedly under different environments.

構成可能なプロセッサは、ソフトウェア開発者が従来のプロセッサのソフトウェアを解決する方法と同様にハードウェアおよびソフトウェアを共同で対話して開発することを望む。カスタム命令を構成可能なプロセッサに付加するユーザの場合を考察する。ユーザは、対話して可能性のある命令を自分のプロセッサに加え、自分の特定のアプリケーションでこれらの命令をテストし、評価することを望む。従来のシステムの場合、これは３つの理由のために困難である。 Configurable processors want software developers to collaborate and develop hardware and software in a manner similar to how traditional processor software is resolved. Consider the case of a user adding custom instructions to a configurable processor. The user wants to interact and add the possible instructions to his processor and test and evaluate these instructions in his specific application. For conventional systems, this is difficult for three reasons.

先ず第一に、可能性のある命令を提案した後、ユーザは、命令を利用できるコンパイラおよびシミュレータを得る前に一時間あるいはそれ以上待たなければならない。 First of all, after proposing a potential instruction, the user must wait for an hour or more before getting a compiler and simulator that can use the instruction.

第二に、ユーザが多数の可能性のある命令で実験したい場合、ユーザは、各々の命令に対してソフトウェア開発システムを形成し、保持しなければならない。ソフトウェア開発システムは非常に大きくてもよい。多数のバージョンを保持することは管理できなくなり得る。 Second, if the user wants to experiment with a large number of possible instructions, the user must create and maintain a software development system for each instruction. The software development system can be very large. Keeping multiple versions can become unmanageable.

最後に、ソフトウェア開発システムは全プロセッサのために構成される。これは、異なる技術者の中で開発処理を分離することを困難にする。２人の開発者が特定のアプリケーションで作業する例を考察する。一方の開発者は、プロセッサのキャッシュ特性を決定する責任を負うべきであり、他方の開発者はカストマイズされた命令を付加する責任を負うべきであり得る。２人の開発者の作業は関連しているが、各作業は十分分離可能であるので、各開発者は隔離して自分の仕事を作業できる。キャッシュ開発者は特定の構成を最初に提案し得る。他方の開発者は、この構成で開始し、いくつかの命令を試用し、各可能性のある命令に対するソフトウェア開発システムを形成する。次に、キャッシュ開発者は提案されたキャッシュ構成を変更する。開発者の構成の各々は最初のキャッシュ構成をとるので、他方の開発者は開発者の構成の中のあらゆる構成を再形成しなければならない。プロジェクトで作業する多数の開発者に関して、異なる構成を編成することは直ぐに管理できないことになり得る。 Finally, the software development system is configured for all processors. This makes it difficult to separate development processes among different engineers. Consider an example where two developers work on a specific application. One developer should be responsible for determining the cache characteristics of the processor, and the other developer may be responsible for adding customized instructions. Although the work of the two developers is related, each work is sufficiently separable, so each developer can work in isolation and work on his or her work. A cache developer may propose a specific configuration first. The other developer starts with this configuration and tries out several instructions to form a software development system for each possible instruction. The cache developer then changes the proposed cache configuration. Since each developer configuration takes the initial cache configuration, the other developer must recreate every configuration in the developer configuration. For many developers working on a project, organizing different configurations can quickly become unmanageable.

本発明は、従来技術のこれらの問題を解決し、プロセッサのハードウェアインプリメンテーションの記述および同じ構成仕様からプロセッサをプログラミングするソフトウェア開発ツールのセットの両方を生成することによってプロセッサを自動的に構成できるシステムを提供する目的を有する。 The present invention solves these problems of the prior art and automatically configures the processor by generating both a description of the hardware implementation of the processor and a set of software development tools that program the processor from the same configuration specifications. To provide a system that can.

本発明の他の目的は、ハードウェアインプリメンテーションおよび様々な性能基準のためのソフトウェアツールを最適化できるこのようなシステムを提供することにある。 It is another object of the present invention to provide such a system that can optimize hardware implementations and software tools for various performance criteria.

本発明のもう一つの目的は、伸長性、２進選択およびパラメータ変更を含む、プロセッサのための様々な種類の構成可能性を可能にするこのようなシステムを提供することにある。 Another object of the present invention is to provide such a system that allows various types of configurability for the processor, including extensibility, binary selection and parameter changes.

本発明のもう一つの目的は、ハードウェアで容易に実行できる言語でプロセッサの命令セットアーキテクチャを示すことができるこのようなシステムを提供することにある。 Another object of the present invention is to provide such a system that can indicate the instruction set architecture of a processor in a language that can be easily implemented in hardware.

本発明の他の目的は、プロセッサ状態を変更する命令セット拡張を開発し、実行するシステムおよび方法を提供することにある。 It is another object of the present invention to provide a system and method for developing and executing instruction set extensions that change processor state.

本発明の他の目的は、プロセッサレジスタを変更する命令セット拡張を開発し、実行するシステムおよび方法を提供することにある。 It is another object of the present invention to provide a system and method for developing and executing instruction set extensions that change processor registers.

本発明のもう一つの目的は、ユーザが新しい命令をこの特性を評価できる数分内に付加することによってプロセッサ構成をカストマイズすることができることにある。 Another object of the present invention is that the processor configuration can be customized by adding new instructions within a few minutes that the user can evaluate this property.

上記の目的は、カストマイズされたプロセッサ命令セットオプションおよび標準化言語の拡張の記述を使用し、ターゲット命令セットの構成定義、命令セットを実行するのに必要な回路のハードウェア記述言語記述、およびプロセッサのためのソフトウェアを生成し、プロセッサを検証するために使用できるコンパイラ、アセンブラ、デバッガおよびシミュレータのような開発ツールを開発する自動プロセッサ生成システムを提供することによって達成される。プロセッサ回路のインプリメンテーションは、領域、電力消費および速度のような様々な基準に対して最適化できる。一旦プロセッサ構成が開発されると、プロセッサ構成は、試験でき、プロセッサインプリメンテーションを繰り返して最適化するように変更されるシステムに対して入力する。 The above objective is to use customized processor instruction set options and descriptions of standardized language extensions to define the configuration of the target instruction set, the hardware description language description of the circuitry required to execute the instruction set, and the processor This is accomplished by providing an automatic processor generation system that develops development tools such as compilers, assemblers, debuggers and simulators that can be used to generate software and verify the processor. The implementation of the processor circuit can be optimized for various criteria such as area, power consumption and speed. Once the processor configuration is developed, the processor configuration can be tested and entered into a system that is modified to repeatedly optimize the processor implementation.

本発明による自動プロセッサ生成システムを開発するために、命令セットアーキテクチャ記述言語が定義され、アセンブラ、リンカ、コンパイラおよびデバッガのような構成可能なプロセッサ／システム構成ツールおよび開発ツールが開発される。このことは、大部分のツールは標準であるけれども、ＩＳＡ記述から自動的に構成されるようにされなければならないために開発処理の一部である。この設計処理の一部は、一般的には自動プロセッサ設計ツールそのものの設計者あるいは製造者によって行われる。 To develop an automatic processor generation system according to the present invention, an instruction set architecture description language is defined and configurable processor / system configuration tools and development tools such as assemblers, linkers, compilers and debuggers are developed. This is part of the development process because most tools are standard, but must be automatically constructed from ISA descriptions. Part of this design process is generally performed by the designer or manufacturer of the automatic processor design tool itself.

本発明による自動プロセッサ生成システムは下記のように作動する。ユーザ、例えばシステム設計者は構成命令セットアーキテクチャを開発する。すなわち、ＩＳＡ定義および予め開発されたツールを使用して、所定のＩＳＡ設計目的に従う構成可能な命令セットアーキテクチャが開発される。次に、開発ツールおよびシミュレータはこの命令セットアーキテクチャのために構成される。構成されるシミュレータを使用して、ベンチマークは、構成可能な命令セットアーキテクチャの効率を評価するために実行され、この中心部は評価結果に基づいて改訂される。一旦、構成可能な命令セットアーキテクチャは満足な状態であると、検証スイートはそのために開発される。 The automatic processor generation system according to the present invention operates as follows. A user, for example a system designer, develops a configuration instruction set architecture. That is, using ISA definitions and pre-developed tools, a configurable instruction set architecture is developed that conforms to predetermined ISA design objectives. Development tools and simulators are then configured for this instruction set architecture. Using the configured simulator, the benchmark is run to evaluate the efficiency of the configurable instruction set architecture, and this center is revised based on the evaluation results. Once the configurable instruction set architecture is satisfactory, a verification suite is developed for it.

この処理のこれらのソフトウェア態様とともに、このシステムは、構成可能なプロセッサを開発することによってハードウェア態様に付随する。したがって、コスト、性能、電力および機能性のようなシステム目的および使用可能なプロセッサ製作の情報を使用して、このシステムは、構成可能なＩＳＡオプション、拡張およびプロセッサ機能選択を考慮する全システムアーキテクチャを設計する。全システムアーキテクチャ、開発ソフトウェア、シミュレータ、構成可能な命令セットアーキテクチャおよびプロセッサＨＤＬインプリメンテーションを使用して、プロセッサＩＳＡ、ＨＤＬインプリメンテーション、ソフトウェアおよびシミュレータは、システムによって構成され、システムＨＤＬはシステムオンアチップ設計のために設計される。さらに、システムアーキテクチャおよびチップファンドリの仕様に基づいて、チップファンドリは、システムＨＤＬ（従来技術のようにプロセッサ選択に関連しない）に関するファンドリ機能の評価に基づいて選択される。最後に、ファンドリの標準電池ライブラリを使用して、構成システムは、回路を統合し、この回路を配置し、経路選択し、レイアウトおよびタイミングを再最適化する能力を与える。したがって、この設計が単一チップ型のものでない場合、回路板レイアウトは設計され、チップが製造され、回路板が組み立てられる。 Along with these software aspects of this process, the system accompanies the hardware aspects by developing configurable processors. Thus, using information on system objectives such as cost, performance, power and functionality and available processor fabrication, this system has an overall system architecture that allows for configurable ISA options, extensions and processor function selection. design. Using the entire system architecture, development software, simulator, configurable instruction set architecture and processor HDL implementation, the processor ISA, HDL implementation, software and simulator are configured by the system, and the system HDL is system-on-a-chip. Designed for design. Further, based on the system architecture and chip foundry specifications, the chip foundry is selected based on an evaluation of the foundry function for the system HDL (not related to processor selection as in the prior art). Finally, using the foundry standard battery library, the configuration system provides the ability to integrate the circuit, place and route this circuit, and reoptimize layout and timing. Thus, if this design is not of a single chip type, the circuit board layout is designed, the chip is manufactured, and the circuit board is assembled.

上記で分かるように、いくつかの技術がプロセッサ設計処理の広範囲に及ぶ自動化を容易にするために使用される。これらの問題を取り組むために使用される第１の技術は、任意の変更あるいは拡張ほど柔軟でなく、それにもかかわらず著しい機能性改善を可能にする特定の機構を設計し、実現することにある。変更の任意性を抑制することによって、それに関連した問題が抑制される。 As can be seen above, several techniques are used to facilitate extensive automation of the processor design process. The first technique used to address these issues is to design and implement a specific mechanism that is not as flexible as any changes or extensions, yet allows significant functional improvements. . By suppressing the voluntary nature of the change, the problems associated with it are suppressed.

第２の技術は、変更のただ一つの記述を行い、全ての影響を及ぼされた構成要素の変更および拡張を自動的に生成することにある。手動で１回何かを行うことは、ツールを記述し、このツールを自動的に行い、このツールを１回使用することよりもしばしば安価であるために、従来技術で設計されたプロセッサはこれを行わなかった。タスクが数回繰り返される場合、自動化の長所を用いる。 The second technique consists in making a single description of the changes and automatically generating changes and extensions of all affected components. Doing something manually once describes a tool, does this tool automatically, and is often cheaper than using this tool once, so a processor designed in the prior art is Did not do. Use automation advantages when a task is repeated several times.

使用される第３の技術は、その後のユーザ評価のための推定および自動構成を補助するためにデータベースを形成することにある。 The third technique used is to form a database to assist in estimation and automatic configuration for subsequent user evaluation.

最後に、第４の技術は、構成に役に立つ形式でハードウェアおよびソフトウェアを提供することにある。本発明の実施例では、ハードウェアおよびソフトウェアのいくつかは標準ハードウェアおよびソフトウェア言語で直接記述されないで、構成データベースの照会および置換、条件付、複製および他の修正を有する標準ハードウェアおよびソフトウェア言語コードの生成を可能にするプリプロセッサの付加によって強化された言語で記述される。したがってコアプロセッサ設計は、強化がリンクインできるフックで行われる。 Finally, the fourth technique is to provide hardware and software in a form that is useful for configuration. In an embodiment of the present invention, some of the hardware and software are not directly described in standard hardware and software languages, but standard hardware and software languages with configuration database queries and replacements, conditionals, duplications and other modifications It is written in a language enhanced by the addition of a preprocessor that allows code generation. The core processor design is therefore done with hooks to which enhancements can be linked in.

これらの技術を示すために、特定用途用命令の付加を考察する。この方法をレジスタおよび一定のオペランドを有し、レジスタ結果を生じる命令に抑制することによって、命令の操作は組合せ（無状態、自由フィードバック）ロジックだけで指定できる。この入力は、操作符号割当、命令名、アセンブラシンタックスおよび命令のための組合せロジックを指定する。この命令からツールは、・プロセッサが新しい操作符号を認識する命令復号化ロジック；・レジスタオペランドで組合せロジック機能を実行する機能ユニットの付加；・そのオペランドが有効である場合だけ命令発行を確認するプロセッサの命令スケジューリングロジックの入力；・新しい操作符号およびそのオペランドを受け取り、正しいマシンコードを生成するアセンブラ変更；・新しい命令をアクセスする新しい固有の機能を付加するコンパイラ変更；・マシンコードを新しい命令として受け取り、指定ロジック機能を実行するシミュレータ変更；および・付加命令の結果を含み、この結果を検査する直接コードシーケンスおよびランダムコードシーケンスの両方を生成する診断生成器を生成する。 To illustrate these techniques, consider the addition of application specific instructions. By constraining this method to instructions that have registers and constant operands and produce register results, the operation of the instructions can be specified only in combinatorial (stateless, free feedback) logic. This input specifies operational code assignment, instruction name, assembly brush syntax, and combinational logic for the instruction. From this instruction, the tool will: • Instruction decoding logic in which the processor recognizes the new operation code; • Addition of a functional unit that performs a combinational logic function on the register operand; Instruction scheduling logic input;-Assembler changes that accept new opcodes and their operands and generate correct machine code;-Compiler changes that add new unique functions to access new instructions;-Machine code as new instructions Change the simulator to perform the specified logic function; and Generate a diagnostic generator that includes both the result of the additional instruction and generates both a direct code sequence and a random code sequence that examines the result.

上記の技術の全ては特定用途用命令を付加するために使用される。入力は、入力オペランドおよび出力オペランドおよびこれらのオペランドを評価するロジックに抑制される。この変更は１つの場所に記述され、全ハードウェア変更およびソフトウェア変更はこの記述から得られる。この機能は、いかに単一入力が複数の構成要素を高めるために使用できることを示している。 All of the above techniques are used to add application specific instructions. Input is constrained by input and output operands and logic that evaluates these operands. This change is described in one place, and all hardware and software changes are derived from this description. This feature shows how a single input can be used to enhance multiple components.

プロセッサとシステムロジックの他の部分との間のトレードオフは設計処理で非常に遅れて行うことができるために、この処理の結果は、システムのアプリケーション要求をかなえることで従来の技術よりも非常に優れたシステムである。このシステムは、システムの構成は多数のより多くの表示形式に適用されてもよい点で前述の従来の方式の多くより優れている。単一ソースは、全ＩＳＡ符号化ために使用されてもよく、ソフトウェアツールおよび高レベルシミュレーションは構成可能パッケージに含められてもよく、流れは、構成値の最適結合を探すために反復のために設計されてもよい。さらに、前述の方法は、制御のための単一ユーザインタフェースあるいはユーザ向けの再定義のための測定システムなしだけでハードウェア構成あるいはソフトウェア構成だけに焦点を合わせていたが、本発明は、プロセッサハードウェアおよびソフトウェアの構成のための流れを完成することに寄与し、最適構成の選択を助けるハードウェア設計結果およびソフトウェア性能からフィードバックを含む。 The trade-off between the processor and the rest of the system logic can be done very late in the design process, so the result of this process is much more than conventional technology by meeting system application requirements. It is an excellent system. This system is superior to many of the conventional methods described above in that the system configuration may be applied to many more display formats. A single source may be used for full ISA encoding, software tools and high-level simulations may be included in the configurable package, and the flow for iteration to find the optimal combination of configuration values May be designed. Furthermore, while the foregoing method has focused solely on hardware or software configuration without a single user interface for control or a measurement system for redefinition for the user, the present invention is not limited to processor hardware. Includes feedback from hardware design results and software performance that contribute to completing the flow for hardware and software configuration and help select the optimal configuration.

これらの目的は、カストマイズされたプロセッサ命令セット拡張の記述を標準化言語で使用し、ターゲット命令セットの構成可能な定義、命令セットを実行するのに必要な回路のハードウェア記述言語、およびプロセッサのためのアプリケーションを開発し、検証するために使用できるコンパイラ、アセンブラ、デバッガおよびシミュレータのような開発ツールを開発する自動プロセッサ設計を提供することによって本発明の態様により達成される。標準化言語は、プロセッサ状態を変更するかあるいは構成可能なプロセッサを使用する命令セット拡張を処理できる。拡張および最適化の抑制領域を与えることによって、処理は高度まで自動化でき、それによって高速で、信頼性のある開発を容易にする。 These objectives are to use a customized processor instruction set extension description in a standardized language for a configurable definition of the target instruction set, a hardware description language for the circuitry required to execute the instruction set, and a processor. It is achieved in accordance with aspects of the present invention by providing an automated processor design that develops development tools such as compilers, assemblers, debuggers and simulators that can be used to develop and validate applications. Standardized languages can handle instruction set extensions that use processor states that change processor state or are configurable. By providing a zone of expansion and optimization, the process can be automated to a high degree, thereby facilitating fast and reliable development.

上記の目的は、さらにユーザが複数の可能性のある命令のセットあるいは状態（以下、可能性のある構成可能な命令あるいは状態の組合せは、ひとまとめにして「プロセッサエンハンスメント」と呼ばれる）を保持し、プロセッサエンハンスメントのアプリケーションを評価できる場合、プロセッサエンハンスメント間で容易に切り換えることができるシステムを提供する本発明の他の態様により達成される。 The above objectives also allow the user to maintain multiple sets of possible instructions or states (hereinafter, possible combinations of configurable instructions or states are collectively referred to as “processor enhancements”), If the application of processor enhancement can be evaluated, it is achieved by another aspect of the invention that provides a system that can be easily switched between processor enhancements.

ユーザは、ここに示された方法を使用してベースプロセッサ構成を選択し、形成する。ユーザは、新しいセットのユーザ定義のプロセッサエンハンスメントを形成し、このプロセッサエンハンスメントをファイルディレクトリに入れる。次に、ユーザは、ユーザエンハンスメントを処理し、ベースソフトウェア開発ツールによって使用可能な形式に変換するツールを呼び出す。この変換は、ユーザ定義のエンハンスメントだけを含み、全ソフトウェアシステムを形成しないので、非常に速い。次にユーザは、ベースソフトウェア開発ツールを呼び出し、新しいディレクトリで形成されたプロセッサエンハンスメントを動的に使用することをツールに知らせる。好ましくは、ディレクトリの位置は、コマンドラインオプションあるいは環境のいずれかによってツールに与えられる。処理をさらに簡単にするために、ユーザは標準ソフトウェアメークファイルを使用できる。これらによって、ユーザは、そのプロセッサ命令を変更し、次に単一のメークコマンドによってエンハンスメントを処理し、新しいプロセッサエンハンスメントに関してそのアプリケーションを再形成し、評価するためにベースソフトウェア開発システムを使用できる。 The user selects and forms a base processor configuration using the method shown here. The user creates a new set of user-defined processor enhancements and places this processor enhancement in the file directory. The user then invokes a tool that processes the user enhancement and converts it into a form usable by the base software development tool. This conversion is very fast because it includes only user-defined enhancements and does not form an entire software system. The user then invokes the base software development tool and informs the tool to dynamically use the processor enhancement formed in the new directory. Preferably, the directory location is provided to the tool either by command line option or environment. To further simplify the process, the user can use a standard software makefile. These allow the user to use the base software development system to change the processor instructions, then process the enhancements with a single make command, reshape and evaluate the application with respect to the new processor enhancements.

本発明は、従来の方式の３つの制限を克服する。新しいセットの可能性のあるエンハンスメントを与えると、ユーザは瞬間の問題で新しいエンハンスメントを評価できる。ユーザは、各セットに対する新しいディレクトリを形成することによって可能性のあるエンハンスメントの多数のバージョンを保持できる。ディレクトリは新しいエンハンスメントの記述を含むだけで全ソフトウェアシステムを含まないので、必要とされる記憶空間は最少である。最後に、新しいエンハンスメントは構成の他の部分から切り離される。一旦ユーザが可能性のあるセットの新しいエンハンスメントを有するディレクトリで作成したとすると、ユーザは、このディレクトリを任意のベース構成と併用できる。 The present invention overcomes three limitations of the conventional scheme. Given a new set of possible enhancements, the user can evaluate the new enhancement with a momentary problem. Users can maintain multiple versions of potential enhancements by creating a new directory for each set. Since the directory only contains a description of the new enhancement and not the entire software system, the storage space required is minimal. Finally, the new enhancement is cut off from the rest of the configuration. Once the user has created a directory with a possible set of new enhancements, the user can use this directory with any base configuration.

本発明の好ましい実施例による命令セットを実行するプロセッサのブロック図である。FIG. 2 is a block diagram of a processor that executes an instruction set according to a preferred embodiment of the present invention. 本実施例によるプロセッサで使用されるパイプラインのブロック図である。It is a block diagram of the pipeline used with the processor by a present Example. 本実施例によるＧＵＩの構成マネージャを示している。2 shows a configuration manager of a GUI according to the present embodiment. 本実施例によるＧＵＩの構成エディタを示している。3 shows a GUI configuration editor according to the present embodiment. 本実施例による異なる種類の構成可能性を示している。The different types of configuration possibilities according to this embodiment are shown. 本実施例のプロセッサ構成のフローを示すブロック図である。It is a block diagram which shows the flow of the processor structure of a present Example. 本実施例による命令セットシミュレータのブロック図である。It is a block diagram of the instruction set simulator by a present Example. 本実施例により構成されたプロセッサと併用するためのエミュレーションボードのブロック図である。It is a block diagram of the emulation board for using together with the processor comprised by the present Example. 本実施例による構成可能なプロセッサの論理アーキテクチャを示すブロック図である。FIG. 2 is a block diagram illustrating a logical architecture of a configurable processor according to an embodiment. 図９のアーキテクチャへの乗算器の付加を示すブロック図である。FIG. 10 is a block diagram illustrating the addition of multipliers to the architecture of FIG. 図９のアーキテクチャへの乗算−累算装置の付加を示すブロック図である。FIG. 10 is a block diagram illustrating the addition of a multiply-accumulate device to the architecture of FIG. 本実施例のメモリの構成を示す図である。It is a figure which shows the structure of the memory of a present Example. 本実施例のメモリの構成を示す図である。It is a figure which shows the structure of the memory of a present Example. 図８のアーキテクチャのユーザ定義の機能装置の付加を示す図である。FIG. 9 illustrates the addition of a user-defined functional device of the architecture of FIG. 図８のアーキテクチャのユーザ定義の機能装置の付加を示す図である。FIG. 9 illustrates the addition of a user-defined functional device of the architecture of FIG. 他の好ましい実施例のシステム構成要素間の情報のフローを示すブロック図である。FIG. 6 is a block diagram illustrating the flow of information between system components of another preferred embodiment. いかにカスタムコードが本実施例のソフトウェア開発ツールのために生成されるかを示すブロック図である。FIG. 3 is a block diagram illustrating how custom code is generated for the software development tool of the present embodiment. 本発明の他の好ましい実施例で使用されるいろいろのソフトウェアモジュールの生成を示すブロック図である。FIG. 6 is a block diagram illustrating the generation of various software modules used in another preferred embodiment of the present invention. 本実施例による構成可能なプロセッサのパイプライン構造のブロック図である。It is a block diagram of the pipeline structure of the configurable processor by a present Example. 本実施例によるゲートレジスタインプリメンテーションである。It is a gate register implementation according to the present embodiment. 本実施例で状態レジスタインプリメンテーションを実行するのに必要である付加ロジックの図である。FIG. 6 is a diagram of additional logic required to perform a state register implementation in this example. 本実施例によるいろいろのセマンティックブロックおよび選択ブロックから状態レジスタの入力への状態の次の状態出力の結合を示す図である。FIG. 7 is a diagram illustrating the coupling of the next state output of the state from the various semantic blocks and selection blocks to the input of the state register according to this embodiment. 本実施例によるセマンティックロジックに対応するロジックを示している。The logic corresponding to the semantic logic by a present Example is shown. 状態のビットが本実施例のユーザレジスタのビットにマッピングされる場合、状態のビットのためのロジックを示している。If the status bits are mapped to the bits of the user register of this embodiment, the logic for the status bits is shown.

一般に、自動プロセッサ生成処理は、構成可能なプロセッサ定義およびそれのユーザ指定変更、ならびにプロセッサが構成されるべきユーザ指定アプリケーションで始まる。この情報は、ユーザ変更を考慮する構成済プロセッサを生成し、ソフトウェア開発ツール、例えば、このツールのためのコンパイラ、シミュレータ、アセンブラ、逆アセンブラ等を生成するために使用される。さらに、このアプリケーションは、新しいソフトウェア開発ツールを使用して再コンパイルされる。再コンパイル済アプリケーションは、アプリケーションを実行する構成済プロセッサの性能を記述するソフトウェアプロフィールを生成するためにシミュレータを使用してシミュレートされ、構成済プロセッサは、プロセッサ回路インプリメンテーションを特徴とするハードウェアプロフィールを生成するためにシリコンチップエリア使用、電力消費、速度等に対して評価される。ソフトウェアおよびハードウェアプロフィールはフィードバックされ、プロセッサがこの特定のアプリケーションのために最適化できるように他の反復構成を可能にするようにユーザに供給される。 In general, the automatic processor generation process begins with a configurable processor definition and its user-specified change, and a user-specified application in which the processor is to be configured. This information is used to generate a configured processor that takes into account user changes and to generate software development tools, eg, compilers, simulators, assemblers, disassemblers, etc. for this tool. In addition, the application is recompiled using new software development tools. The recompiled application is simulated using a simulator to generate a software profile that describes the performance of the configured processor executing the application, and the configured processor is hardware that features a processor circuit implementation. Evaluated for silicon chip area usage, power consumption, speed, etc. to generate a profile. Software and hardware profiles are fed back and supplied to the user to allow other iterative configurations so that the processor can be optimized for this particular application.

本発明の好ましい実施例による自動プロセッサ生成システム１０は、図１に示されるように４つの主要構成要素を有する。すなわち、４つの主要構成要素は、プロセッサを設計することを望むユーザがユーザの構成可能性および伸長性オプションおよび他の設計抑制を入力するユーザ構成インタフェース２０と、ユーザによって選択された基準のために設計されたプロセッサのためにカストマイズすることができる一連のソフトウェア開発ツール３０と、プロセッサ４０のハードウェアインプリメンテーションのパラメータ化伸長記述と、および入力データをユーザインタフェースから受信し、要求されたプロセッサのカストマイズされた統合可能なハードウェア記述を生成し、ソフトウェア開発ツールを変更し、選択された設計を受け入れる形成システム５０である。好ましくは、形成システム５０は、さらに診断ツールを生成し、ハードウェアおよびソフトウェア設計および推定器を検証し、ハードウェアおよびソフトウェア特性を推定する。 The automatic processor generation system 10 according to the preferred embodiment of the present invention has four main components as shown in FIG. That is, the four main components are for a user configuration interface 20 where a user desiring to design a processor enters user configurability and extensibility options and other design constraints, and for criteria selected by the user. A set of software development tools 30 that can be customized for the designed processor, a parameterized decompression description of the hardware implementation of the processor 40, and input data from the user interface, and the requested processor's A forming system 50 that generates customized and integratable hardware descriptions, changes software development tools, and accepts selected designs. Preferably, the forming system 50 further generates diagnostic tools, validates hardware and software designs and estimators, and estimates hardware and software characteristics.

ここで使用され、かつ添付された特許請求の範囲に使用されるような「ハードウェアインプリメンテーション記述」は、プロセッサ設計の物理的インプリメンテーションの態様を記述し、１つあるいはそれ以上の他の記述だけあるいはこの記述とともにこの設計に従ってチップの製造を容易にする１つあるいはそれ以上の記述を意味する。したがって、ハードウェアインプリメンテーション記述の構成要素は、記述をマスクするネットリストおよびマイクロコーティングによるハードウェア記述言語のような比較的高レベルから変化する抽象概念のレベルにあってもよい。しかしながら、本実施例では、ハードウェアインプリメンテーション記述の主構成要素は、ＨＤＬ、ネットリストおよびスクリプトで記述される。 A “hardware implementation description” as used herein and as used in the appended claims describes a physical implementation aspect of a processor design and describes one or more other , Or together with this description, means one or more descriptions that facilitate the manufacture of the chip according to this design. Thus, the components of the hardware implementation description may be at a level of abstraction that varies from a relatively high level, such as a hardware description language with a netlist and microcoating masking the description. However, in this embodiment, the main components of the hardware implementation description are described in HDL, netlist, and script.

さらに、ここで使用され、添付された特許請求の範囲で使用されるようなＨＤＬは、マイクロアーキテクチャ等を記述するために使用される一般クラスのハードウェア記述言語を示すことを意図し、このＨＤＬはこのような言語の任意の特定の例を示すことを示すことを意図する。 Further, HDL as used herein and as used in the appended claims is intended to indicate a general class of hardware description language used to describe microarchitectures, etc. Is intended to show that any particular example of such a language is given.

この実施例では、プロセッサ構成のための基本は図２に示されたアーキテクチャ６０である。多数のアーキテクチャの要素は、ユーザによって直接変更できない基本機能である。これらは、プロセッサ制御部６２と、整列・復号化部６４（ただし、この部分の一部はユーザ指定構成に基づいている）と、ＡＬＵ・アドレス生成部６６と、ブランチロジック・命令フェッチ６８と、プロセッサインタフェース７０とを含む。他の装置は、基本プロセッサの一部であるが、ユーザ構成可能である。これらは、割り込み制御部７２と、データおよび命令アドレス監視部７４および７６と、ウィンドウレジスタファイル７８と、データおよび命令キャッシュおよびタグ部８０と、書き込みバッファ８２と、タイマ８４とを含む。図２に示された残りの部分は任意にはユーザによって含まれる。 In this embodiment, the basis for the processor configuration is the architecture 60 shown in FIG. Many architectural elements are basic functions that cannot be changed directly by the user. These include a processor control unit 62, an alignment / decoding unit 64 (part of which is based on a user-specified configuration), an ALU / address generation unit 66, a branch logic / instruction fetch 68, Processor interface 70. Other devices are part of the basic processor but are user configurable. These include an interrupt control unit 72, data and instruction address monitoring units 74 and 76, a window register file 78, a data and instruction cache and tag unit 80, a write buffer 82, and a timer 84. The remaining portion shown in FIG. 2 is optionally included by the user.

プロセッサ構成システム１０の中央構成要素はユーザ構成インタフェース２０である。これは、好ましくはコンパイラの再構成およびアセンブラ、逆アセンブラおよび命令セットシミュレータ（ＩＳＳ）の再生と、全プロセッサ統合、配置およびルーチングを始める入力の作成を含むプロセッサ機能性を選択できるグラフィックユーザインタフェース（ＧＵＩ）をユーザに提供するモジュールである。それによって、ユーザも、プロセッサエリア、電力消費、サイクル時間、アプリケーション性能およびプロセッサ構成の他の反復およびエンハンスメントのためのコードサイズの迅速な推定を利用できる。好ましくは、ＧＵＩも、構成データベースにアクセスし、デフォルト値を得て、ユーザ入力のエラーチェックを行う。 The central component of the processor configuration system 10 is a user configuration interface 20. This is preferably a graphical user interface (GUI) that allows the selection of processor functionality including compiler reconfiguration and assembler, disassembler and instruction set simulator (ISS) playback, and creation of inputs that start full processor integration, placement and routing. ) To the user. Thereby, users can also take advantage of a quick estimate of code size for processor area, power consumption, cycle time, application performance and other iterations and enhancements of the processor configuration. Preferably, the GUI also accesses the configuration database, obtains default values, and performs user input error checking.

プロセッサ６０を設計するために本実施例による自動プロセッサ生成システム１０を使用するために、ユーザは、設計パラメータをユーザ構成インタフェース２０に入力する。自動プロセッサ生成システム１０は、ユーザの制御の下でコンピュータシステムで実行するスタンドアロンシステムであってもよい。すなわち、しかしながら、このシステム１０は、好ましくは自動プロセッサ生成システム１０の製造の制御の下で主にシステムで実行する。次に、ユーザアクセスは、通信ネットワークを介して提供されてもよい。例えば、ＧＵＩは、ＨＴＭＬおよびジャバで記述されているデータ入力スクリーンを有するウェバブラウザを使用して提供されてもよい。これは、任意の所有権を主張できるバックエンドソフトウェアの機密性を保有し、保守を簡単にし、バックエンドソフトウェア等を更新するようないくつかの長所を有する。この場合、ＧＵＩにアクセスするために、ユーザは、自分のＩＤを証明するためにシステム１０に最初のログオンをしてもよい。 In order to use the automatic processor generation system 10 according to the present embodiment to design the processor 60, the user inputs design parameters into the user configuration interface 20. The automatic processor generation system 10 may be a stand-alone system that runs on a computer system under the control of a user. That is, however, the system 10 preferably runs primarily in the system under the control of manufacturing the automatic processor generation system 10. User access may then be provided via a communication network. For example, the GUI may be provided using a web browser having a data entry screen written in HTML and Java. This has several advantages such as maintaining the confidentiality of the backend software that can claim arbitrary ownership, simplifying maintenance, updating the backend software and the like. In this case, in order to access the GUI, the user may first log on to the system 10 to prove his identity.

一旦ユーザがアクセス権を有すると、システムは図３に示されるように構成マネージャスクリーン８６を表示する。構成マネージャ８６は、ユーザによってアクセスできる構成の全てをリストするディレクトリである。図３の構成マネージャ８６は、ユーザが２つの構成、「ｊｕｓｔｉｎｔｒ」および「ｈｉｇｈｐｒｉｏ」、を有し、最初のものは既に形成された、すなわち製造のために終了され、第２番目のものは依然として形成されるべきであることを示している。このスクリーン８６から、ユーザは選択構成を形成し、削除し、編集し、どの構成および拡張がこの構成のために選択されるかあるいは新しい構成を形成するかを指定するリポートを生成してもよい。「ｊｕｓｔｉｎｔｒ」のような形成されたこれらの構成の場合、それのためにカストマイズされた一連のソフトウェア開発ツール３０はダウンロードできる。 Once the user has access, the system displays a configuration manager screen 86 as shown in FIG. The configuration manager 86 is a directory that lists all of the configurations accessible by the user. The configuration manager 86 of FIG. 3 has a user having two configurations, “just intr” and “high prior”, the first one already formed, ie finished for manufacturing and the second one Indicates that it should still be formed. From this screen 86, the user may create, delete, and edit the selected configuration and generate a report that specifies which configurations and extensions are selected for this configuration or form a new configuration. . For those configurations formed like “just intr”, a set of software development tools 30 customized for it can be downloaded.

新しい構成を作成するかあるいは既存の構成を編集することは、図４に示された構成エディタ８８を持ち出す。構成エディタ８８は、構成および拡張できるプロセッサ６０の様々な一般的な態様を示す左側に「オプション」セクションメニューを有する。オプションセクションが選択される場合、このセクションのための構成セクションを有するスクリーンが右側に表示され、これらのオプションは、当該技術分野で公知であるようにプルダウンメニュー、メモメニュー、メモボックス、チェックボックス、ラジオボタン等でセットできる。ユーザは、オプションを選択し、データをランダムに入力できるけれども、好ましくは、セクション間に論理従属関係があるので、データは各々に逐次入力される。例えば、「割り込み」セクションにオプションを適切に表示するために割り込み数は「ＩＳＡＯｐｔｉｏｎｓ」セクションで選択されねばならない。 Creating a new configuration or editing an existing configuration brings up the configuration editor 88 shown in FIG. The configuration editor 88 has an “Options” section menu on the left that shows various general aspects of the processor 60 that can be configured and expanded. When an options section is selected, a screen with a configuration section for this section is displayed on the right, and these options are available as pull-down menus, note menus, note boxes, check boxes, as known in the art. Can be set with a radio button. Although the user can select options and enter data randomly, preferably there is a logical dependency between sections so that data is entered sequentially into each. For example, the number of interrupts must be selected in the “ISA Options” section in order to properly display the options in the “Interrupts” section.

本実施例では、下記の構成オプションは各セクションに対して使用可能である。
目的
推定のための技術
ターゲットＡＳＩＣ技術：．１８，．２５．．３５ミクロン
ターゲット作動状態：典型的な、最悪の場合
インプリメンテーション目標
ターゲット速度：任意ゲート総数：任意
ターゲット電力：任意
目的優先化：速度、領域電力、速度、電力、領域
ＩＳＡオプション
数値オプション
４０ビットアキュムレータを有するＭＡＣ１６：イエス、ノー
１６ビット乗算器：イエス、ノー
例外オプション
割り込み数：０〜３２
高優先順位割り込みレベル：０〜１４
イネーブルデバッギング：イエス、ノー
タイマ数：０〜３
その他
バイト配列：リトル・エンディアン、ビッグエンディアン
ウィンドウズ（登録商標）を呼び出すために使用可能なレジスタ数：３２、６４
プロセッサキャッシュ＆メモリ
プロセッサインタフェース読み出し幅（ビット）：３２、６４、１２８
書き込みバッファエントリ（アドレス／値対）：４、８、１６、３２
プロセッサキャッシュ
命令／データキャッシュサイズ（ｋＢ）：１、２、４、８、１６
命令／データキャッシュラインサイズ（ｋＢ）：１６、３２、６４
周辺構成要素
タイマ
タイマ割り込み数
タイマ割り込みレベル
デバッギングサポート
命令アドレスブレークポイントレジスタ数：０〜２
データアドレスブレークポイントレジスタ数：０〜２
デバッグ割り込みレベル
トレースポート：イエス、ノー
オンチップデバッグモジュール：イエス、ノー
全走査：イエス、ノー
割り込み
ソース：外部、ソフトウェア
優先順位レベル
システムメモリアドレス
ベクトルおよびアドレス計算法：ＸＴＯＳ、マニュアル
構成パラメータ
ＲＡＭサイズ、開始アドレス：任意
ＲＯＭサイズ、開始アドレス：任意
ＸＴＯＳ：任意
構成特定アドレス
ユーザ例外ベクトル：任意
カーネル例外ベクトル：任意
レジスタウィンドウオーバーフロー／アンダーフローベクトルベース：任意
リセットベクトル：任意
ＸＴＯＳ開始アドレス：任意
アプリケーション開始アドレス：任意
ＴＩＥ命令
（ＩＳＡ拡張を規定する）ターゲットＣＡＤ環境
シミュレーションＶｅｒｉｌｏｇ（登録商標）：イエス、ノー
統合
ＤｅｓｉｇｎＣｏｍｐｉｌｅｒ（登録商標）：イエス、ノー
場所＆ルート
Ａｐｏｌｌｏ（登録商標）：イエス、ノー
さらに、システム１０は、３２ビット整数乗算／除算装置あるいは浮動小数点演算装置、メモリ管理装置、オンチップＲＡＭおよびＲＯＭオプション、キャッシュ連想性、機能拡張ＤＳＰおよびコプロセッサ命令セット、ライトバックキャッシュ、マルチプロセッサ同期化、コンパイラ指向推測、および付加ＣＡＤパッケージのためのサポートのような他の付加装置を追加するためのオプションを提供する。たとえどんな構成オプションが所与の構成プロセッサのために使用可能であっても、この構成オプションは好ましくは、一旦ユーザが適切なオプションを選択したとしてもシステム１０がシンタックスチェック等のために使用する定義ファイル（例えば、付録Ａに示された定義ファイル）に列挙される。 In this embodiment, the following configuration options are available for each section.
the purpose
Technique for estimation
Target ASIC technology: 18,. 25. . 35 microns
Target operating condition: typical, worst case
Implementation goals
Target speed: Arbitrary Total number of gates: Arbitrary
Target power: Any
Objective prioritization: speed, area power, speed, power, area ISA options
Numeric option
MAC16 with 40-bit accumulator: yes, no
16-bit multiplier: yes, no
Exception options
Number of interrupts: 0 to 32
High priority interrupt level: 0-14
Enable debugging: yes, no
Number of timers: 0-3
Other
Byte array: little endian, big endian
Number of registers available for calling Windows®: 32, 64
Processor cache & memory
Processor interface read width (bit): 32, 64, 128
Write buffer entry (address / value pair): 4, 8, 16, 32
Processor cache
Instruction / data cache size (kB): 1, 2, 4, 8, 16
Instruction / data cache line size (kB): 16, 32, 64
Peripheral components
Timer
Timer interrupt count
Timer interrupt level debugging support
Number of instruction address breakpoint registers: 0 to 2
Number of data address breakpoint registers: 0-2
Debug interrupt level
Trace port: yes, no
On-chip debug module: yes, no
Full scan: yes, no interrupt
Source: External, Software
Priority level system memory address
Vector and address calculation methods: XTOS, manual
Configuration parameters
RAM size, start address: Any
ROM size, start address: Any
XTOS: optional
Configuration specific address
User exception vector: Any
Kernel exception vector: any
Register window overflow / underflow vector base: Any
Reset vector: Any
XTOS start address: Any
Application start address: Arbitrary TIE instruction
(Defines ISA extensions) Target CAD environment
Simulation Verilog (R): yes, no
Integration
Design Compiler (R): yes, no
Location & Route
Apollo®: yes, no In addition, system 10 includes 32-bit integer multiplier / divider or floating point arithmetic unit, memory management unit, on-chip RAM and ROM option, cache associativity, enhanced DSP and coprocessor instructions Provides options for adding other additional devices such as sets, write-back cache, multiprocessor synchronization, compiler-oriented guessing, and support for additional CAD packages. Whatever configuration option is available for a given configuration processor, this configuration option is preferably used by the system 10 for syntax checks, etc. once the user has selected the appropriate option. Listed in a definition file (for example, the definition file shown in Appendix A).

前述から、自動プロセッサ構成システム１０は、図５に示されるようにユーザに２つの一般的な種類の構成性３００、すなわちユーザがスクラッチからの任意の機能および構造を規定できる伸長性３０２およびユーザが所定の制約されたオプションのセットから選択できる変更可能性３０４を提供する。変更可能性内で、システムは、所定の機能、例えばＭＡＣ１６あるいはＤＳＰが、プロセッサ６０に付加されるべきであるかどうかおよび他のプロセッサの機能、例えば割り込み数およびキャッシュサイズのパラメータ仕様３０８の２進選択３０６を可能にする。 From the foregoing, the automatic processor configuration system 10 has two general types of configurability 300 as shown in FIG. 5 for the user: extensibility 302 that allows the user to define any function and structure from scratch, and user A changeability 304 is provided that can be selected from a predetermined set of constrained options. Within the possibility of change, the system may determine whether a given function, eg MAC 16 or DSP, should be added to the processor 60 and other processor functions, eg binary number of parameter specifications 308 for interrupt count and cache size. Allows selection 306.

上記の構成オプションの多くは当業者にはよく知られている。しかしながら、他のものは特別の注意に値する。例えば、ＲＡＭおよびＲＯＭオプションによって、設計者は、プロセッサ１０そのものにスクラッチパッドあるいはファームウエアを含めることができる。プロセッサ１０は、命令を取り出し、これらのメモリからのデータを読み出し、書き込む。メモリのサイズおよび配置は構成可能である。この実施例では、これらのメモリの各々は、セットアソシアティブキャッシュの付加セットとしてアクセスされる。メモリのヒットは単一のタグメモリと比較することによって検出できる。 Many of the above configuration options are well known to those skilled in the art. However, others deserve special attention. For example, RAM and ROM options allow the designer to include a scratchpad or firmware in the processor 10 itself. The processor 10 fetches instructions and reads and writes data from these memories. The size and placement of the memory is configurable. In this embodiment, each of these memories is accessed as an additional set of set associative caches. Memory hits can be detected by comparing to a single tag memory.

システム１０は、割り込みのための別個の構成オプション（レベル１割り込みを実行する）および高優先順位割り込みオプション（レベル２〜１５の割り込みおよびノンマスカブル割り込みを実行する）を行う。何故ならば、各高優先順位割り込みレベルは３つの特別レジスタを必要とするので、これらのレジスタはより高価であるためである。 System 10 provides separate configuration options for interrupts (performing level 1 interrupts) and high priority interrupt options (performing level 2-15 interrupts and non-maskable interrupts). Because each high priority interrupt level requires three special registers, these registers are more expensive.

４０ビットアキュムレータオプションを有するＭＡＣ１６（図２の９０に示されている）は、４０ビットアキュムレータを有する１６ビット乗算器／加算機能、８つの１６ビットオペランドレジスタおよび乗算、累算、オペランドロードおよびアドレス更新の命令を結合する複合命令のセットを付加する。オペランドレジスタには、乗算／累算演算と並列にメモリから１６ビット値の対がロードできる。この装置は、サイクル毎の２つのロードおよび乗算／累算を有するアルゴリズムを持続できる。 MAC16 with 40-bit accumulator option (shown at 90 in FIG. 2) is a 16-bit multiplier / add function with 40-bit accumulator, 8 16-bit operand registers and multiplication, accumulation, operand load and address update Add a set of compound instructions that combine the instructions. Operand registers can be loaded with pairs of 16-bit values from memory in parallel with multiply / accumulate operations. This device can sustain an algorithm with two loads per cycle and multiplication / accumulation.

オンチップデバッグモジュール（図２の９２に示されている）は、ＪＴＡＧポート９４を介してプロセッサ６０の内部ソフトウェアビジブルステートをアクセスするために使用される。モジュール９２は、プロセッサ６０をデバッグモードにする例外生成のためのサポート、全プロセッサビジブルレジスタあるいはメモリロケーションへのアクセス、プロセッサ６０が実行するために構成される任意の命令の実行、コードの所望の位置にジャンプするＰＣの変更、およびＪＴＡＧポート９４を介するプロセッサ６０の外部から作動される通常の動作モードに戻ることができるユーティリティ、を行う。 An on-chip debug module (shown at 92 in FIG. 2) is used to access the internal software visible state of the processor 60 via the JTAG port 94. Module 92 provides support for exception generation that places processor 60 in debug mode, access to all processor visible registers or memory locations, execution of any instructions configured for execution by processor 60, and desired location of code. Change the PC to jump to and a utility that can return to the normal operating mode that is activated from outside the processor 60 via the JTAG port 94.

一旦プロセッサ１０がデバッグモードに入ると、プロセッサ１０は、有効命令がＪＴＡＧポート９４を介してスキャンインされたことの外部領域からの指示を待つ。次に、このプロセッサは、この命令を実行し、次の有効命令を待つ。一旦プロセッサ１０のハードウェアインプリメンテーションが製造されたとすると、このモジュール９２はシステムをデバッグするために使用できる。プロセッサ１０の実行は、遠隔ホストで実行するデバッガを介して制御できる。このデバッガは、ＪＴＡＧポート９４を介してプロセッサとインタフェースし、オンチップデバッグモジュール９２の機能を使用し、命令の実行を制御するのと同様にプロセッサ１０の状態を決定し、制御する。 Once processor 10 enters debug mode, processor 10 waits for an indication from the external area that a valid instruction has been scanned in via JTAG port 94. The processor then executes this instruction and waits for the next valid instruction. Once the hardware implementation of the processor 10 has been manufactured, this module 92 can be used to debug the system. The execution of the processor 10 can be controlled via a debugger running on the remote host. This debugger interfaces with the processor via the JTAG port 94 and uses the functions of the on-chip debug module 92 to determine and control the state of the processor 10 as well as control the execution of instructions.

最高３２ビットのカウンタ／タイマ８４が構成されてもよい。これは、割り込み機能および同様な機能と併用するために、（各構成タイマに対して）比較レジスタおよび比較レジスタ内容と現クロックレジスタカウントとを比較する比較器と同様に各クロックサイクルを増分する３２ビットのレジスタの使用を必要とする。このカウンタ／タイマはエッジトリガされるものとして構成でき、通常あるいは高優先順位の内部割り込みを発生できる。 A counter / timer 84 of up to 32 bits may be configured. This increments each clock cycle in the same way as the comparator that compares the compare register and compare register contents with the current clock register count (for each configuration timer) 32 for use with interrupt functions and similar functions. Requires the use of bit registers. This counter / timer can be configured to be edge triggered and can generate normal or high priority internal interrupts.

推測オプションは、ロードが必ずしも実行されない場合、ロードがフローを制御するために推測して移動できることによってより大きいコンパイラスケジューリング融通性を提供する。ロードは例外を生じてもよいために、このロード移動は、例外を最初に生じなかった有効プログラムに導入できる。ロードが実行されない場合、推測ロードは、これらの例外が生じることを防止するが、このデータが必要とされる場合、例外を与える。ロードエラーに対する例外を生じる代わりに、推測ロードは、ディスティネーションレジスタの有効ビットをリセットする（このオプションに関連した新しいプロセッサ状態）。 The speculation option provides greater compiler scheduling flexibility by allowing the load to speculatively move to control the flow if the load is not necessarily performed. This load movement can be introduced into a valid program that did not initially raise an exception, since the load may cause an exception. If the load is not performed, speculative loading prevents these exceptions from occurring, but gives an exception if this data is needed. Instead of raising an exception for load errors, speculative loads reset the destination register valid bit (the new processor state associated with this option).

複数のプロセッサがシステムで使用される場合、コアプロセッサ６０は好ましくは若干の基本パイプライン同期機能を有するけれども、プロセッサ間のある種の通信および同期が必要である。いくつかの場合、入出力待ち行列のような自己同期通信技術が使用される。他の場合、共有メモリモデルは通信のために使用され、共有メモリは必要とされるセマンティックスを提供するために、同期のための命令セットサポートを行うことが必要である。例えば、セマンティックスを得て、解除する場合の付加ロード命令およびストア命令を付加できる。これらは、同期基準間の正確な配列が保持されなければならないようにメモリロケーションが同期およびデータのために使用されてもよいマイクロプロセッサシステムでメモリ参照の配列を制御するために役に立つ。 If multiple processors are used in the system, the core processor 60 preferably has some basic pipeline synchronization capability, but some communication and synchronization between the processors is required. In some cases, self-synchronous communication techniques such as input / output queues are used. In other cases, a shared memory model is used for communication, and the shared memory needs to provide instruction set support for synchronization to provide the required semantics. For example, an additional load instruction and a store instruction for obtaining and releasing semantics can be added. These are useful for controlling the arrangement of memory references in a microprocessor system where memory locations may be used for synchronization and data so that the exact arrangement between synchronization criteria must be maintained.

いくつかの場合、共有メモリモデルは通信のために使用され、共有メモリは必要とされるセマンティックスを提供しないために同期に対する命令セットサポートを行う必要がある。これはマイクロプロセッサ同期オプションによって行われる。 In some cases, a shared memory model is used for communication, and it is necessary to provide instruction set support for synchronization because the shared memory does not provide the required semantics. This is done with the microprocessor synchronization option.

おそらく構成オプションの中で最も顕著なものは、設計者定義の命令実行装置９６が形成されるＴＩＥ命令定義である。カリフォルニア州のサンタクララ市のテンシリカ社によって開発されたＴＩＥ（登録商標）（Tensilica Instruction Set Extensions）言語によって、ユーザは、拡張および新しい命令の形でアプリケーションのためのカスタム機能を記述し、ベースＩＳＡを拡張できる。さらに、ＴＩＥの汎用性のために、ＴＩＥは、ユーザによって変更できないＩＳＡの部分を記述するために使用されてもよい。このように、全ＩＳＡは、ソフトウェア開発ツール３０およびハードウェアインプリメンテーション記述４０を均一に生成するために使用できる。ＴＩＥ技術は、多数の形成ブロックを使用し、下記のように新しい命令の属性を記述する。 Perhaps the most prominent configuration option is the TIE instruction definition in which the designer-defined instruction execution unit 96 is formed. The TIE® (Tensilica Instruction Set Extensions) language, developed by Tensilica Corporation of Santa Clara, California, allows users to write custom functionality for applications in the form of extensions and new instructions, and to create a base ISA Can be expanded. Further, due to the generality of TIE, TIE may be used to describe the portion of the ISA that cannot be changed by the user. In this way, the entire ISA can be used to uniformly generate the software development tool 30 and the hardware implementation description 40. The TIE technique uses a number of building blocks and describes the attributes of a new instruction as follows:

‥命令フィールド ‥命令クラス
‥命令操作符号 ‥命令セマンティックス
‥命令オペランド ‥一定テーブル
命令フィールドステートメントｆｉｅｌｄは、ＴＩＥコードの可読性を改善するために使用される。フィールドは、一緒にグループ化され、名前によって参照される他のフィールドの連結のサブセットである。命令のビットの全セットは、最高レベルスパーセットフィールドｉｎｓｔであり、このフィールドはより小さいフィールドに分割できる。例えば、

Instruction field Instruction class
Instruction operation code Instruction semantics
Instruction operand ... Constant table
The instruction field statement field is used to improve the readability of the TIE code. A field is a concatenated subset of other fields that are grouped together and referenced by name. The entire set of instruction bits is the highest level superset field inst, which can be divided into smaller fields. For example,

は、２つの４ビットフィールドを規定し、ｘおよびｙは、最高レベルフィールドｉｎｓｔのサブフィールド（ビット８〜１１および１２〜１５のそれぞれ）として、８ビットフィールドｘｙはｘフィールドおよびｙフィールドの連結として規定する。 Defines two 4-bit fields, x and y are subfields of the highest level field inst (bits 8-11 and 12-15, respectively), and 8-bit field xy is a concatenation of x and y fields Stipulate.

ステートメント操作符号は特定のフィールドを符号化する操作符号を規定する。このように規定された操作符号によって使用されるオペランド、例えば、レジスタあるいは即値定数を指定することを目的とする命令フィールドは、最初にフィールドステートメントで規定され、次にオペランドステートメントで規定されねばならない。
例えば、

A statement operation code defines an operation code that encodes a particular field. An instruction field intended to specify an operand, such as a register or an immediate constant, to be used by an operation code defined in this way must first be specified in the field statement and then in the operand statement.
For example,

は、予め規定された操作符号ＣＵＳＴＯ（４，ｂ｜００００は４ビットの長さの２進定数００００を示す）に基づいて２つの新しい操作符号、ａｃｓおよびａｄｓｅｌを規定する。好ましいコアＩＳＡのＴＩＥ仕様はそのベース定義の一部として下記のステートメントを有する。

Defines two new operation codes, acs and adsel, based on the predefined operation code CUSTO (4, b | 0000 indicates a binary constant 0000 with a length of 4 bits). The preferred core ISA TIE specification has the following statements as part of its base definition:

したがって、ａｃｓおよびａｄｓｅｌの定義によって、ＴＩＥコンパイラは、下記によってそれぞれ示される命令復号化ロジックを生成する。

Thus, with the definitions of acs and adsel, the TIE compiler generates instruction decoding logic respectively indicated by:

命令オペランドステートメントオペランドは、レジスタおよび即値定数を識別する。しかしながら、フィールドをオペランドとして規定する前に、このオペランドは前述のようなフィールドとして予め規定されねばならかった。オペランドが即値定数である場合、定数の値はオペランドから生成できるかあるいは定数の値は後述されるように規定された予め規定された定数テーブルからとることができる。例えば、即値オペランドを符号化するために、下記のＴＩＥコードは、

Instruction Operand Statement operand identifies a register and an immediate constant. However, before defining a field as an operand, this operand had to be defined in advance as a field as described above. If the operand is an immediate constant, the value of the constant can be generated from the operand, or the value of the constant can be taken from a predefined constant table defined as described below. For example, to encode an immediate operand, the following TIE code:

符号付数字およびオフセットフィールドに記憶された数の４倍であるオペランドｏｆｆｓｅｔ４を保有する１８ビットフィールド名オフセットを規定する。オペランドステートメントの最後の部分は、当業者に明らかであるように、組合せ回路を記述するＶｅｒｉｌｏｇ（登録商標）ＨＤＬのサブセットの計算を実行するために使用される回路を実際に記述する。 Defines an 18-bit field name offset that holds an operand offset4 that is four times the number stored in the signed number and offset field. The last part of the operand statement actually describes the circuit used to perform the computation of the Verilog® HDL subset describing the combinational circuit, as will be apparent to those skilled in the art.

ここで、ｗｉｒｅステートメントは、ｔの名前の３２ビット幅の論理ワイヤのセットを規定する。ｗｉｒｅステートメント後の最初のａｓｓｉｇｎステートメントは、論理ワイヤを駆動する論理信号は右にシフトされたｏｆｆｓｅｔ４定数であることを指定し、第２番目のａｓｓｉｇｎステートメントは、ｔの下部１８ビットがｏｆｆｓｅｔフィールドに入れられることを指定する。まさしく最初のａｓｓｉｇｎステートメントは、ｏｆｆｓｅｔ４オペランドの値をｏｆｆｓｅｔの連結および２ビットの左シフトが続くその符号ビット（ビット１７）の１４の複製として直接指定する。 Here, the wire statement defines a set of 32 bit wide logical wires named t. The first assign statement after the wire statement specifies that the logic signal driving the logic wire is an offset4 constant shifted to the right, and the second assign statement contains the lower 18 bits of t in the offset field. Specify that The very first assign statement directly specifies the value of the offset4 operand as a 14th copy of its sign bit (bit 17) followed by a concatenation of offset and a 2 bit left shift.

定数テーブルオペランドに関しては、ＴＩＥコード

TIE code for constant table operands

は、テーブルステートメントの使用を行い、定数のアレイプライム（テーブル名に続く数はテーブルの要素の数である）を規定し、オペランドをインデックスとして使用し、テーブルプライムとし、オペランドｐｒｉｍｅｓのための値を符号化する（インデクシングを規定する際のＶｅｒｉｌｏｇ（登録商標）ステートメントの使用を注目せよ）。 Uses a table statement, specifies a constant array prime (the number following the table name is the number of elements in the table), uses operands as indexes, table primes, and operands prime Encode the value for s (note the use of Verilog® statements in defining the indexing).

命令クラスステートメントｉｃｌａｓｓは、操作符号を共通フォーマットのオペランドに関連付ける。ｉｃｌａｓｓステートメントで規定された全命令は、同じフォーマットおよびオペランド使用を有する。命令クラスを規定する前に、その構成要素は、最初にフィールドとして、次に操作符号およびオペランドとして規定されねばならない。例えば、オペランドａｃｓおよびａｄｓｅｌを規定する前述の例で使用されるコードで形成すると、下記の付加ステートメント

The instruction class statement icclass associates an operation code with a common format operand. All instructions specified in the iclass statement have the same format and operand usage. Before defining an instruction class, its components must first be defined as fields, then as operational codes and operands. For example, when formed with the code used in the previous example defining the operands acs and adsel, the following additional statement:

は、３つのレジスタオペランドａｒｔ、ａｒｓおよびａｒｒを規定するためにオペランドステートメントを使用する（定義のＶｅｒｉｌｏｇ（登録商標）ステートメントの使用を再び注目せよ）。したがって、ｉｃｌａｓｓステートメント

Uses operand statements to define the three register operands art, ars, and arr (again, note the use of the definition Verilog® statement). Therefore, the iclass statement

は、オペランドａｄｓｅｌおよびａｃｓが２つのレジスタオペランドａｒｔおよびａｒｓを入力として扱う普通のクラスの命令ｖｉｔｅｒｂｉに属することを指定し、出力をレジスタオペランドａｒｒに書き込む。 Specifies that operands adsel and acs belong to the ordinary class of instructions viterbi that takes two register operands art and ars as inputs, and writes the output to register operand arr.

命令セマンティックステートメントｓｅｍａｎｔｉｃは、オペランドを符号化するために使用される同じサブセットのＶｅｒｉｌｏｇ（登録商標）を使用して１つあるいはそれ以上の命令の働きを記述する。単一セマンティックステートメントで複数命令を規定することによって、いくつかの共通式が共有でき、ハードウェアインプリメンテーションはより有効にされることができる。セマンティックステートメントで許可された変数は、ステートメントの操作符号リストに規定された操作符号のためのオペランドおよび操作符号リストで指定された各操作符号のための単一ビット変数である。この変数は操作符号と同じ名前を有し、操作符号が検出される場合、１に対する数値を求める。この変数は、対応する命令の存在を示すために計算部（Ｖｅｒｉｌｏｇ（登録商標）サブセットセクション）で使用される。 The instruction semantic statement semantic describes the work of one or more instructions using the same subset of Verilog® used to encode the operands. By defining multiple instructions in a single semantic statement, several common expressions can be shared and the hardware implementation can be made more efficient. The variables allowed in the semantic statement are single bit variables for each operation code specified in the operand and operation code list specified in the operation code list of the statement. This variable has the same name as the operation code, and if an operation code is detected, a numerical value for 1 is obtained. This variable is used in the calculator (Verilog® subset section) to indicate the presence of the corresponding instruction.

例えば、他の３２ビットワードのそれぞれの８ビットオペランドとともに３２ビットワードの４つの８ビットオペランドの加算を実行する新しい命令ＡＤＤ８４および３２ビットワードの２つの１６ビットオペランドと他の３２ビットワードのそれぞれの１６ビットオペランドとの間で最少値選択を実行する新しい命令ＭＩＮ１６２を規定するＴＩＥコードは、下記を読み取ってもよい。

For example, a new instruction ADD8 that performs the addition of four 8-bit operands of a 32-bit word with each 8-bit operand of another 32-bit word A new instruction MIN16 that performs a minimum value selection between two 16-bit operands of 4 and 32-bit words and each 16-bit operand of the other 32-bit word The TIE code that defines 2 may read:

ここで、ｏｐ２、ＣＵＳＴＯ、ａｒｒ、ａｒｔおよびａｒｓは、前述のような予め規定されたオペランドおよび前述のようなｏｐｃｏｄｅおよびｉｃｌａｓｓステートメント関数である。 Here, op2, CUSTO, arr, art, and ars are predefined operands as described above and opcode and iclas statement functions as described above.

セマンティックステートメントは、新しい命令によって実行される計算を指定する。当業者に容易に明らかであるように、セマンティックステートメント内の第２行は、新しいＡＤＤ８４命令によって実行された計算を指定し、その中の第３行および第４行は、新しいＭＩＮ１６２命令によって実行された計算を指定し、このセクション内の最後の行はａｒｒレジスタに書き込まれた結果を指定する。 Semantic statements specify calculations performed by new instructions. As will be readily apparent to those skilled in the art, the second line in the semantic statement is the new ADD8 Specifies the computation performed by the four instructions, among which the third and fourth lines are the new MIN16 Specifies the computation performed by the two instructions, and the last line in this section specifies the result written to the arr register.

ユーザ入力インタフェース２０の議論に戻ると、一旦ユーザが望む構成および拡張オプションの全てを入力したとすると、形成システム５０が引き継ぐ。図５に示されるように、形成システム５０は、ユーザによってセットされたパラメータによって構成された構成仕様およびユーザによって設計された伸長機能を受け取り、これらをコアプロセッサアーキテクチャを規定する付加パラメータ、例えば、ユーザによって変更可能な機能と結合し、全プロセッサを記述する単一構成仕様１００を形成する。例えば、ユーザによって選択された構成設定に加えて、形成システム５０は、プロセッサの物理的アドレス空間のための物理アドレスビット数を指定するパラメータ、リセット後のプロセッサ６０によって実行される第１の命令の位置等を加算してもよい。 Returning to the discussion of the user input interface 20, once the user has entered all of the desired configuration and expansion options, the forming system 50 takes over. As shown in FIG. 5, the forming system 50 receives configuration specifications configured by parameters set by the user and decompression functions designed by the user, and adds these to additional parameters that define the core processor architecture, eg, user Combined with the functions that can be changed by, form a single configuration specification 100 that describes all processors. For example, in addition to the configuration settings selected by the user, the forming system 50 may include a parameter specifying the number of physical address bits for the processor's physical address space, the first instruction executed by the processor 60 after reset. You may add a position etc.

テンシリカ社によるＸｔｅｎｓａ（登録商標）命令セットアーキテクチャ（ＩＳＡ）基準マニュアル改訂１．０は、構成可能なプロセッサ内でコア命令として実行できる命令および構成オプションの選択によって使用可能である命令の例を示す目的のために参照してここに組み込まれている。 Tensilica Xtensa® Instruction Set Architecture (ISA) Standard Manual Revision 1.0 is intended to show examples of instructions that can be used as core instructions in a configurable processor and instructions that can be used by selecting configuration options Incorporated for reference here.

構成仕様１００は、ベースＩＳＡを指定するＴＩＥ言語ステートメントを含むＩＳＡパッケージ、コプロセッサパッケージ９６（図２を参照）あるいはＤＳＰパッケージのようなユーザによって選択された任意の付加パッケージ、およびユーザによって供給された任意のＴＩＥ拡張も含む。さらに、構成仕様１００は、所定の構造機能がプロセッサ６０に含まれるべきであるかどうかを示すフラグをセットする多数のステートメントを有してもよい。例えば、

The config spec 100 was supplied by the user, with an ISA package containing a TIE language statement specifying the base ISA, any additional package selected by the user, such as a coprocessor package 96 (see FIG. 2) or a DSP package. Includes optional TIE extensions. In addition, the config spec 100 may have a number of statements that set a flag that indicates whether a given structural function should be included in the processor 60. For example,

は、プロセッサが、オンチップデバッギングモジュール９２、割り込み機能７２および例外処理を含むが、高優先順位割り込み機能を含まないことを示す。 Indicates that the processor includes an on-chip debugging module 92, an interrupt function 72, and exception handling, but does not include a high priority interrupt function.

構成仕様１００を使用すると、下記は、後記に示されるように自動的に生成できる。・プロセッサ６０の命令復号化ロジック
・プロセッサ６０のための不正命令検出ロジック
・アセンブラ１１０の特定ＩＳＡ用部分
・コンパイラ１０８のための特定ＩＳＡ用サポートルーチン
・逆アセンブラ１００（デバッガによって使用される）の特定ＩＳＡ用部分
・シミュレータ１１２の特定ＩＳＡ用部分
重要な構成機能は命令のパッケージの包含を指定することにあるために、これらのことを自動的に生成することは有用である。いくつかのことに関して、命令が構成された場合、これをツールの各々において条件付コードで実行し、命令を処理することができるが、これは扱いにくい。より重要なことには、命令によってシステム設計者は設計者のシステムのための命令を容易に加えることができる。 Using the configuration specification 100, the following can be automatically generated as shown below. -Instruction decoding logic for processor 60-Invalid instruction detection logic for processor 60-Part for specific ISA of assembler 110-Support routine for specific ISA for compiler 108-Specification of disassembler 100 (used by debugger) Part for ISA ・ Part for specific ISA of simulator 112
It is useful to generate these automatically because an important configuration function is to specify package inclusion of instructions. For some things, if an instruction is configured, it can be executed with conditional code in each of the tools to process the instruction, but this is cumbersome. More importantly, the instructions allow the system designer to easily add instructions for the designer's system.

構成仕様１００を設計者からの入力として扱うことに加えて、目標を受け入れ、形成システム５０を有し、構成を自動的に決定することもできる。設計者はプロセッサ６０のための目標を指定できる。例えば、クロック速度、面積、コスト、典型的な電力消費、および最大電力消費は目標であってもよい。目標のいくつかは競合するので（例えば、しばしば性能は、面積あるいは電力消費あるいは両方を増加させることによってのみ増加させることができる）、形成システム５０は、目標に対する優先順位配列も行う。次に、形成システム５０は、サーチエンジン１０６を調べ、利用可能な構成オプションのセットを決定し、入力目標を同時に実行しようと試みるアルゴリズムから各オプションをいかにセットするかを決定する。 In addition to treating the configuration specification 100 as input from the designer, it can also accept goals, have a forming system 50, and automatically determine the configuration. The designer can specify goals for the processor 60. For example, clock speed, area, cost, typical power consumption, and maximum power consumption may be goals. Since some of the goals are competing (eg, often performance can only be increased by increasing area or power consumption or both), the forming system 50 also performs prioritization for the goals. The forming system 50 then examines the search engine 106 to determine the set of available configuration options and how to set each option from an algorithm that attempts to execute the input goals simultaneously.

サーチエンジン１０６は、いろいろの距離に与える効果を記述するエントリを有するデータベースを含む。エントリは、特定の構成設定が距離に加法的、乗法的、あるいは制限する効果がある。エントリは、前提条件として他の構成オプションを必要とするものとしてあるいは他のオプションと互換性がないものとしても示すことができる。例えば、簡単なブランチ予測オプションは、命令毎のサイクル（ＣＰＩ‥性能の決定要素）にある乗法あるいは加法の効果、クロック速度への制限、面積への加法効果、および電力への加法効果を指定できる。このオプションは、より手の込んだブランチ予測子と一致しないものとして示すことができ、命令フェッチ待ち行列サイズを少なくとも２つのエントリに設定することに左右される。これらの効果の値は、ブランチ予測テーブルサイズのようなパラメータの関数であってもよい。一般に、データエントリは数値を求められる関数として示すことができる。 The search engine 106 includes a database having entries that describe the effect on various distances. Entry has the effect that certain configuration settings are additive, multiplicative or limited to distance. The entry can also be indicated as requiring other configuration options as a prerequisite or as incompatible with other options. For example, a simple branch prediction option can specify multiplicative or additive effects in the cycle per instruction (CPI ... performance determinant), limit on clock speed, additive effect on area, and additive effect on power. . This option can be shown as not matching the more elaborate branch predictor and depends on setting the instruction fetch queue size to at least two entries. These effect values may be a function of a parameter such as a branch prediction table size. In general, a data entry can be shown as a function whose value is determined.

いろいろのアルゴリズムは、入力目標を達成することに最も接近する構成設定を探すために可能である。例えば、簡単なナップザックパッキングアルゴリズムは、コストで割られた値の分類配列の各オプションを考察し、コストを特定の制限内に保持している間、値を増加させる任意のオプション仕様を受け入れる。それで、例えば、電力を指定値以下に保持している間に性能を最少にするために、このオプションは、電力で割られた性能によって分類され、電力制限を超えないで構成できる性能を増加させる各オプションが受け入れられ。より込み入ったナップザックアルゴリズムは若干のバックトラッキング量を提供する。 Various algorithms are possible to find the configuration setting that is closest to achieving the input goal. For example, a simple knapsack packing algorithm considers each option in a value array divided by cost and accepts any option specification that increases the value while keeping the cost within certain limits. So, for example, to minimize performance while keeping power below a specified value, this option is categorized by performance divided by power, increasing the performance that can be configured without exceeding the power limit. Each option is accepted. The more complicated knapsack algorithm provides some amount of backtracking.

目標および設計データベースから構成を決定する非常に異なる種類のアルゴリズムはシミュレートアニーリングに基づいている。ランダム初期セットのパラメータは、開始点として使用され、次に個別パラメータの変更は、グローバルユーティリティ関数の数値を求めることによって受け入れられるかあるいは拒否される。ユーティリティ関数の改良は常に、マイナスの変更は最適化が進行するときに低下する閾値に確率的に基づいて受け入れられている間に常に受け入れられる。このシステムでは、ユーティリティ関数は入力目標から構成される。例えば、目標性能＞２００、電力＜１００、面積＜４が与えられる、電力、面積および性能の優先順位に関して、電力消費が１００以下で、次にニュートラルになるまで、電力消費の減少に報い、面積が４以下で、次にニュートラルになるまで、面積の減少に報い、性能が２００以上で、次にニュートラルになるまで、性能の増加に報いる下記のユーティリティ関数が使用できる。

A very different kind of algorithm to determine the configuration from the goal and design database is based on simulated annealing. A random initial set of parameters is used as a starting point, and then individual parameter changes are accepted or rejected by determining the value of the global utility function. Utility function improvements are always accepted while negative changes are accepted on a probabilistic basis based on a threshold that drops as optimization proceeds. In this system, the utility function consists of input targets. For example, given a target performance> 200, power <100, area <4, with respect to power, area and performance priorities, the power consumption will be rewarded for reduction in power consumption until the power consumption is less than 100 and then neutral. The following utility functions can be used that reward an area reduction until the next neutral is less than 4, and a performance increase until the next neutral is 200 or more.

電力が仕様外面積使用を減少させ、電力あるいは面積が仕様外である場合性能使用を減少させる構成要素もある。 There are also components that reduce power usage when the power or area is out of specification, where power reduces off-specific area usage.

これらのアルゴリズムおよび他のアルゴリズムは、指定された目標を満たす構成を探すために使用できる。重要なことは、構成可能なプロセッサ設計が前提条件、非互換性オプション仕様およびいろいろの距離に及ぼす構成オプションの影響を有する設計データベースに記述されていることである。 These and other algorithms can be used to look for configurations that meet specified goals. Importantly, the configurable processor design is described in a design database that has assumptions, incompatibility option specifications, and the effect of configuration options on various distances.

我々が示した例は、一般的であり、プロセッサ６０で実行された特定のアルゴリズムに左右されないハードウェア目標を使用した。記述されているアルゴリズムは、特定のユーザプログラムに最適の構成を選択するために使用することもできる。例えば、ユーザプログラムは、異なるサイズ、異なるラインサイズおよび異なるセットアソシアティブのような異なる特性を有する、異なる種類のキャッシュに対するキャッシュミス数を測定するためにキャッシュの正確なシミュレータで実行できる。これらのシミュレーションの結果は、ハードウェアインプリメンテーション記述４０を選択するのに役立つように記述された検索アルゴリズム１０６によって使用されるデータベースに付加できる。 The example we showed was general and used a hardware goal that was independent of the particular algorithm executed on the processor 60. The described algorithm can also be used to select the optimal configuration for a particular user program. For example, a user program can be run on a cache accurate simulator to measure the number of cache misses for different types of caches having different characteristics such as different sizes, different line sizes and different set associatives. The results of these simulations can be added to a database used by search algorithm 106 that has been written to help select hardware implementation description 40.

同様に、ユーザアルゴリズムは、ハードウェアで任意に実現できる所定の命令の存在に対してプロフィールできる。例えば、ユーザアルゴリズムが乗算を行うかなりの時間を行う。サーチエンジン１０６は、ハードウェア乗算器を含むことを自動的に示唆し得る。このようなアルゴリズムは１つのユーザアルゴリズムを考察することに限定する必要がない。ユーザは、アルゴリズムのセットをシステムに供給することができ、サーチエンジン１０６は、平均してユーザプログラムのセットに役に立つ構成を選択できる。 Similarly, user algorithms can be profiled for the presence of certain instructions that can optionally be implemented in hardware. For example, the user algorithm takes a significant amount of time to multiply. Search engine 106 may automatically suggest including a hardware multiplier. Such an algorithm need not be limited to considering one user algorithm. The user can supply a set of algorithms to the system, and the search engine 106 can, on average, select a configuration that is useful for the set of user programs.

プロセッサ６０の予め構成された特性を選択することに加えて、サーチアルゴリズムは、ユーザ可能ＴＩＥ拡張を自動的に選択するかあるいはユーザ可能ＴＩＥ拡張に示唆するためにも使用できる。入力目標が与えられ、多分Ｃプログラミング言語で記述されたユーザプログラムの例が与えられると、これらのアルゴリズムは可能性があるＴＩＥ拡張を示唆する。状態がないＴＩＥ拡張の場合、コンパイラのようなツールはパターン一致器で具体化される。これらのパターン一致器は、単一命令と取り換えることができる複数の命令パターンを検索するボトムアップ方法で式のノードを移動する。例えば、ユーザＣプログラムが下記のステートメントを含むことを示す。
ｘ＝（ｙ＋ｚ）＜＜２；
ｘ２＝（ｙ２＋ｚ２）＜＜２；
パターン一致器は、２つの異なる位置のユーザが２つの数を加算し、この結果を２ビットを左にシフトすることを見つける。このシステムは、２つの数を加算し、この結果を２ビットを左にシフトするＴＩＥ命令を生成する可能性をデータベースに加える。 In addition to selecting preconfigured characteristics of processor 60, the search algorithm can also be used to automatically select or suggest a user-enabled TIE extension. Given an input goal and possibly an example of a user program written in the C programming language, these algorithms suggest possible TIE extensions. In the case of a stateless TIE extension, a tool such as a compiler is implemented with a pattern matcher. These pattern matchers move expression nodes in a bottom-up manner that searches for multiple instruction patterns that can be replaced with a single instruction. For example, indicate that user C program contains the following statement:
x = (y + z) <<2;
x2 = (y2 + z2) <<2;
The pattern matcher finds that users at two different locations add two numbers and shift the result two bits to the left. The system adds two numbers to the database and adds the possibility to generate a TIE instruction that shifts the result two bits to the left.

形成システム５０は、ＴＩＥ命令が何回現れるかのカウントとともに多数の可能性のあるＴＩＥ命令について常に知っている。プロフィリングツールを使用すると、システム５０は、どのくらい頻繁に各命令がアルゴリズムの全実行中実行されるかについても常に知っている。ハードウェア推定器を使用すると、システム５０は、各可能性のあるＴＩＥ命令を実行するべきであるハードウェアでどれほど高価であるかについても常に知っている。これらの数は、入力目標、すなわち性能、コードサイズ、ハードウェア複雑さ等のような目標を最大にする可能性のあるＴＩＥ命令のセットを選択するために発見的探索アルゴリズムに供給される。 The forming system 50 is always aware of a large number of possible TIE instructions along with a count of how many times the TIE instruction appears. Using the profiling tool, the system 50 also always knows how often each instruction is executed during the entire execution of the algorithm. Using a hardware estimator, the system 50 always knows how expensive the hardware is to execute each potential TIE instruction. These numbers are fed into a heuristic search algorithm to select a set of TIE instructions that may maximize the input goals, ie goals such as performance, code size, hardware complexity, etc.

同じであるが、より強力なアルゴリズムは状態を有する可能性のあるＴＩＥ命令を見つけるために使用される。いくつかの異なるアルゴリズムは異なる種類の好機を検出するために使用される。１つのアルゴリズムは、コンパイラのようなツールを使用し、ユーザプログラムを走査し、ユーザプログラムがハードウェアで使用可能であるよりも多くのレジスタを必要とするかどうかを検出する。当該技術の専門家に公知であるように、これは、レジスタスピルの数をカウントすることによって検出でき、ユーザコードのコンパイルバージョンに再記憶する。コンパイルのようなツールは、付加ハードウェアレジスタ９８を有するコプロセッサをサーチエンジンに示唆するが、多数のスピルを有し、再記憶するユーザのコードの一部で使用される演算だけをサポートする。このツールは、ユーザのアルゴリズムがいかに改善されたかの推定と同様にコプロセッサのハードウェアコストの推定のサーチエンジン１０６によって使用されるデータベースを知らせる責任を負う。前述のように、本サーチエンジン１０６は、示唆されたコプロセッサ９８が十分な構成をもたらすか否かのグローバルに決定を行う。 The same but more powerful algorithm is used to find TIE instructions that may have state. Several different algorithms are used to detect different kinds of opportunities. One algorithm uses a tool such as a compiler to scan the user program and detect whether the user program requires more registers than are available in hardware. As is known to those skilled in the art, this can be detected by counting the number of register spills and re-stored in a compiled version of the user code. Tools such as compilation suggest a coprocessor with additional hardware registers 98 to the search engine, but have multiple spills and support only operations used in the portion of the user's code to be restored. This tool is responsible for informing the database used by the search engine 106 of the coprocessor hardware cost estimate as well as an estimate of how the user's algorithm has been improved. As described above, the search engine 106 makes a global determination as to whether the suggested coprocessor 98 provides sufficient configuration.

それとは別にあるいはそれとともに、ユーザプログラムが、コンパイルのようなツールは、ユーザプログラムが所定の変数が所定の制限よりも決して大きくないことを保証するためにビットマスク演算を使用するかどうかを検査する。この状況では、このツールは、ユーザ制限に合致するデータタイプ（例えば、１２ビットあるいは２０ビットもしくは任意の他の大きさの整数）を使用するコプロセッサ９８をサーチエンジン１０６に示唆する。Ｃ＋＋におけるユーザプログラムために使用される他の実施例で使用される第３のアルゴリズムでは、コンパイルのようなツールは、ユーザ定義の抽象データタイプで演算するのに多くの時間が費やされることが分かる。データタイプの全演算がＴＩＥに適しているけれども、アルゴリズムは、ＴＩＥコプロセッサによってデータタイプの全演算を実行することをサーチエンジン１０６に示唆する。 Alternatively or in addition, the user program checks whether a tool such as compilation uses a bitmask operation to ensure that the user program never has a given variable greater than a given limit . In this situation, the tool suggests to the search engine 106 a coprocessor 98 that uses a data type that matches the user limit (eg, 12-bit or 20-bit or any other large integer). In the third algorithm used in other embodiments used for user programs in C ++, it can be seen that tools such as compilation spend a lot of time operating on user-defined abstract data types. . Although all data type operations are suitable for TIE, the algorithm suggests to search engine 106 to perform all data type operations by the TIE coprocessor.

プロセッサ６０の命令復号化ロジックを生成するために、１つの信号が構成仕様で規定された各操作符号のために発生される。このコードは宣言

In order to generate the instruction decoding logic of the processor 60, one signal is generated for each operation code defined in the configuration specification. This code is declared

をＨＤＬステートメント

HDL statement

に、および

And

を

The

に単に再書き込みすることにより発生される。 Is generated by simply rewriting the file.

レジスタインターロックおよびパイプラインストールの信号の発生も自動化される。このロジックは構成仕様の情報に基づいても生成される。現命令のソースオペランドが完了しなかった前の命令のディスティネーションオペランドによって決まる場合、命令のｉｃｌａｓｓステートメントおよび待ち時間に含まれるレジスタ使用情報に基づいて、生成されたロジックはストール（あるいはバブル）を挿入する。このストール機能性を実行する機構はコアハードウェアの一部として実現される。 The generation of register interlock and pipeline install signals is also automated. This logic is also generated based on the configuration specification information. If the current instruction's source operand is determined by the destination operand of the previous instruction that did not complete, the generated logic inserts a stall (or bubble) based on the register usage information included in the instruction's iclass statement and latency. To do. The mechanism that performs this stall functionality is implemented as part of the core hardware.

不正命令検出ロジックは、命令信号のフィールド制限とＡＮＤをとられた個別の生成された命令信号と一緒にＮＯＲをとることにより生成される。

The illegal instruction detection logic is generated by taking a NOR with a separate generated instruction signal ANDed with the field limit of the instruction signal.

命令復号化信号および不正命令信号は、復号化モジュールの出力としておよび手で書かれたプロセッサロジックの入力として使用可能である。 The instruction decode signal and the illegal instruction signal can be used as the output of the decode module and as input of the processor logic written by hand.

他のプロセッサ機能を生成するために、本実施例は、ペリベースプリプロセッサ言語で機能強化された構成可能なプロセッサ６０のＶｅｒｉｌｏｇ（登録商標）記述を使用する。ペリは、複合制御構造、サブルーチンおよびＩ／Ｏ機能を含む全機能言語である。本発明の実施例では、ＴＰＰ（付録Ｂに列挙するソースに示されているように、ＴＰＰはペリプログラムそのものである）と呼ばれるプリプロセッサは、その入力を走査し、プリプロセッサ言語（ＴＰＰの場合はペリ）で記述されたプリプロセッサコードとしての所定の行（これらの行は、ＴＰＰの場合、セミコロンでプリフィックスされる）を識別し、抽出された行およびステートメントからなるプログラムを構成し、他の行のテキストを生成する。非プリプロセッサ行は、その場所でＴＰＰ処理の結果として生成される式が代入され、埋め込まれた式を有する。したがって、結果として生じるプログラムは、ソースコード、すなわち詳細プロセッサロジック４０を記述するＶｅｒｉｌｏｇ（登録商標）コード（下記で分かるように、ＴＰＰもソフトウェア開発ツール３０を構成するために使用される）を生成するように実行される。 To generate other processor functions, the present embodiment uses a Verilog® description of configurable processor 60 enhanced with the peribase preprocessor language. Peri is a fully functional language that includes complex control structures, subroutines and I / O functions. In an embodiment of the present invention, a preprocessor called TPP (as shown in the source listed in Appendix B, TPP is the periprogram itself) scans its input and preprocessor language (periple in the case of TPP). ) To identify a given line of preprocessor code (these lines are prefixed with a semicolon in the case of TPP), construct a program consisting of the extracted lines and statements, and text on other lines Is generated. The non-preprocessor row has an embedded expression in which an expression generated as a result of the TPP process is substituted. Thus, the resulting program generates source code, ie Verilog® code that describes the detailed processor logic 40 (as will be seen below, TPP is also used to configure the software development tool 30). To be executed.

このコンテキストで使用される場合、ＴＰＰは、前述のようにＶｅｒｉｌｏｇ（登録商標）コードの構成仕様１００によって決まり埋め込まれた式を実行するのと同様に構成仕様照会、条件付式およびＶｅｒｉｌｏｇ（登録商標）コードの反復構造のような構成子の包含を可能にするために、強力な前処理言語である。例えば、データベース照会に基づいたＴＰＰ割当は下記のようになりそうである。

When used in this context, the TPP is a configuration specification query, conditional expression and Verilog® as well as executing embedded expressions determined by the Verilog® code configuration specification 100 as described above. It is a powerful preprocessing language to allow the inclusion of constructors such as code repetition structures. For example, a TPP assignment based on a database query is likely to be:

ここで、ｃｏｎｆｉｇｇｅｔｖａｌｕｅは、構成仕様１００を照会するために使用されるＴＰＰ関数であり、ＩｓａＭｅｍｏｒｙＯｒｄｅｒは、構成仕様１００でセットされたフラグであり、Ｓｅｎｄｉａｎは、Ｖｅｒｉｌｏｇ（登録商標）コードを生成する際に後で使用されるべきＴＰＰ変数である。
ＴＰＰ条件付式は、

Where config get value is a TPP function used to query the config spec 100, IsaMemoryOrder is a flag set in the config spec 100, and Sendian is used later when generating Verilog® code. TPP variable to be done.
The TPP conditional formula is

であってもよい。 It may be.

反復ループは、下記のようなＴＰＰ構成子によって実行できる。

The iterative loop can be performed by a TPP constructor as follows.

ここで、Ｓｉは、ＴＰＰループインデックス変数であり、Ｓｎｉｎｔｅｒｒｕｐｔｓは、プロセッサ６０のために指定された割り込み数である（ｃｏｎｆｉｇｇｅｔＶａｌｕｅを使用して構成仕様１００から得られる）。 Here, Si is a TPP loop index variable, and Sninterrupts is the number of interrupts designated for the processor 60 (config) get Obtained from configuration specification 100 using Value).

最後に、ＴＰＰコードは、下記のようなＶｅｒｉｌｏｇ（登録商標）式に埋め込むことができる。

Finally, the TPP code can be embedded in the Verilog (registered trademark) formula as follows.

ここで、Ｓｎｉｎｔｅｒｒｕｐｔｓは、割り込み数を規定し、ｘｔｓｃｅｎｆｌｏｐモジュール（フリップフロッププリミティブモジュール）の幅（ビットに関する）を決定する。
ｓｒＩｎｔｅｒｒｕｐｔＥｎは、適切なビット数のワイヤであると規定されるフリップフロップの出力である。
ｓｒＤａｔａＩｎＷは、フリップフロップの入力であるが、関連ビットだけが割り込み数に基づいて入力される。
ｓｒＩｎｔｅｒｒｕｐｔＥｎＷＥｎは、フリップフロップの書き込みイネーブルである。
ｃＲｅｓｅｔは、フリップフロップのクリア入力である。
ＣＬＫは、フリップフロップの入力クロックである。 Here, Sninterrupts defines the number of interrupts and determines the width (related to bits) of the xtscenflop module (flip-flop primitive module).
srInterruptEn is the output of a flip-flop that is defined as an appropriate number of bits of wire.
srDataIn W is the input of the flip-flop, but only the relevant bits are input based on the number of interrupts.
srInterruptEnWEn is a flip-flop write enable.
cReset is a flip-flop clear input.
CLK is an input clock of the flip-flop.

例えば、下記の入力をＴＰＰ

For example, the following input

および宣言

And declaration

に与える。 To give.

ＴＰＰは下記を生成する。

The TPP generates:

このように生成されたＨＤＬ記述１１４は、ブロック１２２で例えばＳｙｎｏｐｓｙｓ社によって製造されたデザインコンパイラ（登録商標）を使用してプロセッサインプリメンテーションのためのハードウェアを統合するために使用される。一旦構成要素が経路選択されると、この結果は、例えばＳｙｎｏｐｓｙｓ社によるプライムタイム（登録商標）を使用してブロック１３２でワイヤ逆注釈およびタイミング検証のために使用できる。この処理の成果物は、他の構成反復のため構成獲得ルーチン２０に他の入力を供給するようにユーザによって使用できるハードウェアプロファイルである。 The HDL description 114 thus generated is used at block 122 to integrate hardware for processor implementation using, for example, a design compiler manufactured by Synopsys. Once the component has been routed, this result can be used for wire de-annotation and timing verification at block 132 using, for example, Primetime® from Synopsys. The result of this process is a hardware profile that can be used by the user to supply other inputs to the configuration acquisition routine 20 for other configuration iterations.

ロジック統合部１２２に関して述べられているように、プロセッサ６０を構成する結果の１つは、特定のゲートレベルインプリメンテーションが多数の商用統合ツールのいずれかを使用することによって得ることができるカスタマイズされたＨＤＬファイルのセットである。１つのこのようなツールは、Ｓｙｎｏｐｓｙｓ社からのデザインコンパイラ（登録商標）である。正確な、高性能なゲートレベルインプリメンテーションを保証するために、本実施例は、顧客の環境において統合処理を自動化するのに必要なスクリプトを提供する。このようなスクリプトを提供する際の要求は、いろいろのユーザの統合方法論および異なるインプリメンテーション目的をサポートすることにある。第１の要求を取り扱うために、この実施例は、このスクリプトをより小さいスクリプトおよび機能的に完全にスクリプトに分解する。１つのこのような例は、特定プロセッサ構成６０に関連する全ＨＤＬファイルを読み出すことができる読み出しスクリプト、プロセッサ６０に固有のタイミング要求をセットするタイミング抑制スクリプト、およびゲートレベルネットリストの配置および経路選択のために使用できる方法で統合結果を詳しく書くスクリプトを提供することにある。第２の要求を取り扱うために、本実施例は各インプリメンテーション目的のためのスクリプトを提供する。１つのこのような例は、より速いサイクルタイムを達成するスクリプト、最少シリコン領域を達成するスクリプト、および最少電力消費を達成するスクリプトを提供することにある。 As described with respect to the logic integrator 122, one of the results of configuring the processor 60 is customized that a particular gate level implementation can be obtained by using any of a number of commercial integration tools. A set of HDL files. One such tool is the Design Compiler® from Synopsys. In order to ensure an accurate, high performance gate level implementation, the present embodiment provides the scripts necessary to automate the integration process in the customer environment. The requirement in providing such a script is to support different user integration methodologies and different implementation objectives. In order to handle the first request, this embodiment breaks this script into smaller scripts and functionally completely scripts. One such example is a read script that can read all HDL files associated with a particular processor configuration 60, a timing suppression script that sets timing requirements specific to the processor 60, and gate level netlist placement and routing. It is to provide a script that details the integration results in a way that can be used for. In order to handle the second requirement, the present embodiment provides a script for each implementation purpose. One such example is to provide a script that achieves faster cycle times, a script that achieves minimum silicon area, and a script that achieves minimum power consumption.

スクリプトは、他のフェーズのプロセッサ構成でもまた使用される。例えば、一旦プロセッサ６０のＨＤＬが記述されたとすると、シミュレータは、ブロック１３２に関して前述されたようにプロセッサ６０の正しい動作を検証するために使用できる。これは、シミュレートされたプロセッサ６０で、多数のテストプログラム、あるいは診断を実行することによってしばしば達成される。シミュレートプロセッサ６０でテストプログラムを実行することは、テストプログラムの実行可能な画像を生成し、シミュレータ１１２によって読み出すことができるこの実行可能な画像の表示を生成し、シミュレーションの結果が将来の解析のために収集できるこの実行可能な画像の表示を形成し、このシミュレーションの結果等を解析するような多数のステップを必要とし得る。従来技術では、これは、多数の廃棄スクリプトで行われた。これらのスクリプトは、どのＨＤＬファイルが含まれるべきであるか、これらのファイルのどの場所がディレクトリ構造にあり得るか、どのファイルがテストベンチのために必要とされるか等のようなシミュレーション環境の若干の組み込み知識を有した。最新の設計では、好ましい実施例は、パラメータ置換によって構成されるスクリプトテンプレートを記述することにある。構成機構は、シミュレーションのために必要であるファイルのリストを生成するためにＴＰＰも使用する。 Scripts are also used in other phase processor configurations. For example, once the HDL of processor 60 has been described, the simulator can be used to verify the correct operation of processor 60 as described above with respect to block 132. This is often accomplished by running a number of test programs or diagnostics on the simulated processor 60. Executing the test program on the simulated processor 60 generates an executable image of the test program and generates a display of this executable image that can be read by the simulator 112, and the results of the simulation are used for future analysis. This may require a number of steps such as creating a display of this executable image that can be collected and analyzing the results of this simulation and the like. In the prior art, this was done with a number of discard scripts. These scripts are for simulation environments such as which HDL files should be included, which locations of these files can be in the directory structure, which files are needed for the test bench, etc. Has some built-in knowledge. In modern designs, the preferred embodiment is to describe a script template that consists of parameter substitution. The configuration mechanism also uses TPP to generate a list of files that are needed for simulation.

さらに、ブロック１３２の検証処理では、設計者が一連のテストプログラムを実行できる他のスクリプトを記述することがしばしば必要である。これは、ＨＤＬモデルの所与の変更が新しいバグを導入しないという信用を設計者に与える回帰スイートを実行するためにしばしば使用される。これらの回帰スクリプトも、ファイル名、ロケーション等についての多数の組み込み前提を有するようにしばしば廃棄された。単一のテストプログラムに対する実行スクリプトの形成に対して前述されるように、回帰スクリプトはテンプレートとして記述される。このテンプレートは、構成時パラメータを実際の値の代わりにすることによって構成される。 Further, the verification process of block 132 often requires the designer to write other scripts that can execute a series of test programs. This is often used to run regression suites that give designers confidence that a given change in the HDL model will not introduce new bugs. These regression scripts were also often discarded to have a number of built-in assumptions about file names, locations, etc. As described above for creating an execution script for a single test program, the regression script is described as a template. This template is constructed by substituting configuration time parameters for actual values.

ＲＴＬ記述をハードウェアインプリメンテーションに変換する処理の最終ステップは、抽象ネットリストを幾何学的表示に変換するために場所およびルート（Ｐ＆Ｒ）ソフトウェアを使用することにある。Ｐ＆Ｒソフトウェアは、ネットリストの結合性を解析し、セルの配置を決定する。次に、このソフトウェアは、全セル間の接続の線を描くことを試みる。クロックネットは、通常特別の注意に値し、最後のステップとして経路選択される。この処理は、両方とも、どのセルが一緒に接近していると予想されるか（ソフトグルーピングとして公知である）、セルの相対配置、どのネットがわずかな伝播遅延等を有すると予想されるかのような若干の情報をツールに提供することによって促進できる。 The final step in the process of converting the RTL description into a hardware implementation is to use location and route (P & R) software to convert the abstract netlist into a geometric representation. P & R software analyzes netlist connectivity and determines cell placement. The software then tries to draw a connection line between all cells. Clock nets usually deserve special attention and are routed as the last step. Both of these processes are expected to see which cells are close together (known as soft grouping), the relative placement of cells, which nets are expected to have a slight propagation delay, etc. Can be facilitated by providing the tool with some information like

この処理を容易にするためにおよび所望の性能目標、すなわちサイクル時間、面積、電力消費が達成されることを保証するために、構成機構は、Ｐ＆Ｒソフトウェアのためのスクリプトのセットあるいは入力ファイルを生成する。これらのスクリプトは、セルのための相対配置のような前述されたような情報を含む。このスクリプトは、どれくらいの数の電源接続およびアース接続が必要であるか、いかにこれらが境界等に沿って分布されるべきであるかのような情報も含む。このスクリプトは、どれくらいの数のソフトグループを形成するかおよびどんなセルがソフトグループに含まれるべきであるかおよびどのネットが重要なタイミングであるかの情報を含むデータベースを照会することによって生成される。このパラメータは、どのオプションが選択されたかに基づいて変わる。これらのスクリプトは、場所およびルートを変えるために使用されるツールに応じて構成可能でなければならない。 To facilitate this process and to ensure that the desired performance goals are achieved, ie cycle time, area, power consumption, the configuration mechanism generates a set of scripts or input files for the P & R software To do. These scripts contain information as described above, such as relative placement for cells. The script also includes information such as how many power and ground connections are needed and how they should be distributed along boundaries etc. This script is generated by querying a database containing information on how many soft groups will be formed and what cells should be included in the soft group and which nets are important times . This parameter varies based on which option is selected. These scripts must be configurable depending on the tool used to change the location and route.

任意には、構成機構は、ユーザからより多くの情報を要求し、Ｐ＆Ｒスクリプトに送ることができる。例えば、インタフェースは、ユーザに最終のレイアウトの所望のアスペクト比、どれくらいの数のバッファリングのレベルがクロックツリーに挿入されるべきであるか、どの側面に入出力ピンがこれらのピンの相対あるいは絶対の配置、電源およびアースのストラップ等にあるべきであるかを尋ねることができる。次に、これらのパラメータはＰ＆Ｒスクリプトに送られ、所望のレイアウトを生成する。 Optionally, the configuration mechanism can request more information from the user and send it to the P & R script. For example, the interface allows the user to have the desired aspect ratio of the final layout, how many levels of buffering should be inserted into the clock tree, on which side the input and output pins are relative or absolute to these pins You can ask what should be on the layout, power and grounding straps etc. These parameters are then sent to the P & R script to generate the desired layout.

例えばより込み入ったクロックツリーを可能にするより込み入ったスクリプトさえ使用できる。電力消費を減らすために行われた１つの共通最適化はクロック信号をゲートすることにある。しかしながら、全ブランチの遅延をバランスさせることは非常に困難であるので、これは、クロックツリー統合を非常に困難にする。構成インタフェースは、ユーザに、正しいセルがクロックツリーのために使用し、クロックツリー統合の一部あるいは全部を実行することを尋ねることができる。構成インタフェースは、どこのゲートクロックがこの設計にあるかの若干の知識を有し、対象となるゲートからフリップフロップのクロック入力までの遅延を推定することによってこれを行う。したがって、このインタフェースは、クロックバッファの遅延をゲートセルの遅延と一致させるためにクロックツリー統合ツールに抑制を与える。最新のインプリメンテーションでは、これは汎用ペリスクリプトによって行われる。このスクリプトは、どのオプションが選択されるかに基づいて構成エージェントによって発生されたゲートクロック情報を読み出す。一旦設計が配置され、経路選択されたとするが、最終クロックツリー統合が行われる前に、ペリスクリプトが実行される。 For example, you can even use more complex scripts that allow for more complex clock trees. One common optimization performed to reduce power consumption is to gate the clock signal. However, this makes clock tree integration very difficult because it is very difficult to balance the delays of all branches. The configuration interface can ask the user to use the correct cell for the clock tree and perform some or all of the clock tree integration. The configuration interface has some knowledge of which gate clock is in this design and does this by estimating the delay from the gate of interest to the clock input of the flip-flop. Thus, this interface provides a constraint to the clock tree integration tool to match the clock buffer delay with the gate cell delay. In modern implementations this is done by a generic periscript. This script reads the gate clock information generated by the configuration agent based on which option is selected. Once the design has been placed and routed, a periscript is executed before the final clock tree integration is performed.

更なる改善が前述されたプロフィール処理に対して行うことができる。特に、我々は、ユーザがこれらのＣＡＤツールを実行する時間を費やさないで殆ど同時に同様なハードウェアプロフィール情報を得ることができる処理を記述する。この処理はいくつかのステップを有する。 Further improvements can be made to the profile processing described above. In particular, we describe a process that allows a user to obtain similar hardware profile information almost simultaneously without spending time executing these CAD tools. This process has several steps.

この処理の第１のステップは、ハードウェアプロフィールのグループのオプションの効果が任意の他のグループのオプションとは無関係であるように全構成オプションのセットを直交オプションのグループに分離することにある。例えば、ハードウェアプロフィールに対するＭＡＣ１６装置のインパクトは任意の他のオプションとは無関係である。それで、ＭＡＣオプションだけを有するオプショングループが形成される。より複雑な例は、ハードウェアプロフィールに対するインパクトはこれらのオプションの特定の組合せによって決定されるので、割り込みオプション、高レベル割り込みオプションおよびタイマオプションを含むオプショングループである。 The first step in this process is to separate the set of all configuration options into groups of orthogonal options so that the effect of the options of the hardware profile group is independent of any other group of options. For example, the impact of the MAC16 device on the hardware profile is independent of any other option. Thus, an option group having only MAC options is formed. A more complex example is an option group that includes interrupt options, high-level interrupt options, and timer options, since the impact on the hardware profile is determined by a particular combination of these options.

第２のステップは、各オプショングループのハードウェアプロフィールのインパクトを特徴とすることにある。特徴付けは、グループのいろいろのオプションの組合せに対するハードウェアプロフィールのインパクトを得ることによって行われる。各組合せに関して、このプロフィールは、実際のインプリメンテーションが得られ、そのハードウェアプロフィールが測定される予め記述された処理を使用して得られる。このような情報は推定データベースに記憶されている。 The second step is to characterize the impact of the hardware profile of each option group. Characterization is done by obtaining the impact of the hardware profile on the various option combinations of the group. For each combination, this profile is obtained using a pre-described process in which the actual implementation is obtained and its hardware profile is measured. Such information is stored in the estimation database.

最後のステップは、曲線取り付けおよび補間技術を使用してオプショングループのオプションの特定組合せによってハードウェアプロフィールを計算する特定の式を得ることにある。オプションの名前に応じて、異なる式が使用される。例えば、各付加割り込みベクトルは同じロジックについてハードウェアに加えるので、我々は、線形関数を使用し、そのハードウェアインパクトをモデル化する。他の例では、タイマ装置を有することは高優先順位割り込みオプションを必要とするので、タイマオプションのハードウェアのインパクトのための式は多数のオプションを含む条件付式である。 The last step is to obtain a specific formula that calculates the hardware profile by a specific combination of options in the option group using curve fitting and interpolation techniques. Different expressions are used depending on the name of the option. For example, since each additional interrupt vector adds to the hardware for the same logic, we use a linear function to model its hardware impact. In another example, having a timer device requires a high priority interrupt option, so the formula for the hardware impact of the timer option is a conditional formula that includes multiple options.

いかにアーキテクチャ選択がアプリケーションの実行時間性能およびコードサイズに影響を及ぼすかもしれないかの迅速フィードバックを提供することは役に立つ。複数のアプリケーション領域からのいくつかのセットのベンチマークプログラムがいくつかのセットが選択される。各領域に関して、いかに異なるアーキテクチャ設計決定が領域内のアプリケーションの実行時間性能およびコードサイズに影響を及ぼすかを推定するデータベースは、予め形成される。ユーザはアーキテクチャ設計を変えるので、データベースは、ユーザにあるいは複数の領域のために興味を引き起こさせるアプリケーション領域のために照会される。評価の結果はユーザに提示されるので、ユーザは、ソフトウェアの長所とハードウェアのコストとの間のトレードオフの推定を得る。 It is helpful to provide quick feedback on how architecture choices may affect application runtime performance and code size. Several sets of benchmark programs from multiple application areas are selected. For each region, a database is pre-formed that estimates how different architectural design decisions affect the runtime performance and code size of applications within the region. As users change the architectural design, the database is queried for application areas that create interest for the user or for multiple areas. Since the results of the evaluation are presented to the user, the user obtains an estimate of the trade-off between software advantages and hardware costs.

迅速な評価システムは、プロセッサをさらに最適化するためにいかに構成を変更するかの示唆をユーザに与えるために容易に拡張できる。１つのこのような例は、各構成オプションを面積、遅延および電力のようないろいろのコスト距離のオプションの増分インパクトを示す一組の数と関連付けることにある。所与のオプションのための増分コスト影響を計算することは迅速な評価システムで容易に行われる。それは、オプションの有無の評価システムに対する２つの呼び出しを単に必要とする。２つの評価に対するコストの差はオプションの増分インパクトを示す。例えば、ＭＡＣ１６オプションの増分領域インパクトは、ＭＡＣ１６のオプションの有無で２つの構成の領域コストを評価することによって計算される。次に、この差異は、対話構成システムのＭＡＣ１６オプションで表示される。このようなシステムは、一連の単一ステップ改善によって最適解決策の方へユーザを誘導できる。 The rapid evaluation system can be easily extended to give the user suggestions on how to change the configuration to further optimize the processor. One such example is to associate each configuration option with a set of numbers that indicate the incremental impact of options at various cost distances such as area, delay, and power. Calculating the incremental cost impact for a given option is easily done with a quick evaluation system. It simply requires two calls to the option presence / absence evaluation system. The difference in cost for the two evaluations indicates the incremental impact of the option. For example, the incremental region impact of the MAC16 option is calculated by evaluating the region costs of the two configurations with and without the MAC16 option. This difference is then displayed in the MAC 16 option of the interactive configuration system. Such a system can guide the user towards an optimal solution through a series of single step improvements.

自動プロセッサ構成処理のソフトウェア側に移ると、本発明の本実施例は、ソフトウェア開発ツール３０がプロセッサに固有であるようにソフトウェア開発ツール３０を構成する。構成処理は、いろいろの異なるシステムおよび命令セットアーキテクチャに移植できるソフトウェアツール３０で始める。このような目標を変えることができるツールは、幅広く研究され、周知である。本実施例は、フリーソフトウェアであり、例えば、ＧＮＵＣコンパイラ、ＧＮＵアセンブラ、ＧＮＵリンカ、ＧＮＵプロファイラ、およびいろいろのユーティリティプログラムを含むＧＮＵファミリーのツールを使用する。次に、これらのツール３０は、ＩＳＡ記述からソフトウェアの一部を直接生成し、手で記述されたソフトウェアの一部を変更するためにＴＰＰを使用することによって自動的に構成される。 Moving to the software side of automatic processor configuration processing, this embodiment of the present invention configures software development tool 30 such that software development tool 30 is processor specific. The configuration process begins with a software tool 30 that can be ported to a variety of different systems and instruction set architectures. Tools that can change these goals are widely studied and well known. This embodiment is free software, and uses, for example, a GNU family tool including a GNU C compiler, a GNU assembler, a GNU linker, a GNU profiler, and various utility programs. These tools 30 are then automatically configured by generating a piece of software directly from the ISA description and using TPP to modify the piece of software written by hand.

ＧＮＵＣコンパイラはいくつかの異なる方法で構成される。コアＩＳＡ記述が与えられると、コンパイラの機械依存ロジックの多くが手で記述できる。コンパイラのこの一部は構成可能なプロセッサ命令セットの全構成に共通であり、手で目標を変更することは最善結果を得るための細かい調整を可能にする。しかしながら、コンパイラのこの手で符号された部分の場合さえ、若干のコードはＩＳＡ記述から自動的に生成される。特に、ＩＳＡ記述は、いろいろの命令の即値フィールドに使用できる一定値のセットを規定する。各即値フィールドの場合、述語関数は、特定の定数値がフィールドで符号化できるかどうかを試験するために生成される。コンパイラは、プロセッサ６０のためのコードを生成する場合、これらの述語関数を使用する。コンパイラ構成のこのアスペクトを自動化することは、ＩＳＡ記述とコンパイラとの間の不一致に対する機会を取り除き、それは、最少の努力でＩＳＡの定数を変えることを可能にする。 The GNU C compiler is configured in several different ways. Given a core ISA description, much of the compiler's machine-dependent logic can be described by hand. This part of the compiler is common to all configurations of the configurable processor instruction set, and changing the goals manually allows fine tuning for best results. However, even for this hand-coded part of the compiler, some code is automatically generated from the ISA description. In particular, the ISA description defines a set of constant values that can be used in the immediate field of various instructions. For each immediate field, a predicate function is generated to test whether a particular constant value can be encoded in the field. The compiler uses these predicate functions when generating code for the processor 60. Automating this aspect of the compiler configuration eliminates the opportunity for discrepancies between the ISA description and the compiler, which allows the ISA constants to be changed with minimal effort.

コンパイラのいくつかのアスペクトはＴＰＰで前処理を介して構成される。パラメータ選択によって制御された構成オプションの場合、コンパイラの対応するパラメータはＴＰＰによってセットされる。例えば、コンパイラは、ターゲットプロセッサ６０がビッグエンディアンあるいはリトルエンディアンバイト配列を使用し、この変数は、構成仕様１００からエンディアンネスを読み出すＴＰＰコマンドを自動的に使用してセットされる。ＴＰＰは、対応するパッケージが構成仕様１００で使用可能であるかどうかに基づいて任意のＩＳＡパッケージのためのコードを生成するコンパイラの手で符号化された部分を条件付で使用可能あるいは使用禁止するためにも使用できる。例えば、乗算／累算命令を生成するコードは、構成仕様がＭＡＣ１６オプション９０を含む場合、コンパイラにだけ含まれる。 Some aspects of the compiler are configured via preprocessing in TPP. For configuration options controlled by parameter selection, the compiler's corresponding parameters are set by TPP. For example, in the compiler, the target processor 60 uses a big endian or little endian byte array, and this variable is set automatically using a TPP command that reads the endianness from the configuration specification 100. TPP conditionally enables or disables the hand-encoded part of the compiler that generates code for any ISA package based on whether the corresponding package is available in config spec 100 Can also be used. For example, code that generates multiply / accumulate instructions is only included in the compiler if the configuration specification includes MAC16 option 90.

コンパイラは、ＴＩＥ言語によって指定された設計者定義の命令をサポートするようにも構成される。このサポートには２つのレベルがある。最低レベルで、設計者定義の命令は、コンパイルされるコードのマクロ関数、組込み関数、あるいはインライン（外部）関数として利用可能である。本発明の本実施例は、「インラインアセンブリ」コードのようなインライン関数を規定するＣヘッダ（ＧＮＵＣコンパイラの標準機能）を生成する。設計者定義の操作符号および操作符号の対応するオペランドが与えられると、このヘッダファイルを生成することは、ＧＮＵＣコンパイラのインラインアセンブリシンタックスに変換する簡単な処理である。他のインプリメンテーションは、インラインアセンブリ命令を指定するＣプリプロセッサマクロを含むヘッダファイルを形成する。さらにもう一つの代替は、組込み関数をコンパイラの中に加えるためにＴＰＰを使用する。 The compiler is also configured to support designer-defined instructions specified by the TIE language. There are two levels of support. At the lowest level, designer-defined instructions are available as macro functions, built-in functions, or inline (external) functions in the compiled code. This embodiment of the present invention generates a C header (standard function of the GNU C compiler) that defines inline functions such as "inline assembly" code. Given the designer-defined operation code and the corresponding operand of the operation code, generating this header file is a simple process that translates to the GNU C compiler inline assembly syntax. Other implementations form a header file containing C preprocessor macros that specify inline assembly instructions. Yet another alternative uses TPP to add built-in functions into the compiler.

設計者定義の命令に対する第２のレベルのサポートは、命令を使用する機会をコンパイラに自動的に認識させることによって提供される。これらの命令は、構成処理中ユーザによって直接規定あるいは自動的に形成できる。ユーザアプリケーションをコンパイルするより前に、ＴＩＥコードは、自動的に検査され、Ｃに等しい機能に変換される。これはＴＩＥ命令の高速シミュレーションを可能にするために使用できる同じステップである。Ｃに等しい機能は、コンパイラによって使用されるツリーベース中間表示に部分的にコンパイルされる。各ＴＩＥ命令に対するこの表示はデータベースに記憶される。ユーザアプリケーションがコンパイルされる場合、コンパイル処理の一部はパターン一致器である。ユーザアプリケーションはツリーベース表示にコンパイルされる。パターン一致器は、ユーザプログラムのツリー毎にボトムアップで移動する。移動の各ステップで、パターン一致器は、現ポイントにルートされた中間表示がデータベースのＴＩＥ命令のいずれかに一致しているかどうかを検査する。一致がある場合、一致が示される。各ツリーを移動することを完了した後、最大のサイズにされた一致のセットは選択される。ツリーの各最大一致は、等価のＴＩＥ命令と取り換えられる。 A second level of support for designer-defined instructions is provided by having the compiler automatically recognize opportunities to use instructions. These instructions can be defined directly or automatically by the user during the configuration process. Prior to compiling the user application, the TIE code is automatically inspected and converted to a function equal to C. This is the same step that can be used to enable fast simulation of TIE instructions. Functions equal to C are partially compiled into a tree-based intermediate representation used by the compiler. This indication for each TIE instruction is stored in a database. When a user application is compiled, part of the compilation process is a pattern matcher. User applications are compiled into a tree-based display. The pattern matcher moves bottom-up for each user program tree. At each step of movement, the pattern matcher checks whether the intermediate representation routed to the current point matches any of the TIE instructions in the database. If there is a match, a match is indicated. After completing moving through each tree, the largest sized set of matches is selected. Each maximum match in the tree is replaced with an equivalent TIE instruction.

前述のアルゴリズムは、無状態ＴＩＥ命令を使用する機会を自動的に認識する。付加方式は、状態を有するＴＩＥ命令を使用する機会を自動的に認識するためにも使用できる。前述の節は、状態を有する可能性のあるＴＩＥ命令を自動的に選択するアルゴリズムを記載した。同じアルゴリズムは、ＣあるいはＣ＋＋アプリケーションのＴＩＥ命令を自動的に使用するために使用される。ＴＩＥコプロセッサがより多くのレジスタであるが限られた演算のセットを有するように規定された場合、コードの領域は、レジスタスピリングを受けるかどうかおよびこれらのレジスタが使用可能な演算のセットを使用するだけであるかどうかを調べるために走査される。このような領域が見つかった場合、これらの領域のコードは、コプロセッサ命令およびレジスタ９８を使用するために自動的に変更される。変換動作は、データをコプロセッサ９８の内外へ移動させるように領域の境界で発生される。同様に、ＴＩＥコプロセッサが異なる大きさの整数で作動するように規定された場合、コードの領域は、この領域の全データがあたかも異なる大きさであるかのようにアクセスされるかどうかを調べるように検査される。一致領域に関しては、このコードが変更され、グルーコードは境界に付加される。同様に、ＴＩＥコプロセッサ９８がＣ＋＋抽象データタイプを実行するために規定される場合、このデータタイプの全演算は、ＴＩＥコプロセッサ命令と取り換えられる。 The aforementioned algorithm automatically recognizes the opportunity to use a stateless TIE instruction. The additional scheme can also be used to automatically recognize opportunities to use a TIE instruction with a state. The previous section described an algorithm that automatically selects a TIE instruction that may have a state. The same algorithm is used to automatically use CIE or C ++ application TIE instructions. If the TIE coprocessor is defined to have more registers but a limited set of operations, the area of code will determine whether or not these registers are subject to register spinning and the set of operations that these registers can use. Scanned to see if it is only used. If such areas are found, the code in these areas is automatically changed to use coprocessor instructions and registers 98. A conversion operation occurs at the boundary of the region to move the data in and out of the coprocessor 98. Similarly, if the TIE coprocessor is specified to operate with different sized integers, the area of code is checked to see if all the data in this area is accessed as if it were of a different size. Inspected as follows. For the matching area, this code is changed and the glue code is added to the boundary. Similarly, if a TIE coprocessor 98 is defined to execute a C ++ abstract data type, all operations of this data type are replaced with TIE coprocessor instructions.

ＴＩＥ命令を自動的に暗示することおよびＴＩＥ命令を自動的に使用することの両方とも個別に役立つことを注目せよ。暗示されたＴＩＥ命令は、組込み機構を介してユーザによっても手動で使用でき、アルゴリズムを利用することは手動で設計されたＴＩＥ命令あるいはコプロセッサ９８に加えることができる。 Note that both automatically implying the TIE instruction and automatically using the TIE instruction are useful individually. The implied TIE instructions can also be used manually by the user via an embedded mechanism, and utilizing the algorithm can be added to a manually designed TIE instruction or coprocessor 98.

いかに設計者定義の命令がインライン関数あるいは自動認識のいずれかによって生成されたかにかかわらず、コンパイラは、これらの命令を最適化し、スケジュールできるように設計者定義の命令の可能性のある副作用を知る必要がある。性能を改良するために、従来のコンパイラは、実行時間性能、コードサイズあるいは電力消費のような所望の特性を最大にするためにユーザコードを最適化する。当該技術分野で十分熟練した人に公知であるように、このような最適化は、命令を再配置するかあるいは所定の命令を他の意味論的に等価な命令と取り換えるようなものを含む。最適化を十分実行するために、コンパイラは、いかにあらゆる命令がマシンの異なる部分に影響を及ぼすかを知らなければならない。マシン状態の異なる部分を読み書きする２つの命令が自由に再配列できる。従来のプロセッサの場合、異なる命令によって読み出しおよび／または書き込みされる状態は、時々テーブルによってコンパイラの中にハードワイヤードされる。本発明の一実施例では、ＴＩＥ命令は、内輪に見積もってもプロセッサ６０の状態全てを読み書きするものと仮定される。これによって、コンパイラは、正しいコードを生成するがコンパイラの能力を制限し、ＴＩＥ命令がある場合のコードを最適化できる。本発明の他の実施例では、ツールは、ＴＩＥ定義を自動的に読み出し、各ＴＩＥ命令に対してどの状態が前記命令によって読み出しあるいは書き込みされるかを見つける。次に、このツールは、コンパイラの最適化器によって使用されるテーブルを変更し、各ＴＩＥ命令の効果を正確にモデル化する。 Regardless of how designer-defined instructions are generated by either inline functions or automatic recognition, the compiler knows the possible side effects of designer-defined instructions so that they can be optimized and scheduled There is a need. To improve performance, conventional compilers optimize user code to maximize desired characteristics such as execution time performance, code size, or power consumption. As is known to those skilled in the art, such optimization includes such as rearranging instructions or replacing certain instructions with other semantically equivalent instructions. In order to fully perform the optimization, the compiler must know how every instruction affects different parts of the machine. Two instructions that read and write different parts of the machine state can be freely rearranged. In conventional processors, the state read and / or written by different instructions is sometimes hardwired into the compiler by a table. In one embodiment of the present invention, it is assumed that the TIE instruction reads and writes all the states of the processor 60 even if estimated to the inner ring. This allows the compiler to generate correct code but limit the compiler's ability and optimize the code when there is a TIE instruction. In another embodiment of the invention, the tool automatically reads the TIE definition and finds what state is read or written by the instruction for each TIE instruction. The tool then modifies the table used by the compiler's optimizer to accurately model the effect of each TIE instruction.

コンパイラのように、アセンブラ１１０の機械依存部は、自動的に生成された部分およびＴＰＰで構成された手で符号化された部分の両方とも含む。全構成に共通の機能のいくつかは、手で記述されたコードでサポートされる。しかしながら、アセンブラ１１０の主要タスクは、機械命令を符号化することであり、命令符号化および復号化ソフトウェアはＩＳＡ記述から自動的に生成できる。 Like the compiler, the machine dependent part of the assembler 110 includes both automatically generated parts and manually encoded parts composed of TPP. Some features common to all configurations are supported by hand-written code. However, the primary task of assembler 110 is to encode machine instructions, and instruction encoding and decoding software can be automatically generated from the ISA description.

命令符号化および復号化は異なるソフトウェアツールで有用であるために、本発明の本実施例は、ソフトウェアをグループ化し、これらのタスクを実行し、別個のソフトウェアライブラリにする。このライブラリは、ＩＳＡ記述の情報を使用して自動的に生成される。このライブラリは、操作符号の一覧表、すなわち操作符号ニーモニックのためのストリングを一覧表のメンバー上に効率的にマッピングする関数（stringToOpcode）および各操作符号に対して命令長（instructionLength）、オペランド数（numberOfOperand）、オペランドフィールド、オペランドタイプ（すなわち、レジスタあるいは即値）（operandType）、２進符号化（encodeOpcode）およびニーモニックストリング（opcodeName）を指定するテーブルを規定する。各オペランドフィールドに関しては、ライブラリは、命令ワードの対応するビットを符号化するアクセスサ関数（fieldSetFunction）および復号化するアクセスサ関数（fieldGetSetFunction）を提供する。この情報の全部は、ＩＳＡ記述で容易に利用可能である。すなわち、ライブラリソフトウェアを生成することは、単にこの情報を実行可能なＣコードに変換する問題である。例えば、各エントリが各操作符号フィールドをＩＳＡ記述のこの命令に対して指定された値に設定することによって生成された特定の命令に対する符号化である場合、命令符号化は、Ｃアレイ変数に記録される。ｅｎｃｏｄｅＯｐｅｃｏｄｅ関数は、単に所与の操作符号に対してアレイ値に戻る。 Because instruction encoding and decoding are useful with different software tools, this embodiment of the present invention groups software and performs these tasks into separate software libraries. This library is automatically generated using information in the ISA description. This library includes a list of operation codes, a function that efficiently maps strings for operation code mnemonics onto the members of the list (stringToOpcode), instruction length for each operation code, number of operands ( Defines a table that specifies numberOfOperand), operand field, operand type (ie, register or immediate) (operandType), binary encoding (encodeOpcode) and mnemonic string (opcodeName). For each operand field, the library provides an accessor function (fieldSetFunction) that encodes the corresponding bit of the instruction word and an accessor function (fieldGetSetFunction) that decodes it. All of this information is readily available in the ISA description. That is, generating library software is simply a matter of converting this information into executable C code. For example, if each entry is an encoding for a particular instruction generated by setting each operation code field to the value specified for this instruction in the ISA description, the instruction encoding is recorded in a C array variable. Is done. The encodeOpecode function simply returns to an array value for a given operation code.

ライブラリも、２進命令（decodeInstruction）の操作符号を復号化する関数を提供する。この関数は、最も外側のスイッチが操作符号階層の上部でサブ操作符号をテストする一連のネストスイッチステートメントとして生成され、ネストされたスイッチステートメントは操作符号階層の徐々により低いサブ操作符号をテストする。したがって、この関数に対して生成されたコードは、操作符号そのものと同じ構造を有する。 The library also provides a function for decoding the operation code of a binary instruction (decodeInstruction). This function is generated as a series of nested switch statements in which the outermost switch tests the sub-operation code at the top of the operation code hierarchy, and the nested switch statements test progressively lower sub-operation codes in the operation code hierarchy. Therefore, the code generated for this function has the same structure as the operation code itself.

命令を符号化および復号化するこのライブラリが与えられると、アセンブラ１１０は容易に実行される。例えば、アセンブラの命令符号化ロジックは全く簡単である。すなわち

Given this library for encoding and decoding instructions, the assembler 110 is easily implemented. For example, the instruction coding logic of the assembler is quite simple. Ie

２進命令をアセンブリコードに非常に類似している可読形式に変換する逆アセンブラ１１０を実行することは同様に簡単である。

It is equally simple to implement the disassembler 110 that converts binary instructions into a readable form that is very similar to assembly code.

この逆アセンブラアルゴリズムはスタンドアロン逆アセンブラツールにおいて使用され、またマシンコードのデバッギングをサポートするためにデバッガ１３０においても使用される。 This disassembler algorithm is used in a stand-alone disassembler tool and also in the debugger 130 to support machine code debugging.

リンカはコンパイラやアセンブラ１１０程機器構成に対してあまり敏感ではない。リンカの多くは標準型であり、機械依存部分でさえ主としてコアＩＳＡ記述に依存しており、特定のコアＩＳＡに対して手動で符号化することができる。ｅｎｄｉａｎｎｅｓｓ等のパラメータはＴＰＰを使用して構成仕様書１００から設定される。ターゲットプロセッサ６０のメモリマップはリンカが必要とする構成の他の１つの局面である。このように、メモリマップを指定するパラメータは、ＴＰＰを使用してリンカに挿入される。本発明のこの実施形態では、ＧＮＵリンカが一組のリンカスクリプトによって駆動され、それはメモリマップ情報を含むこれらのリンカスクリプトである。このアプローチの利点は、ターゲットシステムのメモリマップが、プロセッサ６０を構成した時に指定されたメモリマップと異なっている場合、プロセッサ６０を再構成することなく、またリンカを再構築することなく、付加的なリンカスクリプトを後で発生させることができることである。このように、この実施形態は、異なるメモリマップパラメータを備えた新しいリンカスクリプトを構成するツールを含んでいる。 The linker is not as sensitive to device configuration as a compiler or assembler 110. Many of the linkers are standard, and even machine dependent parts rely primarily on the core ISA description and can be manually encoded for a particular core ISA. Parameters such as endianness are set from the configuration specification 100 using TPP. The memory map of the target processor 60 is another aspect of the configuration required by the linker. In this way, the parameter that specifies the memory map is inserted into the linker using TPP. In this embodiment of the invention, the GNU linker is driven by a set of linker scripts, which are those linker scripts that contain memory map information. The advantage of this approach is that if the target system's memory map is different from the memory map specified when the processor 60 was configured, the additional information can be obtained without reconfiguring the processor 60 and without rebuilding the linker. A simple linker script can be generated later. Thus, this embodiment includes a tool that constructs a new linker script with different memory map parameters.

一度に１つの命令の実行をシングルステップ化し、ブレークポイントを導入し、また他の標準デバッギングタスクを遂行するために、デバッガ１３０はプログラムを実施するにつれてのプログラムの状態を観察するための機構を提供する。デバッギングされるプログラムは、構成されたプロセッサのハードウェアインプリメンテーション、あるいはＩＳＳ１２６のいずれに対しても実施することができる。デバッガはいずれの場合にもユーザに対して同じインタフェースを呈する。ハードウェアインプリメンテーションに対してプログラムを実施する場合、ユーザのプログラム実行を制御し、シリアルポートを介してデバッガと通信するために、小さなモニタプログラムがターゲットシステムに含まれる。シミュレータ１２６に対してプログラムを実施する場合、シミュレータ１２６自体がこれらの機能を果たす。デバッガ１３０は幾つかの方法でこの構成に依存している。デバッガ１３０内からの分解機コードをサポートする為に、デバッガ１３０は上述の命令符号化・復号化ライブラリと接続される。どのレジスタがプロセッサ６０に存在するかを見出すために、ＩＳＡ記述をスキャンすることによってプロセッサのレジスタ状態を表示するデバッガ１３０の部分、及びデバッガ１３０に対して情報を提供するデバッグモニタプログラムとＩＳＳ１２６の部分が生成される。 In order to single-step execution of instructions one at a time, introduce breakpoints, and perform other standard debugging tasks, debugger 130 provides a mechanism for observing the state of the program as it executes. To do. The program to be debugged can be implemented on either the hardware implementation of the configured processor or the ISS 126. In either case, the debugger presents the same interface to the user. When implementing a program for a hardware implementation, a small monitor program is included in the target system to control user program execution and communicate with the debugger via the serial port. When the program is executed for the simulator 126, the simulator 126 itself performs these functions. The debugger 130 depends on this configuration in several ways. In order to support the decomposer code from within the debugger 130, the debugger 130 is connected to the instruction encoding / decoding library described above. A portion of the debugger 130 that displays the register status of the processor by scanning the ISA description to find which registers are present in the processor 60, and a portion of the debug monitor program and ISS 126 that provides information to the debugger 130 Is generated.

他のソフトウェア開発ツール３０は標準型であり、各プロセッサ構成のために変更する必要はない。プロファイルビューア及び様々なユーティリティプログラムがこのカテゴリに含まれる。プロセッサ６０の全ての機器構成が共有するバイナリフォーマットでのファイルに対して作動するために、これらのツールをもう一度目標とすることが必要であるかもしれないが、これらのツールはＩＳＡ記述あるいは構成仕様書１００内の他のパラメータのいずれにも依存しない。 Other software development tools 30 are standard and need not be changed for each processor configuration. Profile viewers and various utility programs are included in this category. It may be necessary to retarget these tools in order to work on files in a binary format shared by all device configurations of the processor 60, but these tools may be ISA descriptions or configuration specifications. It does not depend on any of the other parameters in the document 100.

構成仕様書はまた、図１３に示すＩＳＳ１２６と呼ばれるシミュレータを構成するためにも使用される。ＩＳＳ１２６は構成可能なプロセッサ命令セットの機能的な行動をモデル化するソフトウェアアプリケーションである。ＳｙｎｏｐｓｙｓＶＣＳやＣａｄｅｎｃｅＶｅｒｉｌｏｇＸＬやＮＣシミュレータ等のその対応するプロセッサハードウェアモデルシミュレータとは異なり、ＩＳＳＨＤＬモデルはその命令実行中はＣＰＵの抽象化である。ＩＳＳ１２６は各ゲートに対して各信号の推移をモデル化する必要がなく、また完全なプロセッサ設計に登録する必要もないので、ハードウェアシミュレーションよりはるかに早く実行する。 The configuration specification is also used to configure a simulator called ISS 126 shown in FIG. The ISS 126 is a software application that models the functional behavior of a configurable processor instruction set. Unlike its corresponding processor hardware model simulator, such as Synopsys VCS, Cadence Verilog XL, NC simulator, etc., the ISS HDL model is an abstraction of the CPU during execution of its instructions. The ISS 126 runs much faster than the hardware simulation because it does not need to model the transition of each signal for each gate, nor does it need to be registered in a complete processor design.

ＩＳＳ１２６はホストコンピュータに対して実行すべき、構成されたプロセッサ６０のためにプログラムを生成できるようにする。ＩＳＳ１２６はプロセッサのリセットを正確に再生し、デバイスドライバ等の低レベルのプログラムや初期化コードを展開させる行動を遮る。固有のコードを埋め込まれたアプリケーションに接続する場合に、これは特に有用である。 The ISS 126 allows a program to be generated for the configured processor 60 to be executed on the host computer. The ISS 126 accurately reproduces the reset of the processor and blocks the action of developing low level programs such as device drivers and initialization code. This is particularly useful when connecting native code to embedded applications.

実際の埋め込まれたターゲットにコードをダウンロードする必要なしに、構造上の仮定やメモリオーダリング上の問題点等の潜在的な問題を特定するためにＩＳＳ１２６を使用することができる。 The ISS 126 can be used to identify potential problems such as structural assumptions and memory ordering issues without having to download code to the actual embedded target.

この実施形態では、ＩＳＳセマンティクスは、命令を機能に変えるＣオペレータ構築ブロックを構築するために、Ｃのような言語を原文通りに使用するものとして表される。例えば、割込みレジスタやビット設定・割込みレベル・ベクトル等の割込みに関する初歩の機能性は、この言語を使用してモデル化される。 In this embodiment, ISS semantics are represented as using a C-like language literally to build a C operator building block that turns instructions into functions. For example, rudimentary functionality for interrupts such as interrupt registers, bit settings, interrupt levels, vectors, etc. is modeled using this language.

構成可能なＩＳＳ１２６は、システム設計及び確証プロセスの一部として、以下の４つの目的または目標のために使用される：
- ハードウェアが利用できるようになる前にソフトウェアアプリケーションをデバッギングすること；- システムソフトウェア（例えば、コンパイラ及びオペレーティングシステム成分）のデバッギング；- ハードウェア設計検証のためにＨＤＬシミュレーションと比較すること。ＩＳＳはＩＳＡのリファレンスインプリメンテーションとして作用し、ＩＳＳ及びプロセッサＨＤＬは共に、プロセッサ設計検証中に診断法及びアプリケーションのために実行され、両者からのトレースが比較される；及び- ソフトウェアアプリケーション性能の分析（これは構成プロセスの一部であってもよいし、あるいはプロセッサ構成が選択された後で同調する更なるアプリケーションのために使用されてもよい）。 The configurable ISS 126 is used for the following four purposes or goals as part of the system design and validation process:
-Debug software applications before hardware is available;-Debug system software (eg compiler and operating system components);-Compare with HDL simulation for hardware design verification. ISS acts as a reference implementation for ISA, and both ISS and processor HDL are run for diagnostics and applications during processor design verification, and traces from both are compared; and-Analysis of software application performance (This may be part of the configuration process, or may be used for further applications that tune after the processor configuration is selected).

全ての目標にとって、構成可能なアッセンブラ１１０及びリンカを備えて作り出されるプログラムをＩＳＳ１２６がロード・デコードできることが必要である。また命令のＩＳＳ実行が、対応するハードウェア実行及びコンパイラの予測に対して意味論的に同等であることも必要である。これらの理由のために、ＩＳＳ１２６は、ハードウェア及びシステムソフトウェアを定義するために使用される同じＩＳＡファイルから、そのデコード・実行行為を引き出す。 For all goals, it is necessary for the ISS 126 to be able to load and decode programs created with configurable assemblers 110 and linkers. It is also necessary that the ISS execution of instructions be semantically equivalent to the corresponding hardware execution and compiler prediction. For these reasons, the ISS 126 derives its decoding and execution behavior from the same ISA file used to define the hardware and system software.

上記した最初と最後の目標にとって、ＩＳＳ１２６が可及的に速く必要な精度に達することが重要である。従って、ＩＳＳ１２６はシミュレーションの詳細レベルの動的制御を可能にする。例えば、キャッシュの詳細は必要とされない限りモデル化されず、キャッシュモデリングを動的にオン・オフに切り替えることができる。更に、実行時間にＩＳＳ１２６がほとんど構成に依存した所作選択をしないようにＩＳＳ１２６をコンパイルする前に、ＩＳＳ１２６の部分（例えば、キャッシュ及びパイプラインモデル）が構成される。 For the first and last goals described above, it is important that the ISS 126 reach the required accuracy as quickly as possible. Thus, the ISS 126 allows dynamic control of the level of detail of the simulation. For example, cache details are not modeled unless needed, and cache modeling can be dynamically switched on and off. Further, before compiling the ISS 126 so that the ISS 126 has little configuration-dependent behavior selection at run time, portions of the ISS 126 (eg, cache and pipeline models) are configured.

上記した最初と最後の目標にとって、設計（ターゲット）下で、システムにとってオペレーティングシステムサービスがＯＳから利用できない場合、ＩＳＳ１２６がアプリケーションに対してこれらのサービスを提供することが重要である。また、それがデバッギングプロセスの関連する部分である場合、これらのサービスがターゲットＯＳによって提供されることが重要である。この方法で、システムはＩＳＳホストとシミュレーションターゲット間でこれらのサービスを柔軟に動かすための設計を提供する。現在の設計はＩＳＳ動的制御（ＳＹＳＣＡＬＬ命令のトラッピングはオン・オフを切り替えてもよい）と、ホストＯＳサービスを要求するための特別なＳＩＭＣＡＬＬ命令の使用との組み合わせに頼っている。 For the first and last goals described above, it is important that, under design (target), if the operating system services are not available to the system from the OS, the ISS 126 provides these services to the application. It is also important that these services are provided by the target OS if it is a relevant part of the debugging process. In this way, the system provides a design for flexibly running these services between the ISS host and the simulation target. The current design relies on a combination of ISS dynamic control (SYSCALL instruction trapping may be switched on and off) and the use of special SIMMCALL instructions to request host OS services.

最後の目標は、ＩＳＳ１２６がＩＳＡによって指定されるレベル以下であるプロセッサとシステムの行動のうち、一部の局面をモデル化することを必要とする。特に、ＩＳＳキャッシュモデルは、機器構成データベース１００からパラメータを抽出するパール（Ｐｅｒｌ）スクリプトからのモデルのためにＣコードを発生させることによって構成される。更に、命令のパイプライン行動の詳細（例えば、レジスタの使用及び機能ユニットの利用可能性要件に基づくインタロック）も、機器構成データベース１００から引き出される。現在のインプリメンテーションでは、特殊なパイプライン記述ファイルがリスプ状のシンタックス内のこの情報を指定する。 The final goal requires modeling some aspects of processor and system behavior where the ISS 126 is below the level specified by the ISA. In particular, the ISS cache model is constructed by generating C code for a model from a Perl script that extracts parameters from the device configuration database 100. In addition, details of instruction pipeline behavior (e.g., interlocks based on register usage and functional unit availability requirements) are also derived from the device configuration database 100. In the current implementation, a special pipeline description file specifies this information in the lexical syntax.

三番目の目標は割込み行動の正確な制御を必要とする。この目的のために、ＩＳＳ１２６内の特殊な非構造的レジスタを使用して、割込み可能を抑制する。 The third goal requires precise control of interrupting behavior. For this purpose, special unstructured registers in ISS 126 are used to suppress interruptability.

ＩＳＳ１２６はその使用のために異なる目標をサポートするために幾つかのインタフェースを提供する。- バッチまたはコマンドラインモード（一般的に最初と最後の目標との関連で使用される）；- コマンドループモード、これは非象徴的なデバッグ能力、例えば、ブレークポイント・ウォッチポイント・ステップ等、４つ全ての目標のために頻繁に使用されるデバッグ能力を提供する；及び
- 実行バックエンドとして、ＩＳＳ１２６がソフトウェアデバッガにより使用されるようにするソケットインタフェース（これは特定の選択された構成のためにレジスタ状態を読み取り、書き込むように構成されなければならない）；- 非常に詳細なデバッギング及び性能分析を可能にするスクリプタブルインタフェース。特に、このインタフェースは異なる構成に対してアプリケーション行動を比較するために使用されてもよい。例えば、どのブレークポイントにおいても、１つの構成に対するランからの状態を別の構成に対するランからの状態と比較してもよいし、あるいは１つの構成に対するランからの状態を別の構成に対するランからの状態に移行させてもよい。 The ISS 126 provides several interfaces to support different goals for its use. -Batch or command line mode (generally used in relation to the first and last goals);-Command loop mode, which is non-symbolic debugging capability, eg breakpoints, watchpoint steps, etc. 4 Provide frequently used debugging capabilities for all three goals; and
-Socket interface that allows the ISS 126 to be used by the software debugger as an execution backend (this must be configured to read and write register states for a specific selected configuration);-Very detailed Scriptable interface that enables efficient debugging and performance analysis. In particular, this interface may be used to compare application behavior for different configurations. For example, at any breakpoint, the state from a run for one configuration may be compared to the state from a run for another configuration, or the state from a run for one configuration may be compared from the run for another configuration. You may make it transfer to a state.

またシミュレータ１２６は手動でコード化され、自動発生された部分を有している。ＩＳＡ記述言語から発生されるテーブルから作成される命令デコード及び実行を除いて、手動でコード化された部分は従来よりのものである。これらのテーブルは実行すべき命令ワードに見出される一次的操作符号から開始し、その分野の値でテーブルへと索引付けし、一片の操作符号、つまり、他の操作符号の点から定義されていない操作符号が見つかるまで続けることによって、命令を復号化する。次に、テーブルはその命令に対するセマンティクス宣言書において指定されるＴＩＥコードから翻訳されたコードに対するポインタを与える。このコードは命令をシミュレートするために実行される。 The simulator 126 has a portion that is manually coded and automatically generated. Except for instruction decode and execution created from tables generated from the ISA description language, the manually coded parts are conventional. These tables start with the primary operation code found in the instruction word to be executed, index into the table with values in the field, and are not defined in terms of a piece of operation code, that is, another operation code Decode the instruction by continuing until an operational code is found. The table then gives a pointer to the code translated from the TIE code specified in the semantic declaration for that instruction. This code is executed to simulate the instruction.

ＩＳＳ１２６はシミュレートされているプログラムの実行をプロファイルすることができる。このプロファイリングは業界で公知のプログラムカウンタサンプリング技術を使用する。定期的な間隔で、シミュレータ１２６はシミュレートされているプロセッサのＰＣ（プログラムカウンタ）をサンプリングする。シミュレータ１２６は各コード領域におけるサンプル数でヒストグラムを構築する。またシミュレータ１２６は、１つのコール命令がシミュレートされる度にカウンタを増分することによって、コールグラフ内の各エッジが実行される回数をカウントする。シミュレーションが完了すると、シミュレータ１２６は、標準のプロファイルビューアによって読み出すことができるフォーマットで、ヒストグラム及びコールグラフのエッジカウントの両方を含む出力ファイルを書く。シミュレートされているプログラム１１８は（標準のプロファイリング技術におけるように）計測コードで修正する必要がないので、プロファイリングオーバーヘッドはシミュレーション結果に影響を及ぼさないし、プロファイリングは全く非侵略的である。 The ISS 126 can profile the execution of the simulated program. This profiling uses program counter sampling techniques known in the industry. At regular intervals, the simulator 126 samples the PC (program counter) of the simulated processor. The simulator 126 builds a histogram with the number of samples in each code area. The simulator 126 also counts the number of times each edge in the call graph is executed by incrementing the counter each time a call instruction is simulated. When the simulation is complete, the simulator 126 writes an output file that includes both histograms and call graph edge counts in a format that can be read by a standard profile viewer. Since the simulated program 118 does not need to be modified with instrumentation code (as in standard profiling techniques), the profiling overhead does not affect the simulation results and the profiling is totally non-invasive.

システムがハードウェアプロセッサエミュレーション及びソフトウェアプロセッサエミュレーションを利用できるようにすることが好ましい。この目的のために、本実施形態はエミュレーションボードを提供する。図６に示すように、エミュレーションボード２００は、ハードウェア内でプロセッサ構成６０をエミュレートするために、ＡｌｔｅｒａＦｌｅｘ１０Ｋ２００Ｅ等の複合プログラム可能論理装置２０２を使用する。一旦システムにより発生されたプロセッサネットリストでプログラムされると、ＣＰＬＤ装置２０２は機能的に最終的なＡＳＩＣ製品と同等になる。それは他の（ＩＳＳ１２６またはＨＤＬ等の）シミュレーション方法よりはるかに高速で稼動し、周期的に正確である、プロセッサ６０の物理的実用化を利用できるという利点を提供する。しかしながら、それは最終的なＡＳＩＣ装置が達成することができる高周波数ターゲットには達することができない。 Preferably, the system is made available for hardware processor emulation and software processor emulation. For this purpose, the present embodiment provides an emulation board. As shown in FIG. 6, the emulation board 200 uses a complex programmable logic device 202 such as an Altera Flex 10K 200E to emulate a processor configuration 60 in hardware. Once programmed with the processor netlist generated by the system, the CPLD device 202 is functionally equivalent to the final ASIC product. It offers the advantage of being able to take advantage of a physical implementation of the processor 60 that runs much faster than other simulation methods (such as ISS 126 or HDL) and is periodically accurate. However, it cannot reach the high frequency target that the final ASIC device can achieve.

このボードはデザイナが様々なプロセッサ構成オプションを評価でき、ソフトウェア展開及びデバッギングを設計サイクルの初期に開始できるようにする。それはまたプロセッサ構成の機能的検証のためにも使用することができる。 This board allows designers to evaluate various processor configuration options and to begin software deployment and debugging early in the design cycle. It can also be used for functional verification of the processor configuration.

平易なソフトウェア展開、デバッギング及び検証を許すために、エミュレーションボード２００はそれに対して利用できる幾つかの資源を有している。これらの資源には、ＣＰＬＤ装置２０２自体、ＥＰＲＯＭ２０４、ＳＲＡＭ２０６、同期ＳＲＡＭ２０８、フラッシュメモリ２１０及び２つのＲＳ２３２シリアルチャネル２１２が含まれる。シリアルチャネル２１２はユーザプログラムをダウンロードしてデバッギングするために、ＵＮＩＸ（登録商標）またはＰＣホストに対する通信リンクを提供する。ＣＰＬＤネットリストを考慮して、装置の機器構成ポート２１４に対する専用シリアルリンクを通して、あるいは専用機器構成ＲＯＭ２１６を通して、プロセッサ６０の構成がＣＰＬＤ２０２へとダウンロードされる。 To allow easy software deployment, debugging and verification, the emulation board 200 has several resources available for it. These resources include the CPLD device 202 itself, EPROM 204, SRAM 206, synchronous SRAM 208, flash memory 210, and two RS232 serial channels 212. The serial channel 212 provides a communication link to a UNIX or PC host for downloading and debugging user programs. Considering the CPLD netlist, the configuration of the processor 60 is downloaded to the CPLD 202 through a dedicated serial link to the device configuration port 214 of the device or through the dedicated device configuration ROM 216.

ボード２００に対して利用できる資源もある程度まで構成可能である。容易に変更可能であるプログラム可能論理装置（ＰＬＤ）２１７を通してマッピングが行われるので、ボード上の様々なメモリ要素のメモリマップを容易に変更することができる。更に、プロセッサコアが利用するキャッシュ２１８及び２２８は、より大きな記憶装置を使用して、キャッシュ２１８及び２２８に接続されるタグバス２２２及び２２４を適当な大きさに分けることによって拡張可能である。 The resources available for the board 200 can also be configured to some extent. Because the mapping is done through a programmable logic device (PLD) 217 that can be easily changed, the memory map of various memory elements on the board can be easily changed. Further, the caches 218 and 228 utilized by the processor core can be expanded by dividing the tag buses 222 and 224 connected to the caches 218 and 228 to an appropriate size using larger storage devices.

特定のプロセッサ構成をエミュレートするためのボードの使用は幾つかのステップを含む。第１のステップはプロセッサの特定の構成を記述する一組のＲＴＬファイルを入手することである。次のステップは多数の商業的統合ツールを使用して、ＲＴＬ記述からゲートレベルのネットリストを統合することである。１つのこのような例はＳｙｎｏｐｓｙｓからのＦＰＧＡエクスプレスである。次にゲートレベルのネットリストを使用して、業者により典型的に提供されるツールを用いてＣＰＬＤインプリメンテーションを入手できる。このようなツールの１つは、アルテラ社（Altera Corporation）のＭａｘｐｌｕｓ２である。最後のステップは、再びＣＰＬＤ業者により提供されるプログラマを使用して、エミュレーションボードのＣＰＬＤチップ上にインプリメンテーションをダウンロードすることである。 The use of a board to emulate a particular processor configuration involves several steps. The first step is to obtain a set of RTL files that describe the specific configuration of the processor. The next step is to integrate the gate level netlist from the RTL description using a number of commercial integration tools. One such example is FPGA Express from Synopsys. The gate level netlist can then be used to obtain a CPLD implementation using tools typically provided by vendors. One such tool is Maxplus 2 from Altera Corporation. The last step is to download the implementation onto the CPLD chip of the emulation board, again using the programmer provided by the CPLD vendor.

エミュレーションボードの目的の１つはデバッギング目的のために迅速なプロトタイプ実用化をサポートすることであるので、前のパラグラフにおいて概説されたＣＰＬＤ実用化プロセスが自動的であることが重要である。この目的を達成するために、１つのディレクトリに全ての関連ファイルをグループ分けすることによって、ユーザに配送されるファイルがカストマイズされる。そして、完全にカストマイズされた統合スクリプトが提供され、顧客が選択した特定のＦＰＧＡ装置に特定のプロセッサ構成を統合することができる。業者ツールによって使用される完全にカストマイズされた実用化スクリプトも発生される。このような統合及び実用化スクリプトは、最適の性能で機能的に正しい実用化を保証する。特定のプロセッサ構成に関連する全てのＲＴＬファイルを読み込むために、スクリプト内に適切なコマンドを含むことにより、またプロセッサ構成内のＩ／Ｏ信号に基づいてチップピンの位置を割り当てるための適当なコマンドを含むことにより、またゲート型クロックにおけるようなプロセッサ論理のある重大な部分のために特定の論理実用化を入手するためのコマンドを含むことにより、機能的な正確さが達成される。更に、このスクリプトは、全てのプロセッサＩ／Ｏ信号に詳細なタイミング制限を割り当てることにより、またある重大な信号の特殊処理により、実用化の性能を改善する。タイミング制限の１つの例は、ボード上の１つの信号の遅延を考慮することによって、その信号に対して特定の入力遅延を割り当てることである。重大な信号処理の例は、ＣＰＬＤチップに対して低クロックスキューを達成するために、専用グローバルワイヤに対してクロック信号を割り当てることである。 Since one of the goals of the emulation board is to support rapid prototype implementation for debugging purposes, it is important that the CPLD implementation process outlined in the previous paragraph is automatic. To accomplish this goal, the files delivered to the user are customized by grouping all related files into one directory. A fully customized integration script is then provided to integrate a specific processor configuration into a specific FPGA device selected by the customer. A fully customized implementation script used by the merchant tool is also generated. Such integration and implementation scripts ensure functionally correct implementation with optimal performance. To read all RTL files associated with a particular processor configuration, include the appropriate command in the script and the appropriate command to assign the chip pin location based on the I / O signal in the processor configuration. Functional accuracy is achieved by including and by including commands to obtain specific logic implementations for certain critical portions of processor logic, such as in gated clocks. In addition, this script improves the performance of implementation by assigning detailed timing constraints to all processor I / O signals and by special handling of certain critical signals. One example of timing limitation is assigning a specific input delay to that signal by considering the delay of one signal on the board. An example of critical signal processing is assigning clock signals to dedicated global wires to achieve low clock skew for CPLD chips.

好ましくは、該システムは構成されたプロセッサ６０用の検証スイートも構成する。マイクロプロセッサのような複雑な設計の検証のほとんどは以下のような流れで構成される：
━ 設計を刺激し、テストベンチ内で、またはＩＳＳ１２６のような外部モデルを使用して出力を比較するために、テストベンチを構築する；━ 刺激を発生させるための診断法を書き込む；━ 制限された状態の機械カバレッジＨＤＬのラインカバレッジ、下降するバッグ率、設計上移動するベクトルの数等のスキームを使用して、検証カバレッジを測定する；そして
━ そのカバレッジが充分でなければ、更に診断法を書き込み、おそらく診断法を発生させるためのツールを使用して、更に設計を実行する。 Preferably, the system also configures a verification suite for the configured processor 60. Most verifications for complex designs such as microprocessors consist of the following flow:
-Build a test bench to stimulate the design and compare outputs within the test bench or using an external model such as ISS 126;-Write diagnostics to generate the stimulus;-Limited Measure the verification coverage using a scheme such as line coverage of the machine coverage HDL, the falling bag rate, the number of vectors moving in the design, etc .; and if the coverage is not sufficient, further diagnostics Further design is performed using tools for writing and possibly generating diagnostics.

本発明は幾分似たような流れを使用するが、設計の構成可能性を説明するために、この流れの全ての成分を修正する。この方法論は以下のステップより成る：
━ 特定の構成用のテストベンチを構築する。テストベンチの構成はＨＤＬのために記述したのと同様のアプローチを使用し、その中で支持される全てのオプション及び拡張、つまり、キャッシュサイズ、バスインタフェース、クロッキング、及び割込み発生等をサポートする；━ ＨＤＬの特定の構成に対してセルフチェッキング診断法を実行する。診断法自体は特定のハードウェア部品に適応するように構成できる。どの診断法を実行するかの選択も構成に応じて行う。━ 擬似乱数的に発生された診断法を実行し、ＩＳＳ１２６に対する各々の命令実行後のプロセッサ状態を比較する；そして
━ 機能的カバレッジと共にラインカバレッジを測定するカバレッジツールを使用した、検証カバレッジの測定。更に、非合法的な状態を探すために、その診断法に沿ってモニタ及びチェッカを動かす。これらは全て、特定の構成仕様用に構成可能である。 The present invention uses a somewhat similar flow, but modifies all components of this flow to account for the configurability of the design. This methodology consists of the following steps:
-Build test benches for specific configurations. The test bench configuration uses a similar approach as described for HDL and supports all the options and extensions supported within it, ie cache size, bus interface, clocking, interrupt generation, etc. Execute the self-checking diagnostic method for a specific configuration of HDL. The diagnostic method itself can be configured to adapt to specific hardware components. The selection of which diagnostic method to execute is also made according to the configuration. -Run pseudo-randomly generated diagnostics and compare processor state after each instruction execution to ISS 126; and-Measuring verification coverage using a coverage tool that measures line coverage along with functional coverage. Furthermore, in order to look for an illegal state, the monitor and the checker are moved along the diagnostic method. All of these can be configured for specific configuration specifications.

全ての検証成分が構成可能である。構成可能性はＴＰＰを使用して実用化される。 All verification components are configurable. Configurability is put into practical use using TPP.

テストベンチは、構成されたプロセッサ６０が置かれるシステムのＶｅｒｉｌｏｇ（登録商標）モデルである。本発明の場合、これらのテストベンチは以下のものを含む：
━ キャッシュ、バスインタフェース及び外部メモリ；━ 外部割込み機構及びバスエラー発生；━ クロック発生。 The test bench is a Verilog® model of the system where the configured processor 60 is located. In the case of the present invention, these test benches include:
-Cache, bus interface and external memory;-External interrupt mechanism and bus error generation;-Clock generation.

上記特徴のほとんど全てが構成可能であるので、テストベンチ自体は構成可能性をサポートする必要がある。そこで、例えば、キャッシュサイズ及び外部割込み機構の数は構成に基づいて自動的に調節される。 Since almost all of the above features are configurable, the test bench itself needs to support configurability. Thus, for example, the cache size and the number of external interrupt mechanisms are automatically adjusted based on the configuration.

テストベンチはテスト中の装置、プロセッサ６０に刺激を提供する。それはメモリ内に予めロードされるアッセンブリレベルの命令を（診断法から）提供することによって行われる。更にテストベンチはプロセッサ６０の行動、例えば割込み、を制御する信号を発生させる。また、これらの外部信号の周波数及びタイミングは制御可能であり、テストベンチによって自動的に発生される。 The test bench provides stimulation to the device under test, processor 60. It is done by providing assembly level instructions (from diagnostics) that are preloaded into memory. In addition, the test bench generates signals that control the behavior of the processor 60, eg, interrupts. Also, the frequency and timing of these external signals can be controlled and automatically generated by the test bench.

診断法には２つのタイプの構成可能性がある。まず第一に、診断法はＴＰＰを使用して何をテストするかを決定する。例えば、ソフトウェア割込みをテストするために１つの診断法が書かれている。この診断法は正しいアセンブリコードを発生させるために、幾つのソフトウェア割り込みがあるかを知っている必要がある。 There are two types of diagnostic possibilities. First of all, the diagnostic method uses TPP to determine what to test. For example, one diagnostic has been written to test software interrupts. This diagnostic method needs to know how many software interrupts there are in order to generate the correct assembly code.

第二に、プロセッサ構成システム１０はこの構成にとってどの診断法が適しているかを決定しなければならない。例えば、ＭＡＣユニットをテストするために書かれた診断法は、このユニットを含んでいないプロセッサ６０に対しては適用できない。本実施形態では、これは各診断法についての情報を含むデータベースの使用を通して実施される。データベースは各診断法に対して、以下の情報を含んでいてよい：
━ 或るオプションが選択された場合、その診断法を使用する；━ その診断法は割込みがあれば実行できないかどうか；━ その診断法は実行するのに特別なライブラリまたはハンドラを必要とするか否か；及び
━ ＩＳＳ１２６とのコシミュレーションがあればその診断法を実行できないかどうか。 Second, the processor configuration system 10 must determine which diagnostic method is appropriate for this configuration. For example, a diagnostic method written to test a MAC unit is not applicable to a processor 60 that does not include this unit. In this embodiment, this is done through the use of a database containing information about each diagnostic method. The database may contain the following information for each diagnostic method:
-If an option is selected, use that diagnostic;-If the diagnostic can't be run with an interrupt;-Does the diagnostic require a special library or handler to run? And ━ If the co-simulation with ISS126 is present, whether or not the diagnostic method can be executed.

好ましくは、プロセッサハードウェア記述は３つのタイプのテストツール：テスト発生器ツール、モニタ及びカバレッジツール（またはチェッカ）及びコシミュレーション機構を含む。テスト発生器ツールとは、知的方法で一連のプロセッサ命令を作り出すツールである。これらのツールは擬似乱数的なテスト発生器のシーケンスである。本実施形態は内部的に２つのタイプ：特別に展開されたＲＴＰＧと呼ばれるものと、ＶＥＲＡ（ＶＳＧ）と呼ばれる外部ツールに基づくものを使用する。両者共そのまわりに作られる構成可能性を有する。１つの構成に対する有効な命令に基づいて、それらは一連の命令を発生させる。これらのツールはＴＩＥから新たに定義された命令を処理することができ、これらの新たに定義された命令がテスト用に無作為に発生される。本実施形態は設計検証のカバレッジを測定するモニタ及びチェッカを含む。 Preferably, the processor hardware description includes three types of test tools: a test generator tool, a monitor and coverage tool (or checker) and a co-simulation mechanism. A test generator tool is a tool that produces a series of processor instructions in an intelligent manner. These tools are pseudo-random test generator sequences. This embodiment uses two types internally: one based on a specially developed RTPG and one based on an external tool called VERA (VSG). Both have the possibility of construction around them. Based on valid instructions for one configuration, they generate a series of instructions. These tools can process newly defined instructions from the TIE, and these newly defined instructions are randomly generated for testing. This embodiment includes a monitor and a checker that measure the coverage of design verification.

モニタ及びカバレッジツールは、リグレッションランと並んで動かされるツールである。カバレッジツールは診断法が何をしているか、それが働かせているＨＤＬの機能及び論理をモニタする。この情報の全てがリグレッションランを通して集められ、後で分析されて、論理のどの部分が更にテストを必要としているかに関するヒントを得る。本実施形態は構成可能である幾つかの機能的カバレッジツールを使用する。例えば、特定の制限された状態の機械にとって、構成に応じて必ずしも全ての状態が含まれているとは限らない。従って、その構成に対して、機能的カバレッジツールはこれらの状態または遷移をチェックしようとしてはならない。これはＴＰＰを通してツールを構成可能にすることによって達成される。 The monitor and coverage tool are tools that are moved alongside the regression run. The coverage tool monitors what the diagnostics are doing and the function and logic of the HDL that it is working on. All of this information is gathered through regression runs and later analyzed to get hints about which parts of the logic need further testing. This embodiment uses several functional coverage tools that are configurable. For example, a machine in a specific restricted state does not necessarily include all states depending on the configuration. Thus, for that configuration, the functional coverage tool should not attempt to check these states or transitions. This is accomplished by making the tool configurable through TPP.

同様に、ＨＤＬシミュレーション内で発生する非合法的な条件をチェックするモニタがある。これらの非合法的状態はバグとして現れ得る。例えば３状態バス上で、２つのドライバが同時にオンになるべきではない。これらのモニタは構成可能であり、その構成のために特定の論理が含まれているか否かに基づいてチェックを追加または除去する。 Similarly, there are monitors that check for illegal conditions that occur in HDL simulations. These illegal states can appear as bugs. For example, on a tristate bus, two drivers should not be on at the same time. These monitors are configurable and add or remove checks based on whether specific logic is included for that configuration.

コシミュレーション機構はＨＤＬをＩＳＳ１２６に接続する。命令の終りにプロセッサ状態がＨＤＬ及びＩＳＳ１２６において同じであることをチェックするために、このコシミュレーション機構が使用される。更に、各構成にどのような特徴が含まれているか、また比較のためにどのような状態が必要であるかを知る程度まで、このコシミュレーション機構は構成可能である。従って、例えば、データブレークポイント特徴が特殊なレジスタを追加する。この機構はこの新しい特殊なレジスタを比較するために知っていることが必要である。 The co-simulation mechanism connects the HDL to the ISS 126. This co-simulation mechanism is used to check that the processor state is the same in HDL and ISS 126 at the end of the instruction. Furthermore, the co-simulation mechanism is configurable to the extent that it knows what features are included in each configuration and what state is required for comparison. Thus, for example, a register with a special data breakpoint feature is added. This mechanism needs to know to compare this new special register.

ＩＳＳ１２６において使用するために、またテスト及び検証のために使用するためのシステムデザイナのために、ＴＩＥを介して指定される命令セマンティクスを機能的に同等のＣ関数に翻訳することができる。機器構成データベース１０６内の命令セマンティクスは、標準のパーサツールを使用してパーサツリーを作るツールによってＣ関数に翻訳され、次にそのツリーを歩き、Ｃ言語で対応する表現を出力するコードに翻訳される。その翻訳は全ての表現にビット幅を指定し、構文解析木を書き直して一部の翻訳を簡略化するためにプレパスを必要とする。これらの翻訳機構は、ＨＤＬからＣへの、あるいはＣからアッセンブリ言語コンパイラへの他の翻訳機構に比べて比較的簡単であり、ＴＩＥ及びＣ言語仕様書から始めて、当業者により書き換えることができる。 Instruction semantics specified via TIE can be translated into functionally equivalent C functions for use in ISS 126 and for system designers for use in testing and verification. The instruction semantics in the device configuration database 106 are translated into C functions by a tool that creates a parser tree using standard parser tools, and then translated into code that walks the tree and outputs the corresponding representation in C language. The The translation specifies a bit width for all representations and requires a prepass to rewrite the parse tree and simplify some translations. These translation mechanisms are relatively simple compared to other translation mechanisms from HDL to C, or from C to an assembly language compiler, and can be rewritten by those skilled in the art starting with TIE and C language specifications.

機器構成ファイル１００及びアッセンブラ／逆アセンブラ１００を用いて構成されるコンパイラを使用して、ベンチマークアプリケーションソースコード１１８が編纂されて組み付けられ、サンプルデータセット１２４を使用してシミュレートされてソフトウェアプロファイル１３０を入手し、このソフトウェアプロファイル１３０はユーザへのフィードバックのためにユーザ構成捕捉ルーチンにも設けられる。 The benchmark application source code 118 is compiled and assembled using the instrument configuration file 100 and a compiler configured using the assembler / disassembler 100 and simulated using the sample data set 124 to generate the software profile 130. Obtaining this software profile 130 is also provided in the user configuration capture routine for feedback to the user.

どの構成パラメータ選択に対してもハードウェア及びソフトウェア両方のコスト／利益特性記述を得る能力を有することで、デザイナによるシステムの更なる最適化の新たな機会が開かれる。特に、これはデザイナが最適の構成パラメータを選択できるようにし、最適の構成パラメータはある長所の形式に従ってシステム全体を最適化する。１つの可能なプロセスは、構成パラメータを繰り返し選択する、あるいは選択を解除することによる貪欲な戦略に基づいている。各ステップにおいて、システム全体の性能及びコストに最良の影響を有するパラメータが選択される。システムの性能及びコストを改良するために１つのパラメータも変更できなくなるまでこのステップが繰り返される。他の拡張は、一度に一群の構成パラメータを見ること、あるいはより洗練されたサーチアルゴリズムを使用することを含む。 The ability to obtain both hardware and software cost / benefit characterizations for any configuration parameter selection opens up new opportunities for further optimization of the system by the designer. In particular, this allows the designer to select the optimal configuration parameters, which optimize the entire system according to some form of advantage. One possible process is based on a greedy strategy by repeatedly selecting or deselecting configuration parameters. At each step, the parameter with the best impact on overall system performance and cost is selected. This step is repeated until no parameter can be changed to improve system performance and cost. Other extensions include looking at a group of configuration parameters at once, or using a more sophisticated search algorithm.

最適の構成パラメータ選択を得ることに加えて、このプロセスは最適のプロセッサ拡張を構成するためにも使用することができる。プロセッサ拡張における多数の可能性のために、拡張候補数を制限することが重要である。１つの技術は、アプリケーションソフトウェアを分析し、システム性能またはコストを改善することができる命令拡張だけを見ることである。 In addition to obtaining optimal configuration parameter selection, this process can also be used to configure optimal processor expansion. Due to the large number of possibilities in processor expansion, it is important to limit the number of expansion candidates. One technique is to analyze application software and see only instruction extensions that can improve system performance or cost.

本実施形態による自動化されたプロセッサ構成システムの操作をカバーしてきたので、次にプロセッサマイクロアーキテクチャ構成に対するシステムのアプリケーションの例について説明する。最初の例は画像圧縮に適用された場合の本発明の利点を示している。 Having covered the operation of the automated processor configuration system according to this embodiment, an example of a system application for a processor microarchitecture configuration will now be described. The first example shows the advantages of the present invention when applied to image compression.

モーション評価は、ＭＰＥＧビデオ及びＨ２６３会議用アプリケーションを含む多くの画像圧縮アルゴリズムの重要な成分である。ビデオ画像圧縮は、各フレームのために必要な記憶量を減少させるために、１つのフレームから次のフレームへの類似性を使用しようとする。最も簡単な場合、圧縮すべき画像の各ブロックを基準画像（圧縮される画像のすぐ前または後の画像）の対応するブロック（同じＸ、Ｙ位置）と比較することができる。フレーム間の画像差の圧縮は、個々の画像の圧縮より概してビット効率的である。ビデオシーケンスにおいて、明確な画像特徴はしばしばフレームからフレームへと移動するので、異なるフレーム間の最も近い対応関係はしばしば正確に同じＸ、Ｙ位置にはなく、幾分オフセットしている。画像の重大な部分がフレーム間で移動している場合、その差を計算する前に、その動きを特定し補償することが必要であるかもしれない。この事実は、はっきりした差異のある特徴に対しては、計算された差において使用されるサブ画像内のＸ、Ｙオフセットを含む、連続画像間の差を符号化することによって最も濃厚な表示を達成できることを意味する。画像差を計算するために使用される位置でのオフセットはモーションベクトルと称される。 Motion evaluation is an important component of many image compression algorithms, including MPEG video and H263 conference applications. Video image compression seeks to use the similarity from one frame to the next to reduce the amount of storage required for each frame. In the simplest case, each block of the image to be compressed can be compared with the corresponding block (same X, Y position) of the reference image (the image immediately before or after the image to be compressed). Compression of image differences between frames is generally bit more efficient than compression of individual images. In video sequences, distinct image features often move from frame to frame, so the closest correspondence between different frames is often not exactly at the same X, Y position, but somewhat offset. If a significant part of the image is moving between frames, it may be necessary to identify and compensate for that motion before calculating the difference. This fact shows that for distinctly different features, the richest display can be achieved by encoding the difference between successive images, including the X and Y offsets in the sub-image used in the calculated difference. It means that it can be achieved. The offset at the position used to calculate the image difference is called the motion vector.

この種の画像圧縮において最も計算上集中的なタスクは、各ブロックに対して最も適切なモーションベクトルの決定である。モーションベクトルを選択することに対する共通の距離は、圧縮される画像の各ブロックと、前の画像の一組の候補ブロック間のピクセル毎の最も低い平均差を備えたベクトルを見出すためである。候補ブロックは圧縮されるブロックの位置近傍にある全てのブロックのセットである。画像のサイズやブロックのサイズ及び近傍のサイズ全てがモーション推定アルゴリズムの実行時間に影響を及ぼす。 The most computationally intensive task in this type of image compression is the determination of the most appropriate motion vector for each block. A common distance for selecting a motion vector is to find the vector with the lowest average difference per pixel between each block of the image to be compressed and a set of candidate blocks in the previous image. A candidate block is a set of all blocks near the location of the block to be compressed. The size of the image, the size of the block, and the neighboring sizes all affect the execution time of the motion estimation algorithm.

単純なブロックベースのモーション推定は、圧縮すべき画像の各サブ画像を基準画像と比較する。基準画像はビデオシーケンスにおいて被写体像の前にあるか、または後に続いているものであってよい。いずれの場合にも、被写体像がデコンプレッションされる前に、減圧システムにとって基準画像を利用可能であることが知られている。圧縮下の画像の一ブロックを基準画像の候補ブロックと比較することについて下記に説明する。 Simple block-based motion estimation compares each sub-image of the image to be compressed with a reference image. The reference image may be one that precedes or follows the subject image in the video sequence. In either case, it is known that the reference image can be used for the decompression system before the subject image is decompressed. Comparison of one block of the compressed image with the candidate block of the reference image will be described below.

被写体像内の各ブロックに対して、基準画像内の対応する位置付近でサーチを実施する。通常、画像の各カラー成分（例えばＹＵＶ）が別々に分析される。時には、モーション推定が１つの成分、特に輝度についてのみ実施される。ピクセルごとの平均差はその被写体像と、基準画像のサーチゾーン内にある全ての可能なブロック間で計算される。その差はピクセル値の大きさの差の絶対値である。その平均は一対のブロックにおけるＮ^２のピクセル（Ｎはブロックの寸法）全体の合計に比例する。最も小さい平均ピクセル差を作り出す基準画像のブロックが、被写体像のそのブロックに対するモーションベクトルを限定する。 For each block in the subject image, a search is performed near the corresponding position in the reference image. Usually, each color component (eg, YUV) of the image is analyzed separately. Sometimes motion estimation is performed only on one component, especially the luminance. The average difference for each pixel is calculated between the subject image and all possible blocks within the search zone of the reference image. The difference is the absolute value of the pixel value magnitude difference. The average is proportional to the sum of the pixel N ² (N is the size of the block) the total of the pair of blocks. The block of the reference image that produces the smallest average pixel difference limits the motion vector for that block of the subject image.

以下の例は簡単な形態のモーション推定アルゴリズムを示しており、小さな特定用途の機能単位のためにＴＩＥを使用するアルゴリズムを最適化する。この最適化は１０の因数より大きなスピードアップを生じさせ、多くのビデオ用途のためにプロセッサベースの圧縮を実現可能にする。それは高レベル言語でのプログラミングの容易さと、特殊目的のハードウェアの効率とを組み合わせた構成可能なプロセッサの能力を示している。 The following example shows a simple form of motion estimation algorithm that optimizes an algorithm that uses TIE for a small application-specific functional unit. This optimization results in speedups greater than a factor of 10, making processor-based compression feasible for many video applications. It demonstrates the power of a configurable processor that combines the ease of programming in high-level languages with the efficiency of special purpose hardware.

この例は、古い画像と新しい画像を各々表すために、２つのマトリックス、ＯｌｄＢとＮｅｗＢを使用する。画像のサイズはＮＸとＮＹによって決定される。ブロックサイズはＢＬＯＣＫＸとＢＬＯＣＫＹによって決定される。従って、画像はＮＸ／ＢＬＯＣＫＸ×ＮＹ／ＢＬＯＣＫＹブロックで構成される。１つのブロックのサーチ領域はＳＥＡＲＣＨＸとＳＥＡＲＣＨＹによって決定される。最良のモーションベクトル及び値がＶｅｃｔＸ、ＶｅｃｔＹ及びＶｅｃｔＢに格納される。ベース（基準）のインプリメンテーションによって計算される最良のモーションベクトル及び値がＢａｓｅＸ、ＢａｓｅＹ及びＢａｓｅＢに格納される。これらの値は命令拡張を使用するインプリメンテーションによって計算されるベクトルをチェックするために使用される。これらの基本的な定義は以下のＣコードセグメントにおいてデータ捕捉される。

This example uses two matrices, OldB and NewB, to represent the old and new images, respectively. The size of the image is determined by NX and NY. The block size is determined by BLOCKX and BLOCKY. Therefore, the image is composed of NX / BLOCKX × NY / BLOCKY blocks. The search area of one block is determined by SEARCHX and SEARCHY. The best motion vectors and values are stored in VectX, VectY and VectB. The best motion vectors and values calculated by the base (reference) implementation are stored in BaseX, BaseY and BaseB. These values are used to check the vector computed by the implementation using instruction extension. These basic definitions are captured in the following C code segment.

モーション推定アルゴリズムは３つの入れ子構造のループで構成される。１．古い画像内の各ソースブロックに対して。２．ソースブロックの周囲領域内の新しい画像の各目的ブロックに対して。３．各ピクセルペア間の絶対差を計算する。このアルゴリズムに対する完全なコードを下記に記す。 The motion estimation algorithm is composed of three nested loops. 1. For each source block in the old image. 2. For each target block of the new image in the area surrounding the source block. 3. Calculate the absolute difference between each pixel pair. The complete code for this algorithm is given below.

基準ソフトウェアインプリメンテーション

Reference software implementation

基本的なインプリメンテーションが単純である一方、それはこのブロック対ブロック比較の本質的な平行関係の多くを利用し損ねている。構成可能なプロセッサアーキテクチャは、このアプリケーションのかなりのスピードアップを許容するために２つの主なツールを提供する。 While the basic implementation is simple, it fails to take advantage of much of the intrinsic parallelism of this block-to-block comparison. The configurable processor architecture provides two main tools to allow a significant speedup of this application.

第一に、命令セットアーキテクチャはメモリ内の未整列フィールドの急速抽出を可能にするために、強力な漏斗状シフティング基関数を含む。これはピクセル比較の内部ループがメモリから効率的に隣接するピクセル群をフェッチできるようにする。このループは同時に４つのピクセル（バイト）を操作するために書き換えることができる。特に、この例の目的のために、一度に４つのピクセルペアの絶対差を計算するために新しい命令を定義することが望ましい。しかしながら、この新しい命令を定義する前に、このような命令を利用できるようにアルゴリズムを再実用化することが必要である。 First, the instruction set architecture includes a powerful funnel-like shifting primitive to allow rapid extraction of unaligned fields in memory. This allows the inner loop of pixel comparison to efficiently fetch adjacent pixels from memory. This loop can be rewritten to manipulate four pixels (bytes) simultaneously. In particular, for the purposes of this example, it is desirable to define a new instruction to calculate the absolute difference of four pixel pairs at a time. However, before defining this new instruction, it is necessary to re-implement the algorithm so that such an instruction can be used.

この命令の存在が、ループ展開が同様に魅力的になるような内部ループピクセル差計算の改善を許容する。新しい絶対差合計命令と効率的なシフティングを利用するために、内部ループに対するＣコードが書き直される。基準画像の４つの重なり合うブロックの部分を同じループにおいて比較することができる。ＳＡＤ（ｘ、ｙ）は付加された命令に対応する新しい組込み関数である。ＳＲＣ（ｘ、ｙ）は、ＳＡＲレジスタに格納されているシフト量分だけ、ｘとｙの連結状態の右シフトを実施する。 The presence of this instruction allows for improved inner loop pixel difference calculations such that loop unrolling is equally attractive. The C code for the inner loop is rewritten to take advantage of the new absolute difference sum instruction and efficient shifting. The four overlapping block portions of the reference image can be compared in the same loop. SAD (x, y) is a new built-in function corresponding to the added instruction. SRC (x, y) performs the right shift of the connected state of x and y by the shift amount stored in the SAR register.

ＳＡＤ命令を使用するモーション推定の高速バージョン

Fast version of motion estimation using SAD instructions

このインプリメンテーションは最後の新規命令をエミュレートするために以下のＳＡＤ関数を使用する。
４バイトの絶対差の合計

This implementation uses the following SAD function to emulate the last new instruction:
Total absolute difference of 4 bytes

この新規インプリメンテーションをデバッグするために、以下のテストプログラムを使用して、モーションベクトルと、新規インプリメンテーションとベースインプリメンテーションによって計算された値とを比較する。
主テスト

To debug this new implementation, the following test program is used to compare the motion vector with the values calculated by the new implementation and the base implementation.
Main test

この簡単なテストプログラムは開発プロセスを通して使用される。ここで従わなければならない１つの重要な慣例は、エラーが検出された場合、主プログラムが０に復帰しなければならず、その他の場合は１に復帰しなければならないことである。 This simple test program is used throughout the development process. One important convention that must be followed here is that if an error is detected, the main program must return to 0, otherwise it must return to 1.

ＴＩＥの使用が新規命令の急速な特定化を可能にする。構成可能なプロセッサ発生器は、ハードウェアインプリメンテーション及びソフトウェア開発ツールの両方においてこれらの命令を完全に実行することができる。ハードウェア統合は新しい関数のハードウェアデータパスへの最適の統合化を生じさせる。Ｃ及びＣ＋＋コンパイラ、アッセンブラ、象徴的デバッガ、プロファイラ及び正確なサイクルの命令セットシミュレータにおいて、構成可能なプロセッサソフトウェア環境が新しい命令を完全にサポートする。ハードウェアとソフトウェアの急速な再生が、特定用途の命令をアプリケーション加速用の素早く確実なツールにする。 The use of TIE allows for rapid specification of new instructions. A configurable processor generator can fully execute these instructions in both hardware implementations and software development tools. Hardware integration results in optimal integration of new functions into the hardware data path. A configurable processor software environment fully supports new instructions in C and C ++ compilers, assemblers, symbolic debuggers, profilers, and accurate cycle instruction set simulators. Rapid regeneration of hardware and software makes special purpose instructions a quick and reliable tool for application acceleration.

本例は簡単な命令を実行して、４つのピクセルに対して、ピクセル区別化、絶対値及び累算を平行して実施するためにＴＩＥを使用する。この簡単な命令は１１の基本的な操作（従来のプロセスでは、別々の命令を必要とするかもしれない）を１つの原子操作として実施する。以下はその完全な説明である。

This example executes a simple instruction and uses TIE to perform pixel differentiation, absolute value and accumulation in parallel for four pixels. This simple instruction implements eleven basic operations (which may require separate instructions in conventional processes) as one atomic operation. The following is a complete explanation.

この説明は新規命令を定義するのに必要な最低のステップを表している。まず第一に、その命令のために新しい操作符号を定義する必要がある。この場合、新しい操作符号ＳＡＤは、ＣＵＳＴ０のサブ操作符号として定義される。上記のように、ＣＵＳＴ０は以下のように予め定義されている。

This description represents the minimum steps required to define a new instruction. First of all, a new operation code needs to be defined for the instruction. In this case, the new operation code SAD is defined as a sub operation code of CUST0. As described above, CUST0 is predefined as follows.

ＯＲＳＴはトップレベルの操作符号であり、ＣＵＳＴ０はＯＲＳＴのサブ操作符号であり、次にＳＡＤはＣＵＳＴ０のサブ操作符号であることが容易に解る。この操作符号の階層組織が操作符号スペースの論理的グループ化と管理を許容する。覚えておかなければならない１つの重要な事は、ＣＵＳＴ０（及びＣＵＳＴ１）はユーザが新規命令を付加するために取って置かれる操作符号スペースとして定義されることである。ユーザはＴＩＥ記述の将来の再利用可能性を保証するために、この割り当てられた操作符号スペース内に留まることが好ましい。 It is easily understood that ORST is a top-level operation code, CUST0 is a sub-operation code of ORST, and SAD is a sub-operation code of CUST0. This hierarchical organization of operation codes allows logical grouping and management of operation code spaces. One important thing to remember is that CUST0 (and CUST1) is defined as the operational code space that the user reserves to add new instructions. The user preferably stays within this assigned operational code space to ensure future reusability of the TIE description.

このＴＩＥ記述における第２のステップは、新規命令ＳＡＤを含む新規命令クラスを定義することである。これはＳＡＤ命令のオペランドが定義される場合である。この場合、ＳＡＤは３つのレジスタオペランドと、送出先レジスタａｒｒと、ソースレジスタａｒｓ及びａｒｔよりなる。前述のように、ａｒｒは命令のｒフィールドによって索引付けられたレジスタとして定義され、ａｒｓ及びａｒｔは命令のｓ及びｔフィールドによって索引付けられたレジスタとして定義される。 The second step in this TIE description is to define a new instruction class that includes the new instruction SAD. This is the case when the operand of the SAD instruction is defined. In this case, the SAD includes three register operands, a destination register arr, and source registers ars and art. As mentioned above, arr is defined as a register indexed by the r field of the instruction, and ars and art are defined as registers indexed by the s and t fields of the instruction.

この記述における最後のブロックは、ＳＡＤ命令用の正式の意味論的定義を与える。この記述は組み合わせ論理を説明するために、ＶｅｒｉｌｏｇＨＤＬのサブセットを使用している。ＩＳＳが如何にしてＳＡＤ命令をシミュレートし、如何にして付加的な回路が統合され、構成可能なプロセッサハードウェアに付加されて、新規命令をサポートするかを正確に定義するのがこのブロックである。 The last block in this description gives the formal semantic definition for the SAD instruction. This description uses a subset of Verilog HDL to illustrate combinatorial logic. This block defines exactly how the ISS simulates SAD instructions and how additional circuitry is integrated and added to configurable processor hardware to support new instructions. is there.

次に、ＴＩＥ記述がデバッギングされて、前述のツールを用いて検証される。ＴＩＥ記述の正確さを確証した後、次のステップはハードウェアサイズ及び性能に対する新規命令の影響力を推定することである。上記のように、これは、例えば、ＤｅｓｉｇｎＣｏｍｐｉｌｅｒ（登録商標）を使用して実施できる。ＤｅｓｉｇｎＣｏｍｐｉｌｅｒ（登録商標）が完了すると、ユーザは詳細な面積及び速度に関するレポートの出力を見ることができる。 The TIE description is then debugged and verified using the tools described above. After verifying the accuracy of the TIE description, the next step is to estimate the impact of the new instruction on hardware size and performance. As described above, this can be done using, for example, Design Compiler®. Once Design Compiler® is complete, the user can see a report of detailed area and speed reports.

ＴＩＥ記述が正しく効率的であることを確証した後、新しいＳＡＤ命令をサポートする構成可能なプロセッサを構成し組み立てる時である。これは上述のようにＧＵＩを使用して実施される。 It is time to configure and assemble a configurable processor that supports the new SAD instruction after ensuring that the TIE description is correct and efficient. This is done using the GUI as described above.

次に、モーション推定コードが構成可能なプロセッサ用のコードに編集され、そのプロセッサはそのプログラムの正確さを確証するために、またより重要なことには、その性能を測定するために、命令セットシミュレータを使用する。これは３つのステップ：シミュレータを使用してテストプログラムを実行する、ベースインプリメンテーションだけを実行して命令カウントを得る、そして新しいインプリメンテーションだけを実行して命令カウントを得る、ステップにおいて行われる。 Next, the motion estimation code is compiled into code for a configurable processor, which processor sets an instruction set to verify the accuracy of the program and, more importantly, to measure its performance. Use the simulator. This is done in three steps: Run the test program using the simulator, run only the base implementation to get the instruction count, and run only the new implementation to get the instruction count .

以下は第２のステップのシミュレーション出力である。

The following is the simulation output of the second step.

以下は最後のステップのシミュレーション出力である。

The following is the simulation output of the last step.

２つのレポートから、約４倍のスピードアップが発生したことが解る。構成可能なプロセッサの命令セットシミュレータは他の多くの有用な情報を提供できることに注意されたい。 It can be seen from the two reports that a speedup of about 4 times occurred. Note that the configurable processor instruction set simulator can provide many other useful information.

プログラムの正確さ及び性能を確証した後、次のステップは上述のようにＶｅｒｉｌｏｇシミュレータを使用してテストプログラムを実行することである。当業者なら、アペンディクスＣのメイクファイル（アペンディクスＣに関連ファイルも示されている）からこのプロセスの詳細を収集することができるであろう。このシミュレーションの目的は、新しいインプリメンテーションの正確さを更に確証し、またより重要なことは、このテストプログラムをこの構成されたプロセッサ用のリグレッションテストの一部とすることである。 After verifying the accuracy and performance of the program, the next step is to run the test program using the Verilog simulator as described above. The person skilled in the art will be able to collect details of this process from an appendix C makefile (the associated file is also shown in appendix C). The purpose of this simulation is to further confirm the accuracy of the new implementation and, more importantly, to make this test program part of the regression test for this configured processor.

最後に、プロセッサ論理は、例えば、ＤｅｓｉｇｎＣｏｍｐｉｌｅｒ（登録商標）を使用して統合し、例えばＡｐｏｌｌｏ（登録商標）使用して配置及び経路選択することができる。 Finally, the processor logic can be integrated using, for example, Design Compiler® and placed and routed using, for example, Apollo®.

本例は、説明を明確かつ簡略にするために、ビデオ圧縮及びモーション推定の簡略化された図を例に取ってきた。実際には、標準の圧縮アルゴリズムには多くの付加的なニュアンスがある。例えば、ＭＰＥＧ２は典型的にモーション推定を行い、サブピクセルの解像度で補正を実施する。ピクセルの２つの隣接した列または縦列を平均化して、その２つの隣接した列または行間の中間の仮想位置に対して補間された一組のピクセルを作り出すことができる。並列したピクセル平均化命令はＴＩＥコードの３つまたは４つのラインで容易に実用化されるので、構成可能プロセッサのユーザが定義する命令はここでも有用である。１つの列内にあるピクセル間の平均化はプロセッサの標準命令セットの効率的な配列操作を使用する。 This example has taken a simplified illustration of video compression and motion estimation for clarity and simplicity of explanation. In practice, standard compression algorithms have many additional nuances. For example, MPEG2 typically performs motion estimation and performs correction at sub-pixel resolution. Two adjacent columns or columns of pixels can be averaged to produce a set of pixels that are interpolated for an intermediate virtual location between the two adjacent columns or rows. Configurable processor user-defined instructions are also useful here because parallel pixel averaging instructions are easily implemented in three or four lines of TIE code. Averaging between pixels in a column uses efficient array operations of the processor's standard instruction set.

このように、簡単な絶対差合計命令の組込みは数百ゲートを付加するが、１０因数より大きくモーション推定性能を改善する。この加速は最終的なシステムのコスト及び電力効率におけるかなりの改善を表す。更に、新しいモーション推定命令を含むためのソフトウェア開発ツールの縫い目なしの拡張は、急速なプロトタイピング及び性能分析及び完全なソフトウェアアプリケーション解決法のリリースを許す。本発明の解決法は特定用途のプロセッサ構成を簡単な、確実な、そして完全なものにし、最終的なシステム製品のコスト、性能、機能性及び電力効率の劇的なエンハンスメントを提供する。 Thus, incorporating a simple absolute difference sum instruction adds several hundred gates, but improves motion estimation performance by more than 10 factors. This acceleration represents a significant improvement in final system cost and power efficiency. In addition, the seamless expansion of software development tools to include new motion estimation instructions allows for rapid prototyping and performance analysis and the release of complete software application solutions. The solution of the present invention makes application-specific processor configurations simple, reliable and complete, and provides dramatic enhancements in the cost, performance, functionality and power efficiency of the final system product.

機能的なハードウェアユニットの付加に焦点を当てた例として、図６に示した基本的な構成を考えてみよう。この構成はプロセッサ制御機能と、プログラムカウンタ（ＰＣ）と、ブランチセレクションと、命令メモリまたはキャッシュ及び命令デコーダと、主レジスタファイル、バイパスマルチプレクサ、パイプラインレジスタ、ＡＬＵ、アドレス発生器及びキャッシュ用データメモリを含む基本的な整数データパスとを含んでいる。 As an example focusing on the addition of functional hardware units, consider the basic configuration shown in FIG. This configuration includes a processor control function, program counter (PC), branch selection, instruction memory or cache and instruction decoder, main register file, bypass multiplexer, pipeline register, ALU, address generator and cache data memory. Including a basic integer data path.

倍率器論理の存在が「倍率器」パラメータが設定されていることを条件として、ＨＤＬが書き込まれ、図７に示すように、新しいパイプライン段階として倍率器装置が付加される（正確な特例をサポートすべきである場合、特例処理に対する変更が必要であるかもしれない）。もちろん、倍率器を利用するための命令は好ましくは新しいユニットに付随して付加される。 The HDL is written subject to the presence of the multiplier logic and the “multiplier” parameter is set, and a multiplier device is added as a new pipeline stage as shown in FIG. Changes to special handling may be necessary if it should be supported). Of course, instructions for using the multiplier are preferably added to the new unit.

第２の例として、積算・累算ユニット等のデジタル信号プロセッサのために、完全なコプロセッサを図８に示した基本的な構成に付加してもよい。これは、拡張された命令からのレジスタソース及び送出先の復号化と、制御信号に対する適切なパイプライン遅延の付加と、レジスタ送出先論理の拡張と、累積レジスタからの動きに対するレジスタバイパスマルチプレクサ用の制御の追加と、命令結果用の可能なソースとして積算・累算ユニットの包含とを含む、積算・累算演算用の復号化制御信号の追加等のプロセッサ制御の変化を必然的に伴う。それに加えて、それは付加的なアキュムレータレジスタと、積算・累算アレイと、主レジスタソース用のソースセレクトマルチプレクサとを伴う積算・累算ユニットの追加を必要とする。更に、コプロセッサの追加は、累積レジスタからのソースを取り入れるために、累積レジスタからのレジスタバイパスマルチプレクサの延長と、倍率器の結果からのソースを取り入れるために、ロード／アラインメントマルチプレクサの延長とを必然的に必要とする。やはり、システムは好ましくは実際のハードウェアと共に新しい機能ユニットを使用するための命令を付加する。 As a second example, a complete coprocessor may be added to the basic configuration shown in FIG. 8 for a digital signal processor such as an integration / accumulation unit. This is for register bypass multiplexers for register source and destination decoding from extended instructions, adding appropriate pipeline delays to control signals, register destination logic expansion, and movement from cumulative registers. It entails changes in processor control, such as the addition of control and the addition of a decoding control signal for accumulation / accumulation operations, including the inclusion of accumulation / accumulation units as possible sources for instruction results. In addition, it requires the addition of an accumulation / accumulation unit with additional accumulator registers, an accumulation / accumulation array, and a source select multiplexer for the main register source. Furthermore, the addition of a coprocessor necessitates an extension of the register bypass multiplexer from the accumulation register to incorporate the source from the accumulation register and an extension of the load / alignment multiplexer to incorporate the source from the multiplier result. Necessary. Again, the system preferably adds instructions to use the new functional unit with the actual hardware.

デジタル信号プロセッサとの関連で特に有用である別のオプションは、浮動小数点ユニットである。例えば、ＩＥＥＥ７５４単精度浮動小数点演算基準を実用化するこのような機能単位を、それにアクセスするための命令と共に付加してもよい。浮動小数点ユニットは、例えば、音声圧縮・減圧等のデジタル信号処理アプリケーションにおいて使用しても良い。 Another option that is particularly useful in connection with digital signal processors is the floating point unit. For example, such a functional unit that puts the IEEE 754 single precision floating point arithmetic standard into practical use may be added along with instructions for accessing it. The floating point unit may be used in digital signal processing applications such as audio compression / decompression.

更にシステムのフレキシビリティの別の例として、図９に示した４ｋＢメモリインタフェースを考えてみよう。本発明の構成可能性を使用して、コプロセッサレジスタ及びデータパスは主整数レジスタファイル及びデータパスより幅広くても狭くても良く、ローカルメモリ幅は、メモリ幅が最も幅広いプロセッサまたはコプロセッサ幅に等しくなるように変化してもよい（読取及び書込みに対するメモリのアドレス指定はそれに従って調節される）。例えば、図１０は同じアレイにアドレス指定するプロセッサ・コプロセッサの組み合わせに対して３２ビットのロード動作と記憶装置をサポートするが、コプロセッサは１２８ビットのロード動作と記憶装置をサポートする、プロセッサ用のローカルメモリシステムを示している。これはＴＰＰコードを使用して実用化できる。

As another example of system flexibility, consider the 4 kB memory interface shown in FIG. Using the configurability of the present invention, the coprocessor registers and data path may be wider or narrower than the main integer register file and data path, and the local memory width is the processor or coprocessor width with the widest memory width. It may vary to be equal (memory addressing for reads and writes is adjusted accordingly). For example, FIG. 10 supports 32-bit load operations and storage for a processor coprocessor combination addressing the same array, but the coprocessor supports 128-bit load operations and storage. Shows the local memory system. This can be put into practical use using a TPP code.

但し、ＳＢｙｔｅｓは、書き込み信号Ｗ１の制御下にデータバスＤ１を用いるバイトアドレスＡ１における幅Ｂ１バイトとして、あるいは対応するパラメータＢ２、Ａ２、Ｄ２、およびＷ２を使用してアクセスされる全メモリサイズである。Ｓｅｌｅｃｔにより定義される一組の信号だけが所定のサイクルにおいて活動している。ＴＰＰコードはメモリバンクのコレクションとしてメモリを実用化する。各バンクの幅は最小のアクセス幅によって与えられ、またバンク数は最大及び最小のアクセス幅の比率によって与えられる。ループ用Ａは各メモリバンク及びその関連する書き込み信号、つまり書き込みイネーブル及び書込みデータを例示するために使用される。ループ用第２は全てのバンクから読み取られたデータを１つのバスに集めるために使用される。 Where SBytes is the total memory size accessed as byte B1 byte at byte address A1 using data bus D1 under control of write signal W1 or using corresponding parameters B2, A2, D2, and W2. . Only a set of signals defined by Select is active in a given cycle. The TPP code puts the memory into practical use as a collection of memory banks. The width of each bank is given by the minimum access width, and the number of banks is given by the ratio of the maximum and minimum access widths. Loop A is used to illustrate each memory bank and its associated write signal, namely write enable and write data. The second loop is used to collect data read from all banks on one bus.

図１１は基本の機器構成にユーザ限定命令を含めた例を示している。この図に示すように、ＡＬＵのものと同様のタイミングとインタフェースを備えたプロセッサパイプラインに簡単な命令を付加することができる。この方法で付加される命令は如何なる機能停止も特例も発生させてはならず、如何なる状態も含んではならず、２つの正常なソースレジスタ値と命令ワードのみを入力として使用し、１つの出力値だけを発生させなければならない。しかしながら、ＴＩＥ言語がプロセッサ状態を指定する規定を有している場合、このような制限は必要ではない。 FIG. 11 shows an example in which a user-limited command is included in the basic device configuration. As shown in this figure, a simple instruction can be added to the processor pipeline having the same timing and interface as those of the ALU. Instructions added in this way should not cause any outages or exceptions, should not contain any state, use only two normal source register values and an instruction word as input, and one output value Only have to generate. However, such a restriction is not necessary if the TIE language has provisions for specifying processor states.

図１２はこのシステムの下でのユーザが定義したユニットのインプリメンテーションの別の例を示している。この図に示した機能単位、ＡＬＵの８／１６パラレルデータユニットエクステンション、は以下のＩＳＡコードから発生される。

FIG. 12 shows another example of a user-defined unit implementation under this system. The functional unit shown in this figure, the ALU 8/16 parallel data unit extension, is generated from the following ISA code.

本発明の別の局面において特に関心のあることは、設計者が定義した命令実行ユニット９６である。なぜなら、これらの修正プロセッサ状態を含むＴＩＥ限定命令が復号化され実行されるのがここにおいてであるからである。本発明のこの局面において、多数の組立てブロックが言語に付加され、新規命令によって読取り・書込みを実施できる付加的なプロセッサ状態を宣言することができる。これらの「状態」ステートメントはプロセッサ状態の付加を宣言するために使用される。宣言はキーワード状態で始まる。状態ステートメントの次のセクションはビット及び状態のサイズと数、及び状態のビットがどのように索引付けられるかを記述する。それに続くセクションは他の記述セクションにおける状態を特定するために使用される状態名である。「状態」ステートメントの最後のセクションはその状態に関連する属性リストである。例えば、

Of particular interest in another aspect of the invention is the instruction execution unit 96 defined by the designer. This is because it is here that the TIE limited instructions containing these modified processor states are decoded and executed. In this aspect of the invention, a number of building blocks can be added to the language to declare additional processor states that can be read and written by new instructions. These “state” statements are used to declare the addition of processor state. The declaration begins with a keyword state. The next section of the state statement describes the size and number of bits and states, and how the state bits are indexed. Subsequent sections are state names used to identify states in other description sections. The last section of the “state” statement is a list of attributes associated with that state. For example,

は３つの新しいプロセッサ状態、ＤＡＴＡ、ＫＥＹＣ及びＫＥＹＤを定義する。状態ＤＡＴＡは６４ビット幅であり、ビットは６３から０へと索引付けられる。ＫＥＹＣ及びＫＥＹＤは共に２８ビット状態である。ＤＡＴＡはどのコプロセッサにデータＤＡＴＡが属しているかを指示するコプロセッサ番号属性ｃｐｎを有する。 Defines three new processor states, DATA, KEYC and KEYD. The state DATA is 64 bits wide and the bits are indexed from 63 to 0. KEYC and KEYD are both 28-bit states. DATA has a coprocessor number attribute cpn indicating to which coprocessor the data DATA belongs.

属性「ａｕｔｏｐａｃｋ」は、ＤＡＴＡの値をソフトウェアツールによって読取り、書き込むことができるように、状態ＤＡＴＡがユーザレジスタファイル内にあるレジスタに自動的に配置されることを示す。 The attribute “autopack” indicates that the state DATA is automatically placed in a register in the user register file so that the value of DATA can be read and written by the software tool.

ｕｓｅｒ＿ｒｅｇｉｓｔｅｒセクションは、ユーザレジスタファイル内のレジスタに対する状態のマッピングを示すために定義される。１つのｕｓｅｒ＿ｒｅｇｉｓｔｅｒセクションはキーワードｕｓｅｒ＿ｒｅｇｉｓｔｅｒで始まり、次にレジスタ番号を示す数字が続き、レジスタ上に配置されるべき状態ビットを示す式で終了する。例えば、

The user_register section is defined to show the state mapping for the registers in the user register file. One user_register section begins with the keyword user_register, followed by a number indicating the register number, and ends with an expression indicating the status bits to be placed on the register. For example,

は、ＤＡＴＡの下位ワードが第１のユーザレジスタファイルにマッピングされ、上位ワードが第２のユーザレジスタファイルにマッピングされることを明記している。次の２つのユーザレジスタファイルのエントリはＫＥＹＣ及びＫＥＹＤの値を保持するために使用される。明らかに、このセクションにおいて使用される状態情報は、ｓｔａｔｅセクションのものと一致していなければならない。ここで、コンピュータプログラムによって一貫性を自動的にチェックすることができる。 Specifies that the lower word of DATA is mapped to the first user register file and the upper word is mapped to the second user register file. The next two user register file entries are used to hold KEYC and KEYD values. Obviously, the state information used in this section must match that of the state section. Here, consistency can be automatically checked by the computer program.

本発明の別の実施形態では、ユーザレジスタファイルエントリに対する状態ビットのこのような割り当ては、ビンパッキングアルゴリズムを使用して自動的に引き出される。更に別の実施形態では、例えば、上向きの互換性を確実にするために、手動及び自動割当の組み合わせを使用することができる。 In another embodiment of the present invention, such assignment of status bits to user register file entries is automatically derived using a bin packing algorithm. In yet another embodiment, a combination of manual and automatic assignment can be used, for example, to ensure upward compatibility.

命令フィールドステートメントｆｉｅｌｄはＴＩＥコードの可読性を改良するために使用される。フィールドは共にグループ分けされ、１つの名前で参照符が付けられる他のフィールドのサブセットまたは連接である。１つの命令における完全なビットセットが最高レベルのスーパーセットフィールドｉｎｓｔであり、このフィールドは更に小さなフィールドに分けることができる。例えば、

The instruction field statement field is used to improve the readability of the TIE code. Fields are grouped together and are a subset or concatenation of other fields that are referenced by one name. The complete bit set in one instruction is the highest level superset field inst, which can be divided into smaller fields. For example,

は、最高レベルのフィールドｉｎｓｔのサブフィールド（各々ビット８〜１１、１２〜１５）として２つの４ビットフィールド、ｘ及びｙを定義し、ｘ及びｙフィールドの連接として８ビットのフィールドｘｙを定義する。 Defines two 4-bit fields, x and y, as subfields (bits 8-11, 12-15, respectively) of the highest level field inst, and defines an 8-bit field xy as a concatenation of the x and y fields .

ステートメントｏｐｃｏｄｅは特殊なフィールドを符号化するための操作符号を定義する。このように定義された操作符号により使用されるオペランド、例えば、レジスタまたは即時定数を指定するための命令フィールドは、まずフィールドステートメントで定義され、次にオペランドステートメントで定義されなければならない。
例えば、

The statement opcode defines an operation code for encoding a special field. An operand used by an operation code defined in this way, for example an instruction field for specifying a register or immediate constant, must first be defined in a field statement and then in an operand statement.
For example,

は、前に定義された操作符号ＣＵＳＴ０（４’ｂ０００は４ビット長のバイナリ定数００００を示す）に基づいて、２つの新しい操作符号、ａｃｓ及びａｄｓｅｌを定義する。好ましいコアＩＳＡのＴＩＥ仕様書は、その基本的定義として、以下のステートメントを有する。

Defines two new operation codes, acs and adsel, based on the previously defined operation code CUST0 (4'b000 indicates a 4-bit long binary constant 0000). The preferred core ISA TIE specification has the following statements as its basic definition:

このように、ａｃｓ及びａｄｓｅｌの定義は、以下により各々表される命令復号化論理をＴＩＥコンパイラに発生させる。

Thus, the definitions of acs and adsel cause the TIE compiler to generate instruction decoding logic represented by:

命令オペランドステートメントｏｐｅｒａｎｄはレジスタ及び即時定数を特定する。しかしながら、オペランドとして１つのフィールドを定義する前に、それは上述のように１つのフィールドとして以前に定義されていなければならない。オペランドが即時定数である場合、その定数の値をオペランドから発生させることができるし、あるいは下記のように定義される、以前に定義された定数表からその定数の値を取り出すことができる。例えば、即時オペランドを符号化するために、ＴＩＥコード

The instruction operand statement operand specifies a register and an immediate constant. However, before defining a field as an operand, it must have been previously defined as a field as described above. If the operand is an immediate constant, the value of the constant can be generated from the operand, or the value of the constant can be retrieved from a previously defined constant table defined as follows: For example, to encode an immediate operand, the TIE code

は有符号数字及びオフセットフィールドに格納された数の４倍であるオペランドｏｆｆｓｅｔｓ４を保持する、オフセットという名前の１８ビットフィールドを定義する。ｏｐｅｒａｎｄステートメントの最後の部分は、当業者にとっては自明であるように、組み合せの回路を説明するためのＶｅｒｉｌｏｇ（登録商標）ＨＤＬのサブセットにおける計算を実施するために使用されるサーキットリを実際に説明している。 Defines an 18-bit field named offset that holds a signed number and an operand offsets4 that is four times the number stored in the offset field. The last part of the operand statement actually describes the circuitry used to perform the calculations in a subset of Verilog® HDL to describe the combinational circuit, as will be apparent to those skilled in the art. is doing.

ここで、ｗｉｒｅステートメントは３２ビット幅のｔという名前の一組の論理回線を定義している。ｗｉｒｅステートメントの後の最初のａｓｓｉｇｎステートメントは、論理回線を駆動する論理信号が右にシフトされたｏｆｆｓｅｔｓ４定数であることを明記しており、第２のａｓｓｉｇｎステートメントはｔの下位１８ビットがｏｆｆｓｅｔフィールドに置かれることを明記している。最初のａｓｓｉｇｎステートメントはｏｆｆｓｅｔの１連接としてｏｆｆｓｅｔｓ４オペランドの値と、そのサインビット（ｂｉｔ１７）の１４の反復及びそれに続く２ビットの左シフトを直接指定している。 Here, the wire statement defines a set of logical lines named t that are 32 bits wide. The first assign statement after the wire statement specifies that the logic signal driving the logic line is an offsets4 constant shifted to the right, and the second assign statement contains the lower 18 bits of t in the offset field. It is clearly stated that it will be placed. The first assign statement directly specifies the value of the offsets4 operand as a concatenation of offset, 14 repetitions of its sign bit (bit 17), and the subsequent 2-bit left shift.

１つの定数表オペランドに対して、ＴＩＥコード

TIE code for one constant table operand

は、定数のアレイｐｒｉｍｅを限定するためにｔａｂｌｅステートメントを利用し（テーブル名に続く数字はそのテーブル内の要素の数である）、テーブルｐｒｉｍｅへのインデックスとしてそのオペランドを使用してそのオペランドｐｒｉｍｅ＿ｓを符号化する（索引付けを定義する際にＶｅｒｉｌｏｇ（登録商標）ステートメントを使用することに注意）。 Uses a table statement to limit the array prime of the constants (the number following the table name is the number of elements in the table) and uses its operand prime_s as an index into the table prime. Encode (note the use of Verilog® statements when defining indexing).

命令クラスステートメントｉｃｌａｓｓは共通のフォーマットでのオペランドに操作符号を関連付ける。ｉｃｌａｓｓステートメントにおいて定義される全ての命令は同じフォーマットとオペランド使用法を有する。命令クラスを定義する前に、その成分を、まずフィールドとして、次に操作符号及びオペランドとして定義しなければならない。例えば、操作符号ａｃｓ及びａｄｓｅｌを定義する前述の例において使用したコード上に構築する際に、付加的なステートメント

The instruction class statement icclass associates an operation code with an operand in a common format. All instructions defined in the iclass statement have the same format and operand usage. Before defining an instruction class, its components must first be defined as fields, then as operational codes and operands. For example, when building on the code used in the previous example that defines the operation codes acs and adsel, additional statements

は３つのレジスタオペランドａｒｔ・ａｒｓ・ａｒｒ（やはりこの場合も、この定義においてＶｅｒｉｌｏｇ（登録商標）ステートメントを使用することに注意）を定義するためにｏｐｅｒａｎｄステートメントを使用する。次に、ｉｃｌａｓｓステートメント

Uses an operand statement to define the three register operands art, ars, arr (again, note the use of Verilog® statements in this definition). Next, the iclass statement

はオペランドａｄｓｅｌ及びａｃｓが、命令ｖｉｔｅｒｂｉの共通のクラスに属し、それは入力として２つのレジスタオペランドａｒｔ及びａｒｓを取り、レジスタオペランドａｒｒに出力を書き込むことを明記している。 Specifies that the operands adsel and acs belong to a common class of the instruction viterbi, which takes two register operands art and ars as inputs and writes the output to the register operand arr.

本発明において、命令の状態アクセス情報の指定を許容するために命令クラスステートメント「ｉｃｌａｓｓ」が修正される。それはキーワード「ｉｃｌａｓｓ」で始まり、次に命令クラス名、続いてその命令クラスに属する操作符号のリスト及びオペランドアクセス情報のリストが続き、状態アクセス情報のために新たに定義されたリストで終了する。例えば、

In the present invention, the instruction class statement “iclass” is modified to allow specification of instruction state access information. It begins with the keyword “iclass”, followed by the instruction class name, followed by a list of operation codes belonging to that instruction class and a list of operand access information, ending with a newly defined list for state access information. For example,

は幾つかの命令クラスと、如何に様々の新規命令がその状態にアクセスするかを定義している。ｉｃｌａｓｓ内の命令によって、その状態が読み取られ、書き込まれ、または修正（読取り及び書込み）されることを示すために、キーワード「ｉｎ」、「ｏｕｔ」及び「ｉｎｏｕｔ」が使用される。この例では、状態「ＤＡＴＡ」は命令「ＬＤＤＡＴＡ」によって読み取られ、状態「ＫＥＹＣ」及び「ＫＥＹＤ」は命令「ＳＴＫＥＹ」によって書き込まれ、「ＫＥＹＣ」と「ＫＥＹＤ」と「ＤＡＴＡ」が命令「ＤＥＳ」によって修正される。 Defines several instruction classes and how various new instructions access their state. The keywords “in”, “out” and “inout” are used to indicate that an instruction in iclas reads, writes, or modifies (reads and writes) its state. In this example, state “DATA” is read by instruction “LDDATA”, states “KEYC” and “KEYD” are written by instruction “STKEY”, and “KEYC”, “KEYD”, and “DATA” are instructions “DES”. Corrected by.

命令セマンティックステートメントｓｅｍａｎｔｉｃはオペランドをコード化するために使用されるＶｅｒｉｌｏｇ（登録商標）の同じサブセットを使用して、１つ以上の命令の所作を説明する。１つのセマンティックステートメントにおいて多数の命令を定義することにより、一部の共通の表現が共有され、ハードウェアインプリメンテーションをより効率的にすることができる。セマンティックステートメントにおいて許容される変数は、ステートメントの操作符号リスト内で定義される操作符号用のオペランドであり、操作符号リスト内で指定される各操作符号用の単ビットの変数である。この変数は操作符号として同じ名前を有し、操作符号が検出された時に１と評価する。それは対応する命令の存在を示すために、計算セクション（Ｖｅｒｉｌｏｇ（登録商標）サブセットセクション）において使用される。

The instruction semantic statement semantic describes the operation of one or more instructions using the same subset of Verilog® used to encode the operands. By defining multiple instructions in one semantic statement, some common representations can be shared and the hardware implementation can be made more efficient. Variables allowed in a semantic statement are operands for operation codes defined in the operation code list of the statement, and are single-bit variables for each operation code specified in the operation code list. This variable has the same name as the operation code, and evaluates to 1 when the operation code is detected. It is used in the calculation section (Verilog® subset section) to indicate the presence of the corresponding instruction.

上記コードの第１セクションはＢＹＴＥＳＷＡＰと呼ばれる新しい命令用の操作符号を定義する。

The first section of the code defines an operation code for a new instruction called BYTESWAP.

ここで、新しい操作符号ＢＹＴＥＳＷＡＰはＣＵＳＴ０のサブ操作符号として定義される。下記において詳述するＸｔｅｎｓａ（登録商標）の命令セットアーキテクチャ参照マニュアル（ＩｎｓｔｒｕｃｔｉｏｎＳｅｔＡｒｃｈｉｔｅｃｔｕｒｅＲｅｆｅｒｅｎｃｅＭａｎｕａｌ）から、ＣＵＳＴ０が以下のように定義される。

Here, the new operation code BYTESWAP is defined as a sub-operation code of CUST0. From the Xtensa (registered trademark) Instruction Set Architecture Reference Manual, which will be described in detail below, CUST0 is defined as follows.

但し、ｏｐ０及びｏｐ２は命令内のフィールドであることが解る。操作符号は典型的に階層的に組織化される。ここで、ＯＲＳＴはトップレベルの操作符号であり、ＣＵＳＴ０はＯＲＳＴのサブ操作符号であり、次にＢＹＴＥＳＷＡＰはＣＵＳＴ０のサブ操作符号である。この操作符号の階層組織は操作符号スペースの論理的グループ分けと管理を許容する。 However, it can be seen that op0 and op2 are fields in the instruction. Operation codes are typically organized hierarchically. Here, ORST is a top-level operation code, CUST0 is a sub-operation code of ORST, and BYTESWAP is a sub-operation code of CUST0. This hierarchical structure of operation codes allows logical grouping and management of operation code spaces.

第２の宣言はＢＹＴＥＳＷＡＰ命令が必要とする付加的なプロセッサ状態を宣言する。

The second declaration declares additional processor states required by the BYTESWAP instruction.

ここで、ＣＯＵＮＴは３２ビット状態として宣言され、ＳＷＡＰは１ビット状態と宣言される。ＴＩＥ言語はＣＯＵＮＴ内のビットが３１から０に索引付けられ、ビット０が最下位ビットであることを明記している。 Here, COUNT is declared as a 32-bit state, and SWAP is declared as a 1-bit state. The TIE language specifies that the bits in COUNT are indexed from 31 to 0, with bit 0 being the least significant bit.

Ｘｔｅｎｓａ（登録商標）ＩＳＡは、特殊なシステムレジスタをセーブしリストアするために、２つの命令、ＲＳＲとＷＳＲを提供する。同様に、それはＴＩＥ内で宣言される状態をセーブしリストアするために、２つの他の命令、ＲＵＲとＷＵＲ（下記において詳述する）を提供する。ＴＩＥにおいて宣言された状態をセーブ・リストアするために、ＲＵＲとＷＵＲ命令がアクセスすることができるユーザレジスタファイルへのエントリに対して、その状態のマッピングを指定しなければならない。上記コードの以下のセクションがこのマッピングを指定し、

The Xtensa® ISA provides two instructions, RSR and WSR, to save and restore special system registers. Similarly, it provides two other instructions, RUR and WUR (detailed below), to save and restore state declared in the TIE. In order to save and restore the state declared in the TIE, a mapping of that state must be specified for entries in the user register file that can be accessed by the RUR and WUR instructions. The following section of the code above specifies this mapping,

以下の命令がａ２に対するＣＯＵＮＴの値とａ５に対するＳＷＡＰの値をセーブするであろう。

The following instruction will save the value of COUNT for a2 and the value of SWAP for a5.

この機構は状態の内容を検証するためにテストプログラムにおいて実際に使用される。Ｃでは、上記２つの命令は以下のように見えるであろう。

This mechanism is actually used in the test program to verify the contents of the state. In C, the two instructions will appear as follows:

ＴＩＥ記述における入れ子セクションは、新規命令ＢＹＴＥＳＷＡＰを含む新規命令クラスの定義である。

The nested section in the TIE description is a definition of a new instruction class including a new instruction BYTESWAP.

但し、ｉｃｌａｓｓはキーワードであり、ｂｓはｉｃｌａｓｓの名前である。次の節はこの命令クラス（ＢＹＴＥＳＷＡＰ）における命令のリストを作成する。ｔｈａｎの後の節はこのクラス内の命令によって使用されるオペランド（この場合、入力オペランドａｒｓと出力オペランドａｒｒ）を指定する。ｉｃｌａｓｓ定義における最後の節は、このクラスにおける命令によってアクセスされる状態を指定する（この場合、命令は状態ＳＷＡＰを読み取り、状態ＣＯＵＮＴを読み取って書き込むであろう）。 However, iclas is a keyword and bs is the name of iclas. The next section creates a list of instructions in this instruction class (BYTESWAP). The clause after tan specifies the operands (in this case, the input operand ars and the output operand arr) used by the instructions in this class. The last clause in the iclas definition specifies the state accessed by instructions in this class (in this case the instruction will read state SWAP and read and write state COUNT).

上記コードの最後のブロックはＢＹＴＥＳＷＡＰ命令のために正式の意味論的定義を与える。

The last block of code above gives a formal semantic definition for the BYTESWAP instruction.

この記述は組合せ論理を説明するためにＶｅｒｉｌｏｇＨＤＬ用のサブセットを使用する。命令セットシミュレータがＢＹＴＥＳＷＡＰ命令をどのようにシミュレートし、付加的なサーキットリがどのように合成されてＸｔｅｎｓａ（登録商標）プロセッサハードウェアに付け加えられ、新しい命令をサポートするかを正確に定義するのがこのブロックである。 This description uses a subset for Verilog HDL to illustrate combinatorial logic. The instruction set simulator defines exactly how BYTESWAP instructions are simulated and how additional circuitry is synthesized and added to the Xtensa processor hardware to support new instructions Is this block.

本発明において、ユーザ定義状態を実用化する際に、状態に格納されている情報にアクセスするための他の変数と同様に、宣言された状態を使用することができる。式の右手側に現れる状態識別子がその状態からの読取りを示す。状態への書込みは、状態識別子に値または式を割り当てることによって行われる。例えば、以下のセマンティックコードセグメントは命令によって状態がどのようにして読み取られ、書き込まれるかを示している。

In the present invention, when the user-defined state is put into practical use, the declared state can be used as well as other variables for accessing information stored in the state. A state identifier appearing on the right hand side of the expression indicates a read from that state. Writing to the state is done by assigning a value or expression to the state identifier. For example, the following semantic code segment shows how a state is read and written by an instruction.

コア命令及び構成オプションの選択を介して利用できる命令として、構成可能プロセッサ内で実用化することができる命令の例を説明する目的のために、テンシリカ社（Tensilica、Inc.）のＸｔｅｎｓａ（登録商標）命令セットアーキテクチャ（Instruction Set Architecture）（ＩＳＡ）参照マニュアル、改訂版１．０がここに参照して組み込まれる。更に、このようなユーザ定義命令を実用化するために使用することができるＴＩＥ言語命令の例を示すために、やはりテンシル社の命令エクステンション言語（ＴＩＥ）参照マニュアル、改訂版１．３がここに参照して組み込まれる。 For the purpose of explaining examples of instructions that can be implemented in a configurable processor as instructions available through the selection of core instructions and configuration options, Tensilica, Inc. Xtensa®. ) Instruction Set Architecture (ISA) reference manual, revision 1.0, is incorporated herein by reference. Furthermore, the Tensile Instruction Extension Language (TIE) Reference Manual, Revised 1.3, is also here to show examples of TIE language instructions that can be used to put such user-defined instructions into practical use. Incorporated by reference.

ＴＩＥ記述から、例えば、付属書Ｄに示したものと同じようなプログラムを使用して、命令を実行する新しいハードウェアを発生させることができる。付属書Ｅは組込み関数として新しい命令をサポートするために必要なヘッダファイル用のコードを示している。 From the TIE description, for example, a program similar to that shown in Appendix D can be used to generate new hardware that executes instructions. Appendix E shows the code for the header file required to support the new instruction as a built-in function.

構成仕様書を使用して、以下のものを自動的に発生させることができる。
・プロセッサ６０の命令デコード論理；
・プロセッサ６０用の非合法的命令検出論理；
・アッセンブラの特定ＩＳＡ用部分；
・コンパイラのための特定ＩＳＡ用サポートルーチン；
・（デバッガにより使用される）デアッセンブラの特定ＩＳＡ用部分；及び
・シミュレータの特定ＩＳＡ用部分。 The configuration specification can be used to automatically generate:
The instruction decode logic of the processor 60;
An illegal instruction detection logic for the processor 60;
· Parts for specific ISAs of assemblers;
• Specific ISA support routines for the compiler;
A specific ISA part of the deassembler (used by the debugger); and a specific ISA part of the simulator.

図１６はこれらのソフトウェアツールの特定ＩＳＡ用部分をどのように発生させるかを示す図である。ユーザが作成したＴＩＥ記述ファイル４００から、ＴＩＥパーサプログラム４１０が幾つかのプログラム用のＣコードを発生させ、ユーザが定義した命令及び状態に関する情報のために、そのプログラムの各々が、ソフトウェア展開ツールの１つ以上によってアクセスされるファイルを作り出す。例えば、プログラムｔｉｅ２ｇｃｃ４２０はｘｔｅｎｓａ−ｔｉｅ．ｈと呼ばれるＣヘッダファイル４７０を発生させ、このファイルは新しい命令用の組込み関数定義を含んでいる。プログラムｔｉｅ２ｉｓａ４３０は動的接続ライブラリ（ＤＬＬ）４８０を発生させ、これはユーザが定義した命令フォーマットに関する情報を含み、（下記に述べるＷｉｌｓｏｎらの出願では、これは効果的にここで論じるエンコード・デコードＤＬＬの組み合わせである）。プログラムｔｉｅ２ｉｓｓ４４０は性能モデル化ルーチンを発生させ、命令セマンティクスを含むＤＬＬ４９０を作り出し、それは、Ｗｉｌｓｏｎらの出願において論じられているように、シミュレータにより使用されるシミュレータＤＬＬを作り出すためにホストコンパイラによって使用される。プログラムｔｉｅ２ｖｅｒ４５０は適切なハードウェア記述言語でユーザが定義した命令に必要な記述５００を作り出す。最後に、プログラムｔｉｅ２ｘｔｏｓ４６０はＲＵＲ及びＷＵＲ命令が使用するセーブ・リストアコード５１０を作り出す。 FIG. 16 is a diagram showing how to generate a specific ISA portion of these software tools. From the TIE description file 400 created by the user, the TIE parser program 410 generates C code for several programs, and for each of the user-defined instructions and status information, each of the programs is Create a file that is accessed by one or more. For example, the program tie2gcc420 is xtensa-tie. Generate a C header file 470 called h, which contains the built-in function definition for the new instruction. The program tie2isa 430 generates a dynamic connection library (DLL) 480, which contains information about the user-defined instruction format (in the Wilson et al application discussed below, this is effectively the encoding / decoding DLL discussed herein. Is a combination). The program tie2iss 440 generates a performance modeling routine and creates a DLL 490 that includes instruction semantics, which is used by the host compiler to create a simulator DLL that is used by the simulator, as discussed in the Wilson et al. Application. . The program tie2ver 450 creates a description 500 necessary for the user-defined instructions in an appropriate hardware description language. Finally, program tie2xtos 460 creates save / restore code 510 for use by the RUR and WUR instructions.

命令及びそれらがどのようにして状態にアクセスするかについての正確な記述が、既存の高性能マイクロプロセッサ設計にプラグインできる効率的な論理を作り出すことを可能にする。本発明の本実施形態との関連で説明した方法は、特にこれらの新しい命令を処理し、それらは１つ以上の状態レジスタから／へと読み取って、書き込む。特に、本実施形態は文脈上状態レジスタ用のハードウェア論理を如何にして引き出すかを示しており、高性能を達成するための技術として、全てパイプライン方式を使用するマイクロプロセッサインプリメンテーションスタイルのクラス。 An accurate description of the instructions and how they access the state makes it possible to create efficient logic that can be plugged into an existing high performance microprocessor design. The method described in connection with this embodiment of the invention specifically processes these new instructions, which read from and write to one or more status registers. In particular, this embodiment shows how to derive the hardware logic for the status register in context, and as a technique for achieving high performance, all of the microprocessor implementation style that uses the pipeline method. class.

図１７に示したようなもの等のパイプライン式インプリメンテーションにおいて、状態レジスタは典型的に何度も重複しており、各々の具体化が特定のパイプライン段階における状態値を表している。本実施形態では、１つの状態が基礎をなすコアプロセッサインプリメンテーションと矛盾しない多数のレジスタのコピーに移される。やはり基礎をなすコアプロセッサインプリメンテーションと矛盾しない方法で、付加的なバイパス及びフォワード論理も発生される。例えば、３つの実行段階よりなるコアプロセッサインプリメンテーションを目標にするために、本実施形態は１つの状態を図１８に示すように接続される３つのレジスタへと移すであろう。このインプリメンテーションでは、各レジスタ６１０〜６３０が３つのパイプライン段階の１つにおける状態値を表す。ｃｔｒｌ−１と、ｃｔｒｌ−２とｃｔｒｌ−３は、対応するフリップフロップ６１０〜６３０においてデータラッチングを可能化するために使用される制御信号である。 In pipelined implementations such as the one shown in FIG. 17, the status registers typically overlap many times, and each implementation represents a state value at a particular pipeline stage. In this embodiment, one state is moved to multiple register copies that are consistent with the underlying core processor implementation. Additional bypass and forward logic is also generated in a manner that is consistent with the underlying core processor implementation. For example, to target a core processor implementation consisting of three execution stages, the present embodiment will move one state to three registers connected as shown in FIG. In this implementation, each register 610-630 represents a state value in one of the three pipeline stages. ctrl-1, ctrl-2, and ctrl-3 are control signals used to enable data latching in the corresponding flip-flops 610-630.

基礎をなすプロセッサインプリメンテーションと矛盾なく状態レジスタの多数のコピーを動作させるために、付加的な論理と制御信号が必要である。「矛盾なく」とは、割込みや特例・パイプラインの機能停止などの状態下で、状態が残りのプロセッサと全く同じようにふるまうべきであることを意味する。典型的に、所定のプロセッサインプリメンテーションは様々なパイプライン状態を表す或る信号を限定する。このような信号はパイプライン状態のレジスタを適切に作動させるために必要である。 Additional logic and control signals are required to operate multiple copies of the status register consistent with the underlying processor implementation. “With no contradiction” means that the state should behave exactly like the rest of the processor under interrupts, exceptions, or pipeline outages. Typically, a given processor implementation limits certain signals that represent various pipeline states. Such signals are necessary for proper operation of pipelined registers.

典型的なパイプライン式インプリメンテーションにおいて、実行ユニットは多数のパイプライン段階よりなる。１つの命令の計算はこのパイプライン内の多数の段階において実施される。命令ストリームは制御論理により方向付けられるようなシーケンスでパイプラインを通って流れる。所定の時間に、パイプラインにおいて実行されるｎ個の命令があってよく、ｎは段階の数である。やはり本発明を使用して実用化できるスーパースカラープロセッサでは、パイプライン内の命令の数はｎ・ｗであってよく、ｗはプロセッサのイシュー幅である。 In a typical pipelined implementation, the execution unit consists of a number of pipeline stages. The calculation of one instruction is performed in a number of stages in this pipeline. The instruction stream flows through the pipeline in a sequence that is directed by the control logic. There may be n instructions executed in the pipeline at a given time, where n is the number of stages. In a superscalar processor that can also be implemented using the present invention, the number of instructions in the pipeline may be n · w, where w is the issue width of the processor.

制御論理の役割は、命令間の依存性に従い、命令間の干渉が散らされることを保証することである。１つの命令が初期の命令により計算されたデータを使用する場合、パイプラインを機能停止させることなく後の命令へとデータを進めるためには特殊なハードウェアが必要である。割込みが発生した場合、パイプライン内の全ての命令を削り、後に再実行することが必要である。１つの命令が必要とするその入力データまたは計算用ハードウェアが利用できないためにその命令を実行できない場合、その命令は機能停止されなければならない。命令を機能停止させる１つの費用効果的な方法は、その第１実行段階でその命令を削り、次のサイクルでその命令を再実行することである。この技術の結果がパイプライン内に無効な段階（バブル）を作り出している。このバブルが他の命令と共にパイプラインを流れる。命令が遂行されるパイプラインの終りで、バブルが捨てられる。 The role of the control logic is to ensure that interference between instructions is scattered according to the dependency between instructions. When one instruction uses data calculated by the initial instruction, special hardware is required to advance the data to a subsequent instruction without causing the pipeline to stop functioning. If an interrupt occurs, it is necessary to delete all instructions in the pipeline and re-execute later. If the instruction cannot be executed because the input data or computing hardware required by the instruction is not available, the instruction must be disabled. One cost effective way to stall an instruction is to delete the instruction in its first execution stage and re-execute the instruction in the next cycle. The result of this technique is creating invalid stages (bubbles) in the pipeline. This bubble flows through the pipeline along with other instructions. At the end of the pipeline where the instruction is executed, the bubble is discarded.

上記の３段階パイプラインの例を使用して、このようなプロセッサ状態の典型的なインプリメンテーションは図１９に示した付加的な論理と接続を必要とする。 Using the above three-stage pipeline example, a typical implementation of such a processor state requires the additional logic and connections shown in FIG.

正常な状態では、１つの段階で計算された値は、データ依存性により導入されるパイプライン機能停止の数を減少させるために、その値がパイプラインの終りに達するのを待つことなく、直ちに次の命令へと進められるであろう。これは、第１のフリップフロップ６１０の出力を直接セマンティックブロックへと送り、それを次の命令によって直ちに使用できるようにすることによって達成される。割込みや特例等の異常な状態を処理するために、該インプリメンテーションは以下の制御信号、Ｋｉｌｌ＿１、Ｋｉｌｌ＿ａｌｌ、Ｖａｌｉｄ＿３を必要とする。 Under normal conditions, the value calculated in one stage is immediately determined without waiting for the value to reach the end of the pipeline to reduce the number of pipeline outages introduced by data dependencies. It will be advanced to the next command. This is accomplished by sending the output of the first flip-flop 610 directly to the semantic block so that it can be used immediately by the next instruction. In order to handle abnormal conditions such as interrupts and special cases, the implementation requires the following control signals: Kill_1, Kill_all, Valid_3.

信号「Ｋｉｌｌ＿１」は、収益のために必要とするデータを有していない等の理由のために、現在第１パイプライン段階１１０にある命令を削らなければならないことを示している。一旦その命令が削られると、次のサイクルにおいて再度試みられるであろう。信号「Ｋｉｌｌ＿ａｌｌ」は、それらの前の命令が特例を発生させたか、または割込みが発生した等の理由のために、現在第１パイプライン段階１１０にある全ての命令を削らなければならないことを示している。信号「Ｖａｌｉｄ＿３」は、現在最後の段階６３０にある命令が有効であるか否かを示している。このような条件は、しばしば第１のパイプライン段階６１０内の命令を削り、パイプラインにバブル（無効な命令）を生じさせた結果である。「Ｖａｌｉｄ＿３」は単に３番目のパイプライン段階における命令が有効であるかバブルであるかを示している。明らかに、有効な命令だけをラッチすべきである。 The signal “Kill_1” indicates that the instruction currently in the first pipeline stage 110 has to be deleted, for example because it does not have the data needed for revenue. Once the command is deleted, it will be tried again in the next cycle. The signal “Kill_all” indicates that all instructions currently in the first pipeline stage 110 must be deleted, for example because their previous instruction caused a special case or an interrupt occurred. ing. The signal “Valid_3” indicates whether the instruction currently in the last stage 630 is valid. Such a condition is often the result of cutting the instructions in the first pipeline stage 610 and causing bubbles in the pipeline (invalid instructions). “Valid — 3” simply indicates whether the instruction in the third pipeline stage is valid or bubbled. Obviously, only valid instructions should be latched.

図２０は状態レジスタを実用化するために必要な付加的な論理及び接続を示している。更に、この状態レジスタインプリメンテーションが上記の要件を満たすように、信号「ｃｔｒｌ−１」、「ｃｔｒｌ−２」、および「ｃｔｒｌ−３」を駆動するための制御論理を如何にして構築するかも示している。以下は図１９に示したような状態レジスタを実用化するために自動的に発生されるサンプルＨＤＬコードである。

FIG. 20 shows the additional logic and connections necessary to put the status register into practical use. Furthermore, how the control logic to drive the signals “ctrl-1”, “ctrl-2”, and “ctrl-3” can be constructed so that this state register implementation meets the above requirements. Show. The following is a sample HDL code that is automatically generated to put the status register shown in FIG. 19 into practical use.

上記パイプライン式状態レジスタモデルを使用して、セマンティックブロックがその入力として状態を指定する場合、状態の現在の状態値が入力変数としてセマンティックブロックへと送られる。セマンティックブロックが１つの状態に対する新しい値を発生させるための論理を有している場合、出力信号が作られる。この出力信号はパイプライン式状態レジスタへの次の状態入力として使用される。 Using the pipelined state register model, when a semantic block specifies a state as its input, the current state value of the state is sent as an input variable to the semantic block. If the semantic block has logic to generate a new value for a state, an output signal is created. This output signal is used as the next status input to the pipelined status register.

本実施形態は多数のセマンティック記述ブロックを許容し、その各々が多数の命令に対する所作を説明する。この無制限の記述スタイルの下で、セマンティックブロックの１つのサブセットだけが所定の状態に対する次の状態出力を作り出すことができる。更に、所定の時間にそれがどの命令を実行しているかに応じて条件付きで、所定のセマンティックブロックが次の状態出力を作り出すこともできる。その結果、全てのセマンティックブロックからの次の状態出力を組み合わせて、パイプライン式状態レジスタに対する入力を形成するために、付加的なハードウェア論理が必要である。本発明の本実施形態では、このブロックがその状態に対する新しい値を作り出したかどうかを示す各セマンティックブロックのために、１つの信号が自動的に引き出される。別の実施形態では、このような信号を、設計者が指定するように残すことができる。 This embodiment allows multiple semantic description blocks, each of which describes the behavior for multiple instructions. Under this unlimited description style, only one subset of semantic blocks can produce the next state output for a given state. In addition, a given semantic block can produce the next state output conditionally depending on which instruction it is executing at a given time. As a result, additional hardware logic is required to combine the next state outputs from all semantic blocks to form an input to the pipelined state register. In this embodiment of the invention, one signal is automatically derived for each semantic block that indicates whether this block has created a new value for that state. In another embodiment, such signals can be left as specified by the designer.

図２０は幾つかのセマンティックブロックｓ１〜ｓｎからの状態の次の状態出力を如何に組み合わせ、状態レジスタに入力するために如何に適切にその１つを選択するかを示している。この図において、ｏｐ１＿１及びｏｐ１＿２は第１のセマンティックブロックに対する操作符号信号であり、ｏｐ２＿１及びｏｐ２＿２は第２のセマンティックブロックに対する操作符号信号である。セマンティックブロックｉの次の状態出力はｓｉである（多数の状態レジスタがある場合、そのブロックに対して多数の次の状態出力がある）。セマンティックブロックｉがその状態に対して１つの新しい値を作り出したことを示す信号がｓｉ＿ｗｅである。信号ｓ＿ｗｅはいずれかのセマンティックブロックがその状態に対して１つの新しい値を作り出すかどうかを示しており、書込みイネーブル信号としてパイプライン式状態レジスタへの入力として使用される。 FIG. 20 shows how the next state outputs of the states from several semantic blocks s1 to sn are combined and how to properly select one for input to the state register. In this figure, op1_1 and op1_2 are operation code signals for the first semantic block, and op2_1 and op2_2 are operation code signals for the second semantic block. The next status output of semantic block i is si (if there are multiple status registers, there are multiple next status outputs for that block). A signal indicating that semantic block i has created one new value for the state is si_we. The signal s_we indicates whether any semantic block produces one new value for that state and is used as a write enable signal as an input to the pipelined state register.

多数のセマンティックブロックの表現力が１つのセマンティックブロックのものより低くても、それは、典型的に関連する命令を１つのブロックにグループ分けすることにより、より多くの構造化した記述を与える１つの方法を提供する。多数のセマンティックブロックは、命令が実行される更に制限された範囲のために、命令効果のより簡単な分析へと導くことができる。他方、１つのセマンティックブロックが多数の命令の所作を説明することに対して、しばしば多くの理由がある。最も頻繁に、それはこれらの命令のハードウェアインプリメンテーションが共通の論理を共有するからである。多数の命令を１つのセマンティックブロックで説明することは、通常、より効率的なハードウェア設計ハードウェア設計へと導く。 One way to give more structured description by grouping related instructions into one block, even if the expressive power of many semantic blocks is lower than that of one semantic block I will provide a. Multiple semantic blocks can lead to a simpler analysis of instruction effects because of the more limited range over which the instructions are executed. On the other hand, there are often many reasons for one semantic block to account for the operation of many instructions. Most often, it is because the hardware implementation of these instructions share common logic. Explaining a large number of instructions in one semantic block usually leads to a more efficient hardware design hardware design.

割込み及び特例のために、ソフトウェアが状態の値をデータメモリへとリストアし、データメモリからロードすることが必要である。新しい状態及び新しい命令の正式の記述に基づいて、このようなリストア・ロード命令を自動的に発生させることができる。本発明の本実施形態では、リストア・ロード命令用の論理が２つのセマンティックブロックとして自動的に発生され、それは次に他のブロックと全く同様に、反復的に実際のハードウェアに移すことができる。例えば、以下の状態宣言書から、

For interrupts and exceptions, it is necessary for the software to restore the state value to and load from the data memory. Such a restore and load instruction can be automatically generated based on the new state and the formal description of the new instruction. In this embodiment of the invention, the logic for the restore load instruction is automatically generated as two semantic blocks, which can then be transferred to the actual hardware iteratively, just like any other block. . For example, from the following state declaration:

以下のセマンティックブロックを発生させて、「ＤＡＴＡ」、「ＫＥＹＣ」、および「ＫＥＹＤ」の値を汎用レジスタに読み込むことができる。

The following semantic blocks can be generated to read the values of “DATA”, “KEYC”, and “KEYD” into the general purpose registers.

図２１はこの種のセマンティックロジックに対応する論理のブロック線図を示している。入力信号「ｓｔ」を様々な定数と比較して様々な選択信号を形成し、それらはｕｓｅｒ＿ｒｅｇｉｓｔｅｒ仕様書と矛盾しない方法で、状態レジスタから或るビットを選択するために使用される。前の状態宣言書を使用して、ＤＡＴＡのビット３２を第２のユーザレジスタのビット０に配置する。従って、この図においてＭＵＸの第２入力はＤＡＴＡ状態の３２番目のビットに接続されるべきである。 FIG. 21 shows a block diagram of logic corresponding to this kind of semantic logic. The input signal “st” is compared with various constants to form various selection signals, which are used to select a bit from the status register in a manner consistent with the user_register specification. Using the previous state declaration, place bit 32 of DATA into bit 0 of the second user register. Therefore, in this figure, the second input of the MUX should be connected to the 32nd bit in the DATA state.

以下のセマンティックブロックを発生させて、状態「ＤＡＴＡ」、「ＫＥＹＣ」、「ＫＥＹＤ」に汎用レジスタからの値を書き込むことができる。

The following semantic blocks can be generated to write the values from the general purpose registers to the states “DATA”, “KEYC”, and “KEYD”.

図２２はｉ番目のユーザレジスタのｋ番目のビットに配置される場合の、状態Ｓのｊ番目のビットに対する論理を示している。ＷＵＲ命令内のｕｓｅｒ＿ｒｅｇｉｓｔｅｒ番号「ｓｔ」が「ｉ」である場合、「ａｒｓ」のｋ番目のビットがＳ［ｊ］レジスタ内へとロードされ、他の場合には、Ｓ［ｊ］の元の値が再循環される。更に、状態Ｓのどのビットも再ロードされない場合、信号Ｓ＿ｗｅが可能化される。 FIG. 22 shows the logic for the jth bit of state S when placed in the kth bit of the ith user register. If the user_register number “st” in the WUR instruction is “i”, the k th bit of “ars” is loaded into the S [j] register, otherwise the original of S [j] The value is recirculated. Furthermore, if no bit of state S is reloaded, signal S_we is enabled.

ＴＩＥのｕｓｅｒ＿ｒｅｇｉｓｔｅｒ宣言書が、状態宣言書によって定義された付加的なプロセッサ状態から、これらのＲＵＲ及びＷＵＲ命令により使用される識別子へのマッピングを指定して、ＴＩＥ命令とは別個にこの状態を読み取り、書き込む。
付属書ＦはＲＵＲ及びＷＵＲ命令を発生させるコードを示している。 TIE's user_register declaration reads this state separately from the TIE instruction, specifying a mapping from the additional processor state defined by the state declaration to the identifiers used by these RUR and WUR instructions Write.
Appendix F shows the code that generates the RUR and WUR instructions.

ＲＵＲ及びＷＵＲの主な目的はタスク切替えのためである。多重タスク環境では、或るスケジューリングアルゴリズムに従って、多数のソフトウェアタスクがプロセッサを共有する。活動的である場合、タスクの状態はプロセッサレジスタ内にある。スケジューリングアルゴリズムが別のタスクへの切替えを決定した場合、プロセッサレジスタに保持されている状態がメモリにセーブされ、別のタスクの状態がメモリからプロセッサレジスタへとロードされる。Ｘｔｅｎｓａ（登録商標）命令セットアーキテクチャ（ＩＳＡ）はＩＳＡによって定義される状態を読み取り、書き込むためのＲＳＲ及びＷＳＲ命令を含む。例えば、以下のコードはタスク「メモリにセーブ」の一部である。

The main purpose of RUR and WUR is for task switching. In a multitasking environment, a number of software tasks share a processor according to some scheduling algorithm. If active, the task state is in the processor register. When the scheduling algorithm decides to switch to another task, the state held in the processor register is saved in memory and the state of the other task is loaded from memory to the processor register. The Xtensa instruction set architecture (ISA) includes RSR and WSR instructions for reading and writing states defined by the ISA. For example, the following code is part of the task “Save to Memory”:

また以下のコードはタスク「メモリからリストア」の一部である。

The following code is part of the task “Restore from Memory”.

但し、ＳＡＲ、ＬＣＯＵＮＴ、ＬＢＥＧ、ＬＥＮＤはコアＸｔｅｎｓａ（登録商標）ＩＳＡのプロセッサ状態レジスタ部分であり、ＡＣＣＬＯ、ＡＣＣＨＩ、ＭＲ＿０、ＭＲ＿１、ＭＲ＿２、およびＭＲ＿３はＭＡＣ１６Ｘｔｅｎｓａ（登録商標）ＩＳＡオプションの一部である。（レジスタはパイプラインインターロックを避けるために、ペアでセーブ・リストアされる。）
設計者がＴＩＥで新しい状態を定義する場合、上記の状態と同様に、タスク切替えされなければならない。１つの可能性は、設計者が単にタスクスイッチコード（その一部が上述したものである）の編集に進み、次に上記コードに類似したＲＵＲ／Ｓ３２Ｉ及びＬ３２Ｉ／ＷＵＲ命令を付加することであろう。しかしながら、ソフトウェアが自動的に発生され、構成によって正しい場合、構成可能プロセッサが最も効果的である。このように、本発明は自動的にタスクスイッチコードを増大させる機構を含んでいる。以下のトップラインが上記セーブタスクに付加される。

However, SAR, LCOUNT, LBEG, and LEND are the processor status register part of the core Xtensa (registered trademark) ISA, and ACCLO, ACCHI, MR_0, MR_1, MR_2, and MR_3 are part of the MAC16 Xtensa (registered trademark) ISA option. is there. (Registers are saved and restored in pairs to avoid pipeline interlocks.)
When a designer defines a new state in TIE, the task must be switched in the same manner as the above state. One possibility is that the designer simply proceeds to edit the task switch code (part of which is described above) and then adds RUR / S32I and L32I / WUR instructions similar to the above code. Let's go. However, a configurable processor is most effective when the software is automatically generated and correct by configuration. Thus, the present invention includes a mechanism for automatically increasing task switch code. The following top line is added to the save task.

また以下のラインが上記リストアタスクに付加される。

The following line is added to the restore task.

最後に、メモリ内のタスク状態エリアはユーザレジスタ記憶装置のために割り当てられる付加的なスペースを有していなければならないし、またタスクセーブポインタのベースからのこのスペースのオフセットがアッセンブラ定数ＵＥＸＣＵＲＥＧとして定義される。このセーブエリアは以下のコードによって予め定義されている。

Finally, the task state area in memory must have additional space allocated for user register storage, and the offset of this space from the base of the task save pointer is defined as the assembler constant UEXCUREG. Is done. This save area is predefined by the following code.

これは次のように変更される。

This is changed as follows.

このコードはユーザレジスタ番号のリストと共に、ｔｐｐ変数＠ｕｓｅｒ＿ｒｅｇｉｓｔｅｒｓがあることに依存する。これは単にあらゆるｕｓｅｒ＿ｒｅｇｉｓｔｅｒステートメントの最初のアーギュメントから作られるリストである。 This code depends on the presence of the tpp variable @user_registers along with a list of user register numbers. This is simply a list made from the first argument of every user_register statement.

一部の更に複雑なマイクロプロセッサインプリメンテーションでは、異なるパイプライン状態で１つの状態を計算することができる。これを処理することは、ここで説明するプロセスに幾つかのエクステンション（とはいえ簡単なもの）を必要とする。第１に、セマンティックブロックをパイプライン段階と関連付けることができるようにするために、仕様記述言語を拡張する必要がある。これは幾つかの方法のうちの１つで達成することができる。一実施形態では、関連するパイプライン段階を各セマンティックブロックで明白に指定することができる。別の実施形態では、パイプライン段階の範囲を各セマンティックブロックのために指定することができる。更に別の実施形態では、所定のセマンティックブロック用のパイプライン段階を、必要な計算上の遅延に応じて、自動的に引き出すことができる。 In some more complex microprocessor implementations, one state can be calculated in different pipeline states. Handling this requires several extensions (although simple) to the process described here. First, the specification description language needs to be extended to allow semantic blocks to be associated with pipeline stages. This can be accomplished in one of several ways. In one embodiment, the associated pipeline stage can be explicitly specified in each semantic block. In another embodiment, a pipeline stage range may be specified for each semantic block. In yet another embodiment, the pipeline stage for a given semantic block can be automatically derived depending on the required computational delay.

異なるパイプライン段階における状態発生をサポートする際の第２のタスクは、割込み及び特定・機能停止を処理することである。通常これは、パイプライン制御信号の制御下に、適切なバイパス及びフォワードロジックの追加を含む。一実施形態では、状態が発生された時とその状態が使用される時との間の関係を示すために、使用法発生図を発生させることができる。アプリケーション分析に基づいて、適切なフォワードロジックを実用化して、共通の状況を処理することができ、インターロックロジックを発生させて、フォワーディングロジックによって処理されない場合のためにパイプラインを機能停止させることができる。 The second task in supporting state generation at different pipeline stages is to handle interrupts and identification / outages. This usually involves the addition of appropriate bypass and forward logic under the control of pipeline control signals. In one embodiment, a usage diagram can be generated to show the relationship between when a state is generated and when the state is used. Based on application analysis, appropriate forward logic can be put into practical use to handle common situations, generating interlock logic and decommissioning pipelines for cases that are not handled by forwarding logic. it can.

ベースプロセッサの命令発行論理を修正する方法は、ベースプロセッサにより使用されるアルゴリズムに依存する。しかしながら、概して、ほとんどのプロセッサ用の命令発行論理は、それがシングル・イッシューであろうと、あるいはスーパー・スカラーであろうと、シングルサイクル命令用であろうと、あるいは多重サイクル命令用であろうと、命令が以下の信号を発行するためにテストされることにのみ依存する。１．命令がその状態をソースとして使用するか否かを各プロセッサ状態成分のために指示する信号；２．命令がその状態を送出先として使用するか否かを各プロセッサ状態成分のために指示する信号；及び３．命令が機能単位を使用するか否かを各機能単位のために指示する信号。 The method of modifying the base processor's instruction issue logic depends on the algorithm used by the base processor. However, in general, the instruction issue logic for most processors is that the instruction is whether it is a single issue, a super scalar, a single cycle instruction, or a multi-cycle instruction. Rely only on being tested to issue the following signals: 1. 1. a signal that indicates for each processor state component whether the instruction uses that state as a source; 2. a signal indicating for each processor state component whether the instruction uses that state as a destination; and A signal that indicates for each functional unit whether the instruction uses a functional unit.

これらの信号はパイプラインに対する発行及びクロス・イッシューチェックを実施し、パイプライン依存発行論理におけるパイプライン状態を更新するために使用される。ＴＩＥは新しい命令のために信号及びその式を増加させるために全ての必要な情報を含んでいる。 These signals are used to perform issue and cross issue checks on the pipeline and update the pipeline state in the pipeline dependent issue logic. The TIE contains all the necessary information to increase the signal and its expression for new instructions.

まず第１に、各ＴＩＥ状態宣言書が命令発行論理のために新しい信号が作られるようにする。ｉｃｌａｓｓ宣言に対して第３または第４のアーギュメントに表記される各ｉｎまたはｉｎｏｕｔオペランドまたは状態が、指定されたプロセッサ状態成分に対する第１組の式に対して、第２のアーギュメントに表記される命令に対する命令デコード信号を付加する。 First of all, each TIE state declaration causes a new signal to be generated for instruction issue logic. an instruction in which each in or inout operand or state represented in the third or fourth argument for the iclass declaration is represented in the second argument for the first set of expressions for the specified processor state component An instruction decode signal for is added.

第２に、ｉｃｌａｓｓ宣言に対して第３または第４のアーギュメントに表記される各ｏｕｔまたはｉｎｏｕｔオペランドまたは状態が、指定されたプロセッサ状態成分に対する第２組の式に対して、第２のアーギュメントに表記される命令に対する命令デコード信号を付加する。 Second, each out or inout operand or state represented in the third or fourth argument for the iclass declaration is in the second argument for the second set of expressions for the specified processor state component. An instruction decode signal is added to the indicated instruction.

第３に、各ＴＩＥセマンティックブロックから作られた論理が新しい機能単位を表し、新しい単位信号が作られ、セマンティックブロックのために指定されたＴＩＥ命令用のデコード信号が共にＯＲされて、第３組の式を形成する。 Third, the logic created from each TIE semantic block represents a new functional unit, a new unit signal is created, and the decode signal for the TIE instruction specified for the semantic block is ORed together to form a third set. Form the formula

命令が発せられた時、パイプラインステータスを将来の発行決定のために更新しなければならない。ここでも、ベースプロセッサの命令発行論理を修正する方法は、ベースプロセッサにより使用されるアルゴリズムに依存する。しかしながら、やはり幾つかの一般的な考察が可能である。パイプラインステータスは発行論理に対して以下のステータスを戻さなければならない。４．各々の発行された命令送出先のために、その結果がバイパスのために利用可能になる時を指示する信号；５．機能単位が別の命令のために実行可能状態になっていることを、各機能単位のために示す信号。 When an instruction is issued, the pipeline status must be updated for future issue decisions. Again, the method of modifying the instruction issue logic of the base processor depends on the algorithm used by the base processor. However, some general considerations are still possible. The pipeline status must return the following status to the issue logic. 4). 4. For each issued instruction destination, a signal indicating when the result is available for bypass; A signal that indicates for each functional unit that the functional unit is ready for execution by another instruction.

ここで説明した実施形態は、シングル・イッシュープロセッサであり、設計者が定義する命令が論理計算のシングルサイクルに制限される。この場合、上記のことがかなり簡略化される。機能単位がチェックまたはクロス・イッシューチェックをする必要がなく、如何なるシングルサイクル命令もプロセッサ状態成分を次の命令のためにパイプレディにすることができない。このように、発行式は以下のようになる。

The embodiment described here is a single issue processor, where the instructions defined by the designer are limited to a single cycle of logic computation. In this case, the above is considerably simplified. The functional unit does not need to do a check or cross issue check, and no single cycle instruction can pipe ready the processor state component for the next instruction. Thus, the issuing formula is as follows.

またこの場合、ｓｒｃ［ｉ］ｐｉｐｅｒｅａｄｙ信号が付加的な命令による影響を受けず、またｓｒｃ［ｉ］ｕｓｅが上記において説明したように、記述され修正される第１組の式である。本実施形態では、第４及び第５組の信号を必要としない。マルチサイクルでマルチイッシューである代替実施形態に対しては、各命令が計算をパイプラインで送るサイクル数を与えるために、ＴＩＥ仕様記述を待ち時間仕様記述で増大させるであろう。 Also in this case, the src [i] pipeready signal is not affected by the additional instruction, and src [i] use is the first set of expressions that are described and modified as described above. In the present embodiment, the fourth and fifth sets of signals are not required. For alternative embodiments that are multi-cycle and multi-issue, the TIE specification description will be augmented with a latency specification description to give the number of cycles for each instruction to send computations in the pipeline.

第４組の信号は、仕様書に従ってその段階において完了する各命令のために、命令デコード信号を共にＯＲすることによって、各セマンティックブロックパイプ段階において発生されるであろう。 A fourth set of signals will be generated at each semantic block pipe stage by ORing the instruction decode signals together for each instruction completed at that stage according to the specification.

デフォルトによって、発生された論理が完全にパイプライン化され、ＴＩＥ発生機能単位が、１つの命令を受け入れてから１サイクル後に、常にレディになっているであろう。この場合、ＴＩＥセマンティックブロック用の第５組の信号が常に主張される。多数のサイクルに亘ってセマンティックブロック内の論理を再使用する必要がある場合、機能単位がこのような命令によって如何に多くのサイクルで使用されるかを更なる仕様記述が指定するであろう。この場合、その段階において指定されたサイクルカウントで終了する各命令のために、命令デコード信号を共にＯＲすることによって、各セマンティックブロックパイプ段階において第５組の信号が発生されるであろう。
あるいは、更に異なる実施形態において、設計者が結果レディ信号及び機能単位レディ信号を指定するように、それはＴＩＥに対するエクステンションとして残されてもよい。 By default, the generated logic will be fully pipelined and the TIE generating functional unit will always be ready one cycle after accepting one instruction. In this case, a fifth set of signals for the TIE semantic block is always asserted. If the logic in the semantic block needs to be reused over many cycles, the further specification will specify how many cycles the functional unit is used by such an instruction. In this case, a fifth set of signals will be generated at each semantic block pipe stage by ORing the instruction decode signals together for each instruction ending with the cycle count specified at that stage.
Alternatively, in yet another embodiment, it may be left as an extension to the TIE so that the designer specifies a result ready signal and a functional unit ready signal.

本実施形態により処理されたコードの例が添付付属書に示されている。簡潔さのために、これらについて詳細に説明しないが、上述の参照マニュアルを再検討すれば、当業者によって容易に理解されるであろう。付属書ＧはＴＩＥ言語を用いた命令の実行例であり、付属書Ｈはこれらのコードを使用するコンパイラのためにＴＩＥコンパイラが発生させるものを示している。同様に、付属書ＩはシミュレータのためにＴＩＥコンパイラが発生させるものを示しており、付属書ＪはユーザアプリケーションにおけるＴＩＥ命令を拡大するマクロのためにＴＩＥコンパイラが発生させるものを示しており、付属書ＫはネイティブモードにおいてＴＩＥ命令をシミュレートするためにＴＩＥコンパイラが発生させるものを示しており、付属書Ｌは付加的なハードウェアのためのＶｅｒｉｌｏｇＨＤＬ記述としてＴＩＥコンパイラが発生させるものを示している。また、付属書Ｍは上記のＶｅｒｉｌｏｇＨＤＬ記述を最適化して、ＣＰＵ全体のサイズ及び性能に対するＴＩＥ命令のエリア及び速度の影響を推定するために、設計コンパイラスクリプトとしてＴＩＥコンパイラが発生させるものを示している。 An example of code processed according to this embodiment is shown in the appendix. For brevity, they will not be described in detail, but will be readily understood by one of ordinary skill in the art upon reviewing the above reference manual. Appendix G is an example of instruction execution using the TIE language, and Appendix H shows what the TIE compiler generates for a compiler that uses these codes. Similarly, Annex I shows what the TIE compiler generates for the simulator, and Annex J shows what the TIE compiler generates for the macro that expands the TIE instruction in the user application. Letter K shows what the TIE compiler generates to simulate TIE instructions in native mode, and Annex L shows what the TIE compiler generates as a Verilog HDL description for additional hardware Yes. Appendix M also shows what the TIE compiler generates as a design compiler script to optimize the above Verilog HDL description and estimate the effect of TIE instruction area and speed on the overall CPU size and performance. Yes.

上記のように、プロセッサ構成手順を始めるために、ユーザは上述のＧＵＩを介してベースプロセッサ構成を選択することによって開始する。プロセスの一部として、ソフトウェア展開システム３０が組み立てられ、図１に示すようにユーザに送られる。ソフトウェア展開システム３０は、図６において詳細に示される、本発明の別の局面に関する４つの主な成分、つまり、コンパイラ１０８と、アッセンブラ１１０と、命令セットシミュレータ１１２とデバッガ１３０とを含んでいる。 As described above, to begin the processor configuration procedure, the user begins by selecting a base processor configuration via the GUI described above. As part of the process, the software deployment system 30 is assembled and sent to the user as shown in FIG. Software deployment system 30 includes four main components, shown in detail in FIG. 6, relating to another aspect of the present invention: compiler 108, assembler 110, instruction set simulator 112, and debugger 130.

当業者に公知であるように、コンパイラはＣまたはＣ＋＋等の高レベルプログラミング言語で書かれたユーザアプリケーションを特定用途用アッセンブリ言語に変換する。ＣまたはＣ＋＋等の高レベルプログラミング言語はアプリケーションライタが正確に記載するのが容易なフォームでそれらのアプリケーションを記載できるようにするために設計されている。これらはプロセッサによって理解される言語ではない。アプリケーションライタは必ずしも使用されるプロセッサの特殊な特徴について心配する必要はない。多くの異なるタイプのプロセッサに対して、典型的に同じＣまたはＣ＋＋プログラムをほとんど修正なしに使用することができる。 As is known to those skilled in the art, the compiler converts user applications written in a high-level programming language such as C or C ++ into an application specific assembly language. High level programming languages such as C or C ++ are designed to allow application writers to describe their applications in a form that is easy to describe accurately. These are not languages understood by the processor. The application writer does not necessarily have to worry about the special features of the processor used. For many different types of processors, typically the same C or C ++ program can be used with little modification.

コンパイラはＣまたはＣ＋＋プログラムをアッセンブリ言語に翻訳する。アッセンブリ言語は機械言語により近く、プロセッサによって直接サポートされる。異なるタイプのプロセッサはそれ自体のアッセンブリ言語を有するであろう。各アッセンブリ命令はしばしば１つの機械命令を表すが、その両者は必ずしも同じでなくてもよい。アッセンブリ命令は人間が読むことのできる文字列であるように設計されている。各命令及びオペランドは意味のある名前または簡略記憶であり、人間がアッセンブリ命令を読み、機械によってどの操作が行われるかを容易に理解できるようにする。アッセンブラはアッセンブリ言語から機械言語へと変換する。各アッセンブリ命令文字列はアッセンブラによって１つ以上の機械命令へと効率的に符号化され、機械命令はプロセッサによって直接かつ効率的に実行され得る。 The compiler translates C or C ++ programs into assembly language. The assembly language is closer to the machine language and is directly supported by the processor. Different types of processors will have their own assembly language. Each assembly instruction often represents one machine instruction, but they need not be the same. Assembly instructions are designed to be human readable character strings. Each instruction and operand is a meaningful name or mnemonic that allows a human to read the assembly instructions and easily understand what operations are being performed by the machine. The assembler translates from assembly language to machine language. Each assembly instruction string is efficiently encoded into one or more machine instructions by the assembler, and the machine instructions can be executed directly and efficiently by the processor.

機械コードはプロセッサ上で直接実行することができるが、物理的なプロセッサは常に直ちに利用できるとは限らない。物理的プロセッサの組立ては時間のかかる高価なプロセスである。可能性のあるプロセッサ構成を選択する場合、ユーザは各々の可能性のある選択のために物理的プロセッサを組み立てることができない。その代わりに、ユーザにはシミュレータと呼ばれるソフトウェアプログラムが提供される。シミュレータ、つまり汎用体コンピュータで実行されるプログラム、はユーザが構成したプロセッサでユーザアプリケーションを実行する効果をシミュレートすることができる。シミュレータはシミュレートされたプロセッサのセマンティクスを真似ることができ、また実際のプロセッサが如何に速くユーザのアプリケーションを実行することができるかをユーザに告げることができる。 Machine code can be executed directly on the processor, but the physical processor is not always immediately available. Assembling the physical processor is a time consuming and expensive process. When selecting potential processor configurations, the user cannot assemble a physical processor for each potential selection. Instead, the user is provided with a software program called a simulator. A simulator, ie, a program executed on a general-purpose computer, can simulate the effect of executing a user application on a processor configured by the user. The simulator can mimic the semantics of the simulated processor and can tell the user how fast the actual processor can execute the user's application.

デバッガはユーザがソフトウェアと対話形式で問題を見出すことができるようにするツールである。デバッガはユーザが対話形式でそのプログラムを実行することができるようにする。ユーザはいつでもプログラムの実行を停止して、そのＣソースコードまたは結果的に生じるアッセンブリコードまたは機械コードを見ることができる。また、ユーザはブレークポイントにおいて、変数またはハードウェアレジスタのどの値も、あるいは全ての値を調べ、または修正することができる。次に、ユーザは実行を、おそらく一度に１つのステートメント、おそらく一度に１つの機械命令を、新しいユーザが選択したおそらく１つのブレークポイントまで続けることができる。 A debugger is a tool that allows users to find problems interactively with software. The debugger allows the user to execute the program interactively. The user can stop the execution of the program at any time and view the C source code or resulting assembly code or machine code. The user can also examine or modify any or all values of variables or hardware registers at breakpoints. The user can then continue execution, perhaps one statement at a time, perhaps one machine instruction at a time, to perhaps one breakpoint selected by the new user.

４つ全ての成分１０８、１１０、１１２、および１３０はユーザが定義した命令７５０（図３を参照）を知っている必要があり、またシミュレータ１１２及びデバッガ１３０も付加的にユーザが定義した状態７５２を知っていなければならない。システムは、ユーザのＣ及びＣ＋＋アプリケーションに付加されるイントリンシックを介して、ユーザが定義した命令７５０にユーザがアクセスできるようにする。コンパイラ１０８はユーザが定義した命令７５０のためにイントリンシックコールをアッセンブリ言語命令７３８に翻訳しなければならない。ユーザによって直接書かれたか、またはコンパイラ１０８によって翻訳された時はいつでも、アッセンブラ１１０は新しいアッセンブリ言語命令７３８を取り入れ、それらをユーザが定義した命令７５０に対応する機械命令７４０に符号化しなければならない。シミュレータ１１２はユーザが定義した機械命令７４０をデコードしなければならない。シミュレータ１１２はその命令のセマンティクスをモデル化し、構成されたプロセッサ上での命令の性能をモデル化しなければならない。シミュレータ１１２はユーザが定義した状態の値及び性能の含意をモデル化しなければならない。デバッガ１３０はユーザが定義した命令７５０を含むアッセンブリ言語命令７３８をユーザが印刷できるようにしなければならない。またデバッガ１３０はユーザが定義した状態の値をユーザが調べて修正できるようにしなければならない。本発明のこの局面では、ユーザはツール、つまりＴＩＥコンパイラ７０２を呼出して、現在可能性のあるユーザが定義したエンハンスメント７３６を処理する。ＴＩＥコンパイラ７０２はユーザアプリケーションをアッセンブリ言語７３８に翻訳するコンパイラ７０８とは異なっている。ＴＩＥコンパイラ７０２は既に組み立てられているベースソフトウェアシステム３０（コンパイラ７０８、アッセンブリ７１０、シミュレータ７１２及びデバッガ７３０）を可能化して、その新しいユーザが定義したエンハンスメント７３６を使用できるようにする成分を組み立てる。ソフトウェアシステム３０の各要素は幾分異なる成分セットを使用する。 All four components 108, 110, 112, and 130 need to know user-defined instructions 750 (see FIG. 3), and simulator 112 and debugger 130 are additionally user-defined states 752. Must know. The system allows the user access to user-defined instructions 750 via intrinsics added to the user's C and C ++ applications. Compiler 108 must translate intrinsic calls into assembly language instructions 738 for user-defined instructions 750. Whenever written by the user or translated by the compiler 108, the assembler 110 must take new assembly language instructions 738 and encode them into machine instructions 740 corresponding to the user-defined instructions 750. Simulator 112 must decode user-defined machine instructions 740. The simulator 112 must model the semantics of the instruction and model the performance of the instruction on the configured processor. The simulator 112 must model user defined state values and performance implications. The debugger 130 must allow the user to print assembly language instructions 738 that include user-defined instructions 750. The debugger 130 must also allow the user to examine and correct the state values defined by the user. In this aspect of the invention, the user calls the tool, TIE compiler 702, to process the currently possible user-defined enhancement 736. The TIE compiler 702 is different from the compiler 708 that translates the user application into the assembly language 738. The TIE compiler 702 enables the base software system 30 (compiler 708, assembly 710, simulator 712, and debugger 730) that has already been assembled to assemble components that allow the new user-defined enhancement 736 to be used. Each element of the software system 30 uses a somewhat different set of components.

図２４はこれらのソフトウェアツールの特定ＴＩＥ部分が如何に発生されるかを示す図である。ユーザが定義したエクステンションファイル７３６から、ＴＩＥコンパイラ７０２は幾つかのプログラム用のＣコードを発生させ、その各々が、ユーザが定義した命令及び状態に関する情報のために、１つ以上のソフトウェア展開ツールによってアクセスされるファイルを作り出す。例えば、プログラムｔｉｅ２ｇｃｃ８００は、ｘｔｅｎｓａ−ｔｉｅ．ｈと呼ばれるＣヘッダファイル８４２（下記に詳述する）を発生させ、それは新しい命令に対する組込み関数定義を含んでいる。プログラムｔｉｅ２ｉｓａ８１０は、動的接続ライブラリ（ＤＬＬ）８４４／８４８を発生させ、これらはユーザが定義した命令フォーマットに関する情報（下記に詳述するエンコードＤＬＬ８４４とデコードＤＬＬ８４８の組み合わせ）を含む。プログラムｔｉｅ２ｉｓｓ８４０は性能モデル化及び命令セマンティクス用のＣコード８７０を発生させ、それは、後述するように、ホストコンパイラ８４６によって使用され、下記に詳述するように、シミュレータ７１２により使用されるシミュレータＤＬＬ８４９を作り出す。プログラムｔｉｅ２ｖｅｒ８５０は適切なハードウェア記述言語でユーザが定義した命令に必要な記述８５０を作り出す。最後に、プログラムｔｉｅ２ｘｔｏｓ８６０は、文脈切替えのために、ユーザが定義した状態をセーブ・リストアするためのセーブ・リストアコード８１０を作り出す。ユーザが定義した状態のインプリメンテーションについての付加的な情報は、前述のＷａｎｇらの出願に見ることができる。 FIG. 24 is a diagram showing how the specific TIE portion of these software tools is generated. From the user-defined extension file 736, the TIE compiler 702 generates C code for several programs, each of which is one or more software deployment tools for information about user-defined instructions and states. Create a file to be accessed. For example, the program tie2gcc800 is xtensa-tie. Generate a C header file 842 (detailed below) called h, which contains the built-in function definition for the new instruction. The program tie2isa 810 generates a dynamic connection library (DLL) 844/848, which contains information about the instruction format defined by the user (combination of encode DLL 844 and decode DLL 848 detailed below). The program tie2iss 840 generates C code 870 for performance modeling and instruction semantics, which is used by the host compiler 846, as described below, to create a simulator DLL 849 that is used by the simulator 712 as detailed below. . The program tie2ver 850 creates a description 850 necessary for a user-defined instruction in an appropriate hardware description language. Finally, the program tie2xtos 860 creates a save / restore code 810 for saving / restoring a user-defined state for context switching. Additional information about user-defined implementations can be found in the aforementioned Wang et al application.

コンパイラ７０８
本実施形態では、コンパイラ７０８はユーザが定義したエンハンスメント７３６のために、ユーザのアプリケーション内のイントリンシックコールをアッセンブリ言語命令７３８に翻訳する。コンパイラ７０８はＧＮＵコンパイラ等の標準コンパイラに見出されるマクロ及びインラインアッセンブリ機構の上で、この機構を実行する。これらの機構に関する更に詳しい情報については、ＧＮＵＣ及びＣ＋＋コンパイラユーザガイド、ＥＧＣＳバージョン１．０．３を参照。 Compiler 708
In this embodiment, compiler 708 translates intrinsic calls in the user's application into assembly language instructions 738 for user-defined enhancement 736. Compiler 708 performs this mechanism on top of macro and inline assembly mechanisms found in standard compilers such as the GNU compiler. For more information on these mechanisms, see GNU C and C ++ Compiler User Guide, EGCS version 1.0.3.

２台のレジスタで操作し、結果を第３のレジスタに戻す新しい命令ｆｏｏを作りたいと望むユーザを考えてみよう。ユーザは特別なディレクトリ内のユーザが定義した命令ファイル７５０に命令記述を置き、ＴＩＥコンパイラ７０２を呼出す。ＴＩＥコンパイラ７０２はｘｔｅｎｓａ−ｔｉｅ．ｈ等の標準の名前を付けたファイルを作り出す。このファイルはｆｏｏについての以下の定義を含んでいる。

Consider a user who wants to create a new instruction foo that operates on two registers and returns the result to a third register. The user places an instruction description in a user-defined instruction file 750 in a special directory and calls the TIE compiler 702. The TIE compiler 702 is an xtensa-tie. Create a file with a standard name such as h. This file contains the following definitions for foo:

ユーザがそのアプリケーションでコンパイラ７０８を呼出した時、ユーザは、コマンドラインオプションを介して、あるいは環境変数を介して、ユーザが定義したエンハンスメント７３６を備えたディレクトリの名前をコンパイラ７０８に告げる。そのディレクトリはｘｔｅｎｓａ−ｔｉｅ．ｈファイル７４２も含んでいる。コンパイラ７０８は、あたかもユーザ自身でｆｏｏの定義を書いたかのように編集されているユーザのＣまたはＣ＋＋アプリケーションプログラム内に、ファイルｘｔｅｎｓａ−ｔｉｅ．ｈを自動的に含める。ユーザはイントリンシックコールをユーザアプリケーション内の命令ｆｏｏに含んでいる。この含まれている定義のために、コンパイラ７０８はこれらのイントリンシックコールを含まれた定義に対する呼出しとして処理する。コンパイラ７０８により提供される標準のマクロ機構に基づいて、コンパイラ７０８は、あたかもユーザがマクロコールではなく、アッセンブリ言語ステートメント７３８を直接書いたかのように、そのマクロｆｏｏに対する呼出しを処理する。つまり、標準のインラインアッセンブリ機構に基づいて、コンパイラ７０８はその呼出しを１つのアッセンブリ命令ｆｏｏに翻訳する。例えば、ユーザはイントリンシックｆｏｏに対する呼出しを含む関数を有しているかもしれない。

When the user invokes the compiler 708 in the application, the user tells the compiler 708 the name of the directory with the user-defined enhancement 736 via command line options or via environment variables. The directory is xtensa-tie. h file 742 is also included. The compiler 708 creates a file xtensa-tie.com in the user's C or C ++ application program that is edited as if the user had written the foo definition. h is automatically included. The user includes an intrinsic call in the instruction foo in the user application. Because of this included definition, compiler 708 treats these intrinsic calls as calls to the included definition. Based on the standard macro mechanism provided by compiler 708, compiler 708 handles calls to that macro foo as if the user wrote assembly language statement 738 directly rather than a macro call. That is, based on a standard inline assembly mechanism, the compiler 708 translates the call into a single assembly instruction foo. For example, a user may have a function that includes a call to intrinsic foo.

コンパイラはユーザが定義したイントリンシックｆｏｏを使用して、その関数を以下のアッセンブリ言語サブルーチンに翻訳する。

The compiler uses intrinsic foo defined by the user to translate the function into the following assembly language subroutines.

ユーザが新しいユーザ定義エンハンスメント７３６のセットを作成する場合、新しいコンパイラを再構築する必要はない。ＴＩＥコンパイラ７０２が単にファイルｘｔｅｎｓａ−ｔｉｅ．ｈ７４２を作成し、それは予め構築されているコンパイラ７８によって、ユーザのアプリケーション内に自動的に包含される。 If the user creates a new set of user-defined enhancements 736, there is no need to rebuild a new compiler. The TIE compiler 702 simply sends the file xtensa-tie. h742 is created and automatically included in the user's application by the pre-built compiler 78.

アッセンブラ７１０
本実施形態では、アッセンブラ７１０がエンコードライブラリ７４４を使用して、アッセンブリ命令７５０を符号化する。このライブラリ７４４に対するインタフェースは以下の機能を含む：
・操作符号の簡略記憶文字列を内部操作符号表現に翻訳する；
・機械命令７４０内の操作符号フィールド用の各操作符号のために発生すべきビットパターンを提供する；そして
・各命令オペランドに対するオペランド値を符号化し、その符号化されたオペランドビットパターンを機械命令７４０のオペランドフィールドに挿入する。 Assembler 710
In this embodiment, the assembler 710 uses the encode library 744 to encode the assembly instruction 750. The interface to this library 744 includes the following functions:
・ Translate the simplified memory string of operation code into internal operation code representation;
Providing a bit pattern to be generated for each operation code for the operation code field in machine instruction 740; and, encoding an operand value for each instruction operand and converting the encoded operand bit pattern to machine instruction 740. Insert into the operand field.

一例として、イントリンシックｆｏｏを呼出すユーザ関数の以前の例を考えてみよう。アッセンブラは「ｆｏｏａ２、ａ２、ａ３」命令を取り入れ、それを１６進数０ｘ６２２３０により表される機械命令に変換する。この場合、上位の６と下位の０は共にｆｏｏに対する操作符号を表し、２、２、３は各々３つのレジスタａ２、ａ２、ａ３を表す。 As an example, consider the previous example of a user function that calls intrinsic foo. The assembler takes the “foo a2, a2, a3” instruction and converts it to a machine instruction represented by the hexadecimal number 0x62230. In this case, both the upper 6 and the lower 0 represent operation codes for foo, 2, 2, and 3 represent three registers a2, a2, and a3, respectively.

これらの関数の内部インプリメンテーションはテーブルと内部関数の組み合わせに基づいている。テーブルはＴＩＥコンパイラ７０２によって容易に発生されるが、その表現能力は制限される。例えば、オペランド符号化関数を表す時等、更に柔軟性が必要である場合、ＴＩＥコンパイラ７０２はライブラリ７４４に含まれるべき任意のＣコードを発生させることができる。 The internal implementation of these functions is based on a combination of tables and internal functions. The table is easily generated by the TIE compiler 702, but its expressive ability is limited. For example, when more flexibility is required, such as when representing an operand encoding function, the TIE compiler 702 can generate arbitrary C code to be included in the library 744.

再び「ｆｏｏａ２、ａ２、ａ３」の例を考えてみよう。全てのレジスタフィールドはレジスタ番号で単に符号化される。ＴＩＥコンパイラ７０２は合法的レジスタ値に対してチェックする以下の関数を作り出し、その値がリーガルである場合、そのレジスタ番号を戻す。

Consider again the example of “foo a2, a2, a3”. All register fields are simply encoded with register numbers. The TIE compiler 702 creates the following function that checks against a legal register value and returns that register number if the value is legal.

全ての符号化が簡単である場合、符号化関数は必要ではないであろう。１つのテーブルで充分であろう。しかしながら、ユーザはもっと複雑な符号化を選ぶことが許される。ＴＩＥ言語で記述された以下の符号化は、１０２４で割られたオペランドの値である数で、全てのオペランドを符号化する。このような符号化は１０２４の倍数であることが必要な値を密集して符号化するのに有用である。

If all encoding is simple, an encoding function may not be necessary. One table will suffice. However, the user is allowed to choose a more complex encoding. The following encoding described in the TIE language encodes all operands with a number that is the value of the operand divided by 1024. Such encoding is useful for densely encoding values that need to be multiples of 1024.

ＴＩＥコンパイラはオペランド符号化記述を以下のＣ関数に変換する。

The TIE compiler converts the operand coding description into the following C function.

そのオペランドにとって可能な値の領域が非常に大きいので、このような符号化のために１つのテーブルを使用することができない。１つのテーブルは非常に大きくなければならないであろう。 One table cannot be used for such encoding because the range of possible values for that operand is very large. One table will have to be very large.

エンコードライブラリ７４４の実施形態では、１つのテーブルが内部操作符号表示に対して操作符号簡略記憶文字列を配置する。効率のために、このテーブルは分類されてもよいし、あるいはそれはハッシュテーブルまたは効率的なサーチを許容する他のデータ構造であってもよい。別のテーブルが各操作符号を機械命令のテンプレートに配置し、その操作符号フィールドがその操作符号用の適切なビットパターンに初期化される。同じオペランドフィールドとオペランドエンコードを備えた操作符号が共にグループ分けされる。これらのグループの１つにある各操作符号のために、ライブラリはオペランド値をビットパターンに符号化するための関数と、これらのビットを機械命令内の適切なフィールドに挿入するための別の関数とを含んでいる。別の内部テーブルが各命令オペランドをこれらの関数に対して配置する。結果レジスタ番号が命令のビット１２．．１５に符号化された例を考えてみよう。ＴＩＥコンパイラ７０２は以下の関数を発生させ、それは命令のビット１２．．１５に結果レジスタの値（番号）を設定する。

In the embodiment of the encoding library 744, one table arranges the operation code simplified storage character string for the internal operation code display. For efficiency, this table may be categorized, or it may be a hash table or other data structure that allows an efficient search. Another table places each operation code in the machine instruction template and its operation code field is initialized to the appropriate bit pattern for that operation code. Operation codes with the same operand field and operand encoding are grouped together. For each operation code in one of these groups, the library has a function to encode the operand values into a bit pattern and another function to insert these bits into the appropriate field in the machine instruction. Including. A separate internal table places each instruction operand for these functions. Result register number is bit of instruction 12. . Consider the example encoded in 15. The TIE compiler 702 generates the following function, which is bit 12. . The value (number) of the result register is set to 15.

アッセンブラ７１０を再構築することなく、ユーザが定義した命令を変更できるようにするために、エンコードライブラリ７４４は動的に接続されたライブラリ（ＤＬＬ）として実用化される。ＤＬＬはプログラムがその機能性を動的に伸ばすことができるようにする標準的な方法である。ＤＬＬ処理についての詳細は異なるホストオペレーティングシステムに応じて変化するが、基本的なコンセプトは同じである。ＤＬＬはプログラムコードのエクステンションとして実行中のプログラムに動的にロードされる。ランタイムリンカがＤＬＬと主プログラム間、及びＤＬＬと既にロードされている他のＤＬＬ間の象徴的な関係を決定する。エンコードライブラリまたはＤＬＬ７４４の場合、コードのほんの一部がアッセンブラ７１０に静的に結び付けられる。このコードはＤＬＬをロードすること、予め組み立てられている命令セット７４６用の既存のエンコード情報（これは別のＤＬＬからロードされていてもよい）とＤＬＬ内の情報を組み合わせること、また上述のインタフェース機能を介してその情報をアクセス可能にすることに対して責任がある。 In order to be able to change user defined instructions without rebuilding the assembler 710, the encoding library 744 is implemented as a dynamically connected library (DLL). DLL is a standard method that allows a program to dynamically extend its functionality. Although details about DLL processing vary for different host operating systems, the basic concept is the same. The DLL is dynamically loaded into the running program as an extension of the program code. The runtime linker determines symbolic relationships between the DLL and the main program and between the DLL and other already loaded DLLs. In the case of an encoding library or DLL 744, only a small portion of the code is statically bound to the assembler 710. This code loads the DLL, combines the existing encoded information for the pre-assembled instruction set 746 (which may be loaded from another DLL) with the information in the DLL, and the interface described above Responsible for making the information accessible through functions.

ユーザが新しいエンハンスメント７３６を作り出す場合、ユーザはエンハンスメント７３６の記述に対してＴＩＥコンパイラ７０２を呼出す。ＴＩＥコンパイラ７０２は内部テーブルと、そのエンコードＤＬＬ７４４を実用化する関数とを定義するＣコードを発生させる。次にＴＩＥコンパイラ７０２はホストシステムコンパイラ７４６（これは構成中のプロセッサではなく、むしろホストに対して実行するコードを編集する）を呼出して、ユーザが定義した命令７５０に対して該エンコードＤＬＬ１４４を作成する。ユーザはユーザが定義したエンハンスメント７３６を含むディレクトリを指摘するフラグまたは環境変数を備えたアプリケーションで、予め組み立てられているアッセンブラ７１０を呼出す。予め組み立てられているアッセンブラ７１０はそのディレクトリ内のＤＬＬ７４４を動的に開く。各アッセンブリ命令に対して、予め組み立てられているアッセンブラ７１０は該エンコードＤＬＬ７４４を使用して操作符号簡略記憶を調べ、機械命令内の操作符号フィールドに対するビットパターンを見つけ、各命令オペランドを符号化する。 When the user creates a new enhancement 736, the user calls the TIE compiler 702 for the description of the enhancement 736. The TIE compiler 702 generates C code that defines an internal table and a function that puts the encoded DLL 744 into practical use. The TIE compiler 702 then invokes the host system compiler 746 (which is not the processor being configured, but rather edits the code to be executed on the host) to create the encoded DLL 144 for the user-defined instruction 750. To do. The user calls a pre-assembled assembler 710 with an application with a flag or environment variable that points to the directory containing the user-defined enhancement 736. The preassembled assembler 710 dynamically opens the DLL 744 in that directory. For each assembly instruction, the preassembled assembler 710 uses the encode DLL 744 to look up the operation code concise memory, find the bit pattern for the operation code field in the machine instruction, and encode each instruction operand.

例えば、アッセンブラ７１０がＴＩＥ命令「ｆｏｏａ２、ａ２、ａ３」を参照する場合、アッセンブラ７１０は１つのテーブルから、「ｆｏｏ」操作符号がビット位置１６〜２３において数字６に訳すことを見る。１つのテーブルから、アッセンブラ７１０は各レジスタ用の符号化関数を見つける。それらの関数はａ２を数字２に符号化し、もう１つのａ２を数字２に、またａ３を数字３に符号化する。１つのテーブルから、アッセンブラ７１０は適当な集合関数を見つける。Ｓｅｔ＿ｒ＿ｆｉｅｌｄがその結果値２を命令のビット位置１２．．１５に置く。同様の集合関数が他の２と３を適宜配置する。
シミュレータ７１２
シミュレータ７１２は幾つかの方法でユーザが定義したエンハンスメント７３６と相互作用する。機械命令７４０を仮定すれば、シミュレータ７１２は該命令を復号化、つまり、該命令を成分操作符号とオペランドに分解しなければならない。ユーザが定義したエンハンスメント７３６のデコーディングはデコードＤＬＬ７４８内の関数を介して行われる（エンコードＤＬＬ７４４とデコードＤＬＬ７４８は実際には１台のＤＬＬであることも可能である）。例えば、ユーザが各々命令ビット１６〜２３におけるエンコーディング０ｘ６、０ｘ１６、０ｘ２６及びビット０〜３における０で、３つの操作符号；ｆｏｏ１と、ｆｏｏ２とｆｏｏ３とを定義する場合を考えてみよう。ＴＩＥコンパイラ７０２は以下のデコード関数を発生させ、それはその操作符号を全てのユーザが定義した命令７５０の操作符号と比較する。

For example, if assembler 710 refers to the TIE instruction “foo a2, a2, a3”, assembler 710 sees from one table that the “foo” operation code translates to number 6 in bit positions 16-23. From one table, assembler 710 finds the encoding function for each register. These functions encode a2 into the number 2, another a2 into the number 2, and a3 into the number 3. From one table, assembler 710 finds an appropriate set function. Set_r_field sets the resulting value 2 to bit position 12 of the instruction. . Put on 15. A similar set function places the other 2 and 3 as appropriate.
Simulator 712
Simulator 712 interacts with user-defined enhancements 736 in several ways. Given a machine instruction 740, the simulator 712 must decode the instruction, that is, decompose the instruction into component operation codes and operands. The user-defined enhancement 736 is decoded via a function in the decode DLL 748 (the encode DLL 744 and the decode DLL 748 can actually be a single DLL). For example, consider the case where a user defines three operation codes; foo1, foo2 and foo3 with encoding 0x6, 0x16, 0x26 in instruction bits 16-23, respectively, and 0 in bits 0-3. The TIE compiler 702 generates the following decode function, which compares its operation code with the operation code of the instruction 750 defined by all users.

ユーザが定義した多数の命令７５０があるので、全ての可能なユーザ定義命令７５０に対して操作符号を比較することはコストがかかり、そこでＴＩＥコンパイラはその代わりにスイッチステートメントの階層的セットを使用することができる。

Since there are a large number of user-defined instructions 750, it is costly to compare the operation code against all possible user-defined instructions 750, so the TIE compiler uses a hierarchical set of switch statements instead. be able to.

デコーディング命令操作符号に加えて、デコードＤＬＬ７４８は命令オペランドをデコードするための関数を含んでいる。これはエンコードＤＬＬ７４４におけるオペランドのエンコーディングと同じ方法で行われる。まず第１に、デコードＤＬＬ７４８は機械命令からオペランドフィールドを抽出するための関数を提供する。前の例を続けて、ＴＩＥコンパイラ７０２は１つの命令のビット１２〜１５から値を抽出するために以下の関数を発生させる。

In addition to the decoding instruction operation code, the decode DLL 748 includes a function for decoding instruction operands. This is done in the same way as the encoding of the operands in the encoding DLL 744. First of all, the decode DLL 748 provides a function for extracting operand fields from machine instructions. Continuing the previous example, TIE compiler 702 generates the following functions to extract values from bits 12-15 of one instruction.

オペランドのＴＩＥ記述はエンコーディング及びデコーディング両方の仕様記述を含むので、エンコードＤＬＬ７４４がオペランドエンコード仕様記述を使用する一方、デコードＤＬＬ７４８がオペランドデコード仕様記述を使用する。例えば、以下のＴＩＥオペランド仕様記述：

Since the TIE description of the operand includes both encoding and decoding specification descriptions, the encoding DLL 744 uses the operand encoding specification description, while the decoding DLL 748 uses the operand decoding specification description. For example, the following TIE operand specification description:

は以下のオペランドデコード関数を作り出す：

Produces the following operand decode function:

ユーザがシミュレータ７１２を呼出す場合、ユーザはユーザが定義したエンハンスメント７３６に対するデコードＤＬＬ７４８を含むディレクトリをシミュレータ７１２に告げる。シミュレータ７１２は適切なＤＬＬを開く。シミュレータ７１２が命令をデコードする時はいつでも、その命令が予め組立てられている命令セット用のデコード関数によってうまくデコードされない場合、シミュレータ７１２はＤＬＬ７４８内のデコード関数を呼出す。 When the user calls the simulator 712, the user tells the simulator 712 the directory that contains the decoded DLL 748 for the user-defined enhancement 736. Simulator 712 opens the appropriate DLL. Whenever the simulator 712 decodes an instruction, if the instruction is not successfully decoded by a pre-assembled decode function for the instruction set, the simulator 712 calls the decode function in the DLL 748.

デコードされた命令７５０を仮定すると、シミュレータ７１２はその命令７５０のセマンティクスを解釈してモデル化しなければならない。これは機能的に行われる。全ての命令７５０が対応する関数を有しており、それはシミュレータ７１２が該命令７５０のセマンティクスをモデル化できるようにする。シミュレータ７１２はシミュレートされたプロセッサのあらゆる状態のトラックを内部的に保持している。シミュレータ７１２はプロセッサ状態を更新するか、あるいは尋ねるために固定されたインタフェースを有している。上述のように、ユーザが定義したエンハンスメント７３６は、ＶｅｒｉｌｏｇのサブセットであるＴＩＥハードウェア記述言語で書かれる。新しいエンハンスメント７３６をモデル化するために、ＴＩＥコンパイラ７０２はハードウェア記述を、シミュレータ７１２が使用するＣ関数に変換する。ハードウェア記述言語の演算子は対応するＣ演算子に直接翻訳される。プロセッサ状態を更新するか、あるいは尋ねるために、状態を読み取るか、または状態を書き込む操作がシミュレータインタフェースに翻訳される。 Assuming a decoded instruction 750, the simulator 712 must interpret and model the semantics of that instruction 750. This is done functionally. Every instruction 750 has a corresponding function that allows the simulator 712 to model the semantics of the instruction 750. The simulator 712 internally keeps track of every state of the simulated processor. The simulator 712 has a fixed interface for updating or asking the processor state. As described above, user-defined enhancements 736 are written in a TIE hardware description language that is a subset of Verilog. To model the new enhancement 736, the TIE compiler 702 converts the hardware description into a C function that the simulator 712 uses. Hardware description language operators are translated directly into corresponding C operators. Operations to read or write states are translated into the simulator interface to update or ask for the processor state.

本実施形態の一例として、ユーザが２つのレジスタに加えるために１つの命令７５０を作成する場合を考えてみよう。簡略化のためにこの例を選んだ。ハードウェア記述言語で、ユーザは以下のように、追加のセマンティクスを記述するかもしれない：

As an example of this embodiment, consider the case where a user creates one instruction 750 to add to two registers. We chose this example for simplicity. In a hardware description language, a user may write additional semantics as follows:

内蔵されている名前ａｒｒによって示される出力レジスタが、内蔵されている名前ａｒｓとａｒｔによって示される２つの入力レジスタの合計に指定される。ＴＩＥコンパイラ７０２はこの記述を認めて、シミュレータ７１２が使用するセマンティック関数を発生させる。

The output register indicated by the built-in name arr is specified as the sum of the two input registers indicated by the built-in names ars and art. The TIE compiler 702 recognizes this description and generates a semantic function used by the simulator 712.

ハードウェア演算子「＋」はＣ演算子「＋」に直接翻訳される。ハードウェアレジスタａｒｓとａｒｔの読取りがシミュレータ７１２関数呼出し「ａｒ」の呼出しに翻訳される。ハードウェアレジスタａｒｒの書込みがシミュレータ７１２関数「ｓｅｔ＿ａｒ」に対する呼出しに翻訳される。あらゆる命令がプログラムカウンタｐｃを命令のサイズ分だけ明白に増分するので、ＴＩＥコンパイラ７０２は、追加（ａｄｄ）命令のサイズである３だけｓｉｍｕｌａｔｅｄｐｃを増分するシミュレータ７１２関数に対する呼出しを発生させる。 The hardware operator “+” is translated directly into the C operator “+”. Reading hardware registers ars and art is translated into a call to simulator 712 function call “ar”. Writing the hardware register arr is translated into a call to the simulator 712 function “set_ar”. Since every instruction explicitly increments the program counter pc by the size of the instruction, the TIE compiler 702 generates a call to the simulator 712 function that increments simulatedpc by 3 which is the size of the add instruction.

ＴＩＥコンパイラ７０２が呼出されると、ＴＩＥコンパイラ７０２は上述のようにあらゆるユーザ定義命令のためにセマンティック関数を作成する。また関連するセマンティック関数に対する全ての操作符号名を配置するテーブルも作成する。該テーブル及び関数は標準のコンパイラ７４６を使用してシミュレータＤＬＬ７４９内に編集される。ユーザがシミュレータ７１２を呼出すと、ユーザはユーザが定義したエンハンスメント７３６を含むディレクトリをシミュレータ７１２に告げる。シミュレータ７１２は適切なＤＬＬを開く。シミュレータ７１２が呼出される時はいつでも、シミュレータ７１２はプログラム内の全ての命令をデコードして、関連するセマンティック関数に対して命令を配置するテーブルを作成する。マッピングを作成する場合、シミュレータ７１２はＤＬＬを開き、適当なセマンティック関数をサーチする。ユーザ定義命令７３６のセマンティクスをシミュレートする場合、シミュレータ７１２はＤＬＬ内の関数を直接呼出す。 When TIE compiler 702 is invoked, TIE compiler 702 creates a semantic function for any user-defined instruction as described above. A table is also created in which all operation code names for related semantic functions are arranged. The tables and functions are edited in the simulator DLL 749 using a standard compiler 746. When the user calls the simulator 712, the user tells the simulator 712 the directory that contains the user-defined enhancement 736. Simulator 712 opens the appropriate DLL. Whenever the simulator 712 is called, the simulator 712 decodes all instructions in the program and creates a table that places the instructions for the associated semantic function. When creating a mapping, simulator 712 opens the DLL and searches for an appropriate semantic function. When simulating the semantics of the user-defined instruction 736, the simulator 712 calls functions in the DLL directly.

シミュレートされたハードウェアでアプリケーションを実行するのにどの位の時間がかかるかをユーザに告げるために、シミュレータ７１２は命令７５０の性能効果をシミュレートすることが必要である。シミュレータ７１２はこの目的のためにパイプラインモデルを使用する。あらゆる命令が幾つかのサイクルに亘って実行する。各サイクルにおいて、命令は機械の異なる資源を使用する。シミュレータ７１２は全ての命令を平行して実行しようとし始める。多数の命令が同じサイクルにおいて同じ資源を使用しようとすれば、遅い方の命令は資源が自由になるのを待って立ち往生する。遅い方の命令が後のサイクルにおいて、早い方の命令によって書かれた状態を読む場合、遅い方の命令は書かれた値を待って立ち往生する。シミュレータ７１２は各命令の性能をモデル化するために機能的インタフェースを使用する。あらゆるタイプの命令のために１つの関数が作られる。その関数はプロセッサの性能をモデル化するシミュレータのインタフェースに対する呼出しを含んでいる。 In order to tell the user how long it will take to run the application on the simulated hardware, the simulator 712 needs to simulate the performance effects of the instructions 750. Simulator 712 uses a pipeline model for this purpose. Every instruction executes over several cycles. In each cycle, the instruction uses different resources of the machine. Simulator 712 begins to execute all instructions in parallel. If multiple instructions try to use the same resource in the same cycle, the slower instruction will wait until the resource becomes free. If the later instruction reads the state written by the earlier instruction in a later cycle, the later instruction waits for the written value and gets stuck. Simulator 712 uses a functional interface to model the performance of each instruction. One function is created for every type of instruction. The function includes a call to the simulator interface that models the performance of the processor.

例えば、簡単な３レジスタ命令ｆｏｏを考えてみよう。ＴＩＥコンパイラは以下のシミュレータ関数を作成するかもしれない：

For example, consider a simple three register instruction foo. The TIE compiler may create the following simulator functions:

ｐｉｐｅ＿ｕｓｅ＿ｉｆｅｔｃｈに対する呼出しが、命令が３バイトをフェッチすることを必要としていることをシミュレータ７１２に伝える。ｐｉｐｅ＿ｕｓｅに対する２つの呼出しが、２つの入力レジスタがサイクル１において読み取られることをシミュレータ７１２に伝える。ｐｉｐｅ＿ｄｅｆに対する呼出しが、出力レジスタがサイクル２において書き込まれることをシミュレータ７１２に伝える。ｐｉｐｅ＿ｄｅｆ＿ｉｆｅｔｃｈに対する呼出しが、この命令はブランチではなく、従って次の命令を次のサイクルでフェッチできることをシミュレータ７１２に伝える。 A call to pipe_use_ifetch tells simulator 712 that the instruction needs to fetch 3 bytes. Two calls to pipe_use tell simulator 712 that the two input registers are read in cycle 1. A call to pipe_def tells the simulator 712 that the output register will be written in cycle 2. A call to pipe_def_ifetch tells the simulator 712 that this instruction is not a branch and therefore the next instruction can be fetched in the next cycle.

これらの関数に対するポインタがセマンティック関数と同じテーブルに置かれる。関数自体はセマンティック関数と同じＤＬＬ７４９内に編集される。シミュレータ７１２が呼出されると、シミュレータ７１２は命令と性能関数間のマッピングを作成する。マッピングを作成する場合、シミュレータ７１２はＤＬＬ７４９を開いて、適当な性能関数をサーチする。ユーザ定義命令７３６の性能をシミュレートする場合、シミュレータ７１２はＤＬＬ７４９内の関数を直接呼出す。 Pointers to these functions are placed in the same table as the semantic functions. The function itself is edited in the same DLL 749 as the semantic function. When the simulator 712 is called, the simulator 712 creates a mapping between instructions and performance functions. When creating a mapping, simulator 712 opens DLL 749 and searches for an appropriate performance function. When simulating the performance of the user-defined instruction 736, the simulator 712 calls a function in the DLL 749 directly.

デバッガ７３０
デバッガはユーザが定義したエンハンスメント７５０と２つの方法で相互作用する。これを実施するために、デバッガ７３０は機械命令７４０をアッセンブリ命令７３８にデコードしなければならない。これは命令をデコードするためにシミュレータ７１２が使用するのと同じ機構であり、デバッガ７３０は好ましくはデコーディングを行うためにシミュレータ７１２が使用するのと同じＤＬＬを使用する。命令のデコーディングに加えて、デバッガはデコードされた命令を文字列に変換しなければならない。この目的のために、デコードＤＬＬ７４８は各々の内部操作符号表示を対応する簡略記憶文字列に配置するための関数を含む。これは簡単なテーブルを用いて実行できる。 Debugger 730
The debugger interacts with the user-defined enhancement 750 in two ways. In order to do this, the debugger 730 must decode the machine instruction 740 into an assembly instruction 738. This is the same mechanism that simulator 712 uses to decode instructions, and debugger 730 preferably uses the same DLL that simulator 712 uses to perform decoding. In addition to instruction decoding, the debugger must convert the decoded instruction into a string. For this purpose, the decode DLL 748 includes a function for placing each internal operation code representation in a corresponding mnemonic string. This can be done with a simple table.

ユーザはユーザが定義したエンハンスメント７５０を含むディレクトリを指摘するフラグまたは環境変数を備えた予め組み立てられているデバッガを呼出すことができる。予め組み立てられているデバッガは適当なＤＬＬ７４８を動的に開く。 The user can call a pre-assembled debugger with a flag or environment variable pointing to the directory containing the user-defined enhancement 750. A pre-assembled debugger dynamically opens the appropriate DLL 748.

更に、デバッガ７３０はユーザが定義した状態７５２とも相互作用する。デバッガ７３０はその状態７５２を読み取り、修正できなければならない。それを実施するために、デバッガ７３０はシミュレータ７１２と通信する。デバッガ７３０はシミュレータ７１２に対して、該状態がどの程度の大きさであるか、また該状態変数の名前が何であるかを尋ねる。デバッガ７３０が一部のユーザ状態の値を印刷するように求められた時はいつでも、予め定義されている状態について請求するのと同じ方法で、デバッガ７３０はその値をシミュレータ７１２に尋ねる。同様に、ユーザ状態を修正するために、デバッガ７３０は所定の値に状態を設定するようにシミュレータ７１２に伝える。 In addition, the debugger 730 interacts with a user defined state 752. The debugger 730 must be able to read and correct the state 752. In order to do so, the debugger 730 communicates with the simulator 712. The debugger 730 asks the simulator 712 how large the state is and what the name of the state variable is. Whenever the debugger 730 is asked to print a value for some user state, the debugger 730 asks the simulator 712 for the value in the same way as it charges for a predefined state. Similarly, to modify the user state, the debugger 730 tells the simulator 712 to set the state to a predetermined value.

このように、本発明によるユーザが定義した命令セット及び状態に対するサポートのインプリメンテーションは、コアソフトウェア展開ツールにプラグインされるユーザ機能性を定義するモジュールを用いて達成することができる。このように、ユーザが定義したエンハンスメントの特定セットのためのプラグインモジュールが、組織及び操作の容易さのために、システム内で１つのグループとして維持されるシステムを開発することができる。 Thus, implementation of support for user-defined instruction sets and states according to the present invention can be achieved using modules that define user functionality that is plugged into the core software deployment tool. In this way, a system can be developed in which plug-in modules for a specific set of user-defined enhancements are maintained as a group within the system for ease of organization and operation.

更に、コアソフトウェア展開ツールは特定のコア命令セット及びプロセッサ状態にとって特有のものであってよく、ユーザが定義したエンハンスメント用の一組のプラグインモジュールを、システムに存在する多数組のコアソフトウェア展開ツールとの関連で評価してもよい。 In addition, the core software deployment tool may be specific to a particular core instruction set and processor state, and a set of plug-in modules for user-defined enhancements can be combined into a number of core software deployment tools present in the system. You may evaluate in relation to.

Claims

A system for designing an expansion processor,
Has a basic instruction set arch te Kucha including a plurality of basic instructions, the defining together basic functions and basic instruction set arch Te Kucha, containing the basic processor design modeled by the basic software deployment tool that supports standard programming model,
In addition to the basic processor design, the basic instruction set architecture, the plurality of basic instructions, and the basic function, a plurality of additional processor functions corresponding to one or more additional processor instructions that are not present in the plurality of basic instructions Means for generating a description of the hardware implementation of the extension processor based on a specified configuration specification;
The basic software development tools in order to obtain a change software deployment tool, based on the configuration specification, and means for changing,
And the modified software deployment tool uses both the one or more additional processor instructions and the plurality of basic instructions.

The system of claim 1, wherein the modified software deployment tool includes a compiler that generates application code based on at least some basic instructions of the plurality of basic instructions as well as at least one of the one or more additional processor instructions.

The modified software deployment tool includes a disassembler adapted to the configuration specifications and decomposing application code including the at least some basic instructions of the plurality of basic instructions and the at least one of the one or more additional processor instructions. The system of claim 2.

The system of claim 3, wherein the modified software deployment tool includes a debugger that examines the application code including the at least some basic instructions of the plurality of basic instructions and the at least one of the one or more additional processor instructions.

The modified software deployment tool includes an instruction set simulator that simulates the application code including the at least some basic instructions of the plurality of basic instructions and the at least one of the one or more additional processor instructions. System.

At least one of the plurality of additional processor functions includes a user-defined state, one of the additional processor instructions, and a user-defined function associated with the user-defined state, wherein the at least one of the plurality of additional processor functions is the The system of claim 1 including at least one of read and write to a user-defined processor state.

The system of claim 6, wherein the modified software deployment tool includes a compiler that generates application code based on at least some basic instructions of the plurality of basic instructions as well as at least one of the one or more additional processor instructions.

The modified software deployment tool is adapted to the configuration specification and includes a disassembler that decomposes the application code including the at least some basic instructions of the plurality of basic instructions and the at least one of the one or more processor instructions. 8. The system of claim 7, comprising:

9. The system of claim 8, wherein the modified software deployment tool includes a debugger that examines the application code including the at least some basic instructions of the plurality of basic instructions and the at least one of the one or more processor instructions.

8. The modified software deployment tool includes an instruction set simulator that simulates the application code including the at least some basic instructions of the plurality of basic instructions and the at least one of the one or more processor instructions. System.