JP4152659B2

JP4152659B2 - Data processing system and design system

Info

Publication number: JP4152659B2
Application number: JP2002103219A
Authority: JP
Inventors: 慎太郎下郡; 正之大村; 太豪武田
Original assignee: GAIA SYSTEM SOLUTIONS Inc
Current assignee: GAIA SYSTEM SOLUTIONS Inc
Priority date: 2001-04-06
Filing date: 2002-04-05
Publication date: 2008-09-17
Anticipated expiration: 2022-04-05
Also published as: JP2003216678A

Description

【０００１】
【発明の属する技術分野】
本発明は、プロセッサにおける動作をシミュレーションするシステムに関するものである。
【０００２】
【従来の技術】
ハードウェアを設計する際の設計言語として、１９９０年代にレジスタベースの記述言語ＲＴＬが普及した。ＶｅｒｉｌｏｇやＶＨＤＬが代表的な言語である。これらのＲＴＬは、レジスタをハードウェアのベースとしたレジスタ間の信号伝達や信号処理を、プラスやマイナスや乗算や除算といった算術演算、ＡＮＤやＯＲといった論理演算、「ＩＦＴＨＥＮＥＬＳＥ」といった条件文や代入文により論理的に記述することにより設計することができる。したがって、ＲＴＬにより、その当時としては論理回路の設計の抽象度を向上することができ、プロセッサなどの設計効率を高めることができた。
【０００３】
その後、１９９０年代後半からは動作レベルと称される、一層抽象度を高めた設計言語が使用されるようになってきた。これらの言語はＶｅｒｉｌｏｇやＶＨＤＬといったレジスタベースの言語の発展系とも考えられ、実際、ＶｅｒｉｌｏｇやＶＨＤＬには動作レベルの記述形式が包含されている。
【０００４】
その一方、動作レベルにおいてはレジスタという概念は無く、算術演算、論理演算、条件文、代入文といった記述に集約される。したがって、一般のソフトウェア記述言語もその範疇であり、１９９０年代後半からはＣ言語によるハードウェア設計が模索され始めた。Ｃ言語であれば、より一般的でありソフトウェア資源が多いことに加え、レジスタベースの言語は、たとえそれを動作記述に絞ったとしても、シミュレーション速度が遅い。たとえば、Ｃ言語で記述された仕様と、ＲＴＬで記述された仕様では、シミュレーション速度は、Ｃ言語の方が数１０００倍から１０万倍程度速く、速度の差が非常に大きい。このシミュレーション速度の違いは、ＲＴＬはハードウェアを設計するための言語であり、Ｃ言語はソフトウェアを設計するための言語である所に起因する。
【０００５】
【発明が解決しようとする課題】
近年、ソフトウェアを記述するＣ言語でプロセッサにより実行するアプリケーションの仕様を初期設計し、最終的にはハードウェアを記述するＲＴＬに変換してプロセッサの設計を検証する方式が生まれてきている。さらに、Ｃ言語により与えられた仕様を実現するプロセッサを開発および設計する際に、その仕様を直にハードウェアに変換するのではなく、仕様の一部を専用プロセッサを介してハードウェア化する方式が提案されている。本願の出願人は、たとえば、特開２０００−２０７２０２号にカスタマイズ可能な専用命令を装着できるデータ処理装置を開示している。このようなプロセッサの設計方式では、命令セット（インストラクションセット）ベースの記述、あるいはアセンブラベースの記述とも言える記述がＣ言語とＲＴＬとの間に存在することになる。命令セットによる記述は、単に論理ベースの記述でしかないＣ言語による仕様に比べるとより正確にプロセッサにおける実行状況をシミュレートできる。したがって、命令形式での記述（命令レベル）のシミュレータが近年開発されており、インストラクションセットシミュレータ（ＩＳＳ）と称されている。
【０００６】
しかしながら、従来のＩＳＳは命令列、たとえば、アセンブラ命令の実行をシミュレーションするものであり、ハードウェアそのものをシミュレーションするものではなかった。したがって、入出力（ＩＯ）信号のリード・ライトや割り込み等のリアルタイムの処理に関しては、それらの機能が実行されるか否かという点では正しいシミュレーションができても、必ずしもハードウェア的に正しいシミュレーションが行われているわけではない。たとえば、クロックサイクルのレベルではプロセッサの実動作から多少外れるシミュレーションが行われることになる。本来、ＩＳＳは、命令列のシミュレーションであるのでそれでも構わないのであるが、シミュレーションの対象となっているアプリケーションプログラムがプロセッサ上で動作する際のハードウェアをシミュレーションしようとする観点からすればそれでは不十分である。したがって、現状のＩＳＳは、シミュレータ自体はＣ言語で記述することができるので命令セットベースの実行状況を高速でシミュレートできるのであるが、ハードウェアシミュレーションのツールとしては不十分である。
【０００７】
したがって、ハードウェアシミュレーションを行うためには、やはりＲＴＬベースのシミュレータが必要となる。しかしながら、上述したように、ＲＴＬによるシミュレーションは速度が非常に遅く、ハードウェアレベルでのシミュレーションがプロセッサを設計する期間を短縮しようとしたときの大きなボトルネックとなっている。
【０００８】
特に、Ｃ言語からの設計自動化ツールとして、「ＣｔｏＲＴＬ」と称されている動作合成ツールが利用可能になっており、設計の自動化はＲＴＬからＣへと進みつつある現状では、シミュレーション検証の問題を解決することが重要となっている。
【０００９】
このＣｔｏＲＴＬツールとは、仕様を記述したＣ言語を入力として、クロック周波数をパラメータとして与えることにより、レジスタで構成されたＲＴＬを出力するツールである。入力となるＣ言語には、上述したように、クロックもしくはサイクルという概念は無い。そのため、そこに与えられたクロック周波数に応じて、レジスタを割り当てて仕様を満足させる解を得るようになっている。その際、仕様に記述されている算術演算や論理演算を実行する演算器を割り当てる必要があり、いかに少ない演算器で、Ｃ言語で与えられた仕様通りの動作を実現するかというリソースシェアリングと、仕様にて記述された実行順序をいかに少ないクロック数で実現するかのスケジューリングがこの合成ツールの決め手であると言える。したがって、ＣｔｏＲＴＬツールのパフォーマンスを評価あるいは検証することも重要となっている。また、Ｃ言語からＲＴＬを自動合成するので、Ｃ言語の仕様がそのままＲＴＬ化されてしまい、Ｃ言語では冗長なビットがそのまま自動合成に反映されてしまうという問題もある。
【００１０】
また、専用プロセッサを設けたプロセッサを用いたアプリケーションプログラムにおいては、専用プロセッサのデータパス系を用いて特定の処理を行う専用命令を汎用命令と共に装着することになる。しかしながら、従来のＩＳＳでは機能的にはシミュレートできてもハードウェア的にはシミュレートできない。すなわち、データパス系の命令を実行する際に消費するクロック数と汎用命令との関係が不明であり、従来のＩＳＳではデータパス系の命令（専用命令）の機能を確認できるだけである。
【００１１】
そこで、本発明においては、高速でハードウェアシミュレーションを行うことができる新しいシミュレータを提供することを目的としている。さらに、データパス系の専用命令と、汎用プロセッサを動作させる汎用命令との実行状況をハードウェアレベルで高速にシミュレーションできるシミュレータを提供することを目的としている。
【００１２】
【課題を解決するための手段】
レジスタレベルのＲＴＬによるシミュレータはハードウェアを記述するためにクロックサイクルという概念があるが、これに対し、Ｃ言語によるシミュレータはクロックという概念がなく動作を記述できる点が利点になっている。これに対し、本発明においては、ＩＳＳでシミュレートされる命令セットをサイクルベースに分割することにより、ハードウェアをシミュレーションするツールとしてＩＳＳを利用できるようにしている。このサイクルレベルのＩＳＳは、Ｃ言語そのもののシミュレーションよりは速度が低下するとは言うものの、シミュレーション速度をＲＴＬベースのシミュレータの数百倍から数千倍程度まで高速化でき、さらに、ＲＴＬでシミュレーションするのと同等の結果が得られる。このため、プロセッサの設計効率を大幅に改善することが可能となる。
【００１３】
本発明の一態様は、アプリケーションプログラムのプロセッサにおける動作をシミュレートするシミュレータを有するシステムであって、前記プロセッサは、汎用処理を複数のパイプラインステージにより実行する汎用処理部と、特定のデータ処理に特化した専用処理部であって、前記汎用処理部のパイプラインステージとは異なるデータパスを含む専用処理部とを備えており、前記シミュレータは、前記アプリケーションプログラムに含まれる命令が、前記汎用処理部における処理を規定する汎用命令であれば、その汎用命令を前記汎用処理部のそれぞれのパイプラインステージのモデルに変換する汎用命令ライブラリと、前記アプリケーションプログラムに含まれる命令が、前記専用処理部における処理を規定する専用命令であれば、その専用命令を前記汎用処理部の前記パイプラインステージのサイクルに対応する情報を供給するモデルに変換する専用命令ライブラリと、前記汎用命令ライブラリおよび前記専用命令ライブラリから提供される前記モデルを用いて、前記アプリケーションプログラムの動作を前記汎用処理部の前記パイプラインステージのサイクルでシミュレートするサイクルシミュレート手段とを有する、データ処理システムである。
本発明の異なる一態様は、シミュレータが、アプリケーションプログラムのプロセッサにおける動作をシミュレートする方法であって、前記プロセッサは、汎用処理を複数のパイプラインステージにより実行する汎用処理部と、特定のデータ処理に特化した専用処理部であって、前記汎用処理部のパイプラインステージとは異なるデータパスを含む専用処理部とを備えており、当該方法は、前記アプリケーションプログラムに含まれる命令が、前記汎用処理部における処理を規定する汎用命令であれば、その汎用命令を前記汎用処理部のそれぞれのパイプラインステージのモデルに変換する第１の変換工程と、前記アプリケーションプログラムに含まれる命令が、前記専用処理部における処理を規定する専用命令であれば、その専用命令を前記汎用処理部の前記パイプラインステージのサイクルに対応する情報を供給するモデルに変換する第２の変換工程と、前記第１の変換工程および前記第２の変換工程により得られる前記モデルを用いて、前記アプリケーションプログラムによる前記プロセッサの動作を前記汎用処理部の前記パイプラインステージのサイクルでシミュレートするサイクルレベル・シミュレーション工程とを有する、方法である。
本発明のさらに異なる一態様は、コンピュータにより、アプリケーションプログラムのプロセッサにおける動作をシミュレートするためのプログラムであって、前記プロセッサは、汎用処理を複数のパイプラインステージにより実行する汎用処理部と、特定のデータ処理に特化した専用処理部であって、前記汎用処理部のパイプラインステージとは異なるデータパスを含む専用処理部とを備えており、前記コンピュータは、前記汎用処理部における処理を規定する汎用命令を前記汎用処理部のそれぞれのパイプラインステージのモデルに変換する汎用命令ライブラリと、前記専用処理部における処理を規定する専用命令を前記汎用処理部の前記パイプラインステージのサイクルに対応する情報を供給するモデルに変換する専用命令ライブラリとを含み、当該プログラムは、前記アプリケーションプログラムに含まれる命令が、前記汎用命令であれば、前記汎用命令ライブラリにより、その汎用命令を前記汎用処理部のそれぞれのパイプラインステージのモデルに変換する第１の変換処理と、前記アプリケーションプログラムに含まれる命令が、前記専用命令であれば、前記汎用命令ライブラリにより、その専用命令を前記汎用処理部の前記パイプラインステージのサイクルに対応する情報を供給するモデルに変換する第２の変換処理と、前記汎用命令ライブラリおよび前記専用命令ライブラリから提供される前記モデルを用いて、前記アプリケーションプログラムの動作を前記汎用処理部の前記パイプラインステージのサイクルでシミュレートするサイクルレベル・シミュレーション処理とを前記コンピュータが実行するための命令を有する、プログラムである。
アプリケーションプログラムの動作をパイプラインステージのサイクルでシミュレートする手段（サイクルレベル・シミュレーション手段であり、サイクルレベル・シミュレータ本体あるいはＩＳＳ本体として説明している）を有するデータ処理システムは、本明細書の実施の形態ではシミュレータあるいはシミュレーションシステムとして説明されているものである。このデータ処理システムを採用することにより、ＬＳＩを自動設計するための一連の自動設計システムを提供することが可能となる。また、本発明のシミュレータは、命令サイクルが複数のパイプラインステージにより実行される複数の命令セットを備えたアプリケーションプログラムのプロセッサにおける動作をシミュレートするプログラムであって、アプリケーションプログラムの動作をパイプラインステージのサイクルでシミュレートするサイクルレベル・シミュレーション処理を実行するための命令を有するプログラムとして提供することが可能であり、ＣＤ−ＲＯＭなどの適当な記録媒体に記録したり、コンピュータネットワークなどの媒体を通じて供給することができる。
【００１４】
従来のＩＳＳでは、命令セット単位にシミュレーションモデルが記述かつ管理されていた。これに対し、本発明では、命令セットをパイプラインのサイクル単位に分解し、命令セットをサイクル単位にモデル化して管理する。命令セットをサイクル単位にモデル化するのは、サイクルレベル・シミュレーション工程あるいは手段で行っても良いし、あるいは、事前にコンパイラーなどで命令セットをサイクル単位に分解したシミュレーションモデルを形成しておき、それを実行しても良い。本発明では、サイクル単位で命令セットの実行と、それに起因する、あるいはしないその他の処理、たとえば、割り込み動作とが正しくモデル化して管理できる。したがって、ＲＴＬでなくても、Ｃ言語あるいはその他の高級言語によるシミュレータにより、ハードウェアのシミュレーションをサイクルベースで正しく行うことが可能となる。このため、高速でハードウェアをシミュレーションできる。
【００１５】
本発明では、命令セットの一連の命令サイクル、すなわち、フェッチ、デコード、実行およびライトバックを、それに対応した各々のパイプラインステージ毎（パイプラインステージのサイクル毎）の処理に分けてシミュレートする。各命令の各パイプラインステージにおける処理を個別にモデル化しても良い。しかしながら、これらの各命令、あるいは各パイプラインステージの処理で共通なものは共通に管理できるようにモデル化した方がハードウェア資源を節約でき、シミュレータ用のプログラムを短期間に低コストで開発することができる。
【００１６】
サイクル単位のモデルは、大きくふたつの記述部分に別けることができる。ひとつは、その命令セットの固有の記述であり、例えば、ＡＮＤ命令の１つのパイプラインステージである実行（ＥＸＥＣＵＴＩＯＮ）サイクルの機能（機構および／またはタイミングなど）を記述する部分である。もうひとつは、各命令セットの各パイプラインステージを実行する際の共通な記述であり、例えば、割り込み信号に対する処置に関する記述である。そこで、本発明においては、サイクル単位でシミュレートする前に、命令セットの各パイプラインステージをサイクルレベルでシミュレートする手段（ＩＳＳ本体）あるいはそこにおける処理で管理可能な第１の情報に変換するライブラリを設ける。これにより、命令セットあるいはその命令セットのパイプラインステージが与えられれば、ＩＳＳ本体における処理あるいはＩＳＳ本体で管理されるモデルはライブラリで与えられる。また、ＩＳＳ本体あるいはその処理では、各パイプラインステージに共通する処理あるいはモデルを記述しておき、その共通の処理に関わる第２の情報に基づき第１の情報によりシミュレートするようにすると、第１の情報は個々のパイプラインステージの処理のうち、固有の処理に対応する情報に限定できる。
【００１７】
したがって、本発明では、シミュレーションする際に、命令セットをサイクル単位に分解され、サイクルレベル・シミュレーション手段であるＩＳＳ本体にはサイクル単位に時間管理する機能を設けられる。各サイクルにおいて命令固有な部分は、命令セットのサイクル毎の処理記述あるいはモデルとしてライブラリに記述される。また、各サイクルにおいて割り込み等の外部信号に対する共通の処理記述あるいはモデルはＩＳＳ本体に記述される。したがって、簡易な構成のハードウェアで、シミュレートできるシステムあるいは環境を提供できる。そして、アプリケーションプログラムが供給されると、それをシミュレーションの対象となるプロセッサで実行されたときのように、個々の命令セットを個々のパイプラインステージに分割し、サイクル単位でシミュレートすることができる。また、命令サイクルが複数のパイプラインステージにより実行される複数の命令セットを備えたアプリケーションプログラムを、命令セットが、パイプラインステージに分割された記述で表されたシミュレーションモデルに変換する処理を実行可能な命令を有するプログラムにより、アプリケーションプログラムを本発明のＩＳＳ本体に適したシミュレーションモデルに変換してシミュレーションの時間をさらに短縮するようにしても良い。
【００１８】
本明細書におけるアプリケーションプログラムとは、ソースプログラム、オブジェクトプログラム（オブジェクトコード）あるいは中間段階のプログラムなどの、プロセッサに対して動作あるいは処理を指示する命令セットを含むすべてのレベルのプログラムを意味し、シミュレーションの対象となるすべてのプログラムを含むものである。
【００１９】
シミュレーションの対象となるプロセッサが専用のデータパスを備えている場合は、そのデータパスを用いた処理と、汎用プロセッサにおけるパイプラインステージの処理とでは、サイクルの管理が異なる。したがって、プロセッサが、汎用処理を実行可能な汎用処理部および特定のデータ処理に特化した専用処理部を備えている場合は、アプリケーションプログラムに含まれる命令セットのうち、汎用処理部における処理を規定する汎用命令セットが、命令サイクルが複数のパイプラインステージにより実行される命令セットであり、汎用命令セットの各パイプラインステージをシミュレートする手段で管理可能な情報に変換する汎用命令ライブラリと、アプリケーションプログラムに含まれる命令セットのうち、専用処理部における処理を規定する専用命令セットをシミュレートする手段で管理可能な情報に変換する専用命令ライブラリとを設けることが望ましい。汎用命令セットと専用命令セットとを、異なる変換処理（第１の変換処理および第２の変換処理）でそれぞれ異なるライブラリを用いて変換し、サイクルレベルでシミュレーションを行う手段（ＩＳＳ本体）に提供することにより、汎用処理ユニットと専用処理ユニットとを有するプロセッサの動作を、パイプライン処理をベースにモデル化できる。これらの変換を行うライブラリプログラムは１つのプログラムとして提供しても良いが、異なるライブラリプログラムとして提供することにより、汎用命令セットについてはシミュレータ提供者が供給し、専用命令セットについては専用処理部を設計したユーザが供給するといった体制を取ることができる。
【００２０】
また、専用命令ライブラリで、専用処理部の処理を適切にモデル化することができる。専用命令ライブラリから専用命令セットにより消費されるサイクルの情報を何らかの信号あるいは情報としてシミュレートする手段であるＩＳＳ本体あるいはその処理に返すことにより、専用処理部の実行状態をＩＳＳ本体におけるシミュレーションに反映できる。また、専用命令ライブラリは専用処理部の状態をサイクルの単位でＩＳＳ本体あるいはその処理に返すことによっても専用処理部の状態をシミュレーションに反映できる。
【００２１】
このように、本発明によりサイクル単位で高速にハードウェアをシミュレーションできるツールを提供することができる。したがって、Ｃ言語で記述された仕様を実現し、高速で処理可能な複数の専用回路を用いた分散処理タイプのシステムＬＳＩをさらに短期間に、経済的に設計し供給することができる。そして、プロセッサの設計・開発において、Ｃ言語からＲＴＬに移行する中間の段階で、サイクルあるいはクロックの概念を持たせながらシミュレーションすることができる。この新たな設計インフラストラクチャ・レイヤの導入はシミュレーション速度を速めて設計の効率化を図るだけでなく、他にも様々なメリットがある。
【００２２】
その１つは、上述したように、汎用命令セットと専用命令セットで駆動されるプロセッサにおいて、専用命令セットで駆動される専用処理部（専用目的処理部あるいは専用回路部）の動作をＲＴＬではなく、Ｃ言語レベルに近い状態でシミュレーションすることができることである。加えて、サイクルベースのＩＳＳ本体で管理できる情報に変換されていれば良いので、Ｃ言語そのものである必要はなく、専用処理部の動作を記述するのにさらに適した言語があれば、Ｃ言語の仕様をその新たな言語に変換してからコンパイルして専用命令ライブラリを作成することができる。
【００２３】
たとえば、プロセッサの仕様は、現在最もポピュラーなＡＮＳＩ−Ｃが使用されているが、記述できる変数のデータ長が限定されており、１６ビット、３２ビットおよび６４ビットに限定される。これに対し、ユーザ命令として実装される仕様では、これらのデータ長が常に採用されることは少なく、２４ビットあればよいなどの仕様はごく普通である。ＡＮＳＩ−Ｃでは、このような仕様を忠実に再現することはできず、結局、このような仕様を盛り込んだＲＴＬのシミュレーションとは異なった結果をもたらす可能性があり、最終的にはＲＴＬのシミュレーションを繰り返す必要がでてきてしまう。しかしながら、他の言語、たとえばＣ＋＋であれば、クラスライブラリを使用して変数の型宣言を行うことによりデータ長はバリアブルになる。したがって、Ｃ言語の仕様をＣ＋＋に変換してコンパイルし、専用命令ライブラリを作成すれば、冗長なビットを削除することが可能となり、ＲＴＬと同レベルのシミュレーションをサイクルベースのＩＳＳ本体およびそれを備えたシミュレーションシステムで高速に実行することが可能となる。
【００２４】
もう１つのメリットは、シミュレーションの評価のために、あるいは他の目的で、プロセッサで実行されない疑似命令セット（擬似命令セット）をアプリケーションプログラムに追加し、サイクルベースのＩＳＳシミュレーションのシミュレーション対象にすることができることである。プロセッサで実行されない疑似命令セットは、たとえば、ＩＳＳ本体でサイクル数をカウントしないで処理を行うことを命令するものであり、疑似命令セットをシミュレータが管理可能な情報に変換する疑似命令ライブラリを、上記の専用命令ライブラリと同じレベルで用意すればサイクルベースのＩＳＳ本体あるいはそれを用いたシミュレーションシステムに組み込むことができる。動作を評価（テストあるいはデバック）する疑似命令セットとしては、データファイルの入出力をモニタする処理を実行させる命令がある。
【００２５】
このサイクルベースのシミュレーション手段を利用したシミュレーションでは専用処理部を備えたプロセッサをシミュレーションする作業を、幾つかの段階に分けることができる。第１のシミュレーション段階は、Ｃ言語などで与えられた仕様から専用命令で実行する部分を除かないで、ＩＳＳ、できればサイクルベースのＩＳＳ本体でシミュレートすることである。この段階で、疑似命令セットを導入しておけば、それ以降の評価の基準となる入力データや出力期待値などを得ることができる。次の第２のシミュレーション段階は、仕様から専用処理部で実行する部分を抽出して専用命令セットに置き換え、一方、その抽出した部分から専用命令ライブラリを作成しておいて、本発明のサイクルベースのＩＳＳ本体でシミュレートすることである。本発明のサイクルベースのＩＳＳ本体であれば、専用処理部も含めてサイクルベースでシミュレートできるので、この段階でプロセッサの開発の大部分を行うことができる。また、疑似命令セットを利用し、この段階で得られるデータを、先に求めた入力データや出力期待値と照らし合わせることにより、大きな設計上の間違えを犯すことなくスムーズに短時間でプロセッサを開発することができる。
【００２６】
さらに、第３のシミュレーション段階として、専用命令ライブラリをＲＴＬ化した後に、サイクルベースのＩＳＳ本体でシミュレートする段階を設けることも可能である。実際にＲＴＬに変換したときの機能を確認することが可能であり、汎用処理部の側もサイクルベースのＩＳＳ本体で稼動させられるので、ＲＴＬとマッチングさせて高精度のシミュレーションを行うことができる。
【００２７】
【発明の実施の形態】
以下に図面を参照しながら本発明についてさらに説明する。図１に、特定のデータ処理に特化したデータパス部２０を備えた専用処理ユニット（専用命令実行ユニットあるいは専用目的処理ユニット、以降ではＶＵ）１と、汎用的な構成の汎用処理ユニット（汎用命令実行ユニット、汎用目的処理ユニットあるいはプロセスユニット、以降ではＰＵ）２とを備えたデータ処理装置１０の概要を示してある。このデータ処理装置１０は、専用回路を備えたプログラマブルなプロセッサであり、実行形式の制御プログラム（オブジェクトプログラム、オブジェクトコード、マイクロプログラムコード）４ａを内蔵したコードＲＡＭ４から命令をフェッチし、専用データ処理ユニット１および汎用データ処理ユニット２にデコードされた制御信号を提供するフェッチユニット５を備えている。このフェッチユニット５は、前の命令あるいはステートレジスタ６の状態、割り込み信号φｉなどによって決まる所定のコードＲＡＭ４の所定のアドレスから命令をフェッチするフェッチ部７と、フェッチされた専用命令あるいは汎用命令（一般命令）をデコードするデコード部８とを備えている。デコード部８は、専用命令をデコードした制御信号（デコーデド・コントロール・シグナル；Decoded Control Signal）φｖおよび汎用命令をデコードした制御信号φｐを、専用データ処理ユニットＶＵ１および汎用データ処理ユニットＰＵ２にそれぞれ供給する。さらに、ＰＵ２からは実行状態を示すステータス信号φｓが返えされ、ＰＵ２およびＶＵ１の状態がステートレジスタ６に反映される。
【００２８】
本例のＰＵ２は、汎用レジスタ、フラグレジスタおよび演算ユニット（ＡＬＵ）などから構成される汎用性の高い実行ユニット１１を備えており、実行した結果をデータＲＡＭ１５などに出力しながら汎用処理を次々と実行できる構成となっている。すなわち、本例のＰＵ２においては、１つの汎用命令セットは、フェッチおよびデコードするステージ、実行するステージ、結果をメモリに書き込むステージなどの複数のパイプラインステージに分かれて実行される。また、これらのフェッチユニットＦＵ５、汎用データ処理ユニットＰＵ２、コードＲＡＭ４、データＲＡＭ１５を有する構成は、個々の機能は異なるが一般的なプロセッサユニットと類似なものである。したがって、ＦＵ５、ＰＵ２、コードＲＡＭ４およびデータＲＡＭ１５を有する構成をプロセッサユニット３と称することも可能であり、プロセッサユニット（ＰＵＸ）３からＶＵ１を制御するような概念で本例のデータ処理装置１０を構成あるいは設計することができる。
【００２９】
ＦＵ５からの専用命令φｖを実行する専用データ処理ユニットＶＵ１は、ＦＵ５が供給する命令がＶＵ命令φｖであるかなどをデコードするユニット２２と、予め特定のデータ処理を行うように制御信号をハードウェア的に出力するシーケンサ（ＦＳＭ（Finite State Machine）、ファイナイトステートマシン）２１と、このシーケンサ２１からの制御信号に従って特定のデータ処理を行うようにデザインされたデータパス部２０を備えている。また、ＶＵ１は、ＰＵ２からアクセス可能なレジスタ２３を備えており、データパス部２０の処理に必要なデータをインターフェイスレジスタ２３を介してＰＵ２で制御したり、ＶＵ１の内部状態をレジスタ２３を介してＰＵ２で参照できるようになっている。また、データパス部２０で処理された結果はＰＵ２に供給され、ＰＵ２ではその結果を利用した処理が行われる。
【００３０】
このデータ処理装置１０は、コードＲＡＭ４に、汎用命令（ＰＵ命令）および専用命令（ＶＵ命令）を含んだプログラムが記憶されており、それがフェッチユニット５でフェッチされ、デコードされた制御信号φｐまたはφｖとしてＶＵ１およびＰＵ２に供給される。ＶＵ１は、制御信号φｐおよびφｖのうち、自己を起動する専用命令の制御信号φｖが供給されると稼動する。一方、ＰＵ２には、汎用命令がデコードされた制御信号φｐだけが供給されるようになっており、ＶＵ命令をデコードした制御信号φｖはＰＵ２には発行されず、その代わりに、実行を伴わないｎｏｐ命令を示す制御信号が発行され、ＰＵ２の処理はスキップされる。ＶＵ１は、アプリケーションなどによって変更されるものであり、ＶＵ１に指示を出す専用命令もアプリケーションによって変わることが多い。ＶＵ１は、アプリケーションに特化したデータパスあるいは専用回路であり、ＶＵ命令をデコードした制御信号を解釈するように設計することは容易である。一方、ＰＵ２は、ｎｏｐ命令が出力されることにより、ＶＵ１用に特化した命令に対処する必要がなく、基本命令あるいは汎用命令を解釈して実行できる機能があればよく、汎用性を犠牲にすることなく様々なアプリケーションに対応したＶＵ１と共存し、これらを制御したり、その演算結果を用いて処理を行うことができる。
【００３１】
したがって、図１に示したデータ処理装置１０は、リアルタイム応答などの特殊な演算が要求される処理を実現できる専用回路を備えたＶＵ１と、汎用性があるＰＵ２とを有するものであり、以降においてはＶＵＰＵと称することにする。このＶＵＰＵ１０は、リアルタイム応答性を犠牲にすることなく、設計および開発期間を短縮でき、さらに、その後の変更や修正にも柔軟に対処できるものである。また、ＶＵ１は、１つに限定されることはなく、アプリケーションで要求される専用処理を処理できるように複数のＶＵ１を用意し、それぞれのＶＵ１を稼動する複数の専用命令をプログラムコードに含めることが可能である。さらに、本例のＶＵ１は、特殊な演算処理だけでなく、プログラム中の特定のプログラムファンクションを専用回路化してプログラムを効率良く可動させることができる。したがって、ＶＵＰＵ１０を複数備えたデータ処理システムは適応可能な範囲が非常に広いアーキテクチャである。
【００３２】
図２に、このアーキテクチャのプロセッサを開発する流れを示してある。Ｃ言語で記述された仕様あるいはプログラム３１を実行するために、プログラム中の特定のプロセスあるいはプログラムファンクションＣｓを専用回路化してプログラムを効率良く可動させる。すなわち、仕様３１は、Ｃ言語によるアプリケーションプログラム３２と、専用回路化するファンクション３３とに分けられ、アプリケーションプログラム３２は汎用処理を行う命令（ＰＵ命令）からなる部分Ｃｇ３４と、専用回路を起動する命令（ＶＵ命令）とを含むものとなる。そして、このアプリケーションプログラム３２はＣ−コンパイラ３５により、プロセッサで実行可能なアセンブラの命令セットに変換され、実行用のプログラムコード４ａが生成される。一方、抽出されたプログラムファンクション３３は、そのファンクションの処理に必要な動作が解析され（動作合成３６）、専用のデータパスが設計あるいは開発される。このようにして、基本プロセッサであるＰＵＸ３と専用回路ＶＵ１を備えたＶＵＰＵ１０と、このＶＵＰＵ１０で実行されるプログラムコード４ａが生成される。
【００３３】
したがって、開発されたＶＵＰＵ１０における実行用のプログラムコード（以降ではこの状態もアプリケーションプログラムに含められる）４ａの動作をシミュレーションするためには、ＰＵＸ３の機能をモデル化したＲＴＬと、ＶＵ１の機能をモデル化したＲＴＬとを生成し、これらをプラットフォームとしてプログラム４ａをシミュレーションすることができる。本明細書においては、アプリケーションプログラムとは、プロセッサを動作させる命令セットを含んだすべての段階のプログラムを意味し、コンパイル前、コンパイル後およびその過程の中間のプログラムも含む。すなわち、アプリケーションプログラムとはシミュレーションの対象になるすべてのプログラムを意味する
しかしながら、上述したように、ＲＴＬをベースとしたシミュレーションは非常に時間がかかるので実際的でない。一方、アセンブラのレベル、すなわち、命令セットのレベルでシミュレーションすると、プログラム４ａの機能は確認できるが、実際にＶＵＰＵ１０でサイクルあるいはクロック毎にどのように状態が変化しているかも含めてはシミュレートできない。これが第１の問題である。
【００３４】
第２の問題は、ＶＵ１として専用回路化された処理を実行するためのデータパス命令（ＶＵ命令）は、例えば、画像処理やネットワーク処理において信号列の中から「ある特定のビットパターンを検出する」といった命令である。したがって、このような命令の場合には、専用回路で使用または消費されるサイクル数はデータ依存であり、あらかじめ知ることができない。このため、ＲＴＬをプラットフォームとしたシミュレータでもシミュレートは難しい。
【００３５】
まず、最初の問題をもう少し説明すると、図３に示すように、基本プロセッサであるＰＵＸ３では、ＡＤＤなどの１つの命令セットの命令サイクルは複数のパイプラインステージに分割して処理される。図３（ａ）は、３段のパイプライン型ＲＩＳＣの命令サイクルを示してあり、１つの命令セットは、フェッチとデコードするサイクル（Ｆ＆ＤサイクルまたはＦ＆Ｄステージ）５１、実行するサイクル（実行サイクルまたは実行ステージ）５２、メモリなどにライトバックするサイクル（ＷＢサイクルまたはＷＢステージ）５３に分割して処理される。また、図３（ｂ）に示すように、４段のパイプライン型ＲＩＳＣであれば、フェッチとデコードサイクル５１がさらにフェッチサイクル５１ａとデコードサイクル５１ｂに分けて実行される。以降では、簡単のために３段のパイプライン型ＲＩＳＣを基本プロセッサＰＵＸ３として採用した例を説明する。この場合、ＰＵＸ３では、ｎ番目の命令Ｉ（ｎ）のＦ＆Ｄサイクル５１と共に、ｎ−１番目の命令Ｉ（ｎ−１）の実行サイクル５２と、ｎ−２番目の命令Ｉ（ｎ−２）のＷＢサイクル５３が進行する。
【００３６】
図４（ａ）は、従来のインストラクションセットシミュレータ（ＩＳＳ）の命令モデルを示しており、１番目の命令セットＩ（１）からＩ（６）までが順番に並んで処理されるモデルである。命令セットが、その機能どおりに実行されることをシミュレーションするのであれば問題はない。しかしながら、ＰＵＸ３に対し外部あるいはＶＵ１からのＩ／Ｏ５９が発生したときの処理を評価しようとすると、とたんに状態が不確かのものになる。すなわち、図４（ｂ）に示すように各命令セットは３つのパイプラインステージに分割され、部分的にオーバーラップしながらサイクル単位（クロック単位）で処理されている。そして、クロック単位で記述されるＩ／Ｏ５９が入ると、命令毎にＩ／Ｏ５９が発生するとしたときに図４（ａ）のモデルでは６つのタイミングしか記述できないのに対し、現実のプロセッサでは図４（ｂ）のモデルに示すように、８つのタイミングがあり得る。そして、それぞれのタイミングでその後の処理が変わる可能性がある。
【００３７】
たとえば、図５（ａ）に示す処理では、処理を開始した２サイクル目でＩ／Ｏからの入力信号５９が生じた場合の動作を示す。図５（ａ）では、１番目の命令Ｉ（１）の実行サイクル５２でＩ／Ｏ信号５９が入力され、２番目の命令Ｉ（２）で処理して、３番目の命令Ｉ（３）のＷＢサイクル５３で結果をＩ／Ｏ信号５９で出力する。これに対し、図５（ｂ）は、命令セットの単位のモデルであり、命令の先頭をＦ＆Ｄサイクル５１に合わせている。このケース１では、２番目の命令Ｉ（２）はＩ／Ｏ信号を捕らえることができないので、上記のような処理が実現されない。図５（ｃ）は、命令の先頭を実行サイクル５２に合わせたケース（ケース２）であり、１番目の命令を実行中にＩ／Ｏ信号５９を受けて、その結果として２番目および３番目の命令を通じてＩ／Ｏ信号５９を出力することができるが、３番目の命令で出力する信号のタイミングが異なる。図５（ｄ）は、命令の先頭をＷＢサイクル５３に合わせたケース（ケース３）であるが、２番目の命令Ｉ（２）はＩＯ信号５９の入力そのものを感知することができない。
【００３８】
このように、Ｉ／Ｏ信号５９や割り込みはすべてクロック同期またはクロックサイクルベース、すなわちパイプラインのサイクルベースで起動されるので、これらの信号を正しく取り扱うには、そのサイクルベースで命令が分解されてモデル化されている必要がある。そこで、本発明においては、図６に示すように、サイクル単位で管理することができるＩＳＳ本体６１を備えたシミュレータ６０を提供することにより、パイプラインのサイクルベース分解された命令モデルに基づくシミュレーションを行えるようにしている。すなわち、図７に示すように、従来のＩＳＳでは命令単位でその機能がＣ言語によりモデル化されていた。これに対し、本発明のシミュレータ６０では、命令セットＩをサイクル単位、すなわち、Ｆ＆Ｄ部５１と、実行部５２と、ＷＢ部５３とに分けてそれぞれのパイプラインステージをＣ言語によりモデル化する。そして、ＩＳＳ本体６１で各命令セットＩの各サイクルを評価しながらシミュレーションする。したがって、３段パイプラインＲＩＳＣを対象としたシミュレーションでは、１つのサイクルで常に３つの命令セットを評価することになる。このため、従来のＩＳＳよりもシミュレーション速度は低下する原因とはなるが、Ｃ言語のシミュレータによるアセンブラコードのシミュレーションではなく、そのアセンブラコードで記述されたアプリケーションプログラム４ａがハードウェア上で実際に動作する様子をシミュレーションすることができる。
【００３９】
ＩＳＳ本体６１でサイクル単位で命令モデルを管理する場合、図８に示すように、各命令セットの各パイプラインステージ毎に、その各命令セットの各ステージ毎に固有のＣ言語によるモデル（第１の情報）５７と、各命令セットの各ステージに共通のＣ言語によるモデル（第２の情報）５８とを記述しても良い。たとえば、共通のモデルとしてＩ／Ｏ信号に関するモデルを各サイクル毎に記述することにより、Ｉ／Ｏ信号による動作をサイクルバウンダリーに対し正しくシミュレーションすることができる。しかしながら、この方法では、共通部分である第２の情報の記述量が増えることになる。そこで、本例のシミュレータ６０においては、ＩＳＳ本体６１に、サイクル単位にシミュレーションを行うためのサイクル単位で時間を管理する機構６１ａと、Ｉ／Ｏ信号などに対する処理などのサイクルあるいはパイプラインステージに共通な第２の情報を記述した部分６１ｂと、その第２の情報に基づいて応答する機構６１ｃとを設けている。
【００４０】
さらに、命令セットＩが決まれば、その命令セットＩの各パイプラインステージ毎の固有のモデル（第１の情報）も決まる。したがって、本例のシミュレータ６０では、各命令セット毎に各パイプラインステージ毎のモデルを提供するＰＵ命令ライブラリ６２を設け、各命令の各パイプラインステージをＩＳＳ本体６１で管理可能なモデル、すなわち第１の情報に変換している。このため、シミュレータ６０に提供されるアプリケーションプログラムは命令セットの状態であるいが、シミュレーションはサイクル単位で分割して行うことができる。また、シミュレーションの対象となるモデルがサイクル単位で分割して記述されている場合でも、ＰＵ命令ライブラリ６２により、そのシミュレーションモデル自身が記述すべき量を削減できる。
【００４１】
このため、図６に示した本例のシミュレータ６０であれば、アセンブラコードで記述されたアプリケーションプログラム４ａに基づき、サイクル単位のシミュレータモデルをＰＵ命令ライブラリ６２を参照しながらＩＳＳ本体６１で構築し、サイクル単位で管理しながらシミュレーションを行うことができる。ＩＳＳ本体６１の負荷を低減して実行速度を速くするためには、命令セットであるアセンブラコードをパイプラインステージに分割された記述で表されたシミュレーションモデル６６に変換できるコンパイラあるいはコンパイラプログラム６５を用い、命令セットがサイクル単位で記述されたシミュレーションモデル６６を事前に用意しておくことが望ましい。
【００４２】
本例のシミュレータ６０は、さらに、ＶＵ命令により起動するＶＵ１における処理をＩＳＳ本体６１においてサイクル単位で管理可能な情報として提供するＶＵ命令ライブラリ６３を備えている。したがって、ＶＵ命令ライブラリ６３でＶＵ命令で消費されるサイクル数を定義することにより、上述したもう１つの問題、すなわち、画像処理やネットワーク処理において信号列の中から「ある特定のビットパターンを検出する」といったＶＵ命令が実行されたときに、その命令で使用あるいは消費されるサイクル数をあらかじめ知ることができないという問題を簡単に解決することができる。このため、本例のシミュレータ６０においては、ＰＵ命令はＰＵ命令ライブラリ６２から供給されるパイプラインステージ毎のモデルをサイクル単位で管理することによりサイクル単位でシミュレートでき、ＶＵ命令はＶＵ命令ライブラリ６３でＶＵ１において消費されるサイクル数がシミュレートできる。
【００４３】
したがって、本例のシミュレータ６０により、ＶＵＰＵ１０の内部におけるＶＵ１とＰＵ２とのＩ／Ｏあるいは割込み処理、さらには、外部とＩ／Ｏあるいは割込み処理などをサイクルあるいはクロック単位でＣ言語によりモデルで完全にシミュレートすることが可能となる。このため、高速でハードウェアの動作をシミュレートできるシミュレータを提供することが可能となる。
【００４４】
さらに、本例のシミュレータ６０は、ＰＵ命令ライブラリ６２と、ＶＵ命令ライブラリ６３とを別に設けている。ＰＵ命令ライブラリ６２は、ＶＵＰＵ１０を実現するときの埋め込みプロセッサとなる基本プロセッサＰＵＸ３における処理をサイクル単位で記述するためのライブラリであり、ほぼフィックスされたものとなる。
【００４５】
これに対し、ＶＵ命令ライブラリ６３は、ＶＵＰＵ１０で実現しようとするユーザの仕様によって可変なＶＵ１における処理が反映されるものであり、その都度、異なる可能性が高い。したがって、ＶＵ命令ライブラリ６３をユーザ単位で、あるいはユーザ自身が設計できるように分離することにより、シミュレータ６０に殆ど影響を与えずにユーザの仕様を組み込むことができる。このため、本例の構成により、ＶＵＰＵ１０をシミュレートできるシミュレータ６０を短期間に経済的に開発することができる。
【００４６】
また、本例のＶＵ命令ライブラリ６３は、Ｃ言語で記述されている仕様あるいはプログラムファンクションＣｓの部分をＣ＋＋言語に変換されたもの、あるいはそれがコンパイルされたものである。したがって、変数の型宣言を行うことにより、取り扱えるデータ長はバリアブルとなり、Ｃ言語で記述された際の冗長なビット長を削除することができる。このため、ＶＵ命令ライブラリ６３は、機能的にＶＵ１として実現するＲＴＬと置換可能なものとなる。したがって、より精度良く、実機に近い条件で、さらにＲＴＬシミュレータよりも高速でＶＵＰＵ１０をシミュレートすることができる。ＶＵ命令ライブラリ６３は、ＰＵ命令ライブラリ６２も同様であるが、ＩＳＳ本体６１がＣコンパイラあるいはＣ＋＋コンパイラを内蔵していれば、Ｃ言語あるいはＣ＋＋言語で記載されたライブラリでよく、ＩＳＳ本体６１がコンパイラを内蔵していないのであれば、事前にコンパイルしておく必要がある。
【００４７】
さらに、シミュレータ６０は、実際のプロセッサでは実行されない疑似（擬似）命令用のライブラリ６４を備えている。この疑似命令は主に評価用に利用され、取扱い上はＶＵ命令と同じであるので疑似ＶＵ命令（擬似ＶＵ命令）と称することにする。疑似ＶＵ命令は、シミュレーションの過程あるいは結果を評価するためにＩＳＳ本体６１にデータを入力したり、期待値を出力させたりする処理を規定するものであり、この間、ＶＵＰＵのシミュレーションをホールドする。すなわち、ＩＳＳ本体６１は、サイクル数のカウントを停止し、ゼロサイクルで疑似ＶＵ命令を実行する。この疑似命令ライブラリ６４は、ＰＵ命令のシミュレーションに対してゼロサイクルの専用命令として捉えることも可能であるので、シミュレータ６０の内部ではＶＵ命令ライブラリ６３と同等に取り扱われる。したがって、このシミュレータ６０はＶＵに加えて、デバック用の疑似ＶＵの機能が装着されているということができる。疑似ＶＵ命令は、実際のプロセッサ１０においても疑似ＶＵ命令として取り扱われるので、この疑似命令（Ｅ命令）をフェッチしたときはＰＵ２に対してはＮＯＰ命令が出力されるだけである。また、ＶＵ１は自己宛てのＶＵ命令ではないので、処理を行わない。したがって、シミュレーションが終了した後にあえて取り除かなくても良い命令となっている。
【００４８】
図９に、疑似ＶＵ命令を導入した場合の設計手法の概要を示してある。ここでも、Ｃ言語による全体記述３１の第３層（レイヤ３）と、サイクルベースのＩＳＳによるシミュレータ６０の第２層（レイヤ２）と、ＶＵＰＵ１０の第１層（レイヤ１）とに分けて考えることができ、Ｃ言語（レイヤ３）とＲＴＬによるＶＵＰＵ１０（レイヤ１）との間に、サイクルベースのＩＳＳ６１によるシミュレーション（レイヤ２）を導入できることによる効果を示している。本例では、Ｃ言語（ＡＮＳＩ−Ｃを想定）による全体記述３１には、テスト記述、即ちデータファイルの入力とか出力値と期待値との一致比較といった記述も含まれる。目視によるデバッグが有効であるのならば、グラフィック出力ルーチンといった機能も、このテスト機能あるいはデバッグ機能を定義するテスト記述に含められる。
【００４９】
このテスト記述Ｃｅの部分は、シリコンへの実装形態であるＶＵＰＵ１０にはマッピングされない部分であり、ＩＳＳでのみ実行される疑似ＶＵとして装着される部分である。疑似ＶＵ命令でカバーする範囲は、ＶＵ命令でカバーする範囲と同様に、Ｃ記述３１から抽出され、Ｃコンパイラ１００を介してコンパイルされて疑似ＶＵ命令ライブラリ６４あるいはオブジェクトとしてＩＳＳ本体６１が走行するシミュレータ６０に組み込まれる。したがって、疑似ＶＵ命令ライブラリ６４は、ＩＳＳ６１が走行する計算機環境上のＣコンパイラ１００の出力である。そして、ＰＵ命令でカバーされる範囲ＣｇとのインターフェイスＩＦｅが疑似ＶＵ命令となる。
【００５０】
ＰＵ命令で記述される範囲Ｃｇは、図２などで説明したように、基本プロセッサＰＵ２で走行する部分のＣ言語による記述であり、ＰＵ用のＣコンパイラ１０１によりコンパイルされてＩＳＳ６１で実行されるシミュレーションレベルのプログラム４ａとなる。このプログラム４ａがバイナリーに変換したものが、基本プロセッサＰＵがＲＴＬの形式にて提供された時に使用されることになる。すなわち、ＶＵＰＵ１０のオブジェクトプログラムとなる。
【００５１】
ＶＵ命令でカバーされる記述範囲Ｃｓは、全体記述３１から抽出された後に、これがビット冗長を持つときには、Ｃ記述をＣ＋＋記述に変換して変数の型宣言により（Class Libraryを使用して）ビット冗長を取り除き、最適なビット記述のライブラリ１０２とする。その後、Ｃ＋＋コンパイラ１０３でコンパイルしてシミュレータ６０に組み込まれるＶＵ命令ライブラリ６３を生成する。ＰＵ命令でカバーされる範囲ＣｇとのインターフェイスＩＦｓは、ＶＵ命令であり、これはＶＵＰＵ１０においても同様である。
【００５２】
また、ビット冗長を取り除いたＣ＋＋記述のライブラリ１０２はＣｔｏＲＴＬ動作合成ツール１０４の入力スタイルの規定にあったＣ言語に変換されて動作合成の入力となる。ＣｔｏＲＴＬツール１０４は、通常、変換されたＲＴＬを出力し、そのＲＴＬのサイクル数も出力するので、ＶＵ命令ライブラリ６３はレポートされたサイクル数も取り込み、ビット冗長を取り除いたＣ＋＋記述による機能にサイクル数を加えた情報を、ＩＳＳ本体６１に提供し、ＶＵ１を含めたサイクルベースのシミュレーションを可能とする。
【００５３】
このようにして設計を進めて行き、最後に生成されたＶＵ１のＲＴＬをＩＳＳ６１とリンクさせてシミュレーションすることも可能である。設計の最終段階における検証フェーズとして有効である。
【００５４】
図１０は、疑似ＶＵ命令を含めたＣ記述３１を分解してサイクルベースのＩＳＳシミュレーションを実行する過程を示してある。この図では、全体記述３１の一部が２つの疑似ＶＵ命令と、１つのＶＵ命令に置き換えられている。疑似ＶＵ命令に置き換えられたテスト記述３７の内容は、例えば、最初が入力ファイルをオープンしてテストデータを読み込む記述であり、次が、出力された結果と期待値との一致比較照合を取る記述である。また、ＶＵ命令に置き換えられた部分３３は、例えば、専用のハードウエアで処理することにより高速化を図りたい信号処理のＣ記述部分である。その他のＰＵ命令として実行される部分３４は、一般的なソフト処理対象となる記述であり、ＰＵ命令と、ＶＵ命令と、疑似ＶＵ命令とを含んだプログラム３２として生成される。したがって、ＰＵ用のコンパイラ１０１でコンパイルされて、その際、ＶＵ命令も疑似ＶＵ命令もアセンブラコールされているから、コンパイルされた結果はＰＵ命令とＶＵ命令（疑似含む）の混在リスト４ａとなり、ＩＳＳ本体６１に順次読み込まれて実行されることになる。
【００５５】
テスト記述３７はＩＳＳ６１でのみ実行される疑似ＶＵとして装備されることが望ましい。これらテスト記述は、シミュレーションおよびデバッグ上必要であるが、通常は、ハードウェア化する対象外だからである。従って、サイクル数はゼロを設定し、テスト機能のみをゼロサイクルで実行できる形態が望ましい。テスト記述３７は、ＩＳＳ６１が開発され、走行する計算機環境により決定されるコンパイラ１００でコンパイルされる。例えば、あるＯＳ上でＩＳＳ６１が走行しているのであれば、テスト記述は、そのＯＳにて走行するＣコンパイラ（例えばサン社のＯＳであればフリーウェアのｇｃｃがある）でコンパイルされ、ＩＳＳ６１に疑似ＶＵライブラリ６４として装着される。装着された疑似ＶＵライブラリ６４は、ＰＵ側からはＶＵ命令と同様に処理される疑似ＶＵ命令でコールされ、リターンされる。
【００５６】
専用命令としてハードウェア化されるＶＵ命令のＣ記述３３は、ＶＵＰＵプロセッサを開発する場でユーザが定義する命令であるが、上述した方法でライブラリ６３の形式で装着される。このようにして、疑似ＶＵ命令ライブラリ６４と、ＶＵ命令ライブラリ６３が用意されると、ＩＳＳ本体６１は、アセンブラ化されたコード４ａの命令セットを順番に読み込んで、ＰＵ命令であれば、予めサイクルモデルとして用意されているＰＵライブラリ６２を読み、ＶＵ命令であれば、ＶＵライブラリ６３を読み込んで、実行する。同様に疑似ＶＵ命令であれば、疑似ＶＵライブラリ６４を読み込んで実行する。
【００５７】
図１１に、シミュレータ６０を実現するためのプログラムにおける主な処理をフローチャートにより示してある。ＩＳＳ本体６１を実現するためのプログラムがスタートすると、ステップ７１でシミュレートする対象のアプリケーションプログラム４ａのｎ番目の命令セットを取得する。ステップ７９で、その命令セットＩが疑似ＶＵ命令であれば、ステップ７９ａで疑似ＶＵをスタートする処理を行う。疑似ＶＵはＩＳＳ本体６１だけで稼動するＶＵであるが、上述したように疑似ＶＵ命令ライブラリ６４を参照することによりデバック処理用の入出力あるいはその他の処理が行われる。
【００５８】
この疑似ＶＵの処理では、サイクルはカウントしないので、疑似ＶＵが終了すると、他のＶＵＰＵの状態をシミュレートする処理は行われず、ステップ７１に戻って次の命令をフェッチする。したがって、ＶＵ１あるいはＰＵ２の状態に影響を与えずに、それらの状態を評価することができる。そして、疑似ＶＵ命令をＶＵ命令およびＰＵ命令と同じレベルでプログラムコード４ａに挿入することができるので、シミュレーション上で評価する状態をプログラムレベルで規定することが可能である。このため、シミュレーションの評価を効率良く、容易に行うことができる。
【００５９】
ステップ７２で、その命令セットＩがＶＵ命令であれば、ステップ７３でＶＵ１をスタートする処理を行う。図１１に示すように、ＶＵ命令ライブラリ６３としての機能を提供するプログラムがＶＵ１におけるサイクルをカウントするだけのプログラムであれば、ステップ７３においては、そのカウンタ数Ｃをクリアしてカウントを開始する。
【００６０】
ＶＵ命令ライブラリ６３のプログラムは、カウンタ数Ｃを初期化するステップ９１と、カウント数Ｃをサイクルのタイミングでカウントアップするステップ９２と、カウント数ＣがＶＵ１における処理が完了するＣ０に達したことを判断するステップ９３を備えており、カウント中であればＶＵ１が稼動中であること示す情報をＩＳＳ本体６１に返すステップ９４と、カウントが終了していればＶＵ１が停止中である情報をＩＳＳ本体６１に返すステップ９５を備えている。したがって、ＩＳＳ本体６１のプログラムでは、ステップ７２に続いて、ステップ７４においてＶＵ命令ライブラリ６３に問い合わせることにより、そのサイクルにおけるＶＵ１の状況を管理することができる。
【００６１】
さらに、ＩＳＳ本体６１は、ステップ７５で取得した命令セットＩがＰＵ命令であれば、ステップ７６でそのｎ番目のＰＵ命令のＦ＆Ｄサイクルのモデル（第１の情報）をＰＵ命令ライブラリ６２に用意されているＣモデル８１から取得し、そのステージ固有のモデルを実行する。それと共に、それに関わる外部入出力やＶＵからの入出力などに対するステージに共通のモデル（第２の情報）を実行する。次に、ステップ７７で、ｎ−１番目のＰＵ命令の実行サイクルのモデルをＰＵ命令ライブラリ６２のＣモデル８１から取得し、そのステージ固有のモデルと、共通のモデルを実行する。さらに、ステップ７８で、ｎ−２番目のＰＵ命令のＷＢサイクルのモデルをＰＵ命令ライブラリ６２のＣモデル８１から取得し、そのステージ固有のモデルと、共通のモデルを実行する。これらステップ７６、７７および７８の処理の前後は問わない。またこれらの処理は並列に行われても良い。
【００６２】
このように、本例のシミュレータ６０においては、ＩＳＳ本体６１において、ｎ番目、ｎ−１番目およびｎ−２番目の命令セットの各パイプラインステージにおけるモデルを実行することにより、１つのサイクルのシミュレーションを行う。したがって、シミュレータ６０では、サイクル分解されたシミュレーションモデルをＩＳＳ本体６１においてサイクルベースでシミュレーションすることになり、従来のＲＴＬをベースとしたシミュレータと同等のサイクル単位での精度を維持しつつ、実行速度の速いＣ言語による高速シミュレーションが可能となる。このため、数百倍高速でハードウェアをシミュレーションできるシミュレータを提供することが可能となる。
【００６３】
さらに、ＩＳＳ本体６１として機能するシミュレーションプログラムに対し、命令セットの各パイプラインステージ毎に、そのパイプラインステージをシミュレートプログラムで管理可能な情報、すなわちモデルに変換する処理を提供するＰＵ命令ライブラリ６２のプログラムを設けてある。ライブラリプログラムを用いて、パイプラインステージをシミュレートする前に、そのパイプラインステージを管理するためのモデルを取得することができる。また、ライブラリプログラムを用意することにより、アプリケーションプログラムに記述する量を大幅に削減することが可能である。各パイプラインステージのモデルを取得する処理は、本例のように、各パイプラインステージのモデルを実行する直前に取得しても良いし、シミュレートする命令毎、あるいはシミュレートするサイクル毎に一括してモデルを取得することも可能である。あるいは、ライブラリから各パイプラインステージ毎のモデルをアプリケーションプログラム単位で取得してすべてのパイプラインの情報を含むシミュレーションモデルを事前に生成することも可能である。
【００６４】
さらに、本例のシミュレータ６０では、ＩＳＳ本体６１で各パイプラインステージに共通する処理に関するモデルを用意し、各パイプラインステージのモデルと共にサイクル単位で実行している。この方法によりステージ単位のモデルの記述を削減することができるが、共通の処理を含んだモデルをＰＵ命令ライブラリ６２として提供することも可能である。
【００６５】
そして、シミュレータ６０は、ＰＵ命令ライブラリ６２に加えて、ＶＵ命令ライブラリ６３を備えている。したがって、ＶＵＰＵ１０のようなデータパスに特化した専用処理部ＶＵ１を備えたプロセッサにおいて、ＶＵ１でＶＵ命令が消費するサイクル数がデータ依存であるような場合においても、ＶＵ命令で消費されるサイクル数をＶＵ命令ライブラリ６３でカウントしてＩＳＳ本体６１に通知することが可能である。したがって、ＶＵ１の状態もＩＳＳ本体６１においてサイクル単位で管理することができる情報に変換することができる。ＶＵ命令ライブラリ６３がシミュレートするＶＵ１の状態をＩＳＳ本体６１が知る方法は幾つかあり、ＩＳＳ本体６１がサイクル事象をＶＵ命令ライブラリ６３に通知することによりＶＵ命令ライブラリ６３の側でサイクル数のカウントを返したり、またはＶＵ１における処理が終了したことをＶＵ命令ライブラリ６３からＩＳＳ本体６１に通知するだけでも結果としてＶＵ１におけるサイクル数の管理を行うことができる。
【００６６】
たとえば、図１２に示したシミュレータ６０においては、ＶＵ命令ライブラリ６３が各サイクル毎のＶＵ１の状態をＩＳＳ本体６１に返すルーチン６３ａを備えている。したがって、ＩＳＳ本体６１の側でＶＵ１の状態をサイクル単位で監視し、終了までのサイクル数をカウントするなど方法でＶＵ１をサイクル単位で管理できる。このため、ＶＵ１の処理の終了情報だけをＶＵ命令ライブラリ６３からＩＳＳ本体６１に通知させ、サイクル数のカウントはＩＳＳ本体６１またはデータパス命令側であるＶＵ命令ライブラリ６３のどちらでもできることとなる。もちろん、ＶＵ命令ライブラリ６３からＩＳＳ本体６１に供給されるＶＵ１の状態は終了情報に限られる必要はなく、ＰＵ命令の処理に影響を及ぼすＶＵ１の処理の途中経過をＩＳＳ本体６１に通知してさらに精度の高いシミュレーションを行うことも可能である。また、図１２に示したシミュレータ６０では、疑似ＶＵ命令ライブラリを用意していないが、疑似ＶＵ命令ライブラリがない場合でも、サイクルベースのＩＳＳ６１によりＲＴＬモデルを用いなくても、それと等価のシミュレーションを数桁高速で行うことができることは変わりない。
【００６７】
図１３に、本発明による設計の手順詳細を示す。これらの手順を実行可能な手段あるいは処理を備えたシステムあるいはプログラムは、適当な計算機環境で動作する自動設計システムあるいは自動設計プログラムとして提供することが可能である。このため、本発明により、システムＬＳＩにおける、Ｃ言語あるいはＪＡＶＡ（登録商標）言語などの高級言語から自動設計システムの構築が可能となる。
【００６８】
最初の設計ステージ１１１では、初期の仕様を記載したＣ記述３１の出力期待値を作成しておき、後段の設計作業におけるテスト機能の元とする。この段階におけるコンパイラは、汎用のＣコンパイラ（ｇｃｃ）である。
【００６９】
次の設計ステージ１１２は、ＰＵへの移植である。Ｃ記述３１をＰＵ２（ＰＵＸ３）のＣコンパイラであるｐｃｃ１０１でコンパイルする。この際、ステップ１２１で出力期待値との比較照合のようなテスト記述を、疑似ＶＵ命令としてｐｃｃ１０１でコンパイルされるプログラムに装着させることが可能である。疑似ＶＵ命令は他の命令と共にｐｃｃ１０１でコンパイルされるが、疑似ＶＵ命令で実行させる機能（疑似ＶＵ）、すなわちテスト記述された機能は、汎用のコンパイラ（ｇｃｃ）１００でコンパイルし、疑似ＶＵライブラリを生成する。
【００７０】
こうして疑似ＶＵ命令により動作するテスト回路（疑似ＶＵ）を内蔵させたプログラム（Ｃ記述）を、ＩＳＳ６１を用いてステップ１２２で検証する。疑似ＶＵ命令によりテスト機能が移植されているので、正しくｐｃｃ１０１でコンパイルされるプログラムに移植が行われたか否かは直ちに判明する。この設計ステージ１１２では、ＩＳＳ６１がＩＳＳプロファイラとしてレポートする各Ｃソースコード上でのクロック消費数情報（ＩＳＳプロファイラ）から、どの部分をＶＵ化（即ちハードウェア化）して高速化を図るかを検討し判断する。
【００７１】
次の設計ステージ１１３は、ＶＵ１に適した部分の抽出とＶＵ化、すなわち、ＶＵ命令ライブラリの作成である。このため、ステップ１２３でＶＵ化する部分を抽出し、ステップ１２４でその部分をＶＵ命令に置き換え、ＶＵ命令、ＰＵ命令および疑似ＶＵ命令からなる、ｐｃｃ１０１でコンパイル可能なプログラムとする。一方、抽出した部分をステップ１２５でライブラリ化する。Ｃ言語のままでもＩＳＳ６１にて走行可能なＶＵ命令ライブラリは生成可能であるので、この段階でシミュレーションのステップ１２６を挿入しているが、必要がなければスキップして次の段階でのみシミュレーションするだけでも良い。なお、ステップ１２６でＶＵ命令ライブラリを利用するために使用するコンパイラは、汎用のＣコンパイラ、例えばｇｃｃである。この段階でも、疑似ＶＵ命令はテスト記述としてそのまま利用される。
【００７２】
設計ステージ１１４では、ステップ１２７でＣ＋＋言語を用いて冗長ビットを削除し、削除した結果のＣ＋＋のＶＵ命令ライブラリ６３をステップ１２８でシミュレータ６０に装着してＩＳＳ６１を走行させて消費サイクル数を確認する。その為には、ＶＵ１のサイクル数を知る必要があり、次の設計ステージ１１５を実行させてサイクル数を知り、ＶＵ命令ライブラリ６３にフィードバックをかける。この段階でも、疑似ＶＵ命令はテスト記述としてそのまま利用される。
【００７３】
上述したように、通常のＣ言語ではデータ長が、例えば３２ビットに取られ、３２ビット長の範囲内で仕様となるアルゴリズムの開発と検証が行われる。しかしながら、実際に稼動させるハードウエアシステムでは３２ビットは必要ない、といった場合が極めて多い。この場合には冗長を削除しないと、無駄なハードウェアを持った実行回路を生成してしまうことになる。したがって、この設計ステージ１１４で、Ｃ言語と異なる言語、ここではＣ＋＋言語にて冗長を取り除く事は極めて重要である。
【００７４】
加えて、冗長を取り除いたＣ＋＋言語のライブラリ６３で検証を行うことが重要である。冗長を取り除きすぎていないかどうかは、シミュレーションによって確認する必要があり、このステージ１１４でサイクルベースのＩＳＳ６１を用いて確認できる意義は極めて大きい。というのも、もし、サイクルベースのＩＳＳ６１がなければ、この確認はＲＴＬベースのシミュレーションにて行うしかなく、それはサイクルベースのＩＳＳ６１によるシミュレーションに比べ約１、０００倍時間がかかるシミュレーションだからである。
【００７５】
当初の仕様３１を最初からＣ＋＋言語として記述し、冗長を加味しない形式で最初から設計すればよい、との考え方もある。それもひとつの考え方であるが、次のようなデメリットがある。まず、オブジェクト指向の言語であるＣ＋＋言語と通常の手続き型言語であるＣ言語とでは、生成されるオブジェクトのサイズに違いがでる。つまり、オブジェクト指向言語のオブジェクトサイズは通常に比べ３０〜５０％くらい大きくなってしまう欠点がある。これは、ＰＵ化するプログラム、すなわち、ＰＵコンパイラ１０１でコンパイルする部分のオブジェクトサイズが非常に大きくなってしまうことになる。
【００７６】
ＶＵＰＵ１０のような、組み込み型のプロセッサにとっては、シリコン上のメモリサイズをいかに小さくするかが、コストと消費電力上極めて重要な要因となり、この点、Ｃ言語で初期設計する設計方法のメリットは大きい。これに対し、ＶＵ化される記述は、ハードウェアとなるので、たとえＣ＋＋言語を経由したとしても前に述べたようなオブジェクトサイズの増大には繋がらない。むしろ、型変換を通じて冗長なビット長が削除できた状態でシミュレーションできるメリットが大きい。
【００７７】
次の設計ステージ１１５は、ＣｔｏＲＴＬツールによる動作合成である。ＩＳＳシミュレーションにより検討されたＣ＋＋言語のライブラリ６３を、ステップ１２９で動作合成可能なＣ記述に変換し、ステップ１３０で合成をかける。結果としてＲＴＬとサイクル数がレポートされるので、サイクル数は、上述したように上流のＩＳＳシミュレーション（ステップ１２８または１２６）へフィードバックする。このステージ１１５でレポートされるサイクル数まで埋め込んだＶＵライブラリ６３はＣｔｏＲＴＬ動作合成ツールにて生成されたＲＴＬと置換可能なものになる。
【００７８】
設計ステージ１１６は、最終段階であり、ＶＵ１のＲＴＬと、ＰＵ命令、ＶＵ命令および疑似ＶＵ命令が含まれるプログラムとを、ステップ１３１でＩＳＳ６１を使用して混合あるいは協調シミュレーションし、できあがったＶＵ１のＲＴＬの最終的な検証を行う。通常、このシミュレーションにおいては、親はＲＴＬとなる。即ち、親となるＲＴＬを起し、子となるＩＳＳと、子となるＶＵ部のＲＴＬを親のレベルで接続させてシミュレーションを行うことも可能である。この段階のシミュレーションは、ＲＴＬが含まれるのでシミュレーション速度は低下する。しかしながら、最終確認を行う意味で一回検証する意味はある。加えて、ＩＳＳ６１ではプログラムに組み込まれている疑似ＶＵ命令がそのまま動作するので、テスト環境は同一のものが使用できるメリットがある。
【００７９】
最終的に動作が確認されたＶＵ１のＲＴＬは、ステップ１３２で論理合成され、ステップ１３３でネットリストとして出力されシリコン化（回路化）される。
【００８０】
このように、上記の一連のステージでは、ＩＳＳ６１を用いて何段階かのシミュレーション（ステップ１２２、１２６、１２８および１３１）を繰り返しているが、最初のステージ１１１のステップ１２１で導入された疑似ＶＵ命令によるテスト記述、特に期待値照合といった機能がシミュレートされるプログラムに装着されたまま、ＲＴＬの最終検証のフェーズであるステージ１１６まで活かされる。したがって、この複数のシミュレーション段階を備えた設計方法において、同一のテスト記述が設計段階の最初から最後まで利用できることは、各ステージですこしづつ変化していく実装の形態を検証する上で極めて有効であり、大きく回路を間違えて設計を押しすすめる心配がないという大きなメリットがある。
【００８１】
なお、上記では本発明に係るシミュレーション方法をシミュレーションプログラムで実現可能なシミュレータ６０に基づき説明しているが、ハードワイヤードロジックなどの他の実現手段で本発明のシミュレーション方法を実行することも可能である。しかしながら、本発明のシミュレーション方法は、Ｃ言語あるいはＪＡＶＡ（登録商標）などの他の高級言語で記述された高速実行可能なシミュレーションプログラムとして提供できることが重要であり、ＣＤ−ＲＯＭなどの適当な記録媒体に記録されたり、インターネットなどのコンピュータネットワークを介して提供される本発明に係るシミュレーションプログラムをパーソナルコンピュータやワークステーションなどの適当なハードウェア資源にインストールすることによりＶＵＰＵ１０の機能を高速でシミュレートすることができる。
【００８２】
ＶＵＰＵ１０は、汎用的なプロセッサであるＰＵと、専用回路化したＶＵとを備えており、全ての仕様を専用回路化する従来のカスタムＬＳＩと比較すると、短期間に低コストで従来と同等以上の性能のカスタムＬＳＩを提供できるものであり、本発明のシミュレータを採用することにより、さらに開発期間を短縮することができる。また、本発明のシミュレータは、ＶＵＰＵに限らず、汎用的なプロセッサの設計開発、あるいはプロセッサで実行する全てのプログラム（本明細書ではアプリケーションプログラムと称している）の設計開発にも適用可能なものであり、シミュレーションの精度を向上し、シミュレーションの期間を短縮することができる。
【００８３】
【発明の効果】
以上に説明したように、本発明においては、命令セットをパイプラインサイクルに分解したモデルでシミュレートすることができるサイクルベースのＩＳＳ本体を備えたシミュレータを提供しており、Ｃ言語などの高級言語で記述された高速実行可能なシミュレータによりサイクルベースでハードウェアのシミュレーションを行うことができる。したがって、本発明のシミュレーション方法により、精度は従来のＲＴＬベースのシミュレータと同等に、サイクル単位を維持しつつ、シミュレーション速度を数百倍から数千倍高速にできるシミュレータあるいはシミュレータとしての機能を有するデータ処理システムを提供できる。したがって、本発明によりプロセッサの設計期間を大幅に短縮でき、設計品質も向上する。したがって、設計物のコスト競争力に極めて多大な効果を発揮する。
【００８４】
さらに、本発明のシミュレーション方法では、ＶＵＰＵのようなデータパスに特化した専用処理部を備えたプロセッサにおいても、専用処理部で専用命令が消費するサイクル数をライブラリという形でサイクル単位でカウントして管理することが可能であり、サイクルベースでのハードウェアシミュレーションが可能となる。
【００８５】
さらに、プロセッサでは実行されない疑似ＶＵ命令、特にゼロサイクルの疑似ＶＵ命令を導入することにより、評価および検証に有効なテスト回路を疑似ＶＵ命令として初期設計段階から最終のＲＴＬ生成段階まで利用できる。また、ＶＵ命令による処理をライブラリ化することにより、そのライブラリを他の言語、上記ではＣ＋＋で変換して冗長ビットの削減を極めて簡単に行うことが可能となる。そして、本発明のシミュレータは、サイクルベースのＩＳＳであるので、最終段階で生成されたＲＴＬとも接続させてシミュレーションすることが可能であり、最終検証までをフォローすることができる。
【図面の簡単な説明】
【図１】ＰＵおよびＶＵを備えたデータ処理装置（ＶＵＰＵ）の概要を示す図である。
【図２】Ｃ言語により記述された仕様に基づきＶＵＰＵを開発する様子を示す図である。
【図３】命令セットをパイプラインステージに分けて実行する様子を示す図である。
【図４】命令セットを命令単位で評価する場合と、パイプラインステージに分けてサイクル単位で評価する場合を示す図である。
【図５】命令セットを命令単位で評価する場合の幾つかのケースを、サイクル単位で評価する場合と比較して示す図である。
【図６】本発明のシミュレータの概要を示す図である。
【図７】命令セットのモデルを、命令単位のモデルと、パイプラインステージ単位のモデルにした場合とを比較して示す図である。
【図８】パイプラインステージ単位のモデルをステージ個別のモデルと、共通のモデルとに分けて示した図である。
【図９】疑似ＶＵ命令を導入した設計手法の概要を示す図である。
【図１０】疑似ＶＵ命令を導入してＣ言語により記述された仕様に基づきＶＵＰＵを開発する様子を示す図である。
【図１１】図６に示したシミュレータのＩＳＳ本体およびライブラリとしての機能を示すフローチャートである。
【図１２】本発明の異なるシミュレータの例を示す図である。
【図１３】本発明に係る設計手順の概要を示す図である。
【符号の説明】
１専用処理ユニット（ＶＵ）
２汎用処理ユニット（ＰＵ）
３汎用プロセッサ（ＰＵＸ）
４コードＲＡＭ
４ａアプリケーションプログラム
５フェッチユニットＦＵ
１０データ処理装置（ＶＵＰＵ）
１１実行ユニット
６０シミュレータ
６１ＩＳＳ本体
６２ＰＵ命令ライブラリ
６３ＶＵ命令ライブラリ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a system for simulating operations in a processor.
[0002]
[Prior art]
As a design language for designing hardware, a register-based description language RTL became popular in the 1990s. Verilog and VHDL are typical languages. These RTLs are used for signal transmission and signal processing between registers based on hardware, arithmetic operations such as plus, minus, multiplication and division, logical operations such as AND and OR, conditional statements such as “IF THEN ELSE” It can be designed by logically describing with assignment statements. Therefore, RTL can improve the abstraction level of logic circuit design at that time, and can improve the design efficiency of a processor and the like.
[0003]
Thereafter, since the late 1990s, a design language called an operation level, which has a higher level of abstraction, has been used. These languages are also considered to be a development system of register-based languages such as Verilog and VHDL. Actually, Verilog and VHDL include description formats of operation levels.
[0004]
On the other hand, there is no concept of register at the operation level, and it is summarized into descriptions such as arithmetic operations, logical operations, conditional statements, and assignment statements. Therefore, general software description languages are also in that category, and since the late 1990s, hardware design in C language has begun to be sought. The C language is more general and has a lot of software resources, and the register-based language has a low simulation speed even if it is limited to the behavioral description. For example, between a specification written in C language and a specification written in RTL, the simulation speed of C language is several thousand times to 100,000 times faster, and the difference in speed is very large. This difference in simulation speed is due to the fact that RTL is a language for designing hardware and C language is a language for designing software.
[0005]
[Problems to be solved by the invention]
In recent years, a method has been born in which the specification of an application executed by a processor in C language describing software is initially designed and finally converted into RTL describing hardware to verify the design of the processor. Furthermore, when developing and designing a processor that realizes a specification given by the C language, a method of converting a part of the specification into hardware via a dedicated processor, instead of converting the specification directly into hardware Has been proposed. The applicant of the present application discloses, for example, a data processing apparatus in which a customizable customizable instruction can be mounted in Japanese Patent Laid-Open No. 2000-207202. In such a processor design method, an instruction set (instruction set) -based description or an assembler-based description exists between the C language and the RTL. The description by the instruction set can more accurately simulate the execution state in the processor than the specification by the C language which is merely a logic-based description. Therefore, a description (instruction level) simulator in an instruction format has recently been developed and is referred to as an instruction set simulator (ISS).
[0006]
However, the conventional ISS simulates the execution of an instruction sequence, for example, an assembler instruction, and does not simulate the hardware itself. Therefore, with regard to real-time processing such as input / output (IO) signal read / write and interrupts, a correct simulation in terms of whether or not these functions can be executed is not necessarily a correct simulation in terms of hardware. It is not done. For example, a simulation that deviates somewhat from the actual operation of the processor is performed at the clock cycle level. Originally, ISS is a simulation of an instruction sequence, so that is fine. However, this is not sufficient from the viewpoint of simulating hardware when an application program to be simulated runs on a processor. It is. Therefore, in the current ISS, the simulator itself can be described in C language, so that the execution state based on the instruction set can be simulated at a high speed, but it is insufficient as a hardware simulation tool.
[0007]
Therefore, an RTL-based simulator is still necessary to perform hardware simulation. However, as described above, the RTL simulation is very slow, and the simulation at the hardware level is a big bottleneck when trying to shorten the period for designing the processor.
[0008]
In particular, a behavioral synthesis tool called “CtoRTL” is available as a design automation tool from the C language, and design automation is currently proceeding from RTL to C. It is important to solve it.
[0009]
The CtoRTL tool is a tool that outputs an RTL composed of registers by inputting a C language describing specifications and giving a clock frequency as a parameter. As described above, the input C language has no concept of clock or cycle. For this reason, according to the clock frequency given thereto, a solution that allocates registers and satisfies the specifications is obtained. At that time, it is necessary to allocate an arithmetic unit that performs arithmetic operation and logical operation described in the specification, and resource sharing that realizes operation according to the specification given in the C language with a small number of arithmetic units. Scheduling how to implement the execution order described in the specification with a small number of clocks can be said to be the decisive factor of this synthesis tool. Therefore, it is important to evaluate or verify the performance of the CtoRTL tool. Further, since RTL is automatically synthesized from C language, the specification of C language is converted into RTL as it is, and there is a problem that redundant bits are directly reflected in automatic synthesis in C language.
[0010]
In an application program using a processor provided with a dedicated processor, a dedicated instruction for performing a specific process using a data path system of the dedicated processor is attached together with a general-purpose instruction. However, the conventional ISS can be simulated functionally but not hardware. That is, the relationship between the number of clocks consumed when executing a data path instruction and a general-purpose instruction is unknown, and the conventional ISS can only confirm the function of a data path instruction (dedicated instruction).
[0011]
Accordingly, an object of the present invention is to provide a new simulator capable of performing hardware simulation at high speed. It is another object of the present invention to provide a simulator capable of simulating the execution status of a data path dedicated instruction and a general instruction for operating a general-purpose processor at a hardware level at high speed.
[0012]
[Means for Solving the Problems]
The register level RTL simulator has the concept of a clock cycle in order to describe hardware. On the other hand, the C language simulator has an advantage that it can describe an operation without the concept of a clock. On the other hand, in the present invention, the instruction set simulated by the ISS is divided into cycle bases so that the ISS can be used as a tool for simulating hardware. Although the cycle level ISS is slower than the simulation of the C language itself, the simulation speed can be increased from several hundred times to several thousand times that of an RTL-based simulator. Equivalent results are obtained. For this reason, it becomes possible to greatly improve the design efficiency of the processor.
[0013]
One aspect of the present invention is a system having a simulator for simulating the operation of an application program in a processor, the processor executing a general-purpose process by a plurality of pipeline stages and a specific data process. A specialized dedicated processing unit, including a dedicated processing unit including a data path different from the pipeline stage of the general-purpose processing unit, and the simulator includes instructions for the general-purpose processing included in the application program A general-purpose instruction library that converts the general-purpose instruction into a model of each pipeline stage of the general-purpose processing unit, and an instruction included in the application program includes If it is a dedicated instruction that prescribes processing, Using the dedicated instruction library that converts the dedicated instruction into a model that supplies information corresponding to the pipeline stage cycle of the general-purpose processing unit, and the general-purpose instruction library and the model provided from the dedicated instruction library, the A data processing system comprising: cycle simulating means for simulating the operation of an application program in the pipeline stage cycle of the general-purpose processing unit.
Another aspect of the present invention is a method in which a simulator simulates the operation of an application program in a processor, the processor executing a general-purpose process by a plurality of pipeline stages, and a specific data process. And a dedicated processing unit including a data path different from the pipeline stage of the general-purpose processing unit, and the method includes the instruction included in the application program as the general-purpose processing unit. If it is a general-purpose instruction that defines processing in the processing unit, a first conversion step that converts the general-purpose instruction into a model of each pipeline stage of the general-purpose processing unit, and an instruction included in the application program includes the dedicated instruction If it is a dedicated instruction that prescribes processing in the processing unit, Using the second conversion step for converting into a model that supplies information corresponding to the cycle of the pipeline stage of the general-purpose processing unit, and the model obtained by the first conversion step and the second conversion step, A cycle level simulation step of simulating the operation of the processor by the application program in the pipeline stage cycle of the general-purpose processing unit.
According to still another aspect of the present invention, there is provided a program for simulating an operation of an application program in a processor by a computer, the processor including a general-purpose processing unit that executes general-purpose processing by a plurality of pipeline stages, and a specific processing unit. A dedicated processing unit specializing in data processing, and including a dedicated processing unit including a data path different from the pipeline stage of the general-purpose processing unit, and the computer defines the processing in the general-purpose processing unit A general-purpose instruction library that converts a general-purpose instruction to be converted into a model of each pipeline stage of the general-purpose processing unit, and a dedicated instruction that defines processing in the dedicated processing unit corresponds to a cycle of the pipeline stage of the general-purpose processing unit Dedicated instruction library that converts information into a model that supplies information If the instruction included in the application program is the general-purpose instruction, the general-purpose instruction library converts the general-purpose instruction into a model of each pipeline stage of the general-purpose processing unit. If the instruction included in the application program and the instruction included in the application program are the dedicated instruction, the general-purpose instruction library supplies information corresponding to the pipeline stage cycle of the general-purpose processing unit by the general-purpose instruction library. The operation of the application program is simulated in the cycle of the pipeline stage of the general-purpose processing unit using the second conversion process for converting to the general-purpose instruction library and the model provided from the general-purpose instruction library and the dedicated instruction library. Cycle level simulation A process having instructions for said computer executes a program.
A data processing system having means for simulating the operation of an application program in a pipeline stage cycle (which is a cycle level simulation means and is described as a cycle level simulator body or an ISS body)IsIn the embodiment of the present specification, it is described as a simulator or a simulation system. By adopting this data processing system, it is possible to provide a series of automatic design systems for automatically designing LSIs. The simulator of the present invention is a program for simulating the operation of a processor of an application program having a plurality of instruction sets in which an instruction cycle is executed by a plurality of pipeline stages. Can be provided as a program having instructions for executing a cycle level simulation process for simulating in a cycle, and recorded on a suitable recording medium such as a CD-ROM or supplied through a medium such as a computer network. can do.
[0014]
In the conventional ISS, a simulation model is described and managed for each instruction set. In contrast, in the present invention, an instruction set is decomposed into pipeline cycle units, and the instruction set is modeled and managed in cycle units. The instruction set can be modeled on a cycle basis by a cycle-level simulation process or means, or a simulation model in which the instruction set is decomposed on a cycle basis by a compiler or the like is formed in advance. May be executed. In the present invention, execution of an instruction set and other processes caused or not caused by it, for example, interrupt operations can be correctly modeled and managed in units of cycles. Therefore, even if it is not RTL, it is possible to correctly perform hardware simulation on a cycle basis by using a simulator in C language or other high-level language. For this reason, hardware can be simulated at high speed.
[0015]
In the present invention, a series of instruction cycles of an instruction set, that is, fetch, decode, execution, and write-back are simulated by dividing them into processes corresponding to the respective pipeline stages (each pipeline stage cycle). The processing of each instruction at each pipeline stage may be individually modeled. However, it is possible to save hardware resources by modeling so that these instructions or processing common to each pipeline stage can be managed in common, and develop a simulator program at a low cost in a short period of time. be able to.
[0016]
The cycle unit model can be divided into two main parts. One is a unique description of the instruction set, for example, a part describing the function (mechanism and / or timing) of an execution cycle that is one pipeline stage of an AND instruction. The other is a common description when each pipeline stage of each instruction set is executed, for example, a description relating to an action for an interrupt signal. Therefore, in the present invention, before simulating in units of cycles, each pipeline stage of the instruction set is converted to means for simulating at the cycle level (ISS main body) or first information that can be managed by processing there. Establish a library. Thus, if an instruction set or a pipeline stage of the instruction set is given, a process in the ISS main body or a model managed by the ISS main body is given in the library. In the ISS main body or its processing, if a process or model common to each pipeline stage is described, and simulation is performed using the first information based on the second information related to the common processing, The information 1 can be limited to information corresponding to unique processing among the processing of individual pipeline stages.
[0017]
Therefore, in the present invention, when simulating, the instruction set is decomposed in units of cycles, and the ISS main body, which is a cycle level simulation means, is provided with a function for managing time in units of cycles. The part unique to the instruction in each cycle is described in the library as a process description or model for each cycle of the instruction set. In addition, a common processing description or model for an external signal such as an interrupt in each cycle is described in the ISS body. Therefore, it is possible to provide a system or environment that can be simulated with hardware having a simple configuration. When an application program is supplied, it is possible to divide individual instruction sets into individual pipeline stages and simulate them on a cycle-by-cycle basis as if they were executed by a processor to be simulated. . In addition, it is possible to execute processing to convert an application program with multiple instruction sets whose instruction cycles are executed by multiple pipeline stages into a simulation model represented by a description in which the instruction set is divided into pipeline stages An application program may be converted into a simulation model suitable for the ISS main body of the present invention by a program having a simple instruction to further reduce the simulation time.
[0018]
An application program in this specification means a program of all levels including an instruction set that instructs a processor to operate or process, such as a source program, an object program (object code), or an intermediate stage program. Includes all programs subject to
[0019]
When the processor to be simulated has a dedicated data path, the cycle management is different between the processing using the data path and the pipeline stage processing in the general-purpose processor. Therefore, if the processor has a general-purpose processing unit capable of executing general-purpose processing and a dedicated processing unit specialized for specific data processing, the processing in the general-purpose processing unit is defined from the instruction set included in the application program. A general-purpose instruction set that is an instruction set in which an instruction cycle is executed by a plurality of pipeline stages, and is converted into information that can be managed by means of simulating each pipeline stage of the general-purpose instruction set, and an application It is desirable to provide a dedicated instruction library for converting information that can be managed by means for simulating a dedicated instruction set that defines processing in the dedicated processing unit from among the instruction sets included in the program. A general-purpose instruction set and a dedicated instruction set are converted using different libraries in different conversion processes (first conversion process and second conversion process), and provided to means (ISS main body) that performs simulation at the cycle level. Accordingly, the operation of the processor having the general-purpose processing unit and the dedicated processing unit can be modeled based on the pipeline processing. The library program that performs these conversions may be provided as a single program, but by providing it as a different library program, the simulator provider supplies the general-purpose instruction set, and the dedicated processor is designed for the dedicated instruction set. It is possible to adopt a system in which the user who has made the service supplies it.
[0020]
In addition, the processing of the dedicated processing unit can be appropriately modeled with the dedicated instruction library. By returning the cycle information consumed by the dedicated instruction set from the dedicated instruction library to the ISS main body, which is means for simulating as some signal or information, or the processing thereof, the execution state of the dedicated processing section can be reflected in the simulation in the ISS main body. . The dedicated instruction library can also reflect the state of the dedicated processing unit in the simulation by returning the state of the dedicated processing unit to the ISS body or its processing in units of cycles.
[0021]
As described above, the present invention can provide a tool capable of simulating hardware at high speed in units of cycles. Accordingly, it is possible to economically design and supply a distributed processing type system LSI using a plurality of dedicated circuits that can realize specifications written in C language and can be processed at high speed. Then, in the design and development of the processor, simulation can be performed while giving the concept of cycle or clock at an intermediate stage of transition from C language to RTL. The introduction of this new design infrastructure layer not only increases simulation speed and improves design efficiency, but also has various other advantages.
[0022]
One is that, as described above, in a processor driven by a general-purpose instruction set and a dedicated instruction set, the operation of a dedicated processing unit (dedicated purpose processing unit or dedicated circuit unit) driven by a dedicated instruction set is not RTL. The simulation can be performed in a state close to the C language level. In addition, it is only necessary to convert the information into information that can be managed by the cycle-based ISS main body, so it is not necessary to be C language itself, and if there is a language more suitable for describing the operation of the dedicated processing unit, C language Can be compiled into the new language and compiled to create a dedicated instruction library.
[0023]
For example, the most popular ANSI-C is currently used as the processor specification, but the data length of variables that can be described is limited, and is limited to 16 bits, 32 bits, and 64 bits. On the other hand, in specifications implemented as user instructions, these data lengths are not always adopted, and specifications such as 24 bits are very common. In ANSI-C, such specifications cannot be reproduced faithfully, and eventually, there is a possibility that results different from RTL simulations incorporating such specifications, and ultimately RTL simulations are produced. It will be necessary to repeat. However, in other languages, for example, C ++, the data length becomes variable by performing variable type declaration using the class library. Therefore, if the C language specification is converted to C ++ and compiled, and a dedicated instruction library is created, redundant bits can be deleted, and a cycle-based ISS main body and a simulation based on the same level as RTL are provided. The simulation system can be executed at high speed.
[0024]
Another advantage is that a pseudo-instruction set (pseudo-instruction set) that is not executed by the processor is added to the application program for simulation evaluation or for other purposes, and is made a simulation target for the cycle-based ISS simulation. It can be done. The pseudo instruction set that is not executed by the processor is, for example, an instruction to perform processing without counting the number of cycles in the ISS main body, and the pseudo instruction library that converts the pseudo instruction set into information that can be managed by the simulator is described above. If it is prepared at the same level as the dedicated instruction library, it can be incorporated into a cycle-based ISS main body or a simulation system using it. A pseudo instruction set for evaluating (testing or debugging) the operation includes an instruction for executing processing for monitoring input / output of a data file.
[0025]
In the simulation using the cycle-based simulation means, the operation of simulating a processor provided with a dedicated processing unit can be divided into several stages. The first simulation stage is to perform simulation with an ISS, preferably a cycle-based ISS body, without removing a portion to be executed with a dedicated instruction from a specification given in C language or the like. If a pseudo-instruction set is introduced at this stage, input data, output expected values, and the like, which become the basis for subsequent evaluation, can be obtained. In the next second simulation stage, a part to be executed by the dedicated processing unit is extracted from the specification and replaced with a dedicated instruction set. On the other hand, a dedicated instruction library is created from the extracted part, and the cycle base of the present invention is used. It is to simulate with the ISS main body. Since the cycle-based ISS main body according to the present invention can be simulated on a cycle basis including a dedicated processing unit, most of the processor can be developed at this stage. In addition, by using the pseudo instruction set and comparing the data obtained at this stage with the input data and output expected value obtained earlier, the processor can be developed smoothly and quickly without making a big design error. can do.
[0026]
Furthermore, as a third simulation stage, it is also possible to provide a stage in which the dedicated instruction library is converted to RTL and then simulated by the cycle-based ISS main body. The function when actually converted to RTL can be confirmed, and the general-purpose processing unit side is also operated in the cycle-based ISS main body, so that high-precision simulation can be performed by matching with RTL.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
The present invention will be further described below with reference to the drawings. FIG. 1 shows a dedicated processing unit (dedicated instruction execution unit or dedicated purpose processing unit, hereinafter referred to as VU) 1 having a data path unit 20 specialized for specific data processing, and a general-purpose processing unit (general-purpose processing unit). An outline of a data processing apparatus 10 having an instruction execution unit, a general purpose processing unit or a process unit, hereinafter referred to as PU) 2 is shown. This data processing device 10 is a programmable processor having a dedicated circuit, fetches an instruction from a code RAM 4 containing a control program (object program, object code, microprogram code) 4a in an execution format, and a dedicated data processing unit 1 and a general-purpose data processing unit 2 are provided with a fetch unit 5 that provides decoded control signals. The fetch unit 5 includes a fetch unit 7 that fetches an instruction from a predetermined address in a predetermined code RAM 4 determined by the previous instruction or the state of the state register 6, an interrupt signal φi, and the like, and a fetched dedicated instruction or general-purpose instruction (general And a decoding unit 8 for decoding the instruction). The decoding unit 8 supplies a control signal (decoded control signal) φv obtained by decoding the dedicated instruction and a control signal φp obtained by decoding the general-purpose instruction to the dedicated data processing unit VU1 and the general-purpose data processing unit PU2, respectively. . Further, the status signal φs indicating the execution state is returned from PU2, and the states of PU2 and VU1 are reflected in the state register 6.
[0028]
The PU 2 in this example includes a highly versatile execution unit 11 composed of a general-purpose register, a flag register, an arithmetic unit (ALU), and the like, and performs general-purpose processing one after another while outputting execution results to the data RAM 15 and the like. It can be executed. That is, in the PU 2 of this example, one general-purpose instruction set is executed by being divided into a plurality of pipeline stages such as a fetch and decode stage, an execute stage, and a result write stage in a memory. The configuration having the fetch unit FU5, the general-purpose data processing unit PU2, the code RAM 4, and the data RAM 15 is similar to a general processor unit although the individual functions are different. Therefore, the configuration having the FU 5, PU 2, code RAM 4 and data RAM 15 can also be referred to as the processor unit 3, and the data processing apparatus 10 of this example is configured with the concept of controlling the VU 1 from the processor unit (PUX) 3. Alternatively, it can be designed.
[0029]
The dedicated data processing unit VU1 that executes the dedicated instruction φv from the FU5 is provided with a hardware control signal for performing specific data processing in advance with the unit 22 that decodes whether the instruction supplied by the FU5 is the VU instruction φv. A sequencer (FSM (Finite State Machine)) 21 that outputs automatically and a data path unit 20 designed to perform specific data processing in accordance with control signals from the sequencer 21 are provided. The VU 1 includes a register 23 that can be accessed from the PU 2, and the data necessary for processing of the data path unit 20 is controlled by the PU 2 via the interface register 23, and the internal state of the VU 1 is controlled via the register 23. It can be referred to by PU2. In addition, the result processed in the data path unit 20 is supplied to the PU 2, and processing using the result is performed in the PU 2.
[0030]
In this data processing device 10, a program including a general purpose instruction (PU instruction) and a dedicated instruction (VU instruction) is stored in the code RAM 4, and the control signal φp fetched and decoded by the fetch unit 5 or φv is supplied to VU1 and PU2. VU1 operates when control signal φv of a dedicated command for starting itself is supplied among control signals φp and φv. On the other hand, only the control signal φp in which the general-purpose instruction is decoded is supplied to PU2, and the control signal φv in which the VU instruction is decoded is not issued to PU2, but instead is not executed. A control signal indicating a nop instruction is issued, and processing of PU2 is skipped. VU1 is changed by an application or the like, and a dedicated instruction for giving an instruction to VU1 often changes depending on the application. VU1 is a data path or dedicated circuit specialized for an application, and can be easily designed to interpret a control signal obtained by decoding a VU instruction. On the other hand, PU2 does not need to deal with instructions specialized for VU1 by outputting a nop instruction, and only needs a function that can interpret and execute a basic instruction or general-purpose instruction, and sacrifices versatility. It is possible to coexist with the VU1 corresponding to various applications without controlling, and to control these or perform processing using the calculation result.
[0031]
Therefore, the data processing device 10 shown in FIG. 1 has a VU1 having a dedicated circuit capable of realizing a process that requires special operations such as a real-time response, and a PU2 having versatility. Will be referred to as VUPU. This VUPU 10 can shorten the design and development period without sacrificing real-time responsiveness, and can flexibly cope with subsequent changes and modifications. Also, the number of VU1 is not limited to one, and a plurality of VU1s are prepared so that dedicated processing required by the application can be processed, and a plurality of dedicated instructions for operating each VU1 are included in the program code. Is possible. Furthermore, the VU 1 of the present example can move not only special arithmetic processing but also a specific program function in the program as a dedicated circuit so that the program can be moved efficiently. Therefore, a data processing system including a plurality of VUPUs 10 has an architecture that can be applied to a very wide range.
[0032]
FIG. 2 shows the flow of developing a processor of this architecture. In order to execute the specification or program 31 described in the C language, a specific process or program function Cs in the program is made into a dedicated circuit so that the program can be moved efficiently. That is, the specification 31 is divided into an application program 32 in C language and a function 33 for converting to a dedicated circuit. The application program 32 includes a part Cg 34 composed of a command (PU command) for performing general-purpose processing and a command for starting a dedicated circuit. (VU instruction). The application program 32 is converted by the C-compiler 35 into an assembler instruction set that can be executed by the processor, and a program code 4a for execution is generated. On the other hand, the operation of the extracted program function 33 is analyzed (behavior synthesis 36), and a dedicated data path is designed or developed. In this way, the VUPU 10 including the basic processor PUX3 and the dedicated circuit VU1, and the program code 4a executed by the VUPU 10 are generated.
[0033]
Therefore, in order to simulate the operation of the program code for execution in the developed VUPU 10 (hereinafter this state is also included in the application program) 4a, the RTL that models the function of PUX3 and the function of VU1 are modeled. RTL can be generated and the program 4a can be simulated using these RTLs as a platform. In this specification, an application program means a program at all stages including an instruction set for operating a processor, and includes a program before compilation, after compilation, and an intermediate program. In other words, the application program means all the programs to be simulated
However, as described above, the RTL-based simulation is not practical because it takes a very long time. On the other hand, if the simulation is performed at the assembler level, that is, the instruction set level, the function of the program 4a can be confirmed, but it cannot be simulated including how the state actually changes in each cycle or clock in the VUPU 10. . This is the first problem.
[0034]
The second problem is that a data path command (VU command) for executing processing dedicated to VU1 is “detecting a certain bit pattern from a signal sequence in image processing or network processing, for example. Is a command such as Therefore, in the case of such an instruction, the number of cycles used or consumed in the dedicated circuit is data-dependent and cannot be known in advance. For this reason, it is difficult to simulate even a simulator using RTL as a platform.
[0035]
First, the first problem will be explained a little more. As shown in FIG. 3, in the basic processor PUX3, an instruction cycle of one instruction set such as ADD is divided into a plurality of pipeline stages and processed. FIG. 3A shows an instruction cycle of a three-stage pipeline type RISC. One instruction set includes a fetch and decode cycle (F & D cycle or F & D stage) 51, an execute cycle (execution cycle or execution). Stage) 52 and a cycle (WB cycle or WB stage) 53 for writing back to a memory or the like are processed. As shown in FIG. 3B, in the case of a 4-stage pipeline type RISC, the fetch and decode cycle 51 is further divided into a fetch cycle 51a and a decode cycle 51b. Hereinafter, for the sake of simplicity, an example in which a three-stage pipeline type RISC is adopted as the basic processor PUX3 will be described. In this case, in the PUX3, together with the F & D cycle 51 of the nth instruction I (n), the execution cycle 52 of the (n-1) th instruction I (n-1) and the n-2th instruction I (n-2). WB cycle 53 proceeds.
[0036]
FIG. 4A shows an instruction model of a conventional instruction set simulator (ISS), which is a model in which the first instruction sets I (1) to I (6) are processed in order. There is no problem as long as the instruction set is simulated to execute according to its function. However, if an attempt is made to evaluate the processing when an I / O 59 from the outside or VU1 is generated for PUX3, the state immediately becomes uncertain. That is, as shown in FIG. 4B, each instruction set is divided into three pipeline stages and processed in units of cycles (clock units) while partially overlapping. When an I / O 59 described in units of clocks is input, when the I / O 59 is generated for each instruction, the model in FIG. There can be 8 timings as shown in model 4 (b). Then, there is a possibility that subsequent processing changes at each timing.
[0037]
For example, the process shown in FIG. 5A shows an operation when the input signal 59 from the I / O is generated in the second cycle when the process is started. In FIG. 5A, an I / O signal 59 is input in the execution cycle 52 of the first instruction I (1), and the second instruction I (2) is processed to obtain the third instruction I (3). The result is output as an I / O signal 59 in the WB cycle 53. On the other hand, FIG. 5B is a model of an instruction set unit, and the head of the instruction is aligned with the F & D cycle 51. In this case 1, since the second instruction I (2) cannot capture the I / O signal, the above processing is not realized. FIG. 5C shows a case (case 2) in which the head of the instruction is matched with the execution cycle 52. The I / O signal 59 is received during execution of the first instruction, and as a result, the second and third Although the I / O signal 59 can be output through the command, the timing of the signal output by the third command is different. FIG. 5D shows a case (case 3) in which the head of the instruction is aligned with the WB cycle 53, but the second instruction I (2) cannot sense the input of the IO signal 59 itself.
[0038]
In this way, all I / O signals 59 and interrupts are triggered on a clock synchronization or clock cycle basis, i.e., on a pipeline cycle basis, so that to correctly handle these signals, the instructions are broken down on that cycle basis. Must be modeled. Therefore, in the present invention, as shown in FIG. 6, by providing a simulator 60 including an ISS main body 61 that can be managed in units of cycles, simulation based on a cycle-based decomposed instruction model of a pipeline is performed. I can do it. That is, as shown in FIG. 7, in the conventional ISS, the function is modeled in C language in units of instructions. On the other hand, in the simulator 60 of the present invention, the instruction set I is divided into cycle units, that is, the F & D unit 51, the execution unit 52, and the WB unit 53, and each pipeline stage is modeled in C language. Then, the ISS main body 61 performs simulation while evaluating each cycle of each instruction set I. Therefore, in the simulation for the three-stage pipeline RISC, three instruction sets are always evaluated in one cycle. Therefore, although the simulation speed is lower than that of the conventional ISS, the application program 4a described in the assembler code actually operates on the hardware, not the assembler code simulation by the C language simulator. The situation can be simulated.
[0039]
When the instruction model is managed in cycle units in the ISS main body 61, as shown in FIG. 8, a model in C language unique to each stage of each instruction set (first model) Information) 57 and a C language model (second information) 58 common to each stage of each instruction set. For example, by describing a model relating to an I / O signal for each cycle as a common model, the operation based on the I / O signal can be correctly simulated with respect to the cycle boundary. However, this method increases the amount of description of the second information that is a common part. Therefore, in the simulator 60 of the present example, the ISS main body 61 has a mechanism 61a for managing time in cycle units for performing simulation in cycle units, and is common to cycles such as processing for I / O signals or pipeline stages. A portion 61b in which the second information is described and a mechanism 61c that responds based on the second information are provided.
[0040]
Furthermore, when the instruction set I is determined, a unique model (first information) for each pipeline stage of the instruction set I is also determined. Therefore, in the simulator 60 of this example, a PU instruction library 62 that provides a model for each pipeline stage is provided for each instruction set, and a model in which each pipeline stage of each instruction can be managed by the ISS main body 61, that is, 1 information is converted. For this reason, although the application program provided to the simulator 60 is in the state of an instruction set, the simulation can be performed in units of cycles. Even when a model to be simulated is divided and described in units of cycles, the PU instruction library 62 can reduce the amount to be described by the simulation model itself.
[0041]
Therefore, in the case of the simulator 60 of this example shown in FIG. 6, a simulator model in a cycle unit is constructed in the ISS body 61 with reference to the PU instruction library 62 based on the application program 4a described in the assembler code. Simulation can be performed while managing in units of cycles. In order to reduce the load on the ISS main body 61 and increase the execution speed, a compiler or compiler program 65 that can convert an assembler code, which is an instruction set, into a simulation model 66 represented by a description divided into pipeline stages is used. It is desirable to prepare in advance a simulation model 66 in which the instruction set is described in units of cycles.
[0042]
The simulator 60 of this example further includes a VU instruction library 63 that provides processing in the VU 1 activated by the VU instruction as information that can be managed in units of cycles in the ISS main body 61. Therefore, by defining the number of cycles consumed by the VU instruction in the VU instruction library 63, another problem described above, that is, “detecting a specific bit pattern from a signal sequence in image processing or network processing” When the VU instruction such as "" is executed, the problem that the number of cycles used or consumed by the instruction cannot be known in advance can be easily solved. For this reason, in the simulator 60 of this example, PU instructions can be simulated in units of cycles by managing models for each pipeline stage supplied from the PU instruction library 62 in units of cycles, and VU instructions can be simulated in the VU instruction library 63. The number of cycles consumed in VU1 can be simulated.
[0043]
Therefore, the simulator 60 of this example completely performs the I / O or interrupt processing between VU1 and PU2 inside the VUPU 10 and the I / O or interrupt processing outside the VUPU 10 in a model in C language in cycles or clock units. It becomes possible to simulate. Therefore, it is possible to provide a simulator capable of simulating hardware operation at high speed.
[0044]
Furthermore, the simulator 60 of this example is provided with a PU instruction library 62 and a VU instruction library 63 separately. The PU instruction library 62 is a library for describing processing in the basic processor PUX3, which is an embedded processor when the VUPU 10 is realized, in a cycle unit, and is almost fixed.
[0045]
On the other hand, the VU instruction library 63 reflects the processing in the variable VU 1 depending on the specifications of the user to be realized by the VUPU 10, and is likely to be different each time. Therefore, by separating the VU instruction library 63 so that it can be designed by the user or by the user himself / herself, the user's specifications can be incorporated with almost no influence on the simulator 60. For this reason, the simulator 60 which can simulate VUPU10 with the structure of this example can be developed economically in a short time.
[0046]
Also, the VU instruction library 63 of this example is the one in which the specification or program function Cs portion described in the C language is converted into the C ++ language, or it is compiled. Therefore, by performing variable type declaration, the data length that can be handled becomes variable, and redundant bit lengths described in the C language can be deleted. Therefore, the VU instruction library 63 is functionally replaceable with the RTL realized as VU1. Therefore, the VUPU 10 can be simulated with higher accuracy and at a higher speed than the RTL simulator under conditions close to those of the actual machine. The VU instruction library 63 is the same as the PU instruction library 62. However, if the ISS main body 61 includes a C compiler or a C ++ compiler, the VU instruction library 63 may be a library written in C language or C ++ language. If you don't have it, you need to compile it in advance.
[0047]
Furthermore, the simulator 60 includes a library 64 for pseudo instructions that are not executed by an actual processor. Since this pseudo instruction is mainly used for evaluation and is the same as the VU instruction in handling, it will be referred to as a pseudo VU instruction (pseudo VU instruction). The pseudo-VU instruction defines a process for inputting data to the ISS main body 61 or outputting an expected value in order to evaluate a simulation process or result. During this time, the VUPU simulation is held. That is, the ISS main body 61 stops counting the number of cycles and executes the pseudo VU instruction in the zero cycle. The pseudo instruction library 64 can be regarded as a dedicated instruction of zero cycle for the simulation of the PU instruction, and is therefore handled in the simulator 60 in the same manner as the VU instruction library 63. Therefore, it can be said that the simulator 60 is equipped with a function of a pseudo VU for debugging in addition to the VU. Since the pseudo VU instruction is also handled as a pseudo VU instruction in the actual processor 10, when this pseudo instruction (E instruction) is fetched, only a NOP instruction is output to PU2. Since VU1 is not a VU instruction addressed to itself, no processing is performed. Therefore, it is an instruction that does not have to be removed after the simulation is completed.
[0048]
FIG. 9 shows an outline of the design method when the pseudo VU instruction is introduced. Again, the third layer (layer 3) of the entire description 31 in C language, the second layer (layer 2) of the simulator 60 based on the cycle-based ISS, and the first layer (layer 1) of the VUPU 10 are considered separately. This shows the effect of being able to introduce a cycle-based ISS 61 simulation (layer 2) between the C language (layer 3) and the RUP-based VUPU 10 (layer 1). In this example, the entire description 31 in the C language (assuming ANSI-C) includes a test description, that is, a description such as a match comparison between an input value of a data file and an output value and an expected value. If visual debugging is enabled, functions such as graphic output routines are also included in the test description that defines this test function or debug function.
[0049]
The part of the test description Ce is a part that is not mapped to the VUPU 10 that is an implementation form on silicon, and is a part that is mounted as a pseudo VU that is executed only in the ISS. Similar to the range covered by the VU instruction, the range covered by the pseudo VU instruction is extracted from the C description 31 and compiled via the C compiler 100 so that the ISS main body 61 runs as the pseudo VU instruction library 64 or object. 60. Therefore, the pseudo VU instruction library 64 is an output of the C compiler 100 on the computer environment where the ISS 61 runs. The interface IFe with the range Cg covered by the PU instruction becomes a pseudo VU instruction.
[0050]
The range Cg described by the PU instruction is a description in C language of the portion running on the basic processor PU2, as described in FIG. 2 and the like, and is a simulation executed by the ISS 61 after being compiled by the PU C compiler 101. Level program 4a. The program 4a converted into binary is used when the basic processor PU is provided in the RTL format. That is, it becomes an object program of VUPU10.
[0051]
If the description range Cs covered by the VU instruction has bit redundancy after being extracted from the entire description 31, the C description is converted into a C ++ description and converted into a C ++ description by using a variable type declaration (using the Class Library). Redundancy is removed and the optimal bit description library 102 is obtained. Thereafter, the VU instruction library 63 that is compiled by the C ++ compiler 103 and incorporated in the simulator 60 is generated. The interface IFs with the range Cg covered by the PU instruction is a VU instruction, and this is the same in the VUPU 10.
[0052]
Also, the C ++ description library 102 from which bit redundancy has been removed is converted into C language conforming to the input style of the CtoRTL behavioral synthesis tool 104 and used as behavioral synthesis input. Since the CtoRTL tool 104 normally outputs the converted RTL and also outputs the number of cycles of the RTL, the VU instruction library 63 also takes in the reported number of cycles and adds the number of cycles to the function according to the C ++ description without bit redundancy. Is added to the ISS main body 61 to enable cycle-based simulation including VU1.
[0053]
It is also possible to proceed with the design in this way, and link the RTL of the VU1 generated last with the ISS 61 for simulation. It is effective as a verification phase in the final stage of design.
[0054]
FIG. 10 shows a process of disassembling the C description 31 including the pseudo VU instruction and executing the cycle-based ISS simulation. In this figure, a part of the entire description 31 is replaced with two pseudo VU instructions and one VU instruction. The content of the test description 37 replaced with the pseudo VU instruction is, for example, a description that first opens the input file and reads the test data, and the next is a description that performs matching comparison collation between the output result and the expected value. It is. Also, the portion 33 replaced with the VU instruction is a C description portion of signal processing that is desired to be speeded up by processing with dedicated hardware, for example. The part 34 executed as the other PU instruction is a description to be a general software processing target, and is generated as a program 32 including a PU instruction, a VU instruction, and a pseudo VU instruction. Therefore, since it is compiled by the PU compiler 101 and the VU instruction and the pseudo VU instruction are assembler-called at that time, the compiled result is a mixed list 4a of PU instructions and VU instructions (including pseudo), and the ISS The data is sequentially read into the main body 61 and executed.
[0055]
The test description 37 is preferably equipped as a pseudo VU that is executed only in the ISS 61. These test descriptions are necessary for simulation and debugging, but are usually excluded from hardware. Therefore, it is desirable that the number of cycles is set to zero and only the test function can be executed in zero cycles. The test description 37 is compiled by the compiler 100 determined by the computer environment in which the ISS 61 is developed and runs. For example, if the ISS 61 is running on a certain OS, the test description is compiled by a C compiler running on that OS (for example, a Sunware OS has freeware gcc). It is mounted as a pseudo VU library 64. The mounted pseudo VU library 64 is called from the PU side with a pseudo VU instruction processed in the same manner as the VU instruction, and is returned.
[0056]
The C description 33 of the VU instruction that is implemented as hardware as a dedicated instruction is an instruction defined by the user when the VUPU processor is developed, and is mounted in the format of the library 63 by the method described above. When the pseudo VU instruction library 64 and the VU instruction library 63 are prepared in this way, the ISS main body 61 sequentially reads the instruction set of the assembled code 4a. The PU library 62 prepared as a model is read, and if it is a VU instruction, the VU library 63 is read and executed. Similarly, if it is a pseudo VU instruction, the pseudo VU library 64 is read and executed.
[0057]
FIG. 11 is a flowchart showing main processes in the program for realizing the simulator 60. When the program for realizing the ISS main body 61 is started, the nth instruction set of the application program 4a to be simulated in step 71 is acquired. If the instruction set I is a pseudo VU instruction in step 79, a process of starting the pseudo VU is performed in step 79a. The pseudo VU is a VU that operates only in the ISS main body 61, but as described above, by referring to the pseudo VU instruction library 64, input / output for debugging or other processing is performed.
[0058]
In this pseudo VU process, cycles are not counted. Therefore, when the pseudo VU ends, the process of simulating the state of other VUPUs is not performed, and the process returns to step 71 to fetch the next instruction. Therefore, these states can be evaluated without affecting the state of VU1 or PU2. Since the pseudo VU instruction can be inserted into the program code 4a at the same level as the VU instruction and the PU instruction, the state to be evaluated on the simulation can be defined at the program level. For this reason, simulation can be evaluated efficiently and easily.
[0059]
If the instruction set I is a VU instruction at step 72, a process of starting VU1 is performed at step 73. As shown in FIG. 11, if the program providing the function as the VU instruction library 63 is a program that only counts the cycles in VU1, in step 73, the counter number C is cleared and counting is started.
[0060]
The program of the VU instruction library 63 initializes the counter number C at step 91, counts up the count number C at the cycle timing, and indicates that the count number C has reached C0 at which processing in VU1 is completed. Step 93 is provided. When the count is in progress, step 94 returns information indicating that the VU1 is operating to the ISS main body 61, and when the count is completed, information indicating that the VU1 is stopped is displayed. Step 95 is returned to 61. Therefore, in the program of the ISS main body 61, by inquiring of the VU instruction library 63 in step 74 following step 72, the status of VU1 in the cycle can be managed.
[0061]
Further, if the instruction set I acquired in step 75 is a PU instruction, the ISS main body 61 prepares an F & D cycle model (first information) of the nth PU instruction in the PU instruction library 62 in step 76. The model is acquired from the C model 81 and the model specific to the stage is executed. At the same time, a model (second information) common to the stages for external input / output and input / output from the VU is executed. Next, in step 77, a model of the execution cycle of the (n-1) th PU instruction is acquired from the C model 81 of the PU instruction library 62, and a model specific to the stage and a common model are executed. Further, in step 78, the model of the WB cycle of the (n-2) th PU instruction is acquired from the C model 81 of the PU instruction library 62, and the model unique to the stage and the common model are executed. It does not matter before and after the processing of these steps 76, 77 and 78. These processes may be performed in parallel.
[0062]
Thus, in the simulator 60 of this example, the ISS main body 61 executes one cycle simulation by executing the models in the pipeline stages of the nth, n−1th and n−2th instruction sets. I do. Therefore, in the simulator 60, the cycle-resolved simulation model is simulated on a cycle basis in the ISS main body 61, and the execution speed of the execution speed is maintained while maintaining the accuracy in the cycle unit equivalent to the simulator based on the conventional RTL. High-speed simulation using a fast C language is possible. For this reason, it is possible to provide a simulator capable of simulating hardware at a speed of several hundred times.
[0063]
Further, for each simulation stage that functions as the ISS main body 61, for each pipeline stage of the instruction set, a PU instruction library 62 that provides information that can be managed by the simulation program for the pipeline stage, that is, processing for converting the model into a model. The program is provided. A library program can be used to obtain a model for managing the pipeline stage before simulating the pipeline stage. Also, by preparing a library program, it is possible to greatly reduce the amount described in the application program. The process of acquiring the model of each pipeline stage may be acquired immediately before executing the model of each pipeline stage as in this example, or batchwise for each instruction to be simulated or for each cycle to be simulated. It is also possible to acquire a model. Alternatively, a model for each pipeline stage can be acquired from the library in units of application programs, and a simulation model including information on all pipelines can be generated in advance.
[0064]
Further, in the simulator 60 of this example, a model related to processing common to each pipeline stage is prepared in the ISS main body 61 and is executed in units of cycles together with the models of each pipeline stage. Although it is possible to reduce the description of the model in units of stages by this method, it is also possible to provide a model including common processing as the PU instruction library 62.
[0065]
The simulator 60 includes a VU instruction library 63 in addition to the PU instruction library 62. Therefore, in a processor having a dedicated processing unit VU1 specialized for a data path such as VUPU10, even when the number of cycles consumed by the VU instruction in VU1 is data-dependent, the number of cycles consumed by the VU instruction. Can be counted by the VU instruction library 63 and notified to the ISS main body 61. Therefore, the state of VU 1 can also be converted into information that can be managed in cycle units in the ISS main body 61. There are several ways for the ISS main body 61 to know the state of the VU 1 that the VU instruction library 63 simulates. The ISS main body 61 notifies the VU instruction library 63 of the cycle event, and the VU instruction library 63 counts the number of cycles. Or the number of cycles in VU1 can be managed as a result by simply notifying the ISS body 61 from the VU instruction library 63 that the processing in VU1 has been completed.
[0066]
For example, in the simulator 60 shown in FIG. 12, the VU instruction library 63 includes a routine 63a for returning the state of VU1 for each cycle to the ISS main body 61. Therefore, the VU1 can be managed in units of cycles by monitoring the state of VU1 in units of cycles on the ISS main body 61 side and counting the number of cycles until the end. Therefore, only the VU1 process end information is notified from the VU instruction library 63 to the ISS main body 61, and the number of cycles can be counted by either the ISS main body 61 or the VU instruction library 63 on the data path instruction side. Of course, the state of VU1 supplied from the VU instruction library 63 to the ISS main body 61 need not be limited to the end information, but the ISS main body 61 is notified of the progress of processing of the VU1 affecting the processing of the PU instruction. It is also possible to perform a highly accurate simulation. In addition, the simulator 60 shown in FIG. 12 does not prepare a pseudo VU instruction library, but even if there is no pseudo VU instruction library, even if the RTL model is not used by the cycle-based ISS 61, a number of equivalent simulations can be obtained. There is no change in what can be done at high speed.
[0067]
FIG. 13 shows the details of the design procedure according to the present invention. A system or program having means or processing capable of executing these procedures can be provided as an automatic design system or an automatic design program that operates in a suitable computer environment. Therefore, according to the present invention, an automatic design system can be constructed from a high-level language such as C language or JAVA (registered trademark) language in the system LSI.
[0068]
In the first design stage 111, an expected output value of the C description 31 describing the initial specifications is created and used as a source of the test function in the subsequent design work. The compiler at this stage is a general-purpose C compiler (gcc).
[0069]
The next design stage 112 is porting to the PU. The C description 31 is compiled by pcc101 which is a C compiler of PU2 (PUX3). At this time, it is possible to attach a test description such as comparison with the output expected value in step 121 to a program compiled by the pcc 101 as a pseudo VU instruction. The pseudo VU instruction is compiled by pcc101 together with other instructions, but the function executed by the pseudo VU instruction (pseudo VU), that is, the function described by the test is compiled by the general-purpose compiler (gcc) 100, and the pseudo VU library is compiled. Generate.
[0070]
The program (C description) incorporating the test circuit (pseudo VU) that operates according to the pseudo VU instruction in this way is verified in step 122 using the ISS 61. Since the test function is ported by the pseudo VU instruction, it is immediately determined whether or not the program is correctly ported to the program compiled by pcc101. In this design stage 112, from the clock consumption number information (ISS profiler) on each C source code that the ISS 61 reports as an ISS profiler, which part is considered to be VU (ie, hardware) to increase the speed. Judgment.
[0071]
The next design stage 113 is extraction of a part suitable for VU1 and VU conversion, that is, creation of a VU instruction library. For this reason, a part to be converted into a VU is extracted in step 123, and the part is replaced with a VU instruction in step 124, so that the program can be compiled by the pcc 101 and includes a VU instruction, a PU instruction, and a pseudo VU instruction. On the other hand, the extracted part is made into a library in step 125. Since it is possible to generate a VU instruction library that can run in ISS 61 even in C language, simulation step 126 is inserted at this stage, but if it is not necessary, it is skipped and only simulation is performed at the next stage. But it ’s okay. Note that the compiler used to use the VU instruction library in step 126 is a general-purpose C compiler, for example, gcc. Even at this stage, the pseudo VU instruction is used as it is as a test description.
[0072]
In the design stage 114, redundant bits are deleted using the C ++ language in step 127, and the C ++ VU instruction library 63 resulting from the deletion is mounted on the simulator 60 in step 128 and the ISS 61 is run to check the number of consumption cycles. . For that purpose, it is necessary to know the number of cycles of VU1, and the next design stage 115 is executed to know the number of cycles, and the VU instruction library 63 is fed back. Even at this stage, the pseudo VU instruction is used as it is as a test description.
[0073]
As described above, in the normal C language, the data length is, for example, 32 bits, and the development and verification of an algorithm that is a specification within the range of the 32-bit length is performed. However, there are many cases where 32 bits are not necessary in a hardware system that is actually operated. In this case, if redundancy is not deleted, an execution circuit having useless hardware is generated. Therefore, in this design stage 114, it is extremely important to remove redundancy in a language different from the C language, here the C ++ language.
[0074]
In addition, it is important to perform verification using the C ++ language library 63 from which redundancy is removed. Whether or not redundancy has been removed excessively needs to be confirmed by simulation, and it is extremely significant that this stage 114 can be confirmed using the cycle-based ISS 61. This is because, if there is no cycle-based ISS 61, this confirmation can only be performed by RTL-based simulation, which is a simulation that takes about 1,000 times longer than the cycle-based ISS 61 simulation.
[0075]
There is also an idea that the original specification 31 may be described as C ++ language from the beginning and designed from the beginning in a format that does not add redundancy. That is one way of thinking, but it has the following disadvantages. First, there is a difference in the size of the generated object between the C ++ language that is an object-oriented language and the C language that is a normal procedural language. In other words, the object size of the object-oriented language has a drawback that it is about 30 to 50% larger than usual. This means that the object size of the program to be converted into a PU, that is, the part to be compiled by the PU compiler 101 becomes very large.
[0076]
For embedded processors such as VUPU10, how to reduce the memory size on silicon is an extremely important factor in terms of cost and power consumption, and in this respect, the merit of the design method of initial design in C language is great. . On the other hand, since the VU description is hardware, even if it goes through the C ++ language, it does not lead to an increase in the object size as described above. Rather, there is a great merit that simulation can be performed with redundant bit lengths removed through type conversion.
[0077]
The next design stage 115 is behavioral synthesis using the CtoRTL tool. The C ++ language library 63 examined by the ISS simulation is converted into a C description that can be behaviorally synthesized in step 129 and synthesized in step 130. As a result, the RTL and cycle number are reported, so the cycle number feeds back to the upstream ISS simulation (step 128 or 126) as described above. The VU library 63 embedded up to the number of cycles reported in the stage 115 can be replaced with the RTL generated by the CtoRTL behavioral synthesis tool.
[0078]
The design stage 116 is the final stage. In step 131, the RTL of the VU1 and the program including the PU instruction, the VU instruction, and the pseudo-VU instruction are mixed or co-simulated using the ISS 61, and the RTL of the VU1 is completed. Perform final verification of. Usually, in this simulation, the parent is RTL. In other words, it is possible to perform simulation by starting the parent RTL and connecting the child ISS and the child VU RTL at the parent level. Since simulation at this stage includes RTL, the simulation speed decreases. However, there is a meaning to verify once in the meaning of performing the final confirmation. In addition, in the ISS 61, the pseudo VU instruction incorporated in the program operates as it is, so that the same test environment can be used.
[0079]
The RTL of VU1 whose operation is finally confirmed is logically synthesized at step 132, and output as a netlist at step 133 and siliconized (circuitized).
[0080]
As described above, in the above series of stages, several stages of simulation (steps 122, 126, 128, and 131) are repeated using the ISS 61, but the pseudo VU instruction introduced in step 121 of the first stage 111 is performed. The test description according to the above, especially the expected value collation function is utilized in the stage 116 which is the final verification phase of the RTL while being mounted on the simulated program. Therefore, the fact that the same test description can be used from the beginning to the end of the design stage in this design method with multiple simulation stages is extremely effective in verifying the implementation form that changes gradually at each stage. There is a big merit that there is no worry about pushing the design by making a mistake in the circuit.
[0081]
In the above description, the simulation method according to the present invention is described based on the simulator 60 that can be realized by a simulation program. However, the simulation method according to the present invention can be executed by other realization means such as hard-wired logic. . However, it is important that the simulation method of the present invention can be provided as a simulation program capable of being executed at high speed written in another high-level language such as C language or JAVA (registered trademark), and an appropriate recording medium such as a CD-ROM. The function of the VUPU 10 is simulated at high speed by installing the simulation program according to the present invention recorded on the computer or provided via a computer network such as the Internet into an appropriate hardware resource such as a personal computer or a workstation. Can do.
[0082]
The VUPU10 is equipped with a PU that is a general-purpose processor and a VU that is made into a dedicated circuit. Compared with a conventional custom LSI that makes all specifications into a dedicated circuit, the VUPU10 has a lower cost in a short time and is equal to or better than the conventional one. A performance custom LSI can be provided, and the development period can be further shortened by employing the simulator of the present invention. In addition, the simulator of the present invention is not limited to VUPU, and can be applied to the design and development of general-purpose processors or the design and development of all programs (referred to as application programs in this specification) executed by the processors. Thus, the simulation accuracy can be improved and the simulation period can be shortened.
[0083]
【The invention's effect】
As described above, the present invention provides a simulator having a cycle-based ISS body that can be simulated with a model in which an instruction set is decomposed into pipeline cycles, and is a high-level language such as C language. The hardware simulation can be performed on a cycle basis by the simulator capable of high-speed execution described in 1. Therefore, with the simulation method of the present invention, the accuracy is equivalent to that of a conventional RTL-based simulator, and data having a function as a simulator or simulator capable of increasing the simulation speed by several hundred to several thousand times while maintaining the cycle unit. A processing system can be provided. Therefore, the design period of the processor can be greatly shortened by the present invention, and the design quality is improved. Therefore, it has a very great effect on the cost competitiveness of the design.
[0084]
Furthermore, in the simulation method of the present invention, even in a processor having a dedicated processing unit specialized for a data path such as VUPU, the number of cycles consumed by the dedicated instruction in the dedicated processing unit is counted in a cycle unit in the form of a library. It is possible to perform hardware management on a cycle basis.
[0085]
Furthermore, by introducing a pseudo VU instruction that is not executed by the processor, particularly a zero-cycle pseudo VU instruction, a test circuit effective for evaluation and verification can be used as a pseudo VU instruction from the initial design stage to the final RTL generation stage. In addition, by making the processing by the VU instruction into a library, it is possible to convert the library with another language, that is, C ++ in the above, and to reduce redundant bits very easily. Since the simulator of the present invention is a cycle-based ISS, it can be connected to the RTL generated in the final stage for simulation, and can follow up to the final verification.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an outline of a data processing apparatus (VUPU) including a PU and a VU.
FIG. 2 is a diagram illustrating a state in which a VUPU is developed based on a specification described in C language.
FIG. 3 is a diagram illustrating a state in which an instruction set is divided into pipeline stages and executed.
FIG. 4 is a diagram illustrating a case where an instruction set is evaluated in units of instructions and a case where evaluation is performed in units of cycles divided into pipeline stages.
FIG. 5 is a diagram illustrating some cases in which an instruction set is evaluated in units of instructions compared to a case in which evaluation is performed in units of cycles.
FIG. 6 is a diagram showing an outline of a simulator of the present invention.
FIG. 7 is a diagram showing a comparison between an instruction unit model and a pipeline stage unit model as an instruction set model;
FIG. 8 is a diagram showing a model for each pipeline stage divided into a model for each stage and a common model.
FIG. 9 is a diagram showing an outline of a design method in which a pseudo VU instruction is introduced.
FIG. 10 is a diagram showing a state in which a VUPU is developed based on a specification described in C language by introducing a pseudo VU instruction.
11 is a flowchart showing functions of the simulator shown in FIG. 6 as an ISS body and a library.
FIG. 12 is a diagram showing an example of a different simulator of the present invention.
FIG. 13 is a diagram showing an outline of a design procedure according to the present invention.
[Explanation of symbols]
1 Dedicated processing unit (VU)
2 General-purpose processing unit (PU)
3 General-purpose processor (PUX)
4 Code RAM
4a Application program
5 Fetch unit FU
10 Data processing unit (VUPU)
11 execution units
60 Simulator
61 ISS body
62 PU instruction library
63 VU instruction library

Claims

A system with a simulator which simulates the operation of the processor of the application program,
The processor includes a general-purpose processing unit that executes general-purpose processing by a plurality of pipeline stages;
A dedicated processing unit specialized for specific data processing, comprising a dedicated processing unit including a data path different from the pipeline stage of the general-purpose processing unit,
The simulator
If the instruction included in the application program is a general-purpose instruction that defines processing in the general-purpose processing unit, a general-purpose instruction library that converts the general-purpose instruction into a model of each pipeline stage of the general-purpose processing unit;
If the instruction included in the application program is a dedicated instruction that defines the processing in the dedicated processing unit, the dedicated instruction is converted into a model that supplies information corresponding to the pipeline stage cycle of the general-purpose processing unit. A dedicated instruction library;
Wherein using the model provided by the general-purpose instruction library and said dedicated instruction library, and a cycle simulating means for simulating the operation of the application program in cycles of the pipeline stages of the general-purpose processing unit, data processing system.

The general-purpose instruction library according to claim 1, wherein the general-purpose instruction library converts the general-purpose instruction into a first model of each of the pipeline stages of the general-purpose processing unit and a second model common to the pipeline stages . Data processing system.

According to claim 1, wherein the dedicated instruction library converts the dedicated command, the model supplies the information including the number of cycles consumed in the special-purpose processing unit by the dedicated instruction, the data processing system.

The data processing system according to claim 1, wherein the dedicated instruction library converts the dedicated instruction as the information into the model that supplies the state of the dedicated processing unit in units of the cycle.

According to claim 1, wherein the simulator, instructions included in the application program, if the pseudo-instruction set that is not executed by the processor, the directive in the cycle simulating means, executes a process that does not count the cycles A data processing system further comprising a pseudo-instruction library that converts the instruction into an instruction to be executed.

According to claim 5, wherein the pseudo-instruction defines the process of the results of simulation of the process of the general-purpose processing unit and / or the dedicated processing unit evaluates without counting the cycles, the data processing system.

A simulator is a method of simulating the operation of an application program in a processor,
The processor includes a general-purpose processing unit that executes general-purpose processing by a plurality of pipeline stages;
A dedicated processing unit specialized for specific data processing, comprising a dedicated processing unit including a data path different from the pipeline stage of the general-purpose processing unit,
The method is
Instructions contained in the application program, if the generic instruction for defining the processing in the general-purpose processor, a first conversion step of converting the generic instruction to each of the pipelines stearyl over di model of the general-purpose processor When,
If the instruction included in the application program is a dedicated instruction that defines the processing in the dedicated processing unit, the dedicated instruction is converted into a model that supplies information corresponding to the pipeline stage cycle of the general-purpose processing unit. A second conversion step;
Cycle level simulation for simulating the operation of the processor by the application program in the cycle of the pipeline stage of the general-purpose processing unit using the model obtained by the first conversion step and the second conversion step and a step method.

8. The first conversion step according to claim 7 , wherein the general-purpose instruction is converted into a first model of each of the pipeline stages of the general-purpose processing unit and a second model common to the pipeline stages. how to.

8. The method according to claim 7 , wherein in the second conversion step, the dedicated instruction is converted into a model that supplies the information including the number of cycles consumed in the dedicated processing unit by the dedicated instruction.

8. The method according to claim 7 , wherein, in the second conversion step, the dedicated instruction is converted into the model that supplies the state of the dedicated processing unit as the information in units of the cycle.

8. The method according to claim 7 , wherein if the instruction included in the application program is a pseudo-instruction set that is not executed by the processor, the pseudo-instruction set is executed in the cycle level simulation step so as not to count the cycles. A method further comprising a third conversion step of converting to an instruction.

In claim 11, the pre-Symbol pseudo instruction set defines a process of the results of simulation of the process of the general-purpose processing unit and / or special purpose processing unit to evaluate without counting the cycles, the method.

According to claim 7, comprising a first simulated stage and a second simulated phases,
In the first simulation stage, the application program includes the general-purpose instruction based on a specification that defines the operation of the processor, and the pseudo-instruction ,
Wherein the second simulated step, the application program, the range which is executed by the dedicated instruction is replaced with the special instruction method.

A program for simulating the operation of a processor of an application program by a computer,
The processor includes a general-purpose processing unit that executes general-purpose processing by a plurality of pipeline stages;
A dedicated processing unit specialized for specific data processing, comprising a dedicated processing unit including a data path different from the pipeline stage of the general-purpose processing unit,
The computer
A general-purpose instruction library that converts a general-purpose instruction that defines processing in the general-purpose processing unit into a model of each pipeline stage of the general-purpose processing unit;
A dedicated instruction library that converts a dedicated instruction that defines processing in the dedicated processing unit into a model that supplies information corresponding to a cycle of the pipeline stage of the general-purpose processing unit, and
The program is
Instructions contained in the application program, if the generic instruction, by the general purpose instruction library, a first conversion processing for converting the generic instruction to each of the pipeline stage of the model of the general-purpose processing unit,
If the instruction included in the application program is the dedicated instruction, the general instruction library converts the dedicated instruction into a model that supplies information corresponding to the pipeline stage cycle of the general-purpose processing unit. Conversion process,
Using the model provided from the general-purpose instruction library and the dedicated instruction library, the computer performs cycle-level simulation processing for simulating the operation of the application program in the cycle of the pipeline stage of the general-purpose processing unit. having instructions for executing the program.

15. The first conversion process according to claim 14 , wherein the first conversion process converts the general-purpose instruction into a first model of each of the pipeline stages of the general-purpose processing unit and a second model common to the pipeline stages. to, the program.

15. The pseudo-computer according to claim 14 , wherein if the pseudo-instruction set is not executed by the processor, the pseudo-instruction is converted into an instruction for executing a process that does not count the cycles in the cycle-level simulation process. Further comprising an instruction library;
In the program, if the instruction included in the application program is the pseudo instruction set, the pseudo instruction library converts the pseudo instruction into an instruction for executing a process that does not count the cycle . A program further comprising an instruction for executing a conversion process.

17. The program according to claim 16 , wherein the pseudo instruction defines a process for evaluating a result of simulating the process of the general-purpose processing unit and / or the dedicated processing unit without counting the cycle.

15. The process according to claim 14 , wherein in the second conversion process, the dedicated instruction is information corresponding to a cycle of the pipeline stage as the information, and the cycle consumed by the dedicated processing unit by the dedicated instruction. A program for converting to the model that supplies information including numbers .

According to claim 14, wherein the second conversion process, the dedicated command, as the information, converts the state of the special-purpose processing unit, the model supplies in units of the cycle program.