JP3740321B2

JP3740321B2 - Data processing device

Info

Publication number: JP3740321B2
Application number: JP16781299A
Authority: JP
Inventors: 直幹三ッ石
Original assignee: Renesas Technology Corp
Current assignee: Renesas Technology Corp
Priority date: 1999-06-15
Filing date: 1999-06-15
Publication date: 2006-02-01
Anticipated expiration: 2019-06-15
Also published as: JP2000357089A

Description

【０００１】
【発明の属する技術分野】
本発明は、マイクロコンピュータ、マイクロコントローラ、中央処理装置（ＣＰＵ）等のデータ処理装置に関し、例えば、半導体集積回路化された、機器組込み型のマイクロコンピュータに適用して有効な技術に関するものである。
【０００２】
【従来の技術】
半導体集積回路で成るマイクロコンピュータは、アドレス空間の拡張や、命令セットの拡大、高速化などが図られてきた。マイクロコンピュータのＣＰＵは、ソフトウェアによって、その機能が定義されているから、アドレス空間の拡張や、命令セット拡大、高速化などを図ったマイクロコンピュータにおいても、既存のマイクロコンピュータのソフトウェア資産を有効に利用できることが望ましい。
【０００３】
このため、オブジェクトレベルで互換性を保ちつつ、アドレス空間の拡張や、命令セット拡大を実現した例として、例えば、特開平６−５１９８１号公報、平成５年６月（株）日立製作所発行『Ｈ８／３００Ｈシリーズプログラミングマニュアル』などの記載がある。この中で、いわゆるロードストア型アーキテクチャを採用することが、命令セットの拡張を図る上で有効であることを示されている。かかるロードストア型アーキテクチャは、所謂ＲＩＳＣ（ＲｅｄｕｃｅｄＩｎｓｔｒｕｃｔｉｏｎＳｅｔＣｏｍｐｕｔｅｒ）型のマイクロコンピュータの普及と共に、採用される場合が多くなっている。
【０００４】
また、上記ＣＰＵのように２ステートで基本命令を実行していたものと互換性を保ちつつ１ステートで基本命令を実行するように高速化し、さらに、ＣＰＵとは独立した乗算器を内蔵して高速化を図った例として、特開平８−２６３２９０号公報、平成７年３月（株）日立製作所発行『Ｈ８Ｓ／２６００シリーズＨ８Ｓ／２０００シリーズプログラミングマニュアル』に記載のものがある。このＣＰＵにおいては、命令の基本単位は１６ビット（ワード）とされ、データバス幅も１６ビットである。即ち、１回のバスアクセスで１ワード分の命令リードが可能になる。特に、前記平成７年３月（株）日立製作所発行『Ｈ８Ｓ／２６００シリーズＨ８Ｓ／２０００シリーズプログラミングマニュアル』ｐｐ２６８〜２７７には命令実行時のバス状態について記載がある。
【０００５】
上記のＣＰＵにおいて、１ワードの命令コード長を持つレジスタ間演算命令では、１ワードの命令リード（プリフェッチを行なうため、次の命令リード）とＰＣのインクリメント（＋２）を行なうとともに、ＣＰＵ内部で汎用レジスタの内容を読み出して、演算を行い、結果を汎用レジスタに書込む。
【０００６】
一方、１６ビットイミディエイトデータと汎用レジスタとの演算命令では、１６ビットイミディエイトデータを命令中に含むため、命令コード長は２ワードになる。２回の命令リード（自命令の第２ワード及び次の命令リード）と２回のＰＣのインクリメントを行なう。この処理に２ステートを必要とする。一方、ＣＰＵ内部で１６ビットイミディエイトデータと汎用レジスタの内容を読み出して、演算を行い、結果を汎用レジスタに書込む。内部の演算動作については、上記レジスタ間演算命令と同様に１ステートで実行可能である。即ち、１ステートは、命令リードとＰＣインクリメントのみを行い、別の１ステートでは、命令リードとＰＣインクリメントと内部の演算処理を行なっている。
【０００７】
そのほかの、複数ワードの命令コード長を持つ命令の実行も、命令リードとＰＣインクリメントのみを行うステートと、命令リードとＰＣインクリメントと内部の演算処理やデータアクセスを行なうステートから構成される。
【０００８】
上記のような技術によって、単位時間のＣＰＵまたはマイクロコンピュータの処理能力が向上する。これにより、マイクロコンピュータの応用システムの制御処理を高速化でき、マイクロコンピュータによって制御される機器の高速化や高機能化、高精度化、或は、従来複数の半導体集積回路（マイクロコンピュータ）で構成したものを、統合したりすることによる小型化などを図ることができる。また、特に、割込みに対する応答時間を短縮することによって、種々の機器を制御する場合の時間的な精度、いわゆるリアルタイム性を向上することができる。
【０００９】
【発明が解決しようとする課題】
本発明者は、互換性を維持して、ソフトウェア資産を有効利用できるようにしつつ、また、論理的物理的規模の増大を最小限にし、かつ消費電力の増大も最小限にして、機器組み込み型マイクロコンピュータ等のデータ処理装置を更に高速処理可能にすることを検討した。
【００１０】
ＣＰＵの動作を高速化するためには、１命令実行のために必要なステート数を減少させること、半導体集積回路で成るようなシングルチップマイクロコンピュータの動作速度乃至動作周波数を向上させること、に大別することができる。後者の動作速度乃至動作周波数は、半導体集積回路の微細化、トランジスタの高速化などによって実現することができる。しかしながら、動作速度乃至動作周波数は、必然的に消費電流の増加を招く。従って、消費電流の増大を最小限にするには、各命令実行のために必要なステート数を減少させる方がよい。
【００１１】
なお、機器の高速化や高機能化、小型化は、アドレス空間が比較的小さく命令セットが比較的小さいＣＰＵやマイクロコンピュータにおいても要求されるから、前記アドレス空間の広いＣＰＵとアドレス空間の小さいＣＰＵが存在する場合には、その双方に対して高速化を図ることが望ましい。
【００１２】
また、互換性を維持することによって、アセンブラ、Ｃコンパイラ、シミュレータ／デバッガなどのクロスソフトウェアツールを共通に利用できるようになる。即ち、これらの開発環境は既存のものが使用できるから、開発環境を逸早く用意することができる。
【００１３】
通常のマイクロコンピュータは、命令を読み込んで動作するから、自ずから命令をリードすることが必要になる。前記の通り、データアクセスも行なうが、データをアクセスするためには、これを指示する命令を読み込む必要がある。また、データアクセスを行なわない命令もあるから、バスサイクルの内、命令のリードの方が、データアクセスよりも多い。例えば、命令リード８０％、データアクセス２０％の場合もある。
【００１４】
命令のリードを高速にするためには、データバス幅を命令の単位語長よりも大きくすればよい。例えば命令の単位語が１６ビットのとき、３２ビットに広げればよい。１回のバスアクセスで２ワード分の命令リードが可能になる。
【００１５】
しかしながら、単純に命令のリードを高速にしても、リードした分の命令を実行（消費）しなければ、意味がない。
【００１６】
１６ビット固定長の命令セットと３２ビットデータバスを採用したマイクロコンピュータとして、５段のパイプラインを構成する場合、命令フェッチは２命令を１バスサイクルで実行できるので、命令フェッチのステージを１回おきに空けることができる。空いたステージにメモリアクセスを行なうことができる。
【００１７】
しかしながら、1回の命令リードのビット数を増やすして高速化する方式には以下のような問題点のあることが本発明者によって明らかにされた。
【００１８】
パイプラインを深くすると、分岐命令や割込み例外処理のように、プログラムの流れが変わると、それ以降のパイプラインを解消し、新たにパイプラインを埋め直さなければならない。機器制御用のマイクロコンピュータにおいては、分岐命令が比較的多いし、割込みも多く発生するため、パイプラインの段数が大きくなって、分岐命令や割込み例外処理の実行時間が向上できないのでは好ましくない。
【００１９】
また、３２ビットデータバスは４の倍数番地から始まる４バイトを１回にリードまたはライト可能にしているが、既存の１６ビット単位で命令リードを行なうＣＰＵと互換性を維持しようとすれば、１６ビット固定長命令は採用できず、１ワード長の命令や、３ワード長の命令などが混在してしまう。換言すれば、３２ビットのデータバスの単位にアライメントできない。例えば、２ワード長の命令は、自命令が４の倍数番地に存在するか、そうでないかで、１回の命令リードを行なうか、２回の命令リードを行なうか、などを判断しなければならないと考えられる。２回の命令リードを行なことになれば、前記既存のＣＰＵの命令実行と同等の実行時間になりかねない。更に、自命令のアドレスを判定したり、１回及び２回の命令リードの両方を行なう制御を行なったりすることは、論理規模の増大を招き易い。
【００２０】
一方、いわゆるマイクロプロセッサとしては、平成７年１月（株）日経ＢＰ社発行『日経エレクトロニクス』ｐｐ６８〜８０「１９９８年に転機、ハードウェアを単純化してＶＬＩＷへ」などに記載されているように、スーパスカラやＶＬＩＷなどのように高速化が図られている。いずれも同時に実行できる処理の数を増やして（例えば４並列）、全体的な処理性能を向上するものである。しかしながら、上記の技術では、複数命令を並列して実行するため、制御回路・演算器などのＣＰＵの資源を複数セット持つことになり、物理的・論理的規模の増大を招く。
【００２１】
また、機器組み込み型のマイクロコンピュータ（またはマイクロコントローラ）においては、各種の機器の状態を参照しつつ、処理の内容を変えていく。機器の状態を参照するために、データアクセスを行なうし、処理の内容を変えるために、分岐命令を実行する。従って、局所的なプログラムを繰り返し実行するようなことは、マイクロプロセッサに比較して、少ない。また、制御対象に応じて、命令の実行順序に制約があったり、局所的な命令実行時間に制約がある場合がある。ソフトウェア的に矛盾がないからといって、必ずしも、命令の順序を変更できるとは限らない。更には、制御するシステムによっては、消費電力を小さく抑える必要がある場合もある。
【００２２】
上記のように、分岐の処理が多いから、スーパスカラなどの手法を用いて、複数命令並列処理可能としても、分岐命令が存在して、一旦処理した結果を放棄せざるを得ない場合も生じる。また、条件分岐命令は、条件分岐命令が参照する分岐条件が確定するまで、分岐の判定ができないから、条件分岐命令が参照する分岐条件を生成する命令と条件分岐命令とは同時には実行できず、その先の処理も実行できない。条件分岐命令が存在する度に、命令の並列処理ができなくなってしまう。分岐予測や投機的実行を行なえば、実際に処理すべき動作と異なる動作を行なってしまう可能性も無視できない。機器制御の場合には、条件分岐命令を組合せて（ツリー状に構成して）、多数の分岐先の中から分岐先を判定することが多いから、分岐予測に適さないし、予測した分岐先に、更に条件分岐が存在することも多いと考えられる。結果的に分岐予測は、ヒット率が非常に低くなってしまい易い。また、分岐予測や投機的実行によって、平均的な処理時間は向上できるかも知れないが、個別の局所的な処理を高速化することは期待できないと考えられる。
【００２３】
すなわち、複数の命令を並列実行可能としても、無駄になり易い。並列実行するための論理が無駄になり、論理的物理的な資源を有効に利用できない。また、無駄な論理や無駄な動作は消費電力を不所望に増大させてしまう。
【００２４】
結局、機器組み込み制御用のマイクロコンピュータについては、スーパスカラやＶＬＩＷなどのような並列処理を行なっても、ソフトウェア資産の有効利用を図り難いし、また、実際には、マイクロプロセッサほどの高速化も困難である。少なくとも、論理的物理的規模の割に高速化が図り難く、消費電力を増大させてしまう。
【００２５】
なお、同じ演算の繰り返し処理と、モータ制御などの機器制御とでは、ＣＰＵ乃至はマイクロコンピュータ毎に処理性能の傾向が異なることが、平成１０年２月（株）ＣＱ出版社発行『インターフェイス』ｐｐ１３４〜１４５「組み込み用ＣＰＵのパフォーマンスの徹底研究」などに記載されている。この観点からも、適当な費用の、機器組み込み制御に適した既存のＣＰＵ乃至マイクロコンピュータの互換性を維持し、既存のアーキテクチャを継承し、費用の増加を適正にし、高速化を図っていく必要のあることが本発明者によって明らかにされた。
【００２６】
本発明の目的は、機器制御に好適なマイクロコンピュータ等のデータ処理装置の高速化を図ることにある。
【００２７】
本発明の別の目的は、既存のＣＰＵとの互換性を維持して、既存のソフトウェアの利用を可能にしつつ、かつ、論理的な規模の増加を最小限にして、製造費用の増加を最小限にして、ＣＰＵの処理性能を向上させることにある。
【００２８】
本発明の前記並びにその他の目的と新規な特徴は本明細書の記述及び添付図面から明らかになるであろう。
【００２９】
【課題を解決するための手段】
本願において開示される発明のうち代表的なものの概要を簡単に説明すれば下記の通りである。
【００３０】
すなわち、既存のデータ処理装置例えばＣＰＵに対して、内部データバス幅を、少なくとも命令の基本単位（例えばワード）よりも大きくし、リードした命令を複数単位保持することができる命令レジスタを持ち、この命令レジスタに存在する命令の量を監視する手段を設ける。（既存の）命令を、実行の基本単位時間（ステートと称する）に従って、命令のリード制御（プログラムカウンタのインクリメントの制御を含む）のみを行なうステート（第1の動作）と、実効アドレスの計算やデータの演算処理の制御を含むステート（第2の動作）に分割し、既にリードした命令の状況に応じて、命令のリードのみの制御を行なうステートを省略可能にする。換言すれば、前記命令レジスタに存在する既に読み込んだ命令の量に応じた前記監視手段の指示に従い、前記命令のリードのみの制御を行なうステートを省略（スキップ）する。
【００３１】
各命令の実行時の命令リードの量を、自命令の命令長に対して、多くしたり、少なくしたりする。これを、リード済み乃至リード実行中の命令の量に従って制御する。
【００３２】
上記によれば、内部データバス幅を命令の基本単位（ワード）よりも大きくすることによって、一度にリードする命令の量を、既存のＣＰＵより、大きくできる。
【００３３】
基本の実行ステートの場合、既存のＣＰＵと同様に、自分の命令コード長に対応した回数の命令リードを行なうことにより、前記ステートの省略（スキップ）を行なわない場合には、実行した自命令の命令コードの量より、リードした命令の量を大きくして、リード済みの命令コードを蓄積できる。
【００３４】
一方、省略（スキップ）を行なって、命令リードを行なわないことにより、実行した自命令の命令コードの量と、リードした命令の量を同等にして、リード済みの命令コードの量を維持したり、実行した自命令の命令コードの量より、リードした命令の量を少なくして、リード済みの命令コードの量を減少できる。
【００３５】
これによって、リード済みの命令の量を所定の範囲内に収めつつ（命令のリードの量と、命令の実行の量のバランスを採りつつ）、命令のリードを高速化して、全体の命令実行時間を短縮することができる。
【００３６】
また、省略（スキップ）するステートを自動的に変えることによって、命令の配置の変更に対応できる。
【００３７】
命令レジスタには、命令コードとともに、その命令コードのアドレス（ＩＡＢ）上のビット１についての情報を格納し、命令デコーダで同時に判定するとよい。そのようなビット１の値は、ワードのような基本単位のアドレスが４の倍数であるか４の倍数でない偶数であるかを示している。これにより、制御を容易にし、命令デコーダの論理的規模の増大を最小限にすることができる。
【００３８】
分岐命令や割込み例外処理の先頭アドレスのリードを除き、命令のリードは３２ビット単位で行い、プログラムカウンタのインクリメントは＋４とする。分岐命令や割込み例外処理の先頭アドレスが４の倍数番地のときは、同様に、命令のリードは３２ビット単位で行い、プログラムカウンタのインクリメントは＋４とする。分岐命令や割込み例外処理の先頭アドレスが４の倍数番地でないときは、先頭の命令のリードは１６ビット単位で行い、プログラムカウンタのインクリメントは＋２とする。プログラムカウンタのインクリメントの＋２又は＋４（＋２／＋４）を自動的に制御するようにインクリメンタを構成するには、例えば、ビット０をバイトアドレスとするとき、インクリメンタのビット１を、その入力と１との論理和によって得られる値にし、ビット１へのキャリーを与えるようにすればよい。
【００３９】
このように、分岐命令や割込み例外処理の先頭アドレスのリードを除き、命令のリードは３２ビット単位で行い、また、プログラムカウンタのインクリメントは＋２／＋４を自動的に行なうことにより、制御を容易にし、論理規模の増加を抑止することができる。
【００４０】
ライトデータバッファにプログラムカウンタの内容を格納するようにし、更に、ライトデータバッファをＦＩＦＯ（First-In First-Out：先入れ先出し）構造とし、また、前記プログラムカウンタのインクリメンタと同様に前記ビット１をセットする回路を持つとよい。これにより、実際にリードした命令のアドレスとプログラムカウンタに保持した内容との食い違いが大きくなり、また、一意に決まらなくても、サブルーチン分岐命令時に、待避すべきプログラムカウンタの値をライトデータバッファから容易に得ることができる。かつ、制御を容易にし、論理規模の増加を抑止することができる。
【００４１】
実効アドレスの計算やデータの転送処理が複数のステートに亘って動作する場合も、制御自体は１度に行い、制御信号に遅延を設けるなどして、実際の動作を複数のステート（例えば、アドレス計算を最初のステート、リードデータの格納を次のステート）で行なうようにする。
【００４２】
分岐命令や割込み例外処理などの場合、最低限１ワード分のプリフェッチが完了した時点で、分岐先の先頭命令のデコードを開始し、実行することにより、分岐命令や割込み例外処理による処理時間の損失を最小限にし、いわゆる応答性、ひいては、リアルタイム性を向上できる。
【００４３】
独立した内部バスを持つなどして、制御信号を遅延させるべき動作（例えば、リードデータの格納）が、次の制御動作（プログラムカウンタ・インクリメント）と重なっても同時に動作可能なように、実行手段を構成する。
【００４４】
基本単位時間で処理可能な命令（省略可能なステートを持たない命令）は、算術論理演算器（ＡＬＵ）等の演算手段を複数設けて、重なった時間を持ちながら動作可能にする。独立した内部バスを持つなどして、一方の演算手段の動作と、命令リードのための動作とが重なっても同時に動作可能なように、実行手段を構成する。それぞれの演算手段を制御するための制御回路を持つ。一方の制御回路は一方の演算手段を制御して全ての命令の制御を行なうようにし、他方の制御回路は他方の演算手段の制御を専ら行なうようにする。
【００４５】
このように、演算手段と制御回路のみを複数設けるとともに、一方の制御回路は全ての命令の制御を行なうようにし、他方の制御回路は演算手段の制御を専ら行なうようにすることにより、論理規模の増加を抑止することができる。
【００４６】
前置コードのように、制御信号のみを発生する命令コードは、命令をリードし、命令レジスタに保持した状態で、前置コードであることを検出し、所望の制御信号を発生する検出回路及び制御信号発生回路を設けることにより、スキップ可能にし、スキップ時には制御信号のみを発生するようにする。これによって、命令の実行時間を短縮できる。また、前置コードであることのみを検出し、検出した結果に基づいて制御信号を発生すればよいから、前記検出回路と制御信号発生回路の論理的規模を最小限にすることができる。
【００４７】
オブジェクトレベルで互換性を保ちつつ、アドレス空間の広い（命令セットの大きい）ＣＰＵとアドレス空間の小さい（命令セットの小さい）ＣＰＵが存在する場合には、アドレス空間の広いＣＰＵで、上記高速化を実現して、下位互換性をもつ、アドレス空間の小さいＣＰＵにも存在する命令について、同様に上記高速化を可能にできる。換言すれば、同一の方法で、オブジェクトレベルで互換性を保ちつつ、アドレス空間の広いＣＰＵとアドレス空間の小さいＣＰＵでも高速化を可能にできる。オブジェクトレベルで互換性を保つことによる利点と高速化を可能にすることの利点の双方を享受することができる。
【００４８】
既存の命令を実行可能にし、内部動作の順序なども同等にしているから、既存のＣＰＵと比較して、将来拡張余裕を大きく損なうことがない。例えば、既存のＣＰＵに対して、新たな命令の追加が可能になった場合には、かかる技術を、本発明を適用したＣＰＵにも用いることができると考えられる。命令セットの互換性を維持していれば、機械語としては、既存のＣＰＵと同じ命令を追加することはできる。また、追加命令も、複数の実行ステート数を持つものであれば、固有の動作を行なう部分と省略可能なステートとに分け、後者を必要に応じて省略することは可能とすることはできる。少なくとも、必要に応じて命令のリードとプログラムカウンタ・インクリメントを禁止することができ、既存ＣＰＵと同等の処理時間では実現可能である。追加命令が１ステートで実行可能であれば、複数個設けた演算手段（例えばＡＬＵ，ＡＬＵＳ）の交互の動作などによって高速化を実現できる。
【００４９】
既存のＣＰＵと同じ命令セットとすることにより、アセンブラ、Ｃコンパイラ、シミュレータ／デバッガなどの開発ツール、いわゆるクロスソフトウェアを共通にすることができる。クロスソフトウェアを共通化することによって、早く開発環境を整えることができる。
【００５０】
【発明の実施の形態】
図２には本発明が適用されたシングルチップマイクロコンピュータの一例が示される。
【００５１】
シングルチップマイクロコンピュータ１は、全体の制御を司るＣＰＵ２、割込コントローラ（ＩＮＴ）３、ＣＰＵ２の処理プログラムなどを格納するメモリであるＲＯＭ４、ＣＰＵ２の作業領域並びにデータの一時記憶用のメモリであるＲＡＭ５、タイマ６、タイマ７、シリアルコミュニケーションインタフェース（ＳＣＩ）８、Ａ／Ｄ変換器９、システムコントローラ（ＳＹＳＣ）１０、第１入出力ポート（ＩＯＰ[１]）１１乃至第９入出力ポート（ＩＯＰ[９]）１９、クロック発振器（ＣＰＧ）２０の機能ブロック乃至はモジュールから構成され、公知の半導体製造技術により１つの半導体基板（半導体チップ）上に形成される。
【００５２】
かかるシングルチップマイクロコンピュータ１は、電源端子として、グランドレベル（Ｖｓｓ）、電源電圧レベル（Ｖｃｃ）、アナロググランドレベル（ＡＶｓｓ）、アナログ電源電圧レベル（ＡＶｃｃ）、アナログ基準電圧（Ｖｒｅｆ）を有し、更に、専用制御端子として、リセット（ＲＥＳ）、スタンバイ（ＳＴＢＹ）、モード制御（ＭＤ０、ＭＤ１）、クロック入力（ＥＸＴＡＬ、ＸＴＡＬ）の各端子を有する。
【００５３】
ＣＰＧ２０の端子ＥＸＴＡＬ、ＸＴＡＬに接続される水晶発振子またはＥＸＴＡＬ端子に入力れる外部クロックに基づいて生成される基準クロック（システムクロック）に同期して、シングルチップマイクロコンピュータ１は動作を行う。この基準クロック１周期をステートと呼ぶ。
【００５４】
シングルチップマイクロコンピュータ１の機能ブロックは、内部バス２１によって相互に接続さる。内部バス２１は、内部アドレスバス、内部データバス、及び内部コントローラバスを含む。内部コントロールバスはリード信号、ライト信号、バスサイズ信号、システムクロックなどを含む。内部データバスの内、ＣＰＵ２のプログラムを格納するＲＯＭ４とＣＰＵ２との間は、３２ビットとされる。特に制限されないが、図２の例では、ＲＡＭ５も同様に３２ビットバスでインタフェースされる。そのほか、外部バスも３２ビットバスとしてもよい。
【００５５】
内部アドレスバスはその位相によって、ＩＡＢ、ＰＡＢの２種類があり、内部データバスもその位相によって、ＩＤＢ、ＰＤＢが存在する。例えば、リードの場合、ＩＡＢの後、ＰＡＢは０．５ステート遅延する。ＰＡＢとＰＤＢは同期している。ＰＤＢの後、ＩＤＢは０．５ステート遅延する。ＩＡＢとＰＡＢ、ＩＤＢとＰＤＢは、図示されないバスコントローラによってバッファリングされている。かかる機能ブロックやモジュールは内部バスを介して、ＣＰＵ２によってリード／ライトさる。内蔵ＲＯＭ４、ＲＡＭ５は、ＩＡＢ及びＩＤＢを介してＣＰＵ２とインタフェースされ、１ステートでリード／ライト可能とする。なお、タイマ６、タイマ７、ＳＣＩ８、Ａ／Ｄ変換器９、ＩＯＰ[１]１１〜ＩＯＰ[９]１９、ＣＰＧ２０が有す制御レジスタを総称して、内部Ｉ／Ｏレジスタと呼ぶ。これらは、ＰＡＢ及びＰＤＢに接続される。ＰＤＢのバス幅は、特に制限はされないものの、１６ビットとする。これは、内部Ｉ／Ｏレジスタは各機能ブロックに分散しているので、これを３２ビットバスで接続しようとすると、バスの総配線長が大きくなってしまい、物理的規模の増大を招き易いし、内部Ｉ／Ｏレジスタ（各機能ブロック）上で有意味のデータは８乃至１６ビットであり、３２ビットでアクセスする必要性が低いためである。
【００５６】
各入出力ポート１１〜１９は、アドレスバス、データバス、バス制御信号あるいはタイマ６，７、ＳＣＩ８、Ａ／Ｄ変換器９の入力端子や入出力端子と兼用されている。すなわち、タイマ６，７、ＳＣＩ８、Ａ／Ｄ変換器９は、それぞれ入出力信号を有し、入出力ポートと兼用にされた端子介して、外部と入力又は入出力されるものである。例えばＩＯＰ[５]１５、ＩＯＰ[６]１６、ＩＯＰ[７]１７は、タイマ６，７の入出力端子と兼用、ＩＯＰ[８]１８はＳＣＩ８の入出力端子と兼用にされている。アナログデータの入力端子は、ＩＯＰ[９]１９と兼用にされている。
【００５７】
上記シングルチップマイクロコンピュータ１にリセット信号ＲＥＳが与えられると、ＣＰＵ２を始めとし、シングルチップマイクロコンピュータ１はリセット状態になる。このリセットが解除されると、ＣＰＵ２は所定のアドレスからスタートアドレスをリードして、このスタートアドレスから命令のリードを開始するリセット例外処理を行う。この後、ＣＰＵ２は逐次、ＲＯＭ４などから命令をリードし、解読して、その解読内容に基づいてデータの処理或はＲＡＭ５、タイマ６，７、ＳＣＩ８等とのデータ転送を行う。即ち、ＣＰＵ２は、入出力ポートＩＯＰ[１]〜ＩＯＰ[９]、Ａ／Ｄ変換器９などから入力されるデータ、或はＳＣＩ８などから入力される指示を参照しつつ、ＲＯＭ４などに記憶されている命令に基づいて処理を行い、その結果に基づいて、入出力ポートＩＯＰ[１]〜ＩＯＰ[９]、タイマ６，７などを使用して、外部に信号を出力し、各種機器の制御を行うものである。
【００５８】
タイマ６，７、ＳＣＩ８、外部信号などの状態を割込み信号として、ＣＰＵ２に伝達することができる。割込信号は、Ａ／Ｄ変換器９、タイマ６、タイマ７、ＳＣＩ８、ＩＯＰ[１]１１〜ＩＯＰ[９]１９の所定のものが出力し、割込コントローラ３はこれを入力して、所定のレジスタなどの指定に基づて、ＣＰＵ２に割込要求信号２２を与える。割込要因が発生すると、ＣＰＵ割込要求が発生され、ＣＰＵ２は実行中の処理を中断して、例外処理状態を経て、所定の処理ルーチンに分岐し、所望の処理を行い、割込要因をクリアしたりする。所定の処理ルーチンの最後には、通常復帰命令がおかれ、この命令を実行することによって前記中断した処理を再開する。
【００５９】
図３にはＣＰＵ２に内蔵されている汎用レジスタ及び制御レジスタの構成例（プログラミングモデル）が示される。
【００６０】
ＣＰＵ２は、３２ビット長の汎用レジスタＥＲ０〜ＥＲ７を持っている。汎用レジスタＥＲ０〜ＥＲ７は、全て同機能を持っており、アドレスレジスタとしてもデータレジスタとしても使用することができる。
【００６１】
データレジスタとしてしては３２ビット、１６ビットおよび８ビットレジスタとして使用きる。アドレスレジスタ及び３２ビットレジスタとしては、一括して汎用レジスタＥＲ（ＥＲ０〜ＥＲ７）として使用する。１６ビットレジスタとしては、汎用レジスタＥＲを分割して汎用レジスタＥ（Ｅ０〜Ｅ７）、汎用レジスタＲ（Ｒ０〜Ｒ７）として使用する。これらは同等の機能を持っており、１６ビットジスタを最大１６本まで使用することができる。なお、汎用レジスタＥ（Ｅ０〜Ｅ７）を、特に拡張レジスタと呼ぶ場合がある。８ビットレジスタとしては、汎用レジスタＲを分割して汎用レジスタＲＨ（Ｒ０Ｈ〜Ｒ７Ｈ）、汎用レジスタＲＬ（Ｒ０Ｌ〜Ｒ７Ｌ）として使用する。これらは同等の機能を持っており、８ビットレジスタを最大１６本まで使用することができる。各レジスタ独立に使用方法を選択することができる。
【００６２】
汎用レジスタＥＲ７には、汎用レジスタとしての機能に加えて、スタックポインタ（ＳＰ）としての機能が割り当てられており、例外処理やサブルーチン分岐などで暗黙的に使用される。例外処理は前記割込み処理を含む。
【００６３】
ＰＣは２４ビットのカウンタで、ＣＰＵ２が次に実行する命令のアドレスを示す。特に制限されないものの、ＣＰＵ２の命令は、全て２バイト（ワード：１６ビット）を単位としているため、ビット０は無効であり、また、命令のリードは４バイト（ロングワード：３２ビット）としているために、ビット１も使用されない。また、スタックに待避される場合などは、上位８ビットを０としたロングワードサイズとして扱われる。
【００６４】
ＣＣＲは８ビットのコンディションコードレジスタで、ＣＰＵ２の内部状態を示している。割込みマスクビット（Ｉ）とハーフキャリ（Ｈ）、ネガティブ（Ｎ）、ゼロ（Ｚ）、オーバフロー（Ｖ）、キャリ（Ｃ）の各フラグを含む８ビットで構成されている。
【００６５】
ＥＸＲは８ビットのレジスタで、割込みなどの例外処理の制御を行なう制御情報が設定され、割込みマスクビット（Ｉ２〜Ｉ０）とトレース（Ｔ）の各ビットを含んでいる。
【００６６】
汎用レジスタ上のデータ構成例、メモリ空間上のデータ構成、アドレッシングモードと実効アドレスの計算方法などについては、例えば平成７年３月（株）日立製作所発行『Ｈ８Ｓ／２６００シリーズＨ８Ｓ／２０００シリーズプログラミングマニュアル』記載のＣＰＵと同様である。
【００６７】
図４には別のＣＰＵに内蔵されている汎用レジスタ及び制御レジスタの構成例を示す。これは、平成元年７月（株）日立製作所発行『Ｈ８／３００シリーズプログラミングマニュアル』記載のＣＰＵと同様の構成であり、１６ビットの汎用レジスタＲ０〜Ｒ７を有している。本発明を適用した、図３のプログラミングモデルを持つＣＰＵ２は、図４のＣＰＵの汎用レジスタ及び命令セットを包含している。換言すれば、前記ＣＰＵ２は図４のレジスタ及び命令セットを有するＣＰＵに対して上位互換の関係を有する。
【００６８】
図５には前記ＣＰＵ２の機械語の命令フォーマットの一例が示される。ＣＰＵ２の命令は、２バイト（ワード）を単位にしている。各命令はオペレーションフィード（ｏｐ）、レジスタフィールド（ｒ）、ＥＡ拡張部（ＥＡ）、およびコンディションフィールド（ｃｃ）を含む。
【００６９】
特に制限はされないものの、前記ＣＰＵ２は、前記平成７年３月（株）日立製作所発行『Ｈ８Ｓ／２６００シリーズ、Ｈ８Ｓ／２０００シリーズプログラミングマニュアル』記載のＣＰＵと同じ命令フォーマットとしている。特に基本的な演算命令、転送命令は１６ビット長（１ワード）である。
【００７０】
オペレーションフィールド（ｏｐ）は、命令の機能を表し、アドレッシングモードの指定オペランドの処理内容を指定する。命令の先頭４ビットを必ず含む。２つのオペレーションフィールドを持つ場合もある。
【００７１】
レジスタフィールド（ｒ）は組合わせて汎用レジスタを指定する。前記レジスタフィールド（ｒ）はアドレスレジスタのとき３ビット、データレジスタのとき３ビット（３２ビットレジスタ）または４ビット（８または１６ビットレジスタ）である。２つのレジスタフィールドを持つ場合、或いは、レジスタフィールドを持たない場合もある。
【００７２】
前記ＥＡ拡張部（ＥＡ）は、イミディエイトデータ、絶対アドレスまたはディスプレースメントを指定する。８ビット、１６ビット、または３２ビットである。コンディションフィールド（ｃｃ）は条件分岐命令（Ｂｃｃ命令）の分岐条件を指定する。
【００７３】
図１には前記ＣＰＵ２のブロック図が例示される。ＣＰＵ２は、制御部ＣＯＮＴと、前記汎用レジスタＥＲ０〜ＥＲ７、プログラムカウンタＰＣ、コンディションコードレジスタＣＣＲを含む実行部ＥＸＥＣから構成される。
【００７４】
制御部ＣＯＮＴは、例えば３ワード分のＦＩＦＯから成る命令レジスタ２００、命令レジスタ検出回路（ＭＯＮ）２０１、命令レジスタコントローラ（ＦＩＦＯＣＮＴ）２０２、命令デコーダ（ＤＥＣ）２０３、サブ命令デコーダ（ＤＥＣＳ）２０４、レジスタセレクタ（ＳＥＬ）２０５をむ。命令デコーダ（ＤＥＣ）２０３、サブ命令デコーダ（ＤＥＣＳ）２０４は、例えば、ＰＬＡ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＡｒｒａｙ）または布線論理などで構成される。命令デコーダ（ＤＥＣ）２０３は全ての命令に対応し、ＣＰＵ２の全体の制御を行う。サブ命令デコーダ（ＤＥＣＳ）２０４は、レジスタ間演算命令などの演算動作を命令デコーダ（ＤＥＣ）２０３と重なった時間に実行する制御だけを行なう。命令デコーダ２０３の出力の一部が命令デコーダ２０３にフィードバックされている。これは各命令コード内の遷移に用いるステージコード（ＴＭＧ）と、命令コード間に用いる制御コード（ＭＯＤ）を含む。かかる制御コードは、命令デコーダ（ＤＥＣ）２０３及び命令レジスタ検出回路（ＭＯＮ）２０１で生成され、マルチプレク（ＭＰＸ）２０６サを介して、命令デコーダ２０３に入力される。また、命令レジスタ検出回路（ＭＯＮ）２０１は、レジスタ間演算命令、前置コードを検出する検出回路を持ち、レジスタ間演算命令の検出結果を信号ＮＸＴＭＯＮ１によって、前置コードの検出結果を信号ＮＸＴＭＯＮ２によって命令デコーダ（ＤＥＣ）２０３に指示する。更に、命令レジスタ検出回路２０１は、命令レジスタ２００のリード中であることを信号ＩＦＭＯＮによってデコーダ２０３に指示する。命令レジスタコントローラ（ＦＩＦＯＣＮＴ）２０２は、命令レジスタに保持している有効な命令コードの量を検出し、検出結果を検出信号ＦＩＦＯＣＮＴ１，ＦＩＦＯＣＮＴ２によって命令デコーダ（ＤＥＣ）２０３に指示する。前記信号検出信号ＦＩＦＯＣＮＴ１、ＦＩＦＯＣＮＴ２、ＮＸＴＭＯＮ１、ＮＸＴＭＯＮ２、ＩＦＭＯＮに基づくデコーダ２０３による制御内容は、図１０乃至図１３、図２２及び図２３に基づいて詳述する。
【００７５】
レジスタセレクタ（ＲＳＥＬ）２０５は、デコーダ２０３，２０４の出力に基づいて汎用レジスタなどのレジスタ選択信号を形成すると共に、図示はされないものの、汎用レジスタのリードとライトが競合した場合の検出回路を備えている。
【００７６】
実行部ＥＸＥＣは３２ビット単位でデータ転送可能に構成され、図３の汎用レジスタとコントロールレジスタと、更に、テンポラリレジスタＴＲＡ、ＴＲＤ、算術論理演算器ＡＬＵ、サブ算術論理演算器ＡＬＵＳ、算術演算器ＡＵ、インクリメンタＩＮＣ、リードデータバッファＲＤＢ、ライトデータバッファＷＤＢ、アドレスバッファＡＢを含む。これらの機能ブロックはＧＢ（第１の内部バス）、ＰＣＧＢ（第２の内部バス）、ＤＢ（第５の内部バス）、ＷＢ（第４の内部バス）、ＰＣＷＢ（第３の内部バス）の内部バスによって相互に接続されている。
【００７７】
前記内部バスＧＢは、汎用レジスタＥＲ０〜ＥＲ７の所定のレジスタなどから算術論理演算器ＡＬＵ、ＡＬＵＳなどへのデータ転送、汎用レジスタＥＲ０〜ＥＲ７の中の所定レジスタやリードデータバッファＲＤＢ、算術論理演算器ＡＬＵからアドレスバッファＡＢへのアドレスの転送などに用いられる。
【００７８】
前記内部バスＰＣＧＢは、プログラムカウンタＰＣからアドレスバッファＡＢ、インクリメンタＩＮＣ、ライトデータバッファＷＤＢへの命令のアドレスへの転送などに用いられる。
【００７９】
前記内部バスＤＢは、汎用レジスタＥＲ０〜ＥＲ７の中の所定レジスタから算術論理演算器ＡＬＵやライトデータバッファＷＤＢへのデータ転送などに用いられる。
【００８０】
内部バスＷＢは、算術論理演算器ＡＬＵ、ＡＬＵＳやリードデータバッファＲＤＢから汎用レジスタへのデータ転送などに用いられる。内部バスＰＣＷＢは、インクリメンタＩＮＣからプログラムカウンタＰＣへの命令のアドレスへの転送に用いられる。
【００８１】
リードデータバッファＲＤＢは、ＲＯＭ４、ＲＡＭ５、内部Ｉ／Ｏレジスタ、或は図示はされない外部メモリから、リードした命令コードやデータを一時的に格納する。ライトデータバッファＷＤＢはＲＯＭ４、ＲＡＭ５、内部Ｉ／Ｏレジスタ、或は外部メモリへのライトデータを一時的に格納するとともに、命令リードのアドレスを一時的に格納する。リードデータバッファＲＤＢ、ライトデータバッファＷＤＢによってＣＰＵ２の内部動作と、ＣＰＵ２の外部のリード／ライト動作のタイミングを調整している。
【００８２】
アドレスバッファＡＢは、ＣＰＵ２がリード／ライトするアドレスを一時的に格納する他、格納した内容に対するインクリメント機能を有している。インクリメント機能を有するアドレスバッファは特開平４−３３３１５３号公報などに記載されている。
【００８３】
インクリメンタＩＮＣは、主にＰＣの加算に用いられ、＋２／＋４を行なう。前記算術演算器ＡＵは、プログラムカウンタ相対の分岐命令／サブルーチン分岐命令の分岐アドレスの生成に使用される。算術論理演算器ＡＬＵは、命令によって指定される各種の演算や実効アドレスの計算などに用いられる。サブ算術論理演算器ＡＬＵＳは、専らレジスタ間演算命令の演算に用いられるものである。実行すべき命令がレジスタ間接演算命令であるかは前記信号ＮＸＴＭＯＮ１によって検出される。
【００８４】
前記算術論理演算器ＡＬＵとサブ算術論理演算器ＡＬＵＳの動作は、０．５ステートずらされるようになっている。算術論理演算器ＡＬＵは基本クロック（φ）がハイレベルの期間にデータを入力し、この結果を基本クロック（φ）がロウレベルの期間に出力する。これに対して、算術演算器ＡＬＵＳは、基本クロック（φ）がロウレベルの期間にデータを入力し、この結果を基本クロック（φ）がハイレベルの期間に出力する。ＣＰＵ２は、命令フェッチ、デコード、実行の３段パイプラインで命令を実行する。このとき、例えば、加算命令“ＡＤＤ．ＬＥＲ０，ＥＲ１”とシフト命令“ＳＨＬＬＥＲ１”が連続する場合、基本クロック（φ）のハイレベルに同期してＥＲ０とＥＲ１の内容がバスＤＢ、ＧＢに読み出され、算術論理演算器ＡＬＵに入力される。算術論理演算器ＡＬＵで加算が行なわれ、加算結果が基本クロック（φ）のローレベルに同期して、バスＷＢに出力される。この基本クロック（φ）のローレベルでＥＲ１のリード／ライトが競合する。バスＷＢの内容がＥＲ１に書込まれる。ＥＲ１の内容は読み出されず、代わりに、算術論理演算器ＡＬＵの内容がバスＧＢに読み出され、サブ算術論理演算器ＡＬＵＳに入力される。即ち、演算器は命令の順序に従って、算術論理演算器ＡＬＵとサブ算術論理演算器ＡＬＵＳが動作し、一方の算術論理演算器ＡＬＵの結果を、他方の算術論理演算器ＡＬＵＳの入力に利用できるから、レジスタの競合を本質的に回避できる。
【００８５】
命令が重なって動作するのは、各々の命令の最初または最後の１ステート（１ステートで実行する命令は全部の期間）とされ、更に、この期間に動作するのは特定種類の動作（演算動作）とされているから、ＣＰＵ２の命令デコード動作の一部を双方の算術論理演算器ＡＬＵ，ＡＬＵＳのために重なった期間で行なえばよく、その他の順序的な動作を制御する、相対的に大きな命令デコーダ（ＤＥＣ）２０３は従来同等にでき、追加するサブ命令デコーダ（ＤＥＣＳ）２０４を相対的に小さいものとすることによって、論理的規模の増加を最小限にすることができる。
【００８６】
特に制限はされないものの、サブ命令デコーダ（ＤＥＣＳ）２０４は、サブ算術論理演算器ＡＬＵＳを用いた演算の種類の指定（演算制御）と、演算に用いる汎用レジスタの入出力制御、サブ算術論理演算器ＡＬＵＳの演算結果に基づくコンディションコードレジスタＣＣＲの設定制御を行なう。
【００８７】
一方、命令デコーダ（ＤＥＣ）２０３は、上記に加えて、命令の動作タイミングの生成、バス制御、ＰＣ制御、実効アドレスの計算、実効アドレスの計算に用いる汎用レジスタの入出力制御、メモリアクセスのデータの入出力制御、命令レジスタの制御、割込み制御などを行なう。
【００８８】
ここで、前記デコーダ２０３とサブデコーダ２０４の機能について補足説明する。サブデコーダ２０４は、信号ＮＸＴＭＯＮ１によってレジスタ間接演算命令であることが検出されたとき、当該レジスタ間接演算命令をデコードし、そのデコード結果によるサブ算術論理演算器ＡＬＵＳの演算制御は０．５ステート遅れて開始する。
【００８９】
図６には前記ＲＯＭ４の構成が例示される。ＲＯＭ４は、並列データ入出力ビット数が最大３２ビットとされ、そのデータ入出力端子は内部データバスＩＤＢに接続されている。４の倍数番地から始まる連続した４バイトの、下位アドレスが“０”のバイトが上位に、下位アドレスが“３”のバイトが下位になるように構成されている。
【００９０】
このＲＯＭ４は、４の倍数番地から始まる３２ビットデータ（ロングワードデータ）を一括して１ステートでリードされる。また、それ以外の偶数番地から始まる３２ビットデータは、１ステートずつ２回に分けてリードされる必要がある。同様に偶数番地から始まる１６ビットデータ（ワードデータ）を一括して、１ステートでリード可能にされる。奇数番地から始まる１６ビットデータのリードは認められていない。これは命令コードが１６ビット単位であることに対応している。また、ＲＯＭ４は、任意の番地の８ビットデータ（バイトデータ）を１ステートでリード可能にされる。
【００９１】
即ち、１６ビット長の命令が連続する場合、１回のＲＯＭ４のリードで２命令を読み出すことができる。ＲＡＭ５のリード／ライトについても同様の構成とされる。
【００９２】
図７にはＣＰＵ２のアドレシングモードが例示される。レジスタ間接（＠ＥＲｎ）は、命令コードのレジスタフィールド（ｒ１）で指定されるアドレスレジスタ（ＥＲｎ）の内容をアドスとしてメモリ上のオペランドを指定する。
【００９３】
ポストインクリメントレジスタ間接（＠ＥＲｎ＋）は、命令コードのレジスタフィールド（ｒ１）で指定されるアドレスレジスタ（ＥＲｎ）の内容をアドスとしてメモリ上のオペランドを指定する。その後、アドレスレジスタの内容に1、2または4が加算され、加算結果がアドレスレジスタに格納される。バイサイズでは1、ワードサイズでは2、ロングワードサイズでは4がそれぞれ加算される。
【００９４】
プリデクリメントレジスタ間接（＠−ＥＲｎ）は、命令コードのレジスタフィールド（ｒ１）で指定されるアドレスレジスタ（ＥＲｎ）の内容から１，２又は４を減算した内容をアドレスとしてメモリ上のオペランドを指定する。その後、減算結果がアドレスレジスタに格納される。バイトサイズでは１、ワーサイズでは２、ロングワードサイズでは４がそれぞれ減算される。
【００９５】
ディスプレースメント付きレジスタ間接（＠（ｄ：１６，ＥＲｎ））は、命令コードのレジスタフィールド（ｒ１）で指定されるアドレスレジスタ（ＥＲｎ）の内容に命令コード中に含まれる１６ビットディスプレースメント（ｄ）を加算した内容をアドレスとしてメモリ上のオペランドを指定する。加算に際して、１６ビットディスプレースメントは符号拡張される。
【００９６】
絶対アドレス（＠ａａ：１６）は、命令コード中に含まれる絶対アドレス（ａａ）で、メモリ上のオペランドを指定する。特に制限はされないものの、１６ビット絶対アドレスの場合、上位１６ビットは符号拡張される。
【００９７】
図８には転送命令“ＭＯＶ．Ｗ＠ａａ：１６，Ｒｄ”の動作タイミングが示される。図８の（１−１）、（１−２）には制御部（特に命令デコーダ２０３）ＣＯＮＴの動作が示され、図８の（２）には実行部ＥＸＥＣの動作が示されている。実際には、制御部ＣＯＮＴの出力する制御信号に基づいて実行部ＥＸＥＣが動作するから、制御部ＣＯＮＴの動作と実行部ＥＸＥＣの動作には時間差が存在するが、図８では便宜的にその時間差を０として表現している。また、制御部ＣＯＮＴの動作において（１−１）は従来技術同等の動作であり、（１−２）は本発明特有の動作の一例に対応されている。
【００９８】
実行部ＥＸＥＣは、第１ステートＳＴ１で、次命令の命令リード（ｉｆ）とＰＣインクリメント（＋４：従来技術では＋２）を行なう。第２ステートＳＴ２で、リードデータバッファから本命令のＥＡ拡張部（ａａ）を内部バス（ＧＢ）経由でアドレスバッファに転送すると共に、データリードのためのバスコマンドを発行する。第３ステートＳＴ３では、次の次の命令の命令リード（ｉｆ）とプログラムカウンタＰＣインクリメント（＋４：従来技術では＋２）を行なうと共に、第２ステートＳＴ２でリードしたデータを、リードデータバッファから内部バス（ＷＢ）経由で汎用レジスタに転送するとともに、データを検査し、結果をコンディションコードレジスタＣＣＲにセットする。
【００９９】
制御部ＣＯＮＴの動作（１−１）は、上記実行部ＥＸＥＣの動作に即した制御内容になっている。即ち、第２ステートＳＴ２で、アドレスの出力とバスコマンドの生成、第３ステートＳＴ３でリードデータの格納の制御信号を発生している。更に、前記平成７年３月（株）日立製作所発行『Ｈ８Ｓ／２６００シリーズＨ８Ｓ／２０００シリーズプログラミングマニュアル』では、ＰＣのインクリメントは＋２であり、リードデータは、リードデータバッファから内部バス（ＧＢ）を経由して演算器（ＡＬＵ）に入力し、演算器（ＡＬＵ）はそのままデータを内部バス（ＷＢ）に出力して、汎用レジスタに格納していた。演算器（ＡＬＵ）を介することで、内部バスの増加を抑止し（ＧＢ、ＤＢ、ＷＢの３種類）、演算器（ＡＬＵ）のデータ検査回路とフラグセット回路を共有していた。これらの詳細な相違点は省略する。
【０１００】
制御部ＣＯＮＴの動作（１−２）は、データアクセスのための制御を、第２ステートＳＴ２で行なう。即ち、第２ステートＳＴ２で、アドレスの出力とバスコマンドの生成、及び、リードデータの格納の制御信号を発生する。演算部ＥＸＥＣには、まず、アドレスの出力とバスコマンドの生成のための制御信号が与えられ、次のステートでリードデータの格納（ＲＤＢ−Ｒｄ）の制御信号が与えられるようにされる。
【０１０１】
制御部ＣＯＮＴの第１ステートＳＴ１及び第３ステートＳＴ３は、命令リード（ｉｆ）とＰＣインクリメント（＋４）のみを行なう。これらの第１、第３ステートＳＴ１，ＳＴ３は、命令レジスタ（ＦＩＦＯ）２００にリード済みの命令の量に従って、省略（スキップ）される。リード済みの命令が少なければ、第１ステートＳＴ１及び第３ステートＳＴ３を実行し、本命令の命令長（２ワード）より多い命令をリードする。リード済みの命令の量が適切であれば、第１ステートＳＴ１又は第３ステートＳＴ３の一方を実行し、本命令の命令長（２ワード）と同じ量の命令をリードする。リード済みの命令が多ければ、第１ステートＳＴ１及び第３ステートＳＴ３を実行せず、命令をリードしない。どの動作を行なうかは、前記ＦＩＦＯＣＮＴ１，ＦＩＦＯＣＮＴ，ＩＦＭＯＮなどの信号を用いて命令デコーダ２０３が決定する。
【０１０２】
例えば、本命令が複数命令連続して実行される場合には、第１ステートＳＴ１、第２ステートＳＴ２のみの実行（第３ステートＳＴ３は省略）が行われる。図８の（１−１）に示される従来と同様の制御を受ける前の命令の第３ステートＳＴ３のワードサイズ（１６ビット）命令リードと次の命令の第１ステートＳＴ１のワードサイズ命令リードが、図８の（１−２）に示される制御を受ける本発明に係る第１ステートＳＴ１のロングワードサイズ（３２ビット）命令リードに合体されたと理解することができる。
【０１０３】
なお、第１ステートＳＴ１又は第３ステートＳＴ３の一方を実行する場合、何れを実行して何れを省略（スキップ）するかは、それ以前の命令リードの状態によって決まる。分岐命令の分岐先の先頭におかれ、かつ４の倍数番地でない場合は、プリフェッチされたのは自命令の第1ワードのみであり、自命令の第２ワードを待つため、第１ステートＳＴ１を実行するし、分岐命令の分岐先の先頭におかれても、４の倍数番地の場合は、自命令の第２ワードも同時にリード（プリフェッチ）済みであるから、第１ステートＳＴ１は省略（スキップ）し、第３ステートＳＴ３を実行する。
【０１０４】
図９には分岐命令（ＪＭＰ＠ａａ：２４）の動作タイミングが示される。図９の（１）には制御部（特に命令デコーダ２０３）ＣＯＮＴの動作が示され、図９の（２）には実行部ＥＸＥＣの動作が示されている。図８と同様に、制御部ＣＯＮＴの動作と実行部ＥＸＥＣの動作との時間差を便宜的に０として表現している。
【０１０５】
図９において、第１ステートＳＴ１は、自命令の第２ワードのリードの完了を待つためであり、第２ワードがリード済みであれば省略（スキップ）可能である。分岐先の命令を２回リードする。１回目は、リードした命令をＣＰＵ２内部に取り込むが、２回目は命令リードの発行のみを行い、リードした命令のＣＰＵ２内部への取り込みは、次の命令の実行と重なる。
【０１０６】
１回目は、４の倍数番地の場合、２ワードをリードし、ＰＣインクリメントは＋４となり、４の倍数番地でない場合、１ワードをリードし、ＰＣインクリメントは＋２となる。
【０１０７】
このため、分岐先に同じ分岐命令（ＪＭＰ＠ａａ：２４）が存在した場合、分岐先が４の倍数番地であれば、自命令の第２ワードのリード（プリフェッチ）が完了した状態で、実行が開始されるので、第１ステートＳＴ１を省略（スキップ）できる。分岐先が４の倍数番地でなければ、自命令の第２ワードのリードが完了していない状態で、実行が開始されるので、第１ステートＳＴ１を省略（スキップ）できない。
【０１０８】
分岐命令を実行した場合も、従来と同じタイミングで分岐先の命令の実行を開始できる。４の倍数番地に分岐している場合などは、分岐先の命令実行を短縮できる。分岐命令や割込み例外処理などの応答性を維持向上することができる。
【０１０９】
上記の通り、分岐命令は、その配置されるアドレスによらず、実行可能である。実行ステート数が異なるが、最低限、従来同等であり、むしろ、無操作命令を挿入するなどの必要はなく、ソフトウェアに負担をかけなくてよい。
【０１１０】
図１０〜図１３にはプログラムを実行したときのタイミングチャートが例示される。実行プログラムは、

である。
【０１１１】
尚、ＭＯＶ．ＬＥＲ１，＠ａａ２は、ＭＯＶ．ＷＲ１，＠ａａ２の命令コードに、前置コードを付加した命令コードを持っている。かかる前置コードは制御信号を発生し、続く命令コード（ＭＯＶ．ＷＲ１，＠ａａ２）の動作を変更するもので、特開平６−５１９８１号公報などに記載されている。
【０１１２】
図１０〜図１３では、命令の配置されるアドレスが相違され、ラベルＬ０，Ｌ１は、図１０では、Ｌ０＝２，Ｌ１＝１４、即ち
Ｌ０．ＥＱＵ２
Ｌ１．ＥＱＵ１４、
図１１では、Ｌ０＝２，Ｌ１＝１２、即ち
Ｌ０．ＥＱＵ２
Ｌ１．ＥＱＵ１２、
図１２では、Ｌ０＝０，Ｌ１＝１４、即ち
Ｌ０．ＥＱＵ０
Ｌ１．ＥＱＵ１４、
図１３では、Ｌ０＝０，Ｌ１＝１２、即ち
Ｌ０．ＥＱＵ０
Ｌ１．ＥＱＵ１２、
とされる。また、データは共通で、ａａ１＝１０２、１１２＝１０４、即ち
ａａ１．ＥＱＵ１０２
ａａ２．ＥＱＵ１０４
とする。
【０１１３】
分岐命令の第１ステートは、命令レジスタ（ＦＩＦＯ）２００に１ワード分の命令コードが存在する（ＦＩＦＯＣＮＴ１＝１）と、省略可能とされる。
【０１１４】
転送命令（ＭＯＶ．Ｗ）の第１ステートは、命令レジスタ（ＦＩＦＯ）２００に１ワード分の命令コードが存在すると、省略可能とされる。
【０１１５】
転送命令（ＭＯＶ．Ｗ）の第３ステートは、命令レジスタ（ＦＩＦＯ）２００に２ワード分の命令コードが存在する（ＦＩＦＯＣＮＴ１＝ＦＩＦＯＣＮＴ２＝１）場合、または、命令レジスタ（ＦＩＦＯ）２００に１ワード分の命令コードが存在し、かつ、命令リード実行中の場合（ＦＩＦＯＣＮＴ１＝ＩＦＭＯＮ＝１）に、省略可能とされる。
【０１１６】
レジスタ間演算命令は、命令レジスタ（ＦＩＦＯ）２００に２ワード分の命令コードが存在する（ＦＩＦＯＣＮＴ１＝ＦＩＦＯＣＮＴ２＝１）場合、または、命令レジスタ（ＦＩＦＯ）２００に１ワード分の命令コードが存在し、かつ、命令リード実行中の場合（ＦＩＦＯＣＮＴ１＝ＩＦＭＯＮ＝１）に、次の命令がレジスタ間演算命令などのとき（ＮＸＴＭＯＮ１＝１）、サブ命令デコーダ（ＤＥＣＳ）２０４とサブ算術論理演算器ＡＬＵＳの動作を指示する。
【０１１７】
転送命令（ＭＯＶ．Ｌ）の第１ステート（前置コード、ＮＸＴＭＯＮ２＝１）は、命令レジスタ（ＦＩＦＯ）２００に１ワード分の命令コードが存在すると、省略可能とされる。省略されない場合は、前置コードを命令デコーダで解読し、命令リードとＰＣインクリメントを行なうとともに、制御信号を発生する。省略した場合は、命令レジスタから、所望の信号を生成し、命令デコーダに入力する。
【０１１８】
転送命令（ＭＯＶ．Ｌ）の第２ステート、第４ステートは、転送命令（ＭＯＶ．Ｗ）の第１ステート、第３ステートと同様である。
【０１１９】
図１０の場合は以下の通りの動作になる。基準クロックφのサイクルＴ０におけるスロットＣ２（基準クロックφのローレベル期間）で、ＣＰＵ２は、図示されない分岐命令の実行時に、命令フェッチ（ｉｆ）を示す、バスコマンド（ＢＣＭＤ）を出力し、また、アドレスをアドレスバッファＡＢからアドレスバスＩＡＢに出力する。同様に、サイクルＴ１のスロットＣ２でバスコマンドと、次のアドレスを出力する。
【０１２０】
アドレスバスＩＡＢの内容と、バスコマンドに基づき、内蔵ＲＯＭ４の内容が、サイクルＴ１のスロットＣ２で内部データバスＩＤＢに得られ、これをサイクルＴ２のスロットＣ１（基準クロックφのハイレベル期間）で命令レジスタ（ＦＩＦＯ）２００及びリードデータバッファＲＤＢにラッチする。なお、この時の命令アドレスは４の倍数番地ではないので、内部データバスＩＤＢの下位側（ビット１５〜０）のみが使用される。同様にサイクルＴ２のスロットＣ１で、次のアドレスの内容を命令レジスタ（ＦＩＦＯ）２００及びリードデータバッファＲＤＢにラッチする。今回は、４の倍数番地であるので、内部データバスＩＤＢの上位（ビット３１〜１６）及び下位（ビット１５〜０）が使用される。
【０１２１】
サイクルＴ２のスロットＣ１で命令コード（ｊｍｐ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１２２】
自命令の第２ワード（ｊｍｐ−２）のリードが完了していないので、第１ステートＳＴ１を実行する。
【０１２３】
サイクルＴ２のスロットＣ２では、バスコマンドは無操作とされ、リード／ライトは開始されない。
【０１２４】
サイクルＴ３のスロットＣ２で、リードデータバッファＲＤＢの内容（絶対アドレス＝１４）を、内部バスＧＢを経由して、アドレスバッファＡＢに格納し、アドレスバスＩＡＢに出力するとともに、バスコマンドを発行して、命令のリードを行なう。同様に、サイクルＴ４のスロットＣ２でバスコマンドと、次のアドレスを出力して、命令のリードを行なう。
【０１２５】
リードした内容を、サイクルＴ５のスロットＣ１、サイクルＴ６のスロットＣ１で、命令レジスタ（ＦＩＦＯ）２００及びリードデータバッファＲＤＢにラッチする。
【０１２６】
また、内部バスＧＢの内容は、ライトデータバッファＷＤＢとインクリメンタＩＮＣにも入力され、インクリメンタＩＮＣではインクリメント（＋２／＋４）が行われる。
【０１２７】
サイクルＴ４のスロットＣ１で、インクリメンタＩＮＣでインクリメント（＋２）された結果（１６）が、内部バスＷＢを経由してプログラムカウンタＰＣにライトされる。同様に、サイクルＴ５のスロットＣ１で、インクリメント（＋４）された結果（２０）が、内部バスＷＢを経由してプログラムカウンタＰＣにライトされる。
【０１２８】
サイクルＴ５のスロットＣ１で命令コード（ｍｏｖ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１２９】
自命令の第２ワード（ｍｏｖ−２）のリードが完了していないので、第１ステートＳＴ１を実行する。
【０１３０】
サイクルＴ５のスロットＣ２でバスコマンドと、次のアドレスを出力して、命令のリードを行なう。また、プログラムカウンタＰＣのインクリメントなどを行なう。
【０１３１】
サイクルＴ６のスロットＣ２で、リードデータバッファＲＤＢの内容（絶対アドレス＝１０２）を、内部バスＧＢを経由して、アドレスバッファＡＢに格納し、アドレスバスＩＡＢに出力するとともに、バスコマンドを発行して、データのリードを行なう。第３ステートＳＴ３は省略（スキップ）される。
【０１３２】
リードしたデータは、サイクルＴ８のスロットＣ１でリードデータバッファＲＤＢに格納され、内部バスＷＢを経由して、汎用レジスタＥＲ０（実質的にはＲ０）にライトされる。また、リードデータバッファＲＤＢ上のデータを検査して、結果をコンディションコードレジスタＣＣＲの所定のビット（例えば、ネガティブＮ、ゼロＺ、オーバフローＶ）に反映する。この動作は、命令コード（ｍｏｖ−１）を解読した結果に基づき、行われるが、次の命令と重なった時間に実行される。
【０１３３】
サイクルＴ７のスロットＣ１で命令コード（ａｄｄ）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１３４】
サイクルＴ７のスロットＣ２でバスコマンドと、次のアドレスを出力して、命令のリードを行なう。また、プログラムカウンタＰＣのインクリメントなどを行なう。
【０１３５】
サイクルＴ８のスロットＣ１で、汎用レジスタＥＲ１（Ｒ１）の内容を内部バスＧＢに読み出し、算術論理演算器ＡＬＵに入力する。また、汎用レジスタＥＲ０（Ｒ０）の内容を内部バスＤＢに読み出そうとするが、前命令のライトと競合しているために、リードデータバッファＲＤＢから読み出し（遅延時間を最小限にできる）、算術論理演算器ＡＬＵに入力する。算術論理演算器ＡＬＵには加算が指示される。
【０１３６】
サイクルＴ８のスロットＣ２で、汎用レジスタＥＲ１（Ｒ１）に演算結果を格納する。また、演算結果を検査して、結果をコンディションコードレジスタＣＣＲの所定のビット（例えば、ネガティブＮ、ゼロＺ、オーバフローＶ、キャリＣ、ハーフキャリＨ）に反映する。
【０１３７】
次の命令コードが、レジスタ間演算命令（ＮＸＴＭＯＮ１＝１）であるために、サイクルＴ７のスロットＣ２で、サブ命令デコーダ（ＤＥＣＳ）２０４に、命令コード（ｅｘｔｓ）を入力する。
【０１３８】
サイクルＴ８のスロットＣ２で、汎用レジスタＥＲ０の内容を内部バスＧＢに読み出そうとするが、前命令のライトと競合しているために、算術論理演算器ＡＬＵから読み出し（遅延時間を最小限にできる）、算術論理演算器ＡＬＵに入力する。算術論理演算器ＡＬＵには拡張が指示される。
【０１３９】
サイクルＴ９のスロットＣ１で、汎用レジスタＥＲ０に演算結果を格納する。また、演算結果を検査して、結果をコンディションコードレジスタＣＣＲの所定のビット（例えば、ネガティブＮ、ゼロＺ、オーバフローＶ）に反映する。
【０１４０】
サイクルＴ８のスロットＣ１で命令コード（ｍｏｖｌ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１４１】
自命令の第２ワード（ｍｏｖｌ−２）のリードが完了していないので、第１ステート（前置コード）ＳＴ１を実行する。
【０１４２】
サイクルＴ８のスロットＣ２でバスコマンドと、次のアドレスを出力して、命令のリードを行なう。また、プログラムカウンタＰＣのインクリメントなどを行なう。また、制御信号を発生して、次の命令コードへ指示（この場合は、ロングワードサイズ指示）を伝達する。
【０１４３】
サイクルＴ９のスロットＣ１で命令コード（ｍｏｖｌ−２）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１４４】
自命令の第３ワード（ｍｏｖｌ−３）のリードが完了しており、第２ステートＳＴ２を省略（スキップ）する。次の命令のリードが発行されていないので、第４ステートを実行する。
【０１４５】
図１１の動作を、専ら図１０との相違点に関し説明すれば、以下の通りである。サイクルＴ５のスロットＣ１で命令コード（ｍｏｖ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。このとき、自命令の第２ワード（ｍｏｖ−２）のリードが完了しており、第１ステートＳＴ１を省略（スキップ）する。また、次の命令のリードが発行されているため、第３ステートＳＴ３が省略（スキップ）される。
【０１４６】
サイクルＴ６のスロットＣ１で命令コード（ａｄｄ）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１４７】
次の命令コードは、レジスタ間演算命令であるが、次の次の命令のリードが完了していないため、サブ命令デコーダ（ＤＥＣＳ）２０４への命令コード（ｅｘｔｓ）の入力は行なわない。
【０１４８】
サイクルＴ７のスロットＣ１で命令コード（ｅｘｔｓ）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１４９】
算術論理演算器ＡＬＵを使用して演算を行なう。レジスタの競合は発生しないので、汎用レジスタＥＲ０を読み出す。
【０１５０】
サイクルＴ８のスロットＣ１で、命令コード（ｍｏｖｌ−１）が前置コードであることを判定（ＮＸＴＭＯＮ２＝１）し、命令コード（ｍｏｖｌ−２）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。自命令の第３ワード（ｍｏｖｌ−３）のリードが完了していないので、第２ステートＳＴ２を実行する。次の命令のリードが完了しており、第４ステートを省略（スキップ）する。
【０１５１】
図１２の動作を、主に図１０との相違点に関し説明すれば、以下の通りである。サイクルＴ２のスロットＣ１で命令コード（ｊｍｐ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。自命令の第２ワード（ｊｍｐ−２）のリードが完了しており、第１ステートＳＴ１を省略（スキップ）する。それ以降の動作は、図１０と同様である。
【０１５２】
図１３の動作を、主に図１０との相違点に関し説明すれば、以下の通りである。図１２と同様に、サイクルＴ２のスロットＣ１で命令コード（ｊｍｐ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。自命令の第２ワード（ｊｍｐ−２）のリードが完了しており、第１ステートを省略（スキップ）する。それ以降の動作は、図１１と同様である。
【０１５３】
前記５命令を、従来技術では１２ステートで実行可能とされている。これに対して、図１０では９ステート、図１１、図１２では８ステート、図１３では７ステートで実行している。処理に必要なステート数を５８〜７５％に短縮している。分岐先が４の倍数番地かどうかによって、ばらつきがあるが、分岐先が４の倍数番地でない場合に、内部データバスＩＤＢの上位側を使用できず、内部バスのスループットが低下するためである。
【０１５４】
所定のプログラムの処理速度を向上するために、４の倍数番地に所望のプログラムを配置したい場合には、アセンブラに４の倍数番地へアライメントする制御命令を設け、これを利用することができる。かかる制御命令は、例えば、平成４年５月（株）日立製作所発行『Ｈ８Ｓ，Ｈ８／３００シリーズクロスアセンブラ』ｐ６２などに記載されている。アライメントする制御命令は、マイクロコンピュータの命令に変換されなく、変更によるプログラム品質などを大きく損なうことはない。
【０１５５】
また、以上の動作タイミング例においては、短いプログラムで、しかも分岐命令直後で、かつ、更に、分岐命令を実行しているため、本発明の改善の効果は必ずしも、最大限に表現されているとは言えない。
【０１５６】
例えば、レジスタ間演算命令のみを実行し続けるような場合は、ＡＬＵ、ＡＬＵＳを交互に動作させていくことにより、実効的に１ステートに２命令実行し、従来技術に対して５０％に短縮できる。
【０１５７】
また、バス幅を拡張した分を、省略可能なステートを省略（スキップ）して高速化を実現しているから、典型的には、命令リードの時間を半分にできる。命令リードを８０％、データアクセスを２０％とすると、６０％に、命令リードを７０％、データアクセスを３０％とすると、６５％に短縮できることになる。
【０１５８】
プログラムの内容に依存するが、プログラム全体としては、５０〜７５％程度の改善の効果が得られる。
【０１５９】
図１４にはインクリメンタＩＮＣのブロック図が例示されている。インクリメンタＩＮＣはプログラムカウンタＰＣのインクリメント（＋２／＋４）を行なう。
【０１６０】
前述の通り、２組の算術論理演算器ＡＬＵ、ＡＬＵＳがあるのに対して、インクリメンタＩＮＣは２組でなく、１組とされる。分岐命令などを除き、プログラムカウンタＰＣのインクリメントは＋４とされる。従って、通常入力されるプログラムカウンタＰＣの下位２ビットは、２’ｂ００となる。
【０１６１】
インクリメンタＩＮＣは、各ビットハーフアダー３００で構成され、内部バスＧＢを入力とし、内部バスＰＣＷＢへ出力を行なう。ビット１のみ、データ入力を論理和回路（ＯＲ）３０１によって論理値１（＋２）に固定し、更に、キャリ（論理値１）を入力（＋２）する。すなわち、＋２を二重に行なうことによって＋４を実現する。
【０１６２】
分岐命令の場合、分岐先のアドレスが４の倍数番地（ビット１が０）であれば、分岐命令以外と同様に、＋４が行われる。分岐先のアドレスが４の倍数番地でなければ（ビット１が１）、前記論理和回路３０１による＋２の意味がないから、キャリ入力の＋２のみが行われる。
【０１６３】
これによって、分岐先のアドレスによって、自動的に＋２／＋４の選択が行なわれる。換言すれば、入力された内容より大きく、その内最も小さい４の倍数が出力される。
【０１６４】
図１５にはライトデータバッファＷＤＢのブロック図が例示される。ライトデータバッファＷＤＢは３つの部分ＷＤＢ−Ｍ、ＷＤＢ−Ｓ、ＷＤＢ−ＯＵＴから構成され、ＷＤＢ−Ｍには、内部バスＧＢからの入力が可能とされ、ＷＤＢ−ＭからＷＤＢ−Ｓへの転送が可能とされ、更に、ＷＤＢ−Ｍ及びＷＤＢ−ＳからＷＤＢ−ＯＵＴへの転送が可能とされると共に、ＷＤＢ−ＯＵＴには、内部バスＧＢ、ＤＢからの入力が可能とされる。一方、ＷＤＢ−Ｍ、ＷＤＢ−Ｓからは内部バスＧＢへ出力可能とされる。データバスＩＤＢへの出力は、ＷＤＢ−ＯＵＴから行なう。
【０１６５】
待避すべきプログラムカウンタＰＣの値は予めライトデータバッファＷＤＢに格納しておくこととするライトデータバッファＷＤＢに、待避すべきプログラムカウンタＰＣの値を予め格納しておくことは、特開昭６２-２９３６６５号公報に記載されているが、本発明では、自命令の命令コード長、そして命令実行開始時に実行中の命令リードの状態（ＩＦＭＯＮ）によって、待避すべきプログラムカウンタＰＣ値の出力方法を相違させる。
【０１６６】
具体的には、４の倍数番地に存在する１ワード命令で、命令リード中であれば（ＩＦＭＯＮ＝１）、待避すべきプログラムカウンタＰＣの値をＷＤＢ−Ｓから内部データバスＩＤＢに得る。その時、ＰＣ値のビット１を１に固定する。
【０１６７】
４の倍数番地に存在する１ワード命令で、命令リード中でなければ（ＩＦＭＯＮ＝０）、待避すべきプログラムカウンタＰＣの値をＷＤＢ−Ｍから内部データバスＩＤＢに得る。また、その時のＰＣ値のビット１を１に固定する（図２１のＴ８部分参照）。
【０１６８】
４の倍数番地に存在しない１ワード命令で、命令リード中であれば（ＩＦＭＯＮ＝１）、待避すべきＰＣの値をＷＤＢ−Ｓから内部データバスＩＤＢに得る。そのときのＰＣ値のビット１に対する１固定は行なわない。
【０１６９】
４の倍数番地に存在しない１ワード命令で、命令リード中でなければ（ＩＦＭＯＮ＝０）、待避すべきプログラムカウンタＰＣの値をＷＤＢ−Ｍから内部データバスＩＤＢに得る。そのときのＰＣ値のビット１に対する１固定は行なわない（図２０のＴ９部分参照）。
【０１７０】
４の倍数番地に存在する２ワード命令で、命令リード中であれば（ＩＦＭＯＮ＝１）、待避すべきプログラムカウンタＰＣ値をＷＤＢ−Ｍから内部データバスＩＤＢに得る。前記ビット１の固定は行なわない。
【０１７１】
４の倍数番地に存在する２ワード命令で、命令リード中でなければ（ＩＦＭＯＮ＝０）、ＰＣから待避すべきプログラムカウンタＰＣの値を内部データバスＩＤＢに得る。ビット１の固定は行なわない（図１９のＴ８部分参照）。
【０１７２】
４の倍数番地に存在しない２ワード命令で、命令リード中であれば（ＩＦＭＯＮ＝１）、待避すべきプログラムカウンタＰＣの値をＷＤＢ−Ｓから内部データバスＩＤＢに得る。その時、ＰＣ値のビット１を１に固定する。
【０１７３】
４の倍数番地に存在しない２ワード命令で、命令リード中でなければ（ＩＦＭＯＮ＝０）、待避すべきプログラムカウンタＰＣ値をＷＤＢ−Ｍから内部データバスＩＤＢに得る。その時、ＰＣ値のビット１を１に固定する（図１８のＴ１０部分参照）。
【０１７４】
また、プログラム相対のアドレッシングモードは、次の命令のアドレスを基準にディスプレースメント（相対値）が加算されるが、これに用いる次の命令のアドレスは、サブルーチン分岐命令時に待避するプログラムカウンタＰＣの値と同じ値になるから、上記のライトデータバッファＷＤＢの内容を使用することができる。即ち、上記同様に、ＷＤＢ−Ｍ、ＷＤＢ−ＳまたはプログラムカウンタＰＣから、適宜、内部バスＧＢへ次の命令のアドレスを読み出し、算術論理演算器ＡＬＵなどでディスプレースメントと加算を行なえばよい。ビット１を１にセットすることは、算術論理演算器ＡＬＵで行なってもよいし、内部バスＧＢ上、或いは、ライトデータバッファＷＤＢ又はプログラムカウンタＰＣ上で行なってもよい。
【０１７５】
図１６にはプログラム相対分岐アドレス計算用の算術演算器ＡＵの一例が示される。算術演算器ＡＵは、１ワード長のプログラム相対分岐命令の実行開始に先立って、ライトデータバッファＷＤＢに格納されたプログラムカウンタＰＣの値（単にＰＣ値とも称する）をマルチプレクサＭＰＸ経由で入力する。１ワード長のため、プログラムカウンタＰＣは使用しない。更に算術演算器ＡＵは、リードデータバッファＲＤＢまたは命令レジスタ（ＦＩＦＯ）２００に保持された、命令コード中に含まれる８ビットディスプレースメントを、内部バスＤＢを経由して入力する。算術演算器ＡＵは双方の入力を加算する。かかる分岐命令が４の倍数番地に存在するとき、更に、制御信号ｐｌｓ２によって、ＰＣ値のビット１を１に固定して、実効的に＋２を同時に行なうことが可能にされる。
【０１７６】
プログラム相対分岐アドレス計算用の演算器ＡＵを持つことにより、算術論理演算器ＡＬＵの動作状態に拘らず、分岐アドレスの計算ができ、分岐の高速化が実現できる。
【０１７７】
プログラム相対のアドレッシングモードは、分岐命令に限定されず、転送命令などにも使用でき、かかる算術論理演算器ＡＵによって、実効アドレスの計算を高速にし、ひいては命令の処理速度を向上できる。
【０１７８】
図１７にサブルーチン分岐命令（ＪＳＲ＠ａａ：２４）の動作タイミングが示される。同図の表現形式は図９の場合と同様である。図１７において、第３ステートＳＴ３に、プログラムカウンタＰＣのスタックのためのステートが挿入される他は、図９の分岐命令と同様である。第１ステートＳＴ１は、自命令の第２ワードのリードの完了を待つためであり、第２ワードがリード済みであれば省略（スキップ）する。第２ステートＳＴ２の分岐先アドレスの命令リード、第３ステートＳＴ３のスタック、第４ステートＳＴ４の次の命令のリードは、サブルーチン分岐命令に固有の動作であり、省略（スキップ）されない。
【０１７９】
図１８及び図１９にはサブルーチン分岐命令を含むプログラムの実行タイミング図の一例が示される。各図には、分岐命令で分岐した先で、以下のプログラムＬ０ＪＭＰＬ１
Ｌ１ＭＯＶ．Ｗ＠ａａ１，Ｒ０
ＪＳＲＬ２
を実行した場合のタイミングが示されている。図１８及び図１９では、命令の配置されるアドレスが相違され、ラベルは、図１８では、Ｌ０＝２，Ｌ１＝１４，Ｌ２＝４０，即ち、
Ｌ０．ＥＱＵ２
Ｌ１．ＥＱＵ１４
Ｌ２．ＥＱＵ４０、
図１９では、Ｌ０＝２，Ｌ１＝１２，Ｌ２＝４０，即ち、
Ｌ０．ＥＱＵ２
Ｌ１．ＥＱＵ１２
Ｌ２．ＥＱＵ４０
とされる。
【０１８０】
サブルーチン分岐命令は２ワード長の命令コードを持ち、図１８では、サブルーチン分岐命令が、アドレス１８（〜２１）に存在するから、待避すべき、ＰＣ値の内容は２２である。また、図１９では、サブルーチン分岐命令が、アドレス１６（〜１９）に存在するから、待避すべきＰＣ値の内容は２０である。尚、図１８、図１９において、サブルーチン分岐命令実行前までの動作タイミングは、図１０、図１１と同じである。
【０１８１】
図１８では、サイクルＴ７のスロットＣ１で命令コード（ｊｓｒ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。この時点で、ＰＣ値は次に命令リードすべきアドレス（２４）、ライトデータバッファＷＤＢ−Ｍには前回の命令リード時のアドレス（２０）、ＷＤＢ−Ｓには前前回の命令リード時のアドレス（１６）が格納されている。
【０１８２】
サイクルＴ７のスロットＣ２で、ＷＤＢ−Ｍの内容を、ＷＤＢ−ＯＵＴ（２０）に転送する。
【０１８３】
サイクルＴ９のスロットＣ１で、スタックポインタＳＰを内部バスＧＢに読み出し、算術論理演算器ＡＬＵに入力してデクリメント（−４）を行なう。
【０１８４】
サイクルＴ９のスロットＣ２で、デクリメントした結果を、内部バスＷＢに読み出して、スタックポインタＳＰにライトするとともに、内部バスＧＢにも読み出して、アドレスバッファＡＢに格納して、内部アドレスバスＩＡＢに出力する。また、同時にロングワードデータライトのバスコマンドを発行する。
【０１８５】
ライトデータは、サイクルＴ１０のスロットＣ１で、ＷＤＢ−ＯＵＴの内容を、ビット１を１に固定して出力し、この内容（２２）がスタックに格納される。
【０１８６】
図１９では、サイクルＴ６のスロットＣ１で命令コード（ｊｓｒ−１）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１８７】
この時点で、ＰＣ値は次に命令リードすべきアドレス（２０）、ライトデータバッファＷＤＢ−Ｍには前回の命令リード時のアドレス（１６）、ＷＤＢ−Ｓには前前回の命令リード時のアドレス（１２）が格納されている。
【０１８８】
サイクルＴ７のスロットＣ２で、ＰＣ値の内容（２０）を、内部バスＤＢを経由して、ＷＤＢ−ＯＵＴに転送する。
【０１８９】
サイクルＴ７のスロットＣ１で、スタックポインタＳＰを内部バスＧＢに読み出し、算術論理演算器ＡＬＵに入力してデクリメント（−４）を行なう。サイクルＴ７のスロットＣ２で、デクリメントした結果を、内部バスＷＢに読み出して、スタックポインタＳＰにライトするとともに、内部バスＧＢにも読み出して、アドレスバッファＡＢに格納して、内部データバスＩＡＢに出力する。また、同時にロングワードデータライトのバスコマンドを発行する。
【０１９０】
ライトデータは、サイクルＴ８のスロットＣ１で、ＷＤＢ−ＯＵＴの内容を出力し（ビット１の固定は行なわない）、この内容（２２）がスタックに格納される。
【０１９１】
図２０、図２１には別のサブルーチン分岐命令を含むプログラムの実行タイミングの一例が示される。
【０１９２】
図２０では、サブルーチン分岐命令として、アドレッシングモードを絶対アドレスではなく、メモリ間接を使用し、分岐命令で分岐した先で、以下のプログラム

を実行した場合のタイミング図が示されている。この場合の、ラベルは、Ｌ０＝２，Ｌ１＝１４，Ｌ２＝４０，Ｌ３＝１６０、即ち、
Ｌ０．ＥＱＵ２
Ｌ１．ＥＱＵ１４
Ｌ２．ＥＱＵ４０
Ｌ３．ＥＱＵ１６０
としている。なお、メモリ間接では、命令コード中に含まれるアドレス（Ｌ３）に従って、メモリをリードし、リードした内容（Ｌ２）が分岐アドレスとなる。
【０１９３】
図２０のタイミングはサブルーチン分岐命令実行前までは、図１０と同じとされる。図２０のサイクルＴ７のスロットＣ１で命令コード（ｊｓｒ）がデコーダ（ＤＥＣ）２０３に入力されて、命令の内容が解読される。
【０１９４】
この時点で、ＰＣ値は次に命令リードすべきアドレス（２４）、ライトデータバッファＷＤＢ−Ｍには前回の命令リード時のアドレス（２０）、ＷＤＢ−Ｓには前前回の命令リード時のアドレス（１６）が格納されている。
【０１９５】
サイクルＴ７のスロットＣ２で、ＷＤＢ−Ｍの内容を、ＷＤＢ−ＯＵＴ（２０）に転送する。サイクルＴ９のスロットＣ１で、スタックポインタＳＰを内部バスＧＢに読み出し、算術論理演算器ＡＬＵに入力してデクリメント（−４）を行なう。サイクルＴ９のスロットＣ２で、デクリメントした結果を、内部バスＷＢに読み出して、スタックポインタＳＰにライトするとともに、内部バスＧＢにも読み出して、アドレスバッファＡＢに格納して、内部データバスＩＡＢに出力する。また、同時にロングワードデータライトのバスコマンドを発行する。
【０１９６】
ライトデータは、サイクルＴ１０のスロットＣ１で、ＷＤＢ−ＯＵＴの内容を、ライトアドレスのビット１を１に固定して、出力し、この内容（２２）がスタックに格納される。
【０１９７】
図２１では、サブルーチン分岐命令として、アドレッシングモードとしてプログラムカウンタ相対を使用し、分岐命令で分岐した先で、以下のプログラム

を実行した場合のタイミングを示す。この場合のラベルは、Ｌ０＝２，Ｌ１＝１２，Ｌ２＝４０、即ち、
Ｌ０．ＥＱＵ２
Ｌ１．ＥＱＵ１２
Ｌ２．ＥＱＵ４０
とされる。この場合、ＢＳＲのディスプレースメントは、２２（１０進数）とされる。
【０１９８】
図２１において、サブルーチン分岐命令実行前までは、図１１のタイミングと同じとされる。サイクルＴ６のスロットＣ１で、ライトデータバッファＷＤＢに格納されたＰＣ値（１６）と、リードデータバッファＲＤＢに保持した８ビットディスプレースメント（２２）が、算術演算器ＡＵに入力され、更に、ＰＣ値のビット１の１固定が指示されて、加算が行なわれ、加算結果（４０）がサイクルＴ６のスロットＣ２で、内部バスＧＢに出力され、アドレスバッファＡＢに格納され、内部アドレスバスＩＡＢに出力される。
【０１９９】
また、サイクルＴ６のスロットＣ２で、ライトデータバッファＷＤＢ−Ｍの内容（１６）が、ＷＤＢ−ＯＵＴに転送され、サイクルＴ９のスロットＣ１で、１に固定されて、内部データバスＩＤＢに出力される。この内容（１８）がリターンアドレスとしてスタックに書込まれる。尚、ビット１の１固定は、ＢＳＲの存在するアドレスのビット１が０であることに呼応して行われる。
【０２００】
図２２及び図２３にはデコーダ（ＤＥＣ）２０３に含まれる、図８の転送命令に対するデコード論理の一部の論理記述が示される。同図に示された論理記述は、ＲＴＬ（Register Transfer Level）若しくはＨＤＬ（Hardware Description Language）記述と呼ばれ、公知の論理合成ツールを用いることによって論理回路に展開することができる。ＨＤＬはＩＥＥＥ１３６４として標準化されている。これに示される論理記述の構文は、ケース（ｃａｓｅ）文に準拠しており、ａｌｗａｙｓ＠の次の（）内で定義された値若しくは信号に変化が有ったとき、それ以下の記述行の処理を行う、という記述内容になっている。尚、「５’ｂ００００１」は５ビット長のバイナリデータ００００１を意味する。ＩＲ［８］はインストラクションレジスタＩＲ（ＤＥＣの入力値）の最下位から９ビット目の論理値を意味する。記号〜は論理値反転を意味する。
【０２０１】
図２２及び図２３の論理記述は、転送命令“ＭＯＶ．Ｗ＠ａａ：１６，Ｒｄ”のコードを解読するための論理記述に相当する。図２２及び図２３の論理記述において、ｃａｓｅｘ（ＩＲ）の次行に記述された１６‘ｂ０１１０＿１０１？＿？？００＿？？？？がその転送命令のコードを意味する。ＩＲ［８］＝０のときバイトサイズ、ＩＲ［８］＝１のときワードサイズ、ＩＲ［７］＝０のときメモリ→汎用レジスタ（リード型）、ＩＲ［７］＝１のとき汎用レジスタ→メモリ（ライト型）、の転送を意味する。その命令において、第１ステートＳＴ１、第３ステートＳＴ３を省略するかは、信号ＦＩＦＯＣＮＴ１，ＦＩＦＯＣＮＴ２，ＩＦＭＯＮの状態に従って決定する。即ち、その論理記述では、ステージコードＴＭＧに従って制御信号を生成するようになっており、現時点でのステージコードＴＭＧの値とその時のＦＩＦＯＣＮＴ１，ＦＩＦＯＣＮＴ２，ＩＦＭＯＮの値等にしたがって、次のステージコードＮＥＸＴＴＭＧの値を決定するようになっており、これによって、第１ステートＳＴ１、第３ステートＳＴ３を省略するかを決定する。図２２を参照するに、第1ステートＳＴ１のステージコードは１、第２ステートＳＴ１のステージコードは１７、第３ステートＳＴ１のステージコードは３である。
【０２０２】
詳しくは、図２２における論理記述の第１の部分（１−１）でステージコードＴＭＧが生成される。ステージコードＴＭＧは１→１７→３と進行するが、ＦＩＦＯＣＮＴ１、ＦＩＦＯＣＮＴ２、ＩＦＭＯＮの状態によって、ステージコード１７、ステージコード３は省略される。ステージコード１で、自命令の第２ワードがリード済み（ＦＩＦＯＣＮＴ１＝１）であれば、データのリード／ライトの制御を行なう。ステージコード１で、自命令の第２ワードがリード済みでなければ（ＦＩＦＯＣＮＴ１＝０）、ステージコード１７に進み、データのリード／ライトの制御を行なう。
【０２０３】
論理記述の第２の部分（１−２）でバス制御を行なう。ｎｏｐ＝０はバスアクセス開始、ｎｏｐ＝１はバスアクセス禁止を指示する。ｄａｔａ＝０は命令リード、ｄａｔａ＝１はデータアクセスを指示する。ｂｙｔｅ＝０はワードサイズ、ｂｙｔｅ＝１はバイトサイズを指示する。ｗｒｉｔｅ＝０はリード、ｗｒｉｔｅ＝１はライトを指示する。
【０２０４】
本転送命令の場合、ステージコード１で自命令の第２ワードがリード済みでない場合、及びステージコード３で命令リードを行い、ステージコード１で自命令の第２ワードがリード済みの場合、またはステージコード１７で、データアクセスを行なう。データアクセスのリード／ライトはＩＲ［７］によって指示される。命令リードの場合は所定のタイミングで内部データバスＩＤＢの内容がＩＲとリードデータバッファＲＤＢに格納される。データリードの場合は所定のタイミングで内部データバスＩＤＢの内容がリードデータバッファＲＤＢに格納される。データライトの場合は所定のタイミングでライトデータバッファＷＤＢの内容が内部データバスＩＤＢに出力される。
【０２０５】
図２３の論理記述の第３の部分（１−３）で実効アドレスを計算する。本転送命令の場合、ステージコード１で自命令の第２ワードがリード済みの場合、またはステージコード１７で、リードデータバッファＲＤＢに保持している命令コードのＥＡ拡張部１６ビットを、ｒｄｂｅｘｔ信号によって３２ビットに符号拡張した上、内部バスＧＢに出力する。ステージコード１で自命令の第２ワードがリード済みでない場合、及びステージコード３で、ＰＣ値の内部バスＰＣＧＢへの読み出しと、アドレスバスＡＢ、インクリメンタＩＮＣへの入力、及び、インクリメント結果の内部バスＰＣＷＢからプログラムカウンタＰＣへの格納が指示される。なお、アドレスバッファＡＢへは、内部バスＰＣＧＢからの入力が指示されない場合、内部バスＧＢから入力されるようにされている。尚、ｒｄｂｇｂはリードデータバッファＲＤＢの尚用をバスＧＢに出力する指示信号、ｒｄｂｅｘｔはリードデータバッファＲＤＢを符号拡張する指示信号である。
【０２０６】
図２３における論理記述の第４の部分（１−４）で、転送データ及びレジスタを制御する。リード型（ＩＲ［７］＝０）の場合は、ステージコード１で自命令の第２ワードがリード済みの場合、またはステージコード１７で、リードデータをリードデータバッファＲＤＢからバスＷＢへ出力し、汎用レジスタへ（Ｒｄ）へ格納する。コンディションコードレジスタＣＣＲの、Ｎ、Ｚ、Ｖフラグの更新を指示する。図８等に示した通り、かかる動作は制御が遅延される。遅延回路自体は図示されない。
【０２０７】
ライト型（ＩＲ［７］＝１）の場合は、ステージコード１で自命令の第２ワードがリード済みの場合、またはステージコード１７で、汎用レジスタ（Ｒｄ）から内部バスＤＢへデータを出力し、いずれの場合もライトデータバッファＷＤＢに格納する。また、コンディションコードレジスタＣＣＲの、Ｎ、Ｚ、Ｖフラグの更新を指示する。
【０２０８】
図２４乃至図２６にはデコーダ（ＤＥＣ）２０３に含まれる、図９及び図１７の分岐命令／サブルーチン分岐命令に対するデコード論理の一部の論理記述が示される。同図の表現形式は前記図２２及び図２３の場合と同様である。図２４乃至図２６のデコード論理では、ＩＲ［１０］＝０のとき分岐（ＪＭＰ）、ＩＲ［１０］＝１のときサブルーチン分岐（ＪＳＲ）、とされている。
【０２０９】
図２４に示される論理記述の第１の部分（２−１）でステージコードＴＭＧが生成される。ステージコードＴＭＧは１→１７→２→３と進行するが、ＦＩＦＯＣＮＴ１、ＦＩＦＯＣＮＴ２、ＩＦＭＯＮの状態によって、ステージコード１７は省略される。ステージコード１で、自命令の第２ワードがリード済み（ＦＩＦＯＣＮＴ１＝１）であれば、実効アドレス計算と分岐先の命令リードの制御を行なう。ステージコード１で、自命令の第２ワードがリード済みでなければ（ＦＩＦＯＣＮＴ１＝０）、ステージコード１７に進み、実効アドレス計算と分岐先の命令リードの制御を行なう。
【０２１０】
図２４に示される論理記述の第２の部分（２−２）でバス制御を行なう。本転送命令の場合、ステージコード１で自命令の第２ワードがリード済みでない場合、バスアクセスは禁止状態となる。ステージコード１で自命令の第２ワードがリード済みの場合、またはステージコード１７で、分岐先の命令リードを行なう。サブルーチン分岐の場合は、ステージコード２に進み、ＰＣ値のスタックのためのロングワードサイズのデータライトを行なう。ステージコード２では、リード済みの分岐先命令に続く命令リードを行なう。
【０２１１】
図２５に示された論理記述の第３の部分（２−３）で実効アドレスを計算する。本転送命令の場合、ステージコード１で自命令の第２ワードがリード済みの場合、またはステージコード１７で、リードデータバッファＲＤＢに保持している命令コードのＥＡ拡張部を内部バスＧＢに出力し、アドレスバッファＡＢに格納する。この内容は自動的にインクリメンタＩＮＣに入力され、インクリメント（＋２／＋４）が行われる。また、インクリメント結果のプログラムカウンタＰＣへの格納が指示される。また、ステージコード３で、ＰＣ値の内部バスＰＣＧＢへの出力と、インクリメント結果の内部バスＰＣＷＢからプログラムカウンタＰＣへの格納が指示される。
【０２１２】
図２６に示された論理記述の第４の部分（２−４）で、転送データ（スタックされるＰＣ）とレジスタを制御する。ステージコード１で自命令の第２ワードがリード済みの場合、またはステージコード１７で、スタックポインタの内容の内部バスＧＢへの読み出しを指示する。算術論理演算器ＡＬＵに入力され、図示されないものの、算術論理演算器ＡＬＵには、デクリメント（−４）が指示される。
【０２１３】
ステージコード２でデクリメント結果を、算術論理演算器ＡＬＵから内部バスＧＢへ出力する指示が与えられる。これによってデクリメント結果はアドレスバッファＡＢに格納される。また、内部バスＷＢからスタックポインタＳＰへの格納が指示される。
【０２１４】
また、前記の通り、サブルーチン分岐命令が存在したアドレスをＡ１として保持し、命令コードと同時にデコードする。かかるアドレス情報と命令リード実効中を示す信号ＩＦＭＯＮを用いて、ライトデータバッファＷＤＢ−ＯＵＴへ転送する値が、ＰＣ、ＷＤＢ−Ｍ、ＷＤＢ−Ｓの何れかから選択される。また、前記アドレス情報に基づいて、出力するデータのビット１を１にセット（＋２）するかが選択される。
【０２１５】
かかる制御によって、命令の配置に依存せず、また、適宜、命令のステートを省略しつつ、実行することが実現される。
【０２１６】
尚、省略可能なステートを持つことができず、また、算術論理演算器ＡＬＵとサブ算術論理演算器ＡＬＵＳの交互の動作を利用して、プログラムカウンタＰＣのインクリメントを行なわずに実行することができない命令が存在する場合には、ＦＩＦＯＣＮＴ２の状態などを参照して、バスコマンドの発行と、ＰＣインクリメントを禁止するようにすればよい。例えば、図２２乃至図２６において、第２の部分（１−２，２−２）でｎｏｐ＝１とし、第３の部分（１−３，２−３）で、ｉｎｃｐｃ＝０とすればよい。これによって、命令の実行（消費）量より、命令リード量が大きくなってしまい、命令レジスタ（ＦＩＦＯ）２００がオーバフローしたり、サブルーチン分岐命令などにおいて待避すべきプログラムカウンタＰＣの内容が失われてしまったりすることを防止できる。
【０２１７】
以上より以下の作用効果を得ることができる。〔１〕既存のＣＰＵに対して、互換性を損なわずに、データバス幅を拡張して、命令リードを高速化するとともに、命令実行の制御を、命令固有の動作を含むステートと、命令のリードのみを行なうステートに分け、後者を省略（スキップ）可能にする。命令のリードを高速化した分を、命令実行の一部のステートを省略（スキップ）して、命令のリードの量と実行の量のバランスを採るとともに、命令の実行時間を短縮して、高速化を実現できる。命令の一部を省略（スキップ）可能とし、リード済みの命令の量に応じて、適宜省略（スキップ）することにより、命令の配置を任意（リロケータブル）にできる。命令の配置を任意（リロケータブル）にして、プログラムの作成を容易にしたり、Ｃコンパイラなどの開発上の制約をなくすることができる。
【０２１８】
〔２〕レジスタ間演算命令のように、１ワード（基本単位長）の命令コードを持ち、１ステート（単位時間）で実行し、省略（スキップ）可能なステートを持たない命令に対して、演算器を複数設け、かかる演算器を、実行のための資源の実行時間よりも短い時間の差で、動作させることにより、複数のレジスタ間演算命令などを、実効的に同時に実行することができる。実効的に同時に動作する一方の命令の命令リードを省略し、命令のリードの量と実行の量のバランスを採るとともに、命令の実行時間を短縮して、高速化を実現できる。命令デコーダを、全ての命令に対応するもの（ＤＥＣ２０３）と、前記実効的に同時に動作する演算器の一方を専ら制御するもの（ＳＤＥＣ２０４）とすることによって、論理規模の増大を最小限にし、ひいては製造費用の増加も最小限にすることができる。演算器を時間差を持って動作させることによって、並列処理を行なわずに済み、汎用レジスタの競合などの対応を容易にし、かつ、論理規模の増大を抑止することができる。全ての命令に対応する命令デコーダ及び実行部ＥＸＥＣの各ブロックは、概略既存のＣＰＵの論理と大部分を共通にできるから、設計資産を有効に利用して、設計品質を向上したり、開発期間を短縮したりできる。
【０２１９】
〔３〕前置コードのように、制御信号のみを発生する命令コードはスキップ可能にし、スキップ時に制御信号のみを、命令デコーダを使用せずに、発生するようにすることにより、命令の実行時間を短縮して、高速化を実現できる。
【０２２０】
〔４〕分岐命令や割込み例外処理時に、分岐先の先頭命令をリードして、直ちに実行開始するようにして、応答性を維持向上できる。
【０２２１】
〔５〕命令レジスタに、命令コードと共に、その命令コードのアドレスバスＩＡＢのビット１の内容を格納し、命令デコーダ２０３で同時に判定することにより、制御を容易にし、デコード回路を簡略化し、論理規模の増大を抑止することができる。
【０２２２】
〔６〕分岐先命令リード後のＰＣインクリメントを、分岐先アドレスの内容に応じて、インクリメンタＩＮＣで自動的に＋２／＋４を切り替えることにより、分岐先が４の倍数番地かどうかに拘らず画一的な処理を行い、論理規模の増加を抑えることができる。
【０２２３】
〔７〕ライトデータバッファＷＤＢに、ＰＣインクリメント時に、インクリメントする前のＰＣの内容を格納し、更に、ライトデータバッファＷＤＢをＦＩＦＯ構造にしておくことにより、また、ライトデータバッファＷＤＢのビット１を論理値１に固定することを可能にして、＋２を容易に実現できる。サブルーチン分岐命令時の待避すべきＰＣ値を容易に得ることができる。また、ライトデータバッファＷＤＢに保持した待避すべきＰＣ値と、ディスプレースメントを加算する算術演算器ＡＵを持ち、ライトデータバッファＷＤＢから直接入力する経路を持つことにより、プログラム相対のアドレッシングモードの準備を高速にし、ひいては命令の処理速度を向上できる。また、プリフェッチカウンタとプログラムカウンタを別に持ち、更にインクリメンタを別に持つ必要がなく、論理規模の増大を抑止できる。
【０２２４】
〔８〕互換性を保った、アドレス空間の広いＣＰＵと狭いＣＰＵがある場合、双方に、互換性を維持しつつ、高速化を実現することができる。適宜必要な命令やアドレッシングモードを持つ様にすればよい。
【０２２５】
〔９〕既存の命令を実行可能にし、内部の動作の順序なども同等にしているから、既存のＣＰＵと比較して、将来拡張余裕を大きく損なうことがない。例えば、既存のＣＰＵに対して、新たな命令の追加が可能になった場合には、かかる技術を、本発明を適用したＣＰＵにも用いることができると考えられる。命令セットの互換性を維持していれば、機械語としては、既存のＣＰＵと同じ命令を追加することはできる。また、追加命令も、複数の実行ステート数をもつものであれば、固有の動作を行なう部分と省略可能なステートとに分け、後者を必要に応じて省略することは可能とすることはできる。少なくとも、必要に応じて命令のリードとＰＣインクリメントを禁止することはでき、既存ＣＰＵと同等の処理時間では実現可能である。追加命令が１ステートで実行可能であれば、ＡＬＵとＡＬＵＳの交互の動作などによって高速化を実現できる。
【０２２６】
〔１０〕既存のＣＰＵと同じ命令セットとすることにより、アセンブラ、Ｃコンパイラ、シミュレータ／デバッガなどの開発ツール、いわゆるクロスソフトウェアを共通にすることができる。クロスソフトウェアを共通化することによって、逸早く開発環境を整えることができる。また、開発環境の開発に必要な資源を抑制でき、また、利用者にとっても、既存の開発環境を利用することによって、不所望の費用を回避することができる。
【０２２７】
以上本発明者によってなされた発明を実施形態に基づいて具体的に説明したが、本発明はそれに限定されるものではなく、その要旨を逸脱しない範囲において種々変更可能であることは言うまでもない。
【０２２８】
例えば、本発明は、互換性を維持することとは別に、全く新規のマイクロコンピュータにも適用できる。命令セット即ち、命令の種類やアドレッシングモードの種類及びこれらの組合せなども任意にできる。汎用レジスタは、アドレス及びデータに共通に利用可能なものである必要はなく、一部または全部がアドレス専用またはデータ専用のものであってもよい。汎用レジスタのデータサイズについても任意とすることができる。
【０２２９】
前置コードの種類は特に限定はされない。また、前置コードは、ロングワードを指示する情報のほか、そのほかの制御情報を含んでもよい。また、命令コードの基本単位１６ビットに限定する必要はなく、８ビット或いは３２ビットなど任意のビット幅とできる。データバスの幅も３２ビットに限定されず、６４ビットなどでもよい。命令の基本単位の２倍でなく、４倍などでもよい。
【０２３０】
命令レジスタ（ＦＩＦＯ）の容量も３ワード分に限定されない。最小限２ワード以上あればよい。容量が大きければ、省略可能なステートを持たない命令が存在した場合にも、蓄積された命令を、続く命令実行で省略するステートを大きくして、命令の量のバランスを採ることができる。ただし、容量を大きくしても、分岐命令実行時にはリードした命令が無駄になってしまうから、通常、乃至は定常的な状態で、命令レジスタに存在する命令の量はあまり大きくしない方がよい。
【０２３１】
本発明では、並列処理を行なわないようにしているが、並列処理と組合せて構成することもできる。一部の命令を並列処理させてもよい。演算器や命令デコーダの数なども任意にできる。シングルチップマイクロコンピュータのその他の機能ブロックについても何等制約されない。
【０２３２】
以上の説明では主として本発明者によってなされた発明をその背景となった利用分野であシングルチップマイクロコンピュータに適用した場合について説明したが、それに限定されるものではなく、その他のマイクロコンピュータまたはデータ処理装置も適用可能であり、本発明は少なくとも、命令を解読して処理し、演算処理を行なうデータ処理装置に適用することができる。
【０２３３】
【発明の効果】
本願において開示される発明のうち代表的なものによって得られる効果を簡単に説明すれば下記の通りである。
【０２３４】
すなわち、（既存のＣＰＵに対して、）内部データバス幅を、少なくとも命令の基本単位（ワード）よりも大きくし、リードした命令を（複数単位）保持する命令レジスタを持ち、この命令レジスタに存在する命令の量を監視する手段を設け、（既存の）命令を、実行の基本単位時間（ステート）にしたがって、命令のリード（とＰＣインクリメント）のみの制御を行なうステートと、実効アドレスの計算やデータの演算処理の制御を含むステートに分割する。例えば、実効アドレスの計算やデータの転送処理が複数のステートに亘って動作する場合も、制御自体は１度に行い、制御信号に遅延を設けるなどして、実際の動作を複数のステート（例えば、アドレス計算を最初のステート、リードデータの格納を次のステート）に行なうようにし、制御信号を遅延させるべき動作（例えば、リードデータの格納）は、次の制御動作（ＰＣインクリメント）と重なっても同時に動作可能なように、実行手段を構成する。前記命令のリードのみの制御を行なうステートを省略可能にすると共に、前記命令レジスタに存在する量に従って（前記監視手段の指示に従い）、前記命令のリードのみの制御を行なうステートを省略を行なう（スキップする）。
【０２３５】
これにより、内部データバス幅を命令の基本単位（ワード）よりも大きくすることによって、一度にリードする命令の量を（既存のＣＰＵより）大きくでき、基本の実行ステートの場合、（既存のＣＰＵと同様に）自分の命令コード長に対応した回数の命令リードを行なうことにより、省略（スキップ）を行なわない場合には、実行した自命令の命令コードの量より、リードした命令の量を大きくして、リード済みの命令コードの量を蓄積でき、一方、省略（スキップ）を行なって、命令リードを行なわないことにより、実行した自命令の命令コードの量と、リードした命令の量を同等にして、リード済みの命令コードの量を維持したり、実行した自命令の命令コードの量より、リードした命令の量を少なくして、リード済みの命令コードの量を減少できるから、リード済みの命令の量を所定の範囲内に収めつつ（命令のリードの量と、命令の実行の量のバランスを取りつつ）、命令のリードを高速化して、全体の実行時間を短縮することができる。また、省略（スキップ）するステートを自動的に変えることによって、命令の配置の変更に対応できる。
【０２３６】
オブジェクトレベルで互換性を保ちつつ、アドレス空間の広い（命令セットの大きい）ＣＰＵとアドレス空間の小さい（命令セットの小さい）ＣＰＵが存在する場合には、アドレス空間の広いＣＰＵで、上記高速化を実現して、下位互換性をもつ、アドレス空間の小さいＣＰＵにも存在する命令について、同様に上記高速化を可能にできる。換言すれば、同一の方法で、オブジェクトレベルで互換性を保ちつつ、アドレス空間の広いＣＰＵとアドレス空間の小さいＣＰＵでも高速化を可能にできる。オブジェクトレベルで互換性を保つことによる利点と高速化を可能にすることの利点の双方を享受することができる。
【０２３７】
既存のＣＰＵと同じ命令セットとすることにより、アセンブラ、Ｃコンパイラ、シミュレータ／デバッガなどの開発ツール、いわゆるクロスソフトウェアを共通にすることができる。クロスソフトウェアを共通化することによって、逸早く開発環境を整えることができる。
【図面の簡単な説明】
【図１】本発明に係るデータ処理装置を適用したＣＰＵのブロック図である。
【図２】本発明に係るデータ処理装置を適用したシングルチップマイクロコンピュータのブロック図である。
【図３】図１のＣＰＵに内蔵されている汎用レジスタ及び制御レジスタに関するプログラミングモデルの説明図である。
【図４】別のＣＰＵに内蔵されている汎用レジスタ及び制御レジスタに関するプログラミングモデルの説明図である。
【図５】図１のＣＰＵ２における機械語の命令フォーマットの一例説明図である。
【図６】ＲＯＭの一例を示す説明図である。
【図７】ＣＰＵのアドレシングモードの説明図である。
【図８】転送命令“ＭＯＶ．Ｗ＠ａａ：１６，Ｒｄ”の動作タイミングチャートである。
【図９】分岐命令“ＪＭＰ＠ａａ：２４”の動作タイミングチャートである。
【図１０】ＣＰＵによるプログラムの実行タイミングの一例を示すタイミングチャートである。
【図１１】図１０とは命令配置アドレスの異なるプログラムの実行タイミングを例示するタイミングチャートである。
【図１２】図１０及び図１１とは命令配置アドレスの異なるプログラムの実行タイミングを例示するタイミングチャートである。
【図１３】図１０乃至図１２とは命令配置アドレスの異なるプログラムの実行タイミングを例示するタイミングチャートである。
【図１４】インクリメンタの一例を示すブロック図である。
【図１５】ライトデータバッファの一例を示すブロック図である。
【図１６】プログラム相対分岐アドレス計算用の算術演算器の一例を示すブロック図である。
【図１７】ＣＰＵによるサブルーチン分岐命令“ＪＳＲ＠ａａ：２４”の動作タイミングを例示するタイミングチャートである。
【図１８】サブルーチン分岐命令を含むプログラムの実行タイミングを例示するタイミングチャートである。
【図１９】サブルーチン分岐命令を含むプログラムの実行タイミングを例示するタイミングチャートである。
【図２０】別のサブルーチン分岐命令を含むプログラムの実行タイミングを例示するタイミングチャートである。
【図２１】別のサブルーチン分岐命令を含むプログラムの実行タイミングを例示するタイミングチャートである。
【図２２】図８の転送命令に対するデコーダによるデコード論理の一部の論理記述を図２３と共に例示する説明図である。
【図２３】図８の転送命令に対するデコーダによるデコード論理の一部の論理記述を図２２と共に例示する説明図である。
【図２４】図９及び図１７の分岐命令／サブルーチン分岐命令に対するデコーダによるデコード論理の一部の論理記述を図２５及び図２６と共に例示する説明図である。
【図２５】図９及び図１７の分岐命令／サブルーチン分岐命令に対するデコーダによるデコード論理の一部の論理記述を図２４及び図２６と共に例示する説明図である。
【図２６】図９及び図１７の分岐命令／サブルーチン分岐命令に対するデコーダによるデコード論理の一部の論理記述を図２４及び図２５と共に例示する説明図である。
【符号の説明】
１シングルチップマイクロコンピュータ
２ＣＰＵ
４ＲＯＭ
２００命令レジスタ
２０２命令レジスタコントローラ
２０３命令デコーダ
２０４サブ命令デコーダ
２０５レジスタセレクタ
ＦＩＦＯＣＮＴ１，ＦＩＦＯＣＮＴ２命令コード量検出信号
ＥＲ０〜ＥＲ７汎用レジスタ
ＰＣプログラムカウンタ
ＡＬＵ算術論理演算器
ＡＬＵＳサブ算術論理演算器
ＡＵ算術演算器
ＩＮＣインクリメンタ
ＷＤＢライトデータバッファ
ＲＤＢリードデータバッファ
ＡＢアドレスバッファ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a data processing apparatus such as a microcomputer, a microcontroller, and a central processing unit (CPU). For example, the present invention relates to a technique that is effective when applied to a device-embedded microcomputer formed as a semiconductor integrated circuit.
[0002]
[Prior art]
A microcomputer composed of a semiconductor integrated circuit has been expanded in address space, instruction set, and speed. Since the functions of the microcomputer CPU are defined by software, the software resources of existing microcomputers can be used effectively even in microcomputers designed to expand the address space, expand the instruction set, and increase the speed. It is desirable to be able to do it.
[0003]
Therefore, as an example of realizing address space expansion and instruction set expansion while maintaining compatibility at the object level, for example, Japanese Patent Laid-Open No. 6-51981, published by Hitachi, Ltd. / 300H Series Programming Manual ”. Among these, it has been shown that adopting a so-called load store type architecture is effective in extending the instruction set. Such a load store type architecture is often used with the spread of a so-called RISC (Reduced Instruction Set Computer) type microcomputer.
[0004]
In addition, the speed is increased so that the basic instruction can be executed in one state while maintaining compatibility with the CPU that executes the basic instruction in two states, and a multiplier independent from the CPU is incorporated. Examples of speeding up are described in Japanese Patent Laid-Open No. 8-263290 and “H8S / 2600 Series H8S / 2000 Series Programming Manual” published by Hitachi, Ltd. in March 1995. In this CPU, the basic unit of instructions is 16 bits (words), and the data bus width is 16 bits. That is, it is possible to read an instruction for one word by one bus access. In particular, the “H8S / 2600 Series H8S / 2000 Series Programming Manual” pp268-277 issued by Hitachi, Ltd. in March 1995 describes the bus state at the time of instruction execution.
[0005]
In the CPU described above, an inter-register operation instruction having an instruction code length of 1 word performs an instruction read of 1 word (the next instruction read for prefetch) and PC increment (+2), and general-purpose in the CPU. Read the contents of the register, perform the operation, and write the result to the general-purpose register.
[0006]
On the other hand, in an operation instruction for 16-bit immediate data and a general-purpose register, the instruction code length is 2 words because 16-bit immediate data is included in the instruction. Two instruction reads (the second word of the own instruction and the next instruction read) and two PC increments are performed. This process requires two states. On the other hand, the CPU reads 16-bit immediate data and the contents of the general-purpose register, performs an operation, and writes the result into the general-purpose register. The internal arithmetic operation can be executed in one state in the same manner as the inter-register arithmetic instruction. That is, in one state, only instruction read and PC increment are performed, and in another one state, instruction read, PC increment and internal arithmetic processing are performed.
[0007]
The execution of an instruction having an instruction code length of a plurality of words includes a state in which only instruction read and PC increment are performed, and a state in which instruction read and PC increment are performed, and internal arithmetic processing and data access are performed.
[0008]
With the above-described technology, the processing capacity of the CPU or microcomputer per unit time is improved. This makes it possible to speed up the control processing of microcomputer application systems, and speed-up, high functionality and high precision of devices controlled by the microcomputer, or a conventional configuration with a plurality of semiconductor integrated circuits (microcomputers). It is possible to reduce the size by integrating these products. In particular, by shortening the response time to an interrupt, it is possible to improve the temporal accuracy when controlling various devices, so-called real-time characteristics.
[0009]
[Problems to be solved by the invention]
The present inventor makes it possible to effectively use software assets while maintaining compatibility, minimizing an increase in logical and physical scale, and minimizing an increase in power consumption. We studied to make data processing devices such as microcomputers capable of high-speed processing.
[0010]
In order to speed up the operation of the CPU, it is necessary to reduce the number of states required for executing one instruction and to improve the operation speed or operation frequency of a single chip microcomputer such as a semiconductor integrated circuit. Can be separated. The latter operating speed or operating frequency can be realized by miniaturizing a semiconductor integrated circuit, increasing the speed of a transistor, or the like. However, the operation speed or the operation frequency inevitably increases the current consumption. Therefore, in order to minimize the increase in current consumption, it is better to reduce the number of states required for executing each instruction.
[0011]
Note that high speed, high functionality, and miniaturization of devices are also required for CPUs and microcomputers having a relatively small address space and a relatively small instruction set. Therefore, a CPU having a wide address space and a CPU having a small address space. It is desirable to increase the speed of both of them.
[0012]
Also, by maintaining compatibility, cross software tools such as an assembler, C compiler, simulator / debugger can be used in common. That is, since these development environments can use existing ones, the development environments can be prepared quickly.
[0013]
Since an ordinary microcomputer operates by reading an instruction, it is necessary to read the instruction by itself. As described above, data access is also performed. However, in order to access data, it is necessary to read an instruction instructing this. In addition, since there are instructions that do not perform data access, the number of instruction reads in the bus cycle is greater than that in data access. For example, there are cases where the instruction read is 80% and the data access is 20%.
[0014]
In order to increase the speed of instruction reading, the data bus width may be made larger than the unit word length of the instruction. For example, when the instruction unit word is 16 bits, it may be expanded to 32 bits. Instruction read for two words can be performed by one bus access.
[0015]
However, even if the instruction is simply read at high speed, it does not make sense unless the instructions for the read are executed (consumed).
[0016]
When a 5-stage pipeline is configured as a microcomputer employing a 16-bit fixed length instruction set and a 32-bit data bus, instruction fetch can be executed in one bus cycle, so the instruction fetch stage is performed once. You can leave every other. Memory access can be made to an empty stage.
[0017]
However, the present inventor has revealed that the method of increasing the number of bits for one instruction read to increase the speed has the following problems.
[0018]
When the pipeline is deepened, if the flow of the program changes as in the case of branch instructions and interrupt exception handling, the subsequent pipeline must be canceled and the pipeline newly refilled. In a device control microcomputer, since there are a relatively large number of branch instructions and many interrupts are generated, it is not preferable that the number of pipeline stages is large and the execution time of branch instructions and interrupt exception processing cannot be improved.
[0019]
The 32-bit data bus enables reading or writing of 4 bytes starting from a multiple of 4 at a time. However, if it is intended to maintain compatibility with an existing CPU that reads instructions in 16-bit units, Bit fixed length instructions cannot be used, and instructions of 1 word length and instructions of 3 word length are mixed. In other words, it cannot be aligned to a unit of a 32-bit data bus. For example, in the case of an instruction having a length of 2 words, it is necessary to determine whether the instruction is to be read once or read twice depending on whether the instruction exists at a multiple of 4 or not. It is thought not to be. If the instruction is read twice, the execution time may be equivalent to the instruction execution of the existing CPU. Furthermore, it is easy to increase the logical scale when determining the address of the own instruction or performing control to perform both one and two instruction reads.
[0020]
On the other hand, as a so-called microprocessor, as described in “Nikkei Electronics” pp 68-80 published in Nikkei BP Co., Ltd. in January 1995, “Turn in 1998, simplify hardware to VLIW” Acceleration has been achieved such as superscalar and VLIW. In either case, the number of processes that can be executed simultaneously is increased (for example, 4 parallel processes) to improve the overall processing performance. However, in the above technique, since a plurality of instructions are executed in parallel, a plurality of CPU resources such as a control circuit and an arithmetic unit are provided, resulting in an increase in physical and logical scale.
[0021]
Further, in a device built-in type microcomputer (or microcontroller), the contents of processing are changed while referring to the states of various devices. Data access is performed to refer to the state of the device, and a branch instruction is executed to change the contents of processing. Therefore, it is less likely that the local program is repeatedly executed as compared with the microprocessor. In addition, depending on the control target, there are cases where there are restrictions on the order of execution of instructions and there are restrictions on the local instruction execution time. Even if there is no software contradiction, the order of instructions cannot always be changed. Furthermore, depending on the system to be controlled, it may be necessary to reduce power consumption.
[0022]
As described above, since there are many branch processes, there is a case where a branch instruction exists and the result of processing once must be abandoned even if parallel processing is possible using a technique such as superscalar. Also, conditional branch instructions cannot be executed until the branch condition referenced by the conditional branch instruction is determined, so instructions that generate branch conditions referenced by the conditional branch instruction and conditional branch instructions cannot be executed simultaneously. Further processing beyond that cannot be executed. Each time there is a conditional branch instruction, parallel processing of the instruction becomes impossible. If branch prediction or speculative execution is performed, the possibility of performing an operation different from the operation to be actually processed cannot be ignored. In the case of equipment control, conditional branch instructions are combined (configured in a tree) and branch destinations are often determined from a large number of branch destinations. In addition, it is considered that there are many more conditional branches. As a result, branch prediction tends to have a very low hit rate. In addition, the average processing time may be improved by branch prediction or speculative execution, but it is not expected to speed up individual local processing.
[0023]
That is, even if a plurality of instructions can be executed in parallel, they are likely to be wasted. The logic for parallel execution is wasted and logical and physical resources cannot be used effectively. In addition, useless logic and useless operation undesirably increase power consumption.
[0024]
In the end, for microcomputers for embedded control, it is difficult to effectively use software assets even if parallel processing such as superscalar and VLIW is performed, and in fact, it is difficult to increase the speed of microprocessors. It is. At least, it is difficult to increase the speed for the logical and physical scale, and power consumption is increased.
[0025]
It should be noted that the trend of processing performance differs between CPUs and microcomputers in repeated processing of the same calculation and device control such as motor control, “Interface” pp134 issued by CQ Publishing Co., Ltd. in February 1998. ˜145 “Thorough study of embedded CPU performance”. From this point of view, it is necessary to maintain the compatibility of existing CPUs and microcomputers suitable for device embedded control at an appropriate cost, inherit the existing architecture, properly increase costs, and increase the speed. It has been clarified by the present inventors.
[0026]
An object of the present invention is to increase the speed of a data processing apparatus such as a microcomputer suitable for device control.
[0027]
Another object of the present invention is to minimize the increase in manufacturing cost while maintaining compatibility with existing CPUs and allowing the use of existing software while minimizing the increase in logical scale. It is only to improve the processing performance of the CPU.
[0028]
The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.
[0029]
[Means for Solving the Problems]
The following is a brief description of an outline of typical inventions disclosed in the present application.
[0030]
That is, for an existing data processing device such as a CPU, the internal data bus width is at least larger than the basic unit of an instruction (for example, a word), and an instruction register that can hold a plurality of read instructions is provided. Means are provided for monitoring the amount of instructions present in the instruction register. A state (first operation) in which an (existing) instruction is subjected only to instruction read control (including control of increment of the program counter) according to a basic unit time of execution (referred to as a state), an effective address calculation, The state is divided into a state (second operation) including control of data operation processing, and a state in which only the read of the instruction is controlled can be omitted according to the state of the already read instruction. In other words, in accordance with the instruction of the monitoring means corresponding to the amount of already read instructions existing in the instruction register, a state in which only the reading of the instructions is controlled is skipped.
[0031]
The amount of instruction read during execution of each instruction is increased or decreased with respect to the instruction length of the own instruction. This is controlled according to the amount of instructions that have been read or are being executed.
[0032]
According to the above, by making the internal data bus width larger than the basic unit (word) of instructions, the amount of instructions read at a time can be made larger than that of an existing CPU.
[0033]
In the case of the basic execution state, as in the case of the existing CPU, if the state is not omitted (skip) by performing the number of instruction reads corresponding to its instruction code length, The read instruction code can be accumulated by increasing the amount of the read instruction than the instruction code.
[0034]
On the other hand, by omitting (skip) and not reading the instruction, the amount of the instruction code of the executed instruction is made equal to the amount of the read instruction, and the amount of the read instruction code is maintained. The amount of read instruction code can be reduced by reducing the amount of read instruction compared to the amount of instruction code of the executed self-instruction.
[0035]
This speeds up the read of instructions while keeping the amount of read instructions within a predetermined range (balancing the amount of instruction read and the amount of instruction execution), and the overall instruction execution time Can be shortened.
[0036]
In addition, by automatically changing the state to be omitted (skip), it is possible to cope with a change in instruction arrangement.
[0037]
In addition to the instruction code, information on bit 1 on the instruction code address (IAB) is stored in the instruction register, and the instruction decoder may determine the information simultaneously. Such a value of bit 1 indicates whether the address of a basic unit such as a word is a multiple of 4 or an even number that is not a multiple of 4. This facilitates control and minimizes the increase in the logical scale of the instruction decoder.
[0038]
Except for the branch instruction and the read of the leading address of interrupt exception processing, the instruction is read in 32-bit units, and the increment of the program counter is +4. Similarly, when the start address of a branch instruction or interrupt exception processing is a multiple of 4, the instruction is read in units of 32 bits and the increment of the program counter is +4. When the head address of a branch instruction or interrupt exception processing is not a multiple of 4, the head instruction is read in 16-bit units, and the program counter increment is +2. To configure the incrementer to automatically control +2 or +4 (+ 2 / + 4) of the increment of the program counter, for example, when bit 0 is a byte address, bit 1 of the incrementer is used as its input. A value obtained by logical OR with 1 may be used to give a carry to bit 1.
[0039]
In this way, except for reading the start address of branch instructions and interrupt exception processing, instructions are read in 32-bit units, and the program counter is incremented automatically by + 2 / + 4 to facilitate control. An increase in logical scale can be suppressed.
[0040]
The contents of the program counter are stored in the write data buffer, and the write data buffer has a FIFO (First-In First-Out) structure, and the bit 1 is set in the same manner as the incrementer of the program counter. It is good to have a circuit to do. As a result, the discrepancy between the address of the actually read instruction and the content held in the program counter increases, and even if it is not uniquely determined, the value of the program counter to be saved at the subroutine branch instruction is read from the write data buffer. Can be easily obtained. In addition, control can be facilitated and an increase in logical scale can be suppressed.
[0041]
Even when the effective address calculation or data transfer process operates over a plurality of states, the control itself is performed at once and the actual operation is performed in a plurality of states (for example, addresses are provided by delaying the control signal). The calculation is performed in the first state and the read data is stored in the next state.
[0042]
In the case of branch instructions and interrupt exception handling, processing time is lost due to branch instruction and interrupt exception handling by starting and executing decoding of the first instruction at the branch destination when at least one word of prefetch is completed. And so-called responsiveness, and hence real-time performance can be improved.
[0043]
Execution means so that an operation that should delay a control signal (for example, storage of read data), such as having an independent internal bus, can operate simultaneously even if it overlaps with the next control operation (program counter increment) Configure.
[0044]
Instructions that can be processed in basic unit time (instructions that do not have an optional state) are provided with a plurality of arithmetic means such as an arithmetic logic unit (ALU) so that they can operate while having overlapping times. The execution means is configured so that it can be operated at the same time even if the operation of one of the arithmetic means and the operation for reading the instruction overlap, such as by having an independent internal bus. A control circuit for controlling each arithmetic means is provided. One control circuit controls one arithmetic means to control all instructions, and the other control circuit exclusively controls the other arithmetic means.
[0045]
In this way, only a plurality of arithmetic means and control circuits are provided, one control circuit controls all instructions, and the other control circuit exclusively controls the arithmetic means. The increase of can be suppressed.
[0046]
An instruction code that generates only a control signal, such as a prefix code, detects a prefix code in a state in which the instruction is read and held in the instruction register, and generates a desired control signal and By providing a control signal generation circuit, skipping is enabled, and only a control signal is generated when skipping. Thereby, the execution time of the instruction can be shortened. Further, since only the prefix code is detected and a control signal is generated based on the detected result, the logical scale of the detection circuit and the control signal generation circuit can be minimized.
[0047]
While maintaining compatibility at the object level, if there is a CPU with a wide address space (large instruction set) and a CPU with a small address space (small instruction set), the CPU with a wide address space can increase the speed. When implemented, the above-described high speed can be similarly achieved for instructions that exist in a CPU having a backward compatibility and a small address space. In other words, the same method can be used to increase the speed of a CPU having a wide address space and a CPU having a small address space while maintaining compatibility at the object level. It is possible to enjoy both the advantages of maintaining compatibility at the object level and the advantages of enabling high speed.
[0048]
Since existing instructions can be executed and the order of internal operations is the same, future expansion margins are not significantly impaired compared to existing CPUs. For example, when a new instruction can be added to an existing CPU, it is considered that such a technique can be used for a CPU to which the present invention is applied. As long as the instruction set compatibility is maintained, the same instruction as that of the existing CPU can be added as the machine language. Further, if the additional instruction also has a plurality of execution state numbers, it can be divided into a part for performing a specific operation and an optional state, and the latter can be omitted as necessary. At least instruction read and program counter increment can be prohibited as necessary, and can be realized with a processing time equivalent to that of an existing CPU. If the additional instruction can be executed in one state, high speed can be realized by an alternate operation of a plurality of arithmetic means (for example, ALU, ALUS).
[0049]
By using the same instruction set as that of an existing CPU, development tools such as an assembler, C compiler, simulator / debugger, etc., so-called cross software can be used in common. The development environment can be quickly established by using cross software in common.
[0050]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 2 shows an example of a single chip microcomputer to which the present invention is applied.
[0051]
The single-chip microcomputer 1 includes a CPU 2 that controls the whole, an interrupt controller (INT) 3, a ROM 4 that stores a processing program of the CPU 2, a work area of the CPU 2, and a RAM 5 that is a memory for temporary storage of data. , Timer 6, timer 7, serial communication interface (SCI) 8, A / D converter 9, system controller (SYSC) 10, first input / output port (IOP [1]) 11 to ninth input / output ports (IOP [ 9]) 19, and a functional block or module of the clock oscillator (CPG) 20, which is formed on one semiconductor substrate (semiconductor chip) by a known semiconductor manufacturing technique.
[0052]
The single chip microcomputer 1 has a ground level (Vss), a power supply voltage level (Vcc), an analog ground level (AVss), an analog power supply voltage level (AVcc), and an analog reference voltage (Vref) as power supply terminals. Further, as dedicated control terminals, reset (RES), standby (STBY), mode control (MD0, MD1), and clock input (EXTAL, XTAL) terminals are provided.
[0053]
The single-chip microcomputer 1 operates in synchronization with a reference clock (system clock) generated based on a crystal oscillator connected to the terminals EXTAL and XTAL of the CPG 20 or an external clock input to the EXTAL terminal. One cycle of this reference clock is called a state.
[0054]
The functional blocks of the single chip microcomputer 1 are connected to each other by an internal bus 21. The internal bus 21 includes an internal address bus, an internal data bus, and an internal controller bus. The internal control bus includes a read signal, a write signal, a bus size signal, a system clock, and the like. Of the internal data bus, the space between the ROM 4 storing the program of the CPU 2 and the CPU 2 is 32 bits. Although not particularly limited, in the example of FIG. 2, the RAM 5 is similarly interfaced with a 32-bit bus. In addition, the external bus may be a 32-bit bus.
[0055]
There are two types of internal address buses, IAB and PAB, depending on the phase, and the internal data bus also has IDBs and PDBs depending on the phase. For example, in the case of a read, after IAB, the PAB is delayed by 0.5 state. PAB and PDB are synchronized. After the PDB, the IDB is delayed by 0.5 states. IAB and PAB, IDB and PDB are buffered by a bus controller (not shown). Such functional blocks and modules are read / written by the CPU 2 via the internal bus. The built-in ROM 4 and RAM 5 are interfaced with the CPU 2 via IAB and IDB, and can be read / written in one state. Note that the control registers of the timer 6, the timer 7, the SCI 8, the A / D converter 9, the IOP [1] 11 to IOP [9] 19, and the CPG 20 are collectively referred to as an internal I / O register. These are connected to the PAB and PDB. The bus width of the PDB is not particularly limited, but is 16 bits. This is because the internal I / O registers are distributed in each functional block, so if they are to be connected by a 32-bit bus, the total wiring length of the bus becomes large, which tends to increase the physical scale. This is because meaningful data on the internal I / O register (each functional block) is 8 to 16 bits, and the necessity of accessing with 32 bits is low.
[0056]
Each of the input / output ports 11 to 19 is shared with an address bus, a data bus, a bus control signal, or the input terminals and input / output terminals of the

timers

6 and 7, the SCI 8, and the A / D converter 9. That is, the

timers

6 and 7, the SCI 8, and the A / D converter 9 each have input / output signals, and are input / output to / from the outside via a terminal also used as an input / output port. For example, IOP [5] 15, IOP [6] 16, and IOP [7] 17 are also used as input / output terminals for

timers

6 and 7, and IOP [8] 18 is also used as an input / output terminal for SCI8. The analog data input terminal is also used as IOP [9] 19.
[0057]
When a reset signal RES is given to the single chip microcomputer 1, the single chip microcomputer 1 including the CPU 2 is reset. When the reset is released, the CPU 2 reads a start address from a predetermined address, and performs a reset exception process that starts reading an instruction from the start address. Thereafter, the CPU 2 sequentially reads and decodes instructions from the ROM 4 and the like, and processes data or transfers data to and from the RAM 5,

timers

6 and 7, SCI8, etc. based on the decoded contents. That is, the CPU 2 is stored in the ROM 4 or the like while referring to data input from the input / output ports IOP [1] to IOP [9], the A / D converter 9 or the like, or an instruction input from the SCI 8 or the like. Process based on the current command, and based on the result, I / O ports IOP [1] to IOP [9],

timers

6, 7 etc. are used to output signals to control various devices. Is to do.
[0058]
The states of the

timers

6 and 7, the SCI 8, and external signals can be transmitted to the CPU 2 as interrupt signals. The interrupt signal is output by a predetermined one of the A / D converter 9, timer 6, timer 7, SCI8, IOP [1] 11 to IOP [9] 19, and the interrupt controller 3 inputs this, An interrupt request signal 22 is given to the CPU 2 based on designation of a predetermined register or the like. When an interrupt factor occurs, a CPU interrupt request is generated, and the CPU 2 interrupts the process being executed, goes through an exception processing state, branches to a predetermined processing routine, performs a desired process, and sets the interrupt factor. Or clear it. At the end of the predetermined processing routine, a normal return instruction is given, and the interrupted process is resumed by executing this instruction.
[0059]
FIG. 3 shows a configuration example (programming model) of general-purpose registers and control registers built in the CPU 2.
[0060]
The CPU 2 has general-purpose registers ER0 to ER7 having a 32-bit length. The general-purpose registers ER0 to ER7 all have the same function and can be used as an address register and a data register.
[0061]
As a data register, it can be used as a 32-bit, 16-bit and 8-bit register. The address register and 32-bit register are used as general registers ER (ER0 to ER7) at once. As 16-bit registers, general-purpose registers ER are divided and used as general-purpose registers E (E0 to E7) and general-purpose registers R (R0 to R7). These have equivalent functions, and up to 16 16-bit registers can be used. Note that the general-purpose registers E (E0 to E7) may be particularly referred to as extension registers. As an 8-bit register, the general-purpose register R is divided and used as general-purpose registers RH (R0H to R7H) and general-purpose registers RL (R0L to R7L). These have equivalent functions, and up to 16 8-bit registers can be used. The usage method can be selected independently for each register.
[0062]
The general-purpose register ER7 is assigned a function as a stack pointer (SP) in addition to a function as a general-purpose register, and is used implicitly in exception processing, subroutine branching, and the like. Exception handling includes the interrupt handling.
[0063]
PC is a 24-bit counter indicating the address of an instruction to be executed next by the CPU 2. Although not particularly limited, since all instructions of CPU 2 are in units of 2 bytes (word: 16 bits), bit 0 is invalid, and instruction read is 4 bytes (long word: 32 bits). Also, bit 1 is not used. Further, when saved in the stack, etc., it is handled as a long word size with the upper 8 bits set to 0.
[0064]
CCR is an 8-bit condition code register indicating the internal state of the CPU 2. It consists of 8 bits including interrupt mask bit (I) and half carry (H), negative (N), zero (Z), overflow (V) and carry (C) flags.
[0065]
EXR is an 8-bit register in which control information for controlling exception processing such as interrupts is set, and includes interrupt mask bits (I2 to I0) and trace (T) bits.
[0066]
For example of data structure on general-purpose register, data structure on memory space, addressing mode and effective address calculation method, for example, "H8S / 2600 Series H8S / 2000 Series Programming Manual" published by Hitachi, Ltd. Is the same as the CPU described.
[0067]
FIG. 4 shows a configuration example of general-purpose registers and control registers built in another CPU. This is the same configuration as the CPU described in “H8 / 300 Series Programming Manual” issued by Hitachi, Ltd. in July 1989, and has 16-bit general purpose registers R0 to R7. The CPU 2 having the programming model of FIG. 3 to which the present invention is applied includes the general-purpose registers and instruction set of the CPU of FIG. In other words, the CPU 2 has an upward compatibility relationship with the CPU having the register and instruction set of FIG.
[0068]
FIG. 5 shows an example of a machine language instruction format of the CPU 2. The instruction of the CPU 2 is in units of 2 bytes (word). Each instruction includes an operation feed (op), a register field (r), an EA extension (EA), and a condition field (cc).
[0069]
Although not particularly limited, the CPU 2 has the same instruction format as the CPU described in “H8S / 2600 Series, H8S / 2000 Series Programming Manual” published by Hitachi, Ltd. in March 1995. In particular, basic operation instructions and transfer instructions are 16 bits long (one word).
[0070]
The operation field (op) represents the function of the instruction, and specifies the processing contents of the specified operand in the addressing mode. Always include the first 4 bits of the instruction. There may be two operation fields.
[0071]
The register field (r) specifies a general-purpose register in combination. The register field (r) is 3 bits for the address register and 3 bits (32 bit register) or 4 bits (8 or 16 bit register) for the data register. There may be two register fields or no register fields.
[0072]
The EA extension unit (EA) designates immediate data, an absolute address, or a displacement. It is 8 bits, 16 bits, or 32 bits. The condition field (cc) specifies the branch condition of the conditional branch instruction (Bcc instruction).
[0073]
FIG. 1 illustrates a block diagram of the CPU 2. The CPU 2 includes a control unit CONT and an execution unit EXEC including the general-purpose registers ER0 to ER7, a program counter PC, and a condition code register CCR.
[0074]
The control unit CONT includes, for example, an instruction register 200 composed of a FIFO for three words, an instruction register detection circuit (MON) 201, an instruction register controller (FIFOCNT) 202, an instruction decoder (DEC) 203, a sub instruction decoder (DECS) 204, a register Remove the selector (SEL) 205. The instruction decoder (DEC) 203 and the sub instruction decoder (DECS) 204 are configured by, for example, PLA (Programmable Logic Array) or wiring logic. An instruction decoder (DEC) 203 corresponds to all instructions and controls the entire CPU 2. The sub-instruction decoder (DECS) 204 performs only control for executing an arithmetic operation such as an inter-register arithmetic instruction at a time overlapping with the instruction decoder (DEC) 203. Part of the output of the instruction decoder 203 is fed back to the instruction decoder 203. This includes a stage code (TMG) used for transition in each instruction code and a control code (MOD) used between instruction codes. Such a control code is generated by the instruction decoder (DEC) 203 and the instruction register detection circuit (MON) 201, and is input to the instruction decoder 203 via the multiplexer (MPX) 206. The instruction register detection circuit (MON) 201 has a detection circuit for detecting an inter-register operation instruction and a prefix code. The detection result of the inter-register operation instruction is received by a signal NXTMON1, and the detection result of the prefix code is received by a signal NXTMON2. An instruction decoder (DEC) 203 is instructed. Further, the instruction register detection circuit 201 instructs the decoder 203 by the signal IFMON that the instruction register 200 is being read. The instruction register controller (FIFOCNT) 202 detects the amount of valid instruction code held in the instruction register, and instructs the instruction decoder (DEC) 203 of the detection result by the detection signals FIFOCNT1 and FIFOCNT2. The contents of control by the decoder 203 based on the signal detection signals FIFOCNT1, FIFOCNT2, NXTMON1, NXTMON2, and IFMON will be described in detail with reference to FIGS. 10 to 13, 22, and 23. FIG.
[0075]
The register selector (RSEL) 205 forms a register selection signal such as a general-purpose register based on the outputs of the

decoders

203 and 204, and includes a detection circuit in the case where the read and write of the general-purpose register conflict, although not shown. Yes.
[0076]
The execution unit EXEC is configured to be able to transfer data in units of 32 bits, and includes the general-purpose registers and control registers shown in FIG. 3, temporary registers TRA and TRD, an arithmetic logic unit ALU, a sub arithmetic logic unit ALUS, and an arithmetic unit AU. , An incrementer INC, a read data buffer RDB, a write data buffer WDB, and an address buffer AB. These functional blocks are GB (first internal bus), PCGB (second internal bus), DB (fifth internal bus), WB (fourth internal bus), and PCWB (third internal bus). They are connected to each other by an internal bus.
[0077]
The internal bus GB transfers data from predetermined registers of the general-purpose registers ER0 to ER7 to arithmetic logic units ALU, ALUS, etc., predetermined registers in the general-purpose registers ER0 to ER7, read data buffer RDB, arithmetic logic unit It is used for address transfer from the ALU to the address buffer AB.
[0078]
Internal bus PC G B is used for transfer from the program counter PC to the address of an instruction to the address buffer AB, the incrementer INC, the write data buffer WDB, and the like.
[0079]
The internal bus DB is used for data transfer from predetermined registers in the general-purpose registers ER0 to ER7 to the arithmetic logic unit ALU and the write data buffer WDB.
[0080]
The internal bus WB is used for data transfer from the arithmetic logic units ALU and ALUS and the read data buffer RDB to the general purpose registers. The internal bus PCWB is used for transfer to the address of the instruction from the incrementer INC to the program counter PC.
[0081]
The read data buffer RDB temporarily stores instruction codes and data read from the ROM 4, RAM 5, internal I / O registers, or external memory (not shown). The write data buffer WDB temporarily stores write data to the ROM 4, RAM 5, internal I / O register, or external memory, and temporarily stores an instruction read address. The read data buffer RDB and the write data buffer WDB adjust the timing of the internal operation of the CPU 2 and the read / write operation outside the CPU 2.
[0082]
The address buffer AB temporarily stores an address to be read / written by the CPU 2 and has an increment function for the stored contents. An address buffer having an increment function is described in JP-A-4-333153.
[0083]
The incrementer INC is mainly used for PC addition and performs + 2 / + 4. The arithmetic unit AU is used to generate a branch address of a branch instruction / subroutine branch instruction relative to the program counter. The arithmetic and logic unit ALU is used for various operations designated by instructions and calculation of effective addresses. The sub-arithmetic logic arithmetic unit ALUS is used exclusively for arithmetic operations between registers. Whether the instruction to be executed is a register indirect operation instruction is detected by the signal NXTMON1.
[0084]
The operations of the arithmetic logic unit ALU and the sub arithmetic logic unit ALUS are shifted by 0.5 states. The arithmetic logic unit ALU inputs data when the basic clock (φ) is at a high level, and outputs the result when the basic clock (φ) is at a low level. On the other hand, the arithmetic unit ALUS inputs data while the basic clock (φ) is at the low level, and outputs the result when the basic clock (φ) is at the high level. The CPU 2 executes instructions in a three-stage pipeline of instruction fetch, decode, and execution. At this time, for example, when the addition instruction “ADD.L ER0, ER1” and the shift instruction “SHLL ER1” are consecutive, the contents of ER0 and ER1 are transferred to the buses DB and GB in synchronization with the high level of the basic clock (φ). It is read out and input to the arithmetic logic unit ALU. The addition is performed by the arithmetic logic unit ALU, and the addition result is output to the bus WB in synchronization with the low level of the basic clock (φ). The read / write of ER1 competes at the low level of the basic clock (φ). The contents of bus WB are written to ER1. The contents of ER1 are not read out. Instead, the contents of the arithmetic logic unit ALU are read out to the bus GB and input to the sub arithmetic logic unit ALUS. That is, since the arithmetic logic unit ALU and the sub arithmetic logic unit ALUS operate according to the order of instructions, the result of one arithmetic logic unit ALU can be used as an input to the other arithmetic logic unit ALUS. Register conflicts can be essentially avoided.
[0085]
An instruction is operated in an overlapped manner in the first or last state of each instruction (instructions executed in one state is all periods), and in this period, an operation of a specific type (arithmetic operation) is performed. Therefore, a part of the instruction decode operation of the CPU 2 only needs to be performed in the overlapping period for both arithmetic logic units ALU and ALUS, and other sequential operations are controlled. The instruction decoder (DEC) 203 can be made equivalent to the conventional one, and the increase in logical scale can be minimized by making the added sub-instruction decoder (DECS) 204 relatively small.
[0086]
Although not particularly limited, the sub-instruction decoder (DECS) 204 is configured to specify the type of operation (operation control) using the sub-arithmetic logic unit ALUS, input / output control of a general-purpose register used for the operation, and the sub-arithmetic logic unit Setting control of the condition code register CCR based on the ALUS calculation result is performed.
[0087]
On the other hand, in addition to the above, the instruction decoder (DEC) 203 generates instruction operation timing, bus control, PC control, effective address calculation, general-purpose register input / output control used for effective address calculation, and memory access data. I / O control, instruction register control, interrupt control, etc.
[0088]
Here, the functions of the decoder 203 and the sub-decoder 204 will be supplementarily described. When the sub-decoder 204 detects that the register indirect operation instruction is detected by the signal NXTMON1, the sub-decoder 204 decodes the register indirect operation instruction, and the operation control of the sub arithmetic logic unit ALUS based on the decoding result is delayed by 0.5 state. Start.
[0089]
FIG. 6 illustrates the configuration of the ROM 4. The ROM 4 has a maximum number of parallel data input / output bits of 32 bits, and the data input / output terminals thereof are connected to the internal data bus IDB. Consecutive 4 bytes starting from a multiple address of 4 are configured such that the lower address “0” byte is higher and the lower address “3” byte is lower.
[0090]
The ROM 4 reads 32-bit data (long word data) starting from a multiple of 4 in a single state. In addition, 32-bit data starting from other even addresses must be read by dividing each state twice. Similarly, 16-bit data (word data) starting from an even address can be collectively read in one state. Reading of 16-bit data starting from an odd address is not permitted. This corresponds to the instruction code being in 16-bit units. The ROM 4 can read 8-bit data (byte data) at an arbitrary address in one state.
[0091]
That is, when 16-bit instructions are consecutive, two instructions can be read by one read of the ROM 4. The read / write of the RAM 5 has the same configuration.
[0092]
FIG. 7 illustrates the addressing mode of the CPU 2. Register indirect (@ERn) designates an operand on the memory with the contents of the address register (ERn) designated in the register field (r1) of the instruction code as an address.
[0093]
The post-increment register indirect (@ ERn +) designates an operand on the memory with the contents of the address register (ERn) designated in the register field (r1) of the instruction code as an address. Thereafter, 1, 2 or 4 is added to the contents of the address register, and the addition result is stored in the address register. 1 is added for the bi-size, 2 for the word size, and 4 for the long word size.
[0094]
In the pre-decrement register indirect (@ -ERn), an operand in the memory is designated with the content obtained by subtracting 1, 2 or 4 from the content of the address register (ERn) designated in the register field (r1) of the instruction code. . Thereafter, the subtraction result is stored in the address register. 1 is subtracted for the byte size, 2 for the word size, and 4 for the longword size.
[0095]
Register indirect with displacement (@ (d: 16, ERn)) is a 16-bit displacement (d) included in the instruction code in the contents of the address register (ERn) specified in the register field (r1) of the instruction code. The operand in the memory is specified using the content of the addition as an address. Upon addition, the 16-bit displacement is sign extended.
[0096]
The absolute address (@aa: 16) is an absolute address (aa) included in the instruction code and designates an operand on the memory. Although not particularly limited, in the case of a 16-bit absolute address, the upper 16 bits are sign-extended.
[0097]
FIG. 8 shows the operation timing of the transfer instruction “MOV.W @aa: 16, Rd”. 8 (1-1) and (1-2) show the operation of the control unit (particularly the instruction decoder 203) CONT, and FIG. 8 (2) shows the operation of the execution unit EXEC. Actually, since the execution unit EXEC operates based on the control signal output from the control unit CONT, there is a time difference between the operation of the control unit CONT and the operation of the execution unit EXEC. In FIG. Is expressed as 0. Further, in the operation of the control unit CONT, (1-1) is an operation equivalent to the prior art, and (1-2) corresponds to an example of an operation unique to the present invention.
[0098]
The execution unit EXEC performs the instruction read (if) of the next instruction and the PC increment (+4: +2 in the prior art) in the first state ST1. In the second state ST2, the EA extension part (aa) of this instruction is transferred from the read data buffer to the address buffer via the internal bus (GB), and a bus command for reading data is issued. In the third state ST3, instruction read (if) of the next next instruction and program counter PC increment (+4: +2 in the prior art) and data read in the second state ST2 are transferred from the read data buffer to the internal bus The data is transferred to the general-purpose register via (WB), the data is checked, and the result is set in the condition code register CCR.
[0099]
The operation (1-1) of the control unit CONT has a control content in accordance with the operation of the execution unit EXEC. That is, in the second state ST2, an address output and bus command generation are generated, and in the third state ST3, a read data storage control signal is generated. Furthermore, in the “H8S / 2600 Series H8S / 2000 Series Programming Manual” published by Hitachi, Ltd. in March 1995, the PC increment is +2, and the read data is transferred from the read data buffer to the internal bus (GB). Then, the data is input to the arithmetic unit (ALU), and the arithmetic unit (ALU) outputs the data to the internal bus (WB) as it is and stores it in the general-purpose register. By using an arithmetic unit (ALU), an increase in the internal bus is suppressed (GB, DB, and WB), and the data check circuit and flag set circuit of the arithmetic unit (ALU) are shared. These detailed differences are omitted.
[0100]
The operation (1-2) of the control unit CONT performs control for data access in the second state ST2. That is, in the second state ST2, control signals for address output, bus command generation, and read data storage are generated. First, a control signal for outputting an address and generating a bus command is given to the arithmetic unit EXEC, and a control signal for storing read data (RDB-Rd) is given in the next state.
[0101]
The first state ST1 and the third state ST3 of the control unit CONT perform only instruction read (if) and PC increment (+4). These first and third states ST1 and ST3 are omitted (skipped) according to the amount of instructions already read in the instruction register (FIFO) 200. If the number of instructions that have been read is small, the first state ST1 and the third state ST3 are executed, and an instruction larger than the instruction length (2 words) of this instruction is read. If the amount of the read instruction is appropriate, one of the first state ST1 and the third state ST3 is executed, and an instruction having the same amount as the instruction length (2 words) of this instruction is read. If there are many read instructions, the first state ST1 and the third state ST3 are not executed, and the instructions are not read. Which operation is performed is determined by the instruction decoder 203 using signals such as FIFOCNT1, FIFOCNT, and IFMON.
[0102]
For example, when this instruction is executed continuously for a plurality of instructions, only the first state ST1 and the second state ST2 are executed (the third state ST3 is omitted). The word size (16 bits) instruction read in the third state ST3 of the instruction before the control similar to the conventional one shown in (1-1) of FIG. 8 and the word size instruction read in the first state ST1 of the next instruction are performed. It can be understood that it is combined with the long word size (32 bits) instruction read of the first state ST1 according to the present invention that receives the control shown in (1-2) of FIG.
[0103]
When one of the first state ST1 and the third state ST3 is executed, which is executed and which is omitted (skip) is determined by the previous instruction read state. If the branch instruction is placed at the head of the branch destination and is not a multiple of 4, only the first word of the own instruction is prefetched, and the first state ST1 is set to wait for the second word of the own instruction. Even if the branch instruction is placed at the head of the branch destination, if the address is a multiple of 4, the second word of the instruction has been read (prefetched) at the same time, so the first state ST1 is omitted (skip). The third state ST3 is executed.
[0104]
FIG. 9 shows the operation timing of the branch instruction (JMP @aa: 24). FIG. 9 (1) shows the operation of the control unit (particularly the instruction decoder 203) CONT, and FIG. 9 (2) shows the operation of the execution unit EXEC. As in FIG. 8, the time difference between the operation of the control unit CONT and the operation of the execution unit EXEC is expressed as 0 for convenience.
[0105]
In FIG. 9, the first state ST1 is for waiting for the completion of the reading of the second word of the self instruction, and can be omitted (skip) if the second word has been read. Read the branch destination instruction twice. In the first time, the read instruction is taken into the CPU 2, but in the second time, only the instruction read is issued, and the reading of the read instruction into the CPU 2 overlaps with the execution of the next instruction.
[0106]
At the first time, when the address is a multiple of 4, 2 words are read and the PC increment is +4. When the address is not a multiple of 4, the 1 word is read and the PC increment is +2.
[0107]
For this reason, if the same branch instruction (JMP @aa: 24) exists at the branch destination, if the branch destination is a multiple of 4, execution is performed with the second word read (prefetch) of the own instruction completed. Thus, the first state ST1 can be omitted (skip). If the branch destination is not a multiple of 4, since the execution is started in a state where the reading of the second word of the self instruction is not completed, the first state ST1 cannot be omitted (skip).
[0108]
Even when a branch instruction is executed, execution of the branch destination instruction can be started at the same timing as in the prior art. For example, when branching to a multiple of 4, the instruction execution at the branch destination can be shortened. Responsiveness such as branch instructions and interrupt exception handling can be maintained and improved.
[0109]
As described above, the branch instruction can be executed regardless of the arranged address. Although the number of execution states is different, it is at least the same as in the prior art. Rather, it is not necessary to insert a no-operation instruction, and the software need not be burdened.
[0110]
10 to 13 illustrate timing charts when the program is executed. The executable program is

It is.
[0111]
MOV. L ER1, @ aa2 is MOV. It has an instruction code in which a prefix code is added to the instruction code of WR1, @ aa2. Such a prefix code generates a control signal and changes the operation of the following instruction code (MOV.WR1, @ aa2), and is described in Japanese Patent Application Laid-Open No. 6-51981.
[0112]
10 to 13, the addresses where the instructions are arranged are different, and the labels L0 and L1 are L0 = 2 and L1 = 14 in FIG.
L0. EQ2
L1. EQ 14,
In FIG. 11, L0 = 2 and L1 = 12, that is,
L0. EQ2
L1. EQ 12,
In FIG. 12, L0 = 0 and L1 = 14, that is,
L0. EQ 0
L1. EQ 14,
In FIG. 13, L0 = 0 and L1 = 12, that is,
L0. EQ 0
L1. EQ 12,
It is said. The data is common, aa1 = 102, 112 = 104, that is,
aa1. EQU 102
aa2. EQ 104
And
[0113]
The first state of the branch instruction can be omitted when an instruction code for one word exists in the instruction register (FIFO) 200 (FIFOCNT1 = 1).
[0114]
The first state of the transfer instruction (MOV.W) can be omitted if an instruction code for one word exists in the instruction register (FIFO) 200.
[0115]
The third state of the transfer instruction (MOV.W) is when there is an instruction code for two words in the instruction register (FIFO) 200 (FIFOCNT1 = FIFOCNT2 = 1), or for one word in the instruction register (FIFO) 200. Can be omitted when the instruction code is present and the instruction read is being executed (FIFOCNT1 = IFMON = 1).
[0116]
An inter-register operation instruction has an instruction code for two words in the instruction register (FIFO) 200 (FIFOCNT1 = FIFOCNT2 = 1), or an instruction code for one word exists in the instruction register (FIFO) 200, When instruction read is being executed (FIFOCNT1 = IFMON = 1) and the next instruction is an inter-register operation instruction (NXTMON1 = 1), the operations of the sub instruction decoder (DECS) 204 and the sub arithmetic logic unit ALUS Instruct.
[0117]
The first state (prefix code, NXTMON2 = 1) of the transfer instruction (MOV.L) can be omitted when an instruction code for one word exists in the instruction register (FIFO) 200. If not omitted, the prefix code is decoded by the instruction decoder, instruction read and PC increment are performed, and a control signal is generated. If omitted, a desired signal is generated from the instruction register and input to the instruction decoder.
[0118]
The second state and the fourth state of the transfer instruction (MOV.L) are the same as the first state and the third state of the transfer instruction (MOV.W).
[0119]
In the case of FIG. 10, the operation is as follows. In the slot C2 (the low level period of the reference clock φ) in the cycle T0 of the reference clock φ, the CPU 2 outputs a bus command (BCMD) indicating an instruction fetch (if) when executing a branch instruction (not shown), and The address is output from the address buffer AB to the address bus IAB. Similarly, a bus command and the next address are output in slot C2 of cycle T1.
[0120]
Based on the contents of the address bus IAB and the bus command, the contents of the built-in ROM 4 are obtained in the internal data bus IDB in the slot C2 of the cycle T1, and this is instructed in the slot C1 of the cycle T2 (high level period of the reference clock φ). The data is latched in the register (FIFO) 200 and the read data buffer RDB. Since the instruction address at this time is not a multiple of 4, only the lower side (bits 15 to 0) of the internal data bus IDB is used. Similarly, the content of the next address is latched in the instruction register (FIFO) 200 and the read data buffer RDB in the slot C1 of the cycle T2. Since this time is a multiple of 4, the upper (bits 31 to 16) and lower (bits 15 to 0) of the internal data bus IDB are used.
[0121]
In slot C1 of cycle T2, the instruction code (jmp-1) is input to the decoder (DEC) 203, and the contents of the instruction are decoded.
[0122]
Since the reading of the second word (jmp-2) of the self instruction is not completed, the first state ST1 is executed.
[0123]
In slot C2 of cycle T2, the bus command is not operated, and read / write is not started.
[0124]
In slot C2 of cycle T3, the contents of the read data buffer RDB (absolute address = 14) are stored in the address buffer AB via the internal bus GB, output to the address bus IAB, and a bus command is issued. Read the instruction. Similarly, a bus command and the next address are output in slot C2 of cycle T4 to read an instruction.
[0125]
The read contents are latched in the instruction register (FIFO) 200 and the read data buffer RDB in slot C1 of cycle T5 and slot C1 of cycle T6.
[0126]
The contents of the internal bus GB are also input to the write data buffer WDB and the incrementer INC, and incremented (+ 2 / + 4) is performed in the incrementer INC.
[0127]
In slot C1 of cycle T4, the result (16) incremented (+2) by the incrementer INC is written to the program counter PC via the internal bus WB. Similarly, the result (20) incremented (+4) in the slot C1 of the cycle T5 is written to the program counter PC via the internal bus WB.
[0128]
In slot C1 of cycle T5, the instruction code (mov-1) is input to the decoder (DEC) 203, and the contents of the instruction are decoded.
[0129]
Since the reading of the second word (mov-2) of the self instruction is not completed, the first state ST1 is executed.
[0130]
The bus command and the next address are output in slot C2 of cycle T5, and the instruction is read. In addition, the program counter PC is incremented.
[0131]
In slot C2 of cycle T6, the contents of the read data buffer RDB (absolute address = 102) are stored in the address buffer AB via the internal bus GB, output to the address bus IAB, and a bus command is issued. Read data. The third state ST3 is omitted (skipped).
[0132]
The read data is stored in the read data buffer RDB in the slot C1 of cycle T8, and is written to the general-purpose register ER0 (substantially R0) via the internal bus WB. Further, the data on the read data buffer RDB is inspected, and the result is reflected in predetermined bits (eg, negative N, zero Z, overflow V) of the condition code register CCR. This operation is performed based on the result of decoding the instruction code (mov-1), but is executed at a time overlapping with the next instruction.
[0133]
In slot C1 of cycle T7, the instruction code (add) is input to the decoder (DEC) 203, and the contents of the instruction are decoded.
[0134]
The bus command and the next address are output in slot C2 of cycle T7 to read the instruction. In addition, the program counter PC is incremented.
[0135]
In slot C1 of cycle T8, the contents of general-purpose register ER1 (R1) are read to internal bus GB and input to arithmetic logic unit ALU. Further, the contents of the general-purpose register ER0 (R0) are read out to the internal bus DB, but are read from the read data buffer RDB because of contention with the previous instruction write (the delay time can be minimized), Input to the arithmetic logic unit ALU. The arithmetic logic unit ALU is instructed to add.
[0136]
In slot C2 of cycle T8, the operation result is stored in general-purpose register ER1 (R1). Further, the operation result is checked, and the result is reflected in predetermined bits (for example, negative N, zero Z, overflow V, carry C, half carry H) of the condition code register CCR.
[0137]
Since the next instruction code is an inter-register operation instruction (NXTMON1 = 1), the instruction code (exts) is input to the subinstruction decoder (DECS) 204 in the slot C2 of cycle T7.
[0138]
At slot C2 of cycle T8, the content of general-purpose register ER0 is read out to internal bus GB, but is read from arithmetic logic unit ALU because it conflicts with the previous instruction write (minimize delay time). Input to the arithmetic logic unit ALU. Expansion is instructed to the arithmetic logic unit ALU.
[0139]
In slot C1 of cycle T9, the operation result is stored in general-purpose register ER0. Also, the operation result is checked, and the result is reflected in predetermined bits (for example, negative N, zero Z, overflow V) of the condition code register CCR.
[0140]
In slot C1 of cycle T8, the instruction code (movl-1) is input to the decoder (DEC) 203, and the contents of the instruction are decoded.
[0141]
Since the reading of the second word (movl-2) of the self instruction is not completed, the first state (prefix code) ST1 is executed.
[0142]
The bus command and the next address are output in slot C2 of cycle T8 to read the instruction. In addition, the program counter PC is incremented. Also, a control signal is generated to transmit an instruction (in this case, a long word size instruction) to the next instruction code.
[0143]
In slot C1 of cycle T9, the instruction code (movl-2) is input to the decoder (DEC) 203, and the contents of the instruction are decoded.
[0144]
The reading of the third word (movl-3) of the self instruction is completed, and the second state ST2 is omitted (skipped). Since the next instruction read has not been issued, the fourth state is executed.
[0145]
The operation of FIG. 11 will be described below with reference to the difference from FIG. In slot C1 of cycle T5, the instruction code (mov-1) is input to the decoder (DEC) 203, and the contents of the instruction are decoded. At this time, the reading of the second word (mov-2) of the self instruction is completed, and the first state ST1 is omitted (skipped). Also, since the next instruction read has been issued, the third state ST3 is omitted (skipped).
[0146]
In slot C1 of cycle T6, the instruction code (add) is input to the decoder (DEC) 203, and the contents of the instruction are decoded.
[0147]
Although the next instruction code is an inter-register operation instruction, since reading of the next next instruction is not completed, the instruction code (exts) is not input to the sub instruction decoder (DECS) 204.
[0148]
In slot C1 of cycle T7, the instruction code (exts) is input to the decoder (DEC) 203, and the content of the instruction is decoded.
[0149]
An arithmetic logic unit ALU is used to perform an operation. Since no register conflict occurs, the general-purpose register ER0 is read.
[0150]
In slot C1 of cycle T8, it is determined that the instruction code (movl-1) is a prefix code (NXTMON2 = 1), the instruction code (movl-2) is input to the decoder (DEC) 203, and the instruction code The contents are deciphered. Since the reading of the third word (movl-3) of the self instruction is not completed, the second state ST2 is executed. The reading of the next instruction is completed, and the fourth state is omitted (skip).
[0151]
The operation of FIG. 12 will be described mainly with respect to the differences from FIG. In slot C1 of cycle T2, the instruction code (jmp-1) is input to the decoder (DEC) 203, and the contents of the instruction are decoded. The reading of the second word (jmp-2) of the self instruction is completed, and the first state ST1 is omitted (skipped). The subsequent operation is the same as in FIG.
[0152]
The operation of FIG. 13 will be described mainly with respect to the differences from FIG. Similarly to FIG. 12, the instruction code (jmp-1) is input to the decoder (DEC) 203 in the slot C1 of the cycle T2, and the content of the instruction is decoded. The reading of the second word (jmp-2) of the self instruction is completed, and the first state is omitted (skip). The subsequent operation is the same as in FIG.
[0153]
The five instructions can be executed in 12 states in the prior art. On the other hand, it is executed in 9 states in FIG. 10, 8 states in FIGS. 11 and 12, and 7 states in FIG. The number of states required for processing is shortened to 58 to 75%. This is because there is variation depending on whether the branch destination is a multiple address of 4, but when the branch destination is not a multiple address of 4, the upper side of the internal data bus IDB cannot be used, and the throughput of the internal bus is reduced.
[0154]
In order to improve the processing speed of a predetermined program, if it is desired to place a desired program at a multiple of 4, an assembler can be provided with a control command for alignment to a multiple of 4 and used. Such control commands are described in, for example, “H8S, H8 / 300 Series Cross Assembler” p62 issued by Hitachi, Ltd. in May 1992. The control instruction to be aligned is not converted into a microcomputer instruction, and the program quality and the like due to the change are not greatly impaired.
[0155]
In the above operation timing example, since the short instruction is executed immediately after the branch instruction and the branch instruction is further executed, the effect of the improvement of the present invention is not necessarily expressed to the maximum. I can't say that.
[0156]
For example, when only the inter-register operation instruction is continuously executed, by executing ALU and ALUS alternately, two instructions can be effectively executed in one state, which can be reduced to 50% compared to the prior art. .
[0157]
In addition, since the speed of the bus width is increased by omitting (skip) an omissible state, typically the instruction read time can be halved. If the instruction read is 80% and the data access is 20%, the instruction read can be shortened to 60%, and the instruction read can be reduced to 70% and the data access can be reduced to 65%.
[0158]
Although depending on the contents of the program, an improvement effect of about 50 to 75% can be obtained as a whole program.
[0159]
FIG. 14 illustrates a block diagram of the incrementer INC. The incrementer INC increments the program counter PC (+ 2 / + 4).
[0160]
As described above, there are two sets of arithmetic logic units ALU and ALUS, whereas the incrementer INC is set to one set instead of two sets. Except for branch instructions, the increment of the program counter PC is +4. Therefore, the lower 2 bits of the program counter PC that is normally input is 2′b00.
[0161]
The incrementer INC is composed of each bit half adder 300, receives the internal bus GB and outputs it to the internal bus PCWB. For bit 1 only, the data input is fixed to a logical value 1 (+2) by an OR circuit (OR) 301, and a carry (logical value 1) is input (+2). That is, +4 is realized by performing +2 twice.
[0162]
In the case of a branch instruction, if the branch destination address is a multiple address of 4 (bit 1 is 0), +4 is performed as in the case other than the branch instruction. If the branch destination address is not an address that is a multiple of 4 (bit 1 is 1), there is no meaning of +2 by the OR circuit 301, so only +2 of carry input is performed.
[0163]
As a result, + 2 / + 4 is automatically selected according to the branch destination address. In other words, a multiple of 4 that is larger than the input content and the smallest of them is output.
[0164]
FIG. 15 illustrates a block diagram of the write data buffer WDB. The write data buffer WDB includes three parts WDB-M, WDB-S, and WDB-OUT. The WDB-M can be input from the internal bus GB and transferred from the WDB-M to the WDB-S. Further, transfer from WDB-M and WDB-S to WDB-OUT is enabled, and input from internal buses GB and DB is enabled to WDB-OUT. On the other hand, output from the WDB-M and WDB-S can be made to the internal bus GB. Output to the data bus IDB is performed from WDB-OUT.
[0165]
The value of the program counter PC to be saved is stored in advance in the write data buffer WDB. The value of the program counter PC to be saved in the write data buffer WDB is previously stored in Although described in Japanese Patent No. 293665, in the present invention, the output method of the program counter PC value to be saved differs depending on the instruction code length of the own instruction and the state of the instruction read being executed at the start of instruction execution (IFMON). Let
[0166]
Specifically, if the instruction is being read (IFMON = 1) with a 1-word instruction existing at a multiple of 4, the value of the program counter PC to be saved is obtained from the WDB-S to the internal data bus IDB. At that time, bit 1 of the PC value is fixed to 1.
[0167]
If a 1-word instruction existing at a multiple of 4 is not being read (IFMON = 0), the value of the program counter PC to be saved is obtained from WDB-M to the internal data bus IDB. Further, bit 1 of the PC value at that time is fixed to 1 (see T8 portion in FIG. 21).
[0168]
If it is a 1-word instruction that does not exist at an address that is a multiple of 4, and the instruction is being read (IFMON = 1), the value of the PC to be saved is obtained from the WDB-S to the internal data bus IDB. At this time, 1 is not fixed to bit 1 of the PC value.
[0169]
If a 1-word instruction that does not exist at a multiple of 4 and the instruction is not being read (IFMON = 0), the value of the program counter PC to be saved is obtained from WDB-M to the internal data bus IDB. At that time, the PC value is not fixed to bit 1 (see the portion T9 in FIG. 20).
[0170]
If a 2-word instruction existing at a multiple of 4 is being read and the instruction is being read (IFMON = 1), the program counter PC value to be saved is obtained from the WDB-M to the internal data bus IDB. The bit 1 is not fixed.
[0171]
If a 2-word instruction existing at a multiple of 4 is not being read (IFMON = 0), the value of the program counter PC to be saved from the PC is obtained on the internal data bus IDB. Bit 1 is not fixed (see the portion T8 in FIG. 19).
[0172]
If a 2-word instruction that does not exist at a multiple of 4 and an instruction is being read (IFMON = 1), the value of the program counter PC to be saved is obtained from the WDB-S to the internal data bus IDB. At that time, bit 1 of the PC value is fixed to 1.
[0173]
If a 2-word instruction that does not exist at a multiple of 4 is not being read (IFMON = 0), the program counter PC value to be saved is obtained from WDB-M to the internal data bus IDB. At that time, bit 1 of the PC value is fixed to 1 (see T10 portion in FIG. 18).
[0174]
In the program-relative addressing mode, a displacement (relative value) is added based on the address of the next instruction. The address of the next instruction used for this is the value of the program counter PC saved at the time of the subroutine branch instruction. Therefore, the contents of the write data buffer WDB can be used. That is, as described above, the address of the next instruction may be appropriately read from the WDB-M, WDB-S, or the program counter PC to the internal bus GB, and displacement and addition may be performed by the arithmetic logic unit ALU or the like. Setting bit 1 to 1 may be performed by the arithmetic logic unit ALU, or may be performed on the internal bus GB, or on the write data buffer WDB or the program counter PC.
[0175]
FIG. 16 shows an example of an arithmetic operator AU for calculating a program relative branch address. The arithmetic unit AU inputs the value of the program counter PC (also simply referred to as PC value) stored in the write data buffer WDB via the multiplexer MPX prior to the start of execution of the program relative branch instruction of 1 word length. Since it is one word long, the program counter PC is not used. Further, the arithmetic operator AU inputs the 8-bit displacement contained in the instruction code, which is held in the read data buffer RDB or the instruction register (FIFO) 200, via the internal bus DB. The arithmetic unit AU adds both inputs. When such a branch instruction is present at a multiple of 4, it is possible to fix the bit 1 of the PC value to 1 and effectively perform +2 simultaneously by the control signal pls2.
[0176]
By having the arithmetic unit AU for calculating the program relative branch address, the branch address can be calculated regardless of the operation state of the arithmetic logic unit ALU, and the speed of the branch can be increased.
[0177]
The program-relative addressing mode is not limited to a branch instruction, but can also be used for a transfer instruction and the like, and the arithmetic logic unit AU makes it possible to speed up the calculation of the effective address and thus improve the instruction processing speed.
[0178]
FIG. 17 shows the operation timing of the subroutine branch instruction (JSR @aa: 24). The representation format of FIG. 9 is the same as that of FIG. 17 is the same as the branch instruction of FIG. 9 except that a state for stacking the program counter PC is inserted into the third state ST3. The first state ST1 is for waiting for the completion of reading of the second word of the self-instruction, and is omitted (skipped) if the second word has been read. The instruction read of the branch destination address of the second state ST2, the stack of the third state ST3, and the read of the next instruction of the fourth state ST4 are operations unique to the subroutine branch instruction and are not omitted (skipped).
[0179]
18 and 19 show an example of an execution timing diagram of a program including a subroutine branch instruction. Each figure shows the following program L0 JMP L1 after branching with a branch instruction.
L1 MOV. W @ aa1, R0
JSR L2
The timing for executing is shown. 18 and 19, the addresses where the instructions are arranged are different, and the labels are L0 = 2, L1 = 14, L2 = 40 in FIG.
L0. EQ2
L1. EQU 14
L2. EQ 40,
In FIG. 19, L0 = 2, L1 = 12, L2 = 40, that is,
L0. EQ2
L1. EQU 12
L2. EQ 40
It is said.
[0180]
The subroutine branch instruction has an instruction code having a length of 2 words. In FIG. 18, since the subroutine branch instruction exists at the address 18 (to 21), the content of the PC value to be saved is 22. In FIG. 19, since the subroutine branch instruction exists at address 16 (-19), the content of the PC value to be saved is 20. In FIG. 18 and FIG. 19, the operation timing until the subroutine branch instruction is executed is the same as that in FIG. 10 and FIG.
[0181]
In FIG. 18, the instruction code (jsr-1) is input to the decoder (DEC) 203 in the slot C1 of the cycle T7, and the content of the instruction is decoded. At this time, the PC value is the address (24) to be read next, the write data buffer WDB-M is the address (20) at the time of the previous instruction read, and WDB-S is the address at the time of the previous instruction read. (16) is stored.
[0182]
In slot C2 of cycle T7, the contents of WDB-M are transferred to WDB-OUT (20).
[0183]
In slot C1 of cycle T9, the stack pointer SP is read to the internal bus GB, and is input to the arithmetic logic unit ALU to decrement (−4).
[0184]
In slot C2 of cycle T9, the decrement result is read to the internal bus WB, written to the stack pointer SP, read to the internal bus GB, stored in the address buffer AB, and output to the internal address bus IAB. . At the same time, a long word data write bus command is issued.
[0185]
Write data is output in slot C1 of cycle T10 with the contents of WDB-OUT fixed at bit 1 to 1, and this contents (22) are stored in the stack.
[0186]
In FIG. 19, the instruction code (jsr-1) is input to the decoder (DEC) 203 in the slot C1 of cycle T6, and the contents of the instruction are decoded.
[0187]
At this time, the PC value is the address to be read next instruction (20), the write data buffer WDB-M is the address at the previous instruction read (16), and WDB-S is the address at the previous previous instruction read. (12) is stored.
[0188]
In slot C2 of cycle T7, the content (20) of the PC value is transferred to WDB-OUT via the internal bus DB.
[0189]
In slot C1 of cycle T7, the stack pointer SP is read to the internal bus GB, and is input to the arithmetic logic unit ALU to decrement (−4). In slot C2 of cycle T7, the decremented result is read to the internal bus WB, written to the stack pointer SP, read to the internal bus GB, stored in the address buffer AB, and output to the internal data bus IAB. . At the same time, a long word data write bus command is issued.
[0190]
As the write data, the content of WDB-OUT is output in slot C1 of cycle T8 (bit 1 is not fixed), and this content (22) is stored in the stack.
[0191]
20 and 21 show an example of the execution timing of a program including another subroutine branch instruction.
[0192]
In FIG. 20, as a subroutine branch instruction, the addressing mode is not an absolute address but memory indirect is used, and the following program is executed after branching by a branch instruction.

The timing diagram when executing is shown. In this case, the labels are L0 = 2, L1 = 14, L2 = 40, L3 = 160, that is,
L0. EQ2
L1. EQU 14
L2. EQ 40
L3. EQ 160
It is said. In the memory indirect, the memory is read according to the address (L3) included in the instruction code, and the read content (L2) becomes the branch address.
[0193]
The timing in FIG. 20 is the same as that in FIG. 10 until the subroutine branch instruction is executed. The instruction code (jsr) is input to the decoder (DEC) 203 in the slot C1 of the cycle T7 in FIG. 20, and the contents of the instruction are decoded.
[0194]
At this time, the PC value is the address (24) to be read next, the write data buffer WDB-M is the address (20) at the time of the previous instruction read, and WDB-S is the address at the time of the previous instruction read. (16) is stored.
[0195]
In slot C2 of cycle T7, the contents of WDB-M are transferred to WDB-OUT (20). In slot C1 of cycle T9, the stack pointer SP is read to the internal bus GB, and is input to the arithmetic logic unit ALU to decrement (−4). In slot C2 of cycle T9, the decremented result is read to the internal bus WB, written to the stack pointer SP, read to the internal bus GB, stored in the address buffer AB, and output to the internal data bus IAB. . At the same time, a long word data write bus command is issued.
[0196]
Write data is output in slot C1 of cycle T10 with the content of WDB-OUT fixed at bit 1 of the write address to 1, and this content (22) is stored in the stack.
[0197]
In FIG. 21, as a subroutine branch instruction, program counter relative is used as an addressing mode, and the following program is executed after branching by a branch instruction.

Shows the timing when. The labels in this case are L0 = 2, L1 = 12, L2 = 40, that is,
L0. EQ2
L1. EQU 12
L2. EQ 40
It is said. In this case, the displacement of the BSR is 22 (decimal number).
[0198]
In FIG. 21, the timing is the same as that in FIG. 11 until the subroutine branch instruction is executed. In slot C1 of cycle T6, the PC value (16) stored in the write data buffer WDB and the 8-bit displacement (22) held in the read data buffer RDB are input to the arithmetic unit AU, and further the PC value Bit 1 is fixed to 1 and addition is performed, and the addition result (40) is output to internal bus GB in slot C2 of cycle T6, stored in address buffer AB, and output to internal address bus IAB. The
[0199]
Further, the content (16) of the write data buffer WDB-M is transferred to WDB-OUT in the slot C2 of the cycle T6, is fixed to 1 in the slot C1 of the cycle T9, and is output to the internal data bus IDB. . This content (18) is written to the stack as a return address. Bit 1 is fixed to 1 in response to bit 1 of the address where the BSR exists being 0.
[0200]
22 and 23 show a logical description of a part of the decoding logic for the transfer instruction of FIG. 8 included in the decoder (DEC) 203. The logic description shown in the figure is called RTL (Register Transfer Level) or HDL (Hardware Description Language) description, and can be developed into a logic circuit by using a known logic synthesis tool. HDL is standardized as IEEE 1364. The syntax of the logical description shown here conforms to the case statement, and when there is a change in the value or signal defined within () next to alwayss @, The description is to process. “5′b00001” means binary data 00001 having a 5-bit length. IR [8] means the 9th bit logical value from the least significant bit of the instruction register IR (input value of DEC). The symbol ~ means logical value inversion.
[0201]
The logical descriptions in FIGS. 22 and 23 correspond to the logical description for decoding the code of the transfer instruction “MOV.W @aa: 16, Rd”. In the logical descriptions of FIGS. 22 and 23, 16′b0110 — 101? Described in the next line of the caseex (IR). _? ? 00_? ? ? ? Means the code of the transfer instruction. Byte size when IR [8] = 0, Word size when IR [8] = 1, Memory when IR [7] = 0 → General-purpose register (read type), General-purpose register when IR [7] = 1 It means transfer of memory (write type). In the instruction, whether to omit the first state ST1 and the third state ST3 is determined according to the states of the signals FIFOCNT1, FIFOCNT2, and IFMON. That is, in the logic description, a control signal is generated according to the stage code TMG, and the next stage code NEXTTMG is determined according to the value of the current stage code TMG and the values of FIFOCNT1, FIFOCNT2, and IFMON at that time. A value is determined, and by this, whether to omit the first state ST1 and the third state ST3 is determined. Referring to FIG. 22, the stage code of the first state ST1 is 1, the stage code of the second state ST1 is 17, and the stage code of the third state ST1 is 3.
[0202]
Specifically, the stage code TMG is generated in the first part (1-1) of the logical description in FIG. The stage code TMG proceeds from 1 → 17 → 3, but the stage code 17 and the stage code 3 are omitted depending on the state of FIFOCNT1, FIFOCNT2, and IFMON. If the second word of the self-instruction is already read (FIFOCNT1 = 1) at stage code 1, data read / write control is performed. If the second word of the self-instruction has not been read in the stage code 1 (FIFOCNT1 = 0), the process proceeds to the stage code 17 to perform data read / write control.
[0203]
Bus control is performed in the second part (1-2) of the logical description. When nop = 0, bus access is started, and when nop = 1, bus access is prohibited. Data = 0 indicates an instruction read, and data = 1 indicates a data access. Byte = 0 indicates the word size, and byte = 1 indicates the byte size. “write = 0” indicates a read, and “write = 1” indicates a write.
[0204]
In the case of this transfer instruction, when the second word of the own instruction has not been read with stage code 1, and when the instruction is read with stage code 3 and the second word of the own instruction has been read with stage code 1, or the stage Data access is performed with code 17. Data access read / write is instructed by IR [7]. In the case of instruction read, the contents of the internal data bus IDB are stored in the IR and the read data buffer RDB at a predetermined timing. In the case of data read, the contents of the internal data bus IDB are stored in the read data buffer RDB at a predetermined timing. In the case of data write, the contents of the write data buffer WDB are output to the internal data bus IDB at a predetermined timing.
[0205]
The effective address is calculated in the third part (1-3) of the logical description of FIG. In the case of this transfer instruction, when the second word of the own instruction has been read in stage code 1, or in stage code 17, the 16 bits of the EA extension part of the instruction code held in the read data buffer RDB is read by the rdbext signal. The sign is extended to 32 bits and output to the internal bus GB. If stage code 1 has not read the second word of its own instruction, and stage code 3 reads the PC value to internal bus PCGB, inputs to address bus AB, incrementer INC, and the internal increment result Storage from the bus PCWB to the program counter PC is instructed. The address buffer AB is input from the internal bus GB when input from the internal bus PCGB is not instructed. Here, rdbgb is an instruction signal for outputting the read data buffer RDB to the bus GB, and rdbext is an instruction signal for sign extension of the read data buffer RDB.
[0206]
In the fourth part (1-4) of the logical description in FIG. 23, transfer data and registers are controlled. In the case of the read type (IR [7] = 0), the read data is output from the read data buffer RDB to the bus WB when the second word of the self instruction has been read in stage code 1 or in stage code 17. Store to general-purpose register (Rd). Instructs the N, Z, and V flags to be updated in the condition code register CCR. As shown in FIG. 8 and the like, the control of this operation is delayed. The delay circuit itself is not shown.
[0207]
In the case of the write type (IR [7] = 1), the data is output from the general-purpose register (Rd) to the internal bus DB when the second word of the instruction is read in stage code 1 or in stage code 17 In either case, the data is stored in the write data buffer WDB. In addition, it instructs to update the N, Z, and V flags of the condition code register CCR.
[0208]
24 to 26 show a logical description of a part of the decode logic for the branch instruction / subroutine branch instruction shown in FIGS. 9 and 17 included in the decoder (DEC) 203. The expression format of the figure is the same as the case of FIG. 22 and FIG. In the decoding logic of FIGS. 24 to 26, a branch (JMP) is made when IR [10] = 0, and a subroutine branch (JSR) is made when IR [10] = 1.
[0209]
The stage code TMG is generated in the first part (2-1) of the logical description shown in FIG. The stage code TMG proceeds from 1 → 17 → 2 → 3, but the stage code 17 is omitted depending on the states of FIFOCNT1, FIFOCNT2, and IFMON. If the second word of its own instruction has been read (FIFOCNT1 = 1) in stage code 1, effective address calculation and branch destination instruction read control are performed. If the second word of the self instruction has not been read in the stage code 1 (FIFOCNT1 = 0), the process proceeds to the stage code 17 to perform effective address calculation and branch destination instruction read control.
[0210]
Bus control is performed in the second part (2-2) of the logical description shown in FIG. In the case of this transfer instruction, if the second word of the self instruction has not been read in stage code 1, bus access is prohibited. When the second word of the self instruction has been read in stage code 1, or in branch code 17 the branch destination instruction is read. In the case of a subroutine branch, the process proceeds to stage code 2 to perform long word size data write for stacking PC values. In stage code 2, an instruction read following the branch destination instruction that has been read is performed.
[0211]
The effective address is calculated in the third part (2-3) of the logical description shown in FIG. In the case of this transfer instruction, when the second word of the self instruction has been read in stage code 1, or in stage code 17, the EA extension portion of the instruction code held in the read data buffer RDB is output to the internal bus GB. And stored in the address buffer AB. This content is automatically input to the incrementer INC and incremented (+ 2 / + 4). Further, the storage of the increment result to the program counter PC is instructed. The stage code 3 instructs the output of the PC value to the internal bus PCGB and the storage of the increment result from the internal bus PCWB to the program counter PC.
[0212]
In the fourth part (2-4) of the logical description shown in FIG. 26, transfer data (PC to be stacked) and registers are controlled. When the second word of the self instruction has been read in stage code 1, or in stage code 17, reading of the contents of the stack pointer to the internal bus GB is instructed. Although it is input to the arithmetic logic unit ALU and is not shown, decrement (−4) is instructed to the arithmetic logic unit ALU.
[0213]
The stage code 2 gives an instruction to output the decrement result from the arithmetic logic unit ALU to the internal bus GB. As a result, the decrement result is stored in the address buffer AB. In addition, storage from the internal bus WB to the stack pointer SP is instructed.
[0214]
Further, as described above, the address where the subroutine branch instruction exists is held as A1, and is decoded simultaneously with the instruction code. A value to be transferred to the write data buffer WDB-OUT is selected from PC, WDB-M, and WDB-S using the address information and the signal IFMON indicating that the instruction read is being executed. Further, based on the address information, it is selected whether to set bit 1 of the data to be output to 1 (+2).
[0215]
Such control realizes execution without depending on the instruction arrangement and appropriately omitting the instruction state.
[0216]
It should be noted that it cannot have an omissible state, and cannot be executed without incrementing the program counter PC by using the alternate operation of the arithmetic logic unit ALU and the sub-arithmetic logic unit ALUS. If there is an instruction, the bus command issuance and the PC increment may be prohibited by referring to the state of FIFOCNT2. For example, in FIGS. 22 to 26, nop = 1 is set in the second portion (1-2, 2-2), and inpcc = 0 is set in the third portion (1-3, 2-3). . As a result, the instruction read amount becomes larger than the instruction execution (consumption) amount, the instruction register (FIFO) 200 overflows, and the contents of the program counter PC to be saved in a subroutine branch instruction or the like are lost. It is possible to prevent it from getting stuck.
[0217]
From the above, the following effects can be obtained. [1] With respect to existing CPUs, the data bus width is expanded without losing compatibility, and the instruction read speed is increased, and the instruction execution is controlled by the state including instruction-specific operations and the instruction The state is divided into states in which only reading is performed, and the latter can be omitted (skip). By omitting (skip) a part of the instruction execution state, the instruction read speed is shortened and the instruction execution time is shortened. Can be realized. A part of the instructions can be omitted (skip), and the instructions can be arbitrarily arranged (relocatable) by appropriately omitting (skipping) according to the amount of the read instructions. Arrangement of instructions is arbitrary (relocatable) to facilitate the creation of a program and to eliminate development restrictions such as a C compiler.
[0218]
[2] For instructions that have an instruction code of 1 word (basic unit length), execute in 1 state (unit time), and do not have an omissible (skipable) state, such as an inter-register operation instruction By providing a plurality of units and operating such a unit with a time difference shorter than the execution time of resources for execution, a plurality of inter-register operation instructions and the like can be effectively executed simultaneously. It is possible to omit the instruction read of one instruction that operates effectively at the same time, balance the amount of instruction read and the amount of execution, reduce the execution time of the instruction, and realize high speed. By increasing the instruction decoder to one corresponding to all instructions (DEC 203) and one that exclusively controls one of the arithmetic units that operate effectively simultaneously (SDEC 204), the increase in the logical scale is minimized, and consequently Increases in manufacturing costs can also be minimized. By operating the arithmetic unit with a time difference, it is not necessary to perform parallel processing, and it is possible to easily cope with general-purpose register contention and to suppress an increase in logical scale. Each block of the instruction decoder and execution unit EXEC corresponding to all instructions can roughly share the same logic with the existing CPU, so that design assets can be effectively used to improve design quality, development period, etc. Can be shortened.
[0219]
[3] Execution time of an instruction by enabling skipping of an instruction code that generates only a control signal, such as a prefix code, and generating only a control signal without using an instruction decoder at the time of skipping Can be shortened to achieve higher speed.
[0220]
[4] At the time of branch instruction or interrupt exception processing, the head instruction at the branch destination is read and the execution is immediately started to maintain and improve the responsiveness.
[0221]
[5] The instruction register, together with the instruction code, the contents of bit 1 of the address bus IAB of the instruction code are stored and simultaneously judged by the instruction decoder 203, thereby facilitating control, simplifying the decoding circuit, and logical scale Can be prevented.
[0222]
[6] The PC increment after the branch destination instruction is read is automatically switched to + 2 / + 4 by the incrementer INC according to the contents of the branch destination address, so that it is possible to display regardless of whether the branch destination is a multiple of 4 or not. A single process can be performed to prevent an increase in logical scale.
[0223]
[7] When the PC is incremented, the contents of the PC before the increment are stored in the write data buffer WDB. Further, by setting the write data buffer WDB in a FIFO structure, the bit 1 of the write data buffer WDB is logically changed. It is possible to fix the value 1 and easily realize +2. The PC value to be saved at the time of the subroutine branch instruction can be easily obtained. Also, it has an arithmetic operation unit AU for adding the PC value to be saved and the displacement held in the write data buffer WDB, and has a path to directly input from the write data buffer WDB, so that a program-relative addressing mode is prepared. The processing speed can be improved by increasing the speed. In addition, it is not necessary to have a prefetch counter and a program counter separately, and further an incrementer, so that an increase in the logical scale can be suppressed.
[0224]
[8] When there is a CPU having a wide address space and a narrow CPU having compatibility, it is possible to realize high speed while maintaining compatibility with both. The necessary instructions and addressing modes may be provided as appropriate.
[0225]
[9] Since existing instructions can be executed and the order of internal operations is made the same, the future expansion margin is not greatly impaired as compared with existing CPUs. For example, when a new instruction can be added to an existing CPU, it is considered that such a technique can be used for a CPU to which the present invention is applied. As long as the instruction set compatibility is maintained, the same instruction as that of the existing CPU can be added as the machine language. Further, if the additional instruction also has a plurality of execution state numbers, it can be divided into a part for performing a specific operation and an optional state, and the latter can be omitted as necessary. At least instruction read and PC increment can be prohibited as necessary, and can be realized with a processing time equivalent to that of an existing CPU. If the additional instruction can be executed in one state, high speed can be realized by an alternate operation of ALU and ALUS.
[0226]
[10] By using the same instruction set as that of an existing CPU, development tools such as an assembler, C compiler, simulator / debugger, etc., so-called cross software can be shared. By making cross software common, the development environment can be quickly prepared. Further, resources necessary for development of the development environment can be suppressed, and undesired costs can be avoided for the user by using the existing development environment.
[0227]
Although the invention made by the present inventor has been specifically described based on the embodiments, it is needless to say that the present invention is not limited thereto and can be variously modified without departing from the gist thereof.
[0228]
For example, the present invention can be applied to a completely new microcomputer apart from maintaining compatibility. An instruction set, that is, an instruction type, an addressing mode type, and a combination thereof can be arbitrarily set. The general-purpose registers need not be commonly used for addresses and data, and some or all of them may be dedicated for addresses or data. The data size of the general-purpose register can also be set arbitrarily.
[0229]
The type of the prefix code is not particularly limited. In addition, the prefix code may include other control information in addition to the information indicating the long word. Also, it is not necessary to limit the basic unit of the instruction code to 16 bits, and any bit width such as 8 bits or 32 bits can be used. The width of the data bus is not limited to 32 bits, and may be 64 bits. Instead of twice the basic unit of the instruction, it may be four times.
[0230]
The capacity of the instruction register (FIFO) is not limited to 3 words. There should be at least 2 words. If the capacity is large, even when there is an instruction that does not have an omissible state, it is possible to balance the quantity of instructions by enlarging the accumulated instruction to be omitted in subsequent instruction execution. However, even if the capacity is increased, the read instruction is wasted when the branch instruction is executed. Therefore, it is better not to increase the amount of instructions existing in the instruction register in a normal or steady state.
[0231]
In the present invention, parallel processing is not performed, but it may be configured in combination with parallel processing. Some instructions may be processed in parallel. The number of arithmetic units and instruction decoders can be arbitrarily set. There are no restrictions on the other functional blocks of the single-chip microcomputer.
[0232]
In the above description, the case where the invention made mainly by the present inventor is applied to a single chip microcomputer in the field of use that is the background has been described. However, the present invention is not limited thereto, and other microcomputers or data processing The present invention can also be applied, and the present invention can be applied at least to a data processing apparatus that decodes and processes instructions and performs arithmetic processing.
[0233]
【The invention's effect】
The effects obtained by the representative ones of the inventions disclosed in the present application will be briefly described as follows.
[0234]
In other words, the internal data bus width (for an existing CPU) is at least larger than the basic unit (word) of the instruction, and there is an instruction register that holds the read instruction (multiple units). A means for monitoring the amount of instructions to be executed, a state in which (existing) instructions are controlled only by reading (and PC incrementing) instructions according to the basic unit time (state) of execution, and calculation of effective addresses, Divide into states that include control of data processing. For example, even when the effective address calculation or data transfer process operates over a plurality of states, the control itself is performed once, and the actual operation is performed in a plurality of states (for example, by providing a delay in the control signal). The address calculation is performed in the first state, the read data is stored in the next state, and the operation to delay the control signal (for example, read data storage) overlaps with the next control operation (PC increment). Are configured so that they can operate simultaneously. The state for controlling only the reading of the instruction can be omitted, and the state for controlling only the reading of the instruction is omitted according to the amount existing in the instruction register (in accordance with the instruction of the monitoring means) (skip) To do).
[0235]
This makes it possible to increase the amount of instructions to be read at one time (than the existing CPU) by making the internal data bus width larger than the basic unit (word) of the instruction. (Similar to the above) If the instruction is read a number of times corresponding to the length of its own instruction code, and if it is not omitted (skip), the amount of instruction read is larger than the amount of instruction code of the executed own instruction The amount of instruction code that has been read can be accumulated, but by omitting (skip) and not performing instruction reading, the amount of instruction code of the executed instruction is equal to the amount of instruction that has been read. Maintain the amount of instruction code that has been read, or reduce the amount of instruction that has been read by reducing the amount of instruction that has been read. The number of read instructions is kept within a predetermined range (balancing the amount of instruction reads with the amount of instruction executions), while speeding up instruction reads and executing the entire program. Time can be shortened. In addition, by automatically changing the state to be omitted (skip), it is possible to cope with a change in instruction arrangement.
[0236]
While maintaining compatibility at the object level, if there is a CPU with a wide address space (large instruction set) and a CPU with a small address space (small instruction set), the CPU with a wide address space can increase the speed. When implemented, the above-described high speed can be similarly achieved for instructions that exist in a CPU having a backward compatibility and a small address space. In other words, the same method can be used to increase the speed of a CPU having a wide address space and a CPU having a small address space while maintaining compatibility at the object level. It is possible to enjoy both the advantages of maintaining compatibility at the object level and the advantages of enabling high speed.
[0237]
By using the same instruction set as that of an existing CPU, development tools such as an assembler, C compiler, simulator / debugger, etc., so-called cross software can be used in common. By making cross software common, the development environment can be quickly prepared.
[Brief description of the drawings]
FIG. 1 is a block diagram of a CPU to which a data processing apparatus according to the present invention is applied.
FIG. 2 is a block diagram of a single chip microcomputer to which a data processing apparatus according to the present invention is applied.
FIG. 3 is an explanatory diagram of a programming model relating to general-purpose registers and control registers built in the CPU of FIG. 1;
FIG. 4 is an explanatory diagram of a programming model related to general-purpose registers and control registers built in another CPU.
FIG. 5 is an explanatory diagram showing an example of a machine language instruction format in the CPU 2 of FIG. 1;
FIG. 6 is an explanatory diagram illustrating an example of a ROM.
FIG. 7 is an explanatory diagram of a CPU addressing mode.
FIG. 8 is an operation timing chart of a transfer instruction “MOV.W @aa: 16, Rd”.
FIG. 9 is an operation timing chart of a branch instruction “JMP @aa: 24”.
FIG. 10 is a timing chart illustrating an example of execution timing of a program by a CPU.
FIG. 11 is a timing chart illustrating the execution timing of a program having a different instruction arrangement address from FIG.
12 is a timing chart illustrating the execution timing of a program having a different instruction arrangement address from those in FIGS. 10 and 11. FIG.
FIG. 13 is a timing chart illustrating the execution timing of a program having a different instruction arrangement address from FIGS.
FIG. 14 is a block diagram illustrating an example of an incrementer.
FIG. 15 is a block diagram illustrating an example of a write data buffer.
FIG. 16 is a block diagram showing an example of an arithmetic operator for calculating a program relative branch address.
FIG. 17 is a timing chart illustrating the operation timing of a subroutine branch instruction “JSR @aa: 24” by the CPU.
FIG. 18 is a timing chart illustrating the execution timing of a program including a subroutine branch instruction.
FIG. 19 is a timing chart illustrating the execution timing of a program including a subroutine branch instruction.
FIG. 20 is a timing chart illustrating the execution timing of a program including another subroutine branch instruction.
FIG. 21 is a timing chart illustrating the execution timing of a program including another subroutine branch instruction.
FIG. 22 is an explanatory diagram illustrating a part of the logic description of the decoding logic by the decoder for the transfer instruction of FIG. 8 together with FIG. 23;
23 is an explanatory diagram illustrating a logical description of a part of the decoding logic by the decoder for the transfer instruction of FIG. 8 together with FIG. 22;
24 is an explanatory diagram illustrating a logical description of a part of the decoding logic by the decoder for the branch instruction / subroutine branch instruction of FIGS. 9 and 17 together with FIGS. 25 and 26; FIG.
25 is an explanatory diagram illustrating a part of the logic description of the decode logic by the decoder for the branch instruction / subroutine branch instruction of FIGS. 9 and 17 together with FIGS. 24 and 26; FIG.
26 is an explanatory diagram illustrating a logical description of a part of the decoding logic by the decoder for the branch instruction / subroutine branch instruction of FIGS. 9 and 17 together with FIGS. 24 and 25. FIG.
[Explanation of symbols]
1 Single-chip microcomputer
2 CPU
4 ROM
200 Instruction register
202 Instruction register controller
203 Instruction decoder
204 Sub-instruction decoder
205 Register selector
FIFOCNT1, FIFOCNT2 instruction code amount detection signal
ER0 to ER7 General-purpose registers
PC program counter
ALU arithmetic logic unit
ALUS Sub arithmetic logic unit
AU arithmetic unit
INC Incrementer
WDB write data buffer
RDB read data buffer
AB address buffer

Claims

A data processing device that reads an instruction code composed of a basic unit bit number and operates according to a basic unit time,
A data bus having a bit number larger than the basic unit bit number;
Instruction code holding means connected to the data bus, for inputting an instruction code from the data bus, and capable of holding a plurality of instruction codes;
Monitoring means for monitoring the state of the instruction code held in the instruction code holding means;
Control means for decoding an instruction code acquired from the instruction code holding means and controlling an instruction execution operation;
The control means divides a part or all of the instruction execution operation based on the instruction code into the control operation for each basic unit time, and reads the instruction code to the divided control operation. Data processing characterized in that, when an operation and a second operation for performing other operations are included, execution of the first operation is inhibited according to a monitoring state by the monitoring means apparatus.

The state of the instruction code is a holding amount of the instruction code. When the holding amount is large, execution of the first operation is suppressed, and when the holding amount is small, the first operation is executed. The data processing apparatus according to claim 1.

The second operation which has a data transfer instruction as an executable instruction and is controlled by the control unit with respect to the instruction code of the data transfer instruction includes a transfer source address generation operation for generating a transfer source address, and a transfer A transfer data storage operation for storing stored data,
The control means generates control information for the transfer source address generation operation and the transfer data storage operation in the same basic unit time, and the transfer source address generation operation and the transfer data storage operation are different in basic unit time. The data processing apparatus according to claim 1, wherein the data processing apparatus is executed.

Register means, flag means, read data buffer means, and an internal bus connected to the register means and the read data buffer means,
The read data buffer means temporarily stores the transferred data, outputs the stored data to the internal bus, and the register means can store the contents of the internal bus. The data processing apparatus according to claim 3.

5. The data processing according to claim 4, wherein the read data buffer means temporarily stores the transferred data, inspects the stored data, and reflects the inspection result in the flag means. apparatus.

The control means includes a program count means, and the control means reads or writes the program count means as the first operation and stores the transferred data in the register means or reflects it in the flag means. 6. The data processing apparatus according to claim 4, wherein the data processing apparatuses are arranged in parallel.

Program counting means, and arithmetic operation means capable of incrementing the contents of the program counting means,
The value incremented by the arithmetic operation means is a value selected from a first value corresponding to the size of the data bus and a second value smaller than the first value according to the input value. The data processing apparatus according to claim 1, wherein:

A branch instruction is provided as an executable instruction, and when the branch instruction is executed, the arithmetic operation means after reading the instruction code of the branch destination determines a value to be incremented according to the input value. 8. The data processing apparatus according to claim 7, wherein the value is selected to be a value of 1 or a second value.

An instruction that does not perform a branch as an executable instruction, and when the instruction that does not perform a branch is executed, the arithmetic operation means sets the incremented value as the first value. The data processing apparatus according to claim 7 or 8.

A data processing device that reads an instruction code composed of a basic unit bit number and operates according to a basic unit time,
A data bus having a bit number larger than the basic unit bit number;
Instruction code holding means connected to the data bus, for inputting an instruction code from the data bus, and capable of holding a plurality of instruction codes;
First control means and second control means for decoding an instruction code and controlling an instruction execution operation;
A first arithmetic logic operation means and a second arithmetic logic operation means operating in accordance with the basic unit time, the operation phases of which are shifted from each other;
The first control means controls the first arithmetic logic unit, the second control means controls the second arithmetic logic unit,
The first control means includes the function of the second control means,
The first control means and the second control means are operable at the overlapping time;
A data processing apparatus, wherein the first control means controls the second control means.

Monitoring means for monitoring the state of the instruction code held in the instruction code holding means, and generating a control signal according to the monitoring result;
The instruction code holding means inspects the contents of the held instruction code and generates a detection signal when a predetermined instruction is detected,
The first control means supplies all or part of the predetermined instruction code to the second control means when the detection signal is in a predetermined state, and the control signal generated by the monitoring means is predetermined. 11. The data processing apparatus according to claim 10, wherein control by the second control means is permitted in the state of.

Program counting means, and arithmetic operation means capable of incrementing the contents of the program counting means,
12. The data processing apparatus according to claim 10, wherein the arithmetic operation means operates in synchronization with the basic unit time at a phase different from that of the first arithmetic logic arithmetic unit.

A data processing device that operates by reading an instruction code composed of a basic unit number of bits,
A data bus having a bit number larger than the basic unit bit number;
Instruction code holding means connected to the data bus, for inputting an instruction code from the data bus, and capable of holding a plurality of instruction codes;
Monitoring means for monitoring the state of the instruction code held in the instruction code holding means;
Control means for decoding the instruction code and controlling the instruction execution operation,
The control means controls an instruction code read operation to be executed later, and when a predetermined instruction code is decoded, the instruction code read amount can be changed according to a monitoring result by the monitoring means. Data processing device.

A branch instruction as an executable instruction,
When the instruction code of the branch instruction is decoded and executed, the control means reads the branch destination instruction and decodes the branch destination instruction read content when the branch destination instruction read content is input. 14. The data processing apparatus according to claim 13, wherein

A first arithmetic logic unit, a second arithmetic logic unit, an arithmetic unit, a register unit, a program count unit, an incrementer unit, a read data buffer unit, and an address buffer unit,
A first internal bus coupling at least the address buffer means, the program count means, the register means, a first arithmetic logic operator, a second arithmetic logic operator, and the arithmetic operator;
A second internal bus coupling at least the address buffer means, the program count means, and the incrementer means;
A third internal bus coupling the program count means and the incrementer means;
A fourth internal bus coupling at least the read data buffer means and the register means,
Transfer from the register means to the second arithmetic logic unit by the first internal bus and transfer from the program count means to the address buffer means by the second internal bus are performed in parallel. Is possible,
Furthermore, the transfer from the read data buffer means to the register means by the fourth internal bus and the transfer from the incrementer means to the program count means by the third internal bus can be performed in parallel. A data processing apparatus characterized by being:

A fifth internal bus coupling the register means, the first arithmetic logic unit, and the second arithmetic logic unit;
And transfer to the first arithmetic logic unit from said register means by said first internal bus, and another transfer from the register means by the fifth internal bus to said first arithmetic logic unit Can be done in parallel,
Furthermore, another transfer of the to the first from the register means by the internal bus and the transfer of the to the second arithmetic logic unit, said second arithmetic logic unit from said register means by said fifth internal bus 16. The data processing device according to claim 15, wherein the data processing device can be performed in parallel.

The first internal bus is further coupled to the read data buffer means;
The transfer from the register means by the first internal bus to the first arithmetic logic unit and the transfer from the read data buffer means to the register means by the fourth internal bus are instructed. And control means for instructing output from the read data buffer to the first internal bus and prohibiting output from the register means to the first internal bus. The data processing apparatus according to claim 15 or 16.

The second internal bus is further coupled to the write data buffer means, and when transferring from the program count means to the address buffer by the second internal bus, transfer to the write data buffer means is also performed. The data processing apparatus according to any one of claims 15 to 17, wherein the data processing apparatus is characterized in that:

The write data buffer means has a low-order bit correction means, and when the contents transferred from the program count means are outputted, the low-order bits can be corrected by the correction means. The data processing apparatus according to claim 18.

20. The arithmetic operation unit according to claim 18, further comprising another arithmetic operation unit, wherein the another arithmetic operation unit is coupled to the first internal bus and the write data buffer unit. Data processing device.

A plurality of register means;
The register means can use the whole or an area divided into two for holding data, and is also used for holding an address with a number of bits larger than the number of one of the divided bits,
Use the entire register after including the instruction execution function of the other data processing apparatus having the same instruction code as another data processing apparatus and the instruction code having the register corresponding to the divided one bit number 16. The data processing apparatus according to claim 1, 10, 13, or 15, wherein the instruction can be executed.