JP2004362454A

JP2004362454A - Microprocessor

Info

Publication number: JP2004362454A
Application number: JP2003162744A
Authority: JP
Inventors: Tamotsu Hasegawa; 保長谷川
Original assignee: Hitachi ULSI Systems Co Ltd
Current assignee: Hitachi Solutions Technology Ltd
Priority date: 2003-06-06
Filing date: 2003-06-06
Publication date: 2004-12-24

Abstract

<P>PROBLEM TO BE SOLVED: To provide a microprocessor provided with high processing capacity at low power consumption. <P>SOLUTION: Program counters (PCs) in the same number as that pipelines or more are prepared, and processing for successively fetching instructions of the address 0 of the PC 1 up to the address 0 of the PC 6 is performed. Then, the processing of fetching an instruction from the address 1 of the PC1 and so forth is performed. By the constitution and the processing, a pipeline stall is not generated in principle. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、マイクロプロセッサに関し、特に、低消費電力を必要とするマイクロプロセッサに適用して有効な技術に関する。
【０００２】
【従来の技術】
本発明者が検討したところによれば、マイクロプロセッサの技術に関しては、以下のような技術が考えられる。
【０００３】
現在、パイプライン機能を搭載したマイクロプロセッサが多く実在している。これらのマイクロプロセッサは、通常１本のプログラムカウンタを備え、パイプライン処理を実行している。また、パイプラインストールが発生した場合の穴埋めとして、更にもう１本のプログラムカウンタやそれに伴う演算回路などを備えたものも存在する。
【０００４】
【発明が解決しようとする課題】
ところで、前記のようなマイクロプロセッサの技術について、本発明者が検討した結果、以下のようなことが明らかとなった。
【０００５】
現在、パソコンや携帯電話などのように、様々な機器においてマイクロプロセッサが用いられている。例えば、携帯電話のマイクロプロセッサは、本来の電話機能の処理以外に、ブラウザ機能や動画／静止画の記録および再生などの付加機能の処理も行わなくてはならない。
【０００６】
したがって、パソコンなどでは勿論のこと、携帯電話などにおいても、マイクロプロセッサに対して、高い処理能力が要求されてきている。そして、とりわけ携帯機器においては、低消費電力も兼ね備えなければならない。
【０００７】
しかしながら、従来技術のようなマイクロプロセッサでは、高クロックで動作させるためにパイプラインを深くすると、分岐時のペナルティにより性能が低下してしまう。また、前の命令の結果を次の命令で使用するなどの命令の依存関係があると、次の命令の実行が待たされてしまい、これによっても性能が低下する。
【０００８】
これらのようにパイプライン処理が乱れることを、パイプラインストールやパイプラインハザードなどと呼ぶ。現状では、このパイプラインストールを低減するため、分岐予測回路などによる分岐予測および投機実行や、命令順序を入れ替える回路などによる命令の依存関係を取り除く処理などを行っている。ところが、近年、これらの回路規模が格段に大きくなる傾向にあり、回路面積や消費電力の増加が無視できないものとなってきている。
【０００９】
そこで、本発明の目的は、低消費電力のマイクロプロセッサを提供することにある。
【００１０】
本発明の前記並びにその他の目的と新規な特徴は、本明細書の記述及び添付図面から明らかになるであろう。
【００１１】
【課題を解決するための手段】
本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば、次のとおりである。
【００１２】
本発明によるマイクロプロセッサは、パイプラインの段数と同数以上のプログラムカウンタを有し、１クロック毎に前記プログラムカウンタを順番に切り替えながら、前記プログラムカウンタが示す命令をパイプライン処理で実行するものである。この構成により、パイプラインストールを原理的に防止することができる。
【００１３】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一部材には同一の符号を付し、その繰り返しの説明は省略する。
【００１４】
本発明の実施の形態では、本発明の一実施の形態のマイクロプロセッサの特徴を判り易くするために、本発明の前提となる従来のマイクロプロセッサと比較しながら説明する。
【００１５】
図１は、本発明の前提となる従来のマイクロプロセッサにおける、パイプラインの処理方法の一例を示す概略図である。
【００１６】
従来のマイクロプロセッサは、１本のプログラムカウンタ（ＰＣ）を持ち、１プロセス内でパイプラインを形成している。図１の例では、６段のパイプラインを有し、ＰＣが示すアドレス０〜４に存在する５つの命令を、パイプラインで処理している。通常、この５つの命令は、１プロセス内の命令に該当し、いわば一つのタスクに含まれる命令となる。
【００１７】
パイプラインの各段は、例えば、ＩＦ（命令フェッチ）段と、ＩＤ（命令デコード）段と、ＥＸ（命令実行）段と、ＭＡ（メモリアクセス）段と、２つのｍｍ（演算処理）段などから構成される。
【００１８】
図１は、理想的なパイプライン処理を示しているが、実際には、命令の種類や依存関係などによって、パイプラインストールが発生する。その具体例に関しては図４などで後述するが、本質的に、パイプラインストールは、前の命令の処理が終わる前に次の命令の処理に着手することから発生する。
【００１９】
図２は、本発明の一実施の形態のマイクロプロセッサにおける、パイプラインの処理方法の一例を示す概略図である。
【００２０】
本発明の一実施の形態のマイクロプロセッサは、パイプラインの段数と同数以上のプログラムカウンタを有し、１クロック毎に前記プログラムカウンタを順番に切り替えながら、前記プログラムカウンタが示す命令をパイプライン処理で実行するものである。
【００２１】
すなわち、図２においては、パイプライン段数とＰＣの本数を同数とし、前記図１と同様に６段のパイプラインと、６本のＰＣ（ＰＣ１〜６）を有し、そして各ＰＣ毎に異なるプロセスが割り当てられるものとする。このような構成において、クロック毎の処理手順は、例えば、下記のようになる。なお、下記の説明において、クロック毎に時間ｔの値が＋１増加するものとする。
【００２２】
ｔ＝１で、ＰＣ１が示すアドレス０の命令をフェッチする。
【００２３】
ｔ＝２で、ＰＣ１が示すアドレス０の命令をデコードし、同時に、ＰＣ２が示すアドレス０の命令をフェッチする。
【００２４】
ｔ＝３で、ＰＣ１が示すアドレス０の命令の実行とＰＣ２が示すアドレス０の命令のデコードを行い、同時に、ＰＣ３が示すアドレス０の命令をフェッチする。
【００２５】
以下同様にして、ｔ＝６で、ＰＣ１とＰＣ２の演算処理と、ＰＣ３のメモリアクセスと、ＰＣ４の命令実行と、ＰＣ５の命令デコードと、ＰＣ６が示すアドレス０の命令フェッチが行われる。
【００２６】
そして、ｔ＝７で、ＰＣ２〜ＰＣ６が示すアドレス０に対する処理と共に、ＰＣ１が示す次の命令、すなわちアドレス１の命令がフェッチされる。以降、ｔ＝８〜１２において、順次ＰＣ２〜６が示すアドレス１の命令フェッチが行われる。
【００２７】
なお、前記アドレス０と１は、物理アドレスではなく、各ＰＣ毎の論理アドレスを示すものである。
【００２８】
ここで、ＰＣ１に着目すると、図２の下部に示すように、アドレス０の命令の処理が完了した後にアドレス１の命令の処理が行われている。他のＰＣに関しても同様で、同一ＰＣ内の命令は、必ず、前の命令の処理が完了してから次の命令が実行されることになる。
【００２９】
つまり、同一ＰＣ内では、従来のように前の命令の処理が終わる前に次の命令の処理に着手するようなことがないため、パイプラインストールが原理的に発生しなくなる。そして、同一ＰＣ内を同一プロセスとし、各ＰＣ毎に異なるプロセスを割り当てると、パイプラインストールを発生させることなく、複数のプロセスの処理（マルチタスク処理）が可能となる。
【００３０】
このように、パイプラインストールが原理的に発生しなくなると、さらに、下記（１）〜（４）のような効果が得られる。
【００３１】
（１）分岐予測などの機能が不要となり、トランジスタ数の削減と消費電力の低減が図れる。
【００３２】
（２）パイプライン段数を非常に深くすることができ、高クロック化が図れる。
【００３３】
（３）ＭＰＥＧ（ｍｏｖｉｎｇｐｉｃｔｕｒｅｓｅｘｐｅｒｔｇｒｏｕｐ）などのように、各処理ブロックが並列して動作するシステムで、とりわけ演算器の使用効率を上げることができる。
【００３４】
つまり、各ＰＣ毎に異なる処理ブロックを割り当てることで並列処理が可能となる。さらに、ＭＰＥＧなどの画像処理においては、一般的に、可変長符号化／復号化などの処理で分岐命令が多く用いられるが、従来のように分岐予測ミスなどが発生することもない。
【００３５】
（４）コンパイラなどで、パイプライン処理に関する命令の最適化処理などが行われる場合があるが、こういった処理を簡素化できる。
【００３６】
図３は、本発明の一実施の形態のマイクロプロセッサにおいて、その構成の一例を示す概略図である。
【００３７】
図３のマイクロプロセッサは、例えば、６本のプログラムカウンタ（ＰＣ１〜６）と、６段のパイプラインステージなどを有している。
【００３８】
この６段のパイプラインステージは、例えば、命令フェッチ（ＩＦ）ステージと、命令デコード（ＩＤ）ステージと、命令実行（ＥＸ）ステージと、メモリアクセス（ＭＡ）ステージと、演算処理（ｍｍ）ステージと、レジスタライトバック（ＷＢ）ステージなどから構成される。
【００３９】
各プログラムカウンタには、それぞれ依存関係がないプロセスが割り当てられ、各プロセス毎に命令コードメモリが割り当てられている。そして、１クロック毎に、各プログラムカウンタを順番に切り替えながらパイプライン処理を行う。
【００４０】
ここで、図３に示すマイクロプロセッサの動作について、簡単なプログラム例を挙げ、従来の動作と比較しながら説明する。
【００４１】
まず、図４は、本発明の前提となるマイクロプロセッサにおいて、簡単なプログラム例に対する従来のパイプラインの処理方法の一例を示す図である。図４では、命令▲１▼〜▲８▼により構成されるプログラム例を、従来のパイプラインで処理している。命令▲１▼〜▲８▼の内容を簡単に説明すると下記のようになる。なお、Ｒ０〜Ｒ３は汎用レジスタとする。
【００４２】
Ｒ１が示すメモリアドレス上のデータを、Ｒ０に転送する（命令▲１▼）。Ｒ０のデータに１を加え、結果をＲ０に格納する（命令▲２▼）。Ｒ０のデータを、Ｒ１が示すメモリアドレスに書き込む（命令▲３▼）。Ｒ０のデータとＲ２のデータの大きさを比較する（命令▲４▼）。Ｒ０のデータ＞Ｒ２のデータならば、命令▲８▼に分岐し、そうでなければ命令▲６▼に進む（命令▲５▼）。Ｒ０のデータとＲ２のデータを乗算し、結果を積和レジスタＭＡＣＬに格納する（命令▲６▼）。ＭＡＣＬの値をＲ０に格納する（命令▲７▼）。Ｒ３のデータにＲ０のデータを加え、結果をＲ３に格納する（命令▲８▼）。
【００４３】
このような命令において、従来のパイプライン処理では下記のようなパイプラインストールが発生する。
【００４４】
（１）命令▲２▼において、命令▲１▼の結果を使用するため１クロック（ステージ）分の遅延が発生する。
【００４５】
（２）命令▲３▼において、命令▲２▼の結果を使用するため１クロック（ステージ）分の遅延が発生する。
【００４６】
（３）命令▲６▼，▲７▼において、命令▲５▼による分岐有無を確定する間にパイプライン内でフェッチが行われているが、分岐有りの場合は命令自体が破棄される。
【００４７】
つまり、パイプライン処理において、命令▲２▼，▲３▼では、前命令の結果を次命令が使用するために遅延が発生し、命令▲６▼，▲７▼では、前命令の結果で分岐が発生するために無駄な処理を行うことになる。これらの結果、１１ステージで６命令が実行されている。
【００４８】
一方、図５および図６は、本発明の一実施の形態のマイクロプロセッサにおいて、簡単なプログラム例に対するパイプラインの処理方法の一例を示す図である。前記図４と同じ命令を処理することとし、前記命令▲１▼〜▲４▼の処理を図５で示し、前記命令▲５▼〜▲８▼の処理を図６で示している。
【００４９】
図５では、前記図４における命令▲２▼，▲３▼の遅延箇所において、ＰＣ２とＰＣ３の命令を実行できる。なおかつ、命令▲３▼の遅延によって命令▲４▼のフェッチが遅延した箇所で、ＰＣ４の命令を実行できる。つまり、本来待ちとなるステージを、別プロセスのステージとして使用できる。
【００５０】
図６では、前記図４において、フェッチしたが分岐により破棄された無駄な命令▲６▼，▲７▼を、プリフェッチする必要がない。
【００５１】
このように、図３のマイクロプロセッサは、あるＰＣにおける命令の処理が完了するまで、別のＰＣの命令を処理する仕組みとなる。したがって、前記あるＰＣにおける次の命令の処理に際し、待ち時間が必要となったり、破棄されて無駄な命令となる恐れもない。
【００５２】
この結果、例えば、図５では、１６ステージで１３命令を実行しており、実質的には実行効率を１命令／１クロックまで高めることができる。つまり、高い処理能力を備えることができ、なおかつ前述したように分岐予測回路などによって電力を消費されることのないマイクロプロセッサを実現できる。
【００５３】
また、これらの考え方を用いると、従来の３２ｂｉｔ単位などの演算回路ではなく、１ｂｉｔ単位の演算回路を用いたアーキテクチャを構築することができる。このアーキテクチャを実現する構成の一例を図７および図８に示す。なお、ここでは３２ｂｉｔのアーキテクチャを想定する。
【００５４】
図７は、本発明の一実施の形態のマイクロプロセッサにおいて、１ｂｉｔ単位の加算回路の構成の一例を示す回路図である。
【００５５】
図７の加算回路は、例えば、３２ｂｉｔの入力Ａ、入力Ｂがそれぞれ入力される３２ｂｉｔシフトレジスタ１，２と、入力Ａと入力Ｂの各ｂｉｔを１ｂｉｔずつ加算する１ｂｉｔ加算器３と、この１ｂｉｔ加算器３の出力Ｃを格納する３２ｂｉｔシフトレジスタ４などから構成される。そして、図７には明示していないが、この回路を３６ヶ設け、さらに、それぞれの回路に異なったＰＣを割り当てる。つまり、３６本のＰＣ（ＰＣ１〜３６）を備える。
【００５６】
図８は、本発明の一実施の形態のマイクロプロセッサにおいて、１ｂｉｔ単位の比較回路の構成の一例を示す回路図である。
【００５７】
図８の比較回路は、例えば、３２ｂｉｔの入力Ａ、入力Ｂが各々入力される３２ｂｉｔシフトレジスタ１，２と、入力Ａと入力Ｂの各ｂｉｔを１ｂｉｔずつ比較する１ｂｉｔ比較器（減算器）５などから構成される。そして、図８には明示していないが、この回路を３６ヶ設け、さらに、それぞれの回路に異なったＰＣを割り当てる。つまり、３６本のＰＣ（ＰＣ１〜３６）を備える。
【００５８】
これらのような回路において、前記図２の処理のように、ＰＣを順番に切り替えながらパイプライン処理を行う。すると、例えば、ＰＣ１のアドレス０の命令フェッチが行われた後、アドレス１の命令フェッチが行われるのは３６クロック後となる。したがって、この間に図７の加算回路や図８の比較回路などによって、１ｂｉｔ／１クロック単位で３２ｂｉｔ分の加算および比較を行う。
【００５９】
なお、パイプラインの段数は、ＩＦ、ＩＤ、ＩＥ、ｍｍ×３２の計３５段になるが、パイプラインの段数よりもＰＣの本数（３６本）の方が多いため、原理的にパイプラインストールは発生しない。
【００６０】
このように、ＰＣの本数を、パイプラインの段数以上でなおかつデータｂｉｔ数と等しく構成し、演算器を、従来の３２ｂｉｔ単位から１ｂｉｔ単位に変更することで有益な効果を得ることができる。すなわち、従来において３２ｂｉｔ同時に行っていた演算を、１ｂｉｔずつ複数クロックに分散することになるため、クロックあたりの演算器内ゲート間遅延時間の低減を図ることができ、これによって、パイプラインの高クロック化が可能になる。更に、クロックあたりの演算器内ゲート間遅延時間の許容範囲であれば、素子駆動電圧を下げることができるため、消費電力低減も可能となる。
【００６１】
また、一般的なプログラムにおいて、比較結果による分岐命令などがよく用いられるが、この場合、上位ｂｉｔの比較のみで結果が判明する場合も多い。このような場合に、１ｂｉｔ／１クロックずつの比較を用いると、結果が判明した時点で演算を完了（演算器を停止）することができ、消費電流の低減や演算の効率化を図ることができる。
【００６２】
なお、図７および図８は、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃａｎｄｌｏｇｉｃａｌｕｎｉｔ）などの演算回路の一形態として加算器と比較器を示したもので、勿論、加算器と比較器に限定されるものではない。
【００６３】
以上、本発明者によってなされた発明をその実施の形態に基づき具体的に説明したが、本発明は前記実施の形態に限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能であることはいうまでもない。
【００６４】
例えば、前記図７および図８ではＰＣの本数とデータｂｉｔ数を等しく構成したが、ＰＣの本数は、パイプラインの段数以上である限り、データｂｉｔ数以上の本数にすることも可能である。
【００６５】
本願において開示される発明のうち、代表的なものによって得られる効果を簡単に説明すれば、以下のとおりである。
【００６６】
（１）パイプライン段数と同数以上のＰＣを備えることで、パイプラインストールが原理的に発生せず、パイプラインの実行効率を実質的に１命令／１クロックまで高めることができる。
【００６７】
（２）前記（１）により、分岐予測などの機能が不要となり、トランジスタ数の削減と消費電力の低減が図れる。
【００６８】
（３）前記（１）により、パイプライン段数を非常に深くすることができ、高クロック化が図れる。
【００６９】
（４）データｂｉｔ数とＰＣの本数を等しく構成し、演算器を１ｂｉｔ単位に変更することで、消費電流の低減や高クロック化および演算の効率化を図ることができる。
【００７０】
（５）前記（１）〜（４）により、低消費電力で、なおかつ高い処理能力を備えたマイクロプロセッサを実現できる。
【００７１】
【発明の効果】
本願発明を適用することにより低電力のマイクロプロセッサを実現することができる。
【図面の簡単な説明】
【図１】本発明の前提となる従来のマイクロプロセッサにおける、パイプラインの処理方法の一例を示す概略図である。
【図２】本発明の一実施の形態のマイクロプロセッサにおける、パイプラインの処理方法の一例を示す概略図である。
【図３】本発明の一実施の形態のマイクロプロセッサにおいて、その構成の一例を示す概略図である。
【図４】本発明の前提となるマイクロプロセッサにおいて、簡単なプログラム例に対する従来のパイプラインの処理方法の一例を示す図である。
【図５】本発明の一実施の形態のマイクロプロセッサにおいて、簡単なプログラム例に対するパイプラインの処理方法の一例を示す図である。
【図６】本発明の一実施の形態のマイクロプロセッサにおいて、図５に続く簡単なプログラム例に対するパイプラインの処理方法の一例を示す図である。
【図７】本発明の一実施の形態のマイクロプロセッサにおいて、１ｂｉｔ単位の加算回路の構成の一例を示す回路図である。
【図８】本発明の一実施の形態のマイクロプロセッサにおいて、１ｂｉｔ単位の比較回路の構成の一例を示す回路図である。
【符号の説明】
１，２，４３２ｂｉｔシフトレジスタ
３１ｂｉｔ加算器
５１ｂｉｔ比較器[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a microprocessor, and more particularly, to a technique that is effective when applied to a microprocessor that requires low power consumption.
[0002]
[Prior art]
According to the studies made by the present inventors, the following techniques can be considered for the microprocessor technique.
[0003]
At present, there are many microprocessors equipped with a pipeline function. These microprocessors usually have one program counter and execute pipeline processing. In addition, there is a type provided with another program counter and an arithmetic circuit associated therewith as a fill-in when a stall occurs.
[0004]
[Problems to be solved by the invention]
By the way, as a result of the present inventor's study on the above-described microprocessor technology, the following has become clear.
[0005]
At present, microprocessors are used in various devices such as personal computers and mobile phones. For example, a microprocessor of a mobile phone must perform processing of additional functions such as a browser function and recording and reproduction of moving images / still images, in addition to processing of an original telephone function.
[0006]
Therefore, microprocessors are required to have high processing power not only in personal computers and the like but also in mobile phones and the like. In particular, portable devices must also have low power consumption.
[0007]
However, in a microprocessor as in the prior art, if the pipeline is deepened to operate at a high clock, the performance is reduced due to a penalty at the time of branching. Also, if there is an instruction dependency such as using the result of the previous instruction in the next instruction, the execution of the next instruction is delayed, which also lowers the performance.
[0008]
Disturbance in pipeline processing as described above is referred to as pipeline stall or pipeline hazard. At present, in order to reduce this pipeline stall, processing such as branch prediction and speculative execution by a branch prediction circuit or the like, and processing of removing instruction dependencies by a circuit for changing the order of instructions are performed. However, in recent years, these circuit scales tend to be much larger, and increases in circuit area and power consumption cannot be ignored.
[0009]
Therefore, an object of the present invention is to provide a microprocessor with low power consumption.
[0010]
The above and other objects and novel features of the present invention will become apparent from the description of the present specification and the accompanying drawings.
[0011]
[Means for Solving the Problems]
The following is a brief description of an outline of typical inventions disclosed in the present application.
[0012]
A microprocessor according to the present invention has at least as many program counters as the number of pipeline stages, and executes instructions indicated by the program counter by pipeline processing while sequentially switching the program counter every clock. . With this configuration, pipeline stall can be prevented in principle.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In all the drawings for describing the embodiments, the same members are denoted by the same reference numerals, and a repeated description thereof will be omitted.
[0014]
Embodiments of the present invention will be described in comparison with a conventional microprocessor which is a premise of the present invention, in order to make the features of the microprocessor of the embodiment of the present invention easy to understand.
[0015]
FIG. 1 is a schematic diagram showing an example of a pipeline processing method in a conventional microprocessor which is a premise of the present invention.
[0016]
A conventional microprocessor has one program counter (PC) and forms a pipeline in one process. In the example of FIG. 1, there are six pipelines, and five instructions existing at addresses 0 to 4 indicated by the PC are processed by the pipeline. Usually, these five instructions correspond to instructions in one process, so to speak, are included in one task.
[0017]
Each stage of the pipeline includes, for example, an IF (instruction fetch) stage, an ID (instruction decode) stage, an EX (instruction execution) stage, an MA (memory access) stage, and two mm (arithmetic processing) stages. Consists of
[0018]
FIG. 1 shows an ideal pipeline process. However, in actuality, pipeline stall occurs depending on the type of instruction, dependency, and the like. Although a specific example thereof will be described later with reference to FIG. 4 and the like, essentially, pipeline stall occurs because the processing of the next instruction is started before the processing of the previous instruction is completed.
[0019]
FIG. 2 is a schematic diagram illustrating an example of a pipeline processing method in the microprocessor according to the embodiment of the present invention.
[0020]
A microprocessor according to an embodiment of the present invention has a program counter of the same number or more as the number of stages of a pipeline, and sequentially switches the program counter every clock, while executing an instruction indicated by the program counter by pipeline processing. To do.
[0021]
That is, in FIG. 2, the number of pipeline stages and the number of PCs are the same, and as in FIG. 1, the pipeline has six pipelines and six PCs (PC 1 to 6), and differs for each PC. Assume that a process is assigned. In such a configuration, the processing procedure for each clock is, for example, as follows. In the following description, it is assumed that the value of the time t increases by +1 for each clock.
[0022]
At t = 1, the instruction at address 0 indicated by PC1 is fetched.
[0023]
At t = 2, the instruction at address 0 indicated by PC1 is decoded, and at the same time, the instruction at address 0 indicated by PC2 is fetched.
[0024]
At t = 3, execution of the instruction at address 0 indicated by PC1 and decoding of the instruction at address 0 indicated by PC2 are performed, and at the same time, the instruction at address 0 indicated by PC3 is fetched.
[0025]
Similarly, at t = 6, arithmetic processing of PC1 and PC2, memory access of PC3, instruction execution of PC4, instruction decoding of PC5, and instruction fetch of address 0 indicated by PC6 are performed.
[0026]
Then, at t = 7, the next instruction indicated by PC1, that is, the instruction at address 1 is fetched together with the processing for address 0 indicated by PC2 to PC6. Thereafter, at t = 8 to 12, the instruction fetch of the address 1 indicated by PC2 to PC6 is sequentially performed.
[0027]
The addresses 0 and 1 are not physical addresses but indicate logical addresses for each PC.
[0028]
Here, focusing on PC1, as shown in the lower part of FIG. 2, the processing of the instruction at address 1 is performed after the processing of the instruction at address 0 is completed. The same applies to other PCs, and the instructions in the same PC always execute the next instruction after the processing of the previous instruction is completed.
[0029]
That is, in the same PC, the processing of the next instruction is not started before the processing of the previous instruction is completed as in the related art, so that pipeline stall does not occur in principle. If the same PC is set as the same process and different processes are assigned to each PC, processing of a plurality of processes (multitask processing) can be performed without causing pipeline stall.
[0030]
As described above, when the pipeline stall does not occur in principle, the following effects (1) to (4) are further obtained.
[0031]
(1) Functions such as branch prediction become unnecessary, and the number of transistors and power consumption can be reduced.
[0032]
(2) The number of pipeline stages can be made very deep, and a high clock can be achieved.
[0033]
(3) A system in which each processing block operates in parallel, such as a moving picture expert group (MPEG), can particularly increase the use efficiency of a computing unit.
[0034]
That is, parallel processing is possible by assigning different processing blocks to each PC. Further, in image processing such as MPEG, generally, a branch instruction is frequently used in processing such as variable-length encoding / decoding, but a branch prediction error does not occur unlike the related art.
[0035]
(4) There is a case where a compiler or the like performs optimization processing of instructions related to pipeline processing, but such processing can be simplified.
[0036]
FIG. 3 is a schematic diagram showing an example of the configuration of the microprocessor according to the embodiment of the present invention.
[0037]
The microprocessor of FIG. 3 has, for example, six program counters (PC1 to 6) and six pipeline stages.
[0038]
The six pipeline stages include, for example, an instruction fetch (IF) stage, an instruction decode (ID) stage, an instruction execution (EX) stage, a memory access (MA) stage, and an arithmetic processing (mm) stage. , A register write back (WB) stage.
[0039]
A process having no dependency is assigned to each program counter, and an instruction code memory is assigned to each process. Then, the pipeline process is performed while switching each program counter in order for each clock.
[0040]
Here, the operation of the microprocessor shown in FIG. 3 will be described by comparing a conventional operation with a simple program example.
[0041]
First, FIG. 4 is a diagram showing an example of a conventional pipeline processing method for a simple program example in a microprocessor as a premise of the present invention. In FIG. 4, a program example constituted by instructions (1) to (8) is processed by a conventional pipeline. The contents of the instructions (1) to (8) will be briefly described as follows. Note that R0 to R3 are general-purpose registers.
[0042]
The data on the memory address indicated by R1 is transferred to R0 (instruction {circle around (1)}). One is added to the data of R0, and the result is stored in R0 (instruction {circle around (2)}). The data of R0 is written to the memory address indicated by R1 (instruction {circle around (3)}). The size of the data of R0 and the size of the data of R2 are compared (instruction {circle around (4)}). If the data in R0> the data in R2, the flow branches to instruction (8); otherwise, the flow proceeds to instruction (6) (instruction (5)). The data of R0 is multiplied by the data of R2, and the result is stored in the product-sum register MACL (instruction {circle around (6)}). The value of MACL is stored in R0 (instruction {circle around (7)}). The data of R0 is added to the data of R3, and the result is stored in R3 (instruction (8)).
[0043]
In such an instruction, the following pipeline stall occurs in the conventional pipeline processing.
[0044]
(1) In the instruction (2), a delay of one clock (stage) occurs because the result of the instruction (1) is used.
[0045]
(2) In the instruction (3), a delay of one clock (stage) occurs because the result of the instruction (2) is used.
[0046]
(3) In the instructions (6) and (7), the fetch is performed in the pipeline while the presence or absence of the branch by the instruction (5) is determined. If there is a branch, the instruction itself is discarded.
[0047]
In other words, in the pipeline processing, in the instructions (2) and (3), a delay occurs because the result of the previous instruction is used by the next instruction, and in the instructions (6) and (7), a branch is caused by the result of the previous instruction. This causes unnecessary processing to be performed. As a result, 6 instructions are executed in 11 stages.
[0048]
5 and 6 are diagrams showing an example of a method of processing a pipeline for a simple program example in the microprocessor according to the embodiment of the present invention. It is assumed that the same instructions as those in FIG. 4 are processed, and the processing of the instructions (1) to (4) is shown in FIG. 5, and the processing of the instructions (5) to (8) is shown in FIG.
[0049]
In FIG. 5, the instructions of PC2 and PC3 can be executed at the delay point of the instructions (2) and (3) in FIG. The instruction of the PC 4 can be executed at a position where the fetch of the instruction (4) is delayed due to the delay of the instruction (3). In other words, the stage originally waiting can be used as a stage of another process.
[0050]
In FIG. 6, there is no need to prefetch the unnecessary instructions (6) and (7) fetched in FIG. 4 but discarded by the branch.
[0051]
As described above, the microprocessor of FIG. 3 is configured to process an instruction of another PC until the processing of the instruction in one PC is completed. Therefore, there is no danger that a waiting time is required when the next instruction is processed in the certain PC, or that the instruction is discarded and becomes a useless instruction.
[0052]
As a result, for example, in FIG. 5, 13 instructions are executed in 16 stages, and the execution efficiency can be substantially increased to 1 instruction / 1 clock. That is, it is possible to realize a microprocessor that can have high processing capability and does not consume power by the branch prediction circuit and the like as described above.
[0053]
Also, by using these ideas, it is possible to construct an architecture using an arithmetic circuit in units of 1 bit, instead of the conventional arithmetic circuit in units of 32 bits. FIGS. 7 and 8 show an example of a configuration for realizing this architecture. Here, a 32-bit architecture is assumed.
[0054]
FIG. 7 is a circuit diagram showing an example of a configuration of a 1-bit adding circuit in the microprocessor according to the embodiment of the present invention.
[0055]
The adder circuit of FIG. 7 includes, for example, 32-bit shift registers 1 and 2 to which 32-bit inputs A and B are respectively input, a 1-bit adder 3 that adds each bit of the inputs A and B by 1 bit, and a 1-bit adder 3 It comprises a 32-bit shift register 4 for storing the output C of the adder 3 and the like. Although not shown in FIG. 7, 36 circuits are provided, and a different PC is assigned to each circuit. That is, 36 PCs (PC1 to 36) are provided.
[0056]
FIG. 8 is a circuit diagram showing an example of a configuration of a 1-bit unit comparison circuit in the microprocessor according to the embodiment of the present invention.
[0057]
The comparison circuit in FIG. 8 includes, for example, 32-bit shift registers 1 and 2 to which a 32-bit input A and an input B are respectively input, and a 1-bit comparator (subtractor) 5 for comparing each bit of the input A and the input B by 1 bit. Etc. Although not explicitly shown in FIG. 8, 36 circuits are provided, and a different PC is assigned to each circuit. That is, 36 PCs (PC1 to 36) are provided.
[0058]
In such circuits, the pipeline processing is performed while sequentially switching the PCs, as in the processing of FIG. Then, for example, after the instruction fetch of the address 0 of the PC1 is performed, the instruction fetch of the address 1 is performed 36 clocks later. Therefore, during this period, addition and comparison for 32 bits are performed in units of 1 bit / 1 clock by the addition circuit of FIG. 7 and the comparison circuit of FIG.
[0059]
The number of pipeline stages is a total of 35 stages of IF, ID, IE, and mm × 32. However, since the number of PCs (36) is larger than the number of pipeline stages, the pipeline installation is in principle required. Does not occur.
[0060]
As described above, a beneficial effect can be obtained by configuring the number of PCs to be equal to or more than the number of stages in the pipeline and equal to the number of data bits, and changing the arithmetic unit from the conventional 32-bit unit to the 1-bit unit. In other words, the operation that was conventionally performed simultaneously in 32 bits is distributed to a plurality of clocks in units of 1 bit, so that the delay time between gates in the operation unit per clock can be reduced. Becomes possible. Furthermore, as long as the delay time between gates in the arithmetic unit per clock is within an allowable range, the element drive voltage can be reduced, and power consumption can be reduced.
[0061]
In a general program, a branch instruction based on a comparison result or the like is often used. In this case, the result is often determined only by comparing the upper bits. In such a case, by using the comparison of 1 bit / 1 clock, the operation can be completed (the operation unit is stopped) when the result is found, and the current consumption can be reduced and the efficiency of the operation can be improved. it can.
[0062]
FIGS. 7 and 8 show an adder and a comparator as one mode of an arithmetic circuit such as an ALU (arithmetic and logical unit), and are not limited to the adder and the comparator.
[0063]
As described above, the invention made by the inventor has been specifically described based on the embodiment. However, the present invention is not limited to the embodiment, and can be variously modified without departing from the gist thereof. Needless to say.
[0064]
For example, in FIGS. 7 and 8, the number of PCs is equal to the number of data bits, but the number of PCs can be equal to or greater than the number of data bits as long as the number is equal to or greater than the number of stages in the pipeline.
[0065]
The effects obtained by typical aspects of the invention disclosed in the present application will be briefly described as follows.
[0066]
(1) By providing the same number of PCs or more as the number of pipeline stages, pipeline stall does not occur in principle, and the execution efficiency of the pipeline can be substantially increased to one instruction / clock.
[0067]
(2) According to (1), functions such as branch prediction become unnecessary, and the number of transistors and power consumption can be reduced.
[0068]
(3) According to the above (1), the number of pipeline stages can be made very deep, and a high clock can be achieved.
[0069]
(4) By configuring the number of data bits and the number of PCs to be equal and changing the arithmetic unit in units of 1 bit, it is possible to reduce current consumption, increase the clock, and increase the efficiency of the operation.
[0070]
(5) By the above (1) to (4), a microprocessor with low power consumption and high processing capability can be realized.
[0071]
【The invention's effect】
By applying the present invention, a low-power microprocessor can be realized.
[Brief description of the drawings]
FIG. 1 is a schematic diagram showing an example of a pipeline processing method in a conventional microprocessor which is a premise of the present invention.
FIG. 2 is a schematic diagram illustrating an example of a pipeline processing method in the microprocessor according to the embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating an example of a configuration of a microprocessor according to an embodiment of the present invention.
FIG. 4 is a diagram showing an example of a conventional pipeline processing method for a simple program example in a microprocessor as a premise of the present invention.
FIG. 5 is a diagram illustrating an example of a pipeline processing method for a simple program example in the microprocessor according to the embodiment of the present invention;
FIG. 6 is a diagram illustrating an example of a method of processing a pipeline for a simple program example following FIG. 5 in the microprocessor according to the embodiment of the present invention;
FIG. 7 is a circuit diagram illustrating an example of a configuration of an addition circuit in units of 1 bit in the microprocessor according to one embodiment of the present invention;
FIG. 8 is a circuit diagram illustrating an example of a configuration of a 1-bit unit comparison circuit in the microprocessor according to one embodiment of the present invention;
[Explanation of symbols]
1,2,4 32-bit shift register 3 1-bit adder 5 1-bit comparator

Claims

It has more program counters than the number of pipeline stages,
A microprocessor which executes an instruction indicated by the program counter by pipeline processing while sequentially switching the program counter every clock.