JP2004021890A

JP2004021890A - Data processor

Info

Publication number: JP2004021890A
Application number: JP2002179525A
Authority: JP
Inventors: Yoshihide Sugiura; 杉浦　義英; Takeshi Sato; 佐藤　武; Shintaro Shimogoori; 下郡　慎太郎; Kanenaga Seto; 瀬戸　謙修; Toshiaki Kitajima; 北島　利明
Original assignee: Pacific Design Inc
Current assignee: Pacific Design Inc
Priority date: 2002-06-20
Filing date: 2002-06-20
Publication date: 2004-01-22

Abstract

<P>PROBLEM TO BE SOLVED: To secure a parallelism that is equal to a VLIW type processor, and to provide an architecture of a processor that is far smaller and consumes less electric power than the VLIW type processor. <P>SOLUTION: In the data processor 1, a basic processor 10 provided with fetch and decode functions is combined with a parallel data processing unit 20 controlled by a horizontal microcode 21. Since only sections suiting parallel processing can be executed by the horizontal microcode 21, parallel or pipeline parallel data processing of VLIW can be carried out in the parallel data processing unit 20, and sequential data processing can be carried out by the basic processor 10 in sections not suiting parallel processing. By adopting the horizontal microcode 21, the data processor eliminating complexity of circuits for command fetch and decode for parallel data and efficiently carrying out parallel processing while suppressing electric power consumption can be provided. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、１命令で複数の処理を指定することができるＶＬＩＷ型のマイクロコード（水平型マイクロコード）により制御される、並列データ処理が可能な処理ユニットを有するデータ処理装置およびその制御方法に関するものである。
【０００２】
【従来の技術】
並列性を高めるプロセッサ・アーキテクチャとしてＶＬＩＷ（Ｖｅｒｙ　Ｌｏｎｇ　Ｉｎｓｔｒｕｃｔｉｏｎ　Ｗｏｒｄ）が知られている。この方式は、理想的には、並列処理を実行可能なリソースを備えた演算実行ユニットを構成するレジスタファイルと演算器群の間を、命令により自由に完全接続可能とし、並列処理もしくはパイプライン並列処理を高い自由度で実行できるアーキテクチャである。したがって、名称の示す通り非常に長い命令コードとなる特徴を有する。
【０００３】
【発明が解決しようとする課題】
しかしながら、非常に長い命令コードをフェッチし、デコードする必要があるので、プロセッサの命令コードのフェッチ部およびデコード部が大きくなり、また、設計が複雑になるという問題がある。そして、複雑になることによりフェッチおよびデコード部の消費電力も非常に大きくなる。
【０００４】
また、プログラムを構成する命令がすべて並列実行可能なものではないという問題がある。たとえば、４つの処理を同時実行可能なリソースを備えたＶＬＩＷ型のプロセッサであっても、常に４つの処理を同時実行するようなプログラムは生成できず、ＶＬＩＷの１つの命令あるいは命令セットに含まれる４つの命令あるいはサブ命令（本明細書においては、複数の処理を同時実行可能なＶＬＩＷあるいは水平型マイクロコードを構成し、１つの処理に対応する命令を明確に区別するときはサブ命令と称することにする。）としてＮＯＰの占める割合は高い。
【０００５】
したがって、ＶＬＩＷ型のプロセッサはハードウェアが大きくなり膨大な消費電力を必要とする割りに、プログラムの実行効率は向上しない。このため、並列度を上げることによるメリットと、ハードウェアを大きくして消費電力が増大するディメリットを勘案すると、２つの処理を同時実行可能な程度のプロセッサが実際的であり、それ以上の並列処理を実行可能なプロセッサを実用化することは経済的でない。したがって、３つあるいはそれ以上の処理を並列実行することによりループ処理などの実行速度を大幅に向上できるアーキテクチャはあっても、それを活かしたプロセッサを実用レベルで開発および製造することはほとんどない。
【０００６】
そこで、本発明においては、ＶＬＩＷ型のプロセッサに対して、同等の並列度を確保でき、その一方で、ＶＬＩＷ型のプロセッサよりも、遥かに小型で消費電力の小さなプロセッサのアーキテクチャを提供することを目的としている。そして、ループ処理などの並列実行に適した処理を、経済的なハードウェアにより並列実行可能とすることにより、実行速度を飛躍的に向上できる経済的なプロセッサを提供することを目的としている。
【０００７】
さらに、パイプライン並列処理において並列化のネックとなる分岐処理も並列実行可能なアーキテクチャおよび制御方法を提供し、並列データ処理装置の利用効率をさらに向上することも本発明の目的としている。
【０００８】
【課題を解決するための手段】
本発明においては、フェッチおよびデコード機能を備えた基本プロセッサと、水平型マイクロコードにより制御される並列データ処理ユニットとの組み合わせにより、並列処理に適した部分だけを水平型マイクロコードで実行可能とする。この構成であれば、ＬＳＩなどのデータ処理装置において実行するために与えられた仕様のうち、並列処理が可能な部分は水平型マイクロコードにより、ＶＬＩＷが持つ並列もしくはパイプライン並列のデータ処理が可能となり、並列処理に適していない部分は基本プロセッサによりシーケンシャルなデータ処理が可能となる。さらに、水平型マイクロコードを採用することにより、並列データ処理用の命令フェッチの機能とデコードの機能の回路の複雑性を排除し、消費電力を押えつつ効率よく並列処理を行うデータ処理装置を提供できる。
【０００９】
すなわち、本発明のデータ処理装置は、実行プログラムに含まれる命令をフェッチおよびデコードして実行する第１の処理ユニット（基本プロセッサ）と、複数の処理を同時実行可能な水平型マイクロコードにより制御される第２の処理ユニット（並列データ処理ユニット）とを有し、第１の処理ユニットは、実行プログラムに含まれる起動命令により第２の処理ユニットを起動する処理を実行可能であることを特徴とする。第１の処理ユニットの実行ユニットが、フェッチおよびデコードされた起動命令により第２の処理ユニットを制御しても良い。また、フェッチおよびデコードユニットが、第２の処理ユニットに対する命令を識別して、起動命令を第２の処理ユニットに提供しても良い。あるいは、第２の処理ユニットが、第１の処理ユニットがフェッチした命令を監視し、自己の、すなわち、第２の処理ユニットに対する起動命令により起動しても良い。
【００１０】
いずれの場合も、第１の処理ユニットを制御する実行プログラムにより、第２の処理ユニットの処理を第１の処理ユニットと共に協調制御できるので、実行プログラムは実質的に水平型マイクロコードで実行されるループ処理などの並列処理も含めてクロック単位で制御できる。特に、第１の処理ユニットのフェッチおよびデコードユニットから第２の処理ユニットに対して起動命令が供給される構成は、第１の実行ユニットにおける遅れを回避できるので、ネットワーク処理や画像処理などのリアルタイム性が要求される処理に適した構成である。
【００１１】
したがって、本発明のデータ処理装置で実行する処理がＣ言語などで与えられた場合は、オリジナルのプログラムから並列実行に適した部分を抽出して、複数の処理を同時実行可能な水平型マイクロコードに変換する工程と、オリジナルのプログラムの並列実行に適した部分を水平型マイクロコードの起動命令に置換し、フェッチおよびデコード機能を備えた基本プロセッサで実行可能な実行プログラムに変換する工程とを有するプログラム開発方法を採用することが有効である。すなわち、並列実行に適した部分を水平型マイクロコードに変換する機能と、並列実行に適した部分を起動命令に置換して実行プログラムに変換する機能の２つの機能を備えたコンパイラが必要となる。このコンパイラの機能は、適当なリソースを備えたコンピュータで実行可能なソフトウェア、すなわち、プログラムあるいはプログラム製品として適当な記録媒体に記録して提供される。もちろん、コンピュータネットワークを介して提供することも可能である。
【００１２】
並列実行装置あるいは並列データ処理ユニットである第２の処理ユニットは、複数の水平型マイクロコードを格納するコードメモリと、複数のレジスタおよび演算ユニットを備え、それらの接続および／または機能が水平型マイクロコードにより変更されるデータパス部と、コードメモリの実行アドレスを管理するアドレス制御部とを備えている。第１の処理ユニットがデータＲＡＭを備えている場合は、第１または第２の処理ユニットにデータＲＡＭと第２の処理ユニットのデータパス部とのインターフェイス用の複数のインターフェイスレジスタを設け、第１の処理ユニットと第２の処理ユニットの双方によりデータＲＡＭ内のデータに基づく、あるいはデータに対する処理を行えるようにすることが望ましい。
【００１３】
また、コードメモリに格納される水平型マイクロコードを書き換えることにより、並列実行ユニットの実行内容を変更することが可能となり、並列実行ユニットのリソースをさらに効率的に使用することができる。コードメモリは基本プロセッサである第１の処理ユニットにより書き換えることができる。すなわち、第１の処理ユニットは、実行プログラムに含まれる命令により第２の処理ユニットのコードメモリを書き換える処理を実行可能とすることができる。また、第２の処理ユニットのコードメモリを書き換える第３の処理ユニットを設けることが可能であり、基本プロセッサを第２の処理ユニットのコードメモリの管理から開放することによりデータ処理装置全体の処理速度を向上することができる。
【００１４】
本発明における基本プロセッサと並列実行ユニットとの組み合わせは、並列実行ユニットを小型にでき、フェッチおよびデコードユニットのハードウェア的なオーバーヘッドを削減できるので消費電力を大幅に削減できる。さらに、データ処理装置に搭載される並列実行ユニットは１つに限定されるものではないが、少なくとも基本プロセッサと並列実行ユニットの２つの処理ユニットがデータ処理装置に搭載されるので、これらの処理ユニットに供給されるクロックを制御することにより実質的な処理能力を落とさずに消費電力をさらに低減できる。特に、第１の処理ユニットは第２の処理ユニットを起動することができるので、第１の処理ユニットは第２の処理ユニットの起動のタイミングを把握できる。したがって、第１の処理ユニットが第２の処理ユニットへ供給されるクロック信号を制御することにより、データ処理装置全体の消費電力をさらに低減できる。
【００１５】
プログラムのうち、信号処理アルゴリズムを記述する部分の大半はループ処理であり、Ｃ言語記述のステップ数そのものはそれほど大きくはない。高速フーリエ変換、フィルター処理、自己相関演算処理、アダマール変換、ビタビ符号化などの信号処理は、特殊な演算を繰り返し実行する特徴を有し、Ｃ言語のソースコードにて数十から数百ステップの範囲にて収まるのが通例である。さらに、これらのステップには同時実行可能な演算が数多く含まれている。したがって、ループ処理は本発明の並列実行ユニットで実行するのに適した部分であり、並列度を上げて実行することにより大幅に処理速度も向上する。逆に、信号処理以外の部分、データインターフェイスや割り込み処理管理などの汎用的な部分は、ステップ数を必要とするが、このような部分については並列性またはパイプライン並列性を要しないのが一般的である。さらに、信号処理の部分は、ある程度安定しているのに対し、信号処理以外の部分は仕様変更などの影響を受けやすい。このため、汎用的な並列性をさほど必要としないＣ言語記述の部分については、フレキシブルな対応ができる基本プロセッサで実行することが望ましい。
【００１６】
このように、本発明のデータ処理装置においては、低消費電力で並列度の高い処理を実行できるので、通信やネットワークにて必要となる信号処理を行うのに適している。
【００１７】
したがって、並列データ処理装置である第２の処理ユニットは、ループ処理を効率よく実行できるアーキテクチャを備えていることが望ましい。このため、本発明の第２の処理ユニットには、水平型マイクロコードの実行回数をカウントする複数のループカウンタと、このループカウンタにより水平型マイクロコードの実行回数を制御するループ制御部とを設ける。水平型マイクロコードに、データパス部の接続を制御する情報を含むオペランドフィールドと、ループカウンタを選択する情報を含むループカウンタ選択フィールドとを設けることにより、水平型マイクロコード命令毎に選択されたループカウンタでループ処理の実行回数を極めて簡単に制御できる。
【００１８】
ループ制御部に、ループカウンタでカウントする回数をセットするループ制御レジスタを設け、第１の処理ユニットが実行プログラムに含まれる命令により、ループ制御レジスタの値をセットする処理を実行することにより、ループ回数もプログラムで制御できる。ループ制御レジスタに設定されるループ回数は、第２の処理ユニットのデータパスの処理結果であっても良い。
【００１９】
ループ処理中に条件分岐を伴う命令、すなわち「ｉｆ　ｔｈｅｎ　ｅｌｓｅ文」を伴う命令が含まれていると、条件の正否により実行される演算が変わるので並列性は一気に低下する。そこで本発明においては、並列処理ユニットである第２のデータ処理ユニットに、データパス部の各演算ユニットのコンディション・コードを格納可能なコンディションコードレジスタと、このコンディションコードレジスタと水平型マイクロコードに含まれる各演算ユニットの選択情報とを比較し、各演算ユニットからレジスタへの出力および／またはコンディションコードレジスタへのコンディション・コードの出力を制御する出力制御部を設ける。この構成であると、コンディションコードレジスタに格納された前のサイクルのコンディション・コード、または、コンディションコードレジスタに格納される同一サイクル内の各演算ユニットのコンディション・コードのいずれかと、水平型マイクロコードに含まれる各演算ユニットの選択情報とを比較して出力を制御できるので、条件の正否に関わらず演算は並列に実行できる。そして、実行した結果を出力するときに、出力制御部が出力の可否を判断することにより、その演算の前提となる条件の正否を反映できる。したがって、並列性を維持したまま条件分岐を伴う命令を実行できる。
【００２０】
さらに、コンディションコードレジスタに格納された前のサイクル（実行サイクル）のコンディション・コードだけでなく、コンディションコードレジスタに格納される同一サイクル内の各演算ユニットのコンディション・コードのいずれかを選択して水平型マイクロコードに含まれる選択情報と比較することにより、同一サイクル内、すなわち、並列演算された条件結果を反映して出力を制御できるのでさらに並列性の高い処理が可能な並列データ処理装置を提供できる。
【００２１】
水平型マイクロコードに含まれる各演算ユニットの選択情報は、たとえば、コンディション・コードとは無関係に選択されることを指示する第１の情報と、コンディション・コードを比較する各演算ユニットを指示する第２の情報と、比較するコンディション・コードの真偽を示す第３の情報と、比較するコンディション・コードのサイクルを示す第４の情報とである。第４の情報は、水平型マイクロコード命令の単位で設けても良く、水平型マイクロコードを構成するサブ命令毎に設けて良い。これら４つの情報により、分岐条件に支配される演算か否か、どの演算ユニットにより演算された分岐条件に支配されるか、その分岐条件の正否あるいは真偽のいずれで選択されるのか、さらに、前のサイクルの演算結果から同一サイクル内の演算結果が決まるので、条件分岐を伴う演算を含めたすべての演算を並列実行できる。
【００２２】
また、本発明の出力制御部は、個々の演算ユニットの出力の要否を、すべての演算ユニットのコンディション・コードのいずれかを参照して判断できる構成になっている。したがって、コンディションコードレジスタに格納された前のサイクルの演算結果のみならず、同じサイクル内の演算結果も参照して出力を選択することが可能であり、条件的にはシーケンシャルに実行される演算も広い範囲で並列実行することができる。
【００２３】
また、出力制御部は、コンディション・コードを、少なくとも真偽および選択または非選択、すなわち分岐条件により選択されたことを示す情報、および選択されなかったことを示す情報の少なくともいずれかを含めて示すことができる複数ビットのデータに変換してコンディションコードレジスタに出力する変換部を備えている。したがって、水平型マイクロコードに含まれる演算ユニットの選択情報をコンディションコードレジスタの内容に対応したビットマップとすることにより、選択情報をデコードせずにコンディションコードレジスタと比較できる。たとえば、１ビット分のコンディション・コードに着目した場合、この１ビットの真偽の情報と、分岐条件による選択・非選択を示す情報とは、２ビットのデータであらわすことができ、それらのデータを直に比較できる体系のビットマップ、すなわち、ビットマップにデコードされた情報を、水平型マイクロコードをコンパイルするときにセットする。これにより、出力制御部では、コンディション・コードを比較する処理においてオーバヘッドとなるデコード回路などのハードウェアは不用となり、簡易なハードウェアで高速に処理できる。
【００２４】
このように、本発明のデータ処理装置およびその制御方法では、分岐命令によるジャンプ実行を前提としないので、ＶＬＩＷの特徴であるパイプライン並列処理の流れを乱さずに分岐命令を含むループ処理を並列実行することができる。この結果、ＶＬＩＷが持つ並列もしくはパイプライン並列のデータ処理をループ演算で活用でき、さらに水平型マイクロコードを採用することにより命令フェッチ部とデコード部の回路複雑性を排除し、消費電力を押えつつ効率よくループ処理を行うことが可能となる。したがって、信号処理の分野では頻発するループ処理をＶＬＩＷ型の並列演算によって効率よく処理することが可能となる。
【００２５】
【発明の実施の形態】
図１に本発明のデータ処理装置の一例を示してある。このデータ処理装置１は、基本プロセッサ１０と、並列データ処理ユニット２０とを備えている。基本プロセッサ１０は、実行プログラム１１を格納したコードＲＡＭ１２と、コードＲＡＭ１２から実行プログラム１１の命令をフェッチおよびデコードするフェッチ・デコード部１３と、デコードされた命令にしたがって演算処理を実行する実行ユニット１４と、演算結果や処理対象のデータが一時的に格納されるデータＲＡＭ１５を備えている。この基本プロセッサ１０は、たとえば、ＲＩＳＣ型の汎用プロセッサと同等の構成であり、フェッチ、デコード、演算、ライトバックなどの処理をパイプライン方式で行う機能を備えている。
【００２６】
並列データ処理ユニット２０は、複数の処理を同時実行可能な水平型マイクロコードにより制御される並列実行部であり、複数の水平型マイクロコード２１を格納するコードメモリ２２と、複数のレジスタおよび演算ユニットを備え、それらの接続および／または機能が水平型マイクロコード２１により変更されるデータパス部２３と、コードメモリ２２の実行アドレスやコードメモリ２２への入出力を管理する機能を備えた制御部２４を備えている。データパス部２３は、並列処理を実行可能なハードウェアとして複数の汎用レジスタおよび複数の演算器（ＡＬＵ）を備えており、これらの接続を切り替える複数のセレクタおよび配線群により、レジスタとＡＬＵを中心とするネットワークが形成されている。そして、データパス部２３のネットワークの接続と、ＡＬＵの演算機能は水平型マイクロコード２１により切り替えて設定される。
【００２７】
並列データ処理装置２０は、データインターフェイスレジスタ群２８を介してレジスタデータバス３８で基本プロセッサ１０のデータＲＡＭ１５とデータ交換できるようになっており、データパス部２３はデータＲＡＭ１５から供給されたデータを処理してデータＲＡＭ１５に出力することができる。また、本例の並列データ処理装置２０の制御部２４は、コードメモリ２２の実行アドレスを管理するアドレス制御部２５に加え、データパス部２３におけるループ処理を制御するループ制御部２６と、データパス部２３において並列演算された出力を制御する出力制御部２７とを備えている。これらの構成は以下で詳しく説明する。
【００２８】
データ処理装置１の基本プロセッサ１０と並列データ処理ユニット２０とは幾つかの信号線あるいはバスで接続されている。まず、基本プロセッサ１０のフェッチ・デコード部１３は、実行プログラム１１から並列データ処理ユニット２０を起動する命令、たとえば、「Ｖ＿ＬＯＯＰ＿ＩＮＳＴ１」をフェッチすると、命令バス３１を介して並列データ処理ユニット２０の制御部２４に対して起動信号あるいは起動命令φｓを出力する。この起動命令φｓを並列データ処理ユニット２０の制御部２４がデコードして、コードメモリ２２に格納された水平型マイクロコード２１で規定された処理を開始する。制御部２４は、処理が終了すると、割り込み信号線３２を介して割り込み信号φｉを基本プロセッサ１０に送り、処理の終了を基本プロセッサ１０に伝達する。それにより、基本プロセッサ１０は、並列データ処理ユニット２０の処理結果を基に後続の処理を行う。
【００２９】
基本プロセッサ１０のデータＲＡＭ１５には、並列データ処理ユニット２０のコードメモリ２２にダウンロードするための水平型マイクロコード２１あるいは複数の水平型マイクロコード２１からなるマイクロプログラム２９が用意されており、データバス３８を通してコードメモリ２１のマイクロコード２１を変更することができる。したがって、並列データ処理ユニット２０の処理内容は、コードメモリ２２を書き換えることにより変更することが可能であり、基本プロセッサ１０は実行プログラム１１にしたがって書き換える処理を行う。
【００３０】
このため、本例のデータ処理装置１では、並列データ処理ユニット２０により複数の異なる処理を実行することができる。そして、並列データ処理ユニット２０においては複数の演算が並列に同時実行されるために、ループ処理などのステップ数は少なくても処理時間が費やされる複数種類の信号処理が短時間で効率よく行われる。また、信号処理以外のデータインターフェイスや割り込み処理管理などの汎用的な処理は、ステップ数を必要とするが並列性がないか、あるいは並列処理をしても処理速度の向上は望めないので、消費電力の少ない基本プロセッサ１０で行うことが可能である。
【００３１】
このように基本プロセッサ１０との組み合わせで動作する並列データ処理ユニット２０の制御回路２４として必要な機能は、ファンクションの起動、終了、コードメモリ２２へのデータ出し入れ、水平型マイクロコード２１のアドレス管理、アドレステーブル参照によるループ処理と分岐処理、演算器の演算結果コンディション・コードの結果に基づく演算器への書き込み及び汎用レジスタへの書き込み制御となる。この結果、制御に関して極めて軽いステートマシンが必要とされるだけであり、並列処理を行う高速データ処理ユニットが小面積で具現できる。この極めて軽いステートマシンで制御できる点が、水平型マイクロコードで制御される並列データ処理ユニット２０において、複数処理が記述されたＶＬＩＷ命令をフェッチおよびデコードして処理を進める汎用のＶＬＩＷプロセッサと大きく異なる点である。その一方で、水平型マイクロコードによりデータパス部２３のレジスタ群と演算器群の接続が随時変更できる点が従来の固定的な配列演算であるベクトル・プロセッサ型のデータパスを備えた処理ユニットとも異なり、汎用性をある程度確保しながら並列処理により実行速度を高めることができる処理装置となっている。
【００３２】
まとめると、本例のデータ処理装置１は、並列度の高いＶＬＩＷ型である必要のない基本プロセッサ１０に、水平型マイクロコード２１による制御機構を持つ並列度の高いＶＬＩＷデータパス部（並列データ処理ユニット）２０を装着させ、水平型マイクロコード２１の中味は基本プロセッサ１０のデータＲＡＭ１５からロードさせることによりＶＬＩＷによる並列演算またはパイプライン並列演算を最も効果的に実行するアーキテクチャであると言える。したがって、本例のデータ処理装置１は、低消費電力でコンパクトでありながら高速処理が可能な集積回路装置となっている。
【００３３】
図２に示したデータ処理装置２は、本発明の異なる例である。このデータ処理装置２も、基本プロセッサ１０と並列データ処理ユニット２０を備えており、上述したデータ処理装置１と同様に低消費電力で高速処理が可能な構成である。さらに、並列データ処理ユニット２０のコードメモリ２２に対して水平型マイクロコード２１を書き換える機能を備えた管理ユニット３５を備えている。この管理ユニット３５は、マイクロプログラム２９が格納されたプログラムメモリ３６と、入出力などを制御する制御部３７を備えている。このため、管理ユニット３５により、基本プロセッサ１０に変わって並列データ処理ユニット２０の処理内容の変更をサポートできる。この構成は、制御ユニット３５が増えた分、ハードウェアは増加する。しかしながら、基本プロセッサ１０が並列データ処理ユニット２０の処理内容の変更を行う処理から開放され、それによるオーバーヘッドを除くことができる。このため、基本プロセッサ１０の処理効率が向上し、データ処理装置全体では、処理速度が向上するというメリットがある。
【００３４】
さらに、データ処理装置２は、並列データ処理ユニット２０に対するクロック信号φｃの供給を制御できるクロック制御ユニット３４を備えている。このクロック制御ユニット３４は、基本プロセッサ１０から制御線３３により供給されるオンオフ信号あるいはオンオフ命令φｏにより、基本プロセッサ１０から命令が並列データ処理ユニット２０に出されて並列データ処理ユニット２０が稼動するときだけ並列データ処理ユニット２０に対してクロック信号φｃを供給する。したがって、データ処理装置２を基本プロセッサ１０とクロック信号がオンオフできる並列データ処理ユニット２０との組み合わせを基本とするアーキテクチャとすることにより、クロック信号の制御によりさらに省電力を望みうる構成となる。
【００３５】
これらのデータ処理装置１および２（以降では、データ処理装置１を代表して説明する）では、基本プロセッサ１０と並列データ処理ユニット２０とが協働して１つのアプリケーションプログラムを実行する。このため、１つのアプリケーションプログラムを２つのモードを備えたコンパイラでコンパイルして基本プロセッサ用の実行プログラム１１と、並列データ処理ユニット２０の水平型マイクロコード２１とを生成する。
【００３６】
図３に、実行プログラム１１および水平型マイクロコード２１を生成する過程を示してある。ステップＡにおいてデータ処理装置１０で実行するオリジナルのプログラムＣｏが与えられる。オリジナルのプログラムＣｏは、ループ処理Ｃ_ＬＯ _ＯＰなどの並列実行に適した部分と、その他のデータインターフェイス処理や割り込み処理などの汎用的な処理Ｃ１およびＣ２の部分とを備えており、ステップＢでオリジナルのプログラムＣｏから並列実行に適した部分が抽出される。本例では、並列実行に適した部分としてループ処理の部分Ｃ_ＬＯＯＰが分離される。ループ処理Ｃ_ＬＯＯＰが抽出されたプログラムは、処理Ｃ_ＬＯＯＰを開始する起動命令と処理Ｃ_ＬＯＯＰが終了したことを検出する命令とに置き換えられた実行用のソースプログラムＣｅとなる。すなわち、処理Ｃ_ＬＯＯＰとソースプログラムＣｅとの間は、ファンクション・コールとリターンという形で命令インターフェイスが取られる。もしくは、並列データ処理ユニット２０のインターフェイスレジスタにある値を書き込み、これをもって並列データ処理ユニット２０の処理を開始させる形態も可能である。
【００３７】
次に、ステップＣでこれらのプログラムがコンパイルされる。コンパイラ４０は、基本プロセッサ用の実行プログラムを生成する第１のステップあるいは機能Ｃｃ１と、並列データ処理ユニット用に水平型マイクロコードを生成する第２のステップあるいは機能Ｃｃ２とを備えている。第１の機能Ｃｃ１は汎用のＣコンパイラであり、この第１のコンパイラＣｃ１は、汎用的な処理Ｃ１およびＣ２を含み、Ｃ言語で記述したプログラムＣｅを基本プロセッサ１０の実行プログラム（オブジェクトプログラム）１１に変換する。第２の機能Ｃｃ２も、ある種のＣコンパイラであり、この第２のコンパイラＣｃ２は、ステップＢで抽出された繰り返し処理が主となるＣ_ＬＯＯＰの部分を記述するＣ言語のプログラムを並列実行可能なように水平型マイクロコード２１あるいは水平型マイクロプログラム２９に変換する。
【００３８】
これら第１のコンパイラＣｃ１および第２のコンパイラＣｃ２の機能は、適当なリソースを備えたコンピュータで実行可能なプログラムあるいはプログラム製品として提供される。すなわち、本例のデータ処理装置用のコンパイラは２つのモードを持つことになり、出力された基本プロセッサのオブジェクトコード１１と、繰り返し信号処理などのアルゴリズムが記述された並列処理用の水平型マイクロコードとの間はファンクション・コールとリターンという形でインターフェイスが取られる。また、これらの機能Ｃｃ１およびＣｃ２は、異なるプログラム製品として提供されても良いし、１つのプログラム製品として提供されても良い。
【００３９】
なお、本例の並列データ処理ユニット２０においては後述するようにループカウンタでループを処理するので、第２のコンパイラＣｃ２によりループ処理Ｃ_ＬＯ _ＯＰのコンパイル結果は、水平型マイクロコード２１に加え、ループ処理用のカウンタの設定値と、カウントアップしたときの分岐用アドレステーブルとなる。　すなわち、水平型マイクロコード２１は、ループ処理の間、データパス部２３のレジスタファイルと演算器群の接続を、セレクタ群を経由して決定し、並列もしくはパイプライン並列処理の実行を指示する。分岐用アドレステーブルはループ処理が終了した際の戻り番地が格納されたテーブルであり、通常、レジスタファイルに格納され、制御回路２４の内部に設置される。
【００４０】
このようにしてアプリケーションプログラムＣｏを水平型マイクロコードを用いて並列実行可能にすると、並列データ処理ユニット２０で並列実行する内容は、サブルーチンの形態とし、あらかじめ実行可能状態で水平型マイクロコードＳＲＡＭ（コードメモリ）２２にロードしておき、基本プログラム（実行プログラム）１１のほうから起動すればよい。したがって、基本プログラム１１そのものを命令長の長いＶＬＩＷとする必要が無い。この結果、基本プロセッサ１０において常時、フェッチおよびデコードの対象となる基本命令に関しては、ＶＬＩＷ形式を取る必要はない。一方、水平型マイクロコード化された処理は、データパス部２３でＶＬＩＷ化された状態で、並列もしくは並列パイプライン処理の形で実行される。したがって、ＶＬＩＷ方式のプロセッサのように、大型で消費電力の大きなフェッチおよびデコード部分を無くし、ＶＬＩＷ方式のプロセッサと同程度の並列度で並列パイプライン処理を実行することができるコンパクトなデータ処理装置１０を提供することが可能となる。
【００４１】
図４に並列データ処理ユニット２０の一例を示してある。また、図５に、並列データ処理ユニット２０を部分的にさらに詳しく示したブロック図を示してある。この並列データ処理ユニット２０のデータパス部２３は、並列に動作可能な４つの演算ユニット５１を備えており、コードメモリ２２に格納された１つの水平型マイクロコード２１により、４つの処理あるいはサブ命令を同時実行できる。個々の演算ユニット５１は、図５に示すように、１つのＡＬＵ５１ｏと、入力レジスタ５１ａおよび５１ｂと、入力レジスタ５１ａおよび５１ｂへの入力を選択する２段のセレクタ５１ｃ、５１ｄ、５１ｅおよび５１ｆを備えている。
【００４２】
演算器（ＡＬＵ）５１ｏは、乗算器、加算器などとして機能し、各演算器５１ａには、それに対応するコンディションコード格納レジスタも存在する。ＡＬＵ５１ｏと入力レジスタ５１ａおよび５１ｂは、固定的に接続されており接続の自由はない。これに対し、１段目のセレクタ５１ｃおよび５１ｄは、汎用レジスタ群（汎用レジスタファイル）５２から入力レジスタ５１ａおよび５１ｂを選択でき、２段目のセレクタ５１ｅおよび５１ｆは、さらに、自己および他のＡＬＵ５１ｏの出力も含めて入力レジスタ５１ａおよび５１ｂへの入力を選択することができる。
【００４３】
個々の演算ユニット５１の出力は、自己も含めた演算ユニット５１の入力と、汎用レジスタ群５２と、データ出力部５３と、アドレス出力部５４へ切り替えて接続できる。また、個々の演算ユニット５１の出力は、制御回路５４のループ制御部２６へもループ回数の初期セットのために出力することができる。さらに、個々の演算ユニット５１の出力は、条件判断のために出力制御部２７にも出力することができる。
【００４４】
汎用レジスタ群５２は、複数、たとえば８つの汎用レジスタ５２ａと、それに対する入力を選択するセレクタ５２ｂと、汎用レジスタ５２ａに対するライトイネーブル信号ＷＥを選択するセレクタ５２ｃとを備えている。それぞれの汎用レジスタ５２ａには、演算ユニット５１の出力だけではなく、インターフェイスレジスタ群２８の入力データレジスタ２８ａからの出力も選択できる。各汎用レジスタ５２ａへの書き込みは、ライトイネーブル信号ＷＥにより決定されるが、本例の並列データ処理ユニット２０では、各々の演算ユニット５１で演算した結果のいずれかで汎用レジスタ５２ａへの書き込みを制御することができる。このため、セレクタ５２ｃを設けている。
【００４５】
データ出力部５３は、汎用レジスタ群５２の出力または演算ユニット５１の出力のいずれかを選択してインターフェイスレジスタ群２８のデータ出力レジスタ２８ｂに出力する。また、アドレス出力部５４は、汎用レジスタ群５２の出力または演算ユニット５１の出力のいずれかを選択してインターフェイスレジスタ群２８のアドレス出力レジスタ２８ｃに出力する。さらに、水平型マイクロコード２１は、ライトまたはリードするタイミングを制御する信号をデータＲＡＭ制御レジスタ２８ｄに出力する。
【００４６】
したがって、本例の並列データ処理ユニット２０は、インターフェイスレジスタ群２８とデータバス３８を介して基本プロセッサ１０の汎用レジスタまたはデータＲＡＭ１５のデータを並列データ処理ユニット２０のデータパス部２３に入力し、演算処理して基本プロセッサ１０にフィードバックすることができる。すなわち、基本プロセッサ１０の汎用レジスタまたはデータＲＡＭ１５から読み込むアドレスをアドレス出力部５４から出力し、入力データレジスタ２８ａを介してデータを読み込む。一方、書き込むときは、アドレス出力部５４からアドレスを出力してデータ出力部５３からデータを出力し、データＲＡＭ制御レジスタ２８ｄを介して出力される基本プロセッサ宛のＷＥ信号によりデータＲＡＭ１５または汎用レジスタに書き込まれる。
【００４７】
これらの演算ユニット、レジスタおよびその他の回路要素を含むデータパス部２２の接続およびＡＬＵ５１ｏの演算内容は、水平型マイクロコード２１により制御される。したがって、水平型マイクロコード２１により設定されたデータフローグラムを基本プロセッサ１０のデータＲＡＭ１５から供給されたデータが流れて処理され、基本プロセッサのデータＲＡＭ１５にフィードバックする処理を行うことができる。
【００４８】
データパス部２２を制御する制御ユニット２４は、ループを制御するループ制御部２６と、アドレスを制御するアドレス制御部２５と、条件判定による出力を制御する出力制御部２７とを備えている。基本プロセッサ１０からは、レジスタなどに値がセットされ、起動条件が揃えられると、並列データ処理ユニット２０でデコードできるＶＵ命令で基本プロセッサ１０からコールされることにより起動がかけられる。アドレス制御部２５は、起動ＶＵ命令をデコードして、所定のアドレスの水平型マイクロコード２１をアクティブにする。そして、所定のステプの処理が終了すると、終了ＶＵ命令あるいは割り込み命令を基本プロセッサ１０に供給して基本プロセッサ１０の側で処理が継続される。たとえば、アドレス制御部２５は、シーケンサーなどのＦＳＭである。水平型マイクロコード２１を実行中は、これらの制御部は水平型マイクロコード２１に含まれている命令（サブ命令）およびパラメータに従ってデータパス部２２を制御する。
【００４９】
図６に水平型マイクロコード２１のフォーマットの一例を示してある。１つの水平型マイクロコード２１は、４つのサブ命令を記述する命令フィールド５５と、条件判断のときに参照するサイクルを規定するサイクルフラグフィールド５６と、ループカウンタを選択するループカウンタ選択フィールド５７を備えている。各々の命令フィールド５５は、さらに、操作を指定するオペランドフィールド５５ａと、使用するレジスタを指定するレジスタフィールド５５ｂと、データパス部２２のどの演算ユニット５１で処理を実行するかを指定するスロット番号フィールド５５ｃと、条件判断のときにＡＬＵ５１ａのコンディション・コードと比較するビットマップφｂｍが記述される選択条件フィールド５８とを備えている。ビットマップφｂｍは、１０ビットのデータで、＃０〜＃４までの５種類の真偽いずれかで選択される処理であるかを示す情報を備えている。
【００５０】
図７に、水平型マイクロコード２１のループカウンタ選択フィールド５７により、ループ回数が制御される構成を示してある。水平型マイクロコード２１のループカウンタ選択フィールド５７には数ビットが格納でき、ループレジスタを指定する。たとえば、２ビットなら、００：ループなし、０１：ループレジスタ＃１指定、１０：ループレジスタ＃２指定、１１：ループレジスタ＃３指定、といった具合である。
【００５１】
ループ制御回路２６は、３つのループカウンタ２６ａを備えており、水平型マイクロコード２１を複数回実行するときは、それを制御するループカウンタ２６ａが選択される。各々のループカウンタ２６ａを備えた回路２６Ｘはそれぞれ独立しており、カウントダウンされるループレジスタ２６ａ、ループ初期値レジスタ２６ｂ、デクリメンタ２６ｃと、カウンタ２６ａがカウントアップ（本例では０）になると初期値レジスタ２６ｂの値でループカウンタ２６ａをリセットする初期化回路２６ｄを備えている。インクリメンタを採用して、ループカウンタ２６ａと初期値レジスタ２６ｂとの値が一致したときにカウントアップする回路ももちろん可能である。
【００５２】
また、アドレス制御部２５はＦＳＭで構成され、ループレジスタ（ループカウンタイ）２６ａの番号に対応したループ処理開始アドレス（戻り値アドレス）を格納する分岐アドレステーブル２５ａを備えている。初期設定では、この分岐アドレステーブル２６ａに、先に説明した並列データパス専用のコンパイラＣｃ２により作成され、水平型マイクロコード２１と共に初期値レジスタ２６ｂにロードされる。したがって、ループカウンタ２６ａの初期値は、並列データ処理ユニット２０が実行プログラム１１に基づきファンクション・コールされる段階か、それ以前にループ回数として初期値レジスタ２６ｂに設定される。
【００５３】
データパス部２３がループ処理を開始すると、まず、水平型マイクロコード２１のループカウンタ選択フィールド５７の値によりカウンタ２６ａが選択される。カウンタ２６ａは初期設定されているので、水平型マイクロコード２１が実行されると、ループカウンタ選択フィールド５７により関連付けられているループレジスタあるいはループカウンタ２６ａがデクリメントされる。そして、カウンタ２６ａが　“０”になったら、アドレス制御部２５のＦＳＭ制御部２５ｂが０検出の信号を捉えて、アドレス分岐テーブル２５ａに格納された、ループレジスタ２６ａに対応したループ開始アドレスあるいは戻り値アドレスを、次の水平型マイクロコード２１のアドレスとしてコードアドレス出力部２５ｃから出力する。同時に、ループレジスタ２６ａをループ初期値レジスタ２６ｂの値に再設定して次のループ処理に備える。
【００５４】
この結果、本例の並列データ処理ユニット２０においては、ループ処理を実行する際に、ループ終了アドレスと実行アドレスの比較によるループ処理判定を行わずに済む。したがって、アドレス同士を比較するような大きな比較器を必要としないメリットがある。アドレス・コンパレータはビット数が多いので、遅延制御のクリティカル・パスとなる可能性があり、それによって処理速度が低下する可能性がある。これに対し、本例のループカウンタを選択する方法であると、ビット数は非常に少ないので、処理速度に影響を与える可能性はない。
【００５５】
また、本方式によれば、複数のループ処理がネスト構造になっても、そのループ回数の制御は非常に簡単である。ループ内に存在するネストループ数といった情報を必要とせず、異なるループカウンタ２６ａを指定してループ回数を制御するだけでネスト構造に対応できる。このため、本例であると、３重のネスト構造まで、１重のループ処理と同様の汎用的な制御で実行することができる。
【００５６】
なお、パイプライン動作には初期設定が必要であり、ネストされたループ記述において各ループで必要とされるメモリアドレス等のリソースが異なる場合においては先頭の段階において各ループ処理のイニシャライズを行うのが有効である。
【００５７】
図８に、ループ処理を水平型マイクロコード化して並列データ処理ユニット２０で実行する様子を示してある。基本プロセッサ１０を制御する実行プログラム１１では、ステップ６１で並列データ処理ユニット２０を初期設定する。この例では、命令６１ａで、データＲＡＭ１５に格納されたマイクロコードプログラム２９の中から、並列データ処理ユニット２０のコードメモリ２２へロードする水平型マイクロコード２１のアドレスを設定する。また、命令６１ｂで、並列データ処理ユニット２０のコードＲＡＭ２２に、水平型マイクロコード２１を格納するためのアドレスを並列データ処理ユニット２０のアドレスレジスタへ転送する。命令６１ｃで、コードメモリ２２へロードするデータ、すなわち、水平型マイクロコード２１をレジスタに設定する。さらに、命令６１ｄで、並列データ処理ユニット２０の、水平型マイクロコード２１を格納するためのデータレジスタへデータを転送する。これを必要なビット幅だけ繰り返し、命令６１ｅで、コードメモリ２２の指定されたアドレスに水平型マイクロコード２１を書き込む。これを並列データ処理ユニット２０で処理を行うために必要なステップ数だけ繰り返す。これにより、並列データ処理ユニット２０のコードメモリ２２は初期設定される。
【００５８】
したがって、このステップ６１において、第１の処理ユニットである基本プロセッサ１０は、実行プログラム１１に含まれる命令により第２の処理ユニットである並列データ処理ユニット２０のコードメモリ２２を書き換えることができ、並列データ処理ユニット２０により様々な処理を実行することができる。図２に示したデータ処理システム２においては、並列データ処理ユニット２０のマイクロコードを更新する作業は、基本プロセッサ１０からの命令により管理ユニット３５が実行する。したがって、ステップ６１のレジスタを用いた転送処理のほとんどは、管理ユニット３５と並列データ処理ユニット２０との間で行われる処理となり、この間、基本プロセッサ１０に異なる処理を実行することができる。このため、基本プロセッサ１０を、並列データ処理ユニット２０の処理内容の変更を行う処理、すなわち、ステップ６２のほとんどの処理から開放でき、データ処理装置２の処理速度を向上できる。
【００５９】
次に、ステップ６２で、ループ処理を実行するために、初期値設定レジスタ（ＶＲ＿ＬＯＯＰ１〜３）２６ｂに初期値を設定する。命令６２ａは、＃１の初期値レジスタＶＲ＿ＬＯＯＰ１に値を転送し、命令６２ｂは＃２の初期値レジスタＶＲ＿ＬＯＯＰ２、命令６２ｃは＃３の初期値レジスタＶＲ＿ＬＯＯＰ３に値を転送する。このステップ６２において、第１の処理ユニットである基本プロセッサ１０は、実行プログラム１１に含まれる命令により、ループ制御レジスタである初期値設定レジスタ２６ｂに、予めデータＲＡＭ１５に格納されたループ回数または実行プログラム１１に含まれる所望のループ回数をセットすることができる。
【００６０】
さらに、命令６２ｄでは、その他のパラメータが必要であれば、それもレジスタ転送する。そして、命令６２ｅで、ループレジスタを使用する水平型マイクロコード２１を起動する命令が実行される。この命令「Ｖ＿ＬＯＯＰ＿ＩＮＳＴ１」は起動命令φｓとして、基本プロセッサ１０のフェッチ・デコードユニット１３から並列データ処理ユニット２０に供給され、並列データ処理ユニット２０が処理を開始する。
【００６１】
図８に示すように、並列データ処理ユニット２０で実行される処理は３重のネストループになっている。基本プロセッサ１０が実行プログラム１１により並列データ処理ユニット２０の水平型マイクロコード２１と初期値レジスタ２６ｂを設定し、スタートをかけることにより、それぞれのループ処理を行う水平型マイクロコード２１は異なるループカウンタ２６ａで制御されるので、簡単にネスト構造のループ処理を制御できる。また、それぞれのループ処理では、水平型マイクロコード２１により、４つの処理が同時あるいは並列に実行されるので、ループ処理を高速で実行できる。
【００６２】
基本プロセッサ１０は、ステップ６３で、並列データ処理ユニット２０の処理が終了するのを待って次の処理を開始する。並列データ処理ユニット２０がループ処理を行っている間、基本プロセッサ１０は異なる処理を並列して行うことも可能である。ステップ６３では、命令６３ａで並列データ処理ユニット２０からループ処理完了を示す信号（ＶＵＷＡＩＴ信号）が来るのを待つ。次に、命令６３ｂ〜６３ｄで、並列データ処理ユニット２０の処理結果を基本プロセッサ１０に転送する。
【００６３】
並列データ処理ユニット２０で実行する「ループ文」の中には当然「ｉｆ　ｔｈｅｎ　ｅｌｓｅ文」が記述され得る。本例の並列データ処理ユニット２０においては、「ｔｈｅｎ節」の処理と「ｅｌｓｅ節」の処理とを選択して行うのではなく、どちらも実行する。すなわち、ループ処理を水平型マイクロコード２１に変換するコンパイラＣｃ２では、条件分岐の処置に際し、ふたつの選択可能な処理が用意され、まず、「ｔｈｅｎ節」と「ｅｌｓｅ節」とどちらも実行される。そして、「ｉｆ条件」の「真」・「偽」判定の結果、選択されなかった「節」の実行ステートメントは、演算ユニット５１の段階では実行しているが、その出力を最終的にはコンディションコードレジスタや汎用レジスタなどに書き込みを行わないことで条件分岐を実行する。もう１つの方法、すなわち、「ジャンプ文」を生成して分岐させることも可能である。選択肢を同時に実行する方法は「ｔｈｅｎ節」や「ｅｌｓｅ節」が浅い場合に有効であり、ジャンプ文で分岐する方法は、「ｔｈｅｎ節」や「ｅｌｓｅ節」が長い場合やアンバランスな場合に有効である。
【００６４】
図９に、並列データ処理ユニット２０に用意された出力制御部２７の概略構成を示してある。この並列データ処理ユニット２０においては、出力制御部２７で「ｉｆ条件」の演算を行う演算ユニット５１の結果により他の演算ユニット５１の出力を制御することにより、「真」・「偽」判定の結果をデータパス部２３の出力として反映できる。このため、選択肢、すなわち、「ｔｈｅｎ節」および「ｅｌｓｅ節」を同時に実行することが可能となり、条件分岐により並列パイプライン処理の効率が低下するのを防止できる。
【００６５】
本例の出力制御部２７は、４つの演算ユニット５１のそれぞれのＡＬＵ５１ａにおいて演算された「真」・「偽」の判定結果１ビットをコンディション・コード（ＣＣ）φｃｃとしてコンディションコードレジスタセット（ＣＣＲＳ）７１に出力する４つのコンディション・コード（ＣＣ）出力部７２を備えている。ＡＬＵ５１ａから出力されるコンディション・コードは１ビットに限定されるものではないが、本例では、真偽を示す１ビット分のコンディション・コードに着目して説明する。したがって、本発明は１ビットのコンディション・コードを出力するＡＬＵなどの演算ユニットを備えたデータ処理装置に限定されるものではなく、複数のビットのコンディション・コードが出力されるデータ処理装置においても、本発明は適用可能である。
【００６６】
各々のＣＣ出力部７２は、１ビットのコンディション・コードφｃｃを極性反転させて２ビットのコンディション・コード（リバイスド・コンディション・コードＲＣＣ）φｒｃにする変換回路７３と、その極性反転させた２ビットのＲＣＣφｒｃをＣＣＲＳ７１に出力するか否かを選択する出力選択部７４とを備えている。したがって、ＣＣＲＳ７１には、先の「ｉｆ条件」で演算ユニット５１が選択されていなければ「００」のＲＣＣφｒｃが、その演算ユニット５１に対応するＣＣＲＳ７１のアドレスに格納される。したがって、演算ユニット５１の番号と、ＣＣＲＳ７１のアドレス（番地）は１対１に対応している。たとえば、＃１の演算ユニット５１の出力はＣＣＲＳ７１のアドレス＃１（１番地）に格納される。一方、選択されており、演算ユニット５１の演算結果が「真（１）」であれば、「１０」のＲＣＣφｒｃが格納される。また、演算結果が「偽（０）」であれば「０１」のＲＣＣφｒｃが格納される。
【００６７】
２ビットのＲＣＣφｒｃを格納するＣＣＲＳ７１は所謂プレディケートレジスタであるが、ＡＬＵの番号とＣＣＲＳアドレスを１対１に対応させるなど、本来のプレディケートレジスタのような汎用性はないので、ＣＣＲＳ７１と称して説明する。また、ＣＣＲＳ７１は、ＣＣＲＳ＃０（０番地）に架空ＡＬＵのコンディション・コードを格納している。この０番地のＲＣＣは、プログラム全体の制御を行うものでＡＮＤ論理の場合には真に相当する「１０」があらかじめ格納されている。したがって、先行する「ｉｆ条件」に左右されずに実行される命令は、このＣＣＲＳ７１の０番地のＲＣＣと比較することにより必ず実行させるようにすることができる。このＣＣＲＳ＃０については２ビットの「１０」を１ビットのデータ「１」で置き換えることも可能である。
【００６８】
ＣＣＲＳ７１に格納された各演算ユニット５１のコンディション・コードφｒｃは、次の演算ユニット５１の演算結果を出力するか否かを判断するために、出力選択回路７４にフィードバックされる。本例の出力制御部２７は、さらに、ＣＣＲＳ７１に格納されたコンディション・コードφｒｃの代わりに、格納される前のコンディション・コードφｒｃを出力選択回路７４にフィードバックできるコンディション・コード選択回路（ＣＣ選択回路）７５を備えている。このＣＣ選択回路７５は水平型マイクロコード２１のサイクルフラグフィールド５６のデータにより制御され、サイクルフラグが立っていると、ＣＣＲＳ７１に出力されるＲＣＣφｃｒを出力選択回路７４に供給する。この結果、出力選択回路７４では、同一サイクル内の他の演算ユニット５１の演算結果を参照することが可能となり、条件分岐の演算と、その演算結果で出力が左右される演算を同じサイクルで並列に実行することが可能となる。したがって、条件分岐を含む演算をさらに効率よく並列実行することが可能となり、並列データ処理ユニット２０における処理速度をさらに向上できる。なお、ＣＣＲＳ７１の０番のコンディション・コードは、演算ユニット５１から出力されないので、いずれの場合もＣＣＲＳ７１から出力選択部７４に供給される。
【００６９】
図１０に出力選択回路７４の概略構成を示してある。この出力選択回路７４は、水平型マイクロコード２１の各命令フィールド５１に記述された１０ビットのビットマップφｂｍと、０番から４番までの２ビットのＲＣＣφｒｃとをビットバイビットで論理積を演算し、それらの論理和を演算する第１の判定回路７６を備えている。この第１の判定回路７６では、ビットマップφｂｍのビット列のうち「１」にセットされているアドレスのいずれかと、全ＲＣＣφｒｃ（１０ビット）のビット列の「１」にセットされたアドレスのいずれかが一致すると、その出力選択回路７４が選択されたことを判定する。すなわち、前のサイクルまたは同一サイクル内の「ｉｆ条件」の演算で、その出力選択回路７４に対応する演算ユニット５１が選択されたので、その演算ユニット５１の出力は汎用レジスタ５２ａまたはＣＣＲＳ７１に書き込むことが許可される。
【００７０】
出力選択回路７４は、さらに、第１の判定回路７６により選択されたときに、水平型マイクロコード２１の各命令フィールド５１に記述されたオペランドにより、汎用レジスタ５２ａへの書き込みを許可するＷＥ信号φｗｅを出力するか、ＣＣＲＳ７１へコンディション・コードを書き込むことを許可する信号φｃｅを出力するかを判断する第２の判定回路７７を備えている。「ｉｆ条件」の演算で選択された処理が条件比較、すなわち、「ｉｆ条件」の演算であれば、出力選択回路７４からはＲＣＣの書き込みを許可する信号φｃｅがＣＣ変換回路７３に供給され、２ビットに変換されたＲＣＣφｒｃが出力される。一方、「ｉｆ条件」の演算で選択された処理が汎用レジスタ５２ａへの出力を伴う処理、たとえば、算術演算命令や転送命令であれば、出力選択回路７４から汎用レジスタ群５２へライトイネーブル信号φｗｅが供給される。この結果、演算ユニット５１で並列的に処理された結果が、その演算ユニット５１を選択する「ｉｆ条件」の演算結果により出力あるいは出力されないことにより、「ｉｆ条件」で分岐（ジャンプ）して処理を実行したのと同じ結果を並列処理で得ることができる。
【００７１】
図１１に、「ｉｆ　ｔｈｅｎ　ｅｌｓｅ文」のツリー構造を有しつつ、すべての命令をサブ命令として水平型マイクロコード２１に含めて並列実行し、条件分岐により選択された正しい経路のみの実行結果が出力される例を示す。太線が実行時に選択される経路であるとする。
【００７２】
図１２は、この記述例を実行するために生成された水平型マイクロコード２１である。この水平型マイクロコード２１はサブ命令として「ｃｏｎｄｉｔｉｏｎａｌ　ｃｏｍｐａｒｅ文」および「ｃｏｎｄｉｔｉｏｎａｌ　ｍｏｖｅ文」を含んでいる。第１サイクルでは、３つの「ｃｏｎｄｉｔｉｏｎａｌ　ｃｏｍｐａｒｅ文」（ＯＰ−１、ＯＰ−２およびＯＰ−３）と１つの「ｃｏｎｄｉｔｉｏｎａｌ　ｍｏｖｅ文」（ＯＰ−４）が並列実行される。処理ＯＰ−２、ＯＰ−３およびＯＰ−４は、いずれも処理ＯＰ−１の結果により選択されたり、選択されなかったりする。このため、サイクルフラグ領域５６にはフラグが立っており、同一サイクル内の他の演算ユニット５１の演算結果が出力選択回路７４で参照される。本図からわかるように、本例のデータ処理装置１においては、図１１に示されたＯＰ−１からＯＰ−９の９つの処理を、分岐を含めて３サイクルで処理することができ、さらに、分岐が発生したことによるサイクルの無駄、すなわち、サイクルのペナルティは生じない。水平型マイクロコード２１で実行可能な命令は条件判断に限らず、乗算命令、加算命令、その他の命令であっても良い。したがって、ループ処理が要求される積和演算などを水平に展開して実行することにより、同じ処理を基本プロセッサにより実行した場合と比較すると大幅に性能を改善できる。
【００７３】
図１３に、水平型マイクロコード２１に含まれるＯＰ−１〜ＯＰ−９がサブ命令フィールド５５に展開された内容を示してある。また、図１４に、各々のサブ命令ＯＰ−１〜ＯＰ−９の記述内容を示してある。オペランドフィールド５５ａには、「ｃｏｎｄｉｔｉｏｎａｌ　ｃｏｍｐａｒｅ文」、または「ｃｏｎｄｉｔｉｏｎａｌ　ｍｏｖｅ文」が記述され、レジスタフィールド５５ｂには比較するレジスタあるいはデータを転送するレジスタが定義されている。また、スロット番号フィールド５５ｃには、演算ユニット５１を指定する情報が記載されている。本明細書においては、演算ユニット５１を規定する番号をスロット番号と称する。さらに、選択条件フィールド５８に１０ビットのビットマップφｂｍが格納されている。
【００７４】
ビットマップφｂｍは、そのサブ命令ＯＰ−１〜ＯＰ―９が架空のＡＬＵ（スロット番号０）を含む５つのスロット＃０〜＃４の真偽いずれかで選択されるかを示す情報である。すなわち、ビットマップφｂｍは、サブ命令ＯＰ−１〜ＯＰ９が先行する「ｉｆ　ｔｈｅｎ　ｅｌｓｅ文」により「ｔｈｅｎ」側に記述されたのか「ｅｌｓｅ」側に記述されたのかを示す。各スロットの情報は２ビットで記述されており、各スロットの最初のビット（１ビット目、左側のビット）が「ｔｈｅｎ」側で選択されることを示し、２ビット目（右側のビット）が「ｅｌｓｅ」側で選択されることを示す。したがって、同じ配列になっているＣＣＲＳ７１またはＣＣＲＳ７１に書き込まれるデータＲＣＣφｒｃと出力選択回路４７で比較することにより、該当するスロット番号の演算ユニット５１の出力が選択されたか否かが判断できる。
【００７５】
たとえば、サブ命令ＯＰ−１では、グローバルコンディションコードであるスロット番号＃０の「ｔｈｅｎ」側（＃０１）で選択されるようになっており、条件文に関係なく実行されることが分かる。サブ命令ＯＰ−２は、スロット番号＃１の「ｔｈｅｎ」側（＃１１）で選択されるようになっており、また、図１２に示すように、最初のサイクルではサイクルフラグ５６が立っているので、同じサイクルの一番目のスロット、すなわち、サブ命令ＯＰ−１の「ｔｈｅｎ」側で選択されることが分かる。他のサブ命令についても同様であり、サブ命令ＯＰ−５からＯＰ−９については、サイクルフラグ５６がたっていないので、ＣＣＲＳ７１に格納された前のサイクルのＲＣＣφｒｃとそれぞれのスロット番号に割り当てられたサブ命令のビットマップφｂｍが比較される。また、サブ命令ＯＰ−１からＯＰ−３は「ｃｏｎｄｉｔｉｏｎａｌ　ｃｏｍｐａｒｅ文」であるので、出力が制御されるのはＣＣＲＳ７１に対するＡＬＵ５１ａの出力である。一方、サブ命令ＯＰ−４からＯＰ−９は「ｃｏｎｄｉｔｉｏｎａｌ　ｍｏｖｅ文」であるので、出力が制御されるのは、汎用レジスタ５２ａへの書き込みであり、ＷＥ信号φｗｅが出力されるか否かとなる。
【００７６】
したがって、ビットマップφｂｍを備えた本発明の水平マイクロコード２１は、自分の経路を決定している原因となるＣＣＲＳをビットマップで命令コードに保有する形式であると言える。各演算ユニット５１の出力を選択する情報はビットマップφｂｍに限定されないが、各演算ユニット５１のコンディション・コードとは無関係に選択されることを指示する第１の情報、すなわち本例ではスロット番号０の情報（グローバルコンディションコード）が必要である。また、コンディション・コードを比較する各演算ユニット５１を指示する第２の情報が必要であり、本例ではビットマップφｂｍのスロット番号に対応して配置されたビット（２ビット毎）の順番がその情報を示す。さらに、比較するコンディション・コードの真偽を示す第３の情報が必要であり、本例では各スロットで２ビットのデータが割り当てられている。したがって、本例のビットマップφｂｍにおいては、この第３の情報に相当する２ビットにより、その第３の情報の順番で指示されるスロット番号の演算ユニット５１の論理演算結果を参照し、第３の情報の一方のデータが「１」、すなわち２ビットのデータが「１０」のときには論理演算が真のときに選択されることが示される。また、他方のビットが「１」、すなわち２ビットのデータが「０１」のときには論理演算が偽のときに選択されることが示され、さらに、双方のビットが「００」のときは選択されないことが示される。また、これらに加えて、比較するコンディション・コードのサイクルを示す第４の情報、すなわち本例ではサイクルフラグ５６を備えていることが望ましい。
【００７７】
これらの選択情報、特に、第１〜第３の情報をビットマップ化することにより、デコードせずにＣＣＲＳ７１と比較して出力を制御できるので、簡易なハードウェアにより高速で処理できる。上述したように、本例は１ビット分のコンディション・コードに着目して説明しているが、ＣＣＲＳにストアされるコンディション・コードと、水平型マイクロコード２１に含まれるビットマップφｍとが直接比較できる体系で生成されていれば本発明のメリットを得ることができる。
【００７８】
また、サイクルフラグ５６といった第４の情報を設け、ＣＣ選択回路７５を制御することにより、参照する条件判断を行う演算ユニットと、それを利用した演算を行う演算ユニットとを並列実行できるので、並列度が向上する。本例では、水平型マイクロコード毎にサイクルフラグ５６を設けて一括管理しているが、サブ命令フィールド５５にサイクルフラグフィールドを設けることにより、演算ユニット毎に、同一サイクルの分岐条件で制御するか、前のサイクルの分岐条件で制御するかを選択することが可能となり、さらにフレキシブルになり並列度を向上できる。ただし、ＣＣＲＳ７１とそれに書き込まれるＲＣＣφｒｃとを選択するＣＣ選択回路７５を各演算ユニット５１に設ける必要があるので、ハードウェアは複雑になる。
【００７９】
また、本例の並列データ処理ユニット２０においては、各演算ユニット（ＡＬＵ）５１の演算結果（コンディション・コード）があらかじめ水平型マイクロコードにより選択されて、演算ユニット５１の制御論理としてフィードバックあるいは入力されるのではない。その代わりに、すべての演算ユニット５１の演算結果（コンディション・コード）が各演算ユニット５１の制御論理として入力される。すなわち、すべての演算ユニット５１の演算結果が格納されているＣＣＲＳ７１からの出力結果もしくはＣＣＲＳ７１への書き込み情報が各演算ユニット５１にフィードバックされる。したがって、各演算ユニット５１を制御する水平型マイクロコード２１により、演算ユニット５１の単位で所望の演算結果が選択された後、この選択された演算結果と各演算ユニット５１の演算結果との条件が取られてＣＣＲＳ７１に書き込まれ、次の演算ユニット５１における処理で引用される。あるいは、汎用レジスタへの出力の可否が判断される。
【００８０】
したがって、本発明によれば、ｃｏｎｄｉｔｉｏｎａｌ　ｃｏｍｐａｒｅ命令を実行した後には、ＣＣＲＳ７１に書き込むＲＣＣφｒｃはｔｈｅｎ／ｅｌｓｅどちらか１ビットのみが“１”であり他は“０”となる。そして、前の条件文で選択されていれば、それがＣＣＲＳ７１に書き込まれ、選択されていなければ（０，０）がＣＣＲＳ７１に書き込まれる。このため、真もしくは偽の選択があるのであれば、その結果は、複数の“１”としてＣＣＲＳ７１を介して下流の命令に伝播される。
【００８１】
このように、並列データ処理用のコンパイラＣｃ２により、水平型マイクロコード２１を構成するサブ命令５５と、そのサブ命令５５が実行される演算ユニット５１と、ＣＣＲＳ７１のスロット番号と、そのサブ命令５５に含まれるビットマップφｂｍのスロット番号とを対応させ、演算ユニット５１の出力を制御することにより、条件分岐が伴うすべての命令を止めることなく並列に進めることが可能となる。特に、必要に応じて複数命令を含むサイクル単位でのＣＣＲＳ７１への書き込みデータを参照するか、格納されたデータを参照するかの選択肢を設けることにより、条件分岐の結果を水平方向にも伝播させることが可能となり、条件分岐を含む並列処理効率が飛躍的に向上する。
【００８２】
そして、本発明の並列データ処理ユニット２０では水平型マイクロコード２１を使用するのでＶＬＩＷの特徴を最も高速で、かつ高い自由度で引き出せる構成となる。即ち、レジスタファイルと演算器の接続自由度を高めようとするとセレクタ段数が深くなるが、ＶＬＩＷコードの方を水平型マイクロコードとしてあらかじめ展開させておく形式により水平型マイクロコードのフェッチ並びにデコードの機能を排除し高速性を維持している。もともとＶＬＩＷデータパスが処理するのは繰り返しループ処理が多く、コードの深さはさほどでもない。したがって、本例の並列データ処理ユニット２０のように、処理内容を水平型マイクロコード２１として常駐させたとしても、コードメモリ２２のメモリ容量が不足する心配はない。さらに、水平型マイクロコード２１に関しては、基本プロセッサ１０のデータＲＡＭ１５からロードする機能をサポートしているので、異なる水平型マイクロコードあるいはプログラムをロードすることにより、複数のループ処理機能に対応した水平型マイクロコード２１の置き換えが可能となる。したがって、その点でもコードメモリ２２のメモリ容量が不足する心配はない。
【００８３】
また、複数の並列データ処理ユニット２０を備えた集積回路装置１または２を提供することも可能であり、複数の並列データ処理ユニット２０を並列実行させることも可能である。そして、図２に示したように、複数の並列データ処理ユニット２０のクロックを制御することにより、さらにきめ細かく消費電力を低減できる。消費電力を低減するという点では、水平型マイクロコード２１を格納するコードメモリ２２をＲＯＭ化することも有効である。水平型マイクロコード２１は書き換えは不可となるが特定の信号処理に関しては面積・消費電力においてより効果がある。
【００８４】
図１５に、異なる出力制御回路２７の例を示してある。この例では、演算ユニット５１がふたつある場合におけるＣＣＲＳ７１への書き込み前情報（ＲＣＣφｒｃ）を参照する回路構成を詳細に示している。この出力制御回路２７は、自己のＣＣ出力部７２から自己のＣＣ出力部７２へ書き込み前情報φｒｃを伝達するパスはないが、自己のＣＣ出力部７２から他のすべてのＣＣ出力部７２へ書き込み前情報φｒｃを伝達するパスがある完全接続になっており、発振状態に陥る可能性のあるループが存在する。発振状態は取らない様に並列データ処理用のコンパイラＣｃ２で対応することも可能である。しかしながら、同一サイクル内のコンディション・コードを参照するために、瞬間的に発振状態となる可能性があり、消費電力をセーブすることを考え、また、経年変化および信頼性の上で問題となる可能性がある。
【００８５】
図１６に、さらに異なる出力制御回路２７の例を示してある。この出力制御回路２７においては、ＣＣＲＳ７１への書き込み前情報（ＲＣＣφｒｃ）を参照する回路は完全接続とならず、部分接続になっている。部分接続とすることにより、同一サイクル内のコンディション・コードを参照して、すべての演算ユニット（ＡＬＵ）５１が他の演算ユニット５１の書き込み前情報φｒｃを参照することができないという制約が生ずる。すなわち、第１スロットのＣＣ出力部７２から出力される第１スロットの演算ユニット５１の結果（演算結果）は、他のすべてのＣＣ出力部７２で利用できるが、第２スロットの演算結果は、第３および第４スロットのＣＣ出力部７２でしか利用できず、第３スロットの演算結果は、第４スロットのＣＣ出力部７２でしか利用できない。また、第４スロットの演算結果は、他のスロットでは利用できない。しかしながら、発振状態となることは完全に防げる。このような制約のもとで動作するように水平型マイクロコード２１を生成するようにコンパイラＣｃ２を作成することは可能であり、同一サイクル内の演算結果が参照できないケースに比較すれば処理の並列度を上げる点で十分にメリットがある。
【００８６】
なお、ＣＣＲＳ７１に格納済みのＲＣＣφｒｃを用いて演算ユニット５１の結果の書き込み条件を生成する場合には完全接続として良い。
【００８７】
図１７ないし図２０に、異なる処理を並列データ処理ユニット２０で実行する様子を示してある。図１７は、本例の処理を「ｉｆ　ｔｈｅｎ　ｅｌｓｅ文」のツリー構造で示してある。図１１に示した例では、どれかひとつの条件が成立した時にのみ実行される条件（ＡＮＤ条件）の処理であったのに対し、本例の処理は、複数の条件のうちどれが成立しても実行される条件（ＯＲ条件）の処理である。
【００８８】
図１８は、図１７に示した処理を実行するために生成された水平型マイクロコード２１である。この場合も、本例のデータ処理装置１であると、ＯＰ−１からＯＰ−６の６つの処理が２サイクルで実行でき、大幅に性能を改善できる。また、図１９に、水平型マイクロコード２１に含まれるＯＰ−１〜ＯＰ−６がサブ命令フィールド５５に展開された内容を示し、図２０に、各々のサブ命令ＯＰ−１〜ＯＰ−６の記述内容を示してある。サブ命令ＯＰ−５がＯＲ条件で実行されるマイクロコードの例であり、図９に示した出力制御回路２７によりＯＲ条件も分岐のペナルティーなしに実行できることがわかる。
【００８９】
以上、述べたように、本発明によれば、基本プロセッサ１０はＶＬＩＷである必要がないか、もしくは並列度の低いプロセッサでよく、それに対し基本プロセッサ１０の２倍以上の演算器並列度を備えた並列処理ユニット２０、すなわち、水平型マイクロコード型のＶＬＩＷデータパス部２３を備えた処理ユニットを付加することにより、様々な処理で頻繁に使用される記述である「ループ文」の効率的な処理や「ｉｆ　ｔｈｅｎ　ｅｌａｓｅ文」とそれに従属する実行文の並列動作を高める事が可能となる。両者、すなわち、基本プロセッサ１０と並列処理ユニット２０との演算並列度の比率は２以上であることが望ましく、基本プロセッサ１０が１並列であれば並列処理ユニット２０は２並列以上であることが望ましい。また、基本プロセッサ１０が２並列であれば、並列処理ユニット２０は４並列以上であることが望ましい。これはデータ処理を行う上でこの程度以上の比率を設ける事により汎用データ処理と専用データ処理の区別が明確になるからである。
【００９０】
そして、基本プロセッサ１０と並列処理ユニット２０とを有するデータ処理システム１により、基本プロセッサ１０に余なフェッチ・デコード構造を設置することなく、消費電力を押さえ、その一方で、ＶＬＩＷ型のプロセッサと同様に専用のデータパス部で効率のよい並列データ処理動作を行うプロセッサシステムの提供が可能となる。
【００９１】
【発明の効果】
本発明においては、フェッチおよびデコード機能を備えた基本プロセッサと、水平型マイクロコードにより制御される並列データ処理ユニットとの組み合わせにより、並列処理に適した部分だけを水平型マイクロコードで実行可能としている。したがって、並列処理が可能な部分は、ＶＬＩＷ型のプロセッサと同様に並列もしくはパイプライン並列のデータ処理が可能となり、並列処理に適していない部分は基本プロセッサによるシーケンシャルなデータ処理が可能となる。さらに、並列データ処理ユニットの制御に水平型マイクロコードを採用することにより、並列データ処理のための命令フェッチ部とデコード部の回路複雑性を排除でき、消費電力を押えつつ効率よく並列処理を行うことができる。
【００９２】
したがって、本発明により、ＶＬＩＷ型のプロセッサよりも、遥かに小型で消費電力の小さなプロセッサでありながら、ＶＬＩＷ型のプロセッサと同等の並列度を確保できるデータ処理システムを提供できる。このため、本発明では、経済的なハードウェアにより広範囲な並列実行が可能となり、通信やネットワークにて必要となる信号処理、特に、ループ記述による繰り返し演算が頻繁に出現する処理に対して、実行速度が飛躍的に向上した経済的なプロセッサを提供することが可能となる。
【図面の簡単な説明】
【図１】本発明のデータ処理装置（システムＬＳＩ）の一例の構成を示す図である。
【図２】図１と異なる本発明のデータ処理装置の例の構成を示す図である。
【図３】データ処理装置の設計過程を示す図である。
【図４】並列データ処理ユニットの概略構成を示す図である。
【図５】並列データ処理ユニットの回路例を示す図である。
【図６】水平型マイクロコードの構成を示す図である。
【図７】ループ制御部およびアドレス制御部の構成を示す図である。
【図８】ループ制御の様子を示す図である。
【図９】出力制御部の構成を示す図である。
【図１０】出力選択回路の構成を示す図である。
【図１１】並列実行する処理の一例を示す図である。
【図１２】図１１に示した処理を実行する水平型マイクロコードを示す図である。
【図１３】図１２に示す水平型マイクロコードに含まれる情報をさらに詳しく示す図である。
【図１４】水平型マイクロコードに含まれる情報を内容を記述的に示したものである。
【図１５】出力制御部の異なる例を示す図である。
【図１６】出力制御部のさらに異なる例を示す図である。
【図１７】並列実行する処理の異なる例を示す図である。
【図１８】図１７に示した処理を実行する水平型マイクロコードを示す図である。
【図１９】図１８に示す水平型マイクロコードに含まれる情報をさらに詳しく示す図である。
【図２０】水平型マイクロコードに含まれる情報を内容を記述的に示したものである。
【符号の説明】
１、２　データ処理装置（システムＬＳＩ、プロセッサ）
１０　　基本プロセッサ
２０　　並列データ処理ユニット
２１　　水平型マイクロコード
２２　　コードメモリ
２３　　データパス部
２４　　制御部
２５　　アドレス制御部
２６　　ループ制御部
２７　　出力制御部
２８　　データインターフェイスレジスタ
３８　　レジスタデータバス[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a data processing apparatus having a processing unit capable of performing parallel data processing and controlled by VLIW type microcode (horizontal microcode) capable of specifying a plurality of processes with one instruction, and a control method thereof. Things.
[0002]
[Prior art]
VLIW (Very @ Long @ Instruction @ Word) is known as a processor architecture for improving parallelism. Ideally, this method makes it possible to freely and completely connect a register file and an operation unit group, which constitute an operation execution unit having resources capable of executing parallel processing, with instructions, and to execute parallel processing or pipeline parallel processing. It is an architecture that can execute processing with a high degree of freedom. Therefore, as the name implies, it has the feature of being an extremely long instruction code.
[0003]
[Problems to be solved by the invention]
However, since it is necessary to fetch and decode an extremely long instruction code, there is a problem that the fetch section and the decode section of the instruction code of the processor become large and the design becomes complicated. Further, due to the complexity, the power consumption of the fetch and decode unit also becomes very large.
[0004]
Further, there is a problem that not all instructions constituting a program can be executed in parallel. For example, a VLIW-type processor provided with resources capable of simultaneously executing four processes cannot always generate a program that simultaneously executes four processes, and is included in one VLIW instruction or instruction set. Four instructions or sub-instructions (in this specification, a VLIW or a horizontal microcode capable of simultaneously executing a plurality of processes is constituted, and when instructions corresponding to one process are clearly distinguished, they are referred to as sub-instructions. The proportion occupied by NOPs is high.
[0005]
Therefore, the VLIW type processor does not improve the program execution efficiency, although the hardware becomes large and the huge power consumption is required. For this reason, considering the merits of increasing the degree of parallelism and the disadvantage of increasing the hardware and increasing the power consumption, a processor capable of simultaneously executing two processes is practical. It is not economical to put a processor capable of executing processing into practical use. Therefore, although there is an architecture that can significantly improve the execution speed of loop processing and the like by executing three or more processes in parallel, there is almost no development and manufacture of a processor utilizing the architecture at a practical level.
[0006]
In view of the above, the present invention provides an architecture of a processor that can ensure the same degree of parallelism with respect to a VLIW type processor, but is much smaller and consumes less power than a VLIW type processor. The purpose is. It is another object of the present invention to provide an economical processor capable of dramatically improving the execution speed by enabling parallel execution of processing such as loop processing by economical hardware.
[0007]
It is another object of the present invention to provide an architecture and a control method capable of parallel execution of branch processing which is a bottleneck of parallelization in pipeline parallel processing, and to further improve the utilization efficiency of a parallel data processing device.
[0008]
[Means for Solving the Problems]
According to the present invention, a combination of a basic processor having a fetch and decode function and a parallel data processing unit controlled by horizontal microcode enables only a portion suitable for parallel processing to be executed by horizontal microcode. . With this configuration, of the specifications provided for execution in a data processing device such as an LSI, a portion capable of parallel processing can use a horizontal microcode to perform parallel or pipeline parallel data processing of the VLIW. Thus, the part which is not suitable for parallel processing enables sequential data processing by the basic processor. Furthermore, the adoption of horizontal microcode eliminates the complexity of the instruction fetch function and decode function circuit for parallel data processing, and provides a data processing device that performs parallel processing efficiently while suppressing power consumption. it can.
[0009]
That is, the data processing apparatus of the present invention is controlled by a first processing unit (basic processor) that fetches and decodes instructions included in an execution program and executes the instructions, and a horizontal microcode that can execute a plurality of processes simultaneously. And a second processing unit (parallel data processing unit), wherein the first processing unit is capable of executing a process of activating the second processing unit by a start instruction included in the execution program. I do. The execution unit of the first processing unit may control the second processing unit by the fetched and decoded start instruction. Also, the fetch and decode unit may identify an instruction for the second processing unit and provide an activation instruction to the second processing unit. Alternatively, the second processing unit may monitor the instruction fetched by the first processing unit, and may be activated by its own, that is, by the activation instruction for the second processing unit.
[0010]
In any case, the execution program that controls the first processing unit enables the processing of the second processing unit to be cooperatively controlled together with the first processing unit, so that the execution program is executed by substantially horizontal microcode. Control can be performed in clock units, including parallel processing such as loop processing. In particular, the configuration in which the start instruction is supplied from the fetch and decode unit of the first processing unit to the second processing unit can avoid a delay in the first execution unit, so that real-time processing such as network processing and image processing can be performed. This is a configuration suitable for processing that requires reliability.
[0011]
Therefore, when the processing to be executed by the data processing apparatus of the present invention is given in C language or the like, a portion suitable for parallel execution is extracted from the original program, and a horizontal microcode capable of simultaneously executing a plurality of processings is extracted. And a step of replacing a portion suitable for parallel execution of the original program with a start instruction of horizontal microcode, and converting the portion into an execution program executable by a basic processor having fetch and decode functions. It is effective to adopt a program development method. In other words, a compiler having two functions of a function of converting a portion suitable for parallel execution into horizontal microcode and a function of replacing a portion suitable for parallel execution with a start instruction and converting it into an execution program is required. . The functions of the compiler are provided by being recorded on a suitable recording medium as software executable by a computer having appropriate resources, that is, a program or a program product. Of course, it is also possible to provide via a computer network.
[0012]
The second processing unit, which is a parallel execution device or a parallel data processing unit, includes a code memory for storing a plurality of horizontal microcodes, and a plurality of registers and arithmetic units, the connection and / or function of which are horizontal microcode. It has a data path section changed by the code and an address control section for managing an execution address of the code memory. When the first processing unit includes a data RAM, the first or second processing unit is provided with a plurality of interface registers for interfacing the data RAM and the data path unit of the second processing unit. It is desirable that both the first processing unit and the second processing unit can perform processing based on or in the data in the data RAM.
[0013]
In addition, by rewriting the horizontal microcode stored in the code memory, the execution contents of the parallel execution unit can be changed, and the resources of the parallel execution unit can be used more efficiently. The code memory can be rewritten by a first processing unit which is a basic processor. That is, the first processing unit can execute a process of rewriting the code memory of the second processing unit by an instruction included in the execution program. Further, it is possible to provide a third processing unit for rewriting the code memory of the second processing unit, and to release the basic processor from the management of the code memory of the second processing unit to thereby reduce the processing speed of the entire data processing device. Can be improved.
[0014]
The combination of the basic processor and the parallel execution unit according to the present invention can reduce the size of the parallel execution unit and reduce the hardware overhead of the fetch and decode unit, so that the power consumption can be significantly reduced. Further, although the number of parallel execution units mounted on the data processing device is not limited to one, at least two processing units of the basic processor and the parallel execution unit are mounted on the data processing device. By controlling the clock supplied to the power supply, the power consumption can be further reduced without substantially lowering the processing capacity. In particular, since the first processing unit can start the second processing unit, the first processing unit can grasp the start timing of the second processing unit. Therefore, the first processing unit controls the clock signal supplied to the second processing unit, so that the power consumption of the entire data processing device can be further reduced.
[0015]
Most of the portion of the program that describes the signal processing algorithm is a loop process, and the number of steps in the C language description itself is not so large. Signal processing such as fast Fourier transform, filter processing, autocorrelation calculation processing, Hadamard transform, Viterbi coding, etc. has the feature of repeatedly executing special calculations, and requires tens to hundreds of steps in C language source code. It usually falls within the range. In addition, these steps include many operations that can be performed simultaneously. Therefore, the loop processing is a portion suitable for being executed by the parallel execution unit of the present invention, and the processing speed is greatly improved by executing the loop processing with a higher degree of parallelism. Conversely, parts other than signal processing and general-purpose parts such as data interface and interrupt processing management require the number of steps, but such parts generally do not require parallelism or pipeline parallelism. It is a target. Furthermore, the signal processing part is stable to some extent, whereas parts other than the signal processing are susceptible to changes in specifications and the like. For this reason, it is desirable that a portion of the C language description that does not require much general-purpose parallelism be executed by a basic processor that can flexibly cope with it.
[0016]
As described above, the data processing device of the present invention can execute processing with high power and low power consumption, and thus is suitable for performing signal processing required for communication and networks.
[0017]
Therefore, it is desirable that the second processing unit, which is a parallel data processing device, has an architecture that can efficiently execute loop processing. For this reason, the second processing unit of the present invention is provided with a plurality of loop counters for counting the number of executions of the horizontal microcode, and a loop control unit for controlling the number of executions of the horizontal microcode by the loop counter. . The horizontal microcode is provided with an operand field containing information for controlling the connection of the data path section and a loop counter selection field containing information for selecting a loop counter, so that a loop selected for each horizontal microcode instruction is provided. The number of executions of the loop processing can be controlled very easily by the counter.
[0018]
The loop control unit is provided with a loop control register for setting the number of times counted by the loop counter, and the first processing unit executes a process for setting the value of the loop control register in accordance with an instruction included in the execution program. The number of times can also be controlled by a program. The number of loops set in the loop control register may be a processing result of the data path of the second processing unit.
[0019]
If an instruction accompanied by a conditional branch during the loop processing, that is, an instruction accompanied by an "if @ then @ else statement" is included, the operation to be executed changes depending on whether the condition is correct or not, so that the parallelism is reduced at once. Therefore, in the present invention, the second data processing unit, which is a parallel processing unit, includes a condition code register capable of storing a condition code of each operation unit of the data path unit, and the condition code register includes the condition code register and the horizontal microcode. An output control unit is provided for comparing the selection information of each operation unit to be output with each other and controlling the output of each operation unit to a register and / or the output of a condition code to a condition code register. With this configuration, either the condition code of the previous cycle stored in the condition code register or the condition code of each operation unit in the same cycle stored in the condition code register and the horizontal microcode Since the output can be controlled by comparing the information with the selection information of each of the included arithmetic units, the arithmetic can be executed in parallel regardless of whether the condition is correct or not. Then, when outputting the executed result, the output control unit determines whether or not the output is possible, so that the correctness of the condition that is the premise of the calculation can be reflected. Therefore, an instruction with a conditional branch can be executed while maintaining parallelism.
[0020]
Furthermore, not only the condition code of the previous cycle (execution cycle) stored in the condition code register but also one of the condition codes of each operation unit in the same cycle stored in the condition code register is selected and horizontally selected. A parallel data processing device capable of controlling the output in the same cycle, that is, reflecting the result of the parallel operation, by comparing with the selection information included in the type microcode, thereby providing a process with higher parallelism. it can.
[0021]
The selection information of each operation unit included in the horizontal microcode is, for example, first information indicating that selection is performed independently of the condition code and second information indicating each operation unit that compares the condition code. No. 2 information, third information indicating whether the condition code to be compared is true or false, and fourth information indicating the cycle of the condition code to be compared. The fourth information may be provided in units of horizontal microcode instructions, or may be provided for each sub-instruction constituting horizontal microcode. Based on these four pieces of information, whether or not the operation is governed by the branch condition, which operation unit is governed by the branch condition computed, whether the branch condition is true or false, and whether the branch condition is true or false, Since the operation result in the same cycle is determined from the operation result of the previous cycle, all the operations including the operation involving the conditional branch can be executed in parallel.
[0022]
Further, the output control unit of the present invention has a configuration in which the necessity of the output of each arithmetic unit can be determined by referring to one of the condition codes of all the arithmetic units. Therefore, it is possible to select an output by referring not only to the operation result of the previous cycle stored in the condition code register but also to the operation result in the same cycle. It can be executed in a wide range in parallel.
[0023]
The output control unit indicates the condition code including at least one of true and false and information indicating that the condition code is selected or not selected, that is, information indicating that the condition code is selected by the branch condition and information indicating that the condition code is not selected. A conversion unit that converts the data into a plurality of bits of data that can be output to a condition code register. Therefore, by making the selection information of the operation unit included in the horizontal microcode a bitmap corresponding to the contents of the condition code register, the selection information can be compared with the condition code register without decoding. For example, when focusing on the condition code for one bit, the one-bit true / false information and the information indicating selection / non-selection based on the branch condition can be represented by two-bit data. Are set when compiling horizontal microcode, a bitmap of a system that can be directly compared with, ie, the information decoded into the bitmap. As a result, the output control unit does not need hardware such as a decoding circuit which is an overhead in the process of comparing the condition codes, and can perform high-speed processing with simple hardware.
[0024]
As described above, the data processing apparatus and the control method thereof according to the present invention do not assume the jump execution by the branch instruction, so that the loop processing including the branch instruction can be performed in parallel without disturbing the flow of the pipeline parallel processing which is a feature of VLIW. Can be performed. As a result, the parallel or pipelined parallel data processing of the VLIW can be utilized in the loop operation, and by adopting the horizontal microcode, the circuit complexity of the instruction fetch unit and the decode unit is eliminated, and the power consumption is reduced. Loop processing can be performed efficiently. Therefore, in the field of signal processing, frequent loop processing can be efficiently processed by VLIW-type parallel operation.
[0025]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 shows an example of the data processing device of the present invention. The data processing device 1 includes a basic processor 10 and a parallel data processing unit 20. The basic processor 10 includes a code RAM 12 that stores an execution program 11, a fetch / decode unit 13 that fetches and decodes instructions of the execution program 11 from the code RAM 12, an execution unit 14 that performs arithmetic processing according to the decoded instructions. , And a data RAM 15 for temporarily storing calculation results and data to be processed. The basic processor 10 has a configuration equivalent to, for example, a general-purpose RISC processor, and has a function of performing processes such as fetch, decode, operation, and write-back in a pipeline system.
[0026]
The parallel data processing unit 20 is a parallel execution unit controlled by a horizontal microcode capable of simultaneously executing a plurality of processes, and includes a code memory 22 for storing a plurality of horizontal microcodes 21, a plurality of registers and an arithmetic unit. A data path unit 23 whose connection and / or function is changed by the horizontal microcode 21, and a control unit 24 having a function of managing an execution address of the code memory 22 and input / output to / from the code memory 22. It has. The data path unit 23 includes a plurality of general-purpose registers and a plurality of arithmetic units (ALUs) as hardware capable of executing parallel processing, and the registers and the ALU are centralized by a plurality of selectors and wiring groups for switching these connections. Is formed. The network connection of the data path unit 23 and the arithmetic function of the ALU are switched and set by the horizontal microcode 21.
[0027]
The parallel data processing device 20 can exchange data with the data RAM 15 of the basic processor 10 via the register data bus 38 via the data interface register group 28, and the data path unit 23 processes data supplied from the data RAM 15. Then, the data can be output to the data RAM 15. The control unit 24 of the parallel data processing device 20 according to the present embodiment includes, in addition to an address control unit 25 that manages an execution address of the code memory 22, a loop control unit 26 that controls loop processing in the data path unit 23, And an output control unit 27 for controlling the outputs calculated in parallel in the unit 23. These configurations are described in detail below.
[0028]
The basic processor 10 and the parallel data processing unit 20 of the data processing device 1 are connected by some signal lines or buses. First, when the fetch / decode unit 13 of the basic processor 10 fetches an instruction to activate the parallel data processing unit 20 from the execution program 11, for example, “V_LOOP_INST1”, the control unit of the parallel data processing unit 20 via the instruction bus 31 24, a start signal or a start command φs is output. The control unit 24 of the parallel data processing unit 20 decodes the start instruction φs, and starts the processing specified by the horizontal microcode 21 stored in the code memory 22. When the processing is completed, the control unit 24 sends an interrupt signal φi to the basic processor 10 via the interrupt signal line 32, and transmits the end of the processing to the basic processor 10. Thereby, the basic processor 10 performs subsequent processing based on the processing result of the parallel data processing unit 20.
[0029]
The data RAM 15 of the basic processor 10 is provided with a horizontal microcode 21 or a microprogram 29 including a plurality of horizontal microcodes 21 for downloading to the code memory 22 of the parallel data processing unit 20. The microcode 21 of the code memory 21 can be changed through. Therefore, the processing content of the parallel data processing unit 20 can be changed by rewriting the code memory 22, and the basic processor 10 performs the rewriting process according to the execution program 11.
[0030]
Therefore, in the data processing device 1 of the present example, a plurality of different processes can be executed by the parallel data processing unit 20. Then, in the parallel data processing unit 20, a plurality of operations are simultaneously executed in parallel, so that even if the number of steps such as loop processing is small, a plurality of types of signal processing that consumes processing time are efficiently performed in a short time. . In addition, general-purpose processing such as data interface and interrupt processing management other than signal processing requires the number of steps, but does not have parallelism or the processing speed cannot be improved even if parallel processing is performed. This can be performed by the basic processor 10 with low power.
[0031]
The functions required as the control circuit 24 of the parallel data processing unit 20 operating in combination with the basic processor 10 include activation and termination of functions, transfer of data to and from the code memory 22, address management of the horizontal microcode 21, Loop processing and branch processing by referring to the address table, writing to the arithmetic unit based on the result of the operation result condition code of the arithmetic unit, and writing control to the general-purpose register are performed. As a result, only a very light state machine is required for control, and a high-speed data processing unit that performs parallel processing can be implemented with a small area. The point that control can be performed with this extremely light state machine is significantly different from the general-purpose VLIW processor that fetches and decodes a VLIW instruction describing a plurality of processes and performs processing in the parallel data processing unit 20 controlled by the horizontal microcode. Is a point. On the other hand, the point that the connection between the register group and the operation unit group of the data path unit 23 can be changed at any time by the horizontal microcode is the same as the processing unit having the vector processor type data path which is a conventional fixed array operation. On the other hand, the processing device can increase the execution speed by parallel processing while securing a certain degree of versatility.
[0032]
In summary, the data processing device 1 of the present embodiment includes a VLIW data path unit (parallel data processing) having a control mechanism based on a horizontal microcode 21 and having a high parallelism in a basic processor 10 which does not need to be a VLIW type with high parallelism. It can be said that the architecture that most effectively executes the parallel operation by the VLIW or the pipeline parallel operation by loading the unit 20) and loading the contents of the horizontal microcode 21 from the data RAM 15 of the basic processor 10 is described. Therefore, the data processing device 1 of the present example is an integrated circuit device which is compact with low power consumption and capable of high-speed processing.
[0033]
The data processing device 2 shown in FIG. 2 is a different example of the present invention. This data processing device 2 also includes a basic processor 10 and a parallel data processing unit 20, and has a configuration capable of performing high-speed processing with low power consumption similarly to the data processing device 1 described above. Further, a management unit 35 having a function of rewriting the horizontal microcode 21 in the code memory 22 of the parallel data processing unit 20 is provided. The management unit 35 includes a program memory 36 in which a microprogram 29 is stored, and a control unit 37 that controls input and output. Therefore, the management unit 35 can support a change in the processing content of the parallel data processing unit 20 instead of the basic processor 10. In this configuration, the amount of hardware increases as the number of control units 35 increases. However, the basic processor 10 is released from the processing of changing the processing content of the parallel data processing unit 20, and the overhead due to the processing can be eliminated. Therefore, there is an advantage that the processing efficiency of the basic processor 10 is improved, and the processing speed of the entire data processing device is improved.
[0034]
Further, the data processing device 2 includes a clock control unit 34 that can control supply of the clock signal φc to the parallel data processing unit 20. This clock control unit 34 is used when an instruction is issued from the basic processor 10 to the parallel data processing unit 20 and the parallel data processing unit 20 is operated by an on / off signal or an on / off instruction φo supplied from the basic processor 10 via the control line 33. The clock signal φc is supplied to the parallel data processing unit 20 only. Therefore, by configuring the data processing device 2 to have an architecture based on a combination of the basic processor 10 and the parallel data processing unit 20 capable of turning on and off the clock signal, a further power saving can be expected by controlling the clock signal.
[0035]
In these data processing apparatuses 1 and 2 (hereinafter, the data processing apparatus 1 will be described as a representative), the basic processor 10 and the parallel data processing unit 20 cooperate to execute one application program. Therefore, one application program is compiled by a compiler having two modes to generate the execution program 11 for the basic processor and the horizontal microcode 21 of the parallel data processing unit 20.
[0036]
FIG. 3 shows a process of generating the execution program 11 and the horizontal microcode 21. In step A, an original program Co to be executed by the data processing device 10 is provided. The original program Co is loop processing C_LO _OPAnd the like, and other general-purpose processes C1 and C2 such as data interface processing and interrupt processing. In step B, the part suitable for parallel execution is converted from the original program Co. Is extracted. In this example, the part C of the loop processing is a part suitable for parallel execution._LOOPAre separated. Loop processing C_LOOPIs extracted from the program C_LOOPCommand and process C for starting_LOOPIs an execution source program Ce replaced with an instruction for detecting the end of the execution. That is, processing C_LOOPAn instruction interface is provided between the source program Ce and the source program Ce in the form of a function call and a return. Alternatively, it is also possible to write a value in the interface register of the parallel data processing unit 20 and start the processing of the parallel data processing unit 20 with this value.
[0037]
Next, in step C, these programs are compiled. The compiler 40 includes a first step or function Cc1 for generating an execution program for the basic processor, and a second step or function Cc2 for generating horizontal microcode for the parallel data processing unit. The first function Cc1 is a general-purpose C compiler. The first compiler Cc1 includes general-purpose processes C1 and C2, and executes a program Ce described in the C language by an execution program (object program) 11 of the basic processor 10. Convert to The second function Cc2 is also a kind of C compiler, and the second compiler Cc2 is a C compiler that mainly performs the repetition processing extracted in step B._LOOPIs converted into a horizontal microcode 21 or a horizontal microprogram 29 so that the program can be executed in parallel.
[0038]
The functions of the first compiler Cc1 and the second compiler Cc2 are provided as a program or a program product executable by a computer having appropriate resources. That is, the compiler for the data processing device of this example has two modes, the output object code 11 of the basic processor and the horizontal microcode for parallel processing in which an algorithm such as repetitive signal processing is described. Is interfaced between function calls and returns. Further, these functions Cc1 and Cc2 may be provided as different program products, or may be provided as one program product.
[0039]
In the parallel data processing unit 20 of this example, since the loop is processed by the loop counter as described later, the loop processing C is performed by the second compiler Cc2._LO _OPIs a set value of a counter for loop processing and an address table for branching when counting up, in addition to the horizontal microcode 21. That is, during the loop processing, the horizontal microcode 21 determines the connection between the register file of the data path unit 23 and the operation unit group via the selector group, and instructs execution of parallel or pipeline parallel processing. The branch address table is a table in which a return address at the time of completion of the loop processing is stored, and is usually stored in a register file and installed in the control circuit 24.
[0040]
When the application program Co can be executed in parallel using the horizontal microcode in this manner, the contents to be executed in parallel by the parallel data processing unit 20 are in the form of a subroutine, and the horizontal microcode SRAM (code The program may be loaded into the memory 22 and started from the basic program (execution program) 11. Therefore, there is no need to make the basic program 11 itself a VLIW having a long instruction length. As a result, it is not always necessary for the basic processor 10 to adopt the VLIW format for the basic instruction to be fetched and decoded. On the other hand, the horizontal microcoded processing is executed in a parallel or parallel pipeline processing in a state where the processing is VLIW-formed by the data path unit 23. Therefore, unlike a VLIW processor, a compact data processing device 10 that can eliminate a fetch and decode part that is large and consumes a large amount of power and can execute parallel pipeline processing with a degree of parallelism similar to that of a VLIW processor. Can be provided.
[0041]
FIG. 4 shows an example of the parallel data processing unit 20. FIG. 5 is a block diagram partially showing the parallel data processing unit 20 in more detail. The data path unit 23 of the parallel data processing unit 20 includes four operation units 51 that can operate in parallel. One horizontal microcode 21 stored in the code memory 22 performs four processing or sub-instructions. Can be executed simultaneously. As shown in FIG. 5, each arithmetic unit 51 includes one ALU 51o, input registers 51a and 51b, and two-stage selectors 51c, 51d, 51e and 51f for selecting an input to the input registers 51a and 51b. ing.
[0042]
The arithmetic unit (ALU) 51o functions as a multiplier, an adder, and the like, and each arithmetic unit 51a also has a condition code storage register corresponding thereto. The ALU 51o and the input registers 51a and 51b are fixedly connected and have no connection. On the other hand, the first-stage selectors 51c and 51d can select the input registers 51a and 51b from the general-purpose register group (general-purpose register file) 52, and the second-stage selectors 51e and 51f further have their own and other ALUs 51o. Of the input registers 51a and 51b, including the output of
[0043]
The output of each arithmetic unit 51 can be switched and connected to the input of the arithmetic unit 51 including itself, the general-purpose register group 52, the data output unit 53, and the address output unit 54. The output of each arithmetic unit 51 can also be output to the loop control unit 26 of the control circuit 54 for the initial setting of the number of loops. Further, the outputs of the individual arithmetic units 51 can also be output to the output control unit 27 for condition determination.
[0044]
The general-purpose register group 52 includes a plurality of, for example, eight general-purpose registers 52a, a selector 52b for selecting an input thereto, and a selector 52c for selecting a write enable signal WE for the general-purpose register 52a. Each general-purpose register 52a can select not only the output of the arithmetic unit 51 but also the output from the input data register 28a of the interface register group 28. Writing to each general-purpose register 52a is determined by the write enable signal WE. In the parallel data processing unit 20 of this example, writing to the general-purpose register 52a is controlled by one of the results calculated by each of the arithmetic units 51. can do. Therefore, a selector 52c is provided.
[0045]
The data output unit 53 selects either the output of the general-purpose register group 52 or the output of the arithmetic unit 51, and outputs the selected output to the data output register 28b of the interface register group 28. The address output unit 54 selects either the output of the general-purpose register group 52 or the output of the arithmetic unit 51 and outputs the selected output to the address output register 28c of the interface register group 28. Further, the horizontal microcode 21 outputs a signal for controlling the timing of writing or reading to the data RAM control register 28d.
[0046]
Therefore, the parallel data processing unit 20 of this example inputs the data of the general-purpose register of the basic processor 10 or the data RAM 15 to the data path unit 23 of the parallel data processing unit 20 via the interface register group 28 and the data bus 38, and performs the operation. It can be processed and fed back to the basic processor 10. That is, an address read from the general-purpose register of the basic processor 10 or the data RAM 15 is output from the address output unit 54, and data is read via the input data register 28a. On the other hand, when writing, the address is output from the address output unit 54, the data is output from the data output unit 53, and the data is output to the data RAM 15 or the general-purpose register by the WE signal addressed to the basic processor output via the data RAM control register 28d. Written.
[0047]
The connection of the data path unit 22 including these operation units, registers and other circuit elements and the operation contents of the ALU 51o are controlled by the horizontal microcode 21. Therefore, the data flowgram set by the horizontal microcode 21 is processed by the flow of the data supplied from the data RAM 15 of the basic processor 10, and the processing of feeding back to the data RAM 15 of the basic processor can be performed.
[0048]
The control unit 24 that controls the data path unit 22 includes a loop control unit 26 that controls a loop, an address control unit 25 that controls an address, and an output control unit 27 that controls output based on condition determination. When a value is set in a register or the like from the basic processor 10 and the activation conditions are aligned, the basic processor 10 is activated by a call from the basic processor 10 with a VU instruction that can be decoded by the parallel data processing unit 20. The address control unit 25 decodes the activation VU instruction and activates the horizontal microcode 21 at a predetermined address. Then, when the processing of the predetermined step is completed, an end VU instruction or an interrupt instruction is supplied to the basic processor 10, and the processing is continued on the side of the basic processor 10. For example, the address control unit 25 is an FSM such as a sequencer. During the execution of the horizontal microcode 21, these control units control the data path unit 22 according to the instructions (sub-instructions) and parameters included in the horizontal microcode 21.
[0049]
FIG. 6 shows an example of the format of the horizontal microcode 21. One horizontal microcode 21 includes an instruction field 55 for describing four sub-instructions, a cycle flag field 56 for specifying a cycle to be referred to when determining a condition, and a loop counter selection field 57 for selecting a loop counter. ing. Each of the instruction fields 55 further includes an operand field 55a for designating an operation, a register field 55b for designating a register to be used, and a slot number field for designating which operation unit 51 of the data path unit 22 executes processing. 55c and a selection condition field 58 in which a bitmap φbm to be compared with the condition code of the ALU 51a at the time of condition determination is described. The bit map φbm is 10-bit data and includes information indicating whether the process is selected from among five types of truths # 0 to # 4.
[0050]
FIG. 7 shows a configuration in which the number of loops is controlled by the loop counter selection field 57 of the horizontal microcode 21. Several bits can be stored in the loop counter selection field 57 of the horizontal microcode 21, and a loop register is designated. For example, with 2 bits, 00: no loop, 01: loop register # 1 designation, 10: loop register # 2 designation, 11: loop register # 3 designation, and so on.
[0051]
The loop control circuit 26 includes three loop counters 26a. When the horizontal microcode 21 is executed a plurality of times, the loop counter 26a that controls the horizontal microcode 21 is selected. The circuits 26X having the respective loop counters 26a are independent of each other, and the loop register 26a, the loop initial value register 26b, the decrementer 26c to be counted down, and the initial value register when the counter 26a counts up (0 in this example). An initialization circuit 26d for resetting the loop counter 26a with the value of 26b is provided. Of course, a circuit that uses an incrementer and counts up when the value of the loop counter 26a matches the value of the initial value register 26b is also possible.
[0052]
Further, the address control unit 25 is composed of an FSM and has a branch address table 25a for storing a loop processing start address (return value address) corresponding to the number of the loop register (loop counter) 26a. In the initial setting, a compiler Cc2 dedicated to the parallel data path described above is created in the branch address table 26a, and is loaded into the initial value register 26b together with the horizontal microcode 21. Therefore, the initial value of the loop counter 26a is set in the initial value register 26b as the number of loops before or before the function of the parallel data processing unit 20 is called based on the execution program 11.
[0053]
When the data path unit 23 starts the loop processing, first, the counter 26a is selected based on the value of the loop counter selection field 57 of the horizontal microcode 21. Since the counter 26a is initialized, when the horizontal microcode 21 is executed, the loop register or the loop counter 26a associated with the loop counter selection field 57 is decremented. Then, when the counter 26a becomes "0", the FSM control unit 25b of the address control unit 25 catches a signal of 0 detection and stores a loop start address corresponding to the loop register 26a stored in the address branch table 25a or a return. The value address is output from the code address output unit 25c as the address of the next horizontal microcode 21. At the same time, the loop register 26a is reset to the value of the loop initial value register 26b to prepare for the next loop processing.
[0054]
As a result, in the parallel data processing unit 20 of the present embodiment, when executing the loop processing, it is not necessary to perform the loop processing determination by comparing the loop end address and the execution address. Therefore, there is an advantage that a large comparator for comparing addresses is not required. The large number of bits in the address comparator can be a critical path for delay control, which can slow down processing. On the other hand, according to the method of selecting the loop counter of the present example, the number of bits is very small, so there is no possibility that the processing speed is affected.
[0055]
Further, according to this method, even when a plurality of loop processes have a nested structure, control of the number of loops is very simple. It is possible to cope with a nested structure only by specifying a different loop counter 26a and controlling the number of loops without requiring information such as the number of nested loops existing in the loop. For this reason, in the present example, up to a triple nest structure can be executed with the same general-purpose control as the single loop processing.
[0056]
Note that initialization is required for pipeline operation, and if resources such as memory addresses required for each loop in the nested loop description are different, initialization of each loop process should be performed at the first stage. It is valid.
[0057]
FIG. 8 shows how the loop processing is converted to horizontal microcode and executed by the parallel data processing unit 20. In the execution program 11 for controlling the basic processor 10, the parallel data processing unit 20 is initialized in step 61. In this example, the address of the horizontal microcode 21 to be loaded into the code memory 22 of the parallel data processing unit 20 is set from the microcode program 29 stored in the data RAM 15 by the instruction 61a. In addition, an address for storing the horizontal microcode 21 in the code RAM 22 of the parallel data processing unit 20 is transferred to the address register of the parallel data processing unit 20 by the instruction 61b. The instruction 61c sets data to be loaded into the code memory 22, that is, the horizontal microcode 21 in a register. Further, the instruction 61d transfers data to the data register of the parallel data processing unit 20 for storing the horizontal microcode 21. This is repeated by a necessary bit width, and the horizontal microcode 21 is written to the specified address of the code memory 22 by the instruction 61e. This is repeated by the number of steps necessary for performing processing in the parallel data processing unit 20. Thereby, the code memory 22 of the parallel data processing unit 20 is initialized.
[0058]
Therefore, in this step 61, the basic processor 10, which is the first processing unit, can rewrite the code memory 22 of the parallel data processing unit 20, which is the second processing unit, with an instruction included in the execution program 11, and Various processes can be performed by the data processing unit 20. In the data processing system 2 shown in FIG. 2, the operation of updating the microcode of the parallel data processing unit 20 is executed by the management unit 35 according to an instruction from the basic processor 10. Therefore, most of the transfer processing using the register in step 61 is performed between the management unit 35 and the parallel data processing unit 20. During this time, different processing can be executed by the basic processor 10. For this reason, the basic processor 10 can be released from the processing for changing the processing content of the parallel data processing unit 20, that is, most of the processing in step 62, and the processing speed of the data processing device 2 can be improved.
[0059]
Next, in step 62, an initial value is set in the initial value setting register (VR_LOOP1-3) 26b in order to execute the loop processing. The instruction 62a transfers the value to the initial value register VR_LOOP1 of # 1, the instruction 62b transfers the value to the initial value register VR_LOOP2 of # 2, and the instruction 62c transfers the value to the initial value register VR_LOOP3 of # 3. In this step 62, the basic processor 10, which is the first processing unit, stores, in the initial value setting register 26b, which is a loop control register, the number of loops stored in the data RAM 15 or the execution program 11 can be set to a desired number of loops.
[0060]
Further, in the instruction 62d, if other parameters are required, they are also transferred to the register. Then, an instruction to activate the horizontal microcode 21 using the loop register is executed by the instruction 62e. The instruction “V_LOOP_INST1” is supplied as an activation instruction φs from the fetch / decode unit 13 of the basic processor 10 to the parallel data processing unit 20, and the parallel data processing unit 20 starts processing.
[0061]
As shown in FIG. 8, the processing executed by the parallel data processing unit 20 is a triple nested loop. The basic processor 10 sets the horizontal microcode 21 of the parallel data processing unit 20 and the initial value register 26b by the execution program 11 and starts the processing so that the horizontal microcode 21 that performs the respective loop processing becomes a different loop counter 26a. , The nested loop processing can be easily controlled. In each loop processing, the horizontal microcode 21 executes four processings simultaneously or in parallel, so that the loop processing can be executed at a high speed.
[0062]
In step 63, the basic processor 10 waits for the processing of the parallel data processing unit 20 to end, and starts the next processing. While the parallel data processing unit 20 is performing the loop processing, the basic processor 10 can also perform different processing in parallel. In step 63, the process waits for a signal (VUWAIT signal) indicating completion of loop processing from the parallel data processing unit 20 in response to the instruction 63a. Next, the processing results of the parallel data processing unit 20 are transferred to the basic processor 10 by instructions 63b to 63d.
[0063]
Of course, the “if @ then @ else statement” can be described in the “loop statement” executed by the parallel data processing unit 20. In the parallel data processing unit 20 of this example, the processing of the "then clause" and the processing of the "else clause" are not selected and performed, but both are performed. That is, in the compiler Cc2 that converts the loop processing into the horizontal microcode 21, two selectable processings are prepared for the processing of the conditional branch, and first, both the "then clause" and the "else clause" are executed. . As a result of the “if condition” determination of “true” or “false”, the execution statement of the “clause” not selected is being executed at the stage of the arithmetic unit 51, but its output is finally conditioned. A conditional branch is executed by not writing to a code register or a general-purpose register. Another method, that is, a “jump statement” can be generated and branched. The method of executing alternatives simultaneously is effective when the "then clause" or "else clause" is shallow, and the method of branching with a jump statement is useful when the "then clause" or "else clause" is long or unbalanced. It is valid.
[0064]
FIG. 9 shows a schematic configuration of the output control unit 27 prepared in the parallel data processing unit 20. In the parallel data processing unit 20, the output control unit 27 controls the output of another arithmetic unit 51 based on the result of the arithmetic unit 51 that performs the operation of the “if condition”, so that the “true” / “false” determination can be made. The result can be reflected as an output of the data path unit 23. For this reason, the options, that is, the “then clause” and the “else clause” can be executed simultaneously, and it is possible to prevent the efficiency of the parallel pipeline processing from being reduced due to the conditional branch.
[0065]
The output control unit 27 of the present example sets the condition code (CC) φcc to one bit of the determination result of “true” or “false” calculated in each ALU 51a of the four operation units 51 as a condition code register set (CCRS). There are provided four condition code (CC) output units 72 for outputting to the 71. The condition code output from the ALU 51a is not limited to one bit, but in this example, the description will focus on a one-bit condition code indicating true / false. Therefore, the present invention is not limited to a data processing device provided with an arithmetic unit such as an ALU that outputs a 1-bit condition code, and is also applicable to a data processing device that outputs a condition code of a plurality of bits. The present invention is applicable.
[0066]
Each CC output unit 72 includes a conversion circuit 73 that inverts the polarity of a 1-bit condition code φcc to a 2-bit condition code (revised condition code RCC) φrc, and a 2-bit inversion of the polarity. An output selection unit 74 for selecting whether to output RCCφrc to the CCRS 71 or not. Accordingly, in the CCRS 71, if the arithmetic unit 51 is not selected in the previous “if condition”, the RCCφrc of “00” is stored in the address of the CCRS 71 corresponding to the arithmetic unit 51. Therefore, the number of the arithmetic unit 51 and the address (address) of the CCRS 71 correspond one-to-one. For example, the output of the arithmetic unit 51 of # 1 is stored at address # 1 (address 1) of the CCRS 71. On the other hand, if it is selected and the operation result of the operation unit 51 is “true (1)”, the RCCφrc of “10” is stored. If the operation result is “false (0)”, RCCφrc of “01” is stored.
[0067]
Although the CCRS 71 storing the 2-bit RCC φrc is a so-called predicate register, the CCRS 71 does not have the versatility of the original predicate register, such as a one-to-one correspondence between the ALU number and the CCRS address. . Further, the CCRS 71 stores the condition code of the fictitious ALU in CCRS # 0 (address 0). The RCC at address 0 controls the entire program, and in the case of AND logic, “10”, which is true, is stored in advance. Therefore, an instruction executed without being influenced by the preceding “if condition” can be always executed by comparing with the RCC at the address 0 of the CCRS 71. Regarding this CCRS # 0, it is also possible to replace 2-bit "10" with 1-bit data "1".
[0068]
The condition code φrc of each operation unit 51 stored in the CCRS 71 is fed back to the output selection circuit 74 to determine whether to output the operation result of the next operation unit 51. The output control unit 27 of the present example further includes a condition code selection circuit (CC selection circuit) capable of feeding back the condition code φrc before being stored to the output selection circuit 74 instead of the condition code φrc stored in the CCRS 71. ) 75. The CC selection circuit 75 is controlled by the data of the cycle flag field 56 of the horizontal microcode 21, and supplies the RCCφcr output to the CCRS 71 to the output selection circuit 74 when the cycle flag is set. As a result, the output selection circuit 74 can refer to the operation results of the other operation units 51 in the same cycle, so that the operation of the conditional branch and the operation whose output depends on the operation result are performed in parallel in the same cycle. Can be executed. Therefore, it is possible to execute the operation including the conditional branch more efficiently in parallel, and the processing speed in the parallel data processing unit 20 can be further improved. It should be noted that the 0th condition code of the CCRS 71 is not output from the arithmetic unit 51, and is supplied from the CCRS 71 to the output selection unit 74 in any case.
[0069]
FIG. 10 shows a schematic configuration of the output selection circuit 74. The output selection circuit 74 calculates a logical product of the 10-bit bit map φbm described in each instruction field 51 of the horizontal microcode 21 and the 2-bit RCC φrc from the 0th to the 4th bit by bit. And a first determination circuit 76 for calculating the logical sum of them. In the first determination circuit 76, one of the addresses set to “1” in the bit string of the bit map φbm and one of the addresses set to “1” in the bit string of all RCC φrc (10 bits) If they match, it is determined that the output selection circuit 74 has been selected. That is, since the operation unit 51 corresponding to the output selection circuit 74 is selected in the operation of the “if condition” in the previous cycle or the same cycle, the output of the operation unit 51 is written to the general-purpose register 52a or the CCRS 71. Is allowed.
[0070]
The output selection circuit 74 further includes, when selected by the first determination circuit 76, a WE signal φwe for permitting writing to the general-purpose register 52a by an operand described in each instruction field 51 of the horizontal microcode 21. Or a signal φce for permitting writing of a condition code to the CCRS 71 is provided. If the process selected in the operation of the “if condition” is a condition comparison, that is, the operation of the “if condition”, a signal φce permitting RCC writing is supplied from the output selection circuit 74 to the CC conversion circuit 73, RCCφrc converted to 2 bits is output. On the other hand, if the process selected in the operation of the “if condition” is a process involving output to the general-purpose register 52 a, for example, an arithmetic operation instruction or a transfer instruction, the write enable signal φwe is output from the output selection circuit 74 to the general-purpose register group 52. Is supplied. As a result, the result processed in parallel by the operation unit 51 is output or not output according to the operation result of the “if condition” for selecting the operation unit 51, and the processing is performed by branching (jumping) under the “if condition”. Can be obtained in parallel processing.
[0071]
In FIG. 11, while having a tree structure of “if @ then @ else statement”, all instructions are included in the horizontal microcode 21 as sub-instructions and executed in parallel, and the execution result of only the correct path selected by the conditional branch is obtained. Here is an example of the output. It is assumed that a bold line is a path selected at the time of execution.
[0072]
FIG. 12 shows a horizontal microcode 21 generated to execute this description example. The horizontal microcode 21 includes “conditional @ compare statement” and “conditional @ move statement” as sub-instructions. In the first cycle, three "conditional @ compare statements" (OP-1, OP-2 and OP-3) and one "conditional @ move statement" (OP-4) are executed in parallel. Each of the processes OP-2, OP-3, and OP-4 is selected or not selected depending on the result of the process OP-1. Therefore, a flag is set in the cycle flag area 56, and the operation result of the other operation unit 51 in the same cycle is referred to by the output selection circuit 74. As can be seen from the figure, in the data processing device 1 of the present example, the nine processes OP-1 to OP-9 shown in FIG. 11 can be processed in three cycles including the branch, and No cycle waste due to the occurrence of a branch, that is, no cycle penalty occurs. The instruction that can be executed by the horizontal microcode 21 is not limited to the condition determination, but may be a multiplication instruction, an addition instruction, or another instruction. Therefore, by horizontally expanding and executing a product-sum operation or the like that requires a loop process, the performance can be significantly improved as compared with a case where the same process is executed by the basic processor.
[0073]
FIG. 13 shows the contents of OP-1 to OP-9 included in the horizontal microcode 21 expanded in the sub-instruction field 55. FIG. 14 shows the description contents of each of the sub-instructions OP-1 to OP-9. The operand field 55a describes a "conditional @ compare statement" or a "conditional @ move statement", and the register field 55b defines a register to be compared or a register for transferring data. In the slot number field 55c, information for specifying the operation unit 51 is described. In this specification, a number that defines the arithmetic unit 51 is called a slot number. Further, a 10-bit bitmap φbm is stored in the selection condition field 58.
[0074]
The bitmap φbm is information indicating whether the sub-instructions OP-1 to OP-9 are selected as true or false among five slots # 0 to # 4 including a fictitious ALU (slot number 0). That is, the bitmap φbm indicates whether the sub-instructions OP-1 to OP9 are described on the “then” side or “else” side by the preceding “if @ then @ else statement”. The information of each slot is described by two bits, and indicates that the first bit (first bit, left bit) of each slot is selected on the “then” side, and the second bit (right bit) is Indicates that the selection is made on the "else" side. Therefore, by comparing the CCRS 71 having the same arrangement or the data RCCφrc written in the CCRS 71 with the output selection circuit 47, it can be determined whether or not the output of the arithmetic unit 51 of the corresponding slot number has been selected.
[0075]
For example, the sub-instruction OP-1 is selected on the “then” side (# 01) of the slot number # 0, which is the global condition code, and it can be seen that the sub-instruction OP-1 is executed regardless of the conditional statement. The sub-instruction OP-2 is selected on the “then” side (# 11) of the slot number # 1, and the cycle flag 56 is set in the first cycle as shown in FIG. Therefore, it is understood that the selection is made on the first slot of the same cycle, that is, on the “then” side of the sub-instruction OP-1. The same applies to other sub-instructions. For the sub-instructions OP-5 to OP-9, since the cycle flag 56 is not set, the RCCφrc of the previous cycle stored in the CCRS 71 and the sub-instructions assigned to the respective slot numbers are set. The bitmap φbm of the instruction is compared. Since the sub-instructions OP-1 to OP-3 are "conditional @ compare statements", the output is controlled by the output of the ALU 51a to the CCRS 71. On the other hand, since the sub-instructions OP-4 to OP-9 are “conditional @ move statement”, the output is controlled by writing to the general-purpose register 52a, and whether or not the WE signal φwe is output.
[0076]
Therefore, it can be said that the horizontal microcode 21 having the bitmap φbm according to the present invention has a format in which the CCRS that causes the determination of its own path is stored in the instruction code as a bitmap. The information for selecting the output of each arithmetic unit 51 is not limited to the bitmap φbm, but is the first information indicating that the selection is made independently of the condition code of each arithmetic unit 51, that is, the slot number 0 in this example. (Global condition code) is required. Also, second information indicating each operation unit 51 for comparing the condition code is required. In this example, the order of bits (every two bits) arranged corresponding to the slot number of the bit map φbm is Indicates information. Further, third information indicating whether the condition code to be compared is true or false is required. In this example, 2-bit data is assigned to each slot. Therefore, in the bit map φbm of the present example, the logical operation result of the operation unit 51 of the slot number indicated in the order of the third information is referred to by the two bits corresponding to the third information, and the third operation is performed. When one of the two data items is "1", that is, when the 2-bit data is "10", it indicates that the logical operation is selected when the operation is true. When the other bit is "1", that is, when the two-bit data is "01", it indicates that the logical operation is selected when the logical operation is false, and when both bits are "00", it is not selected. Is shown. In addition to the above, it is preferable to include fourth information indicating the cycle of the condition code to be compared, that is, the cycle flag 56 in this example.
[0077]
By bitmapping these pieces of selection information, especially the first to third information, the output can be controlled in comparison with the CCRS 71 without decoding, so that high-speed processing can be performed with simple hardware. As described above, this example focuses on the condition code for one bit, but the condition code stored in the CCRS and the bit map φm included in the horizontal microcode 21 are directly compared. The advantages of the present invention can be obtained if they are generated in a system that allows them.
[0078]
Further, by providing fourth information such as the cycle flag 56 and controlling the CC selection circuit 75, an operation unit for determining a condition to be referred to and an operation unit for performing an operation using the same can be executed in parallel. The degree improves. In this example, the cycle flag 56 is provided for each horizontal microcode to perform collective management. However, by providing a cycle flag field in the sub-instruction field 55, it is possible to control each operation unit under the same cycle branch condition. , It is possible to select whether the control is performed based on the branch condition of the previous cycle, and it becomes more flexible and the degree of parallelism can be improved. However, since it is necessary to provide a CC selection circuit 75 for selecting the CCRS 71 and the RCCφrc to be written into the CCRS 71, the hardware becomes complicated.
[0079]
In the parallel data processing unit 20 of this example, the operation result (condition code) of each arithmetic unit (ALU) 51 is selected in advance by a horizontal microcode, and is fed back or input as control logic of the arithmetic unit 51. Not. Instead, the operation results (condition codes) of all the operation units 51 are input as the control logic of each operation unit 51. That is, an output result from the CCRS 71 in which the operation results of all the operation units 51 are stored or information written to the CCRS 71 is fed back to each operation unit 51. Therefore, after a desired operation result is selected in units of the operation unit 51 by the horizontal microcode 21 that controls each operation unit 51, the condition of the selected operation result and the operation result of each operation unit 51 is changed. It is taken and written into the CCRS 71, and is referred to in the next processing in the arithmetic unit 51. Alternatively, it is determined whether output to a general-purpose register is possible.
[0080]
Therefore, according to the present invention, after the conditional @ compare instruction is executed, RCCφrc to be written to the CCRS 71 has only one bit of either then / else is “1” and the others are “0”. If it is selected in the previous conditional statement, it is written into the CCRS 71; if not, (0, 0) is written into the CCRS 71. Thus, if there is a true or false selection, the result is propagated as a plurality of "1" s to the downstream instruction via the CCRS 71.
[0081]
As described above, the sub-instruction 55 constituting the horizontal microcode 21, the operation unit 51 in which the sub-instruction 55 is executed, the slot number of the CCRS 71, and the sub-instruction 55 By controlling the output of the arithmetic unit 51 in correspondence with the slot number of the included bitmap φbm, it becomes possible to proceed in parallel without stopping all the instructions accompanied by the conditional branch. In particular, the result of the conditional branch is propagated in the horizontal direction by providing an option to refer to the write data to the CCRS 71 in cycle units including a plurality of instructions or to refer to the stored data as necessary. And the efficiency of parallel processing including conditional branching is dramatically improved.
[0082]
Since the horizontal microcode 21 is used in the parallel data processing unit 20 of the present invention, the configuration of the VLIW can be extracted at the highest speed and with a high degree of freedom. That is, the number of selector stages becomes deeper in order to increase the degree of freedom of connection between the register file and the arithmetic unit. However, the function of fetching and decoding horizontal microcode is performed by expanding the VLIW code in advance as horizontal microcode. And maintain high speed. Originally, the VLIW data path processes a lot of repetitive loop processing, and the code depth is not so large. Therefore, unlike the parallel data processing unit 20 of the present embodiment, even if the processing content is resident as the horizontal microcode 21, there is no concern that the memory capacity of the code memory 22 is insufficient. Further, the horizontal microcode 21 supports a function of loading from the data RAM 15 of the basic processor 10, so by loading different horizontal microcodes or programs, a horizontal microcode 21 corresponding to a plurality of loop processing functions is supported. The microcode 21 can be replaced. Therefore, there is no fear that the memory capacity of the code memory 22 becomes insufficient at this point.
[0083]
Further, it is also possible to provide the integrated circuit device 1 or 2 provided with a plurality of parallel data processing units 20, and to execute the plurality of parallel data processing units 20 in parallel. Then, as shown in FIG. 2, by controlling the clocks of the plurality of parallel data processing units 20, the power consumption can be reduced more finely. In terms of reducing power consumption, it is also effective to convert the code memory 22 for storing the horizontal microcode 21 into a ROM. The horizontal microcode 21 cannot be rewritten, but is more effective in terms of area and power consumption for specific signal processing.
[0084]
FIG. 15 shows an example of a different output control circuit 27. In this example, a circuit configuration that refers to the pre-write information (RCCφrc) to the CCRS 71 when there are two arithmetic units 51 is shown in detail. This output control circuit 27 has no path for transmitting the pre-write information φrc from its own CC output unit 72 to its own CC output unit 72, but writes from its own CC output unit 72 to all other CC output units 72. There is a complete connection with a path transmitting the previous information φrc, and there is a loop that may fall into an oscillation state. It is also possible to use a compiler Cc2 for parallel data processing so as not to take the oscillation state. However, since the condition code in the same cycle is referred to, it may be in an oscillating state momentarily, considering saving power consumption, and may cause aging and reliability problems. There is.
[0085]
FIG. 16 shows another example of the output control circuit 27. In the output control circuit 27, the circuit that refers to the information before writing (RCCφrc) to the CCRS 71 is not completely connected but is partially connected. The partial connection imposes a restriction that all arithmetic units (ALUs) 51 cannot refer to the pre-write information φrc of other arithmetic units 51 by referring to the condition code in the same cycle. That is, the result (operation result) of the first slot operation unit 51 output from the first slot CC output unit 72 can be used in all other CC output units 72, but the operation result of the second slot is It can be used only by the CC output unit 72 of the third and fourth slots, and the calculation result of the third slot can be used only by the CC output unit 72 of the fourth slot. The calculation result of the fourth slot cannot be used in other slots. However, the oscillation state can be completely prevented. It is possible to create the compiler Cc2 so as to generate the horizontal microcode 21 so as to operate under such restrictions, and it is possible to parallelize the processing by comparing the case where the operation result in the same cycle cannot be referred to. There is enough merit in raising the degree.
[0086]
In the case where the condition for writing the result of the arithmetic unit 51 is generated using the RCCφrc stored in the CCRS 71, the connection may be completely established.
[0087]
17 to 20 show how different processes are executed by the parallel data processing unit 20. FIG. FIG. 17 shows the processing of this example in a tree structure of "if @ then @ else statement". In the example shown in FIG. 11, the process of the condition (AND condition) is executed only when any one of the conditions is satisfied. On the other hand, in the process of this example, which of the plurality of conditions is satisfied. This is also a condition (OR condition) process to be executed.
[0088]
FIG. 18 shows the horizontal microcode 21 generated to execute the processing shown in FIG. Also in this case, with the data processing device 1 of this example, the six processes OP-1 to OP-6 can be executed in two cycles, and the performance can be greatly improved. FIG. 19 shows the contents of OP-1 to OP-6 included in the horizontal microcode 21 expanded into the sub-instruction field 55, and FIG. 20 shows the contents of each of the sub-instructions OP-1 to OP-6. The description contents are shown. The sub-instruction OP-5 is an example of microcode executed under the OR condition, and it can be seen that the output control circuit 27 shown in FIG. 9 can execute the OR condition without branch penalty.
[0089]
As described above, according to the present invention, according to the present invention, the basic processor 10 does not need to be a VLIW or may be a processor having a low degree of parallelism. By adding a parallel processing unit 20, that is, a processing unit having a horizontal microcode type VLIW data path unit 23, an efficient “loop statement”, which is a description frequently used in various processes, can be efficiently created. It is possible to increase the parallel operation of the processing and the "if @ then @ elase statement" and the execution statement subordinate thereto. It is desirable that the ratio of the arithmetic parallelism between the two, that is, the basic processor 10 and the parallel processing unit 20 be two or more. If the basic processor 10 is one parallel, the parallel processing unit 20 is preferably two or more parallel. . If the basic processors 10 are two-parallel, it is desirable that the number of the parallel processing units 20 is four or more. This is because by providing a ratio higher than this level in performing data processing, the distinction between general-purpose data processing and dedicated data processing becomes clear.
[0090]
The power consumption is reduced by the data processing system 1 having the basic processor 10 and the parallel processing unit 20 without installing an extra fetch / decode structure in the basic processor 10, while the same as the VLIW type processor. It is possible to provide a processor system that performs an efficient parallel data processing operation with a dedicated data path unit.
[0091]
【The invention's effect】
In the present invention, by combining a basic processor having fetch and decode functions and a parallel data processing unit controlled by horizontal microcode, only a portion suitable for parallel processing can be executed by horizontal microcode. . Therefore, a portion capable of parallel processing can perform parallel or pipeline parallel data processing similarly to the VLIW type processor, and a portion not suitable for parallel processing can perform sequential data processing by the basic processor. Furthermore, by adopting horizontal microcode for controlling the parallel data processing unit, the circuit complexity of the instruction fetch unit and the decode unit for parallel data processing can be eliminated, and parallel processing can be performed efficiently while suppressing power consumption. be able to.
[0092]
Therefore, according to the present invention, it is possible to provide a data processing system which is a processor much smaller and consumes less power than a VLIW type processor, and which can secure the same degree of parallelism as the VLIW type processor. For this reason, in the present invention, a wide range of parallel execution is possible by economical hardware, and signal processing required in communication and a network, particularly, processing in which repetitive calculation by a loop description frequently appears, is executed. It is possible to provide an economical processor with a dramatic increase in speed.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of an example of a data processing device (system LSI) of the present invention.
FIG. 2 is a diagram illustrating a configuration of an example of a data processing device according to the present invention, which is different from FIG.
FIG. 3 is a diagram showing a design process of the data processing device.
FIG. 4 is a diagram showing a schematic configuration of a parallel data processing unit.
FIG. 5 is a diagram illustrating a circuit example of a parallel data processing unit.
FIG. 6 is a diagram showing a configuration of a horizontal microcode.
FIG. 7 is a diagram illustrating a configuration of a loop control unit and an address control unit.
FIG. 8 is a diagram illustrating a state of loop control.
FIG. 9 is a diagram illustrating a configuration of an output control unit.
FIG. 10 is a diagram showing a configuration of an output selection circuit.
FIG. 11 is a diagram illustrating an example of processing to be executed in parallel.
FIG. 12 is a diagram showing horizontal microcode for executing the processing shown in FIG. 11;
FIG. 13 is a diagram showing information included in the horizontal microcode shown in FIG. 12 in more detail.
FIG. 14 shows information contained in the horizontal microcode in a descriptive manner.
FIG. 15 is a diagram showing another example of the output control unit.
FIG. 16 is a diagram showing still another example of the output control unit.
FIG. 17 is a diagram showing another example of the processing to be executed in parallel.
FIG. 18 is a diagram showing horizontal microcode for executing the processing shown in FIG. 17;
FIG. 19 is a diagram showing information included in the horizontal microcode shown in FIG. 18 in further detail.
FIG. 20 shows information included in the horizontal microcode in a descriptive manner.
[Explanation of symbols]
1,2 Data processing device (system LSI, processor)
10 Basic processor
20 parallel data processing unit
21 mm horizontal microcode
22 code memory
23 Data path section
24 control unit
25 address control unit
26 loop control unit
27 ° output control unit
28 data interface register
38 register data bus

Claims

A first processing unit that fetches, decodes, and executes instructions included in the execution program;
A second processing unit controlled by a horizontal microcode capable of simultaneously executing a plurality of processes,
The data processing device, wherein the first processing unit is capable of executing a process of activating the second processing unit according to a start instruction included in the execution program.

2. The data processing device according to claim 1, wherein the first processing unit includes a fetch unit capable of supplying a start instruction to the second processing unit to the second processing unit.

In claim 1, the second processing unit comprises:
A code memory for storing a plurality of the horizontal microcodes;
A data path unit comprising a plurality of registers and arithmetic units, the connection and / or function of which are changed by the horizontal microcode;
A data processing device comprising: an address control unit that manages an execution address of the code memory.

4. The data processing device according to claim 3, wherein the first or second processing unit has a plurality of interface registers for interfacing a data RAM of the first processing unit and a data path unit of the second processing unit. .

4. The data processing device according to claim 3, wherein the first processing unit is capable of executing a process of rewriting the code memory of the second processing unit by an instruction included in the execution program.

4. The data processing device according to claim 3, further comprising a third processing unit that rewrites the code memory of the second processing unit.

2. The data processing device according to claim 1, wherein the first processing unit controls supply of a clock signal to the second processing unit.

4. The loop control unit according to claim 3, wherein the second processing unit further includes a plurality of loop counters for counting the number of times the horizontal microcode is executed, and a loop control unit that controls the number of times the horizontal microcode is executed using the loop counter. And
The data processing device, wherein the horizontal microcode includes an operand field including information for controlling connection of the data path unit, and a loop counter selection field including information for selecting the loop counter.

9. The loop control unit according to claim 8, wherein the loop control unit includes a loop control register for setting the number of times counted by the loop counter.
The data processing device, wherein the first processing unit can execute a process of setting a value of the loop control register in accordance with an instruction included in the execution program.

4. The condition code register according to claim 3, wherein the second data processing unit is capable of storing a condition code of each operation unit of the data path unit.
The condition code register is compared with the selection information of each operation unit included in the horizontal microcode, and the output from each operation unit to the register and / or the output of the condition code to the condition code register is output. A data processing device having an output control unit for controlling.

4. The condition code register according to claim 3, wherein the second data processing unit is capable of storing a condition code of each operation unit of the data path unit.
The condition code of the previous cycle stored in the condition code register, or the condition code of each operation unit in the same cycle stored in the condition code register, and the condition code included in the horizontal microcode. A data processing device comprising: an output control unit that compares selection information of an arithmetic unit with each other and controls output of each arithmetic unit to a register and / or output of a condition code to the condition code register.

12. The method according to claim 11, wherein the selection information of each operation unit included in the horizontal microcode is:
First information indicating that the selection is made independently of the condition code;
Second information indicating each of the arithmetic units for comparing a condition code;
Third information indicating whether the condition code to be compared is true or false,
A fourth information indicating a cycle of the condition code to be compared.

12. The output control unit according to claim 11, further comprising a conversion unit configured to convert a condition code into a plurality of bits of data indicating at least true / false and selection or non-selection, and to output the data to the condition code register. A data processing device, wherein the selection information of each operation unit included in a type microcode is a bitmap that can be compared with the condition code register without decoding.

13. The condition code stored in the condition code register according to claim 12, wherein the output control unit is configured to execute a condition code stored in the condition code register based on the fourth information of the selection information of each operation unit included in the horizontal microcode. A data processing device comprising a condition code selection circuit for selecting one of the condition codes stored in a register.

15. The condition code selection circuit according to claim 14, wherein the fourth information of the selection information of each operation unit included in the horizontal microcode is information common to each operation unit in the same cycle. A data processing device common to each of the arithmetic units.

A parallel data processing device controlled by a horizontal microcode capable of simultaneously executing a plurality of processes,
A code memory for storing a plurality of the horizontal microcodes;
A data path section comprising a plurality of registers and arithmetic units, the connection and / or function of which are changed by the horizontal microcode;
A condition code register capable of storing a condition code of each operation unit of the data path unit;
The condition code register is compared with the selection information of each operation unit included in the horizontal microcode, and the output from each operation unit to the register and / or the output of the condition code to the condition code register is output. A parallel data processing device having an output control unit for controlling.

A parallel data processing device controlled by a horizontal microcode capable of simultaneously executing a plurality of processes,
A code memory for storing a plurality of the horizontal microcodes;
A data path section comprising a plurality of registers and arithmetic units, the connection and / or function of which are changed by the horizontal microcode;
A condition code register capable of storing a condition code of each operation unit of the data path unit;
The condition code of the previous cycle stored in the condition code register, or the condition code of each arithmetic unit in the same cycle stored in the condition code register, and the condition code included in the horizontal microcode. A parallel data processing device comprising: an output control unit that compares selection information of an arithmetic unit with each other and controls output from each arithmetic unit to a register and / or output of a condition code to the condition code register.

18. The method according to claim 17, wherein the selection information of each operation unit included in the horizontal microcode is:
First information indicating that the selection is made independently of the condition code;
Second information indicating each of the arithmetic units for comparing a condition code;
Third information indicating whether the condition code to be compared is true or false,
And a fourth information indicating a cycle of the condition code to be compared.

20. The output control unit according to claim 17, further comprising: a conversion unit configured to convert a condition code into a plurality of bits of data indicating at least true / false and selection or non-selection, and output the data to the condition code register. The parallel data processing device, wherein the selection information of each operation unit included in the type microcode is a bitmap that can be compared with the condition code register without decoding.

18. The condition code stored in the condition code register according to claim 17, wherein the output control unit is configured to execute a condition code stored in the condition code register based on the fourth information of the selection information of each operation unit included in the horizontal microcode. A parallel data processing device including a condition code selection circuit for selecting one of condition codes stored in a register.

A method for controlling a parallel data processing device that performs a plurality of processes simultaneously by changing connections and / or functions of a plurality of registers and arithmetic units by using a horizontal microcode,
Storing a condition code of each operation unit in a condition code register;
The condition code register is compared with the selection information of each operation unit included in the horizontal microcode, and an output from each operation unit to a register and / or an output of a condition code to the condition code register is output. And controlling the parallel data processing apparatus.

22. The step of controlling the output according to claim 21, wherein, in the step of controlling the output, a condition code of a previous cycle stored in the condition code register or a condition code of each operation unit in the same cycle stored in the condition code register. A method for controlling a parallel data processing device, wherein a code is compared with selection information of each of the arithmetic units included in the horizontal microcode.

23. The selection information of each of the arithmetic units included in the horizontal microcode according to claim 22,
First information indicating that the selection is made independently of the condition code;
Second information indicating each of the arithmetic units for comparing a condition code;
Third information indicating whether the condition code to be compared is true or false,
And a fourth information indicating a cycle of the condition code to be compared.

23. The step of storing the condition code according to claim 22, wherein in the step of storing the condition code, the condition code is converted into a plurality of bits of data indicating at least true / false and selection or non-selection, and stored in the condition code register.
The method of controlling a parallel data processing device, wherein in the step of controlling the output, the selection information of each operation unit included in the bit-mapped horizontal microcode is compared with the condition code register without decoding.

Extracting a portion suitable for parallel execution from the original program and converting a plurality of processes into a horizontal microcode that can be simultaneously executed;
Replacing the portion of the original program suitable for the parallel execution with the start instruction of the horizontal microcode, and converting the original program into an executable program executable by a basic processor having a fetch and decode function. .

A function that extracts parts suitable for parallel execution from the original program and converts multiple processes into horizontal microcode that can be executed simultaneously.
A compiler having a function of replacing a portion of the original program suitable for parallel execution with a start instruction of the horizontal microcode, and converting it into an executable program executable by a basic processor having a fetch and decode function.

A process of extracting a portion suitable for parallel execution from the original program and converting a plurality of processes into a horizontal microcode that can be simultaneously executed;
An instruction that replaces a portion of the original program suitable for the parallel execution with a start instruction of the horizontal microcode and converts it into an executable program executable by a basic processor having a fetch and decode function. A program with