JP3579087B2

JP3579087B2 - Arithmetic unit and microprocessor

Info

Publication number: JP3579087B2
Application number: JP15756694A
Authority: JP
Inventors: 秀仁武和; 松尾　　茂
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1994-07-08
Filing date: 1994-07-08
Publication date: 2004-10-20
Anticipated expiration: 2019-10-20
Also published as: JPH0822451A

Description

【０００１】
【産業上の利用分野】
本発明は、画像処理等において用いられる積和演算等の演算処理を高速に行う手段に関する。
【０００２】
【従来の技術】
従来、イメージ処理（画像処理）の分野においては、演算処理を行なう際に、高速、高精度等の、高度の演算性能が要求されるため、演算処理を行なうためには、処理内容に応じて、演算処理のための専用演算器を製作し、イメージ処理に応用してきた。
【０００３】
このような専用演算器を、イメージ処理の内容に応じて製作し、イメージ処理を行なうシステムを設計製造するのでは、システムのコストの上昇を招いてしまう。
【０００４】
一方、比較的低コストでシステム構築を行なえ、イメージ処理に応用可能な、汎用プロセッサの性能は向上してきたが、イメージ処理のすべてを、汎用プロセッサ内に内蔵された演算器で行なうほど、汎用プロセッサ内に内蔵された演算器の処理速度や処理内容は優れたものではない。
【０００５】
ところで、イメージ処理で頻繁に行なわれる演算処理である、いわゆる積和演算は、乗算器と加算器とを、適宜組み合わせて構成した演算器で実行可能である。
【０００６】
このような従来の演算器において、与えられた２つの数の乗算を行なう乗算器は、部分積の生成機能と部分積の加算機能とを有する。
【０００７】
ここで、図１９を参照して、部分積の生成と部分積加算を具体例に説明する。
【０００８】
ここで、乗算に用いるデータのビット数は、５ビットとする。
【０００９】
「部分積」は、乗数１９０１のビットを１ビットずつ調べ、ビットの内容が「１」であれば、被乗数１９００の値、そのものであり、ビットの内容が「０」であれば、部分積を「０」とする。
【００１０】
ただし、乗数１９０１の符号ビットで生成される部分積は、符号ビットの内容が「１」ならば、被乗数１９０１のビット反転の値と加算１とし、符号ビットが「０」ならば「０」とする。
【００１１】
図１９において、部分積は、矩形で囲んで表現しており、その内容を矩形内に示している。
【００１２】
５ビットの乗算においては、乗数のビット数は５（ビット）あるので、図１９に示すように、５個の部分積が生成される。図に示した演算例では、部分積１（１９０２）と、部分積２（１９０３）とは、その生成の際に、調べる対象となるビットの内容が「１」であるので、被乗数１９００そのものとなる。
【００１３】
また、部分積３（１９０４）と、部分積４（１９０５）とは、生成の際に、調べる対象となるビットの内容が「０」であるため「０」となり、部分積５（１９０６）は、乗数の符号ビットで生成され、かつ、生成の際に調べる対象となる符号ビットが「１」であるため、被乗数の負数となる。
【００１４】
また、各部分積の加算は、乗数の下位ビットからの生成順に、部分積を１ビットずつ上位（左）にシフトしていき加算する。また、乗数、被乗数が、負数でも表現されている、いわゆる、２の補数表現のとき、部分積加算を正しく実行するには、符号を拡張（以下「符号拡張」と称する）して加算しなければならない。
【００１５】
図１９に示す例では、部分積１では、４ビット拡張（１９０７）し、部分積２では、３ビット拡張（１９０８）し、部分積３では、２ビット拡張（１９０９）し、部分積４では、１ビット拡張している（１９１０）。このような符号拡張により、正確な部分積加算を実行できる。
【００１６】
通常、この部分積加算は、図２０に示すような、桁上げ保存加算器と、桁上げ伝播加算器を使用して行なわれる。図２０に示す部分積加算器は、３入力全加算器を配列して構成される桁上げ保存加算器である。
【００１７】
ここで部分積加算器の基本構成要素となる、３入力全加算器の動作を図２１に示す。
【００１８】
３入力全加算器は、入力の３ビット（２１００、２１０１、２１０２）を加算して、桁上げ２１０４と和２１０３を出力する。
【００１９】
図２１に示すように、３つの値を入力し、所定の場合には、桁上げ出力（２１０４）を行なって、加算（２１０３）を行なっている。
【００２０】
図２０に示す、桁上げ保存機能を有する加算器である全加算器（２０００、２００１、２００２、２００３、２００４）では、図１９に示す部分積１〜３を入力して、入力された部分積の加算を実行する。各全加算器によって行なわれる加算の結果の「桁上げ」は、次段の一桁上位の全加算器に、「和」は、次段の同じ桁の全加算器（２０１０、２０１１等）に入力し、図１９に示す部分積４との加算を行なう。さらに、その結果は、部分積５の加算に使用する全加算器（２０１２等）に入力され、加算される。
【００２１】
部分積５は、被乗数の値を反転させて、「１」を加算する必要があるので、全加算器２０１２の入力２０１３は、１を加算するための入力として使用する。
【００２２】
一例として、１段目の全加算器２０００は、部分積１の拡張符号２００５（値は「１」）と、部分積２の拡張符号２００６（値は「１」）と、部分積３の符号２００７（値は「０」）を入力し、加算を行なう。
【００２３】
そして、加算結果の桁上げ２００８を、次段の一桁上位の全加算器２０１０に入力し、和２００９を次段の同じ桁の全加算器２０１１に入力する。
【００２４】
全加算器２０１０は、部分積１〜部分積３の拡張符号の加算結果の桁上げと和とを入力し、加算を行なう。その入力信号を生成する全加算器２０１４は、全加算器２０００と同一の計算を行うので、全加算器２０００の加算結果２００９を、全加算器２０１０の入力とする構成とし、その結果、全加算器２０００より、上位の桁の加算を行なう、即ち、左に存在する全加算器２０１４は、省略されうる。
【００２５】
このように、桁上げ保存機能を有する加算器は、「桁上げ」を次段に送って、加算を繰り返すため、全部分積の加算が終了しても、２０２４から２０３８の２出力が残る。そのため、最終結果を得るためには、さらに、その２出力を加算するために、図２０に示すような、いわゆる桁上げ伝播加算器が必要である。
【００２６】
図２０に示す構成では、桁上げ伝播加算器は、全加算器２０１５〜２０２２を有して構成される。これらの全加算器間の接続は、一例として、全加算器２０１６の桁上げ２０２３が、全加算器２０１５の入力となるような接続関係を有しており、文字通り桁上げ伝播加算器を構成している。
【００２７】
【発明が解決しようとしている課題】
前述のように、例えば図２０に示す積和演算器においては、桁上げ保存機能を有する加算器はもちろんのこと、桁上げ伝播加算器をも設けて、積和演算器を実現しなければならなかった。
【００２８】
このように、従来の専用演算器を使用して演算を行なう場合においては、処理性能は満たされるものの、システムのコストの上昇を招くことがほとんどである。一方、汎用演算器を使用したのでは、コストの上昇は抑えられるものの、その処理性能は満足のいくものではないという問題が依然として存在する。
【００２９】
しかしながら、汎用演算器の使用は、コストの低減のためには必要不可欠であるので、該汎用演算器の処理性能の向上を図る必要がある。
【００３０】
そこで、本発明の目的は、汎用プロセッセが備える汎用演算器のうち、イメージ処理等で頻繁に使用される積和演算器の一部を独立に動作させる手段を設けることで、複数の積和演算を同時に行い、コストを抑えた、演算速度の速い、高性能のイメージ処理等に使用可能な演算手段を提供することにある。
【００３１】
【課題を解決するための手段】
上記課題を解決し、本発明の目的を達成するために、以下の手段が考えられる。
【００３２】
複数の被乗数を有するＮビットの数と、前記各被乗数に対応する乗数を複数有するＭビットの数との積を求めることによって、被乗数と乗数の組に対する乗算結果を求める演算器であって、以下の手段を備える演算器である。
【００３３】
すなわち、被乗数を複数個保持し、各被乗数のビット長の総和がＮを越えないことを条件として配置され、各被乗数の間に０を埋め込んだ状態で、Ｎビットの数を保持する第１レジスタと、前記各被乗数に対応する乗数を保持し、各乗数のビット長の総和がＭを越えないことを条件として配置され、各乗数の間に０を埋め込んだ状態で、Ｍビットの数を保持する第２レジスタと、第１レジスタに保持された値と第２レジスタに保持された値との部分積を求めていく処理を行う部分積処理部と、前記部分積において、１組の被乗数と乗数の、乗算結果の符号を補償するため、乗算結果を２の補数表現したビットを埋込む符号拡張部と、各部分積の和を順次求めていく手段であって、ある組の被乗数と乗数に対する全ての部分積の総和が所定値を越えた場合、該越えた値を、次の他の組の被乗数と乗数に対する部分積の総和を求めていく際に廃棄する、各部分積の総和値を求める総和手段と、該手段によって求めた、総和値（「Ｎ＋Ｍ」（ビット））のデータから、第１レジスタにおける各被乗数と、これに対応する第２レジスタにおける乗数との乗算結果である値を切り出し、被乗数と乗数との組に対応する乗算結果を、各組について求める処理手段とを備える。
【００３４】
【作用】
本発明は、汎用の積和演算器であって、入力レジスタ等に格納された各データに対する乗算を分離して同時に行うための機能を有する部分積加算器と、各乗算結果に対して、加算を分離して同時に行うための機能を有する加算器を備えて、演算を行なう。
【００３５】
まず、第１レジスタに、被乗数を複数個保持し、各被乗数のビット長の総和がＮを越えないことを条件として配置し、各被乗数の間に０を埋め込んだ状態で、Ｎビットの数を保持しておく。そして、第２レジスタには、前記被乗数に対応する乗数を複数個保持し、各乗数のビット長の総和がＭを越えないことを条件として配置し、各乗数の間に０を埋め込んだ状態で、Ｍビットの数を保持しておく。
【００３６】
次に、部分積処理部は、第１レジスタに保持された値と第２レジスタに保持された値との部分積を求めていく処理を行い、符号拡張部は、前記部分積において、１組の被乗数と乗数の、乗算結果の符号を補償するため、乗算結果を２の補数表現したビットを埋込む処理を行なう。
【００３７】
そして、総和手段は、各部分積の和を順次求めていき、ある組の被乗数と乗数に対する全ての部分積の総和が所定値を越えた場合、該越えた値を、次の他の組の被乗数と乗数に対する部分積の総和を求めていく際に廃棄する、各部分積の総和値を求める。
【００３８】
最後に、処理手段が、総和手段によって求めた、総和値（「Ｎ＋Ｍ」（ビット））のデータから、第１レジスタにおける各被乗数と、これに対応する第２レジスタにおける乗数との乗算結果である値を切り出し、被乗数と乗数との組に対応する乗算結果を、各組について求め、並列演算を実現する。
【００３９】
【実施例】
以下、本発明の実施例について図面を参照して説明する。
【００４０】
図１に、本発明にかかる実施例の構成図を示す。
【００４１】
本実施例は、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」なる演算を同時に実行するための演算器である。
【００４２】
演算器全体は、レジスタ１００にパック（データが詰めこまれた状態を、以下このように表現する）された、２つの被乗数ａ１、ａ２と、レジスタ１０１にパックされた、２つの乗数、ｂ１、ｂ２とを、それぞれ入力１０２、１０３とし、これら入力に基づいて生成した部分積を加算する機能を有する部分積加算器１０４を備える。
【００４３】
また、レジスタ１０５にパックされた、２つの加数ｃ１、ｃ２を入力１０６とし、この入力データと、部分積加算器１０４の２出力である「ａ１×ｂ１」、「ａ２×ｂ２」との加算を、正確な位取りで実行するために、入力１０６の内容をシフトして、部分積加算器１０４の前記２出力に桁合わせする機能を有する「シフトアンドセレクタ」１０７と、部分積加算器１０４の２出力と、「シフトアンドセレクタ」１０７の出力を加算する機能を有する３入力全加算器列１０８と、３入力全加算器列の２出力を加算する機能を有する６４ビット加算器１０９と、６４ビット加算器１０９の出力を、指定されたフォーマットに変換するアライナ１１０と、加算結果がオーバーフローであるか、または、アンダーフローであるかを判定するオーバーフロー／アンダーフロー判定部１１１と、その判定結果がオーバーフローであれば、アライナ１１０の出力を最大値に、また判定結果がアンダーフローであれば、アライナ１１０の出力を最小値に置き換える最大値／最小値置換部１１２とを有して構成される。そして、最終的な演算結果は、レジスタ１５０にパックされた状態で出力される。
【００４４】
図１の実施例において、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」なる積和演算を行なう際の各構成要素の動作を、データ長３２ビットのレジスタの、上位８ビットおよび下位８ビットに、２つのデータをパックした場合を例にとり説明する。
【００４５】
積和演算を行うデータは、図２に示すデータフォーマット２００の様に、３２ビットの上位８ビットおよび下位８ビットに、２つのデータａ１、ａ２をパックし、その間に「０」値を埋め込んで、３２ビットのデータとしている。
【００４６】
データｂ１、ｂ２、ｃ１、ｃ２（図示せず）も同様に、３２ビットの上位８ビットおよび下位８ビットに、２つのデータをパックし、その間に「０」値を埋め込んで、３２ビットのデータとしている。
【００４７】
データａ１、ａ２をパックした３２ビットのデータと、データｂ１、ｂ２をパックした３２ビットのデータとを、３２ビットのデータ同士で乗算することによって、パックされた８ビットのデータ同士の乗算、即ち、「ａ１×ｂ１」と「ａ２×ｂ２」を同時に実行するものである。
【００４８】
その際、３２ビット乗算結果において、どの部分が乗算「ａ１×ｂ１」と「ａ２×ｂ２」に相当しているのかを図２に示している。なお、図２において、各部分積は、横長の矩形で表現している。
【００４９】
さて、乗算「ａ１×ｂ１」は、部分積２５（２０３）から部分積３２（２０４）までの、各々の上位１５桁、また、乗算「ａ２×ｂ２」は、部分積１（２０１）から部分積８（２０２）までの、各々の下位１５桁を使用して計算される。そのため、図面中黒く塗られた部分は、乗算「ａ１×ｂ１」と「ａ２×ｂ２」の符号を拡張された部分となる。なお、符号拡張の概念については、前述の図１９で示した通りである。
【００５０】
また、乗数データｂ１とｂ２と間に「０」値が埋め込まれているため、それらのビット（値が「０」）から生成される部分積９〜２４は、必ず「０」となる。そのため、全ての部分積を加算した時に、ａ２×ｂ２による部分積加算の結果による桁上げが伝播することに起因する、乗算「ａ１×ｂ１」への影響をなくすことができる。
【００５１】
また、部分積２５の加算で使用される、部分積２５の３２ビット用の符号拡張の加算分を「０」とすることで、３２ビット乗算のための、符号拡張分の加算（部分積１から８までに対応する加算）結果が、「ａ１×ｂ１」の演算結果への影響をなくすことができる。
【００５２】
これら２点の工夫により、「ａ１×ｂ１」と「ａ２×ｂ２」を、分離した状態で演算することができる。
【００５３】
次に、以上で説明した機能を有する部分積加算器１０４の動作について、詳細に説明する。
【００５４】
図３は、乗算ａ２×ｂ２の拡張符号の機能を、汎用の部分積加算器によって実現した構成例である。
【００５５】
図２に示すように、この乗算の拡張符号が必要な範囲は、「部分積１」では、９桁目〜１５桁目まで、「部分積２」では、１０桁目〜１５桁目まで、「部分積３」では、１１桁目〜１５桁目まで、「部分積４」では、１２桁目〜１５桁目まで、「部分積５」では１３桁目〜１５桁目まで、「部分積６」では１４桁目〜１５桁目まで、「部分積７」では１５桁目である。
【００５６】
図３では、一例として、部分積３〜５に相当する部分について図示した。
【００５７】
なお、各全加算器の動作は、図２１に示す通りである。
【００５８】
全加算器３００から３０５は、部分積３の１５桁目から１０桁目の加算に、全加算器３０６から３１１は、部分積４の１５桁目から１０桁目の加算に、全加算器３１２から３１７は、部分積５の１５桁目から１０桁目の加算にそれぞれ使用される。
【００５９】
セレクタ３１８は、信号３３０が「１」ならば、２入力３３１、３３２のうち、３３２の方を選択する。他のセレクタ３１９〜３２１も、同様な動作をする。３３２は、乗算「ａ２×ｂ２」の部分積３の符号であることから、信号３３０を「１」とすることで、符号を表現するための信号３３２によって、符号拡張して、全加算器によって部分積３を加算することができる。
【００６０】
セレクタ３２３〜３２６、セレクタ３２７〜３２９も、信号３３０を「１」にすることによって、それぞれ、部分積４と部分積５の符号３３３、３３４を選択する。これにより、部分積４、部分積５についても、符号拡張を行なった加算が実現できる。
【００６１】
そのほか、図３には示していないが、部分積１、２、６、７に対しても、信号３３０によって、符号拡張のための信号を選択するセレクタを、全加算器の入力側に設けた構成にすることによって、乗算「ａ２×ｂ２」の部分積加算で、符号拡張した加算が実現できることになる。
【００６２】
図２２は、乗数ｂ２の符号ビット（下位から８つ目のビット）から、部分積８を生成するための、汎用の部分積加算器の実現例である。全加算器２２００〜２２０７は、部分積８の下位８ビットの加算を行なう。符号ビットから、部分積を生成するため、図１９に示したように、データを反転させて、「１」を加える処理を実現するための構成である。
【００６３】
論理ゲート２２０８は、信号２２２５が「１」のときに、入力２２２７を反転し、「０」のときには、入力２２２７の値に関係なく「１」を出力する機能を有する。また、セレクタ２２１６は、信号２２２６が「１」のとき論理ゲート２２０８の出力を選択し、信号２２２６が「０」のとき、部分積８の下位から８ビット目のデータ２２２７を選択する機能を有する。
【００６４】
なお、「論理ゲート２２０８、セレクタ２２１６」と、「論理ゲート２２０９〜２２１５と、セレクタ２２１７〜２２２３」とは、同じ動作をする。また、セレクタ２２２４は、信号２２２６が「１」のとき「１」を出力する。
【００６５】
論理ゲート２２０８〜２２１５と、セレクタ２２１６から２２２４を用い、信号２２２６を「１」にすることで、乗算「ａ２×ｂ２」の部分積８の生成と加算が実現できることになる。他の部分積も同様に生成加算できる。このように、図３、図２２に示す構成により、乗算「ａ２×ｂ２」の部分積加算の演算が実行できることになる。
【００６６】
次に、図４に示す構成を有する手段により、３２ビット乗算を行なうための符号拡張分の加算（部分積１から８までの加算）結果による、乗算「ａ１×ｂ１」への影響をなくすための機能を実現するが、この動作について説明する。
【００６７】
まず、全加算器４００〜４０７によって、部分積２５の３２桁目〜２５桁目（乗算「ａ１×ｂ１」の部分積１（図２中の部分積２５））の値を加算する処理を行なう。また、全加算器４０８〜４１４は、部分積２４までの加算結果を出力している。
【００６８】
論理ゲート４１５は、信号４３１が「０」のとき、全加算器４０８の和４３３を「０」にする機能を有する。同様に、論理ゲート４１６〜４２２も、信号４３１が「０」のとき、全加算器４０９から４１４の和を「０」にする機能を有する。また、論理ゲート４２３は、信号４３１が「０」のとき、全加算器４０８の桁上げ４３２を、「０」にする機能を有する。
【００６９】
同様に、論理ゲート４２４〜４３０も、信号４３１が「０」のとき、全加算器４０９から４１４の桁上げを「０」にする機能を有する。信号４３１が「１」のときは、論理素子４１５〜４３０は、全加算器４０８〜４１４からの、桁上げと和の値を、そのまま通過させる機能を有する。
【００７０】
つまり、信号４３１を「０」にするこによって、部分積２４までの加算結果と、部分積２５との加算処理を制御することができる。これにより、３２ビット乗算のための符号拡張分の加算（部分積１から８までの加算）結果による、乗算「ａ１×ｂ１」への影響をなくすことが可能となる。
【００７１】
以上説明してきた、図３、図２２、図４に示す構成を有する手段を備えることにより、部分積加算器１０４は、通常の３２ビット乗算と同一の処理で、「ａ１×ｂ１」と「ａ２×ｂ２」の２つの乗算の部分積加算を、並列に実行することができる。
【００７２】
次に、「シフトアンドセレクタ」１０７は、「ａ１×ｂ１＋ｃ１」、「ａ×ｂ２＋ｃ２」の並列実行を行なうことを指示するコントロール信号１１３を受けとり、部分積加算器１０４の演算結果の、上位１６桁と下位１６桁に、それぞれｃ１とｃ２の位取りが正しく合うように、図５の５００に示すようにパックされたｃ１、ｃ２を、５０１に示すようにシフト処理し、ｃ１とｃ２間に「０」を埋め込む。
【００７３】
「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」の演算を、並列実行しないときには、シフトしない状態の、５００を選択する。
【００７４】
３入力全加算器列１０８は、図５に示すような、上位１６桁と下位１６桁に「ａ１×ｂ１」と「ａ２×ｂ２」の部分積加算の、桁上げ保存加算器の出力５０３と、「シフトアンドセレクタ」１０７で選択された、入力５０１とを、桁上げ保存加算し、加算結果５０６を得る。「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」の並列実行の場合、５０１において、ｃ１とｃ２との間は、値「０」が埋っているため、ａ１×ｂ１＋ｃ１、ａ２×ｂ２＋ｃ２の間の桁では桁上げが起きない。その結果、ｃ１、ｃ２の加算は互いに影響しない。
【００７５】
図６に、並列に実行した演算結果が、オーバーフローまたはアンダーフローであるかの判定を並列に実行する、オーバーフロー／アンダーフロー判定部１１１の構成例を示す。
【００７６】
上位用判定部６００、下位用判定部６０１は、それぞれ、図５に示す５０６の上位１６ビットと下位１６ビットを、予め定めている値である、８ビット用上限値６０３、８ビット用下限値６０４と比較する処理を行なう。
【００７７】
上位用判定部６００は、図５に示す５０６の上位１６ビットの内容が、８ビット用上限値６０３より大きな場合は、オーバーフローが発生したと判断し、また、８ビット用下限値６０４より小さい場合は、アンダーフローが発生したと判断する。また、下位用判定部６０１は、図５に示す５０６の下位１６ビットの内容に対し、上位用判定部６００と同様の判断を行なう。
【００７８】
３２ビット用判定部６０２は、５０６の下位３２ビットを、予め定めている値である、３２ビット用上限値６０５と３２ビット用下限値６０６とを、それぞれ用いて、３２ビット乗算の場合のオーバーフロー、アンダーフローが発生したか否かを判定する。なお、上位用判定部６００と、下位用判定部６０１とは、別々に動作するので、上位と下位の１６ビットの、オーバーフロー、アンダーフローの判定は、並列に実行できる。
【００７９】
なお、上位用判定部６００が出力する判定信号６０７、６０８は、それぞれ、５０６の上位１６ビットがオーバーフロー、アンダーフローの時「１」となり、下位用判定部６０１が出力する判定信号６０９、６１０は、それぞれ、５０６の下位１６ビットがオーバーフロー、アンダーフローの時「１」となり、３２ビット用判定部６０２が出力する判定信号６１１、６１２は、それぞれ、５０６の下位３２ビットがオーバーフロー、アンダーフローの時「１」となる。
【００８０】
これらの判定信号６０７から６１２は、判定結果を最大値／最小値置換部１１２に送られる。
【００８１】
図７に、通常の積和演算の場合と、今まで述べてきた、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」を求める並列演算を実行する場合の、２つの場合に対応できるように、６４ビット加算器１０９の出力状態（７０１からのデータの取り出し方）を制御する機能を有するアライナ１１０の構成図を示す。
【００８２】
セレクタ７０２は、データ７１９（７０１の上位９ビット目から上位１６ビット目までのデータ）とデータ７２０（７０１の上位３３ビット目から上位４０ビット目までのデータ）との選択を、「ａ１×ｂ１＋ｃ１」および「ａ２×ｂ２＋ｃ２」の並列演算実行時で「０」となり、通常の積和演算時に「１」となる制御信号７２３で行なう。
【００８３】
論理ゲート７０３〜７１８（計１６個あるが、複雑になるため図面では２個を記載し、あとは省略してある）は、制御信号７２３で、データ７２１（７０１の下位２４ビット目から下位９ビット目まで）を「０」とする（図中、７２４で０値と記載している部分）。
【００８４】
すなわち、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」の並列演算実行時は、制御信号７２３が「０」となり、第１に、セレクタ７０２は、データ７１９を選択し、図中ｄ１部は、データ７１９で満たされ、第２に、論理ゲート７１３〜７１８は、アンドゲートであるので、データ７２１は「０」となる。また、データ７２２は変化しないので、アライナの出力は、７２４のようになる。
【００８５】
一方、通常の積和演算時で、制御信号７２３が「１」となると、セレクタ７０２は、データ７２０を選択し、さらに、データ７２１は、論理ゲート７１３〜７１８を、そのまま通過する。また、データ７２２は変化しないのでアライナの出力は７２５のようになる。
【００８６】
この結果、「ａ１×ｂ１＋ｃ１」および「ａ２×ｂ２＋ｃ２」の並列演算実行時では、演算結果ｄ１、ｄ２が、３２ビットのレジスタに、７２４のようにパックされることになる。
【００８７】
次に、図８に、演算条件およびオーバーフロー／アンダーフロー判定部１１１の判定結果と、最大値／最小値置換部１１２の出力との関係を示す。
【００８８】
以下に、出力の態様を示す。
【００８９】
まず、図８中（１）に示す例では、「ａ１×ｂ１＋ｃ１」および「ａ２×ｂ２＋ｃ２」の演算が同時実行された場合で、上位側の結果が、オーバーフローと判定された時、出力の上位８ビットに、予め定めておいた値である最大値（ｍａｘ）を出力し、出力の下位８ビットは、演算結果をそのまま出力する。
【００９０】
また、図８中（２）に示す例では、上位側の結果がアンダーフローと判定された時、出力の上位８ビットに、予め定めておいた値である最小値（ｍｉｎ）を出力し、出力の下位８ビットは、、演算結果をそのまま出力する。
【００９１】
また、図８中（３）に示す例では、下位側の結果が、オーバーフローと判定された時、出力の下位８ビットに、予め定めておいた値である最大値（ｍａｘ）を出力し、出力の上位８ビットは、演算結果をそのまま出力する。
【００９２】
また、図８中（４）に示す例では、下位側の結果がアンダーフローと判定された時、出力の下位８ビットに、予め定めておいた値である最小値（ｍｉｎ）を出力し、出力の上位８ビットは、演算結果をそのまま出力する。
【００９３】
さらに、通常積和実行時で、オーバーフローと判定された場合、３２ビット全体に、予め定めておいた値である最大値を出力し（図８（５）の示す例）、アンダーフローと判定された場合、３２ビット全体に、予め定めておいた値である最小値を出力することも考えられる（図８（６））。
【００９４】
その他、「ａ１×ｂ１＋ｃ１」および「ａ２×ｂ２＋ｃ２」の並列演算が実行された場合でも、通常積和実行時でも、オーバーフローまたはアンダーフローとも判定されなかった場合には、入力値をそのまま出力することも考えられる（図８（７）の例）。
【００９５】
次に、３２ビットのすべてのビットにデータをパックした例、例えば、８ビットの画素データを４個詰めて行う、並列積和演算器の実施例を図９を参照して説明する。
【００９６】
本実施例の構成は、パックされた４つの被乗数ａ１、ａ２、ａ３、ａ４と、パックされた４つの乗数ｂ１、ｂ２、ｂ３、ｂ４とから部分積を求め、加算する部分積加算器９００と、パックされた４つの加数ｃ１、ｃ２、ｃ３、ｃ４を、部分積加算器９００の出力と桁合わせする「シフトアンドセレクタ」９０１と、部分積加算器９００の２出力と「シフトアンドセレクタ」９０１の出力とを加算する機能を有する３入力全加算器列９０２と、３入力全加算器列の２出力を加算する機能を有する６４ビット加算器９０３と、該加算器の出力を指定されたフォーマットに変換するアライナ９０４と、加算結果がオーバーフロー、アンダーフローであるか否かを判定するオーバーフロー／アンダーフロー判定部９０５と、その結果に基づいて、予め定めた規則に従って、前記アライナ９０４の出力を所定の値に置き換える機能を有する最大値／最小値置換部９０６とを有して構成されている。
【００９７】
図９に示す実施例において、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の複数の積和演算を行なう際の動作を、データ長３２ビットに、８ビットデータを４個を詰めた場合を例にとり説明する。
【００９８】
積和演算を行うデータは、図１０に示す、データフォーマット１０００に示すように、４データ「ａ１、ａ２、ａ３、ａ４」で３２ビットのデータを構成する。
【００９９】
ａ１、ａ２、ａ３、ａ４、また、ｃ１、ｃ２、ｃ３、ｃ４についても、同様に、３２ビットのデータとする。
【０１００】
「ａ１、ａ２、ａ３、ａ４」をパックした３２ビットデータと、「ａ１、ａ２、ａ３、ａ４」をパックした３２ビットデータにおいて、３２ビットデータ同士を乗算することで、８ビット同士の乗算「ａ１×ｂ１」、「ａ２×ｂ２」、「ａ３×ｂ３」、「ａ４×ｂ４」を同時に実行するものである。
【０１０１】
その際、３２ビットデータの乗算結果のどの部分が、それぞれ乗算「ａ１×ｂ１」、「ａ２×ｂ２」、「ａ３×ｂ３」、「ａ４×ｂ４」に相当するかについて、図１０を参照して説明する。
【０１０２】
なお、図１０において、各部分積は、矩形で表現している。
【０１０３】
乗算「ａ１×ｂ１」、「ａ２×ｂ２」、「ａ３×ｂ３」および「ａ４×ｂ４」の各々の値は、それぞれ、部分積２５から３２、部分積１７から２４、部分積９から１６、部分積１から８の加算によって求められる。そのため、図中黒く塗られた部分は、乗算「ａ１×ｂ１」、「ａ２×ｂ２」、「ａ３×ｂ３」および「ａ４×ｂ４」の拡張符号となる部分である。
【０１０４】
例えば、乗算「ａ４×ｂ４」では、部分積１では９桁目〜１５桁目まで、部分積２では１０桁目〜１５桁目まで、部分積３では１１桁目〜１５桁目まで、部分積４では１２桁目〜１５桁目まで、部分積５では１３桁目〜１５桁目まで、部分積６では１４桁目〜１５桁目まで、部分積７では１５桁目が拡張符号となる部分である。
【０１０５】
また、前述した２乗算の並列演算と同様に、部分積８、１６、２４、３２は、符号ビットに相当するため負数を作る必要がある。
【０１０６】
１つの部分積の加算に使用される３２個の全加算器のうちの８個に、図２２に示される示される論理ゲート２２０８〜２２１５とセレクタ２２１６〜２２２４と同じものを、全加算器２２００〜２２０７と同じ接続関係で、追加することによって、負数の部分積の生成と加算が実現できる。論理ゲートとセレクタが追加される全加算器は、部分積８については、下位１ビット目〜８ビット目、部分積１６については、９ビット目〜１６ビット目、部分積２４については、１７ビット目〜２４ビット目に、部分積３２については、２５ビット目〜３５ビット目に対応する全加算器である。
【０１０７】
３２ビット乗算の部分積加算で、部分積８と９、部分積１６と１７、部分積２４と２５の間で、加算結果の伝播をさせないようにする必要がある。また、部分積９〜１６においては、下位８ビット（図中の斜線部分）を「０」とし、部分積１７〜２４においては、下位１６ビット（図中の斜線部分）を「０」とし、部分積２５〜３２においては、下位２４ビット（図中の斜線部分）を「０」とする。こらの機能により、４つの乗算の部分積加算の結果は、他と影響しあわず、そのため、４つの乗算の部分積加算が、並列に実行することができる。
【０１０８】
次に、図１１に、４つの乗算の部分積加算の結果が、他と影響しあわない、部分積加算器の一部を例示する。
【０１０９】
図１１に、部分積８の加算結果を、部分積９の加算へ伝播させない機能と、部分積９の下位８ビットを「０」にして、部分積１〜８までの加算結果を壊さない機能を実現する回路構成例を示す。
【０１１０】
全加算器１１００〜１１０７によって、部分積８の１３ビット目から６ビット目までに対する加算処理を行なう。
【０１１１】
また、全加算器１１０８〜１１１５によって、部分積９の１３ビット目から６ビット目までに対する加算処理を行なう。論理ゲート１１１６、１１２０は、信号１１３３を「０」にすることによって、部分積８の加算結果１１２８、１１２９を全加算器１１０８に入力するのを阻止する機能を有する。論理ゲート１１１７〜１１１９、１１２０〜１１２３も、同様な動作を行ない、対応する全加算器への加算結果の入力を阻止する機能を有する。
【０１１２】
なお、信号１１３３は、４乗算を並列に実行するとき「０」となる。
【０１１３】
そのため、全加算器１１０８には、「０」が入力され、部分積９と加算される。これにより、部分積８の加算結果は、部分積９の１０ビット目からの加算に使用ができなくなることになる。
【０１１４】
論理ゲート１１２４は、信号１１４２により部分積９の８ビット目に対して加算処理を行なう全加算器１１１２への入力を阻止する機能を有する。同様に、論理ゲート１１２５〜１１２７も、信号１１３３の入力により、同様の阻止動作をする。
【０１１５】
信号１１４２は、４乗算の並列演算のとき「０」となる。そのため、部分積９の８ビットより下位は、「０」となる。
【０１１６】
これらの構成により、信号１１３３、１１４２と、論理ゲート１０１６〜１１２７を用いて、部分積８までの加算と、部分積９の加算とを分離することができる。同様に、部分積１６と部分積１７、部分積２４と部分積２５の分離も実現できる。したがって、この部分積加算器では、４つの乗算の部分積加算の並列演算が実行できる。
【０１１７】
また、「シフトアンドセレクタ」９０１は、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、および「ａ４×ｂ４＋ｃ４」を並列実行する旨の制御信号を受けとって、部分積加算器の結果の上位から１６桁ごとに出力される「ａ１×ｂ１」、「ａ２×ｂ２」、「ａ３×ｂ３」、および「ａ４×ｂ４」のそれぞれに対し、加算値、ｃ１、ｃ２、ｃ３、ｃ４との位取りが正しく合わさるようにする。
【０１１８】
そのため、図１２の１２００に示すようにパックされた、ｃ１、ｃ２、ｃ３、ｃ４を、１２０１に示すように、ｃ１の１ビット目が４９桁目に、ｃ２の１ビット目が３３桁目に、ｃ３の１ビット目が１７桁目に、ｃ４の１ビット目が１桁目にくるように、ｃ１、ｃ２、ｃ３をシフトする。また、１２０１において、ｃ１、ｃ２、ｃ３、ｃ４のデータの存在しない部分には、値「０」を埋めておく。なお、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」を並列演算実行しないときには、１２００に示すようにパックされたデータをシフトしない。
【０１１９】
図１３に、上位から１６ビット単位に「ａ１×ｂ１」、「ｂ２×ｂ２」、「ａ３×ｂ３」、「ａ４×ｂ４」の４つの並列乗算の結果（和と桁あげ）と、「シフトアンドセレクタ」で選択された入力の３入力を加算し、和と桁上げの２出力の加算結果を得る３入力全加算器列９０２の構成の一部を示す。
【０１２０】
全加算器１３００と全加算器１３０１は、「ａ３×ｂ３＋ｃ３」の演算結果を求める際、下位２ビットの演算に使用される。全加算器１３０２と全加算器１３０３は、「ａ４×ｂ４＋ｃ４」の演算結果を求める際、上位２ビットの演算に使用される。
【０１２１】
論理素子１３０４は、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の並列演算を実行する時に「０」となる信号１３０５によって、全加算器１３０２の桁上げ１３０６を阻止する。
【０１２２】
図では、「ａ３×ｂ３＋ｃ３」と「ａ４×ｂ４＋ｃ４」を演算する手段の中間部の構成を示したが、「ａ１×ｂ１＋ｃ１」と「ａ２×ｂ２＋ｃ２」を演算する手段の中間部、「ａ２×ｂ２＋ｃ２」と「ａ３×ｂ３＋ｃ３」を演算する手段の中間部にも、同様の構成の論理回路を設け、桁上げを阻止する。
【０１２３】
これにより、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の並列演算を実行しても、４演算間での桁上げによる影響をなくすことができる。また、各乗算の境界において生じた桁上げは、オーバーフロー／アンダーフローの判定に使用するため、オーバーフロー／アンダーフロー判定部に送る。
【０１２４】
図１４に、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の積和演算の、各乗算値の境界での桁上げの阻止機能を有する６４ビット加算器の構成の一部を示す。全加算器１４００と１４０１は、「ａ３×ｂ３＋ｃ３」の演算結果を求める際、下位２ビットの演算に使用される。
【０１２５】
全加算器１４０２と全加算器１４０３は、「ａ４×ｂ４＋ｃ４」の演算結果を求める際、上位２ビットの演算に使用される。論理ゲート１４０４は、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、および「ａ４×ｂ４＋ｃ４」の並列演算を実行する時に「０」となる信号１４０５によって、全加算器１４０２の桁上げ１４０６を阻止する。
【０１２６】
図１４では、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」を演算する手段の中間部の構成を示したが、「ａ１×ｂ１＋ｃ１」と「ａ２×ｂ２＋ｃ２」を演算する手段の中間部、「ａ２×ｂ２＋ｃ２」と「ａ３×ｂ３＋ｃ３」を演算する手段の中間部にも、同様の構成の論理回路を設け、桁上げを阻止する。
【０１２７】
これにより、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の並列演算を実行しても、４演算間で、桁上げによる他の演算結果への影響をなくすことができる。また、各演算の境界に生じた桁上げは、オーバーフロー／アンダーフローの判定に使用するために、オーバーフロー／アンダーフロー判定部に送る。
【０１２８】
次に、図１５に、オーバーフロー／アンダーフロー判定部の構成を示す。
【０１２９】
１５００、１５０１、１５０２、１５０３は、それぞれ「ａ１×ｂ１＋ｃ１用判定部」、「ａ２×ｂ２＋ｃ２用判定部」、「ａ３×ｂ３＋ｃ３用判定部」、「ａ４×ｂ４＋ｃ４用判定部」である。
【０１３０】
１３０７〜１３１０は、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の演算を並列実行した３入力全加算器列９０２によって求められる、各演算結果間における桁上げデータである。
【０１３１】
１４０７〜１４１０は、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の演算を並列実行した６４ビット加算器９０３によって求められる、各演算結果間における桁上げデータである。
【０１３２】
各判定部は、対応する演算の桁上げデータ２つと、６４ビット加算器９０３の出力の対応する演算結果とから、８ビットに演算精度を制限していない演算結果を生成し、予め定めてある、８ビット用上限値および８ビット用下限値とを比較して、演算結果がオーバーフローとなるか、あるいは、アンダーフローとなるかを判定し、判定結果を出力する。
【０１３３】
例えば、「ａ１×ｂ１＋ｃ１用判定部」１５００は、桁上げ１３０７と、１４０７を加算する。
【０１３４】
その加算結果１５０４と、６４ビット加算器の出力の下位１６ビット１５０５とを、１５０４が上位にくるように連結し、１８ビットの演算結果を生成する。
【０１３５】
新しくできた演算結果１５０６と、予め定めてある、８ビット用上限値および８ビット用下限値とを比較してオーバーフローとアンダーフローを判定し、判定結果１５０７を出力する。同様に、判定部１５０１、１５０２、１５０３も、オーバーフロー、アンダーフローの判定を行ない、判定結果をそれぞれ、１５０８、１５０９、１５１０として出力する。
【０１３６】
また、３２ビット用判定部は、６４ビット加算器の出力の下位３２ビットを、予め定めた３２ビット用上限値および３２ビット用下限値と比較し、演算結果のオーバーフロー、アンダーフローを判定し、判定結果１５１１を出力する。
【０１３７】
次に、図１６に、通常の積和演算の場合と、今まで述べてきた「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、および「ａ４×ｂ４＋ｃ４」の並列演算実行の場合の、２つ場合に対応できるように、６４ビット加算器９０３の出力状態（１６０１からのデータの取り出し方）を制御する機能を有するアライナ９０４の構成図を示す。信号１６００は、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、および「ａ４×ｂ４＋ｃ４」の並列演算実行時で「０」となり、通常の積和演算時に「１」となる信号である。
【０１３８】
セレクタ１６０８は、データ１６０２（１６０１の下位５６ビット目から下位４９ビット目）とデータ１６０３（１６０１の下位３２ビット目から下位２５ビット目）のうち、信号１６００が「１」のときデータ１６０３選択し、また、信号１６００が「０」のときデータ１６０２を選択する。また、セレクタ１６０９は、データ１６０４（１６０１の下位４０ビット目から下位３３ビット目）とデータ１６０５（１６０１の下位２４ビット目から下位１７ビット目）のうち、信号１６００が「１」のときデータ１６０５を選択し、また、信号１６００が「０」のときデータ１６０４を選択する。
【０１３９】
さらに、セレクタ１６１０は、データ１６０５（１６０１の下位２４ビット目から下位１７ビット目）とデータ１６０６（１６０１の下位１６ビット目から下位９ビット目）のうち、信号１６００が「１」のときデータ１６０６を選択し、また、信号１６００が「０」のときデータ１６０５を選択する。
【０１４０】
なお、データ１６０７（１６０１の下位８ビットのデータ）は、信号１６００による選択制御を行なわない。この結果、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、および「ａ４×ｂ４＋ｃ４」の並列演算実行時では、信号１６００が「０」となるので、レジスタの領域ｄ１、ｄ２、ｄ３、ｄ４のそれぞれに、データ１６０２、１６０４、１６０５、１６０７が格納され、３２ビットデータが、１６１１に示すようにパックされる。また、通常の積和演算では、データ１６０３、１６０５、１６０６、１６０７が格納され、１６１２に示すような３２ビットデータが格納される。
【０１４１】
６４ビット加算器の出力は、図１６に示すアライナで、信号１６００が、並列積和演算示す場合、即ち信号１６００が「０」のときは１６１１、通常の３２ビット積和演算の場合、即ち信号１６００が「１」のときは１６１２のように、３２ビットのデータに変換される。
【０１４２】
次に、最大値／最小値置換部９０６は、オーバーフロー／アンダーフロー判定部９０５からの判定信号を受けて、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、「ａ４×ｂ４＋ｃ４」の並列演算実行時では、各演算結果に対して、所定の処理を行なう。
【０１４３】
所定の処理としては、例えば、判定結果がオーバーフローであれば、演算結果を、予め定めた、８ビットで表わされる最大値に置き換え、判定結果がアンダーフローであれば、演算結果を、予め定めた、８ビットで表わされる最小値に置き換え、また、いずれでもなければ、演算結果を置き換えずにそのまま出力する処理が考えられる。
【０１４４】
また、通常の積和演算時には、判定結果がオーバーフローであれば、演算結果を、予め定めた、３２ビットで表わされる最大値に置き換え、判定結果がアンダーフローであれば、演算結果を、予め定めた、３２ビットで表わされる最小値に置き換え、いずれでもなければ、演算結果を置き換えずにそのまま出力する処理をすればよい。
【０１４５】
上述のような構成により、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、および「ａ４×ｂ４＋ｃ４」の各演算が分離して行なえ、３２ビットの汎用の積和演算器を使用して、「ａ１×ｂ１＋ｃ１」、「ａ２×ｂ２＋ｃ２」、「ａ３×ｂ３＋ｃ３」、および「ａ４×ｂ４＋ｃ４」の各演算の並列実行が可能となる。
【０１４６】
もちろん、演算ビット数を考慮して本発明の技術的思想を適用することにより、本実施例により示した「３２ビット×３２ビット」の部分積和演算のみならず、他のビット数での部分積和演算を実行することができる。
【０１４７】
図１７は、本発明にかかる積和演算器、または、乗算器を備えるマイクロプロセッサに対する命令コードの態様である。オペコード１７００〜１７０３は、演算の種類によって定義される。また、オペランド１７０４〜１７０７は、演算に使われるデータのソースレジスタとターゲットレジスタを指定する。
【０１４８】
オペコード１７００と１７０２は、それぞれ並列演算を行ない、同時に、複数の乗算または積和演算を行なうときに使用する。また、オペコード１７０１と１７０３は、それぞれ、通常の乗算または積和演算を行なうときに使用する。
【０１４９】
２データの並列演算を例にとり説明すると、オペコード１７００「ＳｐｌｉｔＭＰＹ」は、ａ１、ａ２がパックされているソースレジスタｒ１と、ｂ１、ｂ２がパックされているソースレジスタｒ２に格納されている、データａ１、ａ２、ｂ１、ｂ２を使用して、乗算「ａ１×ｂ１」および「ａ２×ｂ２」を並列に行ない、結果を、ターゲットレジスタｒ３に格納する命令である。
【０１５０】
また、オペコード１７０２「ＳｐｌｉｔＭＰＹＡＤＤ」は、ａ１、ａ２がパックされているソースレジスタｒ１と、ｂ１、ｂ２がパックされているソースレジスタｒ２と、ｃ１、ｃ２がパックされているソースレジスタｒ３（ｒ３は、ターゲットレジスタを兼ねる）に格納されている、データａ１、ａ２、ｂ１、ｂ２、ｃ１、ｃ２を使用して、積和演算「ａ１×ｂ１＋ｃ１」および「ａ２×ｂ２＋ｃ２」を並列に行ない、結果を、ターゲットレジスタを兼ねるレジスタｒ３に格納する命令である。
【０１５１】
なお、４データの並列演算も、データ数と並列演算の実行数が異なるだけであり、同様に、命令コードを設定することができる。
【０１５２】
一方、オペコード１７０１「ＣｏｎｎｅｃｔＭＰＹ」は、１７０５に示されるレジスタｒ１とｒ２の値（値を、それぞれＲ１、Ｒ２とする）を用いて、乗算「Ｒ１×Ｒ２」を行ない、結果をレジスタｒ２に格納する命令である。また、１７０３「ＣｏｎｎｅｃｔＭＰＹＡＤＤ」は、１７０７に示されるレジスタｒ１、ｒ２、ｒ３の値（値を、それぞれＲ１、Ｒ２、Ｒ３とする）を用いて、積和演算「Ｒ１×Ｒ２＋Ｒ３」を行ない、結果を、ターゲットレジスタを兼ねるレジスタｒ３に格納する命令である。
【０１５３】
オペコード１７０３のみを有するアーキテクチャでは、「ａ１×ｂ１＋ｃ１」および「ａ２×ｂ２＋ｃ２」の演算を実行する場合、２度命令する必要があるが、前述のように、オペコード１７０２を定義して、並列演算させることにより、並列の積和演算「ａ１×ｂ１＋ｃ１」および「ａ２×ｂ２＋ｃ２」を、１度の命令で実行させることが可能となる。
【０１５４】
また、乗算についても、同様に、オペコード１７００を定義することで、オペコード１７０１を使用する場合に比べて、必要な命令数が少なくなる。
【０１５５】
これによりプログラムが短くなり、プログラムはメモリに記憶されるため、これらの命令を使用すればメモリの容量を少なくすることが可能となり、本発明にかかる積和演算器等がマイクロプロセッサの構成要素となるときに、有効である。
【０１５６】
次に、図１８に、本発明にかかる他の実施例を示す。
【０１５７】
図１８は、本発明にかかる、乗算器（例えば、「ａ１×ｂ１」なる演算を行なう手段）および積和演算器（例えば、「ａ１×ｂ１＋ｃ１」なる演算を行なう手段）のうち少なくとも一方を備えたマイクロプロセッサ１８００を有したシステムの構成図である。
【０１５８】
記憶装置には、マイクロプロセッサ１８００が実行する処理を定めるプログラムや、必要なデータ等が記憶されている。本システムにおいて、マイクロプロセッサ１８００が、前記プログラムにしたがって、ある画像処理を行なっているものとする。また、画像処理された画像は、ＣＲＴ等によって実現される表示装置１８０２に表示される。このような表示処理は、マイクロプロセッサ１８００が、予め定められているプログラムにしたがって行なわれる。
【０１５９】
さて、画像処理においては、積和演算を頻繁に実行する必要があり、積和演算に必要なデータは、記憶装置１８０１に記憶されているものとする。
【０１６０】
積和演算器１８０３は、本発明にかかる積和演算器であり、複数個の積和演算を並列に実行する。
【０１６１】
マイクロプロセッサ内のレジスタ１８０４が、記憶装置１８０１に記憶されているデータを使用して積和演算を行なうことを想定する。
【０１６２】
プログラムにより積和演算の実行が指示された場合、マイクロプロセッサ１８００は、記憶装置１８０１にアクセスし、バス１８０５を介して、積和演算に必要なデータを、自己が備えるレジスタ１８０４に保持する。１回の、積和演算に必要なデータのみをアクセスしてもよいが、通常、一度に複数個の積和演算が行なわれるので、該当するデータを、すべてレジスタ１８０４に保持しておく。なお、レジスタ１８０４に保持される、被乗数データ（ａ１、ａ２、ａ３、ａ４）、乗数データ（ｂ１、ｂ２、ｂ３、ｂ４）、加算データ（ｃ１、ｃ２、ｃ３、ｃ４）の例を図１８の左側に示す。図では、１回の積和演算を行なうための１組のデータを示したが、通常、複数組のデータを保持しておく。
【０１６３】
そして、次に、積和演算器が起動する。まず、レジスタ１８０４内の、積和演算に必要な全てのデータをソースバスを介して、取り込む。
【０１６４】
積和演算器は、取り込んだデータに基づいて、前述した積和演算を行ない、演算結果を順次、ターゲットバスを介して、レジスタ１８０４の空きエリアに送り、保持させる。もちろん、演算結果を後に使用するような画像処理を行なう場合、記憶装置１８０１に記憶することも考えられる。
【０１６５】
本発明にかかる積和演算器は、同時に複数種類の積和演算を行なうことができるため、画像処理の処理速度は、著しく向上する。
【０１６６】
複数の積和演算を並列に繰り返して実行できるため、例えば、積和演算を繰り返して行い、画像処理において頻繁に行なわれる処理である、離散コサイン変換等の処理に対しても高速な処理が行なえる。
【０１６７】
以上のように、本発明にかかる積和演算器（乗算器）を組み込んだマイクロプロセッサを実現し、該マイクロプロセッサ使用することにより、例えば、高速に画像処理を行なうことが可能なシステムを構築できる。もちろん、システムが対象とする処理内容は、画像処理に限られず、多量の積和演算を行なう処理であれば、いかなるものでもよい。
【０１６８】
【発明の効果】
以上述べたように、本発明によれば、複数の積和演算を並列に実行できるため、複数の積和演算を極めて高速に行なえる。
【図面の簡単な説明】
【図１】本発明にかかる実施例の構成図である。
【図２】乗算の符号拡張と分離機能の説明図である。
【図３】部分積加算器の符号拡張機能を実現するための手段の構成図である。
【図４】部分積加算器の分離機能を実現するための手段の構成図である。
【図５】「シフトアンドセレクタ」の動作の説明図である。
【図６】オーバーフロー／アンダーフロー判定部の構成図である。
【図７】アライナの構成図である。
【図８】最大値／最小値の置き換え処理の説明図である。
【図９】本発明にかかる他の実施例の構成図である。
【図１０】乗算の符号拡張と分離機能の説明図である。
【図１１】部分積加算器の構成図である。
【図１２】「シフトアンドセレクタ」の動作の説明図である。
【図１３】３入力全加算器列の構成図である。
【図１４】６４ビット加算器の構成図である。
【図１５】オーバーフロー／アンダーフロー判定部の構成図である。
【図１６】アライナの構成図である。
【図１７】積和演算用の命令の説明図である。
【図１８】本発明にかかる他の実施例の構成図である。
【図１９】従来の乗算処理の説明図である。
【図２０】従来の部分積加算器の説明図である。
【図２１】全加算器の入出力関係の説明図である。
【図２２】負数部分積加算機能を実現する手段の構成図である。[0001]
[Industrial applications]
The present invention relates to means for performing arithmetic processing such as product-sum operation used in image processing or the like at high speed.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, in the field of image processing (image processing), high-speed, high-precision, etc., high-level arithmetic performance is required when performing arithmetic processing. The company has produced a dedicated arithmetic unit for arithmetic processing and applied it to image processing.
[0003]
If such a dedicated arithmetic unit is manufactured according to the content of the image processing, and a system for performing the image processing is designed and manufactured, the cost of the system will increase.
[0004]
On the other hand, the performance of general-purpose processors that can be constructed at relatively low cost and can be applied to image processing has improved, but the more general processing is performed by the built-in arithmetic unit in the general-purpose processor, the more general-purpose processors The processing speed and processing content of the arithmetic unit built therein are not excellent.
[0005]
By the way, a so-called product-sum operation, which is an arithmetic process frequently performed in image processing, can be executed by an arithmetic unit configured by appropriately combining a multiplier and an adder.
[0006]
In such a conventional arithmetic unit, a multiplier for multiplying a given two numbers has a function of generating a partial product and a function of adding a partial product.
[0007]
Here, generation of a partial product and partial product addition will be described with reference to FIG.
[0008]
Here, the number of bits of data used for multiplication is 5 bits.
[0009]
The “partial product” is obtained by examining the bits of the multiplier 1901 bit by bit. If the bit content is “1”, it is the value of the multiplicand 1900 itself, and if the bit content is “0”, the partial product is calculated. It is set to “0”.
[0010]
However, the partial product generated by the sign bit of the multiplier 1901 is, if the sign bit content is “1”, the sum of the bit inversion of the multiplicand 1901 and the addition 1, and if the sign bit is “0”, it is “0”. I do.
[0011]
In FIG. 19, the partial product is represented by being surrounded by a rectangle, and its contents are shown in the rectangle.
[0012]
In the 5-bit multiplication, since the number of bits of the multiplier is 5 (bits), five partial products are generated as shown in FIG. In the calculation example shown in the figure, the partial product 1 (1902) and the partial product 2 (1903) are generated when their contents of the bit to be examined are “1”. Become.
[0013]
In addition, the partial product 3 (1904) and the partial product 4 (1905) become “0” because the content of the bit to be examined is “0” at the time of generation, and the partial product 5 (1906) becomes , And the sign bit to be examined at the time of generation is “1”, so that the multiplicand is a negative number.
[0014]
The addition of each partial product is performed by shifting the partial product one bit at a time to the upper (left) side in the order of generation from the lower bits of the multiplier, and then adding the products. In addition, when the multiplier and the multiplicand are expressed in negative numbers, that is, in a so-called two's complement expression, in order to correctly execute the partial product addition, the sign must be extended (hereinafter, referred to as “sign extension”) and added. Must.
[0015]
In the example shown in FIG. 19, the partial product 1 is extended by 4 bits (1907), the partial product 2 is extended by 3 bits (1908), the partial product 3 is extended by 2 bits (1909), and the partial product 4 is extended , One bit extended (1910). By such sign extension, accurate partial product addition can be performed.
[0016]
Usually, this partial product addition is performed using a carry save adder and a carry propagation adder as shown in FIG. The partial product adder shown in FIG. 20 is a carry save adder configured by arranging three-input full adders.
[0017]
FIG. 21 shows the operation of the three-input full adder, which is a basic component of the partial product adder.
[0018]
The 3-input full adder adds the input 3 bits (2100, 2101, 2102) and outputs a carry 2104 and a sum 2103.
[0019]
As shown in FIG. 21, three values are input, and in a predetermined case, carry output (2104) is performed and addition (2103) is performed.
[0020]
In a full adder (2000, 2001, 2002, 2003, 2004) which is an adder having a carry-save function shown in FIG. 20, partial products 1 to 3 shown in FIG. Perform the addition of The "carry" of the result of the addition performed by each full adder is sent to the next-stage full adder of the next digit, and the "sum" is sent to the next-stage full adder (2010, 2011, etc.). And performs addition with the partial product 4 shown in FIG. Further, the result is input to a full adder (2012 or the like) used for adding the partial product 5, and is added.
[0021]
Since the partial product 5 needs to invert the value of the multiplicand and add “1”, the input 2013 of the full adder 2012 is used as an input for adding 1.
[0022]
As an example, the first-stage full adder 2000 includes an extension code 2005 (value is “1”) of the partial product 1, an extension code 2006 (value is “1”) of the partial product 2, and a code of the partial product 3. 2007 (the value is “0”) is input and addition is performed.
[0023]
Then, the carry 2008 of the addition result is inputted to the next-stage full-adder 2010 one digit higher, and the sum 2009 is inputted to the next-stage full adder 2011 of the same digit.
[0024]
Full adder 2010 receives the carry and sum of the addition results of the extension codes of partial products 1 to 3 and performs addition. Since the full adder 2014 that generates the input signal performs the same calculation as the full adder 2000, the addition result 2009 of the full adder 2000 is used as an input of the full adder 2010, and as a result, the full addition The adder 2000 performs the addition of the upper digit, that is, the full adder 2014 existing on the left side can be omitted.
[0025]
As described above, the adder having the carry save function sends “carry” to the next stage and repeats the addition. Therefore, even if the addition of all partial products is completed, two outputs 2024 to 2038 remain. Therefore, in order to obtain the final result, a so-called carry propagation adder as shown in FIG. 20 is required to further add the two outputs.
[0026]
In the configuration shown in FIG. 20, the carry propagation adder includes full adders 2015 to 2022. As an example, the connection between these full adders has a connection relationship such that the carry 2023 of the full adder 2016 becomes an input of the full adder 2015, and literally constitutes a carry propagation adder. ing.
[0027]
[Problems to be solved by the invention]
As described above, for example, in the sum-of-products arithmetic unit shown in FIG. 20, not only an adder having a carry-save function, but also a carry propagation adder must be provided to realize the sum-of-products arithmetic unit. Did not.
[0028]
As described above, when the operation is performed using the conventional dedicated operation unit, the processing performance is satisfied, but the cost of the system is increased in most cases. On the other hand, although the use of the general-purpose arithmetic unit can suppress an increase in cost, there still remains a problem that its processing performance is not satisfactory.
[0029]
However, since the use of a general-purpose arithmetic unit is indispensable for cost reduction, it is necessary to improve the processing performance of the general-purpose arithmetic unit.
[0030]
Therefore, an object of the present invention is to provide a means for independently operating a part of a multiply-accumulate unit frequently used in image processing or the like among general-purpose arithmetic units included in a general-purpose processor, thereby enabling a plurality of multiply-accumulate operations to be performed. It is an object of the present invention to provide a computing means which can be used for high-speed image processing or the like, in which the cost is suppressed, the computation speed is high, and the cost is reduced.
[0031]
[Means for Solving the Problems]
To solve the above problems and achieve the object of the present invention, the following means are conceivable.
[0032]
An arithmetic unit for obtaining a product of a number of N bits having a plurality of multiplicands and a number of M bits having a plurality of multipliers corresponding to each of the multiplicands to obtain a multiplication result for a set of a multiplicand and a multiplier, An arithmetic unit comprising the means of (1).
[0033]
That is, a first register that holds a plurality of multiplicands and is arranged on condition that the sum of bit lengths of each multiplicand does not exceed N, and holds the number of N bits with 0 embedded between each multiplicand And the multipliers corresponding to the respective multiplicands are arranged, provided that the sum of the bit lengths of the respective multipliers does not exceed M, and the number of M bits is held with 0 embedded between the multipliers. A second register, a partial product processing unit for performing a process of obtaining a partial product of the value held in the first register and the value held in the second register, and a set of multiplicands in the partial product. A sign extension unit for embedding bits obtained by expressing the multiplication result in two's complement to compensate for the sign of the multiplication result of the multiplier, and a means for sequentially calculating the sum of each partial product; The sum of all partial products for If obtained, the excess value is discarded when calculating the sum of partial products for the next other set of multiplicands and multipliers, summing means for calculating the sum value of each partial product, From the data of the sum value (“N + M” (bits)), a value that is the result of multiplication of each multiplicand in the first register and the corresponding multiplier in the second register is cut out and corresponds to a set of the multiplicand and the multiplier. Processing means for obtaining a multiplication result for each set.
[0034]
[Action]
The present invention relates to a general-purpose sum-of-products arithmetic unit, which has a function of separating and simultaneously performing multiplication on each data stored in an input register or the like, and adding a multiplication result to each multiplication result. Are provided with an adder having a function of separating and simultaneously performing the operations.
[0035]
First, a plurality of multiplicands are held in the first register, arranged under the condition that the sum of the bit lengths of the multiplicands does not exceed N, and the number of N bits is written with 0 embedded between the multiplicands. Keep it. The second register holds a plurality of multipliers corresponding to the multiplicand, and arranges them on condition that the sum of the bit lengths of the multipliers does not exceed M, and embeds 0 between the multipliers. , M bits are held.
[0036]
Next, the partial product processing unit performs a process of obtaining a partial product of the value held in the first register and the value held in the second register, and the sign extension unit performs one set in the partial product. In order to compensate for the sign of the result of the multiplication of the multiplicand and the multiplier, a process of embedding bits obtained by expressing the result of the multiplication in 2's complement is performed.
[0037]
Then, the summation means sequentially calculates the sum of the partial products, and when the sum of all the partial products for a certain set of the multiplicand and the multiplier exceeds a predetermined value, the exceeded value is replaced with the value of the next other set. The sum of the partial products to be discarded when the sum of the partial products for the multiplicand and the multiplier is obtained is obtained.
[0038]
Finally, the processing means multiplies each multiplicand in the first register by a corresponding multiplier in the second register from the data of the sum (“N + M” (bits)) obtained by the summing means. A value is cut out, a multiplication result corresponding to a set of a multiplicand and a multiplier is obtained for each set, and a parallel operation is realized.
[0039]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0040]
FIG. 1 shows a configuration diagram of an embodiment according to the present invention.
[0041]
The present embodiment is an arithmetic unit for simultaneously executing the operations “a1 × b1 + c1” and “a2 × b2 + c2”.
[0042]
The entire arithmetic unit is composed of two multiplicands a1 and a2 packed in the register 100 (the state in which data is packed is expressed as follows) and two multipliers b1 and b1 packed in the register 101. b2 as inputs 102 and 103, respectively, and a partial product adder 104 having a function of adding partial products generated based on these inputs.
[0043]
Further, the two addends c1 and c2 packed in the register 105 are used as an input 106, and the input data is added to two outputs “a1 × b1” and “a2 × b2” of the partial product adder 104. , A shift-and-selector 107 having the function of shifting the contents of the input 106 and aligning the two outputs of the partial product adder 104 with each other, A three-input full adder sequence 108 having a function of adding the two outputs and the output of the "shift and selector"107; a 64-bit adder 109 having a function of adding the two outputs of the three-input full adder sequence; An aligner 110 for converting the output of the bit adder 109 into a specified format, and an overflow for determining whether the addition result is an overflow or an underflow. The flow / underflow determination unit 111 determines whether the output of the aligner 110 is the maximum value if the determination result is an overflow, and the maximum / minimum value that replaces the output of the aligner 110 with the minimum value if the determination result is an underflow. And a value substitution unit 112. Then, the final operation result is output in a state packed in the register 150.
[0044]
In the embodiment of FIG. 1, the operation of each component when performing the product-sum operation of “a1 × b1 + c1” and “a2 × b2 + c2” is described in the upper 8 bits and the lower 8 bits of a register having a data length of 32 bits. The case where two data are packed will be described as an example.
[0045]
As for the data for which the product-sum operation is performed, two data a1 and a2 are packed in upper 8 bits and lower 8 bits of 32 bits as in a data format 200 shown in FIG. 2, and a “0” value is embedded between them. , 32-bit data.
[0046]
Similarly, data b1, b2, c1, and c2 (not shown) are packed with two data in the upper 8 bits and lower 8 bits of 32 bits, and a “0” value is embedded between them to form a 32-bit data. And
[0047]
By multiplying the 32-bit data packed with the data a1 and a2 and the 32-bit data packed with the data b1 and b2 with the 32-bit data, the multiplication of the packed 8-bit data, that is, , “A1 × b1” and “a2 × b2” are simultaneously executed.
[0048]
FIG. 2 shows which part of the result of the 32-bit multiplication corresponds to the multiplication “a1 × b1” and “a2 × b2”. In FIG. 2, each partial product is represented by a horizontally long rectangle.
[0049]
The multiplication “a1 × b1” is the upper 15 digits of each of the partial products 25 (203) to 32 (204), and the multiplication “a2 × b2” is the partial product 1 (201). Calculated using the lower 15 digits of each, up to product 8 (202). Therefore, the portion painted black in the drawing is a portion in which the signs of the multiplication “a1 × b1” and “a2 × b2” are extended. The concept of sign extension is as shown in FIG.
[0050]
Further, since the value “0” is embedded between the multiplier data b1 and b2, the partial products 9 to 24 generated from those bits (the value is “0”) are always “0”. Therefore, when all the partial products are added, it is possible to eliminate the influence on the multiplication “a1 × b1” caused by the propagation of the carry resulting from the partial product addition by a2 × b2.
[0051]
Also, by setting the addition of the sign extension for 32-bit of the partial product 25 used in the addition of the partial product 25 to “0”, the addition of the sign extension for the 32-bit multiplication (partial product 1 The addition results corresponding to (1) to (8) can be eliminated from affecting the calculation result of “a1 × b1”.
[0052]
By devising these two points, “a1 × b1” and “a2 × b2” can be calculated in a separated state.
[0053]
Next, the operation of the partial product adder 104 having the functions described above will be described in detail.
[0054]
FIG. 3 shows a configuration example in which the function of the extension code of multiplication a2 × b2 is realized by a general-purpose partial product adder.
[0055]
As shown in FIG. 2, the range in which the extension code of this multiplication is required is from the ninth digit to the fifteenth digit in “partial product 1”, from the tenth to fifteenth digit in “partial product 2”, In "partial product 3", from the 11th to the 15th digit, in "partial product 4", from the 12th to the 15th digit, and in "partial product 5", from the 13th to the 15th digit, the "partial product" 6 "is the 14th to 15th digits, and" Partial Product 7 "is the 15th digit.
[0056]
FIG. 3 illustrates a portion corresponding to the partial products 3 to 5 as an example.
[0057]
The operation of each full adder is as shown in FIG.
[0058]
The full adders 300 to 305 perform addition of the 15th to 10th digits of the partial product 3, and the full adders 306 to 311 perform addition of the 15th to 10th digits of the partial product 4. To 317 are used for the addition of the 15th to 10th digits of the partial product 5, respectively.
[0059]
If the signal 330 is “1”, the selector 318 selects 332 out of the two inputs 331 and 332. The other selectors 319 to 321 operate similarly. Since 332 is the sign of the partial product 3 of the multiplication “a2 × b2”, by signifying the signal 330 to “1”, the sign is extended by the signal 332 for expressing the sign and the full adder The partial product 3 can be added.
[0060]
The selectors 323 to 326 and the selectors 327 to 329 also select the codes 333 and 334 of the partial products 4 and 5, respectively, by setting the signal 330 to “1”. As a result, sign-extended addition can be realized for the partial products 4 and 5 as well.
[0061]
In addition, although not shown in FIG. 3, for the partial products 1, 2, 6, and 7, a selector for selecting a signal for sign extension by the signal 330 is provided on the input side of the full adder. With this configuration, sign-extended addition can be realized by partial product addition of multiplication “a2 × b2”.
[0062]
FIG. 22 is an implementation example of a general-purpose partial product adder for generating a partial product 8 from the sign bit (eighth bit from the lower order) of the multiplier b2. Full adders 2200 to 2207 add the lower 8 bits of partial product 8. In order to generate a partial product from a sign bit, as shown in FIG. 19, this is a configuration for realizing a process of inverting data and adding “1”.
[0063]
The logic gate 2208 has a function of inverting the input 2227 when the signal 2225 is “1” and outputting “1” regardless of the value of the input 2227 when the signal 2225 is “0”. The selector 2216 has a function of selecting the output of the logic gate 2208 when the signal 2226 is “1”, and selecting the data 2227 of the eighth bit from the lower order of the partial product 8 when the signal 2226 is “0”. .
[0064]
The “logical gate 2208 and selector 2216” and the “logical gates 2209 to 2215 and selectors 2217 to 2223” operate in the same manner. The selector 2224 outputs “1” when the signal 2226 is “1”.
[0065]
By setting the signal 2226 to “1” using the logic gates 2208 to 2215 and the selectors 2216 to 2224, generation and addition of the partial product 8 of the multiplication “a2 × b2” can be realized. Other partial products can be similarly generated and added. As described above, the configurations shown in FIGS. 3 and 22 can execute the operation of the partial product addition of the multiplication “a2 × b2”.
[0066]
Next, the means having the configuration shown in FIG. 4 is used to eliminate the influence on the multiplication “a1 × b1” due to the result of addition (addition of partial products 1 to 8) for sign extension for performing 32-bit multiplication. This function will be described below.
[0067]
First, a process of adding the values of the 32nd to 25th digits of the partial product 25 (the partial product 1 of the multiplication “a1 × b1” (the partial product 25 in FIG. 2)) by the full adders 400 to 407 is performed. . The full adders 408 to 414 output the addition results up to the partial product 24.
[0068]
The logic gate 415 has a function of setting the sum 433 of the full adder 408 to “0” when the signal 431 is “0”. Similarly, the logic gates 416 to 422 also have a function of setting the sum of the full adders 409 to 414 to “0” when the signal 431 is “0”. The logic gate 423 has a function of setting the carry 432 of the full adder 408 to “0” when the signal 431 is “0”.
[0069]
Similarly, the logic gates 424 to 430 also have a function of setting the carry of the full adders 409 to 414 to “0” when the signal 431 is “0”. When the signal 431 is “1”, the logic elements 415 to 430 have a function of passing the carry and sum values from the full adders 408 to 414 as they are.
[0070]
That is, by setting the signal 431 to “0”, it is possible to control the addition processing of the addition result up to the partial product 24 and the partial product 25. This makes it possible to eliminate the influence on the multiplication “a1 × b1” due to the addition of the sign extension for 32-bit multiplication (addition of partial products 1 to 8).
[0071]
The provision of the means having the configuration shown in FIGS. 3, 22, and 4 described above allows the partial product adder 104 to execute “a1 × b1” and “a2” in the same processing as ordinary 32-bit multiplication. × b2 ”can be performed in parallel.
[0072]
Next, “shift and selector” 107 receives control signal 113 instructing parallel execution of “a1 × b1 + c1” and “a × b2 + c2”, and receives the upper 16 digits of the operation result of partial product adder 104 5 so that the scales of c1 and c2 are correctly aligned with the lower 16 digits, respectively, and c1 and c2 packed as shown by 500 in FIG. 5 are shifted as shown by 501, and "0" is set between c1 and c2. Embed.
[0073]
When the operations of “a1 × b1 + c1” and “a2 × b2 + c2” are not executed in parallel, 500 that is not shifted is selected.
[0074]
As shown in FIG. 5, the three-input full adder sequence 108 outputs the output 503 of the carry save adder of the partial product addition of “a1 × b1” and “a2 × b2” to the upper 16 digits and the lower 16 digits. , And the input 501 selected by the “shift and selector” 107, carry-save addition is performed, and an addition result 506 is obtained. In the case of parallel execution of “a1 × b1 + c1” and “a2 × b2 + c2”, in 501, the value “0” is buried between c1 and c2, so the digits between a1 × b1 + c1 and a2 × b2 + c2 Carry does not occur. As a result, the addition of c1 and c2 does not affect each other.
[0075]
FIG. 6 illustrates a configuration example of the overflow / underflow determination unit 111 that determines in parallel whether an operation result executed in parallel is an overflow or an underflow.
[0076]
The upper-order determination unit 600 and the lower-order determination unit 601 respectively convert the upper 16 bits and the lower 16 bits of 506 shown in FIG. 604 is performed.
[0077]
The upper-order determination unit 600 determines that an overflow has occurred when the contents of the upper 16 bits of 506 shown in FIG. 5 are larger than the 8-bit upper limit value 603, and determines that the overflow is smaller than the 8-bit lower limit value 604. Determines that an underflow has occurred. Further, the lower-order determination unit 601 makes the same determination as the upper-order determination unit 600 on the contents of the lower 16 bits of 506 shown in FIG.
[0078]
The 32-bit determining unit 602 uses the lower 32 bits of 506 as a predetermined value, a 32-bit upper limit value 605 and a 32-bit lower limit value 606, respectively, and performs overflow in the case of 32-bit multiplication. , It is determined whether an underflow has occurred. It should be noted that since the upper-order determination unit 600 and the lower-order determination unit 601 operate separately, the determination of overflow and underflow of the upper and lower 16 bits can be performed in parallel.
[0079]
The determination signals 607 and 608 output by the higher-order determination unit 600 become “1” when the upper 16 bits of 506 overflow and underflow, respectively, and the determination signals 609 and 610 output by the lower-order determination unit 601 become When the lower 16 bits of 506 are overflow and underflow, respectively, it becomes “1”, and the determination signals 611 and 612 output by the 32-bit determination unit 602 are when the lower 32 bits of 506 are overflow and underflow, respectively. It becomes "1".
[0080]
These determination signals 607 to 612 are sent to the maximum / minimum value replacement unit 112 based on the determination results.
[0081]
FIG. 7 shows a case where 64 is used so as to be able to cope with two cases, that is, a case of a normal product-sum operation and a case of executing the parallel operation for obtaining “a1 × b1 + c1” and “a2 × b2 + c2” described above. FIG. 3 shows a configuration diagram of an aligner 110 having a function of controlling an output state of a bit adder 109 (how to extract data from 701).
[0082]
The selector 702 selects “a1 × b1 + c1” between the data 719 (data from the upper 9th bit to the upper 16th bit of 701) and the data 720 (data from the upper 33rd bit to the upper 40th bit of 701). ”And“ a2 × b2 + c2 ”, the control signal 723 becomes“ 0 ”during execution of the parallel operation, and becomes“ 1 ”during normal product-sum operation.
[0083]
Logic gates 703 to 718 (a total of 16 logic gates are shown but two are shown in the figure because of complexity) and are omitted from the control signal 723 are data 721 (lower 24 bits to lower 9 bits of 701). (Up to the first bit) is set to “0” (portion described as 0 value at 724 in the figure).
[0084]
That is, when the parallel operation of “a1 × b1 + c1” and “a2 × b2 + c2” is executed, the control signal 723 becomes “0”. First, the selector 702 selects the data 719, and the d1 part in the figure Secondly, since the logic gates 713 to 718 are AND gates, the data 721 becomes “0”. In addition, since the data 722 does not change, the output of the aligner is like 724.
[0085]
On the other hand, when the control signal 723 becomes “1” during normal product-sum operation, the selector 702 selects the data 720, and the data 721 passes through the logic gates 713 to 718 as it is. Since the data 722 does not change, the output of the aligner is 725.
[0086]
As a result, at the time of executing the parallel operation of “a1 × b1 + c1” and “a2 × b2 + c2”, the operation results d1 and d2 are packed into a 32-bit register like 724.
[0087]
Next, FIG. 8 shows the relationship between the calculation condition and the determination result of the overflow / underflow determination unit 111 and the output of the maximum / minimum value replacement unit 112.
[0088]
The following describes the output mode.
[0089]
First, in the example shown in (1) in FIG. 8, when the operations of “a1 × b1 + c1” and “a2 × b2 + c2” are executed simultaneously, and when the result on the upper side is determined to overflow, the upper output The maximum value (max) which is a predetermined value is output to 8 bits, and the lower 8 bits of the output output the operation result as it is.
[0090]
In the example shown in (2) in FIG. 8, when the upper result is determined to be an underflow, a predetermined minimum value (min) is output to the upper 8 bits of the output, The lower 8 bits of the output output the operation result as it is.
[0091]
Further, in the example shown in (3) in FIG. 8, when the result on the lower side is determined to overflow, the maximum value (max) which is a predetermined value is output to the lower 8 bits of the output, The upper 8 bits of the output output the operation result as it is.
[0092]
In the example shown in (4) of FIG. 8, when the lower result is determined to be an underflow, a predetermined minimum value (min) is output to the lower 8 bits of the output, The upper 8 bits of the output output the operation result as it is.
[0093]
Further, when it is determined that an overflow has occurred during normal product-sum execution, a maximum value that is a predetermined value is output to all 32 bits (example shown in FIG. 8 (5)), and it is determined that an underflow has occurred. In such a case, it is conceivable to output a minimum value that is a predetermined value to the entire 32 bits (FIG. 8 (6)).
[0094]
In addition, when the parallel operation of “a1 × b1 + c1” and “a2 × b2 + c2” is executed, or during normal product-sum execution, if neither overflow nor underflow is determined, the input value is output as it is. (Example of FIG. 8 (7)).
[0095]
Next, an example in which data is packed into all 32 bits, for example, an embodiment of a parallel multiply-accumulate unit that packs four 8-bit pixel data will be described with reference to FIG.
[0096]
The configuration of the present embodiment includes a partial product adder 900 for calculating and adding a partial product from four packed multiplicands a1, a2, a3, and a4 and four packed multipliers b1, b2, b3, and b4. , A "shift and selector" 901 for digit-aligning the four packed addends c1, c2, c3, and c4 with the output of the partial product adder 900; A three-input full adder sequence 902 having a function of adding the output of the adder 901 and a 64-bit adder 903 having a function of adding two outputs of the three-input full adder sequence, and the output of the adder is designated. An aligner 904 for converting to a format, an overflow / underflow determination unit 905 for determining whether or not the addition result is an overflow or an underflow; A maximum value / minimum value replacement unit 906 having a function of replacing the output of the aligner 904 with a predetermined value in accordance with predetermined rules.
[0097]
In the embodiment shown in FIG. 9, the operation of performing a plurality of product-sum operations of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” is performed with a data length of 32 bits. A case where four 8-bit data are packed will be described as an example.
[0098]
As shown in a data format 1000 shown in FIG. 10, the data on which the product-sum operation is performed constitutes 32-bit data by four data “a1, a2, a3, a4”.
[0099]
Similarly, a1, a2, a3, and a4, and c1, c2, c3, and c4 are also 32-bit data.
[0100]
By multiplying 32-bit data between 32-bit data packed with “a1, a2, a3, a4” and 32-bit data packed with “a1, a2, a3, a4”, multiplication of 8-bit data is performed. a1 × b1 ”,“ a2 × b2 ”,“ a3 × b3 ”, and“ a4 × b4 ”are simultaneously executed.
[0101]
At this time, with reference to FIG. 10, which part of the multiplication result of the 32-bit data corresponds to the multiplications “a1 × b1”, “a2 × b2”, “a3 × b3”, and “a4 × b4” respectively. Will be explained.
[0102]
In FIG. 10, each partial product is represented by a rectangle.
[0103]
The values of the multiplications “a1 × b1”, “a2 × b2”, “a3 × b3” and “a4 × b4” are respectively the partial products 25 to 32, the partial products 17 to 24, the partial products 9 to 16, It is obtained by adding the partial products 1 to 8. Therefore, the portions painted black in the figure are portions that become extension codes of the multiplications “a1 × b1”, “a2 × b2”, “a3 × b3”, and “a4 × b4”.
[0104]
For example, in the multiplication “a4 × b4”, the partial product 1 is from the 9th to the 15th digit, the partial product 2 is the 10th to the 15th digit, the partial product 3 is the 11th to the 15th digit, The extension code is the 12th to 15th digits for the product 4, the 13th to 15th digits for the partial product 5, the 14th to 15th digits for the partial product 6, and the 15th digit for the partial product 7. Part.
[0105]
Similarly to the above-described parallel operation of squaring, the partial products 8, 16, 24, and 32 correspond to sign bits, so that a negative number must be created.
[0106]
Eight of the 32 full adders used for the addition of one partial product include the same logic gates 2208 to 2215 and selectors 2216 to 2224 shown in FIG. With the same connection relationship as 2207, generation and addition of a partial product of a negative number can be realized by addition. The full adder to which the logic gate and the selector are added is composed of the lower 1st to 8th bits for the partial product 8, the 9th to 16th bits for the partial product 16, and the 17th bit for the partial product 24. A full adder corresponding to the 25th to 35th bits of the partial product 32 for the 24th to 24th bits.
[0107]
In the partial product addition of the 32-bit multiplication, it is necessary to prevent the propagation of the addition result between the partial products 8 and 9, the partial products 16 and 17, and the partial products 24 and 25. In the partial products 9 to 16, the lower 8 bits (shaded portion in the drawing) are set to "0", and in the partial products 17 to 24, the lower 16 bits (shaded portion in the drawing) are set to "0". In the partial products 25 to 32, the lower 24 bits (shaded portions in the drawing) are set to “0”. With these functions, the result of the partial product addition of the four multiplications does not affect the others, so that the partial product addition of the four multiplications can be performed in parallel.
[0108]
Next, FIG. 11 illustrates a part of the partial product adder in which the result of the partial product addition of the four multiplications does not affect the others.
[0109]
FIG. 11 shows a function that does not propagate the addition result of the partial product 8 to the addition of the partial product 9 and a function that sets the lower 8 bits of the partial product 9 to “0” and does not destroy the addition results of the partial products 1 to 8. An example of a circuit configuration for realizing is shown.
[0110]
The full adders 1100 to 1107 perform addition processing on the 13th to 6th bits of the partial product 8.
[0111]
In addition, full adders 1108 to 1115 perform addition processing on the 13th to 6th bits of partial product 9. The logic gates 1116 and 1120 have a function of preventing the addition results 1128 and 1129 of the partial product 8 from being input to the full adder 1108 by setting the signal 1133 to “0”. Logic gates 1117 to 1119 and 1120 to 1123 perform the same operation, and have a function of preventing the input of the addition result to the corresponding full adder.
[0112]
Note that the signal 1133 becomes “0” when the 4 multiplication is performed in parallel.
[0113]
Therefore, “0” is input to the full adder 1108 and added to the partial product 9. As a result, the result of addition of the partial product 8 cannot be used for addition from the 10th bit of the partial product 9.
[0114]
Logic gate 1124 has a function of blocking input to full adder 1112, which performs addition processing on the eighth bit of partial product 9 by signal 1142. Similarly, the logic gates 1125 to 1127 perform the same blocking operation in response to the input of the signal 1133.
[0115]
The signal 1142 becomes “0” at the time of parallel operation of quadruple multiplication. Therefore, the value lower than 8 bits of the partial product 9 is “0”.
[0116]
With these configurations, addition up to the partial product 8 and addition of the partial product 9 can be separated using the signals 1133 and 1142 and the logic gates 1016 to 1127. Similarly, the separation of the partial products 16 and 17 and the partial products 24 and 25 can be realized. Therefore, this partial product adder can execute parallel operation of partial product addition of four multiplications.
[0117]
The “shift and selector” 901 receives a control signal indicating that “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” are executed in parallel, and receives a partial product adder. For each of “a1 × b1”, “a2 × b2”, “a3 × b3”, and “a4 × b4” output every 16 digits from the higher order of the result of the above, the added value, c1, c2, c3, Make sure that the scale with c4 matches correctly.
[0118]
Therefore, as shown by 1201, c1, c2, c3, and c4 packed as shown in 1200 in FIG. 12 are converted into the 49th digit in the first bit of c1 and the 33rd digit in the first bit of c2. , C3 are shifted to the 17th digit, and c1, c2, c3 are shifted so that the first bit of c4 is the first digit. In step 1201, a value “0” is embedded in a portion where data of c1, c2, c3, and c4 does not exist. When the parallel operation of “a1 × b1 + c1” and “a2 × b2 + c2” is not performed, the packed data is not shifted as indicated by 1200.
[0119]
FIG. 13 shows the results (sum and carry) of four parallel multiplications of “a1 × b1”, “b2 × b2”, “a3 × b3”, and “a4 × b4” in 16-bit units from the most significant bit, and “shift”. A part of the configuration of a three-input full adder array 902 that adds three inputs selected by the "and selector" and obtains a sum of two outputs of a sum and a carry is shown.
[0120]
The full adder 1300 and the full adder 1301 are used for the operation of the lower two bits when obtaining the operation result of “a3 × b3 + c3”. The full adder 1302 and the full adder 1303 are used for the operation of the upper two bits when obtaining the operation result of “a4 × b4 + c4”.
[0121]
The logic element 1304 generates the digit of the full adder 1302 by the signal 1305 which becomes “0” when the parallel operation of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” is executed. The raising 1306 is blocked.
[0122]
In the figure, the configuration of the middle part of the means for calculating “a3 × b3 + c3” and “a4 × b4 + c4” is shown, but the middle part of the means for calculating “a1 × b1 + c1” and “a2 × b2 + c2”, “a2 × A logic circuit having a similar configuration is also provided at an intermediate portion of the means for calculating “b2 + c2” and “a3 × b3 + c3” to prevent carry.
[0123]
As a result, even if the parallel operations of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” are executed, the effect of the carry among the four operations can be eliminated. The carry generated at the boundary of each multiplication is sent to an overflow / underflow determination unit for use in determining overflow / underflow.
[0124]
FIG. 14 shows a 64-bit addition having a function of preventing carry at the boundary of each multiplied value in the product-sum operation of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4”. 2 shows a part of the configuration of the vessel. The full adders 1400 and 1401 are used for the operation of the lower two bits when obtaining the operation result of “a3 × b3 + c3”.
[0125]
The full adder 1402 and the full adder 1403 are used for the operation of the upper two bits when obtaining the operation result of “a4 × b4 + c4”. The logic gate 1404 uses the signal 1405 that becomes “0” when the parallel operation of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” is executed, to cause the full adder 1402 Block carry 1406.
[0126]
FIG. 14 shows the configuration of the intermediate part of the means for calculating “a3 × b3 + c3” and “a4 × b4 + c4”. However, the intermediate part of the means for calculating “a1 × b1 + c1” and “a2 × b2 + c2”, “a2 A logic circuit having a similar configuration is also provided at an intermediate portion of the means for calculating “× b2 + c2” and “a3 × b3 + c3” to prevent carry.
[0127]
As a result, even if the parallel operations of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” are executed, the effect of the carry on other operation results among the four operations Can be eliminated. Also, the carry generated at the boundary of each operation is sent to an overflow / underflow determination unit for use in determining overflow / underflow.
[0128]
Next, FIG. 15 shows a configuration of the overflow / underflow determination unit.
[0129]
1500, 1501, 1502, and 1503 are “a1 × b1 + c1 determination section”, “a2 × b2 + c2 determination section”, “a3 × b3 + c3 determination section”, and “a4 × b4 + c4 determination section”, respectively.
[0130]
1307 to 1310 are obtained by a three-input full adder sequence 902 that executes the operations of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” in parallel. Carry data.
[0131]
Reference numerals 1407 to 1410 denote carry between each calculation result obtained by the 64-bit adder 903 which executes the calculations of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” in parallel. Data.
[0132]
Each determination unit generates a calculation result that does not limit the calculation precision to 8 bits from two carry data of the corresponding calculation and the corresponding calculation result of the output of the 64-bit adder 903, and is determined in advance. , And the upper limit value for 8 bits and the lower limit value for 8 bits to determine whether the operation result overflows or underflows, and outputs the determination result.
[0133]
For example, “a1 × b1 + c1 determination unit” 1500 adds carry 1307 and 1407.
[0134]
The addition result 1504 and the lower 16 bits 1505 of the output of the 64-bit adder are concatenated so that 1504 is higher, and an 18-bit operation result is generated.
[0135]
The newly calculated operation result 1506 is compared with a predetermined upper limit value for 8 bits and a lower limit value for 8 bits to determine overflow and underflow, and a determination result 1507 is output. Similarly, the determination units 1501, 1502, and 1503 also determine overflow and underflow, and output the determination results as 1508, 1509, and 1510, respectively.
[0136]
Further, the 32-bit determination unit compares the lower 32 bits of the output of the 64-bit adder with a predetermined upper limit value for 32 bits and a lower limit value for 32 bits, and determines overflow and underflow of the operation result, The judgment result 1511 is output.
[0137]
Next, FIG. 16 shows the case of a normal product-sum operation and the parallel operation execution of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” described above. A configuration diagram of an aligner 904 having a function of controlling the output state of a 64-bit adder 903 (how to extract data from 1601) so as to be able to cope with two cases is shown. The signal 1600 becomes “0” when the parallel operation of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” is executed, and becomes “1” during the normal product-sum operation. Signal.
[0138]
The selector 1608 selects the data 1603 when the signal 1600 is “1” among the data 1602 (the lower 56 bits to the lower 49 bits of 1601) and the data 1603 (the lower 32 bits to the lower 25 bits of 1601). When the signal 1600 is "0", the data 1602 is selected. When the signal 1600 is “1” among the data 1604 (lower 40th bit to lower 33rd bit of 1601) and data 1605 (lower 24th bit to lower 17th bit of 1601), the selector 1609 outputs the data 1605 Is selected, and when the signal 1600 is "0", the data 1604 is selected.
[0139]
Further, the selector 1610 outputs the data 1606 when the signal 1600 is “1” among the data 1605 (lower 24 bits to lower 17 bits of 1601) and data 1606 (lower 16 bits to lower 9 bits of 1601). Is selected, and when the signal 1600 is "0", the data 1605 is selected.
[0140]
Note that data 1607 (data of lower 8 bits of 1601) is not subjected to selection control by signal 1600. As a result, the signal 1600 becomes “0” during the parallel operation of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4”, so that the register areas d1 and d2 , D3, and d4, data 1602, 1604, 1605, and 1607 are stored, and 32-bit data is packed as indicated by 1611. In a normal product-sum operation, data 1603, 1605, 1606, and 1607 are stored, and 32-bit data indicated by 1612 is stored.
[0141]
The output of the 64-bit adder is an aligner shown in FIG. 16. When the signal 1600 indicates a parallel product-sum operation, that is, 1611 when the signal 1600 is “0”, and in the case of a normal 32-bit product-sum operation, When 1600 is “1”, the data is converted into 32-bit data, such as 1612.
[0142]
Next, the maximum value / minimum value replacement unit 906 receives the determination signal from the overflow / underflow determination unit 905, and receives “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4”. At the time of execution of the parallel operation of "", predetermined processing is performed on each operation result.
[0143]
As the predetermined processing, for example, if the determination result is an overflow, the calculation result is replaced with a predetermined maximum value represented by 8 bits, and if the determination result is an underflow, the calculation result is set to a predetermined value. , 8 bits, and if none of them, outputs the operation result without replacing it.
[0144]
At the time of normal product-sum operation, if the judgment result is an overflow, the operation result is replaced with a predetermined maximum value represented by 32 bits. If the judgment result is an underflow, the operation result is set at a predetermined value. Alternatively, the processing may be performed by replacing with the minimum value represented by 32 bits, and in any case, outputting the calculation result without replacing it.
[0145]
With the configuration as described above, the operations of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” can be performed separately, and a 32-bit general-purpose multiply-accumulate unit can be implemented. By using this, it is possible to execute the respective operations of “a1 × b1 + c1”, “a2 × b2 + c2”, “a3 × b3 + c3”, and “a4 × b4 + c4” in parallel.
[0146]
Of course, by applying the technical idea of the present invention in consideration of the number of operation bits, not only the partial product-sum operation of “32 bits × 32 bits” shown in the present embodiment, but also A product-sum operation can be performed.
[0147]
FIG. 17 shows an example of an instruction code for a microprocessor having a product-sum operation unit or a multiplier according to the present invention. The operation codes 1700 to 1703 are defined by the type of operation. The operands 1704 to 1707 specify a source register and a target register of data used for the operation.
[0148]
The opcodes 1700 and 1702 perform parallel operations, respectively, and are used when performing a plurality of multiplication or multiply-accumulate operations at the same time. Further, the operation codes 1701 and 1703 are used when performing a normal multiplication or a product-sum operation, respectively.
[0149]
Taking the parallel operation of two data as an example, the operation code 1700 “SplitMPY” is obtained by storing the data a1 stored in the source register r1 in which a1 and a2 are packed and the source register r2 in which b1 and b2 are packed. , A2, b1, and b2, the multiplication “a1 × b1” and “a2 × b2” are performed in parallel, and the result is stored in the target register r3.
[0150]
Further, the operation code 1702 “SplitMPYADD” includes a source register r1 in which a1 and a2 are packed, a source register r2 in which b1 and b2 are packed, and a source register r3 (r3 in which c1 and c2 are packed). Using the data a1, a2, b1, b2, c1, and c2 stored in the target register), a multiply-accumulate operation “a1 × b1 + c1” and “a2 × b2 + c2” are performed in parallel. This is an instruction to be stored in the register r3 also serving as a target register.
[0151]
In the parallel operation of four data, only the number of data and the number of executions of the parallel operation are different, and an instruction code can be set similarly.
[0152]
On the other hand, the operation code 1701 “ConnectMPY” performs the multiplication “R1 × R2” using the values of the registers r1 and r2 (the values are R1 and R2, respectively) shown in 1705, and stores the result in the register r2. Instruction. Also, 1703 “ConnectMPYADD” performs a product-sum operation “R1 × R2 + R3” using the values of the registers r1, r2, and r3 shown in 1707 (the values are R1, R2, and R3, respectively), and outputs the result. , An instruction to be stored in a register r3 also serving as a target register.
[0153]
In the architecture having only the operation code 1703, when executing the operations of “a1 × b1 + c1” and “a2 × b2 + c2”, it is necessary to instruct twice. However, as described above, the operation code 1702 is defined and the parallel operation is performed. This makes it possible to execute the parallel product-sum operations “a1 × b1 + c1” and “a2 × b2 + c2” with one instruction.
[0154]
Similarly, for the multiplication, the number of necessary instructions is reduced by defining the operation code 1700 as compared with the case where the operation code 1701 is used.
[0155]
As a result, the program is shortened and the program is stored in the memory.Thus, the use of these instructions makes it possible to reduce the capacity of the memory. It is effective when it becomes.
[0156]
Next, FIG. 18 shows another embodiment according to the present invention.
[0157]
FIG. 18 includes at least one of a multiplier (for example, a unit for performing an operation “a1 × b1”) and a product-sum operation unit (for example, a unit for performing an operation “a1 × b1 + c1”) according to the present invention. 1 is a configuration diagram of a system having a microprocessor 1800.
[0158]
The storage device stores a program that determines processing executed by the microprocessor 1800, necessary data, and the like. In this system, it is assumed that the microprocessor 1800 performs certain image processing according to the program. The processed image is displayed on a display device 1802 realized by a CRT or the like. Such display processing is performed by the microprocessor 1800 according to a predetermined program.
[0159]
Now, in image processing, it is necessary to frequently perform a product-sum operation, and it is assumed that data necessary for the product-sum operation is stored in the storage device 1801.
[0160]
The product-sum operation unit 1803 is a product-sum operation unit according to the present invention, and executes a plurality of product-sum operations in parallel.
[0161]
It is assumed that a register 1804 in a microprocessor performs a product-sum operation using data stored in a storage device 1801.
[0162]
When the execution of the product-sum operation is instructed by the program, the microprocessor 1800 accesses the storage device 1801 and holds data necessary for the product-sum operation in the register 1804 included in the microprocessor 1800 via the bus 1805. Only data necessary for one product-sum operation may be accessed. However, since a plurality of product-sum operations are usually performed at once, all the relevant data is held in the register 1804. Note that examples of multiplicand data (a1, a2, a3, a4), multiplier data (b1, b2, b3, b4), and addition data (c1, c2, c3, c4) held in the register 1804 are shown in FIG. Shown on the left. In the figure, one set of data for performing one product-sum operation is shown, but usually, a plurality of sets of data are held.
[0163]
Then, the product-sum operation unit is activated. First, all the data required for the product-sum operation in the register 1804 is fetched via the source bus.
[0164]
The sum-of-products arithmetic unit performs the above-described sum-of-products calculation based on the fetched data, and sequentially sends the calculation results to the empty area of the register 1804 via the target bus and holds the same. Of course, when performing image processing in which the calculation result is used later, it may be stored in the storage device 1801.
[0165]
Since the product-sum operation unit according to the present invention can simultaneously perform a plurality of types of product-sum operations, the processing speed of image processing is significantly improved.
[0166]
Since a plurality of product-sum operations can be repeatedly executed in parallel, for example, the product-sum operation is repeatedly performed, and high-speed processing can be performed even for processing such as discrete cosine transform, which is frequently performed in image processing. You.
[0167]
As described above, a microprocessor incorporating the product-sum operation unit (multiplier) according to the present invention is realized, and by using the microprocessor, for example, a system capable of performing high-speed image processing can be constructed. . Of course, the processing contents targeted by the system are not limited to image processing, and may be any processing that performs a large amount of product-sum operation.
[0168]
【The invention's effect】
As described above, according to the present invention, since a plurality of sum-of-products operations can be executed in parallel, a plurality of sum-of-products operations can be performed at an extremely high speed.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of an embodiment according to the present invention.
FIG. 2 is an explanatory diagram of sign extension and separation functions of multiplication.
FIG. 3 is a configuration diagram of means for realizing a sign extension function of a partial product adder.
FIG. 4 is a configuration diagram of means for realizing a separating function of a partial product adder;
FIG. 5 is an explanatory diagram of an operation of a “shift and selector”.
FIG. 6 is a configuration diagram of an overflow / underflow determination unit.
FIG. 7 is a configuration diagram of an aligner.
FIG. 8 is an explanatory diagram of a maximum value / minimum value replacement process.
FIG. 9 is a configuration diagram of another embodiment according to the present invention.
FIG. 10 is an explanatory diagram of sign extension and separation functions of multiplication.
FIG. 11 is a configuration diagram of a partial product adder.
FIG. 12 is an explanatory diagram of an operation of a “shift and selector”.
FIG. 13 is a configuration diagram of a three-input full adder array.
FIG. 14 is a configuration diagram of a 64-bit adder.
FIG. 15 is a configuration diagram of an overflow / underflow determination unit.
FIG. 16 is a configuration diagram of an aligner.
FIG. 17 is an explanatory diagram of a product-sum operation instruction.
FIG. 18 is a configuration diagram of another embodiment according to the present invention.
FIG. 19 is an explanatory diagram of a conventional multiplication process.
FIG. 20 is an explanatory diagram of a conventional partial product adder.
FIG. 21 is an explanatory diagram of an input / output relationship of a full adder.
FIG. 22 is a configuration diagram of means for realizing a negative partial product addition function.

Claims

An arithmetic unit for calculating a product of an N-bit number having a plurality of multiplicands and an M-bit number having a multiplier corresponding to each of the multiplicands to obtain a multiplication result for a set of the multiplicand and the multiplier,
A plurality of 8-bit multiplicands are held, arranged under the condition that the sum of the bit lengths of each multiplicand does not exceed N (N is 32 bits) , and N bits are embedded with 0 embedded between each multiplicand. A first register for holding a number;
An 8-bit multiplier corresponding to each of the multiplicands is held, arranged under the condition that the sum of the bit lengths of the respective multipliers does not exceed M (M is 32 bits) , and 0 is embedded between the multipliers. , A second register holding the number of M bits;
A partial product processing unit for performing a process of calculating a partial product of the value held in the first register and the value held in the second register;
A sign extension unit that embeds bits obtained by expressing a result of the multiplication in 2's complement to compensate for a sign of the multiplication result of the set of the multiplicand and the multiplier in the partial product;
Means for sequentially calculating the sum of each partial product after the sign extension is performed by the sign extension unit, wherein when the sum of all partial products for a certain set of multiplicand and multiplier exceeds a predetermined value, Summing means for calculating the sum of each partial product, discarding the exceeded value when calculating the sum of partial products for the next other set of multiplicands and multipliers;
From the data of the sum value (“N + M” (bits)) obtained by the summing means, a value that is a multiplication result of each multiplicand in the first register and the corresponding multiplier in the second register is cut out, and the multiplicand A processing unit for obtaining a multiplication result corresponding to a set with a multiplier for each set.

In claim 1,
Further, a plurality of addition numbers for adding a value to a multiplication result of each multiplicand in the first register and a corresponding multiplier in the second register are held, and a total sum of bit lengths of each addition number is determined in advance. A third register which is arranged on condition that the length does not exceed a predetermined length, embeds 0 between each addition number, and holds the addition number;
An arithmetic unit comprising: an adder for adding an addition number in a third register corresponding to a result of multiplication of each multiplicand in the first register and a corresponding multiplier in the second register.

In any one of claims 1 and 2,
If the multiplication result of each multiplicand in the first register and the corresponding multiplier in the second register is an overflow or an underflow, a multiplication value replacement unit that sets the multiplication result to a predetermined value is provided. An arithmetic unit, comprising:

An arithmetic unit for calculating a product of an N-bit number having a plurality of multiplicands and an M-bit number having a multiplier corresponding to each of the multiplicands to obtain a multiplication result for a set of a multiplicand and a multiplier,
A first register for holding a number of N bits, wherein the first register holds a plurality of multiplicands and is arranged on condition that the sum of bit lengths of each multiplicand does not exceed N;
A second register holding an M-bit number, holding a multiplier corresponding to each of the multiplicands, and arranging on condition that a sum of bit lengths of each multiplier does not exceed M;
A partial product processing unit for performing a process of calculating a partial product of the value held in the first register and the value held in the second register;
In the partial product, a sign extension unit that performs sign extension for embedding bits obtained by expressing the result of the multiplication in 2's complement to compensate for the sign of the result of the multiplication of the set of the multiplicand and the multiplier;
After calculating all partial products for a certain set of multiplicands (a bits) and multipliers (b bits), when obtaining partial products for the next other set of multiplicands and multipliers, the partial product for the set is Partial product knitting means for embedding 0 in a bit corresponding to a set before the set ;
Means for sequentially calculating the sum of each partial product, wherein when the sum of all partial products for a certain set of multiplicands and multipliers exceeds a predetermined value, the exceeded value is compared with the next other set of multiplicands. Summing means for calculating a sum value of each partial product, which is discarded when calculating a sum of partial products with respect to the multiplier;
From the data of the sum value (“N + M” (bits)) obtained by the summing means, a value that is a multiplication result of each multiplicand in the first register and the corresponding multiplier in the second register is cut out, and the multiplicand A processing unit for obtaining a multiplication result corresponding to a set with a multiplier for each set.

In claim 4,
Further, a plurality of addition numbers for adding a value to a multiplication result of each multiplicand in the first register and a corresponding multiplier in the second register are held, and a total sum of bit lengths of each addition number is determined in advance. A third register for holding an addition number, which is arranged on condition that the length does not exceed a predetermined length;
An arithmetic unit comprising: an adder for adding an addition number in a third register corresponding to a result of multiplication of each multiplicand in the first register and a corresponding multiplier in the second register.

An arithmetic unit according to claim 1, 2, 4, or 5,
When a predetermined instruction code is found supplied, a data input unit to provide data to the arithmetic unit,
Activating means for activating the computing unit;
A data output unit for obtaining and outputting the operation result of the operation unit.