JP3547309B2

JP3547309B2 - Arithmetic unit

Info

Publication number: JP3547309B2
Application number: JP06292098A
Authority: JP
Inventors: 敏行古澤; 弘一森; 大資薗田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1998-03-13
Filing date: 1998-03-13
Publication date: 2004-07-28
Anticipated expiration: 2018-03-13
Also published as: JPH11259273A

Description

【０００１】
【発明の属する技術分野】
本発明は、並列乗算器を備え、例えばデジタル信号処理を行うＤＳＰなどの演算装置に関する。
【０００２】
【従来の技術】
近年、ＬＳＩ技術や信号処理技術などのめざましい進歩によって、携帯電話機などの情報通信端末装置のデジタル化が進んでいる。特に、携帯電話機の分野においては、デジタル化することにより回線容量不足及び雑音の解消や秘匿性の向上、通話及び待ち受け時間の長期化などの多くの利点があることから、現在、非常に大きな伸びを見せている。
【０００３】
このデジタル化を進める場合には、キーデバイスたるデジタル信号処理用ＬＳＩ（ＤＳＰ：ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）が必要不可欠である。デジタル信号処理においては、積和演算が処理の大半を占めるため、処理時間の向上のために積和演算を如何に高速化するかがＤＳＰの主要な開発課題となっている。
【０００４】
一般に、処理の高速性を重視する場合には、固定小数点方式のＤＳＰが使用されるが、演算精度を重視する用途においては、プログラマは演算誤差を常に考慮する必要がある。また、高精度の演算を行う必要がある処理の場合は、語長（ビット幅）の長いＤＳＰを使用することも考えられるが、語長の長いＤＳＰでは、メモリサイズ，データバス，端子数，レジスタ語長など、ＤＳＰの全構成要素の回路規模が大きくなるため、チップサイズやコスト，動作速度の面でのデメリットが増加してしまう。そこで、一般には、通常の処理については単精度で演算を行い、高精度が要求される処理についてのみ倍精度で演算する方式が多く用いられている。
【０００５】
ここで、斯様な倍精度演算を行うＤＳＰの一構成例を図５に示す。例えば、各１６ビットの乗算オペランドレジスタ１（Ｘ）及び２（Ｙ）に格納されるオペランドデータは、乗算器（ＭＰＹ）３に与えられて乗算（Ｘ×Ｙ）が行われるようになっている。乗算器３における乗算結果データは、例えば３２ビットの乗算結果格納レジスタ（Ｍ）４に出力されるようになっている。
【０００６】
乗算結果格納レジスタ（以下、レジスタと称す）４に格納されたデータは、３入力のマルチプレクサ５を介して算術論理演算回路（以下、ＡＬＵと称す）６にオペランドとして与えられるようになっている。ＡＬＵ６の演算結果データは、例えば各６４ビットのアキュムレータ７（Ｚ０），８（Ｚ１）に格納されるようになっている。
【０００７】
アキュムレータ７，８の出力データは、マルチプレクサ５及びもう１つのマルチプレクサ９を介して夫々ＡＬＵ６に与えられるようになっている。尚、マルチプレクサ５及び９に何れのデータを選択出力させるかは、ＡＬＵ６により制御されるようになっている。以上がＤＳＰ１０を構成している。また、ＤＳＰ１０は、図示しないシステムクロック信号に同期した演算周期（サイクル）で演算処理を行うようになっている。
【０００８】
図６は、ＤＳＰ１０が、倍精度乗算（３２ビット×３２ビット）を行う場合の演算処理を概念的に示すものである。図６（ａ）は、各３２ビットのオペランドデータＸＨ／ＨＬ，ＹＨ／ＹＬ（上位／下位：各１６ビット）を乗算した結果を、アキュムレータＺに加算することを表したものであり、図６（ｂ）は、最終的に乗算結果▲１▼〜▲４▼がアキュムレータＺに格納されるビット配置のイメージを表したものである。
【０００９】
即ち、図６（ｂ）において、各オペランドデータを１６ビットの上位データ，下位データに分けて、▲１▼〜▲４▼の４回の乗算を行う。
▲１▼：ＸＬ×ＹＬ
▲２▼：ＸＨ×ＹＬ
▲３▼：ＸＬ×ＹＨ
▲４▼：ＸＨ×ＹＨ
そして、これら▲１▼〜▲４▼の乗算結果は、順次レジスタ４を介してＡＬＵ６に与えられ、桁合わせが行われながらアキュムレータＺに加算されていく。
【００１０】
図７は、ＤＳＰ１０が乗算処理を行う場合のタイミングチャートである。この図７において、サイクル１でレジスタ１にはデータＸＬが、レジスタ２にはデータＹＬが格納されると（図７（ａ），（ｂ）参照）、乗算器３により乗算（▲１▼：ＸＬ×ＹＬ）が１クロックで実行され、次のサイクル２においては、レジスタ４（Ｍ）に乗算結果が格納される（図７（ｃ）参照）。それと同時に、次の乗算▲２▼を行うため、レジスタ１にはデータＸＨが格納される（レジスタ２はデータＹＬのまま）。
【００１１】
また、サイクル２においては、ＡＬＵ６により、命令（Ｚ０＝Ｍ＞＞１６：レジスタＭを１６ビット右シフトして、アキュムレータＺ０に転送）が実行される（図７（ｄ）参照）。ここで、レジスタ４（Ｍ）のビット３１〜０は、レジスタＭ→アキュムレータＺの転送をシフトせずに行った場合はアキュムレータ７（Ｚ０）のビット６３〜３２に転送されるようになっており、従って、上記命令を実行した結果は、（次のサイクル３で）アキュムレータ７のビット４７〜１６に転送されることになる。
【００１２】
次のサイクル３においては、レジスタ４に乗算結果（▲２▼：ＸＨ×ＹＬ）が格納される（図７（ｃ）参照）。それと同時に、次の乗算▲３▼を行うため、レジスタ１にはデータＸＬ，レジスタ２はデータＹＨが格納される。また、ＡＬＵ６においては、乗算結果（▲２▼：ＸＨ×ＹＬ）をアキュムレータ７のビット６３〜３２に転送する命令（Ｚ０＝Ｚ０＋Ｍ：アキュムレータＺ０に、レジスタＭの内容を加算）が実行される（図７（ｄ）参照）。この時、アキュムレータ７には、１つ前のサイクルでＡＬＵ６により実行された命令（Ｚ０＝Ｍ＞＞１６）の結果が格納される（図７（ｅ）参照）。
【００１３】
次のサイクル４においては、サイクル３と同様にして、レジスタ４には乗算結果（▲３▼：ＸＬ×ＹＨ）が格納される（図７（ｃ）参照）と同時に、レジスタ１にはデータＸＨが格納される（レジスタ２はデータＹＨのまま）。また、ＡＬＵ６においては、乗算結果（▲３▼：ＸＬ×ＹＨ）をアキュムレータ７のビット６３〜３２に加算する命令（Ｚ０＝Ｚ０＋Ｍ）が実行され（図７（ｄ）参照）、アキュムレータ７には、１つ前のサイクルでＡＬＵ６により実行された命令（Ｚ０＝Ｚ０＋Ｍ）の結果が格納される（図７（ｅ）参照）。
【００１４】
尚、サイクル３及び４において、アキュムレータ７（Ｚ０）のビット６３〜３２に乗算結果▲２▼及び▲３▼が転送されることにより上位桁にオーバーフローを生じる場合には、図示しないオーバーフロー処理部により適宜処理されるようになっている。
【００１５】
そして、次のサイクル５においては、レジスタ４に乗算結果（▲４▼：ＸＨ×ＹＨ）が格納される（図７（ｃ）参照）と同時に、ＡＬＵ６においては、命令（Ｚ０＝Ｚ０＞＞１６＋Ｍ：アキュムレータＺ０を１６ビット右シフトしたものに、レジスタＭの内容を加算）が実行される（図７（ｄ）参照）。
【００１６】
即ち、サイクル４までの過程において、アキュムレータ７のビット６３〜１６には各乗算結果▲１▼〜▲３▼の累加算値（▲１▼＋▲２▼＋▲３▼）が格納されており、そのアキュムレータ７を１６ビット右シフトすることにより、前記累加算値は、ビット４７〜０に配置される。その状態で、アキュムレータ７のビット６３〜３２にレジスタ４の内容（乗算結果▲４▼）を加算することによって、最終的に３２ビット×３２ビットの乗算結果が得られることになる。また、この時、アキュムレータ７（Ｚ０）には、１つ前のサイクルでＡＬＵ６により実行された命令（Ｚ０＝Ｚ０＋Ｍ）の結果が格納される（図７（ｅ）参照）。
【００１７】
而して、次のサイクル６では、ＡＬＵ６において、命令（Ｚ１＝Ｚ１＋Ｚ０：アキュムレータＺ１の内容に、アキュムレータＺ０の内容を加算）が実行される（図７（ｄ）参照）と共に、アキュムレータ７には、１つ前のサイクルでＡＬＵ６により実行された命令（Ｚ０＝Ｚ０＞＞１６＋Ｍ）の結果が格納される（図７（ｅ）参照）。
【００１８】
次のサイクル７では、サイクル６でＡＬＵ６により命令（Ｚ１＝Ｚ１＋Ｚ０）が実行された結果、アキュムレータ８（Ｚ１）の内容に（ＯＬＤＺ）にアキュムレータ７（Ｚ０）の内容が累加算され、Ｚ１＝（ＮＥＷＺ）、となった時点で乗算処理を終了する。以上において、サイクル２〜６が積和サイクルであり、５サイクルを要している。
【００１９】
また、図８は、乗算器を２個有するＤＳＰの一構成例である。各１６ビットの乗算オペランドレジスタ１１ａ（ＸＨ）及び１１ｂ（ＨＬ）に格納されるオペランドデータは、乗算器１２（ＭＰＹ１）及び１３（ＭＰＹ２）に夫々与えられるようになっている。各１６ビットの乗算オペランドレジスタ１４ａ（ＹＨ）及び１４ｂ（ＹＬ）に格納されるオペランドデータは、マルチプレクサ１５及び１６に共に与えられ、マルチプレクサ１５及び１６の出力データは、乗算器１２及び１３に夫々与えられるようになっている。
【００２０】
乗算器１２及び１３の出力データは、累積演算器１７及び１８に夫々与えられるようになっており、累積演算器１７及び１８の出力データは、各４８ビットの乗算結果格納レジスタ１９（Ｍ１）及び２０（Ｍ２）に夫々与えられるようになっている。乗算結果格納レジスタ１９及び２０に格納されたデータは、３入力のマルチプレクサ２１及び２２に夫々与えられると共に、累積演算器１７及び１８にも夫々与えられるようになっている。
【００２１】
マルチプレクサ２１及び２２は、図５の構成におけるマルチプレクサ９及び５に対応するものであり、それ以降のＡＬＵ２３，アキュムレータ２４（Ｚ０）及び２５（Ｚ１）は、図５の構成におけるＡＬＵ６，アキュムレータ７及び８に対応するものである。以上がＤＳＰ２６を構成している。
【００２２】
図９は、ＤＳＰ２６が、倍精度の乗算を行う場合の演算処理を概念的に示すものであり、基本的には図６と同様であるが、図９（ｂ）では、乗算▲１▼及び▲２▼を乗算器１２（ＭＰＹ１）で行い、乗算▲３▼及び▲４▼を乗算器１３（ＭＰＹ２）で行うことを示している。
【００２３】
図１０は、ＤＳＰ２６が乗算処理を行う場合のタイミングチャートである。尚、図１０では、乗算オペランドレジスタ１１ａ，１１ｂ，１４ａ及び１４ｂのタイミングチャートは図示を省略しているが、これらのレジスタには、図示しないサイクル１でオペランドデータＸＨ，ＸＬ，ＹＨ及びＹＬが同時に格納され、以降は次の演算処理が実行されるまで変化しない。
【００２４】
この図１０において、サイクル２では、乗算器１２及び１３によって行われた乗算結果▲３▼（ＸＨ×ＹＬ）及び▲１▼（ＸＬ×ＹＬ）のデータが、乗算結果格納レジスタ１９（Ｍ１）及び２０（Ｍ２）のビット４７〜１６に格納される（図１０（ａ）及び（ｂ）参照）。同時に、マルチプレクサ１５及び１６によりオペランドデータの切り換えが行われ（ＹＬ→ＹＨ）、乗算器１２及び１３においては、乗算▲４▼（ＸＨ×ＹＨ）及び▲２▼（ＸＬ×ＹＨ）が夫々実行される。
【００２５】
次のサイクル３では、レジスタ１９には、累積加算器１７によって、サイクル２でレジスタ１９に格納されている乗算結果▲３▼（ＸＨ×ＹＬ）のデータが１６ビット右シフトされた内容（ビット３１〜０に配置）に、サイクル２で行われた乗算結果▲４▼（ＸＨ×ＹＨ）のデータがビット４７〜１６に加算されたもの（▲３▼＋▲４▼）が格納される。
【００２６】
同様に、レジスタ２０には、累積加算器１８によって、サイクル２でレジスタ２０に格納されている乗算結果▲１▼（ＸＬ×ＹＬ）のデータが１６ビット右シフトされた内容に、サイクル２で行われた乗算結果▲２▼（ＸＬ×ＹＨ）のデータが加算されたもの（▲１▼＋▲２▼）が格納される。また、ＡＬＵ２３においては、命令（Ｚ０＝Ｚ０＋Ｍ２＞＞１６：レジスタＭ２を１６ビット右シフトしたものを、アキュムレータＺ０に加算）が実行される（図１０（ｃ）参照）。
【００２７】
次のサイクル４においては、レジスタ１９及び２０の内容はそのまま保持されており、ＡＬＵ２３では、命令（Ｚ０＝Ｚ０＋Ｍ１：レジスタＭ１の内容をアキュムレータＺ０に加算）が実行される。また、アキュムレータ２４の内容（ＯＬＤＺ）に対しては、サイクル３におけるＡＬＵ２３の命令（Ｚ０＝Ｚ０＋Ｍ２＞＞１６）実行結果として、加算値（▲１▼＋▲２▼）がビット４７〜０に累加算される。
【００２８】
そして、次のサイクル５においては、アキュムレータ２４の内容には、サイクル４におけるＡＬＵ２３の命令（Ｚ０＝Ｚ０＋Ｍ１）実行結果として、ビット６３〜１６に加算値（▲３▼＋▲４▼）が累加算される。以上で乗算処理が終了するが、サイクル２〜４が積和サイクルであり、３サイクルを要している。
【００２９】
【発明が解決しようとする課題】
斯様な演算を行うＤＳＰ１０においては、取扱うデータのビット幅に一定の制約がある場合でも、演算処理をできる限り高速に行うことが要求される。
本発明は、上記事情に鑑みてなされたものであり、より少ないサイクル数によって並列乗算処理を行うことができる演算装置を提供することを目的とするものである。
【００３０】
【課題を解決するための手段】
上記目的を達成するため、請求項１記載の演算装置は、並列乗算器と、この並列乗算器の乗算結果データを格納する乗算結果記憶回路と、この乗算結果記憶回路から出力されるデータを算術論理演算する算術論理演算回路と、この算術論理演算回路の演算結果データを格納する演算結果記憶回路とを具備してなるものにおいて、
前記算術論理演算回路は、３つ以上の複数の入力部を備え、これら複数の入力部に与えられるデータをオペランドとして同時に演算可能に構成され、
算術論理演算回路の各入力部に各出力端子が接続されると共に、複数の入力端子には乗算結果記憶回路の出力データ或いは演算結果記憶回路の出力データが与えられ、何れかの入力端子に与えられているデータを選択的に出力可能に構成された複数の選択回路を具備
したことを特徴とする。
【００３１】
斯様に構成すれば、算術論理演算回路が、３つ以上の複数の入力部に与えられるデータをオペランドとして同時に算術論理演算することにより、従来のように入力部を２つのみ備えているものに比べて、例えば、処理データの桁数を多く必要とする倍精度乗算処理などを、より少ない演算周期で行うことができる。そして、算術論理演算回路は、複数の選択回路を介して各入力部に与えられるオペランドデータを適宜選択することによって、より多様なオペランドの組み合わせによる演算処理を行うことができる。
【００３２】
この場合、請求項２に記載したように、選択回路の複数の入力端子の内の何れか一つにゼロデータを与える構成とするのが好ましい。斯様に構成すれば、算術論理演算回路は、各入力部に与えられるオペランドデータの内何れかをゼロデータとすることによって、入力部の数よりも少ない項数での加減算処理などを容易に行うことができる。
【００３４】
また、請求項３に記載したように、並列乗算器及び乗算結果記憶回路を複数組備える構成としても良く、斯様に構成すれば、複数の並列乗算器によって乗算処理が並列に実行される場合でも、算術論理演算回路は、それら並列乗算器の乗算結果データを格納する乗算結果記憶回路の出力データと演算結果記憶回路の出力データとをオペランドとして同時に算術論理演算できるので、演算処理を高速に実行することができる。
【００３８】
加えて、請求項４に記載したように、並列乗算器と乗算結果記憶回路との間に、並列乗算器の乗算結果データを累積加減算する算術演算回路を具備すると良い。斯様に構成すれば、例えば、演算周期をより多く必要とする倍精度乗算処理などを実行する場合に、算術論理演算回路で処理を行うと同時に、算術演算回路によって中間処理を行うことができるので、算術論理演算回路の処理負担を軽減して、総じて演算処理を高速に実行することができる。
【００３９】
【発明の実施の形態】
（第１実施例）
以下、本発明の第１実施例について図１及び図２を参照して説明する。尚、図５と同一部分には同一符号を付して説明を省略し、以下異なる部分についてのみ説明する。本実施例のＤＳＰ（演算装置）３１においては、図５に示すＤＳＰ１０のＡＬＵ６に代えて、３つの入力部３２ａ，３２ｂ及び３２ｃを備え、これらに夫々与えられる３つのオペランドデータを同時に演算可能に構成されたＡＬＵ（算術論理演算回路）３２が配置されている。
【００４０】
また、マルチプレクサ９及び５に代えて、ＡＬＵ３２の３つの入力部３２ａ，３２ｂ及び３２ｃには、２つの入力ポート（入力端子）に与えられるデータの何れか一方を選択して出力ポート（出力端子）に出力するマルチプレクサ（選択回路）３３，３４及び３５が配置されている。マルチプレクサ３３，３４及び３５の一方の入力ポートには、アキュムレータ（演算結果記憶回路）７，８及び乗算結果格納レジスタ（乗算結果記憶回路）４の格納データが与えられており、他方の入力ポートには、ゼロデータが与えられている。その他の構成は、図５と同様である。
【００４１】
斯様に構成することによって、ＡＬＵ３２は、３つの入力部３２ａ，３２ｂ及び３２ｃに夫々与えられるオペランドデータをａ，ｂ及びｃとすると、以下のような３項間での加減算が実行可能となる。
ａ＋ｂ＋ｃ，ａ−ｂ＋ｃ，ａ＋ｂ−ｃ，ａ−ｂ−ｃ
また、何れか一つのオペランドデータを加減算に使用しない場合は、対応するマルチプレクサにゼロデータを選択させることにより、従来と同様に、ａ±ｂ，ｂ±ｃ，ａ±ｃの２項演算を行うこともできる。
【００４２】
次に、本実施例の作用について図２をも参照して説明する。図２は、図７に示した場合と同様に、ＤＳＰ３１によって倍精度乗算を行う場合のタイミングチャートである。この図２においては、サイクル４までは図７に示すタイミングチャートと同様に演算処理が行われる。
【００４３】
即ち、ＡＬＵ３２は、サイクル２ではマルチプレクサ３２ｃを介して、サイクル３及び４ではマルチプレクサ３２ａ，３２ｃを介して図７に示す場合と同様の命令を実行する。この場合、加算について使用しないマルチプレクサ３２ｂについては、上述のようにゼロデータを選択させておく。
【００４４】
そして、サイクル５において、ＡＬＵ３２は、マルチプレクサ３３，３４及び３５を用いて、命令（Ｚ１＝Ｚ１＋Ｚ０＞＞１６＋Ｍ：アキュムレータＺ１の内容に、アキュムレータＺ０を１６ビット右シフトしたものとレジスタＭの内容とを加算）を実行する。
【００４５】
即ち、ここでのサイクル５においてＡＬＵ３２により実行される命令は、図７に示すタイミングチャートにおいてサイクル５及び６で実行された命令を、１サイクルで行うものである。
【００４６】
そして、次のサイクル６においては、サイクル５でのＡＬＵ３２の命令（Ｚ１＝Ｚ１＋Ｚ０＞＞１６＋Ｍ）実行結果がアキュムレータ８（Ｚ１）に格納されて、アキュムレータ８の内容が（ＮＥＷＺ）となり、乗算処理は終了する。従って、本実施例における積和サイクルは、サイクル２〜５の４サイクルであり、従来よりも１サイクル少ない時間で実行されることになる。
【００４７】
以上のように本実施例によれば、乗算オペランドレジスタ１及び２に夫々与えられる１６ビットのオペランドデータを乗算器３により乗算処理して３２ビット×３２ビットの倍精度乗算処理を行う場合に、ＡＬＵ３２を、３つの入力部３２ａ，３２ｂ及び３２ｃに与えられるオペランドデータについて同時に演算可能な構成として、その入力部３２ａ，３２ｂ及び３２ｃには、一方の入力ポートにアキュムレータ７，８及びレジスタ４の出力データが与えられると共に、他方の入力ポートにゼロデータが与えられるマルチプレクサ３３，３４及び３５を配置した。
【００４８】
従って、最後の乗算▲４▼（ＸＨ×ＹＨ）の結果がレジスタ４に格納されると同時にアキュムレータ８に対する累加算を行うことができるので、従来よりも１サイクル少ない時間で倍精度乗算処理を実行することができる。具体的には、従来は同じ乗算処理を行うのに５サイクル要したものを４サイクルで行うことが可能となり、処理速度を２０％向上させることができた。
【００４９】
また、マルチプレクサ３３，３４及び３５を備えたことにより、演算処理を多様なオペランドの組み合わせで行うことができると共に、何れかのマルチプレクサにゼロデータを選択して出力させることによって、従来と同様に、ａ±ｂ，ｂ±ｃ，ａ±ｃの２項演算を容易に行うこともできる。
【００５０】
（第２実施例）
図３及び図４は本発明の第２実施例を示すものであり、図８と同一部分には同一符号を付して説明を省略し、以下異なる部分についてのみ説明する。第２実施例のＤＳＰ（演算装置）３６においては、図８に示すＤＳＰ２６のＡＬＵ２３に代えて、第１実施例と同様に、３つのオペランドデータを同時に演算可能に構成されたＡＬＵ（算術論理演算回路）３７が配置されている。
【００５１】
また、マルチプレクサ２１及び２２に代えて、ＡＬＵ３７の３つの入力部３７ａ，３７ｂ及び３７ｃには、３つの入力ポートに与えられるデータの何れか一方を選択して出力するマルチプレクサ（選択回路）３８，３９及び４０が配置されている。マルチプレクサ３８の２つの入力ポートには、アキュムレータ（演算結果記憶回路）２４（Ｚ０）及び２５（Ｚ１）の格納データが与えられており、マルチプレクサ３９の２つの入力ポートには、アキュムレータ２４及び乗算結果格納レジスタ（乗算結果記憶回路）１９（Ｍ１）の格納データが与えられている。
【００５２】
また、マルチプレクサ４０の２つの入力ポートには、アキュムレータ２５及び乗算結果格納レジスタ（乗算結果記憶回路）２０（Ｍ２）の格納データが与えられている。そして、各マルチプレクサ３８，３９及び４０の残りの１つの入力ポートには、何れもゼロデータが与えられている。その他の構成は、図８と同様である。
【００５３】
斯様に構成することによって、ＡＬＵ３７は、ＡＬＵ３２と同様に、３つの入力部３７ａ，３７ｂ及び３７ｃに夫々与えられるオペランドデータａ，ｂ，ｃについて、３項間での加減算が実行可能となり、また、何れか１つのマルチプレクサにゼロデータを選択させることによって、従来と同様に２項演算を行うこともできる。
【００５４】
次に、第２実施例の作用について説明する。図４は、図１０に示した場合と同様に、ＤＳＰ３６によって倍精度乗算を行う場合のタイミングチャートである。図４において、ＡＬＵ３７は、サイクル３において、マルチプレクサ３８，３９及び４０を用いて、命令（Ｚ０＝Ｚ０＋Ｍ１＋Ｍ２：アキュムレータＺ０の内容に、レジスタＭ１及びＭ２の内容を加算）を実行する。
【００５５】
即ち、ここでのサイクル３においてＡＬＵ３７により実行される命令は、図１０に示すタイミングチャートにおけるサイクル３及び４で実行された命令を１サイクルで行うものである。
【００５６】
そして、次のサイクル４においては、サイクル３でのＡＬＵ３７の命令（Ｚ０＝Ｚ０＋Ｍ１＋Ｍ２）実行結果がアキュムレータ２４（Ｚ０）に格納されて、アキュムレータ２４の内容が（ＮＥＷＺ）となり、乗算処理は終了する。従って、第２実施例における積和サイクルは、サイクル２〜３の２サイクルであり、従来よりも１サイクル少ない時間で実行されることになる。
【００５７】
以上のように第２実施例によれば、２つの乗算器（並列乗算器）１２及び１３並びに２つの累積演算器（算術演算回路）１７及び１８を備えると共に、ＡＬＵ３７を、３つの入力部３７ａ，３７ｂ及び３７ｃに与えられるオペランドデータについて同時に演算可能な構成として、その入力部３７ａ，３７ｂ及び３７ｃには、１つの入力ポートにゼロデータが与えられるマルチプレクサ３８，３９及び４０を備える構成とした。
【００５８】
従って、ＡＬＵ３７は、１サイクルで命令（Ｚ０＝Ｚ０＋Ｍ１＋Ｍ２）を実行することが可能となり、従来よりも１サイクル少ない時間で倍精度乗算処理を実行することができる。具体的には、従来は同じ乗算処理を行うのに３サイクル要したものを２サイクルで行うことが可能となり、処理速度を約３３％向上させることができた。
【００５９】
本発明は上記し且つ図面に記載した実施例にのみ限定されるものではなく、次のような変形または拡張が可能である。
第２実施例において、マルチプレクサ３８，３９及び４０に代えて、マルチプレクサ３８，３９及び４０のゼロデータが与えられている入力ポートを削除した２入力のマルチプレクサを配置して、上記と同様に、使用しない入力部に与えられているデータをＡＬＵ３７の内部で無視するようにしても良い。
【００６０】
処理データやレジスタ，アキュムレータやＡＬＵなどのビット幅は一例であり、個々の処理系に応じて適宜変更して適用すれば良い。
第１実施例の乗算器３とレジスタ４との間に、第２実施例のように算術演算回路を配置しても良い。
第１実施例において、乗算結果を累積加算することなく１回毎の乗算結果のみを求める場合には、アキュムレータ８を設けずとも良い。
【００６１】
第２実施例において、このような乗算処理のみを行う場合には、アキュムレータ２５を設けずとも良い。
並列乗算器及び乗算結果記憶回路を、３組以上設けても良い。
算術論理演算回路は、入力部を４つ以上有するものでも良い。
演算装置としてはＤＳＰに限ることなく、例えばＣＰＵやマイクロコンピュータの内部回路であっても良い。
【００６２】
【発明の効果】
本発明は以上説明した通りであるので、以下の効果を奏する。
請求項１記載の演算装置によれば、算術論理演算回路は、３つ以上の複数の入力部に与えられるデータをオペランドとして同時に算術論理演算することにより、従来のように入力部を２つのみ備えているものに比べて、例えば、処理データの桁数を多く必要とする倍精度乗算処理などをより少ない演算周期で行うことができ、演算処理を高速に実行することができる。そして、算術論理演算回路は、複数の選択回路を介して各入力部に与えられるオペランドデータを適宜選択することによって、より多様なオペランドの組み合わせによる演算処理を行うことができる。
【００６３】
請求項２記載の演算装置によれば、算術論理演算回路は、各入力部に与えられるオペランドデータの内何れかをゼロデータとすることによって、入力部の数よりも少ない項数での加減算処理などを容易に行うことができる。
【００６４】
請求項３記載の演算装置によれば、複数の並列乗算器によって乗算処理が並列に実行される場合でも、算術論理演算回路は、それら並列乗算器の乗算結果データを格納する乗算結果記憶回路の出力データと演算結果記憶回路の出力データとをオペランドとして同時に算術論理演算できるので、演算処理を高速に実行することができる。
【００６７】
請求項４記載の演算装置によれば、例えば、演算周期をより多く必要とする倍精度乗算処理などを実行する場合に、算術論理演算回路で処理を行うと同時に、算術演算回路によって中間処理を行うことができるので、算術論理演算回路の処理負担を軽減して、総じて演算処理を高速に実行することができる。

【図面の簡単な説明】
【図１】本発明の第１実施例における演算装置の構成を示す機能ブロック図
【図２】倍精度乗算処理を行う場合のタイミングチャート
【図３】本発明の第２実施例を示す図１相当図
【図４】図２相当図
【図５】従来技術（その１）を示す図１相当図
【図６】倍精度乗算処理を行う場合の処理過程を概念的に示す図であり、（ｂ）は（ａ）をより具体的なイメージで示す図
【図７】図２相当図
【図８】従来技術（その２）を示す図３相当図
【図９】図６相当図
【図１０】図４相当図
【符号の説明】
３は乗算器（並列乗算器）、４は乗算結果格納レジスタ（乗算結果記憶回路）、７及び８はアキュムレータ（演算結果記憶回路）、１２及び１３は乗算器（並列乗算器）、１７及び１８は累積演算器（算術演算回路）、１９及び２０は乗算結果格納レジスタ（乗算結果記憶回路）、２４及び２５はアキュムレータ（演算結果記憶回路）、３１はＤＳＰ（演算装置）、３２はＡＬＵ（算術論理演算回路）、３２ａ，３２ｂ及び３２ｃは入力部、３３，３４及び３５はマルチプレクサ（選択回路）、３６はＤＳＰ（演算装置）、３７はＡＬＵ（算術論理演算回路）、３７ａ，３７ｂ及び３７ｃは入力部、３８，３９及び４０はマルチプレクサ（選択回路）を示す。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an arithmetic device including a parallel multiplier and performing digital signal processing, such as a DSP.
[0002]
[Prior art]
2. Description of the Related Art In recent years, with the remarkable progress in LSI technology, signal processing technology, and the like, digitalization of information communication terminal devices such as mobile phones has been progressing. Particularly in the field of mobile phones, digitization has many advantages such as shortage of line capacity, elimination of noise, improvement of confidentiality, and prolonged calling and standby time. Is showing.
[0003]
In order to advance this digitization, a digital signal processing LSI (DSP), which is a key device, is indispensable. In digital signal processing, the product-sum operation occupies most of the processing. Therefore, how to speed up the product-sum operation in order to improve the processing time is a major development task of the DSP.
[0004]
Generally, a fixed-point DSP is used when emphasizing high-speed processing, but a programmer must always consider an arithmetic error in an application that emphasizes arithmetic accuracy. In the case of processing requiring high-precision arithmetic, a DSP having a long word length (bit width) may be used. However, in a DSP having a long word length, a memory size, a data bus, the number of terminals, Since the circuit scale of all components of the DSP, such as the register word length, becomes large, disadvantages in terms of chip size, cost, and operation speed increase. Therefore, in general, a method is often used in which calculation is performed in single precision for normal processing, and calculation is performed in double precision only for processing requiring high precision.
[0005]
Here, FIG. 5 shows a configuration example of a DSP that performs such a double precision operation. For example, operand data stored in each of the 16-bit multiplication operand registers 1 (X) and 2 (Y) is supplied to a multiplier (MPY) 3 to perform multiplication (X × Y). . The multiplication result data in the multiplier 3 is output to, for example, a 32-bit multiplication result storage register (M) 4.
[0006]
The data stored in the multiplication result storage register (hereinafter, referred to as a register) 4 is provided as an operand to an arithmetic and logic operation circuit (hereinafter, referred to as an ALU) 6 via a three-input multiplexer 5. The operation result data of the ALU 6 is stored in, for example, accumulators 7 (Z0) and 8 (Z1) of 64 bits each.
[0007]
Output data of the accumulators 7 and 8 are supplied to the ALU 6 via a multiplexer 5 and another multiplexer 9, respectively. The ALU 6 controls which data is to be selected and output by the multiplexers 5 and 9. The above constitutes the DSP 10. The DSP 10 performs arithmetic processing in an arithmetic cycle (cycle) synchronized with a system clock signal (not shown).
[0008]
FIG. 6 conceptually shows the arithmetic processing when the DSP 10 performs double-precision multiplication (32 bits × 32 bits). FIG. 6A shows that the result of multiplying each of the 32-bit operand data XH / HL and YH / YL (upper / lower: 16 bits each) is added to the accumulator Z. (B) shows an image of the bit arrangement in which the multiplication results (1) to (4) are finally stored in the accumulator Z.
[0009]
That is, in FIG. 6B, each operand data is divided into 16-bit upper data and lower data, and four multiplications (1) to (4) are performed.
(1): XL × YL
(2): XH × YL
(3): XL × YH
(4): XH × YH
Then, the multiplication results of (1) to (4) are sequentially applied to the ALU 6 via the register 4 and are added to the accumulator Z while performing digit alignment.
[0010]
FIG. 7 is a timing chart when the DSP 10 performs a multiplication process. In FIG. 7, when data XL is stored in register 1 and data YL is stored in register 2 in cycle 1 (see FIGS. 7A and 7B), multiplication is performed by multiplier 3 ((1): XL × YL) is executed in one clock, and in the next cycle 2, the multiplication result is stored in the register 4 (M) (see FIG. 7C). At the same time, the data XH is stored in the register 1 (the data in the register 2 remains YL) in order to perform the next multiplication (2).
[0011]
In cycle 2, the ALU 6 executes an instruction (Z0 = M >> 16: right-shifts the register M by 16 bits and transfers it to the accumulator Z0) (see FIG. 7D). Here, bits 31 to 0 of the register 4 (M) are transferred to bits 63 to 32 of the accumulator 7 (Z0) when the transfer from the register M to the accumulator Z is performed without shifting. Therefore, the result of executing the above instruction is transferred to bits 47 to 16 of accumulator 7 (in the next cycle 3).
[0012]
In the next cycle 3, the multiplication result ((2): XH × YL) is stored in the register 4 (see FIG. 7C). At the same time, data XL is stored in the register 1 and data YH is stored in the register 2 to perform the next multiplication (3). Further, in the ALU 6, an instruction (Z0 = Z0 + M: the contents of the register M is added to the accumulator Z0) for transferring the multiplication result ((2): XH × YL) to the bits 63 to 32 of the accumulator 7 is executed ( FIG. 7D). At this time, the result of the instruction (Z0 = M >> 16) executed by the ALU 6 in the previous cycle is stored in the accumulator 7 (see FIG. 7E).
[0013]
In the next cycle 4, similarly to cycle 3, the multiplication result ((3): XL × YH) is stored in the register 4 (see FIG. 7C), and at the same time, the data XH is stored in the register 1. Is stored (the register 2 remains in the data YH). In the ALU 6, an instruction (Z0 = Z0 + M) for adding the multiplication result ((3): XL × YH) to the bits 63 to 32 of the accumulator 7 is executed (see FIG. 7 (d)). The result of the instruction (Z0 = Z0 + M) executed by the ALU 6 in the previous cycle is stored (see FIG. 7E).
[0014]
Note that, in the cycles 3 and 4, when the multiplication results (2) and (3) are transferred to the bits 63 to 32 of the accumulator 7 (Z0) to cause an overflow in the upper digit, an overflow processing unit (not shown) It is processed appropriately.
[0015]
Then, in the next cycle 5, the multiplication result ([4]: XH × YH) is stored in the register 4 (see FIG. 7C), and at the same time, the instruction (Z0 = Z0 >> 16 + M) in the ALU6. : The content of the register M is added to the value obtained by shifting the accumulator Z0 to the right by 16 bits (see FIG. 7D).
[0016]
That is, in the process up to the cycle 4, the cumulative addition values ((1) + (2) + (3)) of the multiplication results (1) to (3) are stored in the bits 63 to 16 of the accumulator 7. By accumulating the accumulator 7 by 16 bits to the right, the accumulated value is placed in bits 47-0. In this state, by adding the contents of the register 4 (the multiplication result (4)) to the bits 63 to 32 of the accumulator 7, a multiplication result of 32 bits × 32 bits is finally obtained. At this time, the result of the instruction (Z0 = Z0 + M) executed by the ALU 6 in the immediately preceding cycle is stored in the accumulator 7 (Z0) (see FIG. 7E).
[0017]
Thus, in the next cycle 6, in the ALU 6, the instruction (Z1 = Z1 + Z0: the content of the accumulator Z1 is added to the content of the accumulator Z1) is executed (see FIG. 7D), and the accumulator 7 The result of the instruction (Z0 = Z0 >> 16 + M) executed by the ALU 6 in the immediately preceding cycle is stored (see FIG. 7E).
[0018]
In the next cycle 7, the instruction (Z1 = Z1 + Z0) is executed by the ALU 6 in the cycle 6, and as a result, the contents of the accumulator 7 (Z0) are cumulatively added to the contents of the accumulator 8 (Z1) to (OLD Z). (NEW Z), the multiplication process ends. In the above, cycles 2 to 6 are sum-of-products cycles, and require 5 cycles.
[0019]
FIG. 8 shows a configuration example of a DSP having two multipliers. Operand data stored in each of the 16-bit multiplication operand registers 11a (XH) and 11b (HL) is supplied to multipliers 12 (MPY1) and 13 (MPY2), respectively. The operand data stored in each of the 16-bit multiplication operand registers 14a (YH) and 14b (YL) is supplied to both multiplexers 15 and 16, and the output data of the multiplexers 15 and 16 are supplied to multipliers 12 and 13, respectively. It is supposed to be.
[0020]
The output data of the multipliers 12 and 13 are supplied to accumulators 17 and 18, respectively, and the output data of the accumulators 17 and 18 is a 48-bit multiplication result storage register 19 (M1) and 20 (M2). The data stored in the multiplication result storage registers 19 and 20 are supplied to the three-input multiplexers 21 and 22, respectively, and also to the accumulators 17 and 18, respectively.
[0021]
The multiplexers 21 and 22 correspond to the multiplexers 9 and 5 in the configuration of FIG. 5, and the subsequent ALU 23, accumulators 24 (Z0) and 25 (Z1) are ALU 6, accumulators 7 and 8 in the configuration of FIG. It corresponds to. The above constitutes the DSP 26.
[0022]
FIG. 9 conceptually shows the arithmetic processing when the DSP 26 performs double-precision multiplication, and is basically the same as FIG. 6; however, in FIG. (2) is performed by the multiplier 12 (MPY1), and multiplications (3) and (4) are performed by the multiplier 13 (MPY2).
[0023]
FIG. 10 is a timing chart when the DSP 26 performs a multiplication process. In FIG. 10, the timing charts of the multiplication operand registers 11a, 11b, 14a, and 14b are not shown, but these registers simultaneously store operand data XH, XL, YH, and YL in cycle 1 (not shown). It is stored and does not change until the next arithmetic processing is executed.
[0024]
In FIG. 10, in cycle 2, the data of the multiplication results (3) (XH × YL) and (1) (XL × YL) performed by the multipliers 12 and 13 are stored in the multiplication result storage register 19 (M1) and The bits are stored in bits 47 to 16 of 20 (M2) (see FIGS. 10A and 10B). At the same time, the switching of the operand data is performed by the multiplexers 15 and 16 (YL → YH), and the multipliers 12 and 13 execute the multiplications (4) (XH × YH) and (2) (XL × YH), respectively. You.
[0025]
In the next cycle 3, the accumulator 17 stores the contents (bit 31) of the data of the multiplication result (3) (XH × YL) stored in the register 19 in cycle 2 shifted right by 16 bits. The data (3) + (4) obtained by adding the data of the multiplication result (4) (XH × YH) performed in cycle 2 to bits 47 to 16 are stored in (0 to 0).
[0026]
Similarly, the multiplication result (1) (XL × YL) data stored in the register 20 in the cycle 2 is shifted by 16 bits to the right by the accumulator 18 in the cycle 20 to the register 20. The result ((1) + (2)) obtained by adding the data of the obtained multiplication result (2) (XL × YH) is stored. In the ALU 23, an instruction (Z0 = Z0 + M2 >> 16: the result of shifting the register M2 by 16 bits to the right is added to the accumulator Z0) (see FIG. 10C).
[0027]
In the next cycle 4, the contents of the registers 19 and 20 are held as they are, and the ALU 23 executes an instruction (Z0 = Z0 + M1: adding the contents of the register M1 to the accumulator Z0). As for the contents of the accumulator 24 (OLD Z), the addition value ((1) + (2)) is added to bits 47 to 0 as the execution result of the instruction (Z0 = Z0 + M2 >> 16) of the ALU 23 in cycle 3. Cumulative addition is performed.
[0028]
Then, in the next cycle 5, the addition value ((3) + (4)) is added to the bits 63 to 16 as the execution result of the instruction (Z0 = Z0 + M1) of the ALU 23 in the cycle 4 in the contents of the accumulator 24. Is done. The multiplication process is completed as described above. Cycles 2 to 4 are the product-sum cycles, and require three cycles.
[0029]
[Problems to be solved by the invention]
In the DSP 10 that performs such an operation, it is required that the arithmetic processing be performed as fast as possible even when there is a certain restriction on the bit width of the data to be handled.
The present invention has been made in view of the above circumstances, and has as its object to provide an arithmetic device capable of performing parallel multiplication processing with a smaller number of cycles.
[0030]
[Means for Solving the Problems]
In order to achieve the above object, an arithmetic device according to claim 1 includes a parallel multiplier, a multiplication result storage circuit for storing multiplication result data of the parallel multiplier, and an arithmetic unit for arithmetically converting data output from the multiplication result storage circuit. An arithmetic and logic circuit for performing a logical operation, and an operation result storage circuit for storing operation result data of the arithmetic and logic operation circuit;
The arithmetic and logic operation circuit includes three or more input units, and is configured to be able to simultaneously operate data supplied to the plurality of input units as operands.,
Each output terminal is connected to each input section of the arithmetic and logic operation circuit, and the output data of the multiplication result storage circuit or the output data of the operation result storage circuit is applied to a plurality of input terminals, and is applied to any one of the input terminals. Provided with a plurality of selection circuits configured to be capable of selectively outputting the selected data.
didIt is characterized by the following.
[0031]
With such a configuration, the arithmetic and logic operation circuit simultaneously performs arithmetic and logical operations on data supplied to three or more input units as operands, thereby providing only two input units as in the related art. For example, double-precision multiplication processing that requires a large number of digits of processing data can be performed in a smaller calculation cycle.Then, the arithmetic and logic operation circuit can perform an operation process with a wider variety of operand combinations by appropriately selecting the operand data provided to each input unit via the plurality of selection circuits.
[0032]
In this case, as described in claim 2,It is preferable that zero data be applied to any one of the plurality of input terminals of the selection circuit. With such a configuration, the arithmetic and logic operation circuit can easily perform addition / subtraction processing with a smaller number of terms than the number of input sections by setting any of the operand data supplied to each input section to zero data. It can be carried out.
[0034]
Further, as described in claim 3, a configuration may be employed in which a plurality of sets of a parallel multiplier and a multiplication result storage circuit are provided. With such a configuration, the multiplication processing is performed in parallel by the plurality of parallel multipliers. However, the arithmetic and logic circuit can perform arithmetic and logical operations simultaneously using the output data of the multiplication result storage circuit storing the multiplication result data of the parallel multipliers and the output data of the operation result storage circuit as operands. Can be performed.
[0038]
In addition, claims4As described in above, an arithmetic operation circuit for accumulatively adding / subtracting the multiplication result data of the parallel multiplier may be provided between the parallel multiplier and the multiplication result storage circuit. With this configuration, for example, when performing a double-precision multiplication process or the like that requires a longer operation cycle, the arithmetic logic circuit can perform the processing and the arithmetic processing circuit can perform the intermediate processing at the same time. Therefore, the processing load on the arithmetic and logic operation circuit can be reduced, and the arithmetic processing can be executed at high speed as a whole.
[0039]
BEST MODE FOR CARRYING OUT THE INVENTION
(First embodiment)
Hereinafter, a first embodiment of the present invention will be described with reference to FIGS. Note that the same parts as those in FIG. 5 are denoted by the same reference numerals and description thereof is omitted, and only different parts will be described below. The DSP (arithmetic unit) 31 of the present embodiment includes three input units 32a, 32b and 32c in place of the ALU 6 of the DSP 10 shown in FIG. 5, so that three operand data respectively given to these units can be simultaneously operated. The constructed ALU (arithmetic logic operation circuit) 32 is arranged.
[0040]
Instead of the multiplexers 9 and 5, the three input units 32a, 32b and 32c of the ALU 32 select one of the data supplied to the two input ports (input terminals) and output the data to an output port (output terminal). (Selection circuits) 33, 34, and 35 that output the signals to the other. Data stored in accumulators (operation result storage circuits) 7 and 8 and a multiplication result storage register (multiplication result storage circuit) 4 are given to one input port of the multiplexers 33, 34, and 35, and the other input port is connected to the other input port. Is given zero data. Other configurations are the same as those in FIG.
[0041]
With such a configuration, the ALU 32 can execute addition and subtraction between the following three terms, assuming that the operand data provided to the three input units 32a, 32b, and 32c are a, b, and c, respectively. .
a + b + c, a-b + c, a + bc, abc
When any one of the operand data is not used for addition and subtraction, the corresponding multiplexer is made to select zero data, so that the binary operation of a ± b, b ± c, and a ± c is performed in the same manner as in the related art. You can also.
[0042]
Next, the operation of the present embodiment will be described with reference to FIG. FIG. 2 is a timing chart in the case where double precision multiplication is performed by the DSP 31, as in the case shown in FIG. In FIG. 2, the arithmetic processing is performed up to cycle 4 in the same manner as in the timing chart shown in FIG.
[0043]
That is, the ALU 32 executes the same instruction as that shown in FIG. 7 through the multiplexer 32c in cycle 2 and through the multiplexers 32a and 32c in cycles 3 and 4. In this case, for the multiplexer 32b not used for addition, zero data is selected as described above.
[0044]
Then, in cycle 5, the ALU 32 uses the multiplexers 33, 34, and 35 to store the instruction (Z1 = Z1 + Z0 >> 16 + M: the contents of the accumulator Z1 by shifting the accumulator Z0 by 16 bits to the right and the contents of the register M. Addition).
[0045]
That is, the instruction executed by the ALU 32 in the cycle 5 here is the instruction executed in the cycles 5 and 6 in the timing chart shown in FIG.
[0046]
Then, in the next cycle 6, the execution result of the instruction (Z1 = Z1 + Z0 >> 16 + M) of the ALU 32 in the cycle 5 is stored in the accumulator 8 (Z1), and the content of the accumulator 8 becomes (NEW Z). Ends. Therefore, the product-sum cycle in this embodiment is four cycles of cycles 2 to 5, and is executed in a time shorter by one cycle than in the conventional case.
[0047]
As described above, according to the present embodiment, when the 16-bit operand data provided to the multiplication operand registers 1 and 2 are multiplied by the multiplier 3 to perform a 32-bit × 32-bit double precision multiplication process, The ALU 32 is configured to be able to simultaneously operate on the operand data supplied to the three input units 32a, 32b and 32c. The input units 32a, 32b and 32c have one input port having the accumulators 7 and 8 and the output of the register 4 Multiplexers 33, 34 and 35 are provided which are supplied with data and are supplied with zero data at the other input port.
[0048]
Therefore, since the result of the last multiplication (4) (XH × YH) is stored in the register 4 and the accumulator 8 can be added at the same time, the double-precision multiplication process is executed in one cycle shorter than the conventional case. can do. Specifically, what was conventionally required 5 cycles to perform the same multiplication processing can be performed in 4 cycles, and the processing speed can be improved by 20%.
[0049]
Further, by providing the multiplexers 33, 34 and 35, the arithmetic processing can be performed with various combinations of operands, and by selecting and outputting zero data to any one of the multiplexers, as in the conventional case, Binomial calculations of a ± b, b ± c, and a ± c can be easily performed.
[0050]
(Second embodiment)
FIGS. 3 and 4 show a second embodiment of the present invention. The same parts as those in FIG. 8 are denoted by the same reference numerals, and description thereof will be omitted. Hereinafter, only different parts will be described. In the DSP (arithmetic unit) 36 of the second embodiment, instead of the ALU 23 of the DSP 26 shown in FIG. 8, similarly to the first embodiment, an ALU (arithmetic logic operation) configured to be able to operate three operand data simultaneously is used. (Circuit) 37 is disposed.
[0051]
Also, instead of the multiplexers 21 and 22, the three input units 37a, 37b and 37c of the ALU 37 are provided with multiplexers (selection circuits) 38 and 39 for selecting and outputting any one of the data supplied to the three input ports. And 40 are arranged. Two input ports of the multiplexer 38 are provided with data stored in accumulators (operation result storage circuits) 24 (Z0) and 25 (Z1), and two input ports of the multiplexer 39 are provided with the accumulator 24 and the multiplication result. Data stored in a storage register (multiplication result storage circuit) 19 (M1) is provided.
[0052]
The two input ports of the multiplexer 40 are provided with the data stored in the accumulator 25 and the multiplication result storage register (multiplication result storage circuit) 20 (M2). The remaining one input port of each of the multiplexers 38, 39 and 40 is supplied with zero data. Other configurations are the same as those in FIG.
[0053]
With such a configuration, the ALU 37 can perform addition and subtraction between three terms with respect to the operand data a, b, and c provided to the three input units 37a, 37b, and 37c, respectively, similarly to the ALU 32. By causing any one of the multiplexers to select zero data, a binary operation can be performed as in the conventional case.
[0054]
Next, the operation of the second embodiment will be described. FIG. 4 is a timing chart in the case where double precision multiplication is performed by the DSP 36, as in the case shown in FIG. In FIG. 4, the ALU 37 executes an instruction (Z0 = Z0 + M1 + M2: the contents of the registers M1 and M2 to the contents of the accumulator Z0) using the multiplexers 38, 39 and 40 in cycle 3.
[0055]
That is, the instruction executed by the ALU 37 in the cycle 3 here is an instruction executed in the cycles 3 and 4 in the timing chart shown in FIG. 10 in one cycle.
[0056]
Then, in the next cycle 4, the execution result of the instruction (Z0 = Z0 + M1 + M2) of the ALU 37 in the cycle 3 is stored in the accumulator 24 (Z0), the content of the accumulator 24 becomes (NEW Z), and the multiplication process ends. . Therefore, the product-sum cycle in the second embodiment is two cycles of cycles 2 and 3, and is executed in a time shorter by one cycle than in the related art.
[0057]
As described above, according to the second embodiment, two multipliers (parallel multipliers) 12 and 13 and two accumulators (arithmetic operation circuits) 17 and 18 are provided, and the ALU 37 is connected to the three input units 37a. , 37b, and 37c, the input units 37a, 37b, and 37c include multiplexers 38, 39, and 40 for supplying zero data to one input port.
[0058]
Therefore, the ALU 37 can execute the instruction (Z0 = Z0 + M1 + M2) in one cycle, and can execute the double-precision multiplication processing in one cycle less time than in the related art. Specifically, it has become possible to perform the same multiplication processing in three cycles in the past, but in two cycles, thereby improving the processing speed by about 33%.
[0059]
The present invention is not limited to the embodiment described above and shown in the drawings, and the following modifications or extensions are possible.
No.In the second embodiment, instead of the multiplexers 38, 39, and 40, a two-input multiplexer in which the input ports of the multiplexers 38, 39, and 40 to which the zero data is provided is deleted, and the multiplexers are not used in the same manner as described above. The data supplied to the input unit may be ignored inside the ALU 37.
[0060]
The bit widths of processing data, registers, accumulators, ALUs, and the like are merely examples, and may be appropriately changed and applied according to each processing system.
An arithmetic operation circuit may be arranged between the multiplier 3 and the register 4 in the first embodiment as in the second embodiment.
In the first embodiment, the accumulator 8 need not be provided when only the multiplication result for each time is obtained without cumulatively adding the multiplication results.
[0061]
In the second embodiment, when only such a multiplication process is performed, the accumulator 25 need not be provided.
Three or more sets of parallel multipliers and multiplication result storage circuits may be provided.
The arithmetic and logic operation circuit may have four or more input units.
The arithmetic device is not limited to the DSP, but may be, for example, a CPU or an internal circuit of a microcomputer.
[0062]
【The invention's effect】
Since the present invention is as described above, the following effects are obtained.
According to the arithmetic device of the first aspect, the arithmetic and logic operation circuit performs the arithmetic and logical operation simultaneously with the data supplied to the three or more input units as the operands, so that only two input units are provided as in the related art. For example, double-precision multiplication processing that requires a large number of digits of processing data can be performed in a smaller calculation cycle than that provided, and the calculation processing can be executed at a higher speed.Then, the arithmetic and logic operation circuit can perform an operation process with a wider variety of operand combinations by appropriately selecting the operand data provided to each input unit via the plurality of selection circuits.
[0063]
According to the arithmetic device of the second aspect, the arithmetic and logic operation circuit comprises:By making any one of the operand data given to each input unit zero data, it is possible to easily perform addition / subtraction processing with a smaller number of terms than the number of input units.be able to.
[0064]
According to the arithmetic device of the third aspect, even when the multiplication processing is performed in parallel by a plurality of parallel multipliers, the arithmetic and logic operation circuit is configured to store the multiplication result data of the parallel multipliers. Since the arithmetic logic operation can be performed simultaneously with the output data and the output data of the operation result storage circuit as operands, the operation processing can be executed at high speed.
[0067]
Claim4According to the described arithmetic device, for example, when performing a double-precision multiplication process or the like that requires a longer calculation cycle, the arithmetic logic circuit can perform the processing and the arithmetic processing circuit can perform the intermediate processing at the same time. Therefore, the processing load on the arithmetic and logic operation circuit can be reduced, and the arithmetic processing can be executed at high speed as a whole.

[Brief description of the drawings]
FIG. 1 is a functional block diagram illustrating a configuration of an arithmetic unit according to a first embodiment of the present invention.
FIG. 2 is a timing chart when a double precision multiplication process is performed.
FIG. 3 is a view corresponding to FIG. 1, showing a second embodiment of the present invention;
FIG. 4 is a diagram corresponding to FIG. 2;
FIG. 5 is a diagram corresponding to FIG. 1 showing a prior art (No. 1);
FIGS. 6A and 6B are diagrams conceptually showing a processing process in a case where a double precision multiplication process is performed, and FIG. 6B is a diagram showing FIG.
FIG. 7 is a diagram corresponding to FIG. 2;
FIG. 8 is a diagram corresponding to FIG. 3, showing a related art (No. 2).
FIG. 9 is a diagram corresponding to FIG. 6;
FIG. 10 is a diagram corresponding to FIG. 4;
[Explanation of symbols]
3 is a multiplier (parallel multiplier), 4 is a multiplication result storage register (multiplication result storage circuit), 7 and 8 are accumulators (operation result storage circuit), 12 and 13 are multipliers (parallel multiplier), 17 and 18 Is an accumulator (arithmetic operation circuit), 19 and 20 are multiplication result storage registers (multiplication result storage circuit), 24 and 25 are accumulators (operation result storage circuit), 31 is a DSP (arithmetic device), and 32 is an ALU (arithmetic circuit). Logic operation circuits), 32a, 32b and 32c are input units, 33, 34 and 35 are multiplexers (selection circuits), 36 is a DSP (arithmetic device), 37 is an ALU (arithmetic logic operation circuit), 37a, 37b and 37c are Inputs 38, 39 and 40 indicate multiplexers (selection circuits).

Claims

A parallel multiplier, a multiplication result storage circuit for storing multiplication result data of the parallel multiplier, an arithmetic and logic operation circuit for performing an arithmetic and logic operation on data output from the multiplication result storage circuit, and an operation of the arithmetic and logic operation circuit An operation result storage circuit for storing result data,
The arithmetic and logic operation circuit includes three or more input units, and is configured to be able to simultaneously operate data supplied to the plurality of input units as operands ,
Each output terminal is connected to each input unit of the arithmetic and logic operation circuit, and the output data of the multiplication result storage circuit or the output data of the operation result storage circuit is given to the plurality of input terminals. An arithmetic device comprising a plurality of selection circuits configured to selectively output the data given to the computer.

The arithmetic device according to claim 1 , wherein zero data is given to any one of the plurality of input terminals of the selection circuit .

3. The arithmetic unit according to claim 1, comprising a plurality of sets of a parallel multiplier and a multiplication result storage circuit.

4. The arithmetic device according to claim 1, further comprising an arithmetic operation circuit for cumulatively adding / subtracting the multiplication result data of the parallel multiplier, between the parallel multiplier and the multiplication result storage circuit .