JP3875183B2

JP3875183B2 - Arithmetic unit

Info

Publication number: JP3875183B2
Application number: JP2002336196A
Authority: JP
Inventors: 隆生長谷川; 一雅鬼追
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-11-20
Filing date: 2002-11-20
Publication date: 2007-01-31
Anticipated expiration: 2022-11-20
Also published as: JP2004171263A

Description

【０００１】
【発明の属する技術分野】
本発明は、画像信号の符号化、フィルタ処理、離散フーリエ変換などのデジタル信号処理に使用される分散算術演算法（Distributed Arithmetic、以下ＤＡ法と表記する。）を利用した内積演算装置、複素数乗算装置等の演算装置に関するものである。
【０００２】
【従来の技術】
ＤＡ法は行列の内積演算をルックアップ表と加算器で実現する周知の方法である。例えば、特許文献１や非特許文献１に、ＤＡ法を利用することにより、内積演算を、乗算器を使って行うことに比べてコンパクトに実現することが記載されている。
【０００３】
ＤＡ法について説明するため、下記の式１で表される要素数Ｋの定数ベクトルＡと入力ベクトルＸの内積演算を考える。
【０００４】

【０００５】
Ｘ_ｋを２の補数表現された２進数のＷビット幅固定小数点であるとすると、以下の式２のように表すことができる。
【０００６】

【０００７】
この式（２）において、b_k[w]はビット（１又は０）であり、b_k[0]は符号ビット、b_k[W−1]は最下位ビットである。
【０００８】
また、Ｘ_ｋ＝｛Ｘ_ｋ−（−Ｘ_ｋ）｝／２と表せるので、b_kのビット反転をb'_kとおくと、Ｘ_ｋは以下のようになる。
【０００９】

(式３)
【００１０】
ただし、この式３でｃ_ｋ[w］は以下のように表される。
【００１１】

【００１２】
（式３）を（式１）に代入して、以下の式５が得られる。
【００１３】

【００１４】
この式５において、以下の関係がある。
【００１５】

【００１６】
（式６）において、Ｑ(ｂ_ｗ)は次式のように変形できる。
【００１７】

【００１８】
従って、ｃ_１＝−１の場合のＱ(ｂ_ｗ)は、ｃ_１＝１の場合のＱ(ｂ_ｗ)で表現できるので、式６の値は、２^Ｋ−１通りで全てを表現でき、この値をルックアップ表として記憶しておく。そして、ｗ＝Ｗ−１から０まではそれぞれの値に２^−ｗが掛かるので、ｗ＝Ｗ−１から順番に実行し、前記ルックアップ表から参照した値を１桁ずつ右へずらせて加える。以上の手順により、乗算を行うことなくＡＸを得ることができる。
【００１９】
入力ベクトルＸの複数の値について、上記演算を同時に行う場合は、次式の演算を行うことと等価である。
【００２０】

【００２１】
ここで、Ｘ（ｉ）は、式（１）におけるＸのｉ番目の縦ベクトル、Ａ（ｉ）はそれに対応する定数の横ベクトル（ｉは０以上の整数）、０は要素値が０の横ベクトルを示す。また、Ａ' はＡ（ｉ）を含む行列を表し、Ｘ' はＸ（ｉ）を含むベクトルを表し、Ｂは演算結果を表す。
【００２２】
図６に、例えば、特許文献３、特許文献４等で示された方法を利用した（式８）の演算を行う方法を示す。端子群１０１から入力ベクトルＸ（０）、Ｘ（１）、…の各要素がビットシリアルで同時並列入力される。符号１１０，１１１…は、式１のＡＸを演算するための分散算術演算回路を示し、入力ベクトル数分だけ並列接続される。符号１０３はルックアップ表を記憶している記憶手段を示す。端子群１０１への入力データ（定数ベクトル）に対応したデータが記憶手段１０３から加減算器１０４へ入力される。加減算器１０４は記憶手段１０３の出力データと１ビットシフタ１０６の出力を加減算し、例えばシフトレジスタである一時記憶手段１０７へ入力する。一時記憶手段１０７に一時記憶された加減算器１０４の出力はビットシフタ１０６へ出力され、適切なタイミングでスイッチ１０８が閉じられ、出力端子１０９から出力Ｂ（０）として出力される。同様に、他のＤＡ回路１１１…の出力Ｂ（１），Ｂ（２）…が出力端子１０９から出力される。
【００２３】
【特許文献１】
米国特許第３，７７７，１３０号
【特許文献２】
米国特許第５，２２６，００２号
【特許文献３】
特開２０００−１４８７３０号公報
【非特許文献】
“Applications of distributed arithmetic to digital signal processing: a tutorial review,” S.A.White, IEEE ASSP magazine, vol.6, issue 3, pp.4-9, July 1989
【００２４】
【発明が解決しようとする課題】
しかしながら、図６に示す従来の装置では、各分散算術演算回路１１０，１１１…が、それぞれ記憶手段１０３を内蔵しているので、そのコンパクト性は、記憶手段１０３のコンパクト性に大きな影響を受ける。すなわち、同時処理する入力ベクトルの数が増えれば増えるほど、分散算術演算回路１１０，１１１…の数が増加し、それに従って記憶手段１０３の数も増加する。
【００２５】
また、図６において、入力データは端子群１０１からビットシリアルで入力されるため、１ビットを１クロックで処理すると、出力結果が得られるまでに少なくとも入力データのビット幅分のクロック数が必要となる。一方、クロック周波数を大きくせずに演算時間を短くするために多ビットを同時処理することは、式８におけるベクトルＸ'の要素数を増やすことに相当するので、分散算術演算回路１１０，１１１…の並列処理数が並列処理ビット数の数だけ増加し、従って記憶手段１０３も同数だけ増加する。
【００２６】
このように、分散算術演算により複数データについて内積演算を同時処理すると、同時処理数分だけの記憶手段が必要となり、同時処理数が増すにつれてコンパクト性が悪化するという問題を有している。
【００２７】
本発明は、かかる問題に鑑みてなされたもので、その目的は、記憶手段のサイズを小さく保ったままで、分散算術演算を高速化することにある。
【００２８】
【課題を解決するための手段】
本発明はこうした課題を解決するための手段を提供するもので、各請求項の発明は、以下の技術手段を構成する。
【００２９】
本発明は、第１の入力データ組と、第２の入力データ組とに対して、分散算術演算法を利用して演算処理を行う演算装置において、前記第１の入力データ組が入力される少なくとも１個の分散算術演算手段と、前記第２の入力データ組に対応したデータの表（ルックアップ表）を記憶し、前記第２の入力データに応じたデータ組を出力する記憶手段とを備え、前記分散算術手段は、前記第１の入力データ組の１番目からＵ番目（Ｕは自然数）のデータから、Ｔビット幅（Ｔは自然数）のデータをそれぞれ生成する第１から第Ｕの変換手段と、前記第１から第Ｕの変換手段の最下位から最上位ビットまでの各ビットの内容に従って、前記記憶手段から出力されるデータ組から、それぞれ最適なデータを選ぶ第１から第Ｔの選択手段と、前記第１から第Ｔの選択手段より出力されるデータを演算する第１の演算手段と、前記第１の演算手段から出力されるデータと、ビットシフトを実行する第３の演算手段の出力との加減算を実行する第２の演算手段と、前記第２の演算手段から出力される演算結果を一時的に格納する一時記憶手段と、前記一時記憶手段から出力されるデータのビットシフトを実行する第３の演算手段とを備えることを特徴とする、演算装置を提供する。
【００３０】
上記構成とした本発明の演算装置では、各分散算術演算手段がデータの表を記憶した記憶手段を内蔵する必要がないので、表を記憶するための記憶手段を増加することなく、分散算術演算を利用したＴビット同時処理により処理の高速化を図ることができる。
【００３２】
また、本発明の演算装置は、複素数乗算装置として構成することができる。この場合、前記第１の入力データ組は、Ｈ個（Ｈは自然数）の複素数であり、ｈ個目（ｈは１からＨまでの整数）の複素数の実部及び虚部はそれぞれ前記第１の入力データ組の２ｈ−１番目及び２ｈ番目に割り当てられ、前記第１の分散算術演算手段の数は２Ｈ個であり、前記第１の入力データ組の２ｈ−１番目及び２ｈ番目の入力は、２ｈ−１番目及び２ｈ番目の前記分散算術演算手段にのみ出力され、前記第１の入力データとして入力される複素数と、前記第２の入力データとして入力される複素数の乗算を行う。かかる構成とすれば、表を記憶するための記憶手段を増加することなく、分散算術演算を利用したＴビット同時処理により複素数乗算演算処理の高速化を図ることができる。
【００３３】
前記記憶手段が、０からＮ／８の整数であるｎに対する{cos(2πn/N)+sin(2πn/N)}/2の値を示す第１の表と、前記０からＮ／８の整数であるｎに対する{cos(2πn/N)−sin (2πn/N)}/2の値を示す第２の表とを格納していれば、記憶手段の容量を増大させることなく、Ｎ点離散フーリエ変換に必要なバタフライ演算を高速処理することができる。
【００３４】
あるいは、前記第１の記憶手段が、０からＮ／８の整数であるｎについて{cos(2πn/N)+sin(2πn/N)}/2の値及び{cos(2πn/N)−sin (2πn/N)}/2の値の対を示す第３の表を格納していもよい。この場合、記憶手段の容量と消費電力をより節約しつつ、Ｎ点離散フーリエ変換に必要なバタフライ演算を高速処理することができる。
【００３５】
【発明の実施の形態】
次に、図面に示す本発明の実施形態について説明する。
【００３６】
（基本構成）
まず、図１に示す本発明の演算装置の基本構成について説明する。この基本構成は後述する第１から第３実施形態の内積演算装置や複素乗算装置の基本となる。図１において、２０１は第１の入力データ組が入力される入力端子郡、２０３は第２の入力データ組が入力される設定データ入力端子、２０４はルックアップ表を記憶する記憶手段、２３０は出力端子郡を示す。２５１，２５２，２５３・・・は分散算術演算処理回路を示し、これらはいずれも同一構成である。
【００３７】
入力端子郡３０１は入力端子郡２０１から分散算術演算処理回路２５１に入力されるデータの入力端子郡、端子３０２は記憶手段２０４から分散算術演算処理回路２５１に入力されるデータ組の入力端子である。
【００３８】
入力端子郡２０１から入力される１番目からU番目までのＵ個のデータは、それぞれ変換回路２０２の第１から第Uの変換手段２０２ａ，２０２ｂ…２０２ｕでＴビットデータに変換される。設定データ入力端子２０３から入力される設定データに対応したデータ組が、記憶手段２０４から出力されて選択回路２０５に入力される。
【００３９】
選択回路２０５は、変換回路２０２から出力されるＴビット幅のデータに対応して、第１から第Ｔの選択手段２０５ａ，２０５ｂ…２０５ｔを備えている。選択回路２０５の第１の選択手段２０５ａでは、記憶手段２０４から出力されたデータ組から、変換回路２０２内の各変換手段２０２ａ〜２０２ｕの出力の最下位ビット組に応じたデータが生成され出力される。また、選択回路２０５の第２の選択手段２０５ｂでは、記憶手段２０４から出力されたデータ組から、変換回路２０２内の各変換手段２０２ａ〜２０２ｕの出力の最下位＋１ビット組に応じたデータが生成され出力される。以下、選択回路２０２の各選択手段２０５ｃ，２０５ｄ…で同様の処理が実行される。選択回路２０５の第Ｔの選択手段２０５ｔでは変換回路２０２内の各変換手段２０２ａ〜２０２ｕの出力の最上位ビット組に応じたデータが出力される。
【００４０】
選択回路２０５から出力された複数のデータは、演算手段（第１の演算手段）２０６で演算され、加減算器２０７に入力される。加減算器（第２の演算手段）２０７は演算手段２０６からのデータとビットシフタ２０８からのデータを加減算する。加減算器２０７からのデータは例えばシフトレジスタである一時記憶手段２０９で一時記憶される。一時記憶手段２０９からのデータはビットシフタ（第３の演算手段）２０８へ出力される。適切なタイミングでスイッチ２１０が閉じられ、スイッチ２１０が閉じると一時記憶木手段２０９から出力端子郡２３０の第１の端子から出力される。
【００４１】
従来例の図６における分散算術演算回路１１０，１１１…は、図１におけるスイッチ２１０、ビットシフタ２０８、一時記憶手段２０９、加減算器２０７、演算手段２０６、選択回路２０５のうちのどれか１つ（例えば、選択手段２０５ａ）及び記憶手段２０４を１つの単位とした回路に相当する。そのため、複数データを同時処理する場合、例えば変換回路２０２で多ビットを同時処理できるように割り振り、分散演算処理回路２５１で１つめの設定データを処理し、以降の設定データをそれぞれ分散演算処理回路２５２，２５３…で処理する、というような場合、同時処理数が増えれば増えるほど、分散算術演算回路の数が増えるが、従来例では、記憶手段の数も同時に増え、その総容量が増加していたが、本発明では、このような構成とすることにより、同時処理数にかかわらず、記憶手段２０４は１つでよいので、必要となる記憶手段の総容量は一定となる。
【００４２】
（第１実施形態）
図２に本発明の第１実施形態である内積演算装置を示す。図１と同一機能のものは同一番号を付して説明を省略する。
【００４３】
変換回路２０２は次のような構成とする。即ち、変換回路２０２内の変換手段２０２ｕにおいて、入力端子郡３０１から入力されるデータの［Ｔｋ］ビット、［Ｔｋ＋１］ビット…［Ｔｋ＋（Ｔ−１）］ビット（ｋは０以上の整数）を、出力の最下位ビットから最上位ビットに向けて割り振り、Ｔビットづつ出力する。変換回路２０２内の他の変換手段２０２ａ，２０２ｂ…も同様の処理を実行する。
【００４４】
演算手段２０６はビットシフタ群３０６と加減算器３０７とからなる。選択回路２０５の出力データはビットシフタ郡３０６に入力され、ビットシフタ郡３０６の出力データは加減算器３０７によって加減算される。ビットシフタ郡３０６は次のような構成とする。即ち、選択回路２０５の第１の選択手段２０５ａからビットシフタ郡３０６に入力されたデータはＴ−１ビットシフトされる。選択回路２０５の第２の選択手段２０５ｂからビットシフタ郡３０６に入力されたデータはＴ−２ビットシフトされる。以下、選択回路２０５の各選択手段からビットシフタ郡３０６に入力されたデータを同様に処理する。選択回路２０５の第Ｔの選択手段２０５ｔからビットシフタ郡３０６に入力されたデータはＴ−Ｔビットシフト、即ちそのまま出力される。
【００４５】
ビットシフタ２０８は、Ｔビットシフタ３０８によって構成する。
【００４６】
基本構成の項で説明したように、従来は、選択回路２０５を構成する選択手段の数だけ分散算術演算の処理を、高速で行うことができるように、分散算術演算を同時処理しようとすれば、選択回路２０５を構成する選択手段の数だけのルックアップ表の記憶手段の数が必要となる。また、ルックアップ表の記憶手段の数を増やさないように、選択回路２０５を構成する選択手段の数だけの分散算術演算の処理を１つづつ行うと、その数だけ演算時間がかかる。しかし、本発明では、このような構成とすることにより、ルックアップ表の記憶手段を増加することなく、Ｔビット同時処理を実施することができ、処理の高速化を図ることができる。
【００４７】
（第２実施形態）
図３に本発明による複素数乗算装置の第２の実施例を示す。図１と同一機能のものは同一番号を付して説明を省略する。入力端子群２０１の入力１に入力複素数１の実部を割り当て、入力端子群２０１の入力２に入力複素数１の虚部を割り当て、…入力端子群２０１の入力２ｈ−１に入力複素数ｈの実部を割り当て、…入力端子群２０１の入力２Ｈに入力複素数Ｈの虚部を割り当てる。入力されたデータは、各々分散算術演算処理回路群４０で処理される。分散算術演算処理回路群４０を構成する分散算術演算処理回路２５１、２５２…は２Ｈ個ある。なお、以降の本実施形態の説明ではすべての分散算術演算処理回路を符号２５１で示す。
【００４８】
入力１及び２は、１番目及び２番目の分散算術演算処理回路２５１のみに入力する。入力１及び２は、前記以外の分散算術演算処理回路２５１には、影響を与えない場合を考え、入力線の記載を省略する。出力複素数１の実部及び虚部は、１番目及び２番目の分散算術演算処理回路２５１から各々出力され、出力端子郡２３０の出力１及び出力２に割り当てる。それ以外の入力も、同様に処理される。
【００４９】
ここで、複素定数Ｙ及び複数の任意の複素数｛Ｚ｝の乗算Ｙ・｛Ｚ｝を考える。Ｙ＝Ｙ_ｒ−ｊＹ_ｉ、｛Ｚ｝の要素の１つをＺ＝Ｚ_ｒ＋ｊＺ_ｉとおくと（添え字ｒは実部を、添え字ｉは虚部を示す。また、ｊ^２＝−１である。）、Ｙ・｛Ｚ｝の要素ＹＺは、下記の式９で表される。
【００５０】

【００５１】
従って、式１においてＡ及びＸを以下の式１０ように設定し、要素数２のＤＡ法を適用すればＹＺの実部が求まる。
【００５２】

【００５３】
同様に、式１において、Ａ及びＸを以下の式１１のように設定し、要素数２のＤＡ法を適用すれば、ＹＺの実部が求まる。
【００５４】

【００５５】
実部、虚部いずれの場合でも、ルックアップ表は｛(Ｙ_ｒ＋Ｙ_ｉ)/２, (Ｙ_ｒ―Ｙ_ｉ)/２｝であり、同一となる。これは、図３において、入力端子郡２０１の２ｈ−１番目の入力端子にＺ_ｒ、２ｈ番目の入力端子にＺ_ｉをそれぞれ入力し、記憶手段２０４のルックアップ表として｛(Ｙ_ｒ＋Ｙ_ｉ)/２, (Ｙ_ｒ―Ｙ_ｉ)/２｝を格納すると、出力端子郡２３０の２ｈ−１番目の出力端子及び２h番目の出力端子の出力としてＹＺの実部及び虚部が得られることになる。Ｙ・｛Ｚ｝は、ＹＺを求めるＤＡ法を複数実施することにより求めることができる。これは、図３の入力端子郡（２０１）に｛Ｚ｝を入力すればよい。
【００５６】
このような構成とすることにより、複素数の乗算を分散術演算で実現でき、｛Ｚ｝の要素数、即ち入力端子郡２０１の入力数によらず、また、演算の高速化のため分散算術演算処理回路２５１内で、第１の実施形態に示したように多ビット同時処理を行っても、第１の実施形態の項で説明したように、記憶手段は増加せず一定である。
【００５７】
（第３実施形態）
複素関数f(n)のＮ点離散フーリエ変換（以下、ＤＦＴ）は次式である。
【００５８】

【００５９】
Ｗ_Ｎは回転因子と呼ばれるファクターである。式１２を変形していくと、
【００６０】

【００６１】
ｋが偶数の場合と奇数の場合で分けると、
【００６２】

【００６３】
ただし、

【００６４】
となる。f(n)に関するＮ点ＤＦＴが、y(n)及びz(n)に関するＮ／２点ＤＦＴになった。その際式１５の演算をＮ／２点行う必要がある。式１５の演算はバタフライ演算と呼ばれる。これを、log₂N-1段（Ｎが２のべき乗の場合）、再起的に繰り返していくことで、ＤＦＴ演算結果を得ることができる。これは、基数２の周波数間引きＤＦＴと呼ばれる演算手法である。
【００６５】
この場合、式１５に示す複素乗算を１段あたりＮ／２回を合計log₂N-1段行う必要があるため、ＤＦＴを高速で行うためには、式１５の演算を高速で行う必要がある。そのため、１段あたりＮ／２個の式１５の演算をいくつかづつ同時並列処理で行いたい。これは、本発明による第２の実施例において、
【００６６】

【００６７】
とおくことによって達成することができる。この場合（式６）におけるＱ（ｂ_ｗ）は、
【００６８】

【００６９】
である。ところで、三角関数の全ての値は、０〜４５度の三角関数値で表現できるので、式１７のためのルックアップ表は、ｎ＝０〜Ｎ／８での結果のみを格納すればよい。
【００７０】
この場合の記憶手段の構成を図４に示す。図１と同一機能のものは同一番号を付す。記憶手段２０４は、第１の記憶部５０１及び第２の記憶部５０２で構成される。第１の記憶部５０１はＱｐ(ｎ)のルックアップ表（ただし、nは０〜Ｎ／８の整数）を格納する。第２の記憶部５０２はＱｍ(ｎ)のルックアップ表（ただし、nは０〜Ｎ／８の整数）を格納する。設定２０３から指示されたアドレスに従って、記憶手段１５０１及び記憶手段２５０２のデータ組が端子３０２から出力される。
【００７１】
このような構成とすることにより、ルックアップ表の記憶手段の増大なしに、ＤＦＴに必要となるバタフライ演算を高速で実施することができる。
【００７２】
（第４実施形態）
第３実施形態における記憶手段の別の構成を図５に示す。図４と同一機能のものは同一番号を付す。１つの複素乗算ではｎは固定値なので、単一の記憶部５０４はＱｐ(ｎ)とＱｍ(ｎ)の対を格納した表であれば、アドレスデコーダは１つで済み、低消費電力化できる。
【００７３】
（第５実施形態）
図４又は図５のルックアップ表を使い、第２実施形態に従って複素数乗算をＤＡ法で構成すれば、式１５の演算が実施できる。式１５は前述したように、ｎ＝０…N/2-1について行う必要があるが、このうちのいくつかを同時処理することで、演算の高速化を図ることができる。その際、第２実施形態で示したように、ルックアップ表を格納するための記憶手段の増大はない。さらに、第１実施形態で示したように、多ビット同時処理による演算高速化を図っても、ルックアップ表を格納するための記憶手段の増大はない。以上の演算をlog₂N-1段行うことで、ＤＦＴ演算が達せられる。
【００７４】
このような構成とすることにより、ルックアップ表の記憶手段の増大なしに、ＤＦＴの高速化を図ることができる。
【００７５】
【発明の効果】
以上の説明から明らかなように、本発明の演算装置は、第１の入力データ組が入力され、変換手段、選択手段、及び第１から第３の演算手段を備える少なくとも１個の分散算術演算手段と、前記第２の入力データ組に対応したデータの表を記憶し、前記第２の入力データに応じたデータ組を出力する記憶手段とを備え、分散算術演算手段の選択手段が記憶手段から出力されるデータ組から最適なデータを選ぶので、表を記憶するための記憶手段を増加することなく、分散算術演算を利用したＴビット同時処理により、内積演算処理、複素乗算処理、Ｎ点離散フーリエ変換に必要なバタフライ演算処理等の演算処理の高速化を図ることができる。
【図面の簡単な説明】
【図１】本発明の基本構成を示すブロック図である。
【図２】本発明の第１実施形態を示すブロック図である。
【図３】本発明の第２実施形態を示すブロック図である。
【図４】発明の第３実施形態における記憶手段の構成を示す概略図である。
【図５】本発明の第３実施形態における記憶手段の他の構成を示す概略図である。
【図６】従来の内積演算装置の一例を示すブロック図である。
【符号の説明】
２０１データ入力端子郡
２３０データ出力端子郡
２０３設定データ入力端子
２０４, ５０１, ５０２, ５０３ルックアップ表を格納する記憶手段
２５１, ２５２, ２５３, ４００分散算術演算処理回路
２０２入力データ変換手段
２０５データの選択手段
２０６選択されたデータの演算手段
２０７加減算器
２０８ビットシフタ
２０９一時記憶手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an inner product operation device and a complex multiplication using a distributed arithmetic operation method (Distributed Arithmetic, hereinafter referred to as DA method) used for digital signal processing such as image signal encoding, filter processing, and discrete Fourier transform. The present invention relates to an arithmetic device such as a device.
[0002]
[Prior art]
The DA method is a well-known method for realizing an inner product operation of a matrix with a lookup table and an adder. For example, Patent Document 1 and Non-Patent Document 1 describe that by using the DA method, the inner product operation can be realized more compactly than when a multiplier is used.
[0003]
In order to explain the DA method, an inner product operation of a constant vector A having the number of elements K and an input vector X represented by the following Equation 1 is considered.
[0004]

[0005]
_Assuming that Xk is a binary W-bit width fixed point expressed in two's complement, it can be expressed as the following Expression 2.
[0006]

[0007]
In this equation (2), b _k [w] is a bit (1 or 0), b _k [0] is a sign bit, and b _k [W−1] is a least significant bit.
[0008]
_Further, X _k = Since _{{X k - - (X k} )} / 2 and expressed, placing the bit inversion of b _k and b _'k, _{X k} is as follows.
[0009]

(Formula 3)
[0010]
However, in this equation 3, c _k [w] is expressed as follows.
[0011]

[0012]
By substituting (Equation 3) into (Equation 1), the following equation 5 is obtained.
[0013]

[0014]
In this formula 5, there is the following relationship.
[0015]

[0016]
In (Expression 6), Q (b _w ) can be transformed as the following expression.
[0017]

[0018]
Therefore, since Q (b _w ) in the case of c ₁ = −1 can be expressed by Q (b _w ) in the case of c ₁ = 1, the value of Expression 6 can be expressed in 2 ^K−1 ways. This value is stored as a lookup table. Then, since 2- ^w is applied to each value from w = W−1 to 0, the processing is executed in order from w = W−1, and the value referenced from the lookup table is added to the right by one digit. . With the above procedure, AX can be obtained without performing multiplication.
[0019]
When the above calculation is performed simultaneously for a plurality of values of the input vector X, it is equivalent to performing the following equation.
[0020]

[0021]
Here, X (i) is the i-th vertical vector of X in formula (1), A (i) is a horizontal vector of the constant corresponding thereto (i is an integer of 0 or more), 0 is the element value is 0 Indicates a horizontal vector. A ′ represents a matrix including A (i), X ′ represents a vector including X (i), and B represents an operation result.
[0022]
FIG. 6 shows a method for performing the calculation of (Equation 8) using, for example, the methods disclosed in Patent Document 3, Patent Document 4, and the like. Each element of the input vector X (0), X (1),... Reference numerals 110, 111,... Denote distributed arithmetic operation circuits for calculating AX in Equation 1, and are connected in parallel by the number of input vectors. Reference numeral 103 denotes storage means for storing a lookup table. Data corresponding to input data (constant vector) to the terminal group 101 is input from the storage means 103 to the adder / subtractor 104. The adder / subtracter 104 adds / subtracts the output data of the storage means 103 and the output of the 1-bit shifter 106 and inputs the result to the temporary storage means 107 which is a shift register, for example. The output of the adder / subtracter 104 temporarily stored in the temporary storage means 107 is output to the bit shifter 106, the switch 108 is closed at an appropriate timing, and output from the output terminal 109 as output B (0). Similarly, outputs B (1), B (2)... Of other DA circuits 111.
[0023]
[Patent Document 1]
US Pat. No. 3,777,130 [Patent Document 2]
US Pat. No. 5,226,002 [Patent Document 3]
JP 2000-148730 A [Non-patent Document]
“Applications of distributed arithmetic to digital signal processing: a tutorial review,” SAWhite, IEEE ASSP magazine, vol.6, issue 3, pp.4-9, July 1989
[0024]
[Problems to be solved by the invention]
However, in the conventional apparatus shown in FIG. 6, each distributed arithmetic operation circuit 110, 111... Has a built-in storage means 103, so its compactness is greatly influenced by the compactness of the storage means 103. That is, as the number of input vectors simultaneously processed increases, the number of distributed arithmetic operation circuits 110, 111... Increases, and the number of storage means 103 increases accordingly.
[0025]
In FIG. 6, since input data is input in bit serial from the terminal group 101, if one bit is processed with one clock, at least the number of clocks corresponding to the bit width of the input data is required until an output result is obtained. Become. On the other hand, simultaneous processing of multiple bits in order to shorten the calculation time without increasing the clock frequency is equivalent to increasing the number of elements of the vector X ′ in Equation 8, so that the distributed arithmetic operation circuits 110, 111. The number of parallel processing increases by the number of parallel processing bits, and therefore the storage means 103 also increases by the same number.
[0026]
As described above, when the inner product operation is simultaneously performed on a plurality of data by the distributed arithmetic operation, there is a problem that the storage means corresponding to the number of simultaneous processes is required, and the compactness deteriorates as the number of simultaneous processes increases.
[0027]
The present invention has been made in view of such a problem, and an object of the present invention is to speed up distributed arithmetic operations while keeping the size of the storage means small.
[0028]
[Means for Solving the Problems]
The present invention provides means for solving these problems, and the invention of each claim constitutes the following technical means.
[0029]
According to the present invention, in the arithmetic device that performs arithmetic processing using the distributed arithmetic operation method on the first input data set and the second input data set, the first input data set is input. At least one distributed arithmetic operation means; and storage means for storing a data table (lookup table) corresponding to the second input data set and outputting a data set corresponding to the second input data. And the distributed arithmetic means generates first to Uth data for generating T-bit width (T is a natural number) from first to Uth data (U is a natural number) of the first input data set. The first to Tth selection means selects the optimum data from the data set output from the storage means according to the contents of each bit from the least significant bit to the most significant bit of the conversion means and the first to Uth conversion means. And the first means Addition and subtraction of the first calculation means for calculating the data output from the Tth selection means, the data output from the first calculation means, and the output of the third calculation means for performing bit shift. A second computing means for executing, a temporary storage means for temporarily storing the computation result output from the second computing means, and a third shift for executing a bit shift of the data outputted from the temporary storage means An arithmetic device is provided, comprising an arithmetic means.
[0030]
In the arithmetic device of the present invention configured as described above, each distributed arithmetic operation means does not need to have a storage means for storing a table of data, so that the distributed arithmetic operation can be performed without increasing the storage means for storing the table. The processing speed can be increased by simultaneous T-bit processing using.
[0032]
The arithmetic device of the present invention can be configured as a complex number multiplication device. In this case, the first input data set is an H-number (H is a natural number) complex number, and the real part and the imaginary part of the h-th number (h is an integer from 1 to H) are respectively the first number. 2h-1 and 2h of the first input data set, the number of the first distributed arithmetic operation means is 2H, and the 2h-1st and 2hth inputs of the first input data set are The complex number inputted as the first input data and the complex number inputted as the second input data are multiplied only by the 2h-1 and 2h distributed arithmetic operation means. With such a configuration, it is possible to increase the speed of complex multiplication operation processing by T bit simultaneous processing using distributed arithmetic operation without increasing the storage means for storing the table.
[0033]
A first table showing a value of {cos (2πn / N) + sin (2πn / N)} / 2 for n which is an integer from 0 to N / 8; If the second table indicating the value of {cos (2πn / N) −sin (2πn / N)} / 2 for n which is an integer is stored, N points without increasing the capacity of the storage means The butterfly operation necessary for the discrete Fourier transform can be processed at high speed.
[0034]
Alternatively, the first storage means has a value of {cos (2πn / N) + sin (2πn / N)} / 2 and {cos (2πn / N) −sin for n that is an integer from 0 to N / 8. A third table indicating a value pair of (2πn / N)} / 2 may be stored. In this case, the butterfly calculation necessary for the N-point discrete Fourier transform can be processed at high speed while further saving the capacity and power consumption of the storage means.
[0035]
DETAILED DESCRIPTION OF THE INVENTION
Next, an embodiment of the present invention shown in the drawings will be described.
[0036]
(Basic configuration)
First, the basic configuration of the arithmetic unit of the present invention shown in FIG. 1 will be described. This basic configuration is the basis of the inner product arithmetic device and complex multiplier of the first to third embodiments described later. In FIG. 1, 201 is an input terminal group to which a first input data set is input, 203 is a setting data input terminal to which a second input data set is input, 204 is a storage means for storing a lookup table, and 230 is Indicates the output terminal group. 251, 252, 253... Indicate distributed arithmetic operation processing circuits, all of which have the same configuration.
[0037]
An input terminal group 301 is an input terminal group of data input from the input terminal group 201 to the distributed arithmetic processing circuit 251, and a terminal 302 is an input terminal of a data set input from the storage unit 204 to the distributed arithmetic processing circuit 251. .
[0038]
The U data from the first to the Uth input from the input terminal group 201 are converted into T-bit data by the first to U-th conversion means 202a, 202b,. A data set corresponding to the setting data input from the setting data input terminal 203 is output from the storage unit 204 and input to the selection circuit 205.
[0039]
The selection circuit 205 includes first to T-th selection means 205a, 205b,... 205t corresponding to the T-bit width data output from the conversion circuit 202. In the first selection unit 205a of the selection circuit 205, data corresponding to the least significant bit set of the outputs of the conversion units 202a to 202u in the conversion circuit 202 is generated and output from the data set output from the storage unit 204. The Further, the second selection means 205b of the selection circuit 205 generates data corresponding to the least significant +1 bit set of the outputs of the conversion means 202a to 202u in the conversion circuit 202 from the data set output from the storage means 204. And output. Thereafter, the same processing is executed by each of the selection means 205c, 205d... Of the selection circuit 202. The T-th selection means 205t of the selection circuit 205 outputs data corresponding to the most significant bit set of the outputs of the conversion means 202a to 202u in the conversion circuit 202.
[0040]
The plurality of data output from the selection circuit 205 is calculated by the calculation means (first calculation means) 206 and input to the adder / subtractor 207. An adder / subtracter (second arithmetic means) 207 adds / subtracts the data from the arithmetic means 206 and the data from the bit shifter 208. Data from the adder / subtracter 207 is temporarily stored in a temporary storage unit 209 which is a shift register, for example. Data from the temporary storage unit 209 is output to the bit shifter (third arithmetic unit) 208. The switch 210 is closed at an appropriate timing. When the switch 210 is closed, the temporary storage tree means 209 outputs the signal from the first terminal of the output terminal group 230.
[0041]
6 is one of the switch 210, the bit shifter 208, the temporary storage unit 209, the adder / subtractor 207, the arithmetic unit 206, and the selection circuit 205 in FIG. , The selection unit 205a) and the storage unit 204 correspond to a circuit having one unit. Therefore, when simultaneously processing a plurality of data, for example, the conversion circuit 202 is allocated so that multiple bits can be processed simultaneously, the first setting data is processed by the distributed arithmetic processing circuit 251, and the subsequent setting data is respectively distributed to the distributed arithmetic processing circuit. In the case of processing with 252, 253..., As the number of simultaneous processing increases, the number of distributed arithmetic operation circuits increases. However, in the conventional example, the number of storage means also increases at the same time, and the total capacity increases. However, in the present invention, by adopting such a configuration, only one storage unit 204 is required regardless of the number of simultaneous processes, so that the total capacity of the necessary storage unit is constant.
[0042]
(First embodiment)
FIG. 2 shows an inner product calculation apparatus according to the first embodiment of the present invention. Components having the same functions as those in FIG.
[0043]
The conversion circuit 202 is configured as follows. That is, in the conversion means 202 u in the conversion circuit 202, [Tk] bits, [Tk + 1] bits... [Tk + (T−1)] bits (k is an integer of 0 or more) of data input from the input terminal group 301. , Allocation is performed from the least significant bit to the most significant bit, and T bits are output. The other conversion means 202a, 202b,... In the conversion circuit 202 perform the same processing.
[0044]
The arithmetic means 206 includes a bit shifter group 306 and an adder / subtractor 307. The output data of the selection circuit 205 is input to the bit shifter group 306, and the output data of the bit shifter group 306 is added / subtracted by the adder / subtractor 307. The bit shifter county 306 has the following configuration. That is, the data input from the first selection means 205a of the selection circuit 205 to the bit shifter group 306 is shifted by T-1 bits. Data input to the bit shifter group 306 from the second selection means 205b of the selection circuit 205 is shifted by T-2 bits. Thereafter, the data input to the bit shifter group 306 from each selection means of the selection circuit 205 is processed in the same manner. Data input to the bit shifter group 306 from the Tth selection means 205t of the selection circuit 205 is TT bit shifted, that is, output as it is.
[0045]

Bit shift motor

2 0 8, thus constituting a T-bit shift motor 3 0 8.
[0046]
As explained in the section of the basic configuration, conventionally, if distributed arithmetic operations are to be processed simultaneously so that distributed arithmetic operations can be performed at the high speed as many as the number of selection means constituting the selection circuit 205. Therefore, the number of storage means for the lookup table corresponding to the number of selection means constituting the selection circuit 205 is required. Further, if processing of distributed arithmetic operations is performed one by one for the number of selection means configuring the selection circuit 205 so as not to increase the number of storage means for the lookup table, the calculation time is increased by that number. However, according to the present invention, with such a configuration, T bit simultaneous processing can be performed without increasing the number of lookup table storage means, and the processing speed can be increased.
[0047]
(Second Embodiment)
FIG. 3 shows a second embodiment of the complex multiplier according to the present invention. Components having the same functions as those in FIG. The real part of the input complex number 1 is assigned to the input 1 of the input terminal group 201, the imaginary part of the input complex number 1 is assigned to the input 2 of the input terminal group 201, and the real number of the input complex number h is assigned to the input 2h-1 of the input terminal group 201. ... The imaginary part of the input complex number H is assigned to the input 2H of the input terminal group 201. Input data are respectively processed by the distributed arithmetic processing circuit group 4 0. Distributed arithmetic processing circuit group 4 0 constituting the distributed

arithmetic processing circuit

251, 252 ... are 2H pieces. Na us, the following description of the present embodiment shows all the distributed arithmetic processing circuit at reference numeral 251.
[0048]

Inputs

1 and 2 are input only to the first and second distributed arithmetic processing circuits 251. Considering the case where the

inputs

1 and 2 do not affect the distributed arithmetic operation processing circuit 251 other than the above, description of the input lines is omitted. The real part and the imaginary part of the output complex number 1 are respectively output from the first and second distributed arithmetic processing circuits 251 and assigned to the output 1 and the output 2 of the output terminal group 230. Other inputs are similarly processed.
[0049]
Here, a multiplication Y · {Z} of a complex constant Y and a plurality of arbitrary complex numbers {Z} is considered. Y = Y _r −jY _i , where one element of {Z} is Z = Z _r + jZ _i (subscript r indicates a real part, subscript i indicates an imaginary part, and j ² = − 1), the element YZ of Y · {Z} is represented by the following Expression 9.
[0050]

[0051]
Therefore, if A and X are set as in the following expression 10 in the expression 1 and the DA method with 2 elements is applied, the real part of YZ can be obtained.
[0052]

[0053]
Similarly, in Equation 1, if A and X are set as in Equation 11 below and the DA method with 2 elements is applied, the real part of YZ can be obtained.
[0054]

[0055]
In both the real part and the imaginary part, the lookup table is {(Y _r + Y _i ) / 2, (Y _r −Y _i ) / 2}, which is the same. In FIG. 3, Z _r is input to the 2h−1 input terminal of the input terminal group 201 and Z _i is input to the 2h input terminal, and {(Y _r + Y _i ) / 2, (Y _r −Y _i ) / 2}, the real part and the imaginary part of YZ can be obtained as the output of the 2h-1st output terminal and the 2hth output terminal of the output terminal group 230 become. Y · {Z} can be obtained by performing a plurality of DA methods for obtaining YZ. This can be done by inputting {Z} to the input terminal group (201) in FIG.
[0056]
With such a configuration, multiplication of complex numbers can be realized by distributed arithmetic operation, regardless of the number of elements of {Z}, that is, the number of inputs of the input terminal group 201, and distributed arithmetic operation for speeding up the operation. Even if multi-bit simultaneous processing is performed in the processing circuit 251 as shown in the first embodiment, the storage means does not increase and is constant as described in the section of the first embodiment.
[0057]
(Third embodiment)
The N-point discrete Fourier transform (hereinafter DFT) of the complex function f (n) is
[0058]

[0059]
W _N is a factor called a twiddle factor. As equation 12 is transformed,
[0060]

[0061]
When k is even and odd,
[0062]

[0063]
However,

[0064]
It becomes. The N-point DFT for f (n) is now the N / 2-point DFT for y (n) and z (n). At that time, it is necessary to perform the calculation of Expression 15 by N / 2 points. The calculation of Equation 15 is called butterfly calculation. By repeating this process log ₂ N-1 stages (when N is a power of 2), the DFT operation result can be obtained. This is an arithmetic technique called radix-2 frequency decimation DFT.
[0065]
In this case, since it is necessary to perform the complex multiplication shown in Equation 15 N / 2 times per stage for a total of log ₂ N−1 stages, in order to perform DFT at high speed, it is necessary to perform the calculation of Expression 15 at high speed. is there. Therefore, we would like to perform several N / 2 operations of Equation 15 per stage by simultaneous parallel processing. This is the second embodiment according to the present invention,
[0066]

[0067]
This can be achieved. In this case, Q (b _w ) in (Equation 6) is
[0068]

[0069]
It is. By the way, since all values of the trigonometric function can be expressed by trigonometric function values of 0 to 45 degrees, the look-up table for Equation 17 only needs to store the results for n = 0 to N / 8.
[0070]
The configuration of the storage means in this case is shown in FIG. Components having the same functions as those in FIG. The storage unit 204 includes a first storage unit 501 and a second storage unit 502. The first storage unit 501 stores a lookup table for Qp (n) (where n is an integer from 0 to N / 8). The second storage unit 502 stores a lookup table of Qm (n) (where n is an integer from 0 to N / 8). A data set of the storage unit 1501 and the storage unit 2502 is output from the terminal 302 in accordance with the address instructed from the setting 203.
[0071]
With such a configuration, the butterfly operation required for DFT can be performed at high speed without increasing the storage means of the lookup table.
[0072]
(Fourth embodiment)
Another structure of the memory | storage means in 3rd Embodiment is shown in FIG. Components having the same functions as those in FIG. Since n is a fixed value in one complex multiplication, if the single storage unit 504 is a table storing a pair of Qp (n) and Qm (n), only one address decoder is required, and power consumption can be reduced. .
[0073]
(Fifth embodiment)
If the complex number multiplication is configured by the DA method according to the second embodiment using the lookup table of FIG. 4 or FIG. As described above, it is necessary to perform Equation 15 for n = 0... N / 2-1. However, it is possible to increase the calculation speed by simultaneously processing some of them. At that time, as shown in the second embodiment, there is no increase in storage means for storing the lookup table. Furthermore, as shown in the first embodiment, there is no increase in the storage means for storing the lookup table even if the calculation speed is increased by multi-bit simultaneous processing. The DFT operation can be achieved by performing the above operation on log ₂ N−1 stages.
[0074]
With such a configuration, it is possible to increase the speed of the DFT without increasing the number of lookup table storage means.
[0075]
【The invention's effect】
As is apparent from the above description, the arithmetic device of the present invention receives at least one distributed arithmetic operation that is input with the first input data set and includes a conversion means, a selection means, and first to third arithmetic means. Means for storing data corresponding to the second input data set, and storing means for outputting the data set corresponding to the second input data, wherein the selection means of the distributed arithmetic operation means is the storage means Since the optimum data is selected from the data set output from, the inner product operation processing, complex multiplication processing, N points can be performed by T bit simultaneous processing using distributed arithmetic operation without increasing the storage means for storing the table. Calculation processing such as butterfly calculation processing necessary for discrete Fourier transform can be speeded up.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of the present invention.
FIG. 2 is a block diagram showing a first embodiment of the present invention.
FIG. 3 is a block diagram showing a second embodiment of the present invention.
FIG. 4 is a schematic diagram showing a configuration of storage means in a third embodiment of the invention.
FIG. 5 is a schematic view showing another configuration of the storage means in the third embodiment of the present invention.
FIG. 6 is a block diagram showing an example of a conventional inner product calculation device.
[Explanation of symbols]
201 Data input terminal group 230 Data output terminal group 203 Setting

data input terminals

204, 501, 502, 503 Storage means 251, 252, 253, 400 for storing a lookup table Distributed arithmetic operation processing circuit 202 Input data conversion means 205 Selection means 206 Calculation means 207 for selected data Adder / subtracter 208 Bit shifter 209 Temporary storage means

Claims

In an arithmetic device that performs arithmetic processing using a distributed arithmetic operation method on the first input data set and the second input data set,
At least one distributed arithmetic operation means to which the first input data set is input;
Storing a table of data corresponding to the second input data set, and storing the data set corresponding to the second input data to each of the distributed calculation means ,
The distributed arithmetic operation means includes:
First to U-th conversion means for generating T-bit width data from first to U-th data of the first input data set, respectively;
First to Tth selection means for selecting optimum data from the data set output from the storage means according to the contents of each bit from the least significant bit to the most significant bit of the first to Uth conversion means; ,
First calculation means for calculating data output from the first to Tth selection means;
Second computing means for performing addition / subtraction between the data output from the first computing means and the output of the third computing means for performing bit shift;
Temporary storage means for temporarily storing calculation results output from the second calculation means;
And third arithmetic means for performing a bit shift of data output from the temporary storage means,
The first input data set is H complex numbers expressed as Z = Z _r + jZ _i (subscript r indicates a real part, subscript i indicates an imaginary part, and j ² = −1). And the real part and the imaginary part of the h-th complex number are respectively assigned to the 2h-1th and 2hth of the first input data set,
The second input data set is a complex constant represented by Y = Y _r −jY _i ,
The number of distributed arithmetic operation means is 2H, and the 2h-1st and 2hth inputs of the first input data set are output only to the 2h-1th and 2hth distributed arithmetic operation means,
The storage means stores a value of {(Y _r + Y _i ) / 2, (Y _r −Y _i ) / 2} as the lookup table ,
An arithmetic unit, which performs multiplication of H complex numbers input as the first input data set and a complex constant input as the second input data set .

When n is an integer from 0 to N / 8, the first input data set is a complex function Z = z (n), and the second input data set is Y = exp ( −j 2π n / N). A complex function,
The storage means has a first table indicating a value of { cos (2πn / N) + sin (2πn / N)} / 2 as the look-up table, and n for an integer from 0 to N / 8 { The arithmetic unit according to claim 1 , wherein a second table indicating values of cos (2πn / N) −sin (2πn / N)} / 2 is stored.

Before Kiki憶means, as the look-up table {cos (2πn / N) + sin (2πn / N)} of / 2 values and {cos (2πn / N) -sin (2πn / N)} / 2 of The arithmetic unit according to claim 1 , wherein a third table indicating pairs of values is stored.