JP2004171263A

JP2004171263A - Arithmetic unit

Info

Publication number: JP2004171263A
Application number: JP2002336196A
Authority: JP
Inventors: Takao Hasegawa; 隆生長谷川; Kazumasa Kioi; 一雅鬼追
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2002-11-20
Filing date: 2002-11-20
Publication date: 2004-06-17
Anticipated expiration: 2022-11-20
Also published as: JP3875183B2

Abstract

<P>PROBLEM TO BE SOLVED: To quicken a distributed arithmetic operation while maintaining the size of a storage means small. <P>SOLUTION: An arithmetic unit is provided with a distributed arithmetic operating means 251 for inputting a first data set and a storage means 204 for storing the table of data corresponding to a second input data set. The distributed arithmetic operating means 251 is provided with conversion means 202a and so on for generating data with T bit width from the first input data set, selection means 205a and so on for selecting the optimal data from the data set outputted from the storage means 204 according to the contents of respective bits from the least significant bit to most significant bit of the conversion means 202a and so on, a second arithmetic means 207 for executing the addition/subtraction of the output data of the first arithmetic means 206 for executing the arithmetic operation of the data outputted from the selection means 205a and so on and the output of a third arithmetic means 208 for executing bit shift and a temporary storage means 209 for temporarily storing arithmetic results outputted from the second arithmetic means 207. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、画像信号の符号化、フィルタ処理、離散フーリエ変換などのデジタル信号処理に使用される分散算術演算法（ＤｉｓｔｒｉｂｕｔｅｄＡｒｉｔｈｍｅｔｉｃ、以下ＤＡ法と表記する。）を利用した内積演算装置、複素数乗算装置等の演算装置に関するものである。
【０００２】
【従来の技術】
ＤＡ法は行列の内積演算をルックアップ表と加算器で実現する周知の方法である。例えば、特許文献１や非特許文献１に、ＤＡ法を利用することにより、内積演算を、乗算器を使って行うことに比べてコンパクトに実現することが記載されている。
【０００３】
ＤＡ法について説明するため、下記の式１で表される要素数Ｋの定数ベクトルＡと入力ベクトルＸの内積演算を考える。
【０００４】

【０００５】
Ｘ_ｋを２の補数表現された２進数のＷビット幅固定小数点であるとすると、以下の式２のように表すことができる。
【０００６】

【０００７】
この式（２）において、ｂ_ｋ［ｗ］はビット（１又は０）であり、ｂ_ｋ［０］は符号ビット、ｂ_ｋ［Ｗ−１］は最下位ビットである。
【０００８】
また、Ｘ_ｋ＝｛Ｘ_ｋ−（−Ｘ_ｋ）｝／２と表せるので、ｂ_ｋのビット反転をｂ’_ｋとおくと、Ｘ_ｋは以下のようになる。
【０００９】

【００１０】
ただし、この式３でｃ_ｋ［ｗ］は以下のように表される。
【００１１】

【００１２】
（式３）を（式１）に代入して、以下の式５が得られる。
【００１３】

【００１４】
この式５において、以下の関係がある。
【００１５】

【００１６】
（式６）において、Ｑ（ｂ_ｗ）は次式のように変形できる。
【００１７】

【００１８】
従って、ｃ_１＝−１の場合のＱ（ｂ_ｗ）は、ｃ_１＝１の場合のＱ（ｂ_ｗ）で表現できるので、式６の値は、２^Ｋ−１通りで全てを表現でき、この値をルックアップ表として記憶しておく。そして、ｗ＝Ｗ−１から０まではそれぞれの値に２^−ｗが掛かるので、ｗ＝Ｗ−１から順番に実行し、前記ルックアップ表から参照した値を１桁ずつ右へずらせて加える。以上の手順により、乗算を行うことなくＡＸを得ることができる。
【００１９】
入力ベクトルＸの複数の値について、上記演算を同時に行う場合は、次式の演算を行うことと等価である。
【００２０】

【００２１】
ここで、Ｘ（ｉ）は、式（１）におけるＸのｉ番目の縦ベクトル、Ａ（ｉ）はそれに対応する定数の横ベクトル（ｉは０以上の整数）、０は要素値が０の横ベクトルを示す。また、Ａ’ はＡ（ｉ）を含む行列を表し、Ｘ’ はＸ（ｉ）を含むベクトルを表し、Ｂは演算結果を表す。
【００２２】
図６に、例えば、特許文献３、特許文献４等で示された方法を利用した（式８）の演算を行う方法を示す。端子群１０１から入力ベクトルＸ（０）、Ｘ（１）、…の各要素がビットシリアルで同時並列入力される。符号１１０，１１１…は、式１のＡＸを演算するための分散算術演算回路を示し、入力ベクトル数分だけ並列接続される。符号１０３はルックアップ表を記憶している記憶手段を示す。端子群１０１への入力データ（定数ベクトル）に対応したデータが記憶手段１０３から加減算器１０４へ入力される。加減算器１０４は記憶手段１０３の出力データと１ビットシフタ１０６の出力を加減算し、例えばシフトレジスタである一時記憶手段１０７へ入力する。一時記憶手段１０７に一時記憶された加減算器１の出力はビットシフタ１０６へ出力され、適切なタイミングでスイッチ１０８が閉じられ、出力端子１０９から出力Ｂ（０）として出力される。同様に、他のＤＡ回路１１１…の出力Ｂ（１），Ｂ（２）…が出力端子１０９から出力される。
【００２３】
【特許文献１】
米国特許第３，７７７，１３０号
【特許文献２】
米国特許第５，２２６，００２号
【特許文献３】
特開２０００−１４８７３０号公報
【非特許文献】
“Ａｐｐｌｉｃａｔｉｏｎｓｏｆｄｉｓｔｒｉｂｕｔｅｄａｒｉｔｈｍｅｔｉｃｔｏｄｉｇｉｔａｌｓｉｇｎａｌｐｒｏｃｅｓｓｉｎｇ：ａｔｕｔｏｒｉａｌｒｅｖｉｅｗ，” Ｓ．Ａ．Ｗｈｉｔｅ，ＩＥＥＥＡＳＳＰｍａｇａｚｉｎｅ，ｖｏｌ．６，ｉｓｓｕｅ３，ｐｐ．４−９，Ｊｕｌｙ１９８９
【００２４】
【発明が解決しようとする課題】
しかしながら、図６に示す従来の装置では、各分散算術演算回路１１０，１１１…が、それぞれ記憶手段１０３を内蔵しているので、そのコンパクト性は、記憶手段１０３のコンパクト性に大きな影響を受ける。すなわち、同時処理する入力ベクトルの数が増えれば増えるほど、分散算術演算回路１１０，１１１…の数が増加し、それに従って記憶手段１０３の数も増加する。
【００２５】
また、図６において、入力データは端子群１０１からビットシリアルで入力されるため、１ビットを１クロックで処理すると、出力結果が得られるまでに少なくとも入力データのビット幅分のクロック数が必要となる。一方、クロック周波数を大きくせずに演算時間を短くするために多ビットを同時処理することは、式８におけるベクトルＸ’の要素数を増やすことに相当するので、分散算術演算回路１１０，１１１…の並列処理数が並列処理ビット数の数だけ増加し、従って記憶手段１０３も同数だけ増加する。
【００２６】
このように、分散算術演算により複数データについて内積演算を同時処理すると、同時処理数分だけの記憶手段が必要となり、同時処理数が増すにつれてコンパクト性が悪化するという問題を有している。
【００２７】
本発明は、かかる問題に鑑みてなされたもので、その目的は、記憶手段のサイズを小さく保ったままで、分散算術演算を高速化することにある。
【００２８】
【課題を解決するための手段】
本発明はこうした課題を解決するための手段を提供するもので、各請求項の発明は、以下の技術手段を構成する。
【００２９】
本発明は、第１の入力データ組と、第２の入力データ組とに対して、分散算術演算法を利用して演算処理を行う演算装置において、前記第１の入力データ組が入力される少なくとも１個の分散算術演算手段と、前記第２の入力データ組に対応したデータの表（ルックアップ表）を記憶し、前記第２の入力データに応じたデータ組を出力する記憶手段とを備え、前記分散算術手段は、前記第１の入力データ組の１番目からＵ番目（Ｕは自然数）のデータから、Ｔビット幅（Ｔは自然数）のデータをそれぞれ生成する第１から第Ｕの変換手段と、前記第１から第Ｕの変換手段の最下位から最上位ビットまでの各ビットの内容に従って、前記記憶手段から出力されるデータ組から、それぞれ最適なデータを選ぶ第１から第Ｔの選択手段と、前記第１から第Ｔの選択手段より出力されるデータを演算する第１の演算手段と、前記第１の演算手段から出力されるデータと、ビットシフトを実行する第３の演算手段の出力との加減算を実行する第２の演算手段と、前記第２の演算手段から出力される演算結果を一時的に格納する一時記憶手段と、前記一時記憶手段から出力されるデータのビットシフトを実行する第３の演算手段とを備えることを特徴とする、演算装置を提供する。
【００３０】
上記構成とした本発明の演算装置では、各分散算術演算手段がデータの表を記憶した記憶手段を内蔵する必要がないので、表を記憶するための記憶手段を増加することなく、分散算術演算を利用したＴビット同時処理により処理の高速化を図ることができる。
【００３１】
本発明の演算装置は、内積演算装置として構成することができる。この場合、前記第１から第Ｕの変換手段は、前記第１の入力データ組の１番目からＵ番目のデータにおける各ビット位置（０，１，…，Ｔ−１）の値を最下位ビットから順に、０ビットからＴ−１ビットの各ビット位置（０，１，…，Ｔ−１）に割り振り、第１から第Ｔの選択手段に対してＴビットづつ順次出力し、前記第１の演算手段は、前記第１から第Ｔ−１の選択手段の出力を各々Ｔ−１ビットから１ビットまで１ビットずつ下位側（Ｔ−１，Ｔ−２・・・１ビットずつ下位側）にビットシフトした各結果と、第Ｔの選択手段の出力とを加減算し、前記第３の演算手段は、前記一時記憶手段の出力をＴビット下位側にビットシフトし、前記第１の入力データ組として入力されるベクトルと第２の入力データ組として入力されるベクトルの内積を計算する。かかる構成とすれば、表を記憶するための記憶手段を増加することなく、分散算術演算を利用したＴビット同時処理により内積演算処理の高速化を図ることができる。
【００３２】
また、本発明の演算装置は、複素数乗算装置として構成することができる。この場合、前記第１の入力データ組は、Ｈ個（Ｈは自然数）の複素数であり、ｈ個目（ｈは１からＨまでの整数）の複素数の実部及び虚部はそれぞれ前記第１の入力データ組の２ｈ−１番目及び２ｈ番目に割り当てられ、前記第１の分散算術演算手段の数は２Ｈ個であり、前記第１の入力データ組の２ｈ−１番目及び２ｈ番目の入力は、２ｈ−１番目及び２ｈ番目の前記分散算術演算手段にのみ出力され、前記第１の入力データとして入力される複素数と、前記第２の入力データとして入力される複素数の乗算を行う。かかる構成とすれば、表を記憶するための記憶手段を増加することなく、分散算術演算を利用したＴビット同時処理により複素数乗算演算処理の高速化を図ることができる。
【００３３】
前記記憶手段が、０からＮ／８の整数であるｎに対する｛ｃｏｓ（２πｎ／Ｎ）＋ｓｉｎ（２πｎ／Ｎ）｝／２の値を示す第１の表と、前記０からＮ／８の整数であるｎに対する｛ｃｏｓ（２πｎ／Ｎ）−ｓｉｎ（２πｎ／Ｎ）｝／２の値を示す第２の表とを格納していれば、記憶手段の容量を増大させることなく、Ｎ点離散フーリエ変換に必要なバタフライ演算を高速処理することができる。
【００３４】
あるいは、前記第１の記憶手段が、０からＮ／８の整数であるｎについて｛ｃｏｓ（２πｎ／Ｎ）＋ｓｉｎ（２πｎ／Ｎ）｝／２の値及び｛ｃｏｓ（２πｎ／Ｎ）−ｓｉｎ（２πｎ／Ｎ）｝／２の値の対を示す第３の表を格納していもよい。この場合、記憶手段の容量と消費電力をより節約しつつ、Ｎ点離散フーリエ変換に必要なバタフライ演算を高速処理することができる。
【００３５】
【発明の実施の形態】
次に、図面に示す本発明の実施形態について説明する。
【００３６】
（基本構成）
まず、図１に示す本発明の演算装置の基本構成について説明する。この基本構成は後述する第１から第３実施形態の内積演算装置や複素乗算装置の基本となる。図１において、２０１は第１の入力データ組が入力される入力端子郡、２０３第２の入力データ組が入力される設定データ入力端子、２０４はルックアップ表を記憶する記憶手段、２３０は出力端子郡を示す。２５１，２５２，２５３・・・は分散算術演算処理回路を示し、これらはいずれも同一構成である。
【００３７】
入力端子郡３０１は入力端子郡２０１から分散算術演算処理回路２５１に入力されるデータの入力端子郡、端子３０２は記憶手段２０４から分散算術演算処理回路２５１に入力されるデータ組の入力端子である。
【００３８】
入力端子郡２０１から入力される１番目からＵ番目までのＵ個のデータは、それぞれ変換回路２０２の第１から第Ｕの変換手段２０２ａ，２０２ｂ…２０２ｕでＴビットデータに変換される。設定データ入力端子２０３から入力される設定データに対応したデータ組が、記憶手段２０４から出力されて選択回路２０５に入力される。
【００３９】
選択回路２０５は、変換回路２０２から出力されるＴビット幅のデータに対応して、第１から第Ｔの選択手段２０５ａ，２０５ｂ…２０５ｔを備えている。選択回路２０５の第１の選択手段２０５ａでは、記憶手段２０４から出力されたデータ組から、変換回路２０２内の各変換手段２０２ａ〜２０２ｕの出力の最下位ビット組に応じたデータが生成され出力される。また、選択回路２０５の第２の選択手段２０５ｂでは、記憶手段２０４から出力されたデータ組から、変換回路２０２内の各変換手段２０２ａ〜２０２ｕの出力の最下位＋１ビット組に応じたデータが生成され出力される。以下、選択回路２０２の各選択手段２０５ｃ，２０５ｄ…で同様の処理が実行される。選択回路２０５の第Ｔの選択手段２０５ｔでは変換回路２０２内の各変換手段２０２ａ〜２０２ｕの出力の最上位ビット組に応じたデータが出力される。
【００４０】
選択回路２０５から出力された複数のデータは、演算手段（第１の演算手段）２０６で演算され、加減算器２０７に入力される。加減算器（第２の演算手段）２０７は演算手段２０６からのデータとビットシフタ２０８からのデータを加減算する。加減算器２０７からのデータは例えばシフトレジスタである一時記憶手段２０９で一時記憶される。一時記憶手段２０９からのデータはビットシフタ（第３の演算手段）２０８へ出力される。適切なタイミングでスイッチ２１０が閉じられ、スイッチ２１０が閉じると一時記憶木手段２０９から出力端子郡２３０の第１の端子から出力される。
【００４１】
従来例の図６における分散算術演算回路１１０，１１１…は、図１におけるスイッチ２１０、ビットシフタ２０８、一時記憶手段２０９、加減算器２０７、演算手段２０６、選択回路２０５のうちのどれか１つ（例えば、選択手段２０５ａ）及び記憶手段２０４を１つの単位とした回路に相当する。そのため、複数データを同時処理する場合、例えば変換回路２０２で多ビットを同時処理できるように割り振り、分散演算処理回路２５１で１つめの設定データを処理し、以降の設定データをそれぞれ分散演算処理回路２５２，２５３…で処理する、というような場合、同時処理数が増えれば増えるほど、分散算術演算回路の数が増えるが、従来例では、記憶手段の数も同時に増え、その総容量が増加していたが、本発明では、このような構成とすることにより、同時処理数にかかわらず、記憶手段２０４は１つでよいので、必要となる記憶手段の総容量は一定となる。
【００４２】
（第１実施形態）
図２に本発明の第１実施形態である内積演算装置を示す。図１と同一機能のものは同一番号を付して説明を省略する。
【００４３】
変換回路２０２は次のような構成とする。即ち、変換回路２０２内の変換手段３０３において、入力端子郡３０１から入力されるデータの［Ｔｋ］ビット、［Ｔｋ＋１］ビット…［Ｔｋ＋（Ｔ−１）］ビット（ｋは０以上の整数）を、出力の最下位ビットから最上位ビットに向けて割り振り、Ｔビットづつ出力する。変換回路２０２内の他の変換手段２０２ａ，２０２ｂ…も同様の処理を実行する。
【００４４】
演算手段２０６はビットシフタ群３０６と加減算器３０７とからなる。選択回路２０５の出力データはビットシフタ郡３０６に入力され、ビットシフタ郡３０６の出力データは加減算器３０７によって加減算される。ビットシフタ郡３０６は次のような構成とする。即ち、選択回路２０５の第１の選択手段２０５ａからビットシフタ郡３０６に入力されたデータはＴ−１ビットシフトされる。選択回路２０５の第２の選択手段２０５ｂからビットシフタ郡３０６に入力されたデータはＴ−２ビットシフトされる。以下、選択回路２０５の各選択手段からビットシフタ郡３０６に入力されたデータを同様に処理する。選択回路２０５の第Ｔの選択手段２０５ｔからビットシフタ郡３０６に入力されたデータはＴ−Ｔビットシフト、即ちそのまま出力される。
【００４５】
ビットシフタ（２０８）は、Ｔビットシフタ（３０８）によって構成する。
【００４６】
基本構成の項で説明したように、従来は、選択回路２０５を構成する選択手段の数だけ分散算術演算の処理を、高速で行うことができるように、分散算術演算を同時処理しようとすれば、選択回路２０５を構成する選択手段の数だけのルックアップ表の記憶手段の数が必要となる。また、ルックアップ表の記憶手段の数を増やさないように、選択回路２０５を構成する選択手段の数だけの分散算術演算の処理を１つづつ行うと、その数だけ演算時間がかかる。しかし、本発明では、このような構成とすることにより、ルックアップ表の記憶手段を増加することなく、Ｔビット同時処理を実施することができ、処理の高速化を図ることができる。
【００４７】
（第２実施形態）
図３に本発明による複素数乗算装置の第２の実施例を示す。図１と同一機能のものは同一番号を付して説明を省略する。入力端子群２０１の入力１に入力複素数１の実部を割り当て、入力端子群２０１の入力２に入力複素数１の虚部を割り当て、…入力端子群２０１の入力２ｈ−１に入力複素数ｈの実部を割り当て、…入力端子群２０１の入力２Ｈに入力複素数Ｈの虚部を割り当てる。入力されたデータは、各々分散算術演算処理回路群４００で処理される。分散算術演算処理回路群４００を構成する分散算術演算処理回路２５１、２５２…は２Ｈ個ある。なか、以降の本実施形態の説明ではすべての分散算術演算処理回路を符号２５１で示す。
【００４８】
入力１及び２は、１番目及び２番目の分散算術演算処理回路２５１のみに入力する。入力１及び２は、前記以外の分散算術演算処理回路２５１には、影響を与えない場合を考え、入力線の記載を省略する。出力複素数１の実部及び虚部は、１番目及び２番目の分散算術演算処理回路２５１から各々出力され、出力端子郡２３０の出力１及び出力２に割り当てる。それ以外の入力も、同様に処理される。
【００４９】
ここで、複素定数Ｙ及び複数の任意の複素数｛Ｚ｝の乗算Ｙ・｛Ｚ｝を考える。Ｙ＝Ｙ_ｒ−ｊＹ_ｉ、｛Ｚ｝の要素の１つをＺ＝Ｚ_ｒ＋ｊＺ_ｉとおくと（添え字ｒは実部を、添え字ｉは虚部を示す。また、ｊ^２＝−１である。）、Ｙ・｛Ｚ｝の要素ＹＺは、下記の式９で表される。
【００５０】

【００５１】
従って、式１においてＡ及びＸを以下の式１０ように設定し、要素数２のＤＡ法を適用すればＹＺの実部が求まる。
【００５２】

【００５３】
同様に、式１において、Ａ及びＸを以下の式１１のように設定し、要素数２のＤＡ法を適用すれば、ＹＺの実部が求まる。
【００５４】

【００５５】
実部、虚部いずれの場合でも、ルックアップ表は｛（Ｙ_ｒ＋Ｙ_ｉ）／２，（Ｙ_ｒ―Ｙ_ｉ）／２｝であり、同一となる。これは、図３において、入力端子郡２０１の２ｈ−１番目の入力端子にＺ_ｒ、２ｈ番目の入力端子にＺ_ｉをそれぞれ入力し、記憶手段２０４のルックアップ表として｛（Ｙ_ｒ＋Ｙ_ｉ）／２，（Ｙ_ｒ―Ｙ_ｉ）／２｝を格納すると、出力端子郡２３０の２ｈ−１番目の出力端子及び２ｈ番目の出力端子の出力としてＹＺの実部及び虚部が得られることになる。Ｙ・｛Ｚ｝は、ＹＺを求めるＤＡ法を複数実施することにより求めることができる。これは、図３の入力端子郡（２０１）に｛Ｚ｝を入力すればよい。
【００５６】
このような構成とすることにより、複素数の乗算を分散術演算で実現でき、｛Ｚ｝の要素数、即ち入力端子郡２０１の入力数によらず、また、演算の高速化のため分散算術演算処理回路２５１内で、第１の実施形態に示したように多ビット同時処理を行っても、第１の実施形態の項で説明したように、記憶手段は増加せず一定である。
【００５７】
（第３実施形態）
複素関数ｆ（ｎ）のＮ点離散フーリエ変換（以下、ＤＦＴ）は次式である。
【００５８】

【００５９】
Ｗ_Ｎは回転因子と呼ばれるファクターである。式１２を変形していくと、
【００６０】

【００６１】
ｋが偶数の場合と奇数の場合で分けると、
【００６２】

【００６３】
ただし、

【００６４】
となる。ｆ（ｎ）に関するＮ点ＤＦＴが、ｙ（ｎ）及びｚ（ｎ）に関するＮ／２点ＤＦＴになった。その際式１５の演算をＮ／２点行う必要がある。式１５の演算はバタフライ演算と呼ばれる。これを、ｌｏｇ_２Ｎ−１段（Ｎが２のべき乗の場合）、再起的に繰り返していくことで、ＤＦＴ演算結果を得ることができる。これは、基数２の周波数間引きＤＦＴと呼ばれる演算手法である。
【００６５】
この場合、式１５に示す複素乗算を１段あたりＮ／２回を合計ｌｏｇ_２Ｎ−１段行う必要があるため、ＤＦＴを高速で行うためには、式１５の演算を高速で行う必要がある。そのため、１段あたりＮ／２個の式１５の演算をいくつかづつ同時並列処理で行いたい。これは、本発明による第２の実施例において、
【００６６】

【００６７】
とおくことによって達成することができる。この場合（式６）におけるＱ（ｂ_ｗ）は、
【００６８】

【００６９】
である。ところで、三角関数の全ての値は、０〜４５度の三角関数値で表現できるので、式１７のためのルックアップ表は、ｎ＝０〜Ｎ／８での結果のみを格納すればよい。
【００７０】
この場合の記憶手段の構成を図４に示す。図１と同一機能のものは同一番号を付す。記憶手段２０４は、第１の記憶部５０１及び第２の記憶部５０２で構成される。第１の記憶部５０１はＱｐ（ｎ）のルックアップ表（ただし、ｎは０〜Ｎ／８の整数）を格納する。第２の記憶部５０２はＱｍ（ｎ）のルックアップ表（ただし、ｎは０〜Ｎ／８の整数）を格納する。設定２０３から指示されたアドレスに従って、記憶手段１５０１及び記憶手段２５０２のデータ組が端子３０２から出力される。
【００７１】
このような構成とすることにより、ルックアップ表の記憶手段の増大なしに、ＤＦＴに必要となるバタフライ演算を高速で実施することができる。
【００７２】
（第４実施形態）
第３実施形態における記憶手段の別の構成を図５に示す。図４と同一機能のものは同一番号を付す。１つの複素乗算ではｎは固定値なので、単一の記憶部５０４はＱｐ（ｎ）とＱｍ（ｎ）の対を格納した表であれば、アドレスデコーダは１つで済み、低消費電力化できる。
【００７３】
（第５実施形態）
図４又は図５のルックアップ表を使い、第２実施形態に従って複素数乗算をＤＡ法で構成すれば、式１５の演算が実施できる。式１５は前述したように、ｎ＝０…Ｎ／２−１について行う必要があるが、このうちのいくつかを同時処理することで、演算の高速化を図ることができる。その際、第２実施形態で示したように、ルックアップ表を格納するための記憶手段の増大はない。さらに、第１実施形態で示したように、多ビット同時処理による演算高速化を図っても、ルックアップ表を格納するための記憶手段の増大はない。以上の演算をｌｏｇ_２Ｎ−１段行うことで、ＤＦＴ演算が達せられる。
【００７４】
このような構成とすることにより、ルックアップ表の記憶手段の増大なしに、ＤＦＴの高速化を図ることができる。
【００７５】
【発明の効果】
以上の説明から明らかなように、本発明の演算装置は、第１の入力データ組が入力され、変換手段、選択手段、及び第１から第３の演算手段を備える少なくとも１個の分散算術演算手段と、前記第２の入力データ組に対応したデータの表を記憶し、前記第２の入力データに応じたデータ組を出力する記憶手段とを備え、分散算術演算手段の選択手段が記憶手段から出力されるデータ組から最適なデータを選ぶので、表を記憶するための記憶手段を増加することなく、分散算術演算を利用したＴビット同時処理により、内積演算処理、複素乗算処理、Ｎ点離散フーリエ変換に必要なバタフライ演算処理等の演算処理の高速化を図ることができる。
【図面の簡単な説明】
【図１】本発明の基本構成を示すブロック図である。
【図２】本発明の第１実施形態を示すブロック図である。
【図３】本発明の第２実施形態を示すブロック図である。
【図４】発明の第３実施形態における記憶手段の構成を示す概略図である。
【図５】本発明の第３実施形態における記憶手段の他の構成を示す概略図である。
【図６】従来の内積演算装置の一例を示すブロック図である。
【符号の説明】
２０１データ入力端子郡
２３０データ出力端子郡
２０３設定データ入力端子
２０４，５０１，５０２，５０３ルックアップ表を格納する記憶手段
２５１，２５２，２５３，４００分散算術演算処理回路
２０２入力データ変換手段
２０５データの選択手段
２０６選択されたデータの演算手段
２０７加減算器
２０８ビットシフタ
２０９一時記憶手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an inner product operation device using a distributed arithmetic operation (hereinafter, referred to as a DA method) used for digital signal processing such as image signal encoding, filtering, and discrete Fourier transform, and a complex number multiplication. The present invention relates to an arithmetic device such as a device.
[0002]
[Prior art]
The DA method is a well-known method of realizing the inner product operation of a matrix with a look-up table and an adder. For example, Patent Literature 1 and Non-Patent Literature 1 describe that the inner product operation is realized in a more compact manner by using the DA method as compared with the case where a multiplier is used.
[0003]
To explain the DA method, consider an inner product operation of a constant vector A having the number K of elements and an input vector X represented by the following equation 1.
[0004]

[0005]
_Assuming that X _k is a fixed-point W-bit width fixed-point number represented by 2's complement, it can be expressed as the following equation 2.
[0006]

[0007]
In this equation (2), b _k [w] is a bit (1 or 0), b _k [0] is a sign bit, and b _k [W−1] is a least significant bit.
[0008]
_Further, X _k = Since _{{X k - - (X k} )} / 2 and expressed, placing the bit inversion of _{b k} and b _'k, _{X k} is as follows.
[0009]

[0010]
However, c _k [w] in Expression 3 is expressed as follows.
[0011]

[0012]
By substituting (Equation 3) into (Equation 1), the following Equation 5 is obtained.
[0013]

[0014]
In Expression 5, the following relationship is satisfied.
[0015]

[0016]
In (Equation 6), Q (b _w ) can be transformed as in the following equation.
[0017]

[0018]
Therefore, Q (b _w ) in the case of c ₁ = −1 can be represented by Q (b _w ) in the case of c ₁ = 1, so that the value of Expression 6 can be expressed in 2 ^K−1 ways. This value is stored as a look-up table. Then, since w = W-1 to 0, each value is multiplied by 2- ^{w, the processing} is sequentially executed from w = W-1, and the values referred from the lookup table are shifted to the right by one digit and added. . With the above procedure, AX can be obtained without performing multiplication.
[0019]
Performing the above operation simultaneously for a plurality of values of the input vector X is equivalent to performing the operation of the following equation.
[0020]

[0021]
Here, X (i) is the i-th vertical vector of X in equation (1), A (i) is a corresponding horizontal vector (i is an integer of 0 or more), and 0 is an element value of 0 Indicates a horizontal vector. A ′ represents a matrix containing A (i), X ′ represents a vector containing X (i), and B represents a calculation result.
[0022]
FIG. 6 shows a method of performing the calculation of (Equation 8) using the methods shown in Patent Documents 3 and 4, for example. Each element of the input vector X (0), X (1),... Reference numerals 110, 111,... Indicate distributed arithmetic operation circuits for calculating AX in Expression 1, and are connected in parallel by the number of input vectors. Reference numeral 103 indicates storage means for storing a look-up table. Data corresponding to the input data (constant vector) to the terminal group 101 is input from the storage unit 103 to the adder / subtractor 104. The adder / subtractor 104 adds / subtracts the output data of the storage unit 103 and the output of the 1-bit shifter 106, and inputs the result to a temporary storage unit 107, for example, a shift register. The output of the adder / subtractor 1 temporarily stored in the temporary storage means 107 is output to the bit shifter 106, the switch 108 is closed at an appropriate timing, and output from the output terminal 109 as an output B (0). Similarly, outputs B (1), B (2),... Of the other DA circuits 111 are output from the output terminal 109.
[0023]
[Patent Document 1]
US Patent No. 3,777,130 [Patent Document 2]
US Patent No. 5,226,002 [Patent Document 3]
Japanese Patent Application Laid-Open No. 2000-148730 [Non-Patent Document]
"Applications of distributed arithmetic to digital signal processing: atomic review," A. White, IEEE ASSP magazine, vol. 6, issue 3, pp. 4-9, July 1989
[0024]
[Problems to be solved by the invention]
However, in the conventional device shown in FIG. 6, since each of the distributed arithmetic operation circuits 110, 111,... Has a built-in storage unit 103, its compactness is greatly affected by the compactness of the storage unit 103. That is, as the number of input vectors to be simultaneously processed increases, the number of distributed arithmetic operation circuits 110, 111,... Increases, and the number of storage means 103 increases accordingly.
[0025]
In FIG. 6, since input data is input in a bit serial manner from the terminal group 101, if one bit is processed by one clock, at least the number of clocks for the bit width of the input data is required before an output result is obtained. Become. On the other hand, simultaneous processing of multiple bits in order to shorten the operation time without increasing the clock frequency is equivalent to increasing the number of elements of the vector X ′ in Expression 8, so that the distributed arithmetic operation circuits 110, 111,. Is increased by the number of parallel processing bits, and accordingly, the number of storage units 103 is also increased by the same number.
[0026]
As described above, when the inner product operation is simultaneously performed on a plurality of data by the distributed arithmetic operation, storage means for the number of simultaneous processes is required, and there is a problem that the compactness deteriorates as the number of simultaneous processes increases.
[0027]
The present invention has been made in view of such a problem, and has as its object to speed up distributed arithmetic operations while keeping the size of a storage unit small.
[0028]
[Means for Solving the Problems]
The present invention provides means for solving such problems, and the invention of each claim constitutes the following technical means.
[0029]
According to the present invention, there is provided an arithmetic unit for performing an arithmetic process on a first input data set and a second input data set by using a distributed arithmetic operation method, wherein the first input data set is input. At least one distributed arithmetic operation means, and storage means for storing a data table (look-up table) corresponding to the second input data set and outputting a data set corresponding to the second input data And the distributed arithmetic means generates first to U-th data each having a T-bit width (T is a natural number) from the first to U-th (U is a natural number) data of the first input data set. A conversion unit, and first to T-th selecting optimum data from a data set output from the storage unit according to the content of each bit from the least significant bit to the most significant bit of the first to U-th conversion means. Selecting means, and the first A first calculating means for calculating the data output from the T-th selecting means, and an addition and subtraction of the data output from the first calculating means and the output of the third calculating means for performing a bit shift. A second arithmetic unit for executing, a temporary storage unit for temporarily storing an arithmetic result output from the second arithmetic unit, and a third unit for performing a bit shift of data output from the temporary storage unit. An arithmetic device characterized by comprising arithmetic means.
[0030]
In the arithmetic device of the present invention having the above-described configuration, since each of the distributed arithmetic operation means does not need to include a storage means for storing a table of data, the distributed arithmetic operation can be performed without increasing the number of storage means for storing the table. The processing speed can be increased by the simultaneous T-bit processing using
[0031]
The arithmetic device of the present invention can be configured as an inner product arithmetic device. In this case, the first to U-th conversion means converts the value of each bit position (0, 1,..., T-1) in the first to U-th data of the first input data set to the least significant bit. , In order from 0 bit to T-1 bit, and sequentially output to the first to Tth selecting means in T bits at a time. The calculating means shifts the output of the first to T-1th selecting means one bit at a time from the T-1 bit to one bit (T-1, T-2... One bit at a time). Each of the bit-shifted results is added to or subtracted from the output of the T-th selection means. The third arithmetic means bit-shifts the output of the temporary storage means to the lower side by T bits, and outputs the first input data set. And a vector input as a second input data set To calculate the inner product. With this configuration, it is possible to increase the speed of the inner product operation processing by T-bit simultaneous processing using distributed arithmetic operation without increasing the number of storage means for storing the table.
[0032]
Further, the arithmetic device of the present invention can be configured as a complex number multiplication device. In this case, the first input data set is H (H is a natural number) complex numbers, and the real part and the imaginary part of the h-th (h is an integer from 1 to H) complex number are respectively the first and the imaginary parts. Are assigned to the 2h-1st and 2h-th input data sets, the number of the first distributed arithmetic operation means is 2H, and the 2h-1st and 2h-th inputs of the first input data set are , 2h-1 and 2h, multiplied by the complex number input as the first input data and the complex number input as the second input data, which are output only to the distributed arithmetic operation means. With such a configuration, it is possible to increase the speed of the complex number multiplication operation processing by the T-bit simultaneous processing using the distributed arithmetic operation without increasing the storage means for storing the table.
[0033]
A first table showing a value of {cos (2πn / N) + sin (2πn / N)} / 2 with respect to n being an integer of 0 to N / 8, and an integer of 0 to N / 8. And the second table indicating the value of {cos (2πn / N) −sin (2πn / N)} / 2 for n, the N-point discrete values can be obtained without increasing the capacity of the storage means. The butterfly operation required for the Fourier transform can be processed at high speed.
[0034]
Alternatively, the first storage means stores a value of {cos (2πn / N) + sin (2πn / N)} / 2 and {cos (2πn / N) −sin for n which is an integer from 0 to N / 8. A third table indicating pairs of values of (2πn / N)｝ / 2 may be stored. In this case, the butterfly operation required for the N-point discrete Fourier transform can be performed at high speed, while further reducing the capacity and power consumption of the storage means.
[0035]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, an embodiment of the present invention shown in the drawings will be described.
[0036]
(Basic configuration)
First, the basic configuration of the arithmetic device of the present invention shown in FIG. 1 will be described. This basic configuration is the basis of the inner product operation device and the complex multiplication device of the first to third embodiments described later. In FIG. 1, reference numeral 201 denotes an input terminal group to which a first input data set is input; 203, a setting data input terminal to which a second input data set is input; 204, storage means for storing a lookup table; Indicates terminal group. 251, 252, 253... Indicate distributed arithmetic operation processing circuits, all of which have the same configuration.
[0037]
The input terminal group 301 is an input terminal group of data input from the input terminal group 201 to the distributed arithmetic operation processing circuit 251, and the terminal 302 is an input terminal of a data set input from the storage means 204 to the distributed arithmetic operation processing circuit 251. .
[0038]
The first to U-th data input from the input terminal group 201 are converted into T-bit data by first to U-th conversion means 202a, 202b,. A data set corresponding to the setting data input from the setting data input terminal 203 is output from the storage unit 204 and input to the selection circuit 205.
[0039]
The selection circuit 205 includes first to T-th selection means 205a, 205b,... 205t corresponding to the data having a T bit width output from the conversion circuit 202. The first selection means 205a of the selection circuit 205 generates and outputs data corresponding to the least significant bit set of the output of each of the conversion means 202a to 202u in the conversion circuit 202 from the data set output from the storage means 204. You. Further, the second selecting means 205b of the selecting circuit 205 generates data corresponding to the least significant + 1-bit group of the outputs of the converting means 202a to 202u in the converting circuit 202 from the data set output from the storage means 204. Is output. Hereinafter, the same processing is executed by each of the selection means 205c, 205d,... Of the selection circuit 202. The T-th selection means 205t of the selection circuit 205 outputs data corresponding to the most significant bit set of the outputs of the conversion means 202a to 202u in the conversion circuit 202.
[0040]
The plurality of data output from the selection circuit 205 are calculated by the calculation means (first calculation means) 206 and input to the adder / subtractor 207. An adder / subtractor (second operation unit) 207 adds and subtracts data from the operation unit 206 and data from the bit shifter 208. Data from the adder / subtractor 207 is temporarily stored in a temporary storage unit 209, for example, a shift register. Data from the temporary storage unit 209 is output to the bit shifter (third operation unit) 208. The switch 210 is closed at an appropriate timing, and when the switch 210 is closed, the temporary storage tree unit 209 outputs the data from the first terminal of the output terminal group 230.
[0041]
. In the conventional example shown in FIG. 6 are any one of the switch 210, the bit shifter 208, the temporary storage unit 209, the adder / subtractor 207, the operation unit 206, and the selection circuit 205 (for example, FIG. 1). , The selection means 205a) and the storage means 204 as one unit. Therefore, when processing a plurality of data simultaneously, for example, the conversion circuit 202 allocates so that multiple bits can be processed simultaneously, the distributed setting processing circuit 251 processes the first setting data, and the subsequent setting data are respectively processed by the distributed

processing circuit

252, 253..., The number of distributed arithmetic operation circuits increases as the number of simultaneous processes increases, but in the conventional example, the number of storage means also increases at the same time, and the total capacity increases. However, in the present invention, by adopting such a configuration, only one storage unit 204 is required irrespective of the number of simultaneous processes, so that the required total capacity of the storage unit is constant.
[0042]
(1st Embodiment)
FIG. 2 shows an inner product calculation device according to the first embodiment of the present invention. The components having the same functions as those in FIG.
[0043]
The conversion circuit 202 has the following configuration. That is, the conversion means 303 in the conversion circuit 202 converts [Tk] bits, [Tk + 1] bits... [Tk + (T-1)] bits (k is an integer of 0 or more) of data input from the input terminal group 301. , From the least significant bit of the output to the most significant bit, and output T bits at a time. The other conversion means 202a, 202b,... In the conversion circuit 202 execute the same processing.
[0044]
The operation means 206 includes a group of bit shifters 306 and an adder / subtractor 307. The output data of the selection circuit 205 is input to the bit shifter group 306, and the output data of the bit shifter group 306 is added / subtracted by the adder / subtractor 307. The bit shifter group 306 has the following configuration. That is, the data input from the first selection means 205a of the selection circuit 205 to the bit shifter group 306 is shifted by T-1 bits. The data input from the second selection means 205b of the selection circuit 205 to the bit shifter group 306 is shifted by T-2 bits. Hereinafter, the data input to the bit shifter group 306 from each selection means of the selection circuit 205 is similarly processed. The data input to the bit shifter group 306 from the T-th selection means 205t of the selection circuit 205 is shifted by TT bits, that is, output as it is.
[0045]
The bit shifter (208) is constituted by a T bit shifter (308).
[0046]
As described in the section of the basic configuration, conventionally, it is necessary to simultaneously perform distributed arithmetic operations so that processing of distributed arithmetic operations can be performed at a high speed by the number of selection units constituting the selection circuit 205. , The number of storage means for the lookup tables is the same as the number of selection means constituting the selection circuit 205. Further, if the processing of the distributed arithmetic operation is performed one by one by the number of the selection units included in the selection circuit 205 so as not to increase the number of the storage units of the look-up table, the operation time is required by the number. However, in the present invention, by adopting such a configuration, the T-bit simultaneous processing can be performed without increasing the number of storage means for the lookup table, and the processing speed can be increased.
[0047]
(2nd Embodiment)
FIG. 3 shows a second embodiment of the complex number multiplying device according to the present invention. The components having the same functions as those in FIG. The real part of the input complex number 1 is assigned to the input 1 of the input terminal group 201, the imaginary part of the input complex number 1 is allocated to the input 2 of the input terminal group 201,. .. Are assigned to the input 2H of the input terminal group 201. The input data is processed by the distributed arithmetic operation processing circuit group 400. The distributed arithmetic operation processing circuit group 400 includes 2H distributed arithmetic

operation processing circuits

251, 252,. In the following description of the present embodiment, all the distributed arithmetic operation processing circuits are denoted by reference numeral 251.
[0048]

Inputs

1 and 2 are input only to the first and second distributed arithmetic processing circuits 251.

Inputs

1 and 2 do not affect the distributed arithmetic operation processing circuit 251 other than those described above, and the description of the input lines is omitted. The real part and the imaginary part of the output complex number 1 are output from the first and second distributed arithmetic operation processing circuits 251, respectively, and are assigned to the output 1 and the output 2 of the output terminal group 230. Other inputs are processed similarly.
[0049]
Here, a multiplication Y · {Z} of a complex constant Y and a plurality of arbitrary complex numbers {Z} is considered. _. Y ₌ Y r -jY i, if one of the elements of {Z} is denoted by _{Z = Z} r + jZ _i (the subscript r real part, subscript i denotes the imaginary part ^{also, j} 2 = - 1), and the element YZ of Y · {Z} is represented by the following equation 9.
[0050]

[0051]
Therefore, if A and X are set as in the following Expression 10 in Expression 1, and the DA method with two elements is applied, the real part of YZ can be obtained.
[0052]

[0053]
Similarly, in Equation 1, if A and X are set as in the following Equation 11, and the DA method with two elements is applied, the real part of YZ is obtained.
[0054]

[0055]
The real part, in any case the imaginary part, the look-up table _{_{{(Y r + Y i)}} / 2, (Y r -Y i) / 2} is, the same. This, in FIG. 3, the _{Z i} enter each _Z r, the 2h-th input terminal to 2h-1-th input terminal of the input terminal-gun 201, as a look-up table of the storage unit 204 _{_{(Y} r + Y _i _{_{) / 2, (Y r -Y}} i) / 2 storing}, the real part and the imaginary part of the YZ is obtained as the output of 2h-1 th output terminal of the output terminals gun 230 and 2h-th output terminal become. Y · {Z} can be obtained by performing a plurality of DA methods for obtaining YZ. This can be achieved by inputting {Z} to the input terminal group (201) in FIG.
[0056]
With this configuration, multiplication of complex numbers can be realized by distributed arithmetic operation, irrespective of the number of elements of {Z}, that is, the number of inputs of input terminal group 201, and distributed arithmetic operation for high-speed operation. Even when multi-bit simultaneous processing is performed in the processing circuit 251 as described in the first embodiment, the number of storage units does not increase and is constant as described in the section of the first embodiment.
[0057]
(Third embodiment)
The N-point discrete Fourier transform (hereinafter, DFT) of the complex function f (n) is represented by the following equation.
[0058]

[0059]
W _N is a factor called a twiddle factor. By transforming Equation 12,
[0060]

[0061]
If k is even and odd,
[0062]

[0063]
However,

[0064]
It becomes. The N-point DFT for f (n) became the N / 2-point DFT for y (n) and z (n). At this time, it is necessary to perform the calculation of Expression 15 at N / 2 points. The operation of Expression 15 is called a butterfly operation. By repeating this recursively at log ₂ N−1 stages (when N is a power of 2), a DFT operation result can be obtained. This is a calculation method called a radix-2 frequency thinning DFT.
[0065]
In this case, it is necessary to perform the complex multiplication shown in Expression 15 N / 2 times per stage, for a total of log ₂ N−1 stages. Therefore, in order to perform DFT at high speed, the operation of Expression 15 needs to be performed at high speed. is there. Therefore, it is desired to perform N / 2 operations of Equation 15 per stage by several simultaneous parallel processes. This is the second embodiment according to the present invention,
[0066]

[0067]
This can be achieved. In this case, Q (b _w ) in (Equation 6) is
[0068]

[0069]
It is. By the way, since all the values of the trigonometric function can be expressed by the trigonometric function values of 0 to 45 degrees, the look-up table for Expression 17 needs to store only the results of n = 0 to N / 8.
[0070]
FIG. 4 shows the configuration of the storage means in this case. Those having the same functions as those in FIG. 1 are given the same numbers. The storage unit 204 includes a first storage unit 501 and a second storage unit 502. The first storage unit 501 stores a lookup table of Qp (n) (where n is an integer of 0 to N / 8). The second storage unit 502 stores a look-up table of Qm (n) (where n is an integer of 0 to N / 8). In accordance with the address specified by the setting 203, a data set of the storage unit 1501 and the storage unit 2502 is output from the terminal 302.
[0071]
With such a configuration, the butterfly operation required for DFT can be performed at high speed without increasing the number of storage means for the lookup table.
[0072]
(Fourth embodiment)
FIG. 5 shows another configuration of the storage unit in the third embodiment. Those having the same functions as those in FIG. 4 are given the same numbers. Since n is a fixed value in one complex multiplication, if the single storage unit 504 is a table storing pairs of Qp (n) and Qm (n), only one address decoder is required, and power consumption can be reduced. .
[0073]
(Fifth embodiment)
If the complex number multiplication is configured by the DA method according to the second embodiment using the look-up table of FIG. 4 or 5, the operation of Expression 15 can be performed. As described above, Equation 15 needs to be performed for n = 0... N / 2-1. However, by simultaneously processing some of them, it is possible to speed up the operation. At that time, as shown in the second embodiment, there is no increase in the storage means for storing the lookup table. Further, as described in the first embodiment, even if the calculation speed is increased by the multi-bit simultaneous processing, the number of storage means for storing the lookup table does not increase. By performing the above operations in log ₂ N−1 stages, a DFT operation can be achieved.
[0074]
With such a configuration, it is possible to increase the speed of the DFT without increasing the number of storage means for the lookup table.
[0075]
【The invention's effect】
As is apparent from the above description, the arithmetic device of the present invention has at least one distributed arithmetic operation to which a first input data set is input and which includes a conversion unit, a selection unit, and first to third arithmetic units. Means for storing a table of data corresponding to the second input data set, and storage means for outputting a data set corresponding to the second input data, wherein the selecting means of the distributed arithmetic operation means is a storage means. Optimum data is selected from the data set output from, and without increasing the number of storage means for storing the table, T-bit simultaneous processing using distributed arithmetic operation allows inner product operation processing, complex multiplication processing, N points It is possible to speed up arithmetic processing such as butterfly arithmetic processing required for discrete Fourier transform.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of the present invention.
FIG. 2 is a block diagram showing a first embodiment of the present invention.
FIG. 3 is a block diagram showing a second embodiment of the present invention.
FIG. 4 is a schematic diagram illustrating a configuration of a storage unit according to a third embodiment of the present invention.
FIG. 5 is a schematic diagram illustrating another configuration of the storage unit according to the third embodiment of the present invention.
FIG. 6 is a block diagram illustrating an example of a conventional inner product calculation device.
[Explanation of symbols]
201 data input terminal group 230 data output terminal group 203 setting

data input terminal

204, 501, 502, 503 storage means 251, 252, 253, 400 for storing a look-up table distributed arithmetic operation processing circuit 202 input data conversion means 205 Selection means 206 operation means 207 for selected data adder / subtracter 208 bit shifter 209 temporary storage means

Claims

An arithmetic device for performing arithmetic processing on a first input data set and a second input data set by using a distributed arithmetic operation method,
At least one distributed arithmetic operation unit to which the first input data set is input;
Storage means for storing a table of data corresponding to the second input data set, and outputting a data set corresponding to the second input data,
The distributed arithmetic means,
First to U-th conversion means for respectively generating T-bit-width data from the first to U-th data of the first input data set;
First to T-th selecting means for selecting optimum data from a data set output from the storage means in accordance with the contents of each bit from the least significant bit to the most significant bit of the first to U-th converting means; ,
First calculating means for calculating data output from the first to Tth selecting means;
A second arithmetic unit for performing addition and subtraction between the data output from the first arithmetic unit and the output of the third arithmetic unit for performing a bit shift;
A temporary storage unit for temporarily storing an operation result output from the second operation unit;
And a third arithmetic unit for performing a bit shift of the data output from the temporary storage unit.

The first to U-th conversion means converts the value of each bit position in the first to U-th data of the first input data set in order from the least significant bit to each bit position of 0 to T-1 bits. , And sequentially output T bits to the first to Tth selecting means,
The first arithmetic means includes a result obtained by shifting the output of the first to T-1th selection means one bit at a time from the T-1 bit to one bit by one bit, and a result of the Tth selection means. Addition and subtraction with output
The third arithmetic means shifts the output of the temporary storage means to the lower side by T bits,
The arithmetic device according to claim 1, wherein an inner product of a vector input as the first input data set and a vector input as the second input data set is calculated.

The first input data set is H complex numbers, and the real part and the imaginary part of the h-th complex number are respectively assigned to the 2h-1st and 2hth of the first input data set,
The number of the first distributed arithmetic operation means is 2H,
The 2h-1st and 2hth inputs of the first input data set are output only to the 2h-1st and 2hth distributed arithmetic operation means,
The arithmetic device according to claim 1, wherein multiplication of a complex number input as the first input data and a complex number input as the second input data is performed.

The storage means comprises: a first table showing a value of {cos (2πn / N) + sin (2πn / N)} / 2 for n being an integer of 0 to N / 8; and an integer of 0 to N / 8. The arithmetic device according to claim 3, wherein a second table showing a value of {cos (2πn / N) -sin (2πn / N)} / 2 for n is stored.

The first storage means stores a value of {cos (2πn / N) + sin (2πn / N)} / 2 and {cos (2πn / N) −sin (2πn /) for n being an integer from 0 to N / 8. The arithmetic device according to claim 3, wherein a third table indicating a value pair of N)｝ / 2 is stored.