JP3709291B2

JP3709291B2 - Fast complex Fourier transform method and apparatus

Info

Publication number: JP3709291B2
Application number: JP29173498A
Authority: JP
Inventors: 恒平撫原
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-10-14
Filing date: 1998-10-14
Publication date: 2005-10-26
Anticipated expiration: 2018-10-14
Also published as: JP2000122999A

Description

【０００１】
【発明の属する技術分野】
本発明は、高速フーリエ変換に関し、特にマイクロプロセッサを用いた高速フーリエ変換の高速演算方法及び装置に関する。
【０００２】
【従来の技術】
フーリエ変換 (Fourier Transform) は、時間領域の信号を周波数領域へ変換する手法である。ディジタル信号処理では、離散時間信号に対するフーリエ変換である離散フーリエ変換 (Discrete Fourier Transform、DFT) が、信号の周波数解析、音声認識、ブロック・フィルタの高速演算などに広く応用されている。しかし、DFT は原理的に演算量が多いため、実際のシステムでは、高速演算手法である高速フーリエ変換 (Fast Fourier Transform、FFT) が用いられる。
【０００３】
FFT では、DFT を小さい DFT へ繰り返し分割することで演算量を軽減する。分割の仕方により Radix-2 FFT、Radix-4 FFT など種々の手法がある。 Radix-2 FFT では、入力データ点数を N とすると、N 点 DFT を 2 個の N/2 点 DFT に分割する。次に各 N/2 点 DFT を 2 個の N/4 点 DFT へ分割する。これを繰り返し、最終的に N/2 個の 2 点 DFT まで分割し、この 2 点 DFT を直接計算する。Radix-4 では、N 点 DFT を順次 1/4 に分割し、最終的に 4 点 DFT に分割する。 2 点 DFT、 4 点 DFT は、そのシグナル・フローの形状から「バタフライ演算」とも呼ばれる。
【０００４】
FFT では乗算を多用するため、高速な FFT が必要なシステムでは、乗算器を内蔵したディジタル信号処理プロセッサ (Digital Signal Processor、 DSP) が採用されてきた。最近では、パーソナル・コンピュータや情報家電機器などのマイクロプロセッサ応用製品で、ソフトウェアによるマルチメディア信号処理を行いたいという要求があり、汎用マイクロプロセッサでも乗算器を内蔵するものがある。
【０００５】
さらに、SIMD (Single Instruction stream-Multiple Data streams) 並列演算命令を導入し、さらに高いマルチメディア信号処理性能を実現したマイクロプロセッサもある。このようなマイクロプロセッサの拡張例としては、日経エレクトロニクス、No. 661、 1996 年 5 月 6 日、 105 から 119 ページに「86 系マイクロプロセサのマルチメディア命令 MMX を解剖」と題して掲載された Intel の MMX 命令セット、 Kazumasa Suzuki 他により IEEE MICRO Magazine、 Vol。 18、 No. 2、 1998 年 4 月、 36 から 47 ページに "V830R/AV: Embedded multimedia RISC processor" と題して掲載された NEC の V830R 命令セットがあげられる。
【０００６】
これらのマルチメディア信号処理向け命令セットでは、例えば 64 ビット・レジスタ上に 16 ビットの値 4 個を格納し、 1 命令でこれら 4 個の値に対し 4 並列演算を行える。このような並列演算命令を以下 SIMD 命令と呼ぶ。典型的な SIMD 命令の例を図 13 に示す。この例では、64 ビット長のレジスタ r1 (1301)、 r2 (1302) 上にハーフワード長のデータ 4 個 (a0、 b0、 a1、 b1、 a2、 b2、 a3、 b3) が詰め込まれており、r1 (1301)、 r2 (1302) 上の同じハーフワード位置間で独立した 4 個の加算 (a0+b0、 a1+b1、 a2+b2、 a3+b3) を行い、レジスタ r3 (1303) 上に 4 個のハーフワード長の結果を得ている。ここでハーフワードは 16 ビット・データ長を意味するとする。
【０００７】
SIMD 命令は、異なるレジスタ上の同じハーフワード位置間で、同一の演算を複数組のデータに対し並列に行うというその特徴から、まったく同じシグナル・フローに従って処理する複数組の独立したデータを並列に処理する場合に最も適している。逆に、同じレジスタ上の異なるハーフワード位置間で演算を行う場合、 SIMD 演算命令のほかにフォーマット変換命令を併用する必要が生じ、オーバーヘッドが発生する。
【０００８】
これら SIMD 命令を用い FFT を高速化する試みがすでに発表されている。Intel はアプリケーション・ノート AP-555、発注番号 243040-001、 1996 年 3 月、 "Using MMX Instructions to Perform Complex 16-bit FFT" において、 SIMD 命令を用いて Radix-2 FFT を高速化した例を示している。しかし、このアプリケーション・ノートに示された例では、次に詳しく示すように、演算量に比べてデータ変換のオーバヘッドが大きく、高速な FFT 演算が行えない。Intel のアプリケーション・ノートに示された SIMD 命令による FFT 演算手法 (以下、従来法と呼ぶ) を詳しく説明する。
【０００９】
図 11 に、Radix-2 FFT の構成要素である 2 点バタフライ演算のシグナル・フローを示す。 2 点 FFT の入力を X0、 X1、出力を Y0、 Y1 とする。このように図11 では入出力を区別するために、入力と出力に別の変数名をつけているが、実際には Y0 を X0 と、Y1 を X1 と同じ場所に格納し、次段のバタフライ演算を行うことが多い。 X0、 X1、 Y0、 Y1 は複素数であり、実部と虚部を区別する必要があるときは、例えば X0 の実部を xr0、虚部を xi0、虚数単位を j とおき X0 = xr0 + j xi0 と書くことにする。シグナル・フロー中のひとつの枝は回転因子と呼ばれる係数 W を持つ。 c は係数 W の実部、s は係数 W の虚部とする。
【００１０】
一般的に実際に計算するためには、実部と虚部を別に取り扱う必要がある。そこで図 11 の 2 点 FFT のシグナル・フローを実部と虚部に分けて書き下した式を数 1 に示す。
【００１１】
【数１】

【００１２】
数式 1 の演算を SIMD 命令を用い並列化した例を図 12 に示す。前提として、2 点バタフライ演算の入力データ xr0、 xi0、 xr1、 xi0 はメイン・メモリ 1201 上のデータ配列 X (1202) に、回転因子 c、 s は係数テーブル W (1203) に格納されているものとする。 2 点バタフライ演算の結果はふたたびデータ配列 X (1202) に格納する。
【００１３】
まず、データ配列 X (1202) から xr1、 xi1 を 64 ビット・レジスタ 2 個 (1204、 1205) の下位 2 ハーフワードにロードし、フォーマット変換命令 PUNPACKLdq により、1 個の 64 ビット・レジスタ 1206 に 4 個のハーフワード・データを詰め込む。
【００１４】
つぎに 4 個のハーフワード・データを格納したレジスタ 1206 と、 4 個のハーフワード長の係数を係数テーブル 1203 からロードしたレジスタ 1207 の間で積和演算 PMADDwd を行う。積和演算の結果 1208 は 32 ビット長で得られるが、これを PSRAd 命令で右シフトし、ハーフワードに切り詰める (1209)。
【００１５】
ハーフワード長に切り詰めた積和演算結果 1215 は、零 (1214) から減算することで -1 倍する (1216)。これともとのハーフワード長の積和演算結果 1209 の間で PACKSSdw 命令を用いフォーマット変換を行い、64 ビット・レジスタ 1211 上に 4 個の積和演算結果を格納する。
【００１６】
さらに、別の 64 ビット・レジスタ (1217、 1218) の下位 2 ハーフワードに xr0、 xi0 をロードし、フォーマット変換命令 PUNPACKLdq により、1 個の 64 ビット・レジスタ 1219 に 4 個のハーフワード・データを詰め込む。これをレジスタ 1211 と SIMD 加算命令で4 並列加算すると、最終結果を得る (1213)。最終結果は、次段の計算のため、あるいは最終結果を残すため、メイン・メモリ上のデータ配列に格納する (1220)。 yr0 は xr0、 yi0 は xi0、 yr1 は xr1、 yi1 は xi1 が収められていた位置へ格納する。
【００１７】
このように従来法では、1 個の 2 点バタフライ演算から並列性を抽出し、 4 並列 SIMD 命令を適用し、高速化している。
【００１８】
【発明が解決しようとする課題】
図 12 に示した従来法は、以下に述べる問題があり、SIMD 命令に適していない。
【００１９】
1 個の 2 点バタフライ演算に入力するデータはわずか 2 組の複素数、実部、虚部を別に扱ってもデータ点数として 4 点にすぎない。この少ないデータ点数の中から、典型的な SIMD 演算命令を無駄なく適用するために 4 並列のデータ並列性を抽出するのはかなり困難である。
【００２０】
図 12 に示した例では、あるデータの実部と虚部という異質のデータを同一レジスタ上で混在させるデータ・フォーマットを採用することにより (1206)、 4 並列のデータ並列性を確保している。このデータ・フォーマットを実現するには、メイン・メモリ上からロードしたデータに対しデータ・フォーマット変換命令 PUNPKLdq を適用する必要があるという欠点がある。このデータ・フォーマット変換命令は、xr0、 xi0 の組に対してと xr1、 xi1 の組に対しての計 2 回適用されている。
【００２１】
また、最下位ハーフワードの乗算結果と最下位から 2 番目のハーフワードの乗算結果を加算し 32 ビットの結果を得ているため、並列度が 2 に落ちているところがある (1208)。さらに 32 ビットの結果を 16 ビット長にフォーマット変換するための右シフト命令が余計に必要になるという欠点がある。
【００２２】
このように、従来法が採用している並列性の抽出法は、 1 個の 2 点バタフライ演算という狭い範囲から並列性を抽出し、実部と虚部を同一レジスタ上に混在するデータ・フォーマットを採用し並列演算をしているため、データ・フォーマット変換のオーバーヘッドが大きい、また並列度が途中で 2 に制限されるという欠点があり、 SIMD 命令に適していない。
【００２３】
本発明の目的は、SIMD 命令に適した高速な FFT 演算手法を提案することにある。具体的には、Radix-4 FFT アルゴリズムを、 4 並列の SIMD 演算命令セット上で常に 4 並列で行え、データのフォーマット変換のオーバーヘッドを削減した高速な演算手法を提案する。
【００２４】
【課題を解決するための手段】
本発明は、Radix-4 FFT の全段において、4 個の 4 点バタフライ演算を SIMD命令を用い並列に処理する。より具体的には、独立した入力データの実部を保持する配列 (図７の 701) と入力データの虚部を保持する配列 (図 7 の 702)、それぞれの配列から連続した 4 要素を取り出し保持する手段 (図 2 )、必要な係数を生成し保持する手段 (図 3 )、および図 13 に示すような SIMD 並列演算命令を実行する機構から構成される。最終段においては、係数の生成および保持手段は不要であるが、 4 行 4 列の行列の転置を高速に行う機構 ( 図 4 、 405、 406、 408、 409) が必要である。 Radix-4 FFT では、隣接する 4 要素は独立したバタフライ演算に属するため、隣接する 4 要素を同一レジスタに入れ、通常のスカラー演算と同様の 4 点バタフライ演算のシグナル・フローに従い、 SIMD 演算命令を用いて処理を行うことで、4 個の 4 点バタフライ演算を並列処理できる。最終段においては、隣接する 4 要素間でバタフライ演算を行うため、前段までとは異なり、隣接する 4 個のバタフライ演算への入力データを 1 個のレジスタにロードする。すなわち 4 行 4 列の行列の転置と同等の操作により 3 要素おきに 4 個のデータをレジスタにロードし、通常のスカラー演算と同様の 4 点バタフライ演算のシグナル・フローに従い、 SIMD 演算命令を用いて処理を行うことで、4 個の 4 点バタフライ演算を並列処理する。
【００２５】
【発明の実施の形態】
本発明の実施例を図１から図９、数2、 3、 4 を用いて説明する。
【００２６】
【数２】

【００２７】
【数３】

【００２８】
【数４】

【００２９】
まず Radix-4 FFT について説明する。 Radix-4 FFT は、従来例で用いていた Radix-2 FFT より演算量が少なく高速である。また、 R. Meyer とK. Schwarz が ICASSP (International Conference on Acoustics、 Speech and Signal Processing) 1990 年、予稿集 1503 から 1506 ページで ``FFT Implementation on DSP Chips - Theory and Practice'' と題して述べているように、 Radix-4 FFT は SR (Split Radix) 型 FFT に比べ、乗算の回数は多少多いもののデータのアドレス計算が単純で加減算の回数が少ないため、乗算も加減算も同じ 1クロックで実行できる最近のマイクロプロセッサ上で高速に実行できる。
【００３０】
N 点の DFT を Radix-4 FFT アルゴリズムで求める時の手順を説明する。ここで N は 4 のべき乗とする。 N 点の DFT をまず 4 個の N/4 点 DFT に分割する。つぎにそれぞれの N/4 点 DFT を 4 個の N/16 点 DFT に分割する。この 4 分割を繰り返し、最終的に N/4 個の 4 点 DFT まで分割し、この 4 点 DFT を直接計算する。
【００３１】
N 点 FFT は log₄N 段の 4 点バタフライ演算から構成される。また各段は N/4 個の 4 点バタフライ演算から構成される。従って、N 点 FFT 全体は N/4 log₄N 個の 4 点バタフライ演算を含む。
【００３２】
4 点 DFT (バタフライ演算) のシグナル・フローを図 5 に示す。これを実部、虚部に分け書き下した数式を数 2 に示す。図 5 中、係数 W は数 3 に示すとおりである。また数 2 では簡単化のため、数 4 に示す中間項 r1、 s1、 r2、 s2、 r3、 s3 を使っている。以下、4 点 DFT の内部演算ではなく、4 点 DFT の入出力に注目するため、図 5 に示した 4 点 DFT を図 6 のように簡略に表記する。
【００３３】
各段のバタフライ演算への入力データ、および係数を説明する。このとき図７に示すように、 N 点 FFT への入力データである N 個の複素数は、実部が配列 xr[0 〜 (N-1)] (701)、虚部が配列 xi[0 〜 (N-1)] (702) に別々に収められているとする。また、係数 W は、虚部と実部が交互に収められているとする (703)。
【００３４】
各段のバタフライ演算への入力データ、および係数の関係を図 8 に示す。第 i 段 (1 ≦ i ≦ log₄N) を構成するバタフライ演算へ入力するデータ ( 図 5中 X0、 X1、 X2、 X3) の、配列 xr (701)、 xi (702) 上での添字の間隔 Sd は N/(4ⁱ) である。また、係数の間隔 ( 図 7 中 m) の間隔 Sc は 4^i-1 である。
【００３５】
64 点 Radix-4 FFT の例で説明する。 64 点 Radix-4 FFT のシグナル・フローを図 9 に示す。 64 点 Radix-4 FFT は、4 点バタフライ演算 3 段から構成され、各段は 16 個のバタフライ演算を含む。第 1 段の 16 個のバタフライ演算のデータ間隔 Sd は 16、係数間隔 Sc は 1 である。
【００３６】
すなわち、例えば第 1 のバタフライ演算 901 における X0 の実部は xr[0]、虚部は xi[0]、 X1 の実部は xr[16]、虚部は xi[16]、 X2 の実部は xr[32]、虚部は xi[32]、 X3 の実部は xr[48]、虚部は xi[48] である。このバタフライ演算の係数 W は数 3 で n=0 とおいた場合に相当し、 Wⁿ = W²ⁿ = W³ⁿ = W⁰ であり、実部が cos(0)、虚部が sin(0) の値を図 7 に示した係数テーブル W (703) から取得する。
【００３７】
次に、第 2 のバタフライ演算 902 における X0 の実部は xr[1]、虚部は xi[1]、 X1 の実部は xr[17]、虚部は xi[17]、 X2 の実部は xr[33]、虚部は xi[33]、 X3 の実部は xr[49]、虚部は xi[49] である。このバタフライ演算の係数 W は数 3 で n を第 1 段における係数間隔 1 とおいた場合に相当し、 W¹ の実部は W[2]、虚部は W[3]、 W² の実部は W[4]、虚部は W[5]、 W³ の実部は W[6]、虚部は W[7] の値を図 7 に示した係数テーブル W (703) から取得する。
【００３８】
64 点 FFT の第 2 段の 16 個のバタフライ演算演算のデータ間隔 Sd は 4、係数間隔 Sc も 4 である。すなわち、例えば第 3 のバタフライ演算 903 における X0 の実部は xr[0]、虚部は xi[0]、 X1 の実部は xr[4]、虚部は xi[4]、 X2 の実部は xr[8]、虚部は xi[8]、 X3 の実部は xr[12]、虚部は xi[13]である。このバタフライ演算の係数 W は数式 3 で n=0 とおいた場合に相当し、 Wⁿ = W²ⁿ = W³ⁿ = W⁰ であり、実部が cos(0)、虚部が sin(0) の値を図 7 に示した係数テーブル W (703) から取得する。
【００３９】
次に第 4 のバタフライ演算 904 における X0 の実部は xr[1]、虚部は xi[1]、 X1 の実部は xr[5]、虚部は xi[5]、 X2 の実部は xr[9]、虚部は xi[9]、 X3 の実部は xr[13]、虚部は xi[13] である。このバタフライ演算の係数 W は数 3 で n を第 2 段における係数間隔 4 とおいた場合に相当し、 W⁴ の実部は W[8]、虚部は W[9]、 W⁸ の実部は W[16]、虚部は W[17]、 W¹² の実部は W[24]、虚部は W[25] の値を図７に示した係数テーブル W (703) から取得する。
【００４０】
第 2 段では、ブロック 0 (907)、ブロック 1 (908)、ブロック 2 (909)、ブロック 3 (910)で係数を共通に使える。数 3 に示した W の定義と、sin、 cos の周期性から、第 2 段では 4 バタフライ演算ごとに同じ係数を用いたバタフライ演算が出現するからである。
【００４１】
64 点 Radix-4 FFT の第 3 段の 16 個のバタフライ演算のデータ間隔 Sd は 1、係数間隔 Sc は 16 である。すなわち、64 点 Radix-4 FFT の最終段である第 3 段では、隣接する 4 要素間で 4 点バタフライ演算を行う。例えば第 5のバタフライ演算 905 における X0 の実部は xr[0]、虚部は xi[0]、 X1 の実部は xr[1]、虚部は xi[1]、 X2 の実部は xr[2]、虚部は xi[2]、 X3 の実部は xr[3]、虚部は xi[3] である。また、第 6 のバタフライ演算 906 における X0 の実部は xr[4]、虚部は xi[4]、 X1 の実部は xr[5]、虚部は xi[5]、 X2 の実部は xr[6]、虚部は xi[6]、 X3 の実部は xr[7]、虚部は xi[7] である。係数は sin、 cos の周期性から単純になる。
【００４２】
以上、Radix-4 FFT のシグナルフローと 64 点 Radix-4 FFT における例を示した。次にこの Radix-4 FFT を SIMD 命令を用いて 4 並列化する本発明の手法を示す。
【００４３】
図 1 に、本発明の Radix-4 FFT の制御の流れをフローチャートで示す。 N を FFT 点数とする。実行開始 (101) 後、データ距離 Sd の初期値を N に、係数距離 Sc の初期値を 1 に設定する (102)。
【００４４】
Sd と Sc の初期設定後、変数 k についてのループ (120) を実行する。このループは最終段以外の各段のバタフライ演算に対応しており、N 点 Radix-4 FFT では (log₄N) - 1 回繰り返される。これは k を N に初期化し (103)、ループ末尾で条件 114 が成立している限り、 k を更新しながらループ内を実行することで行う。
【００４５】
変数 k についてのループ (120) 内部では、図 8 に示したとおりデータ距離 Sd をバタフライ演算 1 段ごとに 1/4 にする。さらに係数の配列 W をアクセスする時の添字となる変数 a を 0 に初期化する (104)。その後、変数 j についてのループ (121) を実行する。
【００４６】
変数 j についてのループ (121) は係数 W のアクセスに対応する。 N 点 FFTの、例えば第 1 段では、N/4 個のバタフライ演算を実行するため、後述するように 4 個のバタフライ演算を並列に処理する本発明では、変数 j のループはN/16 回繰り返される。第 2 段では、第 1 段と同様 N/4 個のバタフライ演算を実行するが、 4 個のバタフライ演算間で係数の組が共有できるため、 4 個のバタフライ演算を並列に処理する本発明では、変数 j のループは第 1 段の 1/4 の N/64 回繰り返される。これは j を 0 に初期化し (105)、ループ末尾で条件 113 が成立している限り j を更新しながらループ内の係数アクセスを実行することで行う。係数アクセスの詳細については、図 3 を参照しながら後述する。係数アクセス後には係数アクセス時の添字 a を係数距離 Sc だけ増加させる。
【００４７】
変数 i0 についてのループ (122) は、第 1 の 4 並列バタフライ演算 (110) および演算に必要なデータのアクセス (109)、演算結果の書き込み (111) に対応する。 N 点 FFT の、例えば第 1 段では、1 組の係数は 1 組のバタフライ演算でのみ用いられるため、j についてのループ 1 回につき、i0 についてのループも 1 回実行する。第 2 段では、1 組の係数を 4 組のバタフライ演算で再利用できるため、j についてのループ 1 回につき、i0 についてのループを 4 回実行する。これは、i0 を j に初期化し (108)、ループ末尾で条件 112 が成立している限り i0 を更新しながらデータ・アクセス (109)、第 1 の 4 並列バタフライ演算 (110)、結果書き込み (111) を繰り返すことで行う。
【００４８】
データ・アクセス (109) の詳細について図 2 を参照しながら説明する。本発明の並列 Radix-4 FFT では、配列 xr (201)中、添字 i0 から始まる連続した 4 要素 (202) を収めたレジスタ xr0 (206)、配列 xi (218)中、添字 i0 から始まる連続した 4 要素 (211)を収めたレジスタ xi0 (215)、配列 xr (201)中、添字 i0+Sd から始まる連続した 4 要素 (203) を収めたレジスタ xr1 (207)、配列 xi (208)中、添字 i0+Sd から始まる連続した 4 要素 (212) を収めたレジスタ xi1 (216)、配列 xr (201)中、添字 i0+2*Sd から始まる連続した 4 要素 (204) を収めたレジスタ xr2 (208)、配列 xi (208)中、添字 i0+2*Sd から始まる連続した 4 要素 (213)を収めたレジスタ xi2 (217)、配列 xr (201)中、添字 i0+3*Sd から始まる連続した 4 要素 (205) を収めたレジスタ xr3 (209)、配列 xi (208)中、添字 i0+3*Sd から始まる連続した 4 要素 (214) を収めたレジスタ xi3 (218) を入力とする、独立した 4 組のバタフライ演算を SIMD 命令を用いて並列に行う。配列 xr、 xi 中の連続した 4 要素は、64 ビット長のロード命令 1 命令でレジスタに格納できる。
【００４９】
xr0、 xi0、 xr1、 xi1、 xr2、 xi2、 xr3、 xi3 に対し、次に述べる係数アクセス (106) において取得した s1、 c1、 s2、 c2、 s3、 c3 とともに、数 2 に従って 4 点バタフライ演算を行う。 4 並列 SIMD 演算に用いるデータを格納する 64 ビット・レジスタの各ハーフワード位置は互いに独立であり、同一のハーフワード位置間でのみ乗算、加減算などの演算が行われるため、一般のスカラー型のデータに対する 4 点バタフライ演算の定義式である数 2 をそのまま SIMD データ型に適用すると、4 組のバタフライ演算が同時に求められる。
【００５０】
64 点 Radix-4 FFT を例にとり、データ・アクセスを具体的に説明する。図 10は、64 点 Radix-4 FFT 第 1 段の、第 1、第 2 回めの i0 についてのループ (122) 内部の実行において、並列処理されるバタフライ演算を示している。第 1 回めの実行では、入力データとして xr[0〜3]、 xi[0〜3]、 xr[16〜19]、 xi[16〜19]、 xr[32〜35]、 xi[32〜35]、 xr[48〜51]、 xi[48〜51] をレジスタ xr0 (1001)、 xi0 (1002)、 xr1 (1003)、 xi1 (1004)、 xr2 (1005)、 xi2 (1006)、 xr3 (1007) xi3 (1008) に格納し、数 2 のシグナル・フローに従って 4 点バタフライ演算を実行する。これにより、図 9 に示した 64 点 FFT の第 1 段のバタフライ演算 16 個のうち、図 10 中実線で示したバタフライ演算 4 個 (1009、 1010、 1011、 1012) が並列に計算できる。次の繰り返しでは、入力データとして xr[4〜7]、 xi[4〜7]、 xr[20〜23]、 xi[20〜23]、 xr[36〜39]、 xi[36〜39]、 xr[52〜55]、 xi[52〜55] をレジスタに格納し、 64点 FFT の第 1 段のバタフライ演算のうち、図 10中破線で示したバタフライ演算 4 個が並列に計算できる。以下同様に 64 点 Radix-4 FFT では、第 1 段のバタフライ演算、第 2 段のバタフライ演算各 16 個を 4 個ずつ計算する。
【００５１】
次に係数アクセスの詳細について、図 3 を参照しながら説明する。 1 組のバタフライ演算では、図 5 に示したように、3 個の複素数の係数 W が必要である。本発明では、4 組のバタフライ演算を並列に行うので、係数アクセス (106) では、4 組のバタフライ演算に必要な 12 個の複素数の係数を、係数テーブル W (703) からレジスタに読み出す。
【００５２】
あるバタフライ演算が数 3 に示した Wⁿ として係数テーブル中の a 番目の複素数の要素を必要としているとする。このときこの複素数の虚部は係数テーブル (703) 中 W[a*2]、実部は W[a*2+1] に格納されている。このバタフライ演算は同時に W²ⁿ として、虚部が W[a*4]、実部が W[a*4+1] である複素数を、 W³ⁿ として、虚部が W[a*6]、実部が W[a*6+1] である複素数を必要とする。
【００５３】
このバタフライ演算と同時に演算される 2 番目のバタフライ演算は、係数として係数テーブル W (703) 上で係数距離 Sc だけ離れた位置に格納されている W[(a+Sc)*2]、 W[(a+Sc)*2+1]、 W[(a+Sc)*4]、 W[(a+Sc)*4+1]、 W[(a+Sc)*6]、 W[(a+Sc)*6+1] を必要とする。同様に、3 番目のバタフライ演算は、係数として係数テーブル W (703) 上でさらに係数距離 Sc だけ離れた位置に格納されている W[(a+2*Sc)*2]、 W[(a+2*Sc)*2+1]、 W[(a+2*Sc)*4]、 W[(a+2*Sc)*4+1]、 W[(a+2*Sc)*6]、 W[(a+2*Sc)*6+1] を必要とする。同様に、4 番目のバタフライ演算は、係数として係数テーブル W (703) 上でさらに係数距離 Sc だけ離れた位置に格納されている W[(a+3*Sc)*2]、 W[(a+3*Sc)*2+1]、 W[(a+3*Sc)*4]、 W[(a+3*Sc)*4+1]、 W[(a+3*Sc)*6]、 W[(a+3*Sc)*6+1] を必要とする。
【００５４】
図 3 に示すように、係数テーブル上の必要な値は、いったん 2 個ずつレジスタに格納され、 6 個のレジスタに詰め込まれる。各レジスタ上には、各係数の実部または虚部に対応する値が、 4 バタフライ演算分格納される。このフォーマット変換は、例えば NEC V830R プロセッサの場合 2 段階のフォーマット変換命令 (314、 315、 316、 317) で行える。 4 バタフライ演算分のW¹に対応する実部の値を格納したレジスタを c1 (309)、虚部の値を格納したレジスタを s1 (308)、 W²に対応する実部の値を格納したレジスタを c2 (311)、虚部の値を格納したレジスタを s2 (310)、 W³に対応する実部の値を格納したレジスタを c3 (313)、虚部の値を格納したレジスタを s3 (312)とおく。
【００５５】
最終段の演算について、図 4 を参照しながら説明する。図 9 で示したように、最終段より前では、配列 xr (701)、 xi (712) の連続した 4 要素は独立したバタフライ演算に属しているため、連続 4 要素をレジスタに格納し、これら 4 個の要素に対し同じシグナル・フローに従って演算することにより、 4 個のバタフライ演算を並列に演算できた。最終段でも、前段までと同様、4 個のバタフライ演算を同一のシグナル・フローに従い並列に処理する。しかし、最終段では図 9に示した 64 点 FFT の例でもわかるように、データ配列 xr (701)、 xi (712) の連続した 4 要素は同じバタフライ演算に属しており、依存関係があるため、異なった並列化手法をとる必要がある。
【００５６】
そこで最終段では、データのフォーマット変換を行い、隣接する 4 個のバタフライ演算を SIMD 命令を用いて並列に行う。最終段の処理の流れを図 4 に示す。最初に j を 0 に初期化し (401)、データ配列 xr (701)、 xi (702) の、添字 j から始まる各 16 個の要素の実部をレジスタ xr0、 xr1、 xr2、 xr3 にロードし (403)、虚部をレジスタ xi0、 xi1、 xi2、 xi3 にロードすると (404)、連続する 4 要素が同一レジスタにロードされる。次に実部、虚部各 4 本のレジスタ上に置いた 16 個の要素を、4 行 4 列の行列とみなし、行列の転置を行う (405、 406)。このデータ・フォーマット変換によりひとつのバタフライ演算に必要な、連続 4 要素は別のレジスタに格納され、独立した 4個のバタフライ演算に必要な 4 要素は同一レジスタの異なるハーフワード位置に格納できる。このフォーマット変換のオーバーヘッドは小さい。例えば NEC V830R プロセッサでは、フォーマット変換命令 8 命令で 4 個のレジスタ上の 16 要素の転置が行える。
【００５７】
転置後のレジスタ xr0、 xi0、 xr1、 xi1、 xr2、 xi2、 xr3、 xi3 に対し、図 5 に示したシグナル・フローに従って 4 点バタフライ演算を行うと、隣接する 4 個のバタフライ演算に対し並列演算が行える (407)。バタフライ演算の結果は再び転置し (408、 409)、配列 xr (701)、 xi (702) に最終結果として書き込む (410)。
【００５８】
本発明の実施例は、 64 ビット・レジスタ上に 16 ビット・データを 4 個格納し、 4 並列演算を行うという、現在一般的な SIMD 命令セットを用い説明したが、本発明はレジスタのビット長、各データの長さ、並列度が異なるプロセッサに対しても容易に適用できる。
【００５９】
【発明の効果】
本発明の Radix-4 FFT 演算手法によれば、4 並列 SIMD 命令を持つプロセッサ上で常に 4 個のバタフライ演算を並列に行うことができ、高い並列性を達成できる。また、同一レジスタに実部のみもしくは虚部のみを収めるという規則的なデータ構造を採用しているため、データのフォーマット変換のオーバーヘッドを最小化できる。結果として、高速な Radix-4 FFT の演算が可能である。例えば、 NEC V830R プロセッサ上で本発明の手法により 256 点 FFT を実行した場合、所要クロック数は4,093 クロックである。これは V830R と類似した SIMD 命令セットを持つ Intel PentiumII プロセッサ上で従来の手法により 256 点 FFT を実行した場合の所要クロック数5,522 クロックに比べ、３５％も高速である。
【図面の簡単な説明】
【図１】本発明の FFT 方法を示すフローチャートである。
【図２】データアクセス方法を示す図である。
【図３】係数アクセス方法を示す図である。
【図４】本発明の FFT の最終段の処理を示すフローチャートである。
【図５】4 点バタフライ演算のシグナル・フロー図である。
【図６】の簡略化した 4 点バタフライ演算のシグナル・フロー図である。
【図７】本発明の FFT におけるデータ構造を示す図である。
【図８】バタフライ演算で使用するデータの間隔と係数の間隔を示す表である。
【図９】64 点 Radix-4 FFT のシグナル・フロー図である。
【図１０】本発明の方法で 64 点 FFT を 4 並列化した場合のシグナル・フロー図である。
【図１１】2 点バタフライ演算のシグナル・フロー図である。
【図１２】従来の並列 FFT 演算手法を示す図である。
【図１３】SIMD 演算命令の動作例を説明する図である。
【符号の説明】
201 入力データ X の実部を収めた配列 xr
206、207、208、209 xr の 4 要素を収めた 64 ビット・レジスタ
214、215、216、217 64 ビット・レジスタ
218 入力データ X の虚部を収めた配列 xi
301 係数テーブル W
302、303、304、305 係数テーブルのデータを 2 個ロードした 64 ビット・レジスタ
306、307 フォーマット変換の中間結果を保持する 64 ビット・レジスタ
308、309 4 組の係数を保持する 64 ビット・レジスタ
310、311 4 組の係数を保持する 64 ビット・レジスタ
312、313 4 組の係数を保持する 64 ビット・レジスタ
701 入力データ X の実部を収めた配列 xr
702 入力データ X の実部を収めた配列 xi
703 係数テーブル W
901 第 1 のバタフライ演算
902 第 2 のバタフライ演算
903 第 3 のバタフライ演算
904 第 4 のバタフライ演算
905 第 5 のバタフライ演算
906 第 6 のバタフライ演算
907 第 0 ブロック
908 第 1 ブロック
909 第 2 ブロック
910 第 3 ブロック
1001 最初の繰り返しで使用する xr0 の値を保持する 64 ビット・レジスタ
1002 最初の繰り返しで使用する xi0 の値を保持する 64 ビット・レジスタ
1003 最初の繰り返しで使用する xr1 の値を保持する 64 ビット・レジスタ
1004 最初の繰り返しで使用する xi1 の値を保持する 64 ビット・レジスタ
1005 最初の繰り返しで使用する xr2 の値を保持する 64 ビット・レジスタ
1006 最初の繰り返しで使用する xi2 の値を保持する 64 ビット・レジスタ
1007 最初の繰り返しで使用する xr3 の値を保持する 64 ビット・レジスタ
1008 最初の繰り返しで使用する xi3 の値を保持する 64 ビット・レジスタ
1009 第 7 の 4 点バタフライ演算
1010 第 8 の 4 点バタフライ演算
1011 第 9 の 4 点バタフライ演算
1012 第 10 の 4 点バタフライ演算
1201 メイン・メモリ
1202 データ配列 X
1203 係数テーブル W
1204、1205、1206、1207、1208、1209、1210、1211、1212、1213、1214、1215、1216、 1217、1218、1219 64 ビット・レジスタ
1301 64 ビット・レジスタ
1302 64 ビット・レジスタ
1303 64 ビット・レジスタ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a fast Fourier transform, and more particularly to a fast computation method and apparatus for fast Fourier transform using a microprocessor.
[0002]
[Prior art]
The Fourier Transform is a technique for transforming a time domain signal into the frequency domain. In digital signal processing, Discrete Fourier Transform (DFT), which is a Fourier transform of discrete-time signals, is widely applied to signal frequency analysis, speech recognition, and high-speed block filter computation. However, because DFT has a large amount of calculation in principle, a fast Fourier transform (FFT), which is a high-speed calculation method, is used in an actual system.
[0003]
FFT reduces the amount of computation by repeatedly dividing DFT into smaller DFTs. There are various methods such as Radix-2 FFT, Radix-4 FFT, etc. depending on the division method. In Radix-2 FFT, if the number of input data points is N, the N-point DFT is divided into two N / 2-point DFTs. Next, each N / 2 point DFT is divided into two N / 4 point DFTs. This is repeated, and finally it is divided into N / 2 2-point DFT, and this 2-point DFT is calculated directly. In Radix-4, the N-point DFT is sequentially divided into 1/4 and finally divided into 4-point DFT. The 2-point DFT and 4-point DFT are also called “butterfly computation” because of their signal flow shape.
[0004]
Since FFT uses a lot of multiplication, a digital signal processor (DSP) with a built-in multiplier has been adopted in systems that require high-speed FFT. Recently, there has been a demand for software-based multimedia signal processing in products using microprocessors such as personal computers and information home appliances, and some general-purpose microprocessors have built-in multipliers.
[0005]
In addition, some microprocessors have introduced SIMD (Single Instruction stream-Multiple Data streams) parallel operation instructions to achieve even higher multimedia signal processing performance. An example of an expansion of such a microprocessor is Intel, published in Nikkei Electronics, No. 661, May 6, 1996, pages 105-119, titled "Dissecting MMX multimedia instructions for 86 microprocessors". IEEE MMIC Magazine, Vol. By Kazumasa Suzuki et al. 18, No. 2, April 1998, pages 36-47 include NEC's V830R instruction set titled “V830R / AV: Embedded multimedia RISC processor”.
[0006]
In these instruction sets for multimedia signal processing, for example, four 16-bit values are stored in a 64-bit register, and four parallel operations can be performed on these four values with one instruction. Such a parallel operation instruction is hereinafter referred to as a SIMD instruction. Figure 13 shows an example of a typical SIMD instruction. In this example, four halfword data (a0, b0, a1, b1, a2, b2, a3, b3) are packed on the 64-bit registers r1 (1301) and r2 (1302). Performs four independent additions (a0 + b0, a1 + b1, a2 + b2, a3 + b3) between the same halfword positions on r1 (1301) and r2 (1302) and puts them on register r3 (1303) The result is 4 halfwords long. Here, halfword means 16-bit data length.
[0007]
The SIMD instruction performs the same operation on multiple sets of data in parallel between the same halfword positions on different registers, so multiple sets of independent data processed according to exactly the same signal flow in parallel Most suitable for processing. Conversely, when performing operations between different halfword positions on the same register, it is necessary to use a format conversion instruction in addition to the SIMD operation instruction, resulting in overhead.
[0008]
Attempts to speed up FFT using these SIMD instructions have already been announced. Intel shows an example of speeding up a Radix-2 FFT using SIMD instructions in application note AP-555, order number 243040-001, March 1996, "Using MMX Instructions to Perform Complex 16-bit FFT". ing. However, in the example shown in this application note, as shown in detail below, the data conversion overhead is large compared to the amount of calculation, and high-speed FFT calculation cannot be performed. The FFT calculation method using SIMD instructions (hereinafter referred to as the conventional method) described in the Intel application note is explained in detail.
[0009]
Figure 11 shows the signal flow of the 2-point butterfly computation, which is a component of the Radix-2 FFT. The 2-point FFT input is X0, X1, and the output is Y0, Y1. In this way, in Fig. 11, different variable names are assigned to the input and output in order to distinguish the input and output. However, in practice, Y0 is stored in X0 and Y1 is stored in the same location as X1, and the next stage butterfly There are many operations. X0, X1, Y0, Y1 are complex numbers.When it is necessary to distinguish the real part from the imaginary part, for example, the real part of X0 is xr0, the imaginary part is xi0, the imaginary unit is j, and X0 = xr0 + j Let's write xi0. One branch in the signal flow has a coefficient W called a twiddle factor. c is the real part of the coefficient W and s is the imaginary part of the coefficient W.
[0010]
In general, in order to actually calculate, it is necessary to handle the real part and the imaginary part separately. Therefore, Equation 1 shows the formula that writes the signal flow of the 2-point FFT in Fig. 11 separately for the real part and the imaginary part.
[0011]
[Expression 1]

[0012]
Figure 12 shows an example of parallelizing the operation of Equation 1 using SIMD instructions. As a premise, the input data xr0, xi0, xr1, and xi0 of the two-point butterfly operation are stored in the data array X (1202) on the main memory 1201, and the twiddle factors c and s are stored in the coefficient table W (1203). And The result of the 2-point butterfly operation is stored again in the data array X (1202).
[0013]
First, xr1 and xi1 from data array X (1202) are loaded into the lower two halfwords of two 64-bit registers (1204, 1205), and four in one 64-bit register 1206 using the format conversion instruction PUNPACKLdq. Of halfword data.
[0014]
Next, a product-sum operation PMADDwd is performed between the register 1206 storing four halfword data and the register 1207 loaded with the coefficients of four halfword lengths from the coefficient table 1203. The result 1208 of the multiply-accumulate operation is obtained in 32-bit length, but it is right-shifted with the PSRAd instruction and truncated to a halfword (1209).
[0015]
Multiply-and-accumulate result 1215 truncated to halfword length is multiplied by -1 by subtracting from zero (1214) (1216). Format conversion is performed between the original halfword length product-sum operation result 1209 using the PACKSSdw instruction, and four product-sum operation results are stored in 64-bit register 1211.
[0016]
In addition, xr0 and xi0 are loaded into the lower two halfwords of another 64-bit register (1217, 1218), and four halfword data are packed into one 64-bit register 1219 using the format conversion instruction PUNPACKLdq. . When this is added in parallel by register 1211 and SIMD add instruction, the final result is obtained (1213). The final result is stored in the data array on the main memory for the next calculation or to leave the final result (1220). yr0 is stored at xr0, yi0 is stored at xi0, yr1 is stored at xr1, and yi1 is stored at the position where xi1 was stored.
[0017]
Thus, in the conventional method, parallelism is extracted from one two-point butterfly operation, and four parallel SIMD instructions are applied to increase the speed.
[0018]
[Problems to be solved by the invention]
The conventional method shown in Fig. 12 has the following problems and is not suitable for SIMD instructions.
[0019]
The data input to one 2-point butterfly operation is only 4 data points even if only 2 sets of complex numbers, real part and imaginary part are handled separately. From this small number of data points, it is quite difficult to extract four parallel data parallelism in order to apply typical SIMD operation instructions without waste.
[0020]
In the example shown in Fig. 12, 4 parallel data parallelism is secured by adopting a data format that mixes different data of real data and imaginary data on the same register (1206). . To realize this data format, the data format conversion instruction PUNPKLdq must be applied to the data loaded from the main memory. This data format conversion command is applied twice for the xr0, xi0 pair and for the xr1, xi1 pair.
[0021]
In addition, the parallelism has dropped to 2 because the multiplication result of the least significant halfword and the multiplication result of the second halfword from the least significant are added to obtain a 32-bit result (1208). Furthermore, there is a disadvantage that an extra right shift instruction is required to convert the 32-bit result to 16-bit format.
[0022]
In this way, the parallelism extraction method used in the conventional method is a data format in which parallelism is extracted from a narrow range of one two-point butterfly operation and the real part and imaginary part are mixed on the same register. Because of the parallel operation, the data format conversion overhead is large and the parallelism is limited to 2 in the middle, which is not suitable for SIMD instructions.
[0023]
An object of the present invention is to propose a high-speed FFT calculation method suitable for SIMD instructions. Specifically, we propose a high-speed calculation method that can always perform the Radix-4 FFT algorithm in 4 parallel on the 4 parallel SIMD calculation instruction set and reduce the data format conversion overhead.
[0024]
[Means for Solving the Problems]
The present invention processes four 4-point butterfly operations in parallel using SIMD instructions in all stages of the Radix-4 FFT. More specifically, an array that holds the real part of the independent input data (701 in Fig. 7) and an array that holds the imaginary part of the input data (702 in Fig. 7), extract four consecutive elements from each array. It comprises a means for holding (FIG. 2), a means for generating and holding necessary coefficients (FIG. 3), and a mechanism for executing SIMD parallel operation instructions as shown in FIG. In the final stage, coefficient generation and holding means are not required, but a mechanism (Fig. 4, 405, 406, 408, and 409) that transposes a 4-by-4 matrix at high speed is required. In the Radix-4 FFT, the four adjacent elements belong to independent butterfly operations, so the four adjacent elements are placed in the same register, and the SIMD operation instructions are executed according to the signal flow of the same four-point butterfly operation as the normal scalar operation. By using this method, four 4-point butterfly operations can be processed in parallel. In the final stage, butterfly operations are performed between four adjacent elements, and unlike the previous stage, input data to four adjacent butterfly operations is loaded into one register. In other words, four data is loaded into the register every three elements by the same operation as transposing a 4-by-4 matrix, and SIMD operation instructions are used according to the signal flow of the 4-point butterfly operation similar to normal scalar operation. By performing this process, four four-point butterfly operations are processed in parallel.
[0025]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described with reference to FIGS. 1 to 9 and

equations

2, 3, and 4. FIG.
[0026]
[Expression 2]

[0027]
[Equation 3]

[0028]
[Expression 4]

[0029]
First, Radix-4 FFT is explained. The Radix-4 FFT requires less computation and is faster than the Radix-2 FFT used in the previous example. Also, R. Meyer and K. Schwarz stated in the ICASSP (International Conference on Acoustics, Speech and Signal Processing) 1990, Proceedings 1503 to 1506, titled `` FFT Implementation on DSP Chips-Theory and Practice ''. As shown in the figure, the Radix-4 FFT has a slightly larger number of multiplications than the SR (Split Radix) type FFT, but the data address calculation is simple and the number of additions / subtractions is small. It can be executed at high speed on a recent microprocessor.
[0030]
The procedure for obtaining the N-point DFT using the Radix-4 FFT algorithm is described below. Here, N is a power of 4. First, the N-point DFT is divided into four N / 4-point DFTs. Next, each N / 4 point DFT is divided into four N / 16 point DFTs. This four division is repeated, and finally it is divided into N / 4 4-point DFT, and this 4-point DFT is calculated directly.
[0031]
N-point FFT is log _Four It consists of N stages of 4-point butterfly computation. Each stage consists of N / 4 four-point butterfly operations. Therefore, the entire N-point FFT is N / 4 log. _Four Includes N 4-point butterfly operations.
[0032]
Figure 5 shows the signal flow of a 4-point DFT (butterfly computation). Equation 2 shows the mathematical expression in which this is divided into a real part and an imaginary part. In Figure 5, the coefficient W is as shown in Equation 3. For simplicity, Equation 2 uses the intermediate terms r1, s1, r2, s2, r3, and s3 shown in Equation 4. In the following, the 4-point DFT shown in Fig. 5 is simply expressed as shown in Fig. 6 in order to focus on the input / output of the 4-point DFT instead of the internal calculation of the 4-point DFT.
[0033]
The input data and coefficients for the butterfly computation at each stage will be described. At this time, as shown in FIG. 7, the N complex numbers that are input data to the N-point FFT have the real part array xr [0 to (N-1)] (701) and the imaginary part array xi [0 to (N-1)] (702) separately. The coefficient W is assumed to contain imaginary parts and real parts alternately (703).
[0034]
Figure 8 shows the relationship between the input data and coefficients for the butterfly computation at each stage. I-th stage (1 ≤ i ≤ log _Four N) of the data (X0, X1, X2, X3 in Fig. 5) to be input to the butterfly computation, the subscript spacing Sd on the arrays xr (701) and xi (702) is N / (4 ⁱ ). Also, the coefficient Sc (m in Fig. 7) has an interval Sc of 4 ^i-1 It is.
[0035]
A 64-point Radix-4 FFT is used as an example. Figure 9 shows the signal flow of the 64-point Radix-4 FFT. The 64-point Radix-4 FFT consists of 3 stages of 4-point butterfly computation, each stage containing 16 butterfly computations. The data interval Sd of 16 butterfly operations in the first stage is 16, and the coefficient interval Sc is 1.
[0036]
For example, in the first butterfly operation 901, the real part of X0 is xr [0], the imaginary part is xi [0], the real part of X1 is xr [16], the imaginary part is xi [16], and the real part of X2 Is xr [32], the imaginary part is xi [32], the real part of X3 is xr [48], and the imaginary part is xi [48]. The coefficient W of this butterfly operation corresponds to the case where n = 0 in Equation 3 and W ⁿ = W ²ⁿ = W ³ⁿ = W ⁰ The real part is cos (0) and the imaginary part is sin (0). The value is obtained from the coefficient table W (703) shown in Fig.7.
[0037]
Next, in the second butterfly operation 902, the real part of X0 is xr [1], the imaginary part is xi [1], the real part of X1 is xr [17], the imaginary part is xi [17], and the real part of X2 Is xr [33], the imaginary part is xi [33], the real part of X3 is xr [49], and the imaginary part is xi [49]. The coefficient W of this butterfly calculation is equivalent to the case where n is 3 and the coefficient interval is 1 in the first stage. ¹ The real part is W [2], the imaginary part is W [3], W ² The real part is W [4], the imaginary part is W [5], W ^Three The real part of W [6] and the imaginary part of W [7] are obtained from the coefficient table W (703) shown in Fig. 7.
[0038]
The data interval Sd for the 16 butterfly computations in the second stage of the 64-point FFT is 4, and the coefficient interval Sc is 4. For example, in the third butterfly operation 903, the real part of X0 is xr [0], the imaginary part is xi [0], the real part of X1 is xr [4], the imaginary part is xi [4], and the real part of X2 Is xr [8], the imaginary part is xi [8], the real part of X3 is xr [12], and the imaginary part is xi [13]. The coefficient W of this butterfly operation corresponds to the case where n = 0 in Equation 3, W ⁿ = W ²ⁿ = W ³ⁿ = W ⁰ The real part is cos (0) and the imaginary part is sin (0). The value is obtained from the coefficient table W (703) shown in Fig.7.
[0039]
Next, in the fourth butterfly operation 904, the real part of X0 is xr [1], the imaginary part is xi [1], the real part of X1 is xr [5], the imaginary part is xi [5], and the real part of X2 is xr [9], the imaginary part is xi [9], the real part of X3 is xr [13], and the imaginary part is xi [13]. The coefficient W of this butterfly operation is equivalent to the case where n is 3 and the coefficient interval is 4 in the second stage. ^Four The real part is W [8], the imaginary part is W [9], W ⁸ The real part is W [16], the imaginary part is W [17], W ¹² The real part of W [24] and the imaginary part of W [25] are obtained from the coefficient table W (703) shown in Fig. 7.
[0040]
In the second stage, the coefficients can be used in common for block 0 (907), block 1 (908), block 2 (909), and block 3 (910). This is because, from the definition of W shown in Eq. 3 and the periodicity of sin and cos, butterfly operations using the same coefficient appear every four butterfly operations in the second stage.
[0041]
The data interval Sd for the 16 butterfly operations in the third stage of the 64-point Radix-4 FFT is 1, and the coefficient interval Sc is 16. In other words, the third stage, which is the final stage of the 64-point Radix-4 FFT, performs a four-point butterfly operation between four adjacent elements. For example, in the fifth butterfly operation 905, the real part of X0 is xr [0], the imaginary part is xi [0], the real part of X1 is xr [1], the imaginary part is xi [1], and the real part of X2 is xr [2], the imaginary part is xi [2], the real part of X3 is xr [3], and the imaginary part is xi [3]. In the sixth butterfly operation 906, the real part of X0 is xr [4], the imaginary part is xi [4], the real part of X1 is xr [5], the imaginary part is xi [5], and the real part of X2 is xr [6], the imaginary part is xi [6], the real part of X3 is xr [7], and the imaginary part is xi [7]. The coefficient is simple due to the periodicity of sin and cos.
[0042]
The above is an example of Radix-4 FFT signal flow and 64-point Radix-4 FFT. Next, the method of the present invention that parallelizes this Radix-4 FFT using SIMD instructions is shown.
[0043]
Fig. 1 is a flowchart showing the control flow of the Radix-4 FFT of the present invention. Let N be the number of FFT points. After the start of execution (101), the initial value of the data distance Sd is set to N, and the initial value of the coefficient distance Sc is set to 1 (102).
[0044]
After initial setting of Sd and Sc, execute loop (120) for variable k. This loop supports butterfly computation at each stage other than the final stage, and for N-point Radix-4 FFT (log _Four N)-repeated once. This is done by initializing k to N (103) and executing in the loop while updating k as long as condition 114 is met at the end of the loop.
[0045]
Inside the loop (120) for variable k, as shown in Fig. 8, the data distance Sd is set to 1/4 for each stage of butterfly computation. Furthermore, the variable a, which is a subscript for accessing the coefficient array W, is initialized to 0 (104). Then, the loop (121) for variable j is executed.
[0046]
The loop (121) for variable j corresponds to accessing the coefficient W. For example, in the first stage of the N-point FFT, N / 4 butterfly operations are executed. Therefore, in the present invention in which four butterfly operations are processed in parallel as described later, the loop of variable j is N / 16 times. Repeated. In the second stage, as in the first stage, N / 4 butterfly operations are executed. However, since a set of coefficients can be shared among the four butterfly operations, in the present invention, which processes the four butterfly operations in parallel. The loop of variable j is repeated N / 64 times 1/4 of the first stage. This is done by initializing j to 0 (105) and executing coefficient access in the loop while updating j as long as condition 113 is satisfied at the end of the loop. Details of coefficient access will be described later with reference to FIG. After the coefficient access, the subscript a at the coefficient access is increased by the coefficient distance Sc.
[0047]
The loop (122) for variable i0 corresponds to the first 4-parallel butterfly operation (110), access of data necessary for the operation (109), and writing of the operation result (111). In the first stage of an N-point FFT, for example, one set of coefficients is used only for one set of butterfly operations, so one loop for j and one loop for i0 are executed. In the second stage, one set of coefficients can be reused in four sets of butterfly operations, so one loop for j is executed four times for i0. This initializes i0 to j (108), updates data i0 as long as condition 112 is met at the end of the loop (109), data access (109), first 4-parallel butterfly operation (110), write result ( Repeat step 111).
[0048]
Details of data access (109) will be described with reference to FIG. In the parallel Radix-4 FFT of the present invention, a register xr0 (206) containing four consecutive elements (202) starting from the subscript i0 in the array xr (201), and a continuous sequence starting from the subscript i0 in the array xi (218). In register xi0 (215) containing 4 elements (211), array xr (201), register xr1 (207) containing 4 consecutive elements (203) starting from subscript i0 + Sd, in array xi (208), Register xi1 (216) containing 4 consecutive elements (212) starting from subscript i0 + Sd, register xr2 (4) containing 4 consecutive elements (204) starting from subscript i0 + 2 * Sd in array xr (201) 208), register xi (208), starting from subscript i0 + 2 * Sd Register xi2 (217) containing 4 consecutive elements (213), array xr (201) starting from subscript i0 + 3 * Sd The register xr3 (209) containing the four elements (205) and the register xi3 (218) containing the four consecutive elements (214) starting from the subscript i0 + 3 * Sd in the array xi (208) are input. Independent Four sets of butterfly operations are performed in parallel using SIMD instructions. Four consecutive elements in arrays xr and xi can be stored in a register with one 64-bit load instruction.
[0049]
For xr0, xi0, xr1, xi1, xr2, xi2, xr3, xi3, together with s1, c1, s2, c2, s3, c3 obtained in the coefficient access (106) described below, a 4-point butterfly operation is performed according to Equation 2. Do. 4 Each halfword position of the 64-bit register that stores data used for parallel SIMD operations is independent of each other, and operations such as multiplication and addition / subtraction are performed only between the same halfword positions. If we apply the formula 2, which is the definition formula for the four-point butterfly operation, to the SIMD data type as it is, four sets of butterfly operations can be obtained simultaneously.
[0050]
Taking a 64-point Radix-4 FFT as an example, data access will be explained in detail. Figure 10 shows the butterfly operation that is processed in parallel in the loop (122) execution for the first and second i0 of the first stage of the 64-point Radix-4 FFT. In the first execution, xr [0-3], xi [0-3], xr [16-19], xi [16-19], xr [32-35], xi [32 ~ 35], xr [48 ~ 51], xi [48 ~ 51] are registered in registers xr0 (1001), xi0 (1002), xr1 (1003), xi1 (1004), xr2 (1005), xi2 (1006), xr3 ( 1007) Store in xi3 (1008) and execute the 4-point butterfly operation according to the signal flow of Equation 2. As a result, out of the 16 butterfly operations in the first stage of the 64-point FFT shown in Fig. 9, 4 butterfly operations (1009, 1010, 1011, and 1012) shown by the solid line in Fig. 10 can be calculated in parallel. In the next iteration, the input data is xr [4-7], xi [4-7], xr [20-23], xi [20-23], xr [36-39], xi [36-39], xr [52 ~ 55] and xi [52 ~ 55] are stored in the registers, and the four butterfly operations shown by the broken lines in Fig. 10 can be calculated in parallel among the first stage butterfly operations of 64-point FFT. Similarly, the 64-point Radix-4 FFT calculates four 16 each for the first stage butterfly computation and the second stage butterfly computation.
[0051]
Next, the details of coefficient access will be described with reference to FIG. In one set of butterfly operations, as shown in Fig. 5, three complex coefficients W are required. In the present invention, four sets of butterfly operations are performed in parallel. In coefficient access (106), 12 complex coefficients necessary for the four sets of butterfly operations are read out from the coefficient table W (703) to a register.
[0052]
A certain butterfly operation shown in Equation 3 ⁿ Suppose that we need the element of the ath complex number in the coefficient table. At this time, the imaginary part of this complex number is stored in W [a * 2] and the real part is stored in W [a * 2 + 1] in the coefficient table (703). This butterfly operation ²ⁿ A complex number whose imaginary part is W [a * 4] and whose real part is W [a * 4 + 1] ³ⁿ As a result, we need a complex number whose imaginary part is W [a * 6] and whose real part is W [a * 6 + 1].
[0053]
The second butterfly operation, which is performed at the same time as this butterfly operation, is stored as W [(a + Sc) * 2], W [ (a + Sc) * 2 + 1], W [(a + Sc) * 4], W [(a + Sc) * 4 + 1], W [(a + Sc) * 6], W [(a + Sc) * 6 + 1] is required. Similarly, the third butterfly operation is stored as W [(a + 2 * Sc) * 2], W [(a + 2 * Sc) * 2 + 1], W [(a + 2 * Sc) * 4], W [(a + 2 * Sc) * 4 + 1], W [(a + 2 * Sc) * 6 ], W [(a + 2 * Sc) * 6 + 1]. Similarly, the fourth butterfly operation is stored as W [(a + 3 * Sc) * 2], W [(a + 3 * Sc) * 2 + 1], W [(a + 3 * Sc) * 4], W [(a + 3 * Sc) * 4 + 1], W [(a + 3 * Sc) * 6 ], W [(a + 3 * Sc) * 6 + 1].
[0054]
As shown in Figure 3, the necessary values on the coefficient table are stored in registers twice and packed in six registers. On each register, the value corresponding to the real or imaginary part of each coefficient is stored for 4 butterfly operations. This format conversion can be performed with two-stage format conversion instructions (314, 315, 316, 317) for the NEC V830R processor, for example. 4 W for butterfly computation ¹ C1 (309) is the register that stores the real part value corresponding to s1, 308 is the register that stores the imaginary part value, W ² C2 (311) is the register that stores the real part value corresponding to s2 (310), W is the register that stores the imaginary part value. ^Three Let c3 (313) be the register that stores the real part value corresponding to, and s3 (312) the register that stores the imaginary part value.
[0055]
The final calculation will be described with reference to Fig. 4. As shown in Fig. 9, before the last stage, the four consecutive elements of the arrays xr (701) and xi (712) belong to the independent butterfly operation. By computing according to the same signal flow for the four elements, four butterfly operations could be performed in parallel. In the final stage, as in the previous stage, four butterfly operations are processed in parallel according to the same signal flow. However, as can be seen from the 64-point FFT example shown in Fig. 9, in the final stage, the four consecutive elements of the data array xr (701) and xi (712) belong to the same butterfly operation and have a dependency relationship. It is necessary to take a different parallelization method.
[0056]
Therefore, in the final stage, data format conversion is performed, and four adjacent butterfly operations are performed in parallel using SIMD instructions. Figure 4 shows the final processing flow. First, j is initialized to 0 (401), and the real part of each of the 16 elements starting from the subscript j of the data array xr (701), xi (702) is loaded into the registers xr0, xr1, xr2, xr3 ( 403), loading the imaginary part into registers xi0, xi1, xi2, and xi3 (404) loads four consecutive elements into the same register. Next, the 16 elements placed on each of the four registers for the real part and the imaginary part are regarded as a 4-by-4 matrix, and the matrix is transposed (405, 406). With this data format conversion, four consecutive elements required for one butterfly operation are stored in another register, and the four elements required for four independent butterfly operations can be stored in different halfword positions of the same register. This format conversion overhead is small. For example, NEC V830R processor can transpose 16 elements on 4 registers with 8 format conversion instructions.
[0057]
When the 4-point butterfly operation is performed on the transposed registers xr0, xi0, xr1, xi1, xr2, xi2, xr3, and xi3 according to the signal flow shown in Figure 5, it is performed in parallel on the four adjacent butterfly operations. (407). The result of the butterfly operation is transposed again (408, 409) and written as the final result in the arrays xr (701) and xi (702) (410).
[0058]
Although the embodiment of the present invention has been described using the currently common SIMD instruction set in which four 16-bit data are stored in a 64-bit register and four parallel operations are performed, the present invention is not limited to the bit length of the register. It can be easily applied to processors with different lengths and parallelism of each data.
[0059]
【The invention's effect】
According to the Radix-4 FFT operation method of the present invention, four butterfly operations can always be performed in parallel on a processor having four parallel SIMD instructions, and high parallelism can be achieved. In addition, since a regular data structure is adopted in which only the real part or only the imaginary part is contained in the same register, the overhead of data format conversion can be minimized. As a result, high-speed Radix-4 FFT calculation is possible. For example, when a 256-point FFT is executed on the NEC V830R processor using the method of the present invention, the required number of clocks is 4,093. This is 35% faster than the required number of 5,522 clocks when executing 256-point FFT on an Intel Pentium II processor with SIMD instruction set similar to V830R.
[Brief description of the drawings]
FIG. 1 is a flowchart showing an FFT method of the present invention.
FIG. 2 is a diagram illustrating a data access method.
FIG. 3 is a diagram illustrating a coefficient access method.
FIG. 4 is a flowchart showing the final stage processing of the FFT of the present invention.
FIG. 5 is a signal flow diagram of a four-point butterfly operation.
FIG. 6 is a simplified signal flow diagram of a four-point butterfly operation.
FIG. 7 is a diagram showing a data structure in the FFT of the present invention.
FIG. 8 is a table showing data intervals and coefficient intervals used in butterfly computation;
FIG. 9 is a signal flow diagram of a 64-point Radix-4 FFT.
FIG. 10 is a signal flow diagram when 64-point FFT is parallelized by the method of the present invention.
FIG. 11 is a signal flow diagram of a two-point butterfly operation.
FIG. 12 is a diagram showing a conventional parallel FFT calculation method.
FIG. 13 is a diagram for explaining an operation example of a SIMD operation instruction;
[Explanation of symbols]
201 Array xr containing real part of input data X
64-bit register containing 4 elements of 206, 207, 208, and 209 xr
214, 215, 216, 217 64-bit registers
218 Array containing the imaginary part of input data X xi
301 Coefficient table W
302, 303, 304, 305 64-bit register loaded with two coefficient table data
306, 307 64-bit register holding intermediate results of format conversion
308, 309 64-bit registers holding 4 sets of coefficients
310, 311 64-bit register holding 4 sets of coefficients
312, 313 64-bit register holding 4 sets of coefficients
701 Array xr containing the real part of input data X
702 Array containing the real part of input data X xi
703 Coefficient table W
901 First butterfly operation
902 Second butterfly operation
903 Third butterfly operation
904 Fourth butterfly operation
905 Fifth butterfly operation
906 Sixth butterfly operation
907 0th block
908 1st block
909 2nd block
910 3rd block
1001 64-bit register holding xr0 value for use in first iteration
1002 64-bit register holding the value of xi0 for use in the first iteration
1003 64-bit register holding xr1 value used in first iteration
1004 64-bit register holding the value of xi1 for use in the first iteration
1005 64-bit register holding xr2 value for first iteration
1006 64-bit register holding the value of xi2 for use in the first iteration
1007 64-bit register holding xr3 value for use in first iteration
1008 64-bit register holding the value of xi3 for use in the first iteration
1009 7th 4-point butterfly operation
1010 8th 4-point butterfly operation
1011 9th 4-point butterfly operation
1012 10th 4-point butterfly operation
1201 main memory
1202 Data array X
1203 Coefficient table W
1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219 64-bit registers
1301 64-bit registers
1302 64-bit registers
1303 64-bit registers

Claims

In a fast complex Fourier transform method using a microprocessor having a so-called SIMD type operation instruction that stores a plurality of data in the same register and performs the same operation on each data in parallel .
Storing the real part and imaginary part of complex numbers as input data in independent data storage locations;
In butterfly operations other than the final stage, 16 complex real and imaginary part data, which are the input data to four adjacent butterfly operations, are the same, and the data input location for the butterfly operation is the same. Is divided into four sets of four real part data adjacent to each other and four sets of four imaginary data that have the same input data position for butterfly computation and are adjacent at the data storage location. Loading each data into different registers, performing a 4-point butterfly operation, and storing it again in the storage location;
In the final stage butterfly operation, as the final stage input data, for each of the 4 input data positions from the data storage location to the butterfly operation, 4 elements are added to the real and imaginary parts every 3 elements from the data storage location. Loading one register as a set is repeated, loading these 8 sets of data into different registers, performing a 4-point butterfly operation, and storing it again in the storage location.
A fast complex Fourier transform method characterized by processing four independent four-point butterfly operations in parallel at all stages.

Multiple data is stored in the same register and the same operation is performed on each data in parallel.
In a high-speed complex Fourier transform device using a microprocessor with SIMD type operation instructions,
Means for storing the real part and the imaginary part of the complex number as input data in independent data storage locations;
In butterfly operations other than the final stage, 16 complex real and imaginary part data, which are the input data to four adjacent butterfly operations, are the same, and the data input location for the butterfly operation is the same. Are divided into 4 sets of 4 real part data adjacent to each other and 4 sets of 4 imaginary part data that have the same input data position for butterfly computation and are adjacent at the data storage location. Means to load the data of each section into different registers, perform a 4-point butterfly operation, and store it again in the storage location;
In the final stage butterfly operation, as the final stage input data, for each of the 4 input data positions from the data storage location to the butterfly operation, 4 elements are added to the real and imaginary parts every 3 elements from the data storage location. There is a means to repeatedly load one register as one set, load these 8 sets of data into different registers, perform a 4-point butterfly operation, and store it again in the storage location.
A high-speed complex Fourier transform device that processes four independent four-point butterfly operations in parallel at all stages.