CN102495721A

CN102495721A - Single instruction multiple data (SIMD) vector processor supporting fast Fourier transform (FFT) acceleration

Info

Publication number: CN102495721A
Application number: CN2011103937120A
Authority: CN
Inventors: 李丽; 孙敏敏; 王佳文; 潘红兵; 郑维山; 沙金; 李伟
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2011-12-02
Filing date: 2011-12-02
Publication date: 2012-06-13

Abstract

The invention discloses a single instruction multiple data (SIMD) vector processor supporting fast Fourier transform (FFT) acceleration, which comprises a control unit, a calculation unit, a storage subsystem, a storage weaving unit and an address generation unit. The calculation unit supports quick processing of various vector calculations. The storage subsystem comprises three storage groups. Each storage group comprises four storage bodies, the bit wide of a single storage body in each storage group is a plural character, and the storage groups support plural vector calculation with concurrent four-way data and real number vector calculation with concurrent eight-way data. The calculation unit, the address generation unit and the storage weaving unit are connected with the control unit. The address generation unit generates required operand address sequence, coefficient address sequence and result address sequence. The storage weaving unit and the address generation unit are connected with the calculation unit to achieve address mapping of the storage bodies. The acceleration efficiency of the SIMD vector processor to FFT/ inverse fast Fourier transform (IFFT) calculation corresponds to a special hardware accelerator. The SIMD vector processor avoids huge extra pay expenses brought by use of the special hardware accelerator, and is suitable for being used in a real-time signal processing system with a large amount of long vector calculation.

Description

A kind of SIMD vector processor of supporting that FFT quickens

Technical field

The present invention relates to SIMD vector processor and method for designing thereof that a kind of FFT of support quickens, variable the counting of specifically a kind of support is to lower SIMD vector processor and the method for designing thereof of the higher and whole hardware spending of FFT/IFFT computing acceleration efficiency.

Background technology

(Fast Fourier Transformation, FFT) dedicated hardware accelerators (being called fft processor) or dsp processor completion are generally all passed through in computing in Fast Fourier Transform (FFT).Dedicated hardware accelerators can obtain higher acceleration efficiency; But can take more extra resource; Comprise on the sheet computational logic resource on the storage resources and sheet, particularly when the length of conversion is very big, the shared extra resource of dedicated hardware accelerators can't bear.Though accomplish can the occupying volume not outer hardware resource of FFT computing and have very big dirigibility with the mode of dsp processor software programming, its processing speed is relatively slow, has satisfied not the real-time requirement of some application.

At some digital signal processing algorithms, in range-doppler algorithm, relate to the Vector Processing of a large amount of all lengths, length can reach 16K even longer.The vector operation (vectorial plus-minus method, vector multiplication etc.) that these vectorial processing had both been comprised rule also comprises the FFT/IFFT computing.The SIMD vector processor can be used to quicken the vector operation of rule; But the SIMD vector processor of FFT computing (acceleration efficiency is suitable with special-purpose accelerator) does not appear can directly quickening simultaneously as yet; In this case; Also need to use in addition the FFT hardware accelerator to quicken various FFT/IFFT computings of counting, resource will be occupied on the extra sheet.

Summary of the invention

Operation efficiency for the FFT that quickens to count greatly avoids the use of the extra hardware expense that the specialised hardware accelerator is brought simultaneously, the purpose of this invention is to provide the SIMD vector processor that a kind of FFT of support quickens.This SIMD vector processor can directly quicken the FFT computing, also can provide the FFT computing suitable with the dedicated hardware accelerators acceleration efficiency to quicken, and in guaranteed performance, avoids the additional hardware expense.

The objective of the invention is to realize through following technical scheme:

A kind of SIMD vector processor of supporting that FFT quickens, it is characterized in that: this processor comprises control module, computing unit, memory sub-system, storage interleave unit and address-generation unit; Said computing unit is supported the fast processing of various vector operations; Said memory sub-system comprise the deposit operation number memory set A, deposit the memory set B of coefficient and deposit the memory set C of operation result; And the bit wide of the single memory bank in memory set A, memory set B and the memory set C is a complex digital, supports complex vector located computing and the parallel real number vector operation of 8 circuit-switched data that 4 circuit-switched data are parallel; Computing unit, address-generation unit and storage interleave unit all are connected with control module; Address-generation unit produces required operand address sequence, coefficient address sequence, result address sequence according to the data parallel degree of arithmetic type, computing and the length of vector; The storage interleave unit is connected with computing unit with address-generation unit, and realizes the map addresses of memory bank.

Among the present invention, memory set A, memory set B and memory set C are 4 memory banks.The storage interleave unit realizes the map addresses of inner 4 memory banks of memory set A, memory set B and memory set C, make 4 operand bits that read simultaneously in 4 different memory banks, and 4 operation results that write simultaneously is positioned at 4 different memory banks; Through the programmable address mapping method, support the regular vector operation and the FFT/IFFT computing of all lengths vector.

Said programmable address mapping method is vector length to be set through the software programming mode; For different vector lengths; Address mapping method is respective change also, and under each vector length, can guarantee all that regular vector operation and FFT/IFFT computing do not have to conflict to read and write.

Computing unit comprises 2 complex multipliers and 4 complex adder; Support complex multiplication, convolution algorithm that 2 circuit-switched data are parallel; Plural plus-minus method, accumulating operation that 4 circuit-switched data are parallel; Modulus of complex number side's computing that 4 circuit-switched data are parallel, the FFT/IFFT computing that 4 circuit-switched data are parallel, and parallel real multiplications, convolution, plus-minus method, the accumulating operation of 8 circuit-switched data.For the parallel vector operation of above-mentioned n circuit-switched data, on average each clock period is handled n vector location (not considering to handle the preceding streamline filling time of each vector).Its acceleration efficiency and dedicated hardware accelerators are suitable, and support variable counting, and therefore in the safeguards system counting yield, have saved storage resources and logical resource expense on huge that in design, brings because of use FFT specialized hardware accelerator module.

Storage subsystem among the present invention comprises three memory set; Difference deposit operation number, coefficient and operation result; Every group of storer is divided into 4 memory banks, and the bit wide of memory bank is a complex digital, with complex vector located computing and the parallel real number vector operation of 8 circuit-switched data of supporting that 4 circuit-switched data are parallel.Address-generation unit; Can be according to the data parallel degree (2,4,8) of arithmetic type (regular computing, FFT/IFFT computing), computing, required operand address sequence, coefficient address sequence (some computing not being needed), the result address sequence of generations such as length of vector like accumulating operation and the computing of modulus of complex number side.

The present invention can directly quicken the SIMD vector processor of FFT computing, except can quickening regular vector operation, also can provide the FFT computing suitable with the dedicated hardware accelerators acceleration efficiency to quicken, and in guaranteed performance, avoids the additional hardware expense.

The invention has the beneficial effects as follows:, obtained the acceleration efficiency suitable, and avoided using specialized hardware to quicken the extra hardware expense of being brought with dedicated hardware accelerators through add the mode of FFT assisted instruction to the SIMD vector processor.The present invention can be effectively applied to have a large amount of overlength vector operations system for real-time signal processing of (comprising regular vector operation and FFT/IFFT).

Description of drawings

Fig. 1 is an overall architecture synoptic diagram of the present invention;

Fig. 2 is traditional radix-2 DIT FFT operational data flow graph;

Fig. 3 is a radix-2 DIT FFT operational data flow graph of the present invention.

Embodiment

Below in conjunction with accompanying drawing the SIMD vector processor that the present invention supports FFT to quicken is carried out detailed explanation.

A kind of SIMD vector processor of supporting that FFT quickens is seen Fig. 1, and this processor comprises control module, computing unit, memory sub-system, storage interleave unit and address-generation unit.

Computing unit is supported the fast processing of various vector operations; Computing unit comprises 2 complex multipliers and 4 complex adder; Support complex multiplication, convolution algorithm that 2 circuit-switched data are parallel, plural plus-minus method, accumulating operation that 4 circuit-switched data are parallel, modulus of complex number side's computing that 4 circuit-switched data are parallel; The FFT/IFFT computing that 4 circuit-switched data are parallel, and parallel real multiplications, convolution, plus-minus method, the accumulating operation of 8 circuit-switched data.For the parallel vector operation of above-mentioned n circuit-switched data, on average each clock period is handled n vector location (not considering to handle the preceding streamline filling time of each vector).Its acceleration efficiency and dedicated hardware accelerators are suitable, and support variable counting, and therefore in the safeguards system counting yield, have saved storage resources and logical resource expense on huge that in design, brings because of use FFT specialized hardware accelerator module.

Memory sub-system comprises three memory set, be respectively the deposit operation number memory set A, deposit the memory set B of coefficient and deposit the memory set C of operation result, and be 4 memory banks in each memory set.The bit wide of single memory bank is a complex digital; Support complex vector located computing and the parallel real number vector operation of 8 circuit-switched data that 4 circuit-switched data are parallel; Make 4 operand bits that read simultaneously in 4 different memory banks, and 4 operation results that write simultaneously are positioned at 4 different memory banks; Through the programmable address mapping method, support the regular vector operation and the FFT/IFFT computing of all lengths vector.Computing unit, address-generation unit and storage interleave unit all are connected with control module.

Address-generation unit produces required operand address sequence, coefficient address sequence, result address sequence according to the data parallel degree of arithmetic type, computing and the length of vector; The storage interleave unit is connected with computing unit with address-generation unit, and realizes the map addresses of memory bank.Storage interleave unit and three memory set are suitable, also comprise storage interleave unit A, storage interleave unit BT and three parts of storage interleave unit C.

The programmable address mapping method is vector length to be set through the software programming mode, and for different vector lengths, address mapping method is respective change also, and under each vector length, can guarantee all that regular vector operation and FFT/IFFT computing do not have to conflict to read and write.

As previously mentioned, make the SIMD processor of supporting regular vector operation support the direct biggest obstacle of quickening of FFT to be address conflict.In the design of FFT dedicated hardware accelerators, can face this problem equally, and very ripe solution has been arranged, generally can avoid through design flexible storage system and map addresses.But here problem is just more complicated, because need after adding the FFT assisted instruction, still can support the acceleration of other regular vector operations.

The present invention uses new radix-2 DIT FFT operational data flow graph, and has proposed the conflict-free memory access that a kind of address mapping method is supported regular vector operation and FFT/IFFT simultaneously, and its programmability is supported the computing of all lengths vector.

Fig. 2 is traditional radix-2 DIT FFT operational data flow graph (the input data had been carried out the bit counter-rotating of address).When calculating based on this DFD, the address sequence of operand is identical with result's address sequence, but all is different for each grade arithmetic address sequence, sees table 1.

The address sequence of each operands/results data channel of table 1 (be for length 8 FFT)

The map addresses of former SIMD vector processor is as shown in table 2

The map addresses of the former SIMD vector processor of table 2

Figure 2011103937120100002DEST_PATH_IMAGE003

Can find out, address conflict occur at the 2nd grade, the address of two operands of butterfly computation 2_0 is respectively 0,4; All in memory bank _ 0, the address of two operands of butterfly computation 2_1 is respectively 1,5, all in memory bank _ 1; The address of two operands of butterfly computation 2_2 is respectively 2,6; All in memory bank _ 2, the address of two operands of butterfly computation 2_3 is respectively 3,7, all in memory bank _ 7.

Can avoid address conflict through changing map addresses simply.But; For length greater than for 8 the FFT; From 3rd level backward, all there is address conflict in each level, and more crucial is to cause the conflict address of each grade different owing to the address sequence of each grade is inequality; And change map addresses and also might cause regular vector operation address conflict to occur, thereby can't address this problem through changing map addresses simply.

A kind of new radix-2 DIT FFT operational data flow graph is arranged, and its address sequence is all identical to each grade butterfly computation, and is as shown in Figure 3.New radix-2 DIT FFT operational data flow graph is to change through traditional radix-2 DIT FFT operational data flow graph.In traditional radix-2 DIT FFT operational data flow graph, the 0th grade has N/2 group, every group of 1 butterfly computation; The 1st grade has N/4 group, every group of 2 butterfly computations; The 2nd grade has the N/8 group, and every group has 4 butterfly computations; By that analogy.

In each level, the computation sequence of butterfly computation is: accomplishing the computing of each group from top to bottom successively, also is to carry out each butterfly computation from top to bottom successively in each group.If adjust the computation sequence of above-mentioned butterfly computation once: calculate first butterfly computation of each group earlier from top to bottom successively, and then calculate second butterfly computation of each group from top to bottom successively, go down like this up to all butterfly computations of accomplishing this grade.FFT with N=8 is an example, and according to traditional radix-2 DIT FFT operational data flow graph, then the 1st grade butterfly computation is 1_0-1_1-1_2-1_3 in proper order, and adjusted butterfly computation order then is 1_0-1_2-1_1-1_3.According to adjusted butterfly computation order, and do corresponding adjustment, then obtain new radix-2 DIT FFT operational data flow graph shown in Figure 3 according to new computation sequence is stored data in storer position.

When calculating based on new radix-2 DIT FFT operational data flow graph, the address sequence of operand is different with result's address sequence, but address sequence all is identical for each grade computing, sees table 3 and table 4.

Each operand data channel address sequence of table 3 (based on new radix-2 DIT FFT operational data flow graph)

Figure 2011103937120100002DEST_PATH_IMAGE005

Each result data channel address sequence of table 4 (based on new radix-2 DIT FFT operational data flow graph)

Can find out that from table 3 address sequence of operand is identical with the address sequence of regular vector operation, does not have address conflict.Can find out that from table 4 always there is address conflict in result's address sequence, be respectively 0,4 like two results' of butterfly computation 0_0 address, all in memory bank _ 0.The map addresses after guaranteeing to change can solve this problem through changing map addresses, as long as can not make the address sequence of regular vector operation produce address conflict.

For the vector of N=8, can make map addresses as shown in table 5 into.

The new map addresses (for the vector of N=8) of table 5

According to the map addresses of table 5, the address sequence of table 3 and table 4 all is conflict free, and therefore the parallel memory access for regular vector operation and N=8FFT computing all is conflict free, and the SIMD vector processor can be supported the acceleration of these computings simultaneously.

Be generalized to any vector length N, then map addresses is as shown in table 6

Table 6 is for the map addresses of any vector length N

Like this, for the vector of random length N, this SIMD vector processor all can be supported the direct acceleration computing of regular vector operation and FFT/IFFT computing.Can find out that from table 6 map addresses is relevant with vector length N.In the SIMD vector processor that is designed; Map addresses realizes through the storage interleave unit; Therefore, vector need elder generation vector length to be set to the storage interleave unit before being loaded into on-chip memory from chip external memory through the mode of software programming; Can load vector subsequently and carry out a series of acceleration computings, comprise regular vector operation and FFT/IFFT computing to on-chip memory and to vector.Therefore, this address mapping method is called the programmable address mapping.

FFT acceleration in the present embodiment is average each clock period two butterfly computations, and acceleration efficiency (each cycle butterfly computation number/complex multiplier number) reaches the highest, is 1, and is suitable with the maximum acceleration efficiency of dedicated hardware accelerators.

In addition, what need special instruction is that method for designing of the present invention has great extensibility, can select degree of parallelism according to performance requirement, and the butterfly computation number of parallel computation can be chosen as 1,2,4,8 ...The degree of parallelism of general radix-2 FFT hardware accelerator is 1 or log2N, and does not possess the dirigibility of selection, therefore says that this extensibility is significant.

The present invention can be under the prerequisite of the system of assurance operation efficiency, and the enhanced system dirigibility reduces the huge hardware spending that uses the FFT dedicated hardware units and bring simultaneously, therefore in signal processing system, has and excellent application value.

Claims

1. SIMD vector processor of supporting that FFT quickens, it is characterized in that: this processor comprises control module, computing unit, memory sub-system, storage interleave unit and address-generation unit; Said computing unit is supported the fast processing of various vector operations; Said memory sub-system comprise the deposit operation number memory set A, deposit the memory set B of coefficient and deposit the memory set C of operation result; And the bit wide of the single memory bank in memory set A, memory set B and the memory set C is a complex digital, supports complex vector located computing and the parallel real number vector operation of 8 circuit-switched data that 4 circuit-switched data are parallel; Computing unit, address-generation unit and storage interleave unit all are connected with control module; Address-generation unit produces required operand address sequence, coefficient address sequence, result address sequence according to the data parallel degree of arithmetic type, computing and the length of vector; The storage interleave unit is connected with computing unit with address-generation unit, and realizes the map addresses of memory bank.

2. the SIMD vector processor that support FFT according to claim 1 quickens, it is characterized in that: memory set A, memory set B and memory set C are 4 memory banks.

3. the SIMD vector processor that support FFT according to claim 2 quickens; It is characterized in that: the storage interleave unit realizes the map addresses of inner 4 memory banks of memory set A, memory set B and memory set C; Make 4 operand bits that read simultaneously in 4 different memory banks, and 4 operation results that write simultaneously are positioned at 4 different memory banks; Through the programmable address mapping method, support the regular vector operation and the FFT/IFFT computing of all lengths vector.

4. the SIMD vector processor that support FFT according to claim 3 quickens; It is characterized in that: said programmable address mapping method is vector length to be set through the software programming mode; For different vector lengths; Address mapping method is respective change also, and under each vector length, can guarantee all that regular vector operation and FFT/IFFT computing do not have to conflict to read and write.

5. the SIMD vector processor that support FFT according to claim 1 quickens; It is characterized in that: computing unit comprises 2 complex multipliers and 4 complex adder; Support complex multiplication, convolution algorithm that 2 circuit-switched data are parallel, plural plus-minus method, accumulating operation that 4 circuit-switched data are parallel, modulus of complex number side's computing that 4 circuit-switched data are parallel; The FFT/IFFT computing that 4 circuit-switched data are parallel, and parallel real multiplications, convolution, plus-minus method, the accumulating operation of 8 circuit-switched data.