WO2013097235A1 - 并行位反序装置和方法 - Google Patents

并行位反序装置和方法 Download PDF

Info

Publication number
WO2013097235A1
WO2013097235A1 PCT/CN2011/085177 CN2011085177W WO2013097235A1 WO 2013097235 A1 WO2013097235 A1 WO 2013097235A1 CN 2011085177 W CN2011085177 W CN 2011085177W WO 2013097235 A1 WO2013097235 A1 WO 2013097235A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
address
memory
parallel
parallel bit
Prior art date
Application number
PCT/CN2011/085177
Other languages
English (en)
French (fr)
Inventor
谢少林
蒿杰
汪涛
尹磊祖
Original Assignee
中国科学院自动化研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院自动化研究所 filed Critical 中国科学院自动化研究所
Priority to US14/118,452 priority Critical patent/US9268744B2/en
Priority to PCT/CN2011/085177 priority patent/WO2013097235A1/zh
Publication of WO2013097235A1 publication Critical patent/WO2013097235A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/76Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
    • G06F7/762Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data having at least two separately controlled rearrangement levels, e.g. multistage interconnection networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/76Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
    • G06F7/768Data position reversal, e.g. bit reversal, byte swapping

Definitions

  • the present invention relates to bit reverse order data storage and arrangement in integrated circuit design, and is closely related to a fast Fourier transform (FFT) algorithm and an integrated circuit structure, and particularly relates to a parallel bit reverse sequence device and Parallel bit reverse order method.
  • FFT fast Fourier transform
  • FFT Fast Fourier Transform Algorithm
  • the FFT algorithm inputs N data and outputs N data; generally speaking, the time domain to frequency domain transform is a forward transform, and the frequency domain to time domain transform is inverse transformed.
  • the time domain data in the order of sampling time is "natural order", that is, the sampling time of each data point is incremented; also the order of frequency domain data is arranged in order of frequency, "natural order", that is, each data point The corresponding frequency is incremented.
  • bit reverse order which is to write the index of each data point in the natural order into a binary form. After the binary index image is reversed, the obtained index value is the data point.
  • the index value in bit reverse order Suppose we can use 3 Wt to represent the index value. In the natural order, the index value 3 is represented as binary (011)2. After mirroring reverse order, its binary value is HO, which is 6 in decimal. Therefore, "6" is The index value corresponding to the data after the bit is arranged in reverse order.
  • FIG. 1 shows an algorithm for extracting by frequency domain.
  • the original data 100 is sorted by natural order.
  • the data Before performing the calculation, the data must be sequence-transformed by the bit-reverse operation 103 to be arranged in reverse order, and the data is reversed.
  • the calculated output data 102 is in natural order.
  • time In the extraction algorithm, the input data is in natural order, but the output data is in reverse order of the bits; therefore, before the output data is processed further, the output data needs to be sequence-transformed so that it is arranged in natural order.
  • Some patents have implemented a reverse order data arrangement, such as the US Patent US 2003/0028571 Al (Real-time method for bit-reversal of large size arrays), using two steps of reverse ordering:
  • the first step uses DMA in
  • the external memory and the on-chip memory are arranged in a large reverse order
  • the second step uses the processor to implement reverse ordering of data in a small size.
  • the problem with this method is that when the processor is used to implement the reverse ordering of data in a small size, one internal iteration requires multiple iterations. For data of length N, it needs lo g2 Nl iterations. Second, each iteration only Data can be read and written in scalar units, and multiple data parallel sorting cannot be achieved. Therefore, the bit reverse order data arrangement method described in this patent is inefficient.
  • U.S. Patent No. 7,640,284 (Bit Reversal Methods for a Parallel Processor) describes a method by which nVidia utilizes a multiprocessing core or SIMD execution component in a graphics processor GPU to implement parallel bit reverse ordering.
  • the invention implements the parallel bit reverse order arrangement, the following problems still exist: 1. Each processing core or execution component needs to access the lookup table multiple times and perform multiple shift operations when calculating the bit reverse order address, and only calculate Bit reverse address requires multiple clock cycles, and execution efficiency is low. 2.
  • this method can only reduce the memory access conflict of multiple processors or SIMD features. Storage access violations further reduce the efficiency of bit reverse ordering operations.
  • the technical problem solved by the present invention is to improve the execution efficiency of the parallel bit reverse ordering operation, so as to better support the multi-granular parallel FFT computing device.
  • the invention provides a parallel bit reverse sequence device, comprising a parallel bit reverse sequence unit, a butterfly calculation and control unit and a memory, wherein the butterfly calculation and control unit passes the data bus and the save
  • the parallel bit reverse sequence unit is configured to perform bit reverse order on the butterfly group data calculated by the butterfly calculation and control unit;
  • the parallel bit reverse sequence unit includes address reverse order logic, address reverse order
  • the logic and the butterfly calculation and control unit are connected to perform mirror reverse order and right shift operations on the read address from the butterfly calculation and control unit;
  • the parallel bit reverse sequence unit further includes a multi-granularity parallel memory and an address selector;
  • the multi-granularity parallel memory is coupled to an address selector for receiving an address output by the address selector;
  • the multi-granularity parallel memory is further coupled to the butterfly calculation and control unit for receiving a butterfly shape Calculate the write data and read and write granularity values output by the control unit.
  • the parallel bit reverse sequence unit further includes a data reverse sequence network coupled to the multi-granularity parallel memory to receive read data from the multi-granularity parallel memory.
  • the data reverse order network is used for the bit reverse ordering of data in the butterfly group.
  • the address selector is coupled to the butterfly calculation and control unit and the address reverse sequence logic for selecting a read and write address to be output to the memory.
  • the present invention also provides a parallel bit reverse sequence method for a parallel bit reverse sequence device, the parallel bit reverse sequence device comprising a parallel bit reverse sequence unit and a memory, the parallel bit reverse sequence unit comprising a multi-granularity parallel memory,
  • the method includes the following steps: Step 200: Transfer data from the memory to the multi-granular parallel memory, and after the transfer, the natural sequence data is divided into 2 n groups, and sequentially stored in 2 n of the multi-granular parallel memory.
  • n is a positive integer
  • step 201 the index in the data group is kept unchanged, and the data group index is reversed
  • step 202 the index in the data group is kept unchanged, and the data group index is also unchanged, and the data group index is unchanged.
  • the index is moved to the upper position; in step 203, the index of the data group is kept unchanged, and the index in the group is reversed.
  • the parallel bit reversal device also includes a butterfly calculation and control unit, the steps of the method being performed under the control of the butterfly calculation and control unit.
  • the parallel bit reverse sequence unit further includes an address reverse sequence logic and an address selector; the butterfly calculation and control unit is coupled to the address reverse sequence logic by a shift indication line and a read address line.
  • step 201 the value of the shift indication line is set to An, where A is the bit width of the read address line, and the read address on the read/write address line is set as the data group index, and the address selector selects the address inverse.
  • the output of the sequence logic is used as the read and write address of the memory.
  • step 203 the data read from the multi-granularity parallel memory is arranged in reverse order of the bits of the data reverse order network, and then output to the butterfly calculation and control unit.
  • step 203 the output of the data reverse sequence network obtains 2 n digits of reverse-ordered data to directly perform butterfly calculation.
  • the invention stores the natural sequence data in a memory supporting multi-granularity parallel read and write, and reads a plurality of bit-ordered reverse-ordered data in parallel according to the order of increasing addresses.
  • the method firstly groups the data that needs to be sorted, and sorts the data by using three principles: 1. Using the multi-granularity read and write characteristics of the memory to realize the reverse ordering between the data and the data group in the group; 2. Using hardware logic components to implement Reverse ordering of data in the group; 3. Using different read and write addresses to achieve reverse ordering between data groups.
  • the entire bit reverse sequence operation process of the present invention only needs to read the memory once, and there is no iterative process, and the processor can directly perform the ⁇ operation on the read bit reverse order data, such as the butterfly calculation in the FFT. Meanwhile, the present invention
  • the main method has no access conflict when reading the memory in parallel, and the reading and sorting process can be deeply pipelined, which has extremely high execution efficiency; in addition, the present invention can flexibly specify the parallel granularity according to the specific implementation.
  • FIG. 1 is a flow chart of a base 2 FFT algorithm extracted by frequency when a data length is 8
  • FIG. 2 is a flowchart of a parallel bit reverse order method of the present invention
  • FIG. 3 is a schematic structural diagram of a parallel bit reverse sequence device of the present invention.
  • 4 is a schematic diagram showing the distribution of each data in the memory before the bit reverse sequence operation when the parallel granularity is 4 and the data length is 32;
  • FIG. 5 is a schematic diagram showing the logical structure of a multi-granularity parallel memory of the present invention.
  • FIG. 6 is a schematic diagram of the addressing mode and logical bank partitioning of the multi-granularity parallel memory of the present invention at different read and write granularities.
  • each W data constitutes a group.
  • N 2 T
  • each data index requires T bits to represent; after data grouping, the components of the data index are divided into two parts: data group index g and intra-group data index b.
  • the parallel granularity is 2 n , for 27 data, a total of 2 T - n data sets can be divided.
  • G bits are needed to represent the data group index g, ie ⁇ 1 . ⁇ 1&) , and n bits are needed to represent the index address b in the group, ie bb ⁇ ⁇ b ⁇ each data index
  • the binary representation is composed of g and b. In the natural order arrangement, the data group index is at the high position, and the index b is low at the group, namely:
  • FIG. 3 is a view showing the configuration of a specific embodiment of the parallel bit reversal apparatus according to the present invention.
  • the parallel bit reverse sequence means includes a parallel bit reverse order unit 314, a butterfly calculation and control unit 309, and a mass storage 311.
  • the butterfly calculation and control unit 309 passes through the data bus 310 and the mass storage device.
  • the parallel bit reverse order unit 314 includes a multi-granularity parallel memory 300, an address reverse order logic 306, an address selector 308, and a data reverse order network 301.
  • the parallel granularity is set to 2 n
  • the multi-granularity parallel memory 300 is composed of 2 n memory blocks.
  • Multi-granularity parallel memory 300 has an address line 313 that is coupled to the output of address selector 308 for receiving the address of the address selector output.
  • the multi-granularity parallel memory 300 also has a write enable signal line 307, a write data line 316, and a read/write granularity indicator line 304, all of which are coupled to the butterfly calculation and control unit 309 for receiving from butterfly calculations, respectively.
  • Write enable signal and write data outputted by the control unit 309 And read and write granularity values.
  • the multi-granularity parallel memory 300 also has a read data line 312 coupled to the input of the data reverse sequence network 301 for outputting read data to the data reverse sequence network.
  • the output 302 of the data reversal network 301 is coupled to the butterfly computation and control unit 309 for use in the reverse ordering of the data within the group.
  • the parallel granularity of the parallel bit reversing device 314 is 2 n
  • the number of input data on the read data line 312 of the data reverse sequence network 301 is 2 n
  • the number of output data on the output terminal 302 is also 2 n . .
  • the left-to-right index value is ⁇ (0 ⁇ )
  • ⁇ [ ⁇ ] represents the first input data
  • the output data vector is Y
  • the left-to-right index value is J (0 ⁇ j ⁇ n)
  • Y [j] represents the Jth output data
  • br (i) represents the mirroring bit reverse order of 1
  • One input of the address selector 308 is coupled to the read and write address line 303 of the butterfly computing and control unit 309, and the other input is coupled to the output of the address reverse sequence logic 306 for selection of the read and write address to the memory 300. .
  • An input of the address reverse sequence logic 306 is coupled to the shift indication line 305 of the butterfly calculation and control unit 309, and the other input is coupled to the read and write address line 303 of the butterfly calculation and control unit 309 for reading and writing.
  • the binary representation of the read address input by the address line 303 performs the mirror reverse order and right shift operations in bits.
  • the process of the bit reverse ordering operation can be abstracted into a change of the data address (gc ⁇ . ⁇ gob ⁇ bibo), the original data address is as shown in (Expression 1), and the reversed sorted data address is (Expression 2) ) shown.
  • the method of performing parallel bit-sequencing can be decomposed into four steps, as shown in FIG. 2: Step 200, initialization.
  • the butterfly calculation and control unit 309 carries the data from the mass storage 311 to the multi-granularity parallel memory 300.
  • the data bus 310 and the write data line 315 are used to transfer data during transportation.
  • the address selector 308 selects the read address on the read/write address line 303 as an output, the granularity on the read/write granularity indication line 304 is 2 n , and the data level on the write use signal line 307 is high.
  • the natural sequence data is equally divided into 2 n groups, which are sequentially stored in the 2 n memory block 400 of the multi-granularity parallel memory 300, as shown in FIG.
  • the butterfly calculation and control unit 309 sets the value of the shift indication line 305 to An, where A is the bit width of the read address line 303.
  • Step 202 Reverse order between the data group and the data group. The step keeps the index in the data group unchanged, and the data group index does not change, but moves the index in the group to the high level, so that the data address is ...bibo gogl ... gG-l).
  • the granularity on the read granularity indication line 304 in FIG. 3 is set to 1
  • the data level on the write enable signal line 307 is set to low
  • a set of data is read from the memory according to the address of the address line 313.
  • the effect of this operation is to move the index g in the group to the high position.
  • the principle of achieving this effect can be explained as follows:
  • the read granularity 304 is set to 1 to read data
  • the behavior is to read one data from each of the storage units at the same position in each storage block (500), and form a data group; the behavior is equivalent to the natural order.
  • the read data set number bb ⁇ ... ⁇ the start address is extracted ⁇ gogL- gc, every 26 pieces of data is equivalent to extracting a data address of each extracted data is!:
  • Step 203 Reverse the data in the group. The step keeps the data group index unchanged, and reverses the index within the group, so that the data address is Gogi...go- ⁇ ) becomes g ⁇ bot ...bn.igogi ... g G-1 ).
  • the data utilizes the data reverse sequence network 301. After the data read from the multi-granularity parallel memory 300 through the read data line 312 passes through the data reverse sequence network 301, the data address becomes 03 ⁇ 41) 1 ... 13 ⁇ 4_ 1& ⁇ 1 ... ⁇ 1 :), directly outputted to the butterfly calculation and control unit 309 via the output terminal 302, and the parallel bit reverse operation of 2 n data is completed.
  • the 2 n data obtained by the output terminal 302 that is, the data sorted in reverse order, can directly perform butterfly calculation.
  • each data bit width is measured in units of memory cells, and the memory cell is defined as the address unit of the memory, and is also the minimum data bit width that the memory can read and write.
  • the memory read/write port bit width must be a power of 2 (ie: W is 2 nth power, n is a natural number);
  • ⁇ g: g 2 k , memory read and write granularity, l ⁇ g ⁇ W ;
  • ⁇ N The size of a block of storage.
  • W 4 is assumed in the schematic diagram of the present invention, but the present invention is applicable to other cases where W is a power of 2.
  • the multi-granularity parallel memory is composed of W memory blocks 505 and a data strobe network 502.
  • Each memory block 505 is a two-dimensional array of memory cells 503, and the memory rows 504 in the array must contain W memory cells 503, each of which can read one memory bank 504 at a time.
  • the data strobe network 502 logically selects W memory cells 503 from the W memory blocks 505 as read and write objects based on the read and write addresses and the read and write granularity.
  • the memory of the present invention supports multiple read and write granularities, and the starting addresses of each memory block 505 are different under different read and write granularities.
  • We use the parameter k to characterize the different read and write granularities, the actual read and write granularity g 2 k .
  • each g adjacent storage blocks 605 are spliced into one logical bank 606, and the starting addresses of all logical banks 606 are the same; the starting address of the storage block 605 in the logical bank 606 is connected before and after, each logic Bank 606 has an addressing range of 0 ⁇ g Nl and the entire memory has an addressing range of 0 ⁇ g Nl.
  • the memory sends a read and write address and read and write granularity to each logical bank 406 during a read operation, and each logical bank 406 reads g memory cells and passes them to the memory read/write port 501 through the data strobe network 502.
  • the data read by the /g logical banks 606 is spliced into output data having a bit width of W in order from left to right.
  • the data transferred from the memory read/write port 501 is split into g shares, and each data bit width is W/g, and the i-th data is sent to the i-th logic through the data strobe network 502.
  • Bank 406 (0 ⁇ i ⁇ g), simultaneously sends the read and write addresses and read and write granularity to each logical bank 606, each logical bank 406 writes g storage units.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Discrete Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Memory System (AREA)

Abstract

本发明公开了一种并行位反序装置及方法,其中所述并行位反序装置包括并行位反序单元(314)、蝶形计算与控制单元(309)和存储器(311),所述蝶形计算与控制单元(309)通过数据总线(310)与所述存储器(311)相连,所述并行位反序单元(314)用于对所述蝶形计算与控制单元(309)所计算的蝶形组数据进行位反序。所述并行位反序单元(314)包括地址反序逻辑(306),地址反序逻辑(306)与蝶形计算与控制单元(309)相连,用于对来自蝶形计算与控制单元(309)读取地址进行镜像反序和右移操作。

Description

并行位反序装置和方法 技术领域 本发明属于集成电路设计中的位反序数据存储与排列, 与快速傅立 叶变换 (FFT)算法、集成电路结构紧密相关, 具体涉及一种并行位反序装 置和并行位反序方法。
背景技术 信号处理系统经常需要将信号内容在时域和频域进行转换, 快速傅 立叶变换算法 (FFT)可进行时域和频域间的信号转换。相对于其它转换算 法来说, 快速傅立叶变换算法具有结构统一、 计算量少的优点, 因此广 泛应用于信号处理系统中。
FFT算法输入 N个数据,输出 N个数据;一般称时域至频域的变换 为正向变换, 而频域至时域的变换变逆向变换。 我们称时域数据按采样 时间排列的顺序为 "自然序", 即每个数据点的采样时间是递增的; 同 样称频域数据按频率高低排列的顺序 "自然序", 即每个数据点对应的 频率是递增的。 与 "自然序"对应的是 "位反序", 即将自然序中的每 个数据点的索引写成二进制的形式, 将该二进制索引镜像反序后, 得到 的索引值即为该数据点在 "位反序" 中的索引值。 假设我们可以用 3 Wt 表示索引值, 自然序中索引值 3表示成二进制为 (011)2, 进行镜像反序 后, 其二进制值为 HO , 即十进制中的 6, 因此, "6"即为位反序排列 后该数据对应的索引值。
FFT算法有多种实现方式, 一般可分成按时域抽取和按频域抽取两 种。 图 1示出了按频域抽取的算法, 原始数据 100为自然序排序, 在进 行计算前必须经过位反序操作 103对数据进行序列变换, 使其按位反序 排列, 反序后数据 101经计算后的输出数据 102为自然序排列。 在时间 抽取算法中, 输入数据为自然序排列, 但输出数据是位反序排列; 因此, 在对输出数据作进一歩处理前, 需要对输出数据进行序列变换, 使其按 自然序排列。
已有一些专利实现了位反序数据排列,如美国专利 US 2003/0028571 Al (Real-time method for bit-reversal of large size arrays)米用两歩骤反序 方法: 第一歩骤利用 DMA在外部存储器与片上存储器之间进行大尺寸 反序排列, 第二歩骤利用处理器实现小尺寸内部的数据反序排列。 该方 法存在的问题是在利用处理器实现小尺寸内部的数据反序排列时, 一、 内部排序需要多次迭代, 对于长度为 N的数据, 需要 log2N-l次迭代; 二、每次迭代只能以标量为单位读写数据,无法实现多个数据并行排序。 因此, 该专利描述的位反序数据排列方法效率很低。
美国专利 US 7,640,284Bl(Bit Reversal Methods for a Parallel Processor)描述了 nVidia公司利用图形处理器GPU)中多处理核或 SIMD 执行部件实现并行位反序排列的方法。 该发明虽然实现了并行位反序排 列, 但仍存在以下问题: 一、 每个处理核或执行部件在计算位反序地址 时, 需要多次访问查找表并进行多次移位操作, 仅仅计算位反地址时就 需要多个时钟周期, 执行效率低; 二、正如该专利第 21页第 10行所述, 该方法只能减少而无法消除多个处理器或 SIMD功能部件的存储访问冲 突, 存储访问冲突进一歩降低了位反序排列操作的执行效率。
发明内容
(一) 要解决的技术问题
本发明所述解决的技术问题是提高并行位反序排列操作的执行效 率, 以对多粒度并行 FFT计算装置更好的支持。
(二) 技术方案
本发明提出一种并行位反序装置, 包括并行位反序单元、 蝶形计算 与控制单元和存储器, 所述蝶形计算与控制单元通过数据总线与所述存 储器相连, 所述并行位反序单元用于对所述蝶形计算与控制单元所计算 的蝶形组数据进行位反序; 所述并行位反序单元包括地址反序逻辑, 地 址反序逻辑与蝶形计算与控制单元相连, 用于对来自蝶形计算与控制单 元读取地址进行镜像反序和右移操作; 所述并行位反序单元还包括多粒 度并行存储器、 地址选择器; 所述多粒度并行存储器与地址选择器相连 接, 用于接收所述地址选择器输出的地址; 所述该多粒度并行存储器还 与所述蝶形计算与控制单元相连接, 用于接收蝶形计算与控制单元输出 的写数据和读写粒度值。
所述并行位反序单元还包括数据反序网络, 所述数据反序网络与所 述多粒度并行存储器相连接, 以接收来自所述多粒度并行存储器的读数 据。
所述数据反序网络用于蝶形组组内数据的位反序排列。
所述地址选择器分别与所述蝶形计算与控制单元和所述地址反序 逻辑相连接, 用于选择输出到存储器的读写地址。
本发明还提出一种并行位反序方法, 用于并行位反序装置中, 所述 并行位反序装置包括并行位反序单元和存储器, 所述并行位反序单元包 括多粒度并行存储器, 所述方法包括如下歩骤: 歩骤 200、 将数据从所 述存储器搬运至所述多粒度并行存储器, 搬运后自然序列数据等分成 2n 个组,依次存放在多粒度并行存储器的 2n个存储块中,其中 n为正整数; 歩骤 201保持数据组内索引不变, 将数据组索引反序; 歩骤 202、 保持 数据组内索引不变,数据组索引也不变,将组内索引移至高位;歩骤 203、 保持数据组索引不变, 将组内索引反序。
所述并行位反序装置还包括蝶形计算与控制单元, 所述方法的歩骤 在所述蝶形计算与控制单元的控制下进行。
所述并行位反序单元还包括地址反序逻辑和地址选择器; 蝶形计算 与控制单元通过移位指示线和读取地址线与地址反序逻辑相连接。
在歩骤 201中, 移位指示线的值设为 A-n, 其中 A为读取地址线的 位宽, 同时将读写地址线上的读取地址设为数据组索引, 地址选择器选 择地址反序逻辑的输出, 作为存储器的读写地址。 在歩骤 203中, 从多粒度并行存储器中读取的数据经过数据反序网 络的位反序排列后, 输出到蝶形计算与控制单元。
在歩骤 203之后,所述数据反序网络的输出端得到 2n个数位反序排 序后的数据, 以直接进行蝶形计算。
(三) 有益效果
本发明将自然序数据存放在支持多粒度并行读写的存储器中, 根据 地址递增的顺序并行读取多个经位反序排序后数据。 该方法首先对需要 排序的数据进行分组, 利用三个原理对数据进行排序: 1 .利用存储器多 粒度读写特性实现组内数据与数据组之间的反序排列; 2 .利用硬件逻辑 部件实现组内数据的反序排; 3 .利用不同读写地址, 实现数据组之间的 反序排序。 本发明的整个位反序操作过程只需读取一次存储器, 不存在 迭代过程, 处理器可直接对读取的位反序数据进行一下歩操作, 如 FFT 中的蝶形计算; 同时, 本发明的主法在并行读取存储器时不存在访问冲 突, 读取和排序过程可深度流水, 具有极高的执行效率; 另外, 本发明 可根据具体实现灵活指定并行粒度。
附图说明 图 1是数据长度为 8时, 按频率抽取的基 2 FFT算法流程图; 图 2是本发明的并行位反序方法的流程图;
图 3是本发明的并行位反序装置的结构示意图;
图 4是当并行粒度为 4, 数据长度为 32时, 位反序操作前各个数据 在存储器中的分布示意图;
图 5是本发明的多粒度并行存储器的逻辑结构示意图;
图 6是本发明的的多粒度并行存储器在不同读写粒度下, 存储器的 编址方式和逻辑 Bank划分的示意图。 具体实施方式 为使本发明的目的、 技术方案和优点更加清楚明白, 以下结合具体 实施例, 并参照附图, 对本发明进一歩详细说明。
为了实现并行位反序排列,首先定义并行粒度 W=2n,(n为正整数:), 并行粒度指从存储器并行读取并排序的数据个数。 在进行位反序操作之 前,我们先对数据进行分组,每 W个数据构成一组。对于数据长度 N=2T 的数据来说, 每个数据索引需要 T个比特位来表示; 数据分组后, 数据 索引的组分成两部分: 数据组索引 g和组内数据索引 b。 当并行粒度为 2n时, 对于 27个数据, 一共可分成 2Tn个数据组。 令 T-n=G, 则需要 G 个比特位来表示数据组索引 g, 即 ^1.^1&), 需要 n个比特位来表示 组内索引地址 b, 即 b b^ ^^b^ 每个数据索引的二进制表示由 g和 b 拼接而成, 在自然序排列中, 数据组索引在高位, 组内索引 b在低位, 即:
Figure imgf000007_0001
. • bibo) (表达式 1 ) 而对应的位反序数据地址为:
位反序数据地址二 !^…!^ ^ . ••gG-l) (表达式 2) 图 3示出了根据本发明的并行位反序装置的一个具体实施例的结构 示意图。 该并行位反序装置包括并行位反序单元 314、 蝶形计算与控制 单元 309和大容量存储器 311。
其中, 蝶形计算与控制单元 309通过数据总线 310与大容量存储器
311相连。 并行位反序单元 314包括多粒度并行存储器 300、 地址反序 逻辑 306、 地址选择器 308、 数据反序网络 301。 当并行粒度设定为 2n 时, 多粒度并行存储器 300由 2n个存储块组成。
多粒度并行存储器 300具有地址线 313, 该地址线 313与地址选择 器 308的输出端相连, 用于接收地址选择器输出的地址。
该多粒度并行存储器 300还具有写使能信号线 307、 写数据线 316 和读写粒度指示线 304, 三者均与所述蝶形计算与控制单元 309相连, 分别用于接收来自蝶形计算与控制单元 309输出的写使能信号、 写数据 和读写粒度值。
该多粒度并行存储器 300还具有读数据线 312, 其与数据反序网络 301的输入端相连, 用于向数据反序网络输出读数据。
数据反序网络 301的输出端 302与蝶形计算与控制单元 309相连, 用于组内数据的反序排列。 当并行位反序装置 314的并行粒度为 2n时, 数据反序网络 301的读数据线 312上的输入数据的个数为 2n,输出端 302 上的输出数据的个数也为 2n。 如果定义输入数据向量为 X, 从左至右的 索引值为 ι(0≤ι<η), Χ[ι]表示第 1个输入数据; 输出数据向量为 Y, 从左 至右的索引值为 J(0≤j<n), Y[j]表示第 J个输出数据; br(i)表示对 1进行 镜像位反序, 则数据反序网络 (301)中 X与 Y的对应关系为:
Figure imgf000008_0001
地址选择器 308的一个输入端与蝶形计算与控制单元 309的读写地 址线 303相连, 另一输入端与地址反序逻辑 306的输出端相连, 用于择 输出到存储器 300的读写地址。
地址反序逻辑 306的一输入端与蝶形计算与控制单元 309的移位指 示线 305相连,另一输入端与蝶形计算与控制单元 309的读写地址线 303 相连, 用于对读写地址线 303输入的读取地址的二进制表示按位进行镜 像反序和右移操作。 读取地址具有固定的位宽, 该位宽由并行位反序单 元 314能支持的最大数据长度 M以及并行粒度 2n决定, 即读取地址的 位宽 A=Gog2M-n;>。当数据组位宽 g小于 A时,高位补 0形成读取地址。 接着, 地址反序逻辑 306 对镜像反序后结果进行右移操作, 移位距离 =A-g。 如假定读取地址 303 的位宽 A=16, 而数据组位宽 g=3, 即表示 成 g=g2gl &), 则读取地址 303的输入为 (0000 0000 0000 0g2glgQ), 镜像反 序后为 (gogigzO 0000 0000 0000),移位距离 =13,移位后为 (0000 0000 0000 0g0gig2), 移位后数据即为地址反序逻辑 306的输出。
位反序排序操作的过程可以抽象成数据地址 (gc^.^gob^ bibo) 的变化,原始的数据地址如 (表达式 1)所示,而反序排序后的数据地址如 (表达式 2)所示。 对于自然序数据地址(^(}_1.^^()1¾_1...1)11¾)进行并行位反序的方法可 分解成四个歩骤, 如图 2所示: 步骤 200、 初始化。 蝶形计算与控制单元 309将数据从大容量存储 器 311搬运至多粒度并行存储器 300。 搬运时利用数据总线 310和写数 据线 315传送数据。 此时, 地址选择器 308选择读写地址线 303上的读 取地址作为输出,读写粒度指示线 304上的粒度为 2n,写使用信号线 307 上的数据电平为高。搬运后自然序列数据等分成 2n个组, 依次存放在多 粒度并行存储器 300的 2n存储块 400中, 如图 4所示。 图 4中假定数据 长度 N=32, 并行粒度为 4。 步骤 201、 数据组间反序。 该歩骤将数据组索引反序, 使数据地址 由 g=(g G-lgl ... go)变成 (gogl ... gG-l)。
该过程中, 蝶形计算与控制单元 309 将移位指示线 305 的值设为 A-n, 其中 A为读取地址线 303的位宽。 同时将读写地址线 303上的读 取地址设为数据组索引 g=(g e_lgl ...gQ;),地址选择器 308选择地址反序逻 辑 306的输出, 作为存储器 300的读写地址。 步骤 202、 数据组与数据组之间反序。 该歩骤保持数据组内索引不 变, 数据组索引也不变, 但将组内索引移至高位, 使数据地址由
Figure imgf000009_0001
...bibo gogl ... gG-l)。
该过程中, 将图 3中读取粒度指示线 304上的粒度设为 1, 写使能 信号线 307上的数据电平设为低, 根据地址线 313的地址从存储器中读 取一组数据。
该操作的效果即将组内索引 g移至高位。 取得这一效果的原理可解 释如下: 自然序列数据是分成 2n份, 存放在 2n个存储块 500中, 因此, 每个存储块 500中相同位置的数据的自然序索引值相差 N/2n=26。 当读 取粒度 304设为 1读取数据时,其行为是从各个存储块 (500)中相同位置 的存储单元各读取 1个数据, 拼成一个数据组; 该行为相当于从自然序 数据中, 每隔 N/2n=26个数据抽取 1个数据, 拼成一个数据组。 假设读 取数据的组内编号为 b b^...!^,抽取的起始址为 ^ gogL— gc ,每隔 26 个数据抽取 1个数据即相当于每个数据的抽取地址为:
b><G+g = (b«G) I g = bn-i .t gpgi ... gG-i
该表达式即歩骤 201中期望
Figure imgf000010_0001
gQgl...gc :)。 步骤 203、 组内数据反序。 该歩骤保持数据组索引不变, 将组内索 引 反 序 , 使 数 据 地 址 由
Figure imgf000010_0002
gogi...go-ι) 变 成 g^bot ...bn.igogi ... gG-1)。
该歩骤利用了所述数据反序网络 301, 从多粒度并行存储器 300中 通过读数据线 312读取的数据经过数据反序网络 301后, 数据地址即变 为0¾1)1...1¾_1&^1...^1:), 经过输出端 302直接输出至蝶形计算与控制单 元 309, 完成 2n个数据的并行位反序操作。
歩骤 203之后,输出端 302得到的 2n个数据即位反序排序后的数据, 可直接进行蝶形计算。 多粒度并行存储器
以下参照图 5和图 6具体描述本发明并行位反序装置中所包括的的 多粒度并行存储器 300的具体结构。
为便于说明, 各个数据位宽以存储单元为单位来度量, 存储单元定 义为存储器的编址单位, 也是存储器可读写的最小数据位宽。 描述过程 中出现包含 "位宽为 W" 的语句都需要理解成 W个存储单元的比特位 (bit:)。 如存储单元为 8bit的字节类型时, 读写端口位宽为 4的存储器实 际位宽为 4x8=32bit。 同时, 所有对象从 0开始, 从左至右编号。 另夕卜, 如前所述, "粒度"是指地址连续的存储单元的个数。 在以下的描述过 程中, 约定以下符号:
■ W: 存储器读写端口位宽, 必须为 2的幂次方 (即: W为 2的 n 次方, n为自然数);
■ K: K=log2W, K+1表示存储器支持的读写粒度种类; ■ k: 存储器读写粒度参数, 为自然数, 且 0≤k≤K, 实际读写粒度 为 g=2k;
■ g: g=2k, 存储器读写粒度, l≤g≤W;
■ N: 一个存储块的大小。
本发明示意图中均假定 W=4, 但本发明适用于 W为 2的幂次方的 其它情况。
如图 5所示,多粒度并行存储器由 W个存储块 505和一个数据选通 网络 502构成。 每个存储块 505是由存储单元 503构成的二维阵列, 该 阵列中的存储行 504必须包含 W个存储单元 503,每个存储块一次可读 写一个存储行 504。
数据选通网络 502在逻辑上根据读写地址和读写粒度从 W个存储块 505中选择 W个存储单元 503作为读写对象。
本发明的存储器支持多种读写粒度, 在不同的读写粒度下, 每个存 储块 505的起始地址各不相同。 我们以参数 k来表征不同的读写粒度, 实际的读写粒度 g=2k
图 6示出了 W=4时, 存储器在不同读写粒度下每个存储块 605的 编址。 对于读写粒度 g, 每 g个相邻的存储块 605拼接成一个逻辑 Bank 606, 所有逻辑 Bank 606的起始地址相同; 逻辑 Bank 606内的存储块 605起始地址前后相接,每个逻辑 Bank 606的寻址范围为 0~gN-l,整个 存储器的寻址范围为 0~gN-l。
存储器在进行读操作时, 将读写地址和读写粒度发送给每个逻辑 Bank 406, 每个逻辑 Bank 406读取 g个存储单元并通过数据选通网络 502传递给存储器读写端口 501, W/g个逻辑 Bank 606所读取的数据按 从左到右的顺序拼接成位宽为 W的输出数据。
存储器在进行写操作时, 将存储器读写端口 501传递过来的数据拆 分成 g份, 每份数据位宽为 W/g, 通过数据选通网络 502将第 i份数据 发送给第 i个逻辑 Bank 406 (0<i<g), 同时将读写地址和读写粒度发送 给每个逻辑 Bank 606 每个逻辑 Bank 406写入 g个存储单元。 以上所述的具体实施例, 对本发明的目的、 技术方案和有益效果进 行了进一歩详细说明, 应理解的是, 以上所述仅为本发明的具体实施例 而已, 并不用于限制本发明, 凡在本发明的精神和原则之内, 所做的任 何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权利要求
1、 一种并行位反序装置, 包括并行位反序单元 (314)、 蝶形计算 与控制单元 (309 ) 和存储器 (311 ), 所述蝶形计算与控制单元 (309) 通过数据总线 (310) 与所述存储器 (311 ) 相连, 其特征在于:
所述并行位反序单元(314 )用于对所述蝶形计算与控制单元(309) 所计算的蝶形组数据进行位反序;
所述并行位反序单元 (314 ) 包括地址反序逻辑 (306), 地址反序 逻辑 (306) 与蝶形计算与控制单元 (309) 相连, 用于对来自蝶形计算 与控制单元 (309 ) 读取地址进行镜像反序和右移操作;
所述并行位反序单元 (314 ) 还包括多粒度并行存储器 (300)、 地 址选择器 (308) ;
所述多粒度并行存储器 (300) 与地址选择器 (308) 相连接, 用于 接收所述地址选择器 (308) 输出的地址;
所述该多粒度并行存储器 (300 ) 还与所述蝶形计算与控制单元 (309 ) 相连接, 用于接收蝶形计算与控制单元 (309 ) 输出的写数据和 读写粒度值。
2、 如权利要求 1所述的并行位反序装置, 其特征在于,
所述并行位反序单元 (314 ) 还包括数据反序网络 (301 ), 所述数 据反序网络 (301 ) 与所述多粒度并行存储器 (300) 相连接, 以接收来 自所述多粒度并行存储器 (300 ) 的读数据。
3、 如权利要求 2 所述的并行位反序装置, 其特征在于, 所述数据 反序网络 (301 ) 用于蝶形组组内数据的位反序排列。
4、 如权利要求 2所述的并行位反序装置, 其特征在于,
所述地址选择器 (308) 分别与所述蝶形计算与控制单元 (309) 和 所述地址反序逻辑 (306) 相连接, 用于选择输出到存储器 (300) 的读 写地址。
5、 一种并行位反序方法, 用于并行位反序装置中, 所述并行位反 序装置包括并行位反序单元 (314 ) 和存储器 (311 ), 所述并行位反序 单元 (314) 包括多粒度并行存储器 (300), 其特征在于, 所述方法包括如下歩骤:
歩骤 (200)、 将数据从所述存储器 (311) 搬运至所述多粒度并行 存储器(300), 搬运后自然序列数据等分成 2n个组, 依次存放在多粒度 并行存储器 (300) 的 2n个存储块中, 其中 n为正整数;
歩骤 (201) 保持数据组内索引不变, 将数据组索引反序; 歩骤 (202)、 保持数据组内索引不变, 数据组索引也不变, 将组内 索引移至高位;
歩骤 (203)、 保持数据组索引不变, 将组内索引反序。
6、 如权利要求 5所述的并行位反序方法, 其特征在于,
所述并行位反序装置还包括蝶形计算与控制单元 (309), 所述方法 的歩骤在所述蝶形计算与控制单元 (309) 的控制下进行。
7、 如权利要求 6所述的并行位反序方法, 其特征在于,
所述并行位反序单元 (314) 还包括地址反序逻辑 (306) 和地址选 择器 (308);
蝶形计算与控制单元 (309) 通过移位指示线 (305) 和读取地址线 (303) 与地址反序逻辑 (306) 相连接。
8、 如权利要求 7所述的并行位反序方法, 其特征在于,
在歩骤 (201) 中, 移位指示线 (305) 的值设为 A-n, 其中 A为读 取地址线 (303) 的位宽, 同时将读写地址线 (303) 上的读取地址设为 数据组索引, 地址选择器 (308) 选择地址反序逻辑 (306) 的输出, 作 为存储器 (300) 的读写地址。
9、 如权利要求 8所述的并行位反序方法, 其特征在于,
在歩骤 (203) 中, 从多粒度并行存储器 (300) 中读取的数据经过 数据反序网络 (301) 的位反序排列后, 输出到蝶形计算与控制单元 (309)。
10、 如权利要求 9所述的并行位反序方法, 其特征在于,
在歩骤 (203) 之后, 所述数据反序网络 (301) 的输出端得到 2n 个数位反序排序后的数据, 以直接进行蝶形计算。
PCT/CN2011/085177 2011-12-31 2011-12-31 并行位反序装置和方法 WO2013097235A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/118,452 US9268744B2 (en) 2011-12-31 2011-12-31 Parallel bit reversal devices and methods
PCT/CN2011/085177 WO2013097235A1 (zh) 2011-12-31 2011-12-31 并行位反序装置和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2011/085177 WO2013097235A1 (zh) 2011-12-31 2011-12-31 并行位反序装置和方法

Publications (1)

Publication Number Publication Date
WO2013097235A1 true WO2013097235A1 (zh) 2013-07-04

Family

ID=48696288

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/085177 WO2013097235A1 (zh) 2011-12-31 2011-12-31 并行位反序装置和方法

Country Status (2)

Country Link
US (1) US9268744B2 (zh)
WO (1) WO2013097235A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9262378B2 (en) * 2011-12-31 2016-02-16 Institute Of Automation, Chinese Academy Of Sciences Methods and devices for multi-granularity parallel FFT butterfly computation
WO2013097219A1 (zh) * 2011-12-31 2013-07-04 中国科学院自动化研究所 一种用于并行fft计算的数据存取方法及装置
WO2017111881A1 (en) * 2015-12-21 2017-06-29 Intel Corporation Fast fourier transform architecture
US10476525B2 (en) * 2017-01-09 2019-11-12 Qualcomm Incorporated Low latency bit-reversed polar codes

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028571A1 (en) * 2001-07-09 2003-02-06 Dongxing Jin Real-time method for bit-reversal of large size arrays
CN101072218A (zh) * 2007-03-01 2007-11-14 华为技术有限公司 一种fft/ifft成对处理系统、方法及其装置与方法
CN101520769A (zh) * 2009-04-10 2009-09-02 炬才微电子(深圳)有限公司 一种数据处理的方法和系统
US7640284B1 (en) * 2006-06-15 2009-12-29 Nvidia Corporation Bit reversal methods for a parallel processor

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051239B2 (en) * 2007-06-04 2011-11-01 Nokia Corporation Multiple access for parallel turbo decoder
DE602007002558D1 (de) * 2007-06-28 2009-11-05 Ericsson Telefon Ab L M Verfahren und Vorrichtung zur Transformationsberechnung
US8612505B1 (en) * 2008-07-14 2013-12-17 The Mathworks, Inc. Minimum resource fast fourier transform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028571A1 (en) * 2001-07-09 2003-02-06 Dongxing Jin Real-time method for bit-reversal of large size arrays
US7640284B1 (en) * 2006-06-15 2009-12-29 Nvidia Corporation Bit reversal methods for a parallel processor
CN101072218A (zh) * 2007-03-01 2007-11-14 华为技术有限公司 一种fft/ifft成对处理系统、方法及其装置与方法
CN101520769A (zh) * 2009-04-10 2009-09-02 炬才微电子(深圳)有限公司 一种数据处理的方法和系统

Also Published As

Publication number Publication date
US9268744B2 (en) 2016-02-23
US20140089370A1 (en) 2014-03-27

Similar Documents

Publication Publication Date Title
US7640284B1 (en) Bit reversal methods for a parallel processor
US7836116B1 (en) Fast fourier transforms and related transforms using cooperative thread arrays
WO2017000756A1 (zh) 基于3072点快速傅里叶变换的数据处理方法及处理器、存储介质
CN103955447B (zh) 基于dsp芯片的fft加速器
CN105912501B (zh) 一种基于大规模粗粒度可重构处理器的sm4-128加密算法实现方法及系统
JP2010521728A (ja) データ圧縮のための回路及びこれを用いるプロセッサ
US9176929B2 (en) Multi-granularity parallel FFT computation device
US9317481B2 (en) Data access method and device for parallel FFT computation
WO2013187862A1 (en) A FAST MECHANISM FOR ACCESSING 2n±1 INTERLEAVED MEMORY SYSTEM
US9262378B2 (en) Methods and devices for multi-granularity parallel FFT butterfly computation
WO2013097235A1 (zh) 并行位反序装置和方法
US9098449B2 (en) FFT accelerator
WO2013097436A1 (zh) 一种fft/dft的倒序排列系统与方法及其运算系统
CN106021171A (zh) 一种基于大规模粗粒度可重构处理器的sm4-128的密钥扩展实现方法及系统
CN111221501B (zh) 一种用于大数乘法的数论变换电路
CN102411557B (zh) 多粒度并行fft计算装置
CN109669666B (zh) 乘累加处理器
Ma et al. Accelerating SVD computation on FPGAs for DSP systems
CN111368250B (zh) 基于傅里叶变换/逆变换的数据处理系统、方法及设备
CN109558567B (zh) 自共轭矩阵的上三角部分存储装置和并行读取方法
CN109614582B (zh) 自共轭矩阵的下三角部分存储装置和并行读取方法
CN102622318B (zh) 一种存储器控制电路及其控制的向量数据寻址方法
TWI402695B (zh) 分裂基數-2/8快速傅立葉轉換裝置及方法
Ye et al. Design and implementation of a conflict-free memory accessing technique for FFT on multicluster VLIW DSP
Zhang et al. Small area high speed configurable FFT processor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11878361

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14118452

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11878361

Country of ref document: EP

Kind code of ref document: A1