CN115080503A

CN115080503A - Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping

Info

Publication number: CN115080503A
Application number: CN202210894357.3A
Authority: CN
Inventors: 徐安林; 张强; 刘念; 梁小虎; 郝万宏; 陈昊; 杨欢
Original assignee: 63921 Troops of PLA
Current assignee: 63921 Troops of PLA
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-09-20

Abstract

The invention relates to a systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping, which comprises: the reconfigurable processing unit array, the shared memory, the main controller and the on-chip memory; the reconfigurable processing unit array is responsible for FFT operation and comprises m multiplied by n reconfigurable processing units; the main controller is used for analyzing the configuration packet and writing configuration information into a configuration memory in each reconfigurable processing unit, the reconfigurable processing units execute corresponding operations under the dual drive of data flow and configuration flow, each reconfigurable processing unit and the interconnection among the reconfigurable processing units can be configured independently, and the reconfigurable processing unit array can be dynamically divided into subarrays for algorithm level parallel processing to realize acceleration; the shared memory is a plurality of groups of memories and mainly has two functions, namely, the shared memory is responsible for data interaction with the on-chip memory, and the shared memory stores intermediate data generated by each stage of FFT operation; the on-chip memory is used for storing programs, configuration information and data.

Description

Systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping

Technical Field

The invention relates to the field of computer systems, in particular to a multi-level storage structure systolic array reconfigurable processor aiming at FFT (fast Fourier transform) base module mapping.

Background

With the rapid development of information technology, the demand for signal processing capability in computationally intensive fields such as computers, communications, consumer electronics, and the like is increasing. As an important means for analyzing and processing digital signals, Fast Fourier Transform (FFT) is widely used. However, the FFT algorithm is computationally expensive and time consuming to implement, and particularly in the fields of scientific computing, image processing, etc., fixed point data cannot meet the precision requirement, and a floating point format is required, so that a large number of floating point complex multiplications bring about a great computational burden. In the era of everything interconnection, the calculation efficiency is one of the important standards for measuring the system performance, and the lack of the calculation efficiency leads the compromise of the system scheme in the aspects of precision, real-time performance and the like. At present, emerging application scenes and requirements are continuously emerging, the number of corresponding FFT (fast Fourier transform) operations is different, and higher requirements are provided for system flexibility. Therefore, the realization of the FFT accelerator with high calculation efficiency and strong flexibility is significant.

The existing FFT acceleration methods are mainly divided into two categories:

(1) software optimization based enhancement method

The software optimization-based improvement method is generally realized on general platforms such as a CPU and a GPU and is established on the basis of deep understanding of a target platform pipeline mechanism and a memory architecture. Although such methods have been highly optimized on target platforms, they are limited by the inherent memory access patterns and still are computationally inefficient.

(2) Method based on hardware special design

Hardware-based methods are typically implemented as FPGAs or ASICs. The hardware-based approach may achieve higher performance since the storage architecture may be specifically designed. By virtue of the parallel characteristic, the FPGA is considered as the most promising solution for the first time, but the FPGA has high energy consumption and cannot meet the requirement of power consumption sensitive application. The ASIC-based scheme has high area efficiency and energy efficiency, but due to the solidified circuit function, only supports single application, has insufficient flexibility and high design cost, and cannot adapt to the iteration speed of emerging applications.

In summary, the above solutions cannot satisfy multiple requirements of calculation, area and energy efficiency, real-time performance, and flexibility at the same time.

Disclosure of Invention

In order to solve the problem, the invention provides a multi-level storage structure systolic array reconfigurable processor aiming at FFT base module mapping, and the dynamic reconfigurable processor architecture CGRA is used for realizing FFT acceleration. The CGRA tool chain adopts a high-level language (such as C or C + +), so that the development period can be shortened. The reconfigurable unit provides a plurality of layers of flexibility and parallelism for the CGRA. In addition, CGRA is superior to fine-grained FPGAs in both energy and area efficiency.

The technical scheme of the invention is as follows: a systolic array reconfigurable processor for FFT-based block mapping, comprising:

the reconfigurable processing unit array, the main controller, the shared memory and the on-chip memory;

the reconfigurable processing unit array is responsible for FFT operation and comprises m multiplied by n reconfigurable processing units, wherein m is the number of rows and n is the number of columns;

the main controller is used for analyzing the configuration packet and writing configuration information into a configuration memory in each reconfigurable processing unit, the reconfigurable processing units execute corresponding operations under the dual drive of data flow and configuration flow, each reconfigurable processing unit and the interconnection among the reconfigurable processing units can be configured independently, and the reconfigurable processing unit array can be dynamically divided into subarrays for algorithm level parallel processing to realize acceleration;

the shared memory comprises a plurality of groups of memories and is used for carrying out data interaction with the on-chip memory and storing intermediate data generated by each stage of FFT operation;

the on-chip memory includes global and local registers for storing programs, configuration information and data.

On the other hand, for the systolic array processor mapped by the FFT base module, the method for executing the operation processing comprises the following steps:

firstly, a main controller moves original data from an on-chip memory to a shared memory; after the data preparation is finished, the main controller analyzes the configuration words and writes the configuration information of each reconfigurable unit into a corresponding local register; after all data and configuration information are prepared, initializing a timer and starting the reconfigurable processing unit array;

secondly, reading configuration information by the reconfigurable processing unit array, and determining iteration times; the method comprises the following steps that a part of reconfigurable processing units read original data from a shared memory, each reconfigurable processing unit reads corresponding configuration information and executes specified operation, and once iteration of a reconfigurable processing unit array is finished after all reconfigurable processing units complete operation; continuing to execute until all iterations are completed; stopping the timer and recording the number of clock cycles; during the FFT operation, intermediate data generated by each stage of FFT operation is stored in a shared memory;

and thirdly, writing the FFT operation result into the shared memory by the partial reconfigurable processing unit, and then writing the FFT operation result into the on-chip memory.

Has the beneficial effects that:

the invention provides a multi-level storage structure pulse array reconfigurable processor for mapping of an FFT (fast Fourier transform) base module, which can effectively improve the high calculation efficiency of FFT (fast Fourier transform) operation in a floating point number format and particularly can meet the application requirements of high precision and strong real-time performance; by simply increasing the capacity of the shared memory, the FFT operation with larger points can be processed under the condition of not changing other hardware modules, and the expandability is strong.

Drawings

FIG. 1 is a block diagram of a reconfigurable processor architecture according to the present invention;

FIG. 2 is a base 4 arithmetic core mapping module;

fig. 3 is a diagram of a subarray-based multi-point FFT mapping.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.

According to the embodiment of the invention, a multi-layer storage structure systolic array reconfigurable processor for mapping FFT base modules is provided, the invention optimizes FFT algorithm characteristics and memory access bandwidth, the method provides an FFT hardware acceleration scheme with high calculation efficiency and strong expandability on the whole, and the reconfigurable processor comprises:

the reconfigurable processing unit array, the shared memory, the main controller, the on-chip memory and other main modules are as shown in fig. 1.

The reconfigurable processing unit array is responsible for FFT operation and comprises m multiplied by n reconfigurable processing units.

The main controller is used for analyzing the configuration packet and writing the configuration information into a configuration memory in each reconfigurable processing unit, and the reconfigurable processing unit executes corresponding operation under the dual drive of the data stream and the configuration stream; each reconfigurable processing unit and the interconnection among the reconfigurable processing units can be configured independently, so that the reconfigurable processing unit array can be dynamically divided into sub-arrays to perform algorithm-level parallel processing, and acceleration is realized.

According to the embodiment of the invention, in the process of designing the system architecture, the processing architecture of the reconfigurable processor is dynamically recombined in real time according to the requirement of large-point FFT operation, independent processing units are configured into a systolic array through configuration information, and meanwhile, various systolic array architectures which are beneficial to various algorithms are formed by defining the cutting and splicing modes of various reconfigurable processing unit arrays through software. The large point number refers to FFT of 128K and 256K points, for example;

1. for example, when 2-dimensional folding is required to be performed on 256K-point FFT, matrix transposition occurs during FFT calculation, and ping-pong buffering is required for memory access bandwidth and 1-dimensional FFT calculation in order to maximize FFT calculation efficiency.

2. Due to the characteristic that the FFT disk-shaped calculation access memory is discontinuous, the share memory corresponding to the PE array is subjected to targeted optimization, including bank number and bank bit width.

The shared memory is a plurality of groups of memories and has two main functions, namely, the shared memory is responsible for data interaction with the on-chip memory, and the shared memory stores intermediate data generated by each stage of FFT operation. By increasing the capacity of the shared memory, the accelerator can process FFT operation with larger points, thereby facilitating subsequent expansion.

According to the embodiment of the invention, a hierarchical data storage system is designed for improving the data access efficiency, the architecture relates to three levels of a system, a reconfigurable processing unit array and a reconfigurable processing unit, and physical units for correspondingly providing data access are respectively a shared memory, a global register and a local register. The main role of the global registers is to store data and parameters pointing to the plurality of reconfigurable processing units. The local register is mainly used for storing intermediate data in the reconfigurable processing unit and only used for the current reconfigurable processing unit to access.

Therefore, the invention designs a three-layer storage structure by analyzing the characteristics of the FFT, can store input and output data, intermediate data and the like in a layered manner, and can quickly complete the implementation of the FFT algorithm by matching with a hardware architecture.

The main controller is responsible for controlling the operation of the whole system, including controlling the configuration and data of the reconfigurable processing unit array, data movement between the shared memory and the on-chip memory, and the like. The on-chip memory is used for storing programs, configuration information and data.

The present invention provides an FFT mapping mechanism, which is usually radix-2 or radix-4 according to the FFT algorithm characteristics. According to the invention, a modularized FFT mapping mode is designed according to the architectural characteristics, and a processing unit array is divided into a plurality of sub-arrays to realize algorithm-level parallel processing. And dividing a plurality of radix 2 or radix 4 sub-modules according to the number of the FFT points, wherein the implementation mode of the radix 4 sub-module is shown in FIG. 2. And then FFT mapping with different points can be obtained through multi-point splicing. During mapping, the mapping of different point numbers of FFT can be realized by splicing the plurality of basic modules. Fig. 3 is a diagram illustrating FFT mapping of multiple points. The mapping result can obtain:

TABLE 1 Performance of the proposed FFT architecture for different number of points

The simulation data in table 1 shows that the proposed architecture flexibly supports FFT operation with a larger number of points, and has strong expandability. An FFT of 1K to 256K points can be achieved. According to one embodiment of the invention, typical application requirements in the target field are firstly analyzed, the range of the number of FFT operation points is judged, factors such as area and power consumption are comprehensively considered, and the capacity of the shared memory is determined. On the basis of definite hardware architecture, taking N-point FFT operation as an example to illustrate the processing steps, wherein N is an integer power of 4, and a base 4 FFT algorithm is adopted to be carried out in commonlog ₄ NAnd (4) carrying out stage FFT operation.

Firstly, the main controller moves original data from an on-chip memory to a shared memory; after the data preparation is finished, the main controller analyzes the configuration words and writes the configuration information of each reconfigurable unit into a corresponding configuration memory; after all data and configuration information are prepared, a timer is initialized, and the reconfigurable processing unit array is started.

Secondly, reading configuration information by the reconfigurable processing unit array, and determining iteration times; the method comprises the following steps that a part of reconfigurable processing units read original data from a shared memory, each reconfigurable processing unit reads corresponding configuration information and executes specified operation, and once iteration of a reconfigurable processing unit array is finished after all reconfigurable processing units complete operation; continuing to execute until all iterations are completed; stopping the timer and recording the number of clock cycles; during the FFT operation, intermediate data generated by each stage of the FFT operation is stored in the shared memory.

Through the second step in this embodiment, the following advantages are obtained:

1. the reconfigurable processing unit and the basic butterfly unit of the FFT are mapped and fused. The configuration information of the reconfigurable processing units is combined in a multi-iteration mode, a large amount of similar configuration information is compressed, and the storage capacity of the configuration information is reduced. When the configuration information is executed, the iteration from top to bottom is executed, the iteration comprises the iterative execution of the configuration information of the whole framework, the iterative execution of the configuration information of a plurality of arrays is executed, and finally the iterative execution of the configuration information of each processing unit is refined.

2. During execution, configuration information preloading of each iterative execution is performed according to the occurrence frequency of each operator of the FFT algorithm, so that the operation process of the whole hardware structure can be accelerated.

Compared with the traditional processor structure, the dynamic reconfiguration processing platform is simulated, and the result shows that the number of cycles required by the dynamic reconfiguration processing platform designed by the invention is obviously lower than that of the DSP and the FPGA.

TABLE 2 comparison of Properties

Although the illustrative embodiments of the present invention have been described in order to facilitate those skilled in the art to understand the invention, it is to be understood that the invention is not limited in scope to the specific embodiments, but rather, it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and it is intended that all matter contained in the invention and created by the inventive concept be protected.

Claims

1. A systolic array reconfigurable processor for FFT-based block mapping, comprising:

2. The systolic array reconfigurable processor for mapping to FFT-based blocks according to claim 1, further comprising:

a hierarchical data storage mode is adopted, the architecture hierarchy relates to three levels of a reconfigurable processor, a reconfigurable processing unit array and a reconfigurable processing unit, and physical units for correspondingly providing data access are respectively a shared memory, a global register and a local register;

the global register is mainly used for storing data and parameters pointing to the plurality of reconfigurable processing units; all reconfigurable processing units in the reconfigurable processing unit array can access data;

the local register is mainly used for storing intermediate data in the reconfigurable processing unit and only accessed by the current reconfigurable processing unit;

the main controller is responsible for controlling the configuration and data of the reconfigurable processing unit array and data movement between the shared memory and the on-chip memory.

3. The systolic array reconfigurable processor for mapping to FFT-based blocks according to claim 1, further comprising:

and for the requirement of FFT operation, the processing architecture of the reconfigurable processor is dynamically recombined in real time, the independent reconfigurable processing units are configured into a pulse array through configuration information, and meanwhile, various pulse array architectures suitable for various algorithms are formed by defining the cutting and splicing modes of various reconfigurable processing unit arrays through software.

4. The systolic array reconfigurable processor for mapping to FFT-based blocks according to claim 1, further comprising:

the method comprises the steps of adopting a modularized FFT mapping mode, dividing a reconfigurable processing unit array into a plurality of sub-arrays to achieve algorithm level parallel processing, dividing a plurality of radix 2 or radix 4 sub-modules according to the number of FFT points, obtaining FFT mapping with different points through multi-point splicing, and realizing the FFT mapping with different points through splicing the plurality of radix modules in the mapping process.