CN114968911B

CN114968911B - FIR (finite Impulse response) reconfigurable processor for operator frequency compression and context configuration scheduling

Info

Publication number: CN114968911B
Application number: CN202210913142.1A
Authority: CN
Inventors: 徐安林; 张强; 刘念; 梁小虎; 郝万宏; 陈昊; 杨欢
Original assignee: 63921 Troops of PLA
Current assignee: 63921 Troops of PLA
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-11-22
Anticipated expiration: 2042-08-01
Also published as: CN114968911A

Abstract

The invention relates to an FIR reconfigurable processor for operator frequency compression and context configuration scheduling, which comprises: the system comprises a plurality of processing unit arrays PEA and corresponding memories thereof, a configuration controller of PEA, a bus, an ESRAM and a main controller; the main controller is responsible for task control and function division of the whole device, and distributes data intensive operation to PEA for execution; each PEA is provided with a matched memory for storing initial calculation data, output data and data required to be interacted by PEs in each PEA; the function of the whole PEA is defined by the configuration controller of the PEA in real time in the form of configuration information, so that the dynamic reconfiguration of the processor is realized.

Description

FIR (finite Impulse response) reconfigurable processor for operator frequency compression and context configuration scheduling

Technical Field

The invention relates to the field of computer processors, in particular to an FIR reconfigurable processor for operator frequency compression and context configuration scheduling.

Background

Digital filters are widely used in the fields of wireless communication, image processing, pattern recognition, and the like. Through filtering, a group of input signals are converted into a group of output signals, so that the purpose of modifying the time domain or frequency domain signal attributes in the signals is achieved. Digital filters can be generally classified into Finite Impulse Response (FIR) and wireless impulse response (IIR). As one representative example of digital filters, FIR filters have the following characteristics: (1) linear phase, with arbitrary amplitude characteristics; (2) The unit impulse response is finite and has stable characteristics. The FIR filter is typically implemented using a DSP as well as an FPGA. The DSP has a special FIR function, and FIR function can be realized according to the instruction set architecture. Because the coefficient calculation and quantization are complex during FIR design, MATLAB software is generally adopted as an auxiliary design to calculate the FIR coefficient. But because the program is executed sequentially, the speed is limited, and different instructions of different DSP chips cause longer development time. The FPGA has a regular internal logic block array and rich connecting resources, is suitable for realizing FIR filters with fine granularity and high parallelism structures, and has better parallelism and expandability compared with a general DSP chip dominated by serial operation. With the development of various key applications, the requirements for low-power consumption real-time processing are stronger, and the problems of processing speed, power consumption, area overhead and the like of a corresponding FIR (finite impulse response) realization device are more obvious. Therefore, there is a need to provide an FIR acceleration architecture that can meet both high performance requirements and certain flexibility.

FIR filter implementations are typically both DSP and FPGA. For the DSP, the DSP converts the application algorithm into a command-type software instruction string (or long instruction, single instruction multiple data stream instruction, etc.), and then implements each instruction and its flow, scheduling, and control mechanism with hardware, thereby completing the physical implementation of the application function. This, while preserving the flexibility of the DSP to the greatest extent, will also impact system processing efficiency. Since the characteristics of the hardware itself are not basically embodied, the advantages of the hardware itself cannot be fully exerted. In order to ensure flexibility, a large amount of non-arithmetic logic unit logics such as instruction fetching and decoding are included in DSP design, and over 80% of power consumption is consumed in non-calculation functions such as instruction fetching and decoding and register file access, so that extra energy overhead in the calculation process is very large, and the energy efficiency of calculation is low. Therefore, the efficiency of the DSP is low, and it is difficult to meet the development requirement of fast and low power consumption.

The FPGA is widely used due to the capability of customizing and realizing large-scale digital logic, rapidly finishing product shaping and the like, and has a firm important position in the fields of communication, network, aerospace, national defense and the like. FPGAs can provide fine-grained reconstruction and therefore can achieve high flexibility. However, with the increasing application demand, the following three major problems gradually become bottlenecks restricting the development of the FPGA: (1) low energy efficiency: the performance is low, the power consumption is high, and the static power consumption is huge; (2) capacity limitation: the loaded circuit usually cannot exceed 5% of the size of the FPGA; (3) use threshold height: the programmability is poor, the development is difficult, and software personnel who do not know the circuit design cannot perform efficient programming. These problems are caused by the intrinsic properties of these FPGAs, such as single bit programming granularity, static configuration, etc. Therefore, this would be unacceptable for real-time demand applications due to the long configuration time and high power consumption of FPGAs.

With the development of programmable devices, in recent years, a dynamically reconfigurable architecture has been proposed for signal processing procedures that can be used in a variety of applications, including neural network accelerators based on the dynamically reconfigurable architecture, cryptographic algorithm implementations based on the dynamically reconfigurable architecture, and a dynamically reconfigurable architecture to implement baseband processing. As can be seen from these different types of dynamic reconfiguration architectures, such architectures have significant advantages over FPGAs in terms of reconfiguration time and cost.

The existing dynamic reconfiguration architecture has the following problems:

in terms of computing array, it is necessary to generate a set of all computing features for a PE (Processing element Processing unit) basic structure according to all possible configuration parameter conditions of an FIR, and to design a PE microstructure according to the set in a customized manner. The existing dynamic reconfiguration architecture cannot balance flexibility and resource overhead, and when a dynamic reconfiguration processor needs to support multiple FIR algorithms, hardware resource consumption is overlarge, so that the efficiency of the whole processor is reduced.

Disclosure of Invention

In order to solve the technical problem, the invention discloses an FIR reconfigurable processor for operator frequency compression and context configuration scheduling. The processor is different from a traditional special circuit, a DSP and an FPGA, and high energy efficiency, high flexibility and expandability are realized through a dynamic reconfigurable processor architecture.

The main calculation module of the FIR hardware acceleration device based on the dynamic reconfigurable processor architecture is a plurality of dynamic reconfigurable processing arrays with flexible and variable functions, which are composed of a plurality of isomorphic processing units. The computational logic of each processing unit may be altered by changing the function, interconnections, input-output data flow, etc. of the processing units. By applying this transformation to the whole, a dynamic local reconstruction can be achieved.

The technical scheme of the invention is as follows: an FIR reconfigurable processor for operator frequency compression and context configuration scheduling, comprising:

the system comprises a dynamic reconfiguration processing unit array, a bus, an ESRAM and a main controller, wherein the dynamic reconfiguration processing unit array comprises a plurality of processing unit arrays PEA and corresponding memories thereof, and the configuration controller, the bus, the ESRAM and the main controller of the PEA;

the main controller is responsible for task control and function division of the whole device, and data intensive operation is distributed to PEA for execution; the FIR algorithm function is completed by splicing a plurality of heterogeneous processing unit arrays;

each PEA is provided with a matched memory for storing initial calculation data, output data and data required to be interacted by PEs in each PEA; the function of the whole PEA is defined in real time by the configuration controller of the PEA in the form of configuration information, so that the dynamic reconfiguration of the processor is realized.

Has the advantages that:

the invention realizes high energy efficiency, high flexibility and expandability through the FIR reconfigurable processor with operator frequency compression and context configuration scheduling.

(1) In the aspect of computing array, according to all the possible configuration parameter conditions of FIR, generating all the computing feature sets aiming at the PE basic structure, and customizing and designing the PE microstructure according to the sets. Compared with the existing dynamic reconstruction architecture, the PE with the general computing function is adopted, and the area efficiency can be improved by designing the PE customized by the FIR algorithm; the PE design refers to PE microstructure design, namely, design of internal design of each PE and interconnection mode between PEs. Within each PE is an input register set that holds the input data to the arithmetic logic unit, which includes 2 32-bit inputs. The inputs to the arithmetic logic unit also have 1 bit inputs, so the PE can be configured as a single bit arithmetic function. The function of the arithmetic logic unit is customized according to the FIR algorithm, such as complex multiply accumulate operations. The arithmetic logic unit has two outputs, including 1 32-bit output and 1-bit output, to maintain flexibility. Meanwhile, the PE internally comprises a configuration loader which loads configuration information from a configuration memory according to a control instruction of a processor and rapidly reconstructs the functions of the whole PE. By the algorithm analysis, the PE microstructure is designed in a customized manner according to the algorithm requirement, so that unnecessary interconnection area overhead, arithmetic logic unit functions, redundant configuration memory space and the like can be reduced, and the area efficiency is improved. The FIR algorithm function is completed by splicing a plurality of heterogeneous processing unit arrays, the functions of the array units can be dynamically switched in real time, compared with the traditional FIR processor and FPGA architecture, the architecture provided herein can change the functions of the processing units in real time, and the switching time is usually 5-6 orders of magnitude faster than that of the FPGA, so as to achieve dynamic real-time reconstruction. And through hierarchical structure processing, the separation of system level, PEA level and PE level processing is realized, and the full parallel utilization of hardware resources is realized.

(2) In the aspect of circuit design, a voltage adjusting mechanism controlled by configuration information is designed, the characteristic that the time delay of the interconnection and operation units fluctuates obviously along with the change of loads is utilized, the voltage of each processing unit is dynamically configurable, and the voltage distribution of the interconnection and operation units is controlled in real time by the dynamic configuration information, so that the power consumption of part of the processing units is reduced, and the reduction of the power consumption of the whole processor is realized.

(3) The invention provides a configuration information organization form based on operator use frequency and a cache structure thereof, which can effectively reduce the storage amount of configuration information, reduce the storage area of the configuration information and provide convenience for efficient scheduling of configuration. The invention analyzes the FIR algorithm characteristics, and calculates and summarizes the FIR algorithm characteristics into a plurality of basic operators with proper granularity, configuration information is organized by taking the operators as basic sizes, the operators are classified according to the use frequency of different operators in the algorithm, and operators with high frequency use more storage resources; meanwhile, a hierarchical storage structure is adopted to further reduce the storage area of the configuration, and the configuration of the computing array is recovered by utilizing the multi-level index during computing. After the scheme of the invention is adopted, the amount of the configuration information required to be stored is reduced by more than 82.25%. And when the number of algorithms increases, the method has the advantage of slow growth in the aspect of reducing the total storage amount of the configuration information;

(4) The invention provides a configuration scheduling mode based on configuration information context, which dynamically selects the reconstruction granularity of a reconfigurable processor computing array through the relation of configuration information space context, and on the basis, the unchanged region of the configuration information is skipped in the configuration process through the relation of configuration information time context, so that the transmission quantity of the configuration information is reduced, the proportion of the transmission time of the configuration information in the whole execution time is shortened, and the performance of the reconfigurable processor is improved.

In conclusion, compared with the existing dynamic reconfiguration architecture, the invention adopts the PE with the general computing function, and can realize the improvement of the area efficiency by carrying out the PE design customized by the FIR algorithm; compared with the method that FIR modules with various types and fixed functions are integrated into the PE, the support of a dynamic reconstruction framework on different FIR algorithms and parameters can be improved through flexible switching, and various types of calculation can be realized through one set of hardware.

Drawings

FIG. 1 is a block diagram of an FIR reconfigurable processor for operator frequency compression and context configuration scheduling;

FIG. 2 is a workflow of the overall processor arrangement;

FIG. 3 is a schematic diagram of a dynamic reconfiguration processing unit array structure;

fig. 4 is a schematic diagram of each PE implementation of an order FIR filter.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and based on the embodiments of the present invention, all other embodiments obtained by a person skilled in the art without creative efforts belong to the protection scope of the present invention.

The invention discloses an FIR reconfigurable processor for operator frequency compression and context configuration scheduling, which is different from the traditional special circuit, DSP and FPGA and realizes high energy efficiency, high flexibility and expandability through the framework of a dynamic reconfigurable processor.

The overall architecture of the device of the present invention is shown in fig. 1, and it includes a plurality of processing unit arrays (PEAs) and their corresponding memories, a configuration controller of the PEAs, a bus, an ESRAM, and a main controller.

Firstly, in the aspect of computing arrays, the FIR algorithm function is completed by splicing a plurality of heterogeneous processing unit arrays, the functions of the array units can be dynamically switched in real time, compared with the traditional FIR processor and FPGA architecture, the architecture provided by the invention can change the functions of the processing units in real time, and the switching time is usually 5-6 orders of magnitude faster than that of the FPGA, so as to achieve dynamic real-time reconstruction.

Secondly, in the aspect of circuit design, a voltage adjusting mechanism controlled by configuration information is designed, the voltage of each processing unit is dynamically configurable by utilizing the characteristic that the time delay of the interconnection and operation units fluctuates obviously along with the change of loads, and the voltage distribution of the interconnection and operation units is controlled in real time by the dynamic configuration information, so that the power consumption of part of the processing units is reduced, and the reduction of the power consumption of the whole processor is realized.

The main controller is responsible for task control of the whole device and division of functions, and data intensive operation is distributed to PEA for execution. Each PEA has a memory associated with it that may be used to store initial computed data, output data, data that the PEs within each PEA need to interact with, and the like. The function of the whole PEA is defined by the configuration controller of the PEA in real time in the form of configuration information, so that the dynamic reconfiguration of the processor is realized.

The work flow of the whole processor device is shown in FIG. 2:

firstly, initializing memories of PEAs including a shared memory and a configuration memory through an ESRAM, and enabling each PEA and a timer;

secondly, reading the configuration information in the configuration memory, and determining the calculation cycle number of each PEA according to the configuration information;

thirdly, determining the calculation cycle number of each PE in the PEA, the data and operation codes of the PEA and the like through configuration information;

fourthly, calculating each PE;

fifthly, after each PE completes the circular calculation, whether the whole PEA completes the circular calculation is judged. When all PEA calculations are completed, the time overhead is printed and the data of the calculation results are moved to the respective memories and ESRAMs.

Sixth, the calculation and time overhead are verified and the PEA is turned off.

Therefore, the invention realizes the separation of system level processing, PEA level processing and PE level processing through hierarchical structure processing, and realizes the full parallel utilization of hardware resources. Second, since the configuration information is for the system level, the PEA level and the PE level, the configuration information structure designed by the present invention will be different. The scheduling frequency information of FIR operators of different types is utilized, and the configuration information organization and scheduling form design of each processing module are carried out by combining context configuration information, so that compared with the traditional design, the configuration speed of FIR algorithm implementation can be improved pertinently, and the configuration information storage overhead of FIR algorithm is greatly reduced.

The PEA as a calculation core module (as shown in fig. 3) will have a very important role in the whole processor, and firstly, the improvement of this part mainly includes the following points:

1. in the aspect of computing arrays, it is necessary to generate all sets of computing features for PE basic structures, such as basic operators commonly used in FIR algorithms, data sources and destinations of each operator, and customize and design PE microstructures according to the sets, according to all possible configuration parameter conditions of an FIR. Compared with the conventional dynamic reconstruction framework, the invention can realize the improvement of the area efficiency by carrying out the PE design customized by the FIR algorithm; compared with the method that FIR modules with various types and fixed functions are integrated into the PE, the support of a dynamic reconstruction framework on different FIR algorithms and parameters can be improved through flexible switching, and various types of calculation can be realized through one set of hardware.

2. In the aspect of configuration system design, a configuration information organization form based on operator use frequency and a cache structure thereof are provided, so that the configuration information storage capacity can be effectively reduced, and the storage area of the configuration information is reduced. The invention analyzes the FIR algorithm characteristics, summarizes the calculation into a plurality of basic operators with proper granularity, the configuration information is organized by taking the operators as basic sizes, the operators are classified according to the use frequency of different operators in the algorithm, and the operators with high frequency use more storage resources; meanwhile, a hierarchical storage structure is adopted to further reduce the storage area of the configuration, and the configuration of the computing array is recovered by utilizing the multi-level index during computing. After the scheme of the invention is adopted, the amount of the configuration information required to be stored is reduced by more than 82.25%. And when the number of algorithms increases, the method has the advantage of slow growth in the aspect of reducing the total storage amount of the configuration information;

meanwhile, the invention adopts a configuration scheduling mode based on the context of the configuration information, configures the spatial context relationship of the information, usually represents the composition of the configuration information obtained by the analysis of FIR algorithm, and the configuration information has a certain relationship according to the FIR algorithm. Each PEA is driven by three ways, including control flow, configuration flow, data flow. The control flow controls the functions of the whole PE, including the enabling and ending of the PE, the enabling and ending of tasks, the enabling and ending of data migration and the like; the configuration flow carries the configuration information to each PE under the enabling of the control flow, and then reconfigures the function of each PE; the data flow performs logical calculation on input data in a flow processing mode on the basis of a control flow and a configuration flow, and outputs the input data to a memory. The three streams cooperate to form a fast calculation and a real-time functional reconstruction of the PEA.

In each PEA, there are multiple PEs arranged in an array, and each PE can be updated in real time through configuration information, so that high flexibility is maintained. Meanwhile, in order to ensure the data interaction fluency among the PEs, an interconnection structure is arranged among the PEs; all PEs have shared registers and memory locations. The main function of each PE is to perform computations, which are performed by the ALUs inside the PE, and the ALUs include various logic computation functions, such as basic operations (addition, multiplication, shift, etc.), complex operations (complex multiplication, complex multiply-accumulate, etc.). The ALU consists of two 32-bit inputs, a 1-bit input, a 32-bit output, and a 1-bit output. There are many ways of inputting the ALU, including other PEs, registers, memory, etc., which are controlled by the configuration information.

According to a practical reference embodiment of the present invention, the following is specifically provided:

for ease of understanding, the present embodiment will briefly describe the implementation of an FIR hardware acceleration device based on a dynamically reconfigurable processor architecture.

For a length ofnFIR filter of order outputting a corresponding input time seriesx(n) The relationship (c) can be obtained in the form of a finite convolution, and the corresponding output expression is:

,

wherein, the first and the second end of the pipe are connected with each other,h(n) Corresponding to the impulse response of the FIR filter. Impulse responseh(n) Is divided intoQGroup, there are:

,

wherein the content of the first and second substances,x(Qn) The extraction result called the input data phase offset 0 is also called the 0 th phase after the extraction of the input signal Q. By analogy in the following way,x(Qn+q) It is referred to as Q-th phase after Q decimation. Output signaly(n) The Q extraction phase 0 result of (a) can be expressed as:

as shown in fig. 4, a schematic diagram is implemented for each PE of a 15 th order FIR filter.

And the realization of the FIR reconfigurable processor for operator frequency compression and context configuration scheduling is explained by taking a 15-order FIR filter as an example. Each PE stores a factor of 1. Each PE inputs 4 numbers acquired by the clock in 4-phase channels, wherein the 1 st beat is x 0-x 3, the 2 nd beat is x 4-x 7, and the rest is done in the same way. Each beat outputs 4 filtered 4 numbers, all output from PE 0.

Y0 = PE0_x0 + sum_0

Y1 = PE1_x0 + PE0_x1 + sum_1

Y2 = PE2_x0 + PE1_x1 + PE0_x2 + sum_2

Y3 = PE3_x0 + PE2_x1 + PE1_x2 + PE0_x3 + sum_3

Where sum _ n is the cumulative sum of the last beat of 4 PEs. There are 64, 4 for each 4 PE.

sum_0 = sumD_4+ PE1_x3 + PE2_x2 + PE3_x1 + PE4_x0

sum_1 = sumD_5 + PE2_x3 + PE3_x2 + PE4_x1 + PE5_x0

sum_2 = sumD_6 + PE3_x3 + PE4_x2 + PE5_x1 + PE6_x0

sum_3 = sumD_7 + PE4_x3 + PE5_x2 + PE6_x1 + PE7_x0

Namely: sum _ n = sum _ d _ n +4 + PEn +1_x3 + PEn +2_x2 + PEn +3_x1 + PEn +4_x00

SumD _ n is the sum result of the previous beat. I.e., sum is accumulated.

The execution process of each PE is shown in the following figure, in which the horizontal axis represents a period and the vertical axis represents different PEs and their stored coefficients.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. An FIR reconfigurable processor for operator frequency compression and context configuration scheduling, comprising:

the dynamic reconfiguration processing unit array is composed of a plurality of isomorphic processing units, namely PEA, a corresponding memory thereof, a configuration controller of the PEA, a bus, an ESRAM and a main controller; in the aspect of computing arrays, generating all computing feature sets aiming at the PE basic structure according to all possible configuration parameter conditions of the FIR, and customizing and designing the PE microstructure according to the computing feature sets;

when a system is configured, a configuration information organization form and a cache structure based on the use frequency of operators are adopted; specifically, FIR algorithm characteristics are analyzed and calculated and summarized into a plurality of basic operators with proper granularity, configuration information is organized by taking the operators as basic sizes, the operators are classified according to the use frequency of different operators in the algorithm, and operators with high frequency use more storage resources; meanwhile, a hierarchical storage structure is adopted to further reduce the storage area of the configuration, and the configuration of the calculation array is recovered by utilizing the multi-level index during calculation;

the main controller is responsible for task control and function division of the whole device, and distributes data intensive operation to PEA for execution; the FIR algorithm function is completed by splicing a plurality of heterogeneous PEAs;

each PEA is provided with a corresponding memory for storing initial calculation data, output data and data which need to be interacted by PEs in each PEA; the function of the PEA is defined by a configuration controller of the PEA in real time in a configuration information mode, so that the dynamic reconfiguration of the processor is realized;

the work flow of the whole reconfigurable processor is as follows:

first, the memory of each PEA is initialized by the ESRAM: including shared memory and configuration memory, and enabling each PEA and timer; in each PEA, a plurality of PEs are arranged in an array mode, and each PE is updated in real time through configuration information; meanwhile, in order to ensure the data interaction fluency among the PEs, an interconnection structure is arranged among the PEs; all PEs have shared registers and memory units;

secondly, reading configuration information in a configuration memory, and determining the calculation cycle number of each PEA according to the configuration information;

thirdly, determining the calculation cycle number of each PE in the PEA and the data and operation codes of the PEA through the configuration information;

fourthly, calculating each PE;

fifthly, after each PE completes the circular calculation, whether the whole PEA completes the circular calculation is judged; when the calculation of the whole PEA is completed, printing time overhead, and moving data of a calculation result to each memory and the ESRAM;

2. The FIR reconfigurable processor according to claim 1, wherein:

each PEA is driven by three ways, including: control flow, configuration flow, data flow;

the control flow controls the functions of all processing units, namely PE, and comprises the enabling and ending of PE, the enabling and ending of tasks and the enabling and ending of data migration;

the configuration flow carries the configuration information to each PE under the enabling of the control flow, and then reconfigures the function of each PE; obtaining the context relation of algorithm configuration information by adopting a configuration scheduling mode based on the context of the configuration information through FIR algorithm analysis, forming a configuration information storage space of the FIR algorithm according to the context relation, and dynamically selecting the reconstruction granularity of a reconfigurable processor computing array by utilizing the context relation of the configuration information storage space to shorten the configuration execution time;

the data flow performs logical calculation on input data in a flow processing mode on the basis of a control flow and a configuration flow, and outputs the input data to a memory.

3. The FIR reconfigurable processor for operator frequency compression and context configuration scheduling according to claim 1, wherein:

the main function of each PE is to perform computations, which are performed by an ALU within the PE, which includes a variety of logical computation functions, including two 32-bit inputs, a 1-bit input, a 32-bit output, and a 1-bit output.

4. The FIR reconfigurable processor according to claim 1, wherein a voltage regulation mechanism controlled by configuration information is designed, the voltage of each PE is dynamically configurable by using the characteristic that the delay of interconnection and operation unit fluctuates with the load change, and the voltage distribution of interconnection and operation unit is controlled by dynamic configuration information in real time.