CN109598338B

CN109598338B - Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Info

Publication number: CN109598338B
Application number: CN201811493592.XA
Authority: CN
Inventors: 陆生礼; 庞伟; 舒程昊; 范雪梅; 吴成路; 邹涛
Original assignee: NANJING SAMPLE TECHNOLOGY CO LTD; Southeast University-Wuxi Institute Of Integrated Circuit Technology; Southeast University
Current assignee: NANJING SAMPLE TECHNOLOGY CO LTD; Southeast University-Wuxi Institute Of Integrated Circuit Technology; Southeast University
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2023-05-19
Anticipated expiration: 2038-12-07
Also published as: CN109598338A

Abstract

The invention discloses a convolution neural network accelerator based on calculation optimization of an FPGA, which comprises an AXI4 bus interface, a data cache area, a prefetched data area, a result cache area, a state controller and a PE array; the data buffer area is used for buffering the characteristic diagram data, convolution kernel data and index values read from the external memory DDR through the AXI4 bus interface; the prefetching data area is used for prefetching the feature map data which needs to be input into the PE array in parallel from the feature map sub-cache area; the result buffer area is used for buffering the calculation result of each line of PE; the state controller is used for controlling the working state of the accelerator to realize the conversion between the working states; the PE array is used for reading the data in the prefetched data area and the convolution kernel buffer area to carry out convolution operation. The accelerator finishes redundant calculation in advance by utilizing the parameter sparsity, the repeated weight data and the characteristic of an activation function Relu, reduces the calculated amount and reduces the energy consumption by reducing the access times.

Description

Convolutional neural network accelerator based on FPGA (field programmable Gate array) for calculation optimization

Technical Field

The invention belongs to the field of electronic information and deep learning, and particularly relates to a FPGA (Filed Programmable Gate Array) -based computational optimization convolutional neural network accelerator hardware structure.

Background

In recent years, the use of deep neural networks has grown rapidly, with significant impact on world economic and social activities. Deep convolutional neural network technology has received a great deal of attention in many machine learning fields, including speech recognition, natural language processing, and intelligent image processing, and in particular in the field of image recognition, deep convolutional neural networks have achieved some remarkable results. In these areas, deep convolutional neural networks can achieve beyond human accuracy. The superiority of the deep convolutional neural network arises from its ability to extract advanced features from raw data after statistical learning of large amounts of data.

Deep convolutional neural networks are well known as computationally intensive networks, with convolutional operations accounting for over 90% of the total number of operations. These extensive calculations are reduced by utilizing the operational information and algorithmic structure of convolution calculations, i.e., reducing the effort required for reasoning, has become a new round of hot spot research.

The high accuracy of the deep convolutional neural network comes at the cost of high computational complexity. In addition to being computationally intensive, convolutional neural networks require the storage of millions or even billions of parameters. The large size of such networks presents throughput and energy efficiency challenges to the underlying acceleration hardware.

Currently, various accelerators designed based on FPGA, GPU (Graphic Processing Unit, graphics processor) and ASIC (Application Specific Integrated Circuit ) have been proposed to improve the performance of deep convolutional neural networks. The accelerator based on the FPGA has the advantages of good performance, high energy efficiency, short development period, strong reconstruction capability and the like, and is widely researched. Unlike a general architecture, FPGAs allow users to customize the functionality of the designed hardware to accommodate various resource and data usage patterns.

Based on the analysis, the scheme is generated by the problem that the redundancy calculation amount is overlarge during convolution calculation in the prior art.

Disclosure of Invention

The invention aims to provide a convolutional neural network accelerator based on calculation optimization of FPGA, which utilizes the characteristics of parameter sparsity, repeated weight data and an activation function Relu to finish redundant calculation in advance, reduce the calculated amount and reduce the energy consumption by reducing the visit times.

In order to achieve the above object, the solution of the present invention is:

a convolution neural network accelerator based on calculation optimization of FPGA comprises an AXI4 bus interface, a data buffer area, a prefetched data area, a result buffer area, a state controller and a PE array;

the AXI4 bus interface is a universal bus interface, and can mount an accelerator to any bus device using an AXI4 protocol to work;

the data buffer area is used for buffering the characteristic diagram data, convolution kernel data and index values read from the external memory DDR through the AXI4 bus interface; the data buffer area comprises M characteristic map sub-buffer areas and C convolution kernel buffer areas, each row of PE is correspondingly provided with one convolution kernel buffer area, and the number of the actually used characteristic map sub-buffer areas is determined according to each layer of actually calculated parameters; the number M of the characteristic map sub-buffers is determined according to the size of a convolution kernel of a current layer of the convolution neural network, the size of an output characteristic map and the offset of a convolution window;

the prefetching data area is used for prefetching the feature map data which needs to be input into the PE array in parallel from the feature map sub-cache area;

the result buffer area comprises R result sub-buffer areas, and each line of PE is correspondingly provided with one result sub-buffer area for buffering the calculation result of each line of PE;

the state controller is used for controlling the working state of the accelerator to realize the conversion between the working states;

the PE array is realized by an FPGA and is used for reading the data in the pre-fetch data area and the convolution kernel buffer area to carry out convolution operation, PE of different columns are calculated to obtain different output characteristic diagrams, and PE of different rows are calculated to obtain different rows of the same output characteristic diagram. The PE array comprises R.times.C PE units; each PE unit contains two calculation optimization modes, a pre-activation mode and a weight repetition mode.

The PE unit comprises an input buffer area, a weight buffer area, an input search area, a weight search area, a PE control unit, a pre-activation unit and a configurable multiply-accumulate unit, wherein the input buffer area and the weight buffer area are respectively used for storing characteristic diagram data and weight data required by convolution calculation; the input retrieval area and the weight retrieval area are respectively used for storing index values of the search feature map data and the weight data; the PE control unit is used for controlling the working state of the PE unit, reading the index value of the index area, reading the data of the buffer area according to the index value, sending the data into the multiply-accumulate unit for calculation, configuring the mode of the multiply-accumulate unit and whether to start the pre-activation unit; the preactivation unit is used for detecting a partial sum of convolution calculation, and stopping calculating output 0 if the partial sum is smaller than 0; the multiply-accumulate unit is used for convolution calculation and can be configured into a normal multiply-accumulate calculation mode or a calculation optimization mode with weight repetition.

The PE control unit determines that the convolution calculation optimization mode of the multiply-accumulate unit is a pre-activation mode or a weight repetition mode, and selects different calculation optimization modes for each layer; the determination method is as follows: determining a calculation optimization mode by adopting a two-bit mode flag bit, and performing normal multiply-accumulate calculation with the high bit being 0; the high order is 1, which is a calculation optimization mode repeated by weight; the low order 0 is not pre-activated; the low order 1 is the preactivation mode.

The weight retrieval area comprises a plurality of weight sub-retrieval areas, weights are written into the weight sub-buffer areas according to the sequence from positive to negative and zero weights, and corresponding input index values and weight index values are also written into the retrieval areas according to the sequence; the operation of sorting the weights and the index values is finished offline; and during convolution calculation, sequentially reading the weights of the weight buffer areas according to the weight index values.

The weight index value uses a one-bit weight conversion flag bit to indicate whether to replace the calculated weight, if the flag bit is 0, the weight is unchanged, and the previous clock weight is prolonged; and if the flag bit is 1, the weight is changed, and the next clock reads the next weight in the weight sub-buffer in sequence.

The PE unit comprises two calculation optimization modes, namely a preactivation mode and a weight repetition mode, wherein the preactivation mode is used for monitoring the convolution part and the positive and negative in real time, if the preactivation mode is negative, feeding back to the PE control unit to terminate calculation, directly outputting a Relu result zero, and if the preactivation mode is regular, continuing the convolution calculation; the weight repetition mode is to add the feature map data with the same weight to the convolution operation with the same weight, then multiply the feature map data with the same weight, and reduce the multiplication times and the access times to the weight data.

In the weight repetition mode, the input feature map is firstly accumulated when the weight conversion flag bit is 0, and the accumulated result is stored in the register. When the weight conversion flag bit is 1, after the accumulation operation is finished, the accumulated part and the weight value are sent to the multiplication unit to be multiplied, and the result is stored in the register.

The state controller is composed of 7 states, namely: waiting, writing a feature map, writing an input index, writing a convolution kernel, writing a weight index, convolving calculation, and sending a calculation result, wherein each state sends a corresponding control signal to a corresponding sub-module to complete a corresponding function.

The data bit width of the AXI4 bus interface is larger than that of a single weight or a characteristic diagram, so that a plurality of data are spliced into a multi-bit data to be sent, and the data transmission speed is improved.

By adopting the scheme, the invention reduces redundant useless calculation and reading of parameter data by utilizing the operation information and the algorithm structure in the convolution calculation, accelerates the convolution neural network by utilizing the FPGA hardware platform, can improve the instantaneity of the DCNN, realizes higher calculation performance and reduces energy consumption.

Drawings

FIG. 1 is a schematic diagram of the structure of the present invention;

FIG. 2 is a schematic diagram of the PE structure of the present invention;

FIG. 3 is a schematic diagram of the input index and weight index operations;

fig. 4 is a schematic diagram of the operation of the preactivation unit.

Detailed Description

The technical scheme and beneficial effects of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, in the hardware structure of the convolutional neural network accelerator designed in the present invention, taking the size of the PE array as 16×16, the size of the convolutional kernel as 3*3, and the convolutional kernel step size 1 as an example, the working manner is as follows:

the PC caches the data partition in the external memory DDR through the PCI-E interface, the data cache area reads the characteristic diagram data through the AXI4 bus interface and caches the characteristic diagram data in 3 characteristic diagram sub-cache areas according to lines, and the input index value is cached in the characteristic diagram sub-cache areas in the same mode. Weight data read through an AXI4 bus interface are sequentially cached in 16 convolution kernel cache areas, and weight index values are cached in the convolution kernel cache areas in the same mode. The pre-fetching buffer zone sequentially reads 3 feature map sub-buffer zone data according to a row sequence, reads 3 x 18 16-bit feature map data in total, outputs 16-bit feature map data in parallel in each clock cycle, and inputs 3 feature map data in parallel. The output data of the prefetching buffer is sent to the first PE of each line of the PE array, and is sequentially transmitted to each adjacent PE of each line. The input index value is fed into the PE array in the same manner. The input feature map data is cached in the input sub-cache area of each PE, and the input index value is cached in the input retrieval area. The weight data and the weight index value are input into the first PE of each column of the PE array in parallel through 16 convolution kernel buffers, and each column of adjacent PEs are sequentially transmitted. And finally, the weight sub-buffer memory area and the weight retrieval area in the PE are cached. And the PE unit reads data from the input sub-buffer and the weight sub-buffer according to the configured calculation optimization mode and index values, performs convolution calculation, and sends accumulated results into 16 result sub-buffers in parallel, wherein the calculation results of each line of PE are stored in the same result sub-buffer.

Referring to fig. 2, the PE unit may configure two calculation optimization modes, a pre-activation mode and a weight repetition mode through a two-bit mode flag bit S1S 0. S1S0 is configured to be in a pre-activation mode when being 01, a pre-activation unit is started to monitor a part and a result of multiply-accumulate operation, and if the part and the value are negative, a Relu result 0 is output in advance and the current convolution window calculation is stopped; S1S0 is configured as a weight repetition mode when 10, an input accumulation unit is started, multiplication operation with the same weight is performed, addition is performed first, input data is accumulated and stored in a register first until the weight is changed, and an accumulated result is sent to the multiplication accumulation unit to perform multiplication accumulation operation. When the weight is 0, the PE unit will turn off the computing unit, outputting the portion and result directly.

Referring to fig. 3, the pe control unit sequentially takes the feature map data from the input sub-buffer and sends the feature map data to the calculation unit by inputting the index value. The weight index value is represented by a one-bit weight conversion flag bit, the weight is unchanged if the weight index value is 0, and the next weight is read in sequence if the weight index value is 1. The weights and index values are arranged in the order of the weights from positive to negative, zero weight is placed at the end, and the sorting work is finished offline. As shown in fig. 3, the first four input data correspond to the same weight x, the middle two input data correspond to the same weight y, and the last three input data correspond to the same weight z.

Referring to fig. 4, after the pre-activation unit is started, comparing the partial sum value with a zero value, and if the partial sum value is greater than the zero value, continuing to calculate and output a final result; if the partial sum is smaller than zero, a termination calculation signal is sent to the PE control unit, the PE control unit closes the calculation, and the result zero after the Relu is directly output.

The convolution operation is unfolded into vector multiply-accumulate operation, so that the network structure and the hardware architecture are more matched, the calculation is simplified according to the operation information and the algorithm structure, the calculation efficiency is improved, and the energy consumption is reduced. The specific state transition process in this embodiment is as follows:

after initialization, the accelerator enters a waiting state, a state controller waits for a state signal state sent by an AXI4 bus interface, and when the state is 00001, the state controller enters a convolution writing core state; when the state is 00010, entering a writing weight index state; when the state is 00100, entering a state of writing a feature map; when the state is 01000, entering a write input index state; after the data is received, the waiting state is 10000, and the state of convolution calculation is entered. And after the calculation is finished, automatically jumping into a state of sending the calculation result, and jumping back to a waiting state after the completion of the sending.

Writing a feature map: if the state is entered, waiting for an AXI4 bus interface data valid signal to be pulled high, and simultaneously enabling 3 feature map sub-cache areas in sequence, wherein a first sub-cache area stores first line data of the feature map; the second sub-buffer stores second line data of the feature map; the third sub-buffer stores the third line data of the feature map; after the fourth data of the feature map is jumped back to be stored in the first sub-buffer … and the feature map data is stored in the first sub-buffer in this order, the first data of the first, second and third lines of the feature map stored in the three sub-buffers is taken in the first clock cycle and sent to the pre-fetch buffer; the second clock cycle takes the first data of the fourth, fifth and sixth three lines of the feature map stored in the three sub-caches to send to the pre-fetch cache …, and after the feature map line traverses, sequentially takes the second and third … data to send to the pre-fetch cache according to the sequence. And 3.18 pieces of characteristic map data are stored in the prefetching buffer area, after the storage is completed, 16 pieces of characteristic map data are output in parallel in each clock period and are sent to the first PE of each line of the PE array, and are sequentially transmitted to each line of adjacent PE, and finally are stored in the input sub-buffer area of the PE.

Writing an input index: when this state is entered, data is finally stored in the input search area of the PE in the feature map data storage mode.

Writing a convolution kernel: if the state is entered, waiting for the AXI4 bus interface data valid signal to be pulled high, and simultaneously enabling 16 convolution kernel buffer areas in sequence, wherein a first convolution kernel buffer area stores a convolution kernel value corresponding to a first output channel; after the second convolution kernel buffer area stores the convolution kernel values … corresponding to the second output channel, each sub-buffer area outputs one data per clock, 16 weight data are input into each first PE in the PE array in parallel, and adjacent PEs in the same column are sequentially transferred, and finally buffered in the weight buffer area in the PE unit.

Writing weight index: if the state is entered, the data is finally stored in the weight search area of the PE according to the convolution kernel data storage mode.

And (3) convolution calculation: if the state is entered, the PE control unit configures a calculation optimization mode of the PE unit according to the mode flag bit S1S0, reads data from the weight sub-buffer and the input sub-buffer according to the weight index value and the input index value, sends the data to the multiply-accumulate unit for calculation, and marks that all data are calculated after the multiply-accumulate calculation is performed for 3*3 times of input channels, and the next clock jumps to a state of sending calculation results.

And sending a calculation result: if this state is entered, the calculation results are sequentially read out from the 16 calculation result buffers. The first output channel data in each calculation result buffer area is taken out, every four output channel data are spliced into 64-bit output data, and the 64-bit output data are sent to an external memory DDR through an AXI4 bus interface. All the 16 output channel data are sequentially sent to the external memory DDR, and the accelerator jumps back to the waiting state.

Parameters can be modified through the state controller, and the image size, the convolution kernel size, the step size, the output characteristic diagram size and the number of output channels during modification operation are supported. By utilizing the running state and the algorithm structure, redundant calculation is skipped, so that unnecessary calculation and access are reduced, the efficiency of the convolutional neural network accelerator is improved, and the energy consumption is reduced.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereto, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. The utility model provides a convolutional neural network accelerator of calculation optimization based on FPGA which characterized in that: the system comprises an AXI4 bus interface, a data cache area, a prefetched data area, a result cache area, a state controller and a PE array;

the data buffer area is used for buffering the characteristic diagram data, convolution kernel data and index values read from the external memory DDR through the AXI4 bus interface; the data buffer area comprises M characteristic map sub-buffer areas and C convolution kernel sub-buffer areas;

the PE array is realized by an FPGA and comprises R.times.C PE units, each row of PE units is correspondingly provided with a convolution nucleon cache region, and the number of actually used feature map sub-cache regions is determined according to each layer of actually calculated parameters; the PE array is used for reading the data in the pre-fetch data area and the convolution kernel buffer area to carry out convolution operation, PE units in different columns calculate different output characteristic diagrams, and PE units in different rows calculate different rows of the same output characteristic diagram;

the PE unit comprises an input buffer area, a weight buffer area, an input search area, a weight search area, a PE control unit, a pre-activation unit and a multiplication accumulation unit, wherein the input buffer area and the weight buffer area are respectively used for storing characteristic map data and weight data required by convolution calculation, and the input search area and the weight search area are respectively used for storing index values for searching the characteristic map data and the weight data; the PE control unit is used for controlling the working state of the PE unit, reading the index value of the index area, reading the data of the buffer area according to the index value, sending the data into the multiply-accumulate unit for calculation, configuring the mode of the multiply-accumulate unit and whether to start the pre-activation unit; the preactivation unit is used for detecting a partial sum of convolution calculation, and stopping calculating output 0 if the partial sum is smaller than 0; the multiplication accumulation unit is used for carrying out convolution calculation and can be configured into a normal multiplication accumulation calculation mode or a calculation optimization mode with weight repetition;

the PE control unit determines that the convolution calculation optimization mode of the multiply-accumulate unit is a pre-activation mode or a weight repetition mode, and selects different calculation optimization modes for each layer; the determination method is as follows: determining a calculation optimization mode by adopting a two-bit mode flag bit, and performing normal multiply-accumulate calculation with the high bit being 0; the high order is 1, which is a calculation optimization mode repeated by weight; the low order 0 is not pre-activated; the lower 1 is the preactivation mode;

the PE unit comprises two calculation optimization modes, namely a preactivation mode and a weight repetition mode, wherein the preactivation mode is used for monitoring the convolution part and the positive and negative in real time, stopping calculation if the preactivation mode is negative, directly outputting a Relu result zero, and continuing the convolution calculation if the preactivation mode is regular; the weight repetition mode is to add the feature map data with the same weight to the convolution operation with the same weight, then multiply the feature map data with the same weight, and reduce the multiplication times and the access times to the weight data;

the result buffer area comprises R result sub-buffer areas, and each line of PE units is correspondingly provided with one result sub-buffer area for buffering the calculation result of each line of PE units;

the state controller is used for controlling the working state of the accelerator and realizing the conversion between the working states.

2. A FPGA-based computational optimization convolutional neural network accelerator as defined in claim 1, wherein: the weight retrieval area comprises a plurality of weight sub-retrieval areas, weights are written into the weight sub-buffer areas according to the sequence from positive to negative and zero weights, and corresponding input index values and weight index values are also written into the retrieval areas according to the sequence; the operation of sorting the weights and the index values is finished offline; and during convolution calculation, sequentially reading the weights of the weight buffer areas according to the weight index values.

3. A FPGA-based computational-optimized convolutional neural network accelerator as recited in claim 2, wherein: the weight index value uses a one-bit weight conversion flag bit to indicate whether to replace the calculated weight, if the flag bit is 0, the weight is unchanged, and the last clock weight is prolonged; and if the flag bit is 1, the weight is changed, and the next clock reads the next weight in the weight sub-buffer in sequence.

4. A FPGA-based computational-optimized convolutional neural network accelerator as recited in claim 3, wherein: in the weight repetition mode, an accumulation operation is firstly carried out on an input feature map when a weight conversion zone bit is 0, and an accumulation result is stored in a register; when the weight conversion flag bit is 1, after the accumulation operation is finished, the accumulated part and the weight value are sent to the multiplication unit to be multiplied, and the result is stored in the register.

5. A FPGA-based computational optimization convolutional neural network accelerator as defined in claim 1, wherein: the state controller consists of 7 states, namely: waiting, writing a feature map, writing an input index, writing a convolution kernel, writing a weight index, convolving calculation, and sending a calculation result, wherein each state sends a corresponding control signal to a corresponding sub-module to complete a corresponding function.

6. A FPGA-based computational optimization convolutional neural network accelerator as defined in claim 1, wherein: the AXI4 bus interface concatenates multiple data into one multi-bit data transmission.