CN112801285B

CN112801285B - FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof

Info

Publication number: CN112801285B
Application number: CN202110157101.XA
Authority: CN
Inventors: 陈雪
Original assignee: Nanjing Weihao Technology Co ltd
Current assignee: Nanjing Weihao Technology Co ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2024-01-26
Anticipated expiration: 2041-02-04
Also published as: CN112801285A

Abstract

The invention provides a high-resource utilization CNN accelerator based on an FPGA and an acceleration method thereof, comprising a plurality of layer processors connected end to end, wherein the layer processors finish the calculation of batch continuous tasks in a pipeline mode; the layer processor comprises a convolution calculation unit, an intra-layer data multiplexing unit, an inter-layer data multiplexing unit and an optimizing unit based on resource utilization rate, which are electrically connected with each other. The CNN accelerator uses a convolution calculation unit based on Winograd fast convolution to reduce the multiplication times required by continuous convolution operation and the use amount of multiplier resources, so that the energy efficiency of the accelerator can be effectively improved; and the accelerator is optimized by using an optimization target based on the resource utilization rate, so that the waste of computing resources is reduced, and the upper performance limit of the CNN accelerator on the FPGA is improved.

Description

FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof

Technical Field

The invention relates to the field of hardware acceleration of deep learning algorithms, in particular to a high-resource-utilization-rate CNN accelerator based on an FPGA and an acceleration method thereof.

Background

CNN is one of the most important algorithms in the deep learning algorithm, and is widely used in the fields of target classification, automatic driving, and the like due to its ultra-high accuracy. The excellent recognition accuracy of CNN is derived from huge calculated amount and parameter amount, wherein the calculated amount of a convolution layer occupies more than 90% of a network, meanwhile, convolution operation is complex, and the CPU designed based on serial calculation has the characteristics of frequent data access and low calculation speed. With the development of CNNs, the recognition accuracy is higher and the calculation amount is also increasing sharply, and the calculation on the traditional CPU platform cannot meet the requirement of high and new applications on real-time performance, so a large number of researchers are devoted to designing efficient hardware structures to accelerate CNNs. The FPGA has rich resources, strong parallel computing capability and lower power consumption, and the accelerator designed based on the FPGA has higher flexibility due to the programmable characteristic, and can synchronously update the hardware structure along with the updating of the CNN algorithm with lower cost, so that the CNN accelerator with high resource utilization rate based on the FPGA is widely focused by researchers. At present, a high-resource-utilization CNN accelerator based on an FPGA partially accelerates networks by a single-layer processor structure, so that various CNN networks can be accelerated, and the flexibility is high, but in the process, the access of intermediate data inside and outside a chip is frequent, so that the operation time and the power consumption cost are high, and the continuous batch identification task is weak to process. And the other part of accelerators only accelerate aiming at a specific CNN network, all layers are mapped onto the FPGA, different layers are processed in series by a plurality of layer processors, and meanwhile, batch tasks are executed between the layers in a pipeline mode, so that the output interval is shortened. In addition, all intermediate data in the whole process are not returned to the outside of the chip, so that the power consumption of the access and storage of the data inside and outside the chip is reduced. The disadvantage is that when the network is changed, the code needs to be rewritten, synthesized and burned.

At present, a fully mapped CNN accelerator based on an FPGA generally takes a Roofline model as a reference and takes throughput as an optimization target to allocate resources on an FPGA chip, so that the pipeline efficiency is improved, but the method generally only considers the number of multipliers used, lacks analysis of actual effective operation time of the multipliers, and cannot utilize the calculation resources on the FPGA chip to the maximum extent.

Disclosure of Invention

The invention aims to: the invention provides a high-resource-utilization CNN accelerator based on an FPGA, and further provides an acceleration method based on the CNN accelerator, so as to solve the problems in the prior art.

The technical scheme is as follows: the high resource utilization CNN accelerator based on the FPGA is characterized in that a plurality of layer processors are connected end to end, and the calculation of batch continuous tasks is completed in a pipeline mode. The layer processor is composed of a convolution calculation unit, an intra-layer data multiplexing unit and an inter-layer data multiplexing unit, wherein the intra-layer data multiplexing unit and the inter-layer data multiplexing unit are mapped to specific circuit connection, and the optimizing unit is mapped to the resource usage in the layer processor.

In a further embodiment, the layer processors include at least three, and the first layer processor is configured to receive the input pixels and the corresponding weight parameters, take charge of data storage and calculation, and transfer the calculation result to the second layer processor; the second layer processor receives the data from the first layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the third layer processor; and the third layer processor receives the data from the second layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the outside of the chip.

In a further embodiment, the convolved input data x1, x2, x3 are sequentially updated into the convolution calculation unit in a pipeline manner, and the weight parameters w1, w2, w3 remain unchanged until the calculation of the current batch is completed, and the multiplexer is controlled by the sel signal to realize multiplication of the input data and the corresponding parameters in different periods. The multiplication factor 1/2 in Winograd fast convolution algorithm is embedded in the circuit, and is realized in a signal line dislocation mode, so that the use of a multiplier is reduced.

In a further embodiment, a single input signature stores the data of the parity rows in separate memories, respectively, for two-way parallel storage and provision of the data. The memory is connected with a line cache structure, and periodically provides data input data blocks for the computing unit group. And the efficient convolution operation is realized by matching with a convolution calculation unit.

Based on the CNN accelerator, the embodiment provides a CNN acceleration method with high resource utilization rate, which comprises the following steps:

step 1, cutting a feature map into small blocks for batch operation, continuously transmitting layer and interlayer data blocks, wherein adjacent convolution blocks comprise repeated data blocks;

step 2, starting continuous data storage at the initial address a of the storage unit, and finishing the storage of the data block A;

step 3, storing a data block B at the address B, stopping receiving the data of the previous layer after the AB block is stored, and starting the operation of the current batch;

step 4, after the data calculation of the current batch is completed, a new data block C is stored from an address a, a data block A is covered, then a new characteristic diagram is formed by the new data block C and the data block B, and the data sequence is adjusted by changing the address of the data to be fetched and used for operation;

step 5, calculating the subsequent calculation batch, and updating the necessary non-repeated data.

In a further embodiment, the high resource utilization CNN acceleration method further comprises, for exampleThe method comprises the following steps: first, the total storage space required for simultaneously storing all intermediate results of the accelerated CNN network is calculatedWherein H is _j ，W _j ，C _j The height, the width and the channel number of the j-th layer characteristic diagram of the network are respectively, and n is the network layer number. When the required storage space is higher than the storage capacity on the FPGA chip, the input feature map is cut into small blocks to be calculated in batches, and the method comprises the following steps: a) First, the height of the feature diagram of the last layer is cut into H _Tn ＝H _n 2, if the height of the final layer of feature map is odd, adding 0 rows of supplements, and keeping the width and the channel number unchanged; b) If the storage requirement cannot be met after the processing of the steps, the height of the feature map is cut from the penultimate layer to the input layer (the first layer), and the formula is updated to be H _Tj ＝(H _Tj+1 -1)S _j +K _j, Wherein S is _j And K _j Respectively adding all 0 elements for compensating odd height for the convolution step length and the convolution kernel size of the j-th layer of the network; c) If the processed steps cannot meet the storage requirement, the height of the feature map of the last layer is cut into H _Tn ＝H _n And/4, repeating the step b until the storage requirement is met or the height of the final layer of feature map is 2; d) When the clipping height of the final layer of feature images is 2 and still cannot meet the storage requirement, the same clipping flow is started to the width of the feature images until the requirements are met.

The characteristic diagram is cut into small blocks and then operated in batches, the layer and interlayer data blocks are continuously transmitted, and the adjacent convolution blocks comprise repeated data blocks. When in storage, firstly, the data is continuously stored at the initial address a of the storage unit, and the storage of the data block A is completed. And then storing the data block B at the address B, stopping receiving the data of the previous layer after the AB block is stored, and starting the operation of the current batch. After the data calculation of the current batch is completed, a new data block C is stored from the a address, and covers the data block A, then a new characteristic diagram is formed by the new data block C and the data block B, and the data sequence is adjusted by changing the address of the fetched data and is used for operation. In this embodiment, the complete convolution block is provided only in the previous layer in the first calculation batch, and only the necessary non-repeated data is updated in each subsequent calculation batch, so as to reduce the data repetition problem of the data block B, C caused by the clipping of the feature map, and improve the efficiency of layer-to-layer data transfer and calculation.

In a further embodiment, after the FPGA and CNN network are given, according to the accelerator architecture, the network is partitioned first to meet the storage requirement of the intermediate data; then according to the data transmission capacity inside and outside the chip, determining the calculation batch of the full connection layer to reduce the access memory; and finally, determining the quantity of resources allocated to each layer according to the ratio of the effective calculated quantity between the layers and the actual resource utilization rate, realizing the optimal calculated resource utilization rate, completing the configuration of the accelerator parameters, and integrating and burning codes on a chip.

Compared with the prior art, the invention has the remarkable advantages that:

1) The convolution calculation unit based on the fast convolution is used for reducing the multiplication times required by continuous convolution operation and the use amount of multiplier resources, so that the energy efficiency of the accelerator can be effectively improved;

2) And the accelerator is optimized by using an optimization target based on the resource utilization rate, so that the waste of computing resources is reduced, and the upper performance limit of the CNN accelerator on the FPGA is improved.

Drawings

Fig. 1 is a system frame diagram of a CNN accelerator of the present invention.

Fig. 2 is a circuit configuration diagram of the convolution calculating unit of the present invention.

Fig. 3 is a schematic diagram of an intra-layer data multiplexing scheme according to the present invention.

Fig. 4 is a schematic diagram of an interlayer data multiplexing scheme according to the present invention.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without one or more of these details. In other instances, well-known features have not been described in detail in order to avoid obscuring the invention.

The applicant believes that at present, a high resource utilization CNN accelerator based in part on an FPGA accelerates a network by a single-layer processor structure, in the process, the access of intermediate data inside and outside a chip is frequent, so that the operation time and the power consumption cost are high, and the processing of continuous batch identification tasks is weak. And the other part of accelerators only accelerate aiming at a specific CNN network, all layers are mapped onto the FPGA, different layers are processed in series by a plurality of layer processors, and meanwhile, batch tasks are executed between the layers in a pipeline mode, so that the output interval is shortened. In addition, all intermediate data in the whole process are not returned to the outside of the chip, so that the power consumption of the access and storage of the data inside and outside the chip is reduced. The disadvantage is that when the network is changed, the code needs to be rewritten, synthesized and burned. The fully mapped CNN accelerator based on the FPGA generally takes the roofine model as a reference and takes throughput as an optimization target to allocate resources on the FPGA chip so as to improve the pipeline efficiency, but the method generally only considers the number of multipliers used, lacks analysis of the actual effective operation time of the multipliers, and cannot utilize the calculation resources on the FPGA chip to the maximum extent.

Therefore, the applicant provides a high-resource utilization CNN accelerator based on an FPGA, and further provides an acceleration method based on the CNN accelerator, and a convolution calculation unit based on Winograd fast convolution is used for reducing the multiplication times and the multiplier resource usage amount required by continuous convolution operation, so that the energy efficiency of the accelerator can be effectively improved. And the accelerator is optimized by using an optimization target based on the resource utilization rate, so that the waste of computing resources is reduced, and the upper performance limit of the CNN accelerator on the FPGA is improved.

Embodiment one:

referring to fig. 1, in the FPGA-based CNN accelerator with high resource utilization rate according to the present embodiment, multiple layer processors are connected end to end, and the calculation of batch continuous tasks is completed in a pipeline manner. The layer processor is composed of a convolution calculation unit, an intra-layer data multiplexing and an inter-layer data multiplexing scheme, the intra-layer data multiplexing and the inter-layer data multiplexing units are mapped to specific circuit connection, and the optimizing unit is mapped to the resource usage in the layer processor. The first layer processor is responsible for receiving input pixels and corresponding weight parameters, storing and calculating data, and transferring the calculation result to the next layer processor. The middle layer processor receives the data of the previous layer and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the next layer processor. And the last layer processor receives the data of the previous layer and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the outside of the chip.

Embodiment two:

in the embodiment shown in fig. 2, the convolved input data x1, x2, x3, etc. are sequentially updated into the convolution calculation unit in a pipelined manner (x 2 replaces x1, x3 replaces x2, x4 replaces x 3), and the multiply-add operation is periodically completed. In the first calculation period, the calculation of (x 1-x 3) and (x 3+ x 2) is completed by using an adder and a multiplexer, and the calculation result of (x 3-x 2) is transmitted into a register for temporary storage. The calculation results of (x3+x2) and (x 3-x 2) are alternately selected using the sel signal. The weight parameters w1, w2 and w3 are kept unchanged before the calculation of the current batch is completed, and can be independently started to operate with input data in advance, and the operations of (w1+w3+w2) and (w1+w3-w 2) are completed on the chip by using an adder. In addition, MSB and MLB are respectively the highest bit and the lowest bit of binary data, and multiplication factor 1/2 in Winograd fast convolution algorithm is embedded in a circuit and realized in a signal line dislocation mode. The corresponding multiplication factors are synchronously selected by using sel signals, the two multipliers simultaneously output multiplication results, one-dimensional convolution with the input vector length of 4, the convolution length of 3 and the step length of 1 can be completed once every 2 periods, a plurality of structures are piled up and combined, and Winograd rapid convolution algorithm can be completed with less calculation resources.

Embodiment III:

in the embodiment shown in fig. 3, the PE is constituted by the corresponding embodiment of fig. 2. The data of odd-even rows are respectively stored in different memories by a single input characteristic diagram, and the data are stored and provided in parallel in a double-way mode. The memory is connected with a line cache structure, and periodically provides data input data blocks for the computing unit group. In cooperation with the convolution calculation unit or the embodiment in fig. 2, an efficient convolution operation is achieved.

Embodiment four:

in the embodiment shown in fig. 4, the feature map is cut into small block batch operations, with layers being passed in succession with inter-layer data blocks, adjacent convolution blocks containing duplicate data blocks. When in storage, firstly, the data is continuously stored at the initial address a of the storage unit, and the storage of the data block A is completed. And then storing the data block B at the address B, stopping receiving the data of the previous layer after the AB block is stored, and starting the operation of the current batch. After the data calculation of the current batch is completed, a new data block C is stored from the a address, and covers the data block A, then a new characteristic diagram is formed by the new data block C and the data block B, and the data sequence is adjusted by changing the address of the fetched data and is used for operation. In this embodiment, the complete convolution block is provided only in the previous layer in the first calculation batch, and only the necessary non-repeated data is updated in each subsequent calculation batch, so as to reduce the data repetition problem of the data block B, C caused by the clipping of the feature map, and improve the efficiency of layer-to-layer data transfer and calculation.

After the FPGA and the CNN network are given, firstly partitioning the network according to the accelerator architecture to meet the storage requirement of intermediate data; then according to the data transmission capacity inside and outside the chip, determining the calculation batch of the full connection layer to reduce the access memory; and finally, determining the quantity of resources allocated to each layer according to the ratio of the effective calculated quantity between the layers and the actual resource utilization rate, realizing the optimal calculated resource utilization rate, completing the configuration of the accelerator parameters, and integrating and burning codes on a chip.

Fifth embodiment:

in order to verify the effectiveness of the scheme of the invention, the following experiment is carried out by taking xilox VC709 FPGA as a platform and taking AlexNet network as an example.

The off-chip memory is selected to be a DDR3 memory with the capacity of 4GB, the convolution layer of the AlexNet network is cut out less unordered, the original size is kept, and the batch processing number of the full connection layer is 32. All data formats are 16-bit fixed point data.

After the layer processor architecture of each layer is determined, calculating according to the size and calculation characteristics of the AlexNet network convolution layer to obtain the position of each layerThe layer calculation time T of the processor in theory of completing single task _j Output interval delta T with continuous batch task, the effective utilization rate of computing resource isWherein n is the number of layers of CNN, d _j The number of multipliers per layer processor.

According to the formula, constraint conditions of resources and information of a network are substituted for optimization. When the number of multipliers of each layer of processors of the accelerator is {386, 1296, 368, 488, 368, 160, 64, 32}, the theoretical calculation resource utilization rate of the accelerator reaches 98.8%, and when the CNN accelerator realized based on the parameters processes AlexNet, the throughput of the CNN accelerator is 973GOPs, and the resource efficiency (throughput of unit multiplier) is 0.31GOPs/DSP.

In conclusion, the high-resource-utilization-rate CNN accelerator based on the FPGA can achieve resource efficiency and throughput, and reduces resource waste.

As described above, although the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limiting the invention itself. Various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An acceleration method of a high resource utilization CNN accelerator based on an FPGA is characterized in that the CNN accelerator comprises the following steps: the system comprises a plurality of layer processors connected end to end, wherein the layer processors finish calculation of batch continuous tasks in a pipeline mode;

the layer processor comprises a convolution calculation unit, an intra-layer data multiplexing unit, an inter-layer data multiplexing unit and an optimizing unit based on resource utilization rate, which are electrically connected with each other;

the first layer processor is used for receiving input pixels and corresponding weight parameters, is responsible for data storage and calculation, and transfers calculation results to the second layer processor;

the second layer processor receives the data from the first layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the third layer processor;

the third layer processor receives the data from the second layer processor and the weight parameters transmitted from the outside of the chip, completes data storage and calculation, and transfers the calculation result to the outside of the chip;

the convolved input data x1, x2 and x3 are sequentially updated into the convolution calculation unit in a pipeline mode, weight parameters w1, w2 and w3 are kept unchanged before the calculation of the current batch is completed, and the multiplexer is controlled by the sel signal to realize multiplication of the input data and corresponding parameters in different periods;

the acceleration method comprises the following steps:

step 5, calculating the subsequent calculation batch, and updating necessary non-repeated data;

first, the total storage space required for simultaneously storing all intermediate results of the accelerated CNN network is calculatedWherein H is _j ，W _j ，C _j The height, the width and the channel number of the j-th layer characteristic diagram of the network are respectively shown, and n is the number of network layers;

when the required storage space is higher than the storage capacity on the FPGA chip, cutting the input feature map into small blocks for batch calculation;

the process of clipping the input feature map into small blocks for calculation in batches further comprises:

a) Clipping the height of the characteristic diagram of the last layer to H _Tn ＝H _n 2, if the height of the final layer of feature map is odd, adding 0 rows of supplements, and keeping the width and the channel number unchanged;

b) If the storage requirement cannot be met after the processing of the steps, the height of the feature map is cut from the penultimate layer to the input layer, and the update formula is H _Tj ＝(H _Tj+1 -1)S _j +K _j ，

Wherein S is _j And K _j Respectively adding all 0 elements for compensating odd height for the convolution step length and the convolution kernel size of the j-th layer of the network;

c) If the processed steps cannot meet the storage requirement, the height of the feature map of the last layer is cut into H _Tn ＝H _n And/4, repeating the step b until the storage requirement is met or the height of the final layer of feature map is 2;

d) And when the cutting height of the final layer of feature images is 2 and the storage requirement cannot be met, starting the same cutting process on the width of the feature images until the requirements are met.

2. The acceleration method of the high resource utilization CNN accelerator based on the FPGA of claim 1, wherein the acceleration method comprises the following steps: the single input feature map stores the data of odd-even rows into different memories respectively, and the data is stored and provided in parallel in a double-way manner; the memory is connected with a line cache structure at the rear, and periodically provides data input data blocks for the computing unit group.

3. The acceleration method of the high resource utilization CNN accelerator based on the FPGA of claim 2, wherein the acceleration method is characterized by: after the FPGA and the CNN network are given, firstly, partitioning the network to meet the storage requirement of intermediate data; then according to the data transmission capacity inside and outside the chip, determining the calculation batch of the full connection layer to reduce the access memory; and finally, determining the quantity of resources allocated to each layer according to the ratio of the effective calculated quantity between the layers and the actual resource utilization efficiency, completing the configuration of the accelerator parameters, and integrating and burning codes on a chip.