CN116720549A

CN116720549A - FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache

Info

Publication number: CN116720549A
Application number: CN202310797346.8A
Authority: CN
Inventors: 王健蓉; 赵洪博; 贺治均
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-09-08

Abstract

The invention discloses a FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache, which comprises the following steps: step one: and carrying out model characteristic statistics on the used network, wherein the model characteristic statistics specifically comprises the calculated amount, the data bit width, the number of input channels, the number of output channels, the convolution kernel and the width and height dimensions of the output characteristic diagram of each roll layer. And distributing an FPGA on-chip computing core for each two-dimensional convolution kernel contained in the model, and distributing DSP computing resources for each core. Step two: and (3) constructing a convolution calculation method based on weight multiplexing by adopting a full-cache mode on an input feature chip for each calculation core distributed in the step one, constructing a convolution calculation cycle in a block expansion mode by block factors in two dimensions of an input channel and an output channel for each convolution layer, and performing full parallelization processing on an FPGA chip by convolution operation in the blocks. Step three: and (3) calculating hardware throughput and access memory ratio corresponding to different block factor configuration combinations based on the number constraint of on-chip DSP computing units and the on-chip BRAM storage resource constraint of the block factors of each convolution layer introduced in the step two, and searching based on the Roofline model to obtain optimal block factor parameters which can be realized by each convolution layer.

Description

FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache

Technical Field

The invention belongs to the technical field of convolutional neural network hardware acceleration, and particularly relates to a CNN input full-cache-based FPGA multi-core two-dimensional convolutional acceleration optimization method.

Background

In recent years, with the continuous development of deep learning theory, convolutional neural networks (Convolutional Neural Network, CNN) have achieved higher precision and performance in various artificial intelligence tasks. At present, CNN generally has higher computational complexity and larger parameter storage requirement, and research on CNN algorithm is still focused on the improvement of the neural network model scale. More and more new neural network layers result in more complex structures and larger model sizes, and require billions of operations and millions of parameters, as well as a large amount of computational resources, to train and evaluate the final network performance.

Accordingly, hardware acceleration platforms are widely used to increase the throughput and processing speed of CNNs, mainly including application Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs). The GPU accelerator has very high power consumption, is generally suitable for offline training of a model, and is difficult to apply on some platform devices with low battery power consumption at the mobile end. Furthermore, the performance of GPUs comes from the ability to process bulk images in parallel. However, for some practical tasks, such as video streaming, the input images need to be processed frame by frame, which also reduces the advantages of the GPU to some extent. In contrast, although the memory and the computing resources of the FPGA hardware platform are limited, higher performance can be achieved through lower power consumption, and meanwhile, high parallelism processing of the model reasoning process can be achieved according to the computing process of the neural network and by combining with the hardware design of a specific model. The characteristics of the programmable logic array of the FPGA enable the programmable logic array to be more flexible, and the programmable logic array is suitable for research and development and verification of various algorithms, and has shorter design period and lower power consumption.

The on-chip memory locations of a typical FPGA chip are primarily referred to as cache resources (BRAM) and a relatively small amount of Register resources (registers). However, the specification of the system is still too small compared with the memory occupation of a neural network model, the size of a common CNN model is generally 100-1000MB, and the current largest SRAM (static random Access memory) on an FPGA (field programmable Gate array) chip still does not exceed 10MB. Therefore, the off-chip storage resource (DDR) is needed to assist in the deployment process of the CNN model, and the access bandwidth and the power consumption of the on-chip storage resource (DDR) limit the actual performance of the model. Based on the resource constraints, more reasonable optimization methods such as a calculation processing unit, a memory access mode, a parallelization strategy and the like need to be considered when designing the CNN accelerator based on the FPGA.

Disclosure of Invention

The invention aims to provide a CNN input full-cache-based FPGA multi-core two-dimensional convolution acceleration optimization method, which is oriented to a deep convolution neural network containing various two-dimensional convolution kernel sizes and calculation characteristics, and provides a reasonable optimization method for an on-chip calculation core design, a convolution circulation block expansion strategy and a data multiplexing mode of a network model in the implementation and deployment of an FPGA hardware platform, so that the CNN model can realize higher parallel calculation characteristics, flow reasoning characteristics and shorter on-chip hardware reasoning time delay in the FPGA hardware accelerator.

In order to achieve the purpose, the invention provides an FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full-buffering. Firstly, carrying out model characteristic statistics on the used network, wherein the model characteristic statistics comprises calculated quantity, data type, commonly contained two-dimensional convolution kernel types and configuration parameters of each convolution layer, and carrying out on-chip calculation core and DSP calculation resource allocation according to the convolution kernel types. And then, aiming at the computing cores, a weight multiplexing convolution loop optimization strategy is designed based on a full-cache mode on the input feature chip, blocking factors are introduced into two dimensions of a convolution input channel and an output channel corresponding to each computing core, and full parallelization processing on an FPGA chip can be realized for convolution operation in the blocks. And finally, calculating hardware throughput rates and access memory ratios corresponding to different blocking factors based on-chip DSP computing resource constraint and on-chip BRAM storage resource constraint, and searching based on a Roofline model to obtain an achievable optimal solution. The optimization method provided by the invention can be realized completely, and the finally obtained output result is the optimal configuration strategy of the cyclic expansion factors of the input and output channels of each convolution layer.

The invention provides a FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache, which comprises the following implementation steps:

step one: and carrying out model characteristic statistics on the used network, wherein the model characteristic statistics specifically comprises the calculated amount, the data bit width, the number of input channels, the number of output channels, the convolution kernel and the width and height dimensions of the output characteristic diagram of each roll layer. And distributing an FPGA on-chip computing core for each two-dimensional convolution kernel contained in the model, and distributing DSP computing resources for each core.

Step two: constructing a convolution calculation method based on weight multiplexing by adopting a full-cache mode on an input feature chip for each calculation core allocated in the step one, constructing a convolution calculation cycle in a block expansion mode by block factors for each convolution layer in two dimensions of an input channel and an output channel, and performing full parallelization processing on an FPGA chip by convolution operation in the blocks;

step three: and (3) calculating hardware throughput and access memory ratio corresponding to different block factor configuration combinations based on the number constraint of on-chip DSP computing units and the on-chip BRAM storage resource constraint of the block factors of each convolution layer introduced in the step two, and searching based on the Roofline model to obtain optimal block factor parameters which can be realized by each convolution layer.

The "model characteristic statistics on the network used" in the first step specifically includes the calculated amount, the data bit width, the number of input channels, the number of output channels, the convolution kernel, and the width and height dimensions of the output feature map of each layer. An FPGA on-chip computing core is allocated for each two-dimensional convolution kernel contained in the model, and DSP computing resource allocation is carried out for each core, and the method comprises the following steps:

s11, carrying out statistics on related parameters of the FPGA hardware platform and the model for subsequent calculation steps, wherein the method specifically comprises the following steps: on-chip memory capacity BRAM of FPGA hardware platform _max (MB), number of on-chip DSP calculation units DSP _max The data bit width M (bit) of the model and the two-dimensional convolution kernel class number N contained in the model;

s12, counting relevant parameters of each convolution layer for subsequent calculation steps, wherein the method specifically comprises the following steps: calculation amount per volume layerInput channel of each convolution layer->Output channel->The convolution kernel has a width and height dimension k×k and a width and height dimension ++of the output feature map>Input feature map width and height dimension +.>

S13, according to the number N of the convolution Kernel types counted in the step S11, an on-chip calculation core Kernel is allocated for each convolution Kernel _j ，j∈[1，N]And all convolution layers corresponding to each kind of convolution kernel belong to one calculation kernel;

s14, respectively calculating the sum of the calculated amounts of all the convolution layers corresponding to each calculation core according to the classification result of the convolution layers in the step S13

S15, calculating the distribution ratio of each core to the on-chip DSP computing unit according to the calculated amount corresponding to each core obtained in the step S14

In the second step, the "constructing a convolution calculation method based on weight multiplexing by using a full cache manner on an input feature chip for each calculation core allocated in the first step", constructing a convolution calculation cycle by using a block expansion manner through a block factor for each convolution layer in two dimensions of an input channel and an output channel, wherein convolution operation in a block can be fully parallelized on an FPGA chip ", the method comprises the following steps:

each computing core Kernel allocated in the steps S21 and S13 _j Respectively correspond to a convolution kernel width and height dimensionFor the followingKernel of computing core _j Is one of the convolutional layers conv _i The same convolution cyclic dimension sequence is set, the cyclic inclusion contains 6 layers in total, and the calculation dimensions from outside to inside are set as follows: />

S22, performing S22; distributing a length of BRAM on FPGA chipIs a storage space of (a)And a length of +.>Storage space of->The method is used for storing all input characteristic values of a convolution layer and intermediate output characteristic values calculated by each block respectively;

s23, reading the continuity of the convolution layer from DDR by the programmable logic PL end on the hardware chipA plurality of convolution kernels, each convolution kernel reading +.>Data of individual channels, co->M-bit data,/-bit>Andrespectively expanding factors of an input channel and an output channel;

s24, in S21 stepAnd->Performing cyclic expansion on two cyclic dimensions, and calculating an intermediate convolution result by using the block weight value read in the step S23 and the input feature map traversal, namely calculating +.>All output elements on the feature map of the output channels correspond to convolution kernels>Intermediate values of the channels, adding the results to the buffer allocated in step S22, respectively +.>

S25, returning to the step S23, and starting to read the next continuous according to the address sequenceA plurality of convolution kernels, each convolution kernelThe channel data are then calculated by step S24 and the obtained output result is accumulated in the buffer areaThe process is circularly carried out until the step S23 is carried out to take out the weight data of the last block and obtain the calculation result of the step S24, and the core Kernel is calculated at the moment _j Conv for convolution layer _i Is completed;

the method comprises the following steps of calculating hardware throughput and access ratio corresponding to different block factor configuration combinations based on the number constraint of on-chip DSP computing units and the on-chip BRAM storage resource constraint, and searching to obtain optimal block factor parameters which can be realized by each convolution layer based on a Roofline model, wherein the block factor of each convolution layer introduced in the step three is described in the step two, and the method comprises the following steps:

s31, for each convolution layer conv _i Four constraint conditions are constructed by the blocking factors of the (a) and (b), and the four constraint conditions are respectively as follows: 1) The number of convolution multiplications within each block cannot exceed the number of available DSP units of the corresponding computational core of the convolution layer2)The number of input channels is required to be less than the maximum number of input channels of the convolution layer; 3)/>The maximum output channel number is smaller than the convolution layer; 4) The weight data size of each read cannot exceed the maximum on-chip buffer capacity BRAM _max ；

S32, for each convolution layer conv by means of cyclic calculation _i Corresponding on-chip throughput rate is calculated according to all the blocking factor configuration schemesAnd memory ratio->

S33, searching all the configuration schemes obtained in the step S32 under the Roofline model according to the maximum throughput rate and the upper limit of access performance which can be achieved by the FPGA development board, and searching to obtain a convolution layer conv under the constraint that the throughput rate and the access ratio do not exceed the upper limit of the development board performance _i Is a numerical solution of an optimal blocking factor;

through the steps, the FPGA multi-core two-dimensional convolution acceleration optimization method based on the CNN input full cache allocates different calculation cores and calculation resources based on analysis of the structural characteristics of the CNN network, introduces two blocking factors into the on-chip convolution cyclic calculation process, and performs optimization solution by combining a Roofline model with the performance limitation of a hardware development board.

According to the design of the invention, the FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full-buffer is realized, the algorithm is easy to integrate, and the method can be directly applied to the FPGA accelerator for the existing mainstream CNN network model.

Drawings

FIG. 1A general framework of the method of the invention

Fig. 2 convolution cyclic calculation method based on weight multiplexing

All theoretical solutions example under the rooline model of fig. 3

All possible solutions under the rooline model of fig. 4

Detailed Description

So that the manner in which the features, objects, and functions of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

Firstly, the invention is applied to a two-dimensional convolutional neural network, is not applicable to the existing one-dimensional convolutional kernel three-dimensional convolutional network, the total input of the invention is a convolutional neural network model with definite numerical accuracy (such as 32-bit single accuracy or 8-bit fixed point number, etc.), the acceleration platform is an FPGA acceleration platform, and the total output is a cyclic calculation structure for carrying out on-chip reasoning on each convolutional layer when the convolutional network is deployed to the FPGA.

As shown in fig. 1, is an overall framework of the method of the present invention. For a specific network structure, firstly, statistics is carried out on the types of convolution kernels contained in the model, and a computing core and corresponding DSP multiplication resources are allocated for each convolution kernel. And when each core accelerates the convolution layer, carrying out on-chip hardware reasoning by using a convolution cyclic calculation method based on weight multiplexing, and searching an optimal blocking factor used when cyclic blocking is unfolded by using a Roofline model under the performance constraint of a hardware platform. The specific implementation steps are as follows:

the first step: the statistics of parameters of the FPGA hardware platform specifically comprises the following steps: on-chip memory capacity BRAM of FPGA hardware platform _max (MB), number of on-chip DSP calculation units DSP _max . The parameter statistics of the used network model specifically comprises the following steps: the data bit width M (bit) of the model and the number N of two-dimensional convolution kernels contained in the model. The statistics of the relevant parameters of each convolution layer in the model specifically comprises the following steps: calculation amount per volume layerInput channel->Output channel->The convolution kernel of the layer has a width and height dimension k x k and a width and height dimension +.>Input feature map width and height dimension +.>And after the statistics of the parameters are completed, the parameters are used for the subsequent calculation step.

Each convolution Kernel is then assigned an on-chip computation core Kernel _j ，j∈[1，N]And all the convolution layers corresponding to each kind of convolution kernel belong to one computation core. Calculating the sum of the calculated amounts of all convolution layers corresponding to each calculation coreAnd according to the calculated proportional coefficient +.>The number of DSP multiplication units available for each computing core +.>The assignment is made and the calculation expression is as follows:

and a second step of: and for each computing core, a convolution cyclic computing method construction based on weight multiplexing is carried out by adopting a full-buffer mode on an input feature chip, and a cyclic structure is schematically shown in figure 2. And building convolution calculation circulation of each convolution layer in two dimensions of an input channel and an output channel in a block spreading mode through a block dividing factor, so that full parallelization processing of convolution operation in blocks on an FPGA (field programmable gate array) chip is realized.

Specifically, as shown in FIG. 1, each compute core Kernel _j Respectively correspond to a convolution kernel width and height dimensionFor the Kernel _j Is one of the convolutional layers conv _i The same convolution cyclic dimension sequence is set, the cyclic inclusion contains 6 layers in total, and the calculation dimensions from outside to inside are set as follows:then a length of +.>Storage space of->And a length ofStorage space of->And the method is used for storing all input characteristic values of the convolution layer and intermediate output characteristic values calculated by each block respectively.

The on-chip block convolution calculation process is shown in fig. 2, and can be divided into the following steps performed in 3 loops:

step one: the programmable logic PL end on the hardware chip starts reading the continuity of the convolution layer from DDRA plurality of convolution kernels, each convolution kernel reading +.>Data of individual channels, co->M-bit data,/-bit>And->Respectively expanding factors of an input channel and an output channel;

step two: at the position ofAnd->Performing cyclic expansion on two cyclic dimensions, and traversing and calculating an intermediate convolution result by using the read block weight values and the input characteristic diagram, namely calculating +.>All output elements on the feature map of the output channels correspond to convolution kernels>Intermediate values of the individual channels, the results are added to the buffer area, respectively +.>

Step three: returning to step one, reading the next consecutive in address orderA plurality of convolution kernels, each convolution kernel->The channel data are calculated by the second step and the obtained output result is accumulated to the buffer areaThe process is circularly carried out until the weight data of the last block is taken out and a calculation result is obtained, and a calculation core Kernel is calculated at the moment _j Conv for convolution layer _i Is completed.

At this stage, it should be noted that the cyclic order and dimensions used by the different computational cores are identical, with the difference that the values of each dimension are varied according to the parameters of the different convolutional layers, while the spreading out the output channels and the input channels means a common component contained within each partition The multiplication is done on the hardware in the same clock cycle.

And a third step of: at conv for each convolution layer _i When the optimal configuration combination of the blocking factors is calculated, hardware throughput and access memory ratio corresponding to different configuration combinations are calculated based on the quantity constraint of on-chip DSP calculation units and the constraint of on-chip BRAM storage resources, and a result is obtained by searching based on the Roofline model.

Specifically, for each convolution layer conv _i Firstly, constructing four constraint conditions, namely:

1) The number of convolution multiplications within each block cannot exceed the number of available DSP units of the corresponding computational core of the convolution layerThe expression is as follows:

2)the maximum number of input channels required to be less than the convolution layer is expressed as follows:

3)the maximum output channel number less than the convolution layer is required, and the expression is as follows:

4) The weight data size of each read cannot exceed the maximum on-chip buffer capacity BRAM _max The expression is as follows:

where (M/8) represents converting the left side of the above formula into Byte units. Then for each convolution layer conv by means of a cyclic traversal _i Corresponding on-chip throughput rate is calculated according to all the blocking factor configuration schemesAnd memory ratio->

The throughput rate represents the number of operations performed per second, and for convolution operation, namely multiply-add operation, the calculation formula is as follows:

the memory ratio represents the corresponding operation number of each memory, and the calculation formula is as follows:

in the above-mentioned method, the step of,and->The weight reading times and the memory length of each reading are respectively represented, and the calculation formula is as follows:

according to the above formula, theoretical values of on-chip throughput rate and memory access ratio which can be achieved when each parameter configuration is implemented in hardware can be calculated. However, due to limited resources of the FPGA development board, the number of DSP multipliers, the BRAM memory access bandwidth and the clock frequency together determine the throughput rate and the upper limit of memory access performance that can be achieved by one FPGA. Therefore, all configuration schemes need to be searched under the rooline model, and the throughput rate is highSearching to obtain a convolution layer conv under the constraint that the access memory ratio does not exceed the upper limit of the performance of the development board _i Is a numerical solution of the optimal blocking factor. Fig. 3 and 4 show roofine model solving examples, which respectively represent all theoretical solutions and feasible solutions obtained by calculation. The horizontal axis represents the memory ratio CTC index, the vertical axis represents the calculation performance CR index, and the slope of the straight line between any point and the origin represents the minimum bandwidth requirement of the parameter scheme corresponding to the point when the parameter scheme is realized. For example, the minimum bandwidth requirement of the P-point scheme in fig. 3 is the same as that of the P' -point scheme. The upper access performance limit and the upper computational performance limit of the FPGA are marked in the example of fig. 4. Any point to the left of the upper access performance limit requires a higher bandwidth when implemented than the platform can provide. Therefore, the most reasonably performing blocking factor configuration combination is selected according to the performance limiting boundary, such as point N in fig. 4. In addition, if the solution set only contains one configuration point, the point is used as an optimal parameter; if more than one feasible solution is included under the condition of the same calculation performance, the higher point of CTC is selected as the optimal solution on the basis of the smaller access bandwidth requirement.

Although the present invention has been described with reference to the above embodiments, it is not limited thereto, and various equivalent changes and substitutions can be made therein by those skilled in the art without departing from the spirit and scope of the present invention, and the scope of the present invention is defined by the appended claims.

Claims

1. The FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache is characterized by comprising the following steps of: the method comprises the following steps:

2. The method and the device for rapidly capturing Beidou B1C signals based on FPGA of claim 1 are characterized in that: the first specific process of the step is as follows:

S13, according to the number N of the convolution Kernel types counted in the step S11, an on-chip calculation core Kernel is allocated for each convolution Kernel _j ，j∈[1。N]And all convolution layers corresponding to each kind of convolution kernel belong to one calculation kernel;

3. The method and the device for rapidly capturing Beidou B1C signals based on FPGA of claim 1 are characterized in that: the specific process of the second step is as follows:

each computing core Kernel allocated in the steps S21 and S13 _j Respectively correspond to a convolution kernel width and height dimensionFor the Kernel _j Is one of the convolutional layers conv _i The same convolution cyclic dimension sequence is set, the cyclic inclusion contains 6 layers in total, and the calculation dimensions from outside to inside are set as follows: />

s23, reading the continuity of the convolution layer from DDR by the programmable logic PL end on the hardware chipA plurality of convolution kernels, each convolution kernel reading +.>Data of individual channels, co->M-bit data,/-bit>And->Respectively expanding factors of an input channel and an output channel;

s24, in S21 stepAnd->Performing cyclic expansion on two cyclic dimensions, and calculating an intermediate convolution result by using the block weight value read in the step S23 and the input feature map traversal, namely calculating +.>All output elements on the feature map of the output channels correspond to convolution kernels>Intermediate values of the channels, and accumulating the results to the buffer areas allocated in step S22

S25, returning to the step S23, and starting to read the next continuous according to the address sequenceA plurality of convolution kernels, each convolution kernelThe channel data are then calculated by step S24 and the obtained output result is accumulated in the buffer areaThe process is circularly carried out until the step S23 is carried out to take out the weight data of the last block and obtain the calculation result of the step S24, and the core Kernel is calculated at the moment _j Conv for convolution layer _i Is completed.

4. The method and the device for rapidly capturing the Beidou B1C signals of the FPGA based on the FPGA of claim 1 are characterized in that: the third concrete process is as follows:

S33, searching all the configuration schemes obtained in the step S32 under the Roofline model according to the maximum throughput rate and the upper limit of access performance which can be achieved by the FPGA development board, and searching to obtain the ctnv of the convolution layer under the constraint that the throughput rate and the access ratio do not exceed the upper limit of the performance of the development board _i Is a numerical solution of the optimal blocking factor.