CN116720549A - FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache - Google Patents
FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache Download PDFInfo
- Publication number
- CN116720549A CN116720549A CN202310797346.8A CN202310797346A CN116720549A CN 116720549 A CN116720549 A CN 116720549A CN 202310797346 A CN202310797346 A CN 202310797346A CN 116720549 A CN116720549 A CN 116720549A
- Authority
- CN
- China
- Prior art keywords
- convolution
- chip
- core
- calculation
- fpga
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000005457 optimization Methods 0.000 title claims abstract description 15
- 230000001133 acceleration Effects 0.000 title claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 62
- 238000003860 storage Methods 0.000 claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000010586 diagram Methods 0.000 claims abstract description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 24
- 230000000903 blocking effect Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 12
- 238000011161 development Methods 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 description 22
- 238000013461 design Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 2
- 238000013468 resource allocation Methods 0.000 description 2
- 230000007480 spreading Effects 0.000 description 2
- 238000003892 spreading Methods 0.000 description 2
- 101000919040 Klebsiella oxytoca (strain ATCC 8724 / DSM 4798 / JCM 20051 / NBRC 3318 / NRRL B-199 / KCTC 1686) Diol dehydratase-reactivating factor large subunit Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache, which comprises the following steps: step one: and carrying out model characteristic statistics on the used network, wherein the model characteristic statistics specifically comprises the calculated amount, the data bit width, the number of input channels, the number of output channels, the convolution kernel and the width and height dimensions of the output characteristic diagram of each roll layer. And distributing an FPGA on-chip computing core for each two-dimensional convolution kernel contained in the model, and distributing DSP computing resources for each core. Step two: and (3) constructing a convolution calculation method based on weight multiplexing by adopting a full-cache mode on an input feature chip for each calculation core distributed in the step one, constructing a convolution calculation cycle in a block expansion mode by block factors in two dimensions of an input channel and an output channel for each convolution layer, and performing full parallelization processing on an FPGA chip by convolution operation in the blocks. Step three: and (3) calculating hardware throughput and access memory ratio corresponding to different block factor configuration combinations based on the number constraint of on-chip DSP computing units and the on-chip BRAM storage resource constraint of the block factors of each convolution layer introduced in the step two, and searching based on the Roofline model to obtain optimal block factor parameters which can be realized by each convolution layer.
Description
Technical Field
The invention belongs to the technical field of convolutional neural network hardware acceleration, and particularly relates to a CNN input full-cache-based FPGA multi-core two-dimensional convolutional acceleration optimization method.
Background
In recent years, with the continuous development of deep learning theory, convolutional neural networks (Convolutional Neural Network, CNN) have achieved higher precision and performance in various artificial intelligence tasks. At present, CNN generally has higher computational complexity and larger parameter storage requirement, and research on CNN algorithm is still focused on the improvement of the neural network model scale. More and more new neural network layers result in more complex structures and larger model sizes, and require billions of operations and millions of parameters, as well as a large amount of computational resources, to train and evaluate the final network performance.
Accordingly, hardware acceleration platforms are widely used to increase the throughput and processing speed of CNNs, mainly including application Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs). The GPU accelerator has very high power consumption, is generally suitable for offline training of a model, and is difficult to apply on some platform devices with low battery power consumption at the mobile end. Furthermore, the performance of GPUs comes from the ability to process bulk images in parallel. However, for some practical tasks, such as video streaming, the input images need to be processed frame by frame, which also reduces the advantages of the GPU to some extent. In contrast, although the memory and the computing resources of the FPGA hardware platform are limited, higher performance can be achieved through lower power consumption, and meanwhile, high parallelism processing of the model reasoning process can be achieved according to the computing process of the neural network and by combining with the hardware design of a specific model. The characteristics of the programmable logic array of the FPGA enable the programmable logic array to be more flexible, and the programmable logic array is suitable for research and development and verification of various algorithms, and has shorter design period and lower power consumption.
The on-chip memory locations of a typical FPGA chip are primarily referred to as cache resources (BRAM) and a relatively small amount of Register resources (registers). However, the specification of the system is still too small compared with the memory occupation of a neural network model, the size of a common CNN model is generally 100-1000MB, and the current largest SRAM (static random Access memory) on an FPGA (field programmable Gate array) chip still does not exceed 10MB. Therefore, the off-chip storage resource (DDR) is needed to assist in the deployment process of the CNN model, and the access bandwidth and the power consumption of the on-chip storage resource (DDR) limit the actual performance of the model. Based on the resource constraints, more reasonable optimization methods such as a calculation processing unit, a memory access mode, a parallelization strategy and the like need to be considered when designing the CNN accelerator based on the FPGA.
Disclosure of Invention
The invention aims to provide a CNN input full-cache-based FPGA multi-core two-dimensional convolution acceleration optimization method, which is oriented to a deep convolution neural network containing various two-dimensional convolution kernel sizes and calculation characteristics, and provides a reasonable optimization method for an on-chip calculation core design, a convolution circulation block expansion strategy and a data multiplexing mode of a network model in the implementation and deployment of an FPGA hardware platform, so that the CNN model can realize higher parallel calculation characteristics, flow reasoning characteristics and shorter on-chip hardware reasoning time delay in the FPGA hardware accelerator.
In order to achieve the purpose, the invention provides an FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full-buffering. Firstly, carrying out model characteristic statistics on the used network, wherein the model characteristic statistics comprises calculated quantity, data type, commonly contained two-dimensional convolution kernel types and configuration parameters of each convolution layer, and carrying out on-chip calculation core and DSP calculation resource allocation according to the convolution kernel types. And then, aiming at the computing cores, a weight multiplexing convolution loop optimization strategy is designed based on a full-cache mode on the input feature chip, blocking factors are introduced into two dimensions of a convolution input channel and an output channel corresponding to each computing core, and full parallelization processing on an FPGA chip can be realized for convolution operation in the blocks. And finally, calculating hardware throughput rates and access memory ratios corresponding to different blocking factors based on-chip DSP computing resource constraint and on-chip BRAM storage resource constraint, and searching based on a Roofline model to obtain an achievable optimal solution. The optimization method provided by the invention can be realized completely, and the finally obtained output result is the optimal configuration strategy of the cyclic expansion factors of the input and output channels of each convolution layer.
The invention provides a FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache, which comprises the following implementation steps:
step one: and carrying out model characteristic statistics on the used network, wherein the model characteristic statistics specifically comprises the calculated amount, the data bit width, the number of input channels, the number of output channels, the convolution kernel and the width and height dimensions of the output characteristic diagram of each roll layer. And distributing an FPGA on-chip computing core for each two-dimensional convolution kernel contained in the model, and distributing DSP computing resources for each core.
Step two: constructing a convolution calculation method based on weight multiplexing by adopting a full-cache mode on an input feature chip for each calculation core allocated in the step one, constructing a convolution calculation cycle in a block expansion mode by block factors for each convolution layer in two dimensions of an input channel and an output channel, and performing full parallelization processing on an FPGA chip by convolution operation in the blocks;
step three: and (3) calculating hardware throughput and access memory ratio corresponding to different block factor configuration combinations based on the number constraint of on-chip DSP computing units and the on-chip BRAM storage resource constraint of the block factors of each convolution layer introduced in the step two, and searching based on the Roofline model to obtain optimal block factor parameters which can be realized by each convolution layer.
The "model characteristic statistics on the network used" in the first step specifically includes the calculated amount, the data bit width, the number of input channels, the number of output channels, the convolution kernel, and the width and height dimensions of the output feature map of each layer. An FPGA on-chip computing core is allocated for each two-dimensional convolution kernel contained in the model, and DSP computing resource allocation is carried out for each core, and the method comprises the following steps:
s11, carrying out statistics on related parameters of the FPGA hardware platform and the model for subsequent calculation steps, wherein the method specifically comprises the following steps: on-chip memory capacity BRAM of FPGA hardware platform max (MB), number of on-chip DSP calculation units DSP max The data bit width M (bit) of the model and the two-dimensional convolution kernel class number N contained in the model;
s12, counting relevant parameters of each convolution layer for subsequent calculation steps, wherein the method specifically comprises the following steps: calculation amount per volume layerInput channel of each convolution layer->Output channel->The convolution kernel has a width and height dimension k×k and a width and height dimension ++of the output feature map>Input feature map width and height dimension +.>
S13, according to the number N of the convolution Kernel types counted in the step S11, an on-chip calculation core Kernel is allocated for each convolution Kernel j ,j∈[1,N]And all convolution layers corresponding to each kind of convolution kernel belong to one calculation kernel;
s14, respectively calculating the sum of the calculated amounts of all the convolution layers corresponding to each calculation core according to the classification result of the convolution layers in the step S13
S15, calculating the distribution ratio of each core to the on-chip DSP computing unit according to the calculated amount corresponding to each core obtained in the step S14
In the second step, the "constructing a convolution calculation method based on weight multiplexing by using a full cache manner on an input feature chip for each calculation core allocated in the first step", constructing a convolution calculation cycle by using a block expansion manner through a block factor for each convolution layer in two dimensions of an input channel and an output channel, wherein convolution operation in a block can be fully parallelized on an FPGA chip ", the method comprises the following steps:
each computing core Kernel allocated in the steps S21 and S13 j Respectively correspond to a convolution kernel width and height dimensionFor the followingKernel of computing core j Is one of the convolutional layers conv i The same convolution cyclic dimension sequence is set, the cyclic inclusion contains 6 layers in total, and the calculation dimensions from outside to inside are set as follows: />
S22, performing S22; distributing a length of BRAM on FPGA chipIs a storage space of (a)And a length of +.>Storage space of->The method is used for storing all input characteristic values of a convolution layer and intermediate output characteristic values calculated by each block respectively;
s23, reading the continuity of the convolution layer from DDR by the programmable logic PL end on the hardware chipA plurality of convolution kernels, each convolution kernel reading +.>Data of individual channels, co->M-bit data,/-bit>Andrespectively expanding factors of an input channel and an output channel;
s24, in S21 stepAnd->Performing cyclic expansion on two cyclic dimensions, and calculating an intermediate convolution result by using the block weight value read in the step S23 and the input feature map traversal, namely calculating +.>All output elements on the feature map of the output channels correspond to convolution kernels>Intermediate values of the channels, adding the results to the buffer allocated in step S22, respectively +.>
S25, returning to the step S23, and starting to read the next continuous according to the address sequenceA plurality of convolution kernels, each convolution kernelThe channel data are then calculated by step S24 and the obtained output result is accumulated in the buffer areaThe process is circularly carried out until the step S23 is carried out to take out the weight data of the last block and obtain the calculation result of the step S24, and the core Kernel is calculated at the moment j Conv for convolution layer i Is completed;
the method comprises the following steps of calculating hardware throughput and access ratio corresponding to different block factor configuration combinations based on the number constraint of on-chip DSP computing units and the on-chip BRAM storage resource constraint, and searching to obtain optimal block factor parameters which can be realized by each convolution layer based on a Roofline model, wherein the block factor of each convolution layer introduced in the step three is described in the step two, and the method comprises the following steps:
s31, for each convolution layer conv i Four constraint conditions are constructed by the blocking factors of the (a) and (b), and the four constraint conditions are respectively as follows: 1) The number of convolution multiplications within each block cannot exceed the number of available DSP units of the corresponding computational core of the convolution layer2)The number of input channels is required to be less than the maximum number of input channels of the convolution layer; 3)/>The maximum output channel number is smaller than the convolution layer; 4) The weight data size of each read cannot exceed the maximum on-chip buffer capacity BRAM max ;
S32, for each convolution layer conv by means of cyclic calculation i Corresponding on-chip throughput rate is calculated according to all the blocking factor configuration schemesAnd memory ratio->
S33, searching all the configuration schemes obtained in the step S32 under the Roofline model according to the maximum throughput rate and the upper limit of access performance which can be achieved by the FPGA development board, and searching to obtain a convolution layer conv under the constraint that the throughput rate and the access ratio do not exceed the upper limit of the development board performance i Is a numerical solution of an optimal blocking factor;
through the steps, the FPGA multi-core two-dimensional convolution acceleration optimization method based on the CNN input full cache allocates different calculation cores and calculation resources based on analysis of the structural characteristics of the CNN network, introduces two blocking factors into the on-chip convolution cyclic calculation process, and performs optimization solution by combining a Roofline model with the performance limitation of a hardware development board.
According to the design of the invention, the FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full-buffer is realized, the algorithm is easy to integrate, and the method can be directly applied to the FPGA accelerator for the existing mainstream CNN network model.
Drawings
FIG. 1A general framework of the method of the invention
Fig. 2 convolution cyclic calculation method based on weight multiplexing
All theoretical solutions example under the rooline model of fig. 3
All possible solutions under the rooline model of fig. 4
Detailed Description
So that the manner in which the features, objects, and functions of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.
Firstly, the invention is applied to a two-dimensional convolutional neural network, is not applicable to the existing one-dimensional convolutional kernel three-dimensional convolutional network, the total input of the invention is a convolutional neural network model with definite numerical accuracy (such as 32-bit single accuracy or 8-bit fixed point number, etc.), the acceleration platform is an FPGA acceleration platform, and the total output is a cyclic calculation structure for carrying out on-chip reasoning on each convolutional layer when the convolutional network is deployed to the FPGA.
As shown in fig. 1, is an overall framework of the method of the present invention. For a specific network structure, firstly, statistics is carried out on the types of convolution kernels contained in the model, and a computing core and corresponding DSP multiplication resources are allocated for each convolution kernel. And when each core accelerates the convolution layer, carrying out on-chip hardware reasoning by using a convolution cyclic calculation method based on weight multiplexing, and searching an optimal blocking factor used when cyclic blocking is unfolded by using a Roofline model under the performance constraint of a hardware platform. The specific implementation steps are as follows:
the first step: the statistics of parameters of the FPGA hardware platform specifically comprises the following steps: on-chip memory capacity BRAM of FPGA hardware platform max (MB), number of on-chip DSP calculation units DSP max . The parameter statistics of the used network model specifically comprises the following steps: the data bit width M (bit) of the model and the number N of two-dimensional convolution kernels contained in the model. The statistics of the relevant parameters of each convolution layer in the model specifically comprises the following steps: calculation amount per volume layerInput channel->Output channel->The convolution kernel of the layer has a width and height dimension k x k and a width and height dimension +.>Input feature map width and height dimension +.>And after the statistics of the parameters are completed, the parameters are used for the subsequent calculation step.
Each convolution Kernel is then assigned an on-chip computation core Kernel j ,j∈[1,N]And all the convolution layers corresponding to each kind of convolution kernel belong to one computation core. Calculating the sum of the calculated amounts of all convolution layers corresponding to each calculation coreAnd according to the calculated proportional coefficient +.>The number of DSP multiplication units available for each computing core +.>The assignment is made and the calculation expression is as follows:
and a second step of: and for each computing core, a convolution cyclic computing method construction based on weight multiplexing is carried out by adopting a full-buffer mode on an input feature chip, and a cyclic structure is schematically shown in figure 2. And building convolution calculation circulation of each convolution layer in two dimensions of an input channel and an output channel in a block spreading mode through a block dividing factor, so that full parallelization processing of convolution operation in blocks on an FPGA (field programmable gate array) chip is realized.
Specifically, as shown in FIG. 1, each compute core Kernel j Respectively correspond to a convolution kernel width and height dimensionFor the Kernel j Is one of the convolutional layers conv i The same convolution cyclic dimension sequence is set, the cyclic inclusion contains 6 layers in total, and the calculation dimensions from outside to inside are set as follows:then a length of +.>Storage space of->And a length ofStorage space of->And the method is used for storing all input characteristic values of the convolution layer and intermediate output characteristic values calculated by each block respectively.
The on-chip block convolution calculation process is shown in fig. 2, and can be divided into the following steps performed in 3 loops:
step one: the programmable logic PL end on the hardware chip starts reading the continuity of the convolution layer from DDRA plurality of convolution kernels, each convolution kernel reading +.>Data of individual channels, co->M-bit data,/-bit>And->Respectively expanding factors of an input channel and an output channel;
step two: at the position ofAnd->Performing cyclic expansion on two cyclic dimensions, and traversing and calculating an intermediate convolution result by using the read block weight values and the input characteristic diagram, namely calculating +.>All output elements on the feature map of the output channels correspond to convolution kernels>Intermediate values of the individual channels, the results are added to the buffer area, respectively +.>
Step three: returning to step one, reading the next consecutive in address orderA plurality of convolution kernels, each convolution kernel->The channel data are calculated by the second step and the obtained output result is accumulated to the buffer areaThe process is circularly carried out until the weight data of the last block is taken out and a calculation result is obtained, and a calculation core Kernel is calculated at the moment j Conv for convolution layer i Is completed.
At this stage, it should be noted that the cyclic order and dimensions used by the different computational cores are identical, with the difference that the values of each dimension are varied according to the parameters of the different convolutional layers, while the spreading out the output channels and the input channels means a common component contained within each partition The multiplication is done on the hardware in the same clock cycle.
And a third step of: at conv for each convolution layer i When the optimal configuration combination of the blocking factors is calculated, hardware throughput and access memory ratio corresponding to different configuration combinations are calculated based on the quantity constraint of on-chip DSP calculation units and the constraint of on-chip BRAM storage resources, and a result is obtained by searching based on the Roofline model.
Specifically, for each convolution layer conv i Firstly, constructing four constraint conditions, namely:
1) The number of convolution multiplications within each block cannot exceed the number of available DSP units of the corresponding computational core of the convolution layerThe expression is as follows:
2)the maximum number of input channels required to be less than the convolution layer is expressed as follows:
3)the maximum output channel number less than the convolution layer is required, and the expression is as follows:
4) The weight data size of each read cannot exceed the maximum on-chip buffer capacity BRAM max The expression is as follows:
where (M/8) represents converting the left side of the above formula into Byte units. Then for each convolution layer conv by means of a cyclic traversal i Corresponding on-chip throughput rate is calculated according to all the blocking factor configuration schemesAnd memory ratio->
The throughput rate represents the number of operations performed per second, and for convolution operation, namely multiply-add operation, the calculation formula is as follows:
the memory ratio represents the corresponding operation number of each memory, and the calculation formula is as follows:
in the above-mentioned method, the step of,and->The weight reading times and the memory length of each reading are respectively represented, and the calculation formula is as follows:
according to the above formula, theoretical values of on-chip throughput rate and memory access ratio which can be achieved when each parameter configuration is implemented in hardware can be calculated. However, due to limited resources of the FPGA development board, the number of DSP multipliers, the BRAM memory access bandwidth and the clock frequency together determine the throughput rate and the upper limit of memory access performance that can be achieved by one FPGA. Therefore, all configuration schemes need to be searched under the rooline model, and the throughput rate is highSearching to obtain a convolution layer conv under the constraint that the access memory ratio does not exceed the upper limit of the performance of the development board i Is a numerical solution of the optimal blocking factor. Fig. 3 and 4 show roofine model solving examples, which respectively represent all theoretical solutions and feasible solutions obtained by calculation. The horizontal axis represents the memory ratio CTC index, the vertical axis represents the calculation performance CR index, and the slope of the straight line between any point and the origin represents the minimum bandwidth requirement of the parameter scheme corresponding to the point when the parameter scheme is realized. For example, the minimum bandwidth requirement of the P-point scheme in fig. 3 is the same as that of the P' -point scheme. The upper access performance limit and the upper computational performance limit of the FPGA are marked in the example of fig. 4. Any point to the left of the upper access performance limit requires a higher bandwidth when implemented than the platform can provide. Therefore, the most reasonably performing blocking factor configuration combination is selected according to the performance limiting boundary, such as point N in fig. 4. In addition, if the solution set only contains one configuration point, the point is used as an optimal parameter; if more than one feasible solution is included under the condition of the same calculation performance, the higher point of CTC is selected as the optimal solution on the basis of the smaller access bandwidth requirement.
Although the present invention has been described with reference to the above embodiments, it is not limited thereto, and various equivalent changes and substitutions can be made therein by those skilled in the art without departing from the spirit and scope of the present invention, and the scope of the present invention is defined by the appended claims.
Claims (4)
1. The FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache is characterized by comprising the following steps of: the method comprises the following steps:
step one: and carrying out model characteristic statistics on the used network, wherein the model characteristic statistics specifically comprises the calculated amount, the data bit width, the number of input channels, the number of output channels, the convolution kernel and the width and height dimensions of the output characteristic diagram of each roll layer. And distributing an FPGA on-chip computing core for each two-dimensional convolution kernel contained in the model, and distributing DSP computing resources for each core.
Step two: constructing a convolution calculation method based on weight multiplexing by adopting a full-cache mode on an input feature chip for each calculation core allocated in the step one, constructing a convolution calculation cycle in a block expansion mode by block factors for each convolution layer in two dimensions of an input channel and an output channel, and performing full parallelization processing on an FPGA chip by convolution operation in the blocks;
step three: and (3) calculating hardware throughput and access memory ratio corresponding to different block factor configuration combinations based on the number constraint of on-chip DSP computing units and the on-chip BRAM storage resource constraint of the block factors of each convolution layer introduced in the step two, and searching based on the Roofline model to obtain optimal block factor parameters which can be realized by each convolution layer.
2. The method and the device for rapidly capturing Beidou B1C signals based on FPGA of claim 1 are characterized in that: the first specific process of the step is as follows:
s11, carrying out statistics on related parameters of the FPGA hardware platform and the model for subsequent calculation steps, wherein the method specifically comprises the following steps: on-chip memory capacity BRAM of FPGA hardware platform max (MB), number of on-chip DSP calculation units DSP max The data bit width M (bit) of the model and the two-dimensional convolution kernel class number N contained in the model;
s12, counting relevant parameters of each convolution layer for subsequent calculation steps, wherein the method specifically comprises the following steps: calculation amount per volume layerInput channel of each convolution layer->Output channel->The convolution kernel has a width and height dimension k×k and a width and height dimension ++of the output feature map>Input feature map width and height dimension +.>
S13, according to the number N of the convolution Kernel types counted in the step S11, an on-chip calculation core Kernel is allocated for each convolution Kernel j ,j∈[1。N]And all convolution layers corresponding to each kind of convolution kernel belong to one calculation kernel;
s14, respectively calculating the sum of the calculated amounts of all the convolution layers corresponding to each calculation core according to the classification result of the convolution layers in the step S13
S15, calculating the distribution ratio of each core to the on-chip DSP computing unit according to the calculated amount corresponding to each core obtained in the step S14
3. The method and the device for rapidly capturing Beidou B1C signals based on FPGA of claim 1 are characterized in that: the specific process of the second step is as follows:
each computing core Kernel allocated in the steps S21 and S13 j Respectively correspond to a convolution kernel width and height dimensionFor the Kernel j Is one of the convolutional layers conv i The same convolution cyclic dimension sequence is set, the cyclic inclusion contains 6 layers in total, and the calculation dimensions from outside to inside are set as follows: />
S22, performing S22; distributing a length of BRAM on FPGA chipIs a storage space of (a)And a length of +.>Storage space of->The method is used for storing all input characteristic values of a convolution layer and intermediate output characteristic values calculated by each block respectively;
s23, reading the continuity of the convolution layer from DDR by the programmable logic PL end on the hardware chipA plurality of convolution kernels, each convolution kernel reading +.>Data of individual channels, co->M-bit data,/-bit>And->Respectively expanding factors of an input channel and an output channel;
s24, in S21 stepAnd->Performing cyclic expansion on two cyclic dimensions, and calculating an intermediate convolution result by using the block weight value read in the step S23 and the input feature map traversal, namely calculating +.>All output elements on the feature map of the output channels correspond to convolution kernels>Intermediate values of the channels, and accumulating the results to the buffer areas allocated in step S22
S25, returning to the step S23, and starting to read the next continuous according to the address sequenceA plurality of convolution kernels, each convolution kernelThe channel data are then calculated by step S24 and the obtained output result is accumulated in the buffer areaThe process is circularly carried out until the step S23 is carried out to take out the weight data of the last block and obtain the calculation result of the step S24, and the core Kernel is calculated at the moment j Conv for convolution layer i Is completed.
4. The method and the device for rapidly capturing the Beidou B1C signals of the FPGA based on the FPGA of claim 1 are characterized in that: the third concrete process is as follows:
s31, for each convolution layer conv i Four constraint conditions are constructed by the blocking factors of the (a) and (b), and the four constraint conditions are respectively as follows: 1) The number of convolution multiplications within each block cannot exceed the number of available DSP units of the corresponding computational core of the convolution layer2)The number of input channels is required to be less than the maximum number of input channels of the convolution layer; 3)/>The maximum output channel number is smaller than the convolution layer; 4) The weight data size of each read cannot exceed the maximum on-chip buffer capacity BRAM max ;
S32, for each convolution layer conv by means of cyclic calculation i Corresponding on-chip throughput rate is calculated according to all the blocking factor configuration schemesAnd memory ratio->
S33, searching all the configuration schemes obtained in the step S32 under the Roofline model according to the maximum throughput rate and the upper limit of access performance which can be achieved by the FPGA development board, and searching to obtain the ctnv of the convolution layer under the constraint that the throughput rate and the access ratio do not exceed the upper limit of the performance of the development board i Is a numerical solution of the optimal blocking factor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310797346.8A CN116720549A (en) | 2023-07-03 | 2023-07-03 | FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310797346.8A CN116720549A (en) | 2023-07-03 | 2023-07-03 | FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116720549A true CN116720549A (en) | 2023-09-08 |
Family
ID=87867871
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310797346.8A Pending CN116720549A (en) | 2023-07-03 | 2023-07-03 | FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116720549A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117114055A (en) * | 2023-10-24 | 2023-11-24 | 北京航空航天大学 | FPGA binary neural network acceleration method for industrial application scene |
CN117786537A (en) * | 2024-02-27 | 2024-03-29 | 南京信息工程大学 | Distributed fault diagnosis method of Boltzmann machine voting network based on FPGA |
CN118332239A (en) * | 2024-04-16 | 2024-07-12 | 大连理工大学 | Design and implementation method of general convolution operation accelerator architecture based on loop optimization technology |
-
2023
- 2023-07-03 CN CN202310797346.8A patent/CN116720549A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117114055A (en) * | 2023-10-24 | 2023-11-24 | 北京航空航天大学 | FPGA binary neural network acceleration method for industrial application scene |
CN117114055B (en) * | 2023-10-24 | 2024-04-09 | 北京航空航天大学 | FPGA binary neural network acceleration method for industrial application scene |
CN117786537A (en) * | 2024-02-27 | 2024-03-29 | 南京信息工程大学 | Distributed fault diagnosis method of Boltzmann machine voting network based on FPGA |
CN117786537B (en) * | 2024-02-27 | 2024-04-30 | 南京信息工程大学 | Distributed fault diagnosis method of Boltzmann machine voting network based on FPGA |
CN118332239A (en) * | 2024-04-16 | 2024-07-12 | 大连理工大学 | Design and implementation method of general convolution operation accelerator architecture based on loop optimization technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220012593A1 (en) | Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization | |
CN116720549A (en) | FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache | |
CN108564168B (en) | Design method for neural network processor supporting multi-precision convolution | |
CN109214504B (en) | FPGA-based YOLO network forward reasoning accelerator design method | |
US20180204110A1 (en) | Compressed neural network system using sparse parameters and design method thereof | |
CN107203807B (en) | On-chip cache bandwidth balancing method, system and device of neural network accelerator | |
CN109472361B (en) | Neural network optimization method | |
CN112668708B (en) | Convolution operation device for improving data utilization rate | |
CN105739951B (en) | A kind of L1 minimization problem fast solution methods based on GPU | |
CN113361695B (en) | Convolutional neural network accelerator | |
US11921667B2 (en) | Reconfigurable computing chip | |
US20230025068A1 (en) | Hybrid machine learning architecture with neural processing unit and compute-in-memory processing elements | |
CN115115043A (en) | Method and system for designing hardware architecture of on-chip-to-chip interconnection neural network chip | |
CN111160534A (en) | Binary neural network forward propagation frame suitable for mobile terminal | |
CN113261015A (en) | Neural network system and data processing technology | |
Shahshahani et al. | Memory optimization techniques for fpga based cnn implementations | |
CN111340198A (en) | Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array) | |
CN112966807B (en) | Convolutional neural network implementation method based on storage resource limited FPGA | |
Niu et al. | SPEC2: Spectral sparse CNN accelerator on FPGAs | |
CN114003201A (en) | Matrix transformation method and device and convolutional neural network accelerator | |
CN112149047A (en) | Data processing method and device, storage medium and electronic device | |
CN117391162A (en) | Accelerator based on convolutional neural network and acceleration method | |
Wang et al. | Balancing memory-accessing and computing over sparse DNN accelerator via efficient data packaging | |
CN116451755A (en) | Acceleration method and device of graph convolution neural network and electronic equipment | |
Xiao et al. | A mobilenet accelerator with high processing-element-efficiency on fpga |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Wang Jianrong Inventor after: Zhao Hongbo Inventor after: He Zhijun Inventor before: Wang Jianrong Inventor before: Zhao Hongbo Inventor before: He Zhijun |
|
CB03 | Change of inventor or designer information |