CN109918281B

CN109918281B - Multi-bandwidth target accelerator efficiency testing method

Info

Publication number: CN109918281B
Application number: CN201910185133.3A
Authority: CN
Inventors: 姜晶菲; 付强; 窦勇; 刘志强; 韩哲; 赵小强; 秦步月
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2022-07-12
Anticipated expiration: 2039-03-12
Also published as: CN109918281A

Abstract

The invention discloses an accelerator efficiency testing method for multiple bandwidth targets, and aims to optimize the structural design space of an accelerator with multiple bandwidth targets and evaluate the throughput rate and the MAC utilization rate. The method comprises the following steps of firstly determining attribute parameters of a test case CNN, and determining the column number and the row number of an accelerator MAC array; then testing the efficiency of the accelerator to each convolution layer of the CNN and the total efficiency of all convolution layers, testing the efficiency of the accelerator to each pooling layer of the CNN and the total efficiency of all pooling layers, testing the efficiency of the accelerator to each full-connection layer of the CNN and the total efficiency of all full-connection layers, and finally calculating the throughput rate and MAC utilization rate of all convolution layers, pooling layers and full-connection layers of the CNN by the testing accelerator. The method can reasonably adapt the column number and the line number of the MAC array according to conditions such as accelerator bandwidth constraint and the like, maximize the efficiency of the accelerator, quickly evaluate the throughput rate and the MAC utilization rate of the accelerator under the multi-target bandwidth constraint, and optimize the structural design space of the accelerator.

Description

Multi-bandwidth target accelerator efficiency testing method

Technical Field

The invention relates to a performance test method for a compute-intensive application compute accelerator, in particular to a method for evaluating the efficiency of an accelerator according to the scale of an accelerator array and the bit width of data.

Background

Applications such as image recognition, speech processing, text mining and the like with a large-scale deep neural network as a core are generally computing and memory-intensive applications, and the demand for high-performance computing hardware is increasing day by day. From intelligent reasoning of a terminal mainly bearing a high-performance embedded platform to intelligent application of a miniature and integrated 'micro' end, the intelligent application of different scales and different scenes has different requirements on intelligent processing chip delay, throughput rate and the like. The intelligent chip framework and the test method facing the terminal reasoning must be capable of adjusting the system structure in multiple scales according to the application performance and cost requirements. With the increasing depth of the neural network and the increasing complexity of the application, the artificial neural network needs more and more model parameters, the processed data is larger and larger, the parameters bring huge pressure on the memory access bandwidth required by calculation, and most neural network applications become memory access limited intensive calculation problems under the traditional von Neumann structure.

Currently, a customized intelligent accelerator becomes one of the most effective ways for optimizing applications such as image recognition, voice processing, text mining and the like, but the development cycle of a customized structure is very long, and the customized structure often relates to a plurality of long design flows such as application analysis, structural design, logic design, circuit optimization, comprehensive optimization, sheet casting production and the like.

According to the study on the operation amount in the convolutional Neural network CNN (convolutional Neural network), the convolutional operation occupies 90% of the operation amount of the CNN. Liu Shi et al, published in Electronics journal as "FPGA Accelerator Architecture Design for two-dimensional and three-dimensional convolutional neural networks" paper (Liu, Z.; Chow, P.; Xu, J.; Jiang, J.; Dou, Y.; Zhou, J.A Uniform Architecture Design for Acceleration 2D and 3D CNNs on FPGAs. Electronics 2019,8,65) (referred to as background art 1), propose an accelerator structure. Such an accelerator is a deep neural network-oriented generic Matrix multiplication gemm (general Matrix multiplication) hardware accelerator. As shown in FIG. 1, the GEMM operation method of the deep neural network is that the convolution layer weight W is m × c × k × k, the input feature map X is c × h × W, W and X are subjected to convolution operation to obtain an output feature map Y, and Y is m × h × W. Wherein m is the number of convolution layer convolution kernels, c is the number of input feature map channels, k is the size of the convolution kernels, h is the input and output feature map height, and W is the input and output feature map width, the GEMM method of convolution operation expands the weight W into a weight matrix, compresses and recombines the feature map X into a feature map matrix, and performs matrix multiplication on the weight matrix and the feature map matrix to obtain the expansion form of the output feature map Y, namely the output feature map matrix.

FIG. 2 is a hardware block diagram of an accelerator that performs the above-described convolution GEMM operations. The operation logic of the accelerator includes a multiply-accumulate unit MAC (multiple And accumulate) array, a weight matrix buffer, a feature map matrix buffer, And an output feature map matrix buffer. Each MAC in the MAC array carries out unit pipelining operation, and the operation of 1 multiplication and 1 addition is completed in each clock cycle. When the convolutional neural network operation is performed, the MAC array acquires data W and X required by the operation from a memory, the data W and X are respectively expanded and recombined according to the mode in fig. 1 and are respectively stored in a weight matrix buffer and a characteristic diagram matrix buffer of an accelerator, and a result Y is output to an output characteristic diagram matrix buffer after the operation. The accelerator structure realizes the operation of the MAC array which can be extended in a telescopic way on each layer of the convolutional neural network, and has the characteristics of capability of efficiently processing various convolutional operations, small limitation of hardware resources, capability of flexibly configuring the length and the width of the MAC array and the like. The accelerator is an accelerator structure commonly used in the field of the acceleration of the convolutional neural network at present. But limited by application scenarios and cost penalty, the memory access bandwidth of the accelerator becomes an important factor limiting the performance of the accelerator. The difference of the access bandwidth can cause the optimal MAC array configuration and acceleration efficiency of the accelerator to be different, namely, although the total number of the MAC in the MAC array is fixed, the length mc and the width mr of the array need to be determined according to the access bandwidth. Such accelerators are therefore also called multi-bandwidth target accelerators.

At present, most accelerators with multiple bandwidth targets are designed for specific application fields, and the scale of the field application determines the calculation scale, the storage bandwidth, the communication capacity, the execution mode and the like of the accelerator and also determines the performance level of the accelerator to a certain extent. Aiming at memory access bandwidth constraint, an evaluation method based on a structural model is established for the accelerator with multiple bandwidth targets, and the structural design space of the accelerator can be optimized. Therefore, how to realize rapid evaluation of the performance of the multi-bandwidth target accelerator structure including throughput rate and MAC utilization rate is a technical problem that needs to be overcome by those skilled in the art. There is no disclosure or report of a method of performance evaluation involving such accelerators.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a multi-bandwidth target accelerator efficiency testing method, solve the problem of optimizing the structural design space of a multi-bandwidth target accelerator, and quickly evaluate the multi-bandwidth target accelerator structural efficiency including the throughput rate and the MAC utilization rate, wherein the throughput rate represents the number of multiply-accumulate (MAC) operations of the accelerator operated per second. In particular, the present invention provides a method for calculating the performance of such an accelerator by appropriately adapting the multiplication array according to different convolution scales for the accelerator structure proposed in background art 1. The invention can directly evaluate the performance (throughput rate and MAC utilization rate) of the extensible computing system of the accelerator with multiple bandwidth targets under the constraint of multiple target bandwidths.

The specific technical scheme is as follows:

in the first step, a convolutional neural network is selected from currently widely used convolutional neural network models (such as AlexNet, VGG16, and C3D) as an intelligent accelerator performance test case CNN. And preprocessing the test case CNN, and determining the attribute parameters of each layer of the test case CNN. The specific method comprises the following steps:

let the number of test cases CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers. N, Nc, Np, and Nfc are positive integers, and N is Nc + Np + Nfc.

1.1 defines the convolutional neural network layer ordinal variable i to be 1, defines the convolutional layer count loop variable i1 to be 1,

a pooling layer count loop variable i2 is defined as 1 and a fully-connected layer count loop variable i3 is defined as 1.

1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, go to 1.5

Step (2); if the ith layer is a full connection layer, the step 1.7 is carried out.

1.3 extract the following properties for the i1 th convolutional layer according to the description of CNN:

1.3.1 record input feature map height Ix of i-th layer_i1；

1.3.2 recording input feature map width Iy of i-th layer_i1；

1.3.3 recording the number of channels of the input characteristic diagram (i.e. the number of channels of the convolution kernel) Ic of the i-th layer_i1；

1.3.4 record the i-th layer convolution kernel size k_i1；

1.3.5 recording the number of convolution kernels (i.e. the number of convolution output channels) of the i-th layer Oc_i1；

1.3.6 recording the number padc of column-column direction fills of the i-th layer input feature map_i1；

1.3.7 recording the number of jumps strdc of the i-th layer convolution kernel_i1；

1.3.8 recording the i-th layer output characteristic diagram high Ox_i1；

1.3.9 recording the i-th layer output characteristic diagram width Oy_i1。

1.4. Let i1 be i1+ 1; go to step 1.9.

1.5 extract the following properties for the i2 th pooling layer:

1.5.1 record input Profile height Ipx for layer i_i2；

1.5.2 recording input feature map width Ipy for layer i_i2；

1.5.3 record the number of input profile channels Ipc for the i-th layer_i2；

1.5.4 record the size p of the i-th layer pooling scale_i2；

1.5.5 recording the row and column direction filling number padp of the input feature map of the ith layer_i2；

1.5.6 records the number of jumps strdp of the i-th layer pooling operation_i2；

1.5.7 record the i-th layer output signature height Opx_i2；

1.5.8 record the i-th layer output signature width Opy_i2。

1.6 let i2 ═ i2+ 1; turning to the step 1.9;

1.7 extract the following properties of the i3 th fully-connected layer:

1.7.1 recording the number of i-th layer full-connection operation input nodes as Fin_i3；

1.7.2 recording the number of i-th layer full-connection operation output nodes as Fout_i3；

1.8 let i3 ═ i3+ 1; turning to the step 1.9;

1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time₁，Ic₂,…Ic_Nc}、{k₁,k₂,..k_Nc}、{strdc₁,strdc₂,..strdc_Nc}、{Iy₁，Iy₂,…Iy_Nc}、{padc₁,padc₂,…padc_NcAnd e, turning to the second step if the set required by the subsequent test is the same, or turning to the 1.2 step if the set required by the subsequent test is not the same.

Secondly, determining the column number and the row number of the MAC array of the accelerator, wherein the method comprises the following steps:

2.1 determine the number of columns mc of the accelerator MAC array, mc being a positive integer.

Because the efficiency of the accelerator is optimal when the computation time ratio of the convolution operation of the accelerator is 1:1, the accelerator has optimal efficiency

Therefore, the ratio satisfies the formula (1) and the optimal performance is achieved:

(1) in the formula, Ic ∈ { Ic }₁，Ic₂,…Ic_Nc},K∈{k₁,k₂,..k_Nc}，strdx∈{strdc₁,strdc₂,..strdc_Nc}，Iy∈{Iy₁，Iy₂,…Iy_Nc}. BW is accelerator bandwidth (unit is μ bit/s, μ is bit width of operation input data of MAC unit, determined by accelerator designer), F is accelerator MAC array working frequency (its unit isBit Hz, determined by the accelerator designer), BW and F are fixed parameters of the accelerator as long as the accelerator design is complete.

Mc obtained from the formula (1) satisfies the formula (2):

recent trends in convolutional neural networks show that convolutions of 3 × 3 and 1 × 1 are used most, while the jumps strdx of the convolution kernel generally take 1. Therefore, when K is preferably 3 and strdx is preferably 1, the accelerator maximizes the throughput for most convolution operations, so mc satisfies equation (3):

2.2, determining the number of rows mr of the accelerator MAC array, wherein mr is a positive integer, and the method comprises the following steps:

MACmax is the number of MAC units available to the accelerator (determined by the accelerator designer from hardware logic resources).

Thirdly, testing the efficiency of the accelerator on each convolution layer of the test case CNN, wherein the method comprises the following steps:

3.1 make convolution count loop variable i1 equal to 1.

3.2 determine the output parallelism M when the accelerator computes the i1 th convolutional layer (i.e., the number of rows of the MAC actually used when the MAC array computes the i1 th convolutional layer)_i1The method comprises the following steps:

M_i1＝min(Oc_i1，mr)。

3.3 test Accelerator transfers strdc from memory_i1×Iy_i1×Ic_i1Buffering the input pixel into input feature map to complete M_i1×Oy_i1A sub-convolution operation and dividing M_i1×Oy_i1The convolution operation result pixel is buffered by the output characteristic diagramTime of burst transfer back to memory tconv_i1The method comprises the following steps:

3.4 test accelerator operating i1 convolution calculation time Tconv_i1The method comprises the following steps:

3.5 test throughput of accelerator operation i1 convolutional layer thconv_i1The method comprises the following steps:

the throughput rate is the number of multiply-accumulate operations completed by the accelerator per second, and the unit is number/s.

3.6 test MAC utilization U of Accelerator operation ith 1 convolutional layer_i1The method comprises the following steps:

3.7 when i1 equals i1+1 and i1 equals Nc, it means that the set { Tconv } has been obtained₁、Tconv₂、…Tconv_Nc}、{thconv₁、thconv₂、…thconv_NcA, and { U }₁、U₂、…U_NcAnd (6) jumping to the fourth step, otherwise, jumping to the 3.2 rd step.

Fourthly, testing the total efficiency of the accelerator on Nc convolution layers of the test case CNN, wherein the method comprises the following steps:

4.1 determining the accelerator to calculate the throughput Thconv of Nc convolutional layers of CNN:

4.2 determine that the accelerator calculates the MAC utilization UCmac of Nc convolutional layers of CNN as:

fifthly, testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:

5.1 let pooling layer count cycle variable i2 equal to 1.

5.2 determine the parallelism (i.e. the number of rows actually used by the MAC array) of the accelerator operating the ith 2 pooling layer by:

5.2.1

5.2.2

5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 th pooling layer calculated by accelerator_i2The method comprises the following steps:

5.2.4 determining the number Poolpara of pooling operations that the accelerator computes the ith 2 pooling layer can operate on in parallel when using a 1-row multiplication array_i2The method comprises the following steps:

wherein PAD is max (padc)₁,padc₂,…padc_Nc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,

5.2.5 determining the number of rows of multiplier arrays that can be used by the Accelerator to calculate the maximum parallelism at the ith 2 pooling level

Comprises the following steps:

5.2.6 determining the number of rows Mpr that the accelerator calculates the i2 pooling layer can actually use the multiplier array_i2Comprises the following steps:

Mpr_i2＝min(mr，Mp_max)；

5.3 Accelerator save Mpr when determining the ith 2 pooling layer of accelerator operations_i2×Opy_i2Time tpool of one output pixel_i2：

5.4 determining the calculated time Tp of the ith 2 pooling layer of an accelerator operation_i2Comprises the following steps:

5.5 determining throughput thool of Accelerator operation i2 pooling layer_i2Comprises the following steps:

5.6 determining MAC utilization Up of the ith 2 pooling layer of Accelerator operation_i2：

5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained₁、Tp₂、…Tp_Np}、{thpool₁、thpool₂、…thpool_NpA, and { Up }₁、Up₂、…Up_NpAnd (6) jumping to the sixth step, otherwise, jumping to the 5.2 th step.

Sixthly, testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, wherein the method comprises the following steps:

6.1 determining accelerator calculates the throughput Thpool of Np pooling layers of CNN:

6.2 determining that the accelerator calculates the MAC utilization UPmac for Np pooling layers of CNN:

seventhly, testing the efficiency of the accelerator on each full connection layer of the test case CNN, wherein the method comprises the following steps:

7.1 let the full connection layer count cycle variable i3 be 1;

7.2 determine the parallelism (i.e. the number of MAC units used) when the accelerator operates the i3 th full link layer by:

7.2.1 determining the theoretical memory access time TthMfc of the i3 th fully-connected layer of the accelerator operation_i3Comprises the following steps:

7.2.2 determining the theoretical calculated time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layer_i3Comprises the following steps:

7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th Total connection layer latency ratio_i3Comprises the following steps:

7.2.5 determining the parallelism Mfcr of the i3 full link layer calculated by the accelerator_i3Satisfies the following formula:

mfc therein_maxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, Mfc_maxIs a positive integer of 0<Mfc_maxMr × mc, determined by the accelerator designer.

7.3 determine the throughput rate of the accelerator to calculate the i3 th fully connected layer. If Mfcr _i31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layer_i3Comprises the following steps:

go to 7.4.

If Mfcr_i3>1, the throughput of the i3 th full link layer is:

thfc_i3≈F×Mfcr_i3go to 7.4.

7.4 determining time Tfc for accelerator to calculate i3 full connection layer_i3Comprises the following steps:

7.5 determining Accelerator calculate MAC utilization Ufc for the ith 3 full connectivity layer_i3The method comprises the following steps:

7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained₁、Tfc₂、…Tfc_Nfc}、{thfc₁、thfc₂、…thfc_NfcA, and Ufc₁、Ufc₂、…Ufc_NfcAnd jumping to the eighth step, otherwise, jumping to the 7.2 th step.

Eighthly, testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, wherein the method comprises the following steps:

8.1 determining the throughput rate Thfc of all the fully-connected layers of the CNN calculated by the accelerator as:

8.2 determining that the accelerator calculates the MAC utilization UFCmac of all full connection layers of the CNN as:

ninth step, testing the efficiency of the accelerator to all convolution layers, pooling layers and full connection layers of the test case CNN, the method is as follows:

9.1 determine that the accelerator calculates the throughput ThA of all convolutional, pooling, and fully-connected layers of the CNN as:

9.2 determining that the MAC utilization UA of all the convolution layers, the pooling layer and the full-connection layer of the CNN calculated by the accelerator is as follows:

and step ten, finishing.

The calculation efficiency of the third-fourth step test accelerator on the convolutional layer, the calculation efficiency of the fifth-sixth step test accelerator on the pooling layer, and the calculation efficiency of the seventh-eighth step test accelerator on the full connection layer are all input for the ninth step of full CNN evaluation, and the third-fourth step, the fifth-sixth step and the seventh-eighth step can be executed in parallel or in series without being divided into a plurality of steps.

The invention can achieve the following technical effects:

1. the second step of the invention can reasonably adapt the column number mc and the row number mr of the MAC array according to the characteristics of convolution attributes in different CNNs, the accelerator bandwidth constraint and the accelerator operating frequency to obtain the optimal configuration of the MAC array, thereby maximizing the efficiency of the accelerator.

2. By adopting the method, the throughput rate and the MAC utilization rate of the accelerator under the multi-target bandwidth constraint can be quickly evaluated after the hardware constraint (the number of the MAC resources of the working frequency core) and the target CNN are given, and the structural design space of the accelerator is optimized.

Drawings

FIG. 1 is a diagram illustrating an example of a GEMM calculation method for convolution as described in background art 1;

FIG. 2 is an accelerator structure model of a multi-bandwidth target as described in background 1;

FIG. 3 is a flowchart illustrating the overall method for testing the performance of an accelerator with multiple bandwidth targets according to the present invention.

The specific implementation mode is as follows:

as shown in fig. 3, the present invention comprises the steps of:

the method comprises the following steps of firstly, selecting a convolutional neural network from a convolutional neural network model as an intelligent accelerator efficiency test case CNN, preprocessing the test case CNN, and determining attribute parameters of each layer of the test case CNN, wherein the specific method comprises the following steps:

let the number of layers of the test case CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers, where N, Nc, Np, and Nfc are positive integers and satisfy N ═ Nc + Np + Nfc;

defining a pooling layer counting loop variable i2 to be 1, and defining a full-connection layer counting loop variable i3 to be 1;

Step (2); if the ith layer is a full connection layer, turning to the step 1.7;

1.3 the following properties of the i1 th convolutional layer were extracted according to the description of CNN:

1.3.1 record input feature map height Ix of i-th layer_i1；

1.3.2 recording input feature map width Iy of i-th layer_i1；

1.3.3 recording the number of channels of the input characteristic diagram of the i-th layer, namely the number of channels Ic of the convolution kernel_i1；

1.3.4 record the i-th layer convolution kernel size k_i1；

1.3.5 recording the number of convolution kernels of the i-th layer, i.e. the number of convolution output channels Oc_i1；

1.3.6 recording the number padc of the i-th layer input characteristic diagram column-column direction filling_i1；

1.3.8 recording the i-th layer output characteristic diagram high Ox_i1；

1.3.9 recording the i-th layer output characteristic diagram width Oy_i1；

1.4. Let i1 be i1+ 1; turning to the step 1.9;

1.5 extract the following properties for the i2 th pooling layer:

1.5.1 record the input profile height Ipx of the i-th layer_i2；

1.5.2 record input feature map width Ipy for layer i_i2；

1.5.3 record the number of input profile channels Ipc for the i-th layer_i2；

1.5.4 record the size p of the i-th layer pooling scale_i2；

1.5.6 records the number of jumps strdp of the i-th pooling operation_i2；

1.5.7 recording the output characteristic of the i-th layerHeight of sign Opx_i2；

1.5.8 record the i-th layer output signature width Opy_i2；

1.6 let i2 ═ i2+ 1; turning to the step 1.9;

1.7 extract the following properties of the i3 th fully-connected layer:

1.8 let i3 ═ i3+ 1; turning to the step 1.9;

1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time₁，Ic₂,…Ic_Nc}、{k₁,k₂,..k_Nc}、{strdc₁,strdc₂,..strdc_Nc}、{Iy₁，Iy₂,…Iy_Nc}、{padc₁,padc₂,…padc_NcThe collection required by subsequent tests is transferred to the second step, otherwise, the collection is transferred to the 1.2 step;

secondly, determining the size of an accelerator MAC array, wherein the method comprises the following steps:

2.1 determining the column number mc of the MAC array of the accelerator, wherein mc is a positive integer;

K∈{k₁,k₂,..k_Nc}，strdx∈{strdc₁,strdc₂,..strdc_NcBW is accelerator bandwidth, F is accelerator MAC array working frequency;

2.2 determine the number of rows mr of the accelerator MAC array, mr being a positive integer, the method is:

MACmax is the number of MAC units available to the accelerator;

3.1 making the convolution layer counting circulation variable i1 equal to 1;

3.2 determining output parallelism M when accelerator calculates i1 th convolution layer_i1The method comprises the following steps:

M_i1＝min(Oc_i1，mr)；

3.3 test Accelerator transfers strdc from memory_i1×Iy_i1×Ic_i1Buffering the input pixel into input feature map to complete M_i1×Oy_i1A sub-convolution operation and dividing M_i1×Oy_i1The time tconv for the convolution operation result pixel to be buffered by the output characteristic diagram and transmitted back to the memory_i1The method comprises the following steps:

3.4 time Tconv for test accelerator to operate on i1 th convolution layer_i1The method comprises the following steps:

3.5 test Accelerator operation i1 th layer convolution throughput thconv_i1The method comprises the following steps:

3.6 test MAC utilization U of Accelerator operation i1 th layer convolution_i1The method comprises the following steps:

3.7 let i1 ═ i1+1, and if i1 ═ Nc, indicate that the set { Tconv) has been obtained₁、Tconv₂、…Tconv_Nc}、{thconv₁、thconv₂、…thconv_NcA, and { U }₁、U₂、…U_NcJumping to the fourth step, otherwise jumping to the 3.2 rd step;

the fourth step of testing the total efficiency of the accelerator to Nc convolutional layers of the test case CNN, the method is as follows:

4.2 determine the MAC utilization UCmac of Nc convolutional layers of the accelerator calculation CNN as:

and a fifth step of testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:

5.1 let pooling layer count cycle variable i2 equal to 1;

5.2.1

5.2.2

5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 pooling layer calculated by accelerator_i2The method comprises the following steps:

5.2.5 determining the number of rows Mp of multiplier arrays that can be used by the accelerator to compute maximum parallelism at the ith 2 pooling level_maxComprises the following steps:

Mpr_i2＝min(mr，Mp_max)；

5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained₁、Tp₂、…Tp_Np}、{thpool₁、thpool₂、…thpool_NpA, and { Up }₁、Up₂、…Up_NpJumping to the sixth step, otherwise jumping to the 5.2 th step;

the sixth step of testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, the method is:

7.1 let the full connection layer count cycle variable i3 be 1;

7.2 determine the parallelism of the accelerator in operating the i3 th full link layer, the method is:

7.2.1 determining theoretical memory time TthMfc of accelerator calculating i3 th full connection layer_i3Comprises the following steps:

7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th fully connected layer calculated time ratio_i3Comprises the following steps:

mfc therein_maxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, Mfc_maxIs a positive integer and 0<Mfc_max≤mr×mc；

7.3 determining the Accelerator to calculate the throughput of the i3 th fully-connected layer if Mfcr _i31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layer_i3Comprises the following steps:

go to 7.4;

if Mfcr_i3>1, the throughput of the i3 th full link layer is:

thfc_i3≈F×Mfcr_i3go to 7.4;

7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained₁、Tfc₂、…Tfc_Nfc}、{thfc₁、thfc₂、…thfc_NfcA, (b) and (Ufc)₁、Ufc₂、…Ufc_NfcJumping to the eighth step, otherwise jumping to the 7.2 th step;

the eighth step of testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, the method comprises the following steps:

8.1 determining the throughput rate Thfc of all the fully-connected layers of the accelerator operation CNN as:

8.2 determining the MAC utilization UFCmac of all full connection layers of the accelerator operation CNN as:

the ninth step tests the efficiency of the accelerator to all the convolution layers, pooling layers and full connection layers of the test case CNN, the method is:

9.1 determine the throughput ThA of all convolutional, pooling, and full-link layers of the accelerator operation CNN as:

9.2 determining MAC utilization UA of all convolution layers, pooling layers and full connection layers of the accelerator operation CNN as:

and finishing the tenth step.

The third to fourth steps, the fifth to sixth steps, and the seventh to eighth steps may be executed in parallel, and the flowchart shown in fig. 3 is a flowchart executed in parallel, or may be serially drawn in the third to eighth steps, but the speed is faster when the parallel execution is executed.

Claims

1. A method for testing the efficiency of an accelerator with multiple bandwidth targets is characterized by comprising the following steps:

1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, turning to the step 1.5; if the ith layer is a full connection layer, turning to the step 1.7;

1.3.1 record input feature map height Ix of i-th layer_i1；

1.3.2 recording input feature map width Iy of i-th layer_i1；

1.3.4 record the i-th layer convolution kernel size k_i1；

1.3.5 recording the number of convolution kernels of the i-th layer, namely the number of convolution output channels Oc_i1；

1.3.8 recording the i-th layer output profile high Ox_i1；

1.3.9 recording the i-th layer output characteristic diagram width Oy_i1；

1.4. Let i1 be i1+ 1; turning to the step 1.9;

1.5 extract the following properties for the i2 th pooling layer:

1.5.1 record the input profile height Ipx of the i-th layer_i2；

1.5.2 recording input feature map width Ipy for layer i_i2；

1.5.3 record the number of input profile channels Ipc for the i-th layer_i2；

1.5.4 record the size P of the pooling scale of the i-th layer_i2；

1.5.6 records the number of jumps strdp of the i-th pooling operation_i2；

1.5.7 record the i-th layer output signature height Opx_i2；

1.5.8 record the i-th layer output signature width Opy_i2；

1.6 let i2 ═ i2+ 1; turning to the step 1.9;

1.7 extract the following properties of the i3 th fully-connected layer:

1.8 let i3 ═ i3+ 1; turning to the step 1.9;

1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time₁，Ic₂，...Ic_Nc}、{k₁，k₂，..k_Nc}、{strdc₁，strdc₂，..strdc_Nc}、{Iy₁，Iy₂，...Iy_Nc}、{padc₁，padc₂，...padc_NcThe collection needed by the subsequent test is transferred to the second step, otherwise, the step 1.2 is transferred;

K∈{k₁，k₂，..k_Nc}，strdx∈{strdc₁，strdc₂，..strdc_NcBW is accelerator bandwidth, F is accelerator MAC array working frequency;

MACmax is the number of MAC units available to the accelerator;

3.1 making the convolution layer counting circulation variable i1 equal to 1;

M_i1＝min(Oc_i1，mr)；

the throughput rate is the number of multiply-accumulate operations completed by the accelerator per second, and the unit is number/s;

3.7 let i1 ═ i1+1, and if i1 ═ Nc, indicate that the set { Tconv) has been obtained₁、Tconv₂、...Tconv_Nc}、{thconv₁、thconv₂、...thconv_NcA, and { U }₁、U₂、...U_NcJumping to the fourth step, otherwise jumping to the 3.2 rd step;

4.1 determine the throughput Thconv of the accelerator computing Nc convolutional layers of CNN:

5.1 let pooling layer count cycle variable i2 equal to 1;

5.2.1 Accelerator calculation

5.2.2 Accelerator calculation

wherein PAD is max (padc)₁，padc₂，...padc_Nc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,

Mpr_i2＝min(mr，Mp_max)；

5.4 determining the time Tp of the ith 2 pooling layer of an accelerator operation_i2Comprises the following steps:

5.5 determining throughput threshold of the ith 2 pooling layer of Accelerator operation_i2Comprises the following steps:

5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained₁、Tp₂、...Tp_Np}、{thpool₁、thpool₂、...thpool_NpA, and { Up }₁、Up₂、...Up_NpSkipping to the sixth step, otherwise, skipping to the 5.2 step;

6.1 determining the throughput Thpool of Np pooling layers of the accelerator operation CNN:

6.2 determining the MAC utilization UPmac of Np pooling layers of accelerator operations CNN:

7.1 let full connection layer count cycle variable i3 equal to 1;

7.2.2 determining the theoretical calculation time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layer_i3Comprises the following steps:

mfc therein_maxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, Mfc_maxIs a positive integer and 0 < Mfc_max≤mr×mc；

7.3 test the throughput of accelerator operating the i3 full link layer if Mfcr_i31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layer_i3Comprises the following steps:

go to 7.4;

if Mfcr_i3If the throughput rate of the i3 th full connection layer is more than 1:

thfc_i3≈F×Mfcr_i3go to 7.4;

7.4 test time Tfc for accelerator to operate full connection layer i3_i3Comprises the following steps:

7.5 test Accelerator operation ith 3 MAC utilization rate Ufc of full connection layer_i3The method comprises the following steps:

7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained₁、Tfc₂、...Tfc_Nfc}、

Ke { Ufc₁、Ufc₂、...Ufc_NfcJumping to the eighth step, otherwise jumping to the 7.2 step;

8.2 determining MAC utilization rates UFCmac of all full connection layers of accelerator operation CNN as:

ninth step, testing the efficiency of the accelerator to all the convolution layers, pooling layers and full connection layers of the test case CNN, the method is:

and finishing the tenth step.

2. The method for testing the efficiency of the accelerator with multiple bandwidth targets of claim 1, wherein the third to fourth, fifth to sixth, and seventh to eighth steps are executed in parallel.

3. The multi-bandwidth target accelerator performance testing method of claim 1, wherein the convolutional neural network model of the first step comprises AlexNet, VGG16, C3D.

4. The method for testing the efficiency of the accelerator with multiple bandwidth targets of claim 1, wherein in step 2.1, K is 3, strdx is 1, and mc satisfies the following conditions: