CN109918281B - Multi-bandwidth target accelerator efficiency testing method - Google Patents
Multi-bandwidth target accelerator efficiency testing method Download PDFInfo
- Publication number
- CN109918281B CN109918281B CN201910185133.3A CN201910185133A CN109918281B CN 109918281 B CN109918281 B CN 109918281B CN 201910185133 A CN201910185133 A CN 201910185133A CN 109918281 B CN109918281 B CN 109918281B
- Authority
- CN
- China
- Prior art keywords
- accelerator
- layer
- following
- steps
- pooling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Tests Of Electronic Circuits (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses an accelerator efficiency testing method for multiple bandwidth targets, and aims to optimize the structural design space of an accelerator with multiple bandwidth targets and evaluate the throughput rate and the MAC utilization rate. The method comprises the following steps of firstly determining attribute parameters of a test case CNN, and determining the column number and the row number of an accelerator MAC array; then testing the efficiency of the accelerator to each convolution layer of the CNN and the total efficiency of all convolution layers, testing the efficiency of the accelerator to each pooling layer of the CNN and the total efficiency of all pooling layers, testing the efficiency of the accelerator to each full-connection layer of the CNN and the total efficiency of all full-connection layers, and finally calculating the throughput rate and MAC utilization rate of all convolution layers, pooling layers and full-connection layers of the CNN by the testing accelerator. The method can reasonably adapt the column number and the line number of the MAC array according to conditions such as accelerator bandwidth constraint and the like, maximize the efficiency of the accelerator, quickly evaluate the throughput rate and the MAC utilization rate of the accelerator under the multi-target bandwidth constraint, and optimize the structural design space of the accelerator.
Description
Technical Field
The invention relates to a performance test method for a compute-intensive application compute accelerator, in particular to a method for evaluating the efficiency of an accelerator according to the scale of an accelerator array and the bit width of data.
Background
Applications such as image recognition, speech processing, text mining and the like with a large-scale deep neural network as a core are generally computing and memory-intensive applications, and the demand for high-performance computing hardware is increasing day by day. From intelligent reasoning of a terminal mainly bearing a high-performance embedded platform to intelligent application of a miniature and integrated 'micro' end, the intelligent application of different scales and different scenes has different requirements on intelligent processing chip delay, throughput rate and the like. The intelligent chip framework and the test method facing the terminal reasoning must be capable of adjusting the system structure in multiple scales according to the application performance and cost requirements. With the increasing depth of the neural network and the increasing complexity of the application, the artificial neural network needs more and more model parameters, the processed data is larger and larger, the parameters bring huge pressure on the memory access bandwidth required by calculation, and most neural network applications become memory access limited intensive calculation problems under the traditional von Neumann structure.
Currently, a customized intelligent accelerator becomes one of the most effective ways for optimizing applications such as image recognition, voice processing, text mining and the like, but the development cycle of a customized structure is very long, and the customized structure often relates to a plurality of long design flows such as application analysis, structural design, logic design, circuit optimization, comprehensive optimization, sheet casting production and the like.
According to the study on the operation amount in the convolutional Neural network CNN (convolutional Neural network), the convolutional operation occupies 90% of the operation amount of the CNN. Liu Shi et al, published in Electronics journal as "FPGA Accelerator Architecture Design for two-dimensional and three-dimensional convolutional neural networks" paper (Liu, Z.; Chow, P.; Xu, J.; Jiang, J.; Dou, Y.; Zhou, J.A Uniform Architecture Design for Acceleration 2D and 3D CNNs on FPGAs. Electronics 2019,8,65) (referred to as background art 1), propose an accelerator structure. Such an accelerator is a deep neural network-oriented generic Matrix multiplication gemm (general Matrix multiplication) hardware accelerator. As shown in FIG. 1, the GEMM operation method of the deep neural network is that the convolution layer weight W is m × c × k × k, the input feature map X is c × h × W, W and X are subjected to convolution operation to obtain an output feature map Y, and Y is m × h × W. Wherein m is the number of convolution layer convolution kernels, c is the number of input feature map channels, k is the size of the convolution kernels, h is the input and output feature map height, and W is the input and output feature map width, the GEMM method of convolution operation expands the weight W into a weight matrix, compresses and recombines the feature map X into a feature map matrix, and performs matrix multiplication on the weight matrix and the feature map matrix to obtain the expansion form of the output feature map Y, namely the output feature map matrix.
FIG. 2 is a hardware block diagram of an accelerator that performs the above-described convolution GEMM operations. The operation logic of the accelerator includes a multiply-accumulate unit MAC (multiple And accumulate) array, a weight matrix buffer, a feature map matrix buffer, And an output feature map matrix buffer. Each MAC in the MAC array carries out unit pipelining operation, and the operation of 1 multiplication and 1 addition is completed in each clock cycle. When the convolutional neural network operation is performed, the MAC array acquires data W and X required by the operation from a memory, the data W and X are respectively expanded and recombined according to the mode in fig. 1 and are respectively stored in a weight matrix buffer and a characteristic diagram matrix buffer of an accelerator, and a result Y is output to an output characteristic diagram matrix buffer after the operation. The accelerator structure realizes the operation of the MAC array which can be extended in a telescopic way on each layer of the convolutional neural network, and has the characteristics of capability of efficiently processing various convolutional operations, small limitation of hardware resources, capability of flexibly configuring the length and the width of the MAC array and the like. The accelerator is an accelerator structure commonly used in the field of the acceleration of the convolutional neural network at present. But limited by application scenarios and cost penalty, the memory access bandwidth of the accelerator becomes an important factor limiting the performance of the accelerator. The difference of the access bandwidth can cause the optimal MAC array configuration and acceleration efficiency of the accelerator to be different, namely, although the total number of the MAC in the MAC array is fixed, the length mc and the width mr of the array need to be determined according to the access bandwidth. Such accelerators are therefore also called multi-bandwidth target accelerators.
At present, most accelerators with multiple bandwidth targets are designed for specific application fields, and the scale of the field application determines the calculation scale, the storage bandwidth, the communication capacity, the execution mode and the like of the accelerator and also determines the performance level of the accelerator to a certain extent. Aiming at memory access bandwidth constraint, an evaluation method based on a structural model is established for the accelerator with multiple bandwidth targets, and the structural design space of the accelerator can be optimized. Therefore, how to realize rapid evaluation of the performance of the multi-bandwidth target accelerator structure including throughput rate and MAC utilization rate is a technical problem that needs to be overcome by those skilled in the art. There is no disclosure or report of a method of performance evaluation involving such accelerators.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a multi-bandwidth target accelerator efficiency testing method, solve the problem of optimizing the structural design space of a multi-bandwidth target accelerator, and quickly evaluate the multi-bandwidth target accelerator structural efficiency including the throughput rate and the MAC utilization rate, wherein the throughput rate represents the number of multiply-accumulate (MAC) operations of the accelerator operated per second. In particular, the present invention provides a method for calculating the performance of such an accelerator by appropriately adapting the multiplication array according to different convolution scales for the accelerator structure proposed in background art 1. The invention can directly evaluate the performance (throughput rate and MAC utilization rate) of the extensible computing system of the accelerator with multiple bandwidth targets under the constraint of multiple target bandwidths.
The specific technical scheme is as follows:
in the first step, a convolutional neural network is selected from currently widely used convolutional neural network models (such as AlexNet, VGG16, and C3D) as an intelligent accelerator performance test case CNN. And preprocessing the test case CNN, and determining the attribute parameters of each layer of the test case CNN. The specific method comprises the following steps:
let the number of test cases CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers. N, Nc, Np, and Nfc are positive integers, and N is Nc + Np + Nfc.
1.1 defines the convolutional neural network layer ordinal variable i to be 1, defines the convolutional layer count loop variable i1 to be 1,
a pooling layer count loop variable i2 is defined as 1 and a fully-connected layer count loop variable i3 is defined as 1.
1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, go to 1.5
Step (2); if the ith layer is a full connection layer, the step 1.7 is carried out.
1.3 extract the following properties for the i1 th convolutional layer according to the description of CNN:
1.3.1 record input feature map height Ix of i-th layeri1;
1.3.2 recording input feature map width Iy of i-th layeri1;
1.3.3 recording the number of channels of the input characteristic diagram (i.e. the number of channels of the convolution kernel) Ic of the i-th layeri1;
1.3.4 record the i-th layer convolution kernel size ki1;
1.3.5 recording the number of convolution kernels (i.e. the number of convolution output channels) of the i-th layer Oci1;
1.3.6 recording the number padc of column-column direction fills of the i-th layer input feature mapi1;
1.3.7 recording the number of jumps strdc of the i-th layer convolution kerneli1;
1.3.8 recording the i-th layer output characteristic diagram high Oxi1;
1.3.9 recording the i-th layer output characteristic diagram width Oyi1。
1.4. Let i1 be i1+ 1; go to step 1.9.
1.5 extract the following properties for the i2 th pooling layer:
1.5.1 record input Profile height Ipx for layer ii2;
1.5.2 recording input feature map width Ipy for layer ii2;
1.5.3 record the number of input profile channels Ipc for the i-th layeri2;
1.5.4 record the size p of the i-th layer pooling scalei2;
1.5.5 recording the row and column direction filling number padp of the input feature map of the ith layeri2;
1.5.6 records the number of jumps strdp of the i-th layer pooling operationi2;
1.5.7 record the i-th layer output signature height Opxi2;
1.5.8 record the i-th layer output signature width Opyi2。
1.6 let i2 ═ i2+ 1; turning to the step 1.9;
1.7 extract the following properties of the i3 th fully-connected layer:
1.7.1 recording the number of i-th layer full-connection operation input nodes as Fini3;
1.7.2 recording the number of i-th layer full-connection operation output nodes as Fouti3;
1.8 let i3 ═ i3+ 1; turning to the step 1.9;
1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time1,Ic2,…IcNc}、{k1,k2,..kNc}、{strdc1,strdc2,..strdcNc}、{Iy1,Iy2,…IyNc}、{padc1,padc2,…padcNcAnd e, turning to the second step if the set required by the subsequent test is the same, or turning to the 1.2 step if the set required by the subsequent test is not the same.
Secondly, determining the column number and the row number of the MAC array of the accelerator, wherein the method comprises the following steps:
2.1 determine the number of columns mc of the accelerator MAC array, mc being a positive integer.
Because the efficiency of the accelerator is optimal when the computation time ratio of the convolution operation of the accelerator is 1:1, the accelerator has optimal efficiency Therefore, the ratio satisfies the formula (1) and the optimal performance is achieved:
(1) in the formula, Ic ∈ { Ic }1,Ic2,…IcNc},K∈{k1,k2,..kNc},strdx∈{strdc1,strdc2,..strdcNc},Iy∈{Iy1,Iy2,…IyNc}. BW is accelerator bandwidth (unit is μ bit/s, μ is bit width of operation input data of MAC unit, determined by accelerator designer), F is accelerator MAC array working frequency (its unit isBit Hz, determined by the accelerator designer), BW and F are fixed parameters of the accelerator as long as the accelerator design is complete.
Mc obtained from the formula (1) satisfies the formula (2):
recent trends in convolutional neural networks show that convolutions of 3 × 3 and 1 × 1 are used most, while the jumps strdx of the convolution kernel generally take 1. Therefore, when K is preferably 3 and strdx is preferably 1, the accelerator maximizes the throughput for most convolution operations, so mc satisfies equation (3):
2.2, determining the number of rows mr of the accelerator MAC array, wherein mr is a positive integer, and the method comprises the following steps:
MACmax is the number of MAC units available to the accelerator (determined by the accelerator designer from hardware logic resources).
Thirdly, testing the efficiency of the accelerator on each convolution layer of the test case CNN, wherein the method comprises the following steps:
3.1 make convolution count loop variable i1 equal to 1.
3.2 determine the output parallelism M when the accelerator computes the i1 th convolutional layer (i.e., the number of rows of the MAC actually used when the MAC array computes the i1 th convolutional layer)i1The method comprises the following steps:
Mi1=min(Oci1,mr)。
3.3 test Accelerator transfers strdc from memoryi1×Iyi1×Ici1Buffering the input pixel into input feature map to complete Mi1×Oyi1A sub-convolution operation and dividing Mi1×Oyi1The convolution operation result pixel is buffered by the output characteristic diagramTime of burst transfer back to memory tconvi1The method comprises the following steps:
3.4 test accelerator operating i1 convolution calculation time Tconvi1The method comprises the following steps:
3.5 test throughput of accelerator operation i1 convolutional layer thconvi1The method comprises the following steps:
the throughput rate is the number of multiply-accumulate operations completed by the accelerator per second, and the unit is number/s.
3.6 test MAC utilization U of Accelerator operation ith 1 convolutional layeri1The method comprises the following steps:
3.7 when i1 equals i1+1 and i1 equals Nc, it means that the set { Tconv } has been obtained1、Tconv2、…TconvNc}、{thconv1、thconv2、…thconvNcA, and { U }1、U2、…UNcAnd (6) jumping to the fourth step, otherwise, jumping to the 3.2 rd step.
Fourthly, testing the total efficiency of the accelerator on Nc convolution layers of the test case CNN, wherein the method comprises the following steps:
4.1 determining the accelerator to calculate the throughput Thconv of Nc convolutional layers of CNN:
4.2 determine that the accelerator calculates the MAC utilization UCmac of Nc convolutional layers of CNN as:
fifthly, testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:
5.1 let pooling layer count cycle variable i2 equal to 1.
5.2 determine the parallelism (i.e. the number of rows actually used by the MAC array) of the accelerator operating the ith 2 pooling layer by:
5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 th pooling layer calculated by acceleratori2The method comprises the following steps:
5.2.4 determining the number Poolpara of pooling operations that the accelerator computes the ith 2 pooling layer can operate on in parallel when using a 1-row multiplication arrayi2The method comprises the following steps:
wherein PAD is max (padc)1,padc2,…padcNc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,
5.2.5 determining the number of rows of multiplier arrays that can be used by the Accelerator to calculate the maximum parallelism at the ith 2 pooling levelComprises the following steps:
5.2.6 determining the number of rows Mpr that the accelerator calculates the i2 pooling layer can actually use the multiplier arrayi2Comprises the following steps:
Mpri2=min(mr,Mpmax);
5.3 Accelerator save Mpr when determining the ith 2 pooling layer of accelerator operationsi2×Opyi2Time tpool of one output pixeli2:
5.4 determining the calculated time Tp of the ith 2 pooling layer of an accelerator operationi2Comprises the following steps:
5.5 determining throughput thool of Accelerator operation i2 pooling layeri2Comprises the following steps:
5.6 determining MAC utilization Up of the ith 2 pooling layer of Accelerator operationi2:
5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained1、Tp2、…TpNp}、{thpool1、thpool2、…thpoolNpA, and { Up }1、Up2、…UpNpAnd (6) jumping to the sixth step, otherwise, jumping to the 5.2 th step.
Sixthly, testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, wherein the method comprises the following steps:
6.1 determining accelerator calculates the throughput Thpool of Np pooling layers of CNN:
6.2 determining that the accelerator calculates the MAC utilization UPmac for Np pooling layers of CNN:
seventhly, testing the efficiency of the accelerator on each full connection layer of the test case CNN, wherein the method comprises the following steps:
7.1 let the full connection layer count cycle variable i3 be 1;
7.2 determine the parallelism (i.e. the number of MAC units used) when the accelerator operates the i3 th full link layer by:
7.2.1 determining the theoretical memory access time TthMfc of the i3 th fully-connected layer of the accelerator operationi3Comprises the following steps:
7.2.2 determining the theoretical calculated time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layeri3Comprises the following steps:
7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th Total connection layer latency ratioi3Comprises the following steps:
7.2.5 determining the parallelism Mfcr of the i3 full link layer calculated by the acceleratori3Satisfies the following formula:
mfc thereinmaxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, MfcmaxIs a positive integer of 0<MfcmaxMr × mc, determined by the accelerator designer.
7.3 determine the throughput rate of the accelerator to calculate the i3 th fully connected layer. If Mfcr i31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layeri3Comprises the following steps:
If Mfcri3>1, the throughput of the i3 th full link layer is:
thfci3≈F×Mfcri3go to 7.4.
7.4 determining time Tfc for accelerator to calculate i3 full connection layeri3Comprises the following steps:
7.5 determining Accelerator calculate MAC utilization Ufc for the ith 3 full connectivity layeri3The method comprises the following steps:
7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained1、Tfc2、…TfcNfc}、{thfc1、thfc2、…thfcNfcA, and Ufc1、Ufc2、…UfcNfcAnd jumping to the eighth step, otherwise, jumping to the 7.2 th step.
Eighthly, testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, wherein the method comprises the following steps:
8.1 determining the throughput rate Thfc of all the fully-connected layers of the CNN calculated by the accelerator as:
8.2 determining that the accelerator calculates the MAC utilization UFCmac of all full connection layers of the CNN as:
ninth step, testing the efficiency of the accelerator to all convolution layers, pooling layers and full connection layers of the test case CNN, the method is as follows:
9.1 determine that the accelerator calculates the throughput ThA of all convolutional, pooling, and fully-connected layers of the CNN as:
9.2 determining that the MAC utilization UA of all the convolution layers, the pooling layer and the full-connection layer of the CNN calculated by the accelerator is as follows:
and step ten, finishing.
The calculation efficiency of the third-fourth step test accelerator on the convolutional layer, the calculation efficiency of the fifth-sixth step test accelerator on the pooling layer, and the calculation efficiency of the seventh-eighth step test accelerator on the full connection layer are all input for the ninth step of full CNN evaluation, and the third-fourth step, the fifth-sixth step and the seventh-eighth step can be executed in parallel or in series without being divided into a plurality of steps.
The invention can achieve the following technical effects:
1. the second step of the invention can reasonably adapt the column number mc and the row number mr of the MAC array according to the characteristics of convolution attributes in different CNNs, the accelerator bandwidth constraint and the accelerator operating frequency to obtain the optimal configuration of the MAC array, thereby maximizing the efficiency of the accelerator.
2. By adopting the method, the throughput rate and the MAC utilization rate of the accelerator under the multi-target bandwidth constraint can be quickly evaluated after the hardware constraint (the number of the MAC resources of the working frequency core) and the target CNN are given, and the structural design space of the accelerator is optimized.
Drawings
FIG. 1 is a diagram illustrating an example of a GEMM calculation method for convolution as described in background art 1;
FIG. 2 is an accelerator structure model of a multi-bandwidth target as described in background 1;
FIG. 3 is a flowchart illustrating the overall method for testing the performance of an accelerator with multiple bandwidth targets according to the present invention.
The specific implementation mode is as follows:
as shown in fig. 3, the present invention comprises the steps of:
the method comprises the following steps of firstly, selecting a convolutional neural network from a convolutional neural network model as an intelligent accelerator efficiency test case CNN, preprocessing the test case CNN, and determining attribute parameters of each layer of the test case CNN, wherein the specific method comprises the following steps:
let the number of layers of the test case CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers, where N, Nc, Np, and Nfc are positive integers and satisfy N ═ Nc + Np + Nfc;
1.1 defines the convolutional neural network layer ordinal variable i to be 1, defines the convolutional layer count loop variable i1 to be 1,
defining a pooling layer counting loop variable i2 to be 1, and defining a full-connection layer counting loop variable i3 to be 1;
1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, go to 1.5
Step (2); if the ith layer is a full connection layer, turning to the step 1.7;
1.3 the following properties of the i1 th convolutional layer were extracted according to the description of CNN:
1.3.1 record input feature map height Ix of i-th layeri1;
1.3.2 recording input feature map width Iy of i-th layeri1;
1.3.3 recording the number of channels of the input characteristic diagram of the i-th layer, namely the number of channels Ic of the convolution kerneli1;
1.3.4 record the i-th layer convolution kernel size ki1;
1.3.5 recording the number of convolution kernels of the i-th layer, i.e. the number of convolution output channels Oci1;
1.3.6 recording the number padc of the i-th layer input characteristic diagram column-column direction fillingi1;
1.3.7 recording the number of jumps strdc of the i-th layer convolution kerneli1;
1.3.8 recording the i-th layer output characteristic diagram high Oxi1;
1.3.9 recording the i-th layer output characteristic diagram width Oyi1;
1.4. Let i1 be i1+ 1; turning to the step 1.9;
1.5 extract the following properties for the i2 th pooling layer:
1.5.1 record the input profile height Ipx of the i-th layeri2;
1.5.2 record input feature map width Ipy for layer ii2;
1.5.3 record the number of input profile channels Ipc for the i-th layeri2;
1.5.4 record the size p of the i-th layer pooling scalei2;
1.5.5 recording the row and column direction filling number padp of the input feature map of the ith layeri2;
1.5.6 records the number of jumps strdp of the i-th pooling operationi2;
1.5.7 recording the output characteristic of the i-th layerHeight of sign Opxi2;
1.5.8 record the i-th layer output signature width Opyi2;
1.6 let i2 ═ i2+ 1; turning to the step 1.9;
1.7 extract the following properties of the i3 th fully-connected layer:
1.7.1 recording the number of i-th layer full-connection operation input nodes as Fini3;
1.7.2 recording the number of i-th layer full-connection operation output nodes as Fouti3;
1.8 let i3 ═ i3+ 1; turning to the step 1.9;
1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time1,Ic2,…IcNc}、{k1,k2,..kNc}、{strdc1,strdc2,..strdcNc}、{Iy1,Iy2,…IyNc}、{padc1,padc2,…padcNcThe collection required by subsequent tests is transferred to the second step, otherwise, the collection is transferred to the 1.2 step;
secondly, determining the size of an accelerator MAC array, wherein the method comprises the following steps:
2.1 determining the column number mc of the MAC array of the accelerator, wherein mc is a positive integer;
K∈{k1,k2,..kNc},strdx∈{strdc1,strdc2,..strdcNcBW is accelerator bandwidth, F is accelerator MAC array working frequency;
2.2 determine the number of rows mr of the accelerator MAC array, mr being a positive integer, the method is:
MACmax is the number of MAC units available to the accelerator;
thirdly, testing the efficiency of the accelerator on each convolution layer of the test case CNN, wherein the method comprises the following steps:
3.1 making the convolution layer counting circulation variable i1 equal to 1;
3.2 determining output parallelism M when accelerator calculates i1 th convolution layeri1The method comprises the following steps:
Mi1=min(Oci1,mr);
3.3 test Accelerator transfers strdc from memoryi1×Iyi1×Ici1Buffering the input pixel into input feature map to complete Mi1×Oyi1A sub-convolution operation and dividing Mi1×Oyi1The time tconv for the convolution operation result pixel to be buffered by the output characteristic diagram and transmitted back to the memoryi1The method comprises the following steps:
3.4 time Tconv for test accelerator to operate on i1 th convolution layeri1The method comprises the following steps:
3.5 test Accelerator operation i1 th layer convolution throughput thconvi1The method comprises the following steps:
3.6 test MAC utilization U of Accelerator operation i1 th layer convolutioni1The method comprises the following steps:
3.7 let i1 ═ i1+1, and if i1 ═ Nc, indicate that the set { Tconv) has been obtained1、Tconv2、…TconvNc}、{thconv1、thconv2、…thconvNcA, and { U }1、U2、…UNcJumping to the fourth step, otherwise jumping to the 3.2 rd step;
the fourth step of testing the total efficiency of the accelerator to Nc convolutional layers of the test case CNN, the method is as follows:
4.1 determining the accelerator to calculate the throughput Thconv of Nc convolutional layers of CNN:
4.2 determine the MAC utilization UCmac of Nc convolutional layers of the accelerator calculation CNN as:
and a fifth step of testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:
5.1 let pooling layer count cycle variable i2 equal to 1;
5.2 determine the parallelism (i.e. the number of rows actually used by the MAC array) of the accelerator operating the ith 2 pooling layer by:
5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 pooling layer calculated by acceleratori2The method comprises the following steps:
5.2.4 determining the number Poolpara of pooling operations that the accelerator computes the ith 2 pooling layer can operate on in parallel when using a 1-row multiplication arrayi2The method comprises the following steps:
wherein PAD is max (padc)1,padc2,…padcNc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,
5.2.5 determining the number of rows Mp of multiplier arrays that can be used by the accelerator to compute maximum parallelism at the ith 2 pooling levelmaxComprises the following steps:
5.2.6 determining the number of rows Mpr that the accelerator calculates the i2 pooling layer can actually use the multiplier arrayi2Comprises the following steps:
Mpri2=min(mr,Mpmax);
5.3 Accelerator save Mpr when determining the ith 2 pooling layer of accelerator operationsi2×Opyi2Time tpool of one output pixeli2:
5.4 determining the calculated time Tp of the ith 2 pooling layer of an accelerator operationi2Comprises the following steps:
5.5 determining throughput thool of Accelerator operation i2 pooling layeri2Comprises the following steps:
5.6 determining MAC utilization Up of the ith 2 pooling layer of Accelerator operationi2:
5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained1、Tp2、…TpNp}、{thpool1、thpool2、…thpoolNpA, and { Up }1、Up2、…UpNpJumping to the sixth step, otherwise jumping to the 5.2 th step;
the sixth step of testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, the method is:
6.1 determining accelerator calculates the throughput Thpool of Np pooling layers of CNN:
6.2 determining that the accelerator calculates the MAC utilization UPmac for Np pooling layers of CNN:
seventhly, testing the efficiency of the accelerator on each full connection layer of the test case CNN, wherein the method comprises the following steps:
7.1 let the full connection layer count cycle variable i3 be 1;
7.2 determine the parallelism of the accelerator in operating the i3 th full link layer, the method is:
7.2.1 determining theoretical memory time TthMfc of accelerator calculating i3 th full connection layeri3Comprises the following steps:
7.2.2 determining the theoretical calculated time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layeri3Comprises the following steps:
7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th fully connected layer calculated time ratioi3Comprises the following steps:
7.2.5 determining the parallelism Mfcr of the i3 full link layer calculated by the acceleratori3Satisfies the following formula:
mfc thereinmaxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, MfcmaxIs a positive integer and 0<Mfcmax≤mr×mc;
7.3 determining the Accelerator to calculate the throughput of the i3 th fully-connected layer if Mfcr i31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layeri3Comprises the following steps:
if Mfcri3>1, the throughput of the i3 th full link layer is:
thfci3≈F×Mfcri3go to 7.4;
7.4 determining time Tfc for accelerator to calculate i3 full connection layeri3Comprises the following steps:
7.5 determining Accelerator calculate MAC utilization Ufc for the ith 3 full connectivity layeri3The method comprises the following steps:
7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained1、Tfc2、…TfcNfc}、{thfc1、thfc2、…thfcNfcA, (b) and (Ufc)1、Ufc2、…UfcNfcJumping to the eighth step, otherwise jumping to the 7.2 th step;
the eighth step of testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, the method comprises the following steps:
8.1 determining the throughput rate Thfc of all the fully-connected layers of the accelerator operation CNN as:
8.2 determining the MAC utilization UFCmac of all full connection layers of the accelerator operation CNN as:
the ninth step tests the efficiency of the accelerator to all the convolution layers, pooling layers and full connection layers of the test case CNN, the method is:
9.1 determine the throughput ThA of all convolutional, pooling, and full-link layers of the accelerator operation CNN as:
9.2 determining MAC utilization UA of all convolution layers, pooling layers and full connection layers of the accelerator operation CNN as:
and finishing the tenth step.
The third to fourth steps, the fifth to sixth steps, and the seventh to eighth steps may be executed in parallel, and the flowchart shown in fig. 3 is a flowchart executed in parallel, or may be serially drawn in the third to eighth steps, but the speed is faster when the parallel execution is executed.
Claims (4)
1. A method for testing the efficiency of an accelerator with multiple bandwidth targets is characterized by comprising the following steps:
the method comprises the following steps of firstly, selecting a convolutional neural network from a convolutional neural network model as an intelligent accelerator efficiency test case CNN, preprocessing the test case CNN, and determining attribute parameters of each layer of the test case CNN, wherein the specific method comprises the following steps:
let the number of layers of the test case CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers, where N, Nc, Np, and Nfc are positive integers and satisfy N ═ Nc + Np + Nfc;
1.1 defines the convolutional neural network layer ordinal variable i to be 1, defines the convolutional layer count loop variable i1 to be 1,
defining a pooling layer counting loop variable i2 to be 1, and defining a full-connection layer counting loop variable i3 to be 1;
1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, turning to the step 1.5; if the ith layer is a full connection layer, turning to the step 1.7;
1.3 extract the following properties for the i1 th convolutional layer according to the description of CNN:
1.3.1 record input feature map height Ix of i-th layeri1;
1.3.2 recording input feature map width Iy of i-th layeri1;
1.3.3 recording the number of channels of the input characteristic diagram of the i-th layer, namely the number of channels Ic of the convolution kerneli1;
1.3.4 record the i-th layer convolution kernel size ki1;
1.3.5 recording the number of convolution kernels of the i-th layer, namely the number of convolution output channels Oci1;
1.3.6 recording the number padc of the i-th layer input characteristic diagram column-column direction fillingi1;
1.3.7 recording the number of jumps strdc of the i-th layer convolution kerneli1;
1.3.8 recording the i-th layer output profile high Oxi1;
1.3.9 recording the i-th layer output characteristic diagram width Oyi1;
1.4. Let i1 be i1+ 1; turning to the step 1.9;
1.5 extract the following properties for the i2 th pooling layer:
1.5.1 record the input profile height Ipx of the i-th layeri2;
1.5.2 recording input feature map width Ipy for layer ii2;
1.5.3 record the number of input profile channels Ipc for the i-th layeri2;
1.5.4 record the size P of the pooling scale of the i-th layeri2;
1.5.5 recording the row and column direction filling number padp of the input feature map of the ith layeri2;
1.5.6 records the number of jumps strdp of the i-th pooling operationi2;
1.5.7 record the i-th layer output signature height Opxi2;
1.5.8 record the i-th layer output signature width Opyi2;
1.6 let i2 ═ i2+ 1; turning to the step 1.9;
1.7 extract the following properties of the i3 th fully-connected layer:
1.7.1 recording the number of i-th layer full-connection operation input nodes as Fini3;
1.7.2 recording the number of i-th layer full-connection operation output nodes as Fouti3;
1.8 let i3 ═ i3+ 1; turning to the step 1.9;
1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time1,Ic2,...IcNc}、{k1,k2,..kNc}、{strdc1,strdc2,..strdcNc}、{Iy1,Iy2,...IyNc}、{padc1,padc2,...padcNcThe collection needed by the subsequent test is transferred to the second step, otherwise, the step 1.2 is transferred;
secondly, determining the column number and the row number of the MAC array of the accelerator, wherein the method comprises the following steps:
2.1 determining the column number mc of the MAC array of the accelerator, wherein mc is a positive integer;
K∈{k1,k2,..kNc},strdx∈{strdc1,strdc2,..strdcNcBW is accelerator bandwidth, F is accelerator MAC array working frequency;
2.2, determining the number of rows mr of the accelerator MAC array, wherein mr is a positive integer, and the method comprises the following steps:
MACmax is the number of MAC units available to the accelerator;
thirdly, testing the efficiency of the accelerator on each convolution layer of the test case CNN, wherein the method comprises the following steps:
3.1 making the convolution layer counting circulation variable i1 equal to 1;
3.2 determining output parallelism M when accelerator calculates i1 th convolution layeri1The method comprises the following steps:
Mi1=min(Oci1,mr);
3.3 test Accelerator transfers strdc from memoryi1×Iyi1×Ici1Buffering the input pixel into input feature map to complete Mi1×Oyi1A sub-convolution operation and dividing Mi1×Oyi1The time tconv for the convolution operation result pixel to be buffered by the output characteristic diagram and transmitted back to the memoryi1The method comprises the following steps:
3.4 time Tconv for test accelerator to operate on i1 th convolution layeri1The method comprises the following steps:
3.5 test throughput of accelerator operation i1 convolutional layer thconvi1The method comprises the following steps:
the throughput rate is the number of multiply-accumulate operations completed by the accelerator per second, and the unit is number/s;
3.6 test MAC utilization U of Accelerator operation ith 1 convolutional layeri1The method comprises the following steps:
3.7 let i1 ═ i1+1, and if i1 ═ Nc, indicate that the set { Tconv) has been obtained1、Tconv2、...TconvNc}、{thconv1、thconv2、...thconvNcA, and { U }1、U2、...UNcJumping to the fourth step, otherwise jumping to the 3.2 rd step;
fourthly, testing the total efficiency of the accelerator on Nc convolution layers of the test case CNN, wherein the method comprises the following steps:
4.1 determine the throughput Thconv of the accelerator computing Nc convolutional layers of CNN:
4.2 determine that the accelerator calculates the MAC utilization UCmac of Nc convolutional layers of CNN as:
fifthly, testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:
5.1 let pooling layer count cycle variable i2 equal to 1;
5.2 determine the parallelism (i.e. the number of rows actually used by the MAC array) of the accelerator operating the ith 2 pooling layer by:
5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 th pooling layer calculated by acceleratori2The method comprises the following steps:
5.2.4 determining the number Poolpara of pooling operations that the accelerator computes the ith 2 pooling layer can operate on in parallel when using a 1-row multiplication arrayi2The method comprises the following steps:
wherein PAD is max (padc)1,padc2,...padcNc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,
5.2.5 determining the number of rows Mp of multiplier arrays that can be used by the accelerator to compute maximum parallelism at the ith 2 pooling levelmaxComprises the following steps:
5.2.6 determining the number of rows Mpr that the accelerator calculates the i2 pooling layer can actually use the multiplier arrayi2Comprises the following steps:
Mpri2=min(mr,Mpmax);
5.3 Accelerator save Mpr when determining the ith 2 pooling layer of accelerator operationsi2×Opyi2Time tpool of one output pixeli2:
5.4 determining the time Tp of the ith 2 pooling layer of an accelerator operationi2Comprises the following steps:
5.5 determining throughput threshold of the ith 2 pooling layer of Accelerator operationi2Comprises the following steps:
5.6 determining MAC utilization Up of the ith 2 pooling layer of Accelerator operationi2:
5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained1、Tp2、...TpNp}、{thpool1、thpool2、...thpoolNpA, and { Up }1、Up2、...UpNpSkipping to the sixth step, otherwise, skipping to the 5.2 step;
sixthly, testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, wherein the method comprises the following steps:
6.1 determining the throughput Thpool of Np pooling layers of the accelerator operation CNN:
6.2 determining the MAC utilization UPmac of Np pooling layers of accelerator operations CNN:
seventhly, testing the efficiency of the accelerator on each full connection layer of the test case CNN, wherein the method comprises the following steps:
7.1 let full connection layer count cycle variable i3 equal to 1;
7.2 determine the parallelism of the accelerator in operating the i3 th full link layer, the method is:
7.2.1 determining the theoretical memory access time TthMfc of the i3 th fully-connected layer of the accelerator operationi3Comprises the following steps:
7.2.2 determining the theoretical calculation time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layeri3Comprises the following steps:
7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th Total connection layer latency ratioi3Comprises the following steps:
7.2.5 determining the parallelism Mfcr of the i3 full link layer calculated by the acceleratori3Satisfies the following formula:
mfc thereinmaxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, MfcmaxIs a positive integer and 0 < Mfcmax≤mr×mc;
7.3 test the throughput of accelerator operating the i3 full link layer if Mfcri31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layeri3Comprises the following steps:
if Mfcri3If the throughput rate of the i3 th full connection layer is more than 1:
thfci3≈F×Mfcri3go to 7.4;
7.4 test time Tfc for accelerator to operate full connection layer i3i3Comprises the following steps:
7.5 test Accelerator operation ith 3 MAC utilization rate Ufc of full connection layeri3The method comprises the following steps:
7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained1、Tfc2、...TfcNfc}、 Ke { Ufc1、Ufc2、...UfcNfcJumping to the eighth step, otherwise jumping to the 7.2 step;
eighthly, testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, wherein the method comprises the following steps:
8.1 determining the throughput rate Thfc of all the fully-connected layers of the accelerator operation CNN as:
8.2 determining MAC utilization rates UFCmac of all full connection layers of accelerator operation CNN as:
ninth step, testing the efficiency of the accelerator to all the convolution layers, pooling layers and full connection layers of the test case CNN, the method is:
9.1 determine the throughput ThA of all convolutional, pooling, and full-link layers of the accelerator operation CNN as:
9.2 determining MAC utilization UA of all convolution layers, pooling layers and full connection layers of the accelerator operation CNN as:
and finishing the tenth step.
2. The method for testing the efficiency of the accelerator with multiple bandwidth targets of claim 1, wherein the third to fourth, fifth to sixth, and seventh to eighth steps are executed in parallel.
3. The multi-bandwidth target accelerator performance testing method of claim 1, wherein the convolutional neural network model of the first step comprises AlexNet, VGG16, C3D.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910185133.3A CN109918281B (en) | 2019-03-12 | 2019-03-12 | Multi-bandwidth target accelerator efficiency testing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910185133.3A CN109918281B (en) | 2019-03-12 | 2019-03-12 | Multi-bandwidth target accelerator efficiency testing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109918281A CN109918281A (en) | 2019-06-21 |
CN109918281B true CN109918281B (en) | 2022-07-12 |
Family
ID=66964319
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910185133.3A Active CN109918281B (en) | 2019-03-12 | 2019-03-12 | Multi-bandwidth target accelerator efficiency testing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109918281B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929860B (en) * | 2019-11-07 | 2020-10-23 | 深圳云天励飞技术有限公司 | Convolution acceleration operation method and device, storage medium and terminal equipment |
CN111242314B (en) * | 2020-01-08 | 2023-03-21 | 中国信息通信研究院 | Deep learning accelerator benchmark test method and device |
CN114169514B (en) * | 2022-02-14 | 2022-05-17 | 浙江芯昇电子技术有限公司 | Convolution hardware acceleration method and convolution hardware acceleration circuit |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530224A (en) * | 2013-06-26 | 2014-01-22 | 郑州大学 | Harris corner detecting software system based on GPU |
CN107239824A (en) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | Apparatus and method for realizing sparse convolution neutral net accelerator |
CN109102065A (en) * | 2018-06-28 | 2018-12-28 | 广东工业大学 | A kind of convolutional neural networks accelerator based on PSoC |
CN109245773A (en) * | 2018-10-30 | 2019-01-18 | 南京大学 | A kind of decoding method based on block circulation sparse matrix neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10614354B2 (en) * | 2015-10-07 | 2020-04-07 | Altera Corporation | Method and apparatus for implementing layers on a convolutional neural network accelerator |
-
2019
- 2019-03-12 CN CN201910185133.3A patent/CN109918281B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530224A (en) * | 2013-06-26 | 2014-01-22 | 郑州大学 | Harris corner detecting software system based on GPU |
CN107239824A (en) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | Apparatus and method for realizing sparse convolution neutral net accelerator |
CN109102065A (en) * | 2018-06-28 | 2018-12-28 | 广东工业大学 | A kind of convolutional neural networks accelerator based on PSoC |
CN109245773A (en) * | 2018-10-30 | 2019-01-18 | 南京大学 | A kind of decoding method based on block circulation sparse matrix neural network |
Also Published As
Publication number | Publication date |
---|---|
CN109918281A (en) | 2019-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109918281B (en) | Multi-bandwidth target accelerator efficiency testing method | |
US11003985B2 (en) | Convolutional neural network system and operation method thereof | |
CN108564168B (en) | Design method for neural network processor supporting multi-precision convolution | |
CN111459877B (en) | Winograd YOLOv2 target detection model method based on FPGA acceleration | |
US20180204110A1 (en) | Compressed neural network system using sparse parameters and design method thereof | |
CN111667051A (en) | Neural network accelerator suitable for edge equipment and neural network acceleration calculation method | |
KR102335955B1 (en) | Convolution neural network system and operation method thereof | |
CN108090560A (en) | The design method of LSTM recurrent neural network hardware accelerators based on FPGA | |
CN111445012A (en) | FPGA-based packet convolution hardware accelerator and method thereof | |
US11120101B2 (en) | Matrix multiplication system and method | |
CN106570559A (en) | Data processing method and device based on neural network | |
KR20220015813A (en) | Method and apparatus for performing deep learning operations. | |
CN111767994B (en) | Neuron computing device | |
CN110543936B (en) | Multi-parallel acceleration method for CNN full-connection layer operation | |
CN111079923A (en) | Spark convolution neural network system suitable for edge computing platform and circuit thereof | |
US11928176B2 (en) | Time domain unrolling sparse matrix multiplication system and method | |
WO2022110386A1 (en) | Data processing method and artificial intelligence processor | |
CN113869507B (en) | Neural network accelerator convolution calculation device and method based on pulse array | |
CN111160534A (en) | Binary neural network forward propagation frame suitable for mobile terminal | |
CN114781629B (en) | Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method | |
CN109086879B (en) | Method for realizing dense connection neural network based on FPGA | |
CN110580519A (en) | Convolution operation structure and method thereof | |
CN113344179A (en) | IP core of binary convolution neural network algorithm based on FPGA | |
CN116720549A (en) | FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache | |
US20230229917A1 (en) | Hybrid multipy-accumulation operation with compressed weights |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |