CN109918281B - Multi-bandwidth target accelerator efficiency testing method - Google Patents

Multi-bandwidth target accelerator efficiency testing method Download PDF

Info

Publication number
CN109918281B
CN109918281B CN201910185133.3A CN201910185133A CN109918281B CN 109918281 B CN109918281 B CN 109918281B CN 201910185133 A CN201910185133 A CN 201910185133A CN 109918281 B CN109918281 B CN 109918281B
Authority
CN
China
Prior art keywords
accelerator
layer
following
steps
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910185133.3A
Other languages
Chinese (zh)
Other versions
CN109918281A (en
Inventor
姜晶菲
付强
窦勇
刘志强
韩哲
赵小强
秦步月
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910185133.3A priority Critical patent/CN109918281B/en
Publication of CN109918281A publication Critical patent/CN109918281A/en
Application granted granted Critical
Publication of CN109918281B publication Critical patent/CN109918281B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Tests Of Electronic Circuits (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an accelerator efficiency testing method for multiple bandwidth targets, and aims to optimize the structural design space of an accelerator with multiple bandwidth targets and evaluate the throughput rate and the MAC utilization rate. The method comprises the following steps of firstly determining attribute parameters of a test case CNN, and determining the column number and the row number of an accelerator MAC array; then testing the efficiency of the accelerator to each convolution layer of the CNN and the total efficiency of all convolution layers, testing the efficiency of the accelerator to each pooling layer of the CNN and the total efficiency of all pooling layers, testing the efficiency of the accelerator to each full-connection layer of the CNN and the total efficiency of all full-connection layers, and finally calculating the throughput rate and MAC utilization rate of all convolution layers, pooling layers and full-connection layers of the CNN by the testing accelerator. The method can reasonably adapt the column number and the line number of the MAC array according to conditions such as accelerator bandwidth constraint and the like, maximize the efficiency of the accelerator, quickly evaluate the throughput rate and the MAC utilization rate of the accelerator under the multi-target bandwidth constraint, and optimize the structural design space of the accelerator.

Description

Multi-bandwidth target accelerator efficiency testing method
Technical Field
The invention relates to a performance test method for a compute-intensive application compute accelerator, in particular to a method for evaluating the efficiency of an accelerator according to the scale of an accelerator array and the bit width of data.
Background
Applications such as image recognition, speech processing, text mining and the like with a large-scale deep neural network as a core are generally computing and memory-intensive applications, and the demand for high-performance computing hardware is increasing day by day. From intelligent reasoning of a terminal mainly bearing a high-performance embedded platform to intelligent application of a miniature and integrated 'micro' end, the intelligent application of different scales and different scenes has different requirements on intelligent processing chip delay, throughput rate and the like. The intelligent chip framework and the test method facing the terminal reasoning must be capable of adjusting the system structure in multiple scales according to the application performance and cost requirements. With the increasing depth of the neural network and the increasing complexity of the application, the artificial neural network needs more and more model parameters, the processed data is larger and larger, the parameters bring huge pressure on the memory access bandwidth required by calculation, and most neural network applications become memory access limited intensive calculation problems under the traditional von Neumann structure.
Currently, a customized intelligent accelerator becomes one of the most effective ways for optimizing applications such as image recognition, voice processing, text mining and the like, but the development cycle of a customized structure is very long, and the customized structure often relates to a plurality of long design flows such as application analysis, structural design, logic design, circuit optimization, comprehensive optimization, sheet casting production and the like.
According to the study on the operation amount in the convolutional Neural network CNN (convolutional Neural network), the convolutional operation occupies 90% of the operation amount of the CNN. Liu Shi et al, published in Electronics journal as "FPGA Accelerator Architecture Design for two-dimensional and three-dimensional convolutional neural networks" paper (Liu, Z.; Chow, P.; Xu, J.; Jiang, J.; Dou, Y.; Zhou, J.A Uniform Architecture Design for Acceleration 2D and 3D CNNs on FPGAs. Electronics 2019,8,65) (referred to as background art 1), propose an accelerator structure. Such an accelerator is a deep neural network-oriented generic Matrix multiplication gemm (general Matrix multiplication) hardware accelerator. As shown in FIG. 1, the GEMM operation method of the deep neural network is that the convolution layer weight W is m × c × k × k, the input feature map X is c × h × W, W and X are subjected to convolution operation to obtain an output feature map Y, and Y is m × h × W. Wherein m is the number of convolution layer convolution kernels, c is the number of input feature map channels, k is the size of the convolution kernels, h is the input and output feature map height, and W is the input and output feature map width, the GEMM method of convolution operation expands the weight W into a weight matrix, compresses and recombines the feature map X into a feature map matrix, and performs matrix multiplication on the weight matrix and the feature map matrix to obtain the expansion form of the output feature map Y, namely the output feature map matrix.
FIG. 2 is a hardware block diagram of an accelerator that performs the above-described convolution GEMM operations. The operation logic of the accelerator includes a multiply-accumulate unit MAC (multiple And accumulate) array, a weight matrix buffer, a feature map matrix buffer, And an output feature map matrix buffer. Each MAC in the MAC array carries out unit pipelining operation, and the operation of 1 multiplication and 1 addition is completed in each clock cycle. When the convolutional neural network operation is performed, the MAC array acquires data W and X required by the operation from a memory, the data W and X are respectively expanded and recombined according to the mode in fig. 1 and are respectively stored in a weight matrix buffer and a characteristic diagram matrix buffer of an accelerator, and a result Y is output to an output characteristic diagram matrix buffer after the operation. The accelerator structure realizes the operation of the MAC array which can be extended in a telescopic way on each layer of the convolutional neural network, and has the characteristics of capability of efficiently processing various convolutional operations, small limitation of hardware resources, capability of flexibly configuring the length and the width of the MAC array and the like. The accelerator is an accelerator structure commonly used in the field of the acceleration of the convolutional neural network at present. But limited by application scenarios and cost penalty, the memory access bandwidth of the accelerator becomes an important factor limiting the performance of the accelerator. The difference of the access bandwidth can cause the optimal MAC array configuration and acceleration efficiency of the accelerator to be different, namely, although the total number of the MAC in the MAC array is fixed, the length mc and the width mr of the array need to be determined according to the access bandwidth. Such accelerators are therefore also called multi-bandwidth target accelerators.
At present, most accelerators with multiple bandwidth targets are designed for specific application fields, and the scale of the field application determines the calculation scale, the storage bandwidth, the communication capacity, the execution mode and the like of the accelerator and also determines the performance level of the accelerator to a certain extent. Aiming at memory access bandwidth constraint, an evaluation method based on a structural model is established for the accelerator with multiple bandwidth targets, and the structural design space of the accelerator can be optimized. Therefore, how to realize rapid evaluation of the performance of the multi-bandwidth target accelerator structure including throughput rate and MAC utilization rate is a technical problem that needs to be overcome by those skilled in the art. There is no disclosure or report of a method of performance evaluation involving such accelerators.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a multi-bandwidth target accelerator efficiency testing method, solve the problem of optimizing the structural design space of a multi-bandwidth target accelerator, and quickly evaluate the multi-bandwidth target accelerator structural efficiency including the throughput rate and the MAC utilization rate, wherein the throughput rate represents the number of multiply-accumulate (MAC) operations of the accelerator operated per second. In particular, the present invention provides a method for calculating the performance of such an accelerator by appropriately adapting the multiplication array according to different convolution scales for the accelerator structure proposed in background art 1. The invention can directly evaluate the performance (throughput rate and MAC utilization rate) of the extensible computing system of the accelerator with multiple bandwidth targets under the constraint of multiple target bandwidths.
The specific technical scheme is as follows:
in the first step, a convolutional neural network is selected from currently widely used convolutional neural network models (such as AlexNet, VGG16, and C3D) as an intelligent accelerator performance test case CNN. And preprocessing the test case CNN, and determining the attribute parameters of each layer of the test case CNN. The specific method comprises the following steps:
let the number of test cases CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers. N, Nc, Np, and Nfc are positive integers, and N is Nc + Np + Nfc.
1.1 defines the convolutional neural network layer ordinal variable i to be 1, defines the convolutional layer count loop variable i1 to be 1,
a pooling layer count loop variable i2 is defined as 1 and a fully-connected layer count loop variable i3 is defined as 1.
1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, go to 1.5
Step (2); if the ith layer is a full connection layer, the step 1.7 is carried out.
1.3 extract the following properties for the i1 th convolutional layer according to the description of CNN:
1.3.1 record input feature map height Ix of i-th layeri1
1.3.2 recording input feature map width Iy of i-th layeri1
1.3.3 recording the number of channels of the input characteristic diagram (i.e. the number of channels of the convolution kernel) Ic of the i-th layeri1
1.3.4 record the i-th layer convolution kernel size ki1
1.3.5 recording the number of convolution kernels (i.e. the number of convolution output channels) of the i-th layer Oci1
1.3.6 recording the number padc of column-column direction fills of the i-th layer input feature mapi1
1.3.7 recording the number of jumps strdc of the i-th layer convolution kerneli1
1.3.8 recording the i-th layer output characteristic diagram high Oxi1
1.3.9 recording the i-th layer output characteristic diagram width Oyi1
1.4. Let i1 be i1+ 1; go to step 1.9.
1.5 extract the following properties for the i2 th pooling layer:
1.5.1 record input Profile height Ipx for layer ii2
1.5.2 recording input feature map width Ipy for layer ii2
1.5.3 record the number of input profile channels Ipc for the i-th layeri2
1.5.4 record the size p of the i-th layer pooling scalei2
1.5.5 recording the row and column direction filling number padp of the input feature map of the ith layeri2
1.5.6 records the number of jumps strdp of the i-th layer pooling operationi2
1.5.7 record the i-th layer output signature height Opxi2
1.5.8 record the i-th layer output signature width Opyi2
1.6 let i2 ═ i2+ 1; turning to the step 1.9;
1.7 extract the following properties of the i3 th fully-connected layer:
1.7.1 recording the number of i-th layer full-connection operation input nodes as Fini3
1.7.2 recording the number of i-th layer full-connection operation output nodes as Fouti3
1.8 let i3 ═ i3+ 1; turning to the step 1.9;
1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time1,Ic2,…IcNc}、{k1,k2,..kNc}、{strdc1,strdc2,..strdcNc}、{Iy1,Iy2,…IyNc}、{padc1,padc2,…padcNcAnd e, turning to the second step if the set required by the subsequent test is the same, or turning to the 1.2 step if the set required by the subsequent test is not the same.
Secondly, determining the column number and the row number of the MAC array of the accelerator, wherein the method comprises the following steps:
2.1 determine the number of columns mc of the accelerator MAC array, mc being a positive integer.
Because the efficiency of the accelerator is optimal when the computation time ratio of the convolution operation of the accelerator is 1:1, the accelerator has optimal efficiency
Figure BDA0001992591650000046
Figure BDA0001992591650000041
Therefore, the ratio satisfies the formula (1) and the optimal performance is achieved:
Figure BDA0001992591650000042
(1) in the formula, Ic ∈ { Ic }1,Ic2,…IcNc},K∈{k1,k2,..kNc},strdx∈{strdc1,strdc2,..strdcNc},Iy∈{Iy1,Iy2,…IyNc}. BW is accelerator bandwidth (unit is μ bit/s, μ is bit width of operation input data of MAC unit, determined by accelerator designer), F is accelerator MAC array working frequency (its unit isBit Hz, determined by the accelerator designer), BW and F are fixed parameters of the accelerator as long as the accelerator design is complete.
Mc obtained from the formula (1) satisfies the formula (2):
Figure BDA0001992591650000043
recent trends in convolutional neural networks show that convolutions of 3 × 3 and 1 × 1 are used most, while the jumps strdx of the convolution kernel generally take 1. Therefore, when K is preferably 3 and strdx is preferably 1, the accelerator maximizes the throughput for most convolution operations, so mc satisfies equation (3):
Figure BDA0001992591650000044
2.2, determining the number of rows mr of the accelerator MAC array, wherein mr is a positive integer, and the method comprises the following steps:
Figure BDA0001992591650000045
MACmax is the number of MAC units available to the accelerator (determined by the accelerator designer from hardware logic resources).
Thirdly, testing the efficiency of the accelerator on each convolution layer of the test case CNN, wherein the method comprises the following steps:
3.1 make convolution count loop variable i1 equal to 1.
3.2 determine the output parallelism M when the accelerator computes the i1 th convolutional layer (i.e., the number of rows of the MAC actually used when the MAC array computes the i1 th convolutional layer)i1The method comprises the following steps:
Mi1=min(Oci1,mr)。
3.3 test Accelerator transfers strdc from memoryi1×Iyi1×Ici1Buffering the input pixel into input feature map to complete Mi1×Oyi1A sub-convolution operation and dividing Mi1×Oyi1The convolution operation result pixel is buffered by the output characteristic diagramTime of burst transfer back to memory tconvi1The method comprises the following steps:
Figure BDA0001992591650000051
3.4 test accelerator operating i1 convolution calculation time Tconvi1The method comprises the following steps:
Figure BDA0001992591650000052
3.5 test throughput of accelerator operation i1 convolutional layer thconvi1The method comprises the following steps:
Figure BDA0001992591650000053
the throughput rate is the number of multiply-accumulate operations completed by the accelerator per second, and the unit is number/s.
3.6 test MAC utilization U of Accelerator operation ith 1 convolutional layeri1The method comprises the following steps:
Figure BDA0001992591650000054
3.7 when i1 equals i1+1 and i1 equals Nc, it means that the set { Tconv } has been obtained1、Tconv2、…TconvNc}、{thconv1、thconv2、…thconvNcA, and { U }1、U2、…UNcAnd (6) jumping to the fourth step, otherwise, jumping to the 3.2 rd step.
Fourthly, testing the total efficiency of the accelerator on Nc convolution layers of the test case CNN, wherein the method comprises the following steps:
4.1 determining the accelerator to calculate the throughput Thconv of Nc convolutional layers of CNN:
Figure BDA0001992591650000055
4.2 determine that the accelerator calculates the MAC utilization UCmac of Nc convolutional layers of CNN as:
Figure BDA0001992591650000061
fifthly, testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:
5.1 let pooling layer count cycle variable i2 equal to 1.
5.2 determine the parallelism (i.e. the number of rows actually used by the MAC array) of the accelerator operating the ith 2 pooling layer by:
5.2.1
Figure BDA00019925916500000611
Figure BDA0001992591650000062
5.2.2
Figure BDA00019925916500000612
Figure BDA0001992591650000063
5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 th pooling layer calculated by acceleratori2The method comprises the following steps:
Figure BDA0001992591650000064
5.2.4 determining the number Poolpara of pooling operations that the accelerator computes the ith 2 pooling layer can operate on in parallel when using a 1-row multiplication arrayi2The method comprises the following steps:
Figure BDA0001992591650000065
wherein PAD is max (padc)1,padc2,…padcNc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,
5.2.5 determining the number of rows of multiplier arrays that can be used by the Accelerator to calculate the maximum parallelism at the ith 2 pooling level
Figure BDA0001992591650000066
Comprises the following steps:
Figure BDA0001992591650000067
5.2.6 determining the number of rows Mpr that the accelerator calculates the i2 pooling layer can actually use the multiplier arrayi2Comprises the following steps:
Mpri2=min(mr,Mpmax);
5.3 Accelerator save Mpr when determining the ith 2 pooling layer of accelerator operationsi2×Opyi2Time tpool of one output pixeli2
Figure BDA0001992591650000068
5.4 determining the calculated time Tp of the ith 2 pooling layer of an accelerator operationi2Comprises the following steps:
Figure BDA0001992591650000069
5.5 determining throughput thool of Accelerator operation i2 pooling layeri2Comprises the following steps:
Figure BDA00019925916500000610
5.6 determining MAC utilization Up of the ith 2 pooling layer of Accelerator operationi2
Figure BDA0001992591650000071
5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained1、Tp2、…TpNp}、{thpool1、thpool2、…thpoolNpA, and { Up }1、Up2、…UpNpAnd (6) jumping to the sixth step, otherwise, jumping to the 5.2 th step.
Sixthly, testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, wherein the method comprises the following steps:
6.1 determining accelerator calculates the throughput Thpool of Np pooling layers of CNN:
Figure BDA0001992591650000072
6.2 determining that the accelerator calculates the MAC utilization UPmac for Np pooling layers of CNN:
Figure BDA0001992591650000073
seventhly, testing the efficiency of the accelerator on each full connection layer of the test case CNN, wherein the method comprises the following steps:
7.1 let the full connection layer count cycle variable i3 be 1;
7.2 determine the parallelism (i.e. the number of MAC units used) when the accelerator operates the i3 th full link layer by:
7.2.1 determining the theoretical memory access time TthMfc of the i3 th fully-connected layer of the accelerator operationi3Comprises the following steps:
Figure BDA0001992591650000074
7.2.2 determining the theoretical calculated time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layeri3Comprises the following steps:
Figure BDA0001992591650000075
7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th Total connection layer latency ratioi3Comprises the following steps:
Figure BDA0001992591650000076
7.2.5 determining the parallelism Mfcr of the i3 full link layer calculated by the acceleratori3Satisfies the following formula:
Figure BDA0001992591650000078
mfc thereinmaxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, MfcmaxIs a positive integer of 0<MfcmaxMr × mc, determined by the accelerator designer.
7.3 determine the throughput rate of the accelerator to calculate the i3 th fully connected layer. If Mfcr i31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layeri3Comprises the following steps:
Figure BDA0001992591650000077
go to 7.4.
If Mfcri3>1, the throughput of the i3 th full link layer is:
thfci3≈F×Mfcri3go to 7.4.
7.4 determining time Tfc for accelerator to calculate i3 full connection layeri3Comprises the following steps:
Figure BDA0001992591650000081
7.5 determining Accelerator calculate MAC utilization Ufc for the ith 3 full connectivity layeri3The method comprises the following steps:
Figure BDA0001992591650000082
7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained1、Tfc2、…TfcNfc}、{thfc1、thfc2、…thfcNfcA, and Ufc1、Ufc2、…UfcNfcAnd jumping to the eighth step, otherwise, jumping to the 7.2 th step.
Eighthly, testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, wherein the method comprises the following steps:
8.1 determining the throughput rate Thfc of all the fully-connected layers of the CNN calculated by the accelerator as:
Figure BDA0001992591650000083
8.2 determining that the accelerator calculates the MAC utilization UFCmac of all full connection layers of the CNN as:
Figure BDA0001992591650000084
ninth step, testing the efficiency of the accelerator to all convolution layers, pooling layers and full connection layers of the test case CNN, the method is as follows:
9.1 determine that the accelerator calculates the throughput ThA of all convolutional, pooling, and fully-connected layers of the CNN as:
Figure BDA0001992591650000085
9.2 determining that the MAC utilization UA of all the convolution layers, the pooling layer and the full-connection layer of the CNN calculated by the accelerator is as follows:
Figure BDA0001992591650000086
and step ten, finishing.
The calculation efficiency of the third-fourth step test accelerator on the convolutional layer, the calculation efficiency of the fifth-sixth step test accelerator on the pooling layer, and the calculation efficiency of the seventh-eighth step test accelerator on the full connection layer are all input for the ninth step of full CNN evaluation, and the third-fourth step, the fifth-sixth step and the seventh-eighth step can be executed in parallel or in series without being divided into a plurality of steps.
The invention can achieve the following technical effects:
1. the second step of the invention can reasonably adapt the column number mc and the row number mr of the MAC array according to the characteristics of convolution attributes in different CNNs, the accelerator bandwidth constraint and the accelerator operating frequency to obtain the optimal configuration of the MAC array, thereby maximizing the efficiency of the accelerator.
2. By adopting the method, the throughput rate and the MAC utilization rate of the accelerator under the multi-target bandwidth constraint can be quickly evaluated after the hardware constraint (the number of the MAC resources of the working frequency core) and the target CNN are given, and the structural design space of the accelerator is optimized.
Drawings
FIG. 1 is a diagram illustrating an example of a GEMM calculation method for convolution as described in background art 1;
FIG. 2 is an accelerator structure model of a multi-bandwidth target as described in background 1;
FIG. 3 is a flowchart illustrating the overall method for testing the performance of an accelerator with multiple bandwidth targets according to the present invention.
The specific implementation mode is as follows:
as shown in fig. 3, the present invention comprises the steps of:
the method comprises the following steps of firstly, selecting a convolutional neural network from a convolutional neural network model as an intelligent accelerator efficiency test case CNN, preprocessing the test case CNN, and determining attribute parameters of each layer of the test case CNN, wherein the specific method comprises the following steps:
let the number of layers of the test case CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers, where N, Nc, Np, and Nfc are positive integers and satisfy N ═ Nc + Np + Nfc;
1.1 defines the convolutional neural network layer ordinal variable i to be 1, defines the convolutional layer count loop variable i1 to be 1,
defining a pooling layer counting loop variable i2 to be 1, and defining a full-connection layer counting loop variable i3 to be 1;
1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, go to 1.5
Step (2); if the ith layer is a full connection layer, turning to the step 1.7;
1.3 the following properties of the i1 th convolutional layer were extracted according to the description of CNN:
1.3.1 record input feature map height Ix of i-th layeri1
1.3.2 recording input feature map width Iy of i-th layeri1
1.3.3 recording the number of channels of the input characteristic diagram of the i-th layer, namely the number of channels Ic of the convolution kerneli1
1.3.4 record the i-th layer convolution kernel size ki1
1.3.5 recording the number of convolution kernels of the i-th layer, i.e. the number of convolution output channels Oci1
1.3.6 recording the number padc of the i-th layer input characteristic diagram column-column direction fillingi1
1.3.7 recording the number of jumps strdc of the i-th layer convolution kerneli1
1.3.8 recording the i-th layer output characteristic diagram high Oxi1
1.3.9 recording the i-th layer output characteristic diagram width Oyi1
1.4. Let i1 be i1+ 1; turning to the step 1.9;
1.5 extract the following properties for the i2 th pooling layer:
1.5.1 record the input profile height Ipx of the i-th layeri2
1.5.2 record input feature map width Ipy for layer ii2
1.5.3 record the number of input profile channels Ipc for the i-th layeri2
1.5.4 record the size p of the i-th layer pooling scalei2
1.5.5 recording the row and column direction filling number padp of the input feature map of the ith layeri2
1.5.6 records the number of jumps strdp of the i-th pooling operationi2
1.5.7 recording the output characteristic of the i-th layerHeight of sign Opxi2
1.5.8 record the i-th layer output signature width Opyi2
1.6 let i2 ═ i2+ 1; turning to the step 1.9;
1.7 extract the following properties of the i3 th fully-connected layer:
1.7.1 recording the number of i-th layer full-connection operation input nodes as Fini3
1.7.2 recording the number of i-th layer full-connection operation output nodes as Fouti3
1.8 let i3 ═ i3+ 1; turning to the step 1.9;
1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time1,Ic2,…IcNc}、{k1,k2,..kNc}、{strdc1,strdc2,..strdcNc}、{Iy1,Iy2,…IyNc}、{padc1,padc2,…padcNcThe collection required by subsequent tests is transferred to the second step, otherwise, the collection is transferred to the 1.2 step;
secondly, determining the size of an accelerator MAC array, wherein the method comprises the following steps:
2.1 determining the column number mc of the MAC array of the accelerator, wherein mc is a positive integer;
Figure BDA0001992591650000101
K∈{k1,k2,..kNc},strdx∈{strdc1,strdc2,..strdcNcBW is accelerator bandwidth, F is accelerator MAC array working frequency;
2.2 determine the number of rows mr of the accelerator MAC array, mr being a positive integer, the method is:
Figure BDA0001992591650000102
MACmax is the number of MAC units available to the accelerator;
thirdly, testing the efficiency of the accelerator on each convolution layer of the test case CNN, wherein the method comprises the following steps:
3.1 making the convolution layer counting circulation variable i1 equal to 1;
3.2 determining output parallelism M when accelerator calculates i1 th convolution layeri1The method comprises the following steps:
Mi1=min(Oci1,mr);
3.3 test Accelerator transfers strdc from memoryi1×Iyi1×Ici1Buffering the input pixel into input feature map to complete Mi1×Oyi1A sub-convolution operation and dividing Mi1×Oyi1The time tconv for the convolution operation result pixel to be buffered by the output characteristic diagram and transmitted back to the memoryi1The method comprises the following steps:
Figure BDA0001992591650000111
3.4 time Tconv for test accelerator to operate on i1 th convolution layeri1The method comprises the following steps:
Figure BDA0001992591650000112
3.5 test Accelerator operation i1 th layer convolution throughput thconvi1The method comprises the following steps:
Figure BDA0001992591650000113
3.6 test MAC utilization U of Accelerator operation i1 th layer convolutioni1The method comprises the following steps:
Figure BDA0001992591650000114
3.7 let i1 ═ i1+1, and if i1 ═ Nc, indicate that the set { Tconv) has been obtained1、Tconv2、…TconvNc}、{thconv1、thconv2、…thconvNcA, and { U }1、U2、…UNcJumping to the fourth step, otherwise jumping to the 3.2 rd step;
the fourth step of testing the total efficiency of the accelerator to Nc convolutional layers of the test case CNN, the method is as follows:
4.1 determining the accelerator to calculate the throughput Thconv of Nc convolutional layers of CNN:
Figure BDA0001992591650000115
4.2 determine the MAC utilization UCmac of Nc convolutional layers of the accelerator calculation CNN as:
Figure BDA0001992591650000116
and a fifth step of testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:
5.1 let pooling layer count cycle variable i2 equal to 1;
5.2 determine the parallelism (i.e. the number of rows actually used by the MAC array) of the accelerator operating the ith 2 pooling layer by:
5.2.1
Figure BDA0001992591650000118
Figure BDA0001992591650000117
5.2.2
Figure BDA0001992591650000129
Figure BDA0001992591650000121
5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 pooling layer calculated by acceleratori2The method comprises the following steps:
Figure BDA0001992591650000122
5.2.4 determining the number Poolpara of pooling operations that the accelerator computes the ith 2 pooling layer can operate on in parallel when using a 1-row multiplication arrayi2The method comprises the following steps:
Figure BDA0001992591650000123
wherein PAD is max (padc)1,padc2,…padcNc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,
5.2.5 determining the number of rows Mp of multiplier arrays that can be used by the accelerator to compute maximum parallelism at the ith 2 pooling levelmaxComprises the following steps:
Figure BDA0001992591650000124
5.2.6 determining the number of rows Mpr that the accelerator calculates the i2 pooling layer can actually use the multiplier arrayi2Comprises the following steps:
Mpri2=min(mr,Mpmax);
5.3 Accelerator save Mpr when determining the ith 2 pooling layer of accelerator operationsi2×Opyi2Time tpool of one output pixeli2
Figure BDA0001992591650000125
5.4 determining the calculated time Tp of the ith 2 pooling layer of an accelerator operationi2Comprises the following steps:
Figure BDA0001992591650000126
5.5 determining throughput thool of Accelerator operation i2 pooling layeri2Comprises the following steps:
Figure BDA0001992591650000127
5.6 determining MAC utilization Up of the ith 2 pooling layer of Accelerator operationi2
Figure BDA0001992591650000128
5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained1、Tp2、…TpNp}、{thpool1、thpool2、…thpoolNpA, and { Up }1、Up2、…UpNpJumping to the sixth step, otherwise jumping to the 5.2 th step;
the sixth step of testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, the method is:
6.1 determining accelerator calculates the throughput Thpool of Np pooling layers of CNN:
Figure BDA0001992591650000131
6.2 determining that the accelerator calculates the MAC utilization UPmac for Np pooling layers of CNN:
Figure BDA0001992591650000132
seventhly, testing the efficiency of the accelerator on each full connection layer of the test case CNN, wherein the method comprises the following steps:
7.1 let the full connection layer count cycle variable i3 be 1;
7.2 determine the parallelism of the accelerator in operating the i3 th full link layer, the method is:
7.2.1 determining theoretical memory time TthMfc of accelerator calculating i3 th full connection layeri3Comprises the following steps:
Figure BDA0001992591650000133
7.2.2 determining the theoretical calculated time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layeri3Comprises the following steps:
Figure BDA0001992591650000134
7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th fully connected layer calculated time ratioi3Comprises the following steps:
Figure BDA0001992591650000135
7.2.5 determining the parallelism Mfcr of the i3 full link layer calculated by the acceleratori3Satisfies the following formula:
Figure BDA0001992591650000139
mfc thereinmaxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, MfcmaxIs a positive integer and 0<Mfcmax≤mr×mc;
7.3 determining the Accelerator to calculate the throughput of the i3 th fully-connected layer if Mfcr i31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layeri3Comprises the following steps:
Figure BDA0001992591650000136
go to 7.4;
if Mfcri3>1, the throughput of the i3 th full link layer is:
thfci3≈F×Mfcri3go to 7.4;
7.4 determining time Tfc for accelerator to calculate i3 full connection layeri3Comprises the following steps:
Figure BDA0001992591650000137
7.5 determining Accelerator calculate MAC utilization Ufc for the ith 3 full connectivity layeri3The method comprises the following steps:
Figure BDA0001992591650000138
7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained1、Tfc2、…TfcNfc}、{thfc1、thfc2、…thfcNfcA, (b) and (Ufc)1、Ufc2、…UfcNfcJumping to the eighth step, otherwise jumping to the 7.2 th step;
the eighth step of testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, the method comprises the following steps:
8.1 determining the throughput rate Thfc of all the fully-connected layers of the accelerator operation CNN as:
Figure BDA0001992591650000141
8.2 determining the MAC utilization UFCmac of all full connection layers of the accelerator operation CNN as:
Figure BDA0001992591650000142
the ninth step tests the efficiency of the accelerator to all the convolution layers, pooling layers and full connection layers of the test case CNN, the method is:
9.1 determine the throughput ThA of all convolutional, pooling, and full-link layers of the accelerator operation CNN as:
Figure BDA0001992591650000143
9.2 determining MAC utilization UA of all convolution layers, pooling layers and full connection layers of the accelerator operation CNN as:
Figure BDA0001992591650000144
and finishing the tenth step.
The third to fourth steps, the fifth to sixth steps, and the seventh to eighth steps may be executed in parallel, and the flowchart shown in fig. 3 is a flowchart executed in parallel, or may be serially drawn in the third to eighth steps, but the speed is faster when the parallel execution is executed.

Claims (4)

1. A method for testing the efficiency of an accelerator with multiple bandwidth targets is characterized by comprising the following steps:
the method comprises the following steps of firstly, selecting a convolutional neural network from a convolutional neural network model as an intelligent accelerator efficiency test case CNN, preprocessing the test case CNN, and determining attribute parameters of each layer of the test case CNN, wherein the specific method comprises the following steps:
let the number of layers of the test case CNN be N, which includes Nc convolutional layers, Np pooling layers, and Nfc fully-connected layers, where N, Nc, Np, and Nfc are positive integers and satisfy N ═ Nc + Np + Nfc;
1.1 defines the convolutional neural network layer ordinal variable i to be 1, defines the convolutional layer count loop variable i1 to be 1,
defining a pooling layer counting loop variable i2 to be 1, and defining a full-connection layer counting loop variable i3 to be 1;
1.2 if the ith layer of the convolutional neural network is a convolutional layer, turning to the step 1.3; if the ith layer is a pooling layer, turning to the step 1.5; if the ith layer is a full connection layer, turning to the step 1.7;
1.3 extract the following properties for the i1 th convolutional layer according to the description of CNN:
1.3.1 record input feature map height Ix of i-th layeri1
1.3.2 recording input feature map width Iy of i-th layeri1
1.3.3 recording the number of channels of the input characteristic diagram of the i-th layer, namely the number of channels Ic of the convolution kerneli1
1.3.4 record the i-th layer convolution kernel size ki1
1.3.5 recording the number of convolution kernels of the i-th layer, namely the number of convolution output channels Oci1
1.3.6 recording the number padc of the i-th layer input characteristic diagram column-column direction fillingi1
1.3.7 recording the number of jumps strdc of the i-th layer convolution kerneli1
1.3.8 recording the i-th layer output profile high Oxi1
1.3.9 recording the i-th layer output characteristic diagram width Oyi1
1.4. Let i1 be i1+ 1; turning to the step 1.9;
1.5 extract the following properties for the i2 th pooling layer:
1.5.1 record the input profile height Ipx of the i-th layeri2
1.5.2 recording input feature map width Ipy for layer ii2
1.5.3 record the number of input profile channels Ipc for the i-th layeri2
1.5.4 record the size P of the pooling scale of the i-th layeri2
1.5.5 recording the row and column direction filling number padp of the input feature map of the ith layeri2
1.5.6 records the number of jumps strdp of the i-th pooling operationi2
1.5.7 record the i-th layer output signature height Opxi2
1.5.8 record the i-th layer output signature width Opyi2
1.6 let i2 ═ i2+ 1; turning to the step 1.9;
1.7 extract the following properties of the i3 th fully-connected layer:
1.7.1 recording the number of i-th layer full-connection operation input nodes as Fini3
1.7.2 recording the number of i-th layer full-connection operation output nodes as Fouti3
1.8 let i3 ═ i3+ 1; turning to the step 1.9;
1.9 let i equal i +1, if i equal N, indicate that the set { Ic } has been obtained at this time1,Ic2,...IcNc}、{k1,k2,..kNc}、{strdc1,strdc2,..strdcNc}、{Iy1,Iy2,...IyNc}、{padc1,padc2,...padcNcThe collection needed by the subsequent test is transferred to the second step, otherwise, the step 1.2 is transferred;
secondly, determining the column number and the row number of the MAC array of the accelerator, wherein the method comprises the following steps:
2.1 determining the column number mc of the MAC array of the accelerator, wherein mc is a positive integer;
Figure FDA0003651303940000021
K∈{k1,k2,..kNc},strdx∈{strdc1,strdc2,..strdcNcBW is accelerator bandwidth, F is accelerator MAC array working frequency;
2.2, determining the number of rows mr of the accelerator MAC array, wherein mr is a positive integer, and the method comprises the following steps:
Figure FDA0003651303940000022
MACmax is the number of MAC units available to the accelerator;
thirdly, testing the efficiency of the accelerator on each convolution layer of the test case CNN, wherein the method comprises the following steps:
3.1 making the convolution layer counting circulation variable i1 equal to 1;
3.2 determining output parallelism M when accelerator calculates i1 th convolution layeri1The method comprises the following steps:
Mi1=min(Oci1,mr);
3.3 test Accelerator transfers strdc from memoryi1×Iyi1×Ici1Buffering the input pixel into input feature map to complete Mi1×Oyi1A sub-convolution operation and dividing Mi1×Oyi1The time tconv for the convolution operation result pixel to be buffered by the output characteristic diagram and transmitted back to the memoryi1The method comprises the following steps:
Figure FDA0003651303940000023
3.4 time Tconv for test accelerator to operate on i1 th convolution layeri1The method comprises the following steps:
Figure FDA0003651303940000024
3.5 test throughput of accelerator operation i1 convolutional layer thconvi1The method comprises the following steps:
Figure FDA0003651303940000031
the throughput rate is the number of multiply-accumulate operations completed by the accelerator per second, and the unit is number/s;
3.6 test MAC utilization U of Accelerator operation ith 1 convolutional layeri1The method comprises the following steps:
Figure FDA0003651303940000032
3.7 let i1 ═ i1+1, and if i1 ═ Nc, indicate that the set { Tconv) has been obtained1、Tconv2、...TconvNc}、{thconv1、thconv2、...thconvNcA, and { U }1、U2、...UNcJumping to the fourth step, otherwise jumping to the 3.2 rd step;
fourthly, testing the total efficiency of the accelerator on Nc convolution layers of the test case CNN, wherein the method comprises the following steps:
4.1 determine the throughput Thconv of the accelerator computing Nc convolutional layers of CNN:
Figure FDA0003651303940000033
4.2 determine that the accelerator calculates the MAC utilization UCmac of Nc convolutional layers of CNN as:
Figure FDA0003651303940000034
fifthly, testing the efficiency of the accelerator on each pooling layer of the test case CNN, wherein the method comprises the following steps:
5.1 let pooling layer count cycle variable i2 equal to 1;
5.2 determine the parallelism (i.e. the number of rows actually used by the MAC array) of the accelerator operating the ith 2 pooling layer by:
5.2.1 Accelerator calculation
Figure FDA0003651303940000035
Figure FDA0003651303940000036
5.2.2 Accelerator calculation
Figure FDA0003651303940000037
Figure FDA0003651303940000038
5.2.3 determining theoretical ratio ratiopool of memory access time and calculation time of i2 th pooling layer calculated by acceleratori2The method comprises the following steps:
Figure FDA0003651303940000039
5.2.4 determining the number Poolpara of pooling operations that the accelerator computes the ith 2 pooling layer can operate on in parallel when using a 1-row multiplication arrayi2The method comprises the following steps:
Figure FDA00036513039400000310
wherein PAD is max (padc)1,padc2,...padcNc) The number of buffer blocks of the single-side input characteristic diagram designed for the accelerator to adapt to the convolution operation,
5.2.5 determining the number of rows Mp of multiplier arrays that can be used by the accelerator to compute maximum parallelism at the ith 2 pooling levelmaxComprises the following steps:
Figure FDA0003651303940000041
5.2.6 determining the number of rows Mpr that the accelerator calculates the i2 pooling layer can actually use the multiplier arrayi2Comprises the following steps:
Mpri2=min(mr,Mpmax);
5.3 Accelerator save Mpr when determining the ith 2 pooling layer of accelerator operationsi2×Opyi2Time tpool of one output pixeli2
Figure FDA0003651303940000042
5.4 determining the time Tp of the ith 2 pooling layer of an accelerator operationi2Comprises the following steps:
Figure FDA0003651303940000043
5.5 determining throughput threshold of the ith 2 pooling layer of Accelerator operationi2Comprises the following steps:
Figure FDA0003651303940000044
5.6 determining MAC utilization Up of the ith 2 pooling layer of Accelerator operationi2
Figure FDA0003651303940000045
5.7 let i2 be i2+1 and if i2 be Np, indicate that the set { Tp is obtained1、Tp2、...TpNp}、{thpool1、thpool2、...thpoolNpA, and { Up }1、Up2、...UpNpSkipping to the sixth step, otherwise, skipping to the 5.2 step;
sixthly, testing the total efficiency of the accelerator on Np pooling layers of the test case CNN, wherein the method comprises the following steps:
6.1 determining the throughput Thpool of Np pooling layers of the accelerator operation CNN:
Figure FDA0003651303940000046
6.2 determining the MAC utilization UPmac of Np pooling layers of accelerator operations CNN:
Figure FDA0003651303940000047
seventhly, testing the efficiency of the accelerator on each full connection layer of the test case CNN, wherein the method comprises the following steps:
7.1 let full connection layer count cycle variable i3 equal to 1;
7.2 determine the parallelism of the accelerator in operating the i3 th full link layer, the method is:
7.2.1 determining the theoretical memory access time TthMfc of the i3 th fully-connected layer of the accelerator operationi3Comprises the following steps:
Figure FDA0003651303940000051
7.2.2 determining the theoretical calculation time TthCfc for 1 MAC unit of the accelerator to calculate the i3 th fully connected layeri3Comprises the following steps:
Figure FDA0003651303940000052
7.2.3 determining Accelerator 1 MAC Unit to calculate i3 th Total connection layer latency ratioi3Comprises the following steps:
Figure FDA0003651303940000053
7.2.5 determining the parallelism Mfcr of the i3 full link layer calculated by the acceleratori3Satisfies the following formula:
Figure FDA0003651303940000054
mfc thereinmaxMaximum number of MAC available in parallel for accelerator design for full connectivity layer, MfcmaxIs a positive integer and 0 < Mfcmax≤mr×mc;
7.3 test the throughput of accelerator operating the i3 full link layer if Mfcri31, the accelerator is explained to calculate the access limit of the i3 th fully-connected layer and the throughput rate thfc of the i3 th fully-connected layeri3Comprises the following steps:
Figure FDA0003651303940000055
go to 7.4;
if Mfcri3If the throughput rate of the i3 th full connection layer is more than 1:
thfci3≈F×Mfcri3go to 7.4;
7.4 test time Tfc for accelerator to operate full connection layer i3i3Comprises the following steps:
Figure FDA0003651303940000056
7.5 test Accelerator operation ith 3 MAC utilization rate Ufc of full connection layeri3The method comprises the following steps:
Figure FDA0003651303940000057
7.6 let i3 ═ i3+1, if i3 ═ Nfc, indicate that the set { Tfc has been obtained1、Tfc2、...TfcNfc}、
Figure FDA0003651303940000058
Figure FDA0003651303940000059
Ke { Ufc1、Ufc2、...UfcNfcJumping to the eighth step, otherwise jumping to the 7.2 step;
eighthly, testing the total efficiency of the accelerator on the Nfc full-connection layers of the test case CNN, wherein the method comprises the following steps:
8.1 determining the throughput rate Thfc of all the fully-connected layers of the accelerator operation CNN as:
Figure FDA00036513039400000510
8.2 determining MAC utilization rates UFCmac of all full connection layers of accelerator operation CNN as:
Figure FDA0003651303940000061
ninth step, testing the efficiency of the accelerator to all the convolution layers, pooling layers and full connection layers of the test case CNN, the method is:
9.1 determine the throughput ThA of all convolutional, pooling, and full-link layers of the accelerator operation CNN as:
Figure FDA0003651303940000062
9.2 determining MAC utilization UA of all convolution layers, pooling layers and full connection layers of the accelerator operation CNN as:
Figure FDA0003651303940000063
and finishing the tenth step.
2. The method for testing the efficiency of the accelerator with multiple bandwidth targets of claim 1, wherein the third to fourth, fifth to sixth, and seventh to eighth steps are executed in parallel.
3. The multi-bandwidth target accelerator performance testing method of claim 1, wherein the convolutional neural network model of the first step comprises AlexNet, VGG16, C3D.
4. The method for testing the efficiency of the accelerator with multiple bandwidth targets of claim 1, wherein in step 2.1, K is 3, strdx is 1, and mc satisfies the following conditions:
Figure FDA0003651303940000064
CN201910185133.3A 2019-03-12 2019-03-12 Multi-bandwidth target accelerator efficiency testing method Active CN109918281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910185133.3A CN109918281B (en) 2019-03-12 2019-03-12 Multi-bandwidth target accelerator efficiency testing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910185133.3A CN109918281B (en) 2019-03-12 2019-03-12 Multi-bandwidth target accelerator efficiency testing method

Publications (2)

Publication Number Publication Date
CN109918281A CN109918281A (en) 2019-06-21
CN109918281B true CN109918281B (en) 2022-07-12

Family

ID=66964319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910185133.3A Active CN109918281B (en) 2019-03-12 2019-03-12 Multi-bandwidth target accelerator efficiency testing method

Country Status (1)

Country Link
CN (1) CN109918281B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929860B (en) * 2019-11-07 2020-10-23 深圳云天励飞技术有限公司 Convolution acceleration operation method and device, storage medium and terminal equipment
CN111242314B (en) * 2020-01-08 2023-03-21 中国信息通信研究院 Deep learning accelerator benchmark test method and device
CN114169514B (en) * 2022-02-14 2022-05-17 浙江芯昇电子技术有限公司 Convolution hardware acceleration method and convolution hardware acceleration circuit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530224A (en) * 2013-06-26 2014-01-22 郑州大学 Harris corner detecting software system based on GPU
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN109102065A (en) * 2018-06-28 2018-12-28 广东工业大学 A kind of convolutional neural networks accelerator based on PSoC
CN109245773A (en) * 2018-10-30 2019-01-18 南京大学 A kind of decoding method based on block circulation sparse matrix neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10614354B2 (en) * 2015-10-07 2020-04-07 Altera Corporation Method and apparatus for implementing layers on a convolutional neural network accelerator

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530224A (en) * 2013-06-26 2014-01-22 郑州大学 Harris corner detecting software system based on GPU
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
CN109102065A (en) * 2018-06-28 2018-12-28 广东工业大学 A kind of convolutional neural networks accelerator based on PSoC
CN109245773A (en) * 2018-10-30 2019-01-18 南京大学 A kind of decoding method based on block circulation sparse matrix neural network

Also Published As

Publication number Publication date
CN109918281A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN109918281B (en) Multi-bandwidth target accelerator efficiency testing method
US11003985B2 (en) Convolutional neural network system and operation method thereof
CN108564168B (en) Design method for neural network processor supporting multi-precision convolution
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
US20180204110A1 (en) Compressed neural network system using sparse parameters and design method thereof
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
KR102335955B1 (en) Convolution neural network system and operation method thereof
CN108090560A (en) The design method of LSTM recurrent neural network hardware accelerators based on FPGA
CN111445012A (en) FPGA-based packet convolution hardware accelerator and method thereof
US11120101B2 (en) Matrix multiplication system and method
CN106570559A (en) Data processing method and device based on neural network
KR20220015813A (en) Method and apparatus for performing deep learning operations.
CN111767994B (en) Neuron computing device
CN110543936B (en) Multi-parallel acceleration method for CNN full-connection layer operation
CN111079923A (en) Spark convolution neural network system suitable for edge computing platform and circuit thereof
US11928176B2 (en) Time domain unrolling sparse matrix multiplication system and method
WO2022110386A1 (en) Data processing method and artificial intelligence processor
CN113869507B (en) Neural network accelerator convolution calculation device and method based on pulse array
CN111160534A (en) Binary neural network forward propagation frame suitable for mobile terminal
CN114781629B (en) Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN109086879B (en) Method for realizing dense connection neural network based on FPGA
CN110580519A (en) Convolution operation structure and method thereof
CN113344179A (en) IP core of binary convolution neural network algorithm based on FPGA
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
US20230229917A1 (en) Hybrid multipy-accumulation operation with compressed weights

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant