CN112306951B

CN112306951B - CNN-SVM resource efficient acceleration architecture based on FPGA

Info

Publication number: CN112306951B
Application number: CN202011252879.0A
Authority: CN
Inventors: 付平; 吴瑞东; 刘冰; 周彦臻; 高丽娜; 王宾涛; 陈浩林
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-03-22
Anticipated expiration: 2040-11-11
Also published as: CN112306951A

Abstract

The invention relates to a CNN-SVM resource efficient acceleration architecture based on an FPGA. The invention relates to the technical field of embedded target classification and detection, and the framework comprises a processor system and a programmable logic system; the CNN-SVM flow architecture accelerator is based on a general acceleration operator structure, the acceleration operator structure is a two-dimensional array formed by a plurality of multiply-accumulate MAC nodes, the acceleration architecture can fully utilize the data reuse characteristic and is suitable for different types of layers including CNN, FC full connection layers and SVM in a CNN-SVM hybrid algorithm. In addition, the pipeline interval of the general operators designed in the acceleration architecture can be kept in a single clock cycle, so that the computing efficiency of the accelerator can be improved.

Description

CNN-SVM resource efficient acceleration architecture based on FPGA

Technical Field

The invention relates to the technical field of embedded target classification and detection, in particular to a CNN-SVM resource efficient acceleration architecture based on an FPGA.

Background

Convolutional Neural Networks (CNN) are currently widely used for classification, detection, and recognition applications, and a hybrid network structure composed of CNN and a conventional machine learning algorithm (such as support vector machine SVM) is of great interest in practical applications due to its robustness, high classification accuracy, and suitability for small sample training. With the demands for low power consumption, high performance, and light weight, acceleration of a hybrid network based on an embedded platform (CNN-SVM) has become a current research focus.

In early studies on CNN acceleration, the roopline model has been proposed to balance resources and performance, which provides theoretical guidance on how to solve the throughput and bandwidth matching problem. To overcome the bandwidth limitation, the weight parameters are typically buffered in on-chip memory. As the depth of the network increases, it will face a shortage of limited memory resources. One effective approach is to explore the characteristics of data reuse in the convolution process. Based on this idea, the hierarchical storage result and ping-pong buffer are used to suppress the limitation of the external memory bandwidth, such as exploring peak bandwidth occupancy and data reordering to buffer parameters, which effectively improves throughput when breaking bandwidth limitation. These acceleration strategies can be summarized as follows: (1) circularly tiling to reduce memory conflicts; (2) on-chip buffering realizes data reuse; (3) all parameters are stored on-chip without occupying bandwidth. All of these are based on the discovery of bandwidth and data reuse. However, the improvement of the throughput depends on the consumption of a large number of DSP units, so that the effective utilization rate of the DSP is insufficient.

Fast convolution algorithms are another effective study to improve throughput. It reduces resource utilization by replacing the original convolution structure. Winograd greatly reduces the arithmetic complexity and improves the efficiency in the convolution process, and a new fast convolution algorithm comprises frequency domain convolution through overlapping and adding and a fast finite impulse response algorithm to realize that limited resources support more convolution numbers. Although fast convolution algorithms can exploit acceleration potential, most algorithms are directed to a special convolution structure and are not suitable for hybrid networks. In addition, some convolution acceleration methods may alter the original pipeline structure, thereby increasing the requirements for timing.

Disclosure of Invention

In order to solve the problems in the prior art, the invention fully utilizes knowledge information of a knowledge map, and provides the following technical scheme:

an FPGA-based CNN-SVM resource efficient acceleration architecture, the architecture comprising a processor system and a programmable logic system;

the processor system comprises a DDR storage controller, an SD/SDIO controller, a serial port controller, a main switch and an application processor; the application processor carries out data scheduling and program control, the SD/SDIO controller and the DDR memory controller store an external input data set and a network parameter file, and the serial port controller monitors the output result and the calculation time of the architecture;

the programmable logic system includes: CNN-SVM flow architecture accelerator, AXI interconnection, AXI peripheral equipment, DMA0 and DMA 1; the main switch is connected with AXI peripheral equipment, a convolution calculation unit in the CNN-SVM flow type architecture accelerator performs convolution operation, a pooling layer and an activation function are merged and calculated by a pooling calculation unit in the CNN-SVM flow type architecture accelerator, after CNN calculation is completed, a feature vector extracted by CNN is flattened and input into a classifier SVM for classification, and a classification result is finally output; the convolution kernel of the convolution calculation unit and the weight parameter of the classifier are configured by the application processor;

a DMA0 reads an input image from an external memory to the accelerator, a DMA1 is responsible for initializing convolution kernels or weight parameters of each layer of computing units, and data transmission inside the accelerator is based on an AXI-Stream interface;

the CNN-SVM flow architecture accelerator is based on a general acceleration operator structure, the acceleration operator structure is a two-dimensional array formed by a plurality of multiply-accumulate MAC nodes, wherein the vertical direction is the parallelism Tn of an output channel, the horizontal direction is the parallelism Tc of an output characteristic diagram, and under the condition of data multiplexing, the multiplexing times of an input characteristic diagram and weight are Tn and Tc respectively; the two-dimensional array is expanded into a three-dimensional array, so that the data reusability is improved; the single node comprises a multiplication and addition MA tree and a special accumulator ACC, the input of the MA tree realizes the parallelism Tm of an input channel, the depth of the MA tree is Tm, the special accumulator ACC automatically adjusts accumulation items according to the size K of a convolution kernel or the layer type and generates accumulation results, and the acceleration operator structure carries out TcxTnxTm multiplication and accumulation operations for times.

Preferably, when the MA tree is a complete binary tree structure, the utilization rate of the DSP is the maximum, and for the SVM classifier, only an additional voting decision module needs to be added after the ACC, and the remaining structure remains unchanged.

Preferably, the clock period T required for the operator to complete the single-layer inference is determined by:

where M is the number of input channels, K is the kernel size, R_outAnd C_outCorresponding to the rows and columns of the output signature, respectively, and Tr and Tc are the row and column parallelism of the output signature.

Preferably, the FPGA-based CNN-SVM resource efficient acceleration architecture determines resource consumption through a model of operator resource evaluation, where parameters of a single node of the model include a dimension and a type of input data, and when the dimension of the input data is Dim, there are different types of DSP resources DType consumed by the single node, and the different types of DSP resources DType consumed by the single node is represented by the following formula:

generating a DSP estimate of an operator structure by

Where Dim is the input dataDimension of (D)_TypeFor DSP resources consumed by a single node, type is data precision, float32 is a single-precision floating point number, int32 is a 32-bit fixed point number, int16 is a 16-bit fixed point number, and int8 is an 8-bit fixed point number;

is the DSP estimated value of the operator structure;

the DSP estimated value depends on the node parameters and the multiplexing times, the peak throughput is influenced, the number of DSP units in hardware resources is limited in general, and the peak throughput which can be achieved is a fixed value.

Preferably, when the CNN-SVM resource efficient acceleration architecture based on the FPGA accelerates, the BRAM is used for temporary buffering and parallel expansion, the minimum BRAM structure BRAM _18K is used as a basic unit of an evaluation model, the direct memory access DMA transmits an input function diagram to the WriteBRAM module, and then the input function diagram is written into the BRAM _18K in a multi-bit parallel mode; after writing the function map, the WriteBRAM module sends a half flag to the readdram module, which reads the function map from BRAM _18K and sends it to the operator structure.

Preferably, based on the analysis of the BRAM, a BRAM estimation model for the reuse of the feature map and the weight reuse is established, where Rin and Cin are rows and columns of the input feature map respectively, a typical value of Depth is 512 for BRAM _18K in the 32-bit independent read-write mode, when the data size is very small or the parallel parameters are very large, the number of BRAM _18K occupied by a single parallel storage unit is ignored, and BRAM resources are only related to bit width and parallelism, and the BRAM estimation value for the reuse of the feature map and the BRAM estimation value for the weight reuse are represented by the following formula:

wherein the content of the first and second substances,

for BRAM estimation for reuse of signatures,

for BRAM estimation for weight reuse, R_inAnd C_inRespectively, the row and the column of the input feature map, Tr and Tc are the row and column parallelism of the output feature map, Depth is the Depth of BRAM, and Width is the bit Width of data.

The invention has the following beneficial effects:

the invention provides a resource efficient acceleration architecture which can fully utilize the data reuse characteristic and is suitable for different types of layers including CNN, FC full-connection layers and SVM in a CNN-SVM hybrid algorithm. Furthermore, the pipeline interval of the general operators designed in the acceleration architecture can be kept in a single clock cycle, so that the computing efficiency of the accelerator can be improved.

Drawings

FIG. 1 is a diagram of a generic operator structure;

FIG. 2 is a schematic diagram of data transmission of a generic operator;

FIG. 3 is a resource-efficient accelerator architecture diagram for a CNN-SVM hybrid algorithm using a generic operator design.

Detailed Description

The present invention will be described in detail with reference to specific examples.

The first embodiment is as follows:

as shown in fig. 1 to fig. 3, the present invention provides a CNN-SVM resource efficient acceleration architecture based on FPGA, which includes a processor system and a programmable logic system;

the key point of the invention is to provide a general two-dimensional multiplication and addition array operator, and construct a resource efficient acceleration framework suitable for a CNN-SVM algorithm based on the operator. The operator can be applied to layers of different types based on the high data reuse characteristic, and meanwhile, a high-efficiency complete flow production line is ensured. In addition, the resource evaluation model is constructed according to the operator, the resource consumption and the estimation of the required time of the Block Random Access Memory (BRAM) and the Digital Signal Processing (DSP) can be accurately obtained, the resource utilization rate and the calculation efficiency are greatly improved under the guidance of the model, and the hardware acceleration of the CNN-SVM algorithm is completed.

When the MA tree is a complete binary tree structure, the utilization rate of the DSP is the maximum, and only an additional voting decision module needs to be added after the ACC aiming at the SVM classifier, and the residual structure is kept unchanged.

Therefore, the general operator structure provided by the invention can be applied to different types of network layers, and can fully multiplex input data. Since the operator structure is only related to the dimensions of the output feature map, it is also applicable to convolution kernels of other sizes, independent of the input feature map size or convolution kernel step size. For some special convolution forms, only the order of the MA tree and ACC needs to be changed or adjusted. The clock period T required for the operator to complete a single layer inference is determined by:

In order to fully utilize limited hardware resources, the CNN-SVM resource efficient acceleration architecture based on the FPGA determines resource consumption through a model of operator resource evaluation, parameters of a single node of the model comprise dimensionality and type of input data, when the dimensionality of the input data is Dim, DSP resources DType consumed by the single node with different types are provided, and the DSP resources DType consumed by the single node with different types is represented by the following formula:

generating a DSP estimate of an operator structure by

Where Dim is the dimension of the input data, D_TypeFor DSP resources consumed by a single node, type is data precision, float32 is a single-precision floating point number, int32 is a 32-bit fixed point number, int16 is a 16-bit fixed point number, and int8 is an 8-bit fixed point number;

is the DSP estimated value of the operator structure;

it can be seen that the DSP estimate depends on the node parameters and the number of multiplexes and directly affects the peak throughput. Usually the number of DSP units in the hardware resources is limited and the peak throughput that can be achieved is a fixed value. When different network deployments are used, the core difference is the utilization rate of the BRAM.

BRAM is typically used for temporary buffering and parallel extensions during the deployment phase of accelerator design. Mining the potential parallelism of the BRAM is an effective method for saving memory resources. The BRAM resource estimation model is needed to be analyzed in deployment. The invention takes the minimum BRAM structure (BRAM _18K) as a basic unit of an evaluation model. Fig. 2 is a schematic diagram of data transmission of an operator structure based on eigenmap reuse. First, Direct Memory Access (DMA) transfers the input function map to the WriteBRAM module, which is then written to BRAM _18K in a multi-bit parallel mode. After writing the functional map, the WriteBRAM module sends a half flag to the readdram module. The ReadDRAM module reads the function map from BRAM _18K and sends it to the operator structure. In fig. 2, the read/write modules work independently in a cyclically alternating manner, thereby making maximum use of BRAM bit width and operator structure efficiency.

When the CNN-SVM resource efficient acceleration architecture based on the FPGA accelerates, the BRAM is used for temporary buffering and parallel expansion, the minimum BRAM structure BRAM _18K is used as a basic unit of an evaluation model, the direct memory access DMA transmits an input function diagram to the WriteBRAM module, and then the input function diagram is written into the BRAM _18K in a multi-bit parallel mode; after writing the function map, the WriteBRAM module sends a half flag to the readdram module, which reads the function map from BRAM _18K and sends it to the operator structure.

Based on the analysis of the BRAM, a BRAM estimation model for the reuse of the characteristic diagram and the weight reuse is established, wherein Rin and Cin are rows and columns of an input characteristic diagram respectively, a typical value of Depth is 512 for BRAM _18K under a 32-bit independent read-write mode, when the data size is very small or parallel parameters are very large, the number of BRAM _18K occupied by a single parallel storage unit is ignored, and BRAM resources are only related to bit width and parallelism, and the BRAM estimation value for the reuse of the characteristic diagram and the BRAM estimation value for the reuse of the weight are represented by the following formula:

wherein the content of the first and second substances,

for BRAM estimation for reuse of signatures,

The design of the resource-efficient acceleration architecture of the CNN-SVM algorithm comprises a Processor System (PS) and a Programmable Logic (PL). As shown in FIG. 3, the processor system implements, in part, the control logic and external interfaces, including the Application Processor Units (APUs): responsible for data scheduling and program control, SD input output controller (SDIO) and double data rate memory (DDR 4): storing external input data sets and network parameter files, universal asynchronous receiver/transmitter (UART): is responsible for monitoring the output results and computation time of the algorithm, and also controls the state of the PL through a high performance extensible interface (AXI) HPM 0. The programmable logic part realizes a CNN-SVM hardware accelerator, the accelerator is a core part of a system architecture and is constructed by using the general operator provided by the invention, and the accelerator accesses an external memory through an AXI HP port to execute data transmission with the external memory.

The programmable logic part in fig. 3 implements a CNN-SVM hardware accelerator, which is responsible for forward reasoning calculations of the CNN-SVM algorithm and is a core part of the system architecture. The accelerator is designed into a flow type architecture of a full-network curing strategy, different network layers (such as a convolutional layer, a pooling layer and an SVM) of an algorithm in the flow type architecture are respectively mapped on mutually independent computing units, the computing units are constructed by adopting the general operator provided by the invention and are responsible for carrying out reasoning and calculation of each network layer, a plurality of parallelities including a function diagram level, an input channel level and an output channel level can be realized in each computing unit, and the computing parallelism between the layers is improved by pipeline optimization among the different computing units.

In addition, a control interface is designed between the accelerator and the processor, and the processor system can control the startup (Start), the viewing state (Status), the viewing Iteration number (Iteration) of the calculation, the viewing Offset (Offset) of the write-back data address and the like of the accelerator. The DMA0 and the DMA1 are data transfer modules of an accelerator, the accelerator enables the DMA0 to read an input image from an external memory to the accelerator during calculation, the DMA1 is responsible for initializing convolution kernels or weight parameters of each layer of calculation units, and data transfer inside the accelerator is based on an AXI-Stream interface, which is a protocol of an AXI general bus and can realize high-speed continuous data transfer. In the final layer of the accelerator, the accelerator may select a Fully Connected (FC) layer or SVM as a classifier as desired. And after the accelerator finishes the calculation, writing the classification result output by the algorithm into an external memory.

The second embodiment is as follows:

in the embodiment, the CNN-SVM algorithm is accelerated by using the resource efficient acceleration architecture provided by the invention. As shown in fig. 3, the input picture and parameter configuration file of the CNN-SVM algorithm are stored in the off-chip SD card, and the input picture uses an MNIST data set, which contains 10000 test pictures. The processor system controls the DMA0 to provide input pictures for the accelerator of the programmable logic part, and controls the DMA1 to configure parameters for each computing unit of the accelerator.

The input of the CNN-SVM streaming architecture accelerator of the programmable logic part is a 28x28x1 picture, the convolution calculation units (convolution 1, convolution 2, convolution 3) in the accelerator are used for calculating convolution operation in the algorithm, the pooling calculation units (pooling 1, pooling 2, pooling 3) combine the pooling layer and the activation function in the algorithm for calculation, the calculation results between the calculation units are transmitted as shown in fig. 2, and the specific dimension of the transmitted data is shown in fig. 3. After the CNN part of the algorithm is calculated, the characteristic vector extracted by the CNN is flattened, the characteristic vector is input into a classifier (SVM) for classification, a classification result is finally output, and parameters such as a convolution kernel of a convolution calculation unit, the weight of the classifier and the like are configured by a processor system.

The computing units of each layer of the accelerator are implemented by the general operator provided by the invention, and in the implementation process of the operator, specific construction is carried out according to different parallelism degrees of each layer. As shown in table 1, where R, C is the height and width of the feature map, M, N is the number of channels of the output feature map and the input feature map, and the structure of the operator used by each computing unit is determined by the parallelism (Tc, Tm, Tn), so that the structure of the general operator used by each computing unit is different, this also illustrates that the operator proposed by the present invention has generality and extensibility, so that the computing unit can be constructed by using the general operator proposed by the present invention according to the specific parameters of different layers of the algorithm. In addition, in the embodiment, the use of the BRAM and the DSP is evaluated according to the resource evaluation model proposed by the designed generic operator, and as a result, as shown in table 1, the number of the BRAM and the DSP used is 13.5 and 316, the clock period consumed by the CNN-SVM algorithm is 297, and the clock frequency is set to be 100 MHz.

TABLE 1 PARAMETRIC PARAMETER TABLE FOR DIFFERENT LAYERS OF CNN-SVM

The corresponding resource consumption in the board level verification in the present embodiment is shown in table 2, because the present embodiment is implemented under the XAZU3EG platform, and therefore Available in the table is the resource constraint condition of the platform. Since the key to the accelerated architecture throughput is DSP utilization, this embodiment focuses on DSP utilization. From the implementation results, the DSP utilization is greater than 80%. Ideally, DSP utilization could reach 100%, but as network parallelism parameters continue to expand, DSP resources will increase significantly to extend parallelism throughout the deployed network. In other words, the maximum DSP utilization depends on the amount of limited resources and network fabric, which results in a phenomenon that all DSPs cannot be used in a deployment. On the other hand, since the operator in the architecture uses the data transmission method shown in fig. 2 to fully utilize the bit width of the BRAM, the BRAM utilization rate is low in the implementation result. In addition, the calculation time for 10000 test pictures accelerated by the CNN-SVM algorithm in this embodiment is 30.1ms, and the power consumption is only 3.42W. According to experimental test results, the embodiment fully utilizes the DSP resources on the chip and completes the efficient acceleration of the resources of the CNN-SVM algorithm under the condition of consuming a very small amount of BRAM resources.

TABLE 2 resource usage of CNN-SVM

The above description is only a preferred embodiment of the FPGA-based CNN-SVM resource efficient acceleration architecture, and the protection scope of the FPGA-based CNN-SVM resource efficient acceleration architecture is not limited to the above embodiments, and all technical solutions belonging to the idea belong to the protection scope of the present invention. It should be noted that modifications and variations which do not depart from the gist of the invention will be those skilled in the art to which the invention pertains and which are intended to be within the scope of the invention.

Claims

1. A CNN-SVM resource high-efficiency acceleration architecture based on FPGA is characterized in that: the architecture includes a processor system and a programmable logic system;

2. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: when the MA tree is a complete binary tree structure, the utilization rate of the DSP is the maximum, and only an additional voting decision module needs to be added after the ACC aiming at the SVM classifier, and the residual structure is kept unchanged.

3. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 2, wherein: the clock period T required for the operator to complete a single layer inference is determined by:

4. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: the CNN-SVM resource efficient acceleration architecture based on the FPGA determines resource consumption through a model of operator resource evaluation, parameters of a single node of the model comprise dimensionality and type of input data, and when the dimensionality of the input data is Dim, DSP resources D consumed by the single node of different types are provided_TypeThe DSP resources D consumed by the different types of single nodes are represented by the following formula_Type：

Generating a DSP estimate of an operator structure by

is the DSP estimated value of the operator structure;

5. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: when the CNN-SVM resource efficient acceleration architecture based on the FPGA accelerates, the BRAM is used for temporary buffering and parallel expansion, the minimum BRAM structure BRAM _18K is used as a basic unit of an evaluation model, the direct memory access DMA transmits an input function diagram to the WriteBRAM module, and then the input function diagram is written into the BRAM _18K in a multi-bit parallel mode; after writing the function map, the WriteBRAM module sends a half flag to the readdram module, which reads the function map from BRAM _18K and sends it to the operator structure.

6. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: based on the analysis of the BRAM, a BRAM estimation model for the reuse of the characteristic diagram and the weight reuse is established, wherein Rin and Cin are rows and columns of an input characteristic diagram respectively, a typical value of Depth is 512 for BRAM _18K under a 32-bit independent read-write mode, when the data size is very small or parallel parameters are very large, the number of BRAM _18K occupied by a single parallel storage unit is ignored, and BRAM resources are only related to bit width and parallelism, and the BRAM estimation value for the reuse of the characteristic diagram and the BRAM estimation value for the reuse of the weight are represented by the following formula:

wherein the content of the first and second substances,

for BRAM estimation for reuse of signatures,