CN112306951B - CNN-SVM resource efficient acceleration architecture based on FPGA - Google Patents

CNN-SVM resource efficient acceleration architecture based on FPGA Download PDF

Info

Publication number
CN112306951B
CN112306951B CN202011252879.0A CN202011252879A CN112306951B CN 112306951 B CN112306951 B CN 112306951B CN 202011252879 A CN202011252879 A CN 202011252879A CN 112306951 B CN112306951 B CN 112306951B
Authority
CN
China
Prior art keywords
svm
cnn
bram
architecture
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011252879.0A
Other languages
Chinese (zh)
Other versions
CN112306951A (en
Inventor
付平
吴瑞东
刘冰
周彦臻
高丽娜
王宾涛
陈浩林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202011252879.0A priority Critical patent/CN112306951B/en
Publication of CN112306951A publication Critical patent/CN112306951A/en
Application granted granted Critical
Publication of CN112306951B publication Critical patent/CN112306951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a CNN-SVM resource efficient acceleration architecture based on an FPGA. The invention relates to the technical field of embedded target classification and detection, and the framework comprises a processor system and a programmable logic system; the CNN-SVM flow architecture accelerator is based on a general acceleration operator structure, the acceleration operator structure is a two-dimensional array formed by a plurality of multiply-accumulate MAC nodes, the acceleration architecture can fully utilize the data reuse characteristic and is suitable for different types of layers including CNN, FC full connection layers and SVM in a CNN-SVM hybrid algorithm. In addition, the pipeline interval of the general operators designed in the acceleration architecture can be kept in a single clock cycle, so that the computing efficiency of the accelerator can be improved.

Description

CNN-SVM resource efficient acceleration architecture based on FPGA
Technical Field
The invention relates to the technical field of embedded target classification and detection, in particular to a CNN-SVM resource efficient acceleration architecture based on an FPGA.
Background
Convolutional Neural Networks (CNN) are currently widely used for classification, detection, and recognition applications, and a hybrid network structure composed of CNN and a conventional machine learning algorithm (such as support vector machine SVM) is of great interest in practical applications due to its robustness, high classification accuracy, and suitability for small sample training. With the demands for low power consumption, high performance, and light weight, acceleration of a hybrid network based on an embedded platform (CNN-SVM) has become a current research focus.
In early studies on CNN acceleration, the roopline model has been proposed to balance resources and performance, which provides theoretical guidance on how to solve the throughput and bandwidth matching problem. To overcome the bandwidth limitation, the weight parameters are typically buffered in on-chip memory. As the depth of the network increases, it will face a shortage of limited memory resources. One effective approach is to explore the characteristics of data reuse in the convolution process. Based on this idea, the hierarchical storage result and ping-pong buffer are used to suppress the limitation of the external memory bandwidth, such as exploring peak bandwidth occupancy and data reordering to buffer parameters, which effectively improves throughput when breaking bandwidth limitation. These acceleration strategies can be summarized as follows: (1) circularly tiling to reduce memory conflicts; (2) on-chip buffering realizes data reuse; (3) all parameters are stored on-chip without occupying bandwidth. All of these are based on the discovery of bandwidth and data reuse. However, the improvement of the throughput depends on the consumption of a large number of DSP units, so that the effective utilization rate of the DSP is insufficient.
Fast convolution algorithms are another effective study to improve throughput. It reduces resource utilization by replacing the original convolution structure. Winograd greatly reduces the arithmetic complexity and improves the efficiency in the convolution process, and a new fast convolution algorithm comprises frequency domain convolution through overlapping and adding and a fast finite impulse response algorithm to realize that limited resources support more convolution numbers. Although fast convolution algorithms can exploit acceleration potential, most algorithms are directed to a special convolution structure and are not suitable for hybrid networks. In addition, some convolution acceleration methods may alter the original pipeline structure, thereby increasing the requirements for timing.
Disclosure of Invention
In order to solve the problems in the prior art, the invention fully utilizes knowledge information of a knowledge map, and provides the following technical scheme:
an FPGA-based CNN-SVM resource efficient acceleration architecture, the architecture comprising a processor system and a programmable logic system;
the processor system comprises a DDR storage controller, an SD/SDIO controller, a serial port controller, a main switch and an application processor; the application processor carries out data scheduling and program control, the SD/SDIO controller and the DDR memory controller store an external input data set and a network parameter file, and the serial port controller monitors the output result and the calculation time of the architecture;
the programmable logic system includes: CNN-SVM flow architecture accelerator, AXI interconnection, AXI peripheral equipment, DMA0 and DMA 1; the main switch is connected with AXI peripheral equipment, a convolution calculation unit in the CNN-SVM flow type architecture accelerator performs convolution operation, a pooling layer and an activation function are merged and calculated by a pooling calculation unit in the CNN-SVM flow type architecture accelerator, after CNN calculation is completed, a feature vector extracted by CNN is flattened and input into a classifier SVM for classification, and a classification result is finally output; the convolution kernel of the convolution calculation unit and the weight parameter of the classifier are configured by the application processor;
a DMA0 reads an input image from an external memory to the accelerator, a DMA1 is responsible for initializing convolution kernels or weight parameters of each layer of computing units, and data transmission inside the accelerator is based on an AXI-Stream interface;
the CNN-SVM flow architecture accelerator is based on a general acceleration operator structure, the acceleration operator structure is a two-dimensional array formed by a plurality of multiply-accumulate MAC nodes, wherein the vertical direction is the parallelism Tn of an output channel, the horizontal direction is the parallelism Tc of an output characteristic diagram, and under the condition of data multiplexing, the multiplexing times of an input characteristic diagram and weight are Tn and Tc respectively; the two-dimensional array is expanded into a three-dimensional array, so that the data reusability is improved; the single node comprises a multiplication and addition MA tree and a special accumulator ACC, the input of the MA tree realizes the parallelism Tm of an input channel, the depth of the MA tree is Tm, the special accumulator ACC automatically adjusts accumulation items according to the size K of a convolution kernel or the layer type and generates accumulation results, and the acceleration operator structure carries out TcxTnxTm multiplication and accumulation operations for times.
Preferably, when the MA tree is a complete binary tree structure, the utilization rate of the DSP is the maximum, and for the SVM classifier, only an additional voting decision module needs to be added after the ACC, and the remaining structure remains unchanged.
Preferably, the clock period T required for the operator to complete the single-layer inference is determined by:
Figure BDA0002772178060000021
where M is the number of input channels, K is the kernel size, RoutAnd CoutCorresponding to the rows and columns of the output signature, respectively, and Tr and Tc are the row and column parallelism of the output signature.
Preferably, the FPGA-based CNN-SVM resource efficient acceleration architecture determines resource consumption through a model of operator resource evaluation, where parameters of a single node of the model include a dimension and a type of input data, and when the dimension of the input data is Dim, there are different types of DSP resources DType consumed by the single node, and the different types of DSP resources DType consumed by the single node is represented by the following formula:
Figure BDA0002772178060000022
generating a DSP estimate of an operator structure by
Figure BDA0002772178060000023
Figure BDA0002772178060000031
Where Dim is the input dataDimension of (D)TypeFor DSP resources consumed by a single node, type is data precision, float32 is a single-precision floating point number, int32 is a 32-bit fixed point number, int16 is a 16-bit fixed point number, and int8 is an 8-bit fixed point number;
Figure BDA0002772178060000032
is the DSP estimated value of the operator structure;
the DSP estimated value depends on the node parameters and the multiplexing times, the peak throughput is influenced, the number of DSP units in hardware resources is limited in general, and the peak throughput which can be achieved is a fixed value.
Preferably, when the CNN-SVM resource efficient acceleration architecture based on the FPGA accelerates, the BRAM is used for temporary buffering and parallel expansion, the minimum BRAM structure BRAM _18K is used as a basic unit of an evaluation model, the direct memory access DMA transmits an input function diagram to the WriteBRAM module, and then the input function diagram is written into the BRAM _18K in a multi-bit parallel mode; after writing the function map, the WriteBRAM module sends a half flag to the readdram module, which reads the function map from BRAM _18K and sends it to the operator structure.
Preferably, based on the analysis of the BRAM, a BRAM estimation model for the reuse of the feature map and the weight reuse is established, where Rin and Cin are rows and columns of the input feature map respectively, a typical value of Depth is 512 for BRAM _18K in the 32-bit independent read-write mode, when the data size is very small or the parallel parameters are very large, the number of BRAM _18K occupied by a single parallel storage unit is ignored, and BRAM resources are only related to bit width and parallelism, and the BRAM estimation value for the reuse of the feature map and the BRAM estimation value for the weight reuse are represented by the following formula:
Figure BDA0002772178060000033
Figure BDA0002772178060000034
wherein the content of the first and second substances,
Figure BDA0002772178060000035
for BRAM estimation for reuse of signatures,
Figure BDA0002772178060000036
for BRAM estimation for weight reuse, RinAnd CinRespectively, the row and the column of the input feature map, Tr and Tc are the row and column parallelism of the output feature map, Depth is the Depth of BRAM, and Width is the bit Width of data.
The invention has the following beneficial effects:
the invention provides a resource efficient acceleration architecture which can fully utilize the data reuse characteristic and is suitable for different types of layers including CNN, FC full-connection layers and SVM in a CNN-SVM hybrid algorithm. Furthermore, the pipeline interval of the general operators designed in the acceleration architecture can be kept in a single clock cycle, so that the computing efficiency of the accelerator can be improved.
Drawings
FIG. 1 is a diagram of a generic operator structure;
FIG. 2 is a schematic diagram of data transmission of a generic operator;
FIG. 3 is a resource-efficient accelerator architecture diagram for a CNN-SVM hybrid algorithm using a generic operator design.
Detailed Description
The present invention will be described in detail with reference to specific examples.
The first embodiment is as follows:
as shown in fig. 1 to fig. 3, the present invention provides a CNN-SVM resource efficient acceleration architecture based on FPGA, which includes a processor system and a programmable logic system;
the processor system comprises a DDR storage controller, an SD/SDIO controller, a serial port controller, a main switch and an application processor; the application processor carries out data scheduling and program control, the SD/SDIO controller and the DDR memory controller store an external input data set and a network parameter file, and the serial port controller monitors the output result and the calculation time of the architecture;
the programmable logic system includes: CNN-SVM flow architecture accelerator, AXI interconnection, AXI peripheral equipment, DMA0 and DMA 1; the main switch is connected with AXI peripheral equipment, a convolution calculation unit in the CNN-SVM flow type architecture accelerator performs convolution operation, a pooling layer and an activation function are merged and calculated by a pooling calculation unit in the CNN-SVM flow type architecture accelerator, after CNN calculation is completed, a feature vector extracted by CNN is flattened and input into a classifier SVM for classification, and a classification result is finally output; the convolution kernel of the convolution calculation unit and the weight parameter of the classifier are configured by the application processor;
the key point of the invention is to provide a general two-dimensional multiplication and addition array operator, and construct a resource efficient acceleration framework suitable for a CNN-SVM algorithm based on the operator. The operator can be applied to layers of different types based on the high data reuse characteristic, and meanwhile, a high-efficiency complete flow production line is ensured. In addition, the resource evaluation model is constructed according to the operator, the resource consumption and the estimation of the required time of the Block Random Access Memory (BRAM) and the Digital Signal Processing (DSP) can be accurately obtained, the resource utilization rate and the calculation efficiency are greatly improved under the guidance of the model, and the hardware acceleration of the CNN-SVM algorithm is completed.
A DMA0 reads an input image from an external memory to the accelerator, a DMA1 is responsible for initializing convolution kernels or weight parameters of each layer of computing units, and data transmission inside the accelerator is based on an AXI-Stream interface;
the CNN-SVM flow architecture accelerator is based on a general acceleration operator structure, the acceleration operator structure is a two-dimensional array formed by a plurality of multiply-accumulate MAC nodes, wherein the vertical direction is the parallelism Tn of an output channel, the horizontal direction is the parallelism Tc of an output characteristic diagram, and under the condition of data multiplexing, the multiplexing times of an input characteristic diagram and weight are Tn and Tc respectively; the two-dimensional array is expanded into a three-dimensional array, so that the data reusability is improved; the single node comprises a multiplication and addition MA tree and a special accumulator ACC, the input of the MA tree realizes the parallelism Tm of an input channel, the depth of the MA tree is Tm, the special accumulator ACC automatically adjusts accumulation items according to the size K of a convolution kernel or the layer type and generates accumulation results, and the acceleration operator structure carries out TcxTnxTm multiplication and accumulation operations for times.
When the MA tree is a complete binary tree structure, the utilization rate of the DSP is the maximum, and only an additional voting decision module needs to be added after the ACC aiming at the SVM classifier, and the residual structure is kept unchanged.
Therefore, the general operator structure provided by the invention can be applied to different types of network layers, and can fully multiplex input data. Since the operator structure is only related to the dimensions of the output feature map, it is also applicable to convolution kernels of other sizes, independent of the input feature map size or convolution kernel step size. For some special convolution forms, only the order of the MA tree and ACC needs to be changed or adjusted. The clock period T required for the operator to complete a single layer inference is determined by:
Figure BDA0002772178060000051
where M is the number of input channels, K is the kernel size, RoutAnd CoutCorresponding to the rows and columns of the output signature, respectively, and Tr and Tc are the row and column parallelism of the output signature.
In order to fully utilize limited hardware resources, the CNN-SVM resource efficient acceleration architecture based on the FPGA determines resource consumption through a model of operator resource evaluation, parameters of a single node of the model comprise dimensionality and type of input data, when the dimensionality of the input data is Dim, DSP resources DType consumed by the single node with different types are provided, and the DSP resources DType consumed by the single node with different types is represented by the following formula:
Figure BDA0002772178060000052
generating a DSP estimate of an operator structure by
Figure BDA0002772178060000053
Figure BDA0002772178060000054
Where Dim is the dimension of the input data, DTypeFor DSP resources consumed by a single node, type is data precision, float32 is a single-precision floating point number, int32 is a 32-bit fixed point number, int16 is a 16-bit fixed point number, and int8 is an 8-bit fixed point number;
Figure BDA0002772178060000055
is the DSP estimated value of the operator structure;
it can be seen that the DSP estimate depends on the node parameters and the number of multiplexes and directly affects the peak throughput. Usually the number of DSP units in the hardware resources is limited and the peak throughput that can be achieved is a fixed value. When different network deployments are used, the core difference is the utilization rate of the BRAM.
BRAM is typically used for temporary buffering and parallel extensions during the deployment phase of accelerator design. Mining the potential parallelism of the BRAM is an effective method for saving memory resources. The BRAM resource estimation model is needed to be analyzed in deployment. The invention takes the minimum BRAM structure (BRAM _18K) as a basic unit of an evaluation model. Fig. 2 is a schematic diagram of data transmission of an operator structure based on eigenmap reuse. First, Direct Memory Access (DMA) transfers the input function map to the WriteBRAM module, which is then written to BRAM _18K in a multi-bit parallel mode. After writing the functional map, the WriteBRAM module sends a half flag to the readdram module. The ReadDRAM module reads the function map from BRAM _18K and sends it to the operator structure. In fig. 2, the read/write modules work independently in a cyclically alternating manner, thereby making maximum use of BRAM bit width and operator structure efficiency.
When the CNN-SVM resource efficient acceleration architecture based on the FPGA accelerates, the BRAM is used for temporary buffering and parallel expansion, the minimum BRAM structure BRAM _18K is used as a basic unit of an evaluation model, the direct memory access DMA transmits an input function diagram to the WriteBRAM module, and then the input function diagram is written into the BRAM _18K in a multi-bit parallel mode; after writing the function map, the WriteBRAM module sends a half flag to the readdram module, which reads the function map from BRAM _18K and sends it to the operator structure.
Based on the analysis of the BRAM, a BRAM estimation model for the reuse of the characteristic diagram and the weight reuse is established, wherein Rin and Cin are rows and columns of an input characteristic diagram respectively, a typical value of Depth is 512 for BRAM _18K under a 32-bit independent read-write mode, when the data size is very small or parallel parameters are very large, the number of BRAM _18K occupied by a single parallel storage unit is ignored, and BRAM resources are only related to bit width and parallelism, and the BRAM estimation value for the reuse of the characteristic diagram and the BRAM estimation value for the reuse of the weight are represented by the following formula:
Figure BDA0002772178060000061
Figure BDA0002772178060000062
wherein the content of the first and second substances,
Figure BDA0002772178060000063
for BRAM estimation for reuse of signatures,
Figure BDA0002772178060000064
for BRAM estimation for weight reuse, RinAnd CinRespectively, the row and the column of the input feature map, Tr and Tc are the row and column parallelism of the output feature map, Depth is the Depth of BRAM, and Width is the bit Width of data.
The design of the resource-efficient acceleration architecture of the CNN-SVM algorithm comprises a Processor System (PS) and a Programmable Logic (PL). As shown in FIG. 3, the processor system implements, in part, the control logic and external interfaces, including the Application Processor Units (APUs): responsible for data scheduling and program control, SD input output controller (SDIO) and double data rate memory (DDR 4): storing external input data sets and network parameter files, universal asynchronous receiver/transmitter (UART): is responsible for monitoring the output results and computation time of the algorithm, and also controls the state of the PL through a high performance extensible interface (AXI) HPM 0. The programmable logic part realizes a CNN-SVM hardware accelerator, the accelerator is a core part of a system architecture and is constructed by using the general operator provided by the invention, and the accelerator accesses an external memory through an AXI HP port to execute data transmission with the external memory.
The programmable logic part in fig. 3 implements a CNN-SVM hardware accelerator, which is responsible for forward reasoning calculations of the CNN-SVM algorithm and is a core part of the system architecture. The accelerator is designed into a flow type architecture of a full-network curing strategy, different network layers (such as a convolutional layer, a pooling layer and an SVM) of an algorithm in the flow type architecture are respectively mapped on mutually independent computing units, the computing units are constructed by adopting the general operator provided by the invention and are responsible for carrying out reasoning and calculation of each network layer, a plurality of parallelities including a function diagram level, an input channel level and an output channel level can be realized in each computing unit, and the computing parallelism between the layers is improved by pipeline optimization among the different computing units.
In addition, a control interface is designed between the accelerator and the processor, and the processor system can control the startup (Start), the viewing state (Status), the viewing Iteration number (Iteration) of the calculation, the viewing Offset (Offset) of the write-back data address and the like of the accelerator. The DMA0 and the DMA1 are data transfer modules of an accelerator, the accelerator enables the DMA0 to read an input image from an external memory to the accelerator during calculation, the DMA1 is responsible for initializing convolution kernels or weight parameters of each layer of calculation units, and data transfer inside the accelerator is based on an AXI-Stream interface, which is a protocol of an AXI general bus and can realize high-speed continuous data transfer. In the final layer of the accelerator, the accelerator may select a Fully Connected (FC) layer or SVM as a classifier as desired. And after the accelerator finishes the calculation, writing the classification result output by the algorithm into an external memory.
The second embodiment is as follows:
in the embodiment, the CNN-SVM algorithm is accelerated by using the resource efficient acceleration architecture provided by the invention. As shown in fig. 3, the input picture and parameter configuration file of the CNN-SVM algorithm are stored in the off-chip SD card, and the input picture uses an MNIST data set, which contains 10000 test pictures. The processor system controls the DMA0 to provide input pictures for the accelerator of the programmable logic part, and controls the DMA1 to configure parameters for each computing unit of the accelerator.
The input of the CNN-SVM streaming architecture accelerator of the programmable logic part is a 28x28x1 picture, the convolution calculation units (convolution 1, convolution 2, convolution 3) in the accelerator are used for calculating convolution operation in the algorithm, the pooling calculation units (pooling 1, pooling 2, pooling 3) combine the pooling layer and the activation function in the algorithm for calculation, the calculation results between the calculation units are transmitted as shown in fig. 2, and the specific dimension of the transmitted data is shown in fig. 3. After the CNN part of the algorithm is calculated, the characteristic vector extracted by the CNN is flattened, the characteristic vector is input into a classifier (SVM) for classification, a classification result is finally output, and parameters such as a convolution kernel of a convolution calculation unit, the weight of the classifier and the like are configured by a processor system.
The computing units of each layer of the accelerator are implemented by the general operator provided by the invention, and in the implementation process of the operator, specific construction is carried out according to different parallelism degrees of each layer. As shown in table 1, where R, C is the height and width of the feature map, M, N is the number of channels of the output feature map and the input feature map, and the structure of the operator used by each computing unit is determined by the parallelism (Tc, Tm, Tn), so that the structure of the general operator used by each computing unit is different, this also illustrates that the operator proposed by the present invention has generality and extensibility, so that the computing unit can be constructed by using the general operator proposed by the present invention according to the specific parameters of different layers of the algorithm. In addition, in the embodiment, the use of the BRAM and the DSP is evaluated according to the resource evaluation model proposed by the designed generic operator, and as a result, as shown in table 1, the number of the BRAM and the DSP used is 13.5 and 316, the clock period consumed by the CNN-SVM algorithm is 297, and the clock frequency is set to be 100 MHz.
TABLE 1 PARAMETRIC PARAMETER TABLE FOR DIFFERENT LAYERS OF CNN-SVM
Figure BDA0002772178060000081
The corresponding resource consumption in the board level verification in the present embodiment is shown in table 2, because the present embodiment is implemented under the XAZU3EG platform, and therefore Available in the table is the resource constraint condition of the platform. Since the key to the accelerated architecture throughput is DSP utilization, this embodiment focuses on DSP utilization. From the implementation results, the DSP utilization is greater than 80%. Ideally, DSP utilization could reach 100%, but as network parallelism parameters continue to expand, DSP resources will increase significantly to extend parallelism throughout the deployed network. In other words, the maximum DSP utilization depends on the amount of limited resources and network fabric, which results in a phenomenon that all DSPs cannot be used in a deployment. On the other hand, since the operator in the architecture uses the data transmission method shown in fig. 2 to fully utilize the bit width of the BRAM, the BRAM utilization rate is low in the implementation result. In addition, the calculation time for 10000 test pictures accelerated by the CNN-SVM algorithm in this embodiment is 30.1ms, and the power consumption is only 3.42W. According to experimental test results, the embodiment fully utilizes the DSP resources on the chip and completes the efficient acceleration of the resources of the CNN-SVM algorithm under the condition of consuming a very small amount of BRAM resources.
TABLE 2 resource usage of CNN-SVM
Figure BDA0002772178060000082
The above description is only a preferred embodiment of the FPGA-based CNN-SVM resource efficient acceleration architecture, and the protection scope of the FPGA-based CNN-SVM resource efficient acceleration architecture is not limited to the above embodiments, and all technical solutions belonging to the idea belong to the protection scope of the present invention. It should be noted that modifications and variations which do not depart from the gist of the invention will be those skilled in the art to which the invention pertains and which are intended to be within the scope of the invention.

Claims (6)

1. A CNN-SVM resource high-efficiency acceleration architecture based on FPGA is characterized in that: the architecture includes a processor system and a programmable logic system;
the processor system comprises a DDR storage controller, an SD/SDIO controller, a serial port controller, a main switch and an application processor; the application processor carries out data scheduling and program control, the SD/SDIO controller and the DDR memory controller store an external input data set and a network parameter file, and the serial port controller monitors the output result and the calculation time of the architecture;
the programmable logic system includes: CNN-SVM flow architecture accelerator, AXI interconnection, AXI peripheral equipment, DMA0 and DMA 1; the main switch is connected with AXI peripheral equipment, a convolution calculation unit in the CNN-SVM flow type architecture accelerator performs convolution operation, a pooling layer and an activation function are merged and calculated by a pooling calculation unit in the CNN-SVM flow type architecture accelerator, after CNN calculation is completed, a feature vector extracted by CNN is flattened and input into a classifier SVM for classification, and a classification result is finally output; the convolution kernel of the convolution calculation unit and the weight parameter of the classifier are configured by the application processor;
a DMA0 reads an input image from an external memory to the accelerator, a DMA1 is responsible for initializing convolution kernels or weight parameters of each layer of computing units, and data transmission inside the accelerator is based on an AXI-Stream interface;
the CNN-SVM flow architecture accelerator is based on a general acceleration operator structure, the acceleration operator structure is a two-dimensional array formed by a plurality of multiply-accumulate MAC nodes, wherein the vertical direction is the parallelism Tn of an output channel, the horizontal direction is the parallelism Tc of an output characteristic diagram, and under the condition of data multiplexing, the multiplexing times of an input characteristic diagram and weight are Tn and Tc respectively; the two-dimensional array is expanded into a three-dimensional array, so that the data reusability is improved; the single node comprises a multiplication and addition MA tree and a special accumulator ACC, the input of the MA tree realizes the parallelism Tm of an input channel, the depth of the MA tree is Tm, the special accumulator ACC automatically adjusts accumulation items according to the size K of a convolution kernel or the layer type and generates accumulation results, and the acceleration operator structure carries out TcxTnxTm multiplication and accumulation operations for times.
2. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: when the MA tree is a complete binary tree structure, the utilization rate of the DSP is the maximum, and only an additional voting decision module needs to be added after the ACC aiming at the SVM classifier, and the residual structure is kept unchanged.
3. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 2, wherein: the clock period T required for the operator to complete a single layer inference is determined by:
Figure FDA0002772178050000011
where M is the number of input channels, K is the kernel size, RoutAnd CoutCorresponding to the rows and columns of the output signature, respectively, and Tr and Tc are the row and column parallelism of the output signature.
4. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: the CNN-SVM resource efficient acceleration architecture based on the FPGA determines resource consumption through a model of operator resource evaluation, parameters of a single node of the model comprise dimensionality and type of input data, and when the dimensionality of the input data is Dim, DSP resources D consumed by the single node of different types are providedTypeThe DSP resources D consumed by the different types of single nodes are represented by the following formulaType
Figure FDA0002772178050000021
Generating a DSP estimate of an operator structure by
Figure FDA0002772178050000022
Figure FDA0002772178050000023
Where Dim is the dimension of the input data, DTypeFor DSP resources consumed by a single node, type is data precision, float32 is a single-precision floating point number, int32 is a 32-bit fixed point number, int16 is a 16-bit fixed point number, and int8 is an 8-bit fixed point number;
Figure FDA0002772178050000024
is the DSP estimated value of the operator structure;
the DSP estimated value depends on the node parameters and the multiplexing times, the peak throughput is influenced, the number of DSP units in hardware resources is limited in general, and the peak throughput which can be achieved is a fixed value.
5. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: when the CNN-SVM resource efficient acceleration architecture based on the FPGA accelerates, the BRAM is used for temporary buffering and parallel expansion, the minimum BRAM structure BRAM _18K is used as a basic unit of an evaluation model, the direct memory access DMA transmits an input function diagram to the WriteBRAM module, and then the input function diagram is written into the BRAM _18K in a multi-bit parallel mode; after writing the function map, the WriteBRAM module sends a half flag to the readdram module, which reads the function map from BRAM _18K and sends it to the operator structure.
6. The FPGA-based CNN-SVM resource-efficient acceleration architecture of claim 1, wherein: based on the analysis of the BRAM, a BRAM estimation model for the reuse of the characteristic diagram and the weight reuse is established, wherein Rin and Cin are rows and columns of an input characteristic diagram respectively, a typical value of Depth is 512 for BRAM _18K under a 32-bit independent read-write mode, when the data size is very small or parallel parameters are very large, the number of BRAM _18K occupied by a single parallel storage unit is ignored, and BRAM resources are only related to bit width and parallelism, and the BRAM estimation value for the reuse of the characteristic diagram and the BRAM estimation value for the reuse of the weight are represented by the following formula:
Figure FDA0002772178050000025
Figure FDA0002772178050000026
wherein the content of the first and second substances,
Figure FDA0002772178050000027
for BRAM estimation for reuse of signatures,
Figure FDA0002772178050000028
for BRAM estimation for weight reuse, RinAnd CinRespectively, the row and the column of the input feature map, Tr and Tc are the row and column parallelism of the output feature map, Depth is the Depth of BRAM, and Width is the bit Width of data.
CN202011252879.0A 2020-11-11 2020-11-11 CNN-SVM resource efficient acceleration architecture based on FPGA Active CN112306951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011252879.0A CN112306951B (en) 2020-11-11 2020-11-11 CNN-SVM resource efficient acceleration architecture based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011252879.0A CN112306951B (en) 2020-11-11 2020-11-11 CNN-SVM resource efficient acceleration architecture based on FPGA

Publications (2)

Publication Number Publication Date
CN112306951A CN112306951A (en) 2021-02-02
CN112306951B true CN112306951B (en) 2022-03-22

Family

ID=74325704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011252879.0A Active CN112306951B (en) 2020-11-11 2020-11-11 CNN-SVM resource efficient acceleration architecture based on FPGA

Country Status (1)

Country Link
CN (1) CN112306951B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801285B (en) * 2021-02-04 2024-01-26 南京微毫科技有限公司 FPGA-based high-resource-utilization CNN accelerator and acceleration method thereof
CN112989731B (en) * 2021-03-22 2023-10-13 湖南大学 Integrated circuit modeling acquisition method and system based on abstract syntax tree

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934335A (en) * 2019-03-05 2019-06-25 清华大学 High-speed railway track switch method for diagnosing faults based on interacting depth study
CN111832276A (en) * 2019-04-23 2020-10-27 国际商业机器公司 Rich message embedding for conversation deinterlacing

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8131659B2 (en) * 2008-09-25 2012-03-06 Microsoft Corporation Field-programmable gate array based accelerator system
US9971953B2 (en) * 2015-12-10 2018-05-15 Intel Corporation Visual recognition using deep learning attributes
US10970080B2 (en) * 2018-02-08 2021-04-06 Marvell Asia Pte, Ltd. Systems and methods for programmable hardware architecture for machine learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934335A (en) * 2019-03-05 2019-06-25 清华大学 High-speed railway track switch method for diagnosing faults based on interacting depth study
CN111832276A (en) * 2019-04-23 2020-10-27 国际商业机器公司 Rich message embedding for conversation deinterlacing

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
An Accelerator Architecture of Changeable-Dimension Matrix Computing Method for SVM;Ruidong Wu;《MDPI》;20190130;第1-12页 *
Optimizing CNN-based Hyperspectral Image Classification on FPGAs;Shuanglong Liu;《Cornell University》;20190627;第1-8页 *
基于FPGA的CNN算法移植概述;清霜一梦;《博客园》;20180315;第1-2页 *
基于改进的CNN和SVM手势识别算法研究;吴晴;《中国优秀硕士学位论文全文数据库》;20190228;I138-1369 *
面向FPGA部署的CNN-SVM算法研究与实现;周彦臻;《电子测量与仪器学报》;20210415;第90-98页 *

Also Published As

Publication number Publication date
CN112306951A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN107301455B (en) Hybrid cube storage system for convolutional neural network and accelerated computing method
CN108665063B (en) Bidirectional parallel processing convolution acceleration system for BNN hardware accelerator
WO2020150728A1 (en) Systems and methods for virtually partitioning a machine perception and dense algorithm integrated circuit
CN112306951B (en) CNN-SVM resource efficient acceleration architecture based on FPGA
CN113313243B (en) Neural network accelerator determining method, device, equipment and storage medium
WO2008131308A1 (en) Field-programmable gate array based accelerator system
US11315344B2 (en) Reconfigurable 3D convolution engine
KR101950786B1 (en) Acceleration Method for Artificial Neural Network System
CN110991630A (en) Convolutional neural network processor for edge calculation
JP2021518591A (en) Systems and methods for implementing machine perception and high density algorithm integrated circuits
US11593628B2 (en) Dynamic variable bit width neural processor
CN112799599B (en) Data storage method, computing core, chip and electronic equipment
CN113051216A (en) MobileNet-SSD target detection device and method based on FPGA acceleration
Min et al. NeuralHMC: An efficient HMC-based accelerator for deep neural networks
Bhowmik et al. ESCA: Event-based split-CNN architecture with data-level parallelism on ultrascale+ FPGA
TW200617668A (en) Cache memory management system and method
Hu et al. High-performance reconfigurable DNN accelerator on a bandwidth-limited embedded system
US20220004854A1 (en) Artificial neural network computation acceleration apparatus for distributed processing, artificial neural network acceleration system using same, and artificial neural network acceleration method therefor
CN109949202B (en) Parallel graph computation accelerator structure
CN111445019B (en) Device and method for realizing channel shuffling operation in packet convolution
Cain et al. Convolution processing unit featuring adaptive precision using dynamic reconfiguration
US11868873B2 (en) Convolution operator system to perform concurrent convolution operations
CN111861860B (en) Image acceleration processing system for AI intelligent SOC chip
EP4148627A1 (en) Neural network scheduling method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant