CN110458279B

CN110458279B - FPGA-based binary neural network acceleration method and system

Info

Publication number: CN110458279B
Application number: CN201910636517.2A
Authority: CN
Inventors: 李开; 邹复好; 祁迪
Original assignee: Wuhan Meitong Technology Co ltd
Current assignee: Wuhan Meitong Technology Co ltd
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2022-05-20
Anticipated expiration: 2039-07-15
Also published as: CN110458279A

Abstract

The invention discloses a binary neural network acceleration system based on FPGA, which utilizes a convolution kernel parameter acquisition module, a binary convolution neural network structure and a cache module formed by FPGA, wherein the cache module is an on-chip internal memory of FPGA, each module acquires a convolution calculation logic rule and correspondingly performs binary convolution calculation by acquiring an input characteristic diagram of a picture to be processed, the FPGA traverses the convolution calculation of a plurality of threads according to the convolution calculation logic rule to obtain output characteristic diagram data of the picture to be processed, and the calculated quantity of each layer in the binary neural network is completely unloaded to the on-chip internal memory by the integral framework without depending on the interaction of the off-chip internal memory and the on-chip internal memory, so that the communication cost between memories is reduced, the calculation efficiency is greatly improved, and the detection speed of the picture to be detected is improved.

Description

FPGA-based binary neural network acceleration method and system

Technical Field

The invention belongs to the field of image processing, and particularly relates to a binary neural network acceleration method and system based on an FPGA.

Background

Significant advances in artificial intelligence technology have begun to contribute to aspects of human life. From home vacuum robots to a full suite of intelligent production facilities in a factory, many of the world's tasks have been highly automated. In addition, deep learning plays a very important role in this great technical revolution, and is widely applied to the fields of face recognition, object detection, image processing, and the like. The algorithm mainly adopted is a convolutional neural network, and the deep learning algorithm with better performance is already deployed in a large number of PC terminals, mobile terminals of mobile phones and embedded special accelerators, is used for realizing various intelligent computing tasks, and obtains better acceleration effect.

Convolutional Neural Network (CNN) is one of the most important branches of deep learning development, and its development is the most mature and widely applied to various graphic image video processing tasks. Convolutional neural networks have evolved so rapidly that, in addition to training data scale-up and computational power improvement, various convolutional neural network frameworks benefit. Most of the existing convolutional neural network applications are deployed on a server or a desktop platform, and a mobile terminal is the application platform with the widest application and the largest user quantity, so that the convolutional neural network applications are moved to furthest promote the development of deep learning applications.

However, all such mobile terminals and embedded computing devices provide limited computing power and non-computationally large on-chip storage. With the model structure of the convolutional neural network becoming more complex, the number of model layers becoming deeper and deeper, and the quantity of model parameters becoming larger and larger, the convolutional neural network becomes more and more difficult to deploy on the mobile end and the embedded end. The huge calculation amount is realized by adopting 32-bit floating point numbers as operands and running on a light-weight chip, which undoubtedly consumes huge calculation resources and is difficult to achieve better real-time effect.

Disclosure of Invention

Aiming at the defects or improvement requirements in the prior art, the invention provides a binary neural network acceleration system based on an FPGA (field programmable gate array), which utilizes a convolution kernel parameter acquisition module, a binary convolution neural network structure and a cache module formed by the FPGA, wherein the cache module is an on-chip memory of the FPGA, each module carries out corresponding binary convolution calculation according to an acquired convolution calculation logic rule, and the calculated amount of each layer in the binary neural network is completely unloaded to the on-chip memory through the whole architecture without depending on the interaction between the off-chip memory and the on-chip memory, so that the communication cost between memories is reduced, the calculation efficiency is greatly improved, and the detection speed of an image to be detected is improved.

To achieve the above object, according to one aspect of the present invention, there is provided an FPGA-based binary neural network acceleration system, which includes a convolution kernel parameter acquisition module formed by using an FPGA, a binary convolution neural network structure, and a cache module, the cache module is an on-chip memory of the FPGA,

the convolution kernel parameter acquisition module is used for acquiring an input characteristic diagram of a picture to be processed, and performing binarization training on an existing data set by using a convolution neural network model to obtain a convolution calculation logic rule and a plurality of convolution kernel parameters, wherein the convolution calculation logic rule comprises convolution calculation of a plurality of threads;

the cache module is used for calling the convolution calculation logic rule and the convolution kernel parameters, storing the convolution kernel parameters in an on-chip memory of the FPGA according to the convolution calculation logic rule, and caching the calculation result of the convolution basic calculation module and the image data to be processed;

the binary convolution neural network structure is used for calling convolution calculation logic rules to generate a plurality of convolution basic calculation modules, the convolution basic calculation modules establish corresponding connection relations according to the convolution calculation logic rules, the convolution calculation of one thread corresponds to the convolution basic calculation modules, and the convolution kernel parameters correspond to the convolution basic calculation modules one to one;

the convolution basic calculation module is used for reading a calculation result of a last convolution basic calculation module of a current thread in the cache module, an input feature map of an image to be processed in a current sliding window and corresponding convolution kernel data in an on-chip memory of the FPGA according to a convolution calculation logic rule, sequentially performing a preset convolution calculation sequence to obtain a calculation result of the current convolution basic calculation module, and storing the calculation result of the current convolution basic calculation module in a corresponding cache region; the preset convolution calculation sequence is to sequentially perform convolution, PRelu activation, regular normalization and binary activation calculation, or sequentially perform convolution, PRelu activation, pooling, regular normalization and binary activation calculation;

and the FPGA traverses the convolution calculation of the multiple threads according to the convolution calculation logic rule to obtain the output characteristic diagram data of the image to be processed so as to improve the detection speed of the image to be detected.

As a further improvement of the invention, the FPGA configures a corresponding control register through an ARM end, and then loads an image from an external memory DDR3 to a buffer area of an on-chip internal memory through an AXI bus; the FPGA distributes a plurality of processing engines for a convolution basic computation module, and each processing engine comprises an arithmetic operation component, a logic operation component, a bit operation component and storage resources.

As a further improvement of the invention, the convolution calculation layers are classified into a convolution layer, a PRelu activation layer, a pooling layer, a regular normalization layer and a binary activation layer according to a preset convolution calculation sequence, the convolution layer, the PRelu activation layer, the pooling layer, the regular normalization layer and the binary activation layer are respectively used for convolution, PRelu activation, pooling, regular normalization and binary activation calculation, a plurality of calculation engines located in the same convolution calculation layer are used for forming a convolution acceleration array, and one convolution acceleration array is realized by using one PE module of an FPGA.

As a further improvement of the present invention, the cache module sets a corresponding first cache region and a second cache region for a convolution calculation array, respectively, where the first cache region is used to store the operation result of the previous convolution acceleration array, and the second cache region is used to store the operation result of the corresponding convolution acceleration array.

As a further improvement of the invention, the calculation implementation process of the pooling layer is as follows: and performing SIMD vectorization on the column vectors corresponding to the sliding window of the pooling layer, solving respective maximum values of all the column vectors to form a new vector, and taking the new vector as data of the output feature map.

As a further improvement of the invention, when different sliding windows of the pooling layer have the same column vector, the calculation result of the same column vector is put into an LUT for temporary storage, and the temporary storage value in the LUT is directly called when the next sliding window performs the calculation of the same column vector.

As a further improvement of the invention, the FPGA is provided with a matrix vector multiplication unit for each convolution calculation layer according to a convolution calculation logic rule, the matrix vector multiplication unit comprises a plurality of calculation engines, each calculation engine comprises a plurality of parallel single instruction multiple data circulation channels, each calculation engine is used for acquiring an input characteristic diagram of a picture to be processed corresponding to the plurality of parallel single instruction multiple data circulation channels, and different filters corresponding to convolution kernel parameters are used for carrying out multiplication accumulation operation.

As a further improvement of the invention, the system performs the dot product calculation process by: the corresponding position elements in the sliding window are subjected to XOR calculation, and the XOR result is stored in an array; counting the number of 1 in the array through popcount; obtaining a final convolution calculation result according to a formula result ═ popcount (x) - [ N-popcount (x) ]; wherein, popcount (x) represents counting the number of 1 in the vector x corresponding to the one-dimensional array, and N represents the number of elements corresponding to the vector x in popcount (x).

As a further improvement of the method, the FPGA sorts convolution kernel parameters required by convolution calculation according to a convolution calculation logic rule and then packs the convolution kernel parameters into a parameter matrix, an output window is slid to horizontally move an input characteristic diagram covering a picture to be processed according to the convolution calculation logic rule to obtain an image matrix, and the parameter matrix and the image matrix are multiplied to obtain a convolution calculation result.

As a further improvement of the invention, the convolution basic computation module combines PRelu activation, canonical normalization and binary activation into a simple binary function through a common ray function mode.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

the invention relates to a binary neural network acceleration system based on FPGA, which utilizes a convolution kernel parameter acquisition module, a binary convolution neural network structure and a cache module formed by FPGA, wherein the cache module is an on-chip memory of FPGA, each module carries out corresponding binary convolution calculation according to the acquired convolution calculation logic rule, and the calculated amount of each layer in the binary neural network is completely unloaded to the on-chip memory through the whole framework without depending on the interaction of the off-chip memory and the on-chip memory, so that the communication cost between memories is reduced, the calculation efficiency is greatly improved, and the detection speed of an image to be detected is improved.

According to the FPGA-based binary neural network acceleration system, dot product operation is replaced by Xnor logic operation and popcount displacement operation in a binary neural network, the binary operation is dot product calculation between 1bit weight and 1bit input image parameters, the speed of binary convolution calculation is greatly improved compared with full-precision convolution calculation through the operation replacement, meanwhile, blank parts of odd-even staggered filling feature maps are adopted to replace all +1 filling used by predecessors, and model accuracy is guaranteed to a certain extent.

The invention relates to a binary neural network acceleration system based on FPGA, which carries out element recombination on the staggered sorting of an input characteristic diagram by a matrix vector multiplication unit arranged by the FPGA and by an offline staggered sorting of a parameter matrix and a sliding window unit, inputs the recombined data vector into a convolution acceleration matrix, and further realizes the calculation of complete parallelization.

According to the FPGA-based binary neural network acceleration system, the calculation of a convolution calculation layer is accelerated through a double-buffer parallel mechanism, a flow structure is adopted in a buffer area, a sliding window is driven by output data of the previous layer to calculate, and data in the sliding window is completely parallelized, so that the calculation efficiency is further improved.

According to the FPGA-based binary neural network acceleration system, the PRelu activation, the regular normalization and the binary activation are combined into a simple binary function through a common affine function mode, so that the calculation complexity caused by the regular normalization is greatly reduced.

Drawings

FIG. 1 is a schematic structural diagram of an FPGA-based binary neural network acceleration system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an FPGA-based on-chip memory according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a convolution basic calculation module according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of parity interleaved padding in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of a double buffered parallel architecture of an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computational implementation of a pooling layer of an embodiment of the present invention;

FIG. 7 is a schematic diagram of a convolution calculation implementation of an embodiment of the present invention;

FIG. 8 is a schematic diagram of a dot product calculation implementation of an embodiment of the present invention;

FIG. 9 is a schematic diagram of a convolution calculation matrix interleaving ordering according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a convolution calculation matrix storage implementation of an embodiment of the present invention;

FIG. 11 is a schematic diagram of a folded matrix vector multiplication implementation of an embodiment of the present invention;

FIG. 12 is a schematic diagram of an acceleration system process flow according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The present invention will be described in further detail with reference to specific embodiments.

Fig. 1 is a schematic structural diagram of an FPGA-based binary neural network acceleration system according to an embodiment of the present invention. As shown in fig. 1, a binary neural network acceleration system based on FPGA comprises a convolution kernel parameter acquisition module formed by FPGA, a binary convolution neural network structure and a cache module, wherein the cache module is an on-chip memory of the FPGA,

the binary convolution neural network structure is used for calling convolution calculation logic rules to generate a plurality of convolution basic calculation modules, the convolution basic calculation modules establish corresponding connection relations according to the convolution calculation logic rules, the convolution calculation of one thread corresponds to the convolution basic calculation modules, and a plurality of convolution kernel parameters correspond to the convolution basic calculation modules one to one;

the convolution basic calculation module is used for reading a calculation result of a last convolution basic calculation module of a current thread in the cache module, an input feature map of an image to be processed in a current sliding window and corresponding convolution kernel data in an on-chip memory of the FPGA according to a convolution calculation logic rule, sequentially performing a preset convolution calculation sequence to obtain a calculation result of the current convolution basic calculation module, and storing the calculation result of the current convolution basic calculation module in a corresponding cache region; the preset convolution calculation sequence is that convolution, PRelu activation, regular normalization and binary activation calculation are sequentially carried out, or the convolution, the PRelu activation, pooling, the regular normalization and the binary activation calculation are sequentially carried out; as an example, the convolution basic computation module can combine the prilu activation, the canonical normalization and the binary activation into a simple binary function through a common ray function mode, so as to reduce the computation complexity caused by the canonical normalization;

Fig. 2 is a schematic diagram of an FPGA-based on-chip memory according to an embodiment of the present invention. As shown in fig. 2, the convolutional kernel data storage corresponding to the convolutional calculation logic rule is realized through a stream computing architecture based on the FPGA on-chip memory, which not only reduces the communication cost of the on-chip memory and the off-chip memory, but also greatly improves the overall parallelism of the convolutional calculation.

As an example, in the implementation of a convolution basic computation module on hardware based on an FPGA, a hardware overall architecture firstly configures a corresponding control register through an ARM terminal, and then loads an image from an external memory DDR3 to a buffer area of an on-chip memory through an AXI bus; the FPGA can allocate a large number of processing engines for the operation of a convolution basic computation module, wherein the processing engines comprise an arithmetic operation component, a logic operation component, a bit operation component and a storage resource, as a preferred embodiment, a convolution computation layer is classified into a convolution layer, a PRelu activation layer, a pooling layer, a regular normalization layer and a binary activation layer according to a preset convolution computation sequence, the convolution layer, the PRelu activation layer, the pooling layer, the regular normalization layer and the binary activation layer are respectively used for convolution, PRelu activation, pooling, regular normalization and binary activation computation, a plurality of computation engines located in the same convolution computation layer can be utilized to form a convolution acceleration array, and one convolution acceleration array can be realized by utilizing one PE module of the FPGA; preferably, each convolution acceleration array is allocated with a double buffer structure, one is responsible for storing the operation result of the network of the previous layer, and the other is responsible for storing the operation result of the network of the current layer.

Fig. 3 is a schematic structural diagram of a convolution basic calculation module according to an embodiment of the present invention. The convolution basic calculation module is added with PRelu activation, the accuracy of an original model can be improved by 2 percent, and the sequence of the convolution modules of a common binary neural network is as follows: the method comprises the following steps of regular normalization, convolution, binary activation and pooling, wherein before recombination, the input of each module is transmitted after the previous module is subjected to PRelu + Pool calculation, so that data received and transmitted by an inter-block buffer area is non-binary data, and the convolution basic calculation module adjusts the processing sequence of the convolution basic calculation module to be as follows: after recombination, data transmission between convolution basic calculation modules is converted into binary data after binary activation function processing, so that data transmission between blocks is also binary, data amount exchanged between the blocks can be greatly reduced after recombination, communication cost between the blocks is reduced, and uniform interfaces are easily designed for all convolution basic calculation modules; at the same time, the size of the buffer used for exchanging data is reduced, thereby saving hardware resources.

FIG. 4 is a diagram of parity interleaving fill in accordance with an embodiment of the present invention. As shown in fig. 4, the parity interleaving is used in the convolution calculation to fill the blank data of the output feature map to ensure that the dimension of the output feature map, specifically, the width-height dimension and the channel dimension of the feature map are interleaved with ± 1 in the order of parity sorting, and the error rate of the trained network model after parity filling on the cfar 10 data set is close to the full 0 filling error rate at full precision and is only 11.50%, which is lower than 13.76% of the full +1 filling and 12.85% of the odd (even) filling.

Fig. 5 is a schematic diagram of a double-buffered parallel structure according to an embodiment of the present invention. The cache module is used for respectively setting a corresponding first cache region and a second cache region for each convolution calculation array, wherein the first cache region is used for storing the operation result of the last convolution acceleration array, and the second cache region is used for storing the operation result of the corresponding convolution acceleration array; correspondingly filling the first cache region and the second cache region according to the convolution calculation logic rule; taking the pooling layer as an example, two buffers are allocated to the input feature map at the beginning of the calculation, the width of the buffer is the width W of the input feature map of the pooling layer, and the height is the size k of the convolution kernel. The two buffers alternately receive each calculation result of the previous layer in turn. When the first k-1 line of Buffer1 is full and the kth data of the kth line comes, the sliding window can become effective and generate a calculation result, and from this moment, a result is generated every time a data is received later and is sent to the output feature map. When Buffer1 is full, Buffer2 begins receiving data. At this time, the first line in Buffer2 waits until the k-th data arrives, and the sliding window can take effect, and the sliding window contains the data in k-1 line Buffer1 and the data in one line Buffer 2. From this point on, the sliding of the sliding window depends on the arrival of the next new data, as before, and produces a calculation result. When the coverage area of the sliding window does not include the data in Buffer1, the sliding window is emptied, and Buffer1 does not start receiving new data until Buffer2 is full, and the process is repeated.

As an example, the computation implementation of the pooling layer is: and performing SIMD vectorization on the column vectors corresponding to the sliding window of the pooling layer, calculating the respective maximum value of all the column vectors to form a new vector, and taking the new vector as data of the output feature map. As a preferred embodiment, when different sliding windows have the same column vector, the calculation result of the column vector may be temporarily stored in an LUT, and the temporary storage value in the LUT is directly called for the next sliding window calculation. FIG. 6 is a schematic diagram of a computational implementation of a pooling layer of an embodiment of the present invention. As shown in fig. 6, a maximum pool of 3 × 3 is taken as an example, where stride is 2. Sliding windows slide on the same thread arrays1, and we consider each column in the sliding window as a vector, and each sliding window contains 3 such vectors. Since there is no data dependency between these 3 vectors, SIMD vectorization can be used to simultaneously maximize these 3 columns of vectors. And after the 3 maximum value results are obtained, forming a vector by the 3 results, solving the maximum value in the vector, and finally putting the calculation result into the output characteristic diagram to be used as a new element. It should be noted that after every 3 columns in the sliding window are calculated, the calculation result of the rightmost 1 column is temporarily stored in an LUT as the calculation result of the first column of the next sliding window, because a column of data is shared between adjacent sliding windows. The parallelism of each first sliding window from left to right is 3, and the values of the rest sliding windows are 2.

FIG. 7 is a schematic diagram of a convolution calculation implementation of an embodiment of the present invention. As shown in fig. 7, as an example, each convolution calculation layer is provided with a matrix vector multiplication unit according to a convolution calculation logic rule, and one matrix vector multiplication unit includes a plurality of calculation engines, and each calculation engine includes a plurality of parallel single instruction multiple data circulation channels. Furthermore, the calculation engines are configured to obtain input feature maps of the to-be-processed pictures corresponding to the multiple parallel single instruction multiple data flow channels, and perform calculation with corresponding convolution kernel parameters located in the on-chip memory, where each calculation engine receives the same control signal and vector data of the to-be-processed picture, and when calculating, performs multiply-accumulate operation with different filters corresponding to the convolution kernel parameters.

As a further preference, the system performs the process of dot product calculation: the corresponding position elements in the sliding window are subjected to XOR calculation, and the XOR result is stored in an array; counting the number of 1 in the array through popcount; the final convolution calculation is found according to the formula result ═ popcount (x) - [ N-popcount (x) ]. Wherein, popcount (x) represents counting the number of 1 in the vector x corresponding to the one-dimensional array, and N represents the number of elements corresponding to the vector x in popcount (x), i.e. the rank of the vector x.

FIG. 8 is a schematic diagram of a dot product calculation implementation of an embodiment of the invention. As shown in fig. 8, the calculation data flow of a calculation engine in the matrix vector multiplication calculation unit is taken as an example, and is mainly used for calculating the dot product result of the input vector and a row of data of the parameter matrix, comparing the result with a threshold, and finally outputting a 1-bit data. The dot product calculation is essentially a multiply-accumulate operation between two vectors, implemented here with Xnor gates for a binary neural network. The first step is to solve the XOR of the corresponding position elements in the sliding window and store the XOR result in an array; the second step is to count the number of 1 in the array through popcount; the third step is to find the final convolution calculation result according to the formula result- { popcount (x) - [ N-popcount (x) ]. And finally comparing the result with a threshold value and outputting a final result. The calculation engine structure also supports non-binary calculation, and can process the calculation only by replacing the dot product gate of the dotted line part with a conventional multiplication accumulator.

FIG. 9 is a diagram illustrating a convolutional calculation matrix interleaving ordering according to an embodiment of the present invention. As shown in fig. 9, the convolution calculation may be converted into a general matrix multiplication operation according to a convolution calculation logic rule, that is, a matrix interleaving ordering method based on channel dimensions, convolution kernel parameters required by the convolution calculation are packed into a parameter matrix, meanwhile, the sliding window is also translated and covers the input feature map, and the elements of the input feature map are packed into an image matrix, and finally, the matrices are multiplied to output a result. Since the dot product operation includes all the pixel values in a sliding window, and the exchangeability of addition, the order of the staggered ordering in the matrix can be any order, and the pixel values at the same position of different channels are put together, although other orders can be adopted, but the final calculation result is not changed. It should be noted that the conversion of the filter matrix does not require any overhead, since it is converted before the program runs, whereas the image matrix can be converted while the program runs. FIG. 10 is a schematic diagram of a convolution calculation matrix storage implementation of an embodiment of the present invention. As shown in fig. 10, the input map is simply stored in a buffer in a certain order, and then the address generator will take the memory location corresponding to each sliding window and generate the image matrix from the data transmitted from the previous layer according to the same ordering rule as the filter matrix.

FIG. 11 is a diagram illustrating an implementation of folding matrix vector multiplication according to an embodiment of the present invention. As shown in fig. 11, since almost all computations in the binary neural network can be represented as matrix vector multiplication, the folding method can largely control the throughput of the system, and can also directly affect the resource utilization and energy consumption of the system. The number of the computing engines of each layer is a, the number of the single instruction multidata stream channels of each computing engine is b, and the size of the parameter matrix is m x n. Then the total degree of folding is (m/a) × (n/b) and the number of cycles required to complete a matrix vector multiplication is also (m/a) × (n/b). Since the acceleration structure of the binary neural network is a pipelined structure, the overall computational throughput is determined by the slowest layer. Therefore, different numbers of computing engines and single instruction multiple data channels need to be configured for each convolutional layer and the fully-connected layer, and finally, the number of cycles needed for realizing each layer is approximately equal, so that the forward computing speed of the whole network is fastest. The folding structure of the matrix vector multiplication is shown in fig. 11, and the folding formalization can be used for fully utilizing the calculation space to carry out optimization design, so that better reasoning performance is achieved after folding is realized according to the calculation load.

FIG. 12 is a schematic diagram of an acceleration system process flow according to an embodiment of the present invention. As shown in fig. 12, the processing flow includes three stages, the first stage is a binary neural network initialization and picture preprocessing stage, which includes processes of importing a bitstream file, initializing a network structure, interleaving and sorting weighting parameters and allocating on-chip memories, and resizing a picture (resizing the picture to 32 × 3); the second stage is the acceleration process of the FPGA to obtain a one-dimensional characteristic vector; the third stage is a pass-back stage, which includes classification processing of feature vectors on an ARM processor.

A network improved based on VGG16 is accelerated on a Xilinx PYNQ-Z1 lightweight development board through a Vivado HLS high-level comprehensive tool, the realization mode of a traditional convolutional neural network on an FPGA is broken through, a running water calculation framework based on an on-chip memory of the FPGA is adopted on a designed overall hardware structure, the communication cost of the on-chip memory and the off-chip memory is reduced, and the overall parallelism is greatly improved. Meanwhile, the convolution layer, the pooling layer, the regular normalization layer and the full connection layer in the binary neural network are optimized correspondingly. In order to fully exploit the parallel potential, a matrix vector multiplication unit is designed to support convolution layer calculation of the network. By configuring different numbers of PE and SIMD channels for each layer of the network, the model can achieve local optimal performance, and finally obtain overall optimal performance. Higher data throughput, faster processing speed and lower power consumption are obtained by optimization. Table 1 is a schematic table of a fully binarized network structure according to an embodiment of the present invention. As shown in table 1, through the final acceleration scheme design, the fully-binarized network structure is implemented forward, and the processing speed of 844FPS, 3.8TOPS data throughput, is obtained. The whole power consumption of the accelerator is only 2.3W, and the model accuracy is 83.6%.

Table 1 schematic table of complete binarization network structure of the embodiment of the present invention

Layer	Input_Size	Kernel_size	Output_Size	Operations	Size(KB)
						Conv_0	32×32×3	3×3×3×64	30×30×64	3110400	5.0
Conv_1	30×30×64	3×3×64×64	28×28×64	57802752	9.5
						pool_0	28×28×64	2×2	14×14×64	\	\
Conv_2
		14×14×64	3×3×64×128	12×12×128	21233664	19.0
Conv_3							12×12×128	3×3×128×128	10×10×128	29491200	37.0
	pool_1	10×10×128	2×2	5×5×128	\	\
Conv_4
		5×5×128	3×3×128×256	3×3×256	5308416	74.0
Conv_5							3×3×256	3×3×256×256	1×1×256	1179648	146.0
	Fc_6	1×1×256	1×1×256×512	1×1×512	262144	260.0
Fc_7							1×1×512	1×1×512×512	1×1×512	524288	260.0
	Fc_8	1×1×512	1×1×512×10	1×1×10	10240	64.0

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A binary neural network accelerating system based on FPGA comprises a convolution kernel parameter acquisition module formed by FPGA, a binary convolution neural network structure and a cache module, wherein the cache module is an on-chip memory of the FPGA,

the binary convolution neural network structure is used for calling convolution calculation logic rules to generate a plurality of convolution basic calculation modules, the convolution basic calculation modules establish corresponding connection relations according to the convolution calculation logic rules, convolution calculation of one thread corresponds to the convolution basic calculation modules, and a plurality of convolution kernel parameters correspond to the convolution basic calculation modules one to one;

2. The FPGA-based binary neural network acceleration system of claim 1, characterized in that, the FPGA configures a corresponding control register through an ARM end, and loads an image from an external memory DDR3 to a buffer area of an on-chip memory through an AXI bus; the FPGA allocates a plurality of processing engines for the convolution basic computation module, and each processing engine comprises an arithmetic operation component, a logic operation component, a bit operation component and a storage resource.

3. The FPGA-based binary neural network acceleration system of claim 2, wherein the convolution calculation layers are classified into convolution layers, PRelu activation layers, pooling layers, regular normalization layers and binary activation layers according to a preset convolution calculation sequence, and are respectively used for convolution, PRelu activation, pooling, regular normalization and binary activation calculation, a convolution acceleration array is formed by using a plurality of calculation engines located in the same convolution calculation layer, and a convolution acceleration array is implemented by using a PE module of the FPGA.

4. The FPGA-based binary neural network acceleration system of claim 3, wherein the cache module is configured to set a first cache region and a second cache region for a convolution calculation array, respectively, the first cache region is configured to store the operation result of a previous convolution acceleration array, and the second cache region is configured to store the operation result of a corresponding convolution acceleration array.

5. The FPGA-based binary neural network acceleration system of claim 3, wherein the computation of the pooling layer is implemented by: and performing SIMD vectorization on the column vectors corresponding to the sliding window of the pooling layer, solving respective maximum values of all the column vectors to form a new vector, and taking the new vector as data of the output feature map.

6. The FPGA-based binary neural network acceleration system of claim 5, wherein when the same column vector exists in different sliding windows of the pooling layer, the calculation result of the same column vector is temporarily stored in an LUT, and the temporary storage value in the LUT is directly called when the same column vector calculation is performed in the next sliding window.

7. The FPGA-based binary neural network acceleration system of claim 3, wherein the FPGA is provided with a matrix vector multiplication unit for each convolution calculation layer according to a convolution calculation logic rule, the matrix vector multiplication unit comprises a plurality of calculation engines, the calculation engines comprise a plurality of parallel single instruction multiple data flow channels, the calculation engines are used for obtaining input feature maps of the pictures to be processed corresponding to the plurality of parallel single instruction multiple data flow channels, and different filters corresponding to convolution kernel parameters perform multiplication accumulation operation.

8. The FPGA-based binary neural network acceleration system of claim 7, wherein the system performs the dot product calculation process by: solving the XOR of the elements at the corresponding positions in the sliding window, and storing the XOR result in an array; counting the number of 1 in the array through popcount; obtaining a final convolution calculation result according to a formula of result ═ popcount (x) - [ N-popcount (x) ]; wherein, popcount (x) represents the number of 1 in the vector x corresponding to the statistical one-dimensional array, and N represents the number of elements corresponding to the vector x in popcount (x).

9. The FPGA-based binary neural network acceleration system of claim 3, wherein the FPGA sorts convolution kernel parameters required by convolution calculation according to a convolution calculation logic rule and then packages the parameters into a parameter matrix, the sliding output window translates the input feature map covering the picture to be processed according to the convolution calculation logic rule to obtain an image matrix, and the parameter matrix and the image matrix are multiplied to obtain a convolution calculation result.

10. The FPGA-based binary neural network acceleration system of any one of claims 1-9, wherein the convolution basis computation module combines PRelu activation, canonical normalization and binary activation into a simple binary function through a common ray function pattern.