CN110458279B - FPGA-based binary neural network acceleration method and system - Google Patents

FPGA-based binary neural network acceleration method and system Download PDF

Info

Publication number
CN110458279B
CN110458279B CN201910636517.2A CN201910636517A CN110458279B CN 110458279 B CN110458279 B CN 110458279B CN 201910636517 A CN201910636517 A CN 201910636517A CN 110458279 B CN110458279 B CN 110458279B
Authority
CN
China
Prior art keywords
convolution
calculation
fpga
neural network
binary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910636517.2A
Other languages
Chinese (zh)
Other versions
CN110458279A (en
Inventor
李开
邹复好
祁迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Meitong Technology Co ltd
Original Assignee
Wuhan Meitong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Meitong Technology Co ltd filed Critical Wuhan Meitong Technology Co ltd
Priority to CN201910636517.2A priority Critical patent/CN110458279B/en
Publication of CN110458279A publication Critical patent/CN110458279A/en
Application granted granted Critical
Publication of CN110458279B publication Critical patent/CN110458279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a binary neural network acceleration system based on FPGA, which utilizes a convolution kernel parameter acquisition module, a binary convolution neural network structure and a cache module formed by FPGA, wherein the cache module is an on-chip internal memory of FPGA, each module acquires a convolution calculation logic rule and correspondingly performs binary convolution calculation by acquiring an input characteristic diagram of a picture to be processed, the FPGA traverses the convolution calculation of a plurality of threads according to the convolution calculation logic rule to obtain output characteristic diagram data of the picture to be processed, and the calculated quantity of each layer in the binary neural network is completely unloaded to the on-chip internal memory by the integral framework without depending on the interaction of the off-chip internal memory and the on-chip internal memory, so that the communication cost between memories is reduced, the calculation efficiency is greatly improved, and the detection speed of the picture to be detected is improved.

Description

FPGA-based binary neural network acceleration method and system
Technical Field
The invention belongs to the field of image processing, and particularly relates to a binary neural network acceleration method and system based on an FPGA.
Background
Significant advances in artificial intelligence technology have begun to contribute to aspects of human life. From home vacuum robots to a full suite of intelligent production facilities in a factory, many of the world's tasks have been highly automated. In addition, deep learning plays a very important role in this great technical revolution, and is widely applied to the fields of face recognition, object detection, image processing, and the like. The algorithm mainly adopted is a convolutional neural network, and the deep learning algorithm with better performance is already deployed in a large number of PC terminals, mobile terminals of mobile phones and embedded special accelerators, is used for realizing various intelligent computing tasks, and obtains better acceleration effect.
Convolutional Neural Network (CNN) is one of the most important branches of deep learning development, and its development is the most mature and widely applied to various graphic image video processing tasks. Convolutional neural networks have evolved so rapidly that, in addition to training data scale-up and computational power improvement, various convolutional neural network frameworks benefit. Most of the existing convolutional neural network applications are deployed on a server or a desktop platform, and a mobile terminal is the application platform with the widest application and the largest user quantity, so that the convolutional neural network applications are moved to furthest promote the development of deep learning applications.
However, all such mobile terminals and embedded computing devices provide limited computing power and non-computationally large on-chip storage. With the model structure of the convolutional neural network becoming more complex, the number of model layers becoming deeper and deeper, and the quantity of model parameters becoming larger and larger, the convolutional neural network becomes more and more difficult to deploy on the mobile end and the embedded end. The huge calculation amount is realized by adopting 32-bit floating point numbers as operands and running on a light-weight chip, which undoubtedly consumes huge calculation resources and is difficult to achieve better real-time effect.
Disclosure of Invention
Aiming at the defects or improvement requirements in the prior art, the invention provides a binary neural network acceleration system based on an FPGA (field programmable gate array), which utilizes a convolution kernel parameter acquisition module, a binary convolution neural network structure and a cache module formed by the FPGA, wherein the cache module is an on-chip memory of the FPGA, each module carries out corresponding binary convolution calculation according to an acquired convolution calculation logic rule, and the calculated amount of each layer in the binary neural network is completely unloaded to the on-chip memory through the whole architecture without depending on the interaction between the off-chip memory and the on-chip memory, so that the communication cost between memories is reduced, the calculation efficiency is greatly improved, and the detection speed of an image to be detected is improved.
To achieve the above object, according to one aspect of the present invention, there is provided an FPGA-based binary neural network acceleration system, which includes a convolution kernel parameter acquisition module formed by using an FPGA, a binary convolution neural network structure, and a cache module, the cache module is an on-chip memory of the FPGA,
the convolution kernel parameter acquisition module is used for acquiring an input characteristic diagram of a picture to be processed, and performing binarization training on an existing data set by using a convolution neural network model to obtain a convolution calculation logic rule and a plurality of convolution kernel parameters, wherein the convolution calculation logic rule comprises convolution calculation of a plurality of threads;
the cache module is used for calling the convolution calculation logic rule and the convolution kernel parameters, storing the convolution kernel parameters in an on-chip memory of the FPGA according to the convolution calculation logic rule, and caching the calculation result of the convolution basic calculation module and the image data to be processed;
the binary convolution neural network structure is used for calling convolution calculation logic rules to generate a plurality of convolution basic calculation modules, the convolution basic calculation modules establish corresponding connection relations according to the convolution calculation logic rules, the convolution calculation of one thread corresponds to the convolution basic calculation modules, and the convolution kernel parameters correspond to the convolution basic calculation modules one to one;
the convolution basic calculation module is used for reading a calculation result of a last convolution basic calculation module of a current thread in the cache module, an input feature map of an image to be processed in a current sliding window and corresponding convolution kernel data in an on-chip memory of the FPGA according to a convolution calculation logic rule, sequentially performing a preset convolution calculation sequence to obtain a calculation result of the current convolution basic calculation module, and storing the calculation result of the current convolution basic calculation module in a corresponding cache region; the preset convolution calculation sequence is to sequentially perform convolution, PRelu activation, regular normalization and binary activation calculation, or sequentially perform convolution, PRelu activation, pooling, regular normalization and binary activation calculation;
and the FPGA traverses the convolution calculation of the multiple threads according to the convolution calculation logic rule to obtain the output characteristic diagram data of the image to be processed so as to improve the detection speed of the image to be detected.
As a further improvement of the invention, the FPGA configures a corresponding control register through an ARM end, and then loads an image from an external memory DDR3 to a buffer area of an on-chip internal memory through an AXI bus; the FPGA distributes a plurality of processing engines for a convolution basic computation module, and each processing engine comprises an arithmetic operation component, a logic operation component, a bit operation component and storage resources.
As a further improvement of the invention, the convolution calculation layers are classified into a convolution layer, a PRelu activation layer, a pooling layer, a regular normalization layer and a binary activation layer according to a preset convolution calculation sequence, the convolution layer, the PRelu activation layer, the pooling layer, the regular normalization layer and the binary activation layer are respectively used for convolution, PRelu activation, pooling, regular normalization and binary activation calculation, a plurality of calculation engines located in the same convolution calculation layer are used for forming a convolution acceleration array, and one convolution acceleration array is realized by using one PE module of an FPGA.
As a further improvement of the present invention, the cache module sets a corresponding first cache region and a second cache region for a convolution calculation array, respectively, where the first cache region is used to store the operation result of the previous convolution acceleration array, and the second cache region is used to store the operation result of the corresponding convolution acceleration array.
As a further improvement of the invention, the calculation implementation process of the pooling layer is as follows: and performing SIMD vectorization on the column vectors corresponding to the sliding window of the pooling layer, solving respective maximum values of all the column vectors to form a new vector, and taking the new vector as data of the output feature map.
As a further improvement of the invention, when different sliding windows of the pooling layer have the same column vector, the calculation result of the same column vector is put into an LUT for temporary storage, and the temporary storage value in the LUT is directly called when the next sliding window performs the calculation of the same column vector.
As a further improvement of the invention, the FPGA is provided with a matrix vector multiplication unit for each convolution calculation layer according to a convolution calculation logic rule, the matrix vector multiplication unit comprises a plurality of calculation engines, each calculation engine comprises a plurality of parallel single instruction multiple data circulation channels, each calculation engine is used for acquiring an input characteristic diagram of a picture to be processed corresponding to the plurality of parallel single instruction multiple data circulation channels, and different filters corresponding to convolution kernel parameters are used for carrying out multiplication accumulation operation.
As a further improvement of the invention, the system performs the dot product calculation process by: the corresponding position elements in the sliding window are subjected to XOR calculation, and the XOR result is stored in an array; counting the number of 1 in the array through popcount; obtaining a final convolution calculation result according to a formula result ═ popcount (x) - [ N-popcount (x) ]; wherein, popcount (x) represents counting the number of 1 in the vector x corresponding to the one-dimensional array, and N represents the number of elements corresponding to the vector x in popcount (x).
As a further improvement of the method, the FPGA sorts convolution kernel parameters required by convolution calculation according to a convolution calculation logic rule and then packs the convolution kernel parameters into a parameter matrix, an output window is slid to horizontally move an input characteristic diagram covering a picture to be processed according to the convolution calculation logic rule to obtain an image matrix, and the parameter matrix and the image matrix are multiplied to obtain a convolution calculation result.
As a further improvement of the invention, the convolution basic computation module combines PRelu activation, canonical normalization and binary activation into a simple binary function through a common ray function mode.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
the invention relates to a binary neural network acceleration system based on FPGA, which utilizes a convolution kernel parameter acquisition module, a binary convolution neural network structure and a cache module formed by FPGA, wherein the cache module is an on-chip memory of FPGA, each module carries out corresponding binary convolution calculation according to the acquired convolution calculation logic rule, and the calculated amount of each layer in the binary neural network is completely unloaded to the on-chip memory through the whole framework without depending on the interaction of the off-chip memory and the on-chip memory, so that the communication cost between memories is reduced, the calculation efficiency is greatly improved, and the detection speed of an image to be detected is improved.
According to the FPGA-based binary neural network acceleration system, dot product operation is replaced by Xnor logic operation and popcount displacement operation in a binary neural network, the binary operation is dot product calculation between 1bit weight and 1bit input image parameters, the speed of binary convolution calculation is greatly improved compared with full-precision convolution calculation through the operation replacement, meanwhile, blank parts of odd-even staggered filling feature maps are adopted to replace all +1 filling used by predecessors, and model accuracy is guaranteed to a certain extent.
The invention relates to a binary neural network acceleration system based on FPGA, which carries out element recombination on the staggered sorting of an input characteristic diagram by a matrix vector multiplication unit arranged by the FPGA and by an offline staggered sorting of a parameter matrix and a sliding window unit, inputs the recombined data vector into a convolution acceleration matrix, and further realizes the calculation of complete parallelization.
According to the FPGA-based binary neural network acceleration system, the calculation of a convolution calculation layer is accelerated through a double-buffer parallel mechanism, a flow structure is adopted in a buffer area, a sliding window is driven by output data of the previous layer to calculate, and data in the sliding window is completely parallelized, so that the calculation efficiency is further improved.
According to the FPGA-based binary neural network acceleration system, the PRelu activation, the regular normalization and the binary activation are combined into a simple binary function through a common affine function mode, so that the calculation complexity caused by the regular normalization is greatly reduced.
Drawings
FIG. 1 is a schematic structural diagram of an FPGA-based binary neural network acceleration system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an FPGA-based on-chip memory according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a convolution basic calculation module according to an embodiment of the present invention.
FIG. 4 is a schematic diagram of parity interleaved padding in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of a double buffered parallel architecture of an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computational implementation of a pooling layer of an embodiment of the present invention;
FIG. 7 is a schematic diagram of a convolution calculation implementation of an embodiment of the present invention;
FIG. 8 is a schematic diagram of a dot product calculation implementation of an embodiment of the present invention;
FIG. 9 is a schematic diagram of a convolution calculation matrix interleaving ordering according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a convolution calculation matrix storage implementation of an embodiment of the present invention;
FIG. 11 is a schematic diagram of a folded matrix vector multiplication implementation of an embodiment of the present invention;
FIG. 12 is a schematic diagram of an acceleration system process flow according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The present invention will be described in further detail with reference to specific embodiments.
Fig. 1 is a schematic structural diagram of an FPGA-based binary neural network acceleration system according to an embodiment of the present invention. As shown in fig. 1, a binary neural network acceleration system based on FPGA comprises a convolution kernel parameter acquisition module formed by FPGA, a binary convolution neural network structure and a cache module, wherein the cache module is an on-chip memory of the FPGA,
the convolution kernel parameter acquisition module is used for acquiring an input characteristic diagram of a picture to be processed, and performing binarization training on an existing data set by using a convolution neural network model to obtain a convolution calculation logic rule and a plurality of convolution kernel parameters, wherein the convolution calculation logic rule comprises convolution calculation of a plurality of threads;
the cache module is used for calling the convolution calculation logic rule and the convolution kernel parameters, storing the convolution kernel parameters in an on-chip memory of the FPGA according to the convolution calculation logic rule, and caching the calculation result of the convolution basic calculation module and the image data to be processed;
the binary convolution neural network structure is used for calling convolution calculation logic rules to generate a plurality of convolution basic calculation modules, the convolution basic calculation modules establish corresponding connection relations according to the convolution calculation logic rules, the convolution calculation of one thread corresponds to the convolution basic calculation modules, and a plurality of convolution kernel parameters correspond to the convolution basic calculation modules one to one;
the convolution basic calculation module is used for reading a calculation result of a last convolution basic calculation module of a current thread in the cache module, an input feature map of an image to be processed in a current sliding window and corresponding convolution kernel data in an on-chip memory of the FPGA according to a convolution calculation logic rule, sequentially performing a preset convolution calculation sequence to obtain a calculation result of the current convolution basic calculation module, and storing the calculation result of the current convolution basic calculation module in a corresponding cache region; the preset convolution calculation sequence is that convolution, PRelu activation, regular normalization and binary activation calculation are sequentially carried out, or the convolution, the PRelu activation, pooling, the regular normalization and the binary activation calculation are sequentially carried out; as an example, the convolution basic computation module can combine the prilu activation, the canonical normalization and the binary activation into a simple binary function through a common ray function mode, so as to reduce the computation complexity caused by the canonical normalization;
and the FPGA traverses the convolution calculation of the multiple threads according to the convolution calculation logic rule to obtain the output characteristic diagram data of the image to be processed so as to improve the detection speed of the image to be detected.
Fig. 2 is a schematic diagram of an FPGA-based on-chip memory according to an embodiment of the present invention. As shown in fig. 2, the convolutional kernel data storage corresponding to the convolutional calculation logic rule is realized through a stream computing architecture based on the FPGA on-chip memory, which not only reduces the communication cost of the on-chip memory and the off-chip memory, but also greatly improves the overall parallelism of the convolutional calculation.
As an example, in the implementation of a convolution basic computation module on hardware based on an FPGA, a hardware overall architecture firstly configures a corresponding control register through an ARM terminal, and then loads an image from an external memory DDR3 to a buffer area of an on-chip memory through an AXI bus; the FPGA can allocate a large number of processing engines for the operation of a convolution basic computation module, wherein the processing engines comprise an arithmetic operation component, a logic operation component, a bit operation component and a storage resource, as a preferred embodiment, a convolution computation layer is classified into a convolution layer, a PRelu activation layer, a pooling layer, a regular normalization layer and a binary activation layer according to a preset convolution computation sequence, the convolution layer, the PRelu activation layer, the pooling layer, the regular normalization layer and the binary activation layer are respectively used for convolution, PRelu activation, pooling, regular normalization and binary activation computation, a plurality of computation engines located in the same convolution computation layer can be utilized to form a convolution acceleration array, and one convolution acceleration array can be realized by utilizing one PE module of the FPGA; preferably, each convolution acceleration array is allocated with a double buffer structure, one is responsible for storing the operation result of the network of the previous layer, and the other is responsible for storing the operation result of the network of the current layer.
Fig. 3 is a schematic structural diagram of a convolution basic calculation module according to an embodiment of the present invention. The convolution basic calculation module is added with PRelu activation, the accuracy of an original model can be improved by 2 percent, and the sequence of the convolution modules of a common binary neural network is as follows: the method comprises the following steps of regular normalization, convolution, binary activation and pooling, wherein before recombination, the input of each module is transmitted after the previous module is subjected to PRelu + Pool calculation, so that data received and transmitted by an inter-block buffer area is non-binary data, and the convolution basic calculation module adjusts the processing sequence of the convolution basic calculation module to be as follows: after recombination, data transmission between convolution basic calculation modules is converted into binary data after binary activation function processing, so that data transmission between blocks is also binary, data amount exchanged between the blocks can be greatly reduced after recombination, communication cost between the blocks is reduced, and uniform interfaces are easily designed for all convolution basic calculation modules; at the same time, the size of the buffer used for exchanging data is reduced, thereby saving hardware resources.
FIG. 4 is a diagram of parity interleaving fill in accordance with an embodiment of the present invention. As shown in fig. 4, the parity interleaving is used in the convolution calculation to fill the blank data of the output feature map to ensure that the dimension of the output feature map, specifically, the width-height dimension and the channel dimension of the feature map are interleaved with ± 1 in the order of parity sorting, and the error rate of the trained network model after parity filling on the cfar 10 data set is close to the full 0 filling error rate at full precision and is only 11.50%, which is lower than 13.76% of the full +1 filling and 12.85% of the odd (even) filling.
Fig. 5 is a schematic diagram of a double-buffered parallel structure according to an embodiment of the present invention. The cache module is used for respectively setting a corresponding first cache region and a second cache region for each convolution calculation array, wherein the first cache region is used for storing the operation result of the last convolution acceleration array, and the second cache region is used for storing the operation result of the corresponding convolution acceleration array; correspondingly filling the first cache region and the second cache region according to the convolution calculation logic rule; taking the pooling layer as an example, two buffers are allocated to the input feature map at the beginning of the calculation, the width of the buffer is the width W of the input feature map of the pooling layer, and the height is the size k of the convolution kernel. The two buffers alternately receive each calculation result of the previous layer in turn. When the first k-1 line of Buffer1 is full and the kth data of the kth line comes, the sliding window can become effective and generate a calculation result, and from this moment, a result is generated every time a data is received later and is sent to the output feature map. When Buffer1 is full, Buffer2 begins receiving data. At this time, the first line in Buffer2 waits until the k-th data arrives, and the sliding window can take effect, and the sliding window contains the data in k-1 line Buffer1 and the data in one line Buffer 2. From this point on, the sliding of the sliding window depends on the arrival of the next new data, as before, and produces a calculation result. When the coverage area of the sliding window does not include the data in Buffer1, the sliding window is emptied, and Buffer1 does not start receiving new data until Buffer2 is full, and the process is repeated.
As an example, the computation implementation of the pooling layer is: and performing SIMD vectorization on the column vectors corresponding to the sliding window of the pooling layer, calculating the respective maximum value of all the column vectors to form a new vector, and taking the new vector as data of the output feature map. As a preferred embodiment, when different sliding windows have the same column vector, the calculation result of the column vector may be temporarily stored in an LUT, and the temporary storage value in the LUT is directly called for the next sliding window calculation. FIG. 6 is a schematic diagram of a computational implementation of a pooling layer of an embodiment of the present invention. As shown in fig. 6, a maximum pool of 3 × 3 is taken as an example, where stride is 2. Sliding windows slide on the same thread arrays1, and we consider each column in the sliding window as a vector, and each sliding window contains 3 such vectors. Since there is no data dependency between these 3 vectors, SIMD vectorization can be used to simultaneously maximize these 3 columns of vectors. And after the 3 maximum value results are obtained, forming a vector by the 3 results, solving the maximum value in the vector, and finally putting the calculation result into the output characteristic diagram to be used as a new element. It should be noted that after every 3 columns in the sliding window are calculated, the calculation result of the rightmost 1 column is temporarily stored in an LUT as the calculation result of the first column of the next sliding window, because a column of data is shared between adjacent sliding windows. The parallelism of each first sliding window from left to right is 3, and the values of the rest sliding windows are 2.
FIG. 7 is a schematic diagram of a convolution calculation implementation of an embodiment of the present invention. As shown in fig. 7, as an example, each convolution calculation layer is provided with a matrix vector multiplication unit according to a convolution calculation logic rule, and one matrix vector multiplication unit includes a plurality of calculation engines, and each calculation engine includes a plurality of parallel single instruction multiple data circulation channels. Furthermore, the calculation engines are configured to obtain input feature maps of the to-be-processed pictures corresponding to the multiple parallel single instruction multiple data flow channels, and perform calculation with corresponding convolution kernel parameters located in the on-chip memory, where each calculation engine receives the same control signal and vector data of the to-be-processed picture, and when calculating, performs multiply-accumulate operation with different filters corresponding to the convolution kernel parameters.
As a further preference, the system performs the process of dot product calculation: the corresponding position elements in the sliding window are subjected to XOR calculation, and the XOR result is stored in an array; counting the number of 1 in the array through popcount; the final convolution calculation is found according to the formula result ═ popcount (x) - [ N-popcount (x) ]. Wherein, popcount (x) represents counting the number of 1 in the vector x corresponding to the one-dimensional array, and N represents the number of elements corresponding to the vector x in popcount (x), i.e. the rank of the vector x.
FIG. 8 is a schematic diagram of a dot product calculation implementation of an embodiment of the invention. As shown in fig. 8, the calculation data flow of a calculation engine in the matrix vector multiplication calculation unit is taken as an example, and is mainly used for calculating the dot product result of the input vector and a row of data of the parameter matrix, comparing the result with a threshold, and finally outputting a 1-bit data. The dot product calculation is essentially a multiply-accumulate operation between two vectors, implemented here with Xnor gates for a binary neural network. The first step is to solve the XOR of the corresponding position elements in the sliding window and store the XOR result in an array; the second step is to count the number of 1 in the array through popcount; the third step is to find the final convolution calculation result according to the formula result- { popcount (x) - [ N-popcount (x) ]. And finally comparing the result with a threshold value and outputting a final result. The calculation engine structure also supports non-binary calculation, and can process the calculation only by replacing the dot product gate of the dotted line part with a conventional multiplication accumulator.
As a further improvement of the method, the FPGA sorts convolution kernel parameters required by convolution calculation according to a convolution calculation logic rule and then packs the convolution kernel parameters into a parameter matrix, an output window is slid to horizontally move an input characteristic diagram covering a picture to be processed according to the convolution calculation logic rule to obtain an image matrix, and the parameter matrix and the image matrix are multiplied to obtain a convolution calculation result.
FIG. 9 is a diagram illustrating a convolutional calculation matrix interleaving ordering according to an embodiment of the present invention. As shown in fig. 9, the convolution calculation may be converted into a general matrix multiplication operation according to a convolution calculation logic rule, that is, a matrix interleaving ordering method based on channel dimensions, convolution kernel parameters required by the convolution calculation are packed into a parameter matrix, meanwhile, the sliding window is also translated and covers the input feature map, and the elements of the input feature map are packed into an image matrix, and finally, the matrices are multiplied to output a result. Since the dot product operation includes all the pixel values in a sliding window, and the exchangeability of addition, the order of the staggered ordering in the matrix can be any order, and the pixel values at the same position of different channels are put together, although other orders can be adopted, but the final calculation result is not changed. It should be noted that the conversion of the filter matrix does not require any overhead, since it is converted before the program runs, whereas the image matrix can be converted while the program runs. FIG. 10 is a schematic diagram of a convolution calculation matrix storage implementation of an embodiment of the present invention. As shown in fig. 10, the input map is simply stored in a buffer in a certain order, and then the address generator will take the memory location corresponding to each sliding window and generate the image matrix from the data transmitted from the previous layer according to the same ordering rule as the filter matrix.
FIG. 11 is a diagram illustrating an implementation of folding matrix vector multiplication according to an embodiment of the present invention. As shown in fig. 11, since almost all computations in the binary neural network can be represented as matrix vector multiplication, the folding method can largely control the throughput of the system, and can also directly affect the resource utilization and energy consumption of the system. The number of the computing engines of each layer is a, the number of the single instruction multidata stream channels of each computing engine is b, and the size of the parameter matrix is m x n. Then the total degree of folding is (m/a) × (n/b) and the number of cycles required to complete a matrix vector multiplication is also (m/a) × (n/b). Since the acceleration structure of the binary neural network is a pipelined structure, the overall computational throughput is determined by the slowest layer. Therefore, different numbers of computing engines and single instruction multiple data channels need to be configured for each convolutional layer and the fully-connected layer, and finally, the number of cycles needed for realizing each layer is approximately equal, so that the forward computing speed of the whole network is fastest. The folding structure of the matrix vector multiplication is shown in fig. 11, and the folding formalization can be used for fully utilizing the calculation space to carry out optimization design, so that better reasoning performance is achieved after folding is realized according to the calculation load.
FIG. 12 is a schematic diagram of an acceleration system process flow according to an embodiment of the present invention. As shown in fig. 12, the processing flow includes three stages, the first stage is a binary neural network initialization and picture preprocessing stage, which includes processes of importing a bitstream file, initializing a network structure, interleaving and sorting weighting parameters and allocating on-chip memories, and resizing a picture (resizing the picture to 32 × 3); the second stage is the acceleration process of the FPGA to obtain a one-dimensional characteristic vector; the third stage is a pass-back stage, which includes classification processing of feature vectors on an ARM processor.
A network improved based on VGG16 is accelerated on a Xilinx PYNQ-Z1 lightweight development board through a Vivado HLS high-level comprehensive tool, the realization mode of a traditional convolutional neural network on an FPGA is broken through, a running water calculation framework based on an on-chip memory of the FPGA is adopted on a designed overall hardware structure, the communication cost of the on-chip memory and the off-chip memory is reduced, and the overall parallelism is greatly improved. Meanwhile, the convolution layer, the pooling layer, the regular normalization layer and the full connection layer in the binary neural network are optimized correspondingly. In order to fully exploit the parallel potential, a matrix vector multiplication unit is designed to support convolution layer calculation of the network. By configuring different numbers of PE and SIMD channels for each layer of the network, the model can achieve local optimal performance, and finally obtain overall optimal performance. Higher data throughput, faster processing speed and lower power consumption are obtained by optimization. Table 1 is a schematic table of a fully binarized network structure according to an embodiment of the present invention. As shown in table 1, through the final acceleration scheme design, the fully-binarized network structure is implemented forward, and the processing speed of 844FPS, 3.8TOPS data throughput, is obtained. The whole power consumption of the accelerator is only 2.3W, and the model accuracy is 83.6%.
Table 1 schematic table of complete binarization network structure of the embodiment of the present invention
Layer Input_Size Kernel_size Output_Size Operations Size(KB)
Conv_0 32×32×3 3×3×3×64 30×30×64 3110400 5.0
Conv_1 30×30×64 3×3×64×64 28×28×64 57802752 9.5
pool_0 28×28×64 2×2 14×14×64 \ \
Conv_2
14×14×64 3×3×64×128 12×12×128 21233664 19.0
Conv_3 12×12×128 3×3×128×128 10×10×128 29491200 37.0
pool_1 10×10×128 2×2 5×5×128 \ \
Conv_4
5×5×128 3×3×128×256 3×3×256 5308416 74.0
Conv_5 3×3×256 3×3×256×256 1×1×256 1179648 146.0
Fc_6 1×1×256 1×1×256×512 1×1×512 262144 260.0
Fc_7 1×1×512 1×1×512×512 1×1×512 524288 260.0
Fc_8 1×1×512 1×1×512×10 1×1×10 10240 64.0
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A binary neural network accelerating system based on FPGA comprises a convolution kernel parameter acquisition module formed by FPGA, a binary convolution neural network structure and a cache module, wherein the cache module is an on-chip memory of the FPGA,
the convolution kernel parameter acquisition module is used for acquiring an input characteristic diagram of a picture to be processed, and performing binarization training on an existing data set by using a convolution neural network model to obtain a convolution calculation logic rule and a plurality of convolution kernel parameters, wherein the convolution calculation logic rule comprises convolution calculation of a plurality of threads;
the cache module is used for calling the convolution calculation logic rule and the convolution kernel parameters, storing the convolution kernel parameters in an on-chip memory of the FPGA according to the convolution calculation logic rule, and caching the calculation result of the convolution basic calculation module and the image data to be processed;
the binary convolution neural network structure is used for calling convolution calculation logic rules to generate a plurality of convolution basic calculation modules, the convolution basic calculation modules establish corresponding connection relations according to the convolution calculation logic rules, convolution calculation of one thread corresponds to the convolution basic calculation modules, and a plurality of convolution kernel parameters correspond to the convolution basic calculation modules one to one;
the convolution basic calculation module is used for reading a calculation result of a last convolution basic calculation module of a current thread in the cache module, an input feature map of an image to be processed in a current sliding window and corresponding convolution kernel data in an on-chip memory of the FPGA according to a convolution calculation logic rule, sequentially performing a preset convolution calculation sequence to obtain a calculation result of the current convolution basic calculation module, and storing the calculation result of the current convolution basic calculation module in a corresponding cache region; the preset convolution calculation sequence is to sequentially perform convolution, PRelu activation, regular normalization and binary activation calculation, or sequentially perform convolution, PRelu activation, pooling, regular normalization and binary activation calculation;
and the FPGA traverses the convolution calculation of the multiple threads according to the convolution calculation logic rule to obtain the output characteristic diagram data of the image to be processed so as to improve the detection speed of the image to be detected.
2. The FPGA-based binary neural network acceleration system of claim 1, characterized in that, the FPGA configures a corresponding control register through an ARM end, and loads an image from an external memory DDR3 to a buffer area of an on-chip memory through an AXI bus; the FPGA allocates a plurality of processing engines for the convolution basic computation module, and each processing engine comprises an arithmetic operation component, a logic operation component, a bit operation component and a storage resource.
3. The FPGA-based binary neural network acceleration system of claim 2, wherein the convolution calculation layers are classified into convolution layers, PRelu activation layers, pooling layers, regular normalization layers and binary activation layers according to a preset convolution calculation sequence, and are respectively used for convolution, PRelu activation, pooling, regular normalization and binary activation calculation, a convolution acceleration array is formed by using a plurality of calculation engines located in the same convolution calculation layer, and a convolution acceleration array is implemented by using a PE module of the FPGA.
4. The FPGA-based binary neural network acceleration system of claim 3, wherein the cache module is configured to set a first cache region and a second cache region for a convolution calculation array, respectively, the first cache region is configured to store the operation result of a previous convolution acceleration array, and the second cache region is configured to store the operation result of a corresponding convolution acceleration array.
5. The FPGA-based binary neural network acceleration system of claim 3, wherein the computation of the pooling layer is implemented by: and performing SIMD vectorization on the column vectors corresponding to the sliding window of the pooling layer, solving respective maximum values of all the column vectors to form a new vector, and taking the new vector as data of the output feature map.
6. The FPGA-based binary neural network acceleration system of claim 5, wherein when the same column vector exists in different sliding windows of the pooling layer, the calculation result of the same column vector is temporarily stored in an LUT, and the temporary storage value in the LUT is directly called when the same column vector calculation is performed in the next sliding window.
7. The FPGA-based binary neural network acceleration system of claim 3, wherein the FPGA is provided with a matrix vector multiplication unit for each convolution calculation layer according to a convolution calculation logic rule, the matrix vector multiplication unit comprises a plurality of calculation engines, the calculation engines comprise a plurality of parallel single instruction multiple data flow channels, the calculation engines are used for obtaining input feature maps of the pictures to be processed corresponding to the plurality of parallel single instruction multiple data flow channels, and different filters corresponding to convolution kernel parameters perform multiplication accumulation operation.
8. The FPGA-based binary neural network acceleration system of claim 7, wherein the system performs the dot product calculation process by: solving the XOR of the elements at the corresponding positions in the sliding window, and storing the XOR result in an array; counting the number of 1 in the array through popcount; obtaining a final convolution calculation result according to a formula of result ═ popcount (x) - [ N-popcount (x) ]; wherein, popcount (x) represents the number of 1 in the vector x corresponding to the statistical one-dimensional array, and N represents the number of elements corresponding to the vector x in popcount (x).
9. The FPGA-based binary neural network acceleration system of claim 3, wherein the FPGA sorts convolution kernel parameters required by convolution calculation according to a convolution calculation logic rule and then packages the parameters into a parameter matrix, the sliding output window translates the input feature map covering the picture to be processed according to the convolution calculation logic rule to obtain an image matrix, and the parameter matrix and the image matrix are multiplied to obtain a convolution calculation result.
10. The FPGA-based binary neural network acceleration system of any one of claims 1-9, wherein the convolution basis computation module combines PRelu activation, canonical normalization and binary activation into a simple binary function through a common ray function pattern.
CN201910636517.2A 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system Active CN110458279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910636517.2A CN110458279B (en) 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910636517.2A CN110458279B (en) 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system

Publications (2)

Publication Number Publication Date
CN110458279A CN110458279A (en) 2019-11-15
CN110458279B true CN110458279B (en) 2022-05-20

Family

ID=68481247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910636517.2A Active CN110458279B (en) 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system

Country Status (1)

Country Link
CN (1) CN110458279B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126309A (en) * 2019-12-26 2020-05-08 长沙海格北斗信息技术有限公司 Convolutional neural network architecture method based on FPGA and face recognition method thereof
CN111160534A (en) * 2019-12-31 2020-05-15 中山大学 Binary neural network forward propagation frame suitable for mobile terminal
CN111275167A (en) * 2020-01-16 2020-06-12 北京中科研究院 High-energy-efficiency pulse array framework for binary convolutional neural network
CN111401543B (en) * 2020-06-08 2020-11-10 深圳市九天睿芯科技有限公司 Neural network accelerator with full on-chip storage and implementation method thereof
US20220019872A1 (en) * 2020-07-14 2022-01-20 United Microelectronics Centre (Hong Kong) Limited Processor, logic chip and method for binarized convolution neural network
CN111931925B (en) * 2020-08-10 2024-02-09 西安电子科技大学 Acceleration system of binary neural network based on FPGA
CN114201726B (en) * 2020-09-18 2023-02-10 深圳先进技术研究院 Convolution operation optimization method, system, terminal and storage medium
CN112418417B (en) * 2020-09-24 2024-02-27 北京计算机技术及应用研究所 Convolutional neural network acceleration device and method based on SIMD technology
CN112153347B (en) * 2020-09-27 2023-04-07 北京天玛智控科技股份有限公司 Coal mine underground intelligent visual terminal sensing method, storage medium and electronic equipment
CN112308762A (en) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 Data processing method and device
CN112199896A (en) * 2020-10-26 2021-01-08 云中芯半导体技术(苏州)有限公司 Chip logic comprehensive optimization acceleration method based on machine learning
CN112487448B (en) * 2020-11-27 2024-05-03 珠海零边界集成电路有限公司 Encryption information processing device, method and computer equipment
CN112862080B (en) * 2021-03-10 2023-08-15 中山大学 Hardware computing method of attention mechanism of Efficient Net
CN113301221B (en) * 2021-03-19 2022-09-09 西安电子科技大学 Image processing method of depth network camera and terminal
CN113298236B (en) * 2021-06-18 2023-07-21 中国科学院计算技术研究所 Low-precision neural network computing device and acceleration method based on data flow structure
CN113469350B (en) * 2021-07-07 2023-03-24 武汉魅瞳科技有限公司 Deep convolutional neural network acceleration method and system suitable for NPU
CN113949592B (en) * 2021-12-22 2022-03-22 湖南大学 Anti-attack defense system and method based on FPGA
CN114202071B (en) * 2022-02-17 2022-05-27 浙江光珀智能科技有限公司 Deep convolutional neural network reasoning acceleration method based on data stream mode
CN114662660A (en) * 2022-03-14 2022-06-24 昆山市工业技术研究院有限责任公司 CNN accelerator data access method and system
CN114897159B (en) * 2022-05-18 2023-05-12 电子科技大学 Method for rapidly deducing electromagnetic signal incident angle based on neural network
CN115083462B (en) * 2022-07-14 2022-11-11 中科南京智能技术研究院 Digital in-memory computing device based on Sram
CN117114055B (en) * 2023-10-24 2024-04-09 北京航空航天大学 FPGA binary neural network acceleration method for industrial application scene

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN109086867A (en) * 2018-07-02 2018-12-25 武汉魅瞳科技有限公司 A kind of convolutional neural networks acceleration system based on FPGA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10802992B2 (en) * 2016-08-12 2020-10-13 Xilinx Technology Beijing Limited Combining CPU and special accelerator for implementing an artificial neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
CN109086867A (en) * 2018-07-02 2018-12-25 武汉魅瞳科技有限公司 A kind of convolutional neural networks acceleration system based on FPGA

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Energy-Efficient Architecture for FPGA-based Deep Convolutional Neural Networks with Binary Weights;Yunzhi Duan et al.;《2018 IEEE 23rd International Conference on Digital Signal Processing(DSP)》;20190204;全文 *
一种基于FPGA的卷积神经网络加速器设计与实现;仇越等;《微电子学与计算机》;20180805(第08期);全文 *

Also Published As

Publication number Publication date
CN110458279A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110458279B (en) FPGA-based binary neural network acceleration method and system
CN108108809B (en) Hardware architecture for reasoning and accelerating convolutional neural network and working method thereof
CN106970896B (en) Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN108733348B (en) Fused vector multiplier and method for performing operation using the same
US11989638B2 (en) Convolutional neural network accelerating device and method with input data conversion
CN109409512B (en) Flexibly configurable neural network computing unit, computing array and construction method thereof
CN111414994B (en) FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN108665063B (en) Bidirectional parallel processing convolution acceleration system for BNN hardware accelerator
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
CN110321997B (en) High-parallelism computing platform, system and computing implementation method
CN111738433B (en) Reconfigurable convolution hardware accelerator
CN110991631A (en) Neural network acceleration system based on FPGA
Liu et al. Towards an efficient accelerator for DNN-based remote sensing image segmentation on FPGAs
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
Li et al. A multistage dataflow implementation of a deep convolutional neural network based on FPGA for high-speed object recognition
CN112487750A (en) Convolution acceleration computing system and method based on memory computing
CN111582465B (en) Convolutional neural network acceleration processing system and method based on FPGA and terminal
Wang et al. A low-latency sparse-winograd accelerator for convolutional neural networks
CN111768458A (en) Sparse image processing method based on convolutional neural network
Xiao et al. FPGA-based scalable and highly concurrent convolutional neural network acceleration
CN113240101B (en) Method for realizing heterogeneous SoC (system on chip) by cooperative acceleration of software and hardware of convolutional neural network
Chang et al. VSCNN: Convolution neural network accelerator with vector sparsity
Li et al. A novel software-defined convolutional neural networks accelerator
US20230376733A1 (en) Convolutional neural network accelerator hardware

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant