CN110458279A - A kind of binary neural network accelerated method and system based on FPGA - Google Patents
A kind of binary neural network accelerated method and system based on FPGA Download PDFInfo
- Publication number
- CN110458279A CN110458279A CN201910636517.2A CN201910636517A CN110458279A CN 110458279 A CN110458279 A CN 110458279A CN 201910636517 A CN201910636517 A CN 201910636517A CN 110458279 A CN110458279 A CN 110458279A
- Authority
- CN
- China
- Prior art keywords
- convolution
- fpga
- convolutional calculation
- convolutional
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The binary neural network acceleration system based on FPGA that the invention discloses a kind of, its convolution kernel parameter acquisition module for utilizing FPGA to be formed, binaryzation convolutional neural networks structure and cache module, cache module is the on piece memory of FPGA, the input feature vector figure that each module passes through acquisition picture to be processed, obtain convolutional calculation logic rules and corresponding progress binaryzation convolutional calculation, FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, obtain the output feature diagram data of image to be processed, the calculation amount of layer each in binary neural network is all unloaded on piece memory by the overall architecture, and the interaction of memory and on piece memory outside piece is not depended on, to reduce the communications cost between memory, greatly improve computational efficiency, improve the detection speed of image to be detected.
Description
Technical field
The invention belongs to field of image processings, and in particular to a kind of binary neural network accelerated method based on FPGA and be
System.
Background technique
The major progress of artificial intelligence technology has begun the every aspect for benefiting human lives.From the vacuum machine of household
The intelligent production equipment of the whole series in people to factory, many tasks in the world all have reached increasingly automated.And deep learning
Play very important effect in this great technological revolution, recognition of face, object detection, image procossing, etc.
Field has a wide range of applications.The algorithm wherein mainly used is exactly convolutional neural networks, this performance preferably deep learning
Algorithm is disposed in a large amount of end PC, mobile phone mobile terminal and embedded dedicated accelerator, for realizing a variety of intelligence
Energy calculating task, and achieve preferable acceleration effect.
Convolutional neural networks (CNN, Convolutional Neural Network) develop most important as deep learning
One of branch, its development is the most mature, and is widely used in the task of various graph image videos processing.Convolution
Neural network development is so rapid, in addition to training data scale increases and computing capability is promoted, also has benefited from various convolutional Neurals
Network frame.Existing convolutional neural networks application is to be deployed on server or desktop machine platform, and mobile terminal is only mostly
Convolutional neural networks application mobile could be pushed depth by application platform most widely used, user volume is most to greatest extent
Learn the development of application.
However all such mobile terminals and embedding assembly equipment can only all provide limited computing capability and capacity
Not big on piece storage.As the model structure of convolutional neural networks becomes increasingly complex, the model number of plies is deeper and deeper, model
Parameter amount is increasing, so that deployment of the convolutional neural networks on mobile terminal and embedded end becomes more and more difficult.It is huge
Calculation amount all use 32bit floating number to operate on lightweight chip as operand, undoubtedly to the huge of computing resource
Big consumption, while being also extremely difficult to relatively good live effect.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of binary neural networks based on FPGA
Network acceleration system, the convolution kernel parameter acquisition module formed using FPGA, binaryzation convolutional neural networks structure and caching mould
Block, cache module are the on piece memory of FPGA, and each module carries out corresponding binaryzation according to the convolutional calculation logic rules obtained
The calculation amount of layer each in binary neural network is all unloaded on piece memory by the overall architecture by convolutional calculation, and is disobeyed
Rely the interaction of piece outer memory and on piece memory, to reduce the communications cost between memory, greatly improves calculating effect
Rate improves the detection speed of image to be detected.
To achieve the above object, according to one aspect of the present invention, a kind of binary neural network based on FPGA is provided
Acceleration system, convolution kernel parameter acquisition module which forms including the use of FPGA, binaryzation convolutional neural networks structure and
Cache module, cache module are the on piece memory of FPGA,
Convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model
Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, convolutional calculation
Logic rules include the convolutional calculation of multiple threads;
Cache module is for transferring the convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic
Multiple convolution nuclear parameters are stored in the on piece memory of FPGA by rule, and cache module is also used to cache convolution basic calculating module
Calculated result and image data to be processed;
Binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules
Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules
Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond;
Convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules
The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA
Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module
Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area;The preset convolution
Computation sequence be successively carry out convolution, PRelu activation, canonical normalization and two-value activation calculate, or successively carry out convolution,
PRelu activation, Chi Hua, canonical normalization and two-value activation calculate;
FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, and the output for obtaining image to be processed is special
Diagram data is levied, to improve the detection speed of image to be detected.
As a further improvement of the present invention, FPGA then passes through AXI by the corresponding control register of the end ARM configuration
Bus from external memory DDR3 load image to the buffer area of on piece memory;FPGA is convolution basic calculating module point
With multiple processing engines, handling engine includes arithmetical operation component, logical operation component, bit arithmetic component and storage resource.
As a further improvement of the present invention, convolutional calculation layer is classified as convolution according to preset convolutional calculation sequence
Layer, PRelu active coating, pond layer, canonical normalization layer and two-value active coating are respectively used to convolution, PRelu activation, Chi Hua, just
Then normalization and two-value activation calculate, and form a convolution using the multiple computing engines for being located at the same convolutional calculation layer and accelerate
Array, a convolution accelerate array to realize using a PE module of FPGA.
As a further improvement of the present invention, cache module is respectively that a convolutional calculation array setting corresponding first is delayed
Area and second buffering area are deposited, the first buffer area is used to store the operation result that a convolution accelerates array, and second buffering area is used
In the operation result for storing corresponding convolution acceleration array.
As a further improvement of the present invention, process is realized in the calculating of pond layer are as follows: the sliding window of pond layer is corresponding
Column vector carry out SIMD vectorization, and all column vectors are asked and form a new vector after respective maximum value, and will be new
Vector as output characteristic pattern data.
As a further improvement of the present invention, the different sliding windows of pond layer, will be described there are when identical column vector
The calculated result of identical column vector is put into a LUT and keeps in, and next sliding window carries out the identical column vector calculation
When call directly temporary value in LUT.
As a further improvement of the present invention, FPGA is arranged each convolutional calculation layer according to convolutional calculation logic rules
There is a matrix-vector multiplication unit, matrix-vector multiplication unit includes multiple computing engines, and computing engines include multiple parallel
Single-instruction multiple-data stream (SIMD) circulation passage, computing engines for obtain multiple parallel single-instruction multiple-data stream (SIMD) circulation passages it is corresponding to
The input feature vector figure for handling picture, different filters corresponding from convolution nuclear parameter carry out multiply-accumulate operation.
As a further improvement of the present invention, which carries out the process of dot product calculating are as follows: to corresponding to position in sliding window
It sets element and seeks exclusive or, the result of exclusive or is stored in array;In array 1 number is counted by popcount;According to formula
Result=- (popcount- (N-popcount)) finds out final convolutional calculation result.
As a further improvement of the present invention, FPGA is by convolution nuclear parameter needed for convolutional calculation according to convolutional calculation logic
Parameter matrix is packaged as after rule compositor, sliding output window covers picture to be processed according to the translation of convolutional calculation logic rules
Input feature vector figure obtains image array, and the parameter matrix is multiplied to obtain convolutional calculation result with image array.
As a further improvement of the present invention, convolution basic calculating module is swashed PRelu by common function mode of penetrating
A simple two-valued function is merged into living, canonical normalization and two-value activation.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect
Fruit:
A kind of binary neural network acceleration system based on FPGA of the invention, the convolution nuclear parameter formed using FPGA
Module, binaryzation convolutional neural networks structure and cache module are obtained, cache module is the on piece memory of FPGA, each module foundation
The convolutional calculation logic rules of acquisition carry out corresponding binaryzation convolutional calculation, will be in binary neural network by the overall architecture
The calculation amount of each layer is all unloaded on piece memory, and does not depend on the interaction of memory and on piece memory outside piece, less in this way to deposit
Communications cost between reservoir, greatly improves computational efficiency, improves the detection speed of image to be detected.
A kind of binary neural network acceleration system based on FPGA of the invention, dot-product operation is in binary neural network
It is replaced by Xnor logical operation and popcount shift operation, since Binary Operation is in 1bit weight and 1bit input picture
Dot product between parameter calculates, operation replacement in this way, so that convolutional calculation of the two-value convolutional calculation compared to full precision
Speed is greatly improved, meanwhile, use the blank parts of oem character set filling characteristic pattern to replace complete+1 used in forefathers
Filling, ensure that model accuracy to a certain extent.
A kind of binary neural network acceleration system based on FPGA of the invention, the matrix-vector multiplication being arranged by FPGA
Method unit, by carrying out member to the staggered-sequence of input feature vector figure to parameter matrix off-line type staggered-sequence and sliding window unit
Data vector input convolution after recombination is accelerated in matrix, and then realizes fully parallelized calculating by element recombination.
A kind of binary neural network acceleration system based on FPGA of the invention, is accelerated by double buffering parallel mechanism
The calculating of convolutional calculation layer uses flowing structure in buffer area, and sliding window is driven calculatings by upper one layer of output data, cunning
The data of dynamic window interior use fully parallelized calculation, to further improve computational efficiency.
A kind of binary neural network acceleration system based on FPGA of the invention, will by common affine function mode
A simple two-valued function is merged into PRelu activation, canonical normalization and two-value activation, greatly reduces canonical normalization
Bring computation complexity.
Detailed description of the invention
Fig. 1 is a kind of structural schematic diagram of binary neural network acceleration system based on FPGA of the embodiment of the present invention;
Fig. 2 is the schematic diagram based on FPGA on piece memory of the embodiment of the present invention;
Fig. 3 is the structural schematic diagram of the convolution basic calculating module of the embodiment of the present invention.
Fig. 4 is the schematic diagram of the oem character set filling of the embodiment of the present invention;
Fig. 5 is the schematic diagram of the double buffering parallel organization of the embodiment of the present invention;
Fig. 6 is the schematic diagram that the calculating of the pond layer of the embodiment of the present invention is realized;
Fig. 7 is the schematic diagram that the convolutional calculation of the embodiment of the present invention is realized;
Fig. 8 is that the dot product of the embodiment of the present invention calculates the schematic diagram realized;
Fig. 9 is the schematic diagram of the convolutional calculation matrix staggered-sequence of the embodiment of the present invention;
Figure 10 is the schematic diagram that the convolutional calculation matrix storage of the embodiment of the present invention is realized;
Figure 11 is the schematic diagram for folding matrix-vector multiplication and realizing of the embodiment of the present invention;
Figure 12 is the schematic diagram of the acceleration system process flow of the embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
In addition, as long as technical characteristic involved in the various embodiments of the present invention described below is each other not
Constituting conflict can be combined with each other.The present invention is described in more detail With reference to embodiment.
Fig. 1 is a kind of structural schematic diagram of binary neural network acceleration system based on FPGA of the embodiment of the present invention.Such as
Shown in Fig. 1, a kind of binary neural network acceleration system based on FPGA, the convolution nuclear parameter which forms including the use of FPGA
Module, binaryzation convolutional neural networks structure and cache module are obtained, cache module is the on piece memory of FPGA, wherein
Convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model
Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, convolutional calculation
Logic rules include the convolutional calculation of multiple threads;
Cache module is for transferring convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic rules
Multiple convolution nuclear parameters are stored in the on piece memory of FPGA, cache module is also used to cache the calculating of convolution basic calculating module
As a result with image data to be processed;
Binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules
Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules
Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond;
Convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules
The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA
Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module
Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area;Preset convolutional calculation
Sequence calculates successively to carry out convolution, PRelu activation, canonical normalization and two-value activation, or successively carries out convolution, PRelu
Activation, Chi Hua, canonical normalization and two-value activation calculate;As an example, convolution basic calculating module can be by common
It penetrates function mode and a simple two-valued function is merged into PRelu activation, canonical normalization and two-value activation, to reduce
Canonical normalizes bring computation complexity;
FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, and the output for obtaining image to be processed is special
Diagram data is levied, to improve the detection speed of image to be detected.
Fig. 2 is the schematic diagram based on FPGA on piece memory of the embodiment of the present invention.As shown in Fig. 2, by being based on FPGA piece
The pipeline computing framework of upper memory realizes the corresponding convolution Nuclear Data storage of convolutional calculation logic rules, has both reduced on piece memory
With the communications cost of memory outside piece, and the whole degree of parallelism of convolutional calculation is greatly improved.
As an example, convolution basic calculating module is in the hardware realization based on FPGA, and hardware overall architecture is first
Configure corresponding control register by the end ARM, then pass through AXI bus from external memory DDR3 load image on piece
The buffer area of memory;FPGA can be assigned with a large amount of processing engines for the operation of convolution basic calculating module, transport including arithmetic
Component, logical operation component, bit arithmetic component and storage resource are calculated, as a preferred embodiment, according to preset convolution
Convolutional calculation layer is classified as convolutional layer, PRelu active coating, pond layer, canonical normalization layer and two-value active coating by computation sequence,
It is used for convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation respectively to calculate, can use positioned at the same volume
Multiple computing engines of product computation layer form a convolution and accelerate array, and one of convolution accelerates array to can use FPGA's
One PE module is realized;Preferably as one, it can accelerate array that can distribute a double-damping structure for each convolution, one negative
Duty stores the operation result of a upper layer network, another is responsible for the operation result for storing this layer network.
Fig. 3 is the structural schematic diagram of the convolution basic calculating module of the embodiment of the present invention.Convolution basic calculating module is added
The accuracy of master mould, can be improved 2 percentage points, the convolution module of common binary neural network is suitable by PRelu activation
Sequence is formed according to: canonical normalization, convolution, two-value activation, pondization, and the input of each module is a upper module before recombination
It is transmitted to after being calculated by PRelu+Pool, results in that buffer area receives between block and the data transmitted are non-binaryzations in this way
Data, convolution basic calculating module adjusts the processing sequence of convolution basic calculating module are as follows: convolution, PRelu activation, pond
Change, canonical normalization and two-value activation or convolution, PRelu activation, canonical normalization and two-value activate, after recombination,
Data transmission between convolution basic calculating module is converted to two-value after the processing of two-value activation primitive, then block and block
Between data transmission be also two-value, the data volume exchanged between block can be substantially reduced after recombination, reduce block
Between communications cost, be easy to the interface unified for all convolution basic calculating module designs;Meanwhile it reducing for exchanging data
Buffer size, to save hardware resource.
Fig. 4 is the schematic diagram of the oem character set filling of the embodiment of the present invention.As shown in figure 4, using odd even in convolutional calculation
The staggeredly clear data of filling output characteristic pattern, it is specially high-dimensional in the width of characteristic pattern to ensure to export the dimension of characteristic pattern
With the staggeredly filling for carrying out ± 1 on channel dimension according to odd even ordering sequence, the network trained after passing through odd even filling
Model in the error rate on Cifar10 data set close to the full 0 charging error rate under full precision, and only 11.50%, this is low
The 12.85% of 13.76% and odd (idol) filling that Yu Quan+1 is filled.
Fig. 5 is the schematic diagram of the double buffering parallel organization of the embodiment of the present invention.Cache module is respectively each convolution meter
It calculates array and corresponding first buffer area and second buffering area is set, the first buffer area is used to store a convolution and accelerates array
Operation result, second buffering area are used to store the operation result that corresponding convolution accelerates array;First buffer area and the second caching
It is filled accordingly according to convolutional calculation logic rules in area;Using pond layer as example, two are distributed for it when calculating initial
Buffer area, buffering sector width are the width W of the input feature vector figure of pond layer, are highly the size k of convolution kernel.Two buffer areas according to
Secondary alternating receives one layer each calculated result.When the preceding k-1 row filling of Buffer1 is full, k-th of data of row k are arrived
Come when, sliding window can come into effect and generate calculated result, from that point on, behind one data of every reception will generate one
A result is sent in output characteristic pattern.When Buffer1 is filled up, Buffer2 starts to receive data.In Buffer2 at this time
A line will be until the arrival of k-th of data, and sliding window can just come into force, this sliding window contains in k-1 row Buffer1
Data contain the data in a line Buffer2.From that point on, such as front, the sliding of sliding window is dependent on next
The arrival of a new data, and generate a calculated result.When the coverage area of sliding window does not include the data in Buffer1
When, that is, it is emptied, Buffer1 just starts to receive new data when Buffer2 is filled up, and circulation is gone down so according to this.
As an example, pond layer calculating realize process are as follows: by the corresponding column vector of the sliding window of pond layer into
Row SIMD vectorization, and all column vectors are asked and form a new vector after respective maximum value, and using new vector as
Export the data of characteristic pattern.As a preferred embodiment, different sliding windows, can be by the column there are when identical column vector
The calculated result of vector is put into a LUT and keeps in, and next sliding window calls directly the temporary value in LUT when calculating.Fig. 6 is
The schematic diagram that the calculating of the pond layer of the embodiment of the present invention is realized.As shown in fig. 6, example is turned to the maximum pond of a 3*3,
Middle stride=2.Sliding window slides on the same thread arrays1, we regard each column in sliding window as one
A vector, each sliding window just include vector as 3 column.Due to there is no data dependence relation between this 3 vectors, so
SIMD vectorization can be used, respective maximum value is asked simultaneously to this 3 column vector.3 maximum value results come out and then by this
3 results form a vector, and solve the maximum value in this vector, and calculated result is finally put into conduct in output characteristic pattern
One new element.It is worth noting that, after every 3 column count in sliding window is complete, the calculated result that rightmost 1 can be arranged
Be put into it is temporary inside LUT, as the calculated result of next sliding window first row, this is because adjacent sliding window it
Between shared a column data.It is each 3 from the degree of parallelism for doing left-to-right first sliding window, remaining sliding window value
It is all 2.
Fig. 7 is the schematic diagram that the convolutional calculation of the embodiment of the present invention is realized.As shown in fig. 7, as an example, according to volume
Each regular convolutional calculation layer of product calculating logic is provided with a matrix-vector multiplication unit, a matrix-vector multiplication unit
Including multiple computing engines, each computing engines include multiple parallel single-instruction multiple-data stream (SIMD) circulation passages.Further, it calculates
Engine is used to obtain the input feature vector figure of the corresponding picture to be processed of multiple parallel single-instruction multiple-data stream (SIMD) circulation passages, and is located at
The corresponding convolution nuclear parameter of on piece memory is calculated, and each computing engines receive identical control signal and picture to be processed
Vector data, when calculating will different filters corresponding from convolution nuclear parameter carry out multiply-accumulate operation therewith.
As a further preference, system carries out the process of dot product calculating: asking different to corresponding position element in sliding window
Or, the result of exclusive or is stored in array;In array 1 number is counted by popcount;According to formula result=-
(popcount- (N-popcount)) finds out final convolutional calculation result.
Fig. 8 is that the dot product of the embodiment of the present invention calculates the schematic diagram realized.As shown in figure 8, being calculated with matrix-vector multiplication
The calculating data flow of a computing engines is example in unit, is mainly used to calculate input vector and parameter matrix data line
Dot product result, and result is compared with a threshold value, finally exports the data of a 1bit.It is two that dot product, which calculates essence,
Xnor (exclusive or non-exclusive) Men Shixian is used binary neural network in multiply-accumulate operation between a vector here.The first step be by
Corresponding position element seeks exclusive or in sliding window, and the result of exclusive or is stored in array;Second step is united by popcount
1 number in counting group;Third step is to find out final volume according to formula result=- (popcount- (N-popcount))
Product calculated result.Result is compared with threshold value finally and exports final result.Here computing engines structure is equally supported
The calculating of non-binaryzation, it is only necessary to can handle the dot product door of conventional parallel multiplication replacement dotted portion.
As a further improvement of the present invention, FPGA is by convolution nuclear parameter needed for convolutional calculation according to convolutional calculation logic
Parameter matrix is packaged as after rule compositor, sliding output window covers picture to be processed according to the translation of convolutional calculation logic rules
Input feature vector figure obtains image array, and parameter matrix is multiplied to obtain convolutional calculation result with image array.
Fig. 9 is the schematic diagram of the convolutional calculation matrix staggered-sequence of the embodiment of the present invention.As shown in figure 9, can be according to volume
Product calculating logic rule, i.e., multiplied based on the method for the matrix staggered-sequence of channel dimension convolutional calculation is converted to ordinary channel
Method operation, convolution nuclear parameter needed for convolutional calculation was packaged into parameter matrix, while sliding window can also be translated and be covered
Lid input feature vector figure, and the element of input feature vector figure is bundled in image array finally by these matrix multiples and exports knot
Fruit.Since dot-product operation includes all pixels value in a sliding window, along with the interchangeability of addition, so being handed in matrix
The sequence of mistake sequence can be random order, the pixel value of different channel same positions has been placed on together, naturally it is also possible to adopt
It is put in order with others, but final calculated result is constant.It should be noted that the conversion of filter matrix do not need it is any
Expense, because it is that program operation is preceding just converted, and image array can be converted when program is run.Figure 10 is
The schematic diagram that the convolutional calculation matrix storage of the embodiment of the present invention is realized.As shown in Figure 10, input figure is in the buffer with certain
Sequence is simply stored, and then address generator can take the corresponding core position of each sliding window, and upper one layer is transmitted through
Come data its according to ordering rule identical with filter matrix generate image array.
Figure 11 is the schematic diagram that the embodiment of the present invention folds that matrix-vector multiplication is realized.As shown in figure 11, due to two-value mind
It may be expressed as matrix-vector multiplication through calculating almost all of in network, which, which can largely control, is
The handling capacity of system, while can also directly affect the utilization of resources and energy consumption of system.Every layer of computing engines quantity is a, Mei Geji
Calculating single-instruction multiple-data stream (SIMD) port number in engine is b, and parameter matrix size is m*n.So total folding degree is (m/a) * (n/
B), cycle-index needed for completing a matrix-vector multiplication is also (m/a) * (n/b).Due to the acceleration knot of binary neural network
Structure is a flowing structure, and whole calculating handling capacity is determined by that most slow layer.So needing for each convolutional layer and complete
Articulamentum configures computing engines and the single-instruction multiple-data stream (SIMD) channel of different number, finally makes required for each layer of realization
Recurring number is roughly equal, so that whole network forward calculation is fastest.The foldable structure of the matrix-vector multiplication such as Figure 11
It is shown, using folded formization realization make full use of calculate space optimize, thus realize according to computational load come
It realizes to fold and arrived better reasoning performance.
Figure 12 is the schematic diagram of the acceleration system process flow of the embodiment of the present invention.As shown in figure 12, the process flow packet
Three phases are included, first stage is binary neural network initialization and picture pretreatment stage, it includes to bit stream file
Importing, network structure initialization, the staggered-sequence of weight parameter and on piece Memory Allocation and picture size recanalization (will be schemed
Piece size is adjusted to 32*32*3) etc. processes;Second stage is exactly the accelerator of FPGA, obtains an one-dimensional characteristic vector;The
Three stages were the passback stages, including on arm processor to the classification processing of feature vector.
Changed on Xilinx PYNQ-Z1 light methods plate to based on VGG16 by Vivado HLS higher synthesis tool
Into network accelerated, break implementation pattern of traditional convolutional neural networks on FPGA, designed whole hard
The pipeline computing framework based on FPGA on piece memory is used in part structure, both reduced the communication of the outer memory of on piece memory and piece at
This, and greatly improve whole degree of parallelism.Meanwhile the convolutional layer in binary neural network, pond layer, canonical are normalized
Layer, full articulamentum have carried out corresponding optimization.Sufficiently to excavate parallel potentiality therein, a matrix-vector multiplication list is devised
The convolutional layer that member carrys out supporting network calculates.PE and SIMD port number by configuring different number to each layer of network can make mould
It is optimal that type reaches Local Property, and finally obtains total optimization performance.Higher data throughout, very fast is obtained by optimization
Processing speed and lower power consumption.Table 1 is the schematic table of the complete binaryzation network structure of the embodiment of the present invention.Such as 1 institute of table
Show, is designed by final speeding scheme, Xiang Shixian before being carried out to complete binaryzation network structure, and obtain the place of 844FPS
Manage speed, 3.8TOPS data throughout.Accelerator overall power is only 2.3W, and model accuracy rate is 83.6%.
The schematic table of the complete binaryzation network structure of 1 embodiment of the present invention of table
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include
Within protection scope of the present invention.
Claims (10)
1. a kind of binary neural network acceleration system based on FPGA, the convolution nuclear parameter which forms including the use of FPGA obtain
Modulus block, binaryzation convolutional neural networks structure and cache module, the cache module are the on piece memory of FPGA, and feature exists
In,
The convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model
Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, the convolution
Calculating logic rule includes the convolutional calculation of multiple threads;
The cache module is for transferring the convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic
Multiple convolution nuclear parameters are stored in the on piece memory of FPGA by rule, and cache module is also used to cache convolution basic calculating module
Calculated result and image data to be processed;
The binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules
Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules
Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond;
The convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules
The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA
Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module
Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area;The preset convolution
Computation sequence be successively carry out convolution, PRelu activation, canonical normalization and two-value activation calculate, or successively carry out convolution,
PRelu activation, Chi Hua, canonical normalization and two-value activation calculate;
FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, obtains the output characteristic pattern of image to be processed
Data, to improve the detection speed of image to be detected.
2. a kind of binary neural network acceleration system based on FPGA according to claim 1, which is characterized in that FPGA is logical
Cross the corresponding control register of the end ARM configuration, by AXI bus from external memory DDR3 load image on piece memory
Buffer area;FPGA is the multiple processing engines of convolution basic calculating module assignment, and the processing engine includes arithmetical operation component, patrols
Collect operational components, bit arithmetic component and storage resource.
3. a kind of binary neural network acceleration system based on FPGA according to claim 2, which is characterized in that according to pre-
If convolutional calculation sequence by convolutional calculation layer be classified as convolutional layer, PRelu active coating, pond layer, canonical normalization layer and two
It is worth active coating, is respectively used to convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation and calculates, it is same using being located at
Multiple computing engines of convolutional calculation layer form a convolution and accelerate array, and a convolution accelerates array using a PE of FPGA
Module is realized.
4. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that caching mould
Block is respectively that corresponding first buffer area and second buffering area is arranged in a convolutional calculation array, and first buffer area is for depositing
The operation result that a convolution accelerates array is stored up, the second buffering area is used to store the operation that corresponding convolution accelerates array
As a result.
5. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that the pond
Process is realized in the calculating for changing layer are as follows: the corresponding column vector of the sliding window of pond layer is carried out SIMD vectorization, and to all column
Vector forms a new vector after seeking respective maximum value, and using new vector as the data of output characteristic pattern.
6. a kind of binary neural network acceleration system based on FPGA according to claim 5, which is characterized in that pond layer
Different sliding windows there are when identical column vector, the calculated result of the identical column vector is put into a LUT and is kept in,
Next sliding window call directly when the identical column vector calculation the temporary value in LUT.
7. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that FPGA according to
One matrix-vector multiplication unit, the matrix-vector multiplication are provided with to each convolutional calculation layer according to convolutional calculation logic rules
Method unit includes multiple computing engines, and the computing engines include multiple parallel single-instruction multiple-data stream (SIMD) circulation passages, the meter
The input feature vector figure that engine is used to obtain the corresponding picture to be processed of multiple parallel single-instruction multiple-data stream (SIMD) circulation passages is calculated, with volume
The corresponding different filters of product nuclear parameter carry out multiply-accumulate operation.
8. a kind of binary neural network acceleration system based on FPGA according to claim 7, which is characterized in that the system
System carries out the process of dot product calculating are as follows: seeks exclusive or to corresponding position element in sliding window, the result of exclusive or is stored in array
In;In array 1 number is counted by popcount;It is found out according to formula result=- (popcount- (N-popcount))
Final convolutional calculation result.
9. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that FPGA will
Convolution nuclear parameter needed for convolutional calculation slides output window according to parameter matrix is packaged as after the sequence of convolutional calculation logic rules
The input feature vector figure that picture to be processed is covered according to the translation of convolutional calculation logic rules obtains image array, the parameter matrix and
Image array is multiplied to obtain convolutional calculation result.
10. a kind of binary neural network acceleration system based on FPGA according to claim 1 to 9, feature
It is, the convolution basic calculating module is activated PRelu activation, canonical normalization and two-value by common function mode of penetrating
Merge into a simple two-valued function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910636517.2A CN110458279B (en) | 2019-07-15 | 2019-07-15 | FPGA-based binary neural network acceleration method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910636517.2A CN110458279B (en) | 2019-07-15 | 2019-07-15 | FPGA-based binary neural network acceleration method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110458279A true CN110458279A (en) | 2019-11-15 |
CN110458279B CN110458279B (en) | 2022-05-20 |
Family
ID=68481247
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910636517.2A Active CN110458279B (en) | 2019-07-15 | 2019-07-15 | FPGA-based binary neural network acceleration method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110458279B (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126309A (en) * | 2019-12-26 | 2020-05-08 | 长沙海格北斗信息技术有限公司 | Convolutional neural network architecture method based on FPGA and face recognition method thereof |
CN111160534A (en) * | 2019-12-31 | 2020-05-15 | 中山大学 | Binary neural network forward propagation frame suitable for mobile terminal |
CN111275167A (en) * | 2020-01-16 | 2020-06-12 | 北京中科研究院 | High-energy-efficiency pulse array framework for binary convolutional neural network |
CN111401543A (en) * | 2020-06-08 | 2020-07-10 | 深圳市九天睿芯科技有限公司 | Neural network accelerator with full on-chip storage and implementation method thereof |
CN111931925A (en) * | 2020-08-10 | 2020-11-13 | 西安电子科技大学 | FPGA-based binary neural network acceleration system |
CN112199896A (en) * | 2020-10-26 | 2021-01-08 | 云中芯半导体技术(苏州)有限公司 | Chip logic comprehensive optimization acceleration method based on machine learning |
CN112418417A (en) * | 2020-09-24 | 2021-02-26 | 北京计算机技术及应用研究所 | Convolution neural network acceleration device and method based on SIMD technology |
CN112487448A (en) * | 2020-11-27 | 2021-03-12 | 珠海零边界集成电路有限公司 | Encrypted information processing device and method and computer equipment |
CN112862080A (en) * | 2021-03-10 | 2021-05-28 | 中山大学 | Hardware calculation method of attention mechanism of EfficientNet |
CN113298236A (en) * | 2021-06-18 | 2021-08-24 | 中国科学院计算技术研究所 | Low-precision neural network computing device based on data stream structure and acceleration method |
CN113301221A (en) * | 2021-03-19 | 2021-08-24 | 西安电子科技大学 | Image processing method, system and application of depth network camera |
CN113469350A (en) * | 2021-07-07 | 2021-10-01 | 武汉魅瞳科技有限公司 | Deep convolutional neural network acceleration method and system suitable for NPU |
CN113949592A (en) * | 2021-12-22 | 2022-01-18 | 湖南大学 | Anti-attack defense system and method based on FPGA |
WO2022013722A1 (en) * | 2020-07-14 | 2022-01-20 | United Microelectronics Centre (Hong Kong) Limited | Processor, logic chip and method for binarized convolution neural network |
CN114202071A (en) * | 2022-02-17 | 2022-03-18 | 浙江光珀智能科技有限公司 | Deep convolutional neural network reasoning acceleration method based on data stream mode |
WO2022057054A1 (en) * | 2020-09-18 | 2022-03-24 | 深圳先进技术研究院 | Convolution operation optimization method and system, terminal, and storage medium |
CN114897159A (en) * | 2022-05-18 | 2022-08-12 | 电子科技大学 | Method for rapidly deducing incident angle of electromagnetic signal based on neural network |
CN115083462A (en) * | 2022-07-14 | 2022-09-20 | 中科南京智能技术研究院 | Novel digital in-memory computing device based on Sram |
CN115550607A (en) * | 2020-09-27 | 2022-12-30 | 北京天玛智控科技股份有限公司 | Model reasoning accelerator realized based on FPGA and intelligent visual perception terminal |
CN117114055A (en) * | 2023-10-24 | 2023-11-24 | 北京航空航天大学 | FPGA binary neural network acceleration method for industrial application scene |
CN112487448B (en) * | 2020-11-27 | 2024-05-03 | 珠海零边界集成电路有限公司 | Encryption information processing device, method and computer equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106228240A (en) * | 2016-07-30 | 2016-12-14 | 复旦大学 | Degree of depth convolutional neural networks implementation method based on FPGA |
US20180046913A1 (en) * | 2016-08-12 | 2018-02-15 | DeePhi Technology Co., Ltd. | Combining cpu and special accelerator for implementing an artificial neural network |
CN109086867A (en) * | 2018-07-02 | 2018-12-25 | 武汉魅瞳科技有限公司 | A kind of convolutional neural networks acceleration system based on FPGA |
-
2019
- 2019-07-15 CN CN201910636517.2A patent/CN110458279B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106228240A (en) * | 2016-07-30 | 2016-12-14 | 复旦大学 | Degree of depth convolutional neural networks implementation method based on FPGA |
US20180046913A1 (en) * | 2016-08-12 | 2018-02-15 | DeePhi Technology Co., Ltd. | Combining cpu and special accelerator for implementing an artificial neural network |
CN109086867A (en) * | 2018-07-02 | 2018-12-25 | 武汉魅瞳科技有限公司 | A kind of convolutional neural networks acceleration system based on FPGA |
Non-Patent Citations (2)
Title |
---|
YUNZHI DUAN ET AL.: "Energy-Efficient Architecture for FPGA-based Deep Convolutional Neural Networks with Binary Weights", 《2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING(DSP)》 * |
仇越等: "一种基于FPGA的卷积神经网络加速器设计与实现", 《微电子学与计算机》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126309A (en) * | 2019-12-26 | 2020-05-08 | 长沙海格北斗信息技术有限公司 | Convolutional neural network architecture method based on FPGA and face recognition method thereof |
CN111160534A (en) * | 2019-12-31 | 2020-05-15 | 中山大学 | Binary neural network forward propagation frame suitable for mobile terminal |
CN111275167A (en) * | 2020-01-16 | 2020-06-12 | 北京中科研究院 | High-energy-efficiency pulse array framework for binary convolutional neural network |
CN111401543A (en) * | 2020-06-08 | 2020-07-10 | 深圳市九天睿芯科技有限公司 | Neural network accelerator with full on-chip storage and implementation method thereof |
WO2022013722A1 (en) * | 2020-07-14 | 2022-01-20 | United Microelectronics Centre (Hong Kong) Limited | Processor, logic chip and method for binarized convolution neural network |
CN111931925A (en) * | 2020-08-10 | 2020-11-13 | 西安电子科技大学 | FPGA-based binary neural network acceleration system |
CN111931925B (en) * | 2020-08-10 | 2024-02-09 | 西安电子科技大学 | Acceleration system of binary neural network based on FPGA |
WO2022057054A1 (en) * | 2020-09-18 | 2022-03-24 | 深圳先进技术研究院 | Convolution operation optimization method and system, terminal, and storage medium |
CN112418417A (en) * | 2020-09-24 | 2021-02-26 | 北京计算机技术及应用研究所 | Convolution neural network acceleration device and method based on SIMD technology |
CN112418417B (en) * | 2020-09-24 | 2024-02-27 | 北京计算机技术及应用研究所 | Convolutional neural network acceleration device and method based on SIMD technology |
CN115550607A (en) * | 2020-09-27 | 2022-12-30 | 北京天玛智控科技股份有限公司 | Model reasoning accelerator realized based on FPGA and intelligent visual perception terminal |
CN112199896A (en) * | 2020-10-26 | 2021-01-08 | 云中芯半导体技术(苏州)有限公司 | Chip logic comprehensive optimization acceleration method based on machine learning |
CN112487448B (en) * | 2020-11-27 | 2024-05-03 | 珠海零边界集成电路有限公司 | Encryption information processing device, method and computer equipment |
CN112487448A (en) * | 2020-11-27 | 2021-03-12 | 珠海零边界集成电路有限公司 | Encrypted information processing device and method and computer equipment |
CN112862080A (en) * | 2021-03-10 | 2021-05-28 | 中山大学 | Hardware calculation method of attention mechanism of EfficientNet |
CN112862080B (en) * | 2021-03-10 | 2023-08-15 | 中山大学 | Hardware computing method of attention mechanism of Efficient Net |
CN113301221A (en) * | 2021-03-19 | 2021-08-24 | 西安电子科技大学 | Image processing method, system and application of depth network camera |
CN113298236A (en) * | 2021-06-18 | 2021-08-24 | 中国科学院计算技术研究所 | Low-precision neural network computing device based on data stream structure and acceleration method |
CN113469350A (en) * | 2021-07-07 | 2021-10-01 | 武汉魅瞳科技有限公司 | Deep convolutional neural network acceleration method and system suitable for NPU |
CN113949592A (en) * | 2021-12-22 | 2022-01-18 | 湖南大学 | Anti-attack defense system and method based on FPGA |
CN113949592B (en) * | 2021-12-22 | 2022-03-22 | 湖南大学 | Anti-attack defense system and method based on FPGA |
CN114202071A (en) * | 2022-02-17 | 2022-03-18 | 浙江光珀智能科技有限公司 | Deep convolutional neural network reasoning acceleration method based on data stream mode |
CN114897159A (en) * | 2022-05-18 | 2022-08-12 | 电子科技大学 | Method for rapidly deducing incident angle of electromagnetic signal based on neural network |
CN115083462B (en) * | 2022-07-14 | 2022-11-11 | 中科南京智能技术研究院 | Digital in-memory computing device based on Sram |
CN115083462A (en) * | 2022-07-14 | 2022-09-20 | 中科南京智能技术研究院 | Novel digital in-memory computing device based on Sram |
CN117114055A (en) * | 2023-10-24 | 2023-11-24 | 北京航空航天大学 | FPGA binary neural network acceleration method for industrial application scene |
CN117114055B (en) * | 2023-10-24 | 2024-04-09 | 北京航空航天大学 | FPGA binary neural network acceleration method for industrial application scene |
Also Published As
Publication number | Publication date |
---|---|
CN110458279B (en) | 2022-05-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110458279A (en) | A kind of binary neural network accelerated method and system based on FPGA | |
CN105681628B (en) | A kind of convolutional network arithmetic element and restructural convolutional neural networks processor and the method for realizing image denoising processing | |
CN107169563B (en) | Processing system and method applied to two-value weight convolutional network | |
CN108256628B (en) | Convolutional neural network hardware accelerator based on multicast network-on-chip and working method thereof | |
CN109409512B (en) | Flexibly configurable neural network computing unit, computing array and construction method thereof | |
CN108733348B (en) | Fused vector multiplier and method for performing operation using the same | |
CN109784489A (en) | Convolutional neural networks IP kernel based on FPGA | |
CN109032781A (en) | A kind of FPGA parallel system of convolutional neural networks algorithm | |
CN111667051A (en) | Neural network accelerator suitable for edge equipment and neural network acceleration calculation method | |
CN108564168A (en) | A kind of design method to supporting more precision convolutional neural networks processors | |
CN111105023B (en) | Data stream reconstruction method and reconfigurable data stream processor | |
CN105892989A (en) | Neural network accelerator and operational method thereof | |
CN109447241A (en) | A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field | |
CN110163353A (en) | A kind of computing device and method | |
CN109934336A (en) | Neural network dynamic based on optimum structure search accelerates platform designing method and neural network dynamic to accelerate platform | |
Alawad et al. | Stochastic-based deep convolutional networks with reconfigurable logic fabric | |
CN111738433A (en) | Reconfigurable convolution hardware accelerator | |
CN113792621B (en) | FPGA-based target detection accelerator design method | |
CN110276447A (en) | A kind of computing device and method | |
CN110163350A (en) | A kind of computing device and method | |
Fujii et al. | A threshold neuron pruning for a binarized deep neural network on an FPGA | |
Duan et al. | Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights | |
Wang et al. | High-performance mixed-low-precision cnn inference accelerator on fpga | |
Jiang et al. | Hardware implementation of depthwise separable convolution neural network | |
CN113033795B (en) | Pulse convolution neural network hardware accelerator of binary pulse diagram based on time step |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |