CN110458279A

CN110458279A - A kind of binary neural network accelerated method and system based on FPGA

Info

Publication number: CN110458279A
Application number: CN201910636517.2A
Authority: CN
Inventors: 李开; 邹复好; 祁迪
Original assignee: Wuhan Charm Pupil Technology Co Ltd
Current assignee: Wuhan Charm Pupil Technology Co Ltd
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2019-11-15
Anticipated expiration: 2039-07-15
Also published as: CN110458279B

Abstract

The binary neural network acceleration system based on FPGA that the invention discloses a kind of, its convolution kernel parameter acquisition module for utilizing FPGA to be formed, binaryzation convolutional neural networks structure and cache module, cache module is the on piece memory of FPGA, the input feature vector figure that each module passes through acquisition picture to be processed, obtain convolutional calculation logic rules and corresponding progress binaryzation convolutional calculation, FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, obtain the output feature diagram data of image to be processed, the calculation amount of layer each in binary neural network is all unloaded on piece memory by the overall architecture, and the interaction of memory and on piece memory outside piece is not depended on, to reduce the communications cost between memory, greatly improve computational efficiency, improve the detection speed of image to be detected.

Description

A kind of binary neural network accelerated method and system based on FPGA

Technical field

The invention belongs to field of image processings, and in particular to a kind of binary neural network accelerated method based on FPGA and be System.

Background technique

The major progress of artificial intelligence technology has begun the every aspect for benefiting human lives.From the vacuum machine of household The intelligent production equipment of the whole series in people to factory, many tasks in the world all have reached increasingly automated.And deep learning Play very important effect in this great technological revolution, recognition of face, object detection, image procossing, etc. Field has a wide range of applications.The algorithm wherein mainly used is exactly convolutional neural networks, this performance preferably deep learning Algorithm is disposed in a large amount of end PC, mobile phone mobile terminal and embedded dedicated accelerator, for realizing a variety of intelligence Energy calculating task, and achieve preferable acceleration effect.

Convolutional neural networks (CNN, Convolutional Neural Network) develop most important as deep learning One of branch, its development is the most mature, and is widely used in the task of various graph image videos processing.Convolution Neural network development is so rapid, in addition to training data scale increases and computing capability is promoted, also has benefited from various convolutional Neurals Network frame.Existing convolutional neural networks application is to be deployed on server or desktop machine platform, and mobile terminal is only mostly Convolutional neural networks application mobile could be pushed depth by application platform most widely used, user volume is most to greatest extent Learn the development of application.

However all such mobile terminals and embedding assembly equipment can only all provide limited computing capability and capacity Not big on piece storage.As the model structure of convolutional neural networks becomes increasingly complex, the model number of plies is deeper and deeper, model Parameter amount is increasing, so that deployment of the convolutional neural networks on mobile terminal and embedded end becomes more and more difficult.It is huge Calculation amount all use 32bit floating number to operate on lightweight chip as operand, undoubtedly to the huge of computing resource Big consumption, while being also extremely difficult to relatively good live effect.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of binary neural networks based on FPGA Network acceleration system, the convolution kernel parameter acquisition module formed using FPGA, binaryzation convolutional neural networks structure and caching mould Block, cache module are the on piece memory of FPGA, and each module carries out corresponding binaryzation according to the convolutional calculation logic rules obtained The calculation amount of layer each in binary neural network is all unloaded on piece memory by the overall architecture by convolutional calculation, and is disobeyed Rely the interaction of piece outer memory and on piece memory, to reduce the communications cost between memory, greatly improves calculating effect Rate improves the detection speed of image to be detected.

To achieve the above object, according to one aspect of the present invention, a kind of binary neural network based on FPGA is provided Acceleration system, convolution kernel parameter acquisition module which forms including the use of FPGA, binaryzation convolutional neural networks structure and Cache module, cache module are the on piece memory of FPGA,

Convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, convolutional calculation Logic rules include the convolutional calculation of multiple threads；

Cache module is for transferring the convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic Multiple convolution nuclear parameters are stored in the on piece memory of FPGA by rule, and cache module is also used to cache convolution basic calculating module Calculated result and image data to be processed；

Binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond；

Convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area；The preset convolution Computation sequence be successively carry out convolution, PRelu activation, canonical normalization and two-value activation calculate, or successively carry out convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation calculate；

FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, and the output for obtaining image to be processed is special Diagram data is levied, to improve the detection speed of image to be detected.

As a further improvement of the present invention, FPGA then passes through AXI by the corresponding control register of the end ARM configuration Bus from external memory DDR3 load image to the buffer area of on piece memory；FPGA is convolution basic calculating module point With multiple processing engines, handling engine includes arithmetical operation component, logical operation component, bit arithmetic component and storage resource.

As a further improvement of the present invention, convolutional calculation layer is classified as convolution according to preset convolutional calculation sequence Layer, PRelu active coating, pond layer, canonical normalization layer and two-value active coating are respectively used to convolution, PRelu activation, Chi Hua, just Then normalization and two-value activation calculate, and form a convolution using the multiple computing engines for being located at the same convolutional calculation layer and accelerate Array, a convolution accelerate array to realize using a PE module of FPGA.

As a further improvement of the present invention, cache module is respectively that a convolutional calculation array setting corresponding first is delayed Area and second buffering area are deposited, the first buffer area is used to store the operation result that a convolution accelerates array, and second buffering area is used In the operation result for storing corresponding convolution acceleration array.

As a further improvement of the present invention, process is realized in the calculating of pond layer are as follows: the sliding window of pond layer is corresponding Column vector carry out SIMD vectorization, and all column vectors are asked and form a new vector after respective maximum value, and will be new Vector as output characteristic pattern data.

As a further improvement of the present invention, the different sliding windows of pond layer, will be described there are when identical column vector The calculated result of identical column vector is put into a LUT and keeps in, and next sliding window carries out the identical column vector calculation When call directly temporary value in LUT.

As a further improvement of the present invention, FPGA is arranged each convolutional calculation layer according to convolutional calculation logic rules There is a matrix-vector multiplication unit, matrix-vector multiplication unit includes multiple computing engines, and computing engines include multiple parallel Single-instruction multiple-data stream (SIMD) circulation passage, computing engines for obtain multiple parallel single-instruction multiple-data stream (SIMD) circulation passages it is corresponding to The input feature vector figure for handling picture, different filters corresponding from convolution nuclear parameter carry out multiply-accumulate operation.

As a further improvement of the present invention, which carries out the process of dot product calculating are as follows: to corresponding to position in sliding window It sets element and seeks exclusive or, the result of exclusive or is stored in array；In array 1 number is counted by popcount；According to formula Result=- (popcount- (N-popcount)) finds out final convolutional calculation result.

As a further improvement of the present invention, FPGA is by convolution nuclear parameter needed for convolutional calculation according to convolutional calculation logic Parameter matrix is packaged as after rule compositor, sliding output window covers picture to be processed according to the translation of convolutional calculation logic rules Input feature vector figure obtains image array, and the parameter matrix is multiplied to obtain convolutional calculation result with image array.

As a further improvement of the present invention, convolution basic calculating module is swashed PRelu by common function mode of penetrating A simple two-valued function is merged into living, canonical normalization and two-value activation.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:

A kind of binary neural network acceleration system based on FPGA of the invention, the convolution nuclear parameter formed using FPGA Module, binaryzation convolutional neural networks structure and cache module are obtained, cache module is the on piece memory of FPGA, each module foundation The convolutional calculation logic rules of acquisition carry out corresponding binaryzation convolutional calculation, will be in binary neural network by the overall architecture The calculation amount of each layer is all unloaded on piece memory, and does not depend on the interaction of memory and on piece memory outside piece, less in this way to deposit Communications cost between reservoir, greatly improves computational efficiency, improves the detection speed of image to be detected.

A kind of binary neural network acceleration system based on FPGA of the invention, dot-product operation is in binary neural network It is replaced by Xnor logical operation and popcount shift operation, since Binary Operation is in 1bit weight and 1bit input picture Dot product between parameter calculates, operation replacement in this way, so that convolutional calculation of the two-value convolutional calculation compared to full precision Speed is greatly improved, meanwhile, use the blank parts of oem character set filling characteristic pattern to replace complete+1 used in forefathers Filling, ensure that model accuracy to a certain extent.

A kind of binary neural network acceleration system based on FPGA of the invention, the matrix-vector multiplication being arranged by FPGA Method unit, by carrying out member to the staggered-sequence of input feature vector figure to parameter matrix off-line type staggered-sequence and sliding window unit Data vector input convolution after recombination is accelerated in matrix, and then realizes fully parallelized calculating by element recombination.

A kind of binary neural network acceleration system based on FPGA of the invention, is accelerated by double buffering parallel mechanism The calculating of convolutional calculation layer uses flowing structure in buffer area, and sliding window is driven calculatings by upper one layer of output data, cunning The data of dynamic window interior use fully parallelized calculation, to further improve computational efficiency.

A kind of binary neural network acceleration system based on FPGA of the invention, will by common affine function mode A simple two-valued function is merged into PRelu activation, canonical normalization and two-value activation, greatly reduces canonical normalization Bring computation complexity.

Detailed description of the invention

Fig. 1 is a kind of structural schematic diagram of binary neural network acceleration system based on FPGA of the embodiment of the present invention；

Fig. 2 is the schematic diagram based on FPGA on piece memory of the embodiment of the present invention；

Fig. 3 is the structural schematic diagram of the convolution basic calculating module of the embodiment of the present invention.

Fig. 4 is the schematic diagram of the oem character set filling of the embodiment of the present invention；

Fig. 5 is the schematic diagram of the double buffering parallel organization of the embodiment of the present invention；

Fig. 6 is the schematic diagram that the calculating of the pond layer of the embodiment of the present invention is realized；

Fig. 7 is the schematic diagram that the convolutional calculation of the embodiment of the present invention is realized；

Fig. 8 is that the dot product of the embodiment of the present invention calculates the schematic diagram realized；

Fig. 9 is the schematic diagram of the convolutional calculation matrix staggered-sequence of the embodiment of the present invention；

Figure 10 is the schematic diagram that the convolutional calculation matrix storage of the embodiment of the present invention is realized；

Figure 11 is the schematic diagram for folding matrix-vector multiplication and realizing of the embodiment of the present invention；

Figure 12 is the schematic diagram of the acceleration system process flow of the embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

In addition, as long as technical characteristic involved in the various embodiments of the present invention described below is each other not Constituting conflict can be combined with each other.The present invention is described in more detail With reference to embodiment.

Fig. 1 is a kind of structural schematic diagram of binary neural network acceleration system based on FPGA of the embodiment of the present invention.Such as Shown in Fig. 1, a kind of binary neural network acceleration system based on FPGA, the convolution nuclear parameter which forms including the use of FPGA Module, binaryzation convolutional neural networks structure and cache module are obtained, cache module is the on piece memory of FPGA, wherein

Cache module is for transferring convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic rules Multiple convolution nuclear parameters are stored in the on piece memory of FPGA, cache module is also used to cache the calculating of convolution basic calculating module As a result with image data to be processed；

Convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area；Preset convolutional calculation Sequence calculates successively to carry out convolution, PRelu activation, canonical normalization and two-value activation, or successively carries out convolution, PRelu Activation, Chi Hua, canonical normalization and two-value activation calculate；As an example, convolution basic calculating module can be by common It penetrates function mode and a simple two-valued function is merged into PRelu activation, canonical normalization and two-value activation, to reduce Canonical normalizes bring computation complexity；

Fig. 2 is the schematic diagram based on FPGA on piece memory of the embodiment of the present invention.As shown in Fig. 2, by being based on FPGA piece The pipeline computing framework of upper memory realizes the corresponding convolution Nuclear Data storage of convolutional calculation logic rules, has both reduced on piece memory With the communications cost of memory outside piece, and the whole degree of parallelism of convolutional calculation is greatly improved.

As an example, convolution basic calculating module is in the hardware realization based on FPGA, and hardware overall architecture is first Configure corresponding control register by the end ARM, then pass through AXI bus from external memory DDR3 load image on piece The buffer area of memory；FPGA can be assigned with a large amount of processing engines for the operation of convolution basic calculating module, transport including arithmetic Component, logical operation component, bit arithmetic component and storage resource are calculated, as a preferred embodiment, according to preset convolution Convolutional calculation layer is classified as convolutional layer, PRelu active coating, pond layer, canonical normalization layer and two-value active coating by computation sequence, It is used for convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation respectively to calculate, can use positioned at the same volume Multiple computing engines of product computation layer form a convolution and accelerate array, and one of convolution accelerates array to can use FPGA's One PE module is realized；Preferably as one, it can accelerate array that can distribute a double-damping structure for each convolution, one negative Duty stores the operation result of a upper layer network, another is responsible for the operation result for storing this layer network.

Fig. 3 is the structural schematic diagram of the convolution basic calculating module of the embodiment of the present invention.Convolution basic calculating module is added The accuracy of master mould, can be improved 2 percentage points, the convolution module of common binary neural network is suitable by PRelu activation Sequence is formed according to: canonical normalization, convolution, two-value activation, pondization, and the input of each module is a upper module before recombination It is transmitted to after being calculated by PRelu+Pool, results in that buffer area receives between block and the data transmitted are non-binaryzations in this way Data, convolution basic calculating module adjusts the processing sequence of convolution basic calculating module are as follows: convolution, PRelu activation, pond Change, canonical normalization and two-value activation or convolution, PRelu activation, canonical normalization and two-value activate, after recombination, Data transmission between convolution basic calculating module is converted to two-value after the processing of two-value activation primitive, then block and block Between data transmission be also two-value, the data volume exchanged between block can be substantially reduced after recombination, reduce block Between communications cost, be easy to the interface unified for all convolution basic calculating module designs；Meanwhile it reducing for exchanging data Buffer size, to save hardware resource.

Fig. 4 is the schematic diagram of the oem character set filling of the embodiment of the present invention.As shown in figure 4, using odd even in convolutional calculation The staggeredly clear data of filling output characteristic pattern, it is specially high-dimensional in the width of characteristic pattern to ensure to export the dimension of characteristic pattern With the staggeredly filling for carrying out ± 1 on channel dimension according to odd even ordering sequence, the network trained after passing through odd even filling Model in the error rate on Cifar10 data set close to the full 0 charging error rate under full precision, and only 11.50%, this is low The 12.85% of 13.76% and odd (idol) filling that Yu Quan+1 is filled.

Fig. 5 is the schematic diagram of the double buffering parallel organization of the embodiment of the present invention.Cache module is respectively each convolution meter It calculates array and corresponding first buffer area and second buffering area is set, the first buffer area is used to store a convolution and accelerates array Operation result, second buffering area are used to store the operation result that corresponding convolution accelerates array；First buffer area and the second caching It is filled accordingly according to convolutional calculation logic rules in area；Using pond layer as example, two are distributed for it when calculating initial Buffer area, buffering sector width are the width W of the input feature vector figure of pond layer, are highly the size k of convolution kernel.Two buffer areas according to Secondary alternating receives one layer each calculated result.When the preceding k-1 row filling of Buffer1 is full, k-th of data of row k are arrived Come when, sliding window can come into effect and generate calculated result, from that point on, behind one data of every reception will generate one A result is sent in output characteristic pattern.When Buffer1 is filled up, Buffer2 starts to receive data.In Buffer2 at this time A line will be until the arrival of k-th of data, and sliding window can just come into force, this sliding window contains in k-1 row Buffer1 Data contain the data in a line Buffer2.From that point on, such as front, the sliding of sliding window is dependent on next The arrival of a new data, and generate a calculated result.When the coverage area of sliding window does not include the data in Buffer1 When, that is, it is emptied, Buffer1 just starts to receive new data when Buffer2 is filled up, and circulation is gone down so according to this.

As an example, pond layer calculating realize process are as follows: by the corresponding column vector of the sliding window of pond layer into Row SIMD vectorization, and all column vectors are asked and form a new vector after respective maximum value, and using new vector as Export the data of characteristic pattern.As a preferred embodiment, different sliding windows, can be by the column there are when identical column vector The calculated result of vector is put into a LUT and keeps in, and next sliding window calls directly the temporary value in LUT when calculating.Fig. 6 is The schematic diagram that the calculating of the pond layer of the embodiment of the present invention is realized.As shown in fig. 6, example is turned to the maximum pond of a 3*3, Middle stride=2.Sliding window slides on the same thread arrays1, we regard each column in sliding window as one A vector, each sliding window just include vector as 3 column.Due to there is no data dependence relation between this 3 vectors, so SIMD vectorization can be used, respective maximum value is asked simultaneously to this 3 column vector.3 maximum value results come out and then by this 3 results form a vector, and solve the maximum value in this vector, and calculated result is finally put into conduct in output characteristic pattern One new element.It is worth noting that, after every 3 column count in sliding window is complete, the calculated result that rightmost 1 can be arranged Be put into it is temporary inside LUT, as the calculated result of next sliding window first row, this is because adjacent sliding window it Between shared a column data.It is each 3 from the degree of parallelism for doing left-to-right first sliding window, remaining sliding window value It is all 2.

Fig. 7 is the schematic diagram that the convolutional calculation of the embodiment of the present invention is realized.As shown in fig. 7, as an example, according to volume Each regular convolutional calculation layer of product calculating logic is provided with a matrix-vector multiplication unit, a matrix-vector multiplication unit Including multiple computing engines, each computing engines include multiple parallel single-instruction multiple-data stream (SIMD) circulation passages.Further, it calculates Engine is used to obtain the input feature vector figure of the corresponding picture to be processed of multiple parallel single-instruction multiple-data stream (SIMD) circulation passages, and is located at The corresponding convolution nuclear parameter of on piece memory is calculated, and each computing engines receive identical control signal and picture to be processed Vector data, when calculating will different filters corresponding from convolution nuclear parameter carry out multiply-accumulate operation therewith.

As a further preference, system carries out the process of dot product calculating: asking different to corresponding position element in sliding window Or, the result of exclusive or is stored in array；In array 1 number is counted by popcount；According to formula result=- (popcount- (N-popcount)) finds out final convolutional calculation result.

Fig. 8 is that the dot product of the embodiment of the present invention calculates the schematic diagram realized.As shown in figure 8, being calculated with matrix-vector multiplication The calculating data flow of a computing engines is example in unit, is mainly used to calculate input vector and parameter matrix data line Dot product result, and result is compared with a threshold value, finally exports the data of a 1bit.It is two that dot product, which calculates essence, Xnor (exclusive or non-exclusive) Men Shixian is used binary neural network in multiply-accumulate operation between a vector here.The first step be by Corresponding position element seeks exclusive or in sliding window, and the result of exclusive or is stored in array；Second step is united by popcount 1 number in counting group；Third step is to find out final volume according to formula result=- (popcount- (N-popcount)) Product calculated result.Result is compared with threshold value finally and exports final result.Here computing engines structure is equally supported The calculating of non-binaryzation, it is only necessary to can handle the dot product door of conventional parallel multiplication replacement dotted portion.

As a further improvement of the present invention, FPGA is by convolution nuclear parameter needed for convolutional calculation according to convolutional calculation logic Parameter matrix is packaged as after rule compositor, sliding output window covers picture to be processed according to the translation of convolutional calculation logic rules Input feature vector figure obtains image array, and parameter matrix is multiplied to obtain convolutional calculation result with image array.

Fig. 9 is the schematic diagram of the convolutional calculation matrix staggered-sequence of the embodiment of the present invention.As shown in figure 9, can be according to volume Product calculating logic rule, i.e., multiplied based on the method for the matrix staggered-sequence of channel dimension convolutional calculation is converted to ordinary channel Method operation, convolution nuclear parameter needed for convolutional calculation was packaged into parameter matrix, while sliding window can also be translated and be covered Lid input feature vector figure, and the element of input feature vector figure is bundled in image array finally by these matrix multiples and exports knot Fruit.Since dot-product operation includes all pixels value in a sliding window, along with the interchangeability of addition, so being handed in matrix The sequence of mistake sequence can be random order, the pixel value of different channel same positions has been placed on together, naturally it is also possible to adopt It is put in order with others, but final calculated result is constant.It should be noted that the conversion of filter matrix do not need it is any Expense, because it is that program operation is preceding just converted, and image array can be converted when program is run.Figure 10 is The schematic diagram that the convolutional calculation matrix storage of the embodiment of the present invention is realized.As shown in Figure 10, input figure is in the buffer with certain Sequence is simply stored, and then address generator can take the corresponding core position of each sliding window, and upper one layer is transmitted through Come data its according to ordering rule identical with filter matrix generate image array.

Figure 11 is the schematic diagram that the embodiment of the present invention folds that matrix-vector multiplication is realized.As shown in figure 11, due to two-value mind It may be expressed as matrix-vector multiplication through calculating almost all of in network, which, which can largely control, is The handling capacity of system, while can also directly affect the utilization of resources and energy consumption of system.Every layer of computing engines quantity is a, Mei Geji Calculating single-instruction multiple-data stream (SIMD) port number in engine is b, and parameter matrix size is m*n.So total folding degree is (m/a) * (n/ B), cycle-index needed for completing a matrix-vector multiplication is also (m/a) * (n/b).Due to the acceleration knot of binary neural network Structure is a flowing structure, and whole calculating handling capacity is determined by that most slow layer.So needing for each convolutional layer and complete Articulamentum configures computing engines and the single-instruction multiple-data stream (SIMD) channel of different number, finally makes required for each layer of realization Recurring number is roughly equal, so that whole network forward calculation is fastest.The foldable structure of the matrix-vector multiplication such as Figure 11 It is shown, using folded formization realization make full use of calculate space optimize, thus realize according to computational load come It realizes to fold and arrived better reasoning performance.

Figure 12 is the schematic diagram of the acceleration system process flow of the embodiment of the present invention.As shown in figure 12, the process flow packet Three phases are included, first stage is binary neural network initialization and picture pretreatment stage, it includes to bit stream file Importing, network structure initialization, the staggered-sequence of weight parameter and on piece Memory Allocation and picture size recanalization (will be schemed Piece size is adjusted to 32*32*3) etc. processes；Second stage is exactly the accelerator of FPGA, obtains an one-dimensional characteristic vector；The Three stages were the passback stages, including on arm processor to the classification processing of feature vector.

Changed on Xilinx PYNQ-Z1 light methods plate to based on VGG16 by Vivado HLS higher synthesis tool Into network accelerated, break implementation pattern of traditional convolutional neural networks on FPGA, designed whole hard The pipeline computing framework based on FPGA on piece memory is used in part structure, both reduced the communication of the outer memory of on piece memory and piece at This, and greatly improve whole degree of parallelism.Meanwhile the convolutional layer in binary neural network, pond layer, canonical are normalized Layer, full articulamentum have carried out corresponding optimization.Sufficiently to excavate parallel potentiality therein, a matrix-vector multiplication list is devised The convolutional layer that member carrys out supporting network calculates.PE and SIMD port number by configuring different number to each layer of network can make mould It is optimal that type reaches Local Property, and finally obtains total optimization performance.Higher data throughout, very fast is obtained by optimization Processing speed and lower power consumption.Table 1 is the schematic table of the complete binaryzation network structure of the embodiment of the present invention.Such as 1 institute of table Show, is designed by final speeding scheme, Xiang Shixian before being carried out to complete binaryzation network structure, and obtain the place of 844FPS Manage speed, 3.8TOPS data throughout.Accelerator overall power is only 2.3W, and model accuracy rate is 83.6%.

The schematic table of the complete binaryzation network structure of 1 embodiment of the present invention of table

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of binary neural network acceleration system based on FPGA, the convolution nuclear parameter which forms including the use of FPGA obtain Modulus block, binaryzation convolutional neural networks structure and cache module, the cache module are the on piece memory of FPGA, and feature exists In,

The convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, the convolution Calculating logic rule includes the convolutional calculation of multiple threads；

The cache module is for transferring the convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic Multiple convolution nuclear parameters are stored in the on piece memory of FPGA by rule, and cache module is also used to cache convolution basic calculating module Calculated result and image data to be processed；

The binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond；

The convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area；The preset convolution Computation sequence be successively carry out convolution, PRelu activation, canonical normalization and two-value activation calculate, or successively carry out convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation calculate；

FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, obtains the output characteristic pattern of image to be processed Data, to improve the detection speed of image to be detected.

2. a kind of binary neural network acceleration system based on FPGA according to claim 1, which is characterized in that FPGA is logical Cross the corresponding control register of the end ARM configuration, by AXI bus from external memory DDR3 load image on piece memory Buffer area；FPGA is the multiple processing engines of convolution basic calculating module assignment, and the processing engine includes arithmetical operation component, patrols Collect operational components, bit arithmetic component and storage resource.

3. a kind of binary neural network acceleration system based on FPGA according to claim 2, which is characterized in that according to pre- If convolutional calculation sequence by convolutional calculation layer be classified as convolutional layer, PRelu active coating, pond layer, canonical normalization layer and two It is worth active coating, is respectively used to convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation and calculates, it is same using being located at Multiple computing engines of convolutional calculation layer form a convolution and accelerate array, and a convolution accelerates array using a PE of FPGA Module is realized.

4. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that caching mould Block is respectively that corresponding first buffer area and second buffering area is arranged in a convolutional calculation array, and first buffer area is for depositing The operation result that a convolution accelerates array is stored up, the second buffering area is used to store the operation that corresponding convolution accelerates array As a result.

5. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that the pond Process is realized in the calculating for changing layer are as follows: the corresponding column vector of the sliding window of pond layer is carried out SIMD vectorization, and to all column Vector forms a new vector after seeking respective maximum value, and using new vector as the data of output characteristic pattern.

6. a kind of binary neural network acceleration system based on FPGA according to claim 5, which is characterized in that pond layer Different sliding windows there are when identical column vector, the calculated result of the identical column vector is put into a LUT and is kept in, Next sliding window call directly when the identical column vector calculation the temporary value in LUT.

7. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that FPGA according to One matrix-vector multiplication unit, the matrix-vector multiplication are provided with to each convolutional calculation layer according to convolutional calculation logic rules Method unit includes multiple computing engines, and the computing engines include multiple parallel single-instruction multiple-data stream (SIMD) circulation passages, the meter The input feature vector figure that engine is used to obtain the corresponding picture to be processed of multiple parallel single-instruction multiple-data stream (SIMD) circulation passages is calculated, with volume The corresponding different filters of product nuclear parameter carry out multiply-accumulate operation.

8. a kind of binary neural network acceleration system based on FPGA according to claim 7, which is characterized in that the system System carries out the process of dot product calculating are as follows: seeks exclusive or to corresponding position element in sliding window, the result of exclusive or is stored in array In；In array 1 number is counted by popcount；It is found out according to formula result=- (popcount- (N-popcount)) Final convolutional calculation result.

9. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that FPGA will Convolution nuclear parameter needed for convolutional calculation slides output window according to parameter matrix is packaged as after the sequence of convolutional calculation logic rules The input feature vector figure that picture to be processed is covered according to the translation of convolutional calculation logic rules obtains image array, the parameter matrix and Image array is multiplied to obtain convolutional calculation result.

10. a kind of binary neural network acceleration system based on FPGA according to claim 1 to 9, feature It is, the convolution basic calculating module is activated PRelu activation, canonical normalization and two-value by common function mode of penetrating Merge into a simple two-valued function.