CN110458279A - A kind of binary neural network accelerated method and system based on FPGA - Google Patents

A kind of binary neural network accelerated method and system based on FPGA Download PDF

Info

Publication number
CN110458279A
CN110458279A CN201910636517.2A CN201910636517A CN110458279A CN 110458279 A CN110458279 A CN 110458279A CN 201910636517 A CN201910636517 A CN 201910636517A CN 110458279 A CN110458279 A CN 110458279A
Authority
CN
China
Prior art keywords
convolution
fpga
convolutional calculation
convolutional
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910636517.2A
Other languages
Chinese (zh)
Other versions
CN110458279B (en
Inventor
李开
邹复好
祁迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Charm Pupil Technology Co Ltd
Original Assignee
Wuhan Charm Pupil Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Charm Pupil Technology Co Ltd filed Critical Wuhan Charm Pupil Technology Co Ltd
Priority to CN201910636517.2A priority Critical patent/CN110458279B/en
Publication of CN110458279A publication Critical patent/CN110458279A/en
Application granted granted Critical
Publication of CN110458279B publication Critical patent/CN110458279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The binary neural network acceleration system based on FPGA that the invention discloses a kind of, its convolution kernel parameter acquisition module for utilizing FPGA to be formed, binaryzation convolutional neural networks structure and cache module, cache module is the on piece memory of FPGA, the input feature vector figure that each module passes through acquisition picture to be processed, obtain convolutional calculation logic rules and corresponding progress binaryzation convolutional calculation, FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, obtain the output feature diagram data of image to be processed, the calculation amount of layer each in binary neural network is all unloaded on piece memory by the overall architecture, and the interaction of memory and on piece memory outside piece is not depended on, to reduce the communications cost between memory, greatly improve computational efficiency, improve the detection speed of image to be detected.

Description

A kind of binary neural network accelerated method and system based on FPGA
Technical field
The invention belongs to field of image processings, and in particular to a kind of binary neural network accelerated method based on FPGA and be System.
Background technique
The major progress of artificial intelligence technology has begun the every aspect for benefiting human lives.From the vacuum machine of household The intelligent production equipment of the whole series in people to factory, many tasks in the world all have reached increasingly automated.And deep learning Play very important effect in this great technological revolution, recognition of face, object detection, image procossing, etc. Field has a wide range of applications.The algorithm wherein mainly used is exactly convolutional neural networks, this performance preferably deep learning Algorithm is disposed in a large amount of end PC, mobile phone mobile terminal and embedded dedicated accelerator, for realizing a variety of intelligence Energy calculating task, and achieve preferable acceleration effect.
Convolutional neural networks (CNN, Convolutional Neural Network) develop most important as deep learning One of branch, its development is the most mature, and is widely used in the task of various graph image videos processing.Convolution Neural network development is so rapid, in addition to training data scale increases and computing capability is promoted, also has benefited from various convolutional Neurals Network frame.Existing convolutional neural networks application is to be deployed on server or desktop machine platform, and mobile terminal is only mostly Convolutional neural networks application mobile could be pushed depth by application platform most widely used, user volume is most to greatest extent Learn the development of application.
However all such mobile terminals and embedding assembly equipment can only all provide limited computing capability and capacity Not big on piece storage.As the model structure of convolutional neural networks becomes increasingly complex, the model number of plies is deeper and deeper, model Parameter amount is increasing, so that deployment of the convolutional neural networks on mobile terminal and embedded end becomes more and more difficult.It is huge Calculation amount all use 32bit floating number to operate on lightweight chip as operand, undoubtedly to the huge of computing resource Big consumption, while being also extremely difficult to relatively good live effect.
Summary of the invention
Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of binary neural networks based on FPGA Network acceleration system, the convolution kernel parameter acquisition module formed using FPGA, binaryzation convolutional neural networks structure and caching mould Block, cache module are the on piece memory of FPGA, and each module carries out corresponding binaryzation according to the convolutional calculation logic rules obtained The calculation amount of layer each in binary neural network is all unloaded on piece memory by the overall architecture by convolutional calculation, and is disobeyed Rely the interaction of piece outer memory and on piece memory, to reduce the communications cost between memory, greatly improves calculating effect Rate improves the detection speed of image to be detected.
To achieve the above object, according to one aspect of the present invention, a kind of binary neural network based on FPGA is provided Acceleration system, convolution kernel parameter acquisition module which forms including the use of FPGA, binaryzation convolutional neural networks structure and Cache module, cache module are the on piece memory of FPGA,
Convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, convolutional calculation Logic rules include the convolutional calculation of multiple threads;
Cache module is for transferring the convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic Multiple convolution nuclear parameters are stored in the on piece memory of FPGA by rule, and cache module is also used to cache convolution basic calculating module Calculated result and image data to be processed;
Binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond;
Convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area;The preset convolution Computation sequence be successively carry out convolution, PRelu activation, canonical normalization and two-value activation calculate, or successively carry out convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation calculate;
FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, and the output for obtaining image to be processed is special Diagram data is levied, to improve the detection speed of image to be detected.
As a further improvement of the present invention, FPGA then passes through AXI by the corresponding control register of the end ARM configuration Bus from external memory DDR3 load image to the buffer area of on piece memory;FPGA is convolution basic calculating module point With multiple processing engines, handling engine includes arithmetical operation component, logical operation component, bit arithmetic component and storage resource.
As a further improvement of the present invention, convolutional calculation layer is classified as convolution according to preset convolutional calculation sequence Layer, PRelu active coating, pond layer, canonical normalization layer and two-value active coating are respectively used to convolution, PRelu activation, Chi Hua, just Then normalization and two-value activation calculate, and form a convolution using the multiple computing engines for being located at the same convolutional calculation layer and accelerate Array, a convolution accelerate array to realize using a PE module of FPGA.
As a further improvement of the present invention, cache module is respectively that a convolutional calculation array setting corresponding first is delayed Area and second buffering area are deposited, the first buffer area is used to store the operation result that a convolution accelerates array, and second buffering area is used In the operation result for storing corresponding convolution acceleration array.
As a further improvement of the present invention, process is realized in the calculating of pond layer are as follows: the sliding window of pond layer is corresponding Column vector carry out SIMD vectorization, and all column vectors are asked and form a new vector after respective maximum value, and will be new Vector as output characteristic pattern data.
As a further improvement of the present invention, the different sliding windows of pond layer, will be described there are when identical column vector The calculated result of identical column vector is put into a LUT and keeps in, and next sliding window carries out the identical column vector calculation When call directly temporary value in LUT.
As a further improvement of the present invention, FPGA is arranged each convolutional calculation layer according to convolutional calculation logic rules There is a matrix-vector multiplication unit, matrix-vector multiplication unit includes multiple computing engines, and computing engines include multiple parallel Single-instruction multiple-data stream (SIMD) circulation passage, computing engines for obtain multiple parallel single-instruction multiple-data stream (SIMD) circulation passages it is corresponding to The input feature vector figure for handling picture, different filters corresponding from convolution nuclear parameter carry out multiply-accumulate operation.
As a further improvement of the present invention, which carries out the process of dot product calculating are as follows: to corresponding to position in sliding window It sets element and seeks exclusive or, the result of exclusive or is stored in array;In array 1 number is counted by popcount;According to formula Result=- (popcount- (N-popcount)) finds out final convolutional calculation result.
As a further improvement of the present invention, FPGA is by convolution nuclear parameter needed for convolutional calculation according to convolutional calculation logic Parameter matrix is packaged as after rule compositor, sliding output window covers picture to be processed according to the translation of convolutional calculation logic rules Input feature vector figure obtains image array, and the parameter matrix is multiplied to obtain convolutional calculation result with image array.
As a further improvement of the present invention, convolution basic calculating module is swashed PRelu by common function mode of penetrating A simple two-valued function is merged into living, canonical normalization and two-value activation.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, have below beneficial to effect Fruit:
A kind of binary neural network acceleration system based on FPGA of the invention, the convolution nuclear parameter formed using FPGA Module, binaryzation convolutional neural networks structure and cache module are obtained, cache module is the on piece memory of FPGA, each module foundation The convolutional calculation logic rules of acquisition carry out corresponding binaryzation convolutional calculation, will be in binary neural network by the overall architecture The calculation amount of each layer is all unloaded on piece memory, and does not depend on the interaction of memory and on piece memory outside piece, less in this way to deposit Communications cost between reservoir, greatly improves computational efficiency, improves the detection speed of image to be detected.
A kind of binary neural network acceleration system based on FPGA of the invention, dot-product operation is in binary neural network It is replaced by Xnor logical operation and popcount shift operation, since Binary Operation is in 1bit weight and 1bit input picture Dot product between parameter calculates, operation replacement in this way, so that convolutional calculation of the two-value convolutional calculation compared to full precision Speed is greatly improved, meanwhile, use the blank parts of oem character set filling characteristic pattern to replace complete+1 used in forefathers Filling, ensure that model accuracy to a certain extent.
A kind of binary neural network acceleration system based on FPGA of the invention, the matrix-vector multiplication being arranged by FPGA Method unit, by carrying out member to the staggered-sequence of input feature vector figure to parameter matrix off-line type staggered-sequence and sliding window unit Data vector input convolution after recombination is accelerated in matrix, and then realizes fully parallelized calculating by element recombination.
A kind of binary neural network acceleration system based on FPGA of the invention, is accelerated by double buffering parallel mechanism The calculating of convolutional calculation layer uses flowing structure in buffer area, and sliding window is driven calculatings by upper one layer of output data, cunning The data of dynamic window interior use fully parallelized calculation, to further improve computational efficiency.
A kind of binary neural network acceleration system based on FPGA of the invention, will by common affine function mode A simple two-valued function is merged into PRelu activation, canonical normalization and two-value activation, greatly reduces canonical normalization Bring computation complexity.
Detailed description of the invention
Fig. 1 is a kind of structural schematic diagram of binary neural network acceleration system based on FPGA of the embodiment of the present invention;
Fig. 2 is the schematic diagram based on FPGA on piece memory of the embodiment of the present invention;
Fig. 3 is the structural schematic diagram of the convolution basic calculating module of the embodiment of the present invention.
Fig. 4 is the schematic diagram of the oem character set filling of the embodiment of the present invention;
Fig. 5 is the schematic diagram of the double buffering parallel organization of the embodiment of the present invention;
Fig. 6 is the schematic diagram that the calculating of the pond layer of the embodiment of the present invention is realized;
Fig. 7 is the schematic diagram that the convolutional calculation of the embodiment of the present invention is realized;
Fig. 8 is that the dot product of the embodiment of the present invention calculates the schematic diagram realized;
Fig. 9 is the schematic diagram of the convolutional calculation matrix staggered-sequence of the embodiment of the present invention;
Figure 10 is the schematic diagram that the convolutional calculation matrix storage of the embodiment of the present invention is realized;
Figure 11 is the schematic diagram for folding matrix-vector multiplication and realizing of the embodiment of the present invention;
Figure 12 is the schematic diagram of the acceleration system process flow of the embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
In addition, as long as technical characteristic involved in the various embodiments of the present invention described below is each other not Constituting conflict can be combined with each other.The present invention is described in more detail With reference to embodiment.
Fig. 1 is a kind of structural schematic diagram of binary neural network acceleration system based on FPGA of the embodiment of the present invention.Such as Shown in Fig. 1, a kind of binary neural network acceleration system based on FPGA, the convolution nuclear parameter which forms including the use of FPGA Module, binaryzation convolutional neural networks structure and cache module are obtained, cache module is the on piece memory of FPGA, wherein
Convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, convolutional calculation Logic rules include the convolutional calculation of multiple threads;
Cache module is for transferring convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic rules Multiple convolution nuclear parameters are stored in the on piece memory of FPGA, cache module is also used to cache the calculating of convolution basic calculating module As a result with image data to be processed;
Binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond;
Convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area;Preset convolutional calculation Sequence calculates successively to carry out convolution, PRelu activation, canonical normalization and two-value activation, or successively carries out convolution, PRelu Activation, Chi Hua, canonical normalization and two-value activation calculate;As an example, convolution basic calculating module can be by common It penetrates function mode and a simple two-valued function is merged into PRelu activation, canonical normalization and two-value activation, to reduce Canonical normalizes bring computation complexity;
FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, and the output for obtaining image to be processed is special Diagram data is levied, to improve the detection speed of image to be detected.
Fig. 2 is the schematic diagram based on FPGA on piece memory of the embodiment of the present invention.As shown in Fig. 2, by being based on FPGA piece The pipeline computing framework of upper memory realizes the corresponding convolution Nuclear Data storage of convolutional calculation logic rules, has both reduced on piece memory With the communications cost of memory outside piece, and the whole degree of parallelism of convolutional calculation is greatly improved.
As an example, convolution basic calculating module is in the hardware realization based on FPGA, and hardware overall architecture is first Configure corresponding control register by the end ARM, then pass through AXI bus from external memory DDR3 load image on piece The buffer area of memory;FPGA can be assigned with a large amount of processing engines for the operation of convolution basic calculating module, transport including arithmetic Component, logical operation component, bit arithmetic component and storage resource are calculated, as a preferred embodiment, according to preset convolution Convolutional calculation layer is classified as convolutional layer, PRelu active coating, pond layer, canonical normalization layer and two-value active coating by computation sequence, It is used for convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation respectively to calculate, can use positioned at the same volume Multiple computing engines of product computation layer form a convolution and accelerate array, and one of convolution accelerates array to can use FPGA's One PE module is realized;Preferably as one, it can accelerate array that can distribute a double-damping structure for each convolution, one negative Duty stores the operation result of a upper layer network, another is responsible for the operation result for storing this layer network.
Fig. 3 is the structural schematic diagram of the convolution basic calculating module of the embodiment of the present invention.Convolution basic calculating module is added The accuracy of master mould, can be improved 2 percentage points, the convolution module of common binary neural network is suitable by PRelu activation Sequence is formed according to: canonical normalization, convolution, two-value activation, pondization, and the input of each module is a upper module before recombination It is transmitted to after being calculated by PRelu+Pool, results in that buffer area receives between block and the data transmitted are non-binaryzations in this way Data, convolution basic calculating module adjusts the processing sequence of convolution basic calculating module are as follows: convolution, PRelu activation, pond Change, canonical normalization and two-value activation or convolution, PRelu activation, canonical normalization and two-value activate, after recombination, Data transmission between convolution basic calculating module is converted to two-value after the processing of two-value activation primitive, then block and block Between data transmission be also two-value, the data volume exchanged between block can be substantially reduced after recombination, reduce block Between communications cost, be easy to the interface unified for all convolution basic calculating module designs;Meanwhile it reducing for exchanging data Buffer size, to save hardware resource.
Fig. 4 is the schematic diagram of the oem character set filling of the embodiment of the present invention.As shown in figure 4, using odd even in convolutional calculation The staggeredly clear data of filling output characteristic pattern, it is specially high-dimensional in the width of characteristic pattern to ensure to export the dimension of characteristic pattern With the staggeredly filling for carrying out ± 1 on channel dimension according to odd even ordering sequence, the network trained after passing through odd even filling Model in the error rate on Cifar10 data set close to the full 0 charging error rate under full precision, and only 11.50%, this is low The 12.85% of 13.76% and odd (idol) filling that Yu Quan+1 is filled.
Fig. 5 is the schematic diagram of the double buffering parallel organization of the embodiment of the present invention.Cache module is respectively each convolution meter It calculates array and corresponding first buffer area and second buffering area is set, the first buffer area is used to store a convolution and accelerates array Operation result, second buffering area are used to store the operation result that corresponding convolution accelerates array;First buffer area and the second caching It is filled accordingly according to convolutional calculation logic rules in area;Using pond layer as example, two are distributed for it when calculating initial Buffer area, buffering sector width are the width W of the input feature vector figure of pond layer, are highly the size k of convolution kernel.Two buffer areas according to Secondary alternating receives one layer each calculated result.When the preceding k-1 row filling of Buffer1 is full, k-th of data of row k are arrived Come when, sliding window can come into effect and generate calculated result, from that point on, behind one data of every reception will generate one A result is sent in output characteristic pattern.When Buffer1 is filled up, Buffer2 starts to receive data.In Buffer2 at this time A line will be until the arrival of k-th of data, and sliding window can just come into force, this sliding window contains in k-1 row Buffer1 Data contain the data in a line Buffer2.From that point on, such as front, the sliding of sliding window is dependent on next The arrival of a new data, and generate a calculated result.When the coverage area of sliding window does not include the data in Buffer1 When, that is, it is emptied, Buffer1 just starts to receive new data when Buffer2 is filled up, and circulation is gone down so according to this.
As an example, pond layer calculating realize process are as follows: by the corresponding column vector of the sliding window of pond layer into Row SIMD vectorization, and all column vectors are asked and form a new vector after respective maximum value, and using new vector as Export the data of characteristic pattern.As a preferred embodiment, different sliding windows, can be by the column there are when identical column vector The calculated result of vector is put into a LUT and keeps in, and next sliding window calls directly the temporary value in LUT when calculating.Fig. 6 is The schematic diagram that the calculating of the pond layer of the embodiment of the present invention is realized.As shown in fig. 6, example is turned to the maximum pond of a 3*3, Middle stride=2.Sliding window slides on the same thread arrays1, we regard each column in sliding window as one A vector, each sliding window just include vector as 3 column.Due to there is no data dependence relation between this 3 vectors, so SIMD vectorization can be used, respective maximum value is asked simultaneously to this 3 column vector.3 maximum value results come out and then by this 3 results form a vector, and solve the maximum value in this vector, and calculated result is finally put into conduct in output characteristic pattern One new element.It is worth noting that, after every 3 column count in sliding window is complete, the calculated result that rightmost 1 can be arranged Be put into it is temporary inside LUT, as the calculated result of next sliding window first row, this is because adjacent sliding window it Between shared a column data.It is each 3 from the degree of parallelism for doing left-to-right first sliding window, remaining sliding window value It is all 2.
Fig. 7 is the schematic diagram that the convolutional calculation of the embodiment of the present invention is realized.As shown in fig. 7, as an example, according to volume Each regular convolutional calculation layer of product calculating logic is provided with a matrix-vector multiplication unit, a matrix-vector multiplication unit Including multiple computing engines, each computing engines include multiple parallel single-instruction multiple-data stream (SIMD) circulation passages.Further, it calculates Engine is used to obtain the input feature vector figure of the corresponding picture to be processed of multiple parallel single-instruction multiple-data stream (SIMD) circulation passages, and is located at The corresponding convolution nuclear parameter of on piece memory is calculated, and each computing engines receive identical control signal and picture to be processed Vector data, when calculating will different filters corresponding from convolution nuclear parameter carry out multiply-accumulate operation therewith.
As a further preference, system carries out the process of dot product calculating: asking different to corresponding position element in sliding window Or, the result of exclusive or is stored in array;In array 1 number is counted by popcount;According to formula result=- (popcount- (N-popcount)) finds out final convolutional calculation result.
Fig. 8 is that the dot product of the embodiment of the present invention calculates the schematic diagram realized.As shown in figure 8, being calculated with matrix-vector multiplication The calculating data flow of a computing engines is example in unit, is mainly used to calculate input vector and parameter matrix data line Dot product result, and result is compared with a threshold value, finally exports the data of a 1bit.It is two that dot product, which calculates essence, Xnor (exclusive or non-exclusive) Men Shixian is used binary neural network in multiply-accumulate operation between a vector here.The first step be by Corresponding position element seeks exclusive or in sliding window, and the result of exclusive or is stored in array;Second step is united by popcount 1 number in counting group;Third step is to find out final volume according to formula result=- (popcount- (N-popcount)) Product calculated result.Result is compared with threshold value finally and exports final result.Here computing engines structure is equally supported The calculating of non-binaryzation, it is only necessary to can handle the dot product door of conventional parallel multiplication replacement dotted portion.
As a further improvement of the present invention, FPGA is by convolution nuclear parameter needed for convolutional calculation according to convolutional calculation logic Parameter matrix is packaged as after rule compositor, sliding output window covers picture to be processed according to the translation of convolutional calculation logic rules Input feature vector figure obtains image array, and parameter matrix is multiplied to obtain convolutional calculation result with image array.
Fig. 9 is the schematic diagram of the convolutional calculation matrix staggered-sequence of the embodiment of the present invention.As shown in figure 9, can be according to volume Product calculating logic rule, i.e., multiplied based on the method for the matrix staggered-sequence of channel dimension convolutional calculation is converted to ordinary channel Method operation, convolution nuclear parameter needed for convolutional calculation was packaged into parameter matrix, while sliding window can also be translated and be covered Lid input feature vector figure, and the element of input feature vector figure is bundled in image array finally by these matrix multiples and exports knot Fruit.Since dot-product operation includes all pixels value in a sliding window, along with the interchangeability of addition, so being handed in matrix The sequence of mistake sequence can be random order, the pixel value of different channel same positions has been placed on together, naturally it is also possible to adopt It is put in order with others, but final calculated result is constant.It should be noted that the conversion of filter matrix do not need it is any Expense, because it is that program operation is preceding just converted, and image array can be converted when program is run.Figure 10 is The schematic diagram that the convolutional calculation matrix storage of the embodiment of the present invention is realized.As shown in Figure 10, input figure is in the buffer with certain Sequence is simply stored, and then address generator can take the corresponding core position of each sliding window, and upper one layer is transmitted through Come data its according to ordering rule identical with filter matrix generate image array.
Figure 11 is the schematic diagram that the embodiment of the present invention folds that matrix-vector multiplication is realized.As shown in figure 11, due to two-value mind It may be expressed as matrix-vector multiplication through calculating almost all of in network, which, which can largely control, is The handling capacity of system, while can also directly affect the utilization of resources and energy consumption of system.Every layer of computing engines quantity is a, Mei Geji Calculating single-instruction multiple-data stream (SIMD) port number in engine is b, and parameter matrix size is m*n.So total folding degree is (m/a) * (n/ B), cycle-index needed for completing a matrix-vector multiplication is also (m/a) * (n/b).Due to the acceleration knot of binary neural network Structure is a flowing structure, and whole calculating handling capacity is determined by that most slow layer.So needing for each convolutional layer and complete Articulamentum configures computing engines and the single-instruction multiple-data stream (SIMD) channel of different number, finally makes required for each layer of realization Recurring number is roughly equal, so that whole network forward calculation is fastest.The foldable structure of the matrix-vector multiplication such as Figure 11 It is shown, using folded formization realization make full use of calculate space optimize, thus realize according to computational load come It realizes to fold and arrived better reasoning performance.
Figure 12 is the schematic diagram of the acceleration system process flow of the embodiment of the present invention.As shown in figure 12, the process flow packet Three phases are included, first stage is binary neural network initialization and picture pretreatment stage, it includes to bit stream file Importing, network structure initialization, the staggered-sequence of weight parameter and on piece Memory Allocation and picture size recanalization (will be schemed Piece size is adjusted to 32*32*3) etc. processes;Second stage is exactly the accelerator of FPGA, obtains an one-dimensional characteristic vector;The Three stages were the passback stages, including on arm processor to the classification processing of feature vector.
Changed on Xilinx PYNQ-Z1 light methods plate to based on VGG16 by Vivado HLS higher synthesis tool Into network accelerated, break implementation pattern of traditional convolutional neural networks on FPGA, designed whole hard The pipeline computing framework based on FPGA on piece memory is used in part structure, both reduced the communication of the outer memory of on piece memory and piece at This, and greatly improve whole degree of parallelism.Meanwhile the convolutional layer in binary neural network, pond layer, canonical are normalized Layer, full articulamentum have carried out corresponding optimization.Sufficiently to excavate parallel potentiality therein, a matrix-vector multiplication list is devised The convolutional layer that member carrys out supporting network calculates.PE and SIMD port number by configuring different number to each layer of network can make mould It is optimal that type reaches Local Property, and finally obtains total optimization performance.Higher data throughout, very fast is obtained by optimization Processing speed and lower power consumption.Table 1 is the schematic table of the complete binaryzation network structure of the embodiment of the present invention.Such as 1 institute of table Show, is designed by final speeding scheme, Xiang Shixian before being carried out to complete binaryzation network structure, and obtain the place of 844FPS Manage speed, 3.8TOPS data throughout.Accelerator overall power is only 2.3W, and model accuracy rate is 83.6%.
The schematic table of the complete binaryzation network structure of 1 embodiment of the present invention of table
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims (10)

1. a kind of binary neural network acceleration system based on FPGA, the convolution nuclear parameter which forms including the use of FPGA obtain Modulus block, binaryzation convolutional neural networks structure and cache module, the cache module are the on piece memory of FPGA, and feature exists In,
The convolution kernel parameter acquisition module is used to obtain the input feature vector figure of picture to be processed, utilizes convolutional neural networks model Binaryzation training is carried out on existing data set, obtains convolutional calculation logic rules and multiple convolution nuclear parameters, the convolution Calculating logic rule includes the convolutional calculation of multiple threads;
The cache module is for transferring the convolutional calculation logic rules and multiple convolution nuclear parameters, according to convolutional calculation logic Multiple convolution nuclear parameters are stored in the on piece memory of FPGA by rule, and cache module is also used to cache convolution basic calculating module Calculated result and image data to be processed;
The binaryzation convolutional neural networks structure generates multiple convolution basic calculating moulds for transferring convolutional calculation logic rules Block, multiple convolution basic calculating modules establish corresponding connection relationship, the convolution of a thread according to convolutional calculation logic rules Corresponding multiple convolution basic calculating modules are calculated, multiple convolution nuclear parameters and multiple convolution basic calculating modules correspond;
The convolution basic calculating module is used to read the upper of the current thread in cache module according to convolutional calculation logic rules The piece of the calculated result of one convolution basic calculating module, the input feature vector figure of image to be processed in current sliding window mouth and FPGA Correspondence convolution Nuclear Data in upper memory successively carries out preset convolutional calculation sequence and obtains current convolution basic calculating module Calculated result, and the calculated result of current convolution basic calculating module is stored in corresponding buffer area;The preset convolution Computation sequence be successively carry out convolution, PRelu activation, canonical normalization and two-value activation calculate, or successively carry out convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation calculate;
FPGA traverses the convolutional calculation of multiple threads according to convolutional calculation logic rules, obtains the output characteristic pattern of image to be processed Data, to improve the detection speed of image to be detected.
2. a kind of binary neural network acceleration system based on FPGA according to claim 1, which is characterized in that FPGA is logical Cross the corresponding control register of the end ARM configuration, by AXI bus from external memory DDR3 load image on piece memory Buffer area;FPGA is the multiple processing engines of convolution basic calculating module assignment, and the processing engine includes arithmetical operation component, patrols Collect operational components, bit arithmetic component and storage resource.
3. a kind of binary neural network acceleration system based on FPGA according to claim 2, which is characterized in that according to pre- If convolutional calculation sequence by convolutional calculation layer be classified as convolutional layer, PRelu active coating, pond layer, canonical normalization layer and two It is worth active coating, is respectively used to convolution, PRelu activation, Chi Hua, canonical normalization and two-value activation and calculates, it is same using being located at Multiple computing engines of convolutional calculation layer form a convolution and accelerate array, and a convolution accelerates array using a PE of FPGA Module is realized.
4. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that caching mould Block is respectively that corresponding first buffer area and second buffering area is arranged in a convolutional calculation array, and first buffer area is for depositing The operation result that a convolution accelerates array is stored up, the second buffering area is used to store the operation that corresponding convolution accelerates array As a result.
5. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that the pond Process is realized in the calculating for changing layer are as follows: the corresponding column vector of the sliding window of pond layer is carried out SIMD vectorization, and to all column Vector forms a new vector after seeking respective maximum value, and using new vector as the data of output characteristic pattern.
6. a kind of binary neural network acceleration system based on FPGA according to claim 5, which is characterized in that pond layer Different sliding windows there are when identical column vector, the calculated result of the identical column vector is put into a LUT and is kept in, Next sliding window call directly when the identical column vector calculation the temporary value in LUT.
7. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that FPGA according to One matrix-vector multiplication unit, the matrix-vector multiplication are provided with to each convolutional calculation layer according to convolutional calculation logic rules Method unit includes multiple computing engines, and the computing engines include multiple parallel single-instruction multiple-data stream (SIMD) circulation passages, the meter The input feature vector figure that engine is used to obtain the corresponding picture to be processed of multiple parallel single-instruction multiple-data stream (SIMD) circulation passages is calculated, with volume The corresponding different filters of product nuclear parameter carry out multiply-accumulate operation.
8. a kind of binary neural network acceleration system based on FPGA according to claim 7, which is characterized in that the system System carries out the process of dot product calculating are as follows: seeks exclusive or to corresponding position element in sliding window, the result of exclusive or is stored in array In;In array 1 number is counted by popcount;It is found out according to formula result=- (popcount- (N-popcount)) Final convolutional calculation result.
9. a kind of binary neural network acceleration system based on FPGA according to claim 3, which is characterized in that FPGA will Convolution nuclear parameter needed for convolutional calculation slides output window according to parameter matrix is packaged as after the sequence of convolutional calculation logic rules The input feature vector figure that picture to be processed is covered according to the translation of convolutional calculation logic rules obtains image array, the parameter matrix and Image array is multiplied to obtain convolutional calculation result.
10. a kind of binary neural network acceleration system based on FPGA according to claim 1 to 9, feature It is, the convolution basic calculating module is activated PRelu activation, canonical normalization and two-value by common function mode of penetrating Merge into a simple two-valued function.
CN201910636517.2A 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system Active CN110458279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910636517.2A CN110458279B (en) 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910636517.2A CN110458279B (en) 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system

Publications (2)

Publication Number Publication Date
CN110458279A true CN110458279A (en) 2019-11-15
CN110458279B CN110458279B (en) 2022-05-20

Family

ID=68481247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910636517.2A Active CN110458279B (en) 2019-07-15 2019-07-15 FPGA-based binary neural network acceleration method and system

Country Status (1)

Country Link
CN (1) CN110458279B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126309A (en) * 2019-12-26 2020-05-08 长沙海格北斗信息技术有限公司 Convolutional neural network architecture method based on FPGA and face recognition method thereof
CN111160534A (en) * 2019-12-31 2020-05-15 中山大学 Binary neural network forward propagation frame suitable for mobile terminal
CN111275167A (en) * 2020-01-16 2020-06-12 北京中科研究院 High-energy-efficiency pulse array framework for binary convolutional neural network
CN111401543A (en) * 2020-06-08 2020-07-10 深圳市九天睿芯科技有限公司 Neural network accelerator with full on-chip storage and implementation method thereof
CN111931925A (en) * 2020-08-10 2020-11-13 西安电子科技大学 FPGA-based binary neural network acceleration system
CN112199896A (en) * 2020-10-26 2021-01-08 云中芯半导体技术(苏州)有限公司 Chip logic comprehensive optimization acceleration method based on machine learning
CN112418417A (en) * 2020-09-24 2021-02-26 北京计算机技术及应用研究所 Convolution neural network acceleration device and method based on SIMD technology
CN112487448A (en) * 2020-11-27 2021-03-12 珠海零边界集成电路有限公司 Encrypted information processing device and method and computer equipment
CN112862080A (en) * 2021-03-10 2021-05-28 中山大学 Hardware calculation method of attention mechanism of EfficientNet
CN113298236A (en) * 2021-06-18 2021-08-24 中国科学院计算技术研究所 Low-precision neural network computing device based on data stream structure and acceleration method
CN113301221A (en) * 2021-03-19 2021-08-24 西安电子科技大学 Image processing method, system and application of depth network camera
CN113469350A (en) * 2021-07-07 2021-10-01 武汉魅瞳科技有限公司 Deep convolutional neural network acceleration method and system suitable for NPU
CN113949592A (en) * 2021-12-22 2022-01-18 湖南大学 Anti-attack defense system and method based on FPGA
WO2022013722A1 (en) * 2020-07-14 2022-01-20 United Microelectronics Centre (Hong Kong) Limited Processor, logic chip and method for binarized convolution neural network
CN114202071A (en) * 2022-02-17 2022-03-18 浙江光珀智能科技有限公司 Deep convolutional neural network reasoning acceleration method based on data stream mode
WO2022057054A1 (en) * 2020-09-18 2022-03-24 深圳先进技术研究院 Convolution operation optimization method and system, terminal, and storage medium
CN114897159A (en) * 2022-05-18 2022-08-12 电子科技大学 Method for rapidly deducing incident angle of electromagnetic signal based on neural network
CN115083462A (en) * 2022-07-14 2022-09-20 中科南京智能技术研究院 Novel digital in-memory computing device based on Sram
CN115550607A (en) * 2020-09-27 2022-12-30 北京天玛智控科技股份有限公司 Model reasoning accelerator realized based on FPGA and intelligent visual perception terminal
CN117114055A (en) * 2023-10-24 2023-11-24 北京航空航天大学 FPGA binary neural network acceleration method for industrial application scene
CN112487448B (en) * 2020-11-27 2024-05-03 珠海零边界集成电路有限公司 Encryption information processing device, method and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
US20180046913A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Combining cpu and special accelerator for implementing an artificial neural network
CN109086867A (en) * 2018-07-02 2018-12-25 武汉魅瞳科技有限公司 A kind of convolutional neural networks acceleration system based on FPGA

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228240A (en) * 2016-07-30 2016-12-14 复旦大学 Degree of depth convolutional neural networks implementation method based on FPGA
US20180046913A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Combining cpu and special accelerator for implementing an artificial neural network
CN109086867A (en) * 2018-07-02 2018-12-25 武汉魅瞳科技有限公司 A kind of convolutional neural networks acceleration system based on FPGA

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YUNZHI DUAN ET AL.: "Energy-Efficient Architecture for FPGA-based Deep Convolutional Neural Networks with Binary Weights", 《2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING(DSP)》 *
仇越等: "一种基于FPGA的卷积神经网络加速器设计与实现", 《微电子学与计算机》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126309A (en) * 2019-12-26 2020-05-08 长沙海格北斗信息技术有限公司 Convolutional neural network architecture method based on FPGA and face recognition method thereof
CN111160534A (en) * 2019-12-31 2020-05-15 中山大学 Binary neural network forward propagation frame suitable for mobile terminal
CN111275167A (en) * 2020-01-16 2020-06-12 北京中科研究院 High-energy-efficiency pulse array framework for binary convolutional neural network
CN111401543A (en) * 2020-06-08 2020-07-10 深圳市九天睿芯科技有限公司 Neural network accelerator with full on-chip storage and implementation method thereof
WO2022013722A1 (en) * 2020-07-14 2022-01-20 United Microelectronics Centre (Hong Kong) Limited Processor, logic chip and method for binarized convolution neural network
CN111931925A (en) * 2020-08-10 2020-11-13 西安电子科技大学 FPGA-based binary neural network acceleration system
CN111931925B (en) * 2020-08-10 2024-02-09 西安电子科技大学 Acceleration system of binary neural network based on FPGA
WO2022057054A1 (en) * 2020-09-18 2022-03-24 深圳先进技术研究院 Convolution operation optimization method and system, terminal, and storage medium
CN112418417A (en) * 2020-09-24 2021-02-26 北京计算机技术及应用研究所 Convolution neural network acceleration device and method based on SIMD technology
CN112418417B (en) * 2020-09-24 2024-02-27 北京计算机技术及应用研究所 Convolutional neural network acceleration device and method based on SIMD technology
CN115550607A (en) * 2020-09-27 2022-12-30 北京天玛智控科技股份有限公司 Model reasoning accelerator realized based on FPGA and intelligent visual perception terminal
CN112199896A (en) * 2020-10-26 2021-01-08 云中芯半导体技术(苏州)有限公司 Chip logic comprehensive optimization acceleration method based on machine learning
CN112487448B (en) * 2020-11-27 2024-05-03 珠海零边界集成电路有限公司 Encryption information processing device, method and computer equipment
CN112487448A (en) * 2020-11-27 2021-03-12 珠海零边界集成电路有限公司 Encrypted information processing device and method and computer equipment
CN112862080A (en) * 2021-03-10 2021-05-28 中山大学 Hardware calculation method of attention mechanism of EfficientNet
CN112862080B (en) * 2021-03-10 2023-08-15 中山大学 Hardware computing method of attention mechanism of Efficient Net
CN113301221A (en) * 2021-03-19 2021-08-24 西安电子科技大学 Image processing method, system and application of depth network camera
CN113298236A (en) * 2021-06-18 2021-08-24 中国科学院计算技术研究所 Low-precision neural network computing device based on data stream structure and acceleration method
CN113469350A (en) * 2021-07-07 2021-10-01 武汉魅瞳科技有限公司 Deep convolutional neural network acceleration method and system suitable for NPU
CN113949592A (en) * 2021-12-22 2022-01-18 湖南大学 Anti-attack defense system and method based on FPGA
CN113949592B (en) * 2021-12-22 2022-03-22 湖南大学 Anti-attack defense system and method based on FPGA
CN114202071A (en) * 2022-02-17 2022-03-18 浙江光珀智能科技有限公司 Deep convolutional neural network reasoning acceleration method based on data stream mode
CN114897159A (en) * 2022-05-18 2022-08-12 电子科技大学 Method for rapidly deducing incident angle of electromagnetic signal based on neural network
CN115083462B (en) * 2022-07-14 2022-11-11 中科南京智能技术研究院 Digital in-memory computing device based on Sram
CN115083462A (en) * 2022-07-14 2022-09-20 中科南京智能技术研究院 Novel digital in-memory computing device based on Sram
CN117114055A (en) * 2023-10-24 2023-11-24 北京航空航天大学 FPGA binary neural network acceleration method for industrial application scene
CN117114055B (en) * 2023-10-24 2024-04-09 北京航空航天大学 FPGA binary neural network acceleration method for industrial application scene

Also Published As

Publication number Publication date
CN110458279B (en) 2022-05-20

Similar Documents

Publication Publication Date Title
CN110458279A (en) A kind of binary neural network accelerated method and system based on FPGA
CN105681628B (en) A kind of convolutional network arithmetic element and restructural convolutional neural networks processor and the method for realizing image denoising processing
CN107169563B (en) Processing system and method applied to two-value weight convolutional network
CN108256628B (en) Convolutional neural network hardware accelerator based on multicast network-on-chip and working method thereof
CN109409512B (en) Flexibly configurable neural network computing unit, computing array and construction method thereof
CN108733348B (en) Fused vector multiplier and method for performing operation using the same
CN109784489A (en) Convolutional neural networks IP kernel based on FPGA
CN109032781A (en) A kind of FPGA parallel system of convolutional neural networks algorithm
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
CN108564168A (en) A kind of design method to supporting more precision convolutional neural networks processors
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN105892989A (en) Neural network accelerator and operational method thereof
CN109447241A (en) A kind of dynamic reconfigurable convolutional neural networks accelerator architecture in internet of things oriented field
CN110163353A (en) A kind of computing device and method
CN109934336A (en) Neural network dynamic based on optimum structure search accelerates platform designing method and neural network dynamic to accelerate platform
Alawad et al. Stochastic-based deep convolutional networks with reconfigurable logic fabric
CN111738433A (en) Reconfigurable convolution hardware accelerator
CN113792621B (en) FPGA-based target detection accelerator design method
CN110276447A (en) A kind of computing device and method
CN110163350A (en) A kind of computing device and method
Fujii et al. A threshold neuron pruning for a binarized deep neural network on an FPGA
Duan et al. Energy-efficient architecture for FPGA-based deep convolutional neural networks with binary weights
Wang et al. High-performance mixed-low-precision cnn inference accelerator on fpga
Jiang et al. Hardware implementation of depthwise separable convolution neural network
CN113033795B (en) Pulse convolution neural network hardware accelerator of binary pulse diagram based on time step

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant