CN109086867A

CN109086867A - A kind of convolutional neural networks acceleration system based on FPGA

Info

Publication number: CN109086867A
Application number: CN201810710069.1A
Authority: CN
Inventors: 李开; 邹复好; 孙浩; 李全; 祁迪; 贺坤坤
Original assignee: Wuhan Charm Pupil Technology Co Ltd
Current assignee: Wuhan Charm Pupil Technology Co Ltd
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2018-12-25
Anticipated expiration: 2038-07-02
Also published as: CN109086867B

Abstract

The convolutional neural networks acceleration system based on FPGA that the invention discloses a kind of, the convolutional neural networks on FPGA are accelerated based on OpenCL programming framework, which includes data preprocessing module, Data Post module, convolutional neural networks computing module, data memory module and network model configuration module；Wherein convolutional neural networks computing module includes convolutional calculation submodule, activation primitive computational submodule, pond computational submodule and connects computational submodule entirely；The acceleration system provided by the invention can be arranged in use according to the hardware resource situation of FPGA calculates degree of parallelism to be adapted to different FPGA and different convolutional neural networks, efficient parallel streamlined mode convolutional neural networks can be run on FPGA, and system power dissipation can be effectively reduced and greatly improve the processing speed of convolutional neural networks, meet requirement of real-time.

Description

A kind of convolutional neural networks acceleration system based on FPGA

Technical field

The invention belongs to neural computing technical fields, and in particular to a kind of convolutional neural networks based on FPGA add Speed system.

Background technique

With the continuous maturation of depth learning technology, convolutional neural networks are widely used in computer vision, voice is known Not, the fields such as natural language processing, and good effect is achieved in the practical application scenes such as Face datection, speech recognition Fruit.In recent years, due to scale go from strength to strength can training dataset and the neural network structure constantly brought forth new ideas, convolutional Neural net The accuracy of network and performance have all obtained significant raising, but as convolutional neural networks network structure becomes to become increasingly complex, Requirement in practical application scene to high real-time, low cost is higher and higher, the calculating of the hardware for running neural network The requirement of ability and energy consumption is also higher and higher.

FPGA has the characteristics that computing resource is abundant, flexibility is higher and energy efficiency is high, and and conventional digital circuits System is compared, and is had many advantages, such as programmable, high integration, high speed and high reliability, has constantly been attempted for accelerans net Network.OpenCL is the Heterogeneous Computing language based on traditional C language, may operate at the OverDrive Processor ODPs such as CPU, GPU, PFGA and DSP On, language abstraction hierarchy with higher, programmer, which need not understand hardware circuit and low-level details, can develop high-performance Application program, greatly reduce the complexity of programming process.

In November, 2012, the powerful parallel architecture and Open CL that altera corp is formally proposed collection FPGA are simultaneously Row programming model is familiar with C language for carrying out the software development kit (SDK) of Open CL exploitation on FPGA in one Programmer cracking can adapt to and rest under Open CL high-level language environment using the software development kit realize high property Energy, low-power consumption, high effect exploitation FPGA application method.Convolution is accelerated on FPGA using Altera OpenCL SDK The calculating of neural network, external accelerator of the FPGA as host can be realized the association of host Yu outside FPGA accelerator With work.

Summary of the invention

Extremely a little less in aiming at the above defects or improvement requirements of the prior art, the present invention provides one kind to be based on The convolutional neural networks acceleration system of FPGA, its object is to calculate structure to existing convolutional neural networks to be adjusted again It is whole pipelining between the concurrency and each computation layer in calculating process sufficiently to excavate convolutional neural networks, it improves The processing speed of convolutional neural networks.

To achieve the above object, according to one aspect of the present invention, a kind of convolutional neural networks based on FPGA are provided Acceleration system, including data preprocessing module, convolutional neural networks computing module, Data Post module, data memory module With network model configuration module；Wherein, data preprocessing module, convolutional neural networks computing module and data post-processing module It is realized based on FPGA, data memory module is realized based on the piece external storage of FPGA, piece of the network model configuration module based on FPGA Upper storage is realized；

Wherein, data preprocessing module is used to read phase from data memory module according to the calculation stages being presently in The convolution nuclear parameter and input feature vector figure answered, and convolution nuclear parameter and input feature vector figure are pre-processed: convolution kernel is tieed up by 4 Model parameter desires to make money or profit to input feature vector and is unfolded with sliding window and is replicated at 3 dimensions, so that the local feature in sliding window Figure is corresponded with convolution nuclear parameter, obtains the convolution kernel argument sequence convenient for directly calculating and local characteristic pattern series；Pre- place Convolutional neural networks computing module is sent by the convolution nuclear parameter handled well and input feature vector figure after the completion of reason；

Network model configuration module is used to carry out parameter configuration to convolutional neural networks computing module；The convolutional Neural Convolutional layer, activation primitive layer, pond layer and full articulamentum in convolutional neural networks is independently arranged by network query function module, is led to Parameter configuration is crossed to construct a variety of different network structures, and according to configuration parameter to the volume received from data preprocessing module Product nuclear parameter and input feature vector figure carry out convolution, intensify, the interlayer stream treatment in pond and full connection calculating, then for simultaneously in layer Row processing；Processing result is sent to Data Post module；

Data Post module is used to the output data of convolutional neural networks computing module being written to data memory module In；

Data memory module is used to store the model parameter caffemodel of convolutional neural networks, intermediate features figure calculates As a result and final calculation result, data memory module module carry out data exchange by PCIe interface and external host.

Preferably, above-mentioned convolutional neural networks acceleration system, convolutional neural networks computing module include convolutional calculation Submodule, activation primitive computational submodule, pond computational submodule and computational submodule is connected entirely, convolutional neural networks calculate It is connected between these submodules of inside modules according to the predefined network model configuration parameter of network model configuration module；

After convolutional neural networks computing module receives convolution nuclear parameter and the characteristic pattern of data preprocessing module transmission, Start to be handled according to each submodule that configuration parameter is organized, after sending the result to data after the completion of processing Manage module；

Specifically, convolutional calculation submodule carries out convolutional calculation using the convolution nuclear parameter and characteristic pattern of input, by result It is sent to activation primitive computational submodule；

Activation primitive computational submodule is selected according to the predefined activation primitive configuration parameter of network model parameter configuration module Activation primitive is selected, activation calculating is carried out to characteristic pattern using selected activation primitive, is after the completion sent out result according to parameter configuration It is sent in pond computational submodule or full connection computational submodule；

Pond computational submodule is used to carry out pondization to received characteristic pattern to calculate, and according to network model configuration module Pond result is sent full connection computing module by predefined configuration parameter, or is sent directly to Data Post module；

Full connection computational submodule is used to carry out received characteristic pattern full connection to calculate, and sends full connection result to Data Post module.

Preferably, above-mentioned convolutional neural networks acceleration system, data preprocessing module include data transmission mould Block, convolution nuclear parameter pretreatment submodule and characteristic pattern pre-process submodule；

Wherein, data transmission module is for controlling feature figure and convolution nuclear parameter in data memory module and convolution mind Through the transmission between network query function module；Convolution nuclear parameter pretreatment submodule is for resetting convolution nuclear parameter, being arranged Processing；Characteristic pattern pretreatment submodule for characteristic pattern is unfolded, replicate and arrangement processing.

Preferably, above-mentioned convolutional neural networks acceleration system, data memory module include convolution nuclear parameter storage Module, characteristic pattern sub-module stored, convolution nuclear parameter sub-module stored store submodule for storing convolution nuclear parameter, characteristic pattern Block is used to store the temporal aspect figure in input feature vector figure and calculating process；These sub-module storeds are preferably connected by with FPGA The DDR memory connect divides, in OpenCL programming framework, data memory module by as global memory come using.

Preferably, above-mentioned convolutional neural networks acceleration system, data transmission module include DDR controller, data Transfer bus and memory buffers；

Wherein, DDR controller is used to control data transmission of the data among DDR and FPGA, data transmission bus connection DDR and FPGA is the channel of data transmission；Reading of the memory buffers for temporal data, reduction FPGA to DDR, improves data Transmission speed.

Preferably, above-mentioned convolutional neural networks acceleration system, convolutional calculation submodule include one or more matrixes Multiplication computational submodule；The quantity of matrix multiplication computational submodule is set by the predefined configuration parameter of network model configuration module It is fixed；Calculating between each matrix multiplication computational submodule executes parallel；

Matrix multiplication computational submodule accelerates operation using Winograd minimum filtering algorithm, obtains list for calculating Matrix multiplication between a convolution kernel and corresponding local feature figure.

Preferably, above-mentioned convolutional neural networks acceleration system, activation primitive computational submodule include activation primitive choosing Select submodule, Sigmoid function computational submodule, Tanh function computational submodule and ReLU function computational submodule；

Activation primitive select submodule respectively with Sigmoid function computational submodule, Tanh function computational submodule and ReLU function computational submodule is connected, and the data of characteristic pattern are sent to one in these three computational submodules；

Wherein, activation primitive selection submodule is used to set the activation calculation of characteristic pattern in convolutional neural networks；

Sigmoid function computational submodule is used to carry out the calculating of Sigmoid function；Tanh function computational submodule is used In the calculating for carrying out Tanh function；ReLU function computational submodule is used to carry out the calculating of ReLU function.

Preferably, above-mentioned convolutional neural networks acceleration system, pond computational submodule include by two FPGA on pieces Store the Double buffer constituted；

For the temporal aspect diagram data in storage pool calculating process, buffer size is by network model parameter configuration The predefined network configuration parameters setting of module, the buffer size of different pond layers is different, is tied by this double-buffer area Structure realizes table tennis read-write operation, realizes the stream treatment of pondization calculating.

Preferably, above-mentioned convolutional neural networks acceleration system, network model parameter configuration module are deposited by FPGA on piece Storage is realized, for storing network model configuration parameter, size, convolutional calculation submodule including network inputs characteristic pattern The parameter of pond window size in the size and number of middle convolution nuclear parameter, pond computational submodule, full connection computational submodule Scale calculates degree of parallelism；Data in network model parameter configuration module are preferably previously written before system starting.

Preferably, above-mentioned convolutional neural networks acceleration system, convolutional neural networks computing module is by convolutional calculation Module, activation primitive computational submodule, pond computational submodule, full connection computational submodule are according to network model configuration parameter It cascades, is carried out data transmission between these submodules using OpenCL Channel, the calculating in these submodules is parallel It executes, the calculating between these submodules is that flowing water carries out.

The above-mentioned convolutional neural networks acceleration system based on FPGA provided by the invention, in conjunction with convolutional neural networks model The advantage of design feature and fpga chip feature and OpenCL programming framework, to existing convolutional neural networks calculate structure into Row is readjusted and designs corresponding module, sufficiently excavate concurrency of the convolutional neural networks in calculating process and It is pipelining between each computation layer, it is allowed to more be matched with the design feature of FPGA, rationally efficiently utilizes the calculating of FPGA design Resource improves the processing speed of convolutional neural networks.In general, contemplated above technical scheme and existing through the invention There is technology to compare, can achieve the following beneficial effects:

(1) the convolutional neural networks acceleration system provided by the invention based on FPGA, utilizes each layer of convolutional neural networks Estimated performance is devised suitable for a kind of pipeline processes, the system architecture of parallel computation；By data preprocessing module, convolution mind A pipeline organization is formed through network query function module and data post-processing module；After data preprocessing module and data Processing module controls the data between memory module and computing module and transmits, and convolution nuclear parameter and characteristic pattern pass sequentially through flowing water Three big module in cable architecture completes reading data, data calculate and the fluvial processes of data storage；And by convolutional Neural net Convolutional layer, activation primitive layer, pond layer and full articulamentum in network are respectively designed to individual computing module, are matched by parameter It sets to construct a variety of different network structures；It is many small to split into the processing of each submodule of convolutional neural networks Treatment process, the data of the corresponding submodule of each layer can all undergo reading data, data processing, data storage etc. different Stage forms the pipeline organization for being similar to computer instruction assembly line form；Allow calculating in neural net layer simultaneously Row executes, the calculating of interlayer can be executed with flowing water, can effectively improve the processing speed of convolutional neural networks.

(2) the convolutional neural networks acceleration system provided by the invention based on FPGA, based in convolutional neural networks calculating Convolution nuclear parameter, the low associate feature of data between local feature figure, in the parallel computation structure of convolutional calculation submodule In, there are the data of convolution kernel window corresponding with input feature vector figure to be calculated when calculating every time, under this framework, due to volume Calculating data between product core are not associated with, and can carry out multiple calculation processings parallel；And in traditional convolutional calculation process In, the data of input feature vector figure are obtained by way of sliding window, and it is corresponding that sliding window slides acquisition on characteristic pattern Numerical value in convolution window removes sliding window and in parallel computation structure of the invention directly by original sliding window Data drawout in mouthful forms multiple data blocks, and when calculating directly inputs corresponding data block, and this mode is by multiple numbers It is calculated simultaneously with convolution kernel according to block, further improves processing speed.

(3) the convolutional neural networks acceleration system based on FPGA provided through the invention is, it can be achieved that in convolutional Neural net If data enter pond computational submodule in the calculating process of network, the calculating of part pondization can be carried out；Due to multiple convolution The calculating of core is parallel, it is possible to while generating the partial results on passage portion, that is to say, that pond computational submodule Part input produced；The calculating of pond computational submodule is all with sliding window as convolutional calculation submodule It is calculated for unit, therefore can be started after the data in some window in the computational submodule of pond all obtain Pond, which is operable without etc. after the completion of all calculating of convolutional calculations submodule, to be started pondization again and operates；Convolutional calculation submodule Data on multiple channels can be generated simultaneously, and channel is not associated with interchannel when pondization calculating, so pondization calculates son Calculating in module on each channel can carry out parallel, and the processing speed of convolutional neural networks is greatly improved.

(4) the convolutional neural networks acceleration system provided by the invention based on FPGA, network model parameter is configurable, makes Degree of parallelism in the structure and network query function of network model is set with configuration file so that different types of network model and The FPGA of different computing capabilitys can run convolutional neural networks by parameter configuration.

(5) the convolutional neural networks acceleration system provided by the invention based on FPGA, meter of the preferred embodiment in convolutional layer Winograd minimum filtering algorithm is used during calculating, and can play the role of accelerating convolutional calculation；

Ping-pong buffer is used in the calculating process of pond layer, can be played and be accelerated pondization to calculate and reduce memory space The effect used；

The method that batch calculates is used in the calculating of full articulamentum, can reach to reduce and outside is deposited in calculating process The purpose of the access in space is stored up, and uses segmentation and calculates, the matrix multiplication operation of simplified higher-dimension can be played the role of, mentioned Processing speed is risen, the requirement to FPGA hardware operational capability is reduced；

Each computing module in convolutional neural networks is realized using OpenCL kernel program, can reduce development difficulty.

Detailed description of the invention

Fig. 1 is that the framework of one embodiment of the convolutional neural networks acceleration system provided by the invention based on FPGA shows It is intended to；

Fig. 2 is the processing schematic of the data preprocessing module in embodiment；

Fig. 3 is the processing schematic of the convolutional calculation submodule in embodiment；

Fig. 4 is the processing schematic of the activation primitive computational submodule in embodiment；

Fig. 5 is the processing schematic of the pond computational submodule in embodiment；

Fig. 6 is the processing schematic of the full connection computational submodule in embodiment；

Fig. 7 is the processing schematic of the Data Post module in embodiment；

Fig. 8 is the process flow diagram of the acceleration system in embodiment.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, The present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used to explain this hair It is bright, it is not intended to limit the present invention.In addition, technology involved in the various embodiments of the present invention described below is special Sign can be combined with each other as long as they do not conflict with each other.

Referring to Fig.1, one embodiment of the convolutional neural networks acceleration system provided by the invention based on FPGA includes number Data preprocess module, convolutional neural networks computing module, Data Post module, data memory module and network model configuration Module；

Wherein, the input terminal of data preprocessing module is connected with data memory module, convolutional neural networks computing module The output end of input terminal and data preprocessing module, the input terminal and convolutional neural networks computing module of Data Post module Output end be connected, the input terminal of data memory module is connected with the output end of Data Post module；Convolutional neural networks meter Module is calculated also to be connected with network model configuration module；

Wherein, data preprocessing module is used to read phase from data memory module according to the calculation stages being presently in The convolution nuclear parameter and input feature vector figure answered, and convolution nuclear parameter and input feature vector figure are pre-processed, convolution kernel is tieed up by 4 Parameter is rearranged into 3 dimensions, and is desired to make money or profit to input feature vector and spread out and replicated with sliding window, so that in sliding window Local feature figure and convolution nuclear parameter correspond, obtain convenient for the convolution kernel argument sequence and local feature that directly calculate Figure series sends convolutional neural networks for the convolution nuclear parameter handled well and input feature vector figure after the completion of pretreatment and calculates mould Block；

Network model configuration module is used to carry out parameter configuration to convolutional neural networks computing module；Convolutional neural networks Computing module be used for according to configuration parameter to received from data preprocessing module convolution nuclear parameter and input feature vector figure carry out weight Row's processing, and Data Post module is sent by processing result；

Its convolutional neural networks computing module includes convolutional calculation submodule, activation primitive computational submodule, pondization calculating Submodule and full connection computational submodule, according to network mould between these submodules inside convolutional neural networks computing module Type configuration module predefined network model configuration parameter connects；

Data memory module is used to store the model parameter caffemodel of convolutional neural networks, intermediate features figure calculates As a result and final calculation result, the module carry out data exchange by PCIe interface and external host.

Referring to Fig. 2, data preprocessing module reads convolution nuclear parameter and input feature vector figure from data memory module, reads It takes when convolution kernel and is according to PARALLEL_KERNEL size of parameter predefined in model parameter configuration module reading k*k*C_iConvolution kernel, wherein C_iIndicate the port number of input feature vector figure.Start to carry out sequence to convolution kernel after reading in convolution kernel Change operation, i.e., it will be having a size of k*k*C_i* it is k*k* (C that the four-dimensional convolution kernel of PARALLEL_KERNEL, which is arranged in size,_i* PARALLEL_KERNEL three dimensional form).

It is first H*W*C by size when handling input feature vector figure_iCharacteristic pattern read in, then according on characteristic pattern The size of sliding window and moving step length carry out characteristic pattern drawout, the size of the characteristic pattern after expansion be ((W-k)/ stride+1)*((H-k)/stride+1)*C_i。

It is (PARALLEL_FEATURE_W) * according to configuration parameter interception size size after the expansion of input feature vector figure (PARALLEL_FEATURE_H)*C_iPartial Feature figure, the characteristic pattern of interception is replicated more parts, makes its quantity and convolution kernel Quantity it is identical, finally obtained size be (PARALLEL_FEATURE_W) * (PARALLEL_FEATURE_H) * (C_i* PARALLEL_KERNEL) characteristic pattern so that the calculating of multiple convolution kernels and characteristic pattern can be parallel.In convolution kernel and After the completion of characteristic pattern processing, send convolution nuclear parameter after processing at convolutional neural networks computing module with characteristic pattern Reason.

Process flow reference Fig. 3 of the convolutional calculation submodule of convolutional neural networks computing module, the input of the module are Predefined relevant configuration in the convolution nuclear parameter and characteristic pattern and network model configuration module that data preprocessing module generates Parameter.Pretreated convolution kernel and characteristic pattern are three-dimensional matrice, and port number is all PARALLEL_KERNEL*C_i；It will be every Convolution kernel and characteristic pattern on one channel are separately input in different OpenCL computing units use Winograd Matrix Multiplication Method module carries out two-dimensional matrix multiplying, and the calculating between OpenCL computing unit can carry out parallel, and calculated result is A length of (PARALLEL_FEATURE_W/k), width are (PARALLEL_FEATURE_H/k), port number is (C_i*PARALLEL_ KERNEL characteristic pattern).The characteristic pattern of input generates the part output of the convolutional layer after the processing of convolutional calculation submodule Characteristic pattern, part output characteristic pattern can carry out different processing according to next layer of type.If pre- in network model configuration Next layer of definition is convolutional layer or full articulamentum, is incited somebody to action then output characteristic pattern skips pond layer by Data Post module As a result it is medium to be processed to write back to external storage；It will be defeated if predefined next layer in network model configuration is pond layer Enter characteristic pattern and is sent to progress pond processing in the computational submodule of pond.

Referring to Fig. 4, the activation primitive computational submodule in embodiment includes an activation primitive selection submodule and three Function computational submodule, activation primitive select the selector in submodule to be determined by the configuration parameter in model configuration module, and three A function computational submodule respectively corresponds the calculating of Sigmoid, tanh and ReLU activation primitive.The characteristic pattern of input is according to sharp The path that function selection submodule living determines is sent to progress activation primitive calculation processing in function computational submodule, has handled It is sent in data memory module or pond computational submodule at rear according to configuration parameter.

Referring to Fig. 5, the Ping-Pong for the use of two sizes being pool_size*W in the computational submodule of pond Buffer saves the calculated result from activation primitive computational submodule, and wherein pool_size and W is configuration parameter, convolution The calculated result of computational submodule is constantly filled into buffer1 first, can carry out the buffer in the filling process In part pondization calculate, after buffer1 is filled full, the calculated result of convolutional calculation module is filled into buffer2 In, in the filling process of buffer2 can to the data in buffer2 carry out pondization calculate, while buffer1 with Data between buffer2 can also carry out pondization calculating, when buffer2 is filled full, the calculating of convolutional calculation module As a result it is filled into buffer1 again, two buffer are worked alternatively like this to be completed until entire pondization calculates.At two It also include pond window between buffer, the data of the window come from two buffer, and a buffer is counted wherein It calculates and operates and another buffer is filled the calculating that can be carried out the pond window among two buffer when operation. Since the data between the window of pond do not calculate relevance, it is possible to loop unrolling method be used to make in different windows Calculate synchronous carry out.

Referring to Fig. 6, during processing, the input matrix that N number of input vector is constituted laterally divides pond computational submodule It is dim1/m sections, wherein N indicates the quantity of input feature value, and dim1 indicates that the dimension of input feature value, m indicate input The section length of feature vector, each section be separately formed a size be m*N submatrix, submatrix respectively with weight matrix In corresponding part be multiplied can be obtained size be n*N submatrix constitute partial results, dim1/m section partial results conjunction And be exactly the calculated result of the output vector composition of final N, the corresponding part in calculated sub-matrix and weight matrix When matrix multiplication operation, acceleration calculating is carried out using Winograd minimum filtering matrix multiplication.

Referring to Fig. 7, computational submodule is connected when the pond computational submodule in convolutional neural networks computing module or entirely Processing after the completion of, Data Post module starts to calculate above-mentioned pondization or the data of full connection computational submodule output are write It returns in data memory module, it should be in the process using the barrier operation in OpenCL frame to guarantee to obtain whole calculating knots After fruit just start transmission and all data be transmitted after just start in next step handle.

It is the process flow for the above-mentioned acceleration system that embodiment provides referring to Fig. 8, mainly includes three parts；First It is divided into kernel program compilation process, in order to maximally utilize the computing resource and storage resource on FPGA, it is suitable to need to be arranged Network query function degree of parallelism parameter.In embodiment, the process of setting degree of parallelism parameter is automatically performed by program, is set first Determine PARALLEL_FEATURE the and PARALLEL_KERNEL initial value in convolutional neural networks kernel program, then utilizes Altera OpenCL SDK is compiled kernel program, obtains resource utilization from compiling report after the completion of compiling, PARALLEL_ is updated if the utilization of resources does not reach maximum including storage resource, logical resource, computing resource etc. The value of FEATURE and PARALLEL_KERNEL recompilates, until maximum hardware resource utilization is obtained, after the completion of compiling To the hardware program that may operate on FPGA.

Second part is Parameter Configuration process, including network model calculating parameter and model configuration parameter, network model meter It calculates parameter directly to read from the model file caffemodel of Caffe, model configuration parameter includes the input feature vector figure of each layer The configuration of size, the size of convolution kernel, pond window size etc., parameter utilizes clSetKernelArg () letter in OpenCL It counts up into, a following table 1 illustrates the type and parameter value of model configuration parameter by taking VGG16 as an example.

The type and parameter value example of 1 model configuration parameter of table

In upper table, in Activate func column, 0 indicates no activation primitive, and 1 indicates to use ReLU activation primitive, 2 It indicates to use Sigmoid activation primitive, 3 indicate to use Tanh activation primitive；In Output dst column, 1 indicates to be output to data Memory module, 2 indicate to be output to pond computational submodule, and 3 indicate to be output to convolutional calculation submodule.

Part III be neural network operational process, when host by picture transfer into data memory module after System on FPGA brings into operation, and calculated result is returned to host by data memory module after the completion of operation, inputs without picture When terminate to run.

The convolutional neural networks acceleration system based on FPGA that embodiment provides realizes on DE5a-Net development board VGG16 and AlexNet network model, and performance test has been carried out using the image data that size is 224*224*3, it is real It tests statistics indicate that the processing speed of VGG16 is 160ms/image, the processing speed of AlexNet is 12ms/image, is better than it His FPGA implementation.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all wrap Containing within protection scope of the present invention.

Claims

1. a kind of convolutional neural networks acceleration system based on FPGA, which is characterized in that including data preprocessing module, convolution mind Through network query function module, Data Post module, data memory module and network model configuration module；The data prediction mould Block, convolutional neural networks computing module and data post-processing module realize that data memory module is based on outside FPGA piece based on FPGA Storage realizes that network model configuration module is stored based on the on piece of FPGA and realized；

The data preprocessing module is used to read corresponding volume from data memory module according to the calculation stages being presently in Product nuclear parameter and input feature vector figure, and convolution nuclear parameter and input feature vector figure are pre-processed: convolution kernel model parameter is tieed up by 4 It at 3 dimensions, and desires to make money or profit to input feature vector and is unfolded with sliding window and is replicated, so that local feature figure and convolution kernel in sliding window Parameter corresponds, and obtains the convolution kernel argument sequence convenient for directly calculating and local characteristic pattern series；It will after the completion of pretreatment The convolution nuclear parameter and input feature vector figure handled well are sent to convolutional neural networks computing module；

The network model configuration module is used to carry out parameter configuration to convolutional neural networks computing module；The convolutional Neural net Convolutional layer, activation primitive layer, pond layer and full articulamentum in convolutional neural networks is independently arranged by network computing module, passes through ginseng Number configuration joins the convolution kernel received from data preprocessing module according to configuration parameter to construct a variety of different network structures Number and input feature vector figure carry out convolution, intensify, the interlayer stream treatment in pond and full connection calculating, and processing result is sent to data Post-processing module；

The Data Post module is used to the output data of convolutional neural networks computing module being written to data memory module In；

The data memory module is used to store the model parameter caffemodel of convolutional neural networks, intermediate features figure calculates knot Fruit and final calculation result, the module carry out data exchange by PCIe interface and external host.

2. convolutional neural networks acceleration system as described in claim 1, which is characterized in that the convolutional neural networks calculate mould Block includes convolutional calculation submodule, activation primitive computational submodule, pond computational submodule and connects computational submodule entirely；Convolution According to the predefined network model of network model configuration module between these submodules of neural computing inside modules Configuration parameter connects；

The convolutional calculation submodule carries out convolutional calculation using the convolution nuclear parameter and characteristic pattern of input, after the completion sends out result It is sent to activation primitive computational submodule；

The activation primitive computational submodule is selected according to the predefined activation primitive configuration parameter of network model parameter configuration module Select activation primitive；Activation calculating is carried out to characteristic pattern using selected activation primitive, is after the completion sent out result according to parameter configuration It is sent in pond computational submodule or full connection computational submodule；

The pond computational submodule is used to carry out pondization to received characteristic pattern to calculate, and pre- according to network model configuration module Pond result is sent full connection computing module by the configuration parameter of definition, or is sent directly to Data Post module；

The full connection computational submodule is used to carry out received characteristic pattern full connection to calculate, and sends number for full connection result According to post-processing module.

3. convolutional neural networks acceleration system as claimed in claim 2, which is characterized in that the convolutional neural networks calculate mould Block is by convolutional calculation submodule, activation primitive computational submodule, pond computational submodule, full connection computational submodule according to network Model configuration parameter cascades, and is carried out data transmission between these submodules using OpenCL Channel, in these submodules Calculating execute parallel, the calculating between these submodules be flowing water carry out.

4. convolutional neural networks acceleration system as claimed in claim 2 or claim 3, which is characterized in that the convolutional calculation submodule Including one or more matrix multiplication computational submodules；The quantity of matrix multiplication computational submodule is pre- by network model configuration module The configuration parameter of definition is set；Processing between each matrix multiplication computational submodule executes parallel；

Matrix multiplication computational submodule accelerates operation using Winograd minimum filtering algorithm, obtains single convolution for calculating Matrix multiplication between core and corresponding local feature figure.

5. convolutional neural networks acceleration system as claimed in claim 2 or claim 3, which is characterized in that the activation primitive calculates son Module includes activation primitive selection submodule, Sigmoid function computational submodule, Tanh function computational submodule and ReLU function Computational submodule；

Activation primitive selection submodule respectively with Sigmoid function computational submodule, Tanh function computational submodule and ReLU function computational submodule is connected, and the data of characteristic pattern are sent to one in these three computational submodules；

The activation primitive selection submodule is used to set the activation calculation of characteristic pattern in convolutional neural networks；Sigmoid Function computational submodule is used to carry out the calculating of Sigmoid function；Tanh function computational submodule is for carrying out Tanh function It calculates；ReLU function computational submodule is used to carry out the calculating of ReLU function.

6. convolutional neural networks acceleration system as claimed in any one of claims 1 to 5, which is characterized in that the pondization calculates Submodule includes the Double buffer being made of two FPGA on piece storages, for the temporal aspect figure number in storage pool calculating process According to buffer size is set by the predefined network configuration parameters of network model parameter configuration module, the buffering of different pond layers Area is in different size, and table tennis read-write operation is realized by this double-buffer area structure, realizes the stream treatment that pondization calculates.

7. convolutional neural networks acceleration system as claimed in claim 1 or 2, which is characterized in that the data preprocessing module Submodule is pre-processed including data transmission module, convolution nuclear parameter pretreatment submodule and characteristic pattern；

The data transmission module is for controlling feature figure and convolution nuclear parameter in data memory module and convolutional neural networks Transmission between computing module；Convolution nuclear parameter pretreatment submodule is for resetting convolution nuclear parameter, arrangement is handled；It is special Sign figure pretreatment submodule for characteristic pattern is unfolded, replicate and arrangement processing.

8. convolutional neural networks acceleration system as claimed in claim 7, which is characterized in that the data transmission module includes DDR controller, data transmission bus and memory buffers；

The DDR controller be used for control data among DDR and FPGA data transmission, data transmission bus connect DDR and FPGA is the channel of data transmission；Reading of the memory buffers for temporal data, reduction FPGA to DDR, improve data transfer speed Degree.

9. convolutional neural networks acceleration system as claimed in claim 1 or 2, which is characterized in that the data memory module packet Convolution nuclear parameter sub-module stored, characteristic pattern sub-module stored are included, convolution nuclear parameter sub-module stored is for storing convolution kernel ginseng Number, characteristic pattern sub-module stored are used to store the temporal aspect figure in input feature vector figure and calculating process；These sub-module storeds It is preferred that being divided by the DDR memory being connect with FPGA.

10. convolutional neural networks acceleration system as claimed in claim 1 or 2, which is characterized in that the network model parameter is matched Module is set for storing network model configuration parameter, in size, convolutional calculation submodule including network inputs characteristic pattern The parameter rule of pond window size in the size and number of convolution nuclear parameter, pond computational submodule, full connection computational submodule Mould calculates degree of parallelism；Data in network model parameter configuration module are preferably previously written before system starting.