CN109598338A

CN109598338A - A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA

Info

Publication number: CN109598338A
Application number: CN201811493592.XA
Authority: CN
Inventors: 陆生礼; 庞伟; 舒程昊; 范雪梅; 吴成路; 邹涛
Original assignee: Sanbao Sci & Tech Co Ltd Nanjing; Southeast University - Wuxi Institute Of Technology Integrated Circuits; Southeast University
Current assignee: Sanbao Sci & Tech Co Ltd Nanjing; Southeast University - Wuxi Institute Of Technology Integrated Circuits; Southeast University
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2019-04-09
Anticipated expiration: 2038-12-07
Also published as: CN109598338B

Abstract

The present invention discloses a kind of convolutional neural networks accelerator of calculation optimization based on FPGA, including AXI4 bus interface, data buffer area, prefetches data field, result cache area, state controller and PE array；Data buffer area is used to cache feature diagram data, convolution Nuclear Data and the index value read from external memory DDR by AXI4 bus interface；Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern；Result cache area is used to cache the calculated result of every row PE；State controller realizes the conversion between working condition for controlling accelerator working condition；PE array is used to read the data prefetched in data field and convolution nucleon buffer area and carries out convolution operation.Such accelerator terminates in advance redundant computation using parameter sparsity, the characteristic of repetition weighted data and activation primitive Relu, reduces calculation amount, and reduce energy consumption by reducing memory access number.

Description

A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA

Technical field

The invention belongs to electronic information and deep learning fields, in particular to a kind of to be based on FPGA (Filed Programmable Gate Array) calculation optimization convolutional neural networks accelerator hardware structure.

Background technique

In recent years, the use rapid development of deep neural network, produces significant impact to world economy and social activities. Depth convolutional neural networks technology receives significant attention in many machine learning fields, including speech recognition, natural language processing With intelligent image processing, especially field of image recognition, depth convolutional neural networks achieve some significant achievements.At these In field, depth convolutional neural networks can be realized the accuracy for surmounting the mankind.The brilliance of depth convolutional neural networks is originated from Mass data can be carried out in it to extract advanced features from initial data after statistical learning.

Depth convolutional neural networks are well-known computation-intensive network, and the 90% of the total operand of convolution operation Zhan More than.Operation information and algorithm structure when by using convolutional calculation largely calculate to reduce these, i.e. reduction reasoning institute Work is needed to become the hot research direction of a new round.

The high-precision of depth convolutional neural networks is using high computation complexity as cost.In addition to computation-intensive, convolution mind Need to store millions of or even nearly hundred million parameters through network.The large scale of this network proposes handling capacity to bottom accelerating hardware With the challenge of energy efficiency.

At present, it has been proposed that be based on FPGA, GPU (Graphic Processing Unit, graphics processor) and ASIC The various accelerators of (Application Specific Integrated Circuit, specific integrated circuit) design improve The performance of depth convolutional neural networks.Based on the accelerator of FPGA since its performance is good, energy efficiency is high, and the development cycle is short, weight The advantages that structure ability is strong is widely studied.Different from generic structure, FPGA allows the function of the designed hardware of user's customization, with suitable Answer various resources and data use pattern.

Based on Such analysis, when convolutional calculation exists in the prior art thus the excessive problem of redundant computation amount, this case are produced It is raw.

Summary of the invention

The purpose of the present invention is to provide a kind of convolutional neural networks accelerator of calculation optimization based on FPGA, sharp With parameter sparsity, the characteristic of repetition weighted data and activation primitive Relu, redundant computation is terminated in advance, reduces calculation amount, and Energy consumption is reduced by reducing memory access number.

In order to achieve the above objectives, solution of the invention is:

A kind of convolutional neural networks accelerator of the calculation optimization based on FPGA, including AXI4 bus interface, data buffer storage Area prefetches data field, result cache area, state controller and PE array；

AXI4 bus interface is general bus interface, and accelerator can be mounted to any bus using AXI4 agreement It works in equipment；

Data buffer area be used for cache pass through AXI4 bus interface read from external memory DDR feature diagram data, Convolution Nuclear Data and index value；Data buffer area includes the sub- buffer area of M characteristic pattern and C convolution nucleon buffer area, each column PE One convolution nucleon buffer area of corresponding configuration, the sub- buffer area of characteristic pattern of actual use is determined according to the every layer parameter actually calculated Number；Wherein, the sub- buffer area number M of characteristic pattern according to convolutional neural networks current layer convolution kernel size, output characteristic pattern size, Convolution window offset determines；

Data field is prefetched for prefetching the feature diagram data for needing parallel input PE array from the sub- buffer area of characteristic pattern；

Result cache area includes the R buffer areas that bear fruit, every row PE one buffer area that bears fruit of corresponding configuration, for caching The calculated result of every row PE；

State controller realizes the conversion between working condition for controlling accelerator working condition；

PE array is realized by FPGA, carries out convolution behaviour for reading the data prefetched in data field and convolution nucleon buffer area Make, different output characteristic patterns are calculated in the PE of different lines, and not going together for same output characteristic pattern is calculated in the PE not gone together. PE array includes R*C PE unit；Each PE unit includes two kinds of calculation optimization modes, is previously active mode and weight repeats mould Formula.

Above-mentioned PE unit include input-buffer area, weight buffer area, input retrieval area, weight retrieval area, PE control unit, It is previously active unit and configurable multiply-accumulate unit, input-buffer area and weight buffer area are respectively used to needed for storage convolutional calculation The feature diagram data and weighted data wanted；Input retrieval area and weight retrieval area are respectively used to storage and search feature diagram data and power The index value of tuple evidence；PE control unit reads index area index value, is read according to index value for controlling PE cell operation state The data of buffer area are taken, multiply-accumulate unit is sent into and calculates, and configure multiply-accumulate unit mode and whether start and be previously active unit； Be previously active unit for detect convolutional calculation part and, if part and less than 0, stop calculate output 0；Multiply-accumulate unit For carrying out convolutional calculation, can be configured to normally multiply accumulating calculating mode or the duplicate calculation optimization mode of exploitation right weight.

Above-mentioned PE control unit determines that the convolutional calculation Optimizing Mode of multiply-accumulate unit is to be previously active mode or weight weight Complex pattern selects different calculation optimization modes for each layer choosing；The method of determination is: determining that calculating is excellent using two bit pattern flag bits Change mode, a high position carry out multiplying accumulating calculating normally for 0；A high position is the duplicate calculation optimization mode of exploitation right weight for 1；Low level is 0 Without being previously active；It is to be previously active mode that low level, which is 1,.

Above-mentioned weight retrieval area includes that multiple weight retrieve area, weight by from positive to negative, the last sequential write of weight of zero Enter the sub- buffer area of weight, also area is retrieved in write-in in the order for corresponding input index value and weight index value；By weight and index The operation of value sequence is completed offline；When convolutional calculation, according to weight index value, it is successively read the weight of weight buffer.

Above-mentioned weight index value indicated whether with a weight transfer flag bit replacement calculate weight, flag bit 0, then Weight is constant, adopts a clock weight；Flag bit is 1, then weight changes, and following clock reads the sub- buffer area of weight in order In next weight.

Above-mentioned PE unit includes that two kinds of calculation optimization modes are previously active to be previously active mode and weight repeat pattern Mode refers to real time monitoring conventional part and positive and negative, PE control unit is fed back to if being negative terminates and calculate, directly output Relu As a result zero, continue convolutional calculation if canonical；Weight repeat pattern refers to convolution operation identical for weight, first weighs corresponding The identical feature diagram data of weight is added, then and multiplied by weight, reduction multiplication number and the memory access number to weighted data.

In above-mentioned weight repeat pattern, the input feature vector figure when weight transfer flag bit is 0 first does accumulation operations, and will Accumulation result saves in a register.When weight transfer flag bit is 1, after finishing accumulation operations, by cumulative part and send Enter multiplication unit to be multiplied with weight, and in result deposit register.

Above-mentioned state controller is made of 7 states, be respectively as follows: waiting, write characteristic pattern, write input index, write convolution kernel, Weight index, convolutional calculation, calculated result to be write to send, corresponding control signal is sent corresponding submodule by each state, Complete corresponding function.

Above-mentioned AXI4 bus interface data bit wide is greater than single weight or characteristic pattern data bit width, therefore multiple data are spliced It is sent at a long numeric data, improve data transfer speed.

After adopting the above scheme, the present invention utilize convolutional calculation when operation information and algorithm structure, reduce redundancy without The reading of calculating and supplemental characteristic, and convolutional neural networks are accelerated using FPGA hardware platform, it can be improved The real-time of DCNN realizes higher calculated performance, and reduces energy consumption.

Detailed description of the invention

Fig. 1 is structural schematic diagram of the invention；

Fig. 2 is PE structural schematic diagram of the present invention；

Fig. 3 is input index and weight indexing service schematic diagram；

Fig. 4 is to be previously active cell operation schematic diagram.

Specific embodiment

Below with reference to attached drawing, technical solution of the present invention and beneficial effect are described in detail.

As shown in Figure 1, for the convolutional neural networks accelerator hardware structure that the present invention designs, with PE array size with 16* 16, convolution kernel size 3*3, for convolution kernel step-length 1, working method is as follows:

PC passes through PCI-E interface by data partition cache in external memory DDR, and data buffer area passes through AXI4 bus Interface reads feature diagram data and is buffered in 3 sub- buffer areas of characteristic pattern by row, and input index value is buffered in spy in the same manner Sign schemes sub- buffer area.The weighted data read by AXI4 bus interface is successively buffered in 16 convolution nucleon buffer areas, weight Index value is buffered in convolution nucleon buffer area in the same manner.Prefetching buffer area, by row sequence to be successively read 3 characteristic pattern slow Area's data are deposited, altogether 3*18 16 feature diagram datas of reading, 16 position feature diagram datas of each clock cycle parallel output, parallel Input 3 feature diagram datas.The output data for prefetching buffer area is sent into every the first PE of row of PE array, and successively passes to every row phase Adjacent PE.Input index value is sent into PE array in the same manner.Input feature vector diagram data is buffered in the sub- buffer area of input of each PE In, input index value is buffered in input retrieval area.Weighted data and weight index value pass through 16 convolution nucleon buffer areas, and In row input first PE of PE array each column, and successively transmit the adjacent PE of each column.The weight caching being finally buffered in PE In area and weight retrieval area.PE unit is according to the calculation optimization mode of configuration, according to index value, from inputting sub- buffer area and weight Data are read in sub- buffer area, carry out convolutional calculation, and accumulation result is sent into 16 buffer areas that bear fruit parallel, every row PE's Calculated result is stored in the same buffer area that bears fruit.

As shown in connection with fig. 2, PE unit can configure two kinds of calculation optimization modes by two bit pattern flag bit S1S0, in advance Activation pattern and weight repeat pattern.To be previously active mode when S1S0 is configured to 01, starting is previously active unit, to multiplying accumulating The part of operation and result are monitored, and if fruit part and value are negative, are then exported Relu result 0 in advance and are stopped current convolution window It calculates；S1S0 is configured to be weight repeat pattern when 10, and starting input summing elements operate the identical multiplication of weight, first do Addition, input data is first carried out it is cumulative be stored in register, until weight changes, accumulation result feeding is multiplied accumulating Unit carries out multiplying accumulating operation.When weight is 0, PE unit will close computing unit, direct output par, c and result.

Referring to Fig. 3, PE control unit successively takes feature diagram data to be sent into meter by input index value from the sub- buffer area of input Calculate unit.Weight index value indicates that weight is constant if weight index value is 0, if weight with a weight transfer flag bit Index value then reads next weight for 1 in order.Weight and index value are sequentially arranged from positive to negative by weight, and weight of zero is placed on most Afterwards, which completes offline.As shown in figure 3, first four input data corresponds to same weight x, intermediate two input numbers According to same weight y is corresponded to, last three input datas correspond to same weight z.

It referring to Fig. 4, is previously active after unit enables, part and value and zero will be compared, if part and being worth greater than zero Then continue to calculate output final result；If part and value less than zero, will terminate calculating signal and be sent to PE control unit, PE control Unit processed, which is closed, to be calculated, the result zero after directly output Relu.

Convolution algorithm is expanded into vector and multiplies accumulating operation so that network structure and hardware structure more match, according to operation Information and algorithm structure, which simplify, to be calculated, and is improved computational efficiency and is reduced energy consumption.The present embodiment particular state conversion process is such as Under:

Initialization postaccelerator enters wait state, the status signal that state controller waits AXI4 bus interface to send State, when state is 00001, into writing convolution nuclear state；When state is 00010, into writing weight Index Status； When state is 00100, into writing characteristic pattern state；When state is 01000, into writing input Index Status；Work as data After receiving, waiting state is 10000, into convolutional calculation state.After calculating, transmission is jumped into automatically and calculates knot Fruit state, and wait state is jumped back to after being sent completely.

It writes characteristic pattern: if waiting AXI4 bus interface data useful signal to draw high into the state, while successively enabling 3 A sub- buffer area of characteristic pattern, first sub- buffer area store characteristic pattern the first row data；Second sub- buffer area stores characteristic pattern Second row data；The sub- buffer area of third stores characteristic pattern the third line data；The rebound of characteristic pattern fourth line data is stored in first In a sub- buffer area ... after having stored feature diagram data in this order, first clock cycle takes the spy of three sub- buffer area storages First data feeding of the first, second and third row of sign figure prefetches buffer area；Second clock cycle takes three sub- buffer area storages Characteristic pattern fourth, fifth, six or three rows first data feeding prefetch buffer area ... characteristic pattern row traversal after, in this order successively Taking second and third ..., a data feeding prefetches buffer area.It prefetches buffer area and 3*18 feature diagram data of storage coexists, after the completion of storage, 16 feature diagram datas of each clock cycle parallel output are sent into every first PE of row of PE array, and successively pass to every row phase Adjacent PE is ultimately stored in the sub- buffer area of input of PE.

Write input index: if, by characteristic pattern data model storage, finally storing data in the input of PE into the state It retrieves in area.

It writes convolution kernel: if waiting AXI4 bus interface data useful signal to draw high into the state, while successively enabling 16 A convolution nucleon buffer area, first convolution nucleon buffer area store the corresponding convolution kernel value of first output channel；Second After convolution nucleon buffer area stores the 16 convolution nucleon buffer area storages of the corresponding convolution kernel value ... of second output channel, Every sub- each one data of clock output of buffer area, 16 weighted datas input first PE of PE array each column parallel, and according to It is secondary to transmit the adjacent PE of same column, it is finally buffered in the weight buffer in PE unit.

It writes weight index: if into the state, by convolution kernel data model storage, finally storing data in the weight of PE It retrieves in area.

Convolutional calculation: if into the state, PE control unit is excellent according to the calculating that mode flags position S1S0 configures PE unit Change mode, and according to weight index value and input index value, data, which are read, from the sub- buffer area of weight and the sub- buffer area of input send Enter multiply-accumulate unit to be calculated, after having carried out 3*3* input channel and multiplying accumulating calculating for several times, indicates all data all It calculates and completes, next clock, which will be jumped into, sends calculated result state.

Send calculated result: if into the state, calculated result is sequential read out from 16 calculated result buffer areas.It will be every First output channel data in a calculated result buffer area are taken out, and every four scrabble up 64 output datas, pass through AXI4 bus interface is sent to external memory DDR.Successively external memory DDR is all sent by 16 output channel data In, accelerator jumps back to wait state.

It can be modified to parameter by state controller, image size when modification being supported to run, convolution kernel size, step Long size exports characteristic pattern size, and how much is output channel number.Using operating status and algorithm structure, redundant computation is skipped, therefore Reduce unnecessary calculating and memory access, improves convolutional neural networks accelerator efficiency, and reduce energy consumption.

The above examples only illustrate the technical idea of the present invention, and this does not limit the scope of protection of the present invention, all According to the technical idea provided by the invention, any changes made on the basis of the technical scheme each falls within the scope of the present invention Within.

Claims

1. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA, it is characterised in that: including AXI4 bus interface, Data buffer area prefetches data field, result cache area, state controller and PE array；

The data buffer area be used for cache pass through AXI4 bus interface read from external memory DDR feature diagram data, Convolution Nuclear Data and index value；Data buffer area includes the sub- buffer area of M characteristic pattern and C convolution nucleon buffer area；

PE array is realized by FPGA, includes R*C PE unit, each column PE unit one convolution nucleon buffer area of corresponding configuration, root Every layer parameter that factually border calculates determines the sub- buffer area number of characteristic pattern of actual use；The PE array is for reading prefectching Convolution operation is carried out according to the data in area and convolution nucleon buffer area, different output features are calculated in the PE unit of different lines Not going together for same output characteristic pattern is calculated in figure, the PE not gone together；

Result cache area includes the R buffer areas that bear fruit, every row PE unit one buffer area that bears fruit of corresponding configuration, for caching The calculated result of every row PE unit；

State controller realizes the conversion between working condition for controlling accelerator working condition.

2. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: The PE unit includes input-buffer area, weight buffer area, input retrieval area, weight retrieval area, PE control unit, is previously active Unit and multiply-accumulate unit, wherein input-buffer area and weight buffer area are respectively used to feature required for storage convolutional calculation Diagram data and weighted data, input retrieval area and weight retrieval area are respectively used to storage and search feature diagram data and weighted data Index value；PE control unit reads index area index value, reads buffer area according to index value for controlling PE cell operation state Data, be sent into multiply-accumulate unit and calculate, and configure multiply-accumulate unit mode and whether start and be previously active unit；It is previously active Unit be used for detect convolutional calculation part and, if part and less than 0, stop calculate output 0；Multiply-accumulate unit is for carrying out Convolutional calculation can be configured to normally multiply accumulating calculating mode or the duplicate calculation optimization mode of exploitation right weight.

3. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 2, it is characterised in that: The PE control unit determines that the convolutional calculation Optimizing Mode of multiply-accumulate unit is to be previously active mode or weight repeat pattern, needle Different calculation optimization modes are selected to each layer choosing；The method of determination is: calculation optimization mode is determined using two bit pattern flag bits, it is high Position carries out multiplying accumulating calculating normally for 0；A high position is the duplicate calculation optimization mode of exploitation right weight for 1；Low level is 0 without preparatory Activation；It is to be previously active mode that low level, which is 1,.

4. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 2, it is characterised in that: Weight retrieval area includes that multiple weight retrieve area, weight by from positive to negative, weight of zero it is last to be sequentially written in weight sub Also area is retrieved in write-in in the order for buffer area, corresponding input index value and weight index value；Weight and index value are sorted Operation is offline to be completed；When convolutional calculation, according to weight index value, it is successively read the weight of weight buffer.

5. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 4, it is characterised in that: The weight index value indicates whether that replacement calculates weight with a weight transfer flag bit, and flag bit 0, then weight is not Become, adopts a clock weight；Flag bit is 1, then weight changes, under following clock is read in order in the sub- buffer area of weight One weight.

6. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: The PE unit includes two kinds of calculation optimization modes, and to be previously active mode and weight repeat pattern, the mode of being previously active refers to It monitors conventional part and positive and negative in real time, calculatings is terminated if being negative, directly export Relu result zero, if canonical continuation convolution meter It calculates；Weight repeat pattern refers to convolution operation identical for weight, is first added the identical feature diagram data of respective weights, then And multiplied by weight, reduce multiplication number and the memory access number to weighted data.

7. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as claimed in claim 6, it is characterised in that: In the weight repeat pattern, the input feature vector figure when weight transfer flag bit is 0 first does accumulation operations, and by accumulation result It saves in a register；When weight transfer flag bit is 1, after finishing accumulation operations, by cumulative part and it is sent into multiplication list Member is multiplied with weight, and in result deposit register.

8. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: The state controller is made of 7 states, is respectively as follows: waiting, is write characteristic pattern, write input index, write convolution kernel, write weight rope Draw, the transmission of convolutional calculation, calculated result, corresponding control signal is sent corresponding submodule by each state, completes corresponding Function.

9. a kind of convolutional neural networks accelerator of the calculation optimization based on FPGA as described in claim 1, it is characterised in that: Multiple data are spliced into a long numeric data and sent by the AXI4 bus interface.