CN109993297A

CN109993297A - A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing

Info

Publication number: CN109993297A
Application number: CN201910259591.7A
Authority: CN
Inventors: 王瑶; 朱志炜; 秦子迪; 苏岩; 王宇宣
Original assignee: Nanjing Jixiang Sensing And Imaging Technology Research Institute Co Ltd
Current assignee: Nanjing Jixiang Sensing And Imaging Technology Research Institute Co Ltd
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2019-07-09

Abstract

The invention discloses the sparse convolution neural network accelerators and its accelerated method of a kind of load balancing.Accelerator includes master controller, data distribution module, the computing array of convolution algorithm, output result cache module, linear activation primitive unit, pond unit, online coding unit and outer chip dynamic memory.The solution of the present invention can realize the computing array high efficiency operation of convolution algorithm under the conditions of seldom storage resource, guarantee the high reusability of input stimulus and weighted data, the load balancing and high usage of computing array；Computing array supports the Parallel Scheduling that two levels between different characteristic figure between the convolution algorithm and ranks of different size different scales are realized by way of static configuration simultaneously, has good applicability and scalability.

Description

A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing

Technical field

The present invention relates to the sparse convolution neural network accelerators and its accelerated method of a kind of load balancing, belong to depth Practise the technical field of algorithm.

Background technique

In recent years, deep learning algorithm computer vision, natural language processing and in terms of obtain It is widely applied the effect with brilliance, and convolutional neural networks CNN is one of most important one algorithm.Convolutional neural networks mould The higher accuracy rate of type often means that the deeper network number of plies, and more network parameters and operand, wherein 90% Calculating all concentrates on convolutional layer, therefore for the high-efficiency operation convolutional neural networks preferably in embedded system, optimization The Energy Efficiency Ratio of convolution operation is imperative.

There are two the main features of convolutional neural networks CNN convolutional layer operation: first is that the data volume of operation is big, convolution algorithm Required feature image and weighted data is in large scale, carries out rarefaction to it and compress storage and can save well data to deposit Storage unit maximumlly utilizes data transfer bandwidth；Second is that operational data stream and control stream are complicated, convolution algorithm is needed according to volume Product dimensional information handles multiple channels of multiple convolution kernels simultaneously, guarantees that the flowing water of operation carries out.

The convolutional neural networks of rarefaction will increase invalid meter in calculating process due to the irregular distribution of nonzero element It calculates, causes calculation resources vacancy rate high.

Summary of the invention

In view of the above-mentioned problems of the prior art, the present invention is intended to provide a kind of sparse convolution of high efficiency load balancing Neural network accelerator, with realize weight and excited data reusability are high, volume of transmitted data is small, it is expansible can degree of parallelism it is high and Required hardware store resource and the few purpose of DSP resource.It is a further object of the present invention to provide a kind of adding using the accelerator Fast method.

Accelerator of the present invention the technical solution adopted is that:

A kind of sparse convolution neural network accelerator of load balancing, comprising: master controller, for controlling convolution algorithm Signal stream and data flow are controlled, data are handled and are saved；Data distribution module, according to the segment partition scheme pair of convolution algorithm Computing array carries out weighted data distribution；The computing array of convolution algorithm, the multiply-add operation for completing sparse convolution operates, defeated The result of part sum out；Result cache module is exported, the result for the part sum to computing array carries out cumulative caching, and whole Unified format is managed into, the characteristic pattern result of processing and pond to be activated is exported；Linear activation primitive unit, for cumulative completion Part and result biasing set and activation primitive operation；Pond unit, for the pond through activation primitive treated result Change operation；Online coding unit, for carrying out the excitation value for still needing to carry out subsequent convolutional layer operation in line coding；Dynamic outside piece Memory, for storing the characteristic pattern of raw image data, the intermediate result of computing array operation and final output.

A kind of accelerated method of the sparse convolution neural network accelerator of load balancing of the present invention, comprising the following steps:

1) beta pruning is carried out to convolutional neural networks Model Weight data, according to the scale parameter logistic of weighted data according to progress Then grouping takes identical prune approach to carry out each group weighted data sparse on the basis of guaranteeing model entirety accuracy rate Change processing；

2) the sparse convolution operation mapping scheme for formulating load balancing, the convolutional neural networks after rarefaction are mapped to and are added On the computing array of the convolution algorithm of fast device；

3) accelerator guarantees the stream of convolution algorithm according to the configuration information reconstruction calculations array and storage array of mapping scheme Water carries out；

4) main controller controls data distribution module completes the distribution of weighted data and excited data, and computing array is transported It calculates, exports conventional part and result；

5) it is added up to the conventional part and result and is linearly corrected, i.e., completion biasing is set operates with activation primitive；

6) the pondization operation of respective cells core size and step-length is carried out according to current convolutional layer pond demand；

7) judge whether current convolution layer operation is the last layer, if it is not, then carrying out in line coding, after coding Excitation result is sent to next layer of convolution, if it is, being output to outer chip dynamic memory, completes the acceleration of convolutional neural networks.

Compared with prior art, the invention has the advantages that

The sparse convolution neural network accelerator and its accelerated method of a kind of load balancing provided by the invention, maximumlly Using the sparse characteristic of convolution algorithm data, it can realize that the computing array of convolution algorithm is high under the conditions of seldom storage resource Efficiency operation guarantees the high reusability of input stimulus and weighted data, the load balancing and high usage of operation array；It counts simultaneously Array is calculated to support to realize by way of static configuration between the convolution algorithm and ranks of different size different scales and different spies The Parallel Scheduling of two levels, has good applicability and scalability between sign figure.Design of the invention can meet well The demand of the low-power consumption high energy efficiency ratio of convolutional neural networks is run under embedded system at present.

Detailed description of the invention

Fig. 1 is the sparse convolution network accelerating method schematic diagram of load balancing.

Fig. 2 is weight prune approach schematic diagram.

Fig. 3 is hardware accelerator overall structure diagram.

Fig. 4 is convolution algorithm mapping mode schematic diagram.

Fig. 5 is convolution algorithm schematic diagram in PE group.

Fig. 6 is the realization schematic diagram that PE is array-supported balanced and storage is shared.

Specific embodiment

The present invention program is described in detail with reference to the accompanying drawing.

It is as shown in Figure 1 the sparse convolution network operations method flow schematic diagram of load balancing, it first can be to convolutional Neural Network model weighted data carries out beta pruning, according to the scale parameter logistic of weighted data according to being grouped, is then guaranteeing model Identical prune approach is taken to carry out LS-SVM sparseness each group weighted data on the basis of whole accuracy rate；Then according to convolution The sparse convolution operation mapping scheme of operation input feature vector figure and convolution kernel dimensioned load balancing, by the convolution after rarefaction Neural network is mapped on PE (Process Element arithmetic element) array of the convolution algorithm of hardware accelerator；Then hard Part accelerator reconstructs PE array and storage array according to the configuration information of mapping scheme, guarantees that the flowing water of convolution algorithm carries out；Add The master controller of fast device can control the distribution for completing weighted data and excited data, and PE array carries out operation, export conventional part And result；Linear amending unit adds up to part and result and is linearly corrected, i.e., completion biasing is set operates with activation primitive； Pond unit carries out the pondization operation of respective cells core size and step-length, including selection maximum according to current convolutional layer pond demand It is worth pondization or average value pond；Finally judge whether current convolution layer operation is the last layer, if it is not, then carrying out Excitation result after coding is sent to next layer of convolution, if it is, being output to piece external storage, completes entire convolution by line coding Accelerate.

The sparse convolution operation mapping scheme of load balancing includes convolution algorithm mapping mode, PE array grouping scheme, defeated Enter the distribution multiplex mode and PE array operation Parallel Scheduling mechanism of feature image and weighted data.

Convolution algorithm mapping mode: input feature vector picture is transformed into a matrix according to row (column) dimension, by weighted data A vector, which is launched into, according to output channel dimension passes through design so that convolution algorithm is converted to matrix-vector multiplication Sparse Matrix-Vector multiplication unit can skip the zero in input feature vector picture and weighted data well, guarantee whole fortune The high efficiency of calculation.

PE array is grouped scheme: completing to divide by master controller static configuration according to the dimensional parameters information of every layer of convolution algorithm Group operation, when PE number is greater than three dimensional convolution kernel total number, one group can calculate all output characteristic pattern channels, on this basis, Remaining PE is grouped by same number, is responsible for calculating not going together for output characteristic pattern；When PE number is less than three dimensional convolution kernel total number, One group of calculating exports the maximum approximate number of characteristic pattern port number, and the principle being grouped in this way is to guarantee each PE arithmetic speed matching, PE array vacancy rate is low.

The distribution multiplex mode of input feature vector picture and weighted data: entire PE array is by one piece of shared on-chip memory The identical excited data of synchronization distribution is as matrix needed for operation, by data distribution module according to the control information of piecemeal operation Weighted data needed for distributing each PE essentially consists in different PE's as vector needed for operation, the multiplexing of input feature vector picture It uses simultaneously, the multiplexing of weight and the same PE replace weighted data after matrix between the multiplexing of weighted data essentially consists in different groups Utilization again without distribution.

PE array operation Parallel Scheduling mechanism: PE array needs to export according to convolutional layer in operation the size of feature image Information determines that different grouping is to complete the output of same output feature image difference row (column), or complete different output characteristic patterns The operation of piece.This ensure that PE array can carry out Parallel Scheduling in two levels, first is that in the layer of single features picture Parallel, second is that different characteristic picture simultaneously and concurrently.

A kind of sparse convolution neural network speeding scheme of load balancing of the present embodiment includes two portions of software and hardware Point, as shown in Fig. 2, being software section Pruning strategy schematic diagram in figure.Pruning strategy is described as follows: for initial intensive nerve Network connection can be grouped it according to the connection number and neuron number of network, and each grouping prune approach is identical with position, That is for the neuron of each convolution kernel group as connection type, the weighted data only connected is different.With input feature vector For figure is W*W*C, (W is characterized figure width and height dimensions, and C is input channel number), convolution kernel is having a size of R*R*C*N, and (R is volume The width of product core and high size, C are convolution kernel port number, and N is convolution kernel number namely output channel number), beta pruning when, can be first The convolution kernel of R*R*C is classified as a convolution kernel group, amount to it is N number of, for each convolution kernel, the position phase of neutral element in them Together；If model needs are not achieved in accuracy rate after beta pruning, convolution kernel group size can be adjusted, takes R*R*C*N1 (approximate number that N1 is N) Carry out beta pruning.

It is illustrated in figure 3 the sparse convolution neural network accelerator structure schematic diagram of hardware components.Overall structure is mainly wrapped Contain: master controller, the convolution algorithm since host computer CPU receives instruction, for generating the control signal of control convolution algorithm Stream and data flow；Data distribution module carries out weighted data distribution to PE according to the segment partition scheme of convolution algorithm；Convolution algorithm PE (Process Element arithmetic element) array is grouped according to the configuration information of master controller and completes sparse convolution Multiply-add operation operation, exports convolution results or part and result；Result cache module is exported, part and result to PE carry out tired Add caching, is organized into after unified format and is sent to subsequent cell and is operated；Linear activation primitive unit, completes convolution algorithm result Biasing set and activation primitive operation；Pond unit completes the maximum value pondization operation of result；Online coding unit, to centre As a result online CSR (storage of compression loose line) coding is carried out, to guarantee that the result of output meets the data of subsequent convolutional layer operation Call format；Outer chip dynamic memory DDR4, for storing raw image data, interlayer intermediate result and convolutional layer final output As a result.

Data distribution module includes the fetch configurable on-chip memory storage unit of address calculation, on piece and data The FIFO group of format caching conversion.The configuration information that data distribution module can be sent according to the master controller received, by fetching Address calculation completes the cache flush mode to outer chip dynamic memory DDR4, and the data taken out are cached to via AXI4 interface The on-chip memory storage unit of on piece weight, a step of going forward side by side format, and distribution is cached in corresponding FIFO, wait Operation sends data.

The PE array of convolution algorithm includes multiple matrix-vector multiplication computing units, can be wanted according to static configuration information It asks, completes in the layer of feature image or interlayer parallel-convolution operates, export part and the result of convolution algorithm.Multiple PE are mono- simultaneously The storage of member is common on-chip memory, and in view of the design of Pruning strategy and hardware structure, multiple PE can be using seldom Under conditions of storage resource, reaches and jump zero acceleration calculating and the matching of difference PE arithmetic speed during calculating sparse convolution.

Matrix-vector multiplication computing unit includes flowing water controller module, weight non-zero detection module, pointer control module, swashs Encourage decompression module, MLA operation unit module and public on-chip memory storage.Weight non-zero detection module can be data distribution The weighted data that module is sent carries out non-zero detection, only transmits nonzero value location information corresponding with its to PE unit；Pointer control Molding block and excitation decompression module can take out non-zero weight value according to corresponding non-zero weight value from common on-chip memory Excitation value needed for corresponding operation, while each PE unit is sent in case operation；MLA operation unit is mainly responsible for matrix Vector multiply in multiplication and additional calculation.

It is illustrated in figure 4 convolution algorithm mapping mode schematic diagram, by taking input feature vector figure is W*W*C as an example, (W is characterized figure Width and height dimensions, C are input channel number), convolution kernel is having a size of R*R*C*N, and (R is the wide and high size of convolution kernel, and C is convolution Core port number, N are convolution kernel number namely output channel number), F is output characteristic pattern size；It is determined each by N size first The number Num_PE of PE unit in PE group can allow Num_PE to be equal to N, each group of a batch operation can if PE total number is greater than N With immediately arrive at output all channels of characteristic pattern as a result, otherwise just allow Num_PE for the approximate number M of N, integer batch operation output is special Sign figure passage portion as a result, guaranteeing that certain PE will not be idle；The group number Group_PE of PE is true by PE total number and Num_PE It is fixed, if one group of operation that can have completed all output channels, different groups are responsible for exporting not going together for characteristic pattern, i.e., As shown in 2 operation of the PE group division of labor in figure.

Convolution algorithm complete for one layer, a PE group is by Num_PE PE unit (i.e. matrix-vector multiplication unit) structure At each matrix-vector multiplication unit is responsible for exporting several rows in a channel of characteristic pattern, if wherein first time operation can export The first row of dry row, specific line number are determined that it is storage that matrix is corresponding in matrix-vector multiplication by the matrix size of matrix-vector multiplication In the shared excited data being locally stored in on-chip memory, corresponding vector is the weight number sent by data distribution module According to；For other PE groups, the subsequent rows that operation content can be output characteristic pattern that is, as shown in Figure 3 can also To be the convolution algorithm of other input feature vector figures, it can it is two different parallel to meet row-column parallel calculation and different characteristic figure in layer Parallel modes of operation.

It is illustrated in figure 5 convolution algorithm schematic diagram in PE group, input feature vector figure and difference are indicated with different numerical value The value of different location on convolution kernel, the matrix-vector multiplication scale that example is taken is the matrix of 2*12 and the vector of 12*1, so PE The vector that each operation output result is 2*1, it is three channel 12* of convolution kernel 1 that PE1 vector in first time operation is corresponding 1, it is three channels of (1,2,4,5) and (2,3,5,6) corresponding position in activating image that matrix is corresponding, is carrying out multiply-add operation Output result is to export the front two row of first channel first row of characteristic pattern afterwards, and subsequent matrices can first update, that is, take (4,5,7, 8) and the excitation value of (5,6,8,9) position, output result are to export the front two row of first channel secondary series of characteristic pattern；It is exporting After all column datas of corresponding row, the corresponding weighted data of vector will do it update, i.e., rear extended meeting output third channel is defeated Result out.And it is exactly after weighted data updates, to become calculating defeated in second channel for calculating output characteristic pattern that PE2 is corresponding The 4th channel out.

It is illustrated in figure 6 the realization schematic diagram that PE is array-supported balanced and storage is shared, the shared on piece of PE array is deposited Reservoir stores according to the nonzero value of the input stimulus of CSR (storage of compression loose line) format storage and their index and refers to Needle, the position of the weight vectors nonzero value sent according to data distribution module take out corresponding excitation and carry out multiply-add operation, due to The interior all weight vectors of PE group are identical according to the position of its nonzero element of software Pruning strategy, so required for each PE Correspondence excitation value be also identical, it is only necessary to seldom memory saves a excitation value, and decodes while being sent to PE i.e. The matrix requirements of PE array can be met.And for all PE, carry out the non-of matrix and vector in matrix-vector multiplication Null position is identical, therefore PE array computation speed matches, and reaches the purpose of design of the low storage load balancing of operation array. At the same time, different PE groups can also share the weighted data of distribution, realize the high reusability of excitation and weight.

It to sum up narrates, the accelerated method for sparse convolution neural network proposed using the embodiment of the present invention, Ke Yiyou Storage hardware resource is saved on effect ground, improves the reusability of input feature vector figure and weight, and the load that can be realized PE array is equal Weighing apparatus, carrying out static configuration to PE array can satisfy different concurrent operation requirements, guarantee the high usage of PE array, whole to improve The data throughput of system system, reaches very high Energy Efficiency Ratio, the embedded system suitable for low-power consumption.

Claims

1. a kind of sparse convolution neural network accelerator of load balancing characterized by comprising

Master controller handles data and is saved for controlling the control signal stream and data flow of convolution algorithm；

Data distribution module carries out weighted data distribution to computing array according to the segment partition scheme of convolution algorithm；

The computing array of convolution algorithm, the multiply-add operation for completing sparse convolution operate, the result of output par, c sum；

Result cache module is exported, the result for the part sum to computing array carries out cumulative caching, and is organized into unified lattice Formula exports the characteristic pattern result of processing and pond to be activated；

Linear activation primitive unit, the biasing for part and result to cumulative completion is set and activation primitive operation；

Pond unit, for being operated to the pondization through activation primitive treated result；

Online coding unit, for carrying out the excitation value for still needing to carry out subsequent convolutional layer operation in line coding；

Outer chip dynamic memory, for storing the spy of raw image data, the intermediate result of computing array operation and final output Sign figure.

2. a kind of sparse convolution neural network accelerator of load balancing according to claim 1, which is characterized in that described The computing array of convolution algorithm includes matrix-vector multiplication computing unit, and the matrix-vector multiplication computing unit includes flowing water controller Module, weight non-zero detection module, pointer control module, excitation decompression module, MLA operation unit module and public on piece are deposited Reservoir；The weighted data that the weight non-zero detection module is used to send data distribution module carries out non-zero detection, and only passes Defeated nonzero value location information corresponding with its is to computing unit；The pointer control module and excitation decompression module are used for according to right The non-zero weight value answered is sent simultaneously from excitation value needed for non-zero weight is worth corresponding operation is taken out in public on-chip memory To each computing unit；The MLA operation unit for operation matrix vector multiply in multiplication and addition.

3. a kind of accelerated method of the sparse convolution neural network accelerator of load balancing, which comprises the following steps:

1) beta pruning is carried out to convolutional neural networks Model Weight data, according to the scale parameter logistic of weighted data according to being divided Then group takes identical prune approach to carry out rarefaction each group weighted data on the basis of guaranteeing model entirety accuracy rate Processing；

2) convolutional neural networks after rarefaction are mapped to accelerator by the sparse convolution operation mapping scheme for formulating load balancing Convolution algorithm computing array on；

3) accelerator is according to the configuration information reconstruction calculations array and storage array of mapping scheme, guarantee the flowing water of convolution algorithm into Row；

4) main controller controls data distribution module completes the distribution of weighted data and excited data, and computing array carries out operation, Export conventional part and result；

7) judge whether current convolution layer operation is the last layer, if it is not, then carrying out in line coding, by the excitation after coding As a result it is sent to next layer of convolution, if it is, being output to outer chip dynamic memory, completes the acceleration of convolutional neural networks.

4. a kind of accelerated method of the sparse convolution neural network accelerator of load balancing according to claim 3, special Sign is, in the step 2), sparse convolution operation mapping scheme includes convolution algorithm mapping mode, computing array grouping side The distribution multiplex mode and computing array operation Parallel Scheduling mechanism of case, input feature vector picture and weighted data.

5. a kind of accelerated method of the sparse convolution neural network accelerator of load balancing according to claim 4, special Sign is, the convolution algorithm mapping mode specifically: input feature vector picture is transformed into a square according to row dimension or column dimension Battle array, is launched into a vector according to output channel dimension for weighted data, so that convolution algorithm is converted to matrix-vector multiplication fortune It calculates.

6. a kind of accelerated method of the sparse convolution neural network accelerator of load balancing according to claim 4, special Sign is that the computing array is grouped scheme specifically: the dimensional parameters information according to every layer of convolution algorithm is quiet by master controller State configures into grouping operation, and when computing unit number is greater than three dimensional convolution kernel total number, a group pattern is all for calculating Characteristic pattern channel is exported, on this basis, remaining computing unit is grouped by same number, is responsible for calculating the difference of output characteristic pattern Row；When computing unit number is less than three dimensional convolution kernel total number, a group pattern is for calculating output characteristic pattern port number most Big approximate number.

7. a kind of accelerated method of the sparse convolution neural network accelerator of load balancing according to claim 4, special Sign is, the distribution multiplex mode of the input feature vector picture and weighted data specifically: entire computing array is shared by one piece The identical excited data of on-chip memory synchronization distribution as matrix needed for operation, transported by data distribution module according to piecemeal Weighted data needed for the control information of calculation distributes each computing unit is as vector needed for operation.

8. a kind of accelerated method of the sparse convolution neural network accelerator of load balancing according to claim 4, special Sign is, the computing array operation Parallel Scheduling mechanism specifically: computing array needs to be exported according to convolutional layer in operation The dimension information of feature image determines that different grouping is to complete same output feature image not go together or the output of different lines, still Complete the operation of different output feature images.