CN114723029A

CN114723029A - DCNN accelerator based on hybrid multi-row data flow strategy

Info

Publication number: CN114723029A
Application number: CN202210482658.5A
Authority: CN
Inventors: 黄以华; 罗聪慧; 黄文津
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-08

Abstract

The invention discloses a DCNN accelerator based on a hybrid multi-row data flow strategy, which is formed by stacking a plurality of convolution processing modules. The convolution processing module comprises a plurality of parallel computing unit arrays, a computing buffer and a data buffer. Data transmission of adjacent convolution processing modules is in line unit, line data is stored in a data buffer, and data read from the data buffer in sequence is sent into a calculation buffer after rearrangement operation so as to be used for calculation of a calculation unit array. Each compute unit array is responsible for computing the output signature for a single row, all compute unit arrays share the same weight data, and all weight data is stored in off-chip DRAM. The off-chip bandwidth usage can be adjusted by adjusting the parallelism of the computing unit array of the convolution processing module, and the problem that the off-chip bandwidth optimization cannot be realized by the conventional layer-by-layer pipeline accelerator is solved.

Description

DCNN accelerator based on hybrid multi-row data flow strategy

Technical Field

The invention relates to the technical field of electronic information and deep learning, in particular to a DCNN accelerator based on a hybrid multi-row data flow strategy.

Background

In the artificial intelligence wave which is continuously developed in recent years, a Deep Convolutional Neural Network (DCNN) shows superior performance in the fields of target detection, semantic segmentation, face recognition, voice recognition, medical auxiliary diagnosis and the like compared with the traditional algorithm. Thus, DCNN has received a very wide range of attention and research.

Because the degree of parallelism between layers and in layers of the DCNN model can be fully utilized, the layer-by-layer assembly line type system architecture is widely applied to the DCNN accelerator based on the FPGA. In a layer-by-layer pipelined system architecture, the computational paradigm of convolutional layer computational tasks (row-by-row, layer-by-layer) determines the number of times weight data is read from off-chip DRAM, which in turn determines the off-chip bandwidth of the accelerator. However, the existing layer-by-layer pipeline architecture all uses a fixed computation paradigm, so that the throughput performance of the accelerator is limited by off-chip bandwidth, and it is difficult to reduce the usage amount of the off-chip bandwidth through on-chip storage.

A hybrid parallelism-based convolution computation apparatus is disclosed in the prior art, the apparatus comprising: the input module is configured to acquire input convolution data and corresponding parameters, judge a convolution shape according to the input convolution data, and extract a feature map size, a convolution kernel size and a channel number of the input convolution data; the simulation module is configured to obtain the parallelism corresponding to the input convolution data according to the data characteristics extracted by the input module; the data features include convolution shape and parameters; an on-chip processor including a plurality of parallel processing modules; the grouping control module is respectively connected with each processing module and is configured to divide all the processing modules on the on-chip processor into G groups according to the parallelism, G is equal to the parallelism, and the number of the processing modules in each group is equal; the mapping module is respectively connected with each processing module and is configured to control data and parameters input into each processing module according to the parallelism, the input convolution data and the corresponding parameters; wherein, the processing modules in the same group input the same parameters and different data; different parameters are input by processing modules of different groups; and the processing module is used for finishing convolution acceleration behavior according to the input data and parameters and outputting a result. This scheme also has difficulty reducing off-chip bandwidth usage through on-chip storage.

Disclosure of Invention

The invention provides a hybrid multi-row data stream strategy-based DCNN (direct coupled neural network) accelerator, which solves the problem that the existing layer-by-layer pipelined DCNN accelerator cannot realize off-chip bandwidth optimization.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a DCNN accelerator based on a hybrid multi-row data stream policy, comprising a convolution processing unit and a fully-connected processing unit, wherein:

the convolution processing unit is responsible for processing the convolution calculation part in the DCNN model and comprises a plurality of convolution processing modules, a bypass convolution processing module and a branch processing module, wherein the convolution processing modules are connected in sequence, the number of the convolution processing modules is equal to the number L of convolution layers of the DCNN model, and the number of input line data of each convolution processing module is r_iThe number of output line data is r_i+1And the data quantity of the input line of the convolution processing module is the data quantity of the output line of the last convolution processing unit; the input line data quantity of the bypass convolution processing module is the output line data quantity of the first convolution processing module, the output of the bypass convolution processing module is the input of the branch processing module, the output of the branch processing module is the input of the last convolution processing module, and the branch processing module processes branch parts in the deep convolution neural network;

and the output line data of the last convolution processing module is output to the full-connection processing unit, and the full-connection processing unit is used for processing the full-connection layer part in the deep convolution neural network.

Preferably, a pooling processing module is connected between each convolution processing module, and the pooling processing module processes a pooling layer part in the deep convolution neural network.

Preferably, the ambient data source is every Δ T₁Inputting a row of input characteristic diagram data to a first convolution processing module in each clock period, and inputting the characteristic diagram data every r_i+1S_iΔT_iIn each clock cycle, the ith convolution processing module completes r_i+1The rows output the computation of the profile data, wherein,

in the formula, Ps_jStride, S of pooled processing modules after the jth convolution processing module_jThe stride of the input feature map corresponding to the jth convolution processing module has the convolution kernel size N_i×C_i×K_i×K_iPadding is pad_i。

Preferably, each convolution processing module includes an input data buffer, a plurality of parallel computation buffers, a plurality of parallel computation cell arrays, and an output data buffer, where:

the input data buffer reads and stores data in the output data buffer of the last convolution processing module, the parallel computing buffers read data in the input data buffer, the input of the parallel computing unit arrays is data in the computing buffers, and the output of the parallel computing unit arrays is stored in the output data buffer.

Data transmission of adjacent convolution processing modules is in line unit, line data is stored in a data buffer, and data read from the data buffer in sequence is sent into a calculation buffer after rearrangement operation so as to be used for calculation of a calculation unit array. Each compute unit array is responsible for computing the output signature for a single row, all compute unit arrays share the same weight data, and all weight data is stored in off-chip DRAM.

Preferably, the plurality of parallel computing unit arrays consists of W_h，i×I_w，iEach computing unit is a W_w，iAnd buffering the input multiply-accumulate tree, the intermediate calculation data in a Dual port RAM, and buffering the final calculation result in the RAM, wherein the RAM is used as a data source of an input data buffer of the next convolution processing module.

Preferably, each of said calculation units is configured to perform in turn a small-sized matrix multiplication W_rb×I_rpbThereby finally realizing large-size matrix multiplication W_r×I_rpAnd finally, the calculation result is single-row output characteristic diagram data.

Preferably, each computing unit computing paradigm is based on Toeplitz matrix multiplication, input feature map data is converted into a Toeplitz matrix, the input feature map data processed by each PE array is located in a column matrix of the Toeplitz matrix, and in a mixed multi-row data flow strategy, all parallel computing unit arrays share the same weight parameter to respectively process Ifmap data from different column matrices in the Toeplitz matrix, so as to achieve optimization of bandwidth usage.

Preferably, the computing resource usage # PE Mult of each computing unit is:

#PE Mult＝W_h，i×W_w，i×I_w，i

wherein, W_h，i，W_w，i，I_w，iNeed to satisfy

In the formula, Hout_iFor the width and height, Δ T, of the output feature map corresponding to the ith convolution processing module₁The clock cycle interval of inputting a line of data to an accelerator for an external data source meets the following requirements:

TRP_objexpectation of accelerator design

Preferably, the data stored in each input data buffer has the same row position and column position in the input characteristic diagram and is arranged in the input data buffers according to the channel size, and the number of data buffers of the ith convolution processing module is # RowDataBuffer × Hin_i：

DataIn₀′＝K_i+S_i(r_i+1-1)+GCD(r_i+1S_i，r_i)(r_i′-1)

Wherein, GCD (r)_i+1S_i，r_i) Is r_iAnd r_i+1S_iGreatest common divisor of r_i' and (r)_i+1S_i) ' is two positive integers and relatively prime, padding is the step size corresponding to the output characteristic diagram, padding is pad_i。

Preferably, when the pooling processing module responsible for processing the pooling operation is used for processing the output data of the convolution processing module, the number of input line data thereof is equal to the number of output line data of the convolution processing module, and the number of data buffers included therein satisfies:

Hout_i×#RowPoolingBuffer

wherein the content of the first and second substances,

compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the invention, a mixed multi-row calculation paradigm (data flow strategy) is introduced into a layer-by-layer pipeline system architecture, and efficient on-chip storage and off-chip bandwidth balance can be realized through flexible data flow strategy configuration, so that the flexibility of layer-by-layer pipeline system architecture design and the theoretical throughput upper limit thereof are improved.

Drawings

Fig. 1 is a schematic diagram of an overall accelerator framework according to the present invention.

Fig. 2 is a schematic diagram of a hybrid multi-row data flow strategy.

FIG. 3 is a block diagram of a convolution processing module.

FIG. 4 is a diagram of a hardware structure of a computing unit array.

FIG. 5 is a diagram illustrating a data storage sequence of an input data buffer.

Fig. 6 is a schematic diagram of convolution calculation paradigm based on Toeplitz matrix.

Fig. 7 is a schematic diagram of the sequence of processing the weight data and Toeplitz matrix data by the computing unit array.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The present embodiment provides a DCNN accelerator for a hybrid multi-row data flow policy, as shown in fig. 1, including a convolution processing unit and a fully-connected processing unit, where:

In fig. 1, BottleNeck is a tributary processing module used in a network such as RESNET, and fpm (full Connected Process module) is a fully Connected processing unit used for processing a fully Connected layer in the network; CPM (convolution Process module) is a convolution processing unit and is used for processing a convolution calculation part in the network; external Memory is off-chip Memory, i.e. External Memory other than FPGA chips, and represents DDR on the FPGA development board.

Example 2

This example continues to disclose the following on the basis of example 1:

when the DCNN model comprises pooling layers, a pooling processing module is connected between every two convolution processing modules and processes the pooling layer part in the deep convolution neural network.

When the parameter of the pooling layer has PK _i2 and Ps_iWhen 2, in the pooling module, each output terminal of the posing buffer is connected to a comparator Comp, 3 comparators form a 4-input comparator tree, and the output terminal of the comparator tree is connected to the CCM_i+1The data buffer input terminal. In addition, the parallelism of the comparator tree is Hin_i+1. Pooling modules in parallel towards CCM_i+1R of_i+1Each row data buffer outputs data, so the output data dimension of the pooling module is r_i+1×Hin_i+1。

Example 3

This example continues to disclose the following on the basis of example 1 or example 2:

assuming that the number of convolutional layers of the convolutional neural network is L, the width and height of the input feature map (Ifmap) corresponding to each convolutional layer are Hin_iThe number of input channels is C_iThe number of output channels is N_iThe width and height of the output feature map (Ofmap) are both Hout_iThe size of convolution kernel corresponding to Ifmap is N_i×C_i×K_i×K_iStride is S_iPadding is pad_iWithout losing oneIn general, it is assumed that a Pooling Layer (PL) exists between each of the convolution layers_i) The size of the pooling filter is PK_iThe pooled stride is Ps_iWherein all i are more than or equal to 1 and less than or equal to L.

The layer-by-layer pipelined DCNN accelerator comprises a plurality of convolution processing modules, the number of the convolution processing modules is equal to the number L of convolution layers of a DCNN model, and the overall framework of the layer-by-layer pipelined DCNN accelerator is shown in an attached figure 1. The number of input line data of each convolution processing module is r_iThe number of output line data is r_i+1. The data quantity of the input line of the convolution processing module is equal to the data quantity of the output line of the last convolution layer processing module, see figure 2, and the external data source is arranged at intervals of delta T₁One line of input profile data is input to the accelerator every clock cycle. Every r₂S₁ΔT₁One clock cycle, the first CCM outputs to the next CCM₂Line data, similarly, every r_i+1S_iΔT_iOne clock cycle, CCM_iCompletion r_i+1The computation of the line Ofmap data, where,

in the PE array, each computation module (PE module) is a multiply-accumulate tree structure, see FIG. 4, in which W is_h，iAnd I_w，iHeight and width of the PE array, W_w，iIn order to multiply the number of input ports of the accumulator tree, intermediate calculation data is cached in a Dual port RAM, and a final calculation result is cached in the RAM. Wherein, the RAM is used as the data source of the data buffer of the next CCM. Each PE array is provided with a PE buffer of a Ping-Pong structure, the PE buffers are used for caching input data of the PE arrays, and the data are sourced from a data buffer of a convolution processing module. The Data buffers are used for buffering output Data from the previous convolution processing module, the Data stored by each Data buffer has the same row position and column position in the input characteristic diagram and is arranged in the Data buffers according to the channel size, see figure 5, the Data stored by each Data buffer has the same row position and column position in the input characteristic diagram, and in addition, the Data buffers have the same row position and column position in the input characteristic diagram and are arranged in the Data buffers according to the channel sizeArranged in data buffers by channel size.

Further, the number of data buffers is # RowDataBuffer × Hin_i。

DataIn₀′＝K_i+S_i(r_i+1-1)+GCD(r_i+1S_i，r_i)(r_i′-1)

Wherein, GCD (r)_i+1S_i，r_i) Is r_iAnd r_i+1S_iMaximum common divisor (gratet common divisor), r_i' and (r)_i+1S_i) ' two positive integers are relatively prime.

Further, the computational paradigm of the PE array is based on Toeplitz matrix multiplication, i.e. the input feature map data needs to be converted into Toeplitz matrices. The input feature map data processed by each PE array is located in a column matrix of a Toeplitz matrix, see FIG. 6, which is a convolution operation based on the Toeplitz matrix, and each PE array is obtained by performing small-size matrix multiplication W in turn_rb×I_rpbThereby finally realizing large-size matrix multiplication W_r×I_rpAnd finally, the calculation result is single-row output characteristic diagram data.

The order in which the PE array calculates the weight data for a single row of output feature map data and the Toeplitz matrix data is shown in fig. 7. In the mixed multi-row data flow strategy, all parallel PE arrays share the same weight parameter to respectively process Ifmap data from different column matrixes in the Toeplitz matrix, thereby realizing the optimization of the bandwidth usage. Fig. 7 shows an example of the operation mode of two parallel PE arrays.

Further, each PE array has a computational resource usage of

#PE Mult＝W_h，i×W_w，i×I_w，i

Wherein, W_h，i，W_w，i，I_w，iNeed to satisfy

Wherein, Delta T₁The clock period interval of inputting a line of data to the accelerator for an external data source is satisfied

Wherein, TRP_objThe desired throughput for the accelerator design. The total computation of the DCNN model executed by the accelerator is set as IOP. freq is the operating frequency of the accelerator.

Furthermore, when the pooling module responsible for processing the pooling operation is used for processing the output data of the convolution processing module, the number of input line data of the pooling module is equal to that of the output line data of the convolution processing module, and the number of included data buffers meets the requirement

Hout_i×#RowPoolingBuffer

Wherein

Further, the implementation of the fully connected processing unit includes a PE array and a data access system. Specifically, the PE array consists of batch PEs. All PEs share the same weight data but process different Ifmap data, i.e. each PE is responsible for processing Ifmap data from different inference tasks. The hardware structure of the PEs is a MACC _ f-input multiply-accumulate tree, so that each PE can realize MACC _ f MACC operations in each clock cycle.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. This need not be, nor should it be exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A DCNN accelerator based on a hybrid multi-row data stream policy, comprising a convolution processing unit and a full-connection processing unit, wherein:

the convolution processing unit is responsible for processing the convolution calculation part in the DCNN model and comprises a plurality of convolution processing modules, a bypass convolution processing module and a branch processing module, wherein the convolution processing modules are connected in sequence, the number of the convolution processing modules is equal to the number L of convolution layers of the DCNN model, and the number of input line data of each convolution processing module is r_iThe number of output line data is r_i+1And the input row data quantity of the convolution processing module is the output row data quantity of the last convolution processing unit; the input line data quantity of the bypass convolution processing module is the output line data quantity of the first convolution processing module, the output of the bypass convolution processing module is the input of the branch processing module, the output of the branch processing module is the input of the last convolution processing module, and the branch processing module processes branch parts in the deep convolution neural network;

2. The hybrid multi-row data flow policy-based DCNN accelerator according to claim 1, wherein a pooling processing module is connected between each convolution processing module, and the pooling processing module processes a pooling layer portion in a deep convolution neural network.

3. The hybrid multi-row data flow policy-based DCNN accelerator of claim 2, wherein the ambient data source is every Δ T₁Inputting a row of input characteristic diagram data to a first convolution processing module in each clock period, and inputting the characteristic diagram data every r_i+1S_iΔT_iIn one clock cycle, the ith convolution processing module completes r_i+1The rows output the computation of the profile data, wherein,

4. The hybrid multi-row data flow policy-based DCNN accelerator according to claim 3, wherein each convolution processing module comprises an input data buffer, a plurality of parallel computation buffers, a plurality of parallel arrays of computation units, an output data buffer, wherein:

5. The hybrid multi-row data flow policy-based DCNN accelerator of claim 4, wherein the multiple rowsA parallel computing unit array composed of_h,i×I_w,iEach computing unit is a W_w,iAnd buffering the input multiply-accumulate tree, the intermediate calculation data in a Dual port RAM, and buffering the final calculation result in the RAM, wherein the RAM is used as a data source of an input data buffer of the next convolution processing module.

6. The hybrid multi-row data flow policy-based DCNN accelerator of claim 5, wherein each compute unit performs a small-sized matrix multiplication W in turn_rb×I_rpbThereby finally realizing large-size matrix multiplication W_r×I_rpAnd finally, the calculation result is single-row output characteristic diagram data.

7. The hybrid multi-row data flow strategy-based DCNN accelerator of claim 6, wherein each of the computing units computing paradigm is based on a Toeplitz matrix multiplication, converting input feature map data into a Toeplitz matrix, the input feature map data processed by each PE array being located in a column matrix of the Toeplitz matrix, all of the parallel computing unit arrays in the hybrid multi-row data flow strategy sharing the same weight parameters to process Ifmap data from different column matrices of the Toeplitz matrix, respectively, thereby achieving optimization of bandwidth usage.

8. The DCNN accelerator according to claim 7, wherein the computing resource usage # PE Mult of each computing unit is:

#PE Mult＝W_h,i×W_w,i×I_w,i

wherein, W_h,i,W_w,i,I_w,iNeed to satisfy

In the formula, Hout_iIs the ith convolution siteWidth and height, Δ T, of output feature map corresponding to physical module₁The clock cycle interval of inputting a line of data to an accelerator for an external data source meets the following requirements:

is a desire for accelerator design.

9. The hybrid multi-line data stream policy-based DCNN accelerator according to claim 8, wherein each input data buffer stores data having the same line position and column position in the input profile and is arranged in the input data buffers by channel size, and the number of data buffers of the i-th convolution processing module is # RowDataBuffer × Hin_i：

DataIn₀′＝K_i+S_i(r_i+1-1)+GCD(r_i+1S_i,r_i)(r_i′-1)

Wherein, GCD (r)_i+1S_i,r_i) Is r_iAnd r_i+1S_iGreatest common divisor of r_i' and (r)_i+1S_i) ' is two positive integers and is relatively prime, padding is the step size corresponding to the output characteristic diagram, and padding is pad_i。

10. The DCNN accelerator according to claim 9, wherein when the pooling processing module responsible for processing the pooling operation is used for processing the output data of the convolution processing module, the number of input line data thereof is equal to the number of output line data of the convolution processing module, and the number of data buffers is equal to:

Hout_i×#RowPoolingBuffer

wherein, the first and the second end of the pipe are connected with each other,