CN113313251B

CN113313251B - Depth separable convolution fusion method and system based on data flow architecture

Info

Publication number: CN113313251B
Application number: CN202110522385.8A
Authority: CN
Inventors: 刘天雨; 吴欣欣; 范志华; 李文明; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2023-05-23
Anticipated expiration: 2041-05-13
Also published as: CN113313251A

Abstract

The invention provides a depth separable convolution fusion method and a system based on a data stream architecture, comprising the following steps: carrying input image data and convolution parameters from a main memory DRAM to a data cache SPM; the PE array reads the input image data and the convolution parameters from the data cache SPM to execute DW convolution, and stores the obtained DW convolution result in a register in the PE; the PE array performs activation calculation on the DW convolution result in the register to obtain a preliminary result Act_out of the input image data, and the preliminary result Act_out is further stored back into the main memory after being written back into the data cache SPM; the PE array reads the initial result Act_out and convolution parameters from the data cache SPM, and performs PW convolution to obtain a final result Output; after the final result Output is written back to the data cache SPM, the final result Output is further stored back to the main memory DRAM. The invention reduces the expenditure caused by the storage and the access of the data, and accelerates the calculation of the depth separable convolution calculation on the data flow architecture.

Description

Depth separable convolution fusion method and system based on data flow architecture

Technical Field

The invention relates to a hardware accelerator with a data flow architecture and the field of neural network application processing. The method particularly relates to an acceleration design for separable convolution calculation in a neural network, and has the advantages of high processing speed, high energy efficiency and the like by combining an efficient data flow execution mode.

Background

Convolutional neural networks are widely focused and applied with their powerful feature extraction and generalization capabilities. The depth separable convolution splits a common convolution operation into a depthwise convolution (DW convolution) and a pointwise convolution (PW convolution). The method decouples channel coupling and space coupling in convolution, can greatly reduce the calculated amount and parameter amount under the condition of small precision loss, and is often used for replacing common convolution calculation.

With the recent vigorous development of convolutional neural networks, depth separable convolution becomes a substitute of the traditional convolutional neural network due to the reduction of the parameter quantity and the reduction of the operand. At present, the widely used MobileNet series and the ShuffleNet series replace common convolution with depth separable convolution, and the calculated amount and the parameter number are reduced on the premise of slightly reducing the model accuracy. Accordingly, the reasoning of the depth separable convolution is accelerated from the two aspects of software and hardware to become a research hot spot.

The first prior art is:

for acceleration of depth separable convolution reasoning, the software layer can be implemented using a deep learning framework and a deep learning library, such as im col (ImagetoColumns) +gemm, conversion to FFT acceleration, or conversion to wingrad for acceleration, etc. The method adopts the modes of converting convolution into matrix multiplication or carrying out mapping calculation on data and the like, and has good acceleration effect.

The acceleration of convolutions using deep learning libraries often faces the problem of significant storage overhead. And the partial acceleration mode has limitation when convolution calculation parameters are different. Many scholars perform model compression from the algorithm level, and the core is based on a trained network, so that the scale of a convolution network model is reduced. This approach also gives good improvements, but there is still room for improvement.

And the second prior art is as follows:

the acceleration of the depth separable convolution in terms of hardware includes two types, one is designed specifically for the depth separable convolution, and is most commonly performed by an FPGA, and the other is run on an accelerator that supports and accelerates CNNs. The hardware accelerator optimizes convolution operation in terms of operation structure and storage structure, and supports low-precision reasoning.

The acceleration of the hardware for convolution calculations is closely related to the accelerator structure. The accelerator adopting the tree structure is often related to a specific acceleration algorithm in the structural design, so that the universality is poor; the array structure accelerator has stronger universality, but the architecture of the array structure accelerator adopts a control flow architecture, and the parallelism among instructions is not mined enough. Therefore, the invention adopts the general accelerator GPDPU with the data flow architecture, the operation unit adopts the array structure to support the universality, the execution mode adopts the execution mode of the coarse-granularity data flow structure, and the parallelism among more instructions is exerted by the data flow so as to accelerate the calculation.

Disclosure of Invention

The invention accelerates the depth separable convolution, and designs a fusion scheme of the depth separable convolution aiming at the redundant transmission of the intermediate result caused by the current multi-library function call, thereby reducing the cost caused by the storage and the access of data. The method comprises the step of providing a depth separable convolution data fusion scheme based on a data flow theory, a data flow control mechanism based on the data flow theory and a convolution calculation hardware structure improvement scheme, so that the calculation of the depth separable convolution calculation on a data flow structure is accelerated.

Aiming at the defects of the prior art, the invention provides a depth separable convolution fusion method based on a data stream architecture, which comprises the following steps:

step 1, carrying input image data and convolution parameters from a main memory DRAM to a data cache SPM;

step 2, the PE array reads the input image data and convolution parameters from the data cache SPM to execute DW convolution, and stores the obtained DW convolution result in a register in the PE;

and 3, performing activation calculation on the DW convolution result in the register by the PE array to obtain a preliminary result Act_out of the input image data.

The depth separable convolution fusion method based on the data flow architecture, wherein the step 3 comprises the following steps: after the preliminary result Act_out is written back to the data cache SPM, the preliminary result Act_out is further stored back to the main memory.

The depth separable convolution fusion method based on the data flow architecture comprises the following steps:

step 4, the PE array reads the preliminary result Act_out and convolution parameters from the data cache SPM, and performs PW convolution to obtain a final result Output; after the final result Output is written back to the data cache SPM, the final result Output is further stored back to the main memory DRAM.

The depth separable convolution fusion method based on the data flow architecture is based on a GPDPU accelerator, and comprises the following steps: the main memory DRAM, the data cache SPM and the PE array.

The method for depth separable convolution fusion based on the data flow architecture, wherein the process of performing activation calculation on the DW convolution result in the register by the PE array in the step 3 specifically comprises the following steps:

when adjacent channels in the PE array load data from the main memory DRAM to the data cache SPM, the loaded data are converted into SIMD4 data, and the total number of input channels owned by the PE array is uniformly distributed to each column of PE so as to complete parallel calculation of column data in the PE array;

and (3) uniformly dividing the output rows of the PE array into each task for execution, calculating the output of a part of rows by each task, and uniformly dividing the total row number calculated by each task into each row of PE to finish the parallel calculation of row data in the PE array.

The depth separable convolution fusion method based on the data flow architecture comprises the following steps of

The weight and bias data of PW convolution are converted into SIMD data when the data is loaded from a main memory DRAM to a data cache SPM, a preliminary result Actout is assigned to a SIMD format through an LDM instruction, the number of output channels is evenly divided into each column of PE for parallel calculation, and the output channels of PW convolution are calculated in parallel;

and (3) uniformly dividing the output rows of the PE array into each task for execution, calculating the output of a part of rows by each task, and uniformly dividing the total row number calculated by each task into each row of PE and simultaneously calculating.

The invention also provides a depth separable convolution fusion system based on the data flow architecture, which comprises:

the module 1, carry the input image data, convolution parameter from the main memory DRAM to the data buffer SPM;

the module 2, PE array reads the input image data and convolution parameters from the data buffer SPM to execute DW convolution, and stores the obtained DW convolution result in a register in PE;

the module 3, the PE array activates the DW convolution result in the register to obtain a preliminary result Act_out of the input image data, and the preliminary result Act_out is further stored back into the main memory after being written back into the data cache SPM;

the module 4, PE array reads the preliminary result Act_out and convolution parameter from data buffer SPM, and executes PW convolution to obtain final result Output; after the final result Output is written back to the data cache SPM, the final result Output is further stored back to the main memory DRAM. .

The depth separable convolution fusion system based on the data flow architecture is based on a GPDPU accelerator, and the GPDPU accelerator comprises: the main memory DRAM, the data cache SPM and the PE array.

The depth separable convolution fusion system based on the data flow architecture, wherein the process of performing activation calculation on the DW convolution result in the register by the PE array in the module 3 specifically comprises the following steps:

The depth separable convolution fusion system based on the data flow architecture, wherein

The advantages of the invention are as follows:

the data flow and calculation mode control mechanism based on the data flow theory realizes the data flow, data multiplexing and instruction mapping mechanism in the completion of the depth separable convolution calculation based on a specific hardware acceleration structure by utilizing the data flow control theory with high energy efficiency, and the instruction optimization based on a hardware platform so as to obtain the effects of improving the calculation parallelism, reducing the storage capacity, reducing the transmission time and the like.

Drawings

FIG. 1 is an overall structure diagram of a GPDPU;

FIG. 2 is a diagram of multi-library function call under a GPDPU storage mechanism;

FIG. 3 is a diagram of a convolutional layer and active layer fusion implementation;

FIG. 4 is a fusion implementation diagram of a depth separable convolution;

fig. 5 is a diagram of a data flow execution scheme for PW convolution calculation.

Detailed Description

The method accelerates the depth separable convolution, and aims at solving the problems that the current multi-library function call causes redundant transmission of intermediate results, calculation parallelism and shared data are not fully utilized, and a data flow diagram is designed and optimized to be unfolded.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

As shown in fig. 1, the overall structure of the GPDPU accelerator used in the present invention is shown. The GPDPU is a coarse-grained data flow structure accelerator, which uses the data flow concept and adopts the execution mode of flow control combination. Conventional control flow execution accelerators are instruction driven execution, whereas in the present architecture data flow directions of the in-computation dataflow graph are employed to drive execution. Compared with the traditional structure, the structure is more suitable for accelerating scientific calculation with high parallelism. The existing acceleration of the depth separable convolution is mostly adjusted on an addition tree and an operation part of a multiplier in a structure for the separable convolution, but the structure calculates the separable convolution by adopting an operation mode of a data stream, and the application focuses on innovation on the data distribution and allocation process of operation.

The GPDPU core consists of six parts, namely a main memory, an operation array PE array of a microcontroller GPDPU_HOST and 4*4, an instruction cache Cbuf, a data cache SPM and a transmission network. The main memory is used for storing configuration information, instructions, data and the like transmitted by the CPU; the instruction cache Cbuf is used for caching instructions to be executed by the execution unit PE and transmitting the instructions from the main memory; the data cache SPM is used for caching input data and weight data required by the execution unit, and calculating output data and intermediate result data; the PE array of 4*4 completes the whole operation process; the transport network is responsible for data transfer between the main memory and the cache, and between the cache and the PE array. The execution array is composed of 16 PEs, each PE comprises an instruction cache, a register file, a four-stage pipeline operation unit, a routing router and a PE internal microcontroller, and the operation of the SIMD32 is supported through interconnection of a mesh network.

As shown in fig. 2, a schematic diagram of multi-library function call is shown in the GPDPU storage mechanism. The implementation of the current depth separable convolution is: invoking DW convolution (deep convolution), BN (batch normalization), activation function and PW convolution (punctuation convolution), considering the case that four layers are all involved, each depends on the previous calculation result except DW convolution, and the data dependency determines that four processes are executed serially. Memory access is one of the bottlenecks of convolution acceleration, and memory is the most expensive memory device in the accelerator. If the intermediate result is saved by using the register without access under the premise of ensuring data dependence, or is stored back into the storage device with lower consumption instead of the main memory, the cost caused by the storage and access of the part of intermediate result data can be saved or reduced.

The invention is based on the data storage characteristics of GPDPU: 1) Main memory, SPM, inter-PE transmission, four-stage storage of registers in PE, and the storage overhead is from high to low: main memory > SPM > inter-PE transfer > PE internal registers;

2) The feature that the host 3) instruction is not accessible during execution of the PE array only supports register access. And designing a library function to realize fusion implementation schemes of five calculations of DW convolution, BN, activation, PW convolution and BN. The BN layer plays a role in normalizing data in batches, the calculation flow contains two times of BN, the first time of BN processes the result of the DW convolution, and the second time of BN processes the result of the PW convolution. Common separable convolution structures each contain two BN's.

As shown in fig. 3, a schematic diagram of the fusion implementation of the DW convolution layer and the active layer is shown. The invention uses the register in PE to store DW_Bn_out, and realizes the fusion of the DW convolution layer and the activation layer. The data DW_P is firstly carried to the SPM from the main memory, the PE array reads the input in the SPM to carry out DW convolution calculation, the calculation result is stored in a register in the PE, and the activation calculation (such as taking max for ReLU activation) is continuously carried out on the calculation result to obtain Act_out, the Act_out is stored back to the SPM, and the Act_out is further stored back to the main memory. The DW convolution layer and the activation layer are fused, the intermediate result DW_Bn_out is stored in a PE internal register with lower cost, and the intermediate result DW_Bn_out is not stored back into a main memory with high cost, so that delay caused by access is reduced. The DW convolution layer mentioned later herein refers to a DW convolution layer that merges a BN layer and an active layer, and the DW convolution result is act_out.

As shown in fig. 4, a fusion implementation of the depth separable convolution is schematically illustrated. Because GPDPU does not allow access to main memory in program execution, firstly, the parameters of input image, DW convolution and PW convolution are carried from main memory DRAM to SPM; the PE array reads the input image data in the SPM and carries out DW convolution, BN and activation operation of fusion on DW convolution parameters, and the result actout is written back to the SPM; the PE array reads the parameters of the Act_out and PW convolution in the SPM, performs PW convolution and BN calculation to obtain a final result Output, writes back the final result Output into the SPM, and transmits the result data back to the main memory by the SPM after all result Output calculation is completed. The fusion implementation scheme combines DW convolution calculation and BN calculation into a whole, and an intermediate result DW_out is not needed; the register is used for storing the intermediate result DW_Bn_out of DW convolution and activation, so that access memories of DW_out and Dw_Bn_out in multi-library function implementation are saved; and the SPM is used for storing the activation result Act_output, and the SPM memory with lower consumption is used for replacing the main memory access, so that the memory access cost is reduced to a great extent.

The invention of the present application is to use the above described computation and storage flows on the architecture of the data stream accelerator specific to the present design.

Another aspect of the present invention is a data stream operation process for a GPDPU architecture during a calculation process, where all operations on a PE array mentioned in the present invention are operated in a data stream manner. Specific operation processes are mapped onto hardware in a unique form to perform highly parallel operation.

The invention also designs a depth separable convolution scheme by utilizing the calculation parallelism provided by the GPDPU aiming at the calculation mode of the depth separable convolution algorithm. First, the GPDPU provides 4 levels of computational parallelism:

1. when multiple tasks are executed, different tasks execute the same instruction by using the same data flow diagram, namely a mapping scheme, and the tasks are completely independently executed logically. In actual execution, according to the design of the data flow graph and the mapping relation between the flow graph and the PE array, different tasks are mapped to flow graph nodes in the same PE, and the PE selects ready nodes to execute through a polling mechanism. Thus, different tasks execute in parallel while sharing the same set of computing resources.

2. Multiple iterations of the same subtask are performed in a dataflow graph pipeline.

3. Within one iteration of one subtask, 4*4 processing units PEs provide 16 copies of the same computing resource, and under the mapping of the dataflow graph, according to the dependency relationship between nodes of the dataflow graph, the computing nodes at the same level can simultaneously compute by mapping to different PEs.

4. And 8 sets of identical computing resources are adopted in the PE, and the eight sets of computing resources are independently computed at the same time.

As in fig. 4, a scheme is implemented for DW convolution calculated data streams. In DW convolution computation, parallelism of channel dimensions is completely calculated in parallel by SIMD and 4-column PE, as shown in fig. 4 (1), every four adjacent input channels are converted into data of one SIMD4 when loading data from DRAM to SPM, and the data is converted from serial storage to parallel storage, i.e., SIMD mode, in a storage manner of converted data. The method increases the parallelism of calculation, improves the calculation efficiency, and adopts a single-instruction multi-data execution mode in the execution process. The number of data input channels converted into SIMD4 is then equally divided into 4 columns of PEs for parallel computation, whereby all channels are simultaneously computed in parallel.

Second, in DW convolution computation, the computation object is a set of images of a specific length and width. The parallel computation may employ parallel of data line units (high) or parallel of data column units (wide), and in this embodiment, parallel of data output lines is employed, meaning that in the output result, the computation result of each line of data is generated in a process that is not related to computation of other data lines. In this embodiment, the full parallel computation of the high dimension in the DW convolution computation is adopted, fig. 4 (2) shows the parallel of the high dimension output by taking 4 tasks as an example, the output lines are evenly divided into 4 tasks to be executed, each task computation obtains the output of a part of lines, in each task, the total line number of each task computation is evenly divided into 4 lines of PEs to be simultaneously computed, and therefore, the output different lines of data are simultaneously computed. Finally, outputting the data of different columns of the same row is realized in different iterations, and the parallel is pipelined through the parallelism of the GPDPU.

As in fig. 5, a scheme is performed for the data stream of PW convolution calculation. The parallelism of channel dimensions in PW convolution is fully calculated in parallel using SIMD and 4 columns of PEs, as shown in fig. 5 (1), first the weights and offsets of PW convolution are converted into one SIMD4 data every four adjacent weights/offsets when loading data from DRAM to SPM, actout is assigned to SIMD4 format by LDM instruction. The number of output channels is equally divided into 4 columns of PE for parallel calculation, and therefore all channels of PW convolution output are calculated simultaneously in parallel. In addition, we compute both the high and wide dimensions of PW convolution computation in full parallel. Fig. 5 (2) shows parallelism in the high and wide dimensions of the output, taking 4 tasks as an example. The method comprises the steps of firstly, uniformly dividing output lines into 4 tasks, calculating to obtain output of a part of lines by each task, uniformly dividing the total line number calculated by each task into 4 lines of PE (provider edge) in each task, and simultaneously calculating, thereby outputting data of different positions and simultaneously calculating.

The parallel computing scheme of DW convolution and PW convolution in the design is adopted, and the dimension of output width in DW convolution computation is parallel computing in a flow-line mode, and other parallel computing processes are parallel computing, so that the computing parallelism is greatly improved.

The realization of the depth separable convolution is divided into 4 subtask, wherein the first subtask carries out pretreatment of data and is circularly executed once; the second subtask completes the calculation of DW convolution, and Actout is written back to SPM and is circularly executed for a plurality of times; the third subtask completes PW convolution calculation and is circularly executed for a plurality of times; the last subtask writes the final result Output back to the SPM. In an implementation, the number of times the second and third subtask loops are executed is determined by the particular data size.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. A depth separable convolution fusion method based on a data stream architecture, comprising:

step 2, the PE array reads the input image data and convolution parameters from the data cache SPM to execute DW convolution, and stores the DW convolution result subjected to normalization processing in a register in the PE;

step 3, the PE array performs activation calculation on the DW convolution result in the register to obtain a preliminary result Act_out of the input image data; writing the preliminary result Act_out back to the data cache SPM, and further storing back to the main memory;

step 4, the PE array executes PW convolution by reading the preliminary result Act_out and convolution parameters from the data cache SPM, and takes the PW convolution result subjected to normalization processing as a final result Output of the depth separable convolution; writing the final result Output back to the data cache SPM, and further storing back to the main memory DRAM;

in step 3, the process of performing activation calculation on the DW convolution result in the register by the PE array specifically includes:

the output lines of the PE array are evenly divided into each task to be executed, the output of a part of lines is obtained through calculation of each task, and the total line number calculated by each task is evenly divided into each line of PE, so that the parallel calculation of line data in the PE array is completed;

the depth separable convolution fusion method is based on a GPDPU accelerator, which comprises the following steps: the main memory DRAM, the data cache SPM and the PE array;

the GPDPU accelerator provides 4 levels of computational parallelism:

when the multi-task is executed, different tasks execute the same instruction by using the same data flow diagram, and the tasks are completely and independently executed logically;

performing iteration of the subtasks in the task through the data flow graph pipeline;

in the iteration of a subtask, a processing unit PE provides a plurality of identical computing resources, and under the mapping of a data flow graph, computing nodes at the same level simultaneously compute by mapping to different PEs according to the dependency relationship among nodes of the data flow graph;

and adopting multiple sets of identical computing resources of SIMD in the PE, and simultaneously and independently computing the computing resources of each set.

2. The depth separable convolution fusion method based on a data stream architecture of claim 1,

the weight and bias data of PW convolution are converted into SIMD data when the data is loaded from a main memory DRAM to a data cache SPM, a preliminary result Act out is assigned to a SIMD format through an LDM instruction, and the number of output channels is divided into each column of PE for parallel calculation so as to calculate the output channels of PW convolution in parallel;

3. A depth separable convolution fusion system based on a data stream architecture, comprising:

a module 1 for transferring input image data, convolution parameters from a main memory DRAM to a data cache SPM;

a module 2, configured to read the input image data and the convolution parameters from the data cache SPM by using the PE array to perform DW convolution, and store the DW convolution result after normalization processing in a register in the PE;

the module 3 is used for performing activation calculation on the DW convolution result in the register by the PE array to obtain a preliminary result Act_out of the input image data, and further storing the preliminary result Act_out back to the main memory after writing the preliminary result Act_out back to the data cache SPM;

the module 4 is configured to perform PW convolution by reading the preliminary result act_out and the convolution parameter from the data cache SPM, and take the PW convolution result after normalization processing as a final result Output of the depth separable convolution; writing the final result Output back to the data cache SPM, and further storing back to the main memory DRAM;

the process of performing activation calculation on the DW convolution result in the register by the PE array in the module 3 specifically includes:

the depth separable convolution fusion system is based on a GPDPU accelerator comprising: the main memory DRAM, the data cache SPM and the PE array;

the GPDPU accelerator provides 4 levels of computational parallelism:

4. The depth separable convolutional fusion system based on a data stream architecture of claim 3,