CN113313252B

CN113313252B - Depth separable convolution implementation method based on pulse array

Info

Publication number: CN113313252B
Application number: CN202110562786.6A
Authority: CN
Inventors: 陆生礼; 张广明; 张娟; 庞伟
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-10-25
Anticipated expiration: 2041-05-24
Also published as: CN113313252A

Abstract

The invention discloses a depth separable convolution implementation method based on a systolic array, which adopts M rows and N columns of Processing units (PE units) to form a systolic array structure: in the horizontal direction, adjacent PE units are connected with each other, and the PE unit on the left side can transmit data to the PE unit on the right side; in the vertical direction, each PE unit has a respective data input port and data output port. The data prefetch module provides feature map data and weight parameters for the compute array. The adder tree accumulates the partial sum data output in parallel by each column of PE units. Each PE unit is mainly internally composed of a register, a data selector, an adder and a multiplier. The invention adopts the pulse array with the structure and is matched with the data pre-fetching module and the addition tree, and can realize different data streams and data reusing modes, thereby realizing the accelerated calculation of standard convolution, point convolution and depth convolution.

Description

Depth separable convolution implementation method based on pulse array

Technical Field

The invention discloses a depth separable convolution implementation method based on a pulse array, relates to a hardware accelerator structure of a convolutional neural network, and belongs to the technical field of calculation and calculation.

Background

The convolutional neural network has high accuracy, and is widely applied to the computer vision fields of image classification, target detection, target tracking and the like. However, convolutional neural networks are computationally intensive models requiring a huge amount of computation and parameters during training and deployment, which affect their application on resource-constrained embedded mobile terminals.

In order to meet the requirements of practical application, a network architecture develops towards a lightweight network, the lightweight network architecture widely adopts deep separable convolution to replace standard convolution calculation, and the standard convolution is decomposed into a deep convolution part and a point convolution part, so that the neural network has fewer parameters and calculation amount, and has accuracy rate comparable to that of a large network.

However, the deep separable convolution has less data reuse and computation parallelism, which results in a great reduction in the utilization rate of the computation array when the accelerator computes the deep separable convolution, and thus in a reduction in performance. Therefore, the method for realizing the depth separable convolution based on the systolic array is of great significance.

Disclosure of Invention

In order to fully utilize the data reuse and the calculation parallelism of the depth separable convolution, the invention provides a depth separable convolution implementation method based on a systolic array, which adopts flexible data flow, improves the utilization rate of an accelerator calculation array and realizes the accelerated calculation of the accelerator on standard convolution and the depth separable convolution.

The invention adopts the following technical scheme for solving the problems:

a depth separable convolution implementation method based on a systolic array comprises a data pre-fetching module and the systolic array, wherein the systolic array comprises a plurality of PE units which are arranged in the horizontal direction and the vertical direction, and the PE units have different processing modes for input data, parts and data thereof; the PE unit updates the data A once every period or fixes the data A in a register inside the PE unit for repeated use; the PE unit updates the data B once in each period and transmits the data B in the previous period to the adjacent PE unit; outputting the partial sum data once per period or accumulating the partial sum in the PE unit, storing the partial sum in the PE unit, and outputting the partial sum in a specific period; the adjacent PE units in the horizontal direction of the systolic array are connected with each other, and each PE unit in the vertical direction is provided with a data input port and a data output port; when the systolic array calculates different convolutions, the data transmitted in the horizontal direction and the vertical direction are different, when the standard convolution and the point convolution are calculated, the characteristic diagram data are transmitted in the horizontal direction, and the weight parameters are transmitted in the vertical direction; when the depth convolution is calculated, the weight parameters are transmitted in the horizontal direction, and the characteristic diagram data are transmitted in the vertical direction; the data pre-fetching module provides the feature map data and the weight parameters for the systolic array according to the requirements of the systolic array on the feature map data and the weight parameters when different convolution calculations are executed.

Further, when calculating the standard convolution and the point convolution, the systolic array adopts a mode of simultaneously performing parallel calculation on the input channel dimension and the output channel dimension, in the horizontal direction, each row of PE units parallelly calculates different input channels, and in the vertical direction, each column of PE units calculates different output channels; when the depth convolution is calculated, a mode of performing parallel calculation on the input channel dimension and the convolution window dimension is adopted, in the horizontal direction, each row of PE units parallelly calculates different input channels, and in the vertical direction, each column of PE units calculates different convolution windows.

Further, the PE unit includes a multiplier, an adder, registers, and data selectors, where the number of the registers is 3, and the registers are register I, register II, and register iii, respectively, the number of the data selectors is 4, and the registers are data selector I, data selector II, data selector iii, and data selector iv, respectively,

when the input data A does not need to be reused, the input data A directly passes through the data selector II and is transmitted into the multiplier to be multiplied with the input data B,

when input data A needs to be reused, in a first clock cycle, the input data A is transmitted into a multiplier through a data selector II, meanwhile, the input data A is registered in a register I through the data selector I, then the output data of the register I is selected as the input of the multiplier by the data selector II in each cycle, the output of the register I is used as the input of the register I by the register I, the input data A is always registered in the register I, and the reuse of the input data A is realized;

the input data B is directly used as the other input of the multiplier to be multiplied by the input data A, meanwhile, the input data B is registered in a register II and is output from the register II in the next clock cycle as the input data of the adjacent PE unit;

when the output does not need to be added in the PE, the data selector IV selects the output of the multiplier as the input of the register III, then outputs the data in the register III in the next clock cycle,

when the output needs to be accumulated in the PE, the data selector III selects the data 0 to be added with the output of the multiplier for the first time in the adder, the data selector IV selects the output of the adder as the input of the register III, in the subsequent accumulation process, the output of the data selector III selects the output of the register III to be added with the output of the multiplier in the adder, and the data in the register III is output in a specific clock period.

Further, when the pulse array performs the point convolution calculation, the feature map data are transmitted in a pulse mode among the PE units in the horizontal direction of the array, the weight parameters enter each column of the PE units in parallel in the vertical direction of the array, each column of the PE units in the array outputs the calculated partial sum in parallel, and the partial sums are accumulated through an addition tree.

Further, the systolic array, when performing the standard convolution calculation, divides the convolution kernel of the standard convolution 3 × 3 into 9 groups of convolution kernels of 1 × 1, and for each group of convolution kernels of 1 × 1, performs the same calculation data flow as the point convolution; unlike the point convolution, the positions of the input feature points corresponding to different groups of 1 × 1 convolution kernels in the 9 groups of 1 × 1 convolution kernels are not all the same; and finally, accumulating the 9 groups of partial sums calculated by the 9 groups of 1-by-1 convolution kernels to obtain the final output.

Furthermore, when the systolic array executes deep convolution calculation, the weight parameters are subjected to systolic transmission between the PE units in the horizontal direction of the array, the feature map data of different convolution windows parallelly enter each row of PE units in the vertical direction of the array, each row of PE units in the array needs 9 cycles to finish calculating one convolution window, and during calculation, partial sum is accumulated in the PE units; and after one convolution window calculation is completed, outputting the calculation result in parallel.

Further, the systolic array has different data reuse modes in different directions when different types of convolutions are performed.

Further, when standard convolution and point convolution calculation are executed, in the horizontal direction of a ripple array, the feature diagram data are subjected to ripple transmission, reuse of the feature diagram data among different PE units in the horizontal direction is realized, in the vertical direction of the ripple array, updated weight data are temporarily stored in a register inside each PE unit during calculation, and reuse of weight parameters inside the PE units is realized; when the deep convolution calculation is executed, the weight parameters are subjected to the ripple transmission in the horizontal direction of the ripple array, the reuse of the weight parameters among different PE units in the horizontal direction is realized, the characteristic diagram data can be updated in each period in the vertical direction of the ripple array, and the characteristic diagram data is not reused in the ripple array.

Further, the data pre-fetching module comprises a feature map cache region, a register array, a data grouping module, a data selection module and a weight cache region, and when standard convolution and point convolution are executed, the data pre-fetching module directly transmits feature map data in the feature map cache region and weight parameters in the weight cache region to the systolic array; when deep convolution is performed, for feature map data, the data pre-fetching module firstly reads a part of feature map data from the feature map buffer area (for the case that the step size is 1, the feature map data of 3 rows (N + 2) columns of M input channels are read, for the case that the step size is 2, the data of 3 rows (2 x N-1) columns of M input channels are read), the data are temporarily stored in the register storage array, then the data in the register storage array are divided into N groups by the data grouping module, each group has 3 x M data and is respectively provided for N columns of PE units in the systolic array, and for the weight parameter, the weight parameter is directly transmitted from the weight buffer area to the systolic array; the data pre-fetching module updates the next batch of feature map data to the register storage array while calculating the systolic array, and reduces the time for the systolic array to wait for the next batch of feature map data. The data grouping module realizes reuse of the characteristic diagram data; the function of the data selection module is to select whether to transmit the data in the feature map buffer area to the PE unit array or to transmit the data after grouping to the PE unit array.

Furthermore, when the pulse array calculates the deep convolution, the data pre-fetching module reads a part of feature map data from the feature map buffer area and temporarily stores the part of feature map data in the register storage array, and the feature map data in the register storage array are grouped through the data grouping module, so that the reuse of the feature map data is realized. The data pre-fetching module reads the next batch of feature map data from the feature map buffer to the register storage array during the period that the systolic array calculates the previous batch of feature map data, and reduces the time that the systolic array waits for the next batch of feature map data.

By adopting the technical scheme, the invention has the following beneficial effects:

(1) The systolic array has the calculation parallelism of different dimensions, can match the calculation characteristics of point convolution, standard convolution and deep convolution, and improves the PE unit utilization rate of the systolic array in calculating three types of convolution.

(2) The systolic array has calculation parallelism of different dimensions, can fully utilize data reuse of different convolution types, reduces access to a data cache region and reduces the power consumption of memory access.

Drawings

FIG. 1 is a schematic diagram of the structure of a data prefetch module and systolic array employed in the present invention;

FIG. 2 is a schematic diagram of a PE unit according to the present invention;

FIG. 3 is a schematic data flow diagram of a point convolution in a systolic array;

FIG. 4 is a data flow diagram of a standard convolution in a systolic array;

FIG. 5 is a data flow diagram of a deep convolution in a systolic array.

Detailed Description

The technical scheme of the invention is explained in detail in the following with reference to the attached drawings.

Example 1

FIG. 1 is a schematic diagram of the structure of a data prefetch module and a systolic array employed in the present invention, and FIG. 2 is a schematic diagram of the internal structure of a PE unit employed in the present invention. The PE unit includes a multiplier, an adder, a register, and a data selector. For each PE unit, according to different convolution types, data A and data B are respectively characteristic diagram data and weight parameters, the data A can be stored in the PE units for repeated use, and the data B is transmitted among the PE units, so that data reuse among different PE units is realized. For the calculated partial sums, the accumulation can be implemented inside the PE unit.

The following details the data flow and data re-use for point convolution, standard convolution, deep convolution, respectively:

(1) Dot convolution

The data flow for the point convolution is shown in fig. 3, with weights and feature maps entering the array simultaneously. In the first clock cycle, the characteristic diagram data of M input channels and the first group of M weights simultaneously enter the leftmost column of PE units of the systolic array, and in the second clock cycle, the leftmost column of PE units continuously update the characteristic diagram data but do not update the weight data. The second set of M weights enters the second column of PE units. After N periods, all N groups of weights enter the array, each PE unit in the array has respective weight, then the weights are not updated, and after all the input feature map data of the M channels enter the array, the weights and the corresponding feature maps are updated simultaneously.

In the horizontal direction of the systolic array, the input characteristic diagram data are transmitted between the adjacent PE units in turn from left to right in each clock cycle, so that the input characteristic diagram data are reused among different PE units. The weight data is fixed inside each PE unit and is reused in the calculation process.

For output, each column of PE units computes M products per clock cycle, which are accumulated through an addition tree of M inputs.

(2) Standard convolution

The data flow for the standard convolution is shown in fig. 4. The standard convolved data stream is similar to a point convolution. The convolution kernels of the standard convolution 3 x 3 are divided into 9 groups of 1 x 1 convolution kernels, and for each group of 1 x 1 convolution kernels, the same way of computation as for the point convolution is performed. Unlike the dot convolution, the positions of the input feature points corresponding to different groups of 1 × 1 convolution kernels in the 9 groups of 1 × 1 convolution kernels are not all the same, and for the feature map data prefetching module, corresponding feature map data need to be read from the feature map cache region according to different situations. And finally, accumulating the 9 groups of partial sums calculated by the 9 groups of 1-by-1 convolution kernels to obtain the final output.

(3) Deep convolution

The data flow for the deep convolution is shown in fig. 5. Each input channel of the deep convolution is only corresponding to one convolution kernel of 3 x 3, and the weight reuse of the convolution kernels can be only utilized. Taking 3 × 3 depth convolution with step size of 1 as an example, before starting the calculation, the feature map data prefetching module needs to fetch the feature map data of M channels in advance, and each channel fetches 3 rows of N +2 columns of data. When the step size of the sliding window is 1, 3 rows and N +2 columns of data of each channel can be split into N3 × 3 convolution windows, and the data splitting function is realized by the data prefetching module. Thereafter, the calculation is started, the weights enter the array from the horizontal direction, and the input feature map data of the N convolution windows enter the array from the vertical direction.

Because the bandwidth requirement of the array on the feature map data is higher in the calculation process, the data pre-fetching module can continuously read new feature map data from the feature map cache region during the calculation of the array, and the time for the array to wait for the feature map data is reduced.

During the calculation, each column of PE units in the array needs to update the profile data every cycle, and the profile data cannot be reused in the array. Because the data prefetching module internally comprises the register storage array, the data overlapped among different convolution windows of the characteristic diagram can be reused in the data prefetching module, and the defect that the array cannot reuse the characteristic diagram data is overcome. For weight data, which is transmitted horizontally in a systolic array, reuse can be achieved between different PE units.

The invention discloses a depth separable convolution implementation method based on a systolic array, which adopts M rows and N columns of Processing units (PE units) to form a systolic array structure: in the horizontal direction, adjacent PE units are connected to each other, and the PE unit on the left side can transmit data to the PE unit on the right side; in the vertical direction, each PE unit has a respective data input port and data output port. The data prefetch module provides feature map data and weight parameters for the compute array. The adder tree accumulates the partial sum data output in parallel by each column of PE units. Each PE unit is mainly internally composed of a register, a data selector, an adder and a multiplier. The invention adopts the pulse array with the structure and is matched with the data pre-fetching module and the addition tree, and can realize different data streams and data reusing modes, thereby realizing the accelerated calculation of standard convolution, point convolution and depth convolution.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the present invention.

Claims

1. A depth separable convolution implementation method based on a systolic array is characterized by comprising a data prefetching module and the systolic array, wherein the systolic array comprises a plurality of PE units which are arranged in the horizontal direction and the vertical direction, and the PE units have different processing modes for input data, parts and data thereof; the PE unit updates the data A once every period or fixes the data A in a register inside the PE unit for repeated use; the PE unit updates the data B once in each period and transmits the data B in the previous period to the adjacent PE unit; outputting the partial sum data once per period or accumulating the partial sum in the PE unit, storing the partial sum in the PE unit, and outputting the partial sum in a specific period; the adjacent PE units in the horizontal direction of the systolic array are mutually connected, and each PE unit in the vertical direction is provided with a data input port and a data output port; when the systolic array calculates different convolutions, the data transmitted in the horizontal direction and the vertical direction are different, when the standard convolution and the point convolution are calculated, the characteristic diagram data are transmitted in the horizontal direction, and the weight parameters are transmitted in the vertical direction; when the depth convolution is calculated, the weight parameters are transmitted in the horizontal direction, and the feature map data are transmitted in the vertical direction; the data pre-fetching module provides the feature map data and the weight parameters for the systolic array according to the requirements of the systolic array on the feature map data and the weight parameters when different convolution calculations are executed.

2. The method of claim 1, wherein the PE unit comprises a multiplier, an adder, a register and data selectors, the number of the registers is 3, and the registers are register I, register II and register III, the number of the data selectors is 4, and the data selectors are data selector I, data selector II, data selector III and data selector IV,

when input data A needs to be reused, in a first clock cycle, the input data A is transmitted into a multiplier through a data selector II, meanwhile, the input data A is registered in a register I through the data selector I, then the output data of the register I is selected as the input of the multiplier through the data selector II in each cycle, the output of the register I is used as the input of the register I through the data selector I by the register I, the input data A is always registered in the register I, and the reuse of the input data A is realized;

3. The method as claimed in claim 1, wherein the systolic array performs the computation of point convolution, the feature map data is transmitted in a systolic manner between PE units in the horizontal direction of the array, the weight parameters are entered in parallel into each column of PE units in the vertical direction of the array, and each column of PE units in the array outputs the computed partial sums in parallel, and the partial sums are accumulated through the addition tree.

4. The method according to claim 1, wherein the systolic array divides the convolution kernel of the standard convolution 3 x 3 into 9 groups of convolution kernels 1 x 1 when performing the standard convolution calculation, and for each group of convolution kernels 1 x 1, performs the same calculation data flow as the point convolution; unlike the point convolution, the positions of the input feature points corresponding to different groups of 1 × 1 convolution kernels in the 9 groups of 1 × 1 convolution kernels are not all the same; and finally, accumulating the 9 groups of parts calculated by the 9 groups of 1-by-1 convolution kernels to obtain the final output.

5. The method according to claim 1, wherein in performing the depth convolution calculation, the weight parameter is transmitted in a pulsating manner between PE units in the horizontal direction of the array, the characteristic map data of different convolution windows enter into each column of PE units in parallel in the vertical direction of the array, each column of PE units in the array needs 9 cycles to complete the calculation of one convolution window, and during the calculation, the partial sum is accumulated inside the PE units; and after one convolution window calculation is completed, outputting the calculation result in parallel.

6. The method of claim 1, wherein the systolic array has different data reuse patterns in different directions when performing different types of convolutions.

7. The method as claimed in claim 6, wherein when performing the standard convolution and point convolution calculation, in the horizontal direction of the systolic array, the feature map data is transmitted in a systolic manner, so as to realize reuse of the feature map data among different PE units in the horizontal direction, and in the vertical direction of the systolic array, the updated weight data is temporarily stored in a register inside each PE unit during calculation, so as to realize reuse of the weight parameters inside the PE units; when the deep convolution calculation is executed, the weight parameters are subjected to the ripple transmission in the horizontal direction of the ripple array, the reuse of the weight parameters among different PE units in the horizontal direction is realized, the characteristic diagram data can be updated in each period in the vertical direction of the ripple array, and the characteristic diagram data is not reused in the ripple array.

8. The method for realizing deep separable convolution based on a systolic array as claimed in claim 7, characterized in that said data pre-fetching module includes a feature map buffer, a register array, a data grouping module, a data selection module and a weight buffer, and when standard convolution and point convolution are performed, the data pre-fetching module directly transmits feature map data in the feature map buffer and weight parameters in the weight buffer to the systolic array; when deep convolution is executed, for feature map data, a data pre-fetching module firstly reads a part of feature map data from a feature map cache region and temporarily stores the feature map data in a register storage array, then the data in the register storage array is divided into N groups through a data grouping module, each group has 3 x M data and is respectively provided for N columns of PE units in a systolic array, and for weight parameters, the weight parameters are directly transmitted from the weight cache region to the systolic array; the data pre-fetching module updates the next batch of feature map data to the register storage array while calculating the systolic array, and reduces the time for the systolic array to wait for the next batch of feature map data; the data grouping module realizes reuse of the characteristic diagram data; the data selection module is used for selecting to transmit the data in the feature map buffer area to the PE unit array or transmit the data after grouping to the PE unit array.