CN117688988A

CN117688988A - Reconfigurable convolution calculation circuit based on TDC algorithm

Info

Publication number: CN117688988A
Application number: CN202311681415.5A
Authority: CN
Inventors: 李辉; 张朝阳
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-12

Abstract

The invention discloses an improved calculation strategy of a TDC algorithm and a reconfigurable convolution calculation circuit based on the TDC algorithm, and aims to provide an optimized calculation strategy for deconvolution with a convolution kernel size of 4*4 and a reconfigurable convolution calculation circuit capable of supporting one type of forward convolution and two types of deconvolution. Compared with the TDC algorithm, the optimization calculation strategy reduces the number of 0 filling in the convolution kernel and reduces redundant operation. The circuit is composed of two reconfigurable convolution computing units, and the computing units can change the internal structure according to related configuration signals to complete multiply-add computation required by the current task. Compared with other works in the same field, the method and the device realize the calculation of various convolution types without reducing the hardware calculation efficiency, and have wider application range.

Description

Reconfigurable convolution calculation circuit based on TDC algorithm

Technical Field

The invention relates to the field of hardware acceleration of convolutional neural network models, in particular to realization of a reconfigurable convolutional calculation circuit based on a TDC algorithm.

Background

In recent years, convolutional Neural Networks (CNNs) with deconvolution layers have been rapidly developed, and exhibit excellent effects in the fields of image super-resolution, image generation, image segmentation, and the like. The deconvolution operation is significantly different from the forward convolution operation. The forward convolution with a convolution kernel size of 2 x 2 is shown in fig. 1 (a), and the forward convolution with a convolution kernel size of 3*3 is shown in fig. 1 (b). In FIG. 1 (a) I ₀ ,I ₁ ,I ₂ ,I ₃ The matrix is an input feature diagram, K ₁₁ ,K ₁₂ ,K ₂₁ ,K ₂₂ The matrix is convolution kernel, O is output pixel point obtained by convolution calculation; in FIG. 1 (b), I ₀ ,I ₁ ,I ₂ ,I ₃ ,I ₄ ,I ₅ ,I ₆ ,I ₇ ,I ₈ The matrix is an input feature diagram, K ₁₁ ,K ₁₂ ,K ₁₃ ,K ₂₁ ,K ₂₂ ,K ₂₃ ,K ₃₁ ,K ₃₂ ,K ₃₃ The matrix is convolution kernel, O is output pixel point obtained by convolution calculation; the deconvolution computation process is more complex than forward convolution. According to the calculation procedure given by Jiale Yan et al, deconvolution calculation can be divided into three steps: 1) Multiplying each input feature map element by a convolution kernel to obtain a submatrix with the same size as the convolution kernel; 2) Each sub-matrix is arranged at the position of the original feature map according to the corresponding element, and then the sub-matrices are overlapped to obtain a feature map with larger size; 3) And carrying out boundary cutting on the obtained feature map according to actual requirements. The deconvolution calculation given by Jiale Yan et al with a convolution kernel size of 3*3 and a step size of 2 is shown in fig. 2, and the deconvolution calculation given by Jiale Yan et al with a convolution kernel size of 4*4 and a step size of 2 is shown in fig. 3.

Because the deconvolution layer is more complex to calculate than forward convolution, the implementation difficulty of the convolution neural network hardware with deconvolution layer is more difficult than that of the traditional CNN. Hardware implementation of convolutional neural networks typically requires consuming a large amount of on-chip resources, while reconfigurable techniques can effectively reduce hardware resource overhead through time division multiplexing.

There are some reports of related work. For example, the reconfigurable hardware accelerator proposed by Mao Wendong et al at the 2019International Symposium on Circuits and Systems conference may support forward and backward convolution with a convolution kernel size of 4*4. The reconfigurable hardware accelerator proposed by Lin Bai et al at the conference International Symposium on Circuits and Systems in 2020 can support forward and backward convolution with a convolution kernel size of 3*3. However, many convolutional neural networks require no less than 3 convolutional types. For example, the super-resolution convolutional neural network LapSRN includes a forward convolution and a backward convolution with a convolution kernel size of 3*3 and a backward convolution with a convolution kernel size of 4*4. Both of the foregoing hardware accelerators cannot meet the deployment requirements of LapSRN.

Disclosure of Invention

Aiming at the hardware realization requirement of the LapSRN of the super-resolution convolutional neural network, the invention provides a reconfigurable convolutional calculation circuit capable of supporting a forward convolution and two deconvolutions, and the circuit can effectively reduce the consumption of hardware resources when the hardware deployment is carried out on a specific network.

A TDC method for converting deconvolution calculations into multiple convolution calculations is presented in Jung-Woo Chang et al, journal IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, "An Energy-efficiency FPGA-Based Deconvolutional Neural Networks Accelerator for Single Image Super-Resolution". But this method performs 0 padding in the convolution kernel, reducing the computational efficiency. An optimized calculation strategy for deconvolution with a convolution kernel size of 3*3 was proposed by Lin Bai et al, conference 2020International Symposium on Circuits and Systems, "A Unified Hardware Architecture for Convolutions and Deconvolutions in CNN". Inspired by the two methods, the invention provides an improved strategy of the TDC method when the convolution kernel size is 4*4, and designs a reconfigurable convolution calculation circuit suitable for the super-resolution convolution neural network LapSRN by combining with an optimized calculation strategy under the 3*3 condition. The circuit supports forward convolution of size 3*3, step size 1, deconvolution of size 3*3, step size 2, deconvolution of size 4*4, step size 2.

The TDC method is applied to deconvolution with convolution kernel size 3*3 as shown in fig. 4. As can be seen from fig. 4, the TDC method converts the deconvolution operation into a plurality of convolution operations. Specifically, the TDC method can decompose the deconvolution of 3*3 into four forward convolutions with a convolution kernel size of 2 x 2. After dividing 9 elements in the original convolution kernel into four parts with the number of elements of 4,2,2,1 and filling 0, four convolution kernels with the size of 2 x 2 are respectively obtained. The convolution operation is performed by using the 2 x 2 sliding windows (such as the red dashed box part in fig. 4) at a certain position of the input feature map respectively, so as to obtain four elements (such as the red solid box part in fig. 4) in the 2 x 2 partial blocks at the corresponding positions in the output feature map respectively.

The TDC method is applied to deconvolution with convolution kernel size 4*4 as shown in fig. 5. As can be seen from fig. 5, for the above case, the TDC method can decompose the deconvolution of 4*4 into four forward convolutions with a convolution kernel size of 3*3. The TDC method divides 16 elements in the original convolution kernel into four parts with the number of elements being 4, and fills 0, so that four convolution kernels with the size of 3*3 are obtained. The four convolution checks are used to perform convolution operation on 3*3 sliding windows (such as red dotted boxes in fig. 5) at a certain position of the input feature map, so as to obtain four elements (such as red solid boxes in fig. 5) in 2 x 2 partial blocks at corresponding positions in the output feature map.

The optimal calculation strategy for deconvolution with a convolution kernel size of 3*3 is shown in fig. 6 (b). For the four forward convolutions of size 2 x 2 in fig. 6 (a) (the same forward convolutions as in fig. 1 (a)), the 0's filled in when the TDC method is applied can be combined into a multiple multiply-add calculation similar to a 3*3 forward convolution, each represented by a different color in fig. 6 (b). Namely:

O ₀ ＝K ₃₃ ·I ₀ +K ₃₁ ·I ₁ +K ₁₃ ·I ₂ +K ₁₁ ·I ₃

O ₁ ＝K ₃₂ ·I ₁ +K ₁₂ ·I ₃

O ₂ ＝K ₂₃ ·I ₂ +K ₂₁ ·I ₃

O ₃ ＝K ₂₂ ·I ₃

the improved strategy of the TDC method (deconvolution calculation for convolution kernel size 4*4) is shown in fig. 7 (b). As can be seen from fig. 7 (a), four forward convolutions of size 3*3 (as in fig. 1 (b)) equivalent to one 4*4 deconvolution can be obtained by the TDC method. Two multiplicative addition calculations, like the 3*3 forward convolution, are included in fig. 7 (b). Considering the specificity of the 0 element in the convolution kernel, each calculation like the 3*3 forward convolution can calculate the calculation result of the two 3*3 convolutions in fig. 7 (a), which are represented by different colors in fig. 7 (b). Namely:

O ₀ ＝W ₄₄ ·I ₀ +W ₄₂ ·I ₁ +W ₂₄ ·I ₃ +W ₂₂ ·I ₄ O ₁ ＝W ₄₃ ·I ₁ +W ₄₁ ·I ₂ +W ₂₃ ·I ₄ +W ₂₁ ·I ₅

O ₂ ＝W ₃₄ ·I ₃ +W ₃₂ ·I ₄ +W ₁₄ ·I ₆ +W ₁₂ ·I ₇ O ₃ ＝W ₃₃ ·I ₄ +W ₃₁ ·I ₅ +W ₁₂ ·I ₇ +W ₁₁ ·I ₈

the reconfigurable convolution computing circuit is composed of two reconfigurable convolution computing units;

the reconfigurable convolution calculation unit changes the internal structure according to the relevant configuration signals to finish multiplication and addition calculation required by the current task;

drawings

Fig. 1 is a schematic diagram of a forward convolution operation. Fig. 1 (a) is a diagram for explaining a forward convolution with a convolution kernel size of 2×2 and a step size of 1, and fig. 1 (b) is a diagram for explaining a forward convolution with a convolution kernel size of 3*3 and a step size of 1;

FIG. 2 is a diagram illustrating a deconvolution operation with a convolution kernel size of 3*3 and a step size of 2;

FIG. 3 is a diagram illustrating a deconvolution operation with a convolution kernel size of 4*4 and a step size of 2;

FIG. 4 is a schematic diagram of a deconvolution TDC method with convolution kernel size 3*3 and step size 2;

FIG. 5 is a schematic diagram of a deconvolution TDC method with convolution kernel size 4*4 and step size 2;

fig. 6 is used to illustrate the deconvolution optimization calculation strategy with a convolution kernel size of 3*3. FIG. 6 (a) is a schematic diagram of a deconvolution TDC method with a convolution kernel size of 3*3 and a step size of 2, and FIG. 6 (b) is a modified strategy of FIG. 6 (a);

fig. 7 is used to illustrate the deconvolution optimization calculation strategy at a convolution kernel size of 4*4. FIG. 6 (a) is a schematic diagram of the deconvolution TDC method with a convolution kernel size of 4*4 and a step size of 2, and FIG. 6 (b) is a modified strategy of FIG. 6 (a);

fig. 8 is a diagram for explaining the structure of the reconfigurable convolution calculation circuit.

Fig. 9 is a diagram for explaining the structure of the reconfigurable convolution computing unit and the data flow in different cases.

Fig. 10 is a diagram showing the comparison of the reconfigurable convolution computing circuit proposed by the present invention with the existing operation on functional diversity and hardware efficiency.

Detailed Description

In order to further clarify the technical scheme and advantages of the present invention, a further detailed description of the invention will be provided below with reference to the drawings and specific examples.

The reconfigurable convolution computing circuit provided by the invention is based on the TDC method and the improvement strategy thereof, and realizes the support of three different types of convolution computation. The reconfigurable convolution computing circuit is composed of two reconfigurable convolution computing units. The structure of the reconfigurable convolution computing circuit is shown in fig. 8, and the structure of the reconfigurable convolution computing unit is shown in fig. 9 (a).

Upon completion of the forward convolution with a convolution kernel size of 3*3 and a step size of 1, the configuration signal CS is 2' b00 and the convolution operation between one input signature and one convolution kernel is completed by one reconfigurable convolution calculation unit. In this case, the data flow of the reconfigurable convolution computing unit is as shown in fig. 9 (b), the computing unit performs the computation shown in fig. 1 (b), and the input and output are identical to those in fig. 1 (b). At this time, the utilization rate of multiplier resources of the computing unit can reach 100%.

Upon completion of deconvolution with a convolution kernel size of 3*3 and a step size of 2, the configuration signal CS is 2' b01 and the convolution operation between one input signature and one convolution kernel is completed by one reconfigurable convolution calculation unit. In this case, the data flow of the reconfigurable convolution computing unit is as shown in fig. 9 (c), the computing unit performs the computation shown in fig. 6 (b), and the input and output are identical to those in fig. 6 (b). At this time, the utilization rate of multiplier resources of the computing unit can reach 89%.

When deconvolution with a convolution kernel size of 4*4 and a step size of 2 is completed, the configuration signal CS is 2' b10, and at this time, the convolution operation between one input feature map and one convolution kernel is divided into two parts, which are completed by two convolution calculation units respectively. The configuration signal column_index is used to determine the allocation of computing tasks. When column_index is 1' b0, in this case, the data flow of the reconfigurable convolution calculating unit is as shown in fig. 9 (d), the convolution calculating unit completes the calculation of the two output pixel points of the first row in fig. 7 (b), and the input and output are identical to those shown in fig. 7 (b); when column_index is 1' b1, in this case, the data flow of the reconfigurable convolution calculating unit is as shown in fig. 9 (e), the convolution calculating unit completes the calculation of two output pixel points of the second row in fig. 7 (b), and the input and output agree with those shown in fig. 7 (b). At this time, the utilization rate of multiplier resources of the computing unit can reach 89%.

The present invention implements one embodiment in a hardware implementation of a deep laplacian pyramid neural network (LapSRN). The network contains 12 forward convolutional layers and 2 deconvolution layers. In the case of conventional convolution and deconvolution calculation circuits, three different calculation circuits are required in the hardware implementation of the network; by adopting the design of the invention, only one calculation circuit is needed, and the hardware resource expenditure can be greatly reduced.

The table shown in fig. 10 is used to compare the reconfigurable convolution calculation circuit proposed by the present invention with the convolution calculation circuit proposed by Lin Bai et al at 2020International Symposium on Circuits and Systems conference. In terms of functions, the convolution type supported by the reconfigurable convolution computing circuit is wider. As can be seen from fig. 10, the multiplier resources consumed by a single computing unit in the present invention are the same as those consumed by a single computing unit in the existing operation; in addition, as can be seen from fig. 10, when the forward and backward convolution with the size 3*3 is calculated, the hardware calculation efficiency of the reconfigurable convolution calculating circuit provided by the invention is the same as that of the existing work.

In summary, the reconfigurable convolution computing circuit provided by the invention has the following advantages:

compared with the existing similar work, the reconfigurable convolution computing circuit provided by the invention can support more types of convolution operations under the conditions of consuming the same multiplier resources and having the same hardware computing efficiency, and has wider application scenes.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. A reconfigurable convolution calculation circuit and an optimization calculation strategy based on a TDC (transforming the deconvolutional layer into the convolutional layer) algorithm are characterized in that: aiming at the deployment requirement of the super-resolution neural network LapSRN, according to the TDC algorithm and the work of Lin Bai et al, an optimized calculation strategy aiming at deconvolution is provided; based on the optimization calculation strategy and the calculation method of Lin Bai et al, a reconfigurable convolution calculation circuit capable of supporting one type of forward convolution and two types of deconvolution and a corresponding calculation task allocation strategy are provided.

2. The deconvolution optimization calculation strategy of claim 1, wherein: according to the method, deconvolution calculation with the convolution kernel size of 4*4 is realized, and compared with a TDC algorithm, the optimized calculation strategy reduces the number of filling 0in the convolution kernel and reduces redundant operation.

3. The reconfigurable convolution computing circuit supporting one type of forward convolution and two types of backward convolution according to claim 1, wherein: the circuit supports a standard convolution of size 3*3, step size 1, a deconvolution of size 3*3, step size 2, a deconvolution of size 4*4, step size 2. The reconfigurable convolution computing circuit is composed of two reconfigurable convolution computing units. The unit realizes local convolution calculation through a multiplication and addition circuit, and outputs a currently required convolution calculation result according to a selection signal CS corresponding to the currently processed convolution type. The multiplier resource utilization of the computational circuit can reach 100% when processing a standard convolution of size 3*3 with step size 2 and a deconvolution of size 3*3 with step size 2. When deconvolution with a size of 4*4 and a step length of 2 is processed, the utilization rate of multiplier resources of the computing circuit can reach 89%. Although the calculation amount is larger when the deconvolution of the size 4*4 and the step size 2 is processed, this case does not consume additional hardware resources compared with the first two cases.

4. The computing task allocation policy according to claim 1, wherein: when processing a standard convolution with a size of 3 and a step size of 2 and a deconvolution with a size of 3*3 and a step size of 2, the convolution operation between one input feature map and one convolution kernel is completed by one reconfigurable convolution calculation unit; when processing deconvolution of 4*4 size and 2 step size, the convolution operation between one input feature map and one convolution kernel is divided into two parts, which are respectively completed by two convolution calculation units.