CN116957018A

CN116957018A - Method for realizing channel-by-channel convolution

Info

Publication number: CN116957018A
Application number: CN202210320433.XA
Authority: CN
Inventors: 刘子航; 王荔枝
Original assignee: Hefei Ingenic Technology Co ltd
Current assignee: Hefei Ingenic Technology Co ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2023-10-27

Abstract

The application provides a realization method of channel-by-channel convolution, when encountering the channel-by-channel convolution, the method realizes NNA convolution acceleration by expanding convolution kernels, the number of channels of each convolution kernel after expansion is equal to the number of input channels, and the number of convolution kernels is equal to the number of input channels; in order to ensure that the convolution result obtained after the convolution kernel expansion is consistent with the channel-by-channel convolution result, the filling values of the channel elements except the corresponding channel after the expansion need to be calculated.

Description

Method for realizing channel-by-channel convolution

Technical Field

The application relates to the technical field of neural networks, in particular to a method for realizing channel-by-channel convolution.

Background

With the advent of the big data age, the application of neural network technology has become increasingly popular, and a large number of data processing technologies have become one of important application technologies. Convolutional neural networks have wide application in the fields of images, video and voice, and particularly as the application of products such as consumer electronics and automotive electronics increasingly introduces artificial intelligence, the artificial intelligence requires a large amount of computation in model training and reasoning. The artificial intelligence has been developed rapidly, and technologies such as deep learning and neural networks enter the climax development stage. As neural networks become more complex, a significant amount of resources are required to train and evaluate, and hardware accelerator performance increases and versatility increases gradually. The convolutional neural network mainly comprises an input layer, a convolutional layer, a pooling layer and a full-connection layer, wherein the convolutional layer is a core layer forming the convolutional neural network, most of calculation amount in the network is generated by the convolutional neural network, and therefore the operation speed of the convolutional neural network basically depends on the operation speed of the convolutional layer. However, due to their algorithm and nature of the computation itself, conventional chips that have been widely used in the past have failed to meet the demands of a large number of computations, requiring chip manufacturers to build specialized chips for neural network algorithms, especially for the inference-side chips, namely Neural Network Accelerators (NNAs). In the prior art, NNA (NNA 1.0) only supports fast operations of multi-channel convolution. And the channel-by-channel convolution (depthwise convolution) process includes a large number of multiply-accumulate operations, which are slow.

That is, the following drawbacks exist in the prior art:

the convolution process comprises a large number of multiply-accumulate calculations, the calculation speed is low, real-time performance is not realized in practical application, NNA1.0 only supports fast calculation of multi-channel convolution, and for depthwise convolution, acceleration cannot be realized through NNA 1.0. The convolution process involves a large number of multiply-accumulate computations, the speed of which directly affects the performance of the convolution network. NNA essentially supports the fast computation of normal unsigned inputs and multiply-accumulate of normal unsigned weights, but NNA has some limitations as a hardware accelerator and does not have versatility.

Furthermore, the common terminology in the prior art is as follows:

1. neural network: the method simulates a mathematical model of a biological neural network structure and function, obtains the capability of analyzing or expressing sample data by learning the internal rule of training sample data, and can be applied to various application fields such as target detection, scene classification, character recognition and the like.

2. Deep learning: a process and method for training a neural network.

3. Multichannel convolution: and for the pixel point of each input channel, calculating the product of the neighborhood pixel and the corresponding convolution kernel channel element, accumulating, and accumulating the values on each channel to obtain a final convolution result.

4. depthwise convolution (channel-by-channel convolution): and for the pixel point of each input channel, calculating the product of the neighborhood pixel and the corresponding convolution kernel channel element, and accumulating to obtain a final convolution result. The number of convolution kernel channels is equal to 1, and the number of convolution kernels is equal to the number of input channels and the number of output channels.

5. NNA (neural network accelerate unit, neural network accelerator): NNA is a neural network accelerator of a hardware platform, and can realize the rapid operation of multi-channel convolution by configuring related register parameters, so that the running time of the neural network is greatly reduced, the real-time performance is higher in practical application, and the user experience is better.

6. FRAM: the NNA internally stores an on-chip RAM for input image data.

7. WRAM: the NNA internally stores on-chip RAM of convolution kernel data.

8. ORAM: on-chip general-purpose RAM.

9. pixel: the minimum unit of the image is input.

10. pad: the input image is edge-filled and divided into pad_top, pad_bottom, pad_left and pad_right, which represent the size of the image for filling up, down, left and right edges.

11. A stride: the step length of the sliding of the convolution kernel matrix is divided into stride_x and stride_y, and the step length of the sliding of the convolution kernel matrix in the transverse direction and the longitudinal direction is represented.

Disclosure of Invention

In order to solve the above problems, an object of the present application is to: a realization method of depthwise convolution based on NNA completes the rapid operation of depthwise convolution. The NNA hardware constraint includes that the size of the convolution kernel must be 3 or less, the convolution step size must be 2 or less, and the number of convolution kernel input channels must be a multiple of 32. When the NNA hardware constraint is not satisfied, some special processing is required for the weight data in order to achieve convolutional acceleration with the NNA. In particular, when considering that the input image data is normal unsigned data and the weight data is normal signed data, how to obtain the correct depthwise convolution result using the nnia hardware accelerator.

Specifically, the application provides a method for realizing channel-by-channel convolution, which realizes NNA convolution acceleration by expanding convolution kernels when encountering channel-by-channel convolution, wherein the number of channels of each convolution kernel after expansion is equal to the number of input channels, and the number of convolution kernels is equal to the number of input channels; the NNA hardware limit comprises that the size of a convolution kernel must be less than or equal to 3, the convolution step size must be less than or equal to 2, and the number of input channels of the convolution kernel must be a multiple of 32; when the NNA hardware constraint is not satisfied, in order to implement convolution acceleration with the NNA, some special processing needs to be done on the weight data: when the input image data is normal unsigned data and the weight data is normal signed data, the padding values of the remaining input channels except the corresponding input channel are calculated while the depthwise convolution kernel is extended.

The NNA supports the rapid calculation of the common unsigned input and the common unsigned weight multiply-accumulate, and the multichannel convolution is the result of the common unsigned input and the common signed weight multiply-accumulate, and the correct convolution result can be obtained by configuring NNA related registers in actual use: the NNA supports common convolution, the number of input channels IC of the convolution kernel is a multiple of 32, input channel expansion is needed to be carried out on the depthwise convolution kernel to calculate depthwise convolution by utilizing the NNA, then the depthwise convolution is calculated according to the mode of NNA calculation common convolution, in order to obtain a correct depthwise convolution result, filling values of other input channels except for corresponding input channels are needed to be calculated, and the consistency of a depthwise convolution formula and an NNA calculation common convolution formula is ensured: assuming that the convolution kernel size is size, the number of input channels is IC, and the bit number of the weight is nw;

common unsigned input is F _u The common signed weight is W _s Common unsigned weight is W _u The method comprises the steps of carrying out a first treatment on the surface of the The conversion formula is as follows: w (W) _s ＝W _u -2 ^nw-1 ；

Multichannel convolution process:

where k=size·ic.

For the channel-by-channel convolution, since NNA1.0 only supports multi-channel convolution, the convolution kernel of the channel-by-channel convolution needs to be channel-extended:

for the nth convolution kernel: the element value of the nth channel is equal to the element value of the nth convolution kernel before expansion, and the element values of the other channels are all 2 ^nw -1, 1.ltoreq.n.ltoreq.IC, the weight after expansion beingThe actual calculation process comprises the following steps:

where k=size, m=size·ic.

The NNA convolution process:

writing convolution kernel data into a WRAM, writing input image data into a FRAM, setting read addresses of the WRAM and the FRAM, input data bit width, convolution kernel size, pad and stride parameters through an NNA hardware register, and calling an NNA hardware instruction NNMACG to obtain a convolution result and outputting the convolution result; wherein the input image size is: input channel number IC input image height IH input image width IW, convolution kernel size: output channel number OC input channel number IC convolution kernel height KH convolution kernel width KW, output image size is: output channel number OC output image high OH output image width OW, pad is divided into: the pad_top, pad_bottom, pad_left, pad_right, and stride are divided into stride_y and stride_x;

the calculation formula of the output image size is:

the data sizes KH and KW of convolution kernels written into the WRAM by the hardware limitation of the NNA are smaller than or equal to 3, and the number of input channels of the convolution kernels is required to be a multiple of 32; before the NNA is called for convolution acceleration, the input image data and the convolution kernel data are processed, and the number of input channels IC is ensured to be a multiple of 32.

The NNA convolution acceleration process: writing the extended convolution kernel data, namely the input channel number is equal to 32, into the WRAM, writing the input image data into the FRAM, setting read addresses of the WRAM and the FRAM, input data bit width, convolution kernel size, pad and stride parameters through an NNA hardware register, and calling an NNA hardware instruction NNMACG to obtain a convolution result and outputting the convolution result.

The method further comprises the steps of:

let the convolution kernel size be indicated by KH KW,

let the representation of the convolution kernel data bit width nw be represented by wbit,

s1, setting common unsigned input as F _u The common signed weight is W _S Common unsigned weight is W _u The method comprises the steps of carrying out a first treatment on the surface of the The conversion formula is as follows: w (W) _s ＝W _u -2 ^wbit-1 Wherein wbit represents the convolution kernel data bit width;

common convolution: the convolution kernel size is KH KW, the number of input image channels is IC, the number of convolution kernel input channels is equal to IC, and IC is a multiple of 32;

s2, NNA calculates common convolution, and the formula is as follows:

will W _u Data writing WRAM, F _u Data write FRAM, configure NNA registers including WRAM and FRAM read addresses, convolution kernel size, convolution kernel number of input channels, where IC must be a multiple of 32, F _u Data bit width sum W _u The data bit width is wide, and then NNMACG instruction is called to obtain convolution result, namely

S3, calculating depthwise convolution: the convolution kernel size is khkw, the number of input image channels is IC, where IC is a multiple of 32, the volumeThe number of the integration input channels is equal to 1; the convolution kernel after the input channel expansion is as followsThe number of the input channels of the convolution kernel after expansion is IC, and the filling values of the other input channels except the corresponding input channels are W _t ；

NNA calculation depthwise convolution formula derivation:

if W is _t ＝2 ^wbit-1 The formula can be further reduced to:

when the weight data is common signed data, the depthwise convolution result is calculated by using NNA, input channel expansion is needed to be carried out on the depthwise convolution kernel, and the values of the other input channels except the corresponding channel after expansion are all filled with 2 ^wbit-1 Wherein wbit is the weight data bit width;

s4, willData writing WRAM, F _u Data write FRAM, configure NNA register, including WRAM and FRAM read address, convolution kernel size, extended convolution kernel input channel number IC, where IC must be a multiple of 32, F _u Data bit width sum->The data bit width, and then the NNMACG instruction is called to obtain the depthwise convolution result, namely +.>

Thus, the present application has the advantages that: by a simple method, the quick operation on depthwise convolution can be completed by accelerating through NNA 1.0. In particular, when considering that the weight data is normal signed data, the filling of the remaining input channels except the corresponding input channel is completed while expanding.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.

FIG. 1 is a schematic flow chart of the method of the present application.

Detailed Description

In order that the technical content and advantages of the present application may be more clearly understood, a further detailed description of the present application will now be made with reference to the accompanying drawings.

Because NNA has some hard condition limitation, only multi-channel convolution is supported, and the universality is not high in practical application. In order to ensure the universality and improve the convolution operation speed, when depthwise convolution is encountered, NNA convolution acceleration is realized by carrying out channel expansion on a convolution kernel. And the number of the channels of each convolution kernel is equal to the number of the input channels, and the number of the convolution kernels is equal to the number of the input channels.

NNA1.0 actually supports fast computation of normal unsigned input and normal unsigned weight multiply-accumulate, while multi-channel convolution is the result of normal unsigned input and normal signed weight multiply-accumulate. In practical use, the correct convolution result can be obtained quickly by configuring NNA related registers.

Assuming that the convolution kernel is size, the number of input channels is IC, and the bit number of the weights is nw. The common signed weight is W _s Common unsigned weight is W _u . The conversion formula is as follows: w (W) _s ＝W _u -2 ^nw-1 。

Multichannel convolution process:

where k=size·ic.

For depthwise convolution, since NNA1.0 only supports multi-channel convolution, channel expansion is required for the depthwise convolution convolution kernel.

For the nth convolution kernel: the element value of the nth channel is equal to the element value of the nth convolution kernel before expansion, and the element values of the other channels are all 2 ^nw -1, 1.ltoreq.n.ltoreq.IC. The weight after expansion is

The actual calculation process comprises the following steps:

where k=size, m=size·ic

In summary, as shown in fig. 1, the method further includes the following steps:

let the convolution kernel size be indicated by KH KW,

s2, NNA calculates common convolution, and the formula is as follows:

will W _u Data writing WRAM, F _u Data write FRAM, configure NNA registers including WRAM and FRAM read addresses, convolution kernel size, convolution kernel number of input channels, where IC must be a multiple of 32Number F _u Data bit width sum W _u The data bit width is wide, and then NNMACG instruction is called to obtain convolution result, namely

S3, calculating depthwise convolution: the convolution kernel size is KH KW, the number of input image channels is IC, wherein IC is a multiple of 32, and the number of input channels of the convolution kernel is equal to 1; the convolution kernel after the input channel expansion is as followsThe number of the input channels of the convolution kernel after expansion is IC, and the filling values of the other input channels except the corresponding input channels are W _t ；

NNA calculation depthwise convolution formula derivation:

if W is _t ＝2 ^wbit-1 The formula can be further reduced to:

Thus, the key points of the application are:

1. convolution kernel expansion: NNA1.0 requires the number of convolution kernel channels to be equal to the number of input channels, and for depthwise convolution, the number of convolution kernel channels is equal to 1, and NNA convolution acceleration can be realized by performing channel expansion on the convolution kernels not in the NNA1.0 support range.

2. In order to ensure that the convolution result obtained after the convolution kernel expansion is consistent with the depthwise convolution result, the filling values of the other channel elements except the corresponding channel after the expansion need to be calculated.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations can be made to the embodiments of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The method is characterized in that when encountering the channel-by-channel convolution, NNA convolution acceleration is realized by expanding convolution kernels, the number of channels of each expanded multi-channel convolution kernel is equal to the number of input channels, and the number of convolution kernels is equal to the number of input channels; the NNA hardware limit comprises that the size of a convolution kernel must be less than or equal to 3, the convolution step size must be less than or equal to 2, and the number of input channels of the convolution kernel must be a multiple of 32; when the NNA hardware constraint is not satisfied, in order to implement convolution acceleration with the NNA, some special processing needs to be done on the weight data: when the input image data is normal unsigned data and the weight data is normal signed data, the padding values of the remaining input channels except the corresponding input channel are calculated while the depthwise convolution kernel is extended.

2. The method for realizing channel-by-channel convolution according to claim 1, wherein said NNA supports fast computation of normal unsigned input and normal unsigned weight multiply-accumulate, and said multi-channel convolution is a result of normal unsigned input and normal signed weight multiply-accumulate, and in actual use, correct convolution results can be obtained by configuring an NNA correlation register: the NNA supports common convolution, the number of input channels IC of the convolution kernel is a multiple of 32, input channel expansion is needed to be carried out on the depthwise convolution kernel to calculate depthwise convolution by utilizing the NNA, then the depthwise convolution is calculated according to the mode of NNA calculation common convolution, in order to obtain a correct depthwise convolution result, filling values of other input channels except for corresponding input channels are needed to be calculated, and the consistency of a depthwise convolution formula and an NNA calculation common convolution formula is ensured:

assuming that the convolution kernel size is size, the number of input channels is IC, and the bit number of the weight is nw;

Multichannel convolution process:

where k=size·ic.

3. A method of performing a channel-by-channel convolution as defined in claim 2,

for channel-by-channel convolution, since the NNA only supports multi-channel convolution, the convolution kernel of the channel-by-channel convolution needs to be channel-extended:

for the nth convolution kernel: the element value of the nth channel is equal to the element value of the nth convolution kernel before expansion, and the element values of the other channels are all 2 ^nw -1, 1.ltoreq.n.ltoreq.IC, the weight after expansion being

The actual calculation process comprises the following steps:

where k=size, m=size·ic.

4. A method of implementing a channel-by-channel convolution as defined in claim 3, wherein said NNA convolution process:

the calculation formula of the output image size is:

5. the method for realizing channel-by-channel convolution according to claim 4, wherein the data sizes KH and KW of the convolution kernel written into the WRAM by the hardware constraint of the NNA are 3 or less, and the number of input channels of the convolution kernel must be a multiple of 32; before the NNA is called for convolution acceleration, the input image data and the convolution kernel data are processed, and the number of input channels IC is ensured to be a multiple of 32.

6. The method of claim 5, wherein the nnia convolution acceleration process: writing the extended convolution kernel data, namely the input channel number is equal to 32, into the WRAM, writing the input image data into the FRAM, setting read addresses of the WRAM and the FRAM, input data bit width, convolution kernel size, pad and stride parameters through an NNA hardware register, and calling an NNA hardware instruction NNMACG to obtain a convolution result and outputting the convolution result.

7. A method of implementing a channel-by-channel convolution as defined in claim 6, said method further comprising the steps of:

let the convolution kernel size be indicated by KH KW,

s1, setting common unsigned input as F _u The common signed weight is W _S Common unsigned weight is W _u ；

The conversion formula is as follows: w (W) _s ＝W _u -2 ^wbit-1 Wherein wbit represents the convolution kernel data bit width;

s2, NNA calculates common convolution, and the formula is as follows:

will W _u Data writing WRAM, F _u Data write FRAM, configure NNA registers including WRAM and FRAM read addresses, convolution kernel size, convolution kernel number of input channels, where IC must be a multiple of 32, F _u Data bit width sum W _u Data bit width, againInvoking NNMACG instruction to obtain convolution result, i.e

NNA calculation depthwise convolution formula derivation:

if W is _t ＝2 ^wbit-1 The formula can be further reduced to: