CN116861143A

CN116861143A - Method for realizing convolution of small input diagram and small weight

Info

Publication number: CN116861143A
Application number: CN202210312160.4A
Authority: CN
Inventors: 田凤彬; 于晓静
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2023-10-10

Abstract

The application provides a method for realizing convolution of a small input diagram with small weight, which comprises the following steps: s1, storing set data: setting a storage mode of the feature map: feature map data storage order: 32, W, H, N, wherein 32 is a portion of the depth, W is wide, H is high, N is how many 32 are in the depth, i.e. 32×n is the depth of the feature map; the storage mode of the set weight: the method comprises the steps of adopting 32x 32 continuity, then continuity in the width of a convolution kernel, then continuity in the height of the convolution kernel, then continuity in the number of input depths/32 of the convolution kernel, and finally continuity in the number of output depths/32; before processing, the common input depth is required to be continuous, the width and the height of the convolution kernel are stored, and finally the output depth of the convolution kernel is stored into a required sequence; s2, using a simd instruction to load all data from ddr to frame, wram, and loading 32 data at a time; s3, realizing convolution calculation. The method realizes the calculation of the small weight of the small input characteristic diagram, accelerates and improves the efficiency.

Description

Method for realizing convolution of small input diagram and small weight

Technical Field

The application relates to the technical field of image processing, in particular to a method for realizing convolution of a small input image and a small weight.

Background

The T40 type chip of Beijing jun integrated circuit Co., ltd (Beijing jun T40 chip for short) is a low power consumption chip for AI deep learning. A convolution calculation unit with independent calculation and a unique simd instruction. Having an oram store, a wram to store weights and a fram to store input data. In this way, the data must be stored in wram and fram before the convolution calculation can be performed. oram, wram, fram is of a given size for the chip. For example, wram is 288 x 1024byte and frame is 128 x 1024byte; the oram size is 2048 x 1024byte. These hypothetical data will be used in the following calculation.

All data is stored in ddr, requiring either a dma instruction to be carried to oram, or a simd instruction to load data into a custom register, and then a special instruction to be carried to wram or fram.

Since this is a new chip. Conventional algorithms, while possible, are inefficient. And existing methods cannot use unique computing units and instructions. The feature maps and the weights are input in different sizes, the implementation methods are different, and the efficiency is drastically reduced due to the use of unsuitable algorithms.

In addition, the common terminology in the prior art is as follows:

1. convolution kernel: the convolution kernel is a matrix used in image processing and is a parameter for operation with the original image. The convolution kernel is typically a matrix of columns (e.g., a matrix of 3*3) with a weight value for each square in the region. The matrix shapes are generally 1X 1, 3X 3, 5X 5, 7X 7, 1X 3, 3X 1, 2X2, 1X 5, 5X 1, …

2. Convolution: the center of the convolution kernel is placed over the pixel to be calculated, and the products of each element in the kernel and its covered image pixel values are calculated and summed once to obtain a structure that is the new pixel value for that location, a process called convolution.

3. Feature map: the result obtained by convolution calculation of input data is called a feature map, and the result generated by full connection of the data is also called a feature map. The feature map size is generally expressed as length x width x depth, or 1 x depth.

4. FRAM (Feature RAM), which is a RAM for Feature maps, is a memory for storing all or part of the Feature maps and directly supplying the Feature maps to a memory calculated by a hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the feature map data must be placed in the FRAM.

5. WRAM (Weight RAM), i.e., weight RAM, is a memory for storing all or part of the Weight and is directly supplied to the memory calculated by the hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the weight data must be placed in the FRAM.

Disclosure of Invention

In order to solve the above problems in the prior art, an object of the present application is to: in order to solve the above-mentioned situation, according to the special situation, a special calculation method is designed, especially, the calculation of the small input characteristic diagram and the small weight is realized on the Beijing jun front T40 chip.

Specifically, the application provides a method for realizing convolution of a small input diagram with small weight, which comprises the following steps:

s1, storing set data:

setting a storage mode of the feature map: feature map data storage order: 32, W, H, N, wherein 32 is a portion of the depth, W is wide, H is high, N is how many 32 are in the depth, i.e. 32×n is the depth of the feature map; the data continues at 32, then at width, then at height, and finally at depth/32 number;

the storage mode of the set weight: the method comprises the steps of adopting 32x 32 continuity, then continuity in the width of a convolution kernel, then continuity in the height of the convolution kernel, then continuity in the number of input depths/32 of the convolution kernel, and finally continuity in the number of output depths/32; before processing, the normal input depth is required to be continuous, the input depth is continuous in the width of the convolution kernel, and finally the convolution kernel outputs continuous data of the depth and stores the continuous data into a required sequence;

s2, using a simd instruction to load all data from ddr to frame, wram, 32 data per load:

s2.1, using simd instruction to load all data from ddr to fram, 32 data at a time:

loading into VR0, VR1 using a simd load data instruction;

loading data to fram using a fram load data instruction;

the feature map storage sequence is stored according to the requirement, and the data size can be completely put into the frame, so that the feature map can be directly stored according to the default sequence until all the data are completely stored;

s2.2, using simd instruction to load all data from ddr to wram, 32 data at a time:

loading into VR0, VR1 using a simd load data instruction;

loading data to the wram using a wram load data instruction;

because the weight storage sequence is stored according to the requirement, and the data size can be completely put into the wram, the weight storage sequence can be directly stored according to the default sequence until all the data are completely stored; s3, realizing convolution calculation:

calculating convolution, wherein the initial address of the given frame is required, the initial address is 0, and the initial address of the wram is also 0;

let the depth of the input feature map be 32x in _ ic32, in _ ic32 be a multiple of the input depth,

the input width is in_width, and the input height is in_height;

the output profile depth is 32x out _ ic32, out _ ic32 is a multiple of the output depth,

the output width is out_width, and the input height is out_height;

the convolution kernel is of the width of kernel_w and of the height of kernel_h;

the width direction step length of the convolution kernel is stride_w, and the height direction of the convolution kernel is stride_h;

the relation in_width=out_width_stride_w of the width of the output feature map and the width of the input feature map, and the relation in_height=out_height_stride_h of the height of the output feature map and the height of the input feature map; if the input feature images are unequal, 0 is required to be supplemented to the input feature images according to specific convolution requirements, and the input feature images are supplemented to equal width and height positions; for example, when the convolution kernel is 3, the step length is 1, and the width and the height of the output characteristic diagram are the same as those of the input characteristic diagram, then 0 needs to be complemented with the input characteristic diagram, and the filling method can be used for filling uniformly from left to right, up to down, or filling only one side according to the requirement of a user;

the generated result is saved in vrd.

The method is suitable for the condition that the input feature map is smaller, namely the number of feature map data is smaller than or equal to the frame, the weight is smaller, namely the number of weights is smaller than or equal to the wram, the frame and the wram can be accommodated, the bit numbers are 8 bits, and the length or the width of the convolution kernel is not more than 3; meanwhile, the input depth is required to be a multiple of 32, and the output depth is required to be a multiple of 32; if some layer input depths in the model are not 32 times, the filling is required to be 32 times; the corresponding weights are also the fill-in process.

The method includes the following instructions:

a) Convolution calculation instructions:

ingenic_conv_bit8(fram_id,wram_id,ic32_num,kernel_w,kernel_h,stride_x,stride_y,feature_w,feature_h，vrd)；

the input variable fram_id is the starting address used by fram, the wram_id is the starting address used by wram, the ic32_num is the calculated number, kernel_w is the width of the convolution kernel, kernel_h is the height of the convolution kernel, stride_x is the step length in the x direction of convolution calculation, stride_y is the step length in the y direction of convolution calculation, feature_w is the width of the input feature map, feature_h is the height of the input feature map, and vrd is the result;

description of use:

calculating 4 pixel point results each time; the computing unit is depth 32, the generation result is 32, and 4 pixel results are generated; if ic32_num=1, it is calculated that the input depth is 32x1, and 4 pixels with the output depth of 32 are generated; if ic32_num=2, it is calculated that the input depth is 32x2, and 4 pixels with the output depth of 32 are generated; if ic32_num=3, it is calculated that the input depth is 32x3, 4 pixels with the output depth of 32 are generated; the calculated minimum depth input depth is 32, the minimum output depth is 32, and the number of the minimum output results pixels is 4; setting the width of the frame, namely loading a plurality of piexls in the input feature diagram, belonging to the parameter setting of a convolution calculation instruction, and setting the width of the processing as feature_w at present;

b) simd load data instruction:

set as ingenic_load (indata, VR0, m)

Inputting data to be loaded, marking a pointer of the current data as indata, loading 128-bit data from a position m pointed by the data indata in a memory,

if the data with 8 bits are loaded 16, 8 data with 16 bits are loaded, and 4 data with 32 bits are loaded; loading data into a variable vrd register; where m is calculated in terms of byte, i.e., 8 bits, as a unit; VR0 is a VR register of simd, and stores 512bit data at most;

c) A frame load data instruction:

set to ingenic_vr2frame (VR 0, frame_load_id, num)

The input variable VR0 is input data, the frame_load_id is a start address loaded into the frame, num is 0 or 1, the frame_load_id data is unchanged after the instruction ends when 0, and the frame_load_id=frame_load_id+32 after the instruction ends when 1;

d) wram load data instruction:

set to ingenic_vr2wram (VR 0, wram_load_id, num)

The input variable VR0 is input data, the wram_load_id is a start address loaded into the wram, num is 0 or 1, the wram_load_id data is unchanged after the instruction ends when 0, and wram_load_id=fram_load_id+32 after the instruction ends when 1.

In the step S2.1, the simd load data instruction is used to load into VR0, VR 1:

ingenic_load(indata，VR0,1)

ingenic_load(indata，VR1,1)

using a fram load data instruction, loading data to fram:

ingenic_vr2fram(VR0,fram_load_id,1)

ingenic_vr2fram(VR1,fram_load_id,1)；

in the step S2.2, the simd load data instruction is used to load into VR0, VR 1:

ingenic_load(widthdata，VR0,1)

ingenic_load(widthdata，VR1,1)

using a wram load data instruction, load data to wram:

ingenic_vr2wram(VR0,wram_load_id,1)

ingenic_vr2wram(VR1,wram_load_id,1)。

in the step S2.1, when the frame cannot be stored down, the method cannot be used; in the step S2.2, when wram cannot be stored down, the method cannot be used.

In the step S3, the convolution calculation is performed, and the order of generation is as follows:

a first 32x out _ with x out _ height is generated,

a second 32x out _ with x out _ height is regenerated,

……

until the last 32×out_width×out_height;

in each 32×out_width×out_height, generating first 32×out_width of the first row, and generating second 32×out_width of the second row until the last 32×out_width of the last row;

for 32×out_width, first generate first 32×4, and then generate second 32×4 until last 32×4, that is, 32×out_width is completed.

The specific implementation of the step S3 is as follows:

s3.1, initializing wram_id=0;

s3.2, initializing ocnum_i=0, and if ocnum_i < out_ic32 is true, continuing execution, and ocnum_1++; if not, jumping out of the step;

s3.3, initializing ydir_i=0, and if ydir_i < out_height is true, continuing execution, and ydir_i++; if not, jumping out of the step;

executing fram_id= (out_width_stride_w) (ydir_i_stride_h) 32 in_ic32;

performing vrd=out_width_ydir_i_32_in_ic32;

s3.4, initializing xdir_i=0, and if xdir_i < out_width is true, continuing execution, and xdir_i+=4; if not, jumping out of the step;

perform fram_id=fram_id+32×4_stride_w;

executing ingenic_conv_bit8 (frame_id, wram_id, ic32_num, kernel_w, kernel_h, stride_x, stride_y, in_width, in_height vrd);

vrd=vrd+32×4 is performed.

Thus, the present application has the advantages that: by designing the method of the application, the small input characteristic diagram and the small weight calculation are realized, the acceleration is realized, and the efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.

Fig. 1 is a flow chart of the method of the present application.

Fig. 2 is a code schematic for the method of the present application.

Detailed Description

In order that the technical content and advantages of the present application may be more clearly understood, a further detailed description of the present application will now be made with reference to the accompanying drawings.

The application relates to a method for realizing convolution of a small input diagram with small weight, which comprises the following steps:

1. the requirements are applicable. The method can be used on a Beijing jun T40 chip, the size of the feature map can be contained in oram, the weight is relatively small, the number of bits is 8 bits under the condition that wram can be contained, and feature map data required for generating 8 pixels by each calculation can be completely placed in fram. The instructions are as follows:

a) Convolution calculation instructions:

the ingenic_conv_bit8 (frame_id, wram_id, ic32_num, kernel_w, kernel_h, stride_x, stride_y, feature_w, feature_h, vrd); 1. the use requirement. The method is used on a Beijing jun positive T40 chip, the feature map is smaller, the weight is smaller, and the number of bits is 8 bits when the fram and the wram can be accommodated. The instructions are as follows:

a) Convolution calculation instructions:

the input variable fram_id is the starting address used by fram, wram_id is the starting address used by wram, ic32_num is the calculated number, kernel_w is the width of the convolution kernel, kernel_h is the height of the convolution kernel, stride_x is the step size of the convolution calculation in the x direction, stride_y is the step size of the convolution calculation in the y direction, feature_w is the width of the input feature map, and feature_h is the height of the input feature map. vrd generates the result.

Description of use: 4 pixel results are calculated at a time. The calculation unit is depth 32, the generation result is 32, and 4 pixel results are generated. If ic32_num=1, it is calculated that the input depth is 32x1, and 4 pixels with the output depth of 32 are generated. If ic32_num=3, it is calculated that the input depth is 32×3, 4 pixels with an output depth of 32 are generated. If ic32_num=2, it is calculated that the input depth is 32x2, 4 pixels with the output depth of 32 are generated. If ic32_num=3, it is calculated that the input depth is 32×3, 4 pixels with an output depth of 32 are generated. The calculated minimum depth input depth is 32, the minimum output depth is 32, and the number of the minimum output results pixels is 4. The width of the frame, i.e. how many piexls in the input feature map are loaded, belongs to the parameter setting of the convolution calculation instruction, and currently the width of the processing is set to feature_w.

b) simd load data instruction:

ingenic_load(indata，VR0,m)

the input data to be loaded is pointer indata of the data at present, 128-bit data is loaded from a position m pointed by the data indata in the memory, if the data of 8 bits is 16, if the data of 16 bits is 8, if the data of 32 bits is 8, 4 data are loaded. The data is loaded into the variable vrd register. Where m is calculated in terms of byte, i.e. 8 bits, as a unit. VR0 is the VR register of simd, storing a maximum of 512 bits of data.

c) A frame load data instruction:

ingenic_vr2fram(VR0,fram_load_id,num)

the input variable VR0 is input data, the frame_load_id is a start address loaded into the frame, num is 0 or 1, the frame_load_id data is unchanged after the instruction ends, and the frame_load_id=frame_load_id+32 after the instruction ends when 0.

d) wram load data instruction:

ingenic_vr2wram(VR0,wram_load_id,num)

2. Convolution calculation

The method is suitable for the condition that the length or width of the convolution kernel is not more than 3 when the input feature map is smaller (the number of feature map data is smaller than or equal to beam) and the weight is smaller (the number of weight is smaller than or equal to wram). While requiring that the input depth requirement be a multiple of 32, the output depth is also a multiple of 32. If some of the layer input depths in the model are not multiples of 32, a padding to multiples of 32 is required. The corresponding weights are also the fill-in process.

Specifically, as shown in fig. 1, the method includes the steps of:

s1, storing data

The storage mode of the feature map comprises the following steps: the feature map data storage order, 32, w, h, n. Where 32 is a portion of the depth, W is wide, H is high, N is the number of 32 over the depth, i.e. 32×n is the depth of the feature map. The data continues over 32, then over width, then over height, and finally over the depth/32 number.

The weight is stored in a way of being continuous over 32x 32, then over the width of the convolution kernel, then over the height of the convolution kernel, then over the number of input depths/32 of the convolution kernel, and finally over the number of output depths/32. The usual input depth needs to be continuous before processing, the convolution kernel width height is re-convolved, and the final convolution kernel output depth is stored in the required order.

S2, using simd to load all data from ddr to frame, wram, 32 data at a time:

s2.1, using simd to load all data from ddr to fram, 32 data at a time:

the use of simd load data instruction loads into VR0, VR 1:

ingenic_load(indata，VR0,1)

ingenic_load(indata，VR1,1)

the data is loaded to the fram using a fram load data instruction.

ingenic_vr2fram(VR0,fram_load_id,1)

ingenic_vr2fram(VR1,fram_load_id,1)

Since the feature map storage order is already stored as required and the data size can be fully put into the frame, it can be stored directly in the default order until all the data is completely stored. When the fram cannot store down, this method cannot be used.

S2.2, using simd to load all data from ddr to wram, 32 data at a time:

loading into VR0, VR1 using simd load data instructions

ingenic_load(widthdata，VR0,1)

ingenic_load(widthdata，VR1,1)

A wram load data instruction is used to load data into the wram.

ingenic_vr2wram(VR0,wram_load_id,1)

ingenic_vr2wram(VR1,wram_load_id,1)

Because the weight storage sequence is stored according to the requirement, and the data size can be completely put into the wram, the weight storage sequence can be directly stored according to the default sequence until all the data are completely stored. When wram cannot store down, this method cannot be used.

S3, realizing convolution calculation.

The convolution is calculated, requiring an initial address of given frame, initially 0, and an initial address of wram, initially also 0. Setting the depth of an input feature map as 32x in_ic32, wherein in_ic32 is a multiple of the input depth, the input width is in_width, and the input height is in_height; the depth of the output characteristic diagram is 32x out_ic32, out_ic32 is a multiple of the output depth, the output width is out_width, and the input height is out_height; the convolution kernel is of the width of kernel_w and of the height of kernel_h; the width direction step length of the convolution kernel is stride_w, and the height direction of the convolution kernel is stride_h. The relation in_width=out_width_stride_w of the output feature map width and the input feature map width, and the relation in_height=out_height_stride_h of the output feature map height and the input feature map height. If the input feature map is not equal and needs to be supplemented with 0 according to specific convolution requirements, the input feature map is supplemented to the equal wide and high positions (for example, when the convolution kernel is 3, the step length is 1, the width and the height of the output feature map are the same as those of the input feature map, and then the input feature map needs to be supplemented with 0, and the filling method can be used for uniformly filling left, right, up, down or only one side according to the requirement of a user. The generated result is saved in vrd.

Convolution calculation is carried out, and the generation sequence is as follows: first 32x out_width out_height is generated, and second 32x out_width out_height is generated until last 32x out_width out_height. In each 32×out_width×out_height, the 32×out_width of the first row is generated first, and the 32×out_width of the second row is generated until the 32×out_width of the last row. For 32×out_width, first generate first 32×4, and then generate second 32×4 until last 32×4, that is, 32×out_width is completed.

As shown in fig. 2, step S3 is specifically implemented as follows:

s3.1, initializing wram_id=0;

executing fram_id= (out_width_stride_w) (ydir_i_stride_h) 32 in_ic32;

performing vrd=out_width_ydir_i_32_in_ic32;

perform fram_id=fram_id+32×4_stride_w;

vrd=vrd+32×4 is performed.

In summary, the key points of the present application are as follows:

the method comprises the steps of storing feature graphs, storing weights, and loading data into a beam and a wram; and a method for implementing convolution.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations can be made to the embodiments of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for implementing a convolution of a small input graph with a small weight, the method comprising the steps of:

s1, storing set data:

loading into VR0, VR1 using a simd load data instruction;

loading data to fram using a fram load data instruction;

the feature map storage sequence is stored according to the requirement, and the data size can be completely put into the frame, so that the feature map can be directly stored according to the default sequence until all the data are completely stored; s2.2, using simd instruction to load all data from ddr to wram, 32 data at a time:

loading into VR0, VR1 using a simd load data instruction;

loading data to the wram using a wram load data instruction;

the input width is in_width, and the input height is in_height;

the output width is out_width, and the input height is out_height;

the relation in_width=out_width_stride_w of the width of the output feature map and the width of the input feature map, and the relation in_height=out_height_stride_h of the height of the output feature map and the height of the input feature map; if the input feature images are not equal, 0 is required to be complemented to the same wide and high positions according to convolution requirements, and the generated results are stored in vrd.

2. The method for realizing the convolution of the small input graph with the small weight according to claim 1, wherein the method is suitable for the situations that the number of input feature graph data is less than or equal to the frame, the number of the weights is less than or equal to the wram, the frame and the wram can be accommodated, the number of bits is 8, and the length or the width of a convolution kernel is not more than 3; meanwhile, the input depth is required to be a multiple of 32, and the output depth is required to be a multiple of 32; if some layer input depths in the model are not 32 times, the filling is required to be 32 times; the corresponding weights are also the fill-in process.

3. A method of implementing a small input graph, small weighted convolution according to claim 1, said method comprising the instructions of:

a) Convolution calculation instructions:

description of use:

b) simd load data instruction:

set as ingenic_load (indata, VR0, m)

c) A frame load data instruction:

set to ingenic_vr2frame (VR 0, frame_load_id, num)

d) wram load data instruction:

set to ingenic_vr2wram (VR 0, wram_load_id, num)

4. The method of claim 4, wherein,

ingenic_load(indata，VR0,1)

ingenic_load(indata，VR1,1)

using a fram load data instruction, loading data to fram:

ingenic_vr2fram(VR0,fram_load_id,1)

ingenic_vr2fram(VR1,fram_load_id,1)；

ingenic_load(widthdata，VR0,1)

ingenic_load(widthdata，VR1,1)

using a wram load data instruction, load data to wram:

ingenic_vr2wram(VR0,wram_load_id,1)

ingenic_vr2wram(VR1,wram_load_id,1)。

5. the method according to claim 1, wherein in the step S2.1, when the frame cannot be stored down, the method cannot be used; in the step S2.2, when wram cannot be stored down, the method cannot be used.

6. The method for realizing the convolution of the small weights of the small input graph according to claim 3, wherein in the step S3, the convolution is calculated, and the order of generation is as follows:

a first 32x out _ with x out _ height is generated,

a second 32x out _ with x out _ height is regenerated,

……

until the last 32×out_width×out_height;

7. The method for realizing the convolution of small weights of the small input graph according to claim 6, wherein the specific implementation of the step S3 is as follows:

s3.1, initializing wram_id=0;

executing fram_id= (out_width_stride_w) (ydir_i_stride_h) 32 in_ic32;

performing vrd=out_width_ydir_i_32_in_ic32;

perform fram_id=fram_id+32×4_stride_w;

vrd=vrd+32×4 is performed.

8. The method for realizing the convolution of small input graph and small weight according to claim 1, wherein when the input feature graph is complemented with 0 according to the convolution requirement, if the convolution kernel is 3, the step size is 1, and the width and the height of the output feature graph are the same as those of the input feature graph, then the input feature graph needs to be complemented with 0, and the filling method is equal to the filling of the input feature graph, or only one side is filled according to the requirement of a user.