CN114581281A

CN114581281A - Optimization method based on first layer 4bit convolution calculation

Info

Publication number: CN114581281A
Application number: CN202011373523.2A
Authority: CN
Inventors: 田凤彬; 于晓静
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-06-03
Anticipated expiration: 2040-11-30
Also published as: CN114581281B

Abstract

The invention provides an optimization method based on a first layer 4bit convolution calculation, which realizes a complete convolution calculation process aiming at an input 3-channel color image output data depth of 16, wherein the related simd instruction operation is in an innermost layer circulation, and in the innermost layer circulation, three groups of depth data of three image channels and three groups of depth data on sequential width are regarded as one group of data for processing and calculation, wherein the storage sequence of convolution kernel data is sequentially stored according to the requirement of convolution calculation, namely, the stored convolution kernel data is calculated according to the convolution, and two adjacent lines of data are alternately stored in two groups of data at the output depth; the simd instruction circulates in the innermost layer, eight continuous rows and every two adjacent rows are required to be stored in an intersecting mode, each data in the 9 th row is subjected to intersection calculation with 0, and at the moment, the data is equivalent to ten rows of data. The method is simple in optimization operation, and the operation speed can be doubled only by adding related simd instructions in the innermost loop.

Description

Optimization method based on first layer 4bit convolution calculation

Technical Field

The invention relates to the technical field of image recognition, in particular to an optimization method based on first-layer 4-bit convolution calculation.

Background

With the development of the era, the application of image recognition technology becomes more and more popular. Optimization methods for image recognition are also various. Particularly convolution calculation, the current optimization method includes an optimization method designed based on the simd instruction set of T and X series chips such as Beijing Junzhen T30, T31 and the like. The algorithm is suitable for operation of vector (vector) instructions. The registers of T30 and T31 are 128-bit registers, and the number of registers is limited, which takes into account the number of registers in the optimal design. On the Beijing Junzheng chip, the program C is directly used, and the speed is very low. In addition, the algorithm of the prior art is designed such that the first layer input data is a color map, i.e., three-channel data. The limited data of the convolution kernel data is 4 bits, and the convolution kernel data is stored according to 8 bits when being stored. The depth of the output data, i.e. the feature map, is 16. The step size used in the convolution calculation is 2, the width in the convolution kernel is 3, and the length is 3.

Technical terms commonly used in the prior art include:

1. a simd instruction: the single instruction stream and the multiple data streams, namely, one operation instruction can execute multiple data streams, so that the operation speed of the program can be improved. It is more generally understood that it is a calculation of a vector (vector). The specific instruction set differs from chip to chip.

2. And (3) convolution kernel: the convolution kernel is a parameter used for performing an operation on a matrix and an original image during image processing. The convolution kernel is typically a matrix of column numbers (e.g., a3 x3 matrix) with a weight value for each square on the region. The matrix shape is typically 1 × 1,3 × 3,5 × 5,7 × 7,1 × 3,3 × 1,2 × 2,1 × 5,5 × 1, ….

3. Convolution: the centre of the convolution kernel is placed on the pixel to be calculated, the products of each element in the kernel and its covered image pixel value are calculated once and summed, and the resulting structure is the new pixel value at that location, a process called convolution.

4. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map (or output data), and the result of the full connection of the data is also called a feature map (or output data). The feature size is typically expressed as length x width x depth, or 1 x depth.

Disclosure of Invention

In order to solve the problems in the prior art, the present invention aims to:

the method adopted by the invention realizes the speed doubling.

Specifically, the invention provides an optimization method based on a first layer 4bit convolution calculation, which realizes a complete convolution calculation process aiming at an input 3-channel color image output data depth of 16, wherein the related simd instruction operation is in an innermost layer cycle, and in the innermost layer cycle, three groups of depth data of three channels of an image and three groups of depth data in sequential width are regarded as one group of data for processing and calculation, wherein the storage sequence of convolution kernel data is stored according to the requirement sequence of convolution calculation, namely, the stored convolution kernel data is calculated according to the convolution, and two adjacent rows of data are in cross storage of two groups of data in the output depth; the simd instruction circulates in the innermost layer, eight continuous rows and every two adjacent rows are required to be stored in an intersecting mode, each data in the 9 th row is subjected to intersection calculation with 0, and at the moment, the data is equivalent to ten rows of data.

The method is characterized in that continuous loading is carried out when data are loaded, in the innermost layer cycle, one datum in the loaded data is copied into a variable register of a simd instruction each time, multiplication simd instruction calculation of 8 bits is carried out, and after 16 bits are converted, accumulation simd instruction calculation is carried out; or in the convolution calculation, the conversion of 8-bit data into 16-bit data is realized by using a Simd instruction of multiplication and adjacent addition.

The method comprises the following steps:

s1, initializing setting parameters:

let the input data indata be a set of data with input depth in _ depth of 32, width in _ width of 512, and height in _ height of 512; the convolution kernel data filter _ data is a group of data with output depth out _ depth of 128, input depth in _ depth of 32, which is consistent with the depth of the input data, convolution kernel width ft _ w of 3 and convolution kernel height ft _ h of 3; let the structure of the output data, i.e. the feature graph outdata: depth is out _ depth, width is out _ width, and height is out _ height; in the convolution calculation, a step length exists, and the step length is set as stride;

setting a simd type variable register: sum _0, sum _1, sum _2, sum _3, in _0, in _1, in _2, in _3, in _4, in _ value, ft _0, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8, ft _9, other parameters are pointers or regular data;

s2, first layer cycle: setting j to be 0;

s2.1, judging whether j < out _ width and out _ height are satisfied or not,

s2.2, if the condition is not met, stopping the first layer of circulation; if so, then

Executing: out _ w is j% out _ width;

out_h＝j/out_width；

in_h＝out_h*stride；

in_w＝out_w*stride；

in _ ptr is indeta + in _ w in _ depth, and in _ ptr is an input data pointer;

out _ ptr ═ outtdata + j × out _ depth, out _ ptr is the output data pointer; executing: determining whether out _ w% 2 is equal to 0, if not, not executing the following execution body, and if so, executing the following execution body: initialization registers sum _0, sum _1, sum _2, sum _3 are 0:

sum _0 ═ 0,0,0,0,0,0,0,0,0, 0; initialization is 0;

sum _1 ═ sum _ 0; initialization is 0;

sum _2 ═ sum _ 0; initialization is 0;

sum _3 ═ sum _ 0; initialization is 0;

executing: step S3;

s2.3, if j is j +1, return to step S2.1;

s3, innermost cycle: setting p as 0;

s3.1, judging whether p < ft _ h is true or not,

s3.2, if the result is not true, stopping the innermost circulation, and returning to the step S2.3; if yes, executing: obtaining an input data pointer, wherein a _ loc is a _ ptr + (in _ h + p) in _ width in _ depth, and a _ loc is the input data pointer;

obtaining a convolution kernel data pointer, wherein the convolution kernel data pointer is represented as b _ ptr ═ filter _ data +160 × p, and b _ ptr is a convolution kernel data pointer;

load input data into register in _ value, denoted as

in_value＝ingenic_load(a_loc,0)；

Copying two consecutive data into registers in _0, in _1, in _2, in _3, and in _4, respectively, as follows:

in_0＝ingenic_copy2(in_value,0)；

in_1＝ingenic_copy2(in_value,1)；

in_2＝ingenic_copy2(in_value,2)；

in_3＝ingenic_copy2(in_value,3)；

in_4＝ingenic_copy2(in_value,4)；

loading convolution kernel data into registers ft _0, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8, ft _9, respectively, as:

ft_0＝ingenic_load(b_ptr,0),；

ft_1＝ingenic_load(b_ptr,16)；

ft_2＝ingenic_load(b_ptr,32)；

ft_3＝ingenic_load(b_ptr,48)；

ft_4＝ingenic_load(b_ptr,64)；

ft_5＝ingenic_load(b_ptr,80)；

ft_6＝ingenic_load(b_ptr,96)；

ft_7＝ingenic_load(b_ptr,112)；

ft_8＝ingenic_load(b_ptr,128)；

ft_9＝ingenic_load(b_ptr,144)；

multiplication and adjacent addition of the simd instruction result in Sum _0, Sum _ 1:

sum_0＝ingenic_muladd_h(sum_0,in_0,ft_0)；

sum_1＝ingenic_muladd_h(sum_1,in_0,ft_1)；

sum_0＝ingenic_muladd_h(sum_0,in_1,ft_2)；

sum_1＝ingenic_muladd_h(sum_1,in_1,ft_3),

sum_0＝ingenic_muladd_h(sum_0,in_2,ft_4)；

sum_1＝ingenic_muladd_h(sum_1,in_2,ft_5)；

sum_0＝ingenic_muladd_h(sum_0,in_3,ft_6)；

sum_1＝ingenic_muladd_h(sum_1,in_3,ft_7)；

sum_0＝ingenic_muladd_h(sum_0,in_4,ft_8)；

sum_1＝ingenic_muladd_h(sum_1,in_4,ft_9)；

assigning in _3 to in _ 0; assigning in _4 to in _ 1;

in_0＝in_3；

in_1＝in_4；

copying two successive data into registers in _2, in _3, in _4, respectively, as indicated

in_2＝ingenic_copy2(in_value,5)；

in_3＝ingenic_copy2(in_value,6)；

in_4＝ingenic_copy2(in_value,7)；

Multiplication and adjacent addition operations of the simd instruction result in Sum _10, Sum _ 11:

sum_10＝ingenic_muladd_h(sum_10,in_0,ft_0)；

sum_11＝ingenic_muladd_h(sum_11,in_0,ft_1)；

sum_10＝ingenic_muladd_h(sum_10,in_1,ft_2)；

sum_11＝ingenic_muladd_h(sum_11,in_1,ft_3)；

sum_10＝ingenic_muladd_h(sum_10,in_2,ft_4)；

sum_11＝ingenic_muladd_h(sum_11,in_2,ft_5)；

sum_10＝ingenic_muladd_h(sum_10,in_3,ft_6)；

sum_11＝ingenic_muladd_h(sum_11,in_3,ft_7)；

sum_10＝ingenic_muladd_h(sum_10,in_4,ft_8)；

sum_11＝ingenic_muladd_h(sum_11,in_4,ft_9)；

s3.3, performing p ═ p + 1; return to step S3.1.

In step S1, the data in one input depth direction of the input data is:

[ x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, … x32 ]; here, for example, the depth is 32, ending at 32.

The data of the convolution kernel data in two output depth directions are:

[d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,d16,…d128]；

[ e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e12, e13, e14, e15, e16, … e128 ]. Here is an example, where the depth is 128, and ends up to 128

In step S1, stride is 2; out _ depth is 16.

In the innermost loop, 16 data are loaded into an in _ value register, and assumed to be read in sequence as A1r, A1g, and A1 b; b1r, B1g, B1B; c1r, C1g, C1 b; d1r, D1g, D1 b; e1r, E1g, E1 b; f1 r; these 16 data effectively use the first 15 data, which are denoted as the first layer;

wherein, the first layer represented by the first 9 data A1r, A1g, A1B, B1r, B1g, B1B, C1r, C1g and C1B is an accumulated value used for calculating a first group sum _0 and sum _1 of output data;

wherein, the data from 7 th to 15 th C1r, C1g, C1 b; d1r, D1g, D1 b; the first layer, denoted E1r, E1g, E1b, is the accumulated value of the second set of sum _2 and sum _3 used to calculate the output data;

this occurs because the input data depth is 3, the convolution kernel width is 3, the step size is 2, two sets are calculated at a time, the number of data used is 15, and the register stores 16 data of 8 bits.

In the first 9 data, the first layer is used as an iteration of a cycle, namely an innermost layer cycle, the layer of data is regarded as an integral process, the number of the data is 9, and a simd instruction which is multiplied and then adjacently added needs even number data, so that a random other data is added later to form even number data, the random other data is multiplied by 0 when the multiplication is carried out, and the value is eliminated when the addition is carried out; if 1 data is output, 9 convolution kernel data are needed, 9 convolution sum data and 0 data are actually used, and if 16 data are output, 144 convolution kernel data are needed, and 160 convolution kernel data are actually used; the multiplied and then adjacently added simd instruction is expressed as: the rd is ingenic _ multicast _ h (vrd, vrs, vrt) which can realize multiplication of 8 bits, accumulation and generation of 16 bits as required.

The step S3.2 further comprises:

the in _0 _ ingenic _ copy2(in _ value,0), and the data stored in the register is: in _0 ═ A1r, A1g, A1r, A1g, A1r, A1g, A1r, A1g, A1r, A1g, A1r, A1g, A1r, A1g, A1r, A1 g;

the in _ 1_ ingenic _ copy2(in _ value,1), and the data stored in the register is: in _1 ═ A1B, B1r, A1B, B1r, A1B, B1r, A1B, B1r, A1B, B1r, A1B, B1r, A1B, B1r, A1B, B1 r;

the in _ 2_ ingenic _ copy2(in _ value,2), and the data stored in the register is: in _2 ═ B1g, B1B, B1g, B1B, B1g, B1B, B1g, B1B, B1g, B1B, B1g, B1B, B1g, B1B, B1g, B1B;

the in _ 3_ ingenic _ copy2(in _ value,3), and the data stored in the register is: in _3 ═ C1r, C1g, C1r, C1g, C1r, C1g, C1r, C1g, C1r, C1g, C1r, C1g, C1r, C1g, C1r, C1 g;

the in _4 _ ingenic _ copy2(in _ value,4), and the data stored in the register is: in _4 ═ C1b, D1r, C1b, D1r, C1b, D1r, C1b, D1r, C1b, D1r, C1b, D1r, C1b, D1r, C1b, D1 r;

ft _0 is ingenic _ load (b _ ptr,0), and the data stored in the register is: ft _0 ═ a1_1, a2_1, a1_2, a2_2, a1_3, a2_3, …, a1_8, a2_ 8;

ft _1 is ingenic _ load (b _ ptr,16), and the data stored in the register is: ft _1 ═ a1_9, a2_9, a1_10, a2_10, a1_11, a2_11, …, a1_16, a2_ 16;

ft _2 is ingenic _ load (b _ ptr,32), and the data stored in the register is: ft _2 ═ a3_1, b1_1, a3_2, b1_2, a3_3, b1_3, …, a3_8, b1_ 8;

ft _3 is ingenic _ load (b _ ptr,48), and the data stored in the register is: ft _3 ═ a3_8, b1_8, a3_9, b1_9, a3_10, b1_10, …, a3_16, b1_ 16;

ft _4 is ingenic _ load (b _ ptr,64), and the data stored in the register is: ft _4 ═ b2_1, b3_1, b2_2, b3_2, b2_3, b3_3, …, b2_8, b3_ 8;

ft _5 is ingenic _ load (b _ ptr,80), and the data stored in the register is: ft _5 ═ b2_9, b3_9, b2_10, b3_10, b2_11, b3_11, …, b2_16, b3_ 16;

ft _6 is ingenic _ load (b _ ptr,96), and the data stored in the register is: ft _6 ═ c1_1, c2_1, c1_2, c2_2, c1_3, c2_3, …, c1_8, c2_ 8;

ft _7 is ingenic _ load (b _ ptr,112), and the data stored in the register is: ft _7 ═ c1_9, c2_9, c1_10, c2_10, c1_11, c2_11, …, c1_16, c2_ 16;

ft _8 is ingenic _ load (b _ ptr,128), and the data stored in the register is: ft _8 ═ c3_1,0, c3_2,0, c3_3,0, …, c3_8, 0;

ft _9 is ingeningcoad (b _ ptr,144), and the data stored in the register is: ft _9 ═ c3_9,0, c3_10,0, c3_11,0, …, c3_16, 0;

later using in _0, in _1, in _2, in _3, in _4, ft _0, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8 and ft _9 to execute multiplication and then adjacent addition simd instructions to calculate sum _0 and sum _ 1; completing innermost layer circulation, namely completing calculation of convolution once, obtaining sum _0 and sum _1 through calculation once, and calculating 8 16-bit data stored in the two registers at the same time to obtain 16 data in total; after the steps are executed, the whole data output result can be completed.

The method comprises calculating two data patterns, namely loading convolution kernel data once, calculating two data sets simultaneously, calculating input data required by two output data sets after loading innermost layer cycle once,

calculating the first group of output data sum _0 and sum _1, wherein input data loaded by the first group of output data are calculated after the innermost layer cycle is completed;

calculating the second group of output data sum _2 and sum _3, wherein input data loaded by the second group of output data are calculated after the innermost layer cycle is completed;

the combination of the input data loaded by the first group of output data and the input data loaded by the first group of output data is the required input data loaded by the two groups of output data calculated after the innermost loop is completed, and the middle superposition shared part is C1r, C1g, C1b, C2r, C2g, C2b, C2r, C2g and C2 b;

the method for calculating the second group of input data comprises the following steps:

the in _0 is in _ 3; the data stored in the register is:

in_0＝[C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g]

the in _1 is in _ 4; the data stored in the register is:

in_4＝[C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r]；

the in _ 2_ ingenic _ copy2(in _ value, 5); the data stored in the register is: in _2 ═ D1g, D1b, D1g, D1b, D1g, D1b, D1g, D1b, D1g, D1b, D1g, D1b, D1g, D1b, D1g, D1 b;

the in _ 3_ ingenic _ copy2(in _ value, 6); the data stored in the register is: in _3 ═ E1r, E1g, E1r, E1g, E1r, E1g, E1r, E1g, E1r, E1g, E1r, E1g, E1r, E1g, E1r, E1 g;

the in _4 _ ingenic _ copy2(in _ value, 7); the data stored in the register is: in _4 ═ E1b, F1r, E1b, F1r, E1b, F1r, E1b, F1r, E1b, F1r, E1b, F1r, E1b, F1r, E1b, F1 r;

the sum _2 and sum _3 are calculated later by using the simd instructions of in _0, in _1, in _2, in _3, in _4, in _3, ft _0, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8, and ft _9, which are multiplied and then adjacently added.

In summary, the method of the present application can achieve the following advantages: because the method adopting the C language program in the prior art is the calculation of one data, the used optimization algorithm method is relatively few, the optimization strength is small, but the speed of running the algorithm on a chip is very slow. The method is simple in optimization operation, and the operation speed can be doubled only by adding related simd instructions in the innermost loop.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a schematic representation of a three-dimensional space of data.

FIG. 2 is a schematic diagram of a data storage structure.

FIG. 3 is a schematic diagram of FIG. 2 showing the four-dimensional space being transformed into a three-dimensional space by compressing the output depth into a particle.

Fig. 4 is a schematic diagram in which the output depth out _ depth is 5, and data in one output depth direction is [ a1_1, a1_2, a1_3, …, a1_ n ], which is visually understood as indicated by a dotted line.

Fig. 5 is a schematic diagram of the first rectangle in fig. 3, with two dashed lines in the rectangle.

Fig. 6 is a schematic diagram for visualizing that the convolution kernel width w is 3, and data in the width direction of one convolution kernel includes three sets of data in the input depth direction.

Fig. 7 shows a schematic diagram of the first layer of the first 9 data in fig. 9, A1r, A1g, A1B, B1r, B1g, B1B, C1r, C1g, C1B.

FIG. 8 shows the 7 th to 15 th data C1r, C1g, C1b in FIG. 9; d1r, D1g, D1 b; schematic representation of the first layer of E1r, E1g, E1 b.

Fig. 9 shows a schematic diagram of the first layer of the first 15 data effectively used by the 16 data read from the top left of fig. 1.

FIG. 10 is a schematic flow chart diagram of an embodiment of the method.

Detailed Description

In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.

1. Introduction to input data storage and input convolution kernel data storage mode

The input data is a normal color picture, which can be understood as three channels in depth, i.e. RGB data. The input data are depth (three data of RGB or GBR make up one depth), width and height in the data storage order. The three-dimensional space of the data is shown in fig. 1.

Storage mode of convolution kernel data: the output depth is stored first, then the input depth, then the convolution kernel width, and finally the convolution kernel height. The data storage structure is understood as a graph (see fig. 2). In the graph, the output depth refers to the specific number of data depths, the input depth refers to how many input depths, the convolution kernel width refers to how many input depths, and the convolution kernel height refers to how many convolution kernel widths. The spatial structure is a four-dimensional structure and cannot be represented in a three-dimensional space. Its output depth is compressed into a particle and the four-dimensional space becomes a three-dimensional space, as shown in figure 3.

According to the characteristics of the input data image, the convolution kernel data related information is set, the output depth out _ depth is n (n is 16), the input depth in _ depth is 3, the convolution kernel width ft _ w is 3, and the convolution kernel height ft _ h is 3. The method comprises the following specific steps:

the output depth out _ depth is 5, and data in one output depth direction is [ a1_1, a1_2, a1_3, …, a1_ n ], which is visually understood as indicated by a dotted line in fig. 4.

The input depth in _ depth is 3, and the data of one input depth direction includes three sets of output depth data, which can be expressed as:

[a1_1,a1_2,a1_3,…,a1_n；

a2_1,a2_2,a2_3,…,a1_n；

a3_1,a3_2,a3_3,…,a1_n；]；

the figure 5 is used for image understanding, and fig. 5 is the first rectangle in fig. 3, and two dotted lines are arranged in the rectangle.

The convolution kernel width w is 3, and the data in the width direction of one convolution kernel includes three sets of data in the input depth direction, which can be expressed as:

{[a1_1,a1_2,a1_3,…,a1_n；

a2_1,a2_2,a2_3,…,a1_n；

a3_1,a3_2,a3_3,…,a1_n；]；

[b1_1,b1_2,b1_3,…,b1_n；

b2_1,b2_2,b2_3,…,b1_n；

b3_1,b3_2,b3_3,…,b1_n；]；

[c1_1,c1_2,c1_3,…,c1_n；

c2_1,c2_2,c2_3,…,c1_n；

c3_1,c3_2,c3_3,…,c1_n；]}

the image is understood as shown in fig. 6.

The convolution kernel height h is 3. The image is understood as shown in figure 2. The entire data is represented as:

{{[a1_1,a1_2,a1_3,…,a1_n；

a2_1,a2_2,a2_3,…,a1_n；

a3_1,a3_2,a3_3,…,a1_n；]；

[b1_1,b1_2,b1_3,…,b1_n；

b2_1,b2_2,b2_3,…,b1_n；

b3_1,b3_2,b3_3,…,b1_n；]；

[c1_1,c1_2,c1_3,…,c1_n；

c2_1,c2_2,c2_3,…,c1_n；

c3_1,c3_2,c3_3,…,c1_n；]}

{[d1_1,d1_2,d1_3,…,d1_n；

d2_1,d2_2,d2_3,…,d1_n；

d3_1,d3_2,d3_3,…,d1_n；]；

[e1_1,e1_2,e1_3,…,e1_n；

e2_1,e2_2,e2_3,…,e1_n；

e3_1,e3_2,e3_3,…,e1_n；]；

[f1_1,f1_2,f1_3,…,f1_n；

f2_1,f2_2,f2_3,…,f1_n；

f3_1,f3_2,f3_3,…,f1_n；]}

{[g1_1,g1_2,g1_3,…,g1_n；

g2_1,g2_2,g2_3,…,g1_n；

g3_1,g3_2,g3_3,…,g1_n；]；

[h1_1,h1_2,h1_3,…,h1_n；

h2_1,h2_2,h2_3,…,h1_n；

h3_1,h3_2,h3_3,…,h1_n；]；

[j1_1,j1_2,j1_3,…,j1_n；

j2_1,j2_2,j2_3,…,j1_n；

j3_1,j3_2,j3_3,…,j1_n；]}}

for ease of understanding, the division is made with semicolons, braces and braces, and the actual on-chip storage is a one-dimensional vector. In the back-interleaved storage, the middle and large brackets will be removed.

2. A new convolution kernel data storage mode.

The convolution kernel data storage sequence is stored according to the convolution calculation requirement sequence, so that data loaded into the cache is not wasted, all loaded data are used completely, the phenomenon that the loading is not applicable is avoided, and the repeated loading times are reduced. The general data storage is the storage method shown in fig. 2. Now two adjacent rows, i.e. two sets of data at the output depth, are cross-stored based on the convolution kernel data stored from the convolution calculation.

Data sequence before cross store:

a1_1,a1_2,a1_3,…,a1_n；

a2_1,a2_2,a2_3,…,a2_n；

data sequence after cross storage:

a1_1,a2_1,a1_2,a2_2,a1_3,a2_3,…,a1_n,a2_n；

eight continuous rows are stored in a crossed manner every two adjacent rows, and the 9 th row of data is stored in a crossed manner with 0, so that the data is equivalent to ten rows of data.

Data sequence before cross store:

{[a1_1,a1_2,a1_3,…,a1_n；

a2_1,a2_2,a2_3,…,a1_n；

a3_1,a3_2,a3_3,…,a1_n；]；

[b1_1,b1_2,b1_3,…,b1_n；

b2_1,b2_2,b2_3,…,b1_n；

b3_1,b3_2,b3_3,…,b1_n；]；

[c1_1,c1_2,c1_3,…,c1_n；

c2_1,c2_2,c2_3,…,c1_n；

c3_1,c3_2,c3_3,…,c1_n；]}

data sequence after cross storage:

a1_1,a2_1,a1_2,a2_2,a1_3,a2_3,…,a1_n,a2_n；

a3_1,b1_1,a3_2,b1_2,a3_3,b1_3,…,a3_n,b1_n；

b2_1,b3_1,b2_2,b3_2,b2_3,b3_3,…,b2_n,b3_n；

c1_1,c2_1,c1_2,c2_2,c1_3,c2_3,…,c1_n,c2_n；

c3_1,0,c3_2,0,c3_3,0,…,c3_n,0；

in the convolution calculation, the particularity of the used simd instruction needs to perform cross calculation with 0 for each data in the 9 th row, and when the data is loaded, the data is continuously loaded, so the 9 th row of data needs to be cross-stored with 0, and the storage sequence is as follows:

data sequence before cross store:

c3_1,c3_2,c3_3,…,c1_n；

data sequence after cross storage:

c3_1,0,c3_2,0,c3_3,0,…,c3_n,0；

the cross-storage order of the convolution kernel data is as follows:

{{a1_1,a2_1,a1_2,a2_2,a1_3,a2_3,…,a1_n,a2_n；

a3_1,b1_1,a3_2,b1_2,a3_3,b1_3,…,a3_n,b1_n；

b2_1,b3_1,b2_2,b3_2,b2_3,b3_3,…,b2_n,b3_n；

c1_1,c2_1,c1_2,c2_2,c1_3,c2_3,…,c1_n,c2_n；

c3_1,0,c3_2,0,c3_3,0,…,c3_n,0；}

{d1_1,d2_1,d1_2,d2_2,d1_3,d2_3,…,d1_n,d2_n；

d3_1,e1_1,d3_2,e1_2,d3_3,e1_3,…,d3_n,e1_n；

e2_1,e3_1,e2_2,e3_2,e2_3,e3_3,…,e2_n,e3_n；

f1_1,f2_1,f1_2,f2_2,f1_3,f2_3,…,f1_n,f2_n；

f3_1,0,f3_2,0,f3_3,0,…,f3_n,0；}

{g1_1,g2_1,g1_2,g2_2,g1_3,g2_3,…,g1_n,g2_n；

g3_1,h1_1,g3_2,h1_2,g3_3,h1_3,…,g3_n,h1_n；

h2_1,h3_1,h2_2,h3_2,h2_3,h3_3,…,h2_n,h3_n；

j1_1,j2_1,j1_2,j2_2,j1_3,j2_3,…,j1_n,j2_n；

j3_1,0,j3_2,0,j3_3,0,…,j3_n,0；}}

3. and (5) a simd instruction algorithm.

1) Simd instruction introduction

The simd instructions are referred to as follows.

a) Multiply and add next to add simd instruction:

vrd＝ingenic_muladd_h(vrd,vrs,vrt)；

the input variables vrd, vrs, vrt and the output variables are vrd. vrd, vrs, vrt are simd type variables, which are 128-bit registers. vrd stores 8 int16_ t of data and vrs and vrt stores 16 int8_ t of data. Because multiplication and addition exist in the operation, 16 bits are needed to be stored after 4-bit data operation, and 4-bit input data are stored into 8 bits.

vrd＝[vrd0,vrd1,vrd2,vrd3,vrd4,vrd5,vrd6,vrd7]；

vrs＝[vrs0,vrs1,vrs2,vrs3,vrs4,vrs5,vrs6,vrs7,vrs8,vrs9,vrs10,vrs11,vrs12,vrs13,vrs14,vrs15]；

vrt＝[vrt0,vrt1,vrt2,vrt3,vrt4,vrt5,

vrt6,vrt7,vrt8,vrt9,vrt10,vrt11,vrt12,vrt13,vrt14,vrt15]；

Equivalent operation:

vrd0:＝vrd0+vrs0*vrt0+vrs1*vrt1；

vrd1:＝vrd1+vrs2*vrt2+vrs3*vrt3；

...

vrd7:＝vrd7+vrs14*vrt14+vrs15*vrt15；

b) load data simd instruction:

the input data to be loaded is currently the pointer of the data, and 128eit data are loaded from the position pointed by the data in the memory, if the data is 8eit, 16 data are loaded, and if the data is 16eit, 8 data are loaded. Data is loaded into variable vrd.

vrd＝ingenic_load(indata)

c) Copy or copy specified element simd instruction:

and copying the specified position of the variable into the output variable. The designated position is denoted by i, and each copy is 8-bit data.

Copying two consecutive data

vrd＝ingenic_copy2(vrs,i)

For example:

i＝0；

vrd＝[vrs0,vrs1,vrs0,vrs1,vrs0,vrs1,vrs0,vrs1,vrs0,vrs1,vrs0,vrs1,vrs0,vrs1,vrs0,vrs1]；

i＝3；

vrd＝[vrs6,vrs7,vrs6,vrs7,vrs6,vrs7,vrs6,vrs7,vrs6,vrs7,vrs6,vrs7,vrs6,vrs7,vrs6,vrs7]；

2) simd instruction implements convolution calculations

Convolution multiplication and accumulation can be realized in many ways, in the simd instruction, multiplication and addition can be realized, but different algorithms have different efficiencies after execution and different consumed time. The following method can reduce the redundancy to a lower degree.

The input data indata is not set as a group of data having an input depth in _ depth of 32, a width in _ width of 512, and a height in _ height of 512; the convolution kernel data filter _ data is a set of data having an output depth out _ depth of 128, an input depth in _ depth of 32 (corresponding to the input data depth), a convolution kernel width ft _ w of 3, and a convolution kernel height ft _ h of 3. Let the structure of the output data (feature map) outdata: depth is out _ depth, width is out _ width, and height is out _ height. In the convolution calculation, there is one step, and the step is assumed to be stride. The data of one input depth direction of the input data is:

[x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,…x32]

the data of the convolution kernel data in two output depth directions are:

[d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,d16,…d128]

[e1,e2,e3,e4,e5,e6,e7,e8,e9,e10,e11,e12,e13,e14,e15,e16,…e128]

……

a) the common C algorithm implements pseudo code:

the C program algorithm is a convolution calculation performed in a conventional input data storage and input convolution kernel data storage manner. The values are sequentially calculated in accordance with the output data storage order, and each data of the output data (feature map) is obtained. Since C is a data-by-data calculation, the optimization algorithm method used is relatively few, the optimization effort is small, but the speed of running the algorithm on a chip is slow. The following is an optimization method using the simd instruction algorithm.

b) Simd instruction algorithm

And copying one data in the loaded data at each time into a variable register of the simd instruction, performing multiplication simd instruction calculation of 8 bits, converting the result into 16 bits, and performing accumulation simd instruction calculation. This multiplication and accumulation is implemented in the innermost loop. Its pseudo code is as follows: simd type variable register: sum _0, Sum _1, Sum _2, Sum _3, in _0, in _1, in _2, in _3, in _4, in _ value, ft _0, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8, ft _9, other parameters are pointers or regular data.

This pseudo-code implements a complete convolution calculation for an input 3-channel color image output data depth of 16. In the algorithm, simd instruction operation is mainly designed to be circulated in the innermost layer. Others can be considered as the same as the C programming algorithm design. In the innermost layer cycle, three groups of depth data of three channels and the sequential width of the image are regarded as one group of data to be processed and calculated, and the data are integrated to the innermost layer by fully utilizing the characteristic of specific output depth.

In the innermost for loop, 16 data are loaded into the in _ value register, and assuming that reading is started from the top left of fig. 1, the sequence is A1r, A1g, and A1 b; b1r, B1g, B1B; c1r, C1g, C1 b; d1r, D1g, D1 b; e1r, E1g, E1 b; f1 r. These 16 data effectively use the first 15 data, which are shown in FIG. 9 as the first layer. The first 9 data, A1r, A1g, A1B, B1r, B1g, B1B, C1r, C1g, and C1B, as shown in fig. 7, are the accumulated values used to calculate the first set sum _0 and sum _1 of output data, from the 7 th to the 15 th data, C1r, C1g, and C1B; d1r, D1g, D1 b; e1r, E1g, E1b, the first level shown in FIG. 8, are the accumulated values of the second set of sum _2 and sum _3 used to compute the output data. This occurs because the input data depth is 3, the convolution kernel width is 3, the step size is 2, two sets are calculated at a time, the number of data used is 15, and the register stores 16 data of 8 bits.

The width and height of the data shown in fig. 7 are the width and height of the convolution kernel, a set of output data is obtained after convolution calculation, the data shown in fig. 7 is also input data loaded by the first set of output data calculated after the innermost layer of one cycle is completed, the data shown in fig. 7 is input data loaded by the second set of output data calculated after the innermost layer of one cycle is completed, and fig. 9 is required input data loaded by the two sets of output data calculated after the innermost layer of one cycle is completed. In the existing simd instruction, 8-bit data is not directly converted into a 16-bit instruction, and a converted simd instruction can be completed only by selecting the simd instruction and a shift instruction. Multiply and add next to add simd instructions, i.e.:

vrd＝ingenic_muladd_h(vrd,vrs,vrt)

the multiplication of 8 bits can be realized, the accumulation can be realized, the required 16 bits can be generated, and the overflow of data is avoided. However, the simd instruction designs a complete algorithm, otherwise the effect is not as good as the common design. In the data of fig. 7, the first layer is regarded as an iteration when the loop, i.e. the innermost layer of the pseudo code for loop p is 0, the data of this layer is regarded as an overall processing, the number of the data is 9, and the simd instruction of multiplication and adjacent addition needs even number data, so that a random other data is added later, the even number data is made up, the multiplication is performed with 0, and the value is eliminated during the addition. If one data is output, 9 convolution kernel data are required, 9 convolution sum data and one 0 data are actually used, and if 16 data are output, 144 convolution kernel data are required, and 160 convolution kernel data are actually used. The relevant instructions and load data are as follows:

in _0 _ ingenic _ copy2(in _ value,0), the data stored in the register is:

in_0＝[A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g]

in _ 1_ ingenic _ copy2(in _ value,1), the data stored in the register is:

in_1＝[A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r]

in _ 2_ ingenic _ copy2(in _ value,2), the data stored in the register is:

in_2＝[B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,]

in _ 3_ ingenic _ copy2(in _ value,3), the data stored in the register is:

in_3＝[C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g]

in _4 _ ingenic _ copy2(in _ value,4), the data stored in the register is:

in_4＝[C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r]

ft _0 is ingenic _ load (b _ ptr,0), and the data stored in the register is:

ft_0＝[a1_1,a2_1,a1_2,a2_2,a1_3,a2_3,…,a1_8,a2_8]

ft _1 is ingenic _ load (b _ ptr,16), and the data stored in the register is:

ft_1＝[a1_9,a2_9,a1_10,a2_10,a1_11,a2_11,…,a1_16,a2_16]

ft _2 is ingenic _ load (b _ ptr,32), and the data stored in the register is:

ft_2＝[a3_1,b1_1,a3_2,b1_2,a3_3,b1_3,…,a3_8,b1_8]

ft _3 is ingenic _ load (b _ ptr,48), and the data stored in the register is:

ft_3＝[a3_8,b1_8,a3_9,b1_9,a3_10,b1_10,…,a3_16,b1_16]

ft _4 is ingenic _ load (b _ ptr,64), and the data stored in the register is:

ft_4＝[b2_1,b3_1,b2_2,b3_2,b2_3,b3_3,…,b2_8,b3_8]

ft _5 is ingenic _ load (b _ ptr,80), and the data stored in the register is:

ft_5＝[b2_9,b3_9,b2_10,b3_10,b2_11,b3_11,…,b2_16,b3_16]

ft _6 is ingenic _ load (b _ ptr,96), and the data stored in the register is:

ft_6＝[c1_1,c2_1,c1_2,c2_2,c1_3,c2_3,…,c1_8,c2_8]

ft _7 is ingenic _ load (b _ ptr,112), and the data stored in the register is:

ft_7＝[c1_9,c2_9,c1_10,c2_10,c1_11,c2_11,…,c1_16,c2_16]

ft _8 is ingenic _ load (b _ ptr,128), and the data stored in the register is:

ft_8＝[c3_1,0,c3_2,0,c3_3,0,…,c3_8,0]

ft _9 is ingenic _ load (b _ ptr,144), and the data stored in the register is:

ft_9＝[c3_9,0,c3_10,0,c3_11,0,…,c3_16,0]

subsequently, the sum _0 and sum _1 are calculated by using the simd instructions of in _0, in _1, in _2, in _3, in _4, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8, and ft _9, which are multiplied and then adjacently added. And completing for (p is 0; p < ft _ h; +++p) circulation, namely completing one convolution calculation, obtaining sum _0 and sum _1 by one convolution calculation, and simultaneously calculating that the two registers store 8 16-bit data and 16 data in total. After the whole pseudo code is executed, the whole data output result can be completed.

c) Mode for calculating two groups of data

And reducing repeated loading of data and calculating the patterns of the two groups of data. And loading convolution kernel data once, and calculating two groups of data simultaneously, so that the efficiency of loaded input data can be improved. One load of data is shown in fig. 9. Fig. 7 is used to calculate the first group of output data sum _0 and sum _1, fig. 8 is used to calculate the second group of output data sum _2 and sum _3, fig. 9 is used to calculate the combination of fig. 7 and 8, and the middle coincident common portions are C1r, C1g, C1b, C2r, C2g, C2b, C2r, C2g, and C2 b. Calculating the second set of input data is in pseudo code:

in _0 ═ in _ 3; the data stored in the register is:

in_0＝[C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g]

in _1 ═ in _ 4; the data stored in the register is:

in_4＝[C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r]

in _ 2_ ingenic _ copy2(in _ value, 5); the data stored in the register is:

in_2＝[D1g,D1b,D1g,D1b,D1g,D1b,D1g,D1b,D1g,D1b,D1g,D1b,D1g,D1b,D1g,D1b]

in _ 3_ copy2(in _ value, 6); the data stored in the register is:

in_3＝[E1r,E1g,E1r,E1g,E1r,E1g,E1r,E1g,E1r,E1g,E1r,E1g,E1r,E1g,E1r,E1g]

in _4 _ ingenic _ copy2(in _ value, 7); the data stored in the register is:

in_4＝[E1b,F1r,E1b,F1r,E1b,F1r,E1b,F1r,E1b,F1r,E1b,F1r,E1b,F1r,E1b,F1r]

the sum _2 and sum _3 are calculated later by using the simd instructions of in _0, in _1, in _2, in _3, in _4, in _3, ft _0, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8, and ft _9, which are multiplied and then adjacently added. By the method, some data reuse of input data and the reuse of convolution kernel data can be realized.

By the above simd instruction algorithm, the speed can be improved by nearly 30 times.

More systematically, as shown in fig. 10, the method according to the invention comprises the following steps:

s1, initializing setting parameters:

s2, first layer cycle: setting j to be 0;

s2.1, judging whether j < out _ width and out _ height are satisfied or not,

s2.2, if the condition is not met, stopping the first layer of circulation; if yes, executing: out _ w is j% out _ width; here, the traversal range of j is the feature diagram width multiplied by the height, and the position in the width direction is now obtained by using the remainder. For example, j is 1.5 × out _ width, and after the remainder is 0.5 × out _ width, j/out _ width is 1 (where an integer part is taken, a decimal part is truncated, that is, an integer is obtained by integer division, and a decimal part is completely truncated), and the current position is the 0.5 × out _ width column of the 1 st row (counting starts from the 0 th row).

out_h＝j/out_width；

in_h＝out_h*stride；

in_w＝out_w*stride；

in _ ptr is indeta + in _ w in _ depth, and in _ ptr is an input data pointer;

out _ ptr ═ outdata + j × -out _ depth, out _ ptr is the output data pointer; executing: judging whether out _ w% 2 is equal to 0 or not, if not, not executing the following executive body; if true, the following executions are performed: initialization registers sum _0, sum _1, sum _2, sum _3 are 0:

sum _0 ═ 0,0,0,0,0,0,0,0,0, 0; initialization is 0;

sum _1 ═ sum _ 0; initialization is 0;

sum _2 ═ sum _ 0; initialization is 0;

sum _3 ═ sum _ 0; initialization is 0;

executing: step S3;

s2.3, if j is j +1, return to step S2.1;

s3, innermost cycle: setting p as 0;

s3.1, judging whether p < ft _ h is true or not,

load input data into register in _ value, denoted as

in_value＝ingenic_load(a_loc,0)；

in_0＝ingenic_copy2(in_value,0)；

in_1＝ingenic_copy2(in_value,1)；

in_2＝ingenic_copy2(in_value,2)；

in_3＝ingenic_copy2(in_value,3)；

in_4＝ingenic_copy2(in_value,4)；

ft_0＝ingenic_load(b_ptr,0),；

ft_1＝ingenic_load(b_ptr,16)；

ft_2＝ingenic_load(b_ptr,32)；

ft_3＝ingenic_load(b_ptr,48)；

ft_4＝ingenic_load(b_ptr,64)；

ft_5＝ingenic_load(b_ptr,80)；

ft_6＝ingenic_load(b_ptr,96)；

ft_7＝ingenic_load(b_ptr,112)；

ft_8＝ingenic_load(b_ptr,128)；

ft_9＝ingenic_load(b_ptr,144)；

sum_0＝ingenic_muladd_h(sum_0,in_0,ft_0)；

sum_1＝ingenic_muladd_h(sum_1,in_0,ft_1)；

sum_0＝ingenic_muladd_h(sum_0,in_1,ft_2)；

sum_1＝ingenic_muladd_h(sum_1,in_1,ft_3),

sum_0＝ingenic_muladd_h(sum_0,in_2,ft_4)；

sum_1＝ingenic_muladd_h(sum_1,in_2,ft_5)；

sum_0＝ingenic_muladd_h(sum_0,in_3,ft_6)；

sum_1＝ingenic_muladd_h(sum_1,in_3,ft_7)；

sum_0＝ingenic_muladd_h(sum_0,in_4,ft_8)；

sum_1＝ingenic_muladd_h(sum_1,in_4,ft_9)；

assigning in _3 to in _ 0; assigning in _4 to in _ 1;

in_0＝in_3；

in_1＝in_4；

in_2＝ingenic_copy2(in_value,5)；

in_3＝ingenic_copy2(in_value,6)；

in_4＝ingenic_copy2(in_value,7)；

sum_10＝ingenic_muladd_h(sum_10,in_0,ft_0)；

sum_11＝ingenic_muladd_h(sum_11,in_0,ft_1)；

sum_10＝ingenic_muladd_h(sum_10,in_1,ft_2)；

sum_11＝ingenic_muladd_h(sum_11,in_1,ft_3)；

sum_10＝ingenic_muladd_h(sum_10,in_2,ft_4)；

sum_11＝ingenic_muladd_h(sum_11,in_2,ft_5)；

sum_10＝ingenic_muladd_h(sum_10,in_3,ft_6)；

sum_11＝ingenic_muladd_h(sum_11,in_3,ft_7)；

sum_10＝ingenic_muladd_h(sum_10,in_4,ft_8)；

sum_11＝ingenic_muladd_h(sum_11,in_4,ft_9)；

s3.3, performing p ═ p + 1; return to step S3.1.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for optimizing based on a first layer 4bit convolution calculation is characterized in that the method realizes a complete convolution calculation process aiming at an input 3-channel color image output data depth of 16, related simd instruction operation is in an innermost circulation, three groups of depth data of three channels of an image and three groups of depth data on sequential width are regarded as a group of data in the innermost circulation for processing and calculation, wherein the storage sequence of convolution kernel data is stored according to the requirement sequence of convolution calculation, namely, the convolution kernel data stored according to the convolution calculation, and two adjacent rows of data are in cross storage of two groups of data at the output depth; the simd instruction circulates in the innermost layer, eight continuous rows and every two adjacent rows are required to be stored in an intersecting mode, each data in the 9 th row is subjected to intersection calculation with 0, and at the moment, the data is equivalent to ten rows of data.

2. The optimization method based on the first layer 4bit convolution calculation of claim 1, characterized in that the method is continuous loading when loading data, in the innermost loop, one data in each loaded data is copied to a register of a variable of a simd instruction, 8-bit multiplication simd instruction calculation is performed, and after 16 bits are converted, accumulation simd instruction calculation is performed; or in the convolution calculation, the conversion of 8-bit data into 16-bit data is realized by using a Simd instruction of multiplication and adjacent addition.

3. The method of claim 2, wherein the method comprises the steps of:

s1, initializing setting parameters:

let the input data indata be a set of data with input depth in _ depth of 32, width in _ width of 512, and height in _ height of 512; the convolution kernel data filter _ data is a group of data with the output depth out _ depth of 128, the input depth in _ depth of 32, the depth of which is consistent with the depth of the input data, the width ft _ w of a convolution kernel of 3 and the height ft _ h of the convolution kernel of 3; let the structure of the output data, i.e. the feature graph outdata: depth is out _ depth, width is out _ width, and height is out _ height; in the convolution calculation, a step length exists, and the step length is set as stride;

s2, first layer cycle: setting j to be 0;

s2.1, judging whether j < out _ width and out _ height are satisfied or not,

Executing the following steps: out _ w is j% out _ width;

out_h＝j/out_width；

in_h＝out_h*stride；

in_w＝out_w*stride；

in _ ptr ═ indata + in _ w × in _ depth, in _ ptr is an input data pointer;

out _ ptr ═ outtdata + j × out _ depth, out _ ptr is the output data pointer; executing: judging whether out _ w% 2 is equal to 0 or not, if not, not executing the following executive body; if true, the following executions are performed: initialization registers sum _0, sum _1, sum _2, sum _3 are 0:

sum _0 ═ 0,0,0,0,0,0,0,0,0, 0; initialization is 0;

sum _1 ═ sum _ 0; initialization is 0;

sum _2 ═ sum _ 0; initialization is 0;

sum _3 ═ sum _ 0; initialization is 0;

executing: step S3;

s2.3, if j is j +1, return to step S2.1;

s3, innermost cycle: setting p as 0;

s3.1, judging whether p < ft _ h is true or not,

s3.2, if the result is not true, stopping the innermost circulation, and returning to the step S2.3; if so, then

Executing the following steps: obtaining an input data pointer, wherein a _ loc is a _ ptr + (in _ h + p) in _ width in _ depth, and a _ loc is the input data pointer;

load input data into register in _ value, denoted as

in_value＝ingenic_load(a_loc,0)；

in_0＝ingenic_copy2(in_value,0)；

in_1＝ingenic_copy2(in_value,1)；

in_2＝ingenic_copy2(in_value,2)；

in_3＝ingenic_copy2(in_value,3)；

in_4＝ingenic_copy2(in_value,4)；

the convolution kernel data is loaded into registers ft _0, ft _1, ft _2, ft _3, ft _4, ft _5,

ft _6, ft _7, ft _8, and ft _9 are expressed as:

ft_0＝ingenic_load(b_ptr,0),；

ft_1＝ingenic_load(b_ptr,16)；

ft_2＝ingenic_load(b_ptr,32)；

ft_3＝ingenic_load(b_ptr,48)；

ft_4＝ingenic_load(b_ptr,64)；

ft_5＝ingenic_load(b_ptr,80)；

ft_6＝ingenic_load(b_ptr,96)；

ft_7＝ingenic_load(b_ptr,112)；

ft_8＝ingenic_load(b_ptr,128)；

ft_9＝ingenic_load(b_ptr,144)；

sum_0＝ingenic_muladd_h(sum_0,in_0,ft_0)；

sum_1＝ingenic_muladd_h(sum_1,in_0,ft_1)；

sum_0＝ingenic_muladd_h(sum_0,in_1,ft_2)；

sum_1＝ingenic_muladd_h(sum_1,in_1,ft_3),

sum_0＝ingenic_muladd_h(sum_0,in_2,ft_4)；

sum_1＝ingenic_muladd_h(sum_1,in_2,ft_5)；

sum_0＝ingenic_muladd_h(sum_0,in_3,ft_6)；

sum_1＝ingenic_muladd_h(sum_1,in_3,ft_7)；

sum_0＝ingenic_muladd_h(sum_0,in_4,ft_8)；

sum_1＝ingenic_muladd_h(sum_1,in_4,ft_9)；

assigning in _3 to in _ 0; assigning in _4 to in _ 1;

in_0＝in_3；

in_1＝in_4；

in_2＝ingenic_copy2(in_value,5)；

in_3＝ingenic_copy2(in_value,6)；

in_4＝ingenic_copy2(in_value,7)；

sum_10＝ingenic_muladd_h(sum_10,in_0,ft_0)；

sum_11＝ingenic_muladd_h(sum_11,in_0,ft_1)；

sum_10＝ingenic_muladd_h(sum_10,in_1,ft_2)；

sum_11＝ingenic_muladd_h(sum_11,in_1,ft_3)；

sum_10＝ingenic_muladd_h(sum_10,in_2,ft_4)；

sum_11＝ingenic_muladd_h(sum_11,in_2,ft_5)；

sum_10＝ingenic_muladd_h(sum_10,in_3,ft_6)；

sum_11＝ingenic_muladd_h(sum_11,in_3,ft_7)；

sum_10＝ingenic_muladd_h(sum_10,in_4,ft_8)；

sum_11＝ingenic_muladd_h(sum_11,in_4,ft_9)；

s3.3, performing p ═ p + 1; return to step S3.1.

4. The method of claim 3, wherein in step S1, the data in one input depth direction of the input data is:

[x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,…x32]；

the data of the convolution kernel data in two output depth directions are:

[d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,d16,…d128]；

[e1,e2,e3,e4,e5,e6,e7,e8,e9,e10,e11,e12,e13,e14,e15,e16,…e128]。

5. the method according to claim 3, wherein in step S1, stride is 2; out _ depth is 16.

6. The optimization method based on the first layer 4bit convolution calculation in claim 5, wherein in the innermost layer loop, 16 data are loaded into the in _ value register, assuming that the data are read in sequence as A1r, A1g, A1 b; b1r, B1g, B1B; c1r, C1g, C1 b; d1r, D1g, D1 b; e1r, E1g, E1 b; f1 r; these 16 data effectively use the first 15 data, which are denoted as the first layer;

wherein, the data from 7 th to 15 th C1r, C1g, C1 b; d1r, D1g, D1 b; the first layer, denoted E1r, E1g, E1b, is an accumulated value of the second set sum _2 and sum _3 for calculating output data;

this occurs because the input data depth is 3, the convolution kernel width is 3, the step size is 2, two sets are calculated at a time, the data used is 15, and the register stores 16 data of 8 bits.

7. The optimization method based on the first layer 4bit convolution calculation of claim 6, characterized in that, in the first 9 data, regarding the first layer as an iteration of a loop, that is, the innermost loop, the layer of data is treated as a whole, the number of data is 9, and the simd instruction of multiplication and adjacent addition needs even number data, so that a random other data is added later to make up the even number of data, and when the multiplication is performed, the multiplication is performed with 0, and when the addition is performed, the value is eliminated; if 1 data is output, 9 convolution kernel data are needed, 9 convolution sum data and 0 data are actually used, and if 16 data are output, 144 convolution kernel data are needed, and 160 convolution kernel data are actually used; the multiplied and then adjacently added simd instruction is expressed as: the rd is ingenic _ multicast _ h (vrd, vrs, vrt) which can realize multiplication of 8 bits, accumulation and generation of 16 bits as required.

8. The method of claim 7, wherein the step S3.2 further comprises:

the in _0 _ ingenic _ copy2(in _ value,0), and the data stored in the register is:

in_0＝[A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g,A1r,A1g]；

the in _ 1_ ingenic _ copy2(in _ value,1), and the data stored in the register is:

in_1＝[A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r,A1b,B1r]；

the in _ 2_ ingenic _ copy2(in _ value,2), and the data stored in the register is:

in_2＝[B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b,B1g,B1b]；

the in _ 3_ ingenic _ copy2(in _ value,3), and the data stored in the register is:

in_3＝[C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g,C1r,C1g]；

the in _4 _ ingenic _ copy2(in _ value,4), and the data stored in the register is:

in_4＝[C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r,C1b,D1r]；

ft _0 is ingenic _ load (b _ ptr,0), and the data stored in the register is:

ft_0＝[a1_1,a2_1,a1_2,a2_2,a1_3,a2_3,…,a1_8,a2_8]；

ft _1 is ingenic _ load (b _ ptr,16), and the data stored in the register is:

ft_1＝[a1_9,a2_9,a1_10,a2_10,a1_11,a2_11,…,a1_16,a2_16]；

ft _2 is ingenic _ load (b _ ptr,32), and the data stored in the register is:

ft_2＝[a3_1,b1_1,a3_2,b1_2,a3_3,b1_3,…,a3_8,b1_8]；

ft _3 is ingenic _ load (b _ ptr,48), and the data stored in the register is:

ft_3＝[a3_8,b1_8,a3_9,b1_9,a3_10,b1_10,…,a3_16,b1_16]；

ft _4 is ingenic _ load (b _ ptr,64), and the data stored in the register is:

ft_4＝[b2_1,b3_1,b2_2,b3_2,b2_3,b3_3,…,b2_8,b3_8]；

ft _5 is ingenic _ load (b _ ptr,80), and the data stored in the register is:

ft_5＝[b2_9,b3_9,b2_10,b3_10,b2_11,b3_11,…,b2_16,b3_16]；

ft _6 is ingenic _ load (b _ ptr,96), and the data stored in the register is:

ft_6＝[c1_1,c2_1,c1_2,c2_2,c1_3,c2_3,…,c1_8,c2_8]；

ft _7 is ingeningcoad (b _ ptr,112), and the data stored in the register is:

ft_7＝[c1_9,c2_9,c1_10,c2_10,c1_11,c2_11,…,c1_16,c2_16]；

ft _8 is ingenic _ load (b _ ptr,128), and the data stored in the register is:

ft_8＝[c3_1,0,c3_2,0,c3_3,0,…,c3_8,0]；

ft _9 is ingenic _ load (b _ ptr,144), and the data stored in the register is:

ft_9＝[c3_9,0,c3_10,0,c3_11,0,…,c3_16,0]；

later using in _0, in _1, in _2, in _3, in _4, ft _0, ft _1, ft _2, ft _3, ft _4, ft _5, ft _6, ft _7, ft _8, and ft _9 to execute multiplication and then adjacent addition of simd instructions to calculate sum _0 and sum _ 1; completing innermost layer circulation, namely completing calculation of convolution once, obtaining sum _0 and sum _1 through calculation once, and calculating 8 16-bit data stored in the two registers simultaneously to obtain 16 data in total; after the steps are executed, the whole data output result can be completed.

9. The optimization method based on the first layer 4bit convolution calculation of claim 8, wherein the method includes calculating two sets of data patterns, that is, loading one convolution kernel data, calculating two sets of data simultaneously, calculating the required input data loaded by two sets of output data after completing one innermost layer of loading cycle,