CN115904489A - Method for optimizing superposition of feature maps - Google Patents

Method for optimizing superposition of feature maps Download PDF

Info

Publication number
CN115904489A
CN115904489A CN202110923498.9A CN202110923498A CN115904489A CN 115904489 A CN115904489 A CN 115904489A CN 202110923498 A CN202110923498 A CN 202110923498A CN 115904489 A CN115904489 A CN 115904489A
Authority
CN
China
Prior art keywords
data
vrsv
ingenic
savedata
tmp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110923498.9A
Other languages
Chinese (zh)
Inventor
田凤彬
于晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ingenic Semiconductor Co Ltd
Original Assignee
Beijing Ingenic Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ingenic Semiconductor Co Ltd filed Critical Beijing Ingenic Semiconductor Co Ltd
Priority to CN202110923498.9A priority Critical patent/CN115904489A/en
Publication of CN115904489A publication Critical patent/CN115904489A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a method for optimizing superposition of feature maps, which comprises the following steps: in the calculation of the low bit map, the calculation is performed by using a banker's rounding method, the input is a, b, the output result is res, and the formula is as follows:
Figure DDA0003208321210000011
an instruction for solving two number average values is used, two 8-bit data are input, one 8-bit data is output, the average value is solved in a full round mode to realize (a + b) > 1, and then an exclusive OR instruction and an AND instruction are used to realize the calculation of a formula (13); on the loading data, 128 data are loaded at one time, and 128 data are loaded on each of the two characteristic diagrams; since one register has 32 registers, the number of registers cannot exceed 32 registers in use; firstly, the operation of the above formula (13) is designed, and then the whole operation is designed to realize optimization. The method is a relatively optimal optimization method, and compared with the original C language setThe calculation method can improve the calculation speed by 50 times.

Description

Method for optimizing superposition of feature maps
Technical Field
The invention relates to the technical field of image processing, in particular to a method for optimizing superposition of feature maps.
Background
Integrated circuit technology is now becoming more and more the focus of technology development, and many chip manufacturers are developing their own chips. In chip applications, the respective requirements are also generated in the respective chip designs. For example, a chip manufactured by beijing junzheng integrated circuit gmbh (abbreviated as beijing junzheng), for example, a chip of beijing junzheng T30 or T31 type among them, the register is a 128-bit register, and the number of registers is limited, and a total of 32 registers, if more than 32 registers are used, the data of the register loaded in front is stored in the memory, and when the data is processed and reused in the back, the stored data is reloaded, resulting in very low efficiency. And the buffer memory space is very small, generally only 512 bits, and is shared with the system. The loaded data is loaded into the cache space from DDR (DDR SDRAM = double data rate synchronous dynamic random access memory, commonly referred to as DDR) first, 512 bits are loaded each time, and then loaded from the cache to the register. If less than 512 bits are used, then other data will be overwritten by the system or other application, and when reused, the data is reloaded. The data loaded into the cache is used as fully as possible. The bottleneck in data transfer arises from DDR to cache transfer.
Therefore, the prior art has the defects that:
1. on the chip being produced by Beijing Jun, the speed of directly using the C program is very slow.
2. There is no optimization method for two mean calculations based on the banker rounding method (four-round six-in no-even method) for feature maps.
In addition, the common terms in the prior art are as follows:
1. a simd instruction: the single instruction stream and multiple data streams, namely, one operation instruction can execute multiple data streams, so that the operation speed of the program can be improved. It is more generally understood that it is a vector (vector) calculation. The specific instruction set differs from chip to chip.
2. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map (or output data), and the result of the full connection of the data is also called a feature map (or output data). The feature size is typically expressed as length x width x depth, or 1 x depth. The depth is also referred to as the channel.
Disclosure of Invention
In order to solve the above problems in the prior art, the present application aims to: by adopting the method, the operation speed can be doubled. Particularly, the method solves the problems that a register in the existing chip is a 128-bit register, the number of the registers is limited, and optimization can be realized by adopting a calculation method based on a banker intake method.
Specifically, the invention provides a method for optimizing superposition of feature maps, which comprises the following steps: in the calculation of the low bit map, the calculation is performed by using a banker's rounding method, the input is a, b, the output result is res, and the formula is as follows:
res = [ (a + b) > 1] + (a ^ b) & [ (a + b) > 1] &1 equation (13)
The method comprises the steps that an instruction for solving two number average values is used, two 8-bit data are input, one 8-bit data is obtained after average value calculation is output, the (a + b) > 1 is realized in a complete round-off mode in the average value solving, and then the calculation of a formula (13) is realized by using an exclusive or instruction and an instruction; on the loading data, 128 data are loaded at one time, and 128 data are loaded on each of the two characteristic diagrams; since one register has 32 registers, the number of registers cannot exceed 32 registers in use; the operation of the above formula (13) is designed first, and then the overall operation is designed to realize optimization.
The method further comprises the following steps:
s1, the operation of the design formula (13):
setting a function of a formula as vrd = average _ bank (vrs, vrt), setting input two registers as vrs and vrt, and setting an output result as vrd; the temporary registers of the intermediate calculation are ave _ tmp and vorad _ tmp;
s1.1, calculating an average value by using an average value instruction, inputting vrd and vrt, storing an output result into ave _ tmp, and realizing the operation of res _0= ab > 1 in formula (3);
s1.2, using exclusive-OR instruction operation, inputting vrd and vrt, storing an output result into vorad _ tmp to realize the operation of formula (4) that mod _0=a ^ b;
s1.3, inputting ave _ tmp and vorad _ tmp by using AND instruction operation, storing an output result into the vorad _ tmp, and realizing the operation of (a ^ b) & res _ 0;
s1.4, inputting vorad _ tmp and 1 by using AND instruction operation, storing an output result into vorad _ tmp, and realizing the operation of the formula (12) mod _5= (a ^ b) & res _0&1;
s1.5, inputting ave _ tmp and vorand _ tmp by using addition instruction operation, storing an output result into vrd, and realizing the operation of the formula (13), res = [ (a + b) > 1] + (a ^ b) & [ (a + b) > 1] &1;
s2, the whole operation is realized by making a core processing part into a function because the number of the registers is 32, and calling the function vrd = average _ bank (vrs, vrt) in an online processing mode each time;
s2.1, loading data, loading 128 data input data indata1 and indata2 to be loaded each time, loading the data into variables vrsv and vrtv, setting m to correspond to 0, 16, 32, 48, 64, 80, 96 and 112 in sequence, and loading 128bit data from the position m pointed by the data indata1 and indata2 in the memory respectively; all data must be read continuously, and cannot be read in an indata1 and indata2 cross mode, and the array registers used for loading the data are set as vrsv [8] and vrtv [8]. Loading data from data indata1, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrsv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrsv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrsv [7]; after processing, indata1= indata1+128. Loading data from data indata2, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrtv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrtv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrtv [7]; after processing, indata2= indata2+128.
S2.2, calling vrd = average _ bank (vrs, vrt), processing vrsv and vrtv, and storing the processing result into vrsv:
s2.2.1, it is assumed that each time the input vrsv [ i ], vrtv [ i ] is processed, the output result is stored in vrsv [ i ], and
vrsv[i]=average_bank(vrsv[i],vrtv[i]);
s2.2.2, starting with i =0, and then carrying in step S2.2.1, then sequentially carrying in i =1 to i =7, calculating all results, and storing in vrsv;
s2.3, outputting the result to savedata, wherein the output result is also output at one time:
register variables vrsv [ i ], i =0, … …,7 for storing data, pointers savedata for storing data, which are pointers, create the first address of the storage size space. After a batch of data is stored, the data are sequentially stored in batch through the change of the pointer, and 8 bits of data are stored by adding the number of the stored data to the savedata pointer (namely savedata = savedata + 128). And (3) setting m to sequentially correspond to 0, 16, 32, 48, 64, 80, 96 and 112, storing 128bit data in vrsv [ i ] to savedata, and storing the 128bit data from the position pointed by the data savedata in the memory. Storing data into savedata, and when m =0, starting to store 16 8-bit data in vrsv [0] from the position of savedata + 0; when m =16, storing 16 8-bit data in vrsv [1] from the savedata +16 position; … …; when m =112, storing 16 8bit data in vrsv [7] from the savata +112 position; after processing, savedata = savedata +128.
The banker rounding method comprises the following steps:
nb = a + b; formula (2)
res _0= ab > 1; and (4) a formula. (3)
Calculating whether remainder is carried:
mod _0= a ^ b; formula (4)
mod _1= mod_0 &1; formula (5)
mod _2= res 0^1; formula (6)
mod _3= mod_1 &mod_2; formula (7)
By equations (2) - (7) there are:
mod _4= [ (a ^ b) &1] & (res _ 0&1); formula (8)
According to the operation rule of AND, OR and XOR, the formula (8) is simplified:
mod _4= [ (a ^ b) & res _0&1; formula (9)
And (3) calculation of positive and negative signs:
sign _0= (res _0 > 8) |1; formula (10)
Calculating whether the data is carried or not by the formulas (9) and (10):
mod _5= [ (a ^ b) & res _0&1] & [ (res _0 > 8) |1]; formula (11)
Since there is one &1 in equation (8), the result can only be 0 or 1; according to the operation rule of AND, OR and XOR, simplifying the formula (11):
mod _5= (a ^ b) & res _0&1; formula (12)
From the formulae (2), (3) and (12)
res = [ (a + b) > 1] + (a ^ b) & [ (a + b) > 1] &1; formula (13)
Equation (13) is the final calculation equation.
The step S1 includes:
s1.1, adopting a mean value instruction, outputting data which is stored by variables and is 16 int8_ t, and inputting data which is stored by variables vrs and vrt and is 16 int8_ t; the rounding mode adopts a complete rounding mode and is expressed as: ave _ tmp = ingenic _ aves _ b (vrs, vrt);
s1.2, adopting an exclusive OR instruction, outputting data which is stored by variables and is 16 int8_ t, inputting data which is stored by variables vrd and vrt and is 16 int8_ t, and expressing as follows:
vorand_tmp=ingenic_xorv_b(vrd,vrt);
s1.3, adopting an AND instruction, wherein the output variable stores data of 16 int8_ t, and the input variables ave _ tmp and vorned _ tmp store data of 16 int8_ t, and the data are expressed as follows: vorad _ tmp = ingress _ andv _ b (ave _ tmp, vorad _ tmp);
s1.4, adopting an AND instruction, wherein the output variable stores 16 int8_ t data, the input variable vorand _ tmp stores 16 int8_ t data, and 1 is a constant and is represented as: vorned _ tmp = ingenic _ andi _ b (vorned _ tmp, 1);
s1.5, adopting an addition instruction, outputting data stored by a variable and having 16 int8_ t, inputting data stored by a variable ave _ tmp and a variable vorned _ tmp and having 16 int8_ t, and expressing as follows: vrd = ingress _ add _ b (ave _ tmp, vorand _ tmp).
The step S2 further includes:
s2.1, adopting a loading instruction, inputting data indata1 and indata2 to be loaded, setting m to be 0, 16, 32, 48, 64, 80, 96 and 112 in sequence, wherein the current data is a pointer of data, and loading 128-bit data from the position pointed by the data in a memory, if the data is 8-bit, 16 data are loaded, and if the data is 16-bit, 8 data are loaded; loading data into a variable vrd; m is calculated according to byte, namely 8bit, as a unit, and 128-bit data is loaded from the position m pointed by the data indata1 and indata2 in the memory in sequence. Loading data from data indata1, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrsv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrsv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrsv [7]; after processing, indata1= indata1+128. Loading data from data indata2, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrtv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrtv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrtv [7]; after processing, indata2= indata2+128. Expressed as:
vrsv[0]=ingenic_load(indata1,0);
vrsv[1]=ingenic_load(indata1,16);
vrsv[2]=ingenic_load(indata1,32);
vrsv[3]=ingenic_load(indata1,48);
vrsv[4]=ingenic_load(indata1,64);
vrsv[5]=ingenic_load(indata1,80);
vrsv[6]=ingenic_load(indata1,96);
vrsv[7]=ingenic_load(indata1,112);
vrtv[0]=ingenic_load(indata2,0);
vrtv[1]=ingenic_load(indata2,16);
vrtv[2]=ingenic_load(indata2,32);
vrtv[3]=ingenic_load(indata2,48);
vrtv[4]=ingenic_load(indata2,64);
vrtv[5]=ingenic_load(indata2,80);
vrtv[6]=ingenic_load(indata2,96);
vrtv[7]=ingenic_load(indata2,112)。
said step S2.3 further comprises: the method comprises the steps of adopting a data storage instruction, storing register variables vrsv [ i ], i =0, … …,7 of data, storing a pointer savedata of the data, starting to store the data in a register from the position of the pointer, storing 128bit data in vrd to savedata, calculating m according to byte, namely 8bit as a unit, and starting to store 128bit data from a position m pointed by the data savedata in a memory. Storing data into savedata, and when m =0, starting to store 16 8-bit data in vrsv [0] from a position of savedata + 0; when m =16, storing 16 8-bit data in vrsv [1] from the savedata +16 position; … …; when m =112, storing 16 8-bit data in vrsv [7] from the savedata +112 position; after processing, savedata = savedata +128. Expressed as:
ingenic_save(vrsv[0],savedata,0);
ingenic_save(vrsv[1],savedata,16);
ingenic_save(vrsv[2],savedata,32);
ingenic_save(vrsv[3],savedata,48);
ingenic_save(vrsv[4],savedata,64);
ingenic_save(vrsv[5],savedata,80);
ingenic_save(vrsv[6],savedata,96);
ingenic_save(vrsv[7],savedata,112);
savedata+=128。
in step S2.3, 16 registers are used for register arrays vrsv [8] and vrtv [8], and 2 temporary registers are used for calling vrd = average _ bank (vrs, vrt), the total number is 18, and the number does not exceed 32.
The method is applicable to chips with an s imd instruction set.
The instruction for solving the two number average values is suitable for chips of positive T30 and T31 models of Beijing Jun.
Thus, the present application has the advantages that: the method provides a relatively optimal optimization method, and compared with the original method using C language design, the method can improve the operation speed by 50 times.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principle of the invention.
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a flow chart of the steps of step S1 of the method of the present invention.
Fig. 3 is a flow chart of the steps of step S2 of the method of the present invention.
Detailed Description
In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.
As shown in fig. 1-3, the method of the present invention relates to a method for feature map overlay optimization, the method further comprising:
s1, the operation of the design formula (13):
setting a function of a formula as vrd = average _ bank (vrs, vrt), setting input two registers as vrs and vrt, and setting an output result as vrd; the temporary registers of the intermediate calculation are ave _ tmp and vorad _ tmp; s1.1, calculating an average value by using an average value instruction, inputting vrd and vrt, storing an output result into ave _ tmp, and realizing the operation of res _0= ab > 1 in formula (3);
s1.2, inputting vrd and vrt by using exclusive-OR instruction operation, storing an output result into vorad _ tmp to realize the operation of a formula (4) that mod _0= a ^ b;
s1.3, inputting ave _ tmp and vorad _ tmp by using AND instruction operation, and storing an output result into the vorad _ tmp to realize the operation of (a ^ b) & res _ 0;
s1.4, inputting vorad _ tmp and 1 by using AND instruction operation, storing an output result into vorad _ tmp, and realizing the operation of the formula (12) mod _5= (a ^ b) & res _0&1;
s1.5, inputting ave _ tmp and vorand _ tmp by using addition instruction operation, storing an output result into vrd, and realizing the operation of the formula (13), res = [ (a + b) > 1] + (a ^ b) & [ (a + b) > 1] &1;
s2, the whole operation is realized by making a core processing part into a function because the number of the registers is 32, and calling the function vrd = average _ bank (vrs, vrt) in an online processing mode each time;
s2.1, loading data, wherein 128 data are loaded each time; the method comprises the steps that input data indata1 and indata2 to be loaded are loaded into variables vrsv and vrtv, m is sequentially corresponding to 0, 16, 32, 48, 64, 80, 96 and 112, and 128-bit data are loaded from positions m to which the data indata1 and indata2 respectively point in a memory; all data must be read continuously, and cannot be read in an indata1 and indata2 cross mode, and the array registers used for loading the data are set as vrsv [8] and vrtv [8]. Loading data from data indata1, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrsv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrsv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrsv [7]; after treatment, indata1= indata1+128. Loading data from data indata2, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrtv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrtv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrtv [7]; after processing, indata2= indata2+128. S2.2, calling vrd = average _ bank (vrs, vrt), processing vrsv and vrtv, and storing a processing result into vrsv:
s2.2.1, it is assumed that each time the input vrsv [ i ], vrtv [ i ] is processed, the output result is stored in vrsv [ i ], and
vrsv[i]=average_bank(vrsv[i],vrtv[i]);
s2.2.2, starting with i =0, and then carrying in step S2.2.1, then sequentially carrying in i =1 to i =7, calculating all results, and storing in vrsv;
s2.3, outputting the result to savedata, wherein the output result is also output at one time:
register variables vrsv [ i ], i =0, … …,7 for storing data, pointer savedata for storing data, m is set to correspond to 0, 16, 32, 48, 64, 80, 96 and 112 in sequence, 128bit data in vrsv [ i ] is stored to savedata, and the 128bit data is stored from the position pointed by the data savedata in the memory. Storing data into savedata, and when m =0, starting to store 16 8-bit data in vrsv [0] from a position of savedata + 0; when m =16, storing 16 8-bit data in vrsv [1] from the savedata +16 position; … …; when m =112, storing 16 8-bit data in vrsv [7] from the savedata +112 position; after processing, savedata = savedata +128.
The specific embodiments of the method of the present application can also be described as follows:
simd instruction algorithm.
1) Simd instruction introduction
The simd instructions are referred to as follows.
a) And (4) average value instruction:
vrd=ingenic_aves_b(vrs,vrt);
the input variables vrs, vrt, and the output variables are vrd. vrd stores 16 int8_ t data, vrs and vrt stores 16 int8_ t data. The rounding mode adopts a complete rounding mode.
Equivalent operation:
vrd0:=(vrs0+vrt0)>>1;
vrd1:=(vrs1+vrt1)>>1;
……
vrd15:=(vrs15+vrt15)>>1;
b) XOR instruction:
vrd=ingenic_xorv_b(vrd,vrt);
the input variables vrs, vrt, and the output variables are vrd. vrd stores 16 int8_ t data, vrs and vrt stores 16 int8_ t data.
Equivalent operation:
vrd0:=vrs0^vrt0;
vrd1:=vrs1^vrt1;
……
vrd15:=vrs15^vrt15;
c) And instruction operation:
vrd=ingenic_andv_b(vrs,vrt);
the input variables vrs, vrt, and the output variables are vrd. vrd stores 16 int8_ t data, vrs and vrt stores 16 int8_ t data.
Equivalent operation:
vrd0:=vrs0&vrt0;
vrd1:=vrs1&vrt1;
……
vrd15:=vrs15&vrt15;
vrd=ingenic_andi_b(vrs,i);
the input variables vrs, i and the output variables are vrd. vrd stores 16 int8_ t data, vrs stores 16 int8_ t data, and i is a constant.
Equivalent operation:
vrd0:=vrs0&i;
vrd1:=vrs1&i;
……
vrd15:=vrs15&i;
d) Addition instruction:
vrd=ingenic_add_b(vrs,vrt);
the input variables vrs, vrt, and the output variables are vrd. vrd stores 16 int8_ t data, vrs and vrt stores 16 int8_ t data.
Equivalent operation:
vrd0:=vrs0+vrt0;
vrd1:=vrs1+vrt1;
……
vrd15:=vrs15+vrt15;
e) A load data instruction: the input data to be loaded is currently a pointer of the data, 128-bit data is loaded from the position pointed by the data in the memory, if the data is 8-bit data, 16 data are loaded, and if the data is 16-bit data, 8 data are loaded. Data is loaded into variable vrd. m is calculated in terms of byte, i.e., 8bitt, as a unit. And loading 128bitt data from the position m pointed by the data indata in the memory.
vrd=ingenic_load(indata,m)
f) Save data instruction
ingenic_save(vrd,savedata,m)
A register variable vrd for storing data, a pointer savedata for storing data, and data in the register is stored from the position of the pointer. And storing the 128bit data in the vrd to the savedata. m is calculated in terms of byte, i.e., 8 bits, as a unit. And saving 128-bit data from the position pointed by the data savedata in the memory. Storing data into savedata, and when m =0, starting to store 16 8-bit data in vrsv [0] from a position of savedata + 0; when m =16, storing 16 8-bit data in vrsv [1] from the savedata +16 position; ...; when m =112, storing 16 8bit data in vrsv [7] from the savata +112 position; after processing, savedata = savedata +128.
2. Formula for calculation
1) Conventional calculation formula
Let inputs be a, b and output result be res. The output is a rounded calculation method. There is the following calculation formula.
Figure BDA0003208321190000131
2) Calculation formula based on banker rounding method
In the calculation of low bit profiles, rounding introduces a large error, so we use the banker rounding method.
ab=a+b;.................................................(2)
res_0=ab>>1;.........................................(3)
Calculation of whether remainder is carried or not
mod_0=a^b;...........................................(4)
mod_1=mod_0&1;..................................(5)
mod_2=res_0^1;....................................(6)
mod_3=mod_1&mod_2;.........................(7)
The method comprises the following steps (2) to (8):
mod_4=[(a^b)&1]&(res_0&1);............(8)
and (3) simplifying the formula (8) according to the operation rule of AND, OR and XOR:
mod_4=[(a^b)&res_0&1;.....................(9)
calculation of positive and negative signs
sign_0=(res_0>>8)1;........................(10)
Calculating whether the data is carried or not through (9) and (10):
mod_5=[(a^b)&res_0&1]&[(res_0>>8)|1];.........(11)
since there is one &1 in formula (8), the result can only be 0 or 1; according to different and/or different
Or operation rule, simplifying equation (11):
mod_5=(a^b)&res_0&1;.................(12)
from (2), (3) and (12)
res=[(a+b)>>1]+(a^b)&[(a+b)>>1]&1;............(13)
Equation (13) is the final calculation equation.
Simd instruction optimization algorithm
Equation (13) was analyzed, and all calculations were performed to calculate the mean of a and b, and a trade-off was made using the banker's rounding method. If the calculation is carried out according to the steps, the original 8-bit data needs to be converted into 16-bit data, then the data are added, and the average value is obtained after the addition. After calculation, the 16-bit data is converted into 8-bit data. The processing here takes a lot of time. In the instructions of the T30 and T31 type chips, an instruction for calculating two number average values exists, two 8-bit data are input, 8-bit data after the average value is output, and the average value is calculated in a complete rounding mode. This is exactly one instruction that is needed. (a + b) > 1 can be achieved; the calculation of the above formula can then be implemented using an exclusive or instruction, and instruction. On loading data, we load 128 data at a time, and the two profiles each load 128 data. Since there are 32 registers, the number of registers in use cannot exceed 32, otherwise the efficiency is reduced. Firstly, the operation of the formula (13) is designed, and then the whole operation is designed.
1) Design of equation (13)
Let formula (13) function as vrd = average _ bank (vrs, vrt), input two registers as vrs, vrt, and output result as vrd. The temporary registers for the intermediate calculation are ave _ tmp, vorand _ tmp
a) And (3) calculating the average value by using an average value instruction, inputting vrd and vrt, storing the output result into ave _ tmp, and realizing the operation of the formula (3).
ave_tmp=ingenic_aves_b(vrs,vrt);
b) And (3) using exclusive-or instruction operation, inputting vrd and vrt, and storing an output result into vorad _ tmp to realize the operation of the formula (4).
vorand_tmp=ingenic_xorv_b(vrd,vrt);
c) And instruction operation is used, ave _ tmp and vorad _ tmp are input, and output results are stored in vorad _ tmp, so that the operation of (a ^ b) & res _0 is realized.
vorand_tmp=ingenic_andv_b(ave_tmp,vorand_tmp);
d) And instruction operation is used, vornd _ tmp,1 is input, and the output result is stored in vornd _ tmp, so that the operation of the formula (12) is realized.
vorand_tmp=ingenic_andi_b(vorand_tmp,1);
e) And inputting ave _ tmp and vorand _ tmp by using addition instruction operation, and storing an output result into vrd to realize the operation of the formula (13).
vrd=ingenic_add_b(ave_tmp,vorand_tmp);
2) Integral operation of algorithm
Efficiency is reduced due to the fact that the efficiency is reduced when the number of the registers exceeds 32, and the algorithm speed is affected. So we make the core processing part as a function, and call function vrd = average _ bank (vrs, vrt) each time using online processing.
a) Data is loaded, 128 data per load. All data must be read continuously, and cannot be read in an indata1 and indata2 cross mode, otherwise, the efficiency is low. Let the array registers used by the load data be vrsv [8], vrtv [8].
vrsv[0]=ingenic_load(indata1,0);
vrsv[1]=ingenic_load(indata1,16);
vrsv[2]=ingenic_load(indata1,32);
vrsv[3]=ingenic_load(indata1,48);
vrsv[4]=ingenic_load(indata1,64);
vrsv[5]=ingenic_load(indata1,80);
vrsv[6]=ingenic_load(indata1,96);
vrsv[7]=ingenic_load(indata1,112);
vrtv[0]=ingenic_load(indata2,0);
vrtv[1]=ingenic_load(indata2,16);
vrtv[2]=ingenic_load(indata2,32);
vrtv[3]=ingenic_load(indata2,48);
vrtv[4]=ingenic_load(indata2,64);
vrtv[5]=ingenic_load(indata2,80);
vrtv[6]=ingenic_load(indata2,96);
vrtv[7]=ingenic_load(indata2,112);
b) Call vrd = average _ bank (vrs, vrt). And processing vrsv and vrtv, and saving the processing result into vrsv.
<1> set the input vrsv [ i ], vrtv [ i ] processed each time, the output result is saved to vrsv [ i ], including
vrsv[i]=average_bank(vrsv[i],vrtv[i]);
<2> = i =0 starts, the procedure is carried out to <1>, i =1 is carried out in sequence, i =7 is carried out, all results are calculated, and the results are stored in vrsv.
c) And outputting the result to savedata. The output result is output once, otherwise, the efficiency is reduced.
ingenic_save(vrsv[0],savedata,0);
ingenic_save(vrsv[1],savedata,16);
ingenic_save(vrsv[2],savedata,32);
ingenic_save(vrsv[3],savedata,48);
ingenic_save(vrsv[4],savedata,64);
ingenic_save(vrsv[5],savedata,80);
ingenic_save(vrsv[6],savedata,96);
ingenic_save(vrsv[7],savedata,112);
savedata+=128;
The register arrays vrsv [8] and vrtv [8] use 16 registers, and the call vrd = average _ bank (vrs, vrt) uses 2 temporary registers, the total number being 18, and not more than 32. If the number of registers initially used in <1> is increased, there is a risk of overflow and no further speed increase. The current design is a relatively optimal design methodology. Compared with the original design method of C, the processing method can improve the speed by 50 times.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A method for feature map overlay optimization, the method comprising: in the calculation of the low bit map, the calculation is performed by using a banker's rounding method, the input is a, b, the output result is res, and the formula is as follows:
res = [ (a + b) > 1] + (a ^ b) & [ (a + b) > 1] &1 equation (13)
An instruction for solving two number average values is used, two 8-bit data are input, one 8-bit data is output, the average value is solved in a full round mode to realize (a + b) > 1, and then an exclusive OR instruction and an AND instruction are used to realize the calculation of a formula (13); on the loading data, 128 data are loaded at one time, and 128 data are loaded on each of the two characteristic diagrams; since one register has 32 registers, the number of registers cannot exceed 32 registers in use; firstly, the operation of the above formula (13) is designed, and then the whole operation is designed to realize optimization.
2. The method of feature map overlay optimization according to claim 1, further comprising:
s1, the operation of the design formula (13):
setting a function of a formula as vrd = average _ bank (vrs, vrt), setting input two registers as vrs and vrt, and setting an output result as vrd; the temporary registers of the intermediate calculation are ave _ tmp and vorad _ tmp;
s1.1, calculating an average value by using an average value instruction, inputting vrd and vrt, storing an output result into ave _ tmp, and realizing the operation of res _0= ab > 1 in formula (3);
s1.2, inputting vrd and vrt by using exclusive-OR instruction operation, storing an output result into vorad _ tmp to realize the operation of a formula (4) that mod _0= a ^ b;
s1.3, inputting ave _ tmp and vorad _ tmp by using AND instruction operation, and storing an output result into the vorad _ tmp to realize the operation of (a ^ b) & res _ 0;
s1.4, inputting vorad _ tmp and 1 by using AND instruction operation, storing an output result into vorad _ tmp, and realizing the operation of the formula (12) mod _ d = (a ^ b) & res _0&1;
s1.5, inputting ave _ tmp and vorand _ tmp by using addition instruction operation, storing an output result into vrd, and realizing the operation of the formula (13) res = [ (a + b) > 1] + (a ^ b) & [ (a + b) > 1] &1;
s2, operation of the whole
Because the number of the registers is 32, the core processing part is made into a function, and the function vrd = average _ bank (vrs, vrt) is called in an online processing mode each time;
s2.1, loading data, wherein 128 data are loaded each time; the input data indata1 and indata2 to be loaded are loaded into variables vrsv and vrtv, m is sequentially corresponding to 0, 16, 32, 48, 64, 80, 96 and 112, 128-bit data are loaded from positions m pointed by the data indata1 and indata2 in a memory respectively, and the 128-bit data are 16 pieces of 8-bit data; all data must be read continuously, and cannot be read in an indata1 and indata2 cross mode, and array registers used for loading data are set as vrsv [8] and vrtv [8]; loading data from data indata1, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrsv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrsv [1]; ...; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrsv [7]; after treatment, indata1= indata1+128; loading data from data indata2, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrtv [0]; when m =1, starting from the 16 th data, 16 8-bit data are loaded to vrtv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrtv [7]; after treatment, indata2= indata2+128;
s2.2, calling vrd = average _ bank (vrs, vrt), processing vrsv and vrtv, and storing the processing result into vrsv:
s2.2.1, it is assumed that each time the input vrsv [ i ], vrtv [ i ] is processed, the output result is stored in vrsv [ i ], and
vrsv[i]=average_bank(vrsv[i],vrtv[i]);
s2.2.2, starting with i =0, and then carrying in step S2.2.1, then sequentially carrying in i =1 to i =7, calculating all results, and storing in vrsv;
s2.3, outputting the result to savedata, wherein the output result is also output at one time: register variables vrsv [ i ], i =0, … …,7 for storing data, a pointer savedata for storing data, wherein m is sequentially corresponding to 0, 16, 32, 48, 64, 80, 96 and 112, 128bit data in the vrsv [ i ] is stored into savedata, and the 128bit data is stored from the position pointed by the data savedata in a memory; storing data into savedata, and when m =0, starting to store 16 8-bit data in vrsv [0] from a savedata +0 position; when m =16, storing 16 8-bit data in vrsv [1] from the savedata +16 position; … …; when m =112, storing 16 8bit data in vrsv [7] from the savata +112 position; after processing, savedata = savedata +128.
3. The method for overlay optimization of feature maps according to claim 2, wherein the banker rounding method is expressed as:
ab = a + b formula (2)
res _0= ab > 1 formula (3)
And (3) calculating whether the remainder is carried:
mod _0= a ^ b equation (4)
mod _1= mod_0 &1 equation (5)
mod _2= res \0 ^1 equation (6)
mod _3= mod_1 &mod_2 formula (7)
By the formulas (2) to (7), there are:
mod _4= [ (a ^ b) &1] & (res _ 0&1) formula (8)
According to the operation rule of AND, OR and XOR, the formula (8) is simplified:
mod _4= [ (a ^ b) & res _0&1 equation (9)
And (3) calculation of positive and negative signs:
sign _0= (res _0 > 8) |1 equation (10)
Calculating whether the data is carried or not by the formulas (9) and (10):
mod _5= [ (a ^ b) & res _0&1] & [ (res _0 > 8) |1] formula (11)
Since there is one &1 in equation (8), the result can only be 0 or 1; according to the operation rule of AND, OR and XOR, simplifying the formula (11):
mod _5= (a ^ b) & res _0&1 equation (12)
From the formulae (2), (3) and (12)
res = [ (a + b) > 1] + (a ^ b) & [ (a + b) > 1] &1 equation (13)
Equation (13) is the final calculation equation.
4. The method for optimizing the overlay of the feature map according to claim 3, wherein the step S1 comprises:
s1.1, adopting a mean value instruction, outputting data which are stored by variables and are 16 int8_ t, and inputting data which are stored by variables vrs and vrt and are 16 int8_ t; the rounding mode adopts a complete rounding mode and is expressed as:
ave_tmp=ingenic_aves_b(vrs,vrt);
s1.2, adopting an exclusive OR instruction, wherein the output variable stores 16 int8_ t data, the input variables vrd and vrt store 16 int8_ t data, and the data are represented as follows:
vorand_tmp=ingenic_xorv_b(vrd,vrt);
s1.3, adopting an AND instruction, wherein the output variable stores data of 16 int8_ t, and the input variables ave _ tmp and vorned _ tmp store data of 16 int8_ t, and the data are expressed as follows:
vorand_tmp=ingenic_andv_b(ave_tmp,vorand_tmp);
s1.4, with the and instruction, the output variable stores 16 int8_ t data, the input variable vorand _ tmp stores 16 int8_ t data, and 1 is a constant and is represented as:
vorand_tmp=ingenic_andi_b(vorand_tmp,1);
s1.5, adopting an addition instruction, outputting data stored by a variable and having 16 int8_ t, inputting data stored by a variable ave _ tmp and a variable vorned _ tmp and having 16 int8_ t, and expressing as follows:
vrd=ingenic_add_b(ave_tmp,vorand_tmp)。
5. the method of claim 3, wherein the step S2 further comprises:
s2.1, adopting a loading instruction, inputting data to be loaded, setting m to be 0, 16, 32, 48, 64, 80, 96 and 112 in sequence, setting a pointer of the data at present, and starting to load 128-bit data from a position pointed by the data in a memory, wherein if the data with 8 bits is loaded with 16 data, and if the data with 16 bits is loaded with 8 data; loading data into a variable vrd; m is calculated according to byte, namely 8bit is a unit, and 128-bit data is loaded from the position m pointed by the data indata1 and indata2 in the memory in sequence; loading data from data indata1, and when m =0, starting from the 0 th data, loading 16 8-bit data to vrsv [0]; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrsv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrsv [7]; after the processing is finished, indata1= indata1+128, data is loaded from data indata2, and when m =0, 16 8-bit data are loaded to vrtv [0] from the 0 th data; when m =16, starting from the 16 th data, 16 8-bit data are loaded to vrtv [1]; … …; when m =112, starting from the 112 th data, 16 8-bit data are loaded to vrtv [7]; after processing, indata2= indata2+128, expressed as:
vrsv[0]=ingenic_load(indata1,0);
vrsv[1]=ingenic_load(indata1,16);
vrsv[2]=ingenic_load(indata1,32);
vrsv[3]=ingenic_load(indata1,48);
vrsv[4]=ingenic_load(indata1,64);
vrsv[5]=ingenic_load(indata1,80);
vrsv[6]=ingenic_load(indata1,96);
vrsv[7]=ingenic_load(indata1,112);
vrtv[0]=ingenic_load(indata2,0);
vrtv[1]=ingenic_load(indata2,16);
vrtv[2]=ingenic_load(indata2,32);
vrtv[3]=ingenic_load(indata2,48);
vrtv[4]=ingenic_load(indata2,64);
vrtv[5]=ingenic_load(indata2,80);
vrtv[6]=ingenic_load(indata2,96);
vrtv[7]=ingenic_load(indata2,112)。
6. a method for overlay optimization of feature maps according to claim 3, wherein said step S2.3 further comprises: adopting a data storage instruction, storing a register variable vrsv [ i ], i =0, … …,7 of data, storing a pointer savedata of the data, starting to store the data in the register from the position of the pointer, storing 128bit data in vrd to savedata, wherein m is calculated according to byte, namely 8bit, as a unit, and starting to store 128bit data from the position pointed by the data savedata in a memory; storing data into savedata, and when m =0, starting to store 16 8-bit data in vrsv [0] from a position of savedata + 0; when m =16, storing 16 8-bit data in vrsv [1] from the savedata +16 position; … …; when m =112, storing 16 8-bit data in vrsv [7] from the savedata +112 position; after processing, savedata = savedata +128; expressed as:
ingenic_save(vrsv[0],savedata,0);
ingenic_save(vrsv[1],savedata,16);
ingenic_save(vrsv[2],savedata,32);
ingenic_save(vrsv[3],savedata,48);
ingenic_save(vrsv[4],savedata,64);
ingenic_save(vrsv[5],savedata,80);
ingenic_save(vrsv[6],savedata,96);
ingenic_save(vrsv[7],savedata,112);
savedata+=128。
7. the method of claim 6, wherein in step S2.3, the register arrays vrsv [8] and vrtv [8] use 16 registers, that is, vrsv [ i ] or vrtv [ i ] (0 < = i < 8), and call vrd = average _ bank (vrs, vrt) uses 2 temporary registers, and the total number is 18 and is not more than 32.
8. The method of claim 1, wherein the method is applied to a chip with a simd instruction set.
9. The method of claim 1, wherein the instruction for averaging two numbers is applicable to chips of T30 and T31 models.
CN202110923498.9A 2021-08-12 2021-08-12 Method for optimizing superposition of feature maps Pending CN115904489A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110923498.9A CN115904489A (en) 2021-08-12 2021-08-12 Method for optimizing superposition of feature maps

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110923498.9A CN115904489A (en) 2021-08-12 2021-08-12 Method for optimizing superposition of feature maps

Publications (1)

Publication Number Publication Date
CN115904489A true CN115904489A (en) 2023-04-04

Family

ID=86482639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110923498.9A Pending CN115904489A (en) 2021-08-12 2021-08-12 Method for optimizing superposition of feature maps

Country Status (1)

Country Link
CN (1) CN115904489A (en)

Similar Documents

Publication Publication Date Title
US10067761B2 (en) Performing rounding operations responsive to an instruction
EP2695054B1 (en) Vector friendly instruction format and execution thereof
US20070074007A1 (en) Parameterizable clip instruction and method of performing a clip operation using the same
US9015452B2 (en) Vector math instruction execution by DSP processor approximating division and complex number magnitude
US9264066B2 (en) Type conversion using floating-point unit
CN107111489B (en) Morton coordinate adjustment processor, method, system, and instructions
CN108563465B (en) Systems, apparatuses, and methods for performing a loop and an XOR in response to a single instruction
US11922133B2 (en) Processor and method for processing mask data
US20080046682A1 (en) Data processing unit and method for parallel vector data processing
CN113918883B (en) Data processing method, device and equipment and computer readable storage medium
CN110737612A (en) processors with in-memory computation
US10567163B2 (en) Processor with secure hash algorithm and digital signal processing method with secure hash algorithm
CN115904489A (en) Method for optimizing superposition of feature maps
CN111797985A (en) Convolution operation memory access optimization method based on GPU
EP3671438A1 (en) Systems and methods to transpose vectors on-the-fly while loading from memory
JP2000322235A (en) Information processor
CN116308989A (en) GPU acceleration method for full-homomorphic rapid number theory transformation
US10628126B2 (en) Architecture and instruction set to support integer division
US20090037702A1 (en) Processor and data load method using the same
US8427490B1 (en) Validating a graphics pipeline using pre-determined schedules
JP2015219823A (en) Processor
CN115481721B (en) Psum calculation circuit for convolutional neural network
CN117221747B (en) SOPC-based single-period dead pixel compensation and non-uniform correction method
CN115705677A (en) Optimization method for equalizing image pixels by using mean value
CN115705676A (en) Method for storing and analyzing 4bit feature map in convolution calculation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination