CN116861144A - Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight - Google Patents

Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight Download PDF

Info

Publication number
CN116861144A
CN116861144A CN202210312413.8A CN202210312413A CN116861144A CN 116861144 A CN116861144 A CN 116861144A CN 202210312413 A CN202210312413 A CN 202210312413A CN 116861144 A CN116861144 A CN 116861144A
Authority
CN
China
Prior art keywords
fram
data
oram
wram
load
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210312413.8A
Other languages
Chinese (zh)
Inventor
田凤彬
于晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ingenic Semiconductor Co Ltd
Original Assignee
Beijing Ingenic Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ingenic Semiconductor Co Ltd filed Critical Beijing Ingenic Semiconductor Co Ltd
Priority to CN202210312413.8A priority Critical patent/CN116861144A/en
Publication of CN116861144A publication Critical patent/CN116861144A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30065Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Image Processing (AREA)

Abstract

The application provides a method for realizing convolution of a WRAM capable of lowering weight, which comprises the following steps: s1, storing data: setting a storage mode of the feature map; setting a storage mode of the weight; s2, loading all data from DDR to WRAM by using SIMD instruction, and loading 32 data at a time; using ORAM data carrying instruction to carry the data in DDR to ORAM; s3, realizing convolution calculation. The application realizes the calculation of small input characteristic diagrams and small weights by designing the FRAM width setting method, the ORAM-to-FRAM number moving method and the corresponding new convolution calculation method, thereby realizing acceleration and improving efficiency.

Description

Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight
Technical Field
The application relates to the technical field of image processing, in particular to a method for realizing convolution of a weight of a WRAM (write-read-write memory) capable of being put down.
Background
The model T40 chip of Beijing jun integrated circuit Co., ltd (hereinafter referred to as the Beijing jun T40 chip) is a low power consumption chip for AI deep learning. A convolution computing unit with independent computation, a unique SIMD instruction. There is one ORAM memory, one WRAM for storing weights and one FRAM for storing input data. In this way, data must be stored in the WRAM and FRAM before the convolution calculation can be performed. ORAM, WRAM, FRAM is of a given size for the chip. For example, WRAM size is 288 x 1024bat, fram is 128 x 1024bat, and oram size is 2048 x 1024 bat. These hypothetical data will be used in the following calculation.
All data is stored in DDR, which requires DMA instruction to be moved to ORAM, or SIMD instruction to load data into a custom register, which is then moved to WRAM or FRAM with special instructions.
Since this is a new chip. Conventional algorithms, while possible, are inefficient. And existing methods cannot use unique computing units and instructions. The feature maps and the weights are input in different sizes, the implementation methods are different, and the efficiency is drastically reduced due to the use of unsuitable algorithms.
In addition, the common terminology in the prior art is as follows:
1. convolution kernel: the convolution kernel is a matrix used in image processing and is a parameter for operation with the original image. The convolution kernel is typically a matrix of columns (e.g., a matrix of 3*3) with a weight value for each square in the region. The matrix shapes are generally 1X 1, 3X 3, 5X 5, 7X 7, 1X 3, 3X 1, 2X 2, 1X 5, 5X 1, …
2. Convolution: the center of the convolution kernel is placed over the pixel to be calculated, and the products of each element in the kernel and its covered image pixel values are calculated and summed once to obtain a structure that is the new pixel value for that location, a process called convolution.
3. Feature map: the result obtained by convolution calculation of input data is called a feature map, and the result generated by full connection of the data is also called a feature map. The feature map size is generally expressed as length x width x depth, or 1 x depth.
4. FRAM (Feature RAM), which is a RAM for Feature maps, is a memory for storing all or part of the Feature maps and directly supplying the Feature maps to a memory calculated by a hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the feature map data must be placed in the FRAM.
5. WRAM (Weight RAM), i.e., weight RAM, is a memory for storing all or part of the Weight and is directly supplied to the memory calculated by the hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the weight data must be placed in the FRAM.
6. ORAM (Oblivious RAM): is a random access memory that provides fast read and write access to Dynamic Random Access Memory (DRAM). For any input X, Y, the series of instructions they produce is the same in probability distribution.
Disclosure of Invention
In order to solve the above problems in the prior art, an object of the present application is to: in order to solve the problem, according to the special situation, a special calculation method is designed, and particularly, the calculation of a small input characteristic diagram and a small weight is realized on a Beijing jun front T40 chip.
Specifically, the application provides a method for realizing convolution of a WRAM capable of dropping weights, which comprises the following steps:
S1, storing data:
setting a storage mode of the feature map: the data storage sequence of the feature map is 32, W, H and N; wherein 32 is a part of the depth, W is wide, H is high, N is the number of 32 in the depth, namely 32 is the depth of the feature map; the data continues at 32, then at width, then at height, and finally at depth/32 number;
the storage mode of the set weight is as follows: the method comprises the steps of adopting 32 x 32 continuity, then continuity in the width of a convolution kernel, then continuity in the height of the convolution kernel, then continuity in the number of input depths/32 of the convolution kernel, and finally continuity in the number of output depths/32; before processing, the common input depth is required to be continuous, the width and the height of the convolution kernel are stored, and finally the output depth of the convolution kernel is stored into a required sequence;
s2, loading all data from DDR to WRAM by using SIMD instruction, and loading 32 data at a time:
s2.1, all data is loaded from DDR to WRAM using SIMD instructions, 32 data per load: setting the initial address of the weight data as widthdata;
loading into VR0, VR1 using SIMD load data instructions;
loading data into the WRAM using a WRAM load data instruction;
Because the weight storage sequence is stored according to the requirement, and the data size can be completely put into the wram, the weight storage sequence can be directly stored according to the default sequence until all the data are completely stored; when wram cannot be stored down, the method cannot be used;
s2.2, carrying data in the DDR into the ORAM by using an ORAM data carrying instruction: setting the initial address of the feature map as ddr_id, setting the byte number of the feature map as count, and setting the initial address of the oram as oram_id;
ingenic_ddr2oram(ddr_id,oram_id,count,1);
because the feature map storage sequence is stored according to the requirement, and the data size can be completely put into ORAM, the feature map storage sequence can be directly stored according to the default sequence until all data are stored; this method cannot be used when ORAM cannot store down or fram_w cannot store down the smallest calculated pixel;
s3, realizing convolution calculation:
s3.1, calculating convolution, wherein data is required to be moved from ORAM to FRAM and then can be used for convolution calculation; the weights are all loaded into the WRAM, so that the situation of the number of the weights does not need to be considered; the FRAM cannot accommodate all feature graphs, and how much input data is needed to be used to carry the data from the ORAM to the FRAM;
s3.2, convolution calculation, namely firstly, loading data from an ORAM into an FRAM; then, the FRAM and the WRAM can be used for convolution calculation; the initial address of ORAM is required to be given, the initial address of WRAM is 0, and the initial address of WRAM is also 0; setting the depth of an input feature map as 32 x in_ic32, wherein in_ic32 is a multiple of the input depth, the input width is in_width, and the input height is in_height; the depth of the output characteristic diagram is 32 x out_ic32, out_ic32 is a multiple of the output depth, the output width is out_width, and the input height is out_height; the convolution kernel is of the width of kernel_w and of the height of kernel_h; the width direction step length of the convolution kernel is stride_w, and the height direction of the convolution kernel is stride_h; the relation in_width=out_width_stride_w of the width of the output feature map and the width of the input feature map, and the relation in_height=out_height_stride_h of the height of the output feature map and the height of the input feature map; if the input feature images are unequal, 0 is required to be supplemented to the input feature images according to specific convolution requirements, and the input feature images are supplemented to equal width and height positions; the generated result is stored in vrd;
To reduce the number of times the ORAM is loaded into the FRAM, this is achieved by generating all results in the same depth direction at the same time; so when designing the circulation order, the outermost circulation is the height of the output feature map, then the width of the output feature map, then the depth/32 of the output feature map, and finally the calculation unit of convolution;
let the number of rows generated each time be fram_h, where fram_h=fram_count/fram_w; the greater the number of rows loaded, the lower the number of times the load is repeated in the height direction.
The method has the following application requirements: the feature map size can be accommodated in the ORAM, the weight is relatively small, the weight number is smaller than or equal to WRAM, the number of bits is 8 bits when the WRAM can accommodate, and feature map data required by generating 8 pixels through calculation each time can be completely put into the FRAM; the convolution kernel length or width is not more than 3; meanwhile, the input depth requirement is required to be a multiple of 32, and the output depth is required to be a multiple of 32; if some layer input depths in the model are not 32 times, the filling is required to be 32 times; the corresponding weights are also the fill-in process.
The instructions used in the method are as follows:
a) Convolution calculation instructions:
ingenic_conv_bit8(fram_id,wram_id,ic32_num,kernel_w,kernel_h,stride_x,stride_y,feature_w,feature_h,vrd);
the input variable fram_id is the starting address used by fram, the wram_id is the starting address used by wram, the ic32_num is the calculated number, kernel_w is the width of the convolution kernel, kernel_h is the height of the convolution kernel, stride_x is the step length of convolution calculation in the x direction, stride_y is the step length of convolution calculation in the y direction, feature_w is the width of the input feature map, and feature_h is the height of the input feature map; vrd generates a result;
Description of use: calculating 4 pixel point results each time; the computing unit is depth 32, the generation result is 32, and 4 pixel results are generated; if ic32_num=1, it is calculated that the input depth is 32x1, and 4 pixels with the output depth of 32 are generated; if ic32_num=3, it is calculated that the input depth is 32x3, 4 pixels with the output depth of 32 are generated; if ic32_num=2, it is calculated that the input depth is 32x2, and 4 pixels with the output depth of 32 are generated; if ic32_num=3, it is calculated that the input depth is 32x3, 4 pixels with the output depth of 32 are generated; the calculated minimum depth input depth is 32, the minimum output depth is 32, and the number of the minimum output results pixels is 4; setting the width of FRAM, namely loading how many piexls in the input feature diagram, and setting parameters belonging to convolution calculation instructions;
b) SIMD load data instruction:
ingenic_load(indata,VR0,m)
inputting data to be loaded, namely, a pointer indata of the data at present, loading 128-bit data from a position m pointed by the data indata in a memory, and loading 16 data if the data is 8 data, 8 data if the data is 16 data, and 4 data if the data is 32 data; loading data into a variable vrd register; where m is calculated in terms of byte, i.e., 8 bits, as a unit; VR0 is a VR register of simd, and stores 512bit data at most;
c) FRAM load data instruction:
ingenic_vr2fram(VR0,fram_load_id,num)
the input variable VR0 is input data, the frame_load_id is a start address loaded into the frame, num is 0 or 1, the frame_load_id data is unchanged after the instruction ends when 0, and the frame_load_id=frame_load_id+32 after the instruction ends when 1;
d) WRAM load data instruction:
ingenic_vr2wram(VR0,wram_load_id,num)
the input variable VR0 is input data, the wram_load_id is a start address loaded into the wram, num is 0 or 1, the wram_load_id data is unchanged after the instruction ends when 0, and the wram_load_id=fram_load_id+64 after the instruction ends when 1;
f) ORAM data handling instruction:
ingenic_ddr2oram(ddr_id,oram_id,count,num)
the input variable, ddr_id is the address of starting to load data in ddr, oram_id is the address of starting to load data in oram, and count is the number of bytes loaded; num is 0 or 1,0 is the ddr_id and oram_id data are unchanged after the instruction ends, and 1 is the ddr_id and oram_id data plus count after the instruction ends.
Width of the setting FRAM:
setting the total byte number of the FRAM as a frame_count, setting the width of the frame as a frame_w, and processing the line number of the loading input feature diagram each time as a frame_h; the set value of fram_w is pixel of the input profile required to generate at least 8 pixls; the number of pixels is calculated to be 4 by the least generated result, and in use, the first 4 pixels are generated and the later 4 pixel characteristic map data are needed to be loaded, so that the least generated result is 8;
Has the following formula
fram_w={[(kernel_w-1)+stride_w*8+3]/4}*4 (1)
Wherein [ (kernel_w-1) +stride_w×8+3]/4 is integer arithmetic, which is an integer; the whole formula is to ensure that fram is a multiple of 4 and that 8 pixels can be generated;
since the data is also 4 pixels each time, for convolution with a convolution kernel greater than 1, a phenomenon of crossing the data to be loaded and the data to be used occurs; to solve this, the extra data is processed according to a multiple of 4, i.e. the fram_w is rounded up (kernel_w-1) +3/4 }, with the formula [ (kernel_w-1) +3]/4 })
fram=fram_w+{[(kernel_w-1)+3]/4}*4 (2)
By the formulas (1) (2)
fram={[(kernel_w-1)+stride_w*8+3]/4}*4+[(kernel_w-1)+3]/4}*4 (3)
Equation (3) cannot be combined because of the rounding operation, and the cases of inequality exist after the combination; the number of lines of the input feature map is loaded for each processing: fram_h=fram_count/fram_w.
In the step S2.1 of the above-mentioned process,
loading into VR0, VR1 using SIMD load data instructions, denoted;
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR1,1)
ingenic_load(widthdata,VR1,1)
loading data into the WRAM by using a WRAM loading data instruction, which is recorded as;
ingenic_vr2wram(VR0,wram_load_id,1)
ingenic_vr2wram(VR1,wram_load_id,1)。
the specific implementation of the convolution calculation in the step S3 is as follows:
s3.1, initializing: is marked as
wram_id=0;
oram_id=0;
fram_id=0;
wr_fram_id=0;
rd_fram_id=0;
int init_fram_w=fram_w-4*stride_w;
S3.2, generating a result of a fram_h row by each treatment;
let ydir_i be the position in height of the generated result, the initial value ydir_i=0, and let the initial processing line number be fram_h_ori, then fram_h=fram_h_ori. If ydir_i < out_height is true, continuing to execute the step, judging whether ydir_i < out_height is true after the step is executed, if yes, continuing to execute the step, and sequentially looping; if not, then determine if ydir_i < (out_height+fram_h-1) is true (here, out_height may not be an integer multiple of fram_h (when fram_h is greater than 1), so there is a remainder, and then need to be determined), if true, fram_h = ydir_i-out_height; executing the step, if not, jumping out of the step circulation; the method is characterized by comprising the following steps:
for(int ydir_i=0;ydir_i<(out_height+fram_h-1);ydir_i+=fram_h);
if(ydir_i>=out_height)fram_h=ydir_i-out_height;
Performing: continuously reading data from the DDR into the ORAM, wherein the number of lines read each time is fram_h;
performing: initializing the address read by the oram to be 0, and initializing the address written by the fram to be 0; the method is characterized by comprising the following steps:
int rd_oram_idx=0;
int wr_fram_idx=0;
performing: s3.2.1, 4 pixels are generated per treatment in the width direction, and a fram_h row is generated in the height direction;
if xdir_i < out_width+3 (there is a remainder when out_width is not a multiple of 4, and the remainder needs to be loaded according to the calculation of the minimum calculation unit, so that the whole division of 4 by out_width needs to be upwards rounded, which is equivalent to out_width+3. In the memory space where we store the input feature map, there is some more memory space than the actual feature map space, and errors are caused by preventing the data from being read), then the step is continued to be executed, and xdir_i+=4 after the step is executed; judging whether xdir_i < out_width+3 is true, and if so, sequentially cycling the steps; if not, the step is skipped to return to S3.2, and the step is marked as:
for(int xdir_i=0;xdir_i<out_width+3;xdir_i+=4);
performing: s3.2.1.1 if xdir_i >1 is true, noted if (xdir_i > 1), then proceed to this step; if not, then execute step S3.2.1.2;
performing: reading the data in the ORAM into the FRAM, wherein each time (32 x in_ic32) (4 x stride_w) is read in the loop body; the data needed by the following convolution calculation is loaded, and in order to form a loop, the minimum unit of each loading is 4 pixels, so the first loading is more, namely { [ (kernel_w-1) +3]/4 }. 4;
Performing: int rd_oram_idx=rd_oram_idx+xdir_i_stride_w_32;
performing: int wr_fram_idx=wr_fram_idx+xdir_i_stride_w_w_32;
performing: a circulation body (5);
performing: s3.2.1.2 if xdir_i= 0 is true, then the present step is continued, noted as: if (xdir_i= 0); if not, continuing with step S3.2.1.3;
performing: reading the data in the ORAM into the FRAM, wherein the length of the reading of the rear circulating body is 4 x stride_w, so the length of the first reading is (frame_w-4 x stride_w), and the reading data is (32 x in_ic32) (frame_w-4 x stride_w) x frame_h;
performing: rd_oram_idx=rd_oram_idx;
performing: wrfram idx=wrfram idx;
performing: a circulation body (6);
in xdir_i >1, each time 4 x stride_w x 32 is added, and xdir_i=0, the loaded data width is (fram_w-8 x stride_w), so the following value is added after xdir_i=0 is added:
rd_oram_idx=rd_oram_idx+(fram_w-8*stride_w)*32;
wr_fram_idx=wr_fram_idx+(fram_w-8*stride_w)*32;
performing: s3.2.1.3 the number of the individual pieces of the plastic,
performing: int wram_id=0;
performing: each treatment generated [32 x (4 pixels) ]x (fram_h row);
performing: a circulation body (7);
wherein, the circulating body (5) is used for loading data under normal conditions; the loop body (6) is loaded with data for the first time; the loop body (7) is a specific implementation convolution, and calculates the result of the generation sharing one weight, namely, the result in the height is firstly performed, then the result in the depth/32 direction is calculated, and all the results are sequentially calculated.
In the method of the present invention,
the circulating body (5) is as follows:
performing: step (5) 1, setting initial iccnum_i=0, judging whether icnum_i < in_ic32 is true, if so, continuing the circulation step, and judging whether icnum_i < in_ic32 is true after the circulation step, and sequentially circulating; if not, jumping out of the loop body;
the method is characterized by comprising the following steps: for (int iccnum_i=0, icnum_i < in_ic32; icnum_i++)
Performing: rd_oram_ic=rd_oram_idx+icnum_i (32 in_width_fram_h);
performing: wrjfram_ic=wrjfram_idx+icnum_i (32 x fram_w x fram_h);
performing: step (5) 2, setting initial fh_i=0, judging whether fh_i < fram_h is true, if true, continuing the circulation step, judging whether fh_i++ is true after the circulation step, and sequentially circulating; if not, jumping out the step and returning to the execution step (5) 1;
the method is characterized by comprising the following steps: for (int fh_i=0, fh_i < fram_h; fh_i++)
Performing: rd_oram=rd_oram_ic+fh_i (32×in_width);
performing: wrjframe=wrjframe_ic+fh_i (32 x frame_w);
performing: step (5) 3, setting an initial fw_i=0, judging whether fw_i <4 x stride_w is true, if true, continuing the circulation step, after the circulation step, fh_i+ =4, and judging whether fw_i <4 x stride_w is true, and circulating in turn; if not, jumping out the step and returning to the execution step (5) 2;
The method is characterized by comprising the following steps: for (int fw_i=0, fw_i <4 x stride_w, fh_i+ =4)
Performing: loading into VR0, VR1 using SIMD load data instructions; the method is characterized by comprising the following steps:
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR1,1);
ingenic_load(rd_oram,VR1,1);
performing: loading data to the FRAM using a FRAM load data instruction; the method is characterized by comprising the following steps:
ingenic_vr2fram(VR0,wr_fram,1);
ingenic_vr2fram(VR1,wr_fram,1);
the circulating body (6) is as follows:
performing: step (6) 1, setting initial iccnum_i=0, judging whether icnum_i < in_ic32 is true or not, if not, jumping out of the loop body (6), if true, continuing to execute the present loop step, judging whether or not true is again carried out after the present loop step, if true, sequentially looping, otherwise jumping out of the present loop;
the method is characterized by comprising the following steps: for (int iccnum_i=0, icnum_i < in_ic32; icnum_i++)
Performing: rd_oram_ic=rd_oram_idx+icnum_i (32 in_width_fram_h);
performing: wrjfram_ic=wrjfram_idx+icnum_i (32 x fram_w x fram_h);
performing: step (6) 2: if the initial fh_i=0 is set, judging whether fh_i < fram_h is established, if not, jumping out of the step (6), if so, continuing to execute the present circulation step, and if so, sequentially circulating, otherwise jumping out of the present circulation after the present circulation step, and if not, continuing to perform the judgment of fh_i++; the method is characterized by comprising the following steps:
for(int fh_i=0;fh_i<fram_h;fh_i++)
performing: rd_oram=rd_oram_ic+fh_i (32×in_width);
Performing: wrjframe=wrjframe_ic+fh_i (32 x frame_w);
performing: step (6) 3, setting initial fw_i=0, judging whether fw_i < (beam_w-4. Times. Stride_w) is true, if not, jumping out of step (6) 3, if true, continuing to execute the present circulation step, judging whether fh_i+ =4 is true after the present circulation step, if true, sequentially circulating, otherwise jumping out of the present circulation; the method is characterized by comprising the following steps:
for(int fw_i=0;fw_i<(fram_w-4*stride_w);fh_i+=4)
performing: loading into VR0, VR1 using SIMD load data instructions; the method is characterized by comprising the following steps:
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR1,1);
ingenic_load(rd_oram,VR1,1);
performing: loading data to the FRAM using a FRAM load data instruction; the method is characterized by comprising the following steps:
ingenic_vr2fram(VR0,wr_fram,1);
ingenic_vr2fram(VR1,wr_fram,1);
the circulating body (7) is as follows:
executing the step (7) 1: if the initial ocnum_i=0 is set, whether or not ocnum_i < out_ic32 is satisfied is judged, if not, the step (7) 1 is skipped, if yes, the present loop step is continued, after the present loop step ocnum_i++, whether or not it is satisfied is judged, if yes,
sequentially circulating, otherwise, jumping out of the circulation; the method is characterized by comprising the following steps:
the method is characterized by comprising the following steps: for (int ocnum_i=0, ocnum_i < out_ic32; ocnum_i++)
Performing: each treatment produced [32 x (4 pixels) ]
Performing: framid = 0;
performing: wram_id=wram_id+32×32×kernel_w×kernel_h×in_ic32;
executing the step (7) 2: if the initial fh_i=0 is set, judging whether fh_i < fram_h is established, if not, jumping out of the step (7), if so, continuing to execute the present circulation step, and if so, sequentially circulating, otherwise jumping out of the present circulation after the present circulation step, and if not, continuing to perform the present circulation step; the method is characterized by comprising the following steps:
The method is characterized by comprising the following steps: for (int fh_i=0, fh_i < fram_h; fh_i++)
Performing: fram_id=fram_id+32×4_stride_w;
performing: the ingenic_conv_bit8 (fram_id, wram_id, ic32_num, kernel_w,
kernel_h,stride_x,stride_y,vrd);
performing: taking out the generated result, and carrying out subsequent processing and storage;
performing: vrd=vrd+32×4.
Thus, the present application has the advantages that: a brand new design method applicable to Beijing jun positive T40 chip, for example, load the feature map number from ddr to oram first, load from oram to fram, realize the feature map fast loading, avoid loading the bandwidth limitation that the feature map brings; and the small weight is calculated, the weight data is directly stored in the WRAM, the data is not repeatedly loaded, the efficient utilization is realized, the problem of bandwidth competition of the loaded data is avoided, the acceleration is realized, and the efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.
Fig. 1 is a flow chart of the method of the present application.
Detailed Description
In order that the technical content and advantages of the present application may be more clearly understood, a further detailed description of the present application will now be made with reference to the accompanying drawings.
The application relates to a method for realizing convolution of a WRAM capable of lowering weight, which is applicable to requirements. The method is used on a Junzhen T40 chip, the size of the feature map can be contained in the oram, the weight is relatively small (the weight number is less than or equal to wram), the number of bits is 8 bits under the condition that wram can be contained, and feature map data required by generating 8 pixels through calculation each time can be completely placed in the fram.
The method comprises the following instructions:
a) Convolution calculation instructions:
ingenic_conv_bit8(fram_id,wram_id,ic32_num,kernel_w,kernel_h,stride_x,stride_y,feature_w,feature_h,vrd);
the input variable fram_id is the starting address used by fram, wram_id is the starting address used by wram, ic32_num is the calculated number, kernel_w is the width of the convolution kernel, kernel_h is the height of the convolution kernel, stride_x is the step size of the convolution calculation in the x direction, stride_y is the step size of the convolution calculation in the y direction, feature_w is the width of the input feature map, and feature_h is the height of the input feature map. vrd generates the result.
Description of use: 4 pixel results are calculated at a time. The calculation unit is depth 32, the generation result is 32, and 4 pixel results are generated. If ic32_num=1, it is calculated that the input depth is 32x1, and 4 pixels with the output depth of 32 are generated. If ic32_num=3, it is calculated that the input depth is 32×3, 4 pixels with an output depth of 32 are generated. If ic32_num=2, it is calculated that the input depth is 32x2, 4 pixels with the output depth of 32 are generated. If ic32_num=3, it is calculated that the input depth is 32×3, 4 pixels with an output depth of 32 are generated. The calculated minimum depth input depth is 32, the minimum output depth is 32, and the number of the minimum output results pixels is 4. The width of the frame, i.e. how many piexls in the input feature map are loaded, is set as a parameter of the convolution calculation instruction.
Method for setting the width of a beam:
let the total number of bytes of the frame be frame_count, the width of the frame set be frame_w, and the number of rows of the processing load input feature map be frame_h each time. The set value of fram_w is the pixel of the input profile required to generate at least 8 piexls. The number of pixels is calculated to be 4 by the minimum generated result, and in use, the first 4 pixels are generated, and the later 4 pixel characteristic map data are required to be loaded, so that the minimum generated result is 8. Has the following formula
fram_w={[(kernel_w-1)+stride_w*8+3]/4}*4 (1)
Wherein [ (kernel_w-1) +stride_w×8+3]/4 is integer arithmetic, which is an integer. The whole formula is to ensure that the frame is a multiple of 4 and that 8 pixels can be generated.
Since the data is also 4 pixels per load, for convolutions with convolution kernels greater than 1, a phenomenon of interleaving the load data with the usage data occurs. To solve this, the extra data is processed according to a multiple of 4, i.e. the fram_w is rounded up (kernel_w-1) +3/4 }, with the formula [ (kernel_w-1) +3]/4 })
fram=fram_w+{[(kernel_w-1)+3]/4}*4 (2)
By the formulas (1) (2)
fram={[(kernel_w-1)+stride_w*8+3]/4}*4+[(kernel_w-1)+3]/4}*4 (3)
Equation (3) cannot be combined because of the rounding operation here, and there are cases of inequality after the combination. The number of lines of the input feature map is loaded for each processing: fram_h=fram_count/fram_w.
b) simd load data instruction:
ingenic_load(indata,VR0,m)
the input data to be loaded is pointer indata of the data at present, 128-bit data is loaded from a position m pointed by the data indata in the memory, if the data of 8 bits is 16, if the data of 16 bits is 8, if the data of 32 bits is 8, 4 data are loaded. The data is loaded into the variable vrd register. Where m is calculated in terms of byte, i.e. 8 bits, as a unit. VR0 is the VR register of simd, storing a maximum of 512 bits of data.
c) A frame load data instruction:
ingenic_vr2fram(VR0,fram_load_id,num)
the input variable VR0 is input data, the frame_load_id is a start address loaded into the frame, num is 0 or 1, the frame_load_id data is unchanged after the instruction ends, and the frame_load_id=frame_load_id+32 after the instruction ends when 0.
d) wram load data instruction:
ingenic_vr2wram(VR0,wram_load_id,num)
the input variable VR0 is input data, the wram_load_id is a start address loaded into the wram, num is 0 or 1, the wram_load_id data is unchanged after the instruction ends when 0, and wram_load_id=fram_load_id+64 after the instruction ends when 1.
f) oram data handling instruction:
ingenic_ddr2oram(ddr_id,oram_id,count,num)
the input variable, ddr_id, oram_id, count, and count are the number of bytes loaded. num is 0 or 1,0 is the ddr_id and oram_id data are unchanged after the instruction ends, and 1 is the ddr_id and oram_id data plus count after the instruction ends.
The convolution calculation method is suitable for the conditions that the input feature diagram is smaller, the weight is smaller, and the length or width of the convolution kernel is not more than 3. While requiring that the input depth requirement be a multiple of 32, the output depth is also a multiple of 32. If some of the layer input depths in the model are not multiples of 32, a padding to multiples of 32 is required. The corresponding weights are also the fill-in process.
As shown in fig. 1, the method comprises the following steps:
s1, storing data:
setting a storage mode of the feature map;
setting a storage mode of the weight;
s2, loading all data from DDR to WRAM by using SIMD instruction, and loading 32 data at a time:
s2.1, loading all data from DDR to WRAM by using SIMD instruction, and loading 32 data at a time;
s2.2, carrying data in the DDR into the ORAM by using an ORAM data carrying instruction: setting the initial address of the feature map as ddr_id, setting the byte number of the feature map as count, and setting the initial address of the oram as oram_id;
ingenic_ddr2oram(ddr_id,oram_id,count,1);
because the feature map storage sequence is stored according to the requirement, and the data size can be completely put into ORAM, the feature map storage sequence can be directly stored according to the default sequence until all data are stored; this method cannot be used when ORAM cannot store down or fram_w cannot store down the smallest calculated pixel;
S3, realizing convolution calculation:
s3.1, calculating convolution, wherein data is required to be moved from ORAM to FRAM and then can be used for convolution calculation; the weights are all loaded into the WRAM, so that the situation of the number of the weights does not need to be considered; the FRAM cannot accommodate all feature graphs, and how much input data is needed to be used to carry the data from the ORAM to the FRAM;
s3.2, convolution calculation, namely firstly, loading data from an ORAM into an FRAM; then, the FRAM and the WRAM can be used for convolution calculation; the initial address of ORAM is required to be given, the initial address of WRAM is 0, and the initial address of WRAM is also 0; setting the depth of an input feature map as 32 x in_ic32, wherein in_ic32 is a multiple of the input depth, the input width is in_width, and the input height is in_height; the depth of the output characteristic diagram is 32 x out_ic32, out_ic32 is a multiple of the output depth, the output width is out_width, and the input height is out_height; the convolution kernel is of the width of kernel_w and of the height of kernel_h; the width direction step length of the convolution kernel is stride_w, and the height direction of the convolution kernel is stride_h; the relation in_width=out_width_stride_w of the width of the output feature map and the width of the input feature map, and the relation in_height=out_height_stride_h of the height of the output feature map and the height of the input feature map; if the input feature images are unequal, 0 is required to be supplemented to the input feature images according to specific convolution requirements, and the input feature images are supplemented to equal width and height positions; the generated result is stored in vrd;
To reduce the number of times the ORAM is loaded into the FRAM, this is achieved by generating all results in the same depth direction at the same time; so when designing the circulation order, the outermost circulation is the height of the output feature map, then the width of the output feature map, then the depth/32 of the output feature map, and finally the calculation unit of convolution;
let the number of rows generated each time be fram_h, where fram_h=fram_count/fram_w; the greater the number of rows loaded, the lower the number of times the load is repeated in the height direction.
In the step S2.1 of the above-mentioned process,
loading into VR0, VR1 using SIMD load data instructions, denoted;
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR1,1)
ingenic_load(widthdata,VR1,1)
loading data into the WRAM by using a WRAM loading data instruction, which is recorded as;
ingenic_vr2wram(VR0,wram_load_id,1)
ingenic_vr2wram(VR1,wram_load_id,1)。
the specific implementation of the convolution calculation in the step S3 is as follows:
s3.1, initializing: is marked as
wram_id=0;
oram_id=0;
fram_id=0;
wr_fram_id=0;
rd_fram_id=0;
int init_fram_w=fram_w-4*stride_w;
S3.2, generating a result of a fram_h row by each treatment;
let ydir_i be the position in height of the generated result, the initial value ydir_i=0, and let the initial processing line number be fram_h_ori, then fram_h=fram_h_ori. If ydir_i < out_height is true, continuing to execute the step, judging whether ydir_i < out_height is true after the step is executed, if yes, continuing to execute the step, and sequentially looping; if not, then determine if ydir_i < (out_height+fram_h-1) is true (here, out_height may not be an integer multiple of fram_h (when fram_h is greater than 1), so there is a remainder, and then need to be determined), if true, fram_h = ydir_i-out_height; executing the step, if not, jumping out of the step circulation; the method is characterized by comprising the following steps:
for(int ydir_i=0;ydir_i<(out_height+fram_h-1);ydir_i+=fram_h);
if(ydir_i>=out_height)fram_h=ydir_i-out_height;
Performing: continuously reading data from the DDR into the ORAM, wherein the number of lines read each time is fram_h;
performing: initializing the address read by the oram to be 0, and initializing the address written by the fram to be 0; the method is characterized by comprising the following steps:
int rd_oram_idx=0;
int wr_fram_idx=0;
performing: s3.2.1, 4 pixels are generated per treatment in the width direction, and a fram_h row is generated in the height direction;
if xdir_i < out_width+3 (there is a remainder when out_width is not a multiple of 4, and the remainder needs to be loaded according to the calculation of the minimum calculation unit, so that the whole division of 4 by out_width needs to be upwards rounded, which is equivalent to out_width+3. In the memory space where we store the input feature map, there is some more memory space than the actual feature map space, and errors are caused by preventing the data from being read), then the step is continued to be executed, and xdir_i+=4 after the step is executed; judging whether xdir_i < out_width+3 is true, and if so, sequentially cycling the steps; if not, the step is skipped to return to S3.2, and the step is marked as:
for(int xdir_i=0;xdir_i<out_width+3;xdir_i+=4);
performing: s3.2.1.1 if xdir_i >1 is true, noted if (xdir_i > 1), then proceed to this step; if not, then execute step S3.2.1.2;
performing: reading the data in the ORAM into the FRAM, wherein each time (32 x in_ic32) (4 x stride_w) is read in the loop body; the data needed by the following convolution calculation is loaded, and in order to form a loop, the minimum unit of each loading is 4 pixels, so the first loading is more, namely { [ (kernel_w-1) +3]/4 }. 4;
Performing: int rd_oram_idx=rd_oram_idx+xdir_i_stride_w_32;
performing: int wr_fram_idx=wr_fram_idx+xdir_i_stride_w_w_32;
performing: a circulation body (5);
performing: s3.2.1.2 if xdir_i= 0 is true, then the present step is continued, noted as: if (xdir_i= 0); if not, continuing with step S3.2.1.3;
performing: reading the data in the ORAM into the FRAM, wherein the length of the reading of the rear circulating body is 4 x stride_w, so the length of the first reading is (frame_w-4 x stride_w), and the reading data is (32 x in_ic32) (frame_w-4 x stride_w) x frame_h;
performing: rd_oram_idx=rd_oram_idx;
performing: wrfram idx=wrfram idx;
performing: a circulation body (6);
in xdir_i >1, each time 4 x stride_w x 32 is added, and xdir_i=0, the loaded data width is (fram_w-8 x stride_w), so the following value is added after xdir_i=0 is added:
rd_oram_idx=rd_oram_idx+(fram_w-8*stride_w)*32;
wr_fram_idx=wr_fram_idx+(fram_w-8*stride_w)*32;
performing: s3.2.1.3 the number of the individual pieces of the plastic,
performing: int wram_id=0;
performing: each treatment generated [32 x (4 pixels) ]x (fram_h row);
performing: a circulation body (7);
wherein, the circulating body (5) is used for loading data under normal conditions; the loop body (6) is loaded with data for the first time; the loop body (7) is a specific implementation convolution, and calculates the result of the generation sharing one weight, namely, the result in the height is firstly performed, then the result in the depth/32 direction is calculated, and all the results are sequentially calculated.
In the method of the present invention,
the circulating body (5) is as follows:
performing: step (5) 1, if it is determined whether icnum_i < in_ic32 is satisfied by setting initial iccnum_i=0, continuing the present loop step, and further determining icnum_i < (r) > after the present loop step
Whether in_ic32 is true or not, and if so, sequentially cycling; if not, jumping out of the loop body (5);
the method is characterized by comprising the following steps: for (int iccnum_i=0, icnum_i < in_ic32; icnum_i++)
Performing: rd_oram_ic=rd_oram_idx+icnum_i (32 in_width_fram_h);
performing: wrjfram_ic=wrjfram_idx+icnum_i (32 x fram_w x fram_h);
performing: step (5) 2, setting initial fh_i=0, judging whether fh_i < fram_h is true, if so, continuing the circulation step, and judging whether fh_i++ is true or not after the circulation step, if so, sequentially circulating; if not, jumping out the step and returning to the execution step (5) 1;
the method is characterized by comprising the following steps: for (int fh_i=0, fh_i < fram_h; fh_i++)
Performing: rd_oram=rd_oram_ic+fh_i (32×in_width);
performing: wrjframe=wrjframe_ic+fh_i (32 x frame_w);
performing: step (5) 3, setting an initial fw_i=0, judging whether fw_i <4 x stride_w is true, continuing the circulation step, after the circulation step, fh_i+ =4, and judging whether fw_i <4 x stride_w is true, if so, sequentially circulating; if not, jumping out the step and returning to the execution step (5) 2;
The method is characterized by comprising the following steps: for (int fw_i=0, fw_i <4 x stride_w, fh_i+ =4)
Performing: loading into VR0, VR1 using SIMD load data instructions; the method is characterized by comprising the following steps:
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR1,1);
ingenic_load(rd_oram,VR1,1);
performing: loading data to the FRAM using a FRAM load data instruction; the method is characterized by comprising the following steps:
ingenic_vr2fram(VR0,wr_fram,1);
ingenic_vr2fram(VR1,wr_fram,1);
the circulating body (6) is as follows:
performing: step (6) 1, setting initial iccnum_i=0, judging whether icnum_i < in_ic32 is true or not, if not, jumping out of the loop body (6), if true, continuing to execute the present loop step, judging whether or not true is again carried out after the present loop step, if true, sequentially looping, otherwise jumping out of the present loop;
the method is characterized by comprising the following steps: for (int iccnum_i=0, icnum_i < in_ic32; icnum_i++)
Performing: rd_oram_ic=rd_oram_idx+icnum_i (32 in_width_fram_h);
performing: wrjfram_ic=wrjfram_idx+icnum_i (32 x fram_w x fram_h);
performing: step (6) 2: if the initial fh_i=0 is set, judging whether fh_i < fram_h is established, if not, jumping out of the step (6), if so, continuing to execute the circulation step, and if so, sequentially circulating after the circulation step, judging whether fh_i++ is established; otherwise, jumping out of the circulation; the method is characterized by comprising the following steps:
for(int fh_i=0;fh_i<fram_h;fh_i++)
performing: rd_oram=rd_oram_ic+fh_i (32×in_width);
Performing: wrjframe=wrjframe_ic+fh_i (32 x frame_w);
performing: step (6) 3, assuming that the initial fw_i=0, judging whether fw_i < (fram_w-4 x stride_w) is satisfied, if not, jumping out of step (6) 3, continuing to execute the present circulation step, after the present circulation step, fh_i+=4, and then judging whether it is satisfied, if so, sequentially circulating; otherwise, jumping out of the circulation; the method is characterized by comprising the following steps:
for(int fw_i=0;fw_i<(fram_w-4*stride_w);fh_i+=4)
performing: loading into VR0, VR1 using SIMD load data instructions; the method is characterized by comprising the following steps:
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR1,1);
ingenic_load(rd_oram,VR1,1);
performing: loading data to the FRAM using a FRAM load data instruction; the method is characterized by comprising the following steps:
ingenic_vr2fram(VR0,wr_fram,1);
ingenic_vr2fram(VR1,wr_fram,1);
the circulating body (7) is as follows:
executing the step (7) 1: if the initial ocnum_i=0 is set, whether or not ocnum_i < out_ic32 is satisfied is judged, if not, the step (7) 1 is skipped, if yes, the present loop step is continued, after the present loop step ocnum_i++, whether or not it is satisfied is judged, if yes,
sequentially circulating; otherwise, jumping out of the circulation; the method is characterized by comprising the following steps:
the method is characterized by comprising the following steps: for (int ocnum_i=0, ocnum_i < out_ic32; ocnum_i++)
Performing: each treatment produced [32 x (4 pixels) ]
Performing: framid = 0;
performing: wram_id=wram_id+32×32×kernel_w×kernel_h×in_ic32;
executing the step (7) 2: if the initial fh_i=0 is set, judging whether fh_i < fram_h is established, if not, jumping out of the step (7), if so, continuing to execute the circulation step, and if so, sequentially circulating after the circulation step, judging whether fh_i++ is established; otherwise, jumping out of the circulation; the method is characterized by comprising the following steps:
The method is characterized by comprising the following steps: for (int fh_i=0, fh_i < fram_h; fh_i++)
Performing: fram_id=fram_id+32×4_stride_w;
performing: the ingenic_conv_bit8 (fram_id, wram_id, ic32_num, kernel_w,
kernel_h,stride_x,stride_y,vrd);
performing: taking out the generated result, and carrying out subsequent processing and storage;
performing: vrd=vrd+32×4.
Thus, the present application has the advantages that: according to the brand new design method, firstly, feature diagram numbers are loaded from ddr to oram, and when the feature diagram numbers are loaded from the oram to the fram, the feature diagram is rapidly loaded, and bandwidth limitation caused by the feature diagram loading is avoided; the small weight is calculated, the weight data is directly stored in the WRAM, the data is not repeatedly loaded and is efficiently utilized, the problem of bandwidth competition of the loaded data is avoided, acceleration is realized, and the efficiency is improved
Specifically, the method can be expressed as follows:
s1, storing data.
The storage mode of the feature map comprises the following steps: the feature map data storage order, 32, w, h, n. Where 32 is a portion of the depth, W is wide, H is high, N is the number of 32 over the depth, i.e. 32×n is the depth of the feature map. The data continues over 32, then over width, then over height, and finally over the depth/32 number.
The weight is stored in a way of being continuous over 32 x 32, then over the width of the convolution kernel, then over the height of the convolution kernel, then over the number of input depths/32 of the convolution kernel, and finally over the number of output depths/32. The usual input depth needs to be continuous before processing, the convolution kernel width height is re-convolved, and the final convolution kernel output depth is stored in the required order.
S2, using simd to load all data from ddr to wram, 32 data at a time. a) All data is loaded from ddr to wram using simd, 32 data at a time. Let the weight data initial address be widthdata.
Loading into VR0, VR1 using simd load data instructions
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR1,1)
ingenic_load(widthdata,VR1,1)
A wram load data instruction is used to load data into the wram.
ingenic_vr2wram(VR0,wram_load_id,1)
ingenic_vr2wram(VR1,wram_load_id,1)
Because the weight storage sequence is stored according to the requirement, and the data size can be completely put into the wram, the weight storage sequence can be directly stored according to the default sequence until all the data are completely stored. When wram cannot store down, this method cannot be used.
b) The data in ddr is moved to the oram using the oram move data instruction. Let the initial address of the feature map be ddr_id, the number of bytes of the feature map be count, and the initial address of the oram be oram_id.
ingenic_ddr2oram(ddr_id,oram_id,count,1);
Because the feature map storage sequence is stored according to the requirement, and the data size can be completely put into the oram, the feature map storage sequence can be directly stored according to the default sequence until all the data are completely stored. This method cannot be used when oram cannot store the minimum computed pixel or when fram_w cannot store it.
S3, realizing convolution calculation.
Computing the convolution requires moving the data from oram to fram before it can be used for the convolution computation. The weights are already all loaded into the wram, so the number of weights need not be considered. While fram cannot accommodate all feature maps, how much input data is needed to be used, how much data is moved from oram to fram.
Convolution computation first requires loading data from the oram into the fram. Only then can the convolution calculations be performed using fram and wram. An initial address of 0 is required for a given oram, and an initial address of 0 is also required for a wram. Setting the depth of an input feature map as 32 x in_ic32, wherein in_ic32 is a multiple of the input depth, the input width is in_width, and the input height is in_height; the depth of the output characteristic diagram is 32 x out_ic32, out_ic32 is a multiple of the output depth, the output width is out_width, and the input height is out_height; the convolution kernel is of the width of kernel_w and of the height of kernel_h; the width direction step length of the convolution kernel is stride_w, and the height direction of the convolution kernel is stride_h. The relation in_width=out_width_stride_w of the output feature map width and the input feature map width, and the relation in_height=out_height_stride_h of the output feature map height and the input feature map height. If the input characteristic diagram is not equal, 0 is complemented according to specific convolution requirements, and the input characteristic diagram is complemented to equal wide and high positions. Saving the generated results to vrd to reduce the number of times oram is loaded to fram can be accomplished by generating all results in the same depth direction at the same time. So when the loop order is designed, the outermost loop is the height of the output feature map, followed by the width of the output feature map, then the depth/32 of the output feature map, and finally the calculation unit of the convolution.
Let the number of rows generated each time be fram_h (where fram_h=fram_count/fram_w). The greater the number of rows loaded, the lower the number of times the load is repeated in the height direction.
The specific implementation is as follows:
/>
/>
in xdir_i >1, each time 4 x stride_w x 32 is added, and xdir_i=0, the loaded data width is (frame_w-8 x stride_w), so xdir_i=0 is needed to be added after loading the data, the following values are added:
(5) The loop is to load normal data, and (6) is to load data for the first time. (7) The convolution is specifically implemented by calculating the result of the generation sharing one weight, namely, the result in the height is firstly performed, then the result in the depth/32 direction is calculated, and all the results are sequentially calculated.
In summary, the key points of the present application are as follows:
1. the method for setting FRAM width and the method for realizing convolution.
2. ORAM to FRAM carry-over method.
3. A specific convolution calculation method.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations can be made to the embodiments of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (7)

1. A method for implementing convolution of WRAM capable of dropping weights, the method comprising the steps of:
s1, storing data:
setting a storage mode of the feature map: the data storage sequence of the feature map is 32, W, H and N; wherein 32 is a part of the depth, W is wide, H is high, N is the number of 32 in the depth, namely 32 is the depth of the feature map; the data continues at 32, then at width, then at height, and finally at depth/32 number;
the storage mode of the set weight is as follows: the method comprises the steps of adopting 32 x 32 continuity, then continuity in the width of a convolution kernel, then continuity in the height of the convolution kernel, then continuity in the number of input depths/32 of the convolution kernel, and finally continuity in the number of output depths/32; before processing, the common input depth is required to be continuous, the width and the height of the convolution kernel are stored, and finally the output depth of the convolution kernel is stored into a required sequence;
s2, loading all data from DDR to WRAM by using SIMD instruction, and loading 32 data at a time:
s2.1, all data is loaded from DDR to WRAM using SIMD instructions, 32 data per load: setting the initial address of the weight data as widthdata;
Loading into VR0, VR1 using SIMD load data instructions;
loading data into the WRAM using a WRAM load data instruction;
because the weight storage sequence is stored according to the requirement, and the data size can be completely put into the wram, the weight storage sequence can be directly stored according to the default sequence until all the data are completely stored; when wram cannot be stored down, the method cannot be used;
s2.2, carrying data in the DDR into the ORAM by using an ORAM data carrying instruction: setting the initial address of the feature map as ddr_id, setting the byte number of the feature map as count, and setting the initial address of the oram as oram_id;
ingenic_ddr2oram(ddr_id,oram_id,count,1);
because the feature map storage sequence is stored according to the requirement, and the data size can be completely put into ORAM, the feature map storage sequence can be directly stored according to the default sequence until all data are stored; this method cannot be used when ORAM cannot store down or fram_w cannot store down the smallest calculated pixel;
s3, realizing convolution calculation:
s3.1, calculating convolution, wherein data is required to be moved from ORAM to FRAM and then can be used for convolution calculation; the weights are all loaded into the WRAM, so that the situation of the number of the weights does not need to be considered; the FRAM cannot accommodate all feature graphs, and how much input data is needed to be used to carry the data from the ORAM to the FRAM;
S3.2, convolution calculation, namely firstly, loading data from an ORAM into an FRAM; then, the FRAM and the WRAM can be used for convolution calculation; the initial address of ORAM is required to be given, the initial address of WRAM is 0, and the initial address of WRAM is also 0; setting the depth of an input feature map as 32 x in_ic32, wherein in_ic32 is a multiple of the input depth, the input width is in_width, and the input height is in_height; the depth of the output characteristic diagram is 32 x out_ic32, out_ic32 is a multiple of the output depth, the output width is out_width, and the input height is out_height; the convolution kernel is of the width of kernel_w and of the height of kernel_h; the width direction step length of the convolution kernel is stride_w, and the height direction of the convolution kernel is stride_h; the relation in_width=out_width_stride_w of the width of the output feature map and the width of the input feature map, and the relation in_height=out_height_stride_h of the height of the output feature map and the height of the input feature map; if the input feature images are unequal, 0 is required to be supplemented to the input feature images according to specific convolution requirements, and the input feature images are supplemented to equal width and height positions; the generated result is stored in vrd;
to reduce the number of times the ORAM is loaded into the FRAM, this is achieved by generating all results in the same depth direction at the same time; so when designing the circulation order, the outermost circulation is the height of the output feature map, then the width of the output feature map, then the depth/32 of the output feature map, and finally the calculation unit of convolution;
Let the number of rows generated each time be fram_h, where fram_h=fram_count/fram_w; the greater the number of rows loaded, the lower the number of times the load is repeated in the height direction.
2. The method for implementing convolution of WRAM capable of lowering weights according to claim 1, wherein the method requires: the size of the feature map can be accommodated in an ORAM, the weight number is smaller than or equal to WRAM, the number of bits is 8 bits when the WRAM can accommodate, and feature map data required by generating 8 pixels through calculation each time can be completely put into the FRAM; the convolution kernel length or width is not more than 3; meanwhile, the input depth requirement is required to be a multiple of 32, and the output depth is required to be a multiple of 32; if some layer input depths in the model are not 32 times, the filling is required to be 32 times; the corresponding weights are also the fill-in process.
3. A method for implementing convolution of WRAM capable of dropping weights according to claim 1, characterized in that the instructions used in the method are as follows:
a) Convolution calculation instructions:
ingenic_conv_bit8(fram_id,wram_id,ic32_num,kernel_w,kernel_h,stride_x,stride_y,feature_w,feature_h,vrd);
the input variable fram_id is the starting address used by fram, the wram_id is the starting address used by wram, the ic32_num is the calculated number, kernel_w is the width of the convolution kernel, kernel_h is the height of the convolution kernel, stride_x is the step length of convolution calculation in the x direction, stride_y is the step length of convolution calculation in the y direction, feature_w is the width of the input feature map, and feature_h is the height of the input feature map; vrd generates a result;
Description of use: calculating 4 pixel point results each time; the computing unit is depth 32, the generation result is 32, and 4 pixel results are generated; if ic32_num=1, it is calculated that the input depth is 32x1, and 4 pixels with the output depth of 32 are generated; if ic32_num=3, it is calculated that the input depth is 32x3, 4 pixels with the output depth of 32 are generated; if ic32_num=2, it is calculated that the input depth is 32x2, and 4 pixels with the output depth of 32 are generated; if ic32_num=3, it is calculated that the input depth is 32x3, 4 pixels with the output depth of 32 are generated; the calculated minimum depth input depth is 32, the minimum output depth is 32, and the number of the minimum output results pixels is 4;
setting the width of FRAM, namely loading how many piexls in the input feature diagram, and setting parameters belonging to convolution calculation instructions;
b) SIMD load data instruction:
ingenic_load(indata,VR0,m)
inputting data to be loaded, namely, a pointer indata of the data at present, loading 128-bit data from a position m pointed by the data indata in a memory, and loading 16 data if the data is 8 data, 8 data if the data is 16 data, and 4 data if the data is 32 data; loading data into a variable vrd register; where m is calculated in terms of byte, i.e., 8 bits, as a unit; VR0 is a VR register of simd, and stores 512bit data at most;
c) FRAM load data instruction:
ingenic_vr2fram(VR0,fram_load_id,num)
the input variable VR0 is input data, the frame_load_id is a start address loaded into the frame, num is 0 or 1, the frame_load_id data is unchanged after the instruction ends when 0, and the frame_load_id=frame_load_id+32 after the instruction ends when 1;
d) WRAM load data instruction:
ingenic_vr2wram(VR0,wram_load_id,num)
the input variable VR0 is input data, the wram_load_id is a start address loaded into the wram, num is 0 or 1, the wram_load_id data is unchanged after the instruction ends when 0, and the wram_load_id=fram_load_id+64 after the instruction ends when 1;
f) ORAM data handling instruction:
ingenic_ddr2oram(ddr_id,oram_id,count,num)
the input variable, ddr_id is the address of starting to load data in ddr, oram_id is the address of starting to load data in oram, and count is the number of bytes loaded; num is 0 or 1,0 is the ddr_id and oram_id data are unchanged after the instruction ends, and 1 is the ddr_id and oram_id data plus count after the instruction ends.
4. The method for implementing convolution of WRAM capable of lowering weights according to claim 3, wherein the width of FRAM is set:
setting the total byte number of the FRAM as a frame_count, setting the width of the frame as a frame_w, and processing the line number of the loading input feature diagram each time as a frame_h; the set value of fram_w is pixel of the input profile required to generate at least 8 pixls; the number of pixels is calculated to be 4 by the least generated result, and in use, the first 4 pixels are generated and the later 4 pixel characteristic map data are needed to be loaded, so that the least generated result is 8;
Has the following formula
fram_w={[(kernel_w-1)+stride_w*8+3]/4}*4 (1)
Wherein [ (kernel_w-1) +stride_w×8+3]/4 is integer arithmetic, which is an integer; the whole formula is to ensure that fram is a multiple of 4 and that 8 pixels can be generated;
since the data is also 4 pixels each time, for convolution with a convolution kernel greater than 1, a phenomenon of crossing the data to be loaded and the data to be used occurs; to solve this, the extra data is processed according to a multiple of 4, i.e. the fram_w is rounded up by adding (kernel_w-1), the calculation formula { [ (kernel_w-1) +3]/4 }. 4, there are
fram=fram_w+{[(kernel_w-1)+3]/4}*4 (2)
By the formulas (1) (2)
fram={[(kernel_w-1)+stride_w*8+3]/4}*4+[(kernel_w-1)+3]/4}*4 (3)
Equation (3) cannot be combined because of the rounding operation, and the cases of inequality exist after the combination;
the number of lines of the input feature map is loaded for each processing: fram_h=fram_count/fram_w.
5. The method for realizing the convolution of the WRAM drop weights according to claim 4, wherein in S2.1,
loading into VR0, VR1 using SIMD load data instructions, denoted;
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR0,1)
ingenic_load(widthdata,VR1,1)
ingenic_load(widthdata,VR1,1)
loading data into the WRAM by using a WRAM loading data instruction, which is recorded as;
ingenic_vr2wram(VR0,wram_load_id,1)
ingenic_vr2wram(VR1,wram_load_id,1)。
6. the implementation method of WRAM weight-down convolution according to claim 5, wherein the specific implementation of the S3 convolution calculation is as follows:
s3.1, initializing: is marked as
wram_id=0;
oram_id=0;
fram_id=0;
wr_fram_id=0;
rd_fram_id=0;
int init_fram_w=fram_w-4*stride_w;
S3.2, generating a result of a fram_h row by each treatment;
setting ydir_i as a position of a generation result on the height, setting an initial value ydir_i=0, and setting an initial processing line number as fram_h_ori, wherein fram_h=fram_h_ori; if ydir_i < out_height is true, continuing to execute the step, judging whether ydir_i < out_height is true after the step is executed, if yes, continuing to execute the step, and sequentially looping; if not, then judging whether ydir_i < (out_height+fram_h-1) is true, wherein out_height is not an integer multiple of fram_h, and if fram_h is greater than 1, so that a remainder exists, and judging again is needed, if true, fram_h=ydir_i-out_height; executing the step, if not, jumping out of the step circulation; the method is characterized by comprising the following steps:
for(int ydir_i=0;ydir_i<(out_height+fram_h-1);ydir_i+=fram_h);
if(ydir_i>=out_height)fram_h=ydir_i-out_height;
performing: continuously reading data from the DDR into the ORAM, wherein the number of lines read each time is fram_h;
performing: initializing the address read by the oram to be 0, and initializing the address written by the fram to be 0; the method is characterized by comprising the following steps:
int rd_oram_idx=0;
int wr_fram_idx=0;
performing: s3.2.1, 4 pixels are generated per treatment in the width direction, and a fram_h row is generated in the height direction;
assuming that the initial xdir_i=0, if xdir_i < out_width+3 is true, there is a remainder because there is a case when out_width is not a multiple of 4, and the remainder also needs to be loaded according to the calculation of the minimum calculation unit, so that the integer division of 4 by out_width needs to be rounded up, which is equivalent to out_width+3, and the memory space for storing the input feature map is more than the actual feature map space, so as to prevent errors caused by the fact that data cannot be read; continuing to execute the step, wherein xdir_i+ =4 after the step is executed; judging whether xdir_i < out_width+3 is true, and if so, sequentially cycling the steps; if not, the step is skipped to return to S3.2, and the step is marked as:
for(int xdir_i=0;xdir_i<out_width+3;xdir_i+=4);
Performing: s3.2.1.1 if xdir_i >1 is true, noted if (xdir_i > 1), then proceed to this step; if not, then execute step S3.2.1.2;
performing: reading the data in the ORAM into the FRAM, wherein each time (32 x in_ic32) (4 x stride_w) is read in the loop body; the data needed by the following convolution calculation is loaded, and in order to form a loop, the minimum unit of each loading is 4 pixels, so the first loading is more, namely { [ (kernel_w-1) +3]/4 }. 4;
performing: int rd_oram_idx=rd_oram_idx+xdir_i_stride_w_32;
performing: int wr_fram_idx=wr_fram_idx+xdir_i_stride_w_w_32;
performing: a circulation body (5);
performing: s3.2.1.2 if xdir_i= 0 is true, then the present step is continued, noted as: if (xdir_i= 0); if not, continuing with step S3.2.1.3;
performing: reading the data in the ORAM into the FRAM, wherein the length of the reading of the rear circulating body is 4 x stride_w, so the length of the first reading is (frame_w-4 x stride_w), and the reading data is (32 x in_ic32) (frame_w-4 x stride_w) x frame_h;
performing: rd_oram_idx=rd_oram_idx;
performing: wrfram idx=wrfram idx;
performing: a circulation body (6);
in xdir_i >1, each time 4 x stride_w x 32 is added, and xdir_i=0, the loaded data width is (fram_w-8 x stride_w), so the following value is added after xdir_i=0 is added:
rd_oram_idx=rd_oram_idx+(fram_w-8*stride_w)*32;
wr_fram_idx=wr_fram_idx+(fram_w-8*stride_w)*32;
Performing: s3.2.1.3 the number of the individual pieces of the plastic,
performing: int wram_id=0;
performing: each treatment generated [32 x (4 pixels) ]x (fram_h row);
performing: a circulation body (7);
wherein, the circulating body (5) is used for loading data under normal conditions; the loop body (6) is loaded with data for the first time; the loop body (7) is a specific implementation convolution, and calculates the result of the generation sharing one weight, namely, the result in the height is firstly performed, then the result in the depth/32 direction is calculated, and all the results are sequentially calculated.
7. The method of claim 6, wherein the WRAM is configured to reduce the convolution of weights,
the circulating body (5) is as follows:
performing: step (5) 1, setting initial iccnum_i=0, judging whether icnum_i < in_ic32 is true, if so, continuing the circulation step, and judging whether icnum_i < in_ic32 is true after the circulation step, and sequentially circulating; if not, jumping out of the loop body (5), namely ending the loop;
the method is characterized by comprising the following steps: for (int iccnum_i=0, icnum_i < in_ic32; icnum_i++)
Performing: rd_oram_ic=rd_oram_idx+icnum_i (32 in_width_fram_h);
performing: wrjfram_ic=wrjfram_idx+icnum_i (32 x fram_w x fram_h);
Performing: step (5) 2, setting initial fh_i=0, judging whether fh_i < fram_h is true, if so, continuing the circulation step, judging whether fh_i++ is true after the circulation step, and sequentially circulating; if not, jumping out the step and returning to the execution step (5) 1;
the method is characterized by comprising the following steps: for (int fh_i=0, fh_i < fram_h; fh_i++)
Performing: rd_oram=rd_oram_ic+fh_i (32×in_width);
performing: wrjframe=wrjframe_ic+fh_i (32 x frame_w);
performing: step (5) 3, setting initial fw_i=0, judging whether fw_i <4 x stride_w is true, continuing the circulation step, after the circulation step, fh_i+ =4, and judging whether fw_i <4 x stride_w is true, and sequentially circulating; if not, jumping out the step and returning to the execution step (5) 2;
the method is characterized by comprising the following steps: for (int fw_i=0, fw_i <4 x stride_w, fh_i+ =4)
Performing: loading into VR0, VR1 using SIMD load data instructions; the method is characterized by comprising the following steps:
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR1,1);
ingenic_load(rd_oram,VR1,1);
performing: loading data to the FRAM using a FRAM load data instruction; the method is characterized by comprising the following steps:
ingenic_vr2fram(VR0,wr_fram,1);
ingenic_vr2fram(VR1,wr_fram,1);
the circulating body (6) is as follows:
performing: step (6) 1, setting initial iccnum_i=0, judging whether icnum_i < in_ic32 is true or not, if not, jumping out of the loop body (6), if true, continuing to execute the step of the loop, after executing the step of the loop, judging whether or not true, if true, sequentially looping, otherwise jumping out of the loop, and executing the following steps;
The method is characterized by comprising the following steps: for (int iccnum_i=0, icnum_i < in_ic32; icnum_i++)
Performing: rd_oram_ic=rd_oram_idx+icnum_i (32 in_width_fram_h);
performing: wrjfram_ic=wrjfram_idx+icnum_i (32 x fram_w x fram_h);
performing: step (6) 2: if yes, continuing to execute the present circulation step, after executing the present circulation step, and then judging whether the initial fh_i=0 is met, if not, jumping out of the present circulation step, and executing the following steps; the method is characterized by comprising the following steps:
for(int fh_i=0;fh_i<fram_h;fh_i++)
performing: rd_oram=rd_oram_ic+fh_i (32×in_width);
performing: wrjframe=wrjframe_ic+fh_i (32 x frame_w);
performing: step (6) 3, setting initial fw_i=0, judging whether fw_i < (fram_w-4 x stride_w) is true, if not, jumping out of step (6) 3, if true, continuing to execute the present circulation step, judging whether fh_i+ =4 is true after the present circulation step, if true, sequentially circulating, otherwise jumping out of the present circulation, and executing the following steps; the method is characterized by comprising the following steps:
for(int fw_i=0;fw_i<(fram_w-4*stride_w);fh_i+=4)
performing: loading into VR0, VR1 using SIMD load data instructions; the method is characterized by comprising the following steps:
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR0,1);
ingenic_load(rd_oram,VR1,1);
ingenic_load(rd_oram,VR1,1);
performing: loading data to the FRAM using a FRAM load data instruction; the method is characterized by comprising the following steps:
ingenic_vr2fram(VR0,wr_fram,1);
ingenic_vr2fram(VR1,wr_fram,1);
The circulating body (7) is as follows:
executing the step (7) 1: if the initial ocnum_i=0 is set, judging whether the ocnum_i < out_ic32 is met, if not, jumping out of the step (7) 1, if so, continuing to execute the circulation step, and after the circulation step, judging whether the ocnum_i++ is met, and sequentially circulating; otherwise, the loop is jumped out, and the following steps are executed:
the method is characterized by comprising the following steps: for (int ocnum_i=0, ocnum_i < out_ic32; ocnum_i++)
Performing: each treatment produced [32 x (4 pixels) ]
Performing: framid = 0;
performing: wram_id=wram_id+32×32×kernel_w×kernel_h×in_ic32;
executing the step (7) 2: setting initial fh_i=0, judging whether fh_i < fram_h is satisfied, if not, jumping out of the step (7) 2, if so, continuing to execute the present circulation step, judging whether fh_i++ is satisfied after the present circulation step, if so, sequentially circulating, otherwise jumping out of the present circulation; the method is characterized by comprising the following steps:
the method is characterized by comprising the following steps: for (int fh_i=0, fh_i < fram_h; fh_i++)
Performing: fram_id=fram_id+32×4_stride_w;
performing: the ingenic_conv_bit8 (fram_id, wram_id, ic32_num, kernel_w,
kernel_h,stride_x,stride_y,vrd);
performing: taking out the generated result, and carrying out subsequent processing and storage;
performing: vrd=vrd+32×4.
CN202210312413.8A 2022-03-28 2022-03-28 Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight Pending CN116861144A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210312413.8A CN116861144A (en) 2022-03-28 2022-03-28 Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210312413.8A CN116861144A (en) 2022-03-28 2022-03-28 Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight

Publications (1)

Publication Number Publication Date
CN116861144A true CN116861144A (en) 2023-10-10

Family

ID=88234555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210312413.8A Pending CN116861144A (en) 2022-03-28 2022-03-28 Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight

Country Status (1)

Country Link
CN (1) CN116861144A (en)

Similar Documents

Publication Publication Date Title
CN109919311B (en) Method for generating instruction sequence, method and device for executing neural network operation
CN107844828B (en) Convolution calculation method in neural network and electronic device
US11574031B2 (en) Method and electronic device for convolution calculation in neural network
US20210390368A1 (en) Buffer Addressing for a Convolutional Neural Network
CN111831254A (en) Image processing acceleration method, image processing model storage method and corresponding device
CN110163338B (en) Chip operation method and device with operation array, terminal and chip
CN112395092B (en) Data processing method and artificial intelligent processor
US6058405A (en) SIMD computation of rank based filters for M×N grids
CN112991142B (en) Matrix operation method, device, equipment and storage medium for image data
US10402196B2 (en) Multi-dimensional sliding window operation for a vector processor, including dividing a filter into a plurality of patterns for selecting data elements from a plurality of input registers and performing calculations in parallel using groups of the data elements and coefficients
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN111860801A (en) Neural network method, neural network system, and computer-readable medium
CN113806261B (en) Vector processor oriented pooling vectorization realization method
CN111949932A (en) Method and system for realizing TenscorCore convolution calculation in TVM
CN112633470A (en) Method, system, device and medium for optimizing neural network convolution residual structure
CN111104092B (en) Fast divider and division operation method
CN116861144A (en) Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight
CN110503193B (en) ROI-based pooling operation method and circuit
CN112712457B (en) Data processing method and artificial intelligence processor
CN114998377A (en) Image edge detection method and device, electronic equipment and storage medium
CN115965052A (en) Convolutional neural network hardware accelerator and acceleration method
CN113052291B (en) Data processing method and device
CN112801864A (en) Image filling method and device in deep learning hardware
CN116861143A (en) Method for realizing convolution of small input diagram and small weight
JP2022074442A (en) Arithmetic device and arithmetic method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination