CN114330669B

CN114330669B - A vector processor-oriented half-precision vectorized conv1×1 convolution method and system

Info

Publication number: CN114330669B
Application number: CN202111681136.XA
Authority: CN
Inventors: 许金伟; 李娅琳; 姜晶菲; 苏华友; 乔鹏; 王庆林; 李荣春; 高蕾; 窦勇
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-09-16
Anticipated expiration: 2041-12-30
Also published as: CN114330669A

Abstract

The invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method and system. The method includes: storing half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory; calling The direct memory access operation loads half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively; in the SM space, the pair is loaded into the on-chip SM space. Perform vectorization processing on the weight data of , in AM space, perform convolution operation conv1×1 between the vectorized weight data and the input data in AM space to obtain the feature map data after convolution. The invention can combine the architectural features of the vector processor to vectorize the convolution calculation (conv1×1) to the vector processor architecture, and realize the improvement of FLOPs on the premise of ensuring the accuracy.

Description

A vector processor-oriented half-precision vectorized conv1×1 convolution method and system

技术领域technical field

本发明涉及向量处理器技术领域，尤其涉及一种面向向量处理器的半精度向量化conv1×1卷积方法及系统。The invention relates to the technical field of vector processors, in particular to a vector processor-oriented half-precision vectorized conv1×1 convolution method and system.

背景技术Background technique

向量处理器的体系结构是一种新型的体系结构，如图1所示，包含进行标量运算的标量处理单元(SPU)和进行向量运算的向量处理单元(VPU)，以及负责数据传输的直接存储器访问(Direct Memory Access，DMA)部件等。SPU由标量处理部件SPE和标量存储器SM构成。VPU由L个向量处理部件VPE和阵列存储器AM构成，L个向量处理部件VPE以单指令多数据(SIMD)的方式协作运行，一个VPE内部集成了3个向量运算部件，用于同时支持向量的定点和浮点操作。The architecture of the vector processor is a new type of architecture, as shown in Figure 1, which includes a scalar processing unit (SPU) that performs scalar operations, a vector processing unit (VPU) that performs vector operations, and direct memory responsible for data transfer. Access (Direct Memory Access, DMA) components and so on. The SPU consists of a scalar processing element SPE and a scalar memory SM. The VPU consists of L vector processing components VPE and array memory AM. The L vector processing components VPE operate cooperatively in a single-instruction, multiple-data (SIMD) manner. One VPE integrates 3 vector computing components for simultaneous support of vector Fixed-point and floating-point operations.

单个VPE一次可以处理1个8字节数据(如FP64、Int64)，也可以处理2个4字节数据(如FP32，Int32)，也可以处理4个2字节数据(如FP16)。DMA部件负责SM与DDR(双倍速率同步动态随机存储器)、AM与DDR之间的数据传输，其操作的最小粒度也是8字节。A single VPE can process 1 piece of 8-byte data (such as FP64, Int64), 2 pieces of 4-byte data (such as FP32, Int32), or 4 pieces of 2-byte data (such as FP16) at a time. The DMA part is responsible for data transfer between SM and DDR (Double Rate Synchronous Dynamic Random Access Memory), AM and DDR, and the minimum granularity of its operation is also 8 bytes.

卷积(Convolution)是神经网络的核心计算之一，其中conv1×1又是卷积运算中最常见的一种规格，所以其效率高低对神经网络的性能影响非常大，优化卷积计算就显得尤为重要。Convolution (Convolution) is one of the core calculations of neural networks, of which conv1×1 is the most common specification in convolution operations, so its efficiency has a great impact on the performance of neural networks, and optimizing convolution calculations appears to be especially important.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明提供了一种面向向量处理器的半精度向量化conv1×1卷积方法，结合向量处理器的体系结构特征，将卷积计算(conv1×1)面向向量处理器体系结构向量化，在保证精度的前提下实现了FLOPs的提升。In view of this, the present invention provides a vector processor-oriented half-precision vectorized conv1 × 1 convolution method, which combines the architectural features of the vector processor to orient the convolution calculation (conv1 × 1) to the vector processor architecture. Vectorization achieves the improvement of FLOPs under the premise of ensuring accuracy.

本发明提供了一种面向向量处理器的半精度向量化conv1×1卷积方法，包括：The present invention provides a vector processor-oriented half-precision vectorized conv1×1 convolution method, including:

将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中；Store half-precision weight data and half-precision input data in double-rate synchronous dynamic random access memory;

调用直接存储器访问操作，将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间；Invoke a direct memory access operation to load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;

在SM空间中，对加载到片上SM空间的权值数据进行向量化处理，在AM空间中，将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1，得到卷积后的特征图数据；In the SM space, vectorize the weight data loaded into the on-chip SM space, and in the AM space, perform the convolution operation conv1×1 between the vectorized weight data and the input data in the AM space to obtain The feature map data after convolution;

其中，所述半精度权值数据Weight_ddr的数据格式为[Co,Cin,ks,ks]，Co为输出通道数，Cin为输入通道数，ks为卷积核大小，当卷积核大小为1时，数据格式可视为[Co,Cin]，故所述权值数据可表示为矩阵Weight_ddr＝M×K，所述半精度输入数据Input_ddr的数据格式为[Cin,Hi,Wi,n]，Hi和Wi分别为图像的高和宽，n为卷积操作中一次批量处理的数量，可将[Hi,Wi,n]看做一维，令N＝Hi×Wi×n，故输入数据可表示为矩阵Input_ddr＝K×N，其中，M表示Co，K表示Cin，N表示图像维度的大小。Wherein, the data format of the half-precision weight data Weight _ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, the data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight _ddr = M × K, the data format of the half-precision input data Input _ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, n is the number of batch processing in the convolution operation, [Hi, Wi, n] can be regarded as one dimension, let N=Hi×Wi×n, so The input data can be represented as a matrix Input _ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.

优选地，所述调用直接存储器访问操作，将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间，包括：Preferably, the direct memory access operation is invoked to load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively, including :

调用直接存储器访问操作，将半精度权值矩阵W_ddr加载到片上SM空间中，将原数据从M维划分为x₁个Wb_sm矩阵，变为W_sm＝x₁×Wb_sm，Wb_sm＝m×K，

其中m的大小由SM的空间大小和AM空间的大小综合决定；Call the direct memory access operation, load the half-precision weight matrix W _ddr into the on-chip SM space, divide the original data from M dimension into x ₁ Wb _sm matrix, become W _sm = x ₁ ×Wb _sm , Wb _sm = m×K,

The size of m is determined comprehensively by the size of the SM space and the size of the AM space;

调用直接存储器访问操作，将半精度输入矩阵I_ddr加载到片上AM空间中，将原数据从N维划分为x₂个Ib_am矩阵，变为I_am＝x₂×Ib_am，其中Ib_am＝K×n，即N＝x₂×n，其中n＝P×L×4，

p表示向量处理器的体系结构中向量功能运算单元部件的数量，L表示向量处理部件的数量。Invoke the direct memory access operation, load the half-precision input matrix I _ddr into the on-chip AM space, divide the original data from N dimensions into x ₂ Ib _am matrices, become I _am =x ₂ ×Ib _am , where Ib _am = K×n, that is, N=x ₂ ×n, where n=P×L×4,

p represents the number of vector functional operation unit components in the architecture of the vector processor, and L represents the number of vector processing components.

优选地，所述在SM空间中，对加载到片上SM空间的权值数据进行向量化处理，在AM空间中，将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1，得到卷积后的特征图数据，包括以下步骤：Preferably, in the SM space, vectorization processing is performed on the weight data loaded into the on-chip SM space, and in the AM space, a convolution operation is performed on the vectorized weight data and the input data in the AM space. conv1×1 to obtain the feature map data after convolution, including the following steps:

步骤1、初始化i＝0，其中，i表示权值子块矩阵Wb_sm(i)在M维上的块索引；Step 1, initialize i=0, wherein, i represents the block index of the weight sub-block matrix Wb _sm(i) in the M dimension;

步骤2、初始化j＝0，其中，j表示输入子块矩阵Ib_am(j)在N维上的块索引；Step 2, initialize j=0, wherein, j represents the block index of the input sub-block matrix Ib _am(j) on the N dimension;

步骤3、初始化k＝0，其中，k表示权值子块Wb_sm的列索引和输入子块Ib_am的行索引，m1表示权值子块的行索引，n1表示输入子块的列索引，即，权值子块表示为Wb_sm(i,m1,k)，输入子块表示为Ib_am(j,k,n1)；Step 3. Initialize k=0, where k represents the column index of the weight sub-block Wb _sm and the row index of the input sub-block Ib _am , m1 represents the row index of the weight sub-block, n1 represents the column index of the input sub-block, That is, the weight sub-block is represented as Wb _sm(i,m1,k) , and the input sub-block is represented as Ib _am(j,k,n1) ;

步骤4、将向量寄存器初始化为0，以便向量寄存器累加并存储计算结果；Step 4. Initialize the vector register to 0, so that the vector register can accumulate and store the calculation result;

步骤5、标量加载指令的最小粒度为4字节，半精度数据为2字节，单次将加载两个半精度数据到指定标量寄存器的R[0:15]和R[16:31]，将所述SM空间中的权值子块Wb_sm(i)的第k列数据Wb_sm(i,0,k)……Wb_sm(i,m-1,k)依次加载到标量寄存器R₃₀、R₃₁...R_30+m-1的R[0:15]中，同时权值子块Wb_sm(i)的第k+1列数据Wb_sm(i,0,k+1)……Wb_{sm(i,m-1,k+1)}依次加载到标量寄存器R₃₀、R₃₁...R_30+m-1的R[16:31]中；Step 5. The minimum granularity of the scalar load instruction is 4 bytes, and the half-precision data is 2 bytes. Two half-precision data will be loaded into R[0:15] and R[16:31] of the specified scalar register at a time. Load the kth column data Wb _{sm(i, 0, k)} ... Wb _{sm(i, m-1, k)} of the weight sub-block Wb _sm(i) in the SM space into the scalar register R ₃₀ , R ₃₁ ...R _30+m-1 in R[0:15], the k+1th column data Wb _sm(i,0,k+1) of the weight sub-block Wb _sm(i) at the same time ...Wb _{sm(i,m-1,k+1)} are sequentially loaded into R[16:31] of scalar registers R ₃₀ , R ₃₁ ... R _30+m-1 ;

步骤6、基于标量寄存器R₃₀、R₃₁...R_30+m-1存放的半精度权值数据，对标量寄存器R₃₀、R₃₁...R_30+m-1进行低位扩展操作，将寄存器中低32位中低16位数据R[0:15]复制扩展为d位数据存储在标量寄存器R₄₀、R₄₁...R_40+m-1中，其中，d为一个标量寄存器的位长；Step 6. Based on the _half _- precision weight data _{stored in the scalar registers R 30} _, _R ₃₁ . Copy and expand the lower 16-bit data R[0:15] in the lower 32 bits of the register into d-bit data and store them in the scalar registers R ₄₀ , R ₄₁ ... R _40+m-1 , where d is a scalar register bit length;

步骤7、基于标量寄存器R₄₀、R₄₁...R_40+m-1存放的复制扩展后的数据，对标量寄存器R₄₀、R₄₁...R_40+m-1依次进行广播操作并将数据储存在向量寄存器VR₅₀、VR₅₁...VR_50+m-1中，L个向量处理部件存储相同的数据，Wb_sm(i)的第k列数据向量化完成；Step 7. Based on the replicated and expanded data stored in the scalar registers R ₄₀ , R ₄₁ . . . R _40+m-1 _, the scalar registers R ₄₀ , R ₄₁ . The data is stored in the vector registers VR ₅₀ , VR ₅₁ . . . VR _50+m-1 , the L vector processing components store the same data, and the data in the kth column of Wb _sm(i) is vectorized;

步骤8、将所述AM空间中的输入子块矩阵Ib_am(j)的第k行数据Ib_am(j,k,0)……Ib_am(j,k,n-1)加载到p个向量寄存器VR₀、VR₁...VR_p-1中，p表示超长数据指令字的体系结构中功能向量运算单元部件的数量，单次加载最小粒度为

个字节，故单次最少可加载

个半精度数据；Step 8. Load the k-th row data Ib _{am(j, k, 0)} ... Ib _{am(j, k, n-1)} of the input sub-block matrix Ib _am(j) in the AM space to p In the vector registers VR ₀ , VR ₁ ... VR _p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is

bytes, so at least one can be loaded at a time

half-precision data;

步骤9、将Wb_sm(i,0,k)向量化后的数据VR₅₀分别与Ib_am(j)的第k行数据VR₀、VR₁...VR_p-1做乘加操作，同时L个向量处理部件并行操作，将计算结果存在向量寄存器VR₁₀、VR₁₁...VR_10+p-1中；Step 9. Perform multiplication and addition operations on the vectorized data VR ₅₀ of Wb _{sm(i, 0, k)} _and the k _- th row data VR ₀ , VR ₁ . The L vector processing components operate in parallel, and store the calculation results in the vector registers VR ₁₀ , VR ₁₁ . . . VR _10+p-1 ;

步骤10、基于向量寄存器VR₅₁...VR_50+m-1储存的是权值子块Wb_sm(i,1,k)……Wb_sm(i,m-1,k)的向量化数据，向量寄存器VR₀、VR₁...VR_p-1中储存的是输入子块Ib_am(j)的第k行数据，重复步骤9，将权值的各组向量化数据分别与Ib_am(j)的第k行数据相乘，并将相乘结果累加到向量寄存器VR_10+p、VR_10+p+1...VR_10+m×p-1上，该过程L个向量处理部件同时并行操作，遍历Wb_sm(i)的第k列数据，直至Wb_sm(i)的第k列和Ib_am(j)的k行的乘加计算完成；Step 10. Based on the vector registers VR ₅₁ ...VR _50+m-1 store the vectorized data of the weight sub-blocks Wb _{sm(i, 1, k)} _... , the vector registers VR ₀ , VR ₁ . . . VR _p-1 store the data of the kth row of the input sub-block Ib _am(j) , repeat step 9, and compare each group of vectorized data of the weight with Ib _am Multiply the data in the kth row of _(j) , and accumulate the multiplied results to the vector registers VR _10+p , VR _10+p+1 ... VR _10+m×p-1 , in this process L vector processing The components operate in parallel at the same time, traverse the data of the kth column of Wb sm( _i ), until the multiplication and addition calculation of the kth column of Wb _sm(i) and the k row of Ib _am(j) is completed;

步骤11、判断k+1是否小于K，若是，则跳转执行步骤19，若否，则继续执行步骤12；Step 11, judge whether k+1 is less than K, if yes, then jump to step 19, if not, continue to execute step 12;

步骤12、基于标量寄存器R₃₀、R₃₁...R_30+m-1的R[16：31]中存放的Wb_sm(i,1,k+1)……Wb_{sm(i,m-1,k+1)}数据，对标量寄存器R₃₀、R₃₁...R_30+m-1进行高位扩展操作，将寄存器中低32位中高16位数据R[16:31]，复制扩展为d位数据存储在标量寄存器R₄₀、R₄₁...R_40+m-1中，d为一个标量寄存器的位长；Step 12. Based on the Wb _sm _{(i, 1, k+1)} _stored in R[16:31] of the scalar registers R ₃₀ , R ₃₁ . _1,k+1) data, perform high-order expansion operation on scalar registers R ₃₀ , R ₃₁ ... R _30+m-1 , and copy and expand the lower 32-bit, middle-high 16-bit data R[16:31] in the register as d-bit data is stored in scalar registers R ₄₀ , R ₄₁ . . . R _40+m-1 , where d is the bit length of a scalar register;

步骤13、基于标量寄存器R₄₀、R₄₁...R_40+m-1存放的复制扩展后的数据，对标量寄存器R₄₀、R₄₁...R_40+m-1依次进行广播操作，将广播后的数据储存在向量寄存器VR₅₀、VR₅₁...VR_50+m-1中，L个向量处理部件存储相同的数据，Wb_sm(i)的第k+1列数据向量化完成；Step 13: Based on the replicated and expanded data _stored in the _scalar registers R ₄₀ , _R ₄₁ _. Store the broadcasted data in the vector registers VR ₅₀ , VR ₅₁ . . . VR _50+m-1 , the L vector processing units store the same data, and the vectorization of the data in the k+1th column of Wb _sm(i) is completed ;

步骤14、将所述AM空间中的输入子块矩阵Ib_am(j)的第k+1行数据Ib_am(j,k+1,0)……Ib_{am(j,k+1,n-1)}加载到p个向量寄存器VR₀、VR₁...VR_p-1中，p表示超长数据指令字的体系结构中功能向量运算单元部件的数量，单次加载最小粒度为

个字节，故单次最少可加载

个半精度数据；Step 14: Convert the k+1 row data Ib _am(j,k+1,0) of the input sub-block matrix Ib _am(j) in the AM space to Ib _{am(j,k+1,n- 1)} Load into p vector registers VR ₀ , VR ₁ ... VR _p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is

bytes, so at least one can be loaded at a time

half-precision data;

步骤15、将Wb_{sm(i,0，k+1)}向量化后的数据VR₅₀分别与Ib_am(j)的第k+1行数据VR₀、VR₁...VR_p-1做乘加操作，同时L个向量处理部件并行操作，将计算结果存在向量寄存器VR₁₀、VR₁₁...VR_10+p-1中；Step 15: Multiply the vectorized data VR ₅₀ of Wb _{sm(i, 0, k+1)} by the data VR ₀ , VR ₁ . . . VR _p-1 of the k+1 row of Ib _am(j) respectively Add operation, while L vector processing components operate in parallel, and store the calculation results in vector registers VR ₁₀ , VR ₁₁ . . . VR _10+p-1 ;

步骤16、基于向量寄存器VR₅₁...VR_50+m-1储存的是权值子块Wb_sm(i,1,k+1)……Wb_{sm(i,m-1,k+1)}的向量化数据，向量寄存器VR₀、VR₁...VR_p-1中储存的是输入子块Ib_am(j)的第k+1行数据，重复步骤15，将权值的各组向量化数据分别与Ib_am(j)的第k+1行数据相乘，并将相乘结果累加至向量寄存器VR_10+p、VR_10+p+1...VR_10+m×p-1上，该过程L个向量处理部件同时并行操作，遍历Wb_sm(i)的第k+1列数据，直至Wb_sm(i)的第k+1列和Ib_am(j)的k+1行的乘加计算完成；Step 16. Based on the vector registers VR ₅₁ ... VR _50+m-1 , the weight sub-blocks Wb _{sm(i, 1, k+1)} ... Wb _{sm(i, m-1, k+1)} are stored The vectorized data of , the vector registers VR ₀ , VR ₁ . . . VR _p-1 store the data of the k+1th row of the input sub-block Ib _am(j) . Multiply the data with the k+1th row data of Ib _am(j) respectively, and accumulate the multiplication results to the vector registers VR _10+p , VR _10+p+1 ... VR _10+m×p-1 In this process, L vector processing components operate in parallel at the same time, traversing the data of the k+1th column of Wb sm( _i ) until the k+1th column of Wb _sm(i) and the k+1 row of Ib _am(j) The multiplication and addition calculation is completed;

步骤17、令k＝k+2；Step 17, let k=k+2;

步骤18、判断k是否小于K，若是，则返回步骤5，若否，则执行步骤19；Step 18, determine whether k is less than K, if so, go back to step 5, if not, go to step 19;

步骤19、将储存在向量寄存器VR₁₀、VR₁₁...VR_10+m×p-1中的数据结果暂时存储到AM空间位置AM_temp；Step 19, temporarily store the data results in the vector registers VR ₁₀ , VR ₁₁ . . . VR _10+m×p-1 to the AM space position AM _temp ;

步骤20、调用直接存储器访问操作，将所述AM空间位置AM_temp储存的特征图数据结果存储至双倍速率同步动态随机存储器指定位置；Step 20, call direct memory access operation, the characteristic map data result that described AM space position AM _temp is stored is stored to double rate synchronous dynamic random access memory designated position;

步骤21、令j＝j+1；Step 21, let j=j+1;

步骤22、判断j是否小于x₂，若是，则调用直接存储器访问操作，将输入子块矩阵Ib_am(j)加载到片上AM空间中，加载完后返回步骤3，若否，则执行步骤23；Step 22, determine whether j is less than x ₂ , if so, call the direct memory access operation, load the input sub-block matrix Ib _am(j) into the on-chip AM space, and return to step 3 after loading, if not, execute step 23 ;

步骤23、令i＝i+1；Step 23, let i=i+1;

步骤24、判断i是否小于x₁，若是，则调用直接存储器访问操作，将权值子块矩阵Wb_sm(i)加载到片上SM空间中，加载完后返回步骤2，若否，则至此全部的权值数据W_ddr和输入数据I_ddr的conv1×1计算完成。Step 24, judge whether i is less than x ₁ , if so, call the direct memory access operation, load the weight sub-block matrix Wb _sm(i) into the on-chip SM space, and return to step 2 after loading, if not, go to this point The conv1×1 calculation of the weight data W _ddr and the input data I _ddr is completed.

一种面向向量处理器的半精度向量化conv1×1卷积系统，包括：A half-precision vectorized conv1×1 convolutional system for vector processors, including:

存储模块，用于将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中；The storage module is used to store the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;

加载模块，用于调用直接存储器访问操作，将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间；a loading module, configured to invoke a direct memory access operation, and load the half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;

处理模块，用于在SM空间中，对加载到片上SM空间的权值数据进行向量化处理，在AM空间中，将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1，得到卷积后的特征图数据；The processing module is used to vectorize the weight data loaded into the on-chip SM space in the SM space, and in the AM space, convolve the vectorized weight data with the input data in the AM space. conv1×1, get the feature map data after convolution;

优选地，所述加载模块具体用于：Preferably, the loading module is specifically used for:

优选地，所述处理模块具体用于执行以下步骤：Preferably, the processing module is specifically configured to perform the following steps:

步骤2、初始化j＝0，其中，j表示输入子块矩阵Ib_am(j0在N维上的块索引；Step 2, initialize j=0, wherein, j represents the block index of the input sub-block matrix Ib _{am (j0} on the N dimension;

步骤3、初始化k＝0，其中，k表示权值子块Wb_sm的列索引和输入子块Ib_am的行索引，m1表示权值子块的行索引，n1表示输入子块的列索引，即，权值子块表示为Wb_sm(i,m1,k)，输入子块表示为Ib_am(j,k，n1)；Step 3. Initialize k=0, where k represents the column index of the weight sub-block Wb _sm and the row index of the input sub-block Ib _am , m1 represents the row index of the weight sub-block, n1 represents the column index of the input sub-block, That is, the weight sub-block is represented as Wb _sm(i,m1,k) , and the input sub-block is represented as Ib _am(j,k,n1) ;

个字节，故单次最少可加载

bytes, so at least one can be loaded at a time

half-precision data;

个字节，故单次最少可加载

bytes, so at least one can be loaded at a time

half-precision data;

步骤15、将Wb_sm(i,0,k+1)向量化后的数据VR₅₀分别与Ib_am(j)的第k+1行数据VR₀、VR₁...VR_p-1做乘加操作，同时L个向量处理部件并行操作，将计算结果存在向量寄存器VR₁₀、VR₁₁...VR_10+p-1中；Step 15. Multiply the data VR ₅₀ after vectorization of Wb _{sm(i, 0, k+1)} by the data VR ₀ , VR ₁ . . . VR _p-1 of the k+1th row of Ib _am(j) respectively Add operation, while L vector processing components operate in parallel, and store the calculation results in vector registers VR ₁₀ , VR ₁₁ . . . VR _10+p-1 ;

步骤17、令k＝k+2；Step 17, let k=k+2;

步骤21、令j＝j+1；Step 21, let j=j+1;

步骤23、令i＝i+1；Step 23, let i=i+1;

综上所述，本发明公开了一种面向向量处理器的半精度向量化conv1×1卷积方法，首先将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中，然后调用直接存储器访问操作，将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间；在SM空间中，对加载到片上SM空间的权值数据进行向量化处理，在AM空间中，将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1，得到卷积后的特征图数据；其中，半精度权值数据Weight_ddr的数据格式为[Co,Cin,ks,ks]，Co为输出通道数，Cin为输入通道数，ks为卷积核大小，当卷积核大小为1时，数据格式可视为[Co,Cin]，故所述权值数据可表示为矩阵Weight_ddr＝M×K，所述半精度输入数据Input_ddr的数据格式为[Cin,Hi,Wi,n]，Hi和Wi分别为图像的高和宽，n为卷积操作中一次批量处理的数量，可将[Hi,Wi,n]看做一维，令N＝Hi×Wi×n，故输入数据可表示为矩阵Input_ddr＝K×N，其中，M表示Co，K表示Cin，N表示图像维度的大小。本发明能够结合向量处理器的体系结构特征，将卷积计算(conv1×1)面向向量处理器体系结构向量化，在保证精度的前提下实现了FLOPs的提升。To sum up, the present invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method. First, half-precision weight data and half-precision input data are stored in a double-rate synchronous dynamic random access memory, Then the direct memory access operation is called to load the half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively; in the SM space, the pair is loaded into the on-chip The weight data in the SM space is vectorized. In the AM space, the vectorized weight data and the input data in the AM space are convolved with the convolution operation conv1×1 to obtain the feature map data after convolution; , the data format of the half-precision weight data Weight _ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, The data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight _ddr = M×K, the data format of the half-precision input data Input _ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image, respectively, and n is the number of batch processing in the convolution operation. [Hi, Wi, n] can be regarded as one dimension, and N=Hi×Wi×n, so the input data can be It is expressed as a matrix Input _ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension. The invention can combine the architectural features of the vector processor to vectorize the convolution calculation (conv1×1) to the vector processor architecture, and realize the improvement of FLOPs on the premise of ensuring the accuracy.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为向量处理器的一般体系结构示意图；Fig. 1 is a general architecture schematic diagram of a vector processor;

图2为本发明提供的一种面向向量处理器的半精度向量化conv1×1卷积方法实施例的方法流程图；2 is a method flow chart of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution method provided by the present invention;

图3为本发明公开的Wb_sm(0,m1,k)的标量加载示意图；3 is a schematic diagram of scalar loading of Wb _{sm(0, m1, k)} disclosed in the present invention;

图4为本发明公开的标量寄存器的低16位扩展示意图；4 is a schematic diagram of the low 16-bit extension of the scalar register disclosed in the present invention;

图5为本发明公开的一种标量寄存器的广播实现示意图；5 is a schematic diagram of a broadcast implementation of a scalar register disclosed in the present invention;

图6为本发明公开的一种Ib_am(0,0,n1)的向量加载示意图；Fig. 6 is a kind of vector loading schematic diagram of Ib _am(0,0,n1) disclosed by the present invention;

图7为本发明公开的Wb_sm(i,0,k)与input第k行的向量乘加示意图；7 is a schematic diagram of vector multiplication and addition of Wb _{sm(i, 0, k)} and the kth row of input disclosed in the present invention;

图8为本发明公开的weight第k列、input第k行的向量乘加示意图；8 is a schematic diagram of vector multiplication and addition of the kth column of weight and the kth row of input disclosed by the present invention;

图9为本发明公开的标量寄存器的高16位扩展示意图；FIG. 9 is a schematic diagram of the high-order 16-bit extension of the scalar register disclosed in the present invention;

图10为本发明公开的一种标量寄存器的广播实现示意图；10 is a schematic diagram of a broadcast implementation of a scalar register disclosed in the present invention;

图11为本发明公开的一种Ib_am(0,1,n1)的向量加载示意图；11 is a schematic diagram of a vector loading of Ib _am(0,1,n1) disclosed in the present invention;

图12为本发明公开的Wb_sm(i,0,k+1)与input第k+1行的向量乘加示意图；12 is a schematic diagram of vector multiplication and addition of Wb _{sm(i, 0, k+1)} and input row k+1 disclosed in the present invention;

图13为weight第k+1列、input第k+1行的向量乘加示意图；Figure 13 is a schematic diagram of vector multiplication and addition of the k+1 column of weight and the k+1 row of input;

图14为weight最后列、input最后行的向量乘加示意图；Figure 14 is a schematic diagram of vector multiplication and addition of the last column of weight and the last row of input;

图15为本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积系统实施例的结构示意图。FIG. 15 is a schematic structural diagram of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution system disclosed in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

如图2所示，为本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积方法实施例的流程图，所述方法可以包括以下步骤：As shown in FIG. 2, it is a flowchart of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution method disclosed in the present invention, and the method may include the following steps:

S201、将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中；S201. Store half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory;

当需要对面向向量处理器的半精度数据进行向量化卷积时，首先将半精度权值数据和半精度输入数据存储在DDR(双倍速率同步动态随机存储器)中。其中，半精度权值数据Weight_ddr的数据格式为[Co,Cin,ks,ks]，Co为输出通道数，Cin为输入通道数，ks为卷积核大小，当卷积核大小为1时，数据格式也可视为[Co,Cin]，故权值数据可表示为矩阵Weight_ddr＝M×K。所述半精度输入数据Input_ddr的数据格式为[Cin,Hi,Wi,n]，Hi和Wi分别为图像的高和宽，n为卷积操作中一次批量处理的数量，可将[Hi,Wi,n]看做一维，令N＝Hi×Wi×n，故输入数据可表示为矩阵Input_ddr＝K×N，其中，M表示Co，K表示Cin，N表示图像维度的大小。When vectorized convolution of half-precision data for vector processors is required, firstly, half-precision weight data and half-precision input data are stored in DDR (Double Rate Synchronous Dynamic Random Access Memory). Among them, the data format of the half-precision weight data Weight _ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1 , the data format can also be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight _ddr =M×K. The data format of the half-precision input data Input _ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, and n is the number of batch processing in the convolution operation. Wi,n] is regarded as one dimension, let N=Hi×Wi×n, so the input data can be represented as a matrix Input _ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.

S202、调用直接存储器访问操作，将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间；S202, calling the direct memory access operation, and loading the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively;

具体地，调用直接存储器访问操作，将半精度权值矩阵W_ddr加载到片上SM空间中，将原数据从M维(输出通道维度)划分为x₁个Wb_sm矩阵，变为W_sm＝x₁×Wb_sm，Wb_sm＝m×K，

其中m的大小由SM的空间大小和AM空间的大小综合决定。如，与m相关的权值数据块Wb_sm大小不能大于SM空间；权值块与输入块卷积后的输出结果与输入数据块大小之和需小于AM空间。Specifically, the direct memory access operation is invoked, the half-precision weight matrix W _ddr is loaded into the on-chip SM space, and the original data is divided from M dimension (output channel dimension) into x ₁ Wb _sm matrix, which becomes W _sm =x ₁ ×Wb _sm , Wb _sm =m×K,

The size of m is determined comprehensively by the size of the SM space and the size of the AM space. For example, the size of the weight data block Wb _sm related to m cannot be larger than the SM space; the sum of the output result after the convolution of the weight block and the input block and the size of the input data block must be smaller than the AM space.

调用直接存储器访问操作，将所述半精度输入矩阵I_ddr加载到片上AM空间中，将原数据从N维(图像层维度)划分为x₂个Ib_am矩阵，变为I_am＝x₂×Ib_am，其中Ib_am＝K×n。即N＝x₂×n，其中n＝P×L×4，

p表示向量处理器的体系结构中向量功能运算单元部件的数量，L表示向量处理部件的数量。The direct memory access operation is invoked, the half-precision input matrix I _ddr is loaded into the on-chip AM space, and the original data is divided from N dimensions (image layer dimension) into x ₂ Ib _am matrices, becoming I _am =x ₂ × Ib _am , where Ib _am =K×n. That is, N=x ₂ ×n, where n=P×L×4,

S203、在SM空间中，对加载到片上SM空间的权值数据进行向量化处理，在AM空间中，将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1，得到卷积后的特征图数据。S203. In the SM space, perform vectorization processing on the weight data loaded into the on-chip SM space, and in the AM space, perform a convolution operation conv1×1 between the vectorized weight data and the input data in the AM space , to obtain the feature map data after convolution.

具体的，可以包括以下步骤：Specifically, the following steps may be included:

步骤3、初始化k＝0，其中，k表示权值子块Wb_sm的列索引和输入子块Ib_am的行索引，m1表示权值子块的行索引，n1表示输入子块的列索引，即，权值子块表示为Wb_sm(i,m1，k)，输入子块表示为Ib_am(j,k,n1)；Step 3. Initialize k=0, where k represents the column index of the weight sub-block Wb _sm and the row index of the input sub-block Ib _am , m1 represents the row index of the weight sub-block, n1 represents the column index of the input sub-block, That is, the weight sub-block is represented as Wb _sm(i,m1,k) , and the input sub-block is represented as Ib _am(j,k,n1) ;

例如，以第一个权值子块Wb_sm(0)＝6×4，m＝6，K＝4为例，k＝0时，使用标量加载指令，依次将Wb_sm(0)的第1列的数据加载到标量寄存器R₃₀、R₃₁...R_30+m-1的R[0:15]中，同时将Wb_sm(0)的第2列的数据加载到标量寄存器R₃₀、R₃₁...R_30+m-1的R[16:31]中，如下图3所示。For example, take the first weight sub-block Wb _sm(0) = 6×4, m=6, K=4 as an example, when k=0, use the scalar load instruction to sequentially load the first weight of Wb _sm(0) The data of the column is loaded into R[0:15] of the scalar registers _R30 , _R31 ...R30 _+m-1 , and the data of the second column of Wb _sm(0) is loaded into the scalar register _R30 , In R[16:31] of R ₃₁ ...R _30+m-1 , as shown in Figure 3 below.

例如，以d＝64为例，步骤6的低32位中低16位的扩展指令实现如图4所示。For example, taking d=64 as an example, the implementation of the extended instruction of the lower 32 bits in the lower 16 bits in step 6 is shown in FIG. 4 .

步骤7、基于标量寄存器R₄₀、R₄₁...R_40+m-1存放的复制扩展后的数据，对标量寄存器R₄₀、R₄₁...R_40+m-1依次进行广播操作并将数据储存在向量寄存器VR₅₀、VR₅₁...VR_50+m-1中，L个向量处理部件存储相同的数据，Wb_sm5i)的第k列数据向量化完成；Step 7. Based on the replicated and expanded data stored in the scalar registers R ₄₀ , R ₄₁ . . . R _40+m-1 _, the scalar registers R ₄₀ , R ₄₁ . The data is stored in the vector registers VR ₅₀ , VR ₅₁ . . . VR _50+m-1 , the L vector processing components store the same data, and the k-th column data of Wb _sm5i) is vectorized and completed;

例如，以L＝8为例，标量寄存器R₄₀广播到向量寄存器VR₅₀的实现如图5所示。For example, taking L=8 as an example, the implementation of broadcasting the scalar register R ₄₀ to the vector register VR ₅₀ is shown in FIG. 5 .

个字节，故单次最少可加载

bytes, so at least one can be loaded at a time

half-precision data;

例如，以第一个输入子块Ib_am(0)＝4×64，K＝4，N＝64为例，k＝0时，使用向量加载指令，将Ib_am(0)的第1行的数据加载到p个向量寄存器VR₀、VR₁...VR_p-1中，同上述以L＝8，p＝2为例，向量加载的具体实现如图6所示。For example, take the first input sub-block Ib _am(0) = 4×64, K=4, N=64 as an example, when k=0, use the vector load instruction to load the first line of Ib _am(0) The data is loaded into the p vector registers VR ₀ , VR ₁ . . . VR _p-1 . Taking L=8 and p=2 as an example, the specific implementation of vector loading is shown in FIG. 6 .

步骤9、将Wb_sm(i,0,k)向量化后的数据VR₅₀分别与Ib_am(j)的第k行数据VR₀、VR₁...VR_p-1做乘加操作，因为该体系结构集成了p个功能向量运算单元部件，所以上述乘加操作支持在同一周期内进行，同时L个向量处理部件并行操作，将计算结果存在向量寄存器VR₁₀、VR₁₁...VR_10+p-1中；Step 9. Perform multiplication and addition operations on the vectorized data VR ₅₀ of Wb _{sm(i, 0, k)} and the data VR ₀ , VR ₁ ... VR _p-1 of the kth row of Ib _am(j) respectively, because The architecture integrates p functional vector arithmetic unit components, so the above multiplication and addition operations are supported in the same cycle, while L vector processing components operate in parallel, and store the calculation results in vector registers VR ₁₀ , VR ₁₁ . . . VR _{10 +p-1} ;

例如，VR₅₀分别与VR₀、VR₁做乘加操作，以L＝8，p＝2为例，结果保存在VR₁₀、VR₁₁中，由于VR₁₀、VR₁₁初始值为0，故乘加结果为相乘本身，具体实现如图7所示。For example, VR ₅₀ performs multiplication and addition operations with VR ₀ and VR ₁ respectively. Taking L=8 and p=2 as an example, the results are stored in VR ₁₀ and VR _11. Since the initial values of VR ₁₀ and VR ₁₁ are 0, the multiplication The addition result is the multiplication itself, and the specific implementation is shown in Figure 7.

步骤10、基于向量寄存器VR₅₁...VR_50+m-1储存的是权值子块Wb_sm(i,1,k)……Wb_sm(i,m-1,k)的向量化数据，向量寄存器VR₀、VR₁...VR_p-1中储存的是输入子块Ib_am(j)的第k行数据，重复步骤9，将权值的各组向量化数据分别与Ib_am(j)的第k行数据相乘，并将相乘结果累加到向量寄存器VR_10+p、VR_10+p+1...VR_10+m×p-1上，该过程L个向量处理部件同时并行操作，遍历Wb_sm(i)的第k列数据，直至Wb_sm(i)的第k列和Ib_am(j)的k行的乘加计算完成，具体实现如图8所示；Step 10. Based on the vector registers VR ₅₁ ...VR _50+m-1 store the vectorized data of the weight sub-blocks Wb _{sm(i, 1, k)} _... , the vector registers VR ₀ , VR ₁ . . . VR _p-1 store the data of the kth row of the input sub-block Ib _am(j) , repeat step 9, and compare each group of vectorized data of the weight with Ib _am Multiply the data in the kth row of _(j) , and accumulate the multiplied results to the vector registers VR _10+p , VR _10+p+1 ... VR _10+m×p-1 , in this process L vector processing The components operate in parallel at the same time, traverse the data of the kth column of Wb sm( _i ), until the multiplication and addition calculation of the kth column of Wb _sm(i) and the k row of Ib _am(j) is completed, and the specific implementation is shown in Figure 8;

例如，以d＝64为例，步骤12的低32位中高16位的扩展指令实现如图9所示。For example, taking d=64 as an example, the implementation of the extended instruction of the lower 32 bits in the upper 16 bits in step 12 is shown in FIG. 9 .

例如，当k＝0时，Wb_sm(i)的第k+1列数据向量化如下，具体广播实现如图10所示。For example, when k=0, the data vectorization of the k+1th column of Wb _sm(i) is as follows, and the specific broadcast implementation is shown in FIG. 10 .

个字节，故单次最少可加载

bytes, so at least one can be loaded at a time

half-precision data;

例如，以第一个输入子块Ib_am(0)＝4×64，K＝4，N＝64为例，k+1＝1时，使用向量加载指令，将Ib_am(0)的第2行的数据加载到p个向量寄存器VR₀、VR₁...VR_p-1中，同上述以L＝8，p＝2为例，向量加载的具体实现如图11所示。For example, taking the first input sub-block Ib _am(0) = 4×64, K=4, N=64 as an example, when k+1=1, use the vector load instruction to load the second sub-block of Ib am(0 ₎ . The data of the row is loaded into p vector registers VR ₀ , VR ₁ . . . VR _p-1 . Taking L=8 and p=2 as an example, the specific implementation of vector loading is shown in FIG. 11 .

步骤15、将Wb_sm(i,0,k+1)向量化后的数据VR₅₀分别与Ib_am(j)的第k+1行数据VR₀、VR₁...VR_p-1做乘加操作，因为该体系结构集成了p个功能向量运算单元部件，所以上述乘加操作支持在同一周期内进行，同时L个向量处理部件并行操作，将计算结果存在向量寄存器VR₁₀、VR₁₁...VR_10+p-1中；Step 15. Multiply the data VR ₅₀ after vectorization of Wb _{sm(i, 0, k+1)} by the data VR ₀ , VR ₁ . . . VR _p-1 of the k+1th row of Ib _am(j) respectively Add operation, because the architecture integrates p functional vector arithmetic unit components, so the above multiplication and addition operations are supported in the same cycle, while L vector processing components operate in parallel, and the calculation results are stored in vector registers VR ₁₀ , VR ₁₁ . ..VR _10+p-1 ;

例如，k+1＝1时，VR₅₀分别与VR₀、VR₁做乘加操作，并累加上VR₁₀、VR₁₁中k行的乘加数据，并且将结果继续保存在VR₁₀、VR₁₁中，以L＝8，p＝2为例，具体实现如图12所示。For example, when k+1=1, VR ₅₀ performs multiplication and addition operations with VR ₀ and VR ₁ respectively, and accumulates the multiplication and addition data of k rows in VR ₁₀ and VR ₁₁ , and continues to save the results in VR ₁₀ and VR ₁₁ , taking L=8 and p=2 as an example, the specific implementation is shown in FIG. 12 .

步骤16、基于向量寄存器VR₅₁...VR_50+m-1储存的是权值子块Wb_sm(i,1,k+1)……Wb_{sm(i,m-1,k+1)}的向量化数据，向量寄存器VR₀、VR₁...VR_p-1中储存的是输入子块Ib_am(j)的第k+1行数据，重复步骤15，将权值的各组向量化数据分别与Ib_am(j)的第k+1行数据相乘，并将相乘结果累加至向量寄存器VR_10+p、VR_10+p+1...VR_10+m×p-1上，该过程L个向量处理部件同时并行操作，遍历Wb_sm(i)的第k+1列数据，直至Wb_sm(i)的第k+1列和Ib_am(j)的k+1行的乘加计算完成，具体实现如图13所示；Step 16. Based on the vector registers VR ₅₁ ... VR _50+m-1 , the weight sub-blocks Wb _{sm(i, 1, k+1)} ... Wb _{sm(i, m-1, k+1)} are stored The vectorized data of , the vector registers VR ₀ , VR ₁ . . . VR _p-1 store the data of the k+1th row of the input sub-block Ib _am(j) . Multiply the data with the k+1th row data of Ib _am(j) respectively, and accumulate the multiplication results to the vector registers VR _10+p , VR _10+p+1 ... VR _10+m×p-1 In this process, L vector processing components operate in parallel at the same time, traversing the data of the k+1th column of Wb sm( _i ) until the k+1th column of Wb _sm(i) and the k+1 row of Ib _am(j) The multiplication and addition calculation is completed, and the specific implementation is shown in Figure 13;

步骤17、令k＝k+2；Step 17, let k=k+2;

步骤19、至此，权值子块矩阵Wb_sm(i)和输入子块矩阵Ib_am(j)的conv1×1计算已经完成，当Wb_sm(i)遍历到最后一列，Ib_am(j)遍历到最后一行时，具体操作如图14所示，将储存在向量寄存器VR₁₀、VR₁₁...VR_10+m×p-1中的数据结果暂时存储到AM空间位置AM_temp；Step 19. So far, the conv1×1 calculation of the weight sub-block matrix Wb _sm(i) and the input sub-block matrix Ib _am(j) has been completed. When Wb _sm(i) traverses to the last column, Ib _am(j) traverses When reaching the last line, the specific operation is as shown in Figure 14, and the data results stored in the vector registers VR ₁₀ , VR ₁₁ . . . VR _10+m×p-1 are temporarily stored in the AM space position AM _temp ;

步骤21、令j＝j+1；Step 21, let j=j+1;

步骤22、判断j是否小于x₂，若是，则调用直接存储器访问操作，将输入子块矩阵Ib_am(j)加载到片上AM空间中，加载完后返回步骤3，重复进行以上标量数据加载、复制扩展、广播、向量数据加载和向量乘加等操作，若否，则执行步骤23；Step 22, determine whether j is less than x ₂ , if so, call the direct memory access operation, load the input sub-block matrix Ib _am(j) into the on-chip AM space, return to step 3 after loading, and repeat the above scalar data loading, Operations such as copy extension, broadcast, vector data loading, and vector multiply-add, if not, go to step 23;

步骤23、令i＝i+1；Step 23, let i=i+1;

步骤24、判断i是否小于x₁，若是，则调用直接存储器访问操作，将权值子块矩阵Wb_sm(i)加载到片上SM空间中，加载完后返回步骤2，重复进行以上标量数据加载、复制扩展、广播、向量数据加载和向量乘加等操作，若否，则至此全部的权值数据W_ddr和输入数据I_ddr的conv1×1计算完成。Step 24, judge whether i is less than x ₁ , if so, call the direct memory access operation, load the weight sub-block matrix Wb _sm(i) into the on-chip SM space, and return to step 2 after loading, and repeat the above scalar data loading , copy extension, broadcast, vector data loading, and vector multiply-add operations, if not, then all the weight data W _ddr and the input data I _ddr conv1×1 calculation is completed.

综上所述，本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积方法，能够结合向量处理器的体系结构特征，将卷积计算(conv1×1)面向向量处理器体系结构向量化，在保证精度的前提下实现了FLOPs的提升。To sum up, the present invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method, which can combine the architectural features of the vector processor to make the convolution calculation (conv1×1) oriented to the vector processor. The architecture is vectorized, and the improvement of FLOPs is achieved on the premise of ensuring accuracy.

如图15所示，为本发明公开的一种面向向量处理器的半精度数据向量化conv1×1卷积系统实施例的结构示意图，所述系统可以包括：As shown in FIG. 15, it is a schematic structural diagram of an embodiment of a vector processor-oriented half-precision data vectorization conv1×1 convolution system disclosed in the present invention. The system may include:

存储模块1501，用于将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中；A storage module 1501, configured to store half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory;

加载模块1502，用于调用直接存储器访问操作，将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间；a loading module 1502, configured to invoke a direct memory access operation, and load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;

处理模块1503，用于在SM空间中，对加载到片上SM空间的权值数据进行向量化处理，在AM空间中，将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1，得到卷积后的特征图数据。The processing module 1503 is used to perform vectorization processing on the weight data loaded into the on-chip SM space in the SM space, and in the AM space, convolve the vectorized weight data with the input data in the AM space Operate conv1×1 to get the feature map data after convolution.

本发明公开面向向量处理器的半精度向量化conv1×1卷积系统的工作原理，与上述面向向量处理器的半精度向量化conv1×1卷积方法的工作原理相同，在此不再赘述。The present invention discloses the working principle of the vector processor-oriented half-precision vectorized conv1×1 convolution system, which is the same as the working principle of the vector processor-oriented half-precision vectorized conv1×1 convolution method, and will not be repeated here.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A vector processor-oriented semi-precision vectorized conv1 x 1 convolution method, comprising:

storing the half-precision weight data and the half-precision input data in a double-rate synchronous dynamic random access memory;

calling direct memory access operation, and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;

in an SM space, vectorizing the weight data loaded to the SM space on the chip, and in an AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution;

wherein, the semi-precision Weight data Weight _ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the weight data can be expressed asMatrix Weight _ddr M × K, the semi-precision Input data Input _ddr The data format of (1) is [ Cin, Hi, Wi, n]Where Hi and Wi are the height and width of the image, respectively, and n is the number of batch processes at a time in the convolution operation, the values [ Hi, Wi, n ] can be obtained]Considering one dimension, let N be Hi × Wi × N, so the Input data can be represented as a matrix Input _ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.

2. The method of claim 1, wherein the invoking of the direct memory access operation loads the half-precision weight data and the half-precision input data from the double rate synchronous dynamic random access memory into an on-chip Scalar Memory (SM) space and an on-chip Array Memory (AM) space, respectively, comprising:

invoking a direct memory access operation to apply a semi-precision weight matrix W _ddr Loading into on-chip SM space, dividing original data into x from M dimension ₁ A Wb _sm Matrix, becomes W _sm ＝x ₁ ×Wb _sm ，Wb _sm ＝m×K，

Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;

invoking a direct memory access operation to input the semi-precision into matrix I _ddr Loading into AM space on chip, dividing original data into x from N dimension ₂ Ib _am Matrix, becomes I _am ＝x ₂ ×Ib _am Wherein Ib is _am K x N, i.e. N x ₂ X n, where n ═ P × L × 4,

p denotes the number of vector function arithmetic unit elements in the architecture of the vector processor, and L denotes the number of vector processing elements.

3. The method according to claim 2, wherein in the SM space, vectorization processing is performed on the weight data loaded into the on-chip SM space, and in the AM space, convolution operation conv1 × 1 is performed on the vectorized weight data and input data in the AM space to obtain convolved feature map data, including the following steps:

step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb _sm(i) A block index in the M dimension;

step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib _am(j) A block index in the N dimension;

step 3, initializing k to 0, wherein k represents the weight subblock Wb _sm Column index and input sub-block Ib _am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb _{sm(i，m1，k)} Input sub-block denoted Ib _{am(j，k，n1)} ；

Step 4, initializing the vector register to 0 so as to accumulate the vector register and store the calculation result;

and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space _sm(i) K-th column data Wb of _{sm(i，0，k)} ......Wb _{sm(i，m-1，k)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [0:15]Middle and simultaneous weight sub-block Wb _sm(i) Column k +1 data Wb _{sm(i，0，k+1)} ......Wb _{sm(i，m-1，k+1)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Performing the following steps;

step 6, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Stored semi-precision weight data for scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Wherein d is a scalar registerThe bit length of (d);

step 7, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR ₅₀ 、VR ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm(i) Completing the k-th column data vectorization;

step 8, inputting the sub-block matrix Ib in the AM space _am(j) Of kth line data Ib _am(j，k，0 )......Ib _{am(j，k，n-1)} Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

One byte, so that it can be loaded at least once

Half precision data;

step 9, mixing Wb _{sm(i，0，k)} Vectorized data VR ₅₀ Respectively react with Ib _am(j) VR of the kth line ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 10, register VR based on vector ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k)} ......Wb _{sm(i，m-1，k)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ....VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) K column of (1) and Ib _am(j) The multiplication and addition calculation of the k rows is completed;

step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;

step 12, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Wb stored in _{sm(i，1，k+1)} ......Wb _{sm(i，m-1，k+1)} Data, to scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 And performing high bit expansion operation, and enabling 16 high bits data R [16:31]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 In (d) is the bit length of a scalar register;

step 13, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR ₅₀ 、VR ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm(i) Completing the vectorization of the k +1 th column of data;

step 14, inputting the sub-block matrix Ib in the AM space _am(j) Data Ib of the k +1 th line _{am(j，k+1，0)} ......Ib _{am(j，k+1，n-1)} Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

One byte, so that it can be loaded at least once

One and a halfPrecision data;

step 15, mixing Wb _{sm(i，0，k+1)} Vectorized data VR ₅₀ Respectively react with Ib _am(j) The (k + 1) th row of data VR ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 16, vector register VR based ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k+1)} ......Wb _{sm(i，m-1，k+1)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) Column k +1 and Ib _am(j) The multiplication and addition calculation of the k +1 line is completed;

step 17, making k equal to k + 2;

step 18, judging whether K is smaller than K, if so, returning to the step 5, otherwise, executing the step 19;

step 19, store in vector register VR ₁₀ 、VR ₁₁ ...VR _10+m×p-1 Temporarily storing the data result in the AM space position AM _temp ；

Step 20, calling direct memory access operation, and AM the spatial position of AM _temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;

step 21, making j equal to j + 1;

step 22, judging whether j is less than x ₂ If yes, calling direct memory access operation and inputting the subblock matrix Ib _am(j) Loading the data into an on-chip AM space, returning to the step 3 after the loading is finished, and if not, executing the step 23;

step 23, making i equal to i + 1;

step 24, judging whether i is less than x ₁ If yes, calling direct memory access operation and making weight value sub-block matrix Wb _sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W _ddr And input data I _ddr The conv1 × 1 calculation of (a) is completed.

4. A vector processor-oriented, half-precision vectorized conv1 x 1 convolution system comprising:

the storage module is used for storing the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;

the loading module is used for calling direct memory access operation and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;

the processing module is used for vectorizing the weight data loaded to the on-chip SM space in the SM space, and performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space in the AM space to obtain feature map data after convolution;

wherein, the semi-precision Weight data Weight _ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the Weight data can be expressed as a matrix Weight _ddr M × K, the semi-precision Input data Input _ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input _ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.

5. The system of claim 4, wherein the loading module is specifically configured to:

invoking a direct memory access operation to input the semi-precision into matrix I _ddr Loading into AM space on chip, dividing original data into x from N dimension ₂ Ib _am Matrix, becomes I _am ＝x ₂ ×Ib _am Wherein Ib is _am K × N, i.e. N ═ x ₂ X n, where n ═ P × L × 4,

6. The system of claim 5, wherein the processing module is specifically configured to perform the steps of:

step 3, initializing k to 0, wherein k represents the weight subblock Wb _sm And the input sub-block Ib _am M1, and n1, i.e., the weight subblocks are denoted as Wb _{sm(i，m1，k)} Input sub-block denoted Ib _{am(j，k，n1)} ；

and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space _sm(i) K-th column data Wb of _{sm(i，0，k)} ......Wb _{sm(i，m-1，k)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [0:15]Middle and weight sub-block Wb _sm(i) Column k +1 data Wb _{sm(i，0，k+1)} ......Wb _{sm(i，m-1，k+1)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Performing the following steps;

step 6, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Stored semi-precision weight data for scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Wherein d is the bit length of a scalar register;

step 8, inputting the sub-block matrix Ib in the AM space _am(j) Of kth line data Ib _{am(j，k，0)} ......Ib _{am(j，k，n-1)} Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

A byte, so that it can be loaded at minimum at a time

Half precision data;

step 10, register VR based on vector ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k)} ......Wb _{sm(i，m-1，k)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) K column of (1) and Ib _am(j) The multiplication and addition calculation of the k rows is completed;

One byte, so that it can be loaded at least once

Half precision data;

step 15, mixing Wb _{sm(i，0，k+1)} Vectorized data VR ₅₀ Respectively react with Ib _am(j) The (k + 1) th row of data VR ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation while L vector processing units operate in parallel, storing the calculation result in vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 16, vector register VR based ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k+1)} ......Wb _{sm(i，m-1，k+1)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 15 for the (k + 1) th line of data, and respectively connecting each group of quantized data of the weight values with Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In this process, L vector processing elements operate in parallel, traversing Wb, simultaneously _sm(i) Up to Wb _sm(i) Column k +1 and Ib _am(j) The multiplication and addition calculation of the k +1 line is completed;

step 17, making k equal to k + 2;

step 18, judging whether K is smaller than K, if so, returning to the step 5, and if not, executing the step 19;

step 21, making j equal to j + 1;

step 23, making i equal to i + 1;

step 24, judging whether i is less than x ₁ If yes, calling direct memory access operation and making weight value sub-block matrix Wb _sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W _ddr And input data I _ddr The conv1 × 1 calculation of (c) is completed.