CN114330669B - A vector processor-oriented half-precision vectorized conv1×1 convolution method and system - Google Patents

A vector processor-oriented half-precision vectorized conv1×1 convolution method and system Download PDF

Info

Publication number
CN114330669B
CN114330669B CN202111681136.XA CN202111681136A CN114330669B CN 114330669 B CN114330669 B CN 114330669B CN 202111681136 A CN202111681136 A CN 202111681136A CN 114330669 B CN114330669 B CN 114330669B
Authority
CN
China
Prior art keywords
data
vector
weight
space
precision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111681136.XA
Other languages
Chinese (zh)
Other versions
CN114330669A (en
Inventor
许金伟
李娅琳
姜晶菲
苏华友
乔鹏
王庆林
李荣春
高蕾
窦勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111681136.XA priority Critical patent/CN114330669B/en
Publication of CN114330669A publication Critical patent/CN114330669A/en
Application granted granted Critical
Publication of CN114330669B publication Critical patent/CN114330669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

本发明公开了一种面向向量处理器的半精度向量化conv1×1卷积方法及系统,方法包括:将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;调用直接存储器访问操作,将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据。本发明能够结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。

Figure 202111681136

The invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method and system. The method includes: storing half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory; calling The direct memory access operation loads half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively; in the SM space, the pair is loaded into the on-chip SM space. Perform vectorization processing on the weight data of , in AM space, perform convolution operation conv1×1 between the vectorized weight data and the input data in AM space to obtain the feature map data after convolution. The invention can combine the architectural features of the vector processor to vectorize the convolution calculation (conv1×1) to the vector processor architecture, and realize the improvement of FLOPs on the premise of ensuring the accuracy.

Figure 202111681136

Description

一种面向向量处理器的半精度向量化conv1×1卷积方法及 系统A vector processor-oriented half-precision vectorized conv1×1 convolution method and system

技术领域technical field

本发明涉及向量处理器技术领域,尤其涉及一种面向向量处理器的半精度向量化conv1×1卷积方法及系统。The invention relates to the technical field of vector processors, in particular to a vector processor-oriented half-precision vectorized conv1×1 convolution method and system.

背景技术Background technique

向量处理器的体系结构是一种新型的体系结构,如图1所示,包含进行标量运算的标量处理单元(SPU)和进行向量运算的向量处理单元(VPU),以及负责数据传输的直接存储器访问(Direct Memory Access,DMA)部件等。SPU由标量处理部件SPE和标量存储器SM构成。VPU由L个向量处理部件VPE和阵列存储器AM构成,L个向量处理部件VPE以单指令多数据(SIMD)的方式协作运行,一个VPE内部集成了3个向量运算部件,用于同时支持向量的定点和浮点操作。The architecture of the vector processor is a new type of architecture, as shown in Figure 1, which includes a scalar processing unit (SPU) that performs scalar operations, a vector processing unit (VPU) that performs vector operations, and direct memory responsible for data transfer. Access (Direct Memory Access, DMA) components and so on. The SPU consists of a scalar processing element SPE and a scalar memory SM. The VPU consists of L vector processing components VPE and array memory AM. The L vector processing components VPE operate cooperatively in a single-instruction, multiple-data (SIMD) manner. One VPE integrates 3 vector computing components for simultaneous support of vector Fixed-point and floating-point operations.

单个VPE一次可以处理1个8字节数据(如FP64、Int64),也可以处理2个4字节数据(如FP32,Int32),也可以处理4个2字节数据(如FP16)。DMA部件负责SM与DDR(双倍速率同步动态随机存储器)、AM与DDR之间的数据传输,其操作的最小粒度也是8字节。A single VPE can process 1 piece of 8-byte data (such as FP64, Int64), 2 pieces of 4-byte data (such as FP32, Int32), or 4 pieces of 2-byte data (such as FP16) at a time. The DMA part is responsible for data transfer between SM and DDR (Double Rate Synchronous Dynamic Random Access Memory), AM and DDR, and the minimum granularity of its operation is also 8 bytes.

卷积(Convolution)是神经网络的核心计算之一,其中conv1×1又是卷积运算中最常见的一种规格,所以其效率高低对神经网络的性能影响非常大,优化卷积计算就显得尤为重要。Convolution (Convolution) is one of the core calculations of neural networks, of which conv1×1 is the most common specification in convolution operations, so its efficiency has a great impact on the performance of neural networks, and optimizing convolution calculations appears to be especially important.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明提供了一种面向向量处理器的半精度向量化conv1×1卷积方法,结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。In view of this, the present invention provides a vector processor-oriented half-precision vectorized conv1 × 1 convolution method, which combines the architectural features of the vector processor to orient the convolution calculation (conv1 × 1) to the vector processor architecture. Vectorization achieves the improvement of FLOPs under the premise of ensuring accuracy.

本发明提供了一种面向向量处理器的半精度向量化conv1×1卷积方法,包括:The present invention provides a vector processor-oriented half-precision vectorized conv1×1 convolution method, including:

将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;Store half-precision weight data and half-precision input data in double-rate synchronous dynamic random access memory;

调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;Invoke a direct memory access operation to load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;

在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据;In the SM space, vectorize the weight data loaded into the on-chip SM space, and in the AM space, perform the convolution operation conv1×1 between the vectorized weight data and the input data in the AM space to obtain The feature map data after convolution;

其中,所述半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式可视为[Co,Cin],故所述权值数据可表示为矩阵Weightddr=M×K,所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。Wherein, the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, the data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr = M × K, the data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, n is the number of batch processing in the convolution operation, [Hi, Wi, n] can be regarded as one dimension, let N=Hi×Wi×n, so The input data can be represented as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.

优选地,所述调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间,包括:Preferably, the direct memory access operation is invoked to load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively, including :

调用直接存储器访问操作,将半精度权值矩阵Wddr加载到片上SM空间中,将原数据从M维划分为x1个Wbsm矩阵,变为Wsm=x1×Wbsm,Wbsm=m×K,

Figure BDA0003447171650000031
其中m的大小由SM的空间大小和AM空间的大小综合决定;Call the direct memory access operation, load the half-precision weight matrix W ddr into the on-chip SM space, divide the original data from M dimension into x 1 Wb sm matrix, become W sm = x 1 ×Wb sm , Wb sm = m×K,
Figure BDA0003447171650000031
The size of m is determined comprehensively by the size of the SM space and the size of the AM space;

调用直接存储器访问操作,将半精度输入矩阵Iddr加载到片上AM空间中,将原数据从N维划分为x2个Ibam矩阵,变为Iam=x2×Ibam,其中Ibam=K×n,即N=x2×n,其中n=P×L×4,

Figure BDA0003447171650000032
p表示向量处理器的体系结构中向量功能运算单元部件的数量,L表示向量处理部件的数量。Invoke the direct memory access operation, load the half-precision input matrix I ddr into the on-chip AM space, divide the original data from N dimensions into x 2 Ib am matrices, become I am =x 2 ×Ib am , where Ib am = K×n, that is, N=x 2 ×n, where n=P×L×4,
Figure BDA0003447171650000032
p represents the number of vector functional operation unit components in the architecture of the vector processor, and L represents the number of vector processing components.

优选地,所述在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据,包括以下步骤:Preferably, in the SM space, vectorization processing is performed on the weight data loaded into the on-chip SM space, and in the AM space, a convolution operation is performed on the vectorized weight data and the input data in the AM space. conv1×1 to obtain the feature map data after convolution, including the following steps:

步骤1、初始化i=0,其中,i表示权值子块矩阵Wbsm(i)在M维上的块索引;Step 1, initialize i=0, wherein, i represents the block index of the weight sub-block matrix Wb sm(i) in the M dimension;

步骤2、初始化j=0,其中,j表示输入子块矩阵Ibam(j)在N维上的块索引;Step 2, initialize j=0, wherein, j represents the block index of the input sub-block matrix Ib am(j) on the N dimension;

步骤3、初始化k=0,其中,k表示权值子块Wbsm的列索引和输入子块Ibam的行索引,m1表示权值子块的行索引,n1表示输入子块的列索引,即,权值子块表示为Wbsm(i,m1,k),输入子块表示为Ibam(j,k,n1)Step 3. Initialize k=0, where k represents the column index of the weight sub-block Wb sm and the row index of the input sub-block Ib am , m1 represents the row index of the weight sub-block, n1 represents the column index of the input sub-block, That is, the weight sub-block is represented as Wb sm(i,m1,k) , and the input sub-block is represented as Ib am(j,k,n1) ;

步骤4、将向量寄存器初始化为0,以便向量寄存器累加并存储计算结果;Step 4. Initialize the vector register to 0, so that the vector register can accumulate and store the calculation result;

步骤5、标量加载指令的最小粒度为4字节,半精度数据为2字节,单次将加载两个半精度数据到指定标量寄存器的R[0:15]和R[16:31],将所述SM空间中的权值子块Wbsm(i)的第k列数据Wbsm(i,0,k)……Wbsm(i,m-1,k)依次加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时权值子块Wbsm(i)的第k+1列数据Wbsm(i,0,k+1)……Wbsm(i,m-1,k+1)依次加载到标量寄存器R30、R31...R30+m-1的R[16:31]中;Step 5. The minimum granularity of the scalar load instruction is 4 bytes, and the half-precision data is 2 bytes. Two half-precision data will be loaded into R[0:15] and R[16:31] of the specified scalar register at a time. Load the kth column data Wb sm(i, 0, k) ... Wb sm(i, m-1, k) of the weight sub-block Wb sm(i) in the SM space into the scalar register R 30 , R 31 ...R 30+m-1 in R[0:15], the k+1th column data Wb sm(i,0,k+1) of the weight sub-block Wb sm(i) at the same time ...Wb sm(i,m-1,k+1) are sequentially loaded into R[16:31] of scalar registers R 30 , R 31 ... R 30+m-1 ;

步骤6、基于标量寄存器R30、R31...R30+m-1存放的半精度权值数据,对标量寄存器R30、R31...R30+m-1进行低位扩展操作,将寄存器中低32位中低16位数据R[0:15]复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,其中,d为一个标量寄存器的位长;Step 6. Based on the half - precision weight data stored in the scalar registers R 30 , R 31 . Copy and expand the lower 16-bit data R[0:15] in the lower 32 bits of the register into d-bit data and store them in the scalar registers R 40 , R 41 ... R 40+m-1 , where d is a scalar register bit length;

步骤7、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作并将数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k列数据向量化完成;Step 7. Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . . . R 40+m-1 , the scalar registers R 40 , R 41 . The data is stored in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing components store the same data, and the data in the kth column of Wb sm(i) is vectorized;

步骤8、将所述AM空间中的输入子块矩阵Ibam(j)的第k行数据Ibam(j,k,0)……Ibam(j,k,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为

Figure BDA0003447171650000041
个字节,故单次最少可加载
Figure BDA0003447171650000042
个半精度数据;Step 8. Load the k-th row data Ib am(j, k, 0) ... Ib am(j, k, n-1) of the input sub-block matrix Ib am(j) in the AM space to p In the vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is
Figure BDA0003447171650000041
bytes, so at least one can be loaded at a time
Figure BDA0003447171650000042
half-precision data;

步骤9、将Wbsm(i,0,k)向量化后的数据VR50分别与Ibam(j)的第k行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;Step 9. Perform multiplication and addition operations on the vectorized data VR 50 of Wb sm(i, 0, k) and the k - th row data VR 0 , VR 1 . The L vector processing components operate in parallel, and store the calculation results in the vector registers VR 10 , VR 11 . . . VR 10+p-1 ;

步骤10、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k)……Wbsm(i,m-1,k)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k行数据,重复步骤9,将权值的各组向量化数据分别与Ibam(j)的第k行数据相乘,并将相乘结果累加到向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k列数据,直至Wbsm(i)的第k列和Ibam(j)的k行的乘加计算完成;Step 10. Based on the vector registers VR 51 ...VR 50+m-1 store the vectorized data of the weight sub-blocks Wb sm(i, 1, k) ... , the vector registers VR 0 , VR 1 . . . VR p-1 store the data of the kth row of the input sub-block Ib am(j) , repeat step 9, and compare each group of vectorized data of the weight with Ib am Multiply the data in the kth row of (j) , and accumulate the multiplied results to the vector registers VR 10+p , VR 10+p+1 ... VR 10+m×p-1 , in this process L vector processing The components operate in parallel at the same time, traverse the data of the kth column of Wb sm( i ), until the multiplication and addition calculation of the kth column of Wb sm(i) and the k row of Ib am(j) is completed;

步骤11、判断k+1是否小于K,若是,则跳转执行步骤19,若否,则继续执行步骤12;Step 11, judge whether k+1 is less than K, if yes, then jump to step 19, if not, continue to execute step 12;

步骤12、基于标量寄存器R30、R31...R30+m-1的R[16:31]中存放的Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)数据,对标量寄存器R30、R31...R30+m-1进行高位扩展操作,将寄存器中低32位中高16位数据R[16:31],复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,d为一个标量寄存器的位长;Step 12. Based on the Wb sm (i, 1, k+1) stored in R[16:31] of the scalar registers R 30 , R 31 . 1,k+1) data, perform high-order expansion operation on scalar registers R 30 , R 31 ... R 30+m-1 , and copy and expand the lower 32-bit, middle-high 16-bit data R[16:31] in the register as d-bit data is stored in scalar registers R 40 , R 41 . . . R 40+m-1 , where d is the bit length of a scalar register;

步骤13、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作,将广播后的数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k+1列数据向量化完成;Step 13: Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . Store the broadcasted data in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing units store the same data, and the vectorization of the data in the k+1th column of Wb sm(i) is completed ;

步骤14、将所述AM空间中的输入子块矩阵Ibam(j)的第k+1行数据Ibam(j,k+1,0)……Ibam(j,k+1,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为

Figure BDA0003447171650000051
个字节,故单次最少可加载
Figure BDA0003447171650000052
个半精度数据;Step 14: Convert the k+1 row data Ib am(j,k+1,0) of the input sub-block matrix Ib am(j) in the AM space to Ib am(j,k+1,n- 1) Load into p vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is
Figure BDA0003447171650000051
bytes, so at least one can be loaded at a time
Figure BDA0003447171650000052
half-precision data;

步骤15、将Wbsm(i,0,k+1)向量化后的数据VR50分别与Ibam(j)的第k+1行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;Step 15: Multiply the vectorized data VR 50 of Wb sm(i, 0, k+1) by the data VR 0 , VR 1 . . . VR p-1 of the k+1 row of Ib am(j) respectively Add operation, while L vector processing components operate in parallel, and store the calculation results in vector registers VR 10 , VR 11 . . . VR 10+p-1 ;

步骤16、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k+1行数据,重复步骤15,将权值的各组向量化数据分别与Ibam(j)的第k+1行数据相乘,并将相乘结果累加至向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k+1列数据,直至Wbsm(i)的第k+1列和Ibam(j)的k+1行的乘加计算完成;Step 16. Based on the vector registers VR 51 ... VR 50+m-1 , the weight sub-blocks Wb sm(i, 1, k+1) ... Wb sm(i, m-1, k+1) are stored The vectorized data of , the vector registers VR 0 , VR 1 . . . VR p-1 store the data of the k+1th row of the input sub-block Ib am(j) . Multiply the data with the k+1th row data of Ib am(j) respectively, and accumulate the multiplication results to the vector registers VR 10+p , VR 10+p+1 ... VR 10+m×p-1 In this process, L vector processing components operate in parallel at the same time, traversing the data of the k+1th column of Wb sm( i ) until the k+1th column of Wb sm(i) and the k+1 row of Ib am(j) The multiplication and addition calculation is completed;

步骤17、令k=k+2;Step 17, let k=k+2;

步骤18、判断k是否小于K,若是,则返回步骤5,若否,则执行步骤19;Step 18, determine whether k is less than K, if so, go back to step 5, if not, go to step 19;

步骤19、将储存在向量寄存器VR10、VR11...VR10+m×p-1中的数据结果暂时存储到AM空间位置AMtempStep 19, temporarily store the data results in the vector registers VR 10 , VR 11 . . . VR 10+m×p-1 to the AM space position AM temp ;

步骤20、调用直接存储器访问操作,将所述AM空间位置AMtemp储存的特征图数据结果存储至双倍速率同步动态随机存储器指定位置;Step 20, call direct memory access operation, the characteristic map data result that described AM space position AM temp is stored is stored to double rate synchronous dynamic random access memory designated position;

步骤21、令j=j+1;Step 21, let j=j+1;

步骤22、判断j是否小于x2,若是,则调用直接存储器访问操作,将输入子块矩阵Ibam(j)加载到片上AM空间中,加载完后返回步骤3,若否,则执行步骤23;Step 22, determine whether j is less than x 2 , if so, call the direct memory access operation, load the input sub-block matrix Ib am(j) into the on-chip AM space, and return to step 3 after loading, if not, execute step 23 ;

步骤23、令i=i+1;Step 23, let i=i+1;

步骤24、判断i是否小于x1,若是,则调用直接存储器访问操作,将权值子块矩阵Wbsm(i)加载到片上SM空间中,加载完后返回步骤2,若否,则至此全部的权值数据Wddr和输入数据Iddr的conv1×1计算完成。Step 24, judge whether i is less than x 1 , if so, call the direct memory access operation, load the weight sub-block matrix Wb sm(i) into the on-chip SM space, and return to step 2 after loading, if not, go to this point The conv1×1 calculation of the weight data W ddr and the input data I ddr is completed.

一种面向向量处理器的半精度向量化conv1×1卷积系统,包括:A half-precision vectorized conv1×1 convolutional system for vector processors, including:

存储模块,用于将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;The storage module is used to store the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;

加载模块,用于调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;a loading module, configured to invoke a direct memory access operation, and load the half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;

处理模块,用于在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据;The processing module is used to vectorize the weight data loaded into the on-chip SM space in the SM space, and in the AM space, convolve the vectorized weight data with the input data in the AM space. conv1×1, get the feature map data after convolution;

其中,所述半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式可视为[Co,Cin],故所述权值数据可表示为矩阵Weightddr=M×K,所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。Wherein, the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, the data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr = M × K, the data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, n is the number of batch processing in the convolution operation, [Hi, Wi, n] can be regarded as one dimension, let N=Hi×Wi×n, so The input data can be represented as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.

优选地,所述加载模块具体用于:Preferably, the loading module is specifically used for:

调用直接存储器访问操作,将半精度权值矩阵Wddr加载到片上SM空间中,将原数据从M维划分为x1个Wbsm矩阵,变为Wsm=x1×Wbsm,Wbsm=m×K,

Figure BDA0003447171650000071
其中m的大小由SM的空间大小和AM空间的大小综合决定;Call the direct memory access operation, load the half-precision weight matrix W ddr into the on-chip SM space, divide the original data from M dimension into x 1 Wb sm matrix, become W sm = x 1 ×Wb sm , Wb sm = m×K,
Figure BDA0003447171650000071
The size of m is determined comprehensively by the size of the SM space and the size of the AM space;

调用直接存储器访问操作,将半精度输入矩阵Iddr加载到片上AM空间中,将原数据从N维划分为x2个Ibam矩阵,变为Iam=x2×Ibam,其中Ibam=K×n,即N=x2×n,其中n=P×L×4,

Figure BDA0003447171650000072
p表示向量处理器的体系结构中向量功能运算单元部件的数量,L表示向量处理部件的数量。Invoke the direct memory access operation, load the half-precision input matrix I ddr into the on-chip AM space, divide the original data from N dimensions into x 2 Ib am matrices, become I am =x 2 ×Ib am , where Ib am = K×n, that is, N=x 2 ×n, where n=P×L×4,
Figure BDA0003447171650000072
p represents the number of vector functional operation unit components in the architecture of the vector processor, and L represents the number of vector processing components.

优选地,所述处理模块具体用于执行以下步骤:Preferably, the processing module is specifically configured to perform the following steps:

步骤1、初始化i=0,其中,i表示权值子块矩阵Wbsm(i)在M维上的块索引;Step 1, initialize i=0, wherein, i represents the block index of the weight sub-block matrix Wb sm(i) in the M dimension;

步骤2、初始化j=0,其中,j表示输入子块矩阵Ibam(j0在N维上的块索引;Step 2, initialize j=0, wherein, j represents the block index of the input sub-block matrix Ib am (j0 on the N dimension;

步骤3、初始化k=0,其中,k表示权值子块Wbsm的列索引和输入子块Ibam的行索引,m1表示权值子块的行索引,n1表示输入子块的列索引,即,权值子块表示为Wbsm(i,m1,k),输入子块表示为Ibam(j,k,n1)Step 3. Initialize k=0, where k represents the column index of the weight sub-block Wb sm and the row index of the input sub-block Ib am , m1 represents the row index of the weight sub-block, n1 represents the column index of the input sub-block, That is, the weight sub-block is represented as Wb sm(i,m1,k) , and the input sub-block is represented as Ib am(j,k,n1) ;

步骤4、将向量寄存器初始化为0,以便向量寄存器累加并存储计算结果;Step 4. Initialize the vector register to 0, so that the vector register can accumulate and store the calculation result;

步骤5、标量加载指令的最小粒度为4字节,半精度数据为2字节,单次将加载两个半精度数据到指定标量寄存器的R[0:15]和R[16:31],将所述SM空间中的权值子块Wbsm(i)的第k列数据Wbsm(i,0,k)……Wbsm(i,m-1,k)依次加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时权值子块Wbsm(i)的第k+1列数据Wbsm(i,0,k+1)……Wbsm(i,m-1,k+1)依次加载到标量寄存器R30、R31...R30+m-1的R[16:31]中;Step 5. The minimum granularity of the scalar load instruction is 4 bytes, and the half-precision data is 2 bytes. Two half-precision data will be loaded into R[0:15] and R[16:31] of the specified scalar register at a time. Load the kth column data Wb sm(i, 0, k) ... Wb sm(i, m-1, k) of the weight sub-block Wb sm(i) in the SM space into the scalar register R 30 , R 31 ...R 30+m-1 in R[0:15], the k+1th column data Wb sm(i,0,k+1) of the weight sub-block Wb sm(i) at the same time ...Wb sm(i,m-1,k+1) are sequentially loaded into R[16:31] of scalar registers R 30 , R 31 ... R 30+m-1 ;

步骤6、基于标量寄存器R30、R31...R30+m-1存放的半精度权值数据,对标量寄存器R30、R31...R30+m-1进行低位扩展操作,将寄存器中低32位中低16位数据R[0:15]复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,其中,d为一个标量寄存器的位长;Step 6. Based on the half - precision weight data stored in the scalar registers R 30 , R 31 . Copy and expand the lower 16-bit data R[0:15] in the lower 32 bits of the register into d-bit data and store them in the scalar registers R 40 , R 41 ... R 40+m-1 , where d is a scalar register bit length;

步骤7、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作并将数据储存在向量寄存器vr50、vr51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k列数据向量化完成;Step 7. Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . . . R 40+m-1 , the scalar registers R 40 , R 41 . The data is stored in the vector registers vr 50 , vr 51 . . . VR 50+m-1 , the L vector processing components store the same data, and the data in the kth column of Wb sm(i) is vectorized;

步骤8、将所述AM空间中的输入子块矩阵Ibam(j)的第k行数据Ibam(j,k,0)……Ibam(j,k,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为

Figure BDA0003447171650000081
个字节,故单次最少可加载
Figure BDA0003447171650000082
个半精度数据;Step 8. Load the k-th row data Ib am(j, k, 0) ... Ib am(j, k, n-1) of the input sub-block matrix Ib am(j) in the AM space to p In the vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is
Figure BDA0003447171650000081
bytes, so at least one can be loaded at a time
Figure BDA0003447171650000082
half-precision data;

步骤9、将Wbsm(i,0,k)向量化后的数据VR50分别与Ibam(j)的第k行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;Step 9. Perform multiplication and addition operations on the vectorized data VR 50 of Wb sm(i, 0, k) and the k - th row data VR 0 , VR 1 . The L vector processing components operate in parallel, and store the calculation results in the vector registers VR 10 , VR 11 . . . VR 10+p-1 ;

步骤10、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k)……Wbsm(i,m-1,k)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k行数据,重复步骤9,将权值的各组向量化数据分别与Ibam(j)的第k行数据相乘,并将相乘结果累加到向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k列数据,直至Wbsm(i)的第k列和Ibam(j)的k行的乘加计算完成;Step 10. Based on the vector registers VR 51 ...VR 50+m-1 store the vectorized data of the weight sub-blocks Wb sm(i, 1, k) ... , the vector registers VR 0 , VR 1 . . . VR p-1 store the data of the kth row of the input sub-block Ib am(j) , repeat step 9, and compare each group of vectorized data of the weight with Ib am Multiply the data in the kth row of (j) , and accumulate the multiplied results to the vector registers VR 10+p , VR 10+p+1 ... VR 10+m×p-1 , in this process L vector processing The components operate in parallel at the same time, traverse the data of the kth column of Wb sm( i ), until the multiplication and addition calculation of the kth column of Wb sm(i) and the k row of Ib am(j) is completed;

步骤11、判断k+1是否小于K,若是,则跳转执行步骤19,若否,则继续执行步骤12;Step 11, judge whether k+1 is less than K, if yes, then jump to step 19, if not, continue to execute step 12;

步骤12、基于标量寄存器R30、R31...R30+m-1的R[16:31]中存放的Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)数据,对标量寄存器R30、R31...R30+m-1进行高位扩展操作,将寄存器中低32位中高16位数据R[16:31],复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,d为一个标量寄存器的位长;Step 12. Based on the Wb sm (i, 1, k+1) stored in R[16:31] of the scalar registers R 30 , R 31 . 1,k+1) data, perform high-order expansion operation on scalar registers R 30 , R 31 ... R 30+m-1 , and copy and expand the lower 32-bit, middle-high 16-bit data R[16:31] in the register as d-bit data is stored in scalar registers R 40 , R 41 . . . R 40+m-1 , where d is the bit length of a scalar register;

步骤13、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作,将广播后的数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k+1列数据向量化完成;Step 13: Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . Store the broadcasted data in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing units store the same data, and the vectorization of the data in the k+1th column of Wb sm(i) is completed ;

步骤14、将所述AM空间中的输入子块矩阵Ibam(j)的第k+1行数据Ibam(j,k+1,0)……Ibam(j,k+1,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为

Figure BDA0003447171650000091
个字节,故单次最少可加载
Figure BDA0003447171650000092
个半精度数据;Step 14: Convert the k+1 row data Ib am(j,k+1,0) of the input sub-block matrix Ib am(j) in the AM space to Ib am(j,k+1,n- 1) Load into p vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is
Figure BDA0003447171650000091
bytes, so at least one can be loaded at a time
Figure BDA0003447171650000092
half-precision data;

步骤15、将Wbsm(i,0,k+1)向量化后的数据VR50分别与Ibam(j)的第k+1行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;Step 15. Multiply the data VR 50 after vectorization of Wb sm(i, 0, k+1) by the data VR 0 , VR 1 . . . VR p-1 of the k+1th row of Ib am(j) respectively Add operation, while L vector processing components operate in parallel, and store the calculation results in vector registers VR 10 , VR 11 . . . VR 10+p-1 ;

步骤16、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k+1行数据,重复步骤15,将权值的各组向量化数据分别与Ibam(j)的第k+1行数据相乘,并将相乘结果累加至向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k+1列数据,直至Wbsm(i)的第k+1列和Ibam(j)的k+1行的乘加计算完成;Step 16. Based on the vector registers VR 51 ... VR 50+m-1 , the weight sub-blocks Wb sm(i, 1, k+1) ... Wb sm(i, m-1, k+1) are stored The vectorized data of , the vector registers VR 0 , VR 1 . . . VR p-1 store the data of the k+1th row of the input sub-block Ib am(j) . Multiply the data with the k+1th row data of Ib am(j) respectively, and accumulate the multiplication results to the vector registers VR 10+p , VR 10+p+1 ... VR 10+m×p-1 In this process, L vector processing components operate in parallel at the same time, traversing the data of the k+1th column of Wb sm( i ) until the k+1th column of Wb sm(i) and the k+1 row of Ib am(j) The multiplication and addition calculation is completed;

步骤17、令k=k+2;Step 17, let k=k+2;

步骤18、判断k是否小于K,若是,则返回步骤5,若否,则执行步骤19;Step 18, determine whether k is less than K, if so, go back to step 5, if not, go to step 19;

步骤19、将储存在向量寄存器VR10、VR11...VR10+m×p-1中的数据结果暂时存储到AM空间位置AMtempStep 19, temporarily store the data results in the vector registers VR 10 , VR 11 . . . VR 10+m×p-1 to the AM space position AM temp ;

步骤20、调用直接存储器访问操作,将所述AM空间位置AMtemp储存的特征图数据结果存储至双倍速率同步动态随机存储器指定位置;Step 20, call direct memory access operation, the characteristic map data result that described AM space position AM temp is stored is stored to double rate synchronous dynamic random access memory designated position;

步骤21、令j=j+1;Step 21, let j=j+1;

步骤22、判断j是否小于x2,若是,则调用直接存储器访问操作,将输入子块矩阵Ibam(j)加载到片上AM空间中,加载完后返回步骤3,若否,则执行步骤23;Step 22, determine whether j is less than x 2 , if so, call the direct memory access operation, load the input sub-block matrix Ib am(j) into the on-chip AM space, and return to step 3 after loading, if not, execute step 23 ;

步骤23、令i=i+1;Step 23, let i=i+1;

步骤24、判断i是否小于x1,若是,则调用直接存储器访问操作,将权值子块矩阵Wbsm(i)加载到片上SM空间中,加载完后返回步骤2,若否,则至此全部的权值数据Wddr和输入数据Iddr的conv1×1计算完成。Step 24, judge whether i is less than x 1 , if so, call the direct memory access operation, load the weight sub-block matrix Wb sm(i) into the on-chip SM space, and return to step 2 after loading, if not, go to this point The conv1×1 calculation of the weight data W ddr and the input data I ddr is completed.

综上所述,本发明公开了一种面向向量处理器的半精度向量化conv1×1卷积方法,首先将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中,然后调用直接存储器访问操作,将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据;其中,半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式可视为[Co,Cin],故所述权值数据可表示为矩阵Weightddr=M×K,所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。本发明能够结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。To sum up, the present invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method. First, half-precision weight data and half-precision input data are stored in a double-rate synchronous dynamic random access memory, Then the direct memory access operation is called to load the half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively; in the SM space, the pair is loaded into the on-chip The weight data in the SM space is vectorized. In the AM space, the vectorized weight data and the input data in the AM space are convolved with the convolution operation conv1×1 to obtain the feature map data after convolution; , the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, The data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr = M×K, the data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image, respectively, and n is the number of batch processing in the convolution operation. [Hi, Wi, n] can be regarded as one dimension, and N=Hi×Wi×n, so the input data can be It is expressed as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension. The invention can combine the architectural features of the vector processor to vectorize the convolution calculation (conv1×1) to the vector processor architecture, and realize the improvement of FLOPs on the premise of ensuring the accuracy.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

图1为向量处理器的一般体系结构示意图;Fig. 1 is a general architecture schematic diagram of a vector processor;

图2为本发明提供的一种面向向量处理器的半精度向量化conv1×1卷积方法实施例的方法流程图;2 is a method flow chart of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution method provided by the present invention;

图3为本发明公开的Wbsm(0,m1,k)的标量加载示意图;3 is a schematic diagram of scalar loading of Wb sm(0, m1, k) disclosed in the present invention;

图4为本发明公开的标量寄存器的低16位扩展示意图;4 is a schematic diagram of the low 16-bit extension of the scalar register disclosed in the present invention;

图5为本发明公开的一种标量寄存器的广播实现示意图;5 is a schematic diagram of a broadcast implementation of a scalar register disclosed in the present invention;

图6为本发明公开的一种Ibam(0,0,n1)的向量加载示意图;Fig. 6 is a kind of vector loading schematic diagram of Ib am(0,0,n1) disclosed by the present invention;

图7为本发明公开的Wbsm(i,0,k)与input第k行的向量乘加示意图;7 is a schematic diagram of vector multiplication and addition of Wb sm(i, 0, k) and the kth row of input disclosed in the present invention;

图8为本发明公开的weight第k列、input第k行的向量乘加示意图;8 is a schematic diagram of vector multiplication and addition of the kth column of weight and the kth row of input disclosed by the present invention;

图9为本发明公开的标量寄存器的高16位扩展示意图;FIG. 9 is a schematic diagram of the high-order 16-bit extension of the scalar register disclosed in the present invention;

图10为本发明公开的一种标量寄存器的广播实现示意图;10 is a schematic diagram of a broadcast implementation of a scalar register disclosed in the present invention;

图11为本发明公开的一种Ibam(0,1,n1)的向量加载示意图;11 is a schematic diagram of a vector loading of Ib am(0,1,n1) disclosed in the present invention;

图12为本发明公开的Wbsm(i,0,k+1)与input第k+1行的向量乘加示意图;12 is a schematic diagram of vector multiplication and addition of Wb sm(i, 0, k+1) and input row k+1 disclosed in the present invention;

图13为weight第k+1列、input第k+1行的向量乘加示意图;Figure 13 is a schematic diagram of vector multiplication and addition of the k+1 column of weight and the k+1 row of input;

图14为weight最后列、input最后行的向量乘加示意图;Figure 14 is a schematic diagram of vector multiplication and addition of the last column of weight and the last row of input;

图15为本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积系统实施例的结构示意图。FIG. 15 is a schematic structural diagram of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution system disclosed in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

如图2所示,为本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积方法实施例的流程图,所述方法可以包括以下步骤:As shown in FIG. 2, it is a flowchart of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution method disclosed in the present invention, and the method may include the following steps:

S201、将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;S201. Store half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory;

当需要对面向向量处理器的半精度数据进行向量化卷积时,首先将半精度权值数据和半精度输入数据存储在DDR(双倍速率同步动态随机存储器)中。其中,半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式也可视为[Co,Cin],故权值数据可表示为矩阵Weightddr=M×K。所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。When vectorized convolution of half-precision data for vector processors is required, firstly, half-precision weight data and half-precision input data are stored in DDR (Double Rate Synchronous Dynamic Random Access Memory). Among them, the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1 , the data format can also be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr =M×K. The data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, and n is the number of batch processing in the convolution operation. Wi,n] is regarded as one dimension, let N=Hi×Wi×n, so the input data can be represented as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.

S202、调用直接存储器访问操作,将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;S202, calling the direct memory access operation, and loading the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively;

具体地,调用直接存储器访问操作,将半精度权值矩阵Wddr加载到片上SM空间中,将原数据从M维(输出通道维度)划分为x1个Wbsm矩阵,变为Wsm=x1×Wbsm,Wbsm=m×K,

Figure BDA0003447171650000121
其中m的大小由SM的空间大小和AM空间的大小综合决定。如,与m相关的权值数据块Wbsm大小不能大于SM空间;权值块与输入块卷积后的输出结果与输入数据块大小之和需小于AM空间。Specifically, the direct memory access operation is invoked, the half-precision weight matrix W ddr is loaded into the on-chip SM space, and the original data is divided from M dimension (output channel dimension) into x 1 Wb sm matrix, which becomes W sm =x 1 ×Wb sm , Wb sm =m×K,
Figure BDA0003447171650000121
The size of m is determined comprehensively by the size of the SM space and the size of the AM space. For example, the size of the weight data block Wb sm related to m cannot be larger than the SM space; the sum of the output result after the convolution of the weight block and the input block and the size of the input data block must be smaller than the AM space.

调用直接存储器访问操作,将所述半精度输入矩阵Iddr加载到片上AM空间中,将原数据从N维(图像层维度)划分为x2个Ibam矩阵,变为Iam=x2×Ibam,其中Ibam=K×n。即N=x2×n,其中n=P×L×4,

Figure BDA0003447171650000131
p表示向量处理器的体系结构中向量功能运算单元部件的数量,L表示向量处理部件的数量。The direct memory access operation is invoked, the half-precision input matrix I ddr is loaded into the on-chip AM space, and the original data is divided from N dimensions (image layer dimension) into x 2 Ib am matrices, becoming I am =x 2 × Ib am , where Ib am =K×n. That is, N=x 2 ×n, where n=P×L×4,
Figure BDA0003447171650000131
p represents the number of vector functional operation unit components in the architecture of the vector processor, and L represents the number of vector processing components.

S203、在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据。S203. In the SM space, perform vectorization processing on the weight data loaded into the on-chip SM space, and in the AM space, perform a convolution operation conv1×1 between the vectorized weight data and the input data in the AM space , to obtain the feature map data after convolution.

具体的,可以包括以下步骤:Specifically, the following steps may be included:

步骤1、初始化i=0,其中,i表示权值子块矩阵Wbsm(i)在M维上的块索引;Step 1, initialize i=0, wherein, i represents the block index of the weight sub-block matrix Wb sm(i) in the M dimension;

步骤2、初始化j=0,其中,j表示输入子块矩阵Ibam(j)在N维上的块索引;Step 2, initialize j=0, wherein, j represents the block index of the input sub-block matrix Ib am(j) on the N dimension;

步骤3、初始化k=0,其中,k表示权值子块Wbsm的列索引和输入子块Ibam的行索引,m1表示权值子块的行索引,n1表示输入子块的列索引,即,权值子块表示为Wbsm(i,m1,k),输入子块表示为Ibam(j,k,n1)Step 3. Initialize k=0, where k represents the column index of the weight sub-block Wb sm and the row index of the input sub-block Ib am , m1 represents the row index of the weight sub-block, n1 represents the column index of the input sub-block, That is, the weight sub-block is represented as Wb sm(i,m1,k) , and the input sub-block is represented as Ib am(j,k,n1) ;

步骤4、将向量寄存器初始化为0,以便向量寄存器累加并存储计算结果;Step 4. Initialize the vector register to 0, so that the vector register can accumulate and store the calculation result;

步骤5、标量加载指令的最小粒度为4字节,半精度数据为2字节,单次将加载两个半精度数据到指定标量寄存器的R[0:15]和R[16:31],将所述SM空间中的权值子块Wbsm(i)的第k列数据Wbsm(i,0,k)……Wbsm(i,m-1,k)依次加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时权值子块Wbsm(i)的第k+1列数据Wbsm(i,0,k+1)……Wbsm(i,m-1,k+1)依次加载到标量寄存器R30、R31...R30+m-1的R[16:31]中;Step 5. The minimum granularity of the scalar load instruction is 4 bytes, and the half-precision data is 2 bytes. Two half-precision data will be loaded into R[0:15] and R[16:31] of the specified scalar register at a time. Load the kth column data Wb sm(i, 0, k) ... Wb sm(i, m-1, k) of the weight sub-block Wb sm(i) in the SM space into the scalar register R 30 , R 31 ...R 30+m-1 in R[0:15], the k+1th column data Wb sm(i,0,k+1) of the weight sub-block Wb sm(i) at the same time ...Wb sm(i,m-1,k+1) are sequentially loaded into R[16:31] of scalar registers R 30 , R 31 ... R 30+m-1 ;

例如,以第一个权值子块Wbsm(0)=6×4,m=6,K=4为例,k=0时,使用标量加载指令,依次将Wbsm(0)的第1列的数据加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时将Wbsm(0)的第2列的数据加载到标量寄存器R30、R31...R30+m-1的R[16:31]中,如下图3所示。For example, take the first weight sub-block Wb sm(0) = 6×4, m=6, K=4 as an example, when k=0, use the scalar load instruction to sequentially load the first weight of Wb sm(0) The data of the column is loaded into R[0:15] of the scalar registers R30 , R31 ...R30 +m-1 , and the data of the second column of Wb sm(0) is loaded into the scalar register R30 , In R[16:31] of R 31 ...R 30+m-1 , as shown in Figure 3 below.

步骤6、基于标量寄存器R30、R31...R30+m-1存放的半精度权值数据,对标量寄存器R30、R31...R30+m-1进行低位扩展操作,将寄存器中低32位中低16位数据R[0:15]复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,其中,d为一个标量寄存器的位长;Step 6. Based on the half - precision weight data stored in the scalar registers R 30 , R 31 . Copy and expand the lower 16-bit data R[0:15] in the lower 32 bits of the register into d-bit data and store them in the scalar registers R 40 , R 41 ... R 40+m-1 , where d is a scalar register bit length;

例如,以d=64为例,步骤6的低32位中低16位的扩展指令实现如图4所示。For example, taking d=64 as an example, the implementation of the extended instruction of the lower 32 bits in the lower 16 bits in step 6 is shown in FIG. 4 .

步骤7、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作并将数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm5i)的第k列数据向量化完成;Step 7. Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . . . R 40+m-1 , the scalar registers R 40 , R 41 . The data is stored in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing components store the same data, and the k-th column data of Wb sm5i) is vectorized and completed;

例如,以L=8为例,标量寄存器R40广播到向量寄存器VR50的实现如图5所示。For example, taking L=8 as an example, the implementation of broadcasting the scalar register R 40 to the vector register VR 50 is shown in FIG. 5 .

步骤8、将所述AM空间中的输入子块矩阵Ibam(j)的第k行数据Ibam(j,k,0)……Ibam(j,k,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为

Figure BDA0003447171650000141
个字节,故单次最少可加载
Figure BDA0003447171650000142
个半精度数据;Step 8. Load the k-th row data Ib am(j, k, 0) ... Ib am(j, k, n-1) of the input sub-block matrix Ib am(j) in the AM space to p In the vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is
Figure BDA0003447171650000141
bytes, so at least one can be loaded at a time
Figure BDA0003447171650000142
half-precision data;

例如,以第一个输入子块Ibam(0)=4×64,K=4,N=64为例,k=0时,使用向量加载指令,将Ibam(0)的第1行的数据加载到p个向量寄存器VR0、VR1...VRp-1中,同上述以L=8,p=2为例,向量加载的具体实现如图6所示。For example, take the first input sub-block Ib am(0) = 4×64, K=4, N=64 as an example, when k=0, use the vector load instruction to load the first line of Ib am(0) The data is loaded into the p vector registers VR 0 , VR 1 . . . VR p-1 . Taking L=8 and p=2 as an example, the specific implementation of vector loading is shown in FIG. 6 .

步骤9、将Wbsm(i,0,k)向量化后的数据VR50分别与Ibam(j)的第k行数据VR0、VR1...VRp-1做乘加操作,因为该体系结构集成了p个功能向量运算单元部件,所以上述乘加操作支持在同一周期内进行,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;Step 9. Perform multiplication and addition operations on the vectorized data VR 50 of Wb sm(i, 0, k) and the data VR 0 , VR 1 ... VR p-1 of the kth row of Ib am(j) respectively, because The architecture integrates p functional vector arithmetic unit components, so the above multiplication and addition operations are supported in the same cycle, while L vector processing components operate in parallel, and store the calculation results in vector registers VR 10 , VR 11 . . . VR 10 +p-1 ;

例如,VR50分别与VR0、VR1做乘加操作,以L=8,p=2为例,结果保存在VR10、VR11中,由于VR10、VR11初始值为0,故乘加结果为相乘本身,具体实现如图7所示。For example, VR 50 performs multiplication and addition operations with VR 0 and VR 1 respectively. Taking L=8 and p=2 as an example, the results are stored in VR 10 and VR 11. Since the initial values of VR 10 and VR 11 are 0, the multiplication The addition result is the multiplication itself, and the specific implementation is shown in Figure 7.

步骤10、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k)……Wbsm(i,m-1,k)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k行数据,重复步骤9,将权值的各组向量化数据分别与Ibam(j)的第k行数据相乘,并将相乘结果累加到向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k列数据,直至Wbsm(i)的第k列和Ibam(j)的k行的乘加计算完成,具体实现如图8所示;Step 10. Based on the vector registers VR 51 ...VR 50+m-1 store the vectorized data of the weight sub-blocks Wb sm(i, 1, k) ... , the vector registers VR 0 , VR 1 . . . VR p-1 store the data of the kth row of the input sub-block Ib am(j) , repeat step 9, and compare each group of vectorized data of the weight with Ib am Multiply the data in the kth row of (j) , and accumulate the multiplied results to the vector registers VR 10+p , VR 10+p+1 ... VR 10+m×p-1 , in this process L vector processing The components operate in parallel at the same time, traverse the data of the kth column of Wb sm( i ), until the multiplication and addition calculation of the kth column of Wb sm(i) and the k row of Ib am(j) is completed, and the specific implementation is shown in Figure 8;

步骤11、判断k+1是否小于K,若是,则跳转执行步骤19,若否,则继续执行步骤12;Step 11, judge whether k+1 is less than K, if yes, then jump to step 19, if not, continue to execute step 12;

步骤12、基于标量寄存器R30、R31...R30+m-1的R[16:31]中存放的Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)数据,对标量寄存器R30、R31...R30+m-1进行高位扩展操作,将寄存器中低32位中高16位数据R[16:31],复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,d为一个标量寄存器的位长;Step 12. Based on the Wb sm (i, 1, k+1) stored in R[16:31] of the scalar registers R 30 , R 31 . 1,k+1) data, perform high-order expansion operation on scalar registers R 30 , R 31 ... R 30+m-1 , and copy and expand the lower 32-bit, middle-high 16-bit data R[16:31] in the register as d-bit data is stored in scalar registers R 40 , R 41 . . . R 40+m-1 , where d is the bit length of a scalar register;

例如,以d=64为例,步骤12的低32位中高16位的扩展指令实现如图9所示。For example, taking d=64 as an example, the implementation of the extended instruction of the lower 32 bits in the upper 16 bits in step 12 is shown in FIG. 9 .

步骤13、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作,将广播后的数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k+1列数据向量化完成;Step 13: Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . Store the broadcasted data in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing units store the same data, and the vectorization of the data in the k+1th column of Wb sm(i) is completed ;

例如,当k=0时,Wbsm(i)的第k+1列数据向量化如下,具体广播实现如图10所示。For example, when k=0, the data vectorization of the k+1th column of Wb sm(i) is as follows, and the specific broadcast implementation is shown in FIG. 10 .

步骤14、将所述AM空间中的输入子块矩阵Ibam(j)的第k+1行数据Ibam(j,k+1,0)……Ibam(j,k+1,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为

Figure BDA0003447171650000161
个字节,故单次最少可加载
Figure BDA0003447171650000162
个半精度数据;Step 14: Convert the k+1 row data Ib am(j,k+1,0) of the input sub-block matrix Ib am(j) in the AM space to Ib am(j,k+1,n- 1) Load into p vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is
Figure BDA0003447171650000161
bytes, so at least one can be loaded at a time
Figure BDA0003447171650000162
half-precision data;

例如,以第一个输入子块Ibam(0)=4×64,K=4,N=64为例,k+1=1时,使用向量加载指令,将Ibam(0)的第2行的数据加载到p个向量寄存器VR0、VR1...VRp-1中,同上述以L=8,p=2为例,向量加载的具体实现如图11所示。For example, taking the first input sub-block Ib am(0) = 4×64, K=4, N=64 as an example, when k+1=1, use the vector load instruction to load the second sub-block of Ib am(0 ) . The data of the row is loaded into p vector registers VR 0 , VR 1 . . . VR p-1 . Taking L=8 and p=2 as an example, the specific implementation of vector loading is shown in FIG. 11 .

步骤15、将Wbsm(i,0,k+1)向量化后的数据VR50分别与Ibam(j)的第k+1行数据VR0、VR1...VRp-1做乘加操作,因为该体系结构集成了p个功能向量运算单元部件,所以上述乘加操作支持在同一周期内进行,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;Step 15. Multiply the data VR 50 after vectorization of Wb sm(i, 0, k+1) by the data VR 0 , VR 1 . . . VR p-1 of the k+1th row of Ib am(j) respectively Add operation, because the architecture integrates p functional vector arithmetic unit components, so the above multiplication and addition operations are supported in the same cycle, while L vector processing components operate in parallel, and the calculation results are stored in vector registers VR 10 , VR 11 . ..VR 10+p-1 ;

例如,k+1=1时,VR50分别与VR0、VR1做乘加操作,并累加上VR10、VR11中k行的乘加数据,并且将结果继续保存在VR10、VR11中,以L=8,p=2为例,具体实现如图12所示。For example, when k+1=1, VR 50 performs multiplication and addition operations with VR 0 and VR 1 respectively, and accumulates the multiplication and addition data of k rows in VR 10 and VR 11 , and continues to save the results in VR 10 and VR 11 , taking L=8 and p=2 as an example, the specific implementation is shown in FIG. 12 .

步骤16、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k+1行数据,重复步骤15,将权值的各组向量化数据分别与Ibam(j)的第k+1行数据相乘,并将相乘结果累加至向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k+1列数据,直至Wbsm(i)的第k+1列和Ibam(j)的k+1行的乘加计算完成,具体实现如图13所示;Step 16. Based on the vector registers VR 51 ... VR 50+m-1 , the weight sub-blocks Wb sm(i, 1, k+1) ... Wb sm(i, m-1, k+1) are stored The vectorized data of , the vector registers VR 0 , VR 1 . . . VR p-1 store the data of the k+1th row of the input sub-block Ib am(j) . Multiply the data with the k+1th row data of Ib am(j) respectively, and accumulate the multiplication results to the vector registers VR 10+p , VR 10+p+1 ... VR 10+m×p-1 In this process, L vector processing components operate in parallel at the same time, traversing the data of the k+1th column of Wb sm( i ) until the k+1th column of Wb sm(i) and the k+1 row of Ib am(j) The multiplication and addition calculation is completed, and the specific implementation is shown in Figure 13;

步骤17、令k=k+2;Step 17, let k=k+2;

步骤18、判断k是否小于K,若是,则返回步骤5,若否,则执行步骤19;Step 18, determine whether k is less than K, if so, go back to step 5, if not, go to step 19;

步骤19、至此,权值子块矩阵Wbsm(i)和输入子块矩阵Ibam(j)的conv1×1计算已经完成,当Wbsm(i)遍历到最后一列,Ibam(j)遍历到最后一行时,具体操作如图14所示,将储存在向量寄存器VR10、VR11...VR10+m×p-1中的数据结果暂时存储到AM空间位置AMtempStep 19. So far, the conv1×1 calculation of the weight sub-block matrix Wb sm(i) and the input sub-block matrix Ib am(j) has been completed. When Wb sm(i) traverses to the last column, Ib am(j) traverses When reaching the last line, the specific operation is as shown in Figure 14, and the data results stored in the vector registers VR 10 , VR 11 . . . VR 10+m×p-1 are temporarily stored in the AM space position AM temp ;

步骤20、调用直接存储器访问操作,将所述AM空间位置AMtemp储存的特征图数据结果存储至双倍速率同步动态随机存储器指定位置;Step 20, call direct memory access operation, the characteristic map data result that described AM space position AM temp is stored is stored to double rate synchronous dynamic random access memory designated position;

步骤21、令j=j+1;Step 21, let j=j+1;

步骤22、判断j是否小于x2,若是,则调用直接存储器访问操作,将输入子块矩阵Ibam(j)加载到片上AM空间中,加载完后返回步骤3,重复进行以上标量数据加载、复制扩展、广播、向量数据加载和向量乘加等操作,若否,则执行步骤23;Step 22, determine whether j is less than x 2 , if so, call the direct memory access operation, load the input sub-block matrix Ib am(j) into the on-chip AM space, return to step 3 after loading, and repeat the above scalar data loading, Operations such as copy extension, broadcast, vector data loading, and vector multiply-add, if not, go to step 23;

步骤23、令i=i+1;Step 23, let i=i+1;

步骤24、判断i是否小于x1,若是,则调用直接存储器访问操作,将权值子块矩阵Wbsm(i)加载到片上SM空间中,加载完后返回步骤2,重复进行以上标量数据加载、复制扩展、广播、向量数据加载和向量乘加等操作,若否,则至此全部的权值数据Wddr和输入数据Iddr的conv1×1计算完成。Step 24, judge whether i is less than x 1 , if so, call the direct memory access operation, load the weight sub-block matrix Wb sm(i) into the on-chip SM space, and return to step 2 after loading, and repeat the above scalar data loading , copy extension, broadcast, vector data loading, and vector multiply-add operations, if not, then all the weight data W ddr and the input data I ddr conv1×1 calculation is completed.

综上所述,本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积方法,能够结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。To sum up, the present invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method, which can combine the architectural features of the vector processor to make the convolution calculation (conv1×1) oriented to the vector processor. The architecture is vectorized, and the improvement of FLOPs is achieved on the premise of ensuring accuracy.

如图15所示,为本发明公开的一种面向向量处理器的半精度数据向量化conv1×1卷积系统实施例的结构示意图,所述系统可以包括:As shown in FIG. 15, it is a schematic structural diagram of an embodiment of a vector processor-oriented half-precision data vectorization conv1×1 convolution system disclosed in the present invention. The system may include:

存储模块1501,用于将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;A storage module 1501, configured to store half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory;

加载模块1502,用于调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;a loading module 1502, configured to invoke a direct memory access operation, and load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;

处理模块1503,用于在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据。The processing module 1503 is used to perform vectorization processing on the weight data loaded into the on-chip SM space in the SM space, and in the AM space, convolve the vectorized weight data with the input data in the AM space Operate conv1×1 to get the feature map data after convolution.

本发明公开面向向量处理器的半精度向量化conv1×1卷积系统的工作原理,与上述面向向量处理器的半精度向量化conv1×1卷积方法的工作原理相同,在此不再赘述。The present invention discloses the working principle of the vector processor-oriented half-precision vectorized conv1×1 convolution system, which is the same as the working principle of the vector processor-oriented half-precision vectorized conv1×1 convolution method, and will not be repeated here.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A vector processor-oriented semi-precision vectorized conv1 x 1 convolution method, comprising:
storing the half-precision weight data and the half-precision input data in a double-rate synchronous dynamic random access memory;
calling direct memory access operation, and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;
in an SM space, vectorizing the weight data loaded to the SM space on the chip, and in an AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution;
wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the weight data can be expressed asMatrix Weight ddr M × K, the semi-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Where Hi and Wi are the height and width of the image, respectively, and n is the number of batch processes at a time in the convolution operation, the values [ Hi, Wi, n ] can be obtained]Considering one dimension, let N be Hi × Wi × N, so the Input data can be represented as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.
2. The method of claim 1, wherein the invoking of the direct memory access operation loads the half-precision weight data and the half-precision input data from the double rate synchronous dynamic random access memory into an on-chip Scalar Memory (SM) space and an on-chip Array Memory (AM) space, respectively, comprising:
invoking a direct memory access operation to apply a semi-precision weight matrix W ddr Loading into on-chip SM space, dividing original data into x from M dimension 1 A Wb sm Matrix, becomes W sm =x 1 ×Wb sm ,Wb sm =m×K,
Figure FDA0003447171640000011
Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;
invoking a direct memory access operation to input the semi-precision into matrix I ddr Loading into AM space on chip, dividing original data into x from N dimension 2 Ib am Matrix, becomes I am =x 2 ×Ib am Wherein Ib is am K x N, i.e. N x 2 X n, where n ═ P × L × 4,
Figure FDA0003447171640000021
p denotes the number of vector function arithmetic unit elements in the architecture of the vector processor, and L denotes the number of vector processing elements.
3. The method according to claim 2, wherein in the SM space, vectorization processing is performed on the weight data loaded into the on-chip SM space, and in the AM space, convolution operation conv1 × 1 is performed on the vectorized weight data and input data in the AM space to obtain convolved feature map data, including the following steps:
step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb sm(i) A block index in the M dimension;
step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib am(j) A block index in the N dimension;
step 3, initializing k to 0, wherein k represents the weight subblock Wb sm Column index and input sub-block Ib am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb sm(i,m1,k) Input sub-block denoted Ib am(j,k,n1)
Step 4, initializing the vector register to 0 so as to accumulate the vector register and store the calculation result;
and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space sm(i) K-th column data Wb of sm(i,0,k) ......Wb sm(i,m-1,k) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [0:15]Middle and simultaneous weight sub-block Wb sm(i) Column k +1 data Wb sm(i,0,k+1) ......Wb sm(i,m-1,k+1) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [16:31]Performing the following steps;
step 6, based on scalar register R 30 、R 31 ...R 30+m-1 Stored semi-precision weight data for scalar register R 30 、R 31 ...R 30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is a scalar registerThe bit length of (d);
step 7, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the k-th column data vectorization;
step 8, inputting the sub-block matrix Ib in the AM space am(j) Of kth line data Ib am(j,k,0 )......Ib am(j,k,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000031
One byte, so that it can be loaded at least once
Figure FDA0003447171640000032
Half precision data;
step 9, mixing Wb sm(i,0,k) Vectorized data VR 50 Respectively react with Ib am(j) VR of the kth line 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 10, register VR based on vector 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k) ......Wb sm(i,m-1,k) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ....VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) K column of (1) and Ib am(j) The multiplication and addition calculation of the k rows is completed;
step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;
step 12, based on scalar register R 30 、R 31 ...R 30+m-1 R [16:31]Wb stored in sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Data, to scalar register R 30 、R 31 ...R 30+m-1 And performing high bit expansion operation, and enabling 16 high bits data R [16:31]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 In (d) is the bit length of a scalar register;
step 13, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the vectorization of the k +1 th column of data;
step 14, inputting the sub-block matrix Ib in the AM space am(j) Data Ib of the k +1 th line am(j,k+1,0) ......Ib am(j,k+1,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000041
One byte, so that it can be loaded at least once
Figure FDA0003447171640000042
One and a halfPrecision data;
step 15, mixing Wb sm(i,0,k+1) Vectorized data VR 50 Respectively react with Ib am(j) The (k + 1) th row of data VR 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 16, vector register VR based 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) Column k +1 and Ib am(j) The multiplication and addition calculation of the k +1 line is completed;
step 17, making k equal to k + 2;
step 18, judging whether K is smaller than K, if so, returning to the step 5, otherwise, executing the step 19;
step 19, store in vector register VR 10 、VR 11 ...VR 10+m×p-1 Temporarily storing the data result in the AM space position AM temp
Step 20, calling direct memory access operation, and AM the spatial position of AM temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;
step 21, making j equal to j + 1;
step 22, judging whether j is less than x 2 If yes, calling direct memory access operation and inputting the subblock matrix Ib am(j) Loading the data into an on-chip AM space, returning to the step 3 after the loading is finished, and if not, executing the step 23;
step 23, making i equal to i + 1;
step 24, judging whether i is less than x 1 If yes, calling direct memory access operation and making weight value sub-block matrix Wb sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W ddr And input data I ddr The conv1 × 1 calculation of (a) is completed.
4. A vector processor-oriented, half-precision vectorized conv1 x 1 convolution system comprising:
the storage module is used for storing the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;
the loading module is used for calling direct memory access operation and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;
the processing module is used for vectorizing the weight data loaded to the on-chip SM space in the SM space, and performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space in the AM space to obtain feature map data after convolution;
wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the Weight data can be expressed as a matrix Weight ddr M × K, the semi-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.
5. The system of claim 4, wherein the loading module is specifically configured to:
invoking a direct memory access operation to apply a semi-precision weight matrix W ddr Loading into on-chip SM space, dividing original data into x from M dimension 1 A Wb sm Matrix, becomes W sm =x 1 ×Wb sm ,Wb sm =m×K,
Figure FDA0003447171640000061
Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;
invoking a direct memory access operation to input the semi-precision into matrix I ddr Loading into AM space on chip, dividing original data into x from N dimension 2 Ib am Matrix, becomes I am =x 2 ×Ib am Wherein Ib is am K × N, i.e. N ═ x 2 X n, where n ═ P × L × 4,
Figure FDA0003447171640000062
p denotes the number of vector function arithmetic unit elements in the architecture of the vector processor, and L denotes the number of vector processing elements.
6. The system of claim 5, wherein the processing module is specifically configured to perform the steps of:
step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb sm(i) A block index in the M dimension;
step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib am(j) A block index in the N dimension;
step 3, initializing k to 0, wherein k represents the weight subblock Wb sm And the input sub-block Ib am M1, and n1, i.e., the weight subblocks are denoted as Wb sm(i,m1,k) Input sub-block denoted Ib am(j,k,n1)
Step 4, initializing the vector register to 0 so as to accumulate the vector register and store the calculation result;
and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space sm(i) K-th column data Wb of sm(i,0,k) ......Wb sm(i,m-1,k) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [0:15]Middle and weight sub-block Wb sm(i) Column k +1 data Wb sm(i,0,k+1) ......Wb sm(i,m-1,k+1) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [16:31]Performing the following steps;
step 6, based on scalar register R 30 、R 31 ...R 30+m-1 Stored semi-precision weight data for scalar register R 30 、R 31 ...R 30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is the bit length of a scalar register;
step 7, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the k-th column data vectorization;
step 8, inputting the sub-block matrix Ib in the AM space am(j) Of kth line data Ib am(j,k,0) ......Ib am(j,k,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000071
A byte, so that it can be loaded at minimum at a time
Figure FDA0003447171640000072
Half precision data;
step 9, mixing Wb sm(i,0,k) Vectorized data VR 50 Respectively react with Ib am(j) VR of the kth line 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 10, register VR based on vector 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k) ......Wb sm(i,m-1,k) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) K column of (1) and Ib am(j) The multiplication and addition calculation of the k rows is completed;
step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;
step 12, based on scalar register R 30 、R 31 ...R 30+m-1 R [16:31]Wb stored in sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Data, to scalar register R 30 、R 31 ...R 30+m-1 And performing high bit expansion operation, and enabling 16 high bits data R [16:31]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 In (d) is the bit length of a scalar register;
step 13, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the vectorization of the k +1 th column of data;
step 14, inputting the sub-block matrix Ib in the AM space am(j) Data Ib of the k +1 th line am(j,k+1,0) ......Ib am(j,k+1,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000081
One byte, so that it can be loaded at least once
Figure FDA0003447171640000082
Half precision data;
step 15, mixing Wb sm(i,0,k+1) Vectorized data VR 50 Respectively react with Ib am(j) The (k + 1) th row of data VR 0 、VR 1 ...VR p-1 Performing multiply-add operation while L vector processing units operate in parallel, storing the calculation result in vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 16, vector register VR based 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 15 for the (k + 1) th line of data, and respectively connecting each group of quantized data of the weight values with Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In this process, L vector processing elements operate in parallel, traversing Wb, simultaneously sm(i) Up to Wb sm(i) Column k +1 and Ib am(j) The multiplication and addition calculation of the k +1 line is completed;
step 17, making k equal to k + 2;
step 18, judging whether K is smaller than K, if so, returning to the step 5, and if not, executing the step 19;
step 19, store in vector register VR 10 、VR 11 ...VR 10+m×p-1 Temporarily storing the data result in the AM space position AM temp
Step 20, calling direct memory access operation, and AM the spatial position of AM temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;
step 21, making j equal to j + 1;
step 22, judging whether j is less than x 2 If yes, calling direct memory access operation and inputting the subblock matrix Ib am(j) Loading the data into an on-chip AM space, returning to the step 3 after the loading is finished, and if not, executing the step 23;
step 23, making i equal to i + 1;
step 24, judging whether i is less than x 1 If yes, calling direct memory access operation and making weight value sub-block matrix Wb sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W ddr And input data I ddr The conv1 × 1 calculation of (c) is completed.
CN202111681136.XA 2021-12-30 2021-12-30 A vector processor-oriented half-precision vectorized conv1×1 convolution method and system Active CN114330669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111681136.XA CN114330669B (en) 2021-12-30 2021-12-30 A vector processor-oriented half-precision vectorized conv1×1 convolution method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111681136.XA CN114330669B (en) 2021-12-30 2021-12-30 A vector processor-oriented half-precision vectorized conv1×1 convolution method and system

Publications (2)

Publication Number Publication Date
CN114330669A CN114330669A (en) 2022-04-12
CN114330669B true CN114330669B (en) 2022-09-16

Family

ID=81023239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111681136.XA Active CN114330669B (en) 2021-12-30 2021-12-30 A vector processor-oriented half-precision vectorized conv1×1 convolution method and system

Country Status (1)

Country Link
CN (1) CN114330669B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114575B (en) * 2022-08-30 2023-01-31 中国人民解放军国防科技大学 Vector processor-oriented image-to-matrix row conversion method, device and medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796235B (en) * 2019-10-21 2022-03-18 中国人民解放军国防科技大学 Vectorized Implementation Method of Valid Convolution of Convolutional Neural Network
CN113626769B (en) * 2021-10-12 2022-01-21 中国人民解放军国防科技大学 Vector processor-oriented low-bit-width data matrix vectorization transposition method and system

Also Published As

Publication number Publication date
CN114330669A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
JP4584580B2 (en) Multiply-and-accumulate (MAC) unit for single instruction multiple data (SIMD) instructions
US7337205B2 (en) Matrix multiplication in a vector processing system
CN110415157B (en) A calculation method and device for matrix multiplication
US8935468B2 (en) Audio digital signal processor
US20040122887A1 (en) Efficient multiplication of small matrices using SIMD registers
CN111639701B (en) A method, system, device and readable storage medium for image feature extraction
CN113626769B (en) Vector processor-oriented low-bit-width data matrix vectorization transposition method and system
JP7401513B2 (en) Sparse matrix multiplication in hardware
CN114281755B (en) Vector processor-oriented semi-precision vectorization convolution method and system
CN114139108B (en) Matrix LU decomposition vectorization calculation method of vector DSP core
CN114330669B (en) A vector processor-oriented half-precision vectorized conv1×1 convolution method and system
JP7174831B2 (en) Video memory processing method, apparatus and recording medium based on convolutional neural network
US8909687B2 (en) Efficient FIR filters
CN110782009A (en) Optimization method of computing kernel based on ARMv8 system
CN114329326B (en) Low-bit-width data matrix vectorization column expansion method and system of vector processor
US6404934B1 (en) High speed image processing apparatus using a cascade of elongated filters programmed in a computer
CN117493748A (en) Implementation method and device for low bit width data matrix vector multiplication of vector processor
CN116842304A (en) A calculation method and system for irregular sparse matrices
CN102231624B (en) Vectorization Implementation Method of FIR of Floating Point Complex Number Block Oriented to Vector Processor
EP4024206A1 (en) Computing device and method for reusing data
WO2023120403A1 (en) Calculation unit involved in merging and sorting and performing sparse matrix computation by cgra
US8423597B1 (en) Method and system for adaptive matrix trimming in an inverse discrete cosine transform (IDCT) operation
CN119719585B (en) Data processing method for SIMD-oriented parallel iterative solution
CN120147660A (en) Image convolution optimization method based on FT-M6678 chip
CN114138692B (en) Low bit width data matrix vectorization column clipping method and system for vector processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant