CN114330669B - A vector processor-oriented half-precision vectorized conv1×1 convolution method and system - Google Patents
A vector processor-oriented half-precision vectorized conv1×1 convolution method and system Download PDFInfo
- Publication number
- CN114330669B CN114330669B CN202111681136.XA CN202111681136A CN114330669B CN 114330669 B CN114330669 B CN 114330669B CN 202111681136 A CN202111681136 A CN 202111681136A CN 114330669 B CN114330669 B CN 114330669B
- Authority
- CN
- China
- Prior art keywords
- data
- vector
- weight
- space
- precision
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 54
- 238000004364 calculation method Methods 0.000 claims abstract description 37
- 230000001360 synchronised effect Effects 0.000 claims abstract description 25
- 239000011159 matrix material Substances 0.000 claims description 60
- 238000010586 diagram Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 3
- 238000013500 data storage Methods 0.000 claims 4
- 230000010076 replication Effects 0.000 claims 4
- 238000010923 batch production Methods 0.000 claims 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- FFBHFFJDDLITSX-UHFFFAOYSA-N benzyl N-[2-hydroxy-4-(3-oxomorpholin-4-yl)phenyl]carbamate Chemical compound OC1=C(NC(=O)OCC2=CC=CC=C2)C=CC(=C1)N1CCOCC1=O FFBHFFJDDLITSX-UHFFFAOYSA-N 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Images
Landscapes
- Complex Calculations (AREA)
Abstract
本发明公开了一种面向向量处理器的半精度向量化conv1×1卷积方法及系统,方法包括:将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;调用直接存储器访问操作,将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据。本发明能够结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。
The invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method and system. The method includes: storing half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory; calling The direct memory access operation loads half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively; in the SM space, the pair is loaded into the on-chip SM space. Perform vectorization processing on the weight data of , in AM space, perform convolution operation conv1×1 between the vectorized weight data and the input data in AM space to obtain the feature map data after convolution. The invention can combine the architectural features of the vector processor to vectorize the convolution calculation (conv1×1) to the vector processor architecture, and realize the improvement of FLOPs on the premise of ensuring the accuracy.
Description
技术领域technical field
本发明涉及向量处理器技术领域,尤其涉及一种面向向量处理器的半精度向量化conv1×1卷积方法及系统。The invention relates to the technical field of vector processors, in particular to a vector processor-oriented half-precision vectorized conv1×1 convolution method and system.
背景技术Background technique
向量处理器的体系结构是一种新型的体系结构,如图1所示,包含进行标量运算的标量处理单元(SPU)和进行向量运算的向量处理单元(VPU),以及负责数据传输的直接存储器访问(Direct Memory Access,DMA)部件等。SPU由标量处理部件SPE和标量存储器SM构成。VPU由L个向量处理部件VPE和阵列存储器AM构成,L个向量处理部件VPE以单指令多数据(SIMD)的方式协作运行,一个VPE内部集成了3个向量运算部件,用于同时支持向量的定点和浮点操作。The architecture of the vector processor is a new type of architecture, as shown in Figure 1, which includes a scalar processing unit (SPU) that performs scalar operations, a vector processing unit (VPU) that performs vector operations, and direct memory responsible for data transfer. Access (Direct Memory Access, DMA) components and so on. The SPU consists of a scalar processing element SPE and a scalar memory SM. The VPU consists of L vector processing components VPE and array memory AM. The L vector processing components VPE operate cooperatively in a single-instruction, multiple-data (SIMD) manner. One VPE integrates 3 vector computing components for simultaneous support of vector Fixed-point and floating-point operations.
单个VPE一次可以处理1个8字节数据(如FP64、Int64),也可以处理2个4字节数据(如FP32,Int32),也可以处理4个2字节数据(如FP16)。DMA部件负责SM与DDR(双倍速率同步动态随机存储器)、AM与DDR之间的数据传输,其操作的最小粒度也是8字节。A single VPE can process 1 piece of 8-byte data (such as FP64, Int64), 2 pieces of 4-byte data (such as FP32, Int32), or 4 pieces of 2-byte data (such as FP16) at a time. The DMA part is responsible for data transfer between SM and DDR (Double Rate Synchronous Dynamic Random Access Memory), AM and DDR, and the minimum granularity of its operation is also 8 bytes.
卷积(Convolution)是神经网络的核心计算之一,其中conv1×1又是卷积运算中最常见的一种规格,所以其效率高低对神经网络的性能影响非常大,优化卷积计算就显得尤为重要。Convolution (Convolution) is one of the core calculations of neural networks, of which conv1×1 is the most common specification in convolution operations, so its efficiency has a great impact on the performance of neural networks, and optimizing convolution calculations appears to be especially important.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明提供了一种面向向量处理器的半精度向量化conv1×1卷积方法,结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。In view of this, the present invention provides a vector processor-oriented half-precision vectorized conv1 × 1 convolution method, which combines the architectural features of the vector processor to orient the convolution calculation (conv1 × 1) to the vector processor architecture. Vectorization achieves the improvement of FLOPs under the premise of ensuring accuracy.
本发明提供了一种面向向量处理器的半精度向量化conv1×1卷积方法,包括:The present invention provides a vector processor-oriented half-precision vectorized conv1×1 convolution method, including:
将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;Store half-precision weight data and half-precision input data in double-rate synchronous dynamic random access memory;
调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;Invoke a direct memory access operation to load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;
在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据;In the SM space, vectorize the weight data loaded into the on-chip SM space, and in the AM space, perform the convolution operation conv1×1 between the vectorized weight data and the input data in the AM space to obtain The feature map data after convolution;
其中,所述半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式可视为[Co,Cin],故所述权值数据可表示为矩阵Weightddr=M×K,所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。Wherein, the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, the data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr = M × K, the data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, n is the number of batch processing in the convolution operation, [Hi, Wi, n] can be regarded as one dimension, let N=Hi×Wi×n, so The input data can be represented as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.
优选地,所述调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间,包括:Preferably, the direct memory access operation is invoked to load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively, including :
调用直接存储器访问操作,将半精度权值矩阵Wddr加载到片上SM空间中,将原数据从M维划分为x1个Wbsm矩阵,变为Wsm=x1×Wbsm,Wbsm=m×K,其中m的大小由SM的空间大小和AM空间的大小综合决定;Call the direct memory access operation, load the half-precision weight matrix W ddr into the on-chip SM space, divide the original data from M dimension into x 1 Wb sm matrix, become W sm = x 1 ×Wb sm , Wb sm = m×K, The size of m is determined comprehensively by the size of the SM space and the size of the AM space;
调用直接存储器访问操作,将半精度输入矩阵Iddr加载到片上AM空间中,将原数据从N维划分为x2个Ibam矩阵,变为Iam=x2×Ibam,其中Ibam=K×n,即N=x2×n,其中n=P×L×4,p表示向量处理器的体系结构中向量功能运算单元部件的数量,L表示向量处理部件的数量。Invoke the direct memory access operation, load the half-precision input matrix I ddr into the on-chip AM space, divide the original data from N dimensions into x 2 Ib am matrices, become I am =x 2 ×Ib am , where Ib am = K×n, that is, N=x 2 ×n, where n=P×L×4, p represents the number of vector functional operation unit components in the architecture of the vector processor, and L represents the number of vector processing components.
优选地,所述在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据,包括以下步骤:Preferably, in the SM space, vectorization processing is performed on the weight data loaded into the on-chip SM space, and in the AM space, a convolution operation is performed on the vectorized weight data and the input data in the AM space. conv1×1 to obtain the feature map data after convolution, including the following steps:
步骤1、初始化i=0,其中,i表示权值子块矩阵Wbsm(i)在M维上的块索引;
步骤2、初始化j=0,其中,j表示输入子块矩阵Ibam(j)在N维上的块索引;
步骤3、初始化k=0,其中,k表示权值子块Wbsm的列索引和输入子块Ibam的行索引,m1表示权值子块的行索引,n1表示输入子块的列索引,即,权值子块表示为Wbsm(i,m1,k),输入子块表示为Ibam(j,k,n1);
步骤4、将向量寄存器初始化为0,以便向量寄存器累加并存储计算结果;
步骤5、标量加载指令的最小粒度为4字节,半精度数据为2字节,单次将加载两个半精度数据到指定标量寄存器的R[0:15]和R[16:31],将所述SM空间中的权值子块Wbsm(i)的第k列数据Wbsm(i,0,k)……Wbsm(i,m-1,k)依次加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时权值子块Wbsm(i)的第k+1列数据Wbsm(i,0,k+1)……Wbsm(i,m-1,k+1)依次加载到标量寄存器R30、R31...R30+m-1的R[16:31]中;
步骤6、基于标量寄存器R30、R31...R30+m-1存放的半精度权值数据,对标量寄存器R30、R31...R30+m-1进行低位扩展操作,将寄存器中低32位中低16位数据R[0:15]复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,其中,d为一个标量寄存器的位长;
步骤7、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作并将数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k列数据向量化完成;
步骤8、将所述AM空间中的输入子块矩阵Ibam(j)的第k行数据Ibam(j,k,0)……Ibam(j,k,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为个字节,故单次最少可加载个半精度数据;
步骤9、将Wbsm(i,0,k)向量化后的数据VR50分别与Ibam(j)的第k行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;
步骤10、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k)……Wbsm(i,m-1,k)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k行数据,重复步骤9,将权值的各组向量化数据分别与Ibam(j)的第k行数据相乘,并将相乘结果累加到向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k列数据,直至Wbsm(i)的第k列和Ibam(j)的k行的乘加计算完成;
步骤11、判断k+1是否小于K,若是,则跳转执行步骤19,若否,则继续执行步骤12;
步骤12、基于标量寄存器R30、R31...R30+m-1的R[16:31]中存放的Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)数据,对标量寄存器R30、R31...R30+m-1进行高位扩展操作,将寄存器中低32位中高16位数据R[16:31],复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,d为一个标量寄存器的位长;
步骤13、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作,将广播后的数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k+1列数据向量化完成;Step 13: Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . Store the broadcasted data in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing units store the same data, and the vectorization of the data in the k+1th column of Wb sm(i) is completed ;
步骤14、将所述AM空间中的输入子块矩阵Ibam(j)的第k+1行数据Ibam(j,k+1,0)……Ibam(j,k+1,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为个字节,故单次最少可加载个半精度数据;Step 14: Convert the k+1 row data Ib am(j,k+1,0) of the input sub-block matrix Ib am(j) in the AM space to Ib am(j,k+1,n- 1) Load into p vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is bytes, so at least one can be loaded at a time half-precision data;
步骤15、将Wbsm(i,0,k+1)向量化后的数据VR50分别与Ibam(j)的第k+1行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;Step 15: Multiply the vectorized data VR 50 of Wb sm(i, 0, k+1) by the data VR 0 , VR 1 . . . VR p-1 of the k+1 row of Ib am(j) respectively Add operation, while L vector processing components operate in parallel, and store the calculation results in vector registers VR 10 , VR 11 . . . VR 10+p-1 ;
步骤16、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k+1行数据,重复步骤15,将权值的各组向量化数据分别与Ibam(j)的第k+1行数据相乘,并将相乘结果累加至向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k+1列数据,直至Wbsm(i)的第k+1列和Ibam(j)的k+1行的乘加计算完成;
步骤17、令k=k+2;
步骤18、判断k是否小于K,若是,则返回步骤5,若否,则执行步骤19;
步骤19、将储存在向量寄存器VR10、VR11...VR10+m×p-1中的数据结果暂时存储到AM空间位置AMtemp;
步骤20、调用直接存储器访问操作,将所述AM空间位置AMtemp储存的特征图数据结果存储至双倍速率同步动态随机存储器指定位置;
步骤21、令j=j+1;
步骤22、判断j是否小于x2,若是,则调用直接存储器访问操作,将输入子块矩阵Ibam(j)加载到片上AM空间中,加载完后返回步骤3,若否,则执行步骤23;
步骤23、令i=i+1;
步骤24、判断i是否小于x1,若是,则调用直接存储器访问操作,将权值子块矩阵Wbsm(i)加载到片上SM空间中,加载完后返回步骤2,若否,则至此全部的权值数据Wddr和输入数据Iddr的conv1×1计算完成。
一种面向向量处理器的半精度向量化conv1×1卷积系统,包括:A half-precision vectorized conv1×1 convolutional system for vector processors, including:
存储模块,用于将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;The storage module is used to store the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;
加载模块,用于调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;a loading module, configured to invoke a direct memory access operation, and load the half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space, respectively;
处理模块,用于在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据;The processing module is used to vectorize the weight data loaded into the on-chip SM space in the SM space, and in the AM space, convolve the vectorized weight data with the input data in the AM space. conv1×1, get the feature map data after convolution;
其中,所述半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式可视为[Co,Cin],故所述权值数据可表示为矩阵Weightddr=M×K,所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。Wherein, the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, the data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr = M × K, the data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, n is the number of batch processing in the convolution operation, [Hi, Wi, n] can be regarded as one dimension, let N=Hi×Wi×n, so The input data can be represented as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.
优选地,所述加载模块具体用于:Preferably, the loading module is specifically used for:
调用直接存储器访问操作,将半精度权值矩阵Wddr加载到片上SM空间中,将原数据从M维划分为x1个Wbsm矩阵,变为Wsm=x1×Wbsm,Wbsm=m×K,其中m的大小由SM的空间大小和AM空间的大小综合决定;Call the direct memory access operation, load the half-precision weight matrix W ddr into the on-chip SM space, divide the original data from M dimension into x 1 Wb sm matrix, become W sm = x 1 ×Wb sm , Wb sm = m×K, The size of m is determined comprehensively by the size of the SM space and the size of the AM space;
调用直接存储器访问操作,将半精度输入矩阵Iddr加载到片上AM空间中,将原数据从N维划分为x2个Ibam矩阵,变为Iam=x2×Ibam,其中Ibam=K×n,即N=x2×n,其中n=P×L×4,p表示向量处理器的体系结构中向量功能运算单元部件的数量,L表示向量处理部件的数量。Invoke the direct memory access operation, load the half-precision input matrix I ddr into the on-chip AM space, divide the original data from N dimensions into x 2 Ib am matrices, become I am =x 2 ×Ib am , where Ib am = K×n, that is, N=x 2 ×n, where n=P×L×4, p represents the number of vector functional operation unit components in the architecture of the vector processor, and L represents the number of vector processing components.
优选地,所述处理模块具体用于执行以下步骤:Preferably, the processing module is specifically configured to perform the following steps:
步骤1、初始化i=0,其中,i表示权值子块矩阵Wbsm(i)在M维上的块索引;
步骤2、初始化j=0,其中,j表示输入子块矩阵Ibam(j0在N维上的块索引;
步骤3、初始化k=0,其中,k表示权值子块Wbsm的列索引和输入子块Ibam的行索引,m1表示权值子块的行索引,n1表示输入子块的列索引,即,权值子块表示为Wbsm(i,m1,k),输入子块表示为Ibam(j,k,n1);
步骤4、将向量寄存器初始化为0,以便向量寄存器累加并存储计算结果;
步骤5、标量加载指令的最小粒度为4字节,半精度数据为2字节,单次将加载两个半精度数据到指定标量寄存器的R[0:15]和R[16:31],将所述SM空间中的权值子块Wbsm(i)的第k列数据Wbsm(i,0,k)……Wbsm(i,m-1,k)依次加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时权值子块Wbsm(i)的第k+1列数据Wbsm(i,0,k+1)……Wbsm(i,m-1,k+1)依次加载到标量寄存器R30、R31...R30+m-1的R[16:31]中;
步骤6、基于标量寄存器R30、R31...R30+m-1存放的半精度权值数据,对标量寄存器R30、R31...R30+m-1进行低位扩展操作,将寄存器中低32位中低16位数据R[0:15]复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,其中,d为一个标量寄存器的位长;
步骤7、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作并将数据储存在向量寄存器vr50、vr51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k列数据向量化完成;
步骤8、将所述AM空间中的输入子块矩阵Ibam(j)的第k行数据Ibam(j,k,0)……Ibam(j,k,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为个字节,故单次最少可加载个半精度数据;
步骤9、将Wbsm(i,0,k)向量化后的数据VR50分别与Ibam(j)的第k行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;
步骤10、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k)……Wbsm(i,m-1,k)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k行数据,重复步骤9,将权值的各组向量化数据分别与Ibam(j)的第k行数据相乘,并将相乘结果累加到向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k列数据,直至Wbsm(i)的第k列和Ibam(j)的k行的乘加计算完成;
步骤11、判断k+1是否小于K,若是,则跳转执行步骤19,若否,则继续执行步骤12;
步骤12、基于标量寄存器R30、R31...R30+m-1的R[16:31]中存放的Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)数据,对标量寄存器R30、R31...R30+m-1进行高位扩展操作,将寄存器中低32位中高16位数据R[16:31],复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,d为一个标量寄存器的位长;
步骤13、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作,将广播后的数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k+1列数据向量化完成;Step 13: Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . Store the broadcasted data in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing units store the same data, and the vectorization of the data in the k+1th column of Wb sm(i) is completed ;
步骤14、将所述AM空间中的输入子块矩阵Ibam(j)的第k+1行数据Ibam(j,k+1,0)……Ibam(j,k+1,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为个字节,故单次最少可加载个半精度数据;Step 14: Convert the k+1 row data Ib am(j,k+1,0) of the input sub-block matrix Ib am(j) in the AM space to Ib am(j,k+1,n- 1) Load into p vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is bytes, so at least one can be loaded at a time half-precision data;
步骤15、将Wbsm(i,0,k+1)向量化后的数据VR50分别与Ibam(j)的第k+1行数据VR0、VR1...VRp-1做乘加操作,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;
步骤16、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k+1行数据,重复步骤15,将权值的各组向量化数据分别与Ibam(j)的第k+1行数据相乘,并将相乘结果累加至向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k+1列数据,直至Wbsm(i)的第k+1列和Ibam(j)的k+1行的乘加计算完成;
步骤17、令k=k+2;
步骤18、判断k是否小于K,若是,则返回步骤5,若否,则执行步骤19;
步骤19、将储存在向量寄存器VR10、VR11...VR10+m×p-1中的数据结果暂时存储到AM空间位置AMtemp;
步骤20、调用直接存储器访问操作,将所述AM空间位置AMtemp储存的特征图数据结果存储至双倍速率同步动态随机存储器指定位置;
步骤21、令j=j+1;
步骤22、判断j是否小于x2,若是,则调用直接存储器访问操作,将输入子块矩阵Ibam(j)加载到片上AM空间中,加载完后返回步骤3,若否,则执行步骤23;
步骤23、令i=i+1;
步骤24、判断i是否小于x1,若是,则调用直接存储器访问操作,将权值子块矩阵Wbsm(i)加载到片上SM空间中,加载完后返回步骤2,若否,则至此全部的权值数据Wddr和输入数据Iddr的conv1×1计算完成。
综上所述,本发明公开了一种面向向量处理器的半精度向量化conv1×1卷积方法,首先将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中,然后调用直接存储器访问操作,将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据;其中,半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式可视为[Co,Cin],故所述权值数据可表示为矩阵Weightddr=M×K,所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。本发明能够结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。To sum up, the present invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method. First, half-precision weight data and half-precision input data are stored in a double-rate synchronous dynamic random access memory, Then the direct memory access operation is called to load the half-precision weight data and half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively; in the SM space, the pair is loaded into the on-chip The weight data in the SM space is vectorized. In the AM space, the vectorized weight data and the input data in the AM space are convolved with the convolution operation conv1×1 to obtain the feature map data after convolution; , the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1, The data format can be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr = M×K, the data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image, respectively, and n is the number of batch processing in the convolution operation. [Hi, Wi, n] can be regarded as one dimension, and N=Hi×Wi×n, so the input data can be It is expressed as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension. The invention can combine the architectural features of the vector processor to vectorize the convolution calculation (conv1×1) to the vector processor architecture, and realize the improvement of FLOPs on the premise of ensuring the accuracy.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.
图1为向量处理器的一般体系结构示意图;Fig. 1 is a general architecture schematic diagram of a vector processor;
图2为本发明提供的一种面向向量处理器的半精度向量化conv1×1卷积方法实施例的方法流程图;2 is a method flow chart of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution method provided by the present invention;
图3为本发明公开的Wbsm(0,m1,k)的标量加载示意图;3 is a schematic diagram of scalar loading of Wb sm(0, m1, k) disclosed in the present invention;
图4为本发明公开的标量寄存器的低16位扩展示意图;4 is a schematic diagram of the low 16-bit extension of the scalar register disclosed in the present invention;
图5为本发明公开的一种标量寄存器的广播实现示意图;5 is a schematic diagram of a broadcast implementation of a scalar register disclosed in the present invention;
图6为本发明公开的一种Ibam(0,0,n1)的向量加载示意图;Fig. 6 is a kind of vector loading schematic diagram of Ib am(0,0,n1) disclosed by the present invention;
图7为本发明公开的Wbsm(i,0,k)与input第k行的向量乘加示意图;7 is a schematic diagram of vector multiplication and addition of Wb sm(i, 0, k) and the kth row of input disclosed in the present invention;
图8为本发明公开的weight第k列、input第k行的向量乘加示意图;8 is a schematic diagram of vector multiplication and addition of the kth column of weight and the kth row of input disclosed by the present invention;
图9为本发明公开的标量寄存器的高16位扩展示意图;FIG. 9 is a schematic diagram of the high-order 16-bit extension of the scalar register disclosed in the present invention;
图10为本发明公开的一种标量寄存器的广播实现示意图;10 is a schematic diagram of a broadcast implementation of a scalar register disclosed in the present invention;
图11为本发明公开的一种Ibam(0,1,n1)的向量加载示意图;11 is a schematic diagram of a vector loading of Ib am(0,1,n1) disclosed in the present invention;
图12为本发明公开的Wbsm(i,0,k+1)与input第k+1行的向量乘加示意图;12 is a schematic diagram of vector multiplication and addition of Wb sm(i, 0, k+1) and input row k+1 disclosed in the present invention;
图13为weight第k+1列、input第k+1行的向量乘加示意图;Figure 13 is a schematic diagram of vector multiplication and addition of the k+1 column of weight and the k+1 row of input;
图14为weight最后列、input最后行的向量乘加示意图;Figure 14 is a schematic diagram of vector multiplication and addition of the last column of weight and the last row of input;
图15为本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积系统实施例的结构示意图。FIG. 15 is a schematic structural diagram of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution system disclosed in the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
如图2所示,为本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积方法实施例的流程图,所述方法可以包括以下步骤:As shown in FIG. 2, it is a flowchart of an embodiment of a vector processor-oriented half-precision vectorized conv1×1 convolution method disclosed in the present invention, and the method may include the following steps:
S201、将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;S201. Store half-precision weight data and half-precision input data in a double-rate synchronous dynamic random access memory;
当需要对面向向量处理器的半精度数据进行向量化卷积时,首先将半精度权值数据和半精度输入数据存储在DDR(双倍速率同步动态随机存储器)中。其中,半精度权值数据Weightddr的数据格式为[Co,Cin,ks,ks],Co为输出通道数,Cin为输入通道数,ks为卷积核大小,当卷积核大小为1时,数据格式也可视为[Co,Cin],故权值数据可表示为矩阵Weightddr=M×K。所述半精度输入数据Inputddr的数据格式为[Cin,Hi,Wi,n],Hi和Wi分别为图像的高和宽,n为卷积操作中一次批量处理的数量,可将[Hi,Wi,n]看做一维,令N=Hi×Wi×n,故输入数据可表示为矩阵Inputddr=K×N,其中,M表示Co,K表示Cin,N表示图像维度的大小。When vectorized convolution of half-precision data for vector processors is required, firstly, half-precision weight data and half-precision input data are stored in DDR (Double Rate Synchronous Dynamic Random Access Memory). Among them, the data format of the half-precision weight data Weight ddr is [Co, Cin, ks, ks], Co is the number of output channels, Cin is the number of input channels, ks is the size of the convolution kernel, when the size of the convolution kernel is 1 , the data format can also be regarded as [Co, Cin], so the weight data can be expressed as a matrix Weight ddr =M×K. The data format of the half-precision input data Input ddr is [Cin, Hi, Wi, n], Hi and Wi are the height and width of the image respectively, and n is the number of batch processing in the convolution operation. Wi,n] is regarded as one dimension, let N=Hi×Wi×n, so the input data can be represented as a matrix Input ddr =K×N, where M represents Co, K represents Cin, and N represents the size of the image dimension.
S202、调用直接存储器访问操作,将半精度权值数据和半精度输入数据从双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;S202, calling the direct memory access operation, and loading the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory into the on-chip scalar memory SM space and the on-chip array memory AM space respectively;
具体地,调用直接存储器访问操作,将半精度权值矩阵Wddr加载到片上SM空间中,将原数据从M维(输出通道维度)划分为x1个Wbsm矩阵,变为Wsm=x1×Wbsm,Wbsm=m×K,其中m的大小由SM的空间大小和AM空间的大小综合决定。如,与m相关的权值数据块Wbsm大小不能大于SM空间;权值块与输入块卷积后的输出结果与输入数据块大小之和需小于AM空间。Specifically, the direct memory access operation is invoked, the half-precision weight matrix W ddr is loaded into the on-chip SM space, and the original data is divided from M dimension (output channel dimension) into x 1 Wb sm matrix, which becomes W sm =x 1 ×Wb sm , Wb sm =m×K, The size of m is determined comprehensively by the size of the SM space and the size of the AM space. For example, the size of the weight data block Wb sm related to m cannot be larger than the SM space; the sum of the output result after the convolution of the weight block and the input block and the size of the input data block must be smaller than the AM space.
调用直接存储器访问操作,将所述半精度输入矩阵Iddr加载到片上AM空间中,将原数据从N维(图像层维度)划分为x2个Ibam矩阵,变为Iam=x2×Ibam,其中Ibam=K×n。即N=x2×n,其中n=P×L×4,p表示向量处理器的体系结构中向量功能运算单元部件的数量,L表示向量处理部件的数量。The direct memory access operation is invoked, the half-precision input matrix I ddr is loaded into the on-chip AM space, and the original data is divided from N dimensions (image layer dimension) into x 2 Ib am matrices, becoming I am =x 2 × Ib am , where Ib am =K×n. That is, N=x 2 ×n, where n=P×L×4, p represents the number of vector functional operation unit components in the architecture of the vector processor, and L represents the number of vector processing components.
S203、在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据。S203. In the SM space, perform vectorization processing on the weight data loaded into the on-chip SM space, and in the AM space, perform a convolution operation conv1×1 between the vectorized weight data and the input data in the AM space , to obtain the feature map data after convolution.
具体的,可以包括以下步骤:Specifically, the following steps may be included:
步骤1、初始化i=0,其中,i表示权值子块矩阵Wbsm(i)在M维上的块索引;
步骤2、初始化j=0,其中,j表示输入子块矩阵Ibam(j)在N维上的块索引;
步骤3、初始化k=0,其中,k表示权值子块Wbsm的列索引和输入子块Ibam的行索引,m1表示权值子块的行索引,n1表示输入子块的列索引,即,权值子块表示为Wbsm(i,m1,k),输入子块表示为Ibam(j,k,n1);
步骤4、将向量寄存器初始化为0,以便向量寄存器累加并存储计算结果;
步骤5、标量加载指令的最小粒度为4字节,半精度数据为2字节,单次将加载两个半精度数据到指定标量寄存器的R[0:15]和R[16:31],将所述SM空间中的权值子块Wbsm(i)的第k列数据Wbsm(i,0,k)……Wbsm(i,m-1,k)依次加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时权值子块Wbsm(i)的第k+1列数据Wbsm(i,0,k+1)……Wbsm(i,m-1,k+1)依次加载到标量寄存器R30、R31...R30+m-1的R[16:31]中;
例如,以第一个权值子块Wbsm(0)=6×4,m=6,K=4为例,k=0时,使用标量加载指令,依次将Wbsm(0)的第1列的数据加载到标量寄存器R30、R31...R30+m-1的R[0:15]中,同时将Wbsm(0)的第2列的数据加载到标量寄存器R30、R31...R30+m-1的R[16:31]中,如下图3所示。For example, take the first weight sub-block Wb sm(0) = 6×4, m=6, K=4 as an example, when k=0, use the scalar load instruction to sequentially load the first weight of Wb sm(0) The data of the column is loaded into R[0:15] of the scalar registers R30 , R31 ...R30 +m-1 , and the data of the second column of Wb sm(0) is loaded into the scalar register R30 , In R[16:31] of R 31 ...R 30+m-1 , as shown in Figure 3 below.
步骤6、基于标量寄存器R30、R31...R30+m-1存放的半精度权值数据,对标量寄存器R30、R31...R30+m-1进行低位扩展操作,将寄存器中低32位中低16位数据R[0:15]复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,其中,d为一个标量寄存器的位长;
例如,以d=64为例,步骤6的低32位中低16位的扩展指令实现如图4所示。For example, taking d=64 as an example, the implementation of the extended instruction of the lower 32 bits in the lower 16 bits in
步骤7、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作并将数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm5i)的第k列数据向量化完成;
例如,以L=8为例,标量寄存器R40广播到向量寄存器VR50的实现如图5所示。For example, taking L=8 as an example, the implementation of broadcasting the scalar register R 40 to the vector register VR 50 is shown in FIG. 5 .
步骤8、将所述AM空间中的输入子块矩阵Ibam(j)的第k行数据Ibam(j,k,0)……Ibam(j,k,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为个字节,故单次最少可加载个半精度数据;
例如,以第一个输入子块Ibam(0)=4×64,K=4,N=64为例,k=0时,使用向量加载指令,将Ibam(0)的第1行的数据加载到p个向量寄存器VR0、VR1...VRp-1中,同上述以L=8,p=2为例,向量加载的具体实现如图6所示。For example, take the first input sub-block Ib am(0) = 4×64, K=4, N=64 as an example, when k=0, use the vector load instruction to load the first line of Ib am(0) The data is loaded into the p vector registers VR 0 , VR 1 . . . VR p-1 . Taking L=8 and p=2 as an example, the specific implementation of vector loading is shown in FIG. 6 .
步骤9、将Wbsm(i,0,k)向量化后的数据VR50分别与Ibam(j)的第k行数据VR0、VR1...VRp-1做乘加操作,因为该体系结构集成了p个功能向量运算单元部件,所以上述乘加操作支持在同一周期内进行,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;
例如,VR50分别与VR0、VR1做乘加操作,以L=8,p=2为例,结果保存在VR10、VR11中,由于VR10、VR11初始值为0,故乘加结果为相乘本身,具体实现如图7所示。For example, VR 50 performs multiplication and addition operations with VR 0 and VR 1 respectively. Taking L=8 and p=2 as an example, the results are stored in VR 10 and VR 11. Since the initial values of VR 10 and VR 11 are 0, the multiplication The addition result is the multiplication itself, and the specific implementation is shown in Figure 7.
步骤10、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k)……Wbsm(i,m-1,k)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k行数据,重复步骤9,将权值的各组向量化数据分别与Ibam(j)的第k行数据相乘,并将相乘结果累加到向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k列数据,直至Wbsm(i)的第k列和Ibam(j)的k行的乘加计算完成,具体实现如图8所示;
步骤11、判断k+1是否小于K,若是,则跳转执行步骤19,若否,则继续执行步骤12;
步骤12、基于标量寄存器R30、R31...R30+m-1的R[16:31]中存放的Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)数据,对标量寄存器R30、R31...R30+m-1进行高位扩展操作,将寄存器中低32位中高16位数据R[16:31],复制扩展为d位数据存储在标量寄存器R40、R41...R40+m-1中,d为一个标量寄存器的位长;
例如,以d=64为例,步骤12的低32位中高16位的扩展指令实现如图9所示。For example, taking d=64 as an example, the implementation of the extended instruction of the lower 32 bits in the upper 16 bits in
步骤13、基于标量寄存器R40、R41...R40+m-1存放的复制扩展后的数据,对标量寄存器R40、R41...R40+m-1依次进行广播操作,将广播后的数据储存在向量寄存器VR50、VR51...VR50+m-1中,L个向量处理部件存储相同的数据,Wbsm(i)的第k+1列数据向量化完成;Step 13: Based on the replicated and expanded data stored in the scalar registers R 40 , R 41 . Store the broadcasted data in the vector registers VR 50 , VR 51 . . . VR 50+m-1 , the L vector processing units store the same data, and the vectorization of the data in the k+1th column of Wb sm(i) is completed ;
例如,当k=0时,Wbsm(i)的第k+1列数据向量化如下,具体广播实现如图10所示。For example, when k=0, the data vectorization of the k+1th column of Wb sm(i) is as follows, and the specific broadcast implementation is shown in FIG. 10 .
步骤14、将所述AM空间中的输入子块矩阵Ibam(j)的第k+1行数据Ibam(j,k+1,0)……Ibam(j,k+1,n-1)加载到p个向量寄存器VR0、VR1...VRp-1中,p表示超长数据指令字的体系结构中功能向量运算单元部件的数量,单次加载最小粒度为个字节,故单次最少可加载个半精度数据;Step 14: Convert the k+1 row data Ib am(j,k+1,0) of the input sub-block matrix Ib am(j) in the AM space to Ib am(j,k+1,n- 1) Load into p vector registers VR 0 , VR 1 ... VR p-1 , p represents the number of functional vector arithmetic unit components in the architecture of the super-long data instruction word, and the minimum granularity of a single load is bytes, so at least one can be loaded at a time half-precision data;
例如,以第一个输入子块Ibam(0)=4×64,K=4,N=64为例,k+1=1时,使用向量加载指令,将Ibam(0)的第2行的数据加载到p个向量寄存器VR0、VR1...VRp-1中,同上述以L=8,p=2为例,向量加载的具体实现如图11所示。For example, taking the first input sub-block Ib am(0) = 4×64, K=4, N=64 as an example, when k+1=1, use the vector load instruction to load the second sub-block of Ib am(0 ) . The data of the row is loaded into p vector registers VR 0 , VR 1 . . . VR p-1 . Taking L=8 and p=2 as an example, the specific implementation of vector loading is shown in FIG. 11 .
步骤15、将Wbsm(i,0,k+1)向量化后的数据VR50分别与Ibam(j)的第k+1行数据VR0、VR1...VRp-1做乘加操作,因为该体系结构集成了p个功能向量运算单元部件,所以上述乘加操作支持在同一周期内进行,同时L个向量处理部件并行操作,将计算结果存在向量寄存器VR10、VR11...VR10+p-1中;
例如,k+1=1时,VR50分别与VR0、VR1做乘加操作,并累加上VR10、VR11中k行的乘加数据,并且将结果继续保存在VR10、VR11中,以L=8,p=2为例,具体实现如图12所示。For example, when k+1=1, VR 50 performs multiplication and addition operations with VR 0 and VR 1 respectively, and accumulates the multiplication and addition data of k rows in VR 10 and VR 11 , and continues to save the results in VR 10 and VR 11 , taking L=8 and p=2 as an example, the specific implementation is shown in FIG. 12 .
步骤16、基于向量寄存器VR51...VR50+m-1储存的是权值子块Wbsm(i,1,k+1)……Wbsm(i,m-1,k+1)的向量化数据,向量寄存器VR0、VR1...VRp-1中储存的是输入子块Ibam(j)的第k+1行数据,重复步骤15,将权值的各组向量化数据分别与Ibam(j)的第k+1行数据相乘,并将相乘结果累加至向量寄存器VR10+p、VR10+p+1...VR10+m×p-1上,该过程L个向量处理部件同时并行操作,遍历Wbsm(i)的第k+1列数据,直至Wbsm(i)的第k+1列和Ibam(j)的k+1行的乘加计算完成,具体实现如图13所示;
步骤17、令k=k+2;
步骤18、判断k是否小于K,若是,则返回步骤5,若否,则执行步骤19;
步骤19、至此,权值子块矩阵Wbsm(i)和输入子块矩阵Ibam(j)的conv1×1计算已经完成,当Wbsm(i)遍历到最后一列,Ibam(j)遍历到最后一行时,具体操作如图14所示,将储存在向量寄存器VR10、VR11...VR10+m×p-1中的数据结果暂时存储到AM空间位置AMtemp;
步骤20、调用直接存储器访问操作,将所述AM空间位置AMtemp储存的特征图数据结果存储至双倍速率同步动态随机存储器指定位置;
步骤21、令j=j+1;
步骤22、判断j是否小于x2,若是,则调用直接存储器访问操作,将输入子块矩阵Ibam(j)加载到片上AM空间中,加载完后返回步骤3,重复进行以上标量数据加载、复制扩展、广播、向量数据加载和向量乘加等操作,若否,则执行步骤23;
步骤23、令i=i+1;
步骤24、判断i是否小于x1,若是,则调用直接存储器访问操作,将权值子块矩阵Wbsm(i)加载到片上SM空间中,加载完后返回步骤2,重复进行以上标量数据加载、复制扩展、广播、向量数据加载和向量乘加等操作,若否,则至此全部的权值数据Wddr和输入数据Iddr的conv1×1计算完成。
综上所述,本发明公开的一种面向向量处理器的半精度向量化conv1×1卷积方法,能够结合向量处理器的体系结构特征,将卷积计算(conv1×1)面向向量处理器体系结构向量化,在保证精度的前提下实现了FLOPs的提升。To sum up, the present invention discloses a vector processor-oriented half-precision vectorized conv1×1 convolution method, which can combine the architectural features of the vector processor to make the convolution calculation (conv1×1) oriented to the vector processor. The architecture is vectorized, and the improvement of FLOPs is achieved on the premise of ensuring accuracy.
如图15所示,为本发明公开的一种面向向量处理器的半精度数据向量化conv1×1卷积系统实施例的结构示意图,所述系统可以包括:As shown in FIG. 15, it is a schematic structural diagram of an embodiment of a vector processor-oriented half-precision data vectorization conv1×1 convolution system disclosed in the present invention. The system may include:
存储模块1501,用于将半精度权值数据和半精度输入数据存储在双倍速率同步动态随机存储器中;A
加载模块1502,用于调用直接存储器访问操作,将所述半精度权值数据和半精度输入数据从所述双倍速率同步动态随机存储器分别加载到片上标量存储器SM空间和片上阵列存储器AM空间;a
处理模块1503,用于在SM空间中,对加载到片上SM空间的权值数据进行向量化处理,在AM空间中,将向量化处理后的权值数据与AM空间上的输入数据做卷积操作conv1×1,得到卷积后的特征图数据。The
本发明公开面向向量处理器的半精度向量化conv1×1卷积系统的工作原理,与上述面向向量处理器的半精度向量化conv1×1卷积方法的工作原理相同,在此不再赘述。The present invention discloses the working principle of the vector processor-oriented half-precision vectorized conv1×1 convolution system, which is the same as the working principle of the vector processor-oriented half-precision vectorized conv1×1 convolution method, and will not be repeated here.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111681136.XA CN114330669B (en) | 2021-12-30 | 2021-12-30 | A vector processor-oriented half-precision vectorized conv1×1 convolution method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111681136.XA CN114330669B (en) | 2021-12-30 | 2021-12-30 | A vector processor-oriented half-precision vectorized conv1×1 convolution method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114330669A CN114330669A (en) | 2022-04-12 |
CN114330669B true CN114330669B (en) | 2022-09-16 |
Family
ID=81023239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111681136.XA Active CN114330669B (en) | 2021-12-30 | 2021-12-30 | A vector processor-oriented half-precision vectorized conv1×1 convolution method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114330669B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115114575B (en) * | 2022-08-30 | 2023-01-31 | 中国人民解放军国防科技大学 | Vector processor-oriented image-to-matrix row conversion method, device and medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796235B (en) * | 2019-10-21 | 2022-03-18 | 中国人民解放军国防科技大学 | Vectorized Implementation Method of Valid Convolution of Convolutional Neural Network |
CN113626769B (en) * | 2021-10-12 | 2022-01-21 | 中国人民解放军国防科技大学 | Vector processor-oriented low-bit-width data matrix vectorization transposition method and system |
-
2021
- 2021-12-30 CN CN202111681136.XA patent/CN114330669B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN114330669A (en) | 2022-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4584580B2 (en) | Multiply-and-accumulate (MAC) unit for single instruction multiple data (SIMD) instructions | |
US7337205B2 (en) | Matrix multiplication in a vector processing system | |
CN110415157B (en) | A calculation method and device for matrix multiplication | |
US8935468B2 (en) | Audio digital signal processor | |
US20040122887A1 (en) | Efficient multiplication of small matrices using SIMD registers | |
CN111639701B (en) | A method, system, device and readable storage medium for image feature extraction | |
CN113626769B (en) | Vector processor-oriented low-bit-width data matrix vectorization transposition method and system | |
JP7401513B2 (en) | Sparse matrix multiplication in hardware | |
CN114281755B (en) | Vector processor-oriented semi-precision vectorization convolution method and system | |
CN114139108B (en) | Matrix LU decomposition vectorization calculation method of vector DSP core | |
CN114330669B (en) | A vector processor-oriented half-precision vectorized conv1×1 convolution method and system | |
JP7174831B2 (en) | Video memory processing method, apparatus and recording medium based on convolutional neural network | |
US8909687B2 (en) | Efficient FIR filters | |
CN110782009A (en) | Optimization method of computing kernel based on ARMv8 system | |
CN114329326B (en) | Low-bit-width data matrix vectorization column expansion method and system of vector processor | |
US6404934B1 (en) | High speed image processing apparatus using a cascade of elongated filters programmed in a computer | |
CN117493748A (en) | Implementation method and device for low bit width data matrix vector multiplication of vector processor | |
CN116842304A (en) | A calculation method and system for irregular sparse matrices | |
CN102231624B (en) | Vectorization Implementation Method of FIR of Floating Point Complex Number Block Oriented to Vector Processor | |
EP4024206A1 (en) | Computing device and method for reusing data | |
WO2023120403A1 (en) | Calculation unit involved in merging and sorting and performing sparse matrix computation by cgra | |
US8423597B1 (en) | Method and system for adaptive matrix trimming in an inverse discrete cosine transform (IDCT) operation | |
CN119719585B (en) | Data processing method for SIMD-oriented parallel iterative solution | |
CN120147660A (en) | Image convolution optimization method based on FT-M6678 chip | |
CN114138692B (en) | Low bit width data matrix vectorization column clipping method and system for vector processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |