WO2022134465A1 - 加速可重构处理器运行的稀疏化数据处理方法和装置 - Google Patents

加速可重构处理器运行的稀疏化数据处理方法和装置 Download PDF

Info

Publication number
WO2022134465A1
WO2022134465A1 PCT/CN2021/096490 CN2021096490W WO2022134465A1 WO 2022134465 A1 WO2022134465 A1 WO 2022134465A1 CN 2021096490 W CN2021096490 W CN 2021096490W WO 2022134465 A1 WO2022134465 A1 WO 2022134465A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
sparse
weight
effective
unit
Prior art date
Application number
PCT/CN2021/096490
Other languages
English (en)
French (fr)
Inventor
唐士斌
欧阳鹏
Original Assignee
北京清微智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京清微智能科技有限公司 filed Critical 北京清微智能科技有限公司
Priority to US17/904,360 priority Critical patent/US20230068450A1/en
Publication of WO2022134465A1 publication Critical patent/WO2022134465A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • G06F15/7871Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the field of reconfigurable processors, in particular to a sparse data processing method and device for accelerating the operation of reconfigurable processors.
  • the neural network computing based on deep learning is widely used in image detection, image recognition, speech recognition and other fields, while the convolution operation and full connection operation in the neural network consume a lot of storage resources, computing resources and bandwidth resources, becoming a neural network Bottlenecks implemented on smart devices such as smart cameras, smart headphones, and smart speakers.
  • Reconfigurable processors can be applied to deep learning-based neural network computations.
  • the sparsification technology is a technology that constrains the proportion of non-zero weights in the weights used in the convolution calculation and the full connection operation by training, so as to reduce the storage cost of storing the weights.
  • sparseness can also be used to reduce the multiplication and addition times of convolution calculation and full connection calculation, and reduce the bandwidth of data transmission.
  • random sparse weights in the training process are not conducive to fully exploiting the computing resources and bandwidth resources of the hardware.
  • Sparsification techniques include regular sparsification.
  • the prior art proposes a method for sparse aggregation rules.
  • this aggregation rule sparseness has shortcomings in the convergence of algorithm accuracy and the convenience of sparsity rate.
  • the purpose of the present invention is to provide a sparse data processing method and device for accelerating the operation of a reconfigurable processor, which adopts a hardware-friendly grouping rule sparse strategy, which is more conducive to algorithm precision convergence, and under the same algorithm precision, Can provide a higher sparsity rate.
  • a method for sparse data processing for accelerating the operation of a reconfigurable processor includes a PE array, the PE array includes P ⁇ Q PE units, the The method includes: dividing a sparse weight matrix to be calculated into at least one unit block; grouping the at least one unit block into at least one calculation group; and obtaining an effective weight address of each effective weight in the calculation group.
  • the step of dividing the sparse weight matrix to be calculated into at least one unit block further includes: dividing the sparse weight matrix by taking P ⁇ Q as a division unit along the row and column directions of the sparse weight matrix. It is divided into at least one unit block, wherein each unit block includes at least one effective weight.
  • the step of grouping the at least one unit block into at least one calculation group further includes: grouping the unit blocks in the sparse weight matrix into at least one group along the column direction of the weight matrix, and each group includes at least one unit block; judge whether the total number of effective weights in each group of unit blocks is more than P ⁇ Q/2; if the total number of effective weights in a group of unit blocks is more than P ⁇ Q/2, then along the sparse
  • the column direction of the sparse weight matrix is divided into two groups on average; the above judgment and splitting steps are repeated until the total number of effective weights in each group of unit blocks in the sparse weight matrix is less than P ⁇ Q/2; obtain the minimum number of unit blocks included in each group in the sparse weight matrix as the number n of grouping divisions, and divide the The sparse weight matrix is divided into multiple calculation groups.
  • the step of acquiring the effective weight address of the at least one unit block further includes: sequentially reading each effective weight in the calculation group by the PE array; The number of , as the effective weight address of the current effective weight, is stored in the storage address corresponding to the current effective weight of the calculation group.
  • the sparse data processing method further includes: reading a convolution calculation value; and performing a convolution or fully connected layer calculation.
  • the step of reading the calculated value of the convolution further includes: obtaining, through the P ⁇ Q PE units in the PE array, the address corresponding to the effective weight address according to the effective weight address of each calculation group of the sparse weight matrix.
  • the step of performing the calculation of the convolution or the fully connected layer further includes: performing the calculation of the convolution or the fully connected layer in the deep learning neural network model according to the convolution calculation value corresponding to the effective weight in each calculation group. .
  • the P ⁇ Q PE units in the PE array are 8 ⁇ 8 PE units.
  • a sparse data processing apparatus for a reconfigurable processor includes at least one PE array, each PE array includes P ⁇ Q PE units,
  • the apparatus includes: a weight matrix dividing unit configured to divide a sparse weight matrix to be calculated into at least one unit block; a calculation group grouping unit configured to group the at least one unit block into at least one calculation group; and an effective weight address acquisition unit configured to acquire an effective weight address of each effective weight in the calculation group.
  • the weight matrix dividing unit is further configured to: divide the thinning weight matrix into at least one unit block by taking P ⁇ Q as a dividing unit along the row and column directions of the thinning weight matrix, wherein each At least one effective weight is included in each unit block.
  • the computing group grouping unit is further configured to: group the unit blocks in the sparse weight matrix into at least one group along the column direction of the sparse weight matrix, and each group includes at least one unit block; determine Whether the total number of effective weights in each group of unit blocks is more than P ⁇ Q/2; if the total number of effective weights in a group of unit blocks is more than P ⁇ Q/2, then along the sparse weight matrix Divide the group into two groups equally in the column direction; repeat the above judgment and splitting steps until the total number of effective weights in each group of unit blocks in the sparse weight matrix is less than P ⁇ Q/2 ; Obtain the minimum number of unit blocks contained in each group in the sparse weight matrix as the grouping number n, and divide the sparse weight along the column direction of the sparse weight matrix according to the grouping number n The matrix is divided into multiple computational groups.
  • the effective weight address obtaining unit is further configured to: sequentially read each effective weight in the calculation group by the PE array; The number of zero weights in the interval between an effective weight is used as the effective weight address of the current effective weight, and is stored in the storage address corresponding to the current effective weight of the calculation group.
  • the thinning data processing apparatus further includes: an extraction unit configured to read the convolution calculation value; and a calculation unit configured to perform convolution or fully connected layer calculation.
  • the extraction unit is further configured to: obtain the effective weight corresponding to the effective weight address according to the effective weight address of each calculation group of the sparse weight matrix through P ⁇ Q PE units in the PE array and the storage address of the effective weight in the non-sparse weight matrix; and reading the convolution calculation value corresponding to the effective weight according to the storage address of the effective weight in the non-sparse weight matrix.
  • the calculation unit is further configured to: perform convolution or fully connected layer calculation in the deep learning neural network model according to the convolution calculation value corresponding to the effective weight in each calculation group.
  • the P ⁇ Q PE units in the PE array are 8 ⁇ 8 PE units.
  • FIG. 1 is a schematic flowchart illustrating a method for processing sparse data for accelerating the operation of a reconfigurable processor according to a first embodiment of the present invention.
  • FIG. 2 is a schematic flowchart illustrating a method for processing sparse data for accelerating the operation of a reconfigurable processor according to a second embodiment of the present invention.
  • FIG. 3 is a schematic flowchart illustrating a method for processing sparse data for accelerating the operation of a reconfigurable processor according to a third embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram illustrating a sparse data processing apparatus for accelerating the operation of a reconfigurable processor according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram for explaining one example of unit block grouping of a sparse weight matrix according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram for explaining another example of unit block grouping of the sparse weight matrix according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram for explaining an example storage vector of a sparse matrix storage format according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram illustrating an example matrix of a sparse matrix storage format according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram for explaining an example feature vector of a sparse matrix storage format according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart illustrating a method for processing sparse data for accelerating the operation of a reconfigurable processor according to a first embodiment of the present invention.
  • Reconfigurable processors include PE arrays.
  • the PE array includes P ⁇ Q PE units.
  • Weight matrices are used in convolution calculations and fully connected operations in neural networks. Under the premise of ensuring appropriate learning accuracy, the number of neurons in the neural network should be as few as possible (structure sparse) to reduce costs, improve robustness and generalization accuracy. Therefore, sparsification techniques are usually used to constrain the proportion of non-zero weights in the weight matrix, so as to reduce the storage overhead of storing weights, reduce the number of multiplications and additions in computation, and reduce the bandwidth of data transmission.
  • the present invention provides a hardware-friendly grouping rule sparse method and accelerated hardware design, so as to facilitate the convergence of algorithm precision and provide a higher sparse rate under the same algorithm precision.
  • the sparse data processing method for accelerating the operation of the reconfigurable processor according to the present invention includes:
  • step S101 the sparse weight matrix to be calculated is divided into at least one unit block.
  • the sparse weight matrix may be divided into at least one unit block by taking P ⁇ Q as a division unit along the row and column directions of the sparse weight matrix. At least one effective weight may be included in each unit block.
  • the weight matrix can be divided into (M/P) ⁇ (N/Q) unit blocks with P ⁇ Q as the granularity.
  • each unit block in the divided unit blocks 1 . . . 64 (corresponding to the divided areas 1, 2, . . . 64 ) includes 8 ⁇ 8 units, so that the entire The 64 ⁇ 64 weight matrix is divided into 8 ⁇ 8 matrices.
  • step S102 the at least one unit block is grouped into at least one calculation group.
  • Blocks of cells can be grouped into computational groups along the column or row direction of the sparse weight matrix.
  • description will be made by taking the grouping of unit blocks into calculation groups in the column direction as an example.
  • the total number of valid weights (ie, non-zero weights) in all unit blocks in each computation group should not exceed P ⁇ Q/2.
  • grouping cell blocks into computational groups can be achieved by the following steps:
  • each group including at least one cell block (eg, for N columns in an M ⁇ N weight matrix, each M unit blocks of a column are grouped into a group, and a total of N groups can be obtained; alternatively, less than M unit blocks of each column or even one unit block can be grouped into a group);
  • the 64 ⁇ 64 weight matrix includes 8 columns in total, and each column includes 8 unit blocks.
  • the unit blocks of each column can be grouped into a group along the column direction of the weight matrix, and a total of 8 groups can be obtained, including the first group of unit blocks 1 to 8, the second group of unit blocks 9 to 16, and the third group of units.
  • the total number of effective weights in the first group of unit blocks 1-8 is 20
  • the total number of effective weights in the second group of unit blocks 9-16 is 15, and the total number of effective weights in the third group of unit blocks 17-24
  • the total number of effective weights in the fourth group of unit blocks 25-32 is 31, the total number of effective weights in the fifth group of unit blocks 33-40 is 30, and the sixth group of unit blocks 41-48
  • the total number of valid weights in the seventh group of unit blocks 49-56 is 8, and the total number of valid weights in the eighth group of unit blocks 57-64 is 11.
  • Figure 6 also shows a 64x64 weight matrix, which includes 64 8x8 unit blocks.
  • the unit blocks of each column can be firstly grouped into a group, and a total of 8 groups are obtained.
  • the first group of unit blocks 1 to 8 is equally divided into two groups along the column direction of the weight matrix, and each group contains 4 unit blocks, that is, the first subgroup is unit blocks 1 to 4, and the second subgroup is Unit blocks 5-8. Since the total number of valid weights in the unit blocks in all groups other than the first group is less than 32, the other groups are no longer split.
  • G8 eight unit blocks can be grouped into one calculation group, denoted as G8, and the area of each G8 contains eight 8 ⁇ 8 unit blocks.
  • G4 four unit blocks may be grouped into a calculation group, denoted as G4, and the area of each G4 includes four 8 ⁇ 8 unit blocks.
  • the grouping sparse method adopted in this patent is applicable to the weight sparseness of convolution and full connection calculation at the same time.
  • the hardware-friendly grouping rule sparse strategy adopted in this patent is more conducive to algorithm precision convergence, and can provide a higher sparse rate under the same algorithm precision.
  • Step S103 obtaining the effective weight address of each effective weight in the calculation group.
  • the effective weight address can be obtained in the following manner:
  • the number of zero weights spaced between the current effective weight and the previous effective weight is taken as the effective weight address of the current effective weight, and stored in the storage address corresponding to the current effective weight of the calculation group.
  • interval bits may be set to 0 if the current effective weight is located at the start of the calculation group.
  • the sparsed weight matrix can be stored by means of sparse coding, wherein the number of space bits between the effective weight and the effective weight is used as the effective weight address to realize the compression of the weight matrix.
  • G8 shown in Figure 5 each calculation group includes eight unit blocks
  • a 4-fold compression effect can be achieved.
  • Figure 7 exemplarily shows a 16-bit vector in which the cells denoted by the numbers A, B, C and D represent valid weights and the blank cells represent zero weights. That is, the vector can be represented as A000B0000000C00D.
  • this example vector can be represented as (A,0)(B,3)(C,7)(D,2).
  • the storage format according to the present invention can effectively reduce the required storage capacity and reduce the bandwidth of data transmission.
  • FIG. 8 exemplarily shows a 6 ⁇ 4 sparse matrix.
  • the storage format of the sparse matrix is as follows.
  • the effective weight address of each effective weight in the matrix in turn.
  • the effective weight 1 in the upper leftmost corner has 0 interval bits relative to the zero weight of the previous effective weight (here is the starting point); next, the effective weight 2 is relatively The number of interval bits of the zero weight relative to the effective weight 1 is 3; the number of interval bits of the zero weight of the effective weight 4 relative to the zero weight of the effective weight 2 is 5, and so on.
  • the sparse code of the matrix is obtained as (1,0)(2,3)(4,5)(3,6)(5,5), where the former value in parentheses represents the effective weight, the latter The numeric value represents the effective weight address for this effective weight.
  • the present invention can use a P ⁇ Q MAC (multiply-add) array to accelerate convolution and sparse operations.
  • a P-dimensional input feature vector and P ⁇ Q weights can be read in each time from a P ⁇ Q MAC array, and a Q-dimensional output feature vector can be calculated.
  • the K-dimensional input feature vector and the sparsed P ⁇ Q/2 effective weights can be read in each time from the P ⁇ Q MAC array.
  • the constraint matrix K ⁇ Q can be restored by extracting the effective weight address of each effective weight (that is, the interval length value in the storage format), so as to obtain the vector corresponding to each effective weight in the K-dimensional input feature vector value.
  • the Q-dimensional output feature vector is calculated.
  • the following sparse decoding can be performed: according to the sparse coding, start from the upper left corner of the matrix, complete the K ⁇ Q matrix from top to bottom, and from left to right.
  • the above-mentioned thinning code is decoded into the form of effective weight and effective weight address, (effective weight, effective weight address).
  • the effective weight and the serial number of the column in which the effective weight is located in the constraint matrix K ⁇ Q are read out.
  • the serial number of the column take out the value under the corresponding serial number in the K-dimensional input feature vector. Multiply and add each valid weight in this column with the value taken from the corresponding ordinal number in the input feature vector to obtain the output value. Repeat the above operation for each column of the K ⁇ Q matrix in sequence, and a total of Q output values can be obtained, thereby forming a Q-dimensional output feature vector.
  • FIG. 2 is a schematic flowchart illustrating a method for processing sparse data for accelerating the operation of a reconfigurable processor according to a second embodiment of the present invention.
  • Reconfigurable processors include PE arrays.
  • the PE array includes P ⁇ Q PE units.
  • the method for processing sparse data includes the following steps.
  • step S201 the sparse weight matrix to be calculated is divided into at least one unit block.
  • step S202 the at least one unit block is grouped into at least one calculation group.
  • step S203 the effective weight address of each effective weight in the calculation group is obtained.
  • steps S201 to S203 are the same as the steps S101 to S103 in the thinning data processing method according to the first embodiment, so the description is not repeated here.
  • the thinning data processing method according to the second embodiment is different in that it further includes steps S240 and S250.
  • step S204 the convolution calculation value is read.
  • the effective weight corresponding to the effective weight address and the effective weight in the Storage address in the sparse weight matrix According to the storage address of the effective weight in the non-sparse weight matrix, the convolution calculation value corresponding to the effective weight is read.
  • step S205 convolutional or fully connected layer calculations are performed.
  • convolutional or fully-connected layer computations in the deep learning neural network model may be performed according to the convolution computation values corresponding to the effective weights in each computation group.
  • FIG. 3 is a schematic flowchart illustrating a method for processing sparse data for accelerating the operation of a reconfigurable processor according to a third embodiment of the present invention.
  • Reconfigurable processors include PE arrays.
  • the PE array includes P ⁇ Q PE units.
  • the method for processing sparse data includes the following steps.
  • step S301 the sparse weight matrix to be calculated is divided into at least one unit block.
  • step S302 the at least one unit block is grouped into at least one calculation group.
  • step S303 the effective weight address of each effective weight in the calculation group is obtained.
  • step S304 the convolution calculation value is read.
  • step S305 convolutional or fully connected layer calculations are performed.
  • steps S301 to S305 are the same as the steps S201 to S205 in the thinning data processing method according to the second embodiment, so the description is not repeated here.
  • the thinning data processing method according to the third embodiment is different in that step S306 is further included.
  • step S306 the result of the convolutional or fully connected layer calculation is output.
  • the results of convolutional or fully connected layer computations in the neural network model may be output.
  • FIG. 4 is a schematic structural diagram illustrating a sparse data processing apparatus for accelerating the operation of a reconfigurable processor according to an embodiment of the present invention.
  • Reconfigurable processors include PE arrays.
  • the PE array includes P ⁇ Q PE units
  • the sparse data processing apparatus includes a weight matrix dividing unit 401 , a calculating group grouping unit 402 and an effective weight address obtaining unit 403 .
  • the weight matrix dividing unit 401 is configured to divide the sparse weight matrix to be calculated into at least one unit block.
  • the weight matrix dividing unit 401 may be configured to divide the thinning weight matrix into at least one unit block by taking P ⁇ Q as a dividing unit along the row and column directions of the thinning weight matrix. At least one effective weight may be included in each unit block.
  • the computation group grouping unit 402 is configured to group the at least one unit block into at least one computation group.
  • the computing group grouping unit 402 may be configured to:
  • each grouping the unit blocks in the sparse weight matrix into at least one group along the column direction of the sparse weight matrix, each group including at least one unit block;
  • the group is equally divided into two groups along the column direction of the sparse weight matrix
  • the effective weight address acquisition unit 403 is configured to acquire the effective weight address of each effective weight in the calculation group.
  • the effective weight address obtaining unit 403 may be configured to:
  • the number of zero weights spaced between the current effective weight and the previous effective weight is taken as the effective weight address of the current effective weight, and stored in the storage address corresponding to the current effective weight of the calculation group.
  • the thinning data processing apparatus may further include an extraction unit 404 and a calculation unit 405 , as indicated by the dotted line in FIG. 4 .
  • the extraction unit 404 is configured to read the convolution calculation value.
  • extraction unit 404 may be configured to:
  • the convolution calculation value corresponding to the effective weight is read.
  • the computation unit 405 is configured to perform convolutional or fully connected layer computations.
  • the computing unit 405 may be configured to perform convolutional or fully connected layer computations in the deep learning neural network model according to the convolution computation values corresponding to the effective weights in each computation group.
  • the thinning data processing apparatus may further include an output unit (not shown in the figure).
  • the output unit is configured to output the result computed by the convolutional or fully connected layers.
  • the output unit may be configured to output a result calculated by a convolutional or fully connected layer in the neural network model.
  • the PE cells in the PE array are 8 ⁇ 8 PE cells.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

一种加速可重构处理器运行的稀疏化数据处理方法和装置,所述可重构处理器包括PE阵列,所述PE阵列包括P×Q个PE单元。所述方法包括:将待计算的稀疏化权重矩阵划分为至少一个单元块(S101);将所述至少一个单元块分组为至少一个计算组(S102);以及获取计算组中的每一有效权重的有效权重地址(S103)。该方法采用硬件友好的分组规则稀疏化策略更有利算法精度收敛,在同样的算法精度下,可以提供更高的稀疏率。

Description

加速可重构处理器运行的稀疏化数据处理方法和装置 技术领域
本发明涉及可重构处理器领域,具体涉及加速可重构处理器运行的稀疏化数据处理方法和装置。
背景技术
基于深度学习的神经网络计算在图像检测、图像识别、语音识别等领域被广泛的应用,而神经网络中的卷积运算与全连接运算消耗大量的存储资源、计算资源与带宽资源,成为神经网络在智能摄像头、智能耳机、智能音箱等智能设备上实施的瓶颈。可重构处理器可被应用于基于深度学习的神经网络计算。
稀疏化技术是一种通过训练的方式约束卷积计算与全连接运算中用到权重中非零权重的比例,以此降低存储权重的存储开销的技术。同时研究发现,稀疏化同样可以用于减少卷积计算与全连接计算的乘加次数,并减少数据传输的带宽。然而,训练过程中随机的稀疏化权重不利于充分挖掘硬件的计算资源与带宽资源。
稀疏化技术包括规则稀疏化。例如,现有技术提出一种聚集规则稀疏化方法。但是这种聚集规则稀疏化在算法精度收敛和稀疏率方便存在不足。
发明内容
本发明的目的是提供一种加速可重构处理器运行的稀疏化数据处理方法和装置,其采用硬件友好的分组规则稀疏化策略,更有利于算法精度收敛,并且在同样的算法精度下,可以提供更高的稀疏率。
根据本发明的一个方面,提供了一种加速可重构处理器运行的稀疏化数据处理方法,所述可重构处理器包括PE阵列,所述PE阵列包括P×Q个PE单元,所述方法包括:将待计算的稀疏化权重矩阵划分为至少一个单元块;将所述至少一个单元块分组为至少一个计算组;以及获取计算组中的每一有效权重的有效权重地址。
可选地,将待计算的稀疏化权重矩阵划分为至少一个单元块的步骤进一步包括:通过沿 所述稀疏化权重矩阵的行列方向以P×Q为一个划分单元,将所述稀疏化权重矩阵划分为至少一个单元块,其中每个单元块中包括至少一个有效权重。
可选地,将所述至少一个单元块分组为至少一个计算组的步骤进一步包括:沿权重矩阵的列向将所述稀疏化权重矩阵中的单元块分组为至少一个组,每组包括至少一个单元块;判断每一组单元块中的有效权重的总数量是否多于P×Q/2;如果一组单元块中的有效权重的总数量多于P×Q/2,则沿所述稀疏化权重矩阵的列向将该组平均拆分为两个组;重复上述判断和拆分步骤,直到所述稀疏化权重矩阵中的每一组单元块中的有效权重的总数量均少于P×Q/2;获取所述稀疏化权重矩阵中的每一组中包含的单元块的最小数量作为分组划分数量n,并根据该分组划分数量n沿所述稀疏化权重矩阵的列向将所述稀疏化权重矩阵划分为多个计算组。
可选地,获取所述至少一个单元块的有效权重地址的步骤进一步包括:由PE阵列依次读取计算组中的每一有效权重;将当前有效权重与上一有效权重之间间隔的零权重的数量作为当前有效权重的有效权重地址,存入与所述计算组的当前有效权重对应的存储地址中。
可选地,该稀疏化数据处理方法进一步包括:读取卷积计算值;以及执行卷积或全连接层计算。
可选地,读取卷积计算值的步骤进一步包括:通过PE阵列中的P×Q个PE单元,根据稀疏化权重矩阵的每个计算组的有效权重地址获取与该有效权重地址所对应的有效权重以及所述有效权重在非稀疏化权重矩阵中的存储地址;以及根据所述有效权重在非稀疏化权重矩阵中的存储地址,读取所述有效权重对应的卷积计算值。
可选地,执行卷积或全连接层计算的步骤进一步包括:根据每个计算组中的有效权重所对应的卷积计算值来执行深度学习的神经网络模型中的卷积或全连接层计算。
可选地,所述PE阵列中的P×Q个PE单元为8×8的PE单元。
根据本发明的一个方面,提供了一种用于可重构处理器的稀疏化数据处理装置,所述可重构处理器包括至少一个PE阵列,每个PE阵列包括P×Q个PE单元,所述装置包括:权重矩阵划分单元,被配置为将待计算的稀疏化权重矩阵划分为至少一个单元块;计算组分组单元,被配置为将所述至少一个单元块分组为至少一个计算组;以及有效权重地址获取单元,被配置为获取计算组中的每一有效权重的有效权重地址。
可选地,所述权重矩阵划分单元进一步配置为:通过沿所述稀疏化权重矩阵的行列方向以P×Q为一个划分单元,将所述稀疏化权重矩阵划分为至少一个单元块,其中每个单元块 中包括至少一个有效权重。
可选地,所述计算组分组单元进一步配置为:沿所述稀疏化权重矩阵的列向将所述稀疏化权重矩阵中的单元块分组为至少一个组,每组包括至少一个单元块;判断每一组单元块中的有效权重的总数量是否多于P×Q/2;如果一组单元块中的有效权重的总数量多于P×Q/2,则沿所述稀疏化权重矩阵的列向将该组平均拆分为两个组;重复上述判断和拆分步骤,直到所述稀疏化权重矩阵中的每一组单元块中的有效权重的总数量均少于P×Q/2;获取所述稀疏化权重矩阵中的每一组中包含的单元块的最小数量作为分组划分数量n,并根据该分组划分数量n沿所述稀疏化权重矩阵的列向将所述稀疏化权重矩阵划分为多个计算组。
可选地,根据权利要求9所述的稀疏化数据处理装置,其中所述有效权重地址获取单元进一步配置为:由PE阵列依次读取计算组中的每一有效权重;将当前有效权重与上一有效权重之间间隔的零权重的数量作为当前有效权重的有效权重地址,存入与所述计算组的当前有效权重对应的存储地址中。
可选地,稀疏化数据处理装置进一步包括:提取单元,被配置为读取卷积计算值;以及计算单元,被配置为执行卷积或全连接层计算。
可选地,所述提取单元进一步配置为:通过PE阵列中的P×Q个PE单元,根据稀疏化权重矩阵的每个计算组的有效权重地址获取与该有效权重地址所对应的有效权重以及所述有效权重在非稀疏化权重矩阵中的存储地址;以及根据所述有效权重在非稀疏化权重矩阵中的存储地址,读取所述有效权重对应的卷积计算值。
可选地,所述计算单元进一步配置为:根据每个计算组中的有效权重所对应的卷积计算值来执行深度学习的神经网络模型中的卷积或全连接层计算。
可选地,所述PE阵列中的P×Q个PE单元为8×8的PE单元。
附图说明
图1是示出根据本发明第一实施例的加速可重构处理器运行的稀疏化数据处理方法的流程示意图。
图2是示出根据本发明第二实施例的加速可重构处理器运行的稀疏化数据处理方法的流程示意图。
图3是示出根据本发明第三实施例的加速可重构处理器运行的稀疏化数据处理方法的流程示意图。
图4是示出根据本发明实施例的加速可重构处理器运行的稀疏化数据处理装置的结构示意图。
图5是用于说明根据本发明实施例的稀疏化权重矩阵的单元块分组的一个示例的示意图。
图6是用于说明根据本发明实施例的稀疏化权重矩阵的单元块分组的另一示例的示意图。
图7是用于说明根据本发明实施例的稀疏化矩阵存储格式的示例存储向量的示意图。
图8是用于说明根据本发明实施例的稀疏化矩阵存储格式的示例矩阵的示意图。
图9是用于说明根据本发明实施例的稀疏化矩阵存储格式的示例特征向量的示意图。
具体实施方式
为了对发明的技术特征、目的和效果有更加清楚的理解,现对照附图说明本发明的具体实施方式,在各图中相同的标号表示结构相同或结构相似但功能相同的部件。
在本文中,“示意性”表示“充当实例、例子或说明”,不应将在本文中被描述为“示意性”的任何图示、实施方式解释为一种更优选的或更具优点的技术方案。为使图面简洁,各图中只示意性地表示出了与本示例性实施例相关的部分,它们并不代表其作为产品的实际结构及真实比例。
图1是示出根据本发明第一实施例的加速可重构处理器运行的稀疏化数据处理方法的流程示意图。可重构处理器包括PE阵列。PE阵列包括P×Q个PE单元。
在神经网络中的卷积计算与全连接运算中会使用到权重矩阵。在保证适当学习精度前提下,神经网络的神经元个数应该尽可能少(结构稀疏化),以降低成本,提高稳健性和推广精度。因此,通常采用稀疏化技术对权重矩阵中的非零权重的比例进行约束,以降低存储权重的存储开销、减少计算中的乘加次数和减少数据传输的带宽。
而本发明提供了硬件友好的分组规则稀疏化方法与加速硬件设计,以利于算法精度收敛,并且在同样的算法精度下,提供更高的稀疏率。
具体来讲,如图1所示,根据本发明的加速可重构处理器运行的稀疏化数据处理方法包括:
在步骤S101,将待计算的稀疏化权重矩阵划分为至少一个单元块。
在实施例中,可以通过沿稀疏化权重矩阵的行列方向以P×Q为一个划分单元,将稀疏 化权重矩阵划分为至少一个单元块。每个单元块中可包括至少一个有效权重。
例如,对于M×N的权重矩阵,可以以P×Q为粒度,将该权重矩阵划分为(M/P)×(N/Q)个单元块。
以具体实例来讲,如图5所示,当PE阵列包括8×8个PE单元时(即P=8,Q=8),可将64×64的权重矩阵(即M=64,N=64)划分为(64/8)×(64/8)=64个单元块,即单元块1-64(通过图中的方框内的数字来表示)。。
如图5所示,所划分的单元块1.......64(对应划分区域1、2.....64)中的每个单元块包括8×8个单元,从而将整个64×64的权重矩阵分成了8×8个矩阵。
接下来,在步骤S102,将所述至少一个单元块分组为至少一个计算组。
可沿稀疏化权重矩阵的列方向或行方向将单元块分组为计算组。为便于说明,在下文中,将以沿列方向将单元块分组为计算组为例进行描述。
在将单元块分组为计算组时,每一计算组中的全部单元块中的有效权重(即非零权重)的总数量不应超过P×Q/2。
这是因为,在使用P×Q个PE单元处理每一计算组时,除了有效权重外,还需要预留P×Q个PE单元中的1/2,作为有效权重的地址存储位置。
因此,将单元块分组为计算组可通过以下步骤来实现:
-沿稀疏化权重矩阵的列向将稀疏化权重矩阵中的单元块分组为至少一个组,每组包括至少一个单元块(例如,对于M×N的权重矩阵中的N个列,可将每一列的M个单元块分组为一组,总共可获得N个组;或者,也可以将每一列的少于M个单元块甚至一个单元块分组为一组);
-判断每一组单元块中的有效权重的总数量是否多于P×Q/2;
-如果一组单元块中的有效权重的总数量多于P×Q/2,则沿稀疏化权重矩阵的列向将该组平均拆分为两个组;
-重复上述判断和拆分步骤,直到稀疏化权重矩阵中的每一组单元块中的有效权重的总数量均少于P×Q/2;
-获取稀疏化权重矩阵中的每一组中包含的单元块的最小数量作为分组划分数量n,并根据该分组划分数量n沿稀疏化权重矩阵的列向将稀疏化权重矩阵划分为多个计算组。
通过以上分组,可获得约束矩阵K×Q,其中K=nP。从而,对于M×N的权重矩阵,可以以K×Q为粒度,将该权重矩阵划分为(M/K)×(N/Q)=(M/(n×P))×(N/Q)个子矩阵。
例如,以图5中的示例为例,64×64权重矩阵总共包括8个列,每一列包括8个单元块。可沿该权重矩阵的列向,将每一列的单元块分组为一组,总共可获得8个组,包括第一组单元块1~8,第二组单元块9~16,第三组单元块17~24,第四组单元块25~32,第五组单元块33~40,第六组单元块41~48,第七组单元块49~56,第八组单元块57~64。
然后,判断每一组单元块中的有效权重的总数量是否多于P×Q/2=(8×8)/2=32。
现在假设第一组单元块1~8中的有效权重的总数量为20,第二组单元块9~16中的有效权重的总数量为15,第三组单元块17~24中的有效权重的总数量为10,第四组单元块25~32中的有效权重的总数量为31,第五组单元块33~40中的有效权重的总数量为30,第六组单元块41~48中有效权重的总数量为28,第七组单元块49~56中的有效权重的总数量为8,第八组单元块57~64中的有效权重的总数量为11。
由于上述各个单元块的有效权重的总数量均未超过32,因而不需要进一步拆分各组。因此,可以将当前每一组中包含的单元块数量8作为分组划分数量n,即n=8,并根据该分组划分数量8沿权重矩阵的列向将权重矩阵划分为8个计算组。
进一步参考图6,图6示出将权重矩阵的单元块分组为计算组的另一示例。
图6同样示出64×64权重矩阵,其中包括64个8×8的单元块。可以以与图5类似的方式,首先将每一列的单元块分组为一组,总共获得8个组。
但是,在图6的示例中,假设第一组单元块1~8中的有效权重的总数量为56,超过了P×Q/2=(8×8)/2=32。因此,沿权重矩阵的列向将第一组单元块1~8平均拆分为两个组,每组包含4个单元块,即第一子组为单元块1~4,第二子组为单元块5~8。由于除了第一组之外的其他组中的单元块中的有效权重的总数量均少于32,因此不再对其他组进行拆分。
结果,在权重矩阵的当前分组中,每一组中包含的单元块的最小数量为4。因此,可将分组划分数量设为n=4。然后,可以根据该分组划分数量4,沿权重矩阵的列向将权重矩阵划分为总共16个计算组。
可按照对工程应用需求的不同,灵活选取不同的分组策略。如在图5的示例中,可以将八个单元块分组为一个计算组,记为G8,每个G8的区域中包含8个8×8单元块。而在图6的示例中,可以将四个单元块分组为一个计算组,记为G4,每个G4的区域中包含4个8×8单元块。
进一步,在神经网络的计算中:
-对于全连接计算的权重矩阵,M=fo,N=fi;其中,fo为:输出特征通道数;fi为:输 入特征通道数。
-对于卷积计算的卷积权重模板,M=fo,N=kx*ky*fi;其中,fo为:输出特征通道数;fi为:输入特征通道数;kx、ky为:卷积模板的尺寸。
因此,本专利采用的分组稀疏化方式同时适用于卷积、全连接计算的权重稀疏化。此外,相比于现有技术提出的聚集规则稀疏化,本专利采用的硬件友好的分组规则稀疏化策略更有利算法精度收敛,在同样的算法精度下,可以提供更高的稀疏率。
步骤S103,获取计算组中的每一有效权重的有效权重地址。
在实施例中,可通过以下方式获取有效权重地址:
由PE阵列依次读取计算组中的每一有效权重;
将当前有效权重与上一有效权重之间间隔的零权重的数量作为当前有效权重的有效权重地址,存入与所述计算组的当前有效权重对应的存储地址中。
应注意的是,如果当前有效权重位于计算组的起点处,则所述间隔位数(有效权重地址)可被设为0。
在本发明中,可采用稀疏化编码的方式对稀疏化后的权重矩阵进行存储,其中利用有效权重与有效权重之间的间隔位数作为有效权重地址,实现了对权重矩阵的压缩。如在图5所示的G8(每个计算组包括八个单元块)的情况下,可以达到压缩4倍的效果。
接下来将参考图7描述这种稀疏化矩阵存储格式。
图7示例性地示出一个16位的向量,其中由数字A、B、C和D标示的格子表示有效权重,而空白的格子表示零权重。即,该向量可表示为A000B0000000C00D。
如图7所示,有效权重A是起点,因此其有效权重地址被设为0。有效权重B与上一有效权重A之间间隔的零权重的数量为3,因此其有效权重地址为3。有效权重C与上一有效权重B之间间隔的零权重的数量为7,因此其有效权重地址为7。有效权重D与上一有效权重C之间间隔的零权重的数量为2,因此其有效权重地址为2。因此,根据本发明的存储格式,该示例向量可以表示为(A,0)(B,3)(C,7)(D,2)。
相较于原存储向量A000B0000000C00D,根据本发明的存储格式能够有效降低所需的存储容量,减少数据传输的带宽。
进一步参考图8,图8示例性地示出一个6×4的稀疏化矩阵。该稀疏化矩阵的存储格式如下。
从该矩阵的左上角开始,从上到下,从左到右,依次获取该矩阵中的每一有效权重的有 效权重地址。如图8所示,该矩阵中存在有效权重(非零权重)1,2,4,3,5(在图中通过粗阴影框标示)。按照从上到下和从左到右的顺序,位于最左上角的有效权重1相对于上一有效权重(此处为起点)的零权重的间隔位数为0;接下来,有效权重2相对于有效权重1的零权重的间隔位数为3;有效权重4相对于有效权重2的零权重的间隔位数为5,以此类推。最终,得到该矩阵的稀疏化编码为(1,0)(2,3)(4,5)(3,6)(5,5),其中,括号中的前一数值表示有效权重,后一数值表示该有效权重的有效权重地址。
在具体硬件加速设计中,本发明可采用P×Q的MAC(乘加)阵列来加速卷积与稀疏化操作。
在正常模式下,可以由P×Q的MAC阵列每次读入一个P维的输入特征向量、以及P×Q个权重,计算得到Q维的输出特征向量。
而在根据本发明的稀疏化模式下,可以由P×Q的MAC阵列每次读入K维的输入特征向量、稀疏化后的P×Q/2个有效权重。在计算时,可通过提取每个有效权重的有效权重地址(即存储格式中的间隔长度数值),还原约束矩阵K×Q,以获得K维的输入特征向量中与每个有效权重对应的向量值。然后,计算得到Q维的输出特征向量。
在还原约束矩阵K×Q时,可进行如下稀疏化解码:根据稀疏化编码,从矩阵的左上角开始,从上到下,从左到右补全K×Q矩阵。
再次以图8中的6×4矩阵为例,如上所述,他的稀疏化编码(1,0)(2,3)(4,5)(3,6)(5,5)。
此时,将上述稀疏化编码解码成为有效权重和有效权重地址的形式,(有效权重,有效权重地址)。在图5的G8示例中,约束矩阵K×Q=8×8×8,总共包括2 9个单元,因此其地址长度可为9比特。应注意的是,在约束矩阵K×Q中,每一列只允许最多P个有效权重,以与P×Q的MAC阵列相适应。
然后,例如通过逻辑电路,读出有效权重以及该有效权重在约束矩阵K×Q中所在列的序号。根据所在列的序号,取出在K维输入特征向量中的对应序号下的数值。将此列中的每个有效权重分别与从输入特征向量中的对应序号下取出的数值进行乘加操作,以得到输出数值。依序对K×Q矩阵的每个列重复上述操作,总共可得到Q个输出数值,从而构成一个Q维的输出特征向量。
接下来参考图8和图9中的具体示例,进一步详细说明上述步骤。
如图8所示,在6×4矩阵的第1列中存在两个有效权重。第一个有效权重为1,它在此列中的序号是1;第二个有效权重为2,它在此列中的序号是5。因此,根据上述序号, 从图9所示的输入特征向量中分别取出对应序号1和5下的数值,也就是2和9,如图9所示。然后,将第1列中的所有有效权重1和2分别与从输入特征向量中的相同序号下取出的数值2和9进行乘加操作,从而得到输出数值1x2+2x9=20。
接下来,参看图8所示矩阵的第2列,在第2列中,只有一个有效权重4,序号是5,因此从输入特征向量中取出序号5下的数值9,得到输出数值4x9=36。
接下来,在矩阵的第3列中,取出有效权重3,其序号是6,然后与从输入特征向量中的序号6下取出的数值8进行乘加操作,得到输出数值3x8=24。
接下来,在矩阵的第4列中,取出有效权重5,其序号是6,然后与从输入特征向量中的序号6下取出的数值8进行乘加操作,得到输出数值5x8=40。
经过上述操作,总共得到四个输出数值:20,36,24,40,从而获得输出特征向量(20,36,24,40)。
图2是示出根据本发明第二实施例的加速可重构处理器运行的稀疏化数据处理方法的流程示意图。可重构处理器包括PE阵列。PE阵列包括P×Q个PE单元。
如图2所示,该稀疏化数据处理方法包括以下步骤。
在步骤S201,将待计算的稀疏化权重矩阵划分为至少一个单元块。
在步骤S202,将所述至少一个单元块分组为至少一个计算组。
在步骤S203,获取计算组中的每一有效权重的有效权重地址。
上述步骤S201至S203与根据第一实施例的稀疏化数据处理方法中的步骤S101至S103相同,因此在此不再重复描述。
相比于根据第一实施例的稀疏化数据处理方法,根据第二实施例的稀疏化数据处理方法的不同之处在于进一步包括步骤S240和S250。
在步骤S204,读取卷积计算值。
在实施例中,可通过PE阵列中的P×Q个PE单元,根据稀疏化权重矩阵的每个计算组的有效权重地址获取与该有效权重地址所对应的有效权重以及所述有效权重在非稀疏化权重矩阵中的存储地址。根据所述有效权重在非稀疏化权重矩阵中的存储地址,读取所述有效权重对应的卷积计算值。
接下来,在步骤S205,执行卷积或全连接层计算。
在实施例中,可根据每个计算组中的有效权重所对应的卷积计算值来执行深度学习的神经网络模型中的卷积或全连接层计算。
图3是示出根据本发明第三实施例的加速可重构处理器运行的稀疏化数据处理方法的流程示意图。可重构处理器包括PE阵列。PE阵列包括P×Q个PE单元。
如图3所示,该稀疏化数据处理方法包括以下步骤。
在步骤S301,将待计算的稀疏化权重矩阵划分为至少一个单元块。
在步骤S302,将所述至少一个单元块分组为至少一个计算组。
在步骤S303,获取计算组中的每一有效权重的有效权重地址。
在步骤S304,读取卷积计算值。
在步骤S305,执行卷积或全连接层计算。
上述步骤S301至S305与根据第二实施例的稀疏化数据处理方法中的步骤S201至S205相同,因此在此不再重复描述。
相比于根据第二实施例的稀疏化数据处理方法,根据第三实施例的稀疏化数据处理方法的不同之处在于进一步包括步骤S306。
在步骤S306,输出卷积或全连接层计算的结果。
在实施例中,可输出神经网络模型中的卷积或全连接层计算的结果。
图4是示出根据本发明实施例的加速可重构处理器运行的稀疏化数据处理装置的结构示意图。可重构处理器包括PE阵列。PE阵列包括P×Q个PE单元
如图4所示,该稀疏化数据处理装置包括权重矩阵划分单元401、计算组分组单元402和有效权重地址获取单元403。
权重矩阵划分单元401被配置为将待计算的稀疏化权重矩阵划分为至少一个单元块。
在实施例中,权重矩阵划分单元401可被配置为通过沿稀疏化权重矩阵的行列方向以P×Q为一个划分单元,将稀疏化权重矩阵划分为至少一个单元块。每个单元块中可包括至少一个有效权重。
计算组分组单元402被配置为将所述至少一个单元块分组为至少一个计算组。
在实施例中,计算组分组单元402可被配置为:
沿稀疏化权重矩阵的列向将稀疏化权重矩阵中的单元块分组为至少一个组,每组包括至少一个单元块;
判断每一组单元块中的有效权重的总数量是否多于P×Q/2;
如果一组单元块中的有效权重的总数量多于P×Q/2,则沿稀疏化权重矩阵的列向将该组平均拆分为两个组;
重复上述判断和拆分步骤,直到稀疏化权重矩阵中的每一组单元块中的有效权重的总数量均少于P×Q/2;
获取稀疏化权重矩阵中的每一组中包含的单元块的最小数量作为分组划分数量n,并根据该分组划分数量n沿稀疏化权重矩阵的列向将稀疏化权重矩阵划分为多个计算组。
有效权重地址获取单元403被配置为获取计算组中的每一有效权重的有效权重地址。
在实施例中,有效权重地址获取单元403可被配置为:
通过PE阵列依次读取计算组中的每一有效权重;
将当前有效权重与上一有效权重之间间隔的零权重的数量作为当前有效权重的有效权重地址,存入与所述计算组的当前有效权重对应的存储地址中。
在实施例中,该稀疏化数据处理装置可进一步包括提取单元404和计算单元405,如图4的虚线所标示。
提取单元404被配置为读取卷积计算值。
在实施例中,提取单元404可被配置为:
通过PE阵列中的P×Q个PE单元,根据稀疏化权重矩阵的每个计算组的有效权重地址获取与该有效权重地址所对应的有效权重以及所述有效权重在非稀疏化权重矩阵中的存储地址;以及
根据所述有效权重在非稀疏化权重矩阵中的存储地址,读取所述有效权重对应的卷积计算值。
计算单元405被配置为执行卷积或全连接层计算。
在实施例中,计算单元405可被配置为根据每个计算组中的有效权重所对应的卷积计算值来执行深度学习的神经网络模型中的卷积或全连接层计算。
在实施例中,该稀疏化数据处理装置可进一步包括输出单元(图中未示出)。
该输出单元被配置为输出卷积或全连接层计算的结果。
在实施例中,该输出单元可被配置为输出神经网络模型中的卷积或全连接层计算的结果。
在实施例中,PE阵列中的PE单元为8×8的PE单元。
应当理解,虽然本说明书是按照各个实施方式中描述的,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施例中的技术方案也可以经适当组合,形成本领域技术人员可以理解 的其他实施方式。
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,它们并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种加速可重构处理器运行的稀疏化数据处理方法,所述可重构处理器包括PE阵列,所述PE阵列包括P×Q个PE单元,所述方法包括:
    将待计算的稀疏化权重矩阵划分为至少一个单元块;
    将所述至少一个单元块分组为至少一个计算组;以及
    获取计算组中的每一有效权重的有效权重地址。
  2. 根据权利要求1所述的稀疏化数据处理方法,其中将待计算的稀疏化权重矩阵划分为至少一个单元块的步骤进一步包括:
    通过沿所述稀疏化权重矩阵的行列方向以P×Q为一个划分单元,将所述稀疏化权重矩阵划分为至少一个单元块,其中每个单元块中包括至少一个有效权重。
  3. 根据权利要求1所述的稀疏化数据处理方法,其中将所述至少一个单元块分组为至少一个计算组的步骤进一步包括:
    沿所述稀疏化权重矩阵的列向将所述稀疏化权重矩阵中的单元块分组为至少一个组,每组包括至少一个单元块;
    判断每一组单元块中的有效权重的总数量是否多于P×Q/2;
    如果一组单元块中的有效权重的总数量多于P×Q/2,则沿所述稀疏化权重矩阵的列向将该组平均拆分为两个组;
    重复上述判断和拆分步骤,直到所述稀疏化权重矩阵中的每一组单元块中的有效权重的总数量均少于P×Q/2;
    获取所述稀疏化权重矩阵中的每一组中包含的单元块的最小数量作为分组划分数量n,并根据该分组划分数量n沿所述稀疏化权重矩阵的列向将所述稀疏化权重矩阵划分为多个计算组。
  4. 根据权利要求1所述的稀疏化数据处理方法,其中获取计算组中的每一有效权重的有效权重地址的步骤进一步包括:
    由PE阵列依次读取所述计算组中的每一有效权重;
    将当前有效权重与上一有效权重之间间隔的零权重的数量作为当前有效权重的有效权重地址,存入与所述计算组的当前有效权重对应的存储地址中。
  5. 根据权利要求1所述的稀疏化数据处理方法,进一步包括:
    读取卷积计算值;以及
    执行卷积或全连接层计算。
  6. 根据权利要求5所述的稀疏化数据处理方法,其中读取卷积计算值的步骤进一步包括:
    通过PE阵列中的P×Q个PE单元,根据稀疏化权重矩阵的每个计算组的有效权重地址获取与该有效权重地址所对应的有效权重以及所述有效权重在非稀疏化权重矩阵中的存储地址;以及
    根据所述有效权重在非稀疏化权重矩阵中的存储地址,读取所述有效权重对应的卷积计算值。
  7. 根据权利要求5所述的稀疏化数据处理方法,其中执行卷积或全连接层计算的步骤进一步包括:
    根据每个计算组中的有效权重所对应的卷积计算值来执行深度学习的神经网络模型中的卷积或全连接层计算。
  8. 根据权利要求1所述的稀疏化数据处理方法,其中所述PE阵列中的P×Q个PE单元为8×8的PE单元。
  9. 一种用于可重构处理器的稀疏化数据处理装置,所述可重构处理器包括至少一个PE阵列,每个PE阵列包括P×Q个PE单元,所述装置包括:
    权重矩阵划分单元,被配置为将待计算的稀疏化权重矩阵划分为至少一个单元块;
    计算组分组单元,被配置为将所述至少一个单元块分组为至少一个计算组;以及
    有效权重地址获取单元,被配置为获取计算组中的每一有效权重的有效权重地址。
  10. 根据权利要求9所述的稀疏化数据处理装置,其中所述权重矩阵划分单元进一步配置为:
    通过沿所述稀疏化权重矩阵的行列方向以P×Q为一个划分单元,将所述稀疏化权重矩阵划分为至少一个单元块,其中每个单元块中包括至少一个有效权重。
  11. 根据权利要求9所述的稀疏化数据处理装置,其中所述计算组分组单元进一步配置为:
    沿所述稀疏化权重矩阵的列向将所述稀疏化权重矩阵中的单元块分组为至少一个组,每组包括至少一个单元块;
    判断每一组单元块中的有效权重的总数量是否多于P×Q/2;
    如果一组单元块中的有效权重的总数量多于P×Q/2,则沿所述稀疏化权重矩阵的列向将该组平均拆分为两个组;
    重复上述判断和拆分步骤,直到所述稀疏化权重矩阵中的每一组单元块中的有效权重的总数量均少于P×Q/2;
    获取所述稀疏化权重矩阵中的每一组中包含的单元块的最小数量作为分组划分数量n,并根据该分组划分数量n沿所述稀疏化权重矩阵的列向将所述稀疏化权重矩阵划分为多个计算组。
  12. 根据权利要求9所述的稀疏化数据处理装置,其中所述有效权重地址获取单元进一步配置为:
    通过PE阵列依次读取计算组中的每一有效权重;
    将当前有效权重与上一有效权重之间间隔的零权重的数量作为当前有效权重的有效权重地址,存入与所述计算组的当前有效权重对应的存储地址中。
  13. 根据权利要求9所述的稀疏化数据处理装置,进一步包括:
    提取单元,被配置为读取卷积计算值;以及
    计算单元,被配置为执行卷积或全连接层计算。
  14. 根据权利要求13所述的稀疏化数据处理装置,其中所述提取单元进一步配置为:
    通过PE阵列中的P×Q个PE单元,根据稀疏化权重矩阵的每个计算组的有效权重地址获取与该有效权重地址所对应的有效权重以及所述有效权重在非稀疏化权重矩阵中的存储地址;以及
    根据所述有效权重在非稀疏化权重矩阵中的存储地址,读取所述有效权重对应的卷积计算值。
  15. 根据权利要求13所述的稀疏化数据处理装置,其中所述计算单元进一步配置为:
    根据每个计算组中的有效权重所对应的卷积计算值来执行深度学习的神经网络模型中的卷积或全连接层计算。
  16. 根据权利要求9所述的稀疏化数据处理装置,其中所述PE阵列中的P×Q个PE单元为8×8的PE单元。
PCT/CN2021/096490 2020-12-24 2021-05-27 加速可重构处理器运行的稀疏化数据处理方法和装置 WO2022134465A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/904,360 US20230068450A1 (en) 2020-12-24 2021-05-27 Method and apparatus for processing sparse data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011552162.8 2020-12-24
CN202011552162.8A CN112286864B (zh) 2020-12-24 2020-12-24 加速可重构处理器运行的稀疏化数据处理方法及系统

Publications (1)

Publication Number Publication Date
WO2022134465A1 true WO2022134465A1 (zh) 2022-06-30

Family

ID=74426070

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096490 WO2022134465A1 (zh) 2020-12-24 2021-05-27 加速可重构处理器运行的稀疏化数据处理方法和装置

Country Status (3)

Country Link
US (1) US20230068450A1 (zh)
CN (1) CN112286864B (zh)
WO (1) WO2022134465A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306811A (zh) * 2023-02-28 2023-06-23 苏州亿铸智能科技有限公司 一种针对ReRAM部署神经网络的权重分配方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286864B (zh) * 2020-12-24 2021-06-04 北京清微智能科技有限公司 加速可重构处理器运行的稀疏化数据处理方法及系统
CN113076083B (zh) * 2021-06-04 2021-08-31 南京后摩智能科技有限公司 数据乘加运算电路
CN115309349B (zh) * 2022-10-12 2023-01-20 深圳鲲云信息科技有限公司 深度学习的稀疏数据存储方法、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993297A (zh) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 一种负载均衡的稀疏卷积神经网络加速器及其加速方法
CN112116084A (zh) * 2020-09-15 2020-12-22 中国科学技术大学 可重构平台上固化全网络层的卷积神经网络硬件加速器
CN112286864A (zh) * 2020-12-24 2021-01-29 北京清微智能科技有限公司 加速可重构处理器运行的稀疏化数据处理方法及系统
US20210065005A1 (en) * 2019-08-29 2021-03-04 Alibaba Group Holding Limited Systems and methods for providing vector-wise sparsity in a neural network

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8972958B1 (en) * 2012-10-23 2015-03-03 Convey Computer Multistage development workflow for generating a custom instruction set reconfigurable processor
WO2009035185A1 (en) * 2007-09-11 2009-03-19 Core Logic Inc. Reconfigurable array processor for floating-point operations
KR101553648B1 (ko) * 2009-02-13 2015-09-17 삼성전자 주식회사 재구성 가능한 구조의 프로세서
CN102572415B (zh) * 2010-12-17 2013-12-04 清华大学 在可重构处理器上映射和实现的运动补偿算法的方法
CN102638659B (zh) * 2012-03-28 2014-05-14 西安电子科技大学 基于cmos-tdi模式的高分辨率成像系统及方法
US10540180B2 (en) * 2014-12-07 2020-01-21 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Reconfigurable processors and methods for collecting computer program instruction execution statistics
CN104679670B (zh) * 2015-03-10 2018-01-30 东南大学 一种面向fft和fir的共享数据缓存结构及管理方法
JP7132043B2 (ja) * 2018-09-10 2022-09-06 東京計器株式会社 リコンフィギュラブルプロセッサ
CN110737628A (zh) * 2019-10-17 2020-01-31 辰芯科技有限公司 一种可重构处理器和可重构处理器系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109993297A (zh) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 一种负载均衡的稀疏卷积神经网络加速器及其加速方法
US20210065005A1 (en) * 2019-08-29 2021-03-04 Alibaba Group Holding Limited Systems and methods for providing vector-wise sparsity in a neural network
CN112116084A (zh) * 2020-09-15 2020-12-22 中国科学技术大学 可重构平台上固化全网络层的卷积神经网络硬件加速器
CN112286864A (zh) * 2020-12-24 2021-01-29 北京清微智能科技有限公司 加速可重构处理器运行的稀疏化数据处理方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306811A (zh) * 2023-02-28 2023-06-23 苏州亿铸智能科技有限公司 一种针对ReRAM部署神经网络的权重分配方法
CN116306811B (zh) * 2023-02-28 2023-10-27 苏州亿铸智能科技有限公司 一种针对ReRAM部署神经网络的权重分配方法

Also Published As

Publication number Publication date
CN112286864B (zh) 2021-06-04
US20230068450A1 (en) 2023-03-02
CN112286864A (zh) 2021-01-29

Similar Documents

Publication Publication Date Title
WO2022134465A1 (zh) 加速可重构处理器运行的稀疏化数据处理方法和装置
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
CN109886400B (zh) 基于卷积核拆分的卷积神经网络硬件加速器系统及其计算方法
US20190340510A1 (en) Sparsifying neural network models
CN112465110B (zh) 一种卷积神经网络计算优化的硬件加速装置
CN108229671B (zh) 一种降低加速器外部数据存储带宽需求的系统和方法
WO2022037257A1 (zh) 卷积计算引擎、人工智能芯片以及数据处理方法
WO2019127362A1 (zh) 神经网络模型分块压缩方法、训练方法、计算装置及系统
CN112668708B (zh) 一种提高数据利用率的卷积运算装置
CN112257844B (zh) 一种基于混合精度配置的卷积神经网络加速器及其实现方法
CN111652360B (zh) 一种基于脉动阵列的卷积运算装置
CN109993293B (zh) 一种适用于堆叠式沙漏网络的深度学习加速器
CN104809161B (zh) 一种对稀疏矩阵进行压缩和查询的方法及系统
CN114781629B (zh) 基于并行复用的卷积神经网络的硬件加速器及并行复用方法
CN111008691B (zh) 一种权值和激活值都二值化的卷积神经网络加速器架构
CN108491924B (zh) 一种面向人工智能计算的神经网络数据串行流水处理装置
CN113705803B (zh) 基于卷积神经网络的图像硬件识别系统及部署方法
CN111198670B (zh) 执行矩阵乘法运算的方法、电路及soc
CN111652359B (zh) 用于矩阵运算的乘法器阵列和用于卷积运算的乘法器阵列
CN110766136B (zh) 一种稀疏矩阵与向量的压缩方法
CN113158132A (zh) 一种基于非结构化稀疏的卷积神经网络加速系统
CN116451755A (zh) 一种图卷积神经网络的加速方法及装置、电子设备
Yang et al. BSRA: Block-based super resolution accelerator with hardware efficient pixel attention
CN113705784A (zh) 一种基于矩阵共享的神经网络权重编码方法及硬件系统
CN112905954A (zh) 一种利用fpga bram的cnn模型卷积运算加速计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908476

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21908476

Country of ref document: EP

Kind code of ref document: A1