WO2023019899A1 - 神经网络实时剪枝方法、系统及神经网络加速器 - Google Patents

神经网络实时剪枝方法、系统及神经网络加速器 Download PDF

Info

Publication number
WO2023019899A1
WO2023019899A1 PCT/CN2022/077281 CN2022077281W WO2023019899A1 WO 2023019899 A1 WO2023019899 A1 WO 2023019899A1 CN 2022077281 W CN2022077281 W CN 2022077281W WO 2023019899 A1 WO2023019899 A1 WO 2023019899A1
Authority
WO
WIPO (PCT)
Prior art keywords
bit
matrix
row
neural network
pruning
Prior art date
Application number
PCT/CN2022/077281
Other languages
English (en)
French (fr)
Inventor
路航
李红燕
李晓维
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Publication of WO2023019899A1 publication Critical patent/WO2023019899A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the technical field of deep neural network model pruning, and in particular to a neural network real-time pruning method, system and neural network accelerator.
  • DNN deep neural network
  • neural network pruning technology is recognized as an effective way to obtain good accuracy of the model and reduce the amount of calculation.
  • almost all traditional pruning methods rely on the software level.
  • Such pruning usually includes the following steps: (1) Determine the importance of neurons according to the importance index; important part of the neurons; (3) fine-tune the network to restore accuracy, or adjust the importance index in the case of low accuracy and start pruning again.
  • the sparsity of the DNN model itself is not conducive to software pruning.
  • pruning utilizes an importance index to identify unimportant parameters.
  • This metric measures the sparsity of weights and activations at different angles. For example: the proportion of 0 in the activation value, the importance of judging the filter based on L1-norm, and the information entropy of the filter, etc.
  • Such metrics attempt to prune parameters at or near zero and then retrain the model until optimal accuracy is achieved.
  • a metric may work for some DNN models but not for others.
  • the sparsity space of the model itself is not always sufficient. Therefore, some pruning methods must perform time-consuming sparse training to increase the sparsity of parameters. Retrain or fine-tune after accuracy loss to make up for the lost accuracy.
  • the pre-trained DNN should be pruned on the hardware as quickly as possible.
  • the hardware should be able to directly perform pruning in an efficient and convenient way, rather than speeding up DNN reasoning through cumbersome software-level operations .
  • traditional pruning steps include identifying and pruning unimportant parameters.
  • the value-based sparsity space is very limited, and if the compression ratio is set too large, it will inevitably lead to serious loss of accuracy. If such a situation occurs, traditional pruning will adopt the following two solutions: 1Reduce the compression ratio and re-pruning from scratch; 2Use sparse training to create more sparse space for pruning. The reason for the time-consuming pruning at the software level also stems from this.
  • the purpose of the present invention is to solve the problem of pruning efficiency of the above-mentioned prior art, propose a kind of method that carries out hardware pruning-BitX for DNN parameter bit, and design the hardware accelerator that implements BitX pruning algorithm.
  • the present invention includes following key technical points:
  • the pruning proposed in the present invention is based on the effective bit pruning method, and proposes a variety of methods for judging the validity of the bit; technical effect, the method for judging the validity of the bit in this application does not need to rely on Pruning at the software level is independent of existing software pruning methods and supports multi-precision DNN, that is, pruning based on effective bits can be implemented based on hardware;
  • the present invention proposes a neural network real-time pruning method, including:
  • Step 1 Obtain the bit matrix to be multiplied by the matrix in the neural network model, and use the Euclidean distance product of the bit row and the bit column of the bit matrix as the importance of each bit row in the matrix multiplication operation in the bit matrix;
  • Step 2 divide each bit row of the bit matrix into an important row or an unimportant row according to the importance, and use the matrix obtained after the bit position of 1 in the unimportant row of the bit matrix to be zero, as the bit matrix of the bit matrix Pruning results.
  • step 1 includes obtaining the importance of each bit row in the matrix multiplication operation in the bit matrix by the following formula:
  • p i is the importance of the i-th bit row in the matrix multiplication operation
  • E i is the bit value of the i-th bit row element
  • BitCnt(i) is the effective bit in the i-th bit row number
  • l is the column number of the bit matrix.
  • the neural network real-time pruning method wherein before performing the step 1, obtain a plurality of original weights to be multiplied by the matrix, judge whether the original weights belong to fixed-point numbers, if so, perform the step 1, otherwise the original weights All mantissas are uniformly aligned to the largest order codes of the multiple original weights, and the aligned matrix is used as the bit matrix, and step 1 is performed.
  • the neural network real-time pruning method wherein the bit matrix is a weight matrix and/or an activation matrix; and the step 2 includes: dividing the N bit rows with the highest importance in the bit matrix into important rows, and N is a positive integer And less than the total number of bit rows of the bit matrix.
  • the present invention also proposes a neural network real-time pruning system, which includes
  • Module 1 is used to obtain the bit matrix to be multiplied by the matrix in the neural network model, and use the Euclidean distance product of the bit matrix bit row and the bit column as the importance of each bit row in the bit matrix in the matrix multiplication operation;
  • Module 2 used to divide each bit row of the bit matrix into an important row or an unimportant row according to the importance, and use the matrix obtained after the bit position of 1 in the unimportant row of the bit matrix to be zero, as the bit The pruning result of the matrix.
  • Described neural network real-time pruning system wherein the module 1 includes obtaining the importance of each bit row in the matrix multiplication operation in the bit matrix by the following formula:
  • p i is the importance of the i-th bit row in the matrix multiplication operation
  • E i is the bit value of the i-th bit row element
  • BitCnt(i) is the effective bit in the i-th bit row number
  • l is the column number of the bit matrix.
  • the neural network real-time pruning system wherein before calling the module 1, obtain a plurality of original weights to be multiplied by matrices, judge whether the original weights belong to fixed-point numbers, if so, call the module 1, otherwise the original weights All mantissas are uniformly aligned to the largest order code of the plurality of original weights, and the aligned matrix is used as the bit matrix, and the module 1 is called.
  • the neural network real-time pruning system wherein the bit matrix is a weight matrix and/or an activation matrix; and the module 2 includes: dividing the N bit rows with the highest importance in the bit matrix into important rows, where N is a positive integer And less than the total number of bit rows of the bit matrix.
  • the present invention also proposes a neural network accelerator, which is used in the above-mentioned neural network real-time pruning system.
  • the neural network accelerator includes a PE composed of multiple CUs, each CU accepts multiple weight activation value pairs as input, and the input weight value is pruned by the module 2 .
  • each selector selector of the extractor extractor in the CU is aimed at a pruned binary weight, and the selector extractor records the actual value of the bit in each important row for corresponding activation The value is shifted.
  • the present invention also proposes a server including a storage medium, wherein the storage medium is used for storing and executing the above neural network real-time pruning method.
  • BitX-mild and BitX-wild acceleration architectures can be formed according to different configurations, and the technical effects are as follows:
  • Fig. 1 is a bit 1 distribution analysis diagram
  • Fig. 2 is a conceptual diagram of the BitX core of the present invention.
  • Fig. 3 is the structural diagram of the accelerator of the present application.
  • FIG. 4 is a structural diagram of a CU in the accelerator of this application.
  • Table 1 Comparison of weight/bit sparsity of different pre-trained models.
  • the weight is represented by a 32-bit floating point number, and the bit sparsity is significantly greater than the weight sparsity
  • the weight sparsity is obtained by comparing the number of weight values smaller than 10 ⁇ 5 with the total number of weights.
  • Bit sparsity is obtained by comparing the number of zero bits in the mantissa with the total number of bits. It is evident that for both sparsity metrics, all models exhibit significant differences. The weight sparsity of most models is below 1%, but the bit sparsity reaches 49%. This presents a good opportunity to explore sparsity at the bit level. Since more than 49% of the bits are 0, pruning these invalid bits will definitely not have any impact on the accuracy. The present invention will take full advantage of this good condition to accelerate DNN reasoning.
  • bit 1 49% of the bits are 0, which means that 51% of the bits are 1, which also occupies a large part of the parameter bits. But not all bit 1's will have an effect on the final precision. Therefore, a part of bit 1 is bit 1 whose actual value is extremely small, which is a factor affecting the calculation efficiency (this factor has never been considered in the previous research). Therefore, after exploring the sparsity at the bit level, we further move the technical direction to the invalid (small impact) bit 1.
  • bit 1 in units of bit distribution (every 10-order code range is regarded as a slice).
  • the x-axis represents bit slices of binary (expressed in 32-bit floating point) weights, and each bit slice represents a bit value at its position.
  • a certain weight bit is 1.1101 ⁇ 2 -4
  • its binary representation is 0.00011101
  • the bit values of the four effective bits 1 we recorded are 2 -4 , 2 -5 , 2 -6 , and 2 -8 .
  • the four benchmark DNN models all present a similar distribution: the peak of the three-dimensional graph is reached when the abscissa is between 2 -21 and 2 -30 , indicating that the bit values in this range cover most of bit 1 (about 40% ), but most of the bit 1 has a weak influence on the inference accuracy.
  • the invention BitX aims to prune these bits to speed up inference. After binary conversion, the range of bit slices varies from 2 9 to 2 0 to 2 ⁇ 61 to 2 ⁇ 70 . All models are rendered as "vaulted" on each floor. Most (40%) of bit 1s are located in the middle of the bit slice.
  • the corresponding decimal range is 0.000000477 (about 10 -8 ) to 0.000000000931 (about 10 -11 ).
  • the present invention aims to accurately identify important bits and prune most of the bits with little influence on the accelerator, so as to achieve the goal of reducing calculation amount under the condition of little loss of precision.
  • the floating-point operand consists of three parts: sign bit, mantissa and exponent, and follows the most commonly used floating-point standard in the industry—IEEE754 standard. If a single-precision floating-point number (fp32) is used, the mantissa bit width is 23 bits, the exponent bit width is 8 bits, and the remaining bit is a sign bit.
  • the mantissa is represented as shown in Figure 2.
  • a weight bit matrix will be obtained, and each column in the matrix represents the binary mantissa value actually stored in memory.
  • the different colors in the legend represent bit values from 2 -1 to 2 -9 (the 2 0 bit value represents the hidden 1 in the mantissa).
  • the result of A ⁇ W can be represented by the sum of n rank-1 matrices.
  • the result of A ⁇ W can be obtained by Fast Monte-Carlo Algorithm (Fast Monte-Carlo Algorithm randomly samples some of the rank-one matrices to approximate the matrix product, and the most common sampling method is to calculate the corresponding probability to select these rank-one matrices).
  • a (i) represents the i-th row of the A matrix
  • W (i) represents the i-th column of the W matrix.
  • the present invention calculates the product of Euclidean distances of A (i) and W (i) as the sampling probability, which reflects the importance of a certain rank-one matrix multiplication in the sum of n rank-one matrix products.
  • the present invention abstracts the bit matrix in Fig. 2(a) as W, looks for (un)important bit rows in Fig. 2(b), uses the probability in formula (1) to pair each bit row in W Sampling is performed and the bit-rows to be pruned are determined, thereby simplifying the MAC calculation.
  • the present invention is aimed at the mantissa parts of n 32-bit floating-point weight values, and the mantissa of each weight is instantiated as a column vector composed of its bit values.
  • n weights imply correspondingly n activations.
  • the n activation values form another column vector [A 1 ,A 2 ...A j ...A n ] T .
  • a j is the element of the activation value vector
  • v j is the jth element of the ith row vector of the weight bit matrix.
  • the same row in the weight bit matrix has the same index (order code). Therefore, using the formula (2) Indicates the exponent code of the jth element. Then the Euclidean distance of the row vector passes calculate.
  • BitX The pairing operation in BitX is almost the same as the pairing operation in floating-point addition. The only difference is that BitX aligns a group of numbers to the maximum order at the same time, rather than one by one weight/activation value. Therefore, after alignment, the same row in the weight bit matrix has the same exponent (order code), as shown in Figure 2(b).
  • order code order code
  • v represents the bit row vector of W, if a certain element v j in v is equal to 0, it will have no effect on the calculation of Euclidean distance, and thus has no effect on p i . Therefore, calculating the Euclidean distance is transformed into calculating the number of bit 1s of the ith row vector. Use BitCnt(i) to represent this value. Therefore p i can be rewritten as formula 3:
  • W with l column vectors is a constant, so let Finally p i will be reduced to Equation 4:
  • Equation 4 reflects the importance of bit row i in the bit matrix in calculation. Because E i reflects the bit value of elements in row i, BitCnt(i) reflects the number of valid bits in row i, where the valid bit is bit1, and correspondingly, the invalid bit is bit0. Larger E i and BitCnt(i) have a greater impact on the final MAC. BitX uses formula 4 to determine the important bits, while pruning unimportant bits directly in the accelerator.
  • BitX first extracts the order code E and the mantissa M of the 32 floating-point number weight as input (lines 1 to 3), and then uniformly aligns all mantissas to the maximum order code e max (line 1 4 lines), then calculate p i , and sort p i in descending order (lines 5 to 10).
  • the input parameter N represents the number of bit row vectors remaining in W after pruning, that is, BitX selects the bit row with the largest N pi .
  • the indices of the selected N rows will be stored in array I (line 13). Pruning is finally achieved by mask.
  • BitX extracts all key bits 1 and saves them back to W′.
  • the design parameter N in the algorithm controls the granularity of pruning. Smaller N will control the algorithm to generate larger bit sparsity, further pruning more rows, and ultimately speed up inference by skipping more 0s.
  • the system architecture of the accelerator is shown in Figure 3.
  • the "E-alignment” and “Bit-Extraction” modules are used to implement the Bit pruning algorithm. Every 16 CUs (computing unit) form a BitX PE (processing element). Each CU accepts M weight/activation pairs as input. The input weight is preprocessed by the "Bit-Extraction” module, and the actual value of the tiny bit pruning is 0. For fixed-point DNN, the E-alignment module is not needed, because fixed-point operations do not involve index alignment operations, so the original weights are directly input to "Bit-Extraction".
  • the E-alignment module implements the alignment of all weights to the maximum order code.
  • This block mainly consists of a data shift unit and a zero bit stuffing unit.
  • the weight parameters are first rewritten to the corresponding mantissa and exponent. And get the maximum order code, and the other weights are unified to the maximum order code.
  • the data shifting unit does this by right-shifting the i-th mantissa by E max -E i . Gaps that occur in the earlier part of the mantissa as a result of the shift are filled with bit zeros (marked light gray in FIG. 3 ) by means of the padding unit.
  • E i may not be exactly the same, so after padding with zero bits, the bit widths of all parameters are not consistent. To handle this case, zero bit padding also pads a series of zero bits to the maximum bit width (marked dark gray in Figure 3).
  • the mantissa output by the E-alignment module is input into the Bit-Extraction module for bit pruning.
  • the first functional part of this module is BITCNT, which is used to realize the BitCnt function in formula (4).
  • the second function of the Bit-Extraction module is to sort the shifted BitCnt(i), and select the top n rows with the largest p i , and the other row weights are pruned. Finally, the weight after pruning is obtained.
  • Each "selector" in the extractor is aimed at a pruned binary weight (a total of M weights), and k represents the essential bit in the pruned weight.
  • the extractor records the actual bit value of each essential bit, denoted by s, and is used to shift the corresponding activation value.
  • Activation values can be floating-point or fixed-point data. Fixed-point activation values can be shifted directly. But for the floating-point activation value, the shift operation is to accumulate the exponent code in the activation value, which is actually a fixed-point operation. Therefore, the shifter does not introduce large overhead.
  • the addition tree performs the final partial sum accumulation and is also used to differentiate between different precisions.
  • the present invention also proposes a neural network real-time pruning system, which includes
  • Module 1 is used to obtain the bit matrix to be multiplied by the matrix in the neural network model, and use the Euclidean distance product of the bit matrix bit row and the bit column as the importance of each bit row in the bit matrix in the matrix multiplication operation;
  • Module 2 used to divide each bit row of the bit matrix into an important row or an unimportant row according to the importance, and use the matrix obtained after the bit position of 1 in the unimportant row of the bit matrix to be zero, as the bit The pruning result of the matrix.
  • Described neural network real-time pruning system wherein the module 1 includes obtaining the importance of each bit row in the matrix multiplication operation in the bit matrix by the following formula:
  • p i is the importance of the i-th bit row in the matrix multiplication operation
  • E i is the bit value of the i-th bit row element
  • BitCnt(i) is the effective bit in the i-th bit row number
  • l is the column number of the bit matrix.
  • the neural network real-time pruning system wherein before calling the module 1, obtain a plurality of original weights to be multiplied by matrices, judge whether the original weights belong to fixed-point numbers, if so, call the module 1, otherwise the original weights All mantissas are uniformly aligned to the largest order code of the plurality of original weights, and the aligned matrix is used as the bit matrix, and the module 1 is called.
  • the neural network real-time pruning system wherein the bit matrix is a weight matrix and/or an activation matrix; and the module 2 includes: dividing the N bit rows with the highest importance in the bit matrix into important rows, where N is a positive integer And less than the total number of bit rows of the bit matrix.
  • the present invention also proposes a neural network accelerator, which is used in the above-mentioned neural network real-time pruning system.
  • the neural network accelerator includes a PE composed of multiple CUs, each CU accepts multiple weight activation value pairs as input, and the input weight value is pruned by the module 2 .
  • each selector selector of the extractor extractor in the CU is aimed at a pruned binary weight, and the selector extractor records the actual value of the bit in each important row for corresponding activation The value is shifted.
  • the present invention also proposes a server including a storage medium, wherein the storage medium is used for storing and executing the above neural network real-time pruning method.
  • the present invention proposes a hardware-based neural network real-time pruning method, system and neural network accelerator, including obtaining the bit matrix to be multiplied by the matrix in the neural network model, and using the Euclidean distance product of the bit row and the bit column of the bit matrix as The importance of each bit row in the matrix multiplication operation in the bit matrix; according to the importance, each bit row of the bit matrix is divided into an important row or a non-important row, and the non-important row of the bit matrix is 1 The matrix obtained after the bit position is zero is used as the pruning result of the bit matrix.
  • the present invention is a pruning method based on effective bits, and the method for judging bit validity in this application does not require pruning at the software level, is independent of existing software pruning methods, and supports multi-precision DNN.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种基于硬件的神经网络实时剪枝方法、系统及神经网络加速器,包括获取神经网络模型中待矩阵乘的比特矩阵,并以比特矩阵比特行与比特列的欧氏距离乘积,作为该比特矩阵中各比特行在矩阵乘运算中的重要度;根据该重要度将比特矩阵的每个比特行划分位重要行或非重要行,并将该比特矩阵的非重要行中为1的比特位置零后得到的矩阵,作为该比特矩阵的剪枝结果。为基于有效bit的剪枝方法,且判断bit有效度的方法无需借助软件层面进行剪枝、独立于现有软件剪枝方法并且支持多精度DNN。

Description

神经网络实时剪枝方法、系统及神经网络加速器 技术领域
本发明涉及深度神经网络模型剪枝技术领域,并特别涉及一种神经网络实时剪枝方法、系统及神经网络加速器。
背景技术
随着深度学习模型参数数目从数百万(如计算机视觉中的ResNet系列)到甚至数千亿(如自然语言处理中的BERT或GPT-3)的快速演变,庞大的计算量成为深度神经网络(deep neural network,DNN)部署到实际应用中的主要障碍之一。尽管具有更深层次和更复杂神经元连接的模型为不断增长的精度需求提供了良好保障,但对于更重要的实时性要求,却没有跟随DNN的发展。这一问题在资源受限设备上尤为突出。
针对上述问题,神经网络剪枝技术被公认为是可获得模型良好精度和减少计算量的有效方式。但是几乎所有传统的剪枝方法都依赖于软件层面,此类剪枝通常包括以下步骤:(1)根据重要性指标,确定神经元的重要性;(2)根据预设的压缩比,删除不重要的部分神经元;(3)微调网络以恢复精度,或者在精度低的情况之下调整重要性指标并再次重新开始剪枝。
然而,由于深度学习应用的多样性,难以找到一种通用的基于软件的剪枝方法。因此,终端用户必须根据DNN的超参数和结构化参数重新考虑针对特定应用的剪枝标准,并从头重新实施以上步骤进行剪枝。这种繁琐耗时的重复工作限制了DNN在实际使用中的快速部署。此类剪枝方法的问题及其原因主要存在于以下三个方面:
(1)从模型角度来看,DNN模型自身的稀疏性并不利于软件剪枝。具体而言,剪枝利用一个重要性指标来确不重要的参数。该指标衡量的是权值和激活值在不同角度下的稀疏性。比如:激活值中0的比例、基于L1-norm判断滤波器的重要性、以及滤波器的信息熵等。此类指标试图剪枝零或接近零的参数,然后重新训练模型,直到达到最佳精度。然而,一个指标可能适用于某些DNN模型,但对其他的DNN却并不适用。此外,模型自身的稀疏性空间也不总是足够的。因此,一些剪枝方法必须进行耗时的稀疏训练以增加参数的稀疏性。 在精度损失之后重新训练或者微调来弥补丢失的精度。
(2)从效率的角度来看,软件剪枝方法在微调/重训练阶段耗时耗力,因为剪枝剩下的参数并不能保证模型可以达到剪枝前的原始精度。因此传统的方法必须依赖于在同样的数据集上进行重训练/微调来弥补精度丧失。但是重训练/微调通常需要经历长达数日或数周的迭代,并且上述程序通常是逐层实现的。如果我们将剪枝应用于VGG-19,则需要重新训练模型19次,每次迭代几十个epoch,来恢复损失的精度。耗时的迭代阻碍了将剪枝模型部署到设备上。并且若剪枝之后精度较差,则需要重复上述步骤。考虑到其他具有上百层的通用网络(ResNet、DenseNet),或者是具有更多、更复杂连接的3D卷积、非局部卷积、可形变卷积,开发者通常面临着同时获取良好精度和花费较短时间这一不可回避的挑战。
(3)从加速器的角度来看,首先,非结构化剪枝极大依赖于硬件。之前的研究提出了大量针对特定剪枝的加速器。如,针对解决非结构化剪枝不规则性的Cambricon-S,针对全连接层的EIE,针对长短期记忆网络(long short-term memory,LSTM)模型的ESE,但这些加速器都不支持卷积神经网络推理计算中的主体——卷积层计算。其次,加速器的设计还依赖于不同的稀疏化方法。SCNN探索了神经元和突触稀疏性,然而Cnvlutin仅支持神经元稀疏性。则若软件开发者改变了剪枝策略或仅仅是从结构化剪枝调整到非结构化剪枝,硬件部署也必须改变,这将引入移植开销。
理想情况下,预训练的DNN应当尽可能快速地在硬件上剪枝,更进一步的,硬件应当能以一种高效便捷的方式直接实施剪枝,而并非通过繁琐的软件层面操作来加速DNN推理。对于大部分软件剪枝方法,传统的剪枝步骤包括识别和剪枝不重要的参数。然而,正如前文所述,基于值的稀疏性空间十分有限,如果压缩比设置过大,则不可避免地会导致严重的精度损失。如若这样的情况发生,则传统剪枝会采用以下两种方案:①降低压缩比,从头重新进行剪枝;②采用稀疏训练,为剪枝创造更多稀疏空间。软件层面剪枝耗时的原因也源于此。
发明公开
本发明的目的是解决上述现有技术的剪枝效率问题,提出了一种针对DNN参数比特进行硬件剪枝的方法—BitX,并设计了实施BitX剪枝算法的硬件加 速器。本发明包括以下关键技术点:
关键点1,BitX硬件剪枝算法,本发明提出的剪枝是基于有效bit的剪枝方法,并提出了多种如何判断bit有效的方法;技术效果,本申请判断bit有效度的方法无需借助软件层面进行剪枝、独立于现有软件剪枝方法并且支持多精度DNN,即基于有效bit的剪枝可以基于硬件实现;
关键点2,硬件加速器架构设计;技术效果,硬件加速器可在硬件层面实现BitX剪枝算法。
具体来说针对现有技术的不足,本发明提出一种神经网络实时剪枝方法,其中包括:
步骤1、获取神经网络模型中待矩阵乘的比特矩阵,并以比特矩阵比特行与比特列的欧氏距离乘积,作为该比特矩阵中各比特行在矩阵乘运算中的重要度;
步骤2、根据该重要度将比特矩阵的每个比特行划分位重要行或非重要行,并将该比特矩阵的非重要行中为1的比特位置零后得到的矩阵,作为该比特矩阵的剪枝结果。
所述的神经网络实时剪枝方法,其中该步骤1包括通过下式得到该比特矩阵中各比特行在矩阵乘运算中的重要度:
Figure PCTCN2022077281-appb-000001
Figure PCTCN2022077281-appb-000002
式中p i为该比特矩阵中第i个比特行在矩阵乘运算中的重要度,E i为第i个比特行元素的比特位值,BitCnt(i)为第i个比特行中有效比特数,l为该比特矩阵的列数。
所述的神经网络实时剪枝方法,其中在执行该步骤1之前,获取多个待矩阵乘运算的原始权重,判断该原始权重是否属于定点数,若是则执行该步骤1,否则将该原始权重所有尾数统一对齐到该多个原始权重的最大阶码,并将对齐后的矩阵作为该比特矩阵,执行该步骤1。
所述的神经网络实时剪枝方法,其中该比特矩阵为权重矩阵和/激活矩阵;且该步骤2包括:将该比特矩阵中重要度最高的N个比特行划分为重要行,N 为正整数且小于该比特矩阵的比特行总数。
本发明还提出了一种神经网络实时剪枝系统,其中包括
模块1,用于获取神经网络模型中待矩阵乘的比特矩阵,并以比特矩阵比特行与比特列的欧氏距离乘积,作为该比特矩阵中各比特行在矩阵乘运算中的重要度;
模块2,用于根据该重要度将比特矩阵的每个比特行划分位重要行或非重要行,并将该比特矩阵的非重要行中为1的比特位置零后得到的矩阵,作为该比特矩阵的剪枝结果。
所述的神经网络实时剪枝系统,其中该模块1包括通过下式得到该比特矩阵中各比特行在矩阵乘运算中的重要度:
Figure PCTCN2022077281-appb-000003
Figure PCTCN2022077281-appb-000004
式中p i为该比特矩阵中第i个比特行在矩阵乘运算中的重要度,E i为第i个比特行元素的比特位值,BitCnt(i)为第i个比特行中有效比特数,l为该比特矩阵的列数。
所述的神经网络实时剪枝系统,其中在调用该模块1之前,获取多个待矩阵乘运算的原始权重,判断该原始权重是否属于定点数,若是则调用该模块1,否则将该原始权重所有尾数统一对齐到该多个原始权重的最大阶码,并将对齐后的矩阵作为该比特矩阵,调用该模块1。
所述的神经网络实时剪枝系统,其中该比特矩阵为权重矩阵和/激活矩阵;且该模块2包括:将该比特矩阵中重要度最高的N个比特行划分为重要行,N为正整数且小于该比特矩阵的比特行总数。
本发明还提出了一种神经网络加速器,其用于上述神经网络实时剪枝系统。
所述的神经网络加速器,其中包括由多个CU组成的PE,每个CU接受多个权重激活值对作为输入,输入的权重值由该模块2进行剪枝处理。
所述的神经网络加速器,其中CU中提取器extractor的每个选择器selector针对一个剪枝后的二进制权重,且选择器extractor记录每个重要行中比特位的实际位值,用于对相应激活值进行移位。
本发明还提出了一种包括存储介质的服务器,其中该存储介质用于存储执行上述神经网络实时剪枝方法。
对于本发明提出的BitX加速器,可根据不同配置形成了BitX-mild和BitX-wild加速架构,技术效果如下:
(1)速度提升:BitX-mild和BitX-wild在32位浮点数模式下相比未剪枝的模型可获得2.61~4.82倍加速,在16位定点数模式下可以达到2.00倍加速。针对物体检测,与原始模型—YoloV3相比速度提高了4.98倍和14.76倍。
(2)准确率:在ImageNet数据集,利用BitX-mild和BitX-wild剪枝,准确率损失分别为0.13%和0.44%;在Cifar-10数据集上为0.09%和0.15%。以上精度数据均基于32位浮点数模式。在16位定点模式下,对于BitX-mild,准确率比原始DenseNet121和ResNext101模型高0.9%和0.2%;对于BitX-wild高0.8%和0.1%。对于YoloV3,BitX-mild的准确率比原始模型高0.06%和0.07%;BitX-wild则低0.31%和1.64%。
(3)加速器性能:与其他最先进的加速器设计相比较。BitX加速器实现了2.00倍和3.79倍的性能提升。在TSMC 28nm工艺下,加速器面积为0.039平方毫米,功率分别为68.62毫瓦(32位浮点数模式)和36.41毫瓦(16位定点数模式)。
附图简要说明
图1为比特1分布分析图;
图2为本发明BitX核心概念图;
图3为本申请加速器的结构图;
图4为本申请加速器中CU的结构图。
实现本发明的最佳方式
考虑到传统剪枝的缺陷以及剪枝高效需求的必要性,我们重新思考了现存的剪枝方法,在比特层面对参数进行了稀疏性分析。并探索了一种新的剪枝方式,提高了剪枝效率。比特层面参数稀疏性分析结果主要结果如下:
表1:不同预训练模型权重/比特稀疏性比较,权重由32位浮点数表示,比特稀疏性明显大于权重稀疏性
Figure PCTCN2022077281-appb-000005
如上述表1所示,权值稀疏性通过小于10 -5的权重值的数目与权重总数相比得到。比特稀疏性通过尾数中比特为0的数目与总比特数目相比得到。显而易见的是对于两种稀疏性指标,所有的模型都展示出了明显的区别。大部分模型的权值稀疏性都在1%以下,比特稀疏性却达到了49%。这为在比特层面上探索稀疏性提供了一个良好的机会。因为49%以上的比特都是0,剪枝这些无效比特毫无疑问对精度不会产生任何影响。本发明将充分利用这一良好条件来加速DNN推理。
49%的比特为0同时意味着51%的比特是1,也占据了参数比特很大的一部分。但是并非所有的比特1都会对最终的精度产生影响。因此比特1中一部分为实际数值极度微小的比特1,是影响计算效率的一个因素(这一因素在此前的研究中,从没有被考虑过)。因此,在探索完比特层面的稀疏性之后,我们进一步将技术方向移向无效(影响微小)的比特1。
因此,我们以比特分布(每10个阶码范围作为一个分片)为单位来研究比特1的分布。如图1所示,x轴表示的是二进制(以32位浮点表示)权值的比特分片,每个比特分片表示其位置上的位值。假设某个权重位1.1101×2 -4,其二进制表示为0.00011101,我们所记录得到的四个有效比特1的位值分别是2 -4、2 -5、2 -6、2 -8
如图1所示,4个基准DNN模型都呈现类似分布:三维图峰值在横坐标处于2 -21~2 -30时达到,说明此范围的位值涵盖了大部分的bit 1(约40%),但其中大部分bit 1对推理精度影响微弱。本发明BitX旨在剪枝这些比特,以加速 推理。在经过二进制转换之后,比特分片的范围从2 9~2 0变化到2 -61~2 -70。所有的模型都在每层上都呈现为“拱形”。大部分(40%)的比特1位于比特分片的中部。以2 -21~2 -30为例,其对应的十进制范围为0.000000477(约为10 -8)到0.000000000931(约为10 -11)。但实际上,此类微小的比特1值对于模型精度影响很小。因此,本发明旨在精确识别重要的比特并在加速器上剪枝大部分影响微小的比特,以达到精度损失很小的条件下减少计算量的目标。
浮点操作数由三部分组成:符号位、尾数和指数,遵循工业界最常用的浮点标准—IEEE754标准。若采用单精度浮点数(fp32),则尾数位宽为23位、指数位宽8位,剩下一位为符号位。一个单精度的浮点权重可以被表示为:fp=(-1) s1.m×2 e-127,e的大小为浮点数小数点实际位置加127。
以6个未对齐的32位单精度浮点权重为例,将尾数表示为如图2所示。将得到一个权重比特矩阵,矩阵中每列代表内存中实际存储的二进制尾数值。图例中的不同颜色表示了从2 -1到2 -9的bit位值(2 0位值表示尾数中的隐藏1)。在权重比特矩阵中,根据不同的指数,我们利用不同的背景颜色来表示该比特位上实际的位值。例如,W 2中最上边的深灰色代表了位值2 -3
如图2(b)所示,根据指数将所有的尾数对齐,则在矩阵上部会出现大量补齐的0。第一,这样的现象导致补0之后稀疏性增大,为比特层面剪枝提供了良好的条件。第二,大部分的比特1都被移至位值小于2 -6的尾部。此类比特1对于最终的MAC(Multiply-Accumulate operation,乘加累计运算量)影响微乎其微。若对这些不重要的比特1进行剪枝,大量的比特级运算可以被省去,进而加速推理。如图2(c)所示,红色方框代表了被剪枝的1,仅仅留下了几个关键的比特1形成了剪枝之后的权重:W′ 1、W′ 3、W′ 4和W′ 5,这些比特则被称为“essential bits”。
利用图2(c)中的“essential bits”是在比特层面上简化MAC的有效方式。然而,对于数以百万计的参数,单独一个比特对于整个网络的影响将难以评估。因此本发明提出了一个有效但硬件友好的机制BitX来充分利用无效的比特,并且在不借助耗时耗力的软件剪枝方式条件下仍能保持原始精度。
(1)BitX剪枝方法:
给定一个n×l矩阵A和l×n矩阵W,A×W的结果可以由n个秩1矩阵的和表示。A×W的结果可以通过Fast Monte-Carlo Algorithm得到(Fast Monte-Carlo  Algorithm随机抽样其中一些秩一矩阵来近似矩阵乘积,最常见的抽样方法是计算相应概率来选择这些秩一矩阵)。如公式(1)所示,A (i)表示A矩阵的第i行,W (i)表示W矩阵的第i列。本发明通过计算A (i)和W (i)欧氏距离乘积作为抽样概率,其反映了某个秩一矩阵乘法在n个秩一矩阵乘积和中的重要性。
Figure PCTCN2022077281-appb-000006
从Fast Monte-Carlo Algorithm得到启发,我们在BitX中采用抽样概率来衡量权重中bit的重要性而非值的重要性。与同一权重值中的其他更重要的比特相比,具有较小概率的bit在与激活相乘时影响微小。因此,本发明将图2(a)中的比特矩阵抽象为W,在图2(b)中寻找(不)重要的比特行,利用公式(1)中的概率对W中的每一个比特行进行采样,并确定要剪枝的比特行,进而简化MAC计算。
在权重矩阵中,本发明针对的是n个32位浮点权重值的尾数部分,每个权重的尾数被实例化为由其比特值组成的列向量。对于MAC,n个权重意味着相应有n个激活值。n个激活值组成了另一个列向量[A 1,A 2…A j…A n] T。将激活值矩阵的列向量与权重矩阵的行向量两个向量带入公式1,可以得到公式2:
Figure PCTCN2022077281-appb-000007
A j是激活值向量的元素,v j是权重比特矩阵第i行向量的第j元素。权重比特矩阵中的同一行具有相同的指数(阶码)。因此用公式(2)中的
Figure PCTCN2022077281-appb-000008
表示第j元素的阶码。则行向量的欧氏距离通过
Figure PCTCN2022077281-appb-000009
计算。
BitX中的对阶操作与浮点加法中的对阶几乎一致。唯一不同的是BitX将一组数同时对齐到最大阶,而并非在一个一个权重/激活值间进行。因此,在对阶之后,权重比特矩阵中的同一行具有相同的指数(阶码),如图2(b)所示。我们使用统一的E i来表示第i行比特向量的实际阶码。且本发明的剪枝方案可应用于权重矩阵和/或激活矩阵。
v表示W的比特行向量,如果v中的某一元素v j等于0,则对于计算欧式距离不产生任何影响,进而对p i无影响。因此,计算欧式距离被转化为计算第i 行向量的比特1数目。使用BitCnt(i)来表示这一数值。因此p i可改写为如公式3:
Figure PCTCN2022077281-appb-000010
在公式3中,E i表示i行向量的阶码。而矩阵A中的所有列向量均相同。因此|A (i′)|等于|A (i)|。对于给定的具有l个列向量的矩阵W,
Figure PCTCN2022077281-appb-000011
为常数,因此令
Figure PCTCN2022077281-appb-000012
最终p i将被化简为公式4:
Figure PCTCN2022077281-appb-000013
公式4中p i反映了比特矩阵中比特行i在计算中的重要性。因为E i反映了i行元素的比特位值大小,BitCnt(i)则反映了i行中有效比特的数目,其中有效比特即bit1,与之对应的,无效比特即bit0。较大的E i和BitCnt(i)对最终的MAC有更大影响。BitX利用公式4确定了重要的比特,同时直接在加速器中剪枝掉不重要的bit。
(2)BitX剪枝程序:
Figure PCTCN2022077281-appb-000014
本发明算法详述为上述算法1,BitX首先提取出32浮点数权重的阶码E 和尾数M作为输入(第1行~3行),然后将所有尾数统一对齐到最大阶码e max(第4行),再计算p i,并对p i降序排序(第5行~10行)。输入参数N表示W经过剪枝之后中剩余的比特行向量数目,即BitX挑选出了具有最大的N个p i的比特行。挑选出的N行的索引将存储在数组I中(第13行)。剪枝最终由mask来实现。剪枝之后,BitX提取出所有关键的比特1,并存回W′。
算法中的设计参数N控制了剪枝的粒度。较小的N会控制算法产生较大的比特稀疏性,进一步剪枝更多的行,最终通过跳过更多的0来加速推理。
(3)BitX硬件剪枝加速器
加速器的系统结构如图3所示。"E-alignment"和"Bit-Extraction"模块用来执行Bit剪枝算法。每16个CU(计算单元computing unit)组成一个BitX PE(运算单元processing element)。每个CU接受M个权重/激活值对作为输入。输入的权重由"Bit-Extraction"模块进行预处理,实际数值微小的比特剪枝为0。对于定点DNN来说,E-alignment模块是不需要的,因为定点运算不涉及指数对齐操作,所以原始权重直接输入至"Bit-Extraction"。
①E-alignment模块
E-alignment模块实现将所有权重对阶到最大阶码。此模块主要由数据移位部件和零比特填充部件组成。对于浮点数据,权重参数首先被改写为相应的尾数和阶码。并得到最大阶码,其他权重统一对阶到最大阶码。数据移位部件通过对第i个尾数右移E max-E i来完成这一操作。由移位导致在尾数较前部分出现的空位通过填充部件被填充为比特零(在图3中标记为浅灰色)。对于不同的权重,E i可能不完全相同,因此通过零比特填充之后,所有参数的位宽并不一致。为了处理这种情况,零比特填充也会将一系列零位填充至最大位宽(在图3标为深灰色)。
②Bit Extraction模块
由E-alignment模块输出的尾数被输入到Bit-Extraction模块中进行比特剪枝。这个模块的第一个功能部件是BITCNT,用来实现公式(4)中的BitCnt函数。Bit-Extraction模块的第二个功能是对移位后的BitCnt(i)进行排序,并选择前n个p i最大的行,其他行权重则被剪枝。最终得到了剪枝之后的权重。
③Compute Unit(CU)部件
剪枝后的权重比特稀疏性空间得到了提高,因此本设计在"Bit-Extraction"模块的"extractor"中设计了一个跳零机制,并进一步将关键比特送入计算单元(CU)模块。
CU的微结构如图4所示。提取器extractor中的每个"selector"针对一个剪枝后的二进制权重(共M个权重),k表示剪枝后权重中的essential bit。extractor记录了每个essential bit位的实际位值,用s表示,用于对相应激活值进行移位。
激活值可以是浮点数据或定点数据。定点激活值可以直接进行移位。但是对于浮点激活值来说,移位操作是激活值中的阶码进行累加,实际上也是定点运算。因此,移位器不会引入大的开销。加法树执行最后的部分和累加,同时也用来区分不同的精度。
以下为与上述方法实施例对应的系统实施例,本实施方式可与上述实施方式互相配合实施。上述实施方式中提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述实施方式中。
本发明还提出了一种神经网络实时剪枝系统,其中包括
模块1,用于获取神经网络模型中待矩阵乘的比特矩阵,并以比特矩阵比特行与比特列的欧氏距离乘积,作为该比特矩阵中各比特行在矩阵乘运算中的重要度;
模块2,用于根据该重要度将比特矩阵的每个比特行划分位重要行或非重要行,并将该比特矩阵的非重要行中为1的比特位置零后得到的矩阵,作为该比特矩阵的剪枝结果。
所述的神经网络实时剪枝系统,其中该模块1包括通过下式得到该比特矩阵中各比特行在矩阵乘运算中的重要度:
Figure PCTCN2022077281-appb-000015
Figure PCTCN2022077281-appb-000016
式中p i为该比特矩阵中第i个比特行在矩阵乘运算中的重要度,E i为第i个比特行元素的比特位值,BitCnt(i)为第i个比特行中有效比特数,l为该比特矩阵的列数。
所述的神经网络实时剪枝系统,其中在调用该模块1之前,获取多个待矩阵乘运算的原始权重,判断该原始权重是否属于定点数,若是则调用该模块1,否则将该原始权重所有尾数统一对齐到该多个原始权重的最大阶码,并将对齐后的矩阵作为该比特矩阵,调用该模块1。
所述的神经网络实时剪枝系统,其中该比特矩阵为权重矩阵和/激活矩阵;且该模块2包括:将该比特矩阵中重要度最高的N个比特行划分为重要行,N为正整数且小于该比特矩阵的比特行总数。
本发明还提出了一种神经网络加速器,其用于上述神经网络实时剪枝系统。
所述的神经网络加速器,其中包括由多个CU组成的PE,每个CU接受多个权重激活值对作为输入,输入的权重值由该模块2进行剪枝处理。
所述的神经网络加速器,其中CU中提取器extractor的每个选择器selector针对一个剪枝后的二进制权重,且选择器extractor记录每个重要行中比特位的实际位值,用于对相应激活值进行移位。
本发明还提出了一种包括存储介质的服务器,其中该存储介质用于存储执行上述神经网络实时剪枝方法。
工业应用性
本发明提出一种基于硬件的神经网络实时剪枝方法、系统及神经网络加速器,包括获取神经网络模型中待矩阵乘的比特矩阵,并以比特矩阵比特行与比特列的欧氏距离乘积,作为该比特矩阵中各比特行在矩阵乘运算中的重要度;根据该重要度将比特矩阵的每个比特行划分位重要行或非重要行,并将该比特矩阵的非重要行中为1的比特位置零后得到的矩阵,作为该比特矩阵的剪枝结果。本发明为基于有效bit的剪枝方法,且本申请判断bit有效度的方法无需借助软件层面进行剪枝、独立于现有软件剪枝方法并且支持多精度DNN。

Claims (12)

  1. 一种神经网络实时剪枝方法,其特征在于,包括
    步骤1、获取神经网络模型中待矩阵乘的比特矩阵,并以比特矩阵比特行与比特列的欧氏距离乘积,作为该比特矩阵中各比特行在矩阵乘运算中的重要度;
    步骤2、根据该重要度将比特矩阵的每个比特行划分位重要行或非重要行,并将该比特矩阵的非重要行中为1的比特位置零后得到的矩阵,作为该比特矩阵的剪枝结果。
  2. 如权利要求1所述的神经网络实时剪枝方法,其特征在于,该步骤1包括通过下式得到该比特矩阵中各比特行在矩阵乘运算中的重要度:
    Figure PCTCN2022077281-appb-100001
    Figure PCTCN2022077281-appb-100002
    式中p i为该比特矩阵中第i个比特行在矩阵乘运算中的重要度,E i为第i个比特行元素的比特位值,BitCnt(i)为第i个比特行中有效比特数,l为该比特矩阵的列数。
  3. 如权利要求1所述的神经网络实时剪枝方法,其特征在于,在执行该步骤1之前,获取多个待矩阵乘运算的原始权重,判断该原始权重是否属于定点数,若是则执行该步骤1,否则将该原始权重所有尾数统一对齐到该多个原始权重的最大阶码,并将对齐后的矩阵作为该比特矩阵,执行该步骤1。
  4. 如权利要求1所述的神经网络实时剪枝方法,其特征在于,该比特矩阵为权重矩阵和/激活矩阵;且该步骤2包括:将该比特矩阵中重要度最高的N个比特行划分为重要行,N为正整数且小于该比特矩阵的比特行总数。
  5. 一种神经网络实时剪枝系统,其特征在于,包括
    模块1,用于获取神经网络模型中待矩阵乘的比特矩阵,并以比特矩阵比特行与比特列的欧氏距离乘积,作为该比特矩阵中各比特行在矩阵乘运算中的重要度;
    模块2,用于根据该重要度将比特矩阵的每个比特行划分位重要行或非重要行,并将该比特矩阵的非重要行中为1的比特位置零后得到的矩阵,作为该比特矩阵的剪枝结果。
  6. 如权利要求1所述的神经网络实时剪枝系统,其特征在于,该模块1包括通过下式得到该比特矩阵中各比特行在矩阵乘运算中的重要度:
    Figure PCTCN2022077281-appb-100003
    Figure PCTCN2022077281-appb-100004
    式中p i为该比特矩阵中第i个比特行在矩阵乘运算中的重要度,E i为第i个比特行元素的比特位值,BitCnt(i)为第i个比特行中有效比特数,l为该比特矩阵的列数。
  7. 如权利要求1所述的神经网络实时剪枝系统,其特征在于,在调用该模块1之前,获取多个待矩阵乘运算的原始权重,判断该原始权重是否属于定点数,若是则调用该模块1,否则将该原始权重所有尾数统一对齐到该多个原始权重的最大阶码,并将对齐后的矩阵作为该比特矩阵,调用该模块1。
  8. 如权利要求1所述的神经网络实时剪枝系统,其特征在于,该比特矩阵为权重矩阵和/激活矩阵;且该模块2包括:将该比特矩阵中重要度最高的N个比特行划分为重要行,N为正整数且小于该比特矩阵的比特行总数。
  9. 一种神经网络加速器,其特征在于,用于权利要求5至8中任意一种神经网络实时剪枝系统。
  10. 如权利要求9所述的神经网络加速器,其特征在于,包括由多个CU组成的PE,每个CU接受多个权重激活值对作为输入,输入的权重值由该模块2进行剪枝处理。
  11. 如权利要求9所述的神经网络加速器,其特征在于,CU中提取器extractor的每个选择器selector针对一个剪枝后的二进制权重,且选择器extractor记录每个重要行中比特位的实际位值,用于对相应激活值进行移位。
  12. 一种包括存储介质的服务器,其特征在于,该存储介质用于存储执行权利要求1至4中任意一种神经网络实时剪枝方法。
PCT/CN2022/077281 2021-08-20 2022-02-22 神经网络实时剪枝方法、系统及神经网络加速器 WO2023019899A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110960966.XA CN113657595B (zh) 2021-08-20 2021-08-20 基于神经网络实时剪枝的神经网络加速器
CN202110960966.X 2021-08-20

Publications (1)

Publication Number Publication Date
WO2023019899A1 true WO2023019899A1 (zh) 2023-02-23

Family

ID=78481585

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077281 WO2023019899A1 (zh) 2021-08-20 2022-02-22 神经网络实时剪枝方法、系统及神经网络加速器

Country Status (2)

Country Link
CN (1) CN113657595B (zh)
WO (1) WO2023019899A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118171697A (zh) * 2024-05-13 2024-06-11 国网山东省电力公司济南供电公司 深度神经网络压缩的方法、装置、计算机设备和存储介质
CN118314473A (zh) * 2024-06-11 2024-07-09 之江实验室 一种基于模型剪枝的星载目标识别方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657595B (zh) * 2021-08-20 2024-03-12 中国科学院计算技术研究所 基于神经网络实时剪枝的神经网络加速器
CN114819141B (zh) * 2022-04-07 2024-08-13 西安电子科技大学 用于深度网络压缩的智能剪枝方法和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344876A1 (en) * 2016-05-31 2017-11-30 Samsung Electronics Co., Ltd. Efficient sparse parallel winograd-based convolution scheme
CN111860826A (zh) * 2016-11-17 2020-10-30 北京图森智途科技有限公司 一种低计算能力处理设备的图像数据处理方法及装置
CN112329910A (zh) * 2020-10-09 2021-02-05 东南大学 一种面向结构剪枝结合量化的深度卷积神经网络压缩方法
CN112396179A (zh) * 2020-11-20 2021-02-23 浙江工业大学 一种基于通道梯度剪枝的柔性深度学习网络模型压缩方法
CN113657595A (zh) * 2021-08-20 2021-11-16 中国科学院计算技术研究所 神经网络实时剪枝方法、系统及神经网络加速器

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090560A (zh) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 基于fpga的lstm递归神经网络硬件加速器的设计方法
CN108932548A (zh) * 2018-05-22 2018-12-04 中国科学技术大学苏州研究院 一种基于fpga的稀疏度神经网络加速系统
CN110378468B (zh) * 2019-07-08 2020-11-20 浙江大学 一种基于结构化剪枝和低比特量化的神经网络加速器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170344876A1 (en) * 2016-05-31 2017-11-30 Samsung Electronics Co., Ltd. Efficient sparse parallel winograd-based convolution scheme
CN111860826A (zh) * 2016-11-17 2020-10-30 北京图森智途科技有限公司 一种低计算能力处理设备的图像数据处理方法及装置
CN112329910A (zh) * 2020-10-09 2021-02-05 东南大学 一种面向结构剪枝结合量化的深度卷积神经网络压缩方法
CN112396179A (zh) * 2020-11-20 2021-02-23 浙江工业大学 一种基于通道梯度剪枝的柔性深度学习网络模型压缩方法
CN113657595A (zh) * 2021-08-20 2021-11-16 中国科学院计算技术研究所 神经网络实时剪枝方法、系统及神经网络加速器

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118171697A (zh) * 2024-05-13 2024-06-11 国网山东省电力公司济南供电公司 深度神经网络压缩的方法、装置、计算机设备和存储介质
CN118314473A (zh) * 2024-06-11 2024-07-09 之江实验室 一种基于模型剪枝的星载目标识别方法及装置

Also Published As

Publication number Publication date
CN113657595A (zh) 2021-11-16
CN113657595B (zh) 2024-03-12

Similar Documents

Publication Publication Date Title
WO2023019899A1 (zh) 神经网络实时剪枝方法、系统及神经网络加速器
EP3270330B1 (en) Method for neural network and apparatus performing same method
CN109543830B (zh) 一种用于卷积神经网络加速器的拆分累加器
KR102476343B1 (ko) 자리수가 비교적 적은 고정 소수점 수치의 신경망 연산에 대한 지원 장치와 방법
CN111860982A (zh) 一种基于vmd-fcm-gru的风电场短期风电功率预测方法
CN108985335B (zh) 核反应堆包壳材料辐照肿胀的集成学习预测方法
CN113128671B (zh) 一种基于多模态机器学习的服务需求动态预测方法及系统
CN114677548B (zh) 基于阻变存储器的神经网络图像分类系统及方法
CN114490065A (zh) 一种负载预测方法、装置及设备
CN115311506B (zh) 基于阻变存储器的量化因子优化的图像分类方法及装置
CN113935489A (zh) 基于量子神经网络的变分量子模型tfq-vqa及其两级优化方法
WO2022188711A1 (zh) Svm模型的训练方法、装置、设备和计算机可读存储介质
Pietroń et al. Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction
CN114861671A (zh) 模型训练方法、装置、计算机设备及存储介质
CN117421703A (zh) 一种深度符号回归加速器及深度符号回归方法
CN112183744A (zh) 一种神经网络剪枝方法及装置
CN116702839A (zh) 一种基于卷积神经网络的模型训练方法及应用系统
CN116579408A (zh) 一种基于模型结构冗余度的模型剪枝方法及系统
CN116384471A (zh) 模型剪枝方法、装置、计算机设备、存储介质和程序产品
CN110852361B (zh) 基于改进深度神经网络的图像分类方法、装置与电子设备
Dong et al. An optimization method for pruning rates of each layer in CNN based on the GA-SMSM
CN114444654A (zh) 一种面向nas的免训练神经网络性能评估方法、装置和设备
WO2023159751A1 (zh) 模型剪枝方法、装置、计算设备及存储介质
CN112650770B (zh) 基于query workload分析的MySQL参数推荐方法
CN117708507B (zh) 一种基于人工智能的高效α和β射线的识别与分类方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22857220

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22857220

Country of ref document: EP

Kind code of ref document: A1