WO2023070997A1 - 利用比特级稀疏性的深度学习卷积加速方法及处理器 - Google Patents

利用比特级稀疏性的深度学习卷积加速方法及处理器 Download PDF

Info

Publication number
WO2023070997A1
WO2023070997A1 PCT/CN2022/077275 CN2022077275W WO2023070997A1 WO 2023070997 A1 WO2023070997 A1 WO 2023070997A1 CN 2022077275 W CN2022077275 W CN 2022077275W WO 2023070997 A1 WO2023070997 A1 WO 2023070997A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
matrix
bit
data pairs
bits
Prior art date
Application number
PCT/CN2022/077275
Other languages
English (en)
French (fr)
Inventor
路航
李晓维
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Publication of WO2023070997A1 publication Critical patent/WO2023070997A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/4836Computations with rational numbers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing
    • G06F7/4876Multiplying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the field of deep learning accelerator design, and in particular to a deep learning acceleration method and an intelligent processor utilizing bit-level sparsity.
  • Figure 1 compares the computing paradigms of three types of accelerator PEs by way of example.
  • early bit-parallel accelerators Figure 1a
  • bit-serial accelerators use the same bit-level arithmetic to calculate the inner product.
  • Fig. 1(b) step to decompose an 8b ⁇ 8b product into eight 1b ⁇ 8b products
  • serially Fig. 1(b) step
  • Fig. 1(c) is a calculation example of the present invention.
  • Step 1 is compared with the previous bit-parallel/serial PE computation distribution.
  • the white background marks the sparse bit (0 bit), and the gray background marks the basic bit (1 bit).
  • Step1 organizes weights for parallelism; Step2 performs MAC.
  • step 1 organizes the weights in a serial fashion; step 2 synchronizes the bit values of the necessary bits; step 3 performs a "bit-serial" MAC.
  • Step1 organizes weights in parallel, but Step2 performs serial MAC along the value of each bit without synchronization.
  • Previously used synchronization methods include intermediate dense scheduling and hardware-level Booth encoding.
  • a key weakness of these approaches stems from the difficulty in identifying a unified schema to describe the location of simultaneous sparsity.
  • An immediate consequence is that ongoing MAC operations must be halted to adjust bit importance, at the expense of reduced throughput compared to their bit-parallel counterparts. For example, in Fig. 1(b), the MAC must be waited until and completed, and the MAC must be completed.
  • the complexity will also increase, because Booth encoding requires additional circuits to encode and store weight bits.
  • Another weakness is that this serialized organization cannot support floating-point operations, that is, the use scenarios of bit-serial accelerators are severely limited, and thus cannot be deployed in more use scenarios.
  • the purpose of the present invention is to solve the efficiency and versatility problems of the above-mentioned existing deep learning accelerator design, propose a calculation method using bit sparse parallelism - "bit interleaving" calculation method, and design the implementation of "bit interleave” calculation method Hardware Accelerator - Bitlet.
  • the present invention proposes a deep learning convolution acceleration method utilizing bit-level sparsity, including:
  • Step 1 Obtain multiple sets of data pairs to be convoluted, each set of data pairs consists of an activation value and its corresponding original weight, and both the activation value and the original weight are floating point numbers;
  • Step 2 summing the exponents of activation values and original weights in each data pair to obtain the exponent sum of each data pair, and selecting the exponent sum with the largest value from all data pairs as the maximum exponent;
  • Step 3 Arranging the mantissas of the original weights according to the calculation order to form a weight matrix, and uniformly aligning each row in the weight matrix to the maximum index to obtain an alignment matrix;
  • Step 4 remove the slack bits in the alignment matrix to obtain a reduced matrix with vacancies, and make the basic bits of each column of the reduced matrix replenish the vacant positions according to the calculation order to form an intermediate matrix, and after removing the blank rows of the intermediate matrix, Set the empty position of the matrix to 0 to obtain the staggered weight matrix, and use each row in the staggered weight matrix as a necessary weight;
  • Step 5 According to the corresponding relationship between the activation value and the basic bit in the original weight, the position information of each bit corresponding to the activation value in the necessary weight is obtained, and the necessary weight is sent to the split accumulator, and the split accumulator takes the necessary
  • the weight is divided into multiple weight segments by bit. According to the position information, the weight segment and the mantissa of the corresponding activation value are sent to the addition tree for summation processing. By performing shift and addition on the processing results, the output feature map is obtained as the The convolution result of multiple sets of data pairs.
  • the activation value is a pixel value of an image.
  • the present invention also proposes a processor for implementing the above-mentioned deep learning convolution acceleration method utilizing bit-level sparsity.
  • the processor includes:
  • the preprocessing module is used to obtain multiple sets of data pairs to be convolved, each set of data pairs is composed of an activation value and its corresponding original weight, and the activation value and the original weight are both floating-point numbers; and summing each set The index of the activation value and the original weight in the data pair is obtained, and the index sum of each group of data pairs is obtained, and the index sum with the largest value is selected from all the data pairs as the largest index;
  • the index alignment module is used to arrange the mantissas of the original weights according to the calculation order to form a weight matrix, and uniformly align each row in the weight matrix to the maximum index to obtain an alignment matrix;
  • the weight interleaving module is used to remove the slack bits in the alignment matrix to obtain a reduced matrix with gaps, and make the basic bits of each column of the reduced matrix complement the gaps according to the calculation order to form an intermediate matrix, and remove the gaps in the intermediate matrix After the row, set the empty position of the matrix to 0 to obtain the staggered weight matrix, and use each row in the staggered weight matrix as a necessary weight;
  • the loop register is used to extract the basic bits in the necessary weight, and obtain the position information of each bit of the necessary weight corresponding to the activation value from the mantissa corresponding to it in the mantissa of all activation values;
  • the split accumulator is used to divide the necessary weight into multiple weight segments. According to the position information, the weight segment and the mantissa of the corresponding activation value are sent to the addition tree for summation processing. The bits are added to obtain the output feature map as the convolution result of the multiple sets of data pairs.
  • the activation value is a pixel value of an image.
  • the present invention has the advantages of:
  • the designed accelerator area is 1.5mm2. Under the TSMC 28nm process, the accelerator area is 0.039 square millimeters, and the power is 570 milliwatts (32-bit floating-point number mode), 432 milliwatts (16-bit fixed-point number mode) and 365 milliwatts respectively. milliwatts (8-bit fixed-point mode)
  • the accelerator is highly configurable.
  • Figure 1 is a comparison diagram of bit-interleaved PE and bit-parallel/serial PE calculation distribution in fixed-point mode;
  • Fig. 2 is a schematic diagram of sparse parallelism
  • FIG. 3 is a conceptual schematic diagram of bit interleaving
  • Fig. 4 is the structural diagram of BCE module
  • Figure 5 is a structural diagram of the Bitlet accelerator.
  • bit sparsity is an inherent finer-grained sparsity that targets "zero bits" in each operand rather than coarse-grained zero values.
  • bit sparsity is an inherent finer-grained sparsity that targets "zero bits" in each operand rather than coarse-grained zero values.
  • the percentage of zero bits can range from 45% to 77% in different DNN models. Skipping zero bits in an operand does not affect the result, which also means that if bit-efficient computations are strictly enforced, the speedup can be directly obtained without any software-level effort. Therefore, the present invention utilizes rich bit-level sparse parallelism to accelerate the training and inference phases to serve general deep learning at the cloud/edge end.
  • each point in the figure represents the proportion of zero bits of all weights in the bit lane in this convolution kernel.
  • the illustration shows that about 50% of the bits in all convolution kernels are 0.
  • the sparsity only includes the mantissa (23/10 bits out of 32/16 bits of floating point), and in the int8-bit precision representation, only 7 significant bits are included, excluding the sign bit.
  • Figure 2 shows the bit sparsity of different convolution kernels and observes that the weight sparsity on each bit value is consistent.
  • the X-axis represents the bit value of the mantissa, so there are 23 bits in total, excluding the hidden bit 1 in the standard floating-point 32 format.
  • Each dot represents the proportion of zero bits at that bit index within a convolution kernel.
  • ResNet152 and MobileNetV2 as examples, there is an obvious aggregation in the first half of the mantissa (bit0 ⁇ bit16), which means that the number of 0 and 1 located on this bit value is almost equal. This provides the advantage of reading the weights into the accelerator in parallel and computing them serially. Also, starting from bits 17-23, the points mostly overlap at 100% on the Y axis (the long tail in fp32 numbers), which means most of the bits are 0. Because floating-point multipliers are designed to cover any case of operands, floating-point multipliers do not distinguish between such suboptimal cases. This is also the fundamental reason why floating-point multiply-add operations and convolution operations (MAC) are difficult to be accelerated.
  • MAC convolution operations
  • accelerators designed for fixed-point precision that can only achieve inference, making these designs difficult to use in general-purpose scenarios.
  • the training of DNN still relies on floating-point backpropagation to ensure that the model is adjusted to floating point, but it still needs to meet the real-time requirements, especially when the fixed-point precision cannot meet the corresponding precision.
  • accelerators should be suitable for most use cases and should provide enough convenience and flexibility for end-user collaboration.
  • the present invention proposes a sparse parallel design mode based on bit interleaving.
  • the advantage of bit-serial accelerators is the efficient use of the sparsity of bits.
  • bit-serial accelerators provide relatively lower throughput than their bit-parallel counterparts.
  • the present invention proposes a bit-interleaved design, which combines the advantages of the above design and avoids the disadvantages.
  • This design mode can significantly surpass the previous bit serial/parallel mode.
  • Accelerator Bitlet adopts the design concept of bit interleaving, and also supports multiple precisions including floating point and fixed point. This configurable feature makes Bitlet suitable for both high-performance and low-power scenarios.
  • a floating-point operand consists of three parts: a sign bit, a mantissa, and an exponent, and follows the IEEE754 standard, which is also the most commonly used floating-point standard in the industry. If we use single-precision floating-point numbers (fp32), the mantissa bit width is 23 bits, the exponent bit width is 8 bits, and the remaining bit is the sign bit.
  • Equation 2 can be rewritten as Equation 3.
  • Equation 3 can be rewritten as
  • N fp32 MAC results are equivalent to a series of bit-level operations on the corresponding mantissas. Specifically, if Then the sum of N MACs is converted into N signed The sum of ( Indicates), and on this basis, move left (right)
  • Equation 5 This theory of computation also includes fixed-point precision.
  • E max and E i -E max are not necessary because fixed-point precision representations do not have exponents.
  • the present invention will use examples to describe in detail how bit interleaving works under floating-point 32-bit precision weights, as well as design details of the Bitlet accelerator supporting multiple precisions.
  • Figure 1(c) illustrates the bit-interleaving process of an 8-bit fixed-point MAC and demonstrates it step by step.
  • floating-point MAC is not as easy to exploit as fixed-point MAC, because there is a special part in the binary operand - the exponent, and different operands have different exponents.
  • bit-interleaving consists of three separate but sequential steps based on Equation 5.
  • Figure 3(a) uses an example as an illustration, 6 common 32-bit floating-point weights are arranged in rows, and the exponent and mantissa of each weight are arbitrary.
  • the triangle mark indicates the actual position of the binary point.
  • the actual 32-bit floating-point stored in memory is not represented in binary format, but a more representative representation is used to represent the value.
  • This step is the same as that in Figure 1c Similar, only here the 32-bit floating point weights are organized in parallel for interleaving. And these binary weights are preprocessed to obtain respective exponents to further determine the "maximum" exponent (E 6 in this case).
  • the mantissa is also stored for later MAC calculation.
  • the tail bits (bits 9-23) of each mantissa are omitted.
  • the exponent indicates the position of the decimal point in the binary representation.
  • this involves an "exponent alignment” step in floating-point addition.
  • E 6 the maximum value
  • This step is called “dynamic index alignment” and is not covered in Figure 1(c) because fixed-point values do not have exponents.
  • Equation 5 in actual execution, the two summations can be performed in parallel.
  • the outer summation represents the vertical dimension in Figure 3(a), that is, the N weights and their corresponding activation values;
  • the inner summation represents the horizontal dimension, that is, the different bit widths of the mantissa.
  • the key is how to use the necessary bits to obtain accurate partial sums and further obtain better inference speed.
  • this step exploits this feature to extract the necessary bits, which is similar to the step in Figure 1c exactly the same.
  • the total computation can be reduced from MAC with 6 operands to only 3.
  • the bit of The value is equal to 2 6 , which means the bit is in the 7th position before the binary point.
  • the bits at 2 6 positions are all zero after alignment. If the first bit of W 6 is moved up to replace the position of the same vertical track in W 1 , A 6 ⁇ 2 6 +A 1 ⁇ 2 3 can be calculated at the same time.
  • Bitlet To perform bit interleaving, we design a new accelerator named Bitlet. In this part, we will describe the key hardware design modules of Bitlet, including the micro-architecture of the computing engine supporting multiple precisions and the overall architecture of efficient memory access.
  • Key module 1 - preprocessing module devises a component involving two steps in the "bit interleaving" operation. Bitlet inputs pairs of weights and activations, denoted by N in Figure 4.
  • Bitlet computing component hereinafter referred to as BCE
  • W 0 to W N-1 are original weights
  • a 0 to A N-1 are corresponding activations.
  • the preprocessing module decomposes each W i and A i into two parts: mantissa and exponent, and performs Afterwards, the maximum exponent E max is selected and stored in a register for subsequent dynamic exponent alignment operations. After determining E max , shifted left (right) bit so that its exponent is consistent with E max .
  • RR-reg extracts the necessary bit 1 (basic bit) in the interleaving weight and selects the output of BCE from N activation mantissas.
  • Each RR-reg has its own internal clock and is connected to the accelerator clock tree.
  • the pseudo-code shown represents a specific procedure: RR-reg firstly extracts the necessary bit 1 sequentially according to the input bit.
  • the "select" signal indicates which activation value path and output O i should be selected by the decoding component configuration. If the necessary bit 1 is not detected, RR-reg will activate the "complement 0" signal, and O i will also output 0.
  • BCE has the following three major characteristics: 1
  • the architecture does not cause precision loss, because the above-mentioned dynamic index alignment is the same as the floating-point operation in IEEE 754.
  • the rightmost bit after the shift is discarded in the operation, but the value of these bits is small, so it can be ignored and has no impact on the accuracy.
  • 2BCE does not require any preprocessing on parameter sparsity.
  • the preprocessing module in Figure 4 is only responsible for converting the activation value of the weight into the corresponding mantissa and exponent.
  • each RR-reg will implement a sliding window to achieve automatic interleaving and extract the necessary bits. Benefiting from the favorable conditions of sparse parallelism, the pairing can be done almost simultaneously in each RR-reg extraction.
  • BCE mainly consists of combinational circuits, but does not involve complex lines that may cause long critical path delays.
  • Each RR-reg produces an output O i at each clock cycle, but the total cycle of computing the partial sum is greatly optimized compared with the traditional one-to-one MAC.
  • N is the only design parameter in BCE, and larger N pairs are beneficial to extract more bit 1s.
  • PE Bitlet is composed of mesh-connected PEs. As shown in Figure 5, each PE consists of a BCE and an adder tree. The BCE connects the on-chip cache and adder tree. Each PE serially inputs N weights and activations and produces partial sums Oi as inputs to the adder tree. Since the BCE output is limited by a 24-bit mantissa, the adder tree also has 24 inputs. PE is multiplied by (Note that b is a negative number) to finalize the result to ensure the correctness of the result. can be decomposed into a fixed part b and a common part E max for generating the BCE output. The execution of the fixed portion is done with a fixed number of shifts. E max is simply performed on the result of the accumulator. Computing Oi only requires fixed-point addition of the mantissas of the activation values and does not include any multiplication, which also means that the arithmetic complexity and power consumption are optimized accordingly.
  • the Bitlet accelerator provides separate DMA channels for activations and weights.
  • the local cache stores the data obtained from the DDR3 memory and provides sufficient bandwidth for the corresponding Bitlet PE access.
  • the bandwidth of each channel between the memory and the local cache reaches 12.8GB/s, and the PE array can use a total of 25.6GB/s bandwidth to obtain the activation value and weight data from the local cache.
  • Bitlet utilizes weight fixation and activation value broadcast mechanism to reduce main memory access.
  • Bitlet accelerator supports multi-precision operations. It can be easily configured in fixed-point mode, providing sufficient flexibility for end users. For example, if 16-bit fixed-point precision is desired, the preprocessing block that performs exponent alignment and shifting (Figure 4 ) part of the gate, and let the input W i be directly connected to the line coordinator. Bitlet was originally designed to support 24-bit mantissa, so if 16-bit fixed-point precision is used, only RR-reg 0 ⁇ RR-reg 15 participate. Other RR-regs can be safely turned off or left empty. Int8 quantization or any other target precision (i.e. int4, int9, etc.) is handled similarly. Therefore, end users do not need to resort to other precision-specific accelerators for different use cases. Users can freely configure DNNs on sub-platforms to meet accuracy goals and power/performance trade-offs.
  • the present invention also proposes a processor for implementing the above-mentioned deep learning convolution acceleration method utilizing bit-level sparsity.
  • the processor includes:
  • the preprocessing module is used to obtain multiple sets of data pairs to be convolved, each set of data pairs is composed of an activation value and its corresponding original weight, and the activation value and the original weight are both floating-point numbers; and summing each set The index of the activation value and the original weight in the data pair is obtained, and the index sum of each group of data pairs is obtained, and the index sum with the largest value is selected from all the data pairs as the largest index;
  • the index alignment module is used to arrange the mantissas of the original weights according to the calculation order to form a weight matrix, and uniformly align each row in the weight matrix to the maximum index to obtain an alignment matrix;
  • the weight interleaving module is used to remove the slack bits in the alignment matrix to obtain a reduced matrix with gaps, and make the basic bits of each column of the reduced matrix complement the gaps according to the calculation order to form an intermediate matrix, and remove the gaps in the intermediate matrix After the row, set the empty position of the matrix to 0 to obtain the staggered weight matrix, and use each row in the staggered weight matrix as a necessary weight;
  • the loop register is used to extract the basic bits in the necessary weight, and obtain the position information corresponding to the activation value of each bit in the necessary weight from the mantissa corresponding to it among the mantissas of all activation values;
  • the split accumulator is used to divide the necessary weight into multiple weight segments. According to the position information, the weight segment and the mantissa of the corresponding activation value are sent to the addition tree for summation processing. The bits are added to obtain the output feature map as the convolution result of the multiple sets of data pairs.
  • the activation value is a pixel value of an image.
  • the present invention proposes a deep learning convolution acceleration method and processor utilizing bit-level sparsity, including: obtaining multiple sets of data pairs to be convoluted, summing the activation value and the index of the original weight in each set of data pairs, and obtaining each The exponent sum of a set of data pairs, and select the exponent sum with the largest value from all data pairs as the largest exponent; arrange the mantissas of the original weights in the order of calculation to form a weight matrix, and align the rows in the weight matrix to the largest exponent to obtain Alignment matrix; remove the slack bits in the alignment matrix to obtain a simplified matrix, and the basic bits of each column of the simplified matrix fill in the vacancies according to the order of calculation to form an intermediate matrix.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Nonlinear Science (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

本发明提出一种利用比特级稀疏性的深度学习卷积加速方法和处理器,包括:获取待卷积的多组数据对,求和每组数据对中激活值和原始权重的指数,得到每一组数据对的指数和,并从所有数据对中选择数值最大的指数和作为最大指数;按计算顺序排列原始权重的尾数,形成权重矩阵,并将权重矩阵中各行统一对齐到最大指数,得到对齐矩阵;剔除对齐矩阵中的松弛位,得到精简矩阵,精简矩阵每一列的基本位按计算顺序递补空位,形成中间矩阵,剔除中间矩阵的空行后,将矩阵中空位置0,得到交错权重矩阵,将交错权重矩阵中每一行中权重段与对应激活值的尾数发送至加法树进行求和处理,通过对处理结果执行移位相加,得到输出特征图作为多组数据对的卷积结果。

Description

利用比特级稀疏性的深度学习卷积加速方法及处理器 技术领域
本发明涉及深度学习加速器设计领域,并特别涉及一种利用比特级稀疏性的深度学习加速方法及智能处理器。
背景技术
为了达到更高的精度,深度学习模型规模不断增大。与之对应,深度学习加速器的性能应当跟随此变化。然而,由于如电池寿命和电源预算以及成本等限制,特别是在机器人、无人机和智能手机等嵌入式设备上,硬件设计师并不愿意根据DNN(deep neural network,深度神经网络)的发展而投入更多的计算资源。因此,提高加速器的效率在高性能和低功耗的场景中都是非常可取的。
之前大量的研究主要聚焦于挖掘权重/激活稀疏性的最大潜力,并尽可能多地并行执行有效的MAC(Multiply-Accumulate operation,乘加累计运算量)。然而,稀疏性并不总是足够的,而是根据不同的模型,甚至同一模型中的各个层而变化。例如,由于非线性激活函数,激活值具备更多的稀疏性。但对于权重来说,除了使用L1准则进行训练外,稀疏性通常很低。而且即使是激活值,也只有在通过ReLU或PReLu等函数时才可能产生零值,为了解决这一挑战,一些工作通过辨识操作数集中接近"零值"的部分,或者实施繁琐的稀疏(重)训练,为剪枝创造更多的稀疏空间。
过去的工作提出了一系列的比特串行加速器,在不同程度上利用丰富的比特级稀疏性。图1通过示例比较了三种类型的加速器PE的计算范式。其中早期的比特并行加速器(图1a)以及比特串行加速器使用数值相同的比特级算术来计算内积。例如,将一个8b×8b的乘积分解成8个1b×8b的乘积,通过串行地(图1(b)的步骤
Figure PCTCN2022077275-appb-000001
)组织和输入权重来产生相同的结果。图1(c)为本发明的计算示例。图1的定点模式下的比特交错PE与之前的比特并行/串行PE计算分布比较。白色背景标记的为稀疏位(0比特位),灰色背景标记的为基本位(1比特位)。在(a)比特并行PE中,Step1为并行组织了权重;Step2执行MAC。在(b)位比特串行PE中,步骤1以串行方式组织权重;步骤2同步 化必要比特的位值;步骤3执行"比特串行"MAC。在(c)比特交错PE中,Step1以并行方式组织权重,但是Step2沿着每个比特的位值执行串行MAC,不进行同步化操作。
但是目前基于值的稀疏性进行探索的空间已经走到尽头。从软件的角度来看,如果无损精度是第一设计要义,那么就存在着压缩率不能超越的明显余量。无论采用何种修剪方法,都需要花费大量的时间来探索这种余量,以平衡模型精度和大小。从硬件实现的角度来看,利用值的稀疏性也不可避免地引起了更复杂的加速器设计。例如,扩大其存储系统以适应不断增长的指数大小,其代价是增加内存访问量和影响峰值计算吞吐量。
同时,现存的技术还存在其他问题。如图1(a),为了释放比特稀疏性的最大潜力,最好尽可能多地跳过零比特。然而,每个8b操作数中的零比特的位置是很难预测的,尤其是在定点量化之后。原因是量化后会充分利用有限的位宽来表示数值范围,使得零比特与必要的1比特任意交错。为了充分利用参数本身所具有的比特稀疏性,需要仔细地执行同步化操作,如图1(b)中的步骤
Figure PCTCN2022077275-appb-000002
所示,在最终确定步骤
Figure PCTCN2022077275-appb-000003
中的比特串行MAC之前,必须先进行同步化。
之前使用的同步化方法包括中间密集调度和硬件级的Booth编码等。然而,这些方法的关键弱点源于难以确定一个统一的模式来描述同步稀疏性的位置。一个直接的后果是,正在进行的MAC操作必须停止,以调整位的重要性,与比特并行的对应方法相比,其代价是削弱了吞吐量。例如在图1(b)中,必须等到和完成了MAC,而必须等到完成了MAC。同时在硬件实现方面,复杂性也会增加,因为Booth编码需要额外的电路来编码和存储权重比特。另一个弱点在于,这种序列化的组织方式不能支持浮点运算,即比特串行加速器的使用场景受到了严重限制,进而不能被部署到更多的使用场景中。
发明公开
本发明的目的是解决上述现有深度学习加速器设计的效率及通用性问题,提出了利用比特稀疏并行性的计算方法——“比特交错”计算方法,并设计了实施“比特交错”计算方法的硬件加速器——Bitlet。
针对现有技术的不足,本发明提出一种利用比特级稀疏性的深度学习卷积加速方法,包括:
步骤1、获取待卷积的多组数据对,每一组数据对由激活值及其对应的原 始权重构成,且该激活值和该原始权重均为浮点数;
步骤2、求和每组数据对中激活值和原始权重的指数,得到每一组数据对的指数和,并从所有数据对中选择数值最大的指数和作为最大指数;
步骤3、按计算顺序排列该原始权重的尾数,形成权重矩阵,并将该权重矩阵中各行统一对齐到该最大指数,得到对齐矩阵;
步骤4、剔除该对齐矩阵中的松弛位,得到具有空位的精简矩阵,并使该精简矩阵每一列的基本位按该计算顺序递补该空位,形成中间矩阵,剔除该中间矩阵的空行后,将矩阵中空位置0,得到交错权重矩阵,将该交错权重矩阵中每一行作为必要权重;
步骤5、根据激活值与原始权重中基本位的对应关系,得到该必要权重中每一位对应激活值的位置信息,将该必要权重送入拆分累加器,该拆分累加器将该必要权重按位分割为多个权重段,根据该位置信息,将该权重段与对应激活值的尾数发送至加法树进行求和处理,通过对处理结果执行移位相加,得到输出特征图作为该多组数据对的卷积结果。
所述的利用比特级稀疏性的深度学习卷积加速方法,其中该激活值为图像的像素值。
本发明还提出了一种处理器,用于实施上述利用比特级稀疏性的深度学习卷积加速方法。
其中,该处理器包括:
预处理模块,用于获取待卷积的多组数据对,每一组数据对由激活值及其对应的原始权重构成,且该激活值和该原始权重均为浮点数;并求和每组数据对中激活值和原始权重的指数,得到每一组数据对的指数和,并从所有数据对中选择数值最大的指数和作为最大指数;
指数对齐模块,用于按计算顺序排列该原始权重的尾数,形成权重矩阵,并将该权重矩阵中各行统一对齐到该最大指数,得到对齐矩阵;
权重交错模块,用于剔除该对齐矩阵中的松弛位,得到具有空位的精简矩阵,并使该精简矩阵每一列的基本位按该计算顺序递补该空位,形成中间矩阵,剔除该中间矩阵的空行后,将矩阵中空位置0,得到交错权重矩阵,将该交错权重矩阵中每一行作为必要权重;
循环寄存器,用于提取该必要权重中的基本位,并从所有激活值的尾数中 与之对应的尾数,得到该必要权重中每一位对应激活值的位置信息;
拆分累加器,用于将该必要权重按位分割为多个权重段,根据该位置信息,将该权重段与对应激活值的尾数发送至加法树进行求和处理,通过对处理结果执行移位相加,得到输出特征图作为该多组数据对的卷积结果。
所述的处理器,其中该激活值为图像的像素值。
由以上方案可知,本发明的优点在于:
(1)与最新的高性能GPU相比,训练/推理的能效分别提高了81倍/21倍;
(2)与最先进的定点加速器相比,速度/效率分别提高了15倍/8倍;
(3)所设计加速器面积为1.5mm2,在TSMC 28nm工艺下,加速器面积为0.039平方毫米,功率分别为570毫瓦(32位浮点数模式)、432毫瓦(16位定点数模式)以及365毫瓦(8位定点模式)
(4)加速器具有高度可配置性。
附图简要说明
图1为定点模式下比特交错PE与比特并行/串行PE计算分布对比图;
图2为稀疏并行性示意图;
图3为比特交错的概念示意图;
图4为BCE模块的结构图;
图5为Bitlet加速器的结构图。
实现本发明的最佳方式
上述技术所存在的弱点主要是因为利用值的稀疏性所导致。在本发明的研究过程中,我们发现"比特稀疏"是一种内在的更精细的稀疏性,其针对每个操作数中的"零比特"而不是粗粒度的零值。使用浮点数或定点数来表示权重或激活,在不同的DNN模型中,零比特百分比可以达到45%~77%。跳过操作数中的零比特不会影响结果,这也意味着如果严格执行比特级的有效计算,不需要任何软件层面上的努力就可以直接获得加速。因此本发明利用了丰富的比特级稀疏并行性来加速训练和推理阶段,以服务于云/边缘端的通用深度学习。
在表1中,我们对最先进的基于稀疏性的加速器进行了分类。在早期的比特并行加速器中,即Cambricon和SCNN,对稀疏性的研究只集中在数值上。通过利用软件层面的剪枝,创造了更多的零稀疏空间来释放这些加速器的潜力。 考虑到比特稀疏性在权重和激活值中较丰富,最近的比特串行加速器研究将重点放在比特级的稀疏性上。最近的Laconic在进行Booth编码之后使用"terms"来串行提取必要的比特,并提出了一个低成本的LPE,以减少频繁编码/解码引起的功耗增加。Tactical在权重值和激活值比特层面上解决稀疏性问题。其设计理念与Pragmatic相似,都是利用跳过零比特来优化无效运算,但Tactical依靠一个数据类型无关的front-end来跳过零权重,以及一个软件调度器来最大化权重跳过的可能性。目前也有一些遵循比特串行计算方式的稀疏性设计模式。例如,Stripes和UNPU实现了定点操作数的比特序列化,而没有规避稀疏性。Bit-fusion支持更快的空间和时间组合来加速比特序列化,但是仍然不能很好地利用比特的稀疏性。
Figure PCTCN2022077275-appb-000004
表1
同时以前的工作已经证明了比特级的稀疏性是丰富的。然而,之前的工作仅仅专注于探索在特定权重内跳过零比特的策略,而没有探索权重间的稀疏性。
如图2所示,图中每个点表示在此个卷积核中的所有权重在bit lane的零比特的比例。图示说明在所有卷积核中大约50%的比特是0。在图中的X轴上,稀疏性只包括尾数(浮点32/16位中的的23/10位),在int8位精度表示中,只包含7个有效位,不包括符号位。图2显示了不同卷积核的比特稀疏度,并观察到每个位值上的权重稀疏度是一致的。X轴表示尾数的比特位值,故共有23位,不包括标准浮点32为格式中的隐藏位1。每个点表示在一个卷积核内该位指数上的零比特占比。以ResNet152和MobileNetV2为例,在尾数的前半部分(bit0~bit16)有明显的聚集,这意味着位于这个比特位值上的0和1的数量几乎是相当的。这为并行地将权重读入加速器并串行计算提供了有利条件。此外,从17~23位开始,这些点在Y轴上大部分重叠在100%(fp32数字中的长尾),这意味着大部分比特位都是0。因为浮点乘法器是为覆盖操 作数的任何情况而设计的,故浮点乘法器并没有区分这种次优情况。这也是浮点乘加运算和卷积运算(MAC)难以被加速的根本原因。
尽管定点精度表示在高效DNN推理方面取得了成功,但这也导致了为定点精度设计的加速器只能实现推理,使得这些设计很难用于通用场景。例如,DNN的训练仍然依靠浮点反向传播来保证模型调整到浮点,但是仍然需要满足实时性的要求,特别是当定点精度不能满足相应的精度时。理想情况下,加速器应该适用于大多数用例,应当为终端用户协同提供足够的便利和灵活性。
基于以上探索,本发明提出了基于比特交错的稀疏性并行设计模式。比特串行加速器的优势在于有效地利用了比特的稀疏性。然而,比特串行加速器提供的吞吐量相对低于其对应的比特并行加速器。在这两种设计理念的基础上,本发明提出了比特交错设计,结合上述设计的优点,并规避缺点,此种设计模式可以显著超越之前的比特串行/并行的模式。加速器Bitlet采用比特交错的设计理念,同时还支持包括浮点和定点在内的多种精度。这种可配置的特性使得Bitlet既适合于高性能,又适合于低功耗场景。
为让本发明的上述特征和效果能阐述的更明确易懂,下文特举实施例,并配合说明书附图作详细说明如下。
以下是本发明的详细阐述:
1.“比特交错”
不失一般性,浮点操作数由三部分组成:符号位、尾数和指数,遵循IEEE754标准,这也是工业界最常用的浮点标准。如果我们采用单精度浮点数(fp32),则尾数位宽为23位、指数位宽8位,剩下一位为符号位。一个单精度的浮点权重可以被表示为:fp=(-1) s1.m×2 e-127,e的大小为浮点数小数点实际位置加127。我们利用一系列浮点32位单精度数的MAC来计算卷积的部分和。
Figure PCTCN2022077275-appb-000005
公式1:将W i转化为fp32表示,其中
Figure PCTCN2022077275-appb-000006
Figure PCTCN2022077275-appb-000007
Figure PCTCN2022077275-appb-000008
Figure PCTCN2022077275-appb-000009
的简化表达。
Figure PCTCN2022077275-appb-000010
包括隐藏的尾数1,而实际在内存中,根据IEEE-754标准,这个位是隐藏的。
Figure PCTCN2022077275-appb-000011
是尾数,宽度固定——共24位,因此进一步分解
Figure PCTCN2022077275-appb-000012
就可以得到比特表示的部分和:
Figure PCTCN2022077275-appb-000013
Figure PCTCN2022077275-appb-000014
其中
Figure PCTCN2022077275-appb-000015
Figure PCTCN2022077275-appb-000016
二进制表示的第b位。用IEEE-754二进制格式表示A i,公式2可以改写为公式3。此外,令
Figure PCTCN2022077275-appb-000017
那么公式3可以被改写为
Figure PCTCN2022077275-appb-000018
根据公式5,可以推断,N个fp32 MAC结果相当于相应尾数的一系列比特级操作。具体来说,如果
Figure PCTCN2022077275-appb-000019
那么N个MAC的求和就转化为N个有符号的
Figure PCTCN2022077275-appb-000020
的求和(
Figure PCTCN2022077275-appb-000021
表示),并在此基础上左(右)移
Figure PCTCN2022077275-appb-000022
上述分析说明:在考虑到稀疏性的情况下,浮点数求部分和可以转化为比特级操作。乘积主要由尾数
Figure PCTCN2022077275-appb-000023
构成,但其是否对乘积有贡献,由公式5中的
Figure PCTCN2022077275-appb-000024
决定。这种比特级的稀疏性也可以在比特交错中得到利用。在每个比特位值都有相当比例的零比特,所以如果
Figure PCTCN2022077275-appb-000025
但在同一位值b上的另一个权重W j是为比特1,则可以用
Figure PCTCN2022077275-appb-000026
代替
Figure PCTCN2022077275-appb-000027
使不同的权重位交错在同一比特行上。在同一周期内,尾数
Figure PCTCN2022077275-appb-000028
Figure PCTCN2022077275-appb-000029
参与部分和运算,即利用稀疏性加速了计算。
本计算理论也囊括了定点精度。在公式5中,E max和E i-E max并非必要,因为定点精度表示并没有指数。本发明将以示例详细描述比特交错是如何在浮点32位精度权重下工作的,以及支持多种精度的Bitlet加速器设计细节。
图1(c)说明了8位定点MAC的比特交错过程,并逐步进行了演示。然而,在实际应用中,浮点MAC并不像定点MAC那样容易被利用,因为在二进制操作数中存在一个特殊的部分--指数,而且不同的操作数有不同的指数。为了最大限度地挖掘浮点稀疏性的潜力,基于公式5,比特交错包括三个独立但连续的步骤。
Figure PCTCN2022077275-appb-000030
预处理
图3(a)使用一个示例作为说明,6个普通32位浮点权重按行排列,每个权重指数和尾数均为任意。三角形标记表示二进制小数点的实际位置。为简单起见,不表示存储在内存中的实际32位浮点表示二进制格式,而是使用更有代表性的表示方法来表达数值。例如,E 5=-2的0.012代表了十进制的0.25(W 5)。这一步与图1c中的步骤
Figure PCTCN2022077275-appb-000031
相似,只是在这里将32位浮点权重并行地组织起来进行交错。并对这些二进制权重进行预处理,以获得各自的指数,进一步确定"最 大"指数(本例中为E 6)。同时尾数也被储存起来,用于后面的MAC计算。为了简化表示,省略了每个尾数的尾位(第9-23位)。
Figure PCTCN2022077275-appb-000032
动态指数对齐:
指数表示了二进制表示中小数点的位置。传统上,这涉及到浮点加法中的"指数对齐"步骤。然而,在比特交错中,我们通过将一组浮点的指数统一对齐到最大值(本例中为E 6)来对齐,而不是一个一个地进行处理。此步骤被称为"动态指数对齐",图1(c)中没有涉及这个步骤,因为定点值没有指数。
回顾公式5,在实际执行过程中,两个求和可以并行执行。外部的求和表示了图3(a)中的垂直维度,即为N个权重与它们的相应激活值;内部的求和表示了水平维度,即尾数的不同位宽。从这个角度看,公式5的关键概念是沿着图3(a)中的两个维度计算所有
Figure PCTCN2022077275-appb-000033
Figure PCTCN2022077275-appb-000034
由于我们的最终目标是计算
Figure PCTCN2022077275-appb-000035
这涉及N个权重和激活值的计算。因此,每次执行中都将所有指数对齐到它们的最大值,而不是进行逐一匹配。如图3(b)中可以看出,6个权重都对齐到了最大指数——W 6。例如,W 5需要右移8位以于W 6对齐。这样做的好处是,所有6个权重的指数对齐只需执行一次,为高效的硬件实现节省了时间和资源。
Figure PCTCN2022077275-appb-000036
必要比特提取
目前,关键是如何利用必要的比特来获得准确的部分和,并进一步获得更好的推理速度。考虑到上述提到的稀疏并行性,这一步利用此特点来提取必要的比特,这与图1c中的步骤
Figure PCTCN2022077275-appb-000037
完全相同。
如图3(c)所示,如果我们高效地提炼出必要的比特1,总的计算量可以从6个操作数的MAC减少到只有3个。仍然以W 6为例,W 6指数为6,第一个比特(b=0)是必要的比特1。受公式5的启发,该位的
Figure PCTCN2022077275-appb-000038
值等于2 6,这意味着该位位于二进制点之前的第7个位置。对于W 1~W 5,在对齐之后2 6位置处比特都是零。若将W 6的第一位上移,取代W 1中同一垂直道的位置,则就能同时计算出A 6×2 6+A 1×2 3。属于其他权重的必要比特也可以用同样的方式操作,提取出来的权重最终见图3(c)。综上所述,这两个步骤从两个方面加速了浮点32位精度计算的MACs。(1)避免了计算代价高昂的指数对齐操作;(2利用稀疏并行性消除了由0比特引起的无效计算。
2.Bitlet加速器
为了执行比特交错,我们设计了一个新的加速器,命名为Bitlet。在这一部分,我们将阐述Bitlet的关键硬件设计模块,包括支持多种精度的计算引擎的微架构和高效内存访问的整体架构。
关键模块1——预处理模块。首先本发明设计了一个涉及“比特交错”运算中的两个步骤的部件。Bitlet输入多对权重和激活值,在图4中用N表示。在Bitlet计算部件(以下简称BCE)中,W 0到W N-1是原始权重,而A 0~A N-1是相应的激活。预处理模块将每个W i和A i分解成两个部分:尾数和指数,并对每个A/W对执行
Figure PCTCN2022077275-appb-000039
之后,选择最大指数E max并存储在寄存器中,用于后续的动态指数对齐操作。在确定了E max后,
Figure PCTCN2022077275-appb-000040
被左(右)移
Figure PCTCN2022077275-appb-000041
位,使其指数与E max一致。仍以图3中的权重为例,E max位W 6中的E 6=6,其他权重都与E 6对齐,即
Figure PCTCN2022077275-appb-000042
将被移位6-0=6位,如图4所示。同时左边移位的位置会自动填充0,因为尾数是24位长度,故超过b=23的尾部将被丢弃。
关键模块2——线路协调器。经过动态指数对齐之后,我们得到24位经过移位的尾数,由
Figure PCTCN2022077275-appb-000043
表示。尾数将被进一步送入另一个模块,在图4中被称为"线路协调器",被用来将相同的比特位值聚集在一起后重新组织线路,实现将矩阵按列输出。协调器的输出表示为
Figure PCTCN2022077275-appb-000044
Figure PCTCN2022077275-appb-000045
其中b的范围是0~23。此模块不包含任何组合逻辑或顺序逻辑,只是对对齐后的尾数执行聚集操作以及转置操作。因此直观上,此模块不会引入明显的功耗。
关键模块3——循环寄存器RR-reg。RR-reg提取出交错权重中的必要的比特1(基本位),并从N个激活尾数中选择BCE的输出。每个RR-reg都有其内部时钟,且与加速器时钟树相连。如图4,所示伪代码代表了具体程序:RR-reg首先根据输入比特依次提取出必要的比特1。“select”信号指示解码部件配置所应选择的激活值路径和输出O i。如果没有检测到必要的比特1,RR-reg则将激活"补0"信号,O i也将输出为0。"补0"信号操作适用于某一比特行中所有比特都是0的情况,即图3(c)中的b=1或2的场景。
BCE具有以下三大特点:①该架构不会带来精度损失,因为上述的动态指数对齐与IEEE 754中的浮点运算相同。运算中丢弃了移位后最右边比特,但是这些比特数值微小,因此可以忽略不计,对精度亦不带来影响。②BCE不需要任何关于参数稀疏性的预处理。图4中的预处理模块只负责将权重的激活 值转化为相对应的尾数和指数。在实际的RTL实现中,每个RR-reg将实现了一个滑动窗口,以实现自动交错和提取出必要比特。受益于稀疏并行性的有利条件,在每个RR-reg中几乎可以同时完成对
Figure PCTCN2022077275-appb-000046
的提取。③除RR-reg外,BCE主要由组合电路组成,但不涉及可能导致关键路径延迟变长的复杂线路。每个RR-reg在每个时钟周期产生一个输出O i,但与传统一一对应的MAC相比,计算部分和的总周期大大优化。N是BCE中唯一的设计参数,较大的N对有利于提取更多的比特1。
3.加速器架构
PE:Bitlet是由网状连接的PE组成的。如图5所示,每个PE由一个BCE和一个加法器树组成。BCE连接了片上缓存和加法器树。每个PE串行输入N个权重和激活,并产生部分和O i作为加法器树的输入。由于BCE输出受到24位尾数的限制,因此加法器树的输入也是24个。PE通过乘以
Figure PCTCN2022077275-appb-000047
(注意b是负数)来最终确定结果,以确保结果的正确性。
Figure PCTCN2022077275-appb-000048
可以分解为固定部分b和共同部分E max,用于产生BCE输出。固定部分的执行通过固定数目的移位即可完成。E max只需在累加器的结果上执行。计算O i只需要对激活值的尾数进行定点加法,不包括任何乘法,这也意味着算术复杂性和功耗也得到相应的优化。
内存系统:为了实现高吞吐量,Bitlet加速器为激活值和权重提供分离的DMA通道。如图5所示,本地缓存存储了从DDR3内存中获取的数据,并为相应的Bitlet PE的访问提供足够的带宽。在RTL实现中,内存和本地缓存间每条通道带宽达到12.8GB/s,而PE阵列可以利用总共25.6GB/s带宽从本地缓存中获得激活值和权重数据。在数据流模式方面,Bitlet利用权重固定和激活值广播机制来减少主内存访问。
4.Bitlet灵活性
Bitlet加速器支持多精度运算。可以方便地配置成定点模式,为终端用户提供足够的灵活性。例如,若想采用16位定点精度,则可将执行指数对齐和移位的预处理模块(附图4中的
Figure PCTCN2022077275-appb-000049
)部分gate,并让输入W i直接连接到线路协调器上。Bitlet最初为设计为支持24位尾数,因此如果使用16位定点精度,则只有RR-reg 0~RR-reg 15参与。其他的RR-reg可以安全地关闭或保持空状态。int8量化或任何其他目标精度(即int4、int9等)类似此种处理。因此,终端用户无需借助其他特定精度的加速器来适应不同的使用 情况。用户可以在次平台上自由的配置DNN来满足精度目标和功耗/性能之间的权衡。
以下为与上述方法实施例对应的系统实施例,本实施方式可与上述实施方式互相配合实施。上述实施方式中提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述实施方式中。
本发明还提出了一种处理器,用于实施上述利用比特级稀疏性的深度学习卷积加速方法。
其中,该处理器包括:
预处理模块,用于获取待卷积的多组数据对,每一组数据对由激活值及其对应的原始权重构成,且该激活值和该原始权重均为浮点数;并求和每组数据对中激活值和原始权重的指数,得到每一组数据对的指数和,并从所有数据对中选择数值最大的指数和作为最大指数;
指数对齐模块,用于按计算顺序排列该原始权重的尾数,形成权重矩阵,并将该权重矩阵中各行统一对齐到该最大指数,得到对齐矩阵;
权重交错模块,用于剔除该对齐矩阵中的松弛位,得到具有空位的精简矩阵,并使该精简矩阵每一列的基本位按该计算顺序递补该空位,形成中间矩阵,剔除该中间矩阵的空行后,将矩阵中空位置0,得到交错权重矩阵,将该交错权重矩阵中每一行作为必要权重;
循环寄存器,用于提取该必要权重中的基本位,并从所有激活值的尾数中与之对应的尾数,得到该必要权重中每一位对应激活值的位置信息;
拆分累加器,用于将该必要权重按位分割为多个权重段,根据该位置信息,将该权重段与对应激活值的尾数发送至加法树进行求和处理,通过对处理结果执行移位相加,得到输出特征图作为该多组数据对的卷积结果。
所述的处理器,其中该激活值为图像的像素值。
工业应用性
本发明提出一种利用比特级稀疏性的深度学习卷积加速方法和处理器,包括:获取待卷积的多组数据对,求和每组数据对中激活值和原始权重的指数,得到每一组数据对的指数和,并从所有数据对中选择数值最大的指数和作为最 大指数;按计算顺序排列原始权重的尾数,形成权重矩阵,并将权重矩阵中各行统一对齐到最大指数,得到对齐矩阵;剔除对齐矩阵中的松弛位,得到精简矩阵,精简矩阵每一列的基本位按计算顺序递补空位,形成中间矩阵,剔除中间矩阵的空行后,将矩阵中空位置0,得到交错权重矩阵,将交错权重矩阵中每一行中权重段与对应激活值的尾数发送至加法树进行求和处理,通过对处理结果执行移位相加,得到输出特征图作为多组数据对的卷积结果。

Claims (5)

  1. 一种利用比特级稀疏性的深度学习卷积加速方法,其特征在于,包括:
    步骤1、获取待卷积的多组数据对,每一组数据对由激活值及其对应的原始权重构成,且该激活值和该原始权重均为浮点数;
    步骤2、求和每组数据对中激活值和原始权重的指数,得到每一组数据对的指数和,并从所有数据对中选择数值最大的指数和作为最大指数;
    步骤3、按计算顺序排列该原始权重的尾数,形成权重矩阵,并将该权重矩阵中各行统一对齐到该最大指数,得到对齐矩阵;
    步骤4、剔除该对齐矩阵中的松弛位,得到具有空位的精简矩阵,并使该精简矩阵每一列的基本位按该计算顺序递补该空位,形成中间矩阵,剔除该中间矩阵的空行后,将矩阵中空位置0,得到交错权重矩阵,将该交错权重矩阵中每一行作为必要权重;
    步骤5、根据激活值与原始权重中基本位的对应关系,得到该必要权重中每一位对应激活值的位置信息,将该必要权重送入拆分累加器,该拆分累加器将该必要权重按位分割为多个权重段,根据该位置信息,将该权重段与对应激活值的尾数发送至加法树进行求和处理,通过对处理结果执行移位相加,得到输出特征图作为该多组数据对的卷积结果。
  2. 如权利要求1所述的利用比特级稀疏性的深度学习卷积加速方法,其特征在于,该激活值为图像的像素值。
  3. 一种处理器,其特征在于,用于实施如权利要求1所述的利用比特级稀疏性的深度学习卷积加速方法。
  4. 如权利要求3所述的处理器,其特征在于,该处理器包括:
    预处理模块,用于获取待卷积的多组数据对,每一组数据对由激活值及其对应的原始权重构成,且该激活值和该原始权重均为浮点数;并求和每组数据对中激活值和原始权重的指数,得到每一组数据对的指数和,并从所有数据对中选择数值最大的指数和作为最大指数;
    指数对齐模块,用于按计算顺序排列该原始权重的尾数,形成权重矩阵,并将该权重矩阵中各行统一对齐到该最大指数,得到对齐矩阵;
    权重交错模块,用于剔除该对齐矩阵中的松弛位,得到具有空位的精简矩阵,并使该精简矩阵每一列的基本位按该计算顺序递补该空位,形成中间矩阵,剔除该中间矩阵的空行后,将矩阵中空位置0,得到交错权重矩阵,将该交错权重矩阵中每一行作为必要权重;
    循环寄存器,用于提取该必要权重中的基本位,并从所有激活值的尾数中与之对应的尾数,得到该必要权重中每一位对应激活值的位置信息;
    拆分累加器,用于将该必要权重按位分割为多个权重段,根据该位置信息,将该权重段与对应激活值的尾数发送至加法树进行求和处理,通过对处理结果执行移位相加,得到输出特征图作为该多组数据对的卷积结果。
  5. 如权利要求4所述的处理器,其特征在于,该激活值为图像的像素值。
PCT/CN2022/077275 2021-10-27 2022-02-22 利用比特级稀疏性的深度学习卷积加速方法及处理器 WO2023070997A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111254887.3 2021-10-27
CN202111254887.3A CN114021710A (zh) 2021-10-27 2021-10-27 利用比特级稀疏性的深度学习卷积加速方法及处理器

Publications (1)

Publication Number Publication Date
WO2023070997A1 true WO2023070997A1 (zh) 2023-05-04

Family

ID=80058127

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077275 WO2023070997A1 (zh) 2021-10-27 2022-02-22 利用比特级稀疏性的深度学习卷积加速方法及处理器

Country Status (2)

Country Link
CN (1) CN114021710A (zh)
WO (1) WO2023070997A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114021710A (zh) * 2021-10-27 2022-02-08 中国科学院计算技术研究所 利用比特级稀疏性的深度学习卷积加速方法及处理器
CN116127255B (zh) * 2022-12-14 2023-10-03 北京登临科技有限公司 卷积运算电路、及具有该卷积运算电路的相关电路或设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543140A (zh) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 一种卷积神经网络加速器
CN114021710A (zh) * 2021-10-27 2022-02-08 中国科学院计算技术研究所 利用比特级稀疏性的深度学习卷积加速方法及处理器

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543140A (zh) * 2018-09-20 2019-03-29 中国科学院计算技术研究所 一种卷积神经网络加速器
CN114021710A (zh) * 2021-10-27 2022-02-08 中国科学院计算技术研究所 利用比特级稀疏性的深度学习卷积加速方法及处理器

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LU HANG; WEI XIN; LIN NING; YAN GUIHAI; LI XIAOWEI: "Tetris: Re-architecting Convolutional Neural Network Computation for Machine Learning Accelerators", 2018 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN (ICCAD), ACM, 5 November 2018 (2018-11-05), pages 1 - 8, XP033487856, DOI: 10.1145/3240765.3240855 *
SHENG LI; JONGSOO PARK; PING TAK PETER TANG: "Enabling Sparse Winograd Convolution by Native Pruning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 October 2017 (2017-10-13), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080971129 *

Also Published As

Publication number Publication date
CN114021710A (zh) 2022-02-08

Similar Documents

Publication Publication Date Title
CN110050256B (zh) 用于神经网络实现的块浮点
WO2023070997A1 (zh) 利用比特级稀疏性的深度学习卷积加速方法及处理器
CN110447010B (zh) 在硬件中执行矩阵乘法
CN109146067B (zh) 一种基于FPGA的Policy卷积神经网络加速器
CN107340993B (zh) 运算装置和方法
CN103176767B (zh) 一种低功耗高吞吐的浮点数乘累加单元的实现方法
CN102629189B (zh) 基于fpga的流水浮点乘累加方法
US20130282778A1 (en) System and Method for Signal Processing in Digital Signal Processors
CN110383300A (zh) 一种计算装置及方法
CN106970775A (zh) 一种可重构定浮点通用加法器
US8930433B2 (en) Systems and methods for a floating-point multiplication and accumulation unit using a partial-product multiplier in digital signal processors
CN110688086A (zh) 一种可重构的整型-浮点加法器
CN113283587A (zh) 一种Winograd卷积运算加速方法及加速模块
Shi et al. Design of parallel acceleration method of convolutional neural network based on fpga
CN103279323A (zh) 一种加法器
Zong-ling et al. The design of lightweight and multi parallel CNN accelerator based on FPGA
CN107092462B (zh) 一种基于fpga的64位异步乘法器
Kang et al. Design of convolution operation accelerator based on FPGA
Zhang et al. A block-floating-point arithmetic based FPGA accelerator for convolutional neural networks
Li et al. Bit-serial systolic accelerator design for convolution operations in convolutional neural networks
WO2020008643A1 (ja) データ処理装置、データ処理回路およびデータ処理方法
Hsiao et al. Design of a low-cost floating-point programmable vertex processor for mobile graphics applications based on hybrid number system
Guardia Implementation of a fully pipelined BCD multiplier in FPGA
Liang et al. An innovative Booth algorithm
CN111931441B (zh) Fpga快速进位链时序模型的建立方法、装置以及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22884914

Country of ref document: EP

Kind code of ref document: A1