WO2022133686A1 - 有/无符号乘累加装置及方法 - Google Patents

有/无符号乘累加装置及方法 Download PDF

Info

Publication number
WO2022133686A1
WO2022133686A1 PCT/CN2020/138119 CN2020138119W WO2022133686A1 WO 2022133686 A1 WO2022133686 A1 WO 2022133686A1 CN 2020138119 W CN2020138119 W CN 2020138119W WO 2022133686 A1 WO2022133686 A1 WO 2022133686A1
Authority
WO
WIPO (PCT)
Prior art keywords
multiply
accumulate
unsigned
signed
calculation
Prior art date
Application number
PCT/CN2020/138119
Other languages
English (en)
French (fr)
Inventor
尹首一
谷江源
孙庆斌
张淞
刘雷波
魏少军
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to PCT/CN2020/138119 priority Critical patent/WO2022133686A1/zh
Publication of WO2022133686A1 publication Critical patent/WO2022133686A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation

Definitions

  • the present invention relates to the field of processor design, in particular to a signed/unsigned multiply-accumulate device and method.
  • Coarse-grained reconfigurable processor architectures are gaining more and more attention due to their low power consumption, high performance and high energy efficiency, and flexible and dynamic reconfigurability.
  • Coarse-grained reconfigurable computing architecture is a high-performance computing architecture that combines the flexibility of general-purpose processors and application-specific integrated circuits. It is very suitable for processing data and computing-intensive applications with a high degree of parallelism, such as artificial intelligence. , digital signal processing, video image processing, scientific computing and communication encryption and other fields.
  • Multiplication Multiplication
  • MAC Multiplication-and-Addition Operation
  • the current reconfigurable processing architecture is a single-bit-width multiplication and addition operation, and is a separate and separate operation. Therefore, it is often unable to support flexible bit width precision adjustment according to specific application requirements.
  • a MAC operation often requires two or more operation cycles. The first cycle multiplies the multiplier and the multiplicand; the second cycle compares the operation result of the previous cycle with the summand through the accumulator. add. In this way, the reconfigurable processor is greatly limited to perform flexible and efficient processing of the above tasks. Therefore, there is an urgent need for a new method and apparatus in the industry to improve efficiency.
  • the purpose of the present invention is to provide a signed/unsigned multiply-accumulate device and method, which can be effectively used in a coarse-grained reconfigurable processor architecture, and can fully utilize its computing resources through flexible dynamic configuration of the data bit width.
  • a signed/unsigned multiply-accumulate device and method which can be effectively used in a coarse-grained reconfigurable processor architecture, and can fully utilize its computing resources through flexible dynamic configuration of the data bit width.
  • the computing throughput, computing performance and energy efficiency are almost doubled.
  • it can effectively and flexibly support signed and unsigned multiply operations/multiply-accumulate operations, fully guaranteeing and realizing the dynamic performance of reconfigurable processors with very low power consumption and area overhead. Reconfigurable features.
  • a signed/unsigned multiply-accumulate device is suitable for a coarse-grained reconfigurable processor architecture, and the device includes a splitting module, an arithmetic module, a processing module and an output module; the The splitting module is used to obtain the configuration control signal. According to the configuration control signal, the input binary multiplicands, multipliers and addends larger than the preset bit width are split according to the preset splitting rules to generate multiple groups smaller than the preset bit width.
  • the arithmetic module is used for multiplying and accumulating the binary numbers that are smaller than the preset bit width after corresponding grouping by multiple MAC arithmetic units according to the dynamic configuration file in the configuration control signal.
  • the processing module is configured to perform shift and significant bit extension processing on multiple of the calculation results according to preset adjustment rules to obtain multiple processes larger than the preset bit width result;
  • the output module is used for accumulating a plurality of the processing results to obtain the operation result.
  • the present invention also provides a signed/unsigned multiply-accumulate method, which is suitable for a coarse-grained reconfigurable processor architecture.
  • the method includes: acquiring a configuration control signal, and according to the configuration control signal, inputting a binary value larger than a preset bit width to be Multipliers, multipliers, and addends are split according to preset splitting rules to generate multiple groups of binary numbers smaller than the preset bit width; according to the dynamic configuration file in the configuration control signal, multiple groups are processed by multiple MAC operation units. After the binary numbers smaller than the preset bit width are grouped correspondingly, multiply-accumulate calculation and/or parallel multiply-accumulate calculation are respectively performed to obtain multiple calculation results; the multiple calculation results are respectively shifted and valid according to the preset adjustment rules.
  • the bit extension processing obtains multiple processing results larger than the preset bit width; and the multiple processing results are accumulated to obtain operation results.
  • the beneficial technical effects of the present invention are: supporting signed/unsigned multiplication and multiply-accumulate operations, unifying signed and unsigned operations into one operation circuit, which not only saves area and power consumption overhead, but also can meet the requirements of configuration and reconstruction at the same time.
  • the requirements of various applications have good reconfigurability and wider applicability.
  • FIG. 1 is a schematic structural diagram of a signed/unsigned multiply-accumulate device provided by an embodiment of the present invention
  • FIG. 2 is a schematic diagram of an application structure of a signed/unsigned multiply-accumulate device provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of an operation principle of a signed/unsigned multiply-accumulate device provided by an embodiment of the present invention
  • FIG. 4 is a schematic diagram of a multiply-accumulate operation principle of an arbitrary-precision configurable MAC operation provided by an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of a signed/unsigned multiply-accumulate method provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a MAC multiply-accumulate operation of a complete set of high-bit-width unsigned numbers provided by an embodiment of the present invention
  • FIG. 7 is a schematic diagram of an application of a complete set of MAC multiply-accumulate operations of high-bit-width unsigned numbers provided by an embodiment of the present invention
  • FIG. 8 is a schematic diagram of two groups of parallel low-bit-width unsigned MAC multiply-accumulate operations provided by an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of the application of two groups of parallel low-bit-width unsigned MAC multiply-accumulate calculations provided by an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a MAC multiply-accumulate operation of a complete set of high-bit-width signed numbers provided by an embodiment of the present invention
  • 11 is a schematic diagram of the application of a complete set of MAC multiply-accumulate operations of high-bit-width unsigned numbers provided by an embodiment of the present invention
  • FIG. 12 is a schematic diagram of MAC multiply-accumulate operations of two groups of parallel low-bit-width signed numbers according to an embodiment of the present invention
  • FIG. 13 is a schematic diagram of an application of two parallel MAC multiply-accumulate operations of signed numbers with low bit width according to an embodiment of the present invention.
  • the steps shown in the flowcharts of the figures may be performed in a computer system, such as a set of computer-executable instructions, and, although shown in a logical order in the flowcharts, in some cases, may be executed differently The steps shown or described are performed in the order shown herein.
  • a signed/unsigned multiply-accumulate device provided by the present invention is suitable for a coarse-grained reconfigurable processor architecture, and the device includes a splitting module, an arithmetic module, a processing module and an output module;
  • the splitting module is used to obtain the configuration control signal, and according to the configuration control signal, the input binary multiplicands, multipliers and addends larger than the preset bit width are split according to the preset splitting rules to generate multiple groups smaller than the preset value.
  • the binary number of the bit width; the arithmetic module is used for, according to the dynamic configuration file in the configuration control signal, after correspondingly grouping multiple groups of binary numbers smaller than the preset bit width through a plurality of MAC arithmetic units, respectively multiply the binary numbers.
  • Accumulation calculation and/or parallel multiply-accumulate calculation to obtain multiple calculation results; the processing module is configured to perform shift and significant bit expansion processing on the multiple calculation results according to preset adjustment rules to obtain multiple calculation results larger than the preset bit width.
  • the processing result; the output module is used for accumulating a plurality of the processing results to obtain the operation result.
  • the operation module includes a plurality of MAC operation units; the MAC operation units are used to parse the function identifier and operation type identifier in the dynamic configuration file; according to the function identifier and the operation type identifier The type identifier is used to obtain the operation mode of each of the MAC operation units on the received binary number; according to the operation mode, the corresponding multiplication and accumulation calculation is performed on the received binary number to obtain a corresponding calculation result.
  • the splitting module mainly splits the input high-bit-width binary multiplicand A, multiplier B and addend C into several groups of low-bit-width binary numbers according to the configuration control signal Config. It is assumed here that it is split into 2 groups of data with any low bit width.
  • Input signal 3-bit configuration control signal Config, M-bit multiplicand A, N -bit multiplier B and L -bit addend C;
  • Output signal m-bit AL and Mm-bit AH after split processing;
  • n-bit BL and Nn are BH ; 1 is CL and L1 is CH .
  • the operation module mainly multiplies and accumulates the multiplicands, multipliers and addends that are processed for splitting, including multiplication and accumulation of signed numbers, multiplication and accumulation of unsigned numbers, and multiplication and accumulation of signed and unsigned numbers. .
  • the configuration information Config1 to Config4 it is determined whether to perform one group of signed/unsigned multiplication/multiply-accumulate operations with high-bit width, or two groups of low-bit-width signed/unsigned multiplication/multiply-accumulate operations calculated in parallel; and according to the specific Whether the summand A_in is zero, it is determined whether the multiplication operation or the multiply-accumulate operation is to be performed.
  • Input signal m-bit AL and M-m-bit AH; n-bit BL and N-n-bit BH; l-bit CL and L-l-bit CH;
  • the processing module mainly performs appropriate shift and most significant bit extension operations on the final results of the multiplication/multiplication-accumulation operations of several small-bit-width data calculations.
  • the result P1 generated by the first MAC is not shifted;
  • the result P2 generated by the second MAC is shifted to the left by m bits;
  • the result P3 generated by the third MAC is shifted to the left by n bits;
  • the result generated by the fourth MAC Result P4 shifted left by m+n bits.
  • the results of all shift operations are extended to the most significant bits, and finally they are extended to M+N bits.
  • Input signals multiply-accumulate results P1, P2, P3, and P4; output signals: P1, P2, P3, and P4 shift and MSB-extended results: P1_ext, P2_ext, P3_extt, and P4_ext.
  • the output module mainly accumulates the results calculated by the previous modules to obtain the final result.
  • Input Signals Shifted and extended results P1_ext, P2_ext, P3_extt and P4_ext.
  • Output signal The final result of multi-bit multiplication and accumulation Product.
  • control signal Config is shown in Table 1, which is a 3-bit wide configuration signal.
  • the present invention further dynamically configures the control signal Config, and generates configuration control signals Config1 to Config4 inside it, which are 2-bit wide configuration signals.
  • the MAC operation unit further includes: identifying a sign condition in a binary number smaller than a preset bit width; performing a corresponding signed bit extension or unsigned on the binary number according to the sign condition Number expansion: After performing partial product and addend shift processing on the expanded binary number, a calculation result is obtained by multiplying and accumulating.
  • the signed/unsigned multiply-accumulate device is unified in a set of hardware architecture for calculation, and can process the summand in the multiply-accumulate like processing the partial product of the multiply, so that it can process the summand in the multiply-accumulate like processing the partial product of the multiply.
  • each MAC operation unit firstly performs signed bit extension and unsigned number extension after judging whether the input status width number is signed or unsigned. Then, according to the design of multiply-accumulate, the Booth algorithm with a base of 4 is improved. After the partial product and the addend are shifted, the accumulation calculation is performed. While the addend is hidden in the multiplication calculation, the multiplication is completed. The calculation of the accumulation operation. For details, please refer to Figure 3.
  • S is the highest bit of the partial product in the multiply-accumulate operation and the summand after Booth coding is performed; N is the partial product in the multiply-accumulate operation, whether it is negative number coding in Booth coding, whether it is performed The operation of inverting and adding 1; M represents the data bit width of the signed multiply-accumulate operation of any bit width.
  • the operation module further includes: obtaining the operation type of each MAC operation unit according to the calling requirement of the application, and obtaining the summand value of each MAC operation unit according to the operation type; the MAC operation The unit performs partial product and addend shift processing on the expanded binary number according to the value of the summand.
  • the processing module may further include, according to the presence/absence of the calculation result and the operation type, performing shifting and significant bit extension processing on the calculation result according to preset adjustment rules to obtain multiple values greater than The processing result of the preset bit width. Specifically, as shown in FIG.
  • the operation principle of the MAC operation unit of the present invention is As shown in Figure 4, the multiplicand A, the multiplicand B and the summand C are divided into two groups of low-bit-width data, respectively A H , A L , B H , BL , and CH and CL ; then AH, AL, BH , BL , CH , CL are respectively correspondingly combined to obtain the four low-bit-width MAC operation parts shown in Figure 4, marked as 1234 .
  • 1 and 3 are ordinary multiply-accumulate operations, and the summands calculated are CL and 0 respectively; 2 and 4 are expressed multiply-accumulate operations, and the calculated summands may be CH or 0; When a group of high-bit-width multiply-accumulate operations are performed, 2 and 4 indicate that the summands of the multiply-accumulate operations are CH and 0, respectively. When two groups of low-bit-width multiply-accumulate operations are performed in parallel, 2 and 4 represent the multiplication and accumulation operations. The summands of the accumulation operation are 0 and CH respectively. If a group of high-bit-width multiply-accumulate operations are performed, the results of the MAC multiply-accumulate operations performed by these four modules are to be performed, and then the shift and the highest-order expansion operations are performed.
  • the signed/unsigned multiply-accumulate device with adjustable precision designed by the present invention can flexibly select a set of high-bit-width signed/unsigned multiplications as final output according to different precision and computing performance requirements of different applications.
  • the result of /multiply-accumulate operation is also the low-bit-width signed/unsigned multiplication/multiply-accumulate operation result of several groups of parallel computing; moreover, unifying signed and unsigned operations into one operation circuit not only saves area and power consumption overhead, Moreover, it can meet the needs of various applications at the same time through configuration and refactoring, and has good reconfigurability and wider usability.
  • the present invention also provides a signed/unsigned multiply-accumulate method, which is suitable for a coarse-grained reconfigurable processor architecture.
  • the method includes:
  • S501 obtains a configuration control signal, divides the input binary multiplicand, multiplier and addend larger than a preset bit width according to the configuration control signal, and splits it according to a preset splitting rule to generate multiple sets of binary numbers smaller than the preset bit width;
  • S503 performs shifting and significant bit extension processing on a plurality of the calculation results according to a preset adjustment rule to obtain a plurality of processing results larger than the preset bit width;
  • S504 accumulates a plurality of the processing results to obtain an operation result.
  • multiply-accumulate calculation and/or parallel multiply-accumulate calculation are respectively performed to obtain multiple calculation results, including: parsing all The function identifier and the operation type identifier in the dynamic configuration file; according to the function identifier and the operation type identifier, obtain the operation mode of each described MAC operation unit to the received binary number; The corresponding multiplication and accumulation calculation is carried out to obtain the corresponding calculation result.
  • performing a corresponding multiply-accumulate calculation on the received binary number according to the operation mode to obtain a corresponding calculation result includes: identifying a sign condition in a binary number smaller than a preset bit width; Corresponding signed bit extension or unsigned number extension is performed; after performing partial product and addend shift processing on the extended binary number, a calculation result is obtained by multiplying and accumulating.
  • identifying a sign condition in a binary number smaller than a preset bit width Corresponding signed bit extension or unsigned number extension is performed; after performing partial product and addend shift processing on the extended binary number, a calculation result is obtained by multiplying and accumulating.
  • performing shift and significant bit expansion processing on a plurality of the calculation results respectively according to a preset adjustment rule to obtain a plurality of processing results larger than the preset bit width further includes: obtaining each calculation result according to the calling requirement of the application.
  • the operation type of the MAC operation unit according to the operation type to obtain the summand value of each MAC operation unit; the MAC operation unit performs partial product sum addition on the expanded binary number according to the summand value Number shift processing.
  • the operation category includes: high-bit-width signed/unsigned MAC operation and parallel low-bit-width signed/unsigned MAC operation.
  • the signed/unsigned multiply-accumulate method provided by the present invention can unify unsigned numbers into signed numbers for calculation, so for the MAC operations of 4 groups of unsigned numbers, sign bit expansion is required here. That is, according to the aforementioned logic of the MAC operation unit, the calculation is performed after adding two bits of unsigned extension. Since there is no addend in the calculation of the two MAC partial products of 3 and 4 in FIG. 6 , for the multiply-accumulator, the summand needs to be treated as 0.
  • the specific processing method is as follows:
  • Step 1 For 1, the calculation of A L ⁇ B L + C L is performed, and the result is not shifted. Among them, AL , BL , CL all perform unsigned extension;
  • the second step for 2, the calculation of A H ⁇ B L + CH is performed, and the result is logically shifted to the left by m bits (the bit width of A L ).
  • a H , BL , and CH are all unsigned extended;
  • the third step for 3, the calculation of A L ⁇ B H + 0 is performed, and the result is logically shifted to the left by n bits (the bit width of B L ).
  • a L and B H are both unsigned extended;
  • Step 4 For 4, the calculation of A H ⁇ B H +0 is performed, and the result is logically shifted left by m+n bits (the bit width of A L + B L ). Among them, A H and B H are both unsigned extended;
  • Step 5 The MAC operations of 1, 2, 3 and 4 will all obtain four calculation results, and these four results will be accumulated to obtain the final unsigned high-bit-width result.
  • FIG. 7 is an example in which the present invention only performs a set of high-bit-width unsigned MAC operations.
  • Calculation description it splits the high-bit width into 2 groups of low-bit-width data, performs unsigned extension processing on the low-bit-width data, and then multiplies or multiplies and accumulates, and finally divides the obtained P 1 , P 2 , P 3 and P 4
  • the final settlement result with high bit width can be obtained by addition; this embodiment proves that the multiply-accumulate method provided by the present invention can accurately multiply or multiply-accumulate an unsigned number with adjustable precision.
  • two groups of parallel low-bit-width unsigned MAC operations can be referred to as shown in FIG. 8.
  • the MAC operation designed in the present invention can unify unsigned numbers into signed numbers for calculation, so for the MAC operations of 4 groups of unsigned numbers, sign bit expansion is required here. That is, according to the above logic of the MAC multiplier, the calculation is performed after adding two bits of unsigned extension. Therefore, in order to obtain two sets of low-bit-width multiply-accumulate results at the same time, the specific process is as follows:
  • Step 1 For 1, the calculation of A L ⁇ B L + C L is performed, and the result is not shifted. Among them, AL , BL , CL all perform unsigned extension;
  • the second step: 2 and 3 will not be enabled, no calculation will be performed, and all the input data signals will be set to 0.
  • the third step for 4, the calculation of A H ⁇ B H + CH is performed, and the result is logically shifted to the left by m+n bits (the bit width of A L + B L ).
  • the fourth step unsigned extension is performed on all of A H , B H and CH .
  • the two groups of results finally output are P 1 and P 4, which are respectively the settlement results of the two groups of low-bit-width unsigned MAC multiply-accumulate operations.
  • a H , B H and CH are signed numbers
  • a L , BL and CL are unsigned numbers
  • 4 low-bit wide MAC operations each part corresponds to 4 different MAC operations.
  • 1 is an unsigned operation
  • 2 is a signed number and an unsigned number multiplied by a signed number
  • 3 is a signed number multiplied by an unsigned number
  • 4 is a signed operation. Therefore, to discuss each operation separately, calculate it as follows:
  • Step 1 As shown in 1 in Figure 10, firstly, when performing the calculation of AL ⁇ BL + CL , it is an unsigned calculation, and it is necessary to perform unsigned extension on AL , BL , CL , that is, expand two bits 0. Then, the first 3 sign bits of the original calculation result are discarded to save area overhead, and m+n+ 1 bits are reserved for the actual calculation result P1. Finally, the complement processing is performed, and the MSB (Most Significant Bit) extension is performed on the calculation result P 1 , and the extension is extended to the M+N bits.
  • MSB Mobile Bit
  • Step 2 As shown in 2 in Figure 10, first of all, when calculating A H ⁇ B L + CH H , A H and CH are signed numbers, and B L is an unsigned number, so it cannot be directly performed. Therefore, it is necessary to perform unsigned extension on the unsigned number BL , that is, to extend two bits of 0; Then, the first 3 sign bits of the original calculation result are discarded to save area overhead, and M-m+n+ 1 bits are reserved for the actual calculation result P2. Next, a shift process is performed to shift P2 to the left by m bits. Finally, the complement processing is performed, and the MSB of the calculation result P 2 is extended, and the extension is extended and supplemented to M+N bits.
  • Step 3 As shown in 3 in Figure 10, when A L ⁇ B H +0 is calculated first, A L is an unsigned number, and B H is a signed number, so it cannot be calculated directly, so it is necessary to convert the unsigned number
  • the number AL is extended unsigned, that is, extended by two bits of 0; the signed number is extended by the sign bit, that is, the sign bit E is extended by two bits.
  • the first 3 sign bits of the original calculation result are discarded to save area overhead, and N-n+m+1 bits are reserved for the actual calculation result P3 .
  • shift processing is performed to shift P 3 to the left by n bits.
  • the complement processing is performed, and the MSB of the calculation result P3 is extended until it is extended to M+N bits.
  • Step 4 As shown in 4 in Figure 10, when A H ⁇ B H +0 is calculated first, A H and B H are signed numbers, so signed calculations are performed . Extend the sign bit, that is, extend the two-digit sign bit E. Then the first four sign bits of the original calculation result are discarded. The difference from Steps 1, 2 and 3 is that the first 4 bits of the calculation result are discarded to save area overhead. The actual calculation result P 4 is reserved for N-n+Mm bit. Finally, shift processing is performed to shift P 4 to the left by m+n bits. As shown in FIG. 10, it is not necessary to extend the MSB when performing the shift processing.
  • Step 5 Finally, 1, 2, 3, and 4 each calculate four results, and then accumulate these four results to obtain the final signed result.
  • the MAC operation unit provided by the present invention can unify three different operations of signed number multiplication/multiplication/accumulation, unsigned number multiplication/multiplication/accumulation, and signed/unsigned number multiplication/multiplication/accumulation into a set of operation circuits , realizes the accurate calculation of the multiplication or multiply-accumulate operation of various symbol numbers; finally, it shifts the obtained P 1 , P 2 , P 3 and P 4 , and then adds them to obtain a set of high-bit width
  • This embodiment proves that by using the multiply-accumulate method provided by the present invention, accurate and adjustable-precision multiply or multiply-accumulate calculations can be performed on signed numbers.
  • a H , B H , CH and AL , BL , and CL need to be signed extended, where E represents the data that needs to be extended . sign bit.
  • Step 1 For 1, the calculation of A L ⁇ B L + C L is performed, and the result is not shifted. Among them, AL , BL , CL all carry out sign extension;
  • Step 2 For 2 and 3 two groups will not be enabled, the corresponding calculation will not be performed, and all the input data signals will be set to 0.
  • Step 4 The results of the two groups of MAC operations that are output in parallel are P 1 and P 4 , which are respectively the settlement results of the two groups of low-bit-width signed MAC multiply-accumulate operations.
  • the technology of the present invention divides the high-bit-width binary numbers required by the multiply/multiply-accumulate operation into several groups of low-bit-width binary numbers, and then passes through several low-bit-width multiply-accumulators, namely MAC operation units, through proper calculation, and finally realizes the A signed/unsigned multiply-accumulator with adjustable bit-width precision; it can simultaneously support signed/unsigned multiplication and multiply-accumulate/multiply operations of various bit-width precisions under very low overhead by making full use of hardware resources ; More importantly, the parallel execution of multiple groups of multiplication/multiplication-accumulation operations with different low-bit widths can be realized according to the specific application and under the premise of the calculation accuracy, so as to meet the needs of application computing performance.
  • the present invention takes splitting into two sets of low-bit-width data as an example, our method can split the input data into any number of low-bit-width data, which further improves flexibility to meet computing requirements of different precisions. If its expansion is divided into 4 groups of low-bit-width data, with 16 low-bit-width MAC operation units, 4 groups of low-bit-width signed and unsigned multiply-accumulate/multiply parallel operations can be realized, or 2 groups of lower-bit width with Parallel operation of unsigned multiply-accumulate/multiply, or a group of separate operations of unsigned multiply-accumulate/multiply with high-bit width; for example, operations with different data bit-width precisions such as 8/16/32 or 4/8/16, the same , and so on, the method proposed by us can be extended to computing applications with adjustable precision of arbitrary precision, and the high-bit-width number is divided into any number of low-bit-width numbers to carry out flexible bit-width precision design; therefore, the present invention designs

Abstract

一种有/无符号乘累加装置及方法,适用于粗粒度可重构处理器架构,所述装置包含拆分模块、运算模块、处理模块和输出模块;拆分模块用于获取配置控制信号,根据配置控制信号将输入的大于预设位宽的二进制被乘数、乘数和加数,按预设拆分规则拆分生成多组小于预设位宽的二进制数;运算模块用于根据配置控制信号中的动态配置文件,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果;处理模块用于将多个计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果;输出模块用于将多个处理结果进行累加获得运算结果。

Description

有/无符号乘累加装置及方法 技术领域
本发明涉及处理器设计领域,尤指一种有/无符号乘累加装置及方法。
背景技术
粗粒度可重构处理器架构以其低能耗、高性能和高能效和灵活动态可重构的特性,正得到越来越多的关注。粗粒度可重构计算架构是一种综合了通用处理器的灵活性和专用集成电路的高性能计算架构,非常适用于对于数据和计算密集型等并行度非常高的应用的处理,比如人工智能、数字信号处理、视频图像处理、科学计算和通信加密等领域的应用。同时,随着人工智能、神经网络、大数据、云计算、5G通信等应用的迅速兴起,其带来的更加密集的数据和更加密集的运算,而这些应用往往都会涉及到大量的不同位宽需求的乘法(Multiplication,MUL)运算和“乘累加(Multiplication-and-Addition Operation,MAC)”运算。
2017年,Google为神经网络应用的加速,构建了一个专用的集成电路加速器TPU(Tensor Processing Unit),其主要采用乘累加器的MAC单元,在一个256x256的MAC阵列上,让乘累加运算以脉动阵列(systolic array)的方式进行执行,从而取得高达92TOPS@8bit的计算能力和4TOPS/W@8bit的能效比。然后它只支持8-bit的MAC。然而,在很多图像视频处理、语音识别和神经网络等应用中,其往往需求的计算精度是不一样的。有些只需要较低位宽的数据,就能满足其计算精度的需求。那么,如果我们能够在一个支持高位宽运算的硬件处理单元中,支持多组低位宽数据的并行执行,那么就能在有限的硬件资源下,将计算能力和计算性能近乎成倍提高,而不会带来太大的功耗开销,并极大挺高其计算的能效比。
而目前的可重构处理架构,都是单一位宽的乘法运算和加法运算,并且是单独分离的运算操作。所以,其往往无法根据具体应用需求,支持灵活的位宽精度调节。同时,一个MAC运算往往需要两个或者更多的运算周期,第一个周期将乘数和被乘数做乘法运算;第二个周期将前一个周期的运算结果通过累加器与被加数相加。这样,极大限制了可重构处理器得对上述等任务进行灵活高效的处理。为此,业内亟需一种新的方法及装置以提高效率。
发明内容
本发明目的在于提供提高一种有/无符号乘累加装置及方法,以有效地用于粗粒度可重构处理器架构中,通过灵活的运算数据位宽的动态配置,在充分利用其运算资源的前提下,实现多组不同位宽的乘法/乘累加运算的并行处理,从而几乎成倍的提高其计算吞吐率和计算性能和能效。同时,在同一套乘累加器电路里,可以有效灵活的支持有、无符号的乘法运算/乘累加运算,在非常低的功耗和面积开销下,充分保障并实现可重构处理器的动态可重构特性。
为达上述目的,本发明所提供的一种有/无符号乘累加装置,适用于粗粒度可重构处理器架构,所述装置包含拆分模块、运算模块、处理模块和输出模块;所述拆分模块用于获取配置控制信号,根据配置控制信号将输入的大于预设位宽的二进制被乘数、乘数和加数,按预设拆分规则拆分生成多组小于预设位宽的二进制数;所述运算模块用于根据所述配置控制信号中的动态配置文件,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果;所述处理模块用于将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果;所述输出模块用于将多个所述处理结果进行累加获得运算结果。
本发明还提供一种有/无符号乘累加方法,适用于粗粒度可重构处理器架构,所述方法包含:获取配置控制信号,根据配置控制信号将输入的大于预设位宽的二进制被乘数、乘数和加数,按预设拆分规则拆分生成多组小于预设位宽的二进制数;根据所述配置控制信号中的动态配置文件,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果;将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果;将多个所述处理结果进行累加获得运算结果。
本发明的有益技术效果在于:支持有/无符号的乘法和乘累加运算,将有无符号的运算统一于一种运算电路,不仅节省了面积和功耗开销,而且可以通过配置重构同时满足多种应用的需求,具有很好的可重构性和更广泛的适用性。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,并不构成对本发明的限定。在附图中:
图1为本发明一实施例所提供的有/无符号乘累加装置的结构示意图;
图2为本发明一实施例所提供的有/无符号乘累加装置的应用结构示意图;
图3为本发明一实施例所提供的有/无符号乘累加装置的运算原理示意图;
图4为本发明一实施例所提供的任意精度可配MAC运算的乘累加运算原理示意图;
图5为本发明一实施例所提供的有/无符号乘累加方法的流程示意图;
图6为本发明一实施例所提供的一组完整的高位宽的无符号数的MAC乘累加运算示意图;
图7为本发明一实施例所提供的一组完整的高位宽的无符号数的MAC乘累加运算的应用示意图;
图8为本发明一实施例所提供的两组并行的低位宽的无符号数MAC乘累加运算示意图;
图9为本发明一实施例所提供的两组并行的低位宽的无符号数MAC乘累加计算的应用示意图;
图10为本发明一实施例所提供的一组完整的高位宽的有符号数的MAC乘累加运算示意图;
图11为本发明一实施例所提供的一组完整的高位宽的无符号数的MAC乘累加运算的应用示意图;
图12为本发明一实施例所提供的两组并行的低位宽的有符号数的MAC乘累加运算示意图;
图13为本发明一实施例所提供的两组并行的低位宽的有符号数的MAC乘累加运算的应用示意图。
具体实施方式
以下将结合附图及实施例来详细说明本发明的实施方式,借此对本发明如何应用技术手段来解决技术问题,并达成技术效果的实现过程能充分理解并据以实施。需要说明的是,只要不构成冲突,本发明中的各个实施例及各实施例中的各个特征可以相互结合,所形成的技术方案均在本发明的保护范围之内。
另外,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
请参考图1所示,本发明所提供的一种有/无符号乘累加装置,适用于粗粒度可重构处理器架构,所述装置包含拆分模块、运算模块、处理模块和输出模块;所述拆分模块用于获取配置控制信号,根据配置控制信号将输入的大于预设位宽的二进制被乘数、乘数和加数,按预设拆分规则拆分生成多组小于预设位宽的二进制数;所述运算模块用于根据所述配置控制信号中的动态配置文件,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果;所述处理模块用于将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果;所述输出模块用于将多个所述处理结果进行累加获得运算结果。由此,通过将乘法/乘累加运算需要的高位宽的二进制数,拆分成几组低位宽的二进制数,然后通过几个低位宽的乘累加器,经过适当计算,最后实现了位宽精度可调的有/无符号乘累加器。
在本发明一实施例中,所述运算模块包含多个MAC运算单元;所述MAC运算单元用于解析所述动态配置文件中的功能标识和运算种类标识;根据所述功能标识和所述运算种类标识获得各所述MAC运算单元对接收到二进制数的运算方式;根据所述运算方式对接收到的二进制数进行对应的乘累计算获得对应的计算结果。
如图2所示,在实际工作中,上述各模块的主要执行功能如下:
所述拆分模块主要根据配置控制信号Config,将输入的高位宽的二进制被乘数A、乘数B和加数C,进行适当合理的拆分,变成几组低位宽的二进制数。这里假设拆成2组任意低位宽的数据。
输入信号:3位的配置控制信号Config、M位被乘数A、N位乘数B和L位加数C;输出信号:经过拆分处理的m位A L和M-m位A H;n位B L和N-n位B H;l位C L和L-l位C H
所述运算模块主要是将进行拆分的处理的各被乘数、乘数和加数进行乘累加计算,包括有符号数乘累加、无符号数乘累加以及有符号数和无符号数乘累加。如根据配置信息Config1~Config4,判断要进行是1组高位宽的有/无符号乘法/乘累加运算,还是2组并行计算的低位宽的有/无符号乘法/乘累加运算;并根据具体的被加数A_in是否为零,判断要进行乘法运算还是乘累加运算。
输入信号:m位AL和M-m位AH;n位BL和N-n位BH;l位CL和L-l位CH;两位config信号;输出信号:经过乘累加计算的结果P1、P2、P3和P4。
所述处理模块主要将几个小位宽数据计算的乘法/乘累加运算的最后结果,进行适当的移位和最高有效位扩展操作。其中,第一个MAC产生的结果P1,不进行移位;第二个MAC产生的结果P2,左移m位;第三个MAC产生的结果P3,左移n位;第四个MAC产生的结果P4,左移m+n位。然后,再将所有移位操作后的结果,进行最高有效位扩展,最终都扩展至M+N位。
输入信号:乘累加计算的结果P1、P2、P3和P4;输出信号:P1、P2、P3和P4经过移位和最高有效位扩展后的结果:P1_ext、P2_ext、P3_extt和P4_ext。
所述输出模块主要将前几部分模块计算后的结果进行累加得到最终的结果。
输入信号:移位和扩展后的结果P1_ext、P2_ext、P3_extt和P4_ext。输出信号:多比特位乘累加的最终结果Product。
在上述实施例中,控制信号Config的功能如表1,其为一个3-bit位宽的配置信号。
表1
Figure PCTCN2020138119-appb-000001
为实现精度可调,本发明进一步对控制信号Config进行动态配置,在其内部生成配置控制信号Config1~Config4,为一个2-bit位宽的配置信号,其对应MAC的运算方式如表2。
表2
Config1~Config4的取值 MAC进行的运算种类
00 乘数和被乘数全为无符号数的乘累加运算
01 有符号的乘数和无符号的被乘数混合的乘累加运算
10 有符号的被乘数和无符号的乘数混合的乘累加运算
11 乘数和被乘数全为有符号数的乘累加运算
由此,不同的MAC运算单元最终的运算方式和功能可基于上述表格的对应关系予以确定,具体如以下表3所示。
表3
Figure PCTCN2020138119-appb-000002
在本发明一实施例中,所述MAC运算单元还包含:识别小于预设位宽的二进制数中的符号情况;根据所述符号情况对所述二进制数进行对应的有符号位扩展或无符号数扩展;对扩展后的所述二进制数进行部分积和加数移位处理后,通过乘累计算获得计算结果。在实际工作中,所述有/无符号乘累加装置统一于一套硬件架构进行计算,可以像处理乘法部分积一样,处理乘累加中的被加数,从而可以像处理乘法部分积一样,处理乘累加中的被加数,从而实现将被加数隐藏在乘法之中采用部分积加法树中(如Wallace Tree),统一进行压缩和累加处理,最终能够在基本不增加面积开销的情况下,完成乘累加运算操作。其中,各MAC运算单元在运算过程中,首先对输入的地位宽数有/无符号的判别后,分别进行有符号位扩展和无符号数扩展。然后根据乘累加的设计对基为4的Booth算法进行改进,将部分积和加数移位处理后,进行累加计算,在实现了将被加数隐藏在乘法计算之中的同时,完成了乘累加运算的计算。具体可参考图3所示,有符号数的符号位扩展和无符号数的最高位的扩展处理,以及对部分积Booth编码后的计算,其最终结果会有四位符号位的增加,因此要得到最终结果需要进行截断处理,从而减少器硬件资源的开销,以及减少面积和功耗开销,并且降低乘累加器的计算延时,提高工作频率和能效。此处,S为乘累加运算中的部分积和被加数进行Booth编码后操作后的最高位;N为乘累加运算中的部分积,在进行Booth编码中是否为负数编码时,是否进行的取反加1的操作;M表示任意位宽的有无符号乘累加运算的数据位宽。
在本发明一实施例中,所述运算模块还包含:根据应用的调用需求获得各MAC运算单元的运算类别,根据所述运算类别获得各MAC运算单元的被加数取值;所述MAC运算单元根据所述被加数取值对扩展后的所述二进制数进行部分积和加数移位处理。进一步的,所述处理模块还可包含根据所述计算结果的有/无符号情况和所述运算类别,对所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的 处理结果。具体的,可参考图4所示,假设其中M=N=L,n=l,m>n,在进行任意精度可调的有/无符号乘累加计算时,本发明MAC运算单元的运算原理如在图4所示,将被乘数A、被乘数B和被加数C,进行拆分为2组低位宽的数据,分别为A H、A L、B H、B L、C H和C L;然后将A H、A L、B H、B L、C H、C L分别对应组合,得到图4所示的4个低位宽的MAC运算部分,标记为①②③④。其中,①和③是普通的乘累加运算,其计算的被加数分别为C L和0;②和④是表示的乘累加运算,其计算的被加数可能为C H或者0;当进行1组高位宽的乘累加运算时,②和④是表示的乘累加运算的被加数分别为C H和0,当并行进行2组低位宽的乘累加运算时,②和④是表示的乘累加运算的被加数分别为0和C H。如果进行1组高位宽的乘累加运算,则要将这4个模块进行的MAC乘累加运算后的结果,再进行移位和最高位扩展操作。
由此,本发明设计的精度可调的有/无符号乘累加装置中,其可以根据不同应用的不同精度和计算性能的需求,灵活选择最终输出的是一组高位宽的有/无符号乘法/乘累加运算结果,还是几组并行计算的低位宽的有/无符号乘法/乘累加运算结果;而且,将有无符号的运算统一于一种运算电路,不仅节省了面积和功耗开销,而且可以通过配置重构同时满足多种应用的需求,具有很好的可重构性和更广泛的使用性。
请参考图5所示,本发明还提供一种有/无符号乘累加方法,适用于粗粒度可重构处理器架构,所述方法包含:
S501获取配置控制信号,根据配置控制信号将输入的大于预设位宽的二进制被乘数、乘数和加数,按预设拆分规则拆分生成多组小于预设位宽的二进制数;
S502根据所述配置控制信号中的动态配置文件,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果;
S503将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果;
S504将多个所述处理结果进行累加获得运算结果。
在上述实施例中,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果包含:解析所述动态配置文件中的功能标识和运算种类标识;根据所述功能标识和所述运算种类标识获得各所述MAC运算单元对接收到二进制数的运算方式;根据所述运算方式对接收到的二进制数进行对应的乘累计算获得对应的计算结果。其中,根据所述运算方式对接收 到的二进制数进行对应的乘累计算获得对应的计算结果包含:识别小于预设位宽的二进制数中的符号情况;根据所述符号情况对所述二进制数进行对应的有符号位扩展或无符号数扩展;对扩展后的所述二进制数进行部分积和加数移位处理后,通过乘累计算获得计算结果。具体应用实例可参考图4及上述对应实施例,在此就不再一一详述。
在本发明一实施例中,将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果还包含:根据应用的调用需求获得各MAC运算单元的运算类别,根据所述运算类别获得各MAC运算单元的被加数取值;所述MAC运算单元根据所述被加数取值对扩展后的所述二进制数进行部分积和加数移位处理。其中,所述运算类别包含:高位宽的有/无符号MAC运算和并行的低位宽的有/无符号MAC运算。
为便于更清楚的理解本发明所提供的上述实施例的具体应用方式,以下以具体实例对高位宽的有/无符号MAC运算和并行的低位宽的有/无符号MAC运算的作详细说明,本领域相关技术人员当可知该实例仅为便于理解本发明所提供的上述实施例的一种应用方式,并不对其做任何限定。
一组完整的高位宽的无符号MAC运算可参考图6所示,假设其中M=N=L,n=l,m>n,当输入A、B和C为无符号数时,A H、A L、B H、B L、C H和C L都是无符号数,因此,4个低位宽的MAC运算部分,进行的都是无符号的运算。本发明所提供的有/无符号乘累加方法,能够将无符号数统一化为有符号数进行计算,所以对于4组无符号数的MAC运算,这里都需要进行符号位的扩展。即按照MAC运算单元的前述逻辑,添加两位无符号扩展后进行计算。由于图6中的③和④两个MAC部分积的计算不存在加数,所以对于乘累加器而言,需要将被加数视为0处理。其具体处理方法如下:
第一步:对于①进行的是A L×B L+C L的计算,对其结果不进行移位。其中,A L,B L,C L均进行无符号扩展;
第二步:对于②进行的是A H×B L+C H的计算,对其结果逻辑左移m位(A L的位宽)。其中A H,B L,C H均进行无符号扩展;
第三步:对于③进行的是A L×B H+0的计算,对其结果逻辑左移n位(B L的位宽)。其中A L,B H均进行无符号扩展;
第四步:对于④进行的是A H×B H+0的计算,对其结果逻辑左移m+n位(A L+B L的位宽)。其中A H,B H均进行无符号扩展;
第五步:①、②、③和④的MAC运算都将得到四个计算结果,将这四个结果再进行累加得到最终无符号高位宽结果。
再请参考图7所示,为本发明仅进行一组高位宽无符号MAC运算的例子,在计算无符号乘累加运算A◇B+C,以A=155,B=161,C=88进行计算说明;其对高位宽拆分为2组低位宽数据,并对低位宽数进行无符号扩展处理,再进行乘或乘累加计算,最后将所得的P 1、P 2、P 3和P 4进行移位处理后,相加即可得到高位宽的最终结算结果;该实施例证明采用本发明提供的乘累加方法可以对于无符号数进行准确的精度可调的乘或乘累加计算。
在另一实施例中,两组并行的低位宽的无符号MAC运算可参考图8所示,在图8中,假设其中M=N=L,m>n,当输入A、B和C为无符号数时,A H、A L、B H、B L、C H和C L都是无符号数,因此,中②③可以不被使能,降低相应的计算功耗;①④计算2组低位宽的MAC运算部分P 1=A L×B L+C L和P 4=A H×B H+C H,其进行的都是无符号的运算。但本发明设计的MAC运算,能够将无符号数统一化为有符号数进行计算,所以对于4组无符号数的MAC运算,这里都需要进行符号位的扩展。即按照MAC乘法器的上述逻辑,添加两位无符号扩展后进行计算。所以,为同时分别获得两组低位宽乘累加的结果,其具体流程如下所示:
第一步:对于①进行的是A L×B L+C L的计算,对其结果不进行移位。其中,A L,B L,C L均进行无符号扩展;
第二步:对②和③将不使能,不进行计算,其输入的数据信号全部置0。
第三步:对于④进行的是A H×B H+C H的计算,对其结果逻辑左移m+n位(A L+B L的位宽)。
第四步:对其中A H、B H和C H均进行无符号扩展。最后输出的两组结果是P 1和P 4,即分别是2组低位宽无符号数MAC乘累加运算的结算结果。
再请参考图9所示,以A H=9,B H=10,C H=5和A L=11,B L=1,C L=8,进行举例说明。在计算无符号乘法A H◇B H+C H时,①MAC运算单元工作进行无符号乘累加计算得到结果P 4=95,②MAC运算单元并行进行A L◇B L+C L无符号的乘累加,计算结果为P 1=19;该实施例证明采用本发明所提供的乘累加方法,可以对于两组低位宽无符号数同时并行计算获取两组准确的结果。
在本发明一实施例中,一组完整的高位宽的有符号MAC运算如图10所示,假设其中M=N=L,n=l,m>n,当输入A、B和C为有符号数时,则4个MAC运算需要进行分类讨论, 其中E表示需要进行扩展的数据的符号位。假设其中A H、B H和C H是有符号数,A L、B L和C L是无符号数,4低位宽的MAC运算,各部分对应4中不同的MAC运算。其中,①进行的是无符号运算;②进行的是有符号数和无符号数相乘加有符号数;③进行的是有符号数乘无符号数运算;④进行的是有符号运算。因此,要对各个运算进行分别讨论,按照如下方法进行计算:
第一步:如图10中①所示,首先在进行A L×B L+C L计算时,是无符号计算,需要对A L,B L,C L进行无符号扩展,即扩展两位0。然后将原计算结果的前3位符号位进行舍弃处理来节省面积开销,实际计算结果P 1保留m+n+1位。最后并进行补位处理,将计算结果P 1进行MSB(Most Significant Bit)扩展,一直扩展补充到M+N位。
第二步:如图10中②所示,首先在进行A H×B L+C H计算的时,其中A H和C H是有符号数,B L是无符号数,所以不能进行直接进行计算,所以需要将无符号数B L进行无符号扩展,即扩展两位0;将有符号数有进行符号位扩展,即扩展两位符号位E。然后将原计算结果的前3位符号位进行舍弃处理来节省面积开销,实际计算结果P 2保留M-m+n+1位。接着,进行移位处理将P 2左移m位。最后,进行补位处理,将计算结果P 2进行MSB扩展,一直扩展补充到M+N位。
第三步:如图10中③所示,首先进行A L×B H+0计算时,其中A L是无符号数,B H是有符号数,所以不能直接进行计算,所以需要将无符号数A L进行无符号扩展,即扩展两位0;将有符号数进行符号位扩展,即扩展两位符号位E。然后将原来计算结果前3位符号位进行舍弃处理来节省面积开销,实际计算结果P 3保留N-n+m+1位。接着进行移位处理,将P 3左移n位。最后进行补位处理,将计算结果P 3进行MSB扩展,一直扩展到M+N位。
第四步:如图10中④所示,首先进行A H×B H+0计算时,其中A H和B H是有符号数,所以进行的是有符号计算,对A H和B H要进行符号位的扩展,即扩展两位符号位E。然后将原计算结果的前四位符号位进行舍弃处理,与步骤一、二、三不同的是将计算结果前4位进行舍弃处理来节省面积开销,实际计算结果P 4保留N-n+M-m位。最后进行移位处理,将P 4左移m+n位。如图10所示在进行移位处理不需要再进行MSB的扩展。
第五步:最后,①、②、③和④各自计算得到四个结果将这四个结果再进行累加得到最终有符号结果。
再请参考图11所示,为一组高位宽的有符号乘法的例子A◇B+C,以A=-1,B=21,C=-1进行举例说明。其对高位宽拆分为两组低位宽数据,并对低位宽数进行有符号数符号 位扩展和无符号扩展处理。由于本发明所提供的MAC运算单元能将有符号数相乘/乘累加、无符号数相乘/乘累加以及有/无符号数相乘/乘累加三种不同的运算统一于一套运算电路,实现了各种符号数的乘法或乘累加运算的准确计算;最后,其再将所得的P 1、P 2、P 3和P 4进行移位处理,然后相加即可得到一组高位宽MAC运算的最终结果;该实施例证明采用本发明所提供的乘累加方法,可以对于有符号数进行准确的精度可调的乘或乘累加计算。
在本发明一实施例中,两组并行的低位宽的有符号MAC运算如图12所示,假设其中M=N=L,m>n,在进行有符号数乘累加计算时,其中A H、B H、C H和A L、B L、C L是两组有符号数。则①②③④为4个低位宽的有符号的MAC乘累加计算,需要将A H、B H、C H和A L、B L、C L均进行有符号扩展,其中E表示需要进行扩展的数据的符号位。然后,其中②③可以不被使能,降低相应的计算功耗;①④计算P 1=A L×B L+C L和P 4=A H×B H+C H,为同时分别获得两组低位宽乘累加的结果,采用本发明所设计的乘累加器原理,其具体方法如下:
第一步:对于①进行的是A L×B L+C L的计算,对其结果不进行移位。其中,A L,B L,C L均进行有符号扩展;
第二步:对于②和③两组将不使能,不进行相应的计算,其输入数据信号全部置0。
第三步:对于④进行的是A H×B H+C H的计算,其中A H、B H和C H均进行有符号扩展。
第四步:最后并行输出的两组MAC运算的结果是P 1和P 4,分别是两组低位宽有符号MAC乘累加运算的结算结果。
如图13所示,为一组高位宽的有符号乘法的例子,以A H=-1,B H=1,C H=-1和A L=-1,B L=5,C L=-1,进行举例说明。其中,有符号运算A H◇B H+C H,此时①运算单元工作进行有符号MAC乘累加计算得到结果P 4=-2,同时②运算单元并行进行A L◇B L+C L有符号MAC乘累加运算,计算结果为P 1=-6;该实施例证明采用本发明所提供的乘累加方法,可以对于两组高位宽有符号数同时并行计算,仍然能获取两组准确的计算结果。
本发明技术通过将乘法/乘累加运算需要的高位宽的二进制数,拆分成几组低位宽的二进制数,然后通过几个低位宽的乘累加器即MAC运算单元,经过适当计算,最后实现了位宽精度可调的有/无符号乘累加器;其在充分利用了硬件资源在非常低的额外开销下,能够同时支持各种位宽精度的有/无符号乘法和乘累加/乘法运算;更重要的是可以根据具体应用,在计算精度允许的前提下,实现多组不同低位宽的乘法/乘累加运算的并行执行,来满足应用计算性能的需求。
虽然本发明以拆分成两组低位宽的数据为例,但是我们的方法可以将输入的数据拆分成任意多个低位宽数据,进一步提高了灵活性满足不同精度的计算要求。如果其扩展拆分成4组低位宽的数据,用16个低位宽的MAC运算单元,可以实现4组低位宽的有无符号乘累加/乘法的并行运算,或者2组较低位宽的有无符号乘累加/乘法的并行运算,或者1组高位宽有的无符号乘累加/乘法的单独运算;比如,8/16/32或者4/8/16等不同数据位宽精度的运算,同样,以此类推,我们所提出的方法是可以推广到任意精度可调的计算应用之中,将高位宽数拆分为任意多个低位宽数,进行灵活的位宽精度设计;因此本发明设计的高能效的位宽精度可调的有/无符号乘累加器,可以应用到多种不同需求的硬件加速电路中,如CGRA、FPGA、GPU、DSP、TPU和神经网络加速芯片(NPU)等,具有非常高的通用性和广泛的适用性。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种有/无符号乘累加装置,适用于粗粒度可重构处理器架构,其特征在于,所述装置包含拆分模块、运算模块、处理模块和输出模块;
    所述拆分模块用于获取配置控制信号,根据配置控制信号将输入的大于预设位宽的二进制被乘数、乘数和加数,按预设拆分规则拆分生成多组小于预设位宽的二进制数;
    所述运算模块用于根据所述配置控制信号中的动态配置文件,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果;
    所述处理模块用于将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果;
    所述输出模块用于将多个所述处理结果进行累加获得运算结果。
  2. 根据权利要求1所述的有/无符号乘累加装置,其特征在于,所述运算模块包含多个MAC运算单元;
    所述MAC运算单元用于解析所述动态配置文件中的功能标识和运算种类标识;
    根据所述功能标识和所述运算种类标识获得各所述MAC运算单元对接收到二进制数的运算方式;
    根据所述运算方式对接收到的二进制数进行对应的乘累计算获得对应的计算结果。
  3. 根据权利要求2所述的有/无符号乘累加装置,其特征在于,所述MAC运算单元还包含:
    识别小于预设位宽的二进制数中的符号情况;
    根据所述符号情况对所述二进制数进行对应的有符号位扩展或无符号数扩展;
    对扩展后的所述二进制数进行部分积和加数移位处理后,通过乘累计算获得计算结果。
  4. 根据权利要求3所述的有/无符号乘累加装置,其特征在于,所述运算模块还包含:根据应用的调用需求获得各MAC运算单元的运算类别,根据所述运算类别获得各MAC运算单元的被加数取值;所述MAC运算单元根据所述被加数取值对扩展后的所述二进制数进行部分积和加数移位处理。
  5. 根据权利要求4所述的有/无符号乘累加装置,其特征在于,所述处理模块还包含根据所述计算结果的有/无符号情况和所述运算类别,对所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果。
  6. 一种有/无符号乘累加方法,适用于粗粒度可重构处理器架构,其特征在于,所述方法包含:
    获取配置控制信号,根据配置控制信号将输入的大于预设位宽的二进制被乘数、乘数和加数,按预设拆分规则拆分生成多组小于预设位宽的二进制数;
    根据所述配置控制信号中的动态配置文件,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果;
    将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果;
    将多个所述处理结果进行累加获得运算结果。
  7. 根据权利要求1所述的有/无符号乘累加方法,其特征在于,通过多个MAC运算单元对多组小于预设位宽的二进制数进行对应的分组后,分别进行乘累加计算和/或并行乘累加计算获得多个计算结果包含:
    解析所述动态配置文件中的功能标识和运算种类标识;
    根据所述功能标识和所述运算种类标识获得各所述MAC运算单元对接收到二进制数的运算方式;
    根据所述运算方式对接收到的二进制数进行对应的乘累计算获得对应的计算结果。
  8. 根据权利要求7所述的有/无符号乘累加方法,其特征在于,根据所述运算方式对接收到的二进制数进行对应的乘累计算获得对应的计算结果包含:
    识别小于预设位宽的二进制数中的符号情况;
    根据所述符号情况对所述二进制数进行对应的有符号位扩展或无符号数扩展;
    对扩展后的所述二进制数进行部分积和加数移位处理后,通过乘累计算获得计算结果。
  9. 根据权利要求8所述的有/无符号乘累加方法,其特征在于,将多个所述计算结果按预设调整规则分别进行移位和有效位扩展处理获得多个大于预设位宽的处理结果还包含:
    根据应用的调用需求获得各MAC运算单元的运算类别,根据所述运算类别获得各MAC运算单元的被加数取值;
    所述MAC运算单元根据所述被加数取值对扩展后的所述二进制数进行部分积和加数移位处理。
  10. 根据权利要求9所述的有/无符号乘累加方法,其特征在于,所述运算类别包含:高位宽的有/无符号MAC运算和并行的低位宽的有/无符号MAC运算。
PCT/CN2020/138119 2020-12-21 2020-12-21 有/无符号乘累加装置及方法 WO2022133686A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/138119 WO2022133686A1 (zh) 2020-12-21 2020-12-21 有/无符号乘累加装置及方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/138119 WO2022133686A1 (zh) 2020-12-21 2020-12-21 有/无符号乘累加装置及方法

Publications (1)

Publication Number Publication Date
WO2022133686A1 true WO2022133686A1 (zh) 2022-06-30

Family

ID=82156973

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/138119 WO2022133686A1 (zh) 2020-12-21 2020-12-21 有/无符号乘累加装置及方法

Country Status (1)

Country Link
WO (1) WO2022133686A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114895869A (zh) * 2022-07-13 2022-08-12 中科南京智能技术研究院 一种带符号的多比特存内计算装置
CN115390770A (zh) * 2022-10-31 2022-11-25 上海亿铸智能科技有限公司 一种用于简化sram输出数据多路选择方法及系统
CN116205244A (zh) * 2023-05-06 2023-06-02 中科亿海微电子科技(苏州)有限公司 一种数字信号处理结构

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567178A (zh) * 2003-07-04 2005-01-19 中国科学院微电子中心 新型乘法器重构算法及电路
US20060253521A1 (en) * 2005-04-14 2006-11-09 Texas Instruments Incorporated High-Speed Integer Multiplier Unit Handling Signed and Unsigned Operands and Occupying a Small Area
CN1963745A (zh) * 2006-12-01 2007-05-16 浙江大学 高速分裂式乘累加器mac装置
CN101082860A (zh) * 2007-07-03 2007-12-05 浙江大学 一种乘累加装置
CN109284083A (zh) * 2018-09-14 2019-01-29 北京探境科技有限公司 一种乘法运算装置及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567178A (zh) * 2003-07-04 2005-01-19 中国科学院微电子中心 新型乘法器重构算法及电路
US20060253521A1 (en) * 2005-04-14 2006-11-09 Texas Instruments Incorporated High-Speed Integer Multiplier Unit Handling Signed and Unsigned Operands and Occupying a Small Area
CN1963745A (zh) * 2006-12-01 2007-05-16 浙江大学 高速分裂式乘累加器mac装置
CN101082860A (zh) * 2007-07-03 2007-12-05 浙江大学 一种乘累加装置
CN109284083A (zh) * 2018-09-14 2019-01-29 北京探境科技有限公司 一种乘法运算装置及方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114895869A (zh) * 2022-07-13 2022-08-12 中科南京智能技术研究院 一种带符号的多比特存内计算装置
CN115390770A (zh) * 2022-10-31 2022-11-25 上海亿铸智能科技有限公司 一种用于简化sram输出数据多路选择方法及系统
CN116205244A (zh) * 2023-05-06 2023-06-02 中科亿海微电子科技(苏州)有限公司 一种数字信号处理结构
CN116205244B (zh) * 2023-05-06 2023-08-11 中科亿海微电子科技(苏州)有限公司 一种数字信号处理结构

Similar Documents

Publication Publication Date Title
WO2022133686A1 (zh) 有/无符号乘累加装置及方法
US9519460B1 (en) Universal single instruction multiple data multiplier and wide accumulator unit
US20210349692A1 (en) Multiplier and multiplication method
DiCecco et al. FPGA-based training of convolutional neural networks with a reduced precision floating-point library
CN112540743A (zh) 面向可重构处理器的有无符号乘累加器及方法
WO2022170811A1 (zh) 一种适用于混合精度神经网络的定点乘加运算单元及方法
US20220283777A1 (en) Signed multiword multiplier
US11609741B2 (en) Apparatus and method for processing floating-point numbers
Tan et al. Multiple-mode-supporting floating-point FMA unit for deep learning processors
CN112558920B (zh) 有/无符号乘累加装置及方法
Padma et al. Design of FFT processor using low power Vedic multiplier for wireless communication
Tang et al. A high-accuracy hardware-efficient multiply–accumulate (mac) unit based on dual-mode truncation error compensation for cnns
Daud et al. Hybrid modified booth encoded algorithm-carry save adder fast multiplier
CN113672196B (zh) 一种基于单数字信号处理单元的双乘法计算装置和方法
Chen et al. Approximate softmax functions for energy-efficient deep neural networks
CN211577939U (zh) 一种神经网络专用计算阵列
CN113608718A (zh) 一种实现素数域大整数模乘计算加速的方法
Li et al. PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications
US20210034327A1 (en) Apparatus and Method for Processing Floating-Point Numbers
Lin et al. Hybrid dynamic fixed point quantization methodology for AI accelerators
Wang et al. FACCU: Enable fast accumulation for high-speed DSP systems
Pawar et al. Review on multiply-accumulate unit
Balakrishnan et al. Arbitrary precision arithmetic-SIMD style
Yang et al. A high performance and full utilization hardware implementation of floating point arithmetic units
Tan et al. Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20966267

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20966267

Country of ref document: EP

Kind code of ref document: A1