WO2023231363A1 - 乘累加操作数的方法及其设备 - Google Patents

乘累加操作数的方法及其设备 Download PDF

Info

Publication number
WO2023231363A1
WO2023231363A1 PCT/CN2022/138472 CN2022138472W WO2023231363A1 WO 2023231363 A1 WO2023231363 A1 WO 2023231363A1 CN 2022138472 W CN2022138472 W CN 2022138472W WO 2023231363 A1 WO2023231363 A1 WO 2023231363A1
Authority
WO
WIPO (PCT)
Prior art keywords
operands
module
multiplication
point numbers
exponent
Prior art date
Application number
PCT/CN2022/138472
Other languages
English (en)
French (fr)
Inventor
刘少礼
郝勇峥
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023231363A1 publication Critical patent/WO2023231363A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/556Logarithmic or exponential functions

Definitions

  • the present invention relates generally to the field of computers. More specifically, the present invention relates to a device for multiplying and accumulating multiple operands, an integrated circuit device and a board thereof, a method for multiplying and accumulating multiple operands, a computer-readable storage medium, a computer program product, and a computer device for multiplying and accumulating multiple operands.
  • Multiply-accumulate refers to multiplying a series of operands two by two, and then summing the results of the multiplication, for example:
  • Multiplication-accumulation operations have a wide range of applications, so computers often have to cope with multiplication-accumulation operations in multiple data formats, such as floating-point operations such as FP16 and FP32, and fixed-point operations such as INT4 and INT8. Since the data formats of floating-point numbers and fixed-point numbers are different, computers often need to be equipped with two sets of hardware to handle the multiply-accumulate operation of floating-point numbers and the multiply-accumulate operation of fixed-point numbers respectively, resulting in a waste of resources and ineffective benefits.
  • the solution of the present invention provides a device for multiplying and accumulating multiple operands, its integrated circuit device and board card, and a method for multiplying and accumulating multiple operands. and computer-readable storage media, computer program products, and computer devices.
  • the present invention discloses a device for multiplying and accumulating multiple operands, including an alignment module, a multiplication module, a first addition module, a shift module and a second addition module.
  • the order module is used to identify the reference value of the exponent of multiple operands;
  • the multiplication module is used to multiply the mantissas of the two operands multiplied in the multiple operands to obtain the partial product of the mantissas;
  • the first addition module is used to The partial products are accumulated to obtain the product result of the mantissas of the two operands;
  • the shift module is used to shift the product result according to the difference between the reference value and the exponent of the two operands;
  • the second addition module is used to Add the shifted product results to obtain the multiply-accumulate result of multiple operands.
  • the present invention discloses an integrated circuit device, including the above device; and a board card, including the above integrated circuit device.
  • the present invention discloses a method for multiplying and accumulating multiple operands, including: identifying reference values of exponents of the multiple operands; performing a multiplication operation on the mantissas of two multiplied operands among the multiple operands, Obtain the partial product of the mantissas; perform an accumulation operation on the partial products to obtain the product result of the mantissas of the two operands; shift the product result according to the difference between the base value and the exponent of the two operands; shift the The product results are added to obtain the multiply-accumulate result of multiple operands.
  • the present invention discloses a computer-readable storage medium having computer program code for a method of multiplying and accumulating multiple operands stored thereon.
  • the computer program code is executed by a processing device, the above-mentioned method is executed.
  • the present invention discloses a computer program product, which includes a computer program for multiplying and accumulating a plurality of operands.
  • the computer program implements the steps of the above method when executed by a processor.
  • the present invention discloses a computer device including a memory, a processor and a computer program stored on the memory.
  • the processor executes the computer program to implement the steps of the above method.
  • the present invention is based on the multiply-accumulate operation of floating-point numbers and uses the concept of order to make part of the operations of floating-point numbers the same as the operations of fixed-point numbers. In this way, part of the same operations can be reused to perform multiplication-accumulate operations of fixed-point numbers, so as to achieve the goal of covering floating-point numbers.
  • Technical effects of multiply-accumulate operations and fixed-point multiply-accumulate operations are based on the multiply-accumulate operation of floating-point numbers and uses the concept of order to make part of the operations of floating-point numbers the same as the operations of fixed-point numbers.
  • Figure 1 is a structural diagram showing a board card according to an embodiment of the present invention.
  • FIG. 2 is a structural diagram showing an integrated circuit device according to an embodiment of the present invention.
  • Figure 3 is a schematic diagram showing a multiply-accumulate device according to an embodiment of the present invention.
  • Fig. 4 is a schematic diagram showing that the alignment module adopts unified alignment according to the embodiment of the present invention.
  • Figure 5 is a schematic diagram illustrating an alignment module using a cluster alignment scheme according to an embodiment of the present invention
  • Figure 6 is a schematic diagram showing a screening module according to an embodiment of the present invention.
  • Figure 7 is a schematic diagram showing the multiplication module and the second addition module according to the embodiment of the present invention.
  • Figure 8 is a schematic diagram showing a Wallace tree adder according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram showing the switching module controlling other modules according to the embodiment of the present invention.
  • Figure 10 is a schematic diagram showing the embodiment of the present invention when multiplying and accumulating fixed-point numbers
  • Figure 11 is a schematic diagram showing a multiply-accumulate device according to another embodiment of the present invention.
  • Figure 12 is a schematic diagram showing a switching module controlling a rounding module and a conversion module according to another embodiment of the present invention.
  • Figure 13 is a flowchart illustrating a method of multiplying and accumulating multiple operands according to another embodiment of the present invention.
  • FIG. 14 is a flowchart illustrating a method of multiplying and accumulating multiple operands according to another embodiment of the present invention.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • the data format of fixed-point numbers is divided into a 1-bit sign bit (sign) and a multi-bit mantissa bit (mantissa).
  • the sign bit is used to determine the positive or negative value of the fixed-point number.
  • the mantissa bit is used to determine the value of the fixed-point number.
  • a fixed-point number refers to a number with a fixed decimal point position in the number.
  • Fixed-point numbers are divided into fixed-point integers and fixed-point decimals. Since the position of the decimal point is fixed, there is no need to express the decimal point. The value can be calculated according to the agreed position.
  • Computers usually represent fixed-point numbers as pure decimals or pure integers.
  • the decimal point is preset between the sign bit and the highest digit of the mantissa. If the value is a pure integer, the decimal point is preset to the right of the lowest digit of the mantissa.
  • FP37 37-bit single-precision floating point number
  • exp 8 exponent bits
  • mantissa bits 28 mantissa bits
  • the exponent bit has 8 bits to represent the range from 0 to 255, which causes the exponent to become very large. Therefore, the IEEE 754 specification exponent offset is 127, which shifts the exponent range to fall between -127 and 128. This range is more reasonable. IEEE 754 further stipulates that there is an implicit one digit to the left of the decimal point. Usually this digit is 1, so the number of mantissa digits in the above single precision is actually 29 digits.
  • the second digit to the right of the decimal point is the reserved digit
  • the third digit to the right of the decimal point is the approximate digit
  • all decimal digits starting from the fourth digit to the right of the decimal point are sticky digits.
  • IEEE 754 defines four different rounding methods: rounding towards even numbers, rounding towards zero, rounding down and rounding up. IEEE 754 defaults to rounding towards even numbers.
  • FIG. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present invention.
  • the board 10 includes a chip 101, which is a System on Chip (SoC), or system on a chip, integrated with one or more combination processing devices.
  • SoC System on Chip
  • the combination processing device is an artificial Intelligent computing units are used to support various deep learning and machine learning algorithms to meet the intelligent processing needs in complex scenarios in computer vision, speech, natural language processing, data mining and other fields.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
  • the board 10 of this embodiment is suitable for use in cloud intelligence applications. application, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or wifi interface.
  • the data to be processed can be transferred to the chip 101 from the external device 103 through the external interface device 102 .
  • the calculation results of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as PCIe interface.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the memory device 104 performs connection and data transmission with the control device 106 and the chip 101 through a bus.
  • the control device 106 in the board card 10 is configured to control the status of the chip 101 .
  • the control device 106 may include a microcontroller unit (Micro Controller Unit, MCU).
  • FIG. 2 is a structural diagram showing the combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201 , an interface device 202 , a processing device 203 and an off-chip memory 204 .
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform calculations of deep learning or machine learning. It can interact with the processing device 203 through the interface device 202 to Work together to complete user-specified operations.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 can obtain input data from the processing device 203 via the interface device 202 and write it into an on-chip storage device of the computing device 201.
  • the computing device 201 can obtain the control instructions from the processing device 203 via the interface device 202 and write them into the on-chip control cache of the computing device 201 .
  • the interface device 202 may also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like.
  • the processing device 203 may be one or more types of a central processing unit (CPU), a graphics processing unit (GPU), or other general and/or special purpose processors.
  • processors including but not limited to digital signal processor (DSP), application specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • the computing device 201 of the present invention can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing device 201 and the processing device 203 are considered together, they are regarded as forming a heterogeneous multi-core structure.
  • the off-chip memory 204 is used to store data to be processed. It is a DDR memory, usually 16G or larger in size, and is used to save data of the computing device 201 and/or the processing device 203 .
  • the computing device 201 of this embodiment includes a device for multiplying and accumulating multiple operands, using the concept of order, so that part of the operations on the floating-point numbers after the order are the same as the operations on fixed-point numbers, and the hardware for the same operation can be reused, so As a result, the multiply-accumulate device of this embodiment can effectively cover the multiply-accumulate operation of floating point numbers and the multiply-accumulate operation of fixed-point numbers.
  • FIG. 3 shows a schematic diagram of the multiply-accumulate device of this embodiment.
  • the multiply-accumulate device of the computing device 201 includes an order module 301, a multiplication module 302, a first addition module 303, a shift module 304, a second addition module 305, a normalization module 306, a rounding module 307 and a switching module 308.
  • the alignment module 301 is used to identify the base values of the exponents of multiple operands, and other operands can be aligned based on the base values to facilitate subsequent accumulation calculations. This embodiment does not limit the alignment method, and the alignment scheme of the alignment module 301 will be exemplarily described below.
  • the alignment module 301 may adopt a unified alignment scheme.
  • FIG. 4 shows a schematic diagram of the alignment module 301 using a unified alignment.
  • the order module 301 is a binary tree structure, and the basic unit in the structure is the comparison unit 401, which is an existing comparator used to compare the exponent sizes of two operands.
  • the comparison unit 401 receives and compares the exponent bits of the two operands ( For example, if the exponent value of operand 1 is exp 1 and the exponent value of operand 2 is exp 2 ), the larger one is max(exp 1 , exp 2 ).
  • the larger exponent value then enters the next-level comparison unit 401 for pairwise comparison until the highest-level comparison unit 401 selects the one with the largest exponent of the n operands, max(exp 1 , exp 2 ,..., exp n ) .
  • the purpose of the unified order-alignment scheme is to find max(exp 1 , exp 2 , ..., exp n ), which has the largest exponent of all operands, as a reference value for the shift module 304 as a reference value for order shift.
  • the present invention does not limit the structure of the comparison unit 401. Those skilled in the art can make appropriate changes according to the actual situation, for example, setting up a comparator to select one of eight, using 8 operands as a group for comparison, and sequentially finding n The one with the largest exponent of the operand.
  • the alignment module 301 can also adopt a cluster alignment scheme.
  • the order module 301 first finds the maximum value of the exponent bits among the n operands, and uses the exponent bit as the reference value to filter out clusters.
  • the exponent values of the operands in the cluster are relatively close, and direct accumulation does not affect the accuracy.
  • the alignment module 301 continues to group the operands that have not yet been grouped in the above manner, and operates in this loop until all the operands are classified.
  • FIG. 5 shows a schematic diagram of the alignment module 301 when this embodiment adopts the cluster alignment solution.
  • the alignment module 301 includes an identification module 501 , a filtering module 502 and a clustering module 503 .
  • the identification module 501 has a structure similar to Figure 4, and is used to identify the reference value, which is the maximum value of the exponent bits in all operands that have not been grouped. At the beginning, none of the n operands are grouped. The identification module 501 finds the maximum value of the exponent bits of all operands, and sets the maximum exponent value as the reference value. The filtering module 502 is used to compare the reference value with the exponent values of all operands, and if the difference is within a certain range, filter the data into a cluster (for example, the first cluster). The clustering module 503 excludes the operands in the first cluster and sorts out the operands that have not been grouped.
  • the identification module 501, the filtering module 502 and the clustering module 503 re-find the benchmark value based on the updated ungrouped operands, establish a cluster (such as a second cluster), and update the ungrouped operands until all the operands are grouped. . Therefore, the n operands will be divided into multiple clusters (the first cluster to the m-th cluster), each cluster has a different base value, and the difference between the exponent value of the operand in each cluster and its corresponding base value are within a certain range, in other words, the exponent values of the operands in each cluster are similar.
  • FIG. 6 shows a schematic diagram of the filtering module 502, which includes a subtractor 601, a comparator 602, a first register 603 and a second register 604.
  • the subtractor 601 is used to obtain the difference between each index value and the reference value based on the result of the identification module 501 (ie, the reference value).
  • the subtractor 601 has multiple subtraction units, receives the exponent values and reference values of several ungrouped operands at one time, and performs subtraction operations on the exponent values and reference values of these operands.
  • the subtractor 601 has multiple subtraction units, receives the exponent values and reference values of several ungrouped operands at one time, and performs subtraction operations on the exponent values and reference values of these operands.
  • the present invention does not limit the implementation of the subtractor 601.
  • the comparator 602 receives the difference from the subtractor 601 to determine whether the difference is less than a threshold, that is, whether the difference between the exponential value of each operand and the reference value is within the threshold range.
  • the threshold can be any number, for example is 32. If the difference is less than the threshold, the operand is sent to the first register 603 for storage. If the difference is not less than the threshold, the operand is sent to the second register 604 for storage. In other words, the comparator 602 divides the operands into two categories according to the size of the exponent value.
  • the first temporary register 603 is used to store the operands whose difference is less than the threshold, and the second temporary register 604 is used to store the operands whose difference is not less than the threshold. of operands.
  • the exponent difference of the operand stored in the first temporary register 603 is less than the threshold, it means that the exponent value of the operand in the first temporary register 603 is not much different from the reference value.
  • All operands in the first temporary register 603 are gathered into a cluster, ready for subsequent modules to process together.
  • the first register 603 generates different clusters in each round of grouping, such as the first cluster to the m-th cluster.
  • the difference between the exponent value of the operand stored in the second temporary register 604 and the reference value is too large. If accumulated, too much precision will be lost when the mantissa bit is shifted. Therefore, it is sent back to the cluster module 503 and the identification module 501 to re- Grouping.
  • the cluster module 503 is used to update the operands in the second temporary register 604 to ungrouped operands. That is to say, the cluster module 503 overwrites the original operands with the operands in the second temporary register 604, so that after the update The ungrouped operands are not all operands, but only the operands in the second register 604 .
  • the clustering module 503 sends the updated ungrouped operands to the identification module 501 to identify the one with the largest index value of these ungrouped operands, and then the filtering module 202 filters out the operations whose difference between the index value and the reference value is less than the threshold. number, as the next cluster.
  • the alignment module 301 divides the n operands into several clusters according to the exponent value.
  • the multiplication module 302 is used to perform a multiplication operation on the mantissas of two operands multiplied among multiple operands (such as the mantissa of operand 1 man 1 and the mantissa mantissa of operand 2 ) to obtain the mantissa of the mantissa. Partial product.
  • the multiplication module 302 includes a compression unit and a Wallace tree unit, where the compression unit is used to perform Radix-4 Booth compression on the two mantissas, and the Wallace tree unit is used to accumulate the compressed mantissas to obtain Partial product.
  • FIG. 7 shows a schematic diagram of the multiplication module 302, which includes a compression unit 701 and a Wallace tree unit 702.
  • the compression unit 701 is used to perform Booth compression on the shifted product result.
  • the Wallace tree unit 702 is used to accumulate the compressed product results to obtain a multiply-accumulate result. Since the Wallace tree only supports complement operations, and the operations of the multiply-accumulate device are all completed under the original code, the Wallace tree unit 702 needs to convert the original code and the complement code. Specifically, the Wallace tree unit 702 It includes a first converter 703, a Wallace tree adder 704 and a second converter 705.
  • the first converter 703 is used to convert the compressed mantissa into a complement for the Wallace tree adder 704 to perform operand accumulation operation.
  • the Wallace tree adder 704 is used to accumulate the compressed complement of the product result to obtain the accumulation result, that is, the complement of the accumulated value.
  • the Wallace tree adder 704 is a multi-stage two-input adder unit.
  • Figure 8 shows a five-stage two-input adder unit 704, including a first-stage adder unit 801 and a second-stage adder unit.
  • Each stage of addition unit adds two operands, so what the fifth stage addition unit 805 obtains is the cumulative sum of 32 operands.
  • the second converter 705 is used to convert the complement of the accumulation result into the original code of the accumulation result to complete the partial product of the operands.
  • the first addition module 303 is used to perform an accumulation operation on the above-mentioned partial products to obtain a product result of the mantissas of the two operands. At this point, the product of the two operands is obtained.
  • the shift module 304 is used to shift the product result from the first addition module 303 according to the difference between the reference value found by the order module 301 and the exponent of the two operands.
  • the shift module 304 includes a plurality of barrel shift cells, which are combinational logic circuits with multiple data inputs and multiple data outputs, as well as control inputs that specify how to move the data.
  • the barrel shift units are respectively used to shift the corresponding mantissas based on the difference. First restore the mantissa, and add 0 after the mantissa. All shifted mantissas have a threshold minus one digit.
  • the shift module 304 may include n shift units, and each shift unit is used to shift the mantissa bit of one operand. Since the base value is the maximum value of the exponent bits among the n operands, the exponent bits of all operands are all aligned with the base value, and the mantissa bits are shifted accordingly. After shifting, the mantissa bits of all operands are the base value minus one digit.
  • the shift module 304 may include 32 shift units, each shift unit being used to shift the mantissa of one operand. Bit. Since the base value is the maximum value of the exponent bits among the 32 operands, the exponent bits of the operands in the cluster are all aligned to the corresponding base value, and their mantissa bits are shifted accordingly. If the threshold is set to 32, there will be no exponent bit difference greater than 32 in the cluster, so after shifting, the mantissa bits of all operands are the threshold minus one digit, that is, 31 bits. Each cluster is shifted in this way.
  • the shift module 304 determines that the number of bits shifted out of the shifted mantissa is all 0, under the principle of rounding to even numbers in IEEE 754, the shift module 304 sets all the sticky bits of the shifted mantissa. is set to 0; when the shift module 304 determines that the shifted bits of the shifted mantissa are all 1, then all sticky bits are set to 1.
  • the second addition module 305 is used to add the shifted product results to obtain the multiplication and accumulation results of n operands.
  • the structure of the second adding module 305 is also shown in FIG. 7 , which includes a compression unit 701 and a Wallace tree unit 702.
  • the compression unit 701 is used to perform Booth compression on the shifted product result.
  • the Wallace tree unit 702 includes a first converter 703 , a Wallace tree adder 704 and a second converter 705 .
  • the first converter 703 is used to convert the compressed mantissa into a complement for the Wallace tree adder 704 to perform operand accumulation operation.
  • the Wallace tree adder 704 is used to accumulate the compressed complement of the product result to obtain the multiply-accumulate result to generate the complement of the multiply-accumulate value.
  • the second converter 705 is used to convert the complement of the multiply-accumulate result into the original code of the multiply-accumulate result. This completes the multiplication and accumulation of operands.
  • the alignment module 301 adopts cluster alignment
  • the multiplication module 302 the first addition module 303, the shift module 304, and the second addition module 305 all perform operations in clusters
  • the second addition module 305 adds each cluster to The multiply-accumulate results of are summed to produce the multiply-accumulate result of n operands.
  • the normalization module 306 is used to normalize the multiplication and accumulation result of the mantissa. Through the addition and subtraction of the exponent values, the mantissa is shifted left or right to restore the shift operation of the shift module 304 .
  • the rounding module 307 is used to perform a rounding operation on the normalized multiplication-accumulation result, such as rounding toward an even number, rounding toward zero, rounding down, or rounding up.
  • the rounding method depends on actual needs.
  • the switching module 308 identifies whether the operand is a floating-point number or a fixed-point number based on the control signal of the previous stage or determines the data format of the operand (for example, whether the operand has an exponent bit), and determines whether the operand is a floating-point number or a fixed-point number. Turn on or off the order module 301, the shift module 304, the normalization module 306 and the rounding module 307. If the operand is a floating point number, the switching module 308 controls the order module 301, the shift module 304, the normalization module 306 and the rounding module 307 to be in the on mode to perform the operations as described above.
  • the multiplication module 302 Since the multiplication-accumulation operation of fixed-point numbers has the same principle as the multiplication operation of floating-point numbers, both of them first use the Wallace tree to calculate the partial sums, and then accumulate all the partial sums. Therefore, the multiplication module 302, the first addition module 303 and the second The addition module 305 may be multiplexed in multiplication and accumulation operations of fixed-point numbers.
  • the switching module 308 controls the alignment module 301, the shift module 304, the normalization module 306 and the rounding module 307 to be in the shutdown mode.
  • the specific shutdown method is as follows. Since fixed-point numbers do not have exponent bits and do not need to be aligned, the switching module 308 can directly cut off the power of the alignment module 301 so that it does not operate. As for the shift module 304, the normalization module 306 and the rounding module 307, although these modules have no effect on the multiplication and accumulation operations of fixed-point numbers, the output of the previous stage needs to be transmitted to the next stage, so they cannot be used as the opposite-order module 301. Just cut off the power supply directly.
  • a shutdown method is shown in Figure 9.
  • a demultiplexer is configured in front of these modules.
  • a first demultiplexer 901 is configured in front of the shift module 304
  • a second demultiplexer 902 is configured in front of the normalization module 306.
  • a third demultiplexer 903 is configured before the rounding module 307.
  • the control signal of the switching module 308 controls the first demultiplexer 901 to decouple the shift module 304, the second demultiplexer 902 to decouple the normalization module 306, and the third demultiplexer 903 to decouple the rounding module 307. , so that these modules are bypassed, and the input of the demultiplexer (the output of the previous stage) is directly output to the next stage.
  • the switching module 308 turns off the alignment module 301, the shift module 304, the normalization module 306 and the rounding module 307, the actual operation module of the multiply-accumulate device in Figure 3 when multiplying and accumulating fixed-point numbers is as shown in Figure 10.
  • the multiplication module 302, the first addition module 303 and the second addition module 305 are running to calculate the mantissa bit of the operand.
  • the operation mode of the multiplication module 302, the first addition module 303 and the second addition module 305 is no different when the operands are fixed-point numbers and when the operands are floating-point numbers, so no details are given.
  • the output of the second addition module 305 is the multiplication and accumulation result of n fixed-point numbers.
  • This embodiment is based on a module for multiplying and accumulating floating-point numbers, and uses the concept of order to make part of the operations of floating-point numbers the same as those of fixed-point numbers. In this way, some modules can be reused to perform multiplication-accumulation operations of fixed-point numbers to achieve a set of
  • the hardware covers the technical effects of multiply-accumulate operations of floating-point numbers and multiply-accumulate operations of fixed-point numbers.
  • This embodiment can at least support multiply-accumulate operations in data formats such as FP32, TF32, FP16, BF16, INT16, INT8, and INT4.
  • FIG 11 shows a schematic diagram of a multiply-accumulate device according to another embodiment of the present invention.
  • This embodiment also has the structure of FIG. 1 and FIG. 2 , and the multiply-accumulate device is also provided in the computing device 201 .
  • the multiply-accumulate device includes an order module 301, a multiplication module 302, a first addition module 303, a shift module 304, a second addition module 305, a normalization module 306, a rounding module 307, and a switching module 308.
  • the operation mode of these modules is It is no different from the embodiment in Figure 3, so no details will be given.
  • the multiply-accumulate device also includes a rounding module 1101 and a conversion module 1102.
  • the rounding module 1101 is configured between the first addition module 303 and the shift module 304, and is used to perform a rounding operation on the product of the two operands, such as rounding towards an even number, rounding towards zero, rounding down or rounding up.
  • the rounding method depends on actual needs.
  • the shift module 304 shifts the rounded product result.
  • the conversion module 1102 is configured after the rounding module 307 to convert the precision of the multiplication and accumulation results. When possible, convert to higher precision to improve the precision of the multiplication and accumulation results, so that subsequent modules can perform operations based on higher precision operands.
  • the conversion process does not involve the sign bit value, that is, the sign bit does not change.
  • the 23-bit mantissa bits can be converted into the 28-bit mantissa bits of FP37.
  • the conversion method is to set the 23-bit high-order mantissa bits of FP37 to the value of the 23-bit mantissa bits of FP32, and set the value of the remaining 5-bit low-order mantissa bits of FP37. Set to 0.
  • the conversion is based on Whether the index value is 0, the conversion method is different.
  • the exponent bit value of FP16 is 0x10, while the exponent bit of FP37 is 0x80, so when converting, the exponent value needs to be added with 0x70. Since the mantissa bit length of FP16 is 10 bits and the mantissa bit length of FP37 is 28 bits, when converting, the mantissa bit needs to be shifted left by 18 bits and the value of the remaining low-order mantissa bits is set to 0.
  • the exponent bit value of FP16 When the exponent bit value of FP16 is 0, its value is 0.xxxxx. Since the exponent bit of FP16 can represent at least 2 -14 , the exponent bit must not be 0 after conversion to FP37, so the mantissa bit must be shifted left by 18 In addition to bits, it is necessary to continue shifting left until the highest bit 1 is omitted. The exponent value needs to be added to 0x70 and then subtracted (an additional left shift value of -1). Then confirm that the highest bit of the mantissa bit of FP16 is 1. When converting the mantissa bit, you need to shift left until the 1 of the highest bit is omitted, and then set the value of the remaining low-order mantissa bit to 0.
  • the switching module 308 determines whether the operand is a floating point number or a fixed point number according to the control signal of the previous stage or determines the data format of the operand, and determines the order module 301 and the shift module 304 according to whether the operand is a floating point number or a fixed point number. , normalization module 306, rounding module 307, rounding module 1101 and conversion module 1102 are turned on or off. If the operand is a floating point number, the above modules are switched to the on mode, as shown in Figure 11, and each module performs the operations as described above.
  • the switching module 308 controls the alignment module 301, the shift module 304, the normalization module 306, the rounding module 307, the rounding module 1101 and the conversion module 1102 to be in the off mode, where the alignment module 301,
  • the control methods of the shift module 304, the normalization module 306, and the rounding module 307 are the same as in the previous embodiment.
  • Figure 12 shows the control method of the rounding module 1101 and the conversion module 1102.
  • the fourth demultiplexer 1201 is arranged in front of the rounding module 1101, and the fifth demultiplexer 1202 is arranged in front of the conversion module 1102.
  • the control signal of the switching module 308 controls the decoupling and rounding module 1101 of the fourth demultiplexer 1201, and controls the decoupling conversion module 1102 of the fifth demultiplexer 1202. Therefore, the multiplication and accumulation device of Figure 11 performs multiplication and accumulation of fixed-point numbers.
  • the actual operation module is also shown in Figure 10. Only the multiplication module 302, the first addition module 303 and the second addition module 305 are running to calculate the mantissa bit of the operand.
  • This embodiment not only introduces the concept of order, but also effectively covers the multiply-accumulate operation of floating-point numbers and the multiply-accumulate operation of fixed-point numbers using only one set of hardware. It can also convert the accuracy of the operation results to facilitate subsequent module operations.
  • This embodiment can at least support multiply-accumulate operations in data formats such as FP32, TF32, FP16, BF16, INT16, INT8, and INT4.
  • Figure 13 shows a flowchart of a method for multiplying and accumulating multiple operands according to another embodiment of the present invention.
  • step 1301 it is determined whether the operand is a floating point number or a fixed point number.
  • step 1302 is executed to identify the reference values of the exponents of the multiple operands.
  • This step can adopt a unified alignment scheme.
  • the unified pairing order uses all the operands to be calculated as the unit, compares the exponent size of the two operands, and outputs the larger one.
  • the larger exponent value then enters the next level for comparison until the exponent of n operands is selected. The largest one is used as the base value.
  • This step can also use a cluster pairing scheme. First find the maximum value of the exponent bits among the n operands, use this exponent value as the benchmark value, and group the operands into a cluster within the threshold range where the difference between the exponent value and the benchmark value is. The exponent values of the operands in the cluster are It is relatively close, and direct accumulation does not affect the accuracy. Then continue to group the operands that have not been grouped in the above manner, and continue to operate in this loop until all operands are grouped.
  • step 1303 a multiplication operation is performed on the mantissas of two multiplied operands among the plurality of operands to obtain a partial product of the mantissas. Specifically, this step first performs Radix-4 Booth compression on the two mantissas, and then accumulates the compressed mantissas to obtain the partial product.
  • step 1304 the above-mentioned partial products are accumulated to obtain the product result of the mantissas of the two operands. At this point, the product of the two operands is obtained.
  • step 1305 the product result is shifted according to the difference between the reference value and the exponents of the two operands.
  • step 1302 adopts a unified order alignment method, in this step, the exponent bits of all operands are aligned with the reference value, and the mantissa bits are shifted accordingly. After shifting, the mantissa bits of all operands are the base value minus one digit.
  • the cluster alignment method is used in step 1302, since the reference value is the maximum value of the exponent bits of the operands in the same cluster, the exponent bits of the operands in the cluster are all aligned to the corresponding reference value, and the mantissa bits are shifted accordingly. Each cluster is shifted in this way.
  • step 1306 an addition operation is performed on the shifted product results to obtain a multiply-accumulate result of n operands.
  • step 1306 first convert the shifted mantissa into a complement, perform Booth compression on the complement of the shifted product result, and then accumulate the compressed product result to obtain the multiply-accumulate result, and finally multiply The complement of the accumulation result is converted into the original code of the multiplication and accumulation result.
  • step 1302 When cluster pairing is used in step 1302, steps 1303 to 1306 are performed in cluster units. Finally, in step 1306, the multiplication and accumulation results of each cluster are added up to become the multiplication and accumulation result of n operands.
  • step 1307 the multiply-accumulate result of the mantissa is normalized. Through the addition and subtraction of the exponent values, the mantissa is shifted left or right to restore the shift in step 1305.
  • step 1308 a rounding operation is performed on the normalized product result, for example, rounding toward an even number, rounding toward zero, rounding down, or rounding up.
  • the rounding method depends on actual needs. At this point, the result of the multiplication and accumulation operation of floating point numbers is obtained.
  • step 1309 is executed to perform multiplication on the mantissas of the two multiplied operands among the multiple operands to obtain the partial product of the mantissas. Specifically, this step first performs Radix-4 Booth compression on the two mantissas, and then accumulates the compressed mantissas to obtain the partial product.
  • step 1310 the above-mentioned partial products are accumulated to obtain the product result of the mantissas of the two operands. At this point, the product of the two operands is obtained.
  • step 1311 an addition operation is performed on the product results to obtain a multiplication-accumulation result of n operands.
  • this step first convert the mantissa into a complement, perform Booth compression on the complement of the product result, then accumulate the compressed product result to obtain the multiply-accumulate result, and finally convert the complement of the multiply-accumulate result into a multiply-accumulate The original code of the result.
  • the result of the multiplication and accumulation operation of fixed-point numbers is obtained.
  • step 1303 step 1309
  • step 1304 step 1310
  • step 1306 step 1311
  • This embodiment is based on the process of multiplying and accumulating floating-point numbers, and uses the concept of order to make part of the operations of floating-point numbers the same as those of fixed-point numbers. In this way, part of the steps can be reused to perform multiplication-accumulating operations of fixed-point numbers to achieve simplification. Technical effects of multiply-accumulate operations.
  • Figure 14 shows a flowchart of a method for multiplying and accumulating multiple operands according to another embodiment of the present invention.
  • step 1401 it is determined whether the operand is a floating point number or a fixed point number.
  • step 1402 is executed to identify the reference values of the exponents of the multiple operands.
  • This step can also adopt a unified alignment scheme or a cluster alignment scheme.
  • step 1403 a multiplication operation is performed on the mantissas of two multiplied operands among the plurality of operands to obtain a partial product of the mantissas.
  • step 1404 the above-mentioned partial products are accumulated to obtain the product result of the mantissas of the two operands. At this point, the product of the two operands is obtained.
  • step 1405 a rounding operation is performed on the product of the two operands, such as rounding toward an even number, rounding toward zero, rounding down, or rounding up.
  • the rounding method depends on actual needs.
  • step 1406 the product result is shifted according to the difference between the reference value and the exponent of the two operands.
  • step 1407 an addition operation is performed on the shifted product results to obtain a multiply-accumulate result of n operands.
  • steps 1403 to 1407 are performed in cluster units.
  • step 1407 the multiplication and accumulation results of each cluster are added up to become the multiplication and accumulation result of n operands.
  • step 1408 the multiply-accumulate result of the mantissa is normalized.
  • step 1409 rounding is performed on the normalized product result.
  • step 1410 the precision of the multiplication-accumulation result is converted, for example, the precision of the multiplication-accumulation result is improved to FP37, so that subsequent modules can perform operations based on higher-precision operands. At this point, the result of the multiplication and accumulation operation of floating point numbers is obtained.
  • step 1411 is executed to perform multiplication on the mantissas of the two multiplied operands among the multiple operands to obtain the partial product of the mantissas.
  • step 1412 the above-mentioned partial products are accumulated to obtain the product result of the mantissas of the two operands.
  • step 1413 an addition operation is performed on the product results to obtain a multiplication-accumulation result of n operands. At this point, the result of the multiplication and accumulation operation of fixed-point numbers is obtained.
  • step 1403 step 1411
  • step 1404 step 1412
  • step 1407 step 1407
  • This embodiment is based on the process of multiplying and accumulating floating-point numbers, and uses the concept of order to make part of the operations of floating-point numbers the same as those of fixed-point numbers. In this way, part of the steps can be reused to perform multiplication-accumulating operations of fixed-point numbers to achieve simplification. Technical effects of multiply-accumulate operations. This embodiment can also convert the accuracy of the operation result to facilitate subsequent operations.
  • Another embodiment of the present invention is a computer-readable storage medium on which is stored a computer program code for a method of multiplying and accumulating multiple operands.
  • the computer program code When the computer program code is run by a processor, the execution is as shown in Figure 13 or Figure 13 The methods of each embodiment shown in 14.
  • Another embodiment of the present invention is a computer program product, including a computer program for multiplying and accumulating multiple operands.
  • the computer program is executed by a processor, the steps of the method shown in Figure 13 or Figure 14 are implemented.
  • Another embodiment of the present invention is a computer device, including a memory, a processor, and a computer program stored on the memory.
  • the processor executes the computer program to implement the steps of the method shown in FIG. 13 or FIG. 14 .
  • the present invention is based on the multiply-accumulate operation of floating-point numbers and uses the concept of order to make part of the operations of floating-point numbers the same as the operations of fixed-point numbers. In this way, part of the same operations can be reused to perform multiplication-accumulate operations of fixed-point numbers, so as to achieve the goal of covering floating-point numbers.
  • Technical effects of multiply-accumulate operations and fixed-point multiply-accumulate operations are based on the multiply-accumulate operation of floating-point numbers and uses the concept of order to make part of the operations of floating-point numbers the same as the operations of fixed-point numbers.
  • the electronic equipment or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or medical equipment.
  • the means of transportation include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present invention can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical care and other fields. Furthermore, the electronic equipment or device of the present invention can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, electronic equipment or devices with high computing power according to the solution of the present invention can be applied to cloud equipment (such as cloud servers), while electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
  • cloud equipment such as cloud servers
  • electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device.
  • the present invention describes some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solution of the present invention is not limited by the sequence of the described actions. . Therefore, based on the disclosure or teaching of the present invention, those skilled in the art can understand that certain steps may be performed in other orders or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present invention. In addition, according to different solutions, the description of some embodiments of the present invention also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present invention, and can also refer to the relevant descriptions of other embodiments.
  • units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units.
  • the aforementioned components or units may be co-located or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention.
  • multiple units in the embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.
  • the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory (Resistive Random Access Memory, RRAM), dynamic memory, etc.
  • Random access memory Dynamic Random Access Memory, DRAM
  • static random access memory Static Random Access Memory, SRAM
  • enhanced dynamic random access memory Enhanced Dynamic Random Access Memory, EDRAM
  • high bandwidth memory High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • a device for multiplying and accumulating a plurality of operands comprising: an order module for identifying reference values of exponents of the plurality of operands; a multiplication module for The mantissas of the two operands multiplied together are multiplied to obtain the partial product of the mantissas; the first addition module is used to perform an accumulation operation on the partial products to obtain the product result of the mantissas of the two operands. ; A shift module, used to shift the product result according to the difference between the reference value and the exponent of the two operands; a second addition module, used to add the shifted product result , to obtain the multiplication and accumulation results of the multiple operands.
  • Clause A2 The device according to Clause A1, further comprising a switching module for determining whether the plurality of operands are floating point numbers or fixed point numbers.
  • Clause A3 The apparatus according to Clause A2, wherein when the plurality of operands are fixed-point numbers, the switching module controls the pair-order module to turn off.
  • Clause A4 The apparatus according to Clause A2, wherein when the plurality of operands are fixed-point numbers, the switching module controls the shift module to turn off.
  • Clause A5. The apparatus of clause A4, further comprising a first demultiplexer responsive to a control signal of the switching module to decouple the shift module.
  • Clause A6 The device according to Clause A2, further comprising a normalization module for normalizing the multiply-accumulate result.
  • Clause A7 The apparatus of Clause A6, further comprising a second demultiplexer configured to decouple the normalization module in response to the control signal of the switching module when the plurality of operands are fixed-point numbers.
  • Clause A8 The device according to Clause A6, further comprising a rounding module for rounding the normalized multiply-accumulate result.
  • Clause A9 The apparatus of Clause A8, further comprising a third demultiplexer configured to decouple the rounding module in response to the control signal of the switching module when the plurality of operands are fixed-point numbers.
  • Clause A10 The apparatus according to Clause A2, further comprising a rounding module for rounding the product result, wherein the shifting module shifts the rounded product result.
  • Clause A11 The apparatus of clause A10, further comprising a fourth demultiplexer responsive to a control signal of the switching module to decouple the rounding module when the plurality of operands are fixed-point numbers.
  • the multiplication module includes: a compression unit, used to perform Booth compression on the mantissa; a Wallace tree unit, used to accumulate the compressed mantissa to obtain the The product of the parts.
  • Clause A13 The device according to Clause A1, wherein the second addition module includes: a compression unit, used to perform Booth compression on the shifted product result; a Wallace tree unit, used to perform Booth compression on the compressed product result Accumulation is performed to obtain the multiply-accumulate result.
  • a compression unit used to perform Booth compression on the shifted product result
  • a Wallace tree unit used to perform Booth compression on the compressed product result Accumulation is performed to obtain the multiply-accumulate result.
  • Clause A15 The apparatus of clause A1, wherein the order module clusters the plurality of operands, the reference value is a maximum value of the exponents of the operands in each cluster, the multiplication module, the The first addition module and the shift module perform operations in cluster units.
  • Clause A16 The device according to Clause A1, further comprising: a conversion module for converting the precision of the multiply-accumulate result.
  • Clause A17 The device of Clause A16, wherein the converted precision is FP37, the exponent bits of FP37 are 8 bits, and the mantissa bits of FP37 are 28 bits.
  • a method of multiplying and accumulating a plurality of operands comprising: identifying a reference value of an exponent of the plurality of operands; multiplying the mantissas of two of the plurality of operands multiplied together Operation to obtain the partial product of the mantissas; perform an accumulation operation on the partial products to obtain the product result of the mantissas of the two operands; according to the difference between the reference value and the exponent of the two operands, The product result is shifted; and the shifted product result is added to obtain a multiplication and accumulation result of the plurality of operands.
  • Clause A21 The method according to Clause A20, further comprising: normalizing the multiply-accumulate result; and performing a rounding operation on the normalized multiply-accumulate result.
  • Clause A22 The method according to Clause A20, further comprising: determining whether the plurality of operands are floating-point numbers or fixed-point numbers; wherein, when the plurality of operands are fixed-point numbers, performing the multiplication step, the The accumulation step and the addition step.
  • Clause A23 The method of Clause A20, further comprising: converting the precision of the multiply-accumulate result.
  • Clause A24 The method of Clause A23, wherein the converted precision is FP37, the exponent bits of FP37 are 8 bits, and the mantissa bits of FP37 are 28 bits.
  • Clause A25 A computer-readable storage medium having computer program code for a method of multiplying and accumulating a plurality of operands stored thereon, which when the computer program code is run by a processing device, performs any one of clauses A20 to 24 Methods.
  • Clause A26 A computer program product comprising a computer program for multiplying and accumulating a plurality of operands, which when executed by a processor implements the steps of the method of any one of clauses A20 to 24.
  • a computer device comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the method of any one of clauses A20 to 24.

Abstract

一种乘累加多个操作数的方法及其设备,其中计算装置包括在集成电路装置中,该集成电路装置包括通用互联接口和其他处理装置。计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。集成电路装置还可以包括存储装置,存储装置分别与计算装置和其他处理装置连接,用于计算装置和其他处理装置的数据存储。

Description

乘累加操作数的方法及其设备
相关申请的交叉引用
本申请要求于2022年06月01日申请的,申请号为202210622935.8,名称为“乘累加操作数的方法及其设备”的中国专利申请的优先权。
技术领域
本发明一般地涉及计算机领域。更具体地,本发明涉及乘累加多个操作数的装置及其集成电路装置与板卡,以及乘累加多个操作数的方法及其计算机可读存储介质、计算机程序产品、计算机装置。
背景技术
计算机领域经常需要进行浮点数的乘累加,尤其是深度学习算法,矩阵乘累加占据了90%以上的运算。乘累加指的是一连串的操作数两两相乘,再把相乘的结果加总起来,例如:
z=a 0·b 0+a 1·b 1+a 2·b 2+a 3·b 3
其中a与b为操作数。
乘累加运算应用的范围极广,因此计算机经常要应付多种数据格式的乘累加运算,像是FP16及FP32等浮点数的运算和INT4及INT8的定点数运算。由于浮点数与定点数的数据格式不同,计算机往往要配置2组硬件来分别处理浮点数的乘累加运算与定点数的乘累加运算,导致资源浪费、效益不彰。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本发明的方案提供了一种乘累加多个操作数的装置及其集成电路装置与板卡,以及一种乘累加多个操作数的方法及其计算机可读存储介质、计算机程序产品、计算机装置。
在一个方面中,本发明揭露一种乘累加多个操作数的装置,包括对阶模块、乘法模块、第一加法模块、移位模块及第二加法模块。对阶模块用以识别多个操作数的指数的基准值;乘法模块用以对多个操作数中相乘的两操作数的尾数进行乘法运算,以获得尾数的部分积;第一加法模块用以将部分积进行累加运算,以获得两操作数的尾数的乘积结果;移位模块用以根据基准值与两操作数的指数的差值,对乘积结果进行移位;第二加法模块用以对移位后的乘积结果进行加法运算,以获得多个操作数的乘累加结果。
在另一个方面,本发明揭露一种集成电路装置,包括上述的装置;并揭露一种板卡,包括上述的集成电路装置。
在另一个方面,本发明揭露一种乘累加多个操作数的方法,包括:识别多个操作数的指数的基准值;对多个操作数中相乘的两操作数的尾数进行乘法运算,以获得尾数的部分积;将部分积进行累加运算,以获得两操作数的尾数的乘积结果;根据基准值与两操作数的指数的差值,对乘积结果进行移位;对移位后的乘积结果进行加法运算,以获得多个操作数的乘累加结果。
在另一个方面,本发明揭露一种计算机可读存储介质,其上存储有乘累加多个操作数的方法的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行上述的方法。
在另一个方面,本发明揭露一种计算机程序产品,包括乘累加多个操作数的计算机程序,所述计算机程序被处理器执行时实现上述方法的步骤。
在另一个方面,本发明揭露一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,所述处理器执行所述计算机程序以实现上述方法的步骤。
本发明以浮点数乘累加运算为基础,利用对阶的概念,使得浮点数的部分运算与定点数运算相同,如此便可复用部分相同运算进行定点数的乘累加运算,达到涵盖浮点数的乘累加运算与定点数的乘累加运算的技术效果。
附图说明
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分。其中:
图1是示出本发明实施例的板卡的结构图;
图2是示出本发明实施例的集成电路装置的结构图;
图3是示出本发明实施例的乘累加装置的示意图;
图4是示出本发明实施例的对阶模块采用统一对阶的示意图;
图5是示出本发明实施例采用集群对阶方案的对阶模块的示意图;
图6是示出本发明实施例的筛选模块的示意图;
图7是示出本发明实施例的乘法模块与第二加法模块的示意图;
图8是示出本发明实施例的华莱士树加法器的示意图;
图9是示出本发明实施例的切换模块控制其他模块的示意图;
图10是示出本发明实施例在乘累加定点数时的示意图;
图11是示出本发明的另一实施例的乘累加装置的示意图;
图12是示出本发明另一实施例的切换模块控制舍入模块与转换模块的示意图;
图13是示出本发明另一实施例的乘累加多个操作数的方法的流程图;
图14是示出本发明另一实施例的乘累加多个操作数的方法的流程图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
应当理解,本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本发明的具体实施方式。
计算机中数值的表示有两种形式,一是定点数,二是浮点数。
定点数的数据格式分为1位的符号位(sign)与多位的尾数位(mantissa)。符号位用 以决定该定点数的正负值。尾数位用以决定该定点数的数值,定点数指的是小数点在数中位置固定不变的数。定点数分为定点整数和定点小数,由于小数点位置固定不变,因此小数点无需表示,按照约定的位置计算数值即可。计算机通常将定点数表示成纯小数或纯整数。如果数值为是纯小数,则小数点预设位于符号位与尾数位最高位间,如果数值是纯整数,则小数点预设位于尾数位最低位的右边。
在计算机中表达浮点数的格式是在IEEE 754中规范的。以37位的单精度浮点数(FP37)为例,其是由1位符号位、8位指数位(exp)及28位尾数位所构成,用来代表以下数值:
数值=sign×mantissa×2 exp-127
指数位有8位可以表示0至255的范围,导致指数变得非常大,故IEEE 754规范指数偏移量为127,使得指数范围平移,落在-127至128间,这样的范围较为合理。IEEE 754进一步约定小数点左边隐含有一位,通常这位数是1,所以上述单精度的尾数位数实际为29位。
由于上述的表示方法限制了浮点数的范围和精度,导致浮点数只能近似地表示来运算,就不得不考虑舍入的问题了。在十进制下,假设欲保留两位小数,即留下十分位和百分位的数值,则保留位就是数值的最低位,也就是百分位,近似位为第一个被舍掉的位,也就是千分位,而千分位以后的所有位合称为粘滞位(sticky bit),粘滞位的信息全部丢失了。对于二进制,如果想要保留两位小数,那么小数点右边第二位就是保留位,小数点右边第三位就是近似位,小数点右边第四位起的所有小数位为粘滞位。
为此,IEEE 754定义了四种不同的舍入方式:向偶数舍入、向零舍入、向下舍入及向上舍入,IEEE 754默认采用向偶数舍入。
图1示出本发明实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和片外内存204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片 上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本发明的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
片外内存204用以存储待处理的数据,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
此实施例的计算装置201包括一种乘累加多个操作数的装置,利用对阶的概念,使得对阶后的浮点数的部分运算与定点数运算相同,相同运算的硬件可以复用,如此一来此实施例的乘累加装置可以有效地涵盖浮点数的乘累加运算与定点数的乘累加运算。
图3示出此实施例的乘累加装置的示意图。计算装置201的乘累加装置包括对阶模块301、乘法模块302、第一加法模块303、移位模块304、第二加法模块305、规格化模块306、舍入模块307及切换模块308。
以下先针对操作数为n个浮点数的乘累加运算进行说明。
对阶模块301用以识别多个操作数的指数的基准值,其他操作数得以基于基准值对阶,以方便后续的累加计算。此实施例不限制对阶方式,以下将示例性地描述对阶模块301的对阶方案。
对阶模块301可以采用统一对阶方案。图4示出对阶模块301采用统一对阶的示意图。对阶模块301为二叉树结构,结构中的基本单元为比较单元401,是一种现有的比较器,用以比较两操作数的指数大小,比较单元401接收并比较两操作数的指数位(如操作数1的指数数值exp 1与操作数2的指数数值exp 2)的数值大小,输出较大者max(exp 1,exp 2)。该较大的指数数值再进入下一级的比较单元401中两两比较,直到最高级的比较单元401选择出n个操作数的指数最大者max(exp 1,exp 2,…,exp n)。统一对阶方案的目的在于找出所有操作数的指数最大者max(exp 1,exp 2,…,exp n)作为基准值,供移位模块304作为对阶移位的参考值。
本发明并不限制比较单元401的结构,本领域技术人员可以根据实际情况做适当的变化,例如设置八选一的比较器,以8个操作数为1组进行比较,依序找出n个操作数的指数最大者。
对阶模块301还可以采用集群对阶方案。对阶模块301先找出n个操作数中指数位的最大值,以该指数位作为基准值筛选出集群,集群内的操作数的指数数值较为接近,直接累加不影响精度。在筛选出集群后,对阶模块301持续针对尚未分群的操作数以上述方式分群,如此循环操作直到所有操作数都归类完毕。
图5示出此实施例采用集群对阶方案时的对阶模块301的示意图,对阶模块301包括识别模块501、筛选模块502及集群模块503。
识别模块501具有类似图4的结构,用以识别基准值,基准值为尚未分群的所有操作数中的指数位的最大值。一开始n个操作数均未被分群,识别模块501找出所有操作数的指数位的最大值,将此指数最大值设定为基准值。筛选模块502用以将此基准值与所有操作数的指数数值进行比较,如果差值在一定范围内,则筛选进集群(例如第一集群)。集 群模块503将第一集群内的操作数排除在外,整理出尚未分群的操作数。识别模块501、筛选模块502及集群模块503重新基于更新后的未分群的操作数找出基准值、建立集群(例如第二集群)、更新未分群的操作数,直到所有操作数都被分群为止。因此,n个操作数会被分为多个集群(第一集群至第m集群),每个集群的基准值不同,且每个集群内的操作数的指数数值与其对应的基准值的差值均在一定范围内,换言之,每个集群内的操作数的指数数值相近。
图6示出筛选模块502的示意图,其包括减法器601、比较器602、第一暂存器603及第二暂存器604。
减法器601基于识别模块501的结果(即基准值),用以获得每个指数数值与基准值的差值。减法器601有几种实现方式,例如:减法器601具有多个减法单元,一次接收未分群的数个操作数的指数数值与基准值,对这些操作数的指数数值与基准值进行减法运算,以获得每个指数数值与基准值的差值;或是每次输入未分群的一个操作数,与基准值进行减法运算,直到所有未分群的操作数都运算完毕为止。本发明不限制减法器601的实现方式。
比较器602接收来自减法器601的差值,用以判断差值是否小于阈值,也就是判断每个操作数的指数数值与基准值的差距是否在阈值范围内,这阈值可以是任意数,例如为32。如果是差值小于阈值的操作数,将其操作数发送至第一暂存器603存储,如果是差值不小于阈值的操作数,将其操作数发送至第二暂存器604存储。换言之,比较器602根据指数数值大小将操作数分为2类,第一暂存器603用以存储差值小于阈值的操作数,第二暂存器604用以存储差值不小于所述阈值的操作数。
由于第一暂存器603中存储的操作数是指数差值小于阈值,表示在第一暂存器603中的操作数的指数数值与基准值差距不大,在后续进行累加时,不会因为指数数值相差太多导致尾数位移位时损失过多精度。第一暂存器603中的所有操作数集合成为一个集群,准备供后续模块一同处理。第一暂存器603在每轮分群中产生不同的集群,例如第一集群至第m集群。而第二暂存器604中存储的操作数的指数数值与基准值差距太大,如果进行累加会使得尾数位移位时损失过多精度,因此将其发回集群模块503与识别模块501重新分群。
集群模块503用以将第二暂存器604中的操作数更新为未分群的操作数,也就是说,集群模块503以第二暂存器604中的操作数覆盖原操作数,使得更新后未分群的操作数不是所有操作数,而仅是第二暂存器604中的操作数。集群模块503将更新后未分群的操作数发送至识别模块501以识别出这些未分群的操作数的指数值最大者,再由筛选模块202筛选出指数数值与基准值的差值小于阈值的操作数,作为下一个集群。
综上所述,当此实施例采用集群对阶方案时,对阶模块301根据指数位数值将n个操作数分成数个集群。
回到图3,乘法模块302用以对多个操作数中相乘的两操作数的尾数(如操作数1的尾数man 1与操作数2的尾数man 2)进行乘法运算,以获得尾数的部分积。具体来说,乘法模块302包括压缩单元及华莱士树单元,其中压缩单元用以对两尾数进行Radix-4的Booth压缩,华莱士树单元用以对压缩后的尾数进行累加,以获得部分积。
图7示出乘法模块302的示意图,其包括压缩单元701及华莱士树单元702。
压缩单元701用以对移位后的乘积结果进行Booth压缩。
华莱士树单元702用以对压缩后的乘积结果进行累加,以获得乘累加结果。由于华莱士树仅支持补码运算,而乘累加装置的操作均在原码下完成,因此华莱士树单元702需要进行原码与补码的转换,具体来说,华莱士树单元702包括第一转换器703、华莱士树加法器704及第二转换器705。
第一转换器703用以将压缩后的尾数转换成补码,供华莱士树加法器704进行操作数 累加运算。华莱士树加法器704用以对压缩后的乘积结果补码进行累加,以获得累加结果,即累加值补码。华莱士树加法器704是一种多级的二输入加法单元,图8示出五级的二输入加法单元的华莱士树加法器704,包括第一级加法单元801、第二级加法单元802、第三级加法单元803、第四级加法单元804及第五级加法单元805。每级加法单元将操作数两两相加,因此第五级加法单元805所获得的是32个操作数的累加总和。第二转换器705用以将累加结果补码转换成累加结果原码,以完成操作数的部分积。
第一加法模块303用以将上述的部分积进行累加运算,以获得两操作数的尾数的乘积结果。至此获得两操作数的乘积。
移位模块304用以根据对阶模块301所找出的基准值与两操作数的指数的差值,对来自第一加法模块303的乘积结果进行移位。移位模块304包括多个桶型移位单元,桶式移位单元是一种组合逻辑电路,具有多个数据输入和多个数据输出,以及指定如何移动数据的控制输入。桶型移位单元分别用以基于差值对相应的尾数进行移位。首先还原尾数,并在尾数后补0,所有移位后的尾数具有阈值减一个位数。
如果对阶模块301采用统一对阶方式,则移位模块304可以包括n个移位单元,每个移位单元用以移位1个操作数的尾数位。由于基准值是这n个操作数中的指数位最大值,因此所有操作数的指数位全部向基准值看齐,其尾数位相应做移位。移位后,所有操作数的尾数位均为基准值减去一个位数。
如果对阶模块301采用集群对阶方式,以某个集群包括32个操作数为例,移位模块304可以包括32个移位单元,每个移位单元用以移位1个操作数的尾数位。由于基准值是这32个操作数中的指数位最大值,因此集群中的操作数的指数位全部向相应的基准值对齐,其尾数位相应做移位。如果阈值设定为32,则集群中不会有指数位差值大于32的,因此移位后,所有操作数的尾数位均为阈值减去一个位数,即31位。每个集群均以此方式进行移位。
当移位模块304判断移位后的尾数所移出的位数皆为0时,在IEEE 754的向偶数舍去的原则下,则移位模块304将移位后的尾数的粘滞位全设定为0;当移位模块304判断移位后的尾数所移出的位数皆为1时,则将粘滞位全设定为1。
第二加法模块305用以对移位后的乘积结果进行加法运算,以获得n个操作数的乘累加结果。第二加法模块305的结构亦如图7所示,其包括压缩单元701及华莱士树单元702。压缩单元701用以对移位后的乘积结果进行Booth压缩。华莱士树单元702包括第一转换器703、华莱士树加法器704及第二转换器705。第一转换器703用以将压缩后的尾数转换成补码,供华莱士树加法器704进行操作数累加运算。华莱士树加法器704用以对压缩后的乘积结果补码进行累加,以获得乘累加结果,以产生乘累加值补码。第二转换器705用以将乘累加结果补码转换成乘累加结果原码。至此完成操作数的乘累加。
当对阶模块301采用集群对阶时,乘法模块302、第一加法模块303、移位模块304、第二加法模块305均以集群为单位进行运算,最后第二加法模块305再将每个集群的乘累加结果加总起来,以产生n个操作数的乘累加结果。
规格化模块306用以规格化尾数的乘累加结果。通过指数数值的加减运算,使得尾数左移或右移,以还原移位模块304的移位操作。
舍入模块307用以对规格化后的乘累加结果进行舍入运算,例如进行向偶数舍入、向零舍入、向下舍入或向上舍入,舍入方式依实际需要而定。
切换模块308根据前级的控制信号或是判断操作数的数据格式(例如操作数有无指数位),来识别操作数为浮点数或定点数,并根据操作数为浮点数或定点数,决定对阶模块301、移位模块304、规格化模块306及舍入模块307的开启或关闭。如果操作数为浮点数,则切换模块308控制对阶模块301、移位模块304、规格化模块306及舍入模块307处于开启模式,执行如前所述的操作。
以下针对操作数为n个定点数的乘累加运算进行说明。
由于定点数的乘累加运算与浮点数的乘法运算的原理相同,均是先利用华莱士树计算出部分和,再累加所有的部分和,因此乘法模块302、第一加法模块303及第二加法模块305可在定点数的乘累加运算中被复用。
当操作数为定点数时,切换模块308控制对阶模块301、移位模块304、规格化模块306及舍入模块307处于关闭模式,具体关闭方式如下。由于定点数不存在指数位,不需要进行对阶,因此切换模块308可以直接切断对阶模块301的电源,使其不运作即可。至于移位模块304、规格化模块306及舍入模块307,虽然这些模块对定点数的乘累加运算亦无作用,但前一级的输出需要往下一级传送,因此不能如对阶模块301一样直接切断电源。一种关闭方式如图9所示,在这些模块前配置一个解复用器,例如移位模块304前配置第一解复用器901,规格化模块306前配置第二解复用器902,舍入模块307前配置第三解复用器903。切换模块308的控制信号控制第一解复用器901去耦移位模块304,控制第二解复用器902去耦规格化模块306,控制第三解复用器903去耦舍入模块307,使得这些模块被旁路,解复用器的输入(前一级的输出)直接输出至下一级。
由于切换模块308关闭了对阶模块301、移位模块304、规格化模块306及舍入模块307,故图3的乘累加装置在乘累加定点数时的实际运行模块如图10所示,仅乘法模块302、第一加法模块303及第二加法模块305运行中,用以计算操作数的尾数位。乘法模块302、第一加法模块303及第二加法模块305在操作数为定点数的运行方式与操作数为浮点数的运行方式并无不同,故不赘述。第二加法模块305的输出即为n个定点数的乘累加结果。
此实施例以浮点数乘累加运算的模块为基础,利用对阶的概念,使得浮点数的部分运算与定点数运算相同,如此便可复用部分模块进行定点数的乘累加运算,达到一组硬件涵盖浮点数的乘累加运算与定点数的乘累加运算的技术效果。此实施例至少可以支援FP32、TF32、FP16、BF16、INT16、INT8、INT4等数据格式的乘累加运算。
图11示出本发明的另一个实施例的乘累加装置的示意图。此实施例同样具有图1与图2的结构,且乘累加装置同样设置在计算装置201中。此乘累加装置包括对阶模块301、乘法模块302、第一加法模块303、移位模块304、第二加法模块305、规格化模块306、舍入模块307、切换模块308,这些模块的运作方式与图3的实施例无异,故不赘述。此乘累加装置还包括舍入模块1101与转换模块1102。
舍入模块1101配置在第一加法模块303与移位模块304间,用以对两操作数的乘积进行舍入运算,例如进行向偶数舍入、向零舍入、向下舍入或向上舍入,舍入方式依实际需要而定。移位模块304对舍入后的乘积结果进行移位。
转换模块1102配置在舍入模块307之后,用以转换乘累加结果的精度。在一种可能的情况下转换成更高精度,以提高乘累加结果精度,供后级模块可以基于更高精度的操作数进行运算。转换过程不涉及符号位数值,即符号位不变动。
如果转换前后的操作数具有相同的指数位数,例如将原本为FP32的操作数转换成FP37的操作数,由于FP32与FP37均具有8位指数位,因此指数位亦不变,仅需将FP32的23位尾数位转换成FP37的28位尾数位即可,转换方式为将FP37的23位高位尾数位设定为FP32的23位尾数位的数值,FP37剩余的5位低位尾数位的数值设定为0。
如果转换前后的操作数的指数位数与尾数位数均不同,例如将原本为FP16的操作数转换成FP37的操作数,由于FP16具有5位指数位及10位尾数位,在进行转换时基于指数数值是否为0,其转换方式不同。
当FP16的指数位值不为0时,以指数为2 0=1为例,FP16的指数位为0x10,而FP37的指数位为0x80,因此进行转换时,指数数值需要加上0x70。由于FP16的尾数位长度是10位,FP37的尾数位长度为28位,因此进行转换时,尾数位需要左移18位,并将剩余 的低位尾数位的数值设定为0。
当FP16的指数位值为0时,其数值是0.xxxxx,由于FP16的指数位最小能表示的是2 -14,转换为FP37后指数位必然不为0,因此尾数位除了要左移18位之外,还需要继续左移直到最高位的1被省略掉。指数数值需要加上0x70再减去(额外的左移值–1)。接着确认FP16的尾数位的最高位为1的位置,在进行尾数位转换时,需要左移直至该最高位的1被省去,再将剩余的低位尾数位的数值设定为0。
切换模块308根据前级的控制信号或是判断操作数的数据格式,来确定操作数为浮点数或定点数,并根据操作数为浮点数或定点数,决定对阶模块301、移位模块304、规格化模块306、舍入模块307、舍入模块1101与转换模块1102的开启或关闭。如果操作数为浮点数,则将上述模块切换至开启模式,即如图11所示,各模块执行如前所述的操作。当操作数为定点数时,切换模块308控制对阶模块301、移位模块304、规格化模块306、舍入模块307、舍入模块1101与转换模块1102处于关闭模式,其中对阶模块301、移位模块304、规格化模块306、舍入模块307的控制方式同前一个实施例。图12示出舍入模块1101与转换模块1102的控制方式,舍入模块1101前配置第四解复用器1201,转换模块1102前配置第五解复用器1202。切换模块308的控制信号控制第四解复用器1201去耦舍入模块1101,并控制第五解复用器1202去耦转换模块1102,故图11的乘累加装置在乘累加定点数时的实际运行模块亦如图10所示,仅乘法模块302、第一加法模块303及第二加法模块305运行中,用以计算操作数的尾数位。
此实施例不仅引入对阶的概念,仅用一组硬件便有效地涵盖浮点数的乘累加运算与定点数的乘累加运算,更可转换运算结果的精度,以利后续模块进行运算。此实施例至少可以支援FP32、TF32、FP16、BF16、INT16、INT8、INT4等数据格式的乘累加运算。
图13示出本发明另一个实施例的乘累加多个操作数的方法的流程图。在步骤1301中,判断操作数为浮点数或是定点数。
如果是浮点数,执行步骤1302,识别多个操作数的指数的基准值。此步骤可以采用统一对阶方案。统一对阶是以所有待计算的操作数为单位,比较两操作数的指数大小,输出较大者,该较大的指数数值再进入下一级中比较,直到选择出n个操作数的指数最大者作为基准值。
此步骤还可以采用集群对阶方案。先找出n个操作数中指数位的最大值,以该指数数值作为基准值,将指数数值与基准值的差值在阈值范围内操作数集合成一个集群,集群内的操作数的指数数值较为接近,直接累加不影响精度。接着持续针对尚未分群的操作数以上述方式分群,如此循环操作直到所有操作数都分群完毕。
在步骤1303中,对多个操作数中相乘的两操作数的尾数进行乘法运算,以获得尾数的部分积。具体来说,此步骤先对两尾数进行Radix-4的Booth压缩,再对压缩后的尾数进行累加,以获得部分积。
在步骤1304中,将上述的部分积进行累加运算,以获得两操作数的尾数的乘积结果。至此获得两操作数的乘积。
在步骤1305中,根据基准值与两操作数的指数的差值,对乘积结果进行移位。如果步骤1302采用统一对阶方式,则在此步骤中所有操作数的指数位全部向基准值看齐,其尾数位相应做移位。移位后,所有操作数的尾数位均为基准值减去一个位数。如果步骤1302采用集群对阶方式,由于基准值是同一集群中的操作数的指数位最大值,因此集群中的操作数的指数位全部向相应的基准值对齐,其尾数位相应做移位。每个集群均以此方式进行移位。
在步骤1306中,对移位后的乘积结果进行加法运算,以获得n个操作数的乘累加结果。在此步骤中,先将移位后的尾数转换成补码,对移位后的乘积结果的补码进行Booth压缩,再对压缩后的乘积结果进行累加,以获得乘累加结果,最后将乘累加结果补码转换 成乘累加结果原码。
当步骤1302采用集群对阶时,步骤1303至步骤1306均以集群为单位进行操作,最后在步骤1306中将每个集群的乘累加结果加总起来,成为n个操作数的乘累加结果。
在步骤1307中,规格化尾数的乘累加结果。通过指数数值的加减运算,使得尾数左移或右移,以还原在步骤1305中的移位。
在步骤1308中,对规格化后的乘积结果进行舍入运算,例如进行向偶数舍入、向零舍入、向下舍入或向上舍入,舍入方式依实际需要而定。至此获得浮点数的乘累加运算结果。
如果操作数为定点数,执行步骤1309,对多个操作数中相乘的两操作数的尾数进行乘法运算,以获得尾数的部分积。具体来说,此步骤先对两尾数进行Radix-4的Booth压缩,再对压缩后的尾数进行累加,以获得部分积。
在步骤1310中,将上述的部分积进行累加运算,以获得两操作数的尾数的乘积结果。至此获得两操作数的乘积。
在步骤1311中,对乘积结果进行加法运算,以获得n个操作数的乘累加结果。在此步骤中,先将尾数转换成补码,对乘积结果的补码进行Booth压缩,再对压缩后的乘积结果进行累加,以获得乘累加结果,最后将乘累加结果补码转换成乘累加结果原码。至此获得定点数的乘累加运算结果。
由于定点数的乘累加运算与浮点数的乘法运算的原理相同,均是先计算出部分和再累加所有的部分和,因此步骤1303(步骤1309)、步骤1304(步骤1310)及步骤1306(步骤1311)可在浮点数与定点数的乘累加运算中被复用。
此实施例以浮点数乘累加运算的流程为基础,利用对阶的概念,使得浮点数的部分运算与定点数运算相同,如此便可复用部分步骤进行定点数的乘累加运算,以达到精简乘累加运算的技术效果。
图14示出本发明另一个实施例的乘累加多个操作数的方法的流程图。在步骤1401中,判断操作数为浮点数或是定点数。
如果是浮点数,执行步骤1402,识别多个操作数的指数的基准值。此步骤同样可以采用统一对阶方案或是集群对阶方案。在步骤1403中,对多个操作数中相乘的两操作数的尾数进行乘法运算,以获得尾数的部分积。在步骤1404中,将上述的部分积进行累加运算,以获得两操作数的尾数的乘积结果。至此获得两操作数的乘积。
在步骤1405中,对两操作数的乘积进行舍入运算,例如进行向偶数舍入、向零舍入、向下舍入或向上舍入,舍入方式依实际需要而定。
在步骤1406中,根据基准值与两操作数的指数的差值,对乘积结果进行移位。在步骤1407中,对移位后的乘积结果进行加法运算,以获得n个操作数的乘累加结果。当步骤1402采用集群对阶时,步骤1403至步骤1407均以集群为单位进行操作,最后在步骤1407中将每个集群的乘累加结果加总起来,成为n个操作数的乘累加结果。在步骤1408中,规格化尾数的乘累加结果。在步骤1409中,对规格化后的乘积结果进行舍入运算。
在步骤1410中,转换乘累加结果的精度,例如将乘累加结果的精度提高为FP37,供后级模块可以基于更高精度的操作数进行运算。至此获得浮点数的乘累加运算结果。
如果操作数为定点数,执行步骤1411,对多个操作数中相乘的两操作数的尾数进行乘法运算,以获得尾数的部分积。在步骤1412中,将上述的部分积进行累加运算,以获得两操作数的尾数的乘积结果。在步骤1413中,对乘积结果进行加法运算,以获得n个操作数的乘累加结果。至此获得定点数的乘累加运算结果。
由于定点数的乘累加运算与浮点数的乘法运算的原理相同,均是先计算出部分和再累加所有的部分和,因此步骤1403(步骤1411)、步骤1404(步骤1412)及步骤1407(步骤1413)可在浮点数与定点数的乘累加运算中被复用。
此实施例以浮点数乘累加运算的流程为基础,利用对阶的概念,使得浮点数的部分运算与定点数运算相同,如此便可复用部分步骤进行定点数的乘累加运算,以达到精简乘累加运算的技术效果。此实施例更可转换运算结果的精度,以利后续运算。
本发明另一个实施例为一种计算机可读存储介质,其上存储有乘累加多个操作数的方法的计算机程序代码,当所述计算机程序代码由处理器运行时,执行如图13或图14所示各实施例的方法。
本发明另一个实施例为一种计算机程序产品,包括乘累加多个操作数的计算机程序,所述计算机程序被处理器执行时实现图13或图14所示的方法的步骤。
本发明另一个实施例为一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,所述处理器执行所述计算机程序以实现图13或图14所示的方法的步骤。
本发明以浮点数乘累加运算为基础,利用对阶的概念,使得浮点数的部分运算与定点数运算相同,如此便可复用部分相同运算进行定点数的乘累加运算,达到涵盖浮点数的乘累加运算与定点数的乘累加运算的技术效果。
根据不同的应用场景,本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本发明将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此,依据本发明的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本发明对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本发明某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本发明的公开和教导,本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本发明中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外,在一些场景中,本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款A1.一种乘累加多个操作数的装置,所述装置包括:对阶模块,用以识别所述多个操作数的指数的基准值;乘法模块,用以对所述多个操作数中相乘的两操作数的尾数进行乘法运算,以获得所述尾数的部分积;第一加法模块,用以将所述部分积进行累加运算,以获得所述两操作数的尾数的乘积结果;移位模块,用以根据所述基准值与所述两操作数的指数的差值,对所述乘积结果进行移位;第二加法模块,用以对移位后的乘积结果进行加法运算,以获得所述多个操作数的乘累加结果。
条款A2.根据条款A1所述的装置,还包括切换模块,用以判断所述多个操作数为浮点数或定点数。
条款A3.根据条款A2所述的装置,其中当所述多个操作数为定点数时,所述切换模块控制所述对阶模块关闭。
条款A4.根据条款A2所述的装置,其中当所述多个操作数为定点数时,所述切换模块控制所述移位模块关闭。
条款A5.根据条款A4所述的装置,还包括第一解复用器,响应所述切换模块的控制信号以去耦所述移位模块。
条款A6.根据条款A2所述的装置,还包括规格化模块,用以规格化所述乘累加结果。
条款A7.根据条款A6所述的装置,还包括第二解复用器,当所述多个操作数为定点数时,响应所述切换模块的控制信号以去耦所述规格化模块。
条款A8.根据条款A6所述的装置,还包括舍入模块,用以对规格化后的乘累加结果进行舍入运算。
条款A9.根据条款A8所述的装置,还包括第三解复用器,当所述多个操作数为定点数时,响应所述切换模块的控制信号以去耦所述舍入模块。
条款A10.根据条款A2所述的装置,还包括舍入模块,用以对所述乘积结果进行舍入运算,其中所述移位模块对舍入后的乘积结果进行移位。
条款A11.根据条款A10所述的装置,还包括第四解复用器,当所述多个操作数为定点数时,响应所述切换模块的控制信号以去耦所述舍入模块。
条款A12.根据条款A1所述的装置,其中所述乘法模块包括:压缩单元,用以对所述尾数进行Booth压缩;华莱士树单元,用以对压缩后的尾数进行累加,以获得所述部分 积。
条款A13.根据条款A1所述的装置,其中所述第二加法模块包括:压缩单元,用以对移位后的乘积结果进行Booth压缩;华莱士树单元,用以对压缩后的乘积结果进行累加,以获得所述乘累加结果。
条款A14.根据条款A1所述的装置,其中所述基准值为所述多个操作数的指数的最大值。
条款A15.根据条款A1所述的装置,其中所述对阶模块对所述多个操作数分群,所述基准值为每一集群中的操作数的指数的最大值,所述乘法模块、所述第一加法模块、所述移位模块以集群为单位进行运算。
条款A16.根据条款A1所述的装置,还包括:转换模块,用以转换所述乘累加结果的精度。
条款A17.根据条款A16所述的装置,其中转换后的精度为FP37,FP37的指数位为8位,FP37的尾数位为28位。
条款A18.一种集成电路装置,包括根据条款A1至17任一项所述的装置。
条款A19.一种板卡,包括根据条款A18所述的集成电路装置。
条款A20.一种乘累加多个操作数的方法,所述方法包括:识别所述多个操作数的指数的基准值;对所述多个操作数中相乘的两操作数的尾数进行乘法运算,以获得所述尾数的部分积;将所述部分积进行累加运算,以获得所述两操作数的尾数的乘积结果;根据所述基准值与所述两操作数的指数的差值,对所述乘积结果进行移位;对移位后的乘积结果进行加法运算,以获得所述多个操作数的乘累加结果。
条款A21.根据条款A20所述的方法,还包括:规格化所述乘累加结果;对规格化后的乘累加结果进行舍入运算。
条款A22.根据条款A20所述的方法,还包括:判断所述多个操作数为浮点数或定点数;其中,当所述多个操作数为定点数时,执行所述乘法步骤、所述累加步骤及所述加法步骤。
条款A23.根据条款A20所述的方法,还包括:转换所述乘累加结果的精度。
条款A24.根据条款A23所述的方法,其中转换后的精度为FP37,FP37的指数位为8位,FP37的尾数位为28位。
条款A25.一种计算机可读存储介质,其上存储有乘累加多个操作数的方法的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款A20至24任一项所述的方法。
条款A26.一种计算机程序产品,包括乘累加多个操作数的计算机程序,所述计算机程序被处理器执行时实现条款A20至24任一项所述方法的步骤。
条款A27.一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,所述处理器执行所述计算机程序以实现条款A20至24任一项所述方法的步骤。
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (27)

  1. 一种乘累加多个操作数的装置,其特征在于,所述装置包括:
    对阶模块,用以识别所述多个操作数的指数的基准值;
    乘法模块,用以对所述多个操作数中相乘的两操作数的尾数进行乘法运算,以获得所述尾数的部分积;
    第一加法模块,用以将所述部分积进行累加运算,以获得所述两操作数的尾数的乘积结果;
    移位模块,用以根据所述基准值与所述两操作数的指数的差值,对所述乘积结果进行移位;
    第二加法模块,用以对移位后的乘积结果进行加法运算,以获得所述多个操作数的乘累加结果。
  2. 根据权利要求1所述的装置,还包括切换模块,用以判断所述多个操作数为浮点数或定点数。
  3. 根据权利要求2所述的装置,其中当所述多个操作数为定点数时,所述切换模块控制所述对阶模块关闭。
  4. 根据权利要求2所述的装置,其中当所述多个操作数为定点数时,所述切换模块控制所述移位模块关闭。
  5. 根据权利要求4所述的装置,还包括第一解复用器,响应所述切换模块的控制信号以去耦所述移位模块。
  6. 根据权利要求2所述的装置,还包括规格化模块,用以规格化所述乘累加结果。
  7. 根据权利要求6所述的装置,还包括第二解复用器,当所述多个操作数为定点数时,响应所述切换模块的控制信号以去耦所述规格化模块。
  8. 根据权利要求6所述的装置,还包括舍入模块,用以对规格化后的乘累加结果进行舍入运算。
  9. 根据权利要求8所述的装置,还包括第三解复用器,当所述多个操作数为定点数时,响应所述切换模块的控制信号以去耦所述舍入模块。
  10. 根据权利要求2所述的装置,还包括舍入模块,用以对所述乘积结果进行舍入运算,其中所述移位模块对舍入后的乘积结果进行移位。
  11. 根据权利要求10所述的装置,还包括第四解复用器,当所述多个操作数为定点数时,响应所述切换模块的控制信号以去耦所述舍入模块。
  12. 根据权利要求1所述的装置,其中所述乘法模块包括:
    压缩单元,用以对所述尾数进行Booth压缩;
    华莱士树单元,用以对压缩后的尾数进行累加,以获得所述部分积。
  13. 根据权利要求1所述的装置,其中所述第二加法模块包括:
    压缩单元,用以对移位后的乘积结果进行Booth压缩;
    华莱士树单元,用以对压缩后的乘积结果进行累加,以获得所述乘累加结果。
  14. 根据权利要求1所述的装置,其中所述基准值为所述多个操作数的指数的最大值。
  15. 根据权利要求1所述的装置,其中所述对阶模块对所述多个操作数分群,所述基准值为每一集群中的操作数的指数的最大值,所述乘法模块、所述第一加法模块、所述移位模块以集群为单位进行运算。
  16. 根据权利要求1所述的装置,还包括:
    转换模块,用以转换所述乘累加结果的精度。
  17. 根据权利要求16所述的装置,其中转换后的精度为FP37,FP37的指数位为8位,FP37的尾数位为28位。
  18. 一种集成电路装置,包括根据权利要求1至17任一项所述的装置。
  19. 一种板卡,包括根据权利要求18所述的集成电路装置。
  20. 一种乘累加多个操作数的方法,其特征在于,所述方法包括:
    识别所述多个操作数的指数的基准值;
    对所述多个操作数中相乘的两操作数的尾数进行乘法运算,以获得所述尾数的部分积;
    将所述部分积进行累加运算,以获得所述两操作数的尾数的乘积结果;
    根据所述基准值与所述两操作数的指数的差值,对所述乘积结果进行移位;
    对移位后的乘积结果进行加法运算,以获得所述多个操作数的乘累加结果。
  21. 根据权利要求20所述的方法,还包括:
    规格化所述乘累加结果;
    对规格化后的乘累加结果进行舍入运算。
  22. 根据权利要求20所述的方法,还包括:
    判断所述多个操作数为浮点数或定点数;
    其中,当所述多个操作数为定点数时,执行所述乘法步骤、所述累加步骤及所述加法步骤。
  23. 根据权利要求20所述的方法,还包括:
    转换所述乘累加结果的精度。
  24. 根据权利要求23所述的方法,其中转换后的精度为FP37,FP37的指数位为8位,FP37的尾数位为28位。
  25. 一种计算机可读存储介质,其上存储有乘累加多个操作数的方法的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行权利要求20至24任一项所述的方法。
  26. 一种计算机程序产品,包括乘累加多个操作数的计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求20至24任一项所述方法的步骤。
  27. 一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现权利要求20至24任一项所述方法的步骤。
PCT/CN2022/138472 2022-06-01 2022-12-12 乘累加操作数的方法及其设备 WO2023231363A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210622935.8A CN117193712A (zh) 2022-06-01 2022-06-01 乘累加操作数的方法及其设备
CN202210622935.8 2022-06-01

Publications (1)

Publication Number Publication Date
WO2023231363A1 true WO2023231363A1 (zh) 2023-12-07

Family

ID=88983874

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/138472 WO2023231363A1 (zh) 2022-06-01 2022-12-12 乘累加操作数的方法及其设备

Country Status (2)

Country Link
CN (1) CN117193712A (zh)
WO (1) WO2023231363A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104111816A (zh) * 2014-06-25 2014-10-22 中国人民解放军国防科学技术大学 Gpdsp中多功能simd结构浮点融合乘加运算装置
CN107168678A (zh) * 2017-05-09 2017-09-15 清华大学 一种改进的浮点乘加器及浮点乘加计算方法
CN107305485A (zh) * 2016-04-25 2017-10-31 北京中科寒武纪科技有限公司 一种用于执行多个浮点数相加的装置及方法
WO2017185203A1 (zh) * 2016-04-25 2017-11-02 北京中科寒武纪科技有限公司 一种用于执行多个浮点数相加的装置及方法
CN111930674A (zh) * 2020-08-10 2020-11-13 中国科学院计算技术研究所 乘累加运算装置及方法、异构智能处理器及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104111816A (zh) * 2014-06-25 2014-10-22 中国人民解放军国防科学技术大学 Gpdsp中多功能simd结构浮点融合乘加运算装置
CN107305485A (zh) * 2016-04-25 2017-10-31 北京中科寒武纪科技有限公司 一种用于执行多个浮点数相加的装置及方法
WO2017185203A1 (zh) * 2016-04-25 2017-11-02 北京中科寒武纪科技有限公司 一种用于执行多个浮点数相加的装置及方法
CN107168678A (zh) * 2017-05-09 2017-09-15 清华大学 一种改进的浮点乘加器及浮点乘加计算方法
CN111930674A (zh) * 2020-08-10 2020-11-13 中国科学院计算技术研究所 乘累加运算装置及方法、异构智能处理器及电子设备

Also Published As

Publication number Publication date
CN117193712A (zh) 2023-12-08

Similar Documents

Publication Publication Date Title
WO2021078212A1 (zh) 用于向量内积的计算装置、方法和集成电路芯片
CN111381871B (zh) 运算方法、装置及相关产品
CN110515589B (zh) 乘法器、数据处理方法、芯片及电子设备
TW202115560A (zh) 用於浮點運算的乘法器、方法、積體電路晶片和計算裝置
CN111381808B (zh) 乘法器、数据处理方法、芯片及电子设备
WO2021078210A1 (zh) 用于神经网络运算的计算装置、方法、集成电路和设备
WO2021185262A1 (zh) 计算装置、方法、板卡和计算机可读存储介质
TWI774093B (zh) 用於轉換資料類型的轉換器、晶片、電子設備及其方法
CN111258541B (zh) 乘法器、数据处理方法、芯片及电子设备
WO2023231363A1 (zh) 乘累加操作数的方法及其设备
CN111258633B (zh) 乘法器、数据处理方法、芯片及电子设备
CN209895329U (zh) 乘法器
CN111381875B (zh) 数据比较器、数据处理方法、芯片及电子设备
CN110515586B (zh) 乘法器、数据处理方法、芯片及电子设备
WO2021073512A1 (zh) 用于浮点运算的乘法器、方法、集成电路芯片和计算装置
CN109582279B (zh) 数据运算装置及相关产品
CN111258545B (zh) 乘法器、数据处理方法、芯片及电子设备
WO2021073511A1 (zh) 用于浮点运算的乘法器、方法、集成电路芯片和计算装置
CN112711440A (zh) 用于转换数据类型的转换器、芯片、电子设备及其方法
CN111381802B (zh) 数据比较器、数据处理方法、芯片及电子设备
CN111258546B (zh) 乘法器、数据处理方法、芯片及电子设备
CN210006082U (zh) 乘法器、装置、神经网络芯片及电子设备
CN111258534B (zh) 数据比较器、数据处理方法、芯片及电子设备
CN111260044B (zh) 数据比较器、数据处理方法、芯片及电子设备
CN111381806A (zh) 数据比较器、数据处理方法、芯片及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22944664

Country of ref document: EP

Kind code of ref document: A1