WO2021078210A1 - 用于神经网络运算的计算装置、方法、集成电路和设备 - Google Patents

用于神经网络运算的计算装置、方法、集成电路和设备 Download PDF

Info

Publication number
WO2021078210A1
WO2021078210A1 PCT/CN2020/122949 CN2020122949W WO2021078210A1 WO 2021078210 A1 WO2021078210 A1 WO 2021078210A1 CN 2020122949 W CN2020122949 W CN 2020122949W WO 2021078210 A1 WO2021078210 A1 WO 2021078210A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
mantissa
result
floating
computing device
Prior art date
Application number
PCT/CN2020/122949
Other languages
English (en)
French (fr)
Inventor
张尧
刘少礼
Original Assignee
安徽寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安徽寒武纪信息科技有限公司 filed Critical 安徽寒武纪信息科技有限公司
Priority to US17/620,547 priority Critical patent/US20220350569A1/en
Publication of WO2021078210A1 publication Critical patent/WO2021078210A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/53Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
    • G06F7/5318Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel with column wise addition of partial products, e.g. using Wallace tree, Dadda counters
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03KPULSE TECHNIQUE
    • H03K19/00Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
    • H03K19/20Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits characterised by logic function, e.g. AND, OR, NOR, NOT circuits
    • H03K19/21EXCLUSIVE-OR circuits, i.e. giving output if input signal exists at only one input; COINCIDENCE circuits, i.e. giving output only if all input signals are identical
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing
    • G06F7/4876Multiplying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/53Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models

Definitions

  • This disclosure generally relates to the field of data processing. More specifically, the present disclosure relates to computing devices, methods, integrated circuit chips, and equipment for neural network operations.
  • the current neural network involves calculation operations of weight data (such as convolution data) and neuron data, including a large number of multiplication and addition operations.
  • the efficiency of the multiply-add operation often depends on the execution speed of the multiplier used.
  • current multipliers have achieved significant improvements in execution efficiency, there is still room for improvement in processing floating-point data.
  • neural network operations will also involve the aforementioned weight data and neuron data processing operations, and currently there is no good operation mechanism for these data processing, resulting in inefficient neural network operations.
  • the solution of the present disclosure provides a computing device, method, integrated circuit chip and integrated circuit device for performing neural network operations, thereby effectively performing neural network operations, and Realize the efficient reuse of weight data and neuron data.
  • the present disclosure discloses a computing device for performing neural network operations, including: an input terminal configured to receive at least one weight data and at least one neuron data of the neural network operation to be performed; multiplication Unit, which includes at least one floating-point multiplier configured to perform a multiplication operation in the neural network operation on the at least one weight data and the at least one neuron data to obtain a corresponding An addition module configured to perform an addition operation on the product result to obtain an intermediate result; and an update module configured to perform multiple summation operations for the plurality of intermediate results generated, To output the final result of the neural network operation.
  • multiplication Unit which includes at least one floating-point multiplier configured to perform a multiplication operation in the neural network operation on the at least one weight data and the at least one neuron data to obtain a corresponding
  • An addition module configured to perform an addition operation on the product result to obtain an intermediate result
  • an update module configured to perform multiple summation operations for the plurality of intermediate results generated, To output the final result of the neural network operation.
  • the present disclosure discloses a method for performing neural network operations, including: receiving at least one weight data and at least one neuron data of the neural network operation to be performed; using at least one floating-point multiplier
  • the multiplication unit of performs the multiplication operation in the neural network operation on the at least one weight data and the at least one neuron data to obtain a corresponding product result; and an addition module is used to perform an addition operation on the product result to Obtaining intermediate results; and using an update module to perform multiple summation operations on the generated intermediate results to output the final result of the neural network operation.
  • the present disclosure discloses an integrated circuit chip and an integrated circuit device.
  • the integrated circuit chip includes the aforementioned computing device for performing neural network operations, and the integrated circuit device includes the integrated circuit chip.
  • neural network operations especially convolution operations in neural networks
  • the present disclosure also supports the reuse of weight data and neuron data, thereby avoiding excessive data migration and storage, improving operation efficiency and reducing operation costs.
  • Fig. 1 is a schematic block diagram showing a computing device according to an embodiment of the present disclosure
  • Fig. 2 is a schematic diagram showing a floating-point data format according to an embodiment of the present disclosure
  • Fig. 3 is a schematic structural block diagram showing a multiplier according to an embodiment of the present disclosure
  • Figure 4 is a block diagram showing more details of the multiplier according to an embodiment of the present disclosure.
  • Fig. 5 is a schematic block diagram showing a mantissa processing unit according to an embodiment of the present disclosure
  • Fig. 6 is a schematic diagram showing a partial product operation according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic block diagram showing an operation flow and a schematic block diagram of a Wallace tree compressor according to an embodiment of the present disclosure
  • FIG. 8 is an overall schematic block diagram showing a multiplier according to an embodiment of the present disclosure.
  • FIG. 9 is a flowchart showing a method for performing floating-point number multiplication using a multiplier according to an embodiment of the present disclosure
  • FIG. 10 is another schematic block diagram showing a computing device according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic block diagram showing an adder group according to an embodiment of the present disclosure.
  • FIG. 12 is another schematic block diagram showing an adder group according to an embodiment of the present disclosure.
  • FIG. 13 is a flowchart showing the execution of neural network operations according to an embodiment of the present disclosure.
  • Fig. 14 is a schematic diagram showing a neural network operation according to an embodiment of the present disclosure.
  • FIG. 15 is a flowchart showing the use of a computing device to perform neural network operations according to an embodiment of the present disclosure
  • FIG. 16 is a structural diagram showing a combined processing device according to an embodiment of the present disclosure.
  • Fig. 17 is a schematic diagram showing the structure of a board card according to an embodiment of the present disclosure.
  • the technical solution of the present disclosure uses a multiplication unit including one or more floating-point multipliers to perform multiplication operations including weight data and neuron data, and performs addition operations and update operations on the obtained product results, thereby obtaining the final result.
  • the solution disclosed in the present disclosure not only improves the efficiency of the multiplication operation through the multiplication unit, but also stores multiple intermediate results before the final result through the update operation, so as to realize the efficient reuse of weight data and neuron data.
  • FIG. 1 is a schematic block diagram showing a computing device 100 according to an embodiment of the present disclosure.
  • the computing device can be used to perform neural network operations, in particular to process weight data and neuron data, to obtain the desired result of the operation.
  • the weight data may be convolution kernel data
  • the neuron data may be, for example, pixel data of the image or after a previous layer arithmetic operation.
  • the output data may be, for example, pixel data of the image or after a previous layer arithmetic operation.
  • the computing device includes an input terminal 102 configured to receive at least one weight data and at least one neuron data of a neural network operation to be executed.
  • the input terminal may receive the image data captured by the image capture device.
  • the image capture device may be, for example, various image sensors, cameras, Cameras, mobile smart terminals, tablet computers and other image acquisition devices, and the collected pixel data or the pixel data that has undergone preliminary processing can be used as neuron data in the present disclosure.
  • the above-mentioned weight data and neuron data may have the same or different types of data formats, for example, have the same or different floating-point number formats.
  • the input terminal may include one or more first type conversion units for data format conversion, for converting the received weight data or neuron data into the multiplication unit 104 Supported data format.
  • the format conversion unit in the input terminal may receive The obtained neuron data and weight data are converted into one of the aforementioned data formats to meet the requirements of the multiplication unit to perform the multiplication operation.
  • the various data formats or types supported by the present disclosure and the conversion of the data formats will be described in detail when discussing the floating-point multiplier of the present disclosure below.
  • the multiplication unit of the present disclosure may include at least one floating-point multiplier 106, which may be configured to perform the neural network operation on the aforementioned at least one weight data and at least one neuron data.
  • the multiplication operation in to obtain the corresponding product result.
  • the floating-point multiplier of the present disclosure may support a multiplication operation in one of multiple operation modes, and the operation mode may be used to indicate the neuron data and weights involved in the multiplication operation The data format of the data.
  • the floating-point multiplier can perform the operation in the first operation mode, and when the neuron data is a half-precision floating-point number and the weight data is a single-precision floating-point number.
  • the floating-point multiplier can perform the multiplication operation in the second operation mode.
  • the product result can be sent to the addition module 108, and the addition module can be configured to perform an addition operation on the product result to obtain an intermediate result.
  • the addition module may be an adder group formed by a plurality of adders, and the adder group may form a tree-like structure.
  • the adder includes a multi-stage adder group arranged in a multi-stage tree structure, and each adder group includes one or more first adders 110, and the first adders may be, for example, floating-point adders.
  • the adder in the addition module of the present disclosure may also be an adder that supports multiple addition modes.
  • the first addition module in this disclosure is An adder can also be a floating-point adder that supports floating-point numbers in any of the above-mentioned data formats.
  • the solution of the present disclosure does not impose any restriction on the type of the first adder, and any device, device or device that can support the addition operation can be used as the adder here to implement the addition operation and obtain the intermediate result.
  • the computing device of the present disclosure may further include an update module 112 configured to perform multiple summation operations for the generated multiple intermediate results to output the final result of the neural network operation.
  • an update module 112 configured to perform multiple summation operations for the generated multiple intermediate results to output the final result of the neural network operation.
  • the update module may include a second adder 114 and a register 116.
  • the first adder in the aforementioned addition module can be a floating-point adder that supports multiple modes
  • the second adder in the update module can also have the same or similar properties as the first adder. , which also supports multiple modes of floating-point number addition operations.
  • the present disclosure also discloses a first or second type conversion unit for performing conversion between data types or formats.
  • first or second adder to perform floating-point number addition in multiple operation modes, that is, to use the first or second adder to perform floating-point number addition in multiple operation modes.
  • the type conversion unit will be described in detail later with reference to FIG. 11.
  • the second adder may be configured to repeatedly perform the following operations until the summing operation of all the plurality of intermediate results is completed: receiving from the adder (for example, the adder 108) The intermediate result and the previous summation result of the previous summation operation from the register (ie register 116); add the intermediate result and the previous summation result to obtain the summation result of this summation operation The sum result; and use the sum result of this sum operation to update the previous sum result stored in the register.
  • the result stored in the output register is used as the final result of the neural network operation.
  • the input terminal may include at least two input ports that support multiple data bit widths, and the register includes a plurality of sub-registers, and the computing device is configured to pair according to the input port bit width.
  • the neuron data and weight data are divided and multiplexed respectively to perform neural network operations.
  • the input data of a port can be a data item including 16 FP32 (single-precision floating-point numbers), or a data item including 32 FP16 (half-precision floating-point numbers), or a 32 A BF16 (brain floating point number) data item.
  • the 2048-bit weight data can be divided into four 512-bit length data, thereby calling the multiplication unit 4 times And update the module, and output the final calculation result after the fourth update of the module is completed.
  • the above-mentioned multiplication unit, addition module, and update module of the present disclosure can all operate independently and in parallel. For example, after the multiplication unit outputs the product result, it receives the next pair of neuron data and weight data to perform the multiplication operation, without waiting for the subsequent stages (such as the addition module and the update module) to finish running before receiving processing. Similarly, after the addition module outputs the intermediate result, it receives the next product result from the product unit for the addition operation. It can be seen that the parallel operation mode of the solution of the present disclosure improves the efficiency of calculation.
  • the "post-stage” here not only refers to the latter level, but also refers to several subsequent levels of operations in a multi-stage pipeline operation.
  • the overall operation of the computing device of the present disclosure has been described above in conjunction with FIG. 1, and efficient neural network operations can be realized by using the computing device.
  • the computing device can realize multiplication of floating-point numbers in multiple data formats in a neural network.
  • the floating-point multiplier of the present disclosure will be described in detail below in conjunction with FIGS. 2-9.
  • FIG. 2 is a schematic diagram showing a floating-point data format 200 according to an embodiment of the present disclosure.
  • the neuron data and weight data to which the technical solution of the present disclosure can be applied can be floating-point numbers, and can include three parts, such as sign (or sign bit) 202, exponent (or exponent bit) 204, and Mantissa (or mantissa bit) 206, where there may be no sign or sign bit for unsigned floating point numbers.
  • the floating-point numbers suitable for the multiplier of the present disclosure may include at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers.
  • the floating-point number format to which the technical solution of the present disclosure can be applied may be a floating-point format that conforms to the IEEE754 standard, such as double-precision floating-point number (float64, abbreviated as "FP64”), single-precision floating-point number ( float32, abbreviated “FP32”) or half-precision floating-point number (float16, abbreviated "FP16").
  • FP64 double-precision floating-point number
  • FP32 single-precision floating-point number
  • FP16 half-precision floating-point number
  • the floating-point number format can also be an existing 16-bit brain floating-point number (bfloat16, abbreviated as "BF16”), or a custom floating-point number format, such as 8-bit brain floating-point number (bfloat8, abbreviated as “BF8"), unsigned half-precision floating point numbers (unsigned float16, abbreviated as "UFP16”), unsigned 16-bit brain floating point numbers (unsigned bfloat16, abbreviated as "UBF16”).
  • bfloat8 8-bit brain floating-point number
  • UFP16 unsigned half-precision floating point numbers
  • UPF16 unsigned 16-bit brain floating point numbers
  • the multiplier of the present disclosure can support at least two floating-point numbers with any of the above-mentioned formats in operation (for example, one of the floating-point numbers is neuron data, and the other floating-point number is Weight data), where two floating-point numbers can have the same or different floating-point data formats.
  • the multiplication operation between two floating-point numbers can be FP16*FP16, BF16*BF16, FP32*FP32, FP32*BF16, FP16*BF16, FP32*FP16, BF8*BF16, UBF16*UFP16 or UBF16*FP16, etc. Multiplication operation between two floating-point numbers.
  • FIG. 3 is a schematic structural block diagram of a multiplier 300 according to an embodiment of the present disclosure.
  • the multiplier of the present disclosure supports the multiplication operation of floating-point numbers in various data formats.
  • One of the multiplier or the multiplicand can be the neuron data of the present disclosure, and the corresponding other can be the present disclosure.
  • the aforementioned data format can be indicated by the operation mode of the present disclosure, so that the multiplier works in one of a variety of operation modes.
  • the multiplier of the present disclosure may generally include an exponent processing unit 302 and a mantissa processing unit 304.
  • the exponent processing unit is used to process the exponent bit of a floating point number
  • the mantissa processing unit is used to process the mantissa of a floating point number. Bit.
  • the multiplier may further include a sign processing unit 306, which may be used to process a floating point number including a sign bit.
  • the multiplier can perform floating-point operations on the received, input, or buffered first floating-point number and the second floating-point number according to one of the operating modes, the first floating-point number and the second floating-point number having the functions discussed above One of the floating-point data formats. For example, when the multiplier is in the first operation mode, it can support the multiplication of two floating-point numbers FP16*FP16, and when the multiplier is in the second operation mode, it can support the multiplication of two floating-point numbers BF16*BF16 .
  • the multiplier when the multiplier is in the third operation mode, it can support the multiplication of two floating-point numbers FP32*FP32, and when the multiplier is in the fourth operation mode, it can support the multiplication of two floating-point numbers FP32*BF16 Operation.
  • the corresponding relationship between the sample operation mode and the floating-point number is shown in Table 2 below.
  • Operation mode number Arithmetic floating-point number type 1 FP16*FP16 2 BF16*BF16 3 FP32*FP32 4 FP32*BF16
  • the above-mentioned table 2 may be stored in a memory of the multiplier, and the multiplier selects one of the operation modes in the table according to an instruction received from an external device, and the external device may be, for example, FIG. 17 The external device shown in 1712.
  • the input of the operation mode can also be realized automatically via the mode selection unit 408 as shown in FIG. 4.
  • the mode selection unit can select the multiplier to work in the first operation mode according to the data format of the two floating-point numbers.
  • the mode selection unit may select the multiplier to work in the fourth operation mode according to the data format of the two floating point numbers.
  • the different operation modes of the present disclosure are associated with corresponding floating-point data. That is to say, the operation mode of the present disclosure can be used to indicate the data format of the first floating-point number and the data format of the second floating-point number. In another embodiment, the operation mode of the present disclosure can not only indicate the data format of the first floating-point number and the data format of the second floating-point number, but can also be used to indicate the data format after the multiplication operation.
  • the operation mode extended in conjunction with Table 2 is shown in Table 3 below.
  • the operation modes in Table 3 are extended by one bit to indicate the data format after floating-point multiplication.
  • the multiplier works in operation mode 21
  • it performs floating-point operations on the input BF16*BF16 two floating-point numbers, and outputs the floating-point multiplication in the FP16 data format.
  • the above operation mode in number form to indicate the floating point data format is only exemplary and not restrictive. According to the teaching of the present disclosure, it is also conceivable to establish an index according to the operation mode to determine the format of the multiplier and the multiplicand.
  • the operation mode includes two indexes, the first index is used to indicate the type of the first floating-point number, and the second index is used to indicate the type of the second floating-point number.
  • the first index "1" in the operation mode 13 indicates The first floating-point number (or multiplicand) is in the first floating-point format, that is, FP16, and the second index "3" indicates that the second floating-point number (or multiplier) is in the second floating-point format, that is, FP32.
  • a third index may be added to the operation mode, which indicates the data format of the output result.
  • the third index "1" in the operation mode 131 it may indicate that the data format of the output result is the first floating point.
  • the format is FP16.
  • the instructions may include three fields or fields, the first field is used to indicate the data format of the first floating-point number, the second field is used to indicate the data format of the second floating-point number, and The third field is used to indicate the data format of the output result.
  • FIG. 4 is a block diagram showing a more detailed structure of the multiplier 400 according to an embodiment of the present disclosure. It can be seen from the content shown in FIG. 4 that it not only includes the exponent processing unit 302, mantissa processing unit 304, and optional symbol processing unit 306 shown in FIG. 3, but also shows the internal components that these units can include and the These units operate related units, and an exemplary operation of these units will be described in detail below with reference to FIG. 4.
  • the exponent processing unit can be used to perform the exponent of the first floating-point number and the exponent of the second floating-point number according to the aforementioned operation mode Get the exponent after the multiplication operation.
  • the exponent processing unit may be implemented by an addition and subtraction circuit.
  • the exponent processing unit here can be used to add the exponent of the first floating-point number, the exponent of the second floating-point number, and the respective offset values of the corresponding input floating-point data format, and then subtract the output floating-point data format The offset value to obtain the exponent after the multiplication of the first floating-point number and the second floating-point number.
  • the mantissa processing unit of the multiplier can be used to obtain the mantissa after the multiplication operation according to the foregoing operation mode, the first floating-point number, and the second floating-point number.
  • the mantissa processing unit may include a partial product operation unit 412 and a partial product summation unit 414, wherein the partial product operation unit is used to obtain the middle of the mantissa according to the mantissa of the first floating-point number and the mantissa of the second floating-point number. result.
  • the intermediate result of the mantissa may be multiple partial products obtained during the multiplication operation of the first floating-point number and the second floating-point number (as shown schematically in FIGS. 6 and 7).
  • the partial product summation unit is used to perform an addition operation on the intermediate result of the mantissa to obtain an addition result, and use the addition result as the mantissa after the multiplication operation.
  • the present disclosure uses a Booth ("Booth") encoding circuit to complement the high and low bits of the mantissa of the second floating-point number (such as serving as a multiplier in floating-point operations) with 0 (where the Complementing the high bit with zero is to convert the mantissa as an unsigned number to a signed number) in order to obtain the intermediate result of the mantissa.
  • the mantissa of the first floating-point number (such as the multiplicand in floating-point operations) can also be encoded (such as high and low bits with 0), or both can be encoded.
  • the partial product summation unit may include an adder, which is used to add the intermediate result of the mantissa to obtain the sum result.
  • the partial product summation unit includes a Wallace tree and an adder, wherein the Wallace tree is used to add the intermediate mantissa results to obtain the second mantissa intermediate result, and the The adder is used to add the second mantissa intermediate result to obtain the added result.
  • the adder may include at least one of a full adder, a serial adder, and a forward bit adder.
  • the mantissa processing unit may further include a control circuit 416 for instructing in the arithmetic module that the bit width of at least one of the first floating-point number or the second floating-point number is larger than the mantissa processing unit can process at a time.
  • the control circuit may be implemented to generate a control signal, for example, it may be a counter or a control flag.
  • the partial product summation unit may also include a shifter.
  • the shifter calls Is used to shift the existing sum result and add it with the sum result obtained in the current call to obtain a new sum result, and use the new sum result obtained in the last call as The mantissa after the multiplication operation.
  • the multiplier of the present disclosure further includes a regularization unit 418 and a rounding unit 420.
  • the regularization unit can be used to perform floating-point regularization processing on the mantissa and exponent after the multiplication operation to obtain a regularized exponent result and a regularized mantissa result, and use the regularized exponent result and the regularized mantissa result as The exponent after the multiplication operation and the mantissa after the multiplication operation.
  • the regularization unit can adjust the bit width of the exponent and the mantissa to meet the requirements of the aforementioned indicated data format.
  • the regularization unit can also make other adjustments to the exponent or mantissa. For example, in some application scenarios, when the value of the mantissa is not 0, the most significant bit of the mantissa bit should be 1; otherwise, you can modify the exponent bit and shift the mantissa bit at the same time to make it a normalized number. form.
  • the regularization unit may also adjust the exponent after the multiplication operation according to the mantissa after the multiplication operation. For example, when the highest bit of the mantissa after the multiplication operation is 1, the exponent obtained after the multiplication operation can be increased by 1.
  • the rounding unit may be used to perform a rounding operation on the regularized mantissa result according to a rounding mode, and use the mantissa after the rounding operation is performed as the mantissa after the multiplication operation.
  • the rounding unit may perform rounding operations including rounding down, rounding up, and rounding to the nearest significant number, for example.
  • the rounding unit can also round the 1 that is shifted out in the process of shifting the mantissa to the right.
  • the multiplier of the present disclosure may also optionally include a sign processing unit.
  • the sign processing unit can be used according to the first floating-point number.
  • the sign of and the sign of the second floating-point number get the sign after the multiplication operation.
  • the symbol processing unit may include an exclusive OR logic circuit 422 for performing an exclusive OR operation based on the sign of the first floating-point number and the sign of the second floating-point number. , To obtain the symbol after the multiplication operation.
  • the symbol processing unit can also be implemented by a truth table or logical judgment.
  • the multiplier of the present disclosure may further include a normalization processing unit 424 for converting the first floating-point number Or when the second floating-point number is a non-normalized non-zero floating-point number, the first floating-point number or the second floating-point number is normalized according to the operation mode to obtain the corresponding exponent and mantissa.
  • the normalization processing unit can be used to normalize the FP16 type data to BF16 type data, so that the multiplier can operate in the second operation mode.
  • the normalization processing unit may also be used to preprocess the mantissa of the normalized floating-point number with an implicit 1 and the mantissa of the non-normalized floating-point number without the implicit 1 (for example, the mantissa of Extend) to facilitate the subsequent operation of the mantissa processing unit.
  • the normalization processing unit 424 here and the aforementioned regularization unit 418 can also perform the same or similar operations in some embodiments.
  • the difference is that the normalization processing unit 424 is specific to the input.
  • the floating point data of is normalized, and the regularization unit 418 normalizes the mantissa and exponent to be output.
  • the multiplier of the present disclosure and its various embodiments have been described above with reference to FIG. 4. Based on the above description, those skilled in the art can understand that the solution of the present disclosure obtains the result of the multiplication operation (including the exponent, the mantissa and optional signs) through the execution of the multiplier. According to different application scenarios, for example, when the aforementioned regularization processing and rounding processing are not required, the result obtained by the mantissa processing unit and the exponential processing unit can be regarded as the operation result of the floating-point multiplier.
  • the exponent and mantissa obtained after the regularization processing and rounding processing can be regarded as the operation result of the floating-point multiplier or the operation of the floating-point multiplier Part of the result (when considering the final symbol).
  • the solution of the present disclosure uses multiple operation modes to enable the multiplier to support the operation of floating-point numbers of different types or data formats, so that the multiplexing of the multiplier can be realized, thereby saving the overhead of chip design and saving the calculation cost.
  • the multiplier of the present disclosure also supports the calculation of high-bit-width floating-point numbers.
  • the mantissa also called the mantissa bit or the mantissa part
  • the mantissa operation of the present disclosure will be described below in conjunction with FIG. 5.
  • FIG. 5 is a schematic block diagram showing an operation 500 of a mantissa processing unit according to an embodiment of the present disclosure.
  • the mantissa processing operation of the present disclosure may mainly involve two units, namely the partial product operation unit and the partial product summation unit discussed in combination with FIG. 4.
  • the mantissa processing operation can be roughly divided into the first stage and the second stage. In the first stage, the mantissa processing operation will obtain the intermediate result of the mantissa, and in the second stage, the mantissa processing operation will be obtained from The mantissa result output by the adder 508.
  • the first floating-point number and the second floating-point number received by the multiplier may be divided into multiple parts, namely the aforementioned sign (optional), exponent, and mantissa.
  • the mantissa part of the two floating-point numbers will enter the mantissa processing unit as input (such as the mantissa processing unit in FIG. 3 or FIG. 4), and specifically enter the partial product operation unit. As shown in FIG.
  • the present disclosure uses Booth coding circuit 502 to perform zero-filling operations on the high and low bits of the mantissa of the second floating-point number (that is, the multiplier in floating-point operations), and performs Booth coding processing, so that in part
  • the product generating circuit 504 obtains the intermediate result of the mantissa.
  • the first floating-point number and the second floating-point number here are only for illustrative and not restrictive purposes. Therefore, in some application scenarios, the first floating-point number can be a multiplier and the second floating-point number can be a multiplicand. .
  • encoding operations can also be performed on floating-point numbers that serve as multiplicands.
  • Booth coding Generally, when two binary numbers are multiplied, a large number of intermediate mantissas called partial products will be produced through the multiplication operation, and then these partial products are accumulated and the final result of the multiplication of the two binary numbers is obtained. result.
  • the greater the number of partial products the greater the area and power consumption of the array multiplier, the slower the execution speed, and the more difficult it is to implement the circuit.
  • the purpose of Booth coding is to effectively reduce the number of summations of partial products, thereby reducing the circuit area.
  • the algorithm is to first encode the input multiplier according to the corresponding rules.
  • the encoding rules may be, for example, the rules shown in Table 4 below:
  • y 2i+1 , y 2i and y 2i-1 in Table 4 can represent the value corresponding to each group of sub-data to be encoded (ie, the multiplier), and X can represent the value in the first floating-point number (ie, the multiplicand) mantissa.
  • the coded signal obtained after Booth coding can include five types, which are -2X, 2X, -X, X, and 0, respectively.
  • the received multiplicand is 8-bit data "X 7 X 6 X 5 X 4 X 3 X 2 X 1 X 0 ", the following partial products can be obtained:
  • the adder may be, for example, one or more full adders, half adders, or various combinations of the two.
  • a Wallace tree compressor (or Wallace tree for short), it is mainly used to sum the above-mentioned intermediate results of the mantissa (ie multiple partial products) to reduce the number of accumulation of partial products (ie, compression) .
  • a Wallace tree compressor can adopt a carry-save CAS (carry-save) architecture and a Wallace tree algorithm.
  • the calculation speed of the Wallace tree array is much faster than the traditional carry-save addition.
  • the Wallace tree compressor can calculate the sum of the partial products of each row in parallel. For example, the number of accumulations of N partial products can be reduced from N-1 times to Log 2 N times, thereby increasing the speed of the multiplier and reducing resources. The effective use of it is important. According to different application needs, the Wallace tree compressor can be designed into many types, such as 7-2 Wallace tree, 4-2 Wallace tree and 3-2 Wallace tree. In one or more embodiments, the present disclosure uses a 7-2 Wallace tree as an example of implementing various floating-point operations of the present disclosure, which will be described in detail later in conjunction with FIG. 5 and FIG. 6.
  • the Wallace tree compression operation disclosed in the present disclosure may be arranged to have M inputs and N outputs, the number of which may not be less than K, where N is a preset positive integer less than M, and K is A positive integer not less than the maximum bit width of the intermediate result of the mantissa.
  • M can be 7, and N can be 2, which is a 7-2 Wallace tree which will be described in detail below.
  • K can be a positive integer of 48, which means that the number of Wallace trees can be 48.
  • one or more groups of the Wallace trees may be selected to add the intermediate results of the mantissa, wherein each group has X Wallace trees, and X is the middle of the mantissa. The number of bits of the result.
  • the Wallace trees in each group may have a sequential carry relationship, but there is no carry relationship between each group.
  • the Wallace tree compressor can be connected through a carry. For example, the carry output from the low-level Wallace tree compressor (Cin in Figure 7) is sent to the high-level Wallace tree, and The carry output (Cout) of the high-level Wallace tree compressor can become a higher-level Wallace tree compressor to receive the carry input from the low-level Wallace tree compressor.
  • one or more Wallace tree compressors are selected from multiple Wallace tree compressors, arbitrary selection can be made. For example, it can be selected in the order of 0, 1, 2 and 3, or 0 , 2, 4, and 6 are connected in the order of numbers, as long as the selected Wallace tree compressor is selected according to the above-mentioned carry relationship.
  • the multiplier supports 32-bit input width (thus supporting two sets of 16-bit parallel multiplication operations)
  • the Wallace tree is 7 (that is, an example value of the above M) input and 2 (that is, the above An example value of N) the output of the 7-2 Wallace Tree Compressor.
  • 48 Wallace trees that is, an example value of K above
  • the 0th to 23rd Wallace trees (that is, the 24 Wallace trees in the first group of Wallace trees) can complete the partial product addition and operation of the first group of multiplications , And each Wallace tree in the group can be connected by carry in turn.
  • the 24th to 47th Wallace trees (that is, the 24 Wallace trees in the second group of Wallace trees) can complete the partial product addition operation of the second group of multiplications, where each Wallace in the group The scholar trees are connected by carry in turn.
  • there is no carry relationship between the 23rd Wallace tree in the first group and the 24th Wallace tree in the second group that is, there is no carry relationship between Wallace trees in different groups.
  • the compressed partial products are summed by the adder to obtain the result of the mantissa multiplication operation.
  • the adder in one or more embodiments of the present disclosure, it may include one of a full adder, a serial adder, and a forward bit adder, which is used to add the Wallace tree compressor The obtained partial products of the last two rows are summed to obtain the result of the mantissa multiplication operation.
  • the mantissa multiplication operation shown in FIG. 5, especially the exemplary use of Booth coding and Wallace tree, can effectively obtain the result of the mantissa multiplication operation.
  • Booth coding can effectively reduce the number of partial product summations, thereby reducing the circuit area
  • the Wallace compression tree can calculate the sum of partial products of each row in parallel, thereby increasing the speed of the multiplier.
  • Figure 6 shows the partial product 600 obtained after passing through the partial product generating circuit in the mantissa processing unit described in conjunction with Figures 3 to 5, as shown in the figure between the two dashed lines in four rows of white dots, where each The white dots on the row indicate a partial product.
  • the number of bits can be expanded in advance.
  • the black dot in Figure 6 is the highest value of each 9-bit partial product copied. It can be seen that the partial product is expanded and aligned to 16 (8+8) bits (that is, the bit width of the multiplicand mantissa is 8bit + multiplication). The bit width of the mantissa is 8bit).
  • the partial product is expanded to 38 (25+13) bits (ie, the bit width of the multiplicand mantissa is 25 bits + the bit width of the multiplier mantissa is 13 bits) .
  • FIG. 7 is an operation flow and schematic block diagram 700 of a Wallace tree compressor according to an embodiment of the present disclosure.
  • the 7 shown in FIG. 7 can be obtained by performing Booth coding of the multiplier and the multiplicand. Part product. Due to the use of Booth coding algorithm, the number of partial products generated is reduced. For ease of understanding, in the figure, a dashed frame is used in the partial product part to identify a Wallace tree that includes 7 elements, and the process of compressing it from 7 elements to 2 elements is further shown with arrows.
  • the compression process (or the addition process) can be implemented with the aid of a full adder, that is, three elements are input and two elements are output (ie, a sum "sum” and a carry "carry” for high bits) .
  • the schematic block diagram of the 7-2 Wallace Tree Compressor is shown on the right side of Figure 7. It can be understood that the Wallace Tree Compressor includes 7 inputs from a column of partial products (as indicated in the dashed box on the left side of Figure 7). Seven elements). In operation, the carry input of the Wallace tree in the 0th column is 0, and the carry output Cout of each Wallace tree is used as the carry input Cin of the next Wallace tree.
  • the Wallace tree including 7 elements can be compressed to include 2 elements.
  • this disclosure uses a 7-2 Wallace tree compressor to finally compress the partial product of 7 rows into a partial product with two rows (that is, the second mantissa intermediate result of this disclosure), and uses an adder ( For example, advance bit adder) to get the mantissa result.
  • adder For example, advance bit adder
  • the multiplier of the present disclosure completes the operation in the first stage under the four operation modes of FP16*FP16, FP16*FP16, FP32*FP32 and FP32*BF16, namely Until the Wallace tree compressor completes the summation of the intermediate results of the mantissa to obtain the intermediate result of the second mantissa:
  • the mantissa bits of the floating-point number are 10 bits.
  • the mantissa bits can be extended by 1 bit, so that the mantissa bits are 11 bits.
  • the mantissa bit is an unsigned number, when the Booth coding algorithm is used, 1 bit of 0 can be extended in the high bit (that is, a 0 is added to the high bit), so the total mantissa bit is 12 bits.
  • the partial product generation circuit can obtain 7 partial products in the high and low parts respectively, of which the seventh partial product is 0.
  • the bit width of each partial product is 24bit.
  • 48 7-2 Wallace trees can be used for compression processing, and the carry of the 23rd to 24th Wallace trees is 0.
  • the mantissa of the floating-point number is 7 bits. Considering that the unnormalized non-zero number under the IEEE754 standard can be expanded to a signed number, the mantissa can be expanded to 9 bits.
  • the partial product generating circuit can obtain 7 effective partial products in the high and low parts respectively, of which the 6th and 7th partial products are 0, each part of the product bit width is 18bit, by using the 0-17th and 24th to 41st two sets of 7-2 Wallace trees for compression processing, of which the 23rd to 24th Wallace trees The carry is 0.
  • the mantissa bits of the floating-point number can be 23 bits.
  • the mantissa can be expanded to 24 bits.
  • the multiplier of the present disclosure can be called twice in this operation mode to complete an operation.
  • the multiplication of the mantissa bits each time is 25bit*13bit, that is, the first floating-point number ina is expanded by 1bit 0 to become a 25-bit signed number, and the 24bit mantissa of the second floating-point number inb is divided into two parts, 12 bits each, and Extend 1bit 0 to get two 13bit multipliers, which are expressed as inb_high13 and inb_low13 high and low parts.
  • the multiplier of the present disclosure is called for the first time to calculate ina*inb_low13, and the multiplier is called for the second time to calculate ina*inb_high13.
  • 7 effective partial products are generated by Booth coding, and the bit width of each partial product is 38 bits, compressed by the 0th to 37th 7-2 Wallace trees.
  • the mantissa of the first floating-point number ina is 23bit
  • the mantissa of the second floating-point number inb is 7bit.
  • the mantissa can be expanded to 25bit and 9bit respectively, and the multiplication of 25bit ⁇ 9bit is performed to obtain 7 effective partial products, among which the 6th and 7th partial products are 0, and the bit width of each partial product is 34bit, passing the 0th to 33rd
  • the Wallace tree is compressed.
  • the aforementioned mantissa processing unit may further include a control circuit, which may be used when the mantissa bit width of the first floating-point number indicated by the operation mode and/or the mantissa bit width of the first floating-point number is greater than
  • the mantissa processing unit is called multiple times according to the operation mode.
  • the partial product summation circuit may further include a shifter, which is used for when the mantissa processing unit is called multiple times according to the operation mode, when the addition result is already available In the case of, shift the existing sum result and add it with the sum result obtained by the current call to obtain a new sum result, and use the new sum result as the sum result.
  • a shifter which is used for when the mantissa processing unit is called multiple times according to the operation mode, when the addition result is already available In the case of, shift the existing sum result and add it with the sum result obtained by the current call to obtain a new sum result, and use the new sum result as the sum result.
  • the mantissa processing unit can be called twice in the FP32*FP32 operation mode. Specifically, in the first call to the mantissa processing unit, the mantissa bits (that is, ina*inb_low13) are added by the advance bit adder in the second stage to obtain the second low mantissa intermediate result, and in the second call to the mantissa processing unit , The mantissa bits (ie, ina*inb_high13) are added in the second stage by a forward bit adder to obtain the second high bit mantissa intermediate result.
  • the second low-order mantissa intermediate result and the second high-order mantissa intermediate result can be accumulated through the shift operation of the shifter to obtain the mantissa after the multiplication operation.
  • the shift operation can be as follows: expression:
  • the second high-order mantissa intermediate result sum h [37:0] is shifted to the left by 12 bits and accumulated with the second low-order mantissa intermediate result sum l [37:0].
  • FIG. 5 does not draw and describe other units, such as an exponent processing unit and a symbol processing unit.
  • FIG. 8 The overall description of the multiplier of the present disclosure will be given below in conjunction with FIG. 8. The previous description of the mantissa processing unit is also applicable to the situation depicted in FIG. 8.
  • FIG. 8 is an overall schematic block diagram showing a multiplier 800 according to an embodiment of the present disclosure. It should be understood that the positions, existence, and connection relationships of the various units depicted in the figure are only exemplary and not restrictive. For example, some of the units can be integrated, while other units can also be separated or depending on the application scenario. It is omitted or replaced if it is different.
  • the multiplier of the present disclosure can be exemplarily divided into a first stage and a second stage in the operation of each operation mode according to the operation flow, as shown by the dotted line in the figure.
  • first stage output the calculation result of the sign bit, output the intermediate calculation result of the mantissa of the exponent bit, output the intermediate calculation result of the mantissa of the mantissa (for example, the coding of Booth's algorithm including the aforementioned input mantissa fixed-point multiplication) Process and Wallace tree compression process).
  • the second stage regularize and round the exponent and mantissa to output the calculation result of the exponent and the calculation result of the mantissa.
  • the multiplier of the present disclosure may include a mode selection unit 802 and a normalization processing unit 804, wherein the mode selection unit may select an operation mode according to an input mode signal (in_mode).
  • the input mode signal may correspond to the operation mode number in Table 2.
  • the multiplier can be made to work in the operation mode of FP16*FP16, and when the input mode signal indicates the operation mode number "3" in Table 2 At this time, the multiplier can be operated in the FP32*FP32 operation mode.
  • FIG. 8 only shows four exemplary operation modes: FP16*FP16, BF16*BF16, FP32*FP32, and FP32*BP16.
  • the multiplier of the present disclosure also supports many other different operation modes.
  • the normalization processing unit may be configured to perform normalization processing on the first floating-point number or the second floating-point number according to the operation mode when the first floating-point number or the second floating-point number is a non-normalized non-zero floating-point number. Obtain the corresponding exponent and mantissa, for example, according to the IEEE754 standard, regularize the floating-point number in the data format indicated by the operation mode.
  • the multiplier includes a mantissa processing unit to perform a multiplication operation of the first floating-point number mantissa and the second floating-point number mantissa.
  • the mantissa processing unit may include a bit number expansion circuit 806, a Booth encoder 808, a partial product generation circuit 810, a Wallace tree compressor 812, and an adder 814, where The number expansion circuit can be used to expand the mantissa in consideration of the denormalized non-zero numbers under the IEEE754 standard, so as to be suitable for the operation of the Booth encoder. Since the Booth encoder, the partial product generation circuit, the Wallace tree compressor and the adder have been described in detail with reference to FIGS. 5-7, the same description is equally applicable here and therefore will not be repeated.
  • the multiplier of the present disclosure further includes a regularization unit 816 and a rounding unit 818, which have the same functions as the units shown in FIG. 4.
  • the regularization unit can perform floating-point regularization on the addition result and the exponent data from the exponent processing unit according to the data format indicated by the output mode signal "out_mode" as shown in FIG. 8 Process to obtain regularized index results and regularized mantissa results.
  • the regularization unit can adjust the bit width of the exponent and the mantissa to make it meet the requirements of the aforementioned indicated data format.
  • the regularization unit can repeatedly shift the mantissa by 1 bit to the left, and subtract 1 from the exponent until the highest bit value is 1.
  • the rounding unit in one embodiment, it can be used to perform a rounding operation on the regularized mantissa result according to a rounding mode to obtain a rounded mantissa, and use the rounded mantissa as the multiplication operation After the mantissa.
  • the aforementioned output mode signal may be a part of the operation mode, and is used to indicate the data format after the multiplication operation.
  • the output mode signal may be combined with the aforementioned input mode signal to provide the mode selection unit. Based on the combined mode signal, the mode selection unit can clarify the data format of the input data and the output result in the initial stage of the operation of the multiplier without separately providing the output mode signal to the regularization, which can further simplify the operation.
  • the following five rounding modes can be exemplarily included.
  • mantissa rounding in the "rounding" mode for example, two 24-bit mantissas are multiplied to obtain a 48-bit mantissa (47-0). After normalization, only the 46th to the 24th digits are taken during output. When the 23rd digit of the mantissa is 0, the (23-0) digit is discarded; when the 23rd digit of the mantissa is 1, the 24th digit is 1 and the (23-0) digit is discarded.
  • the multiplier of the present disclosure further includes an exponent processing unit 820 and a sign processing unit 822, wherein the exponent processing unit can be used to obtain the multiplication according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number The calculated exponent.
  • the exponent processing circuit can add the exponent bit data of the first floating-point number, the exponent bit data of the second floating-point number, and the respective offset values of the corresponding input floating-point data type, and subtract the offset of the output floating-point data type. The value is shifted to obtain exponent bit data of the product of the first floating-point number and the second floating-point number.
  • the exponent processing unit can be implemented as or include an addition and subtraction circuit (the exponent processing unit 820 can be implemented as an addition and subtraction circuit), and the exponential processing unit 820 can be used to perform according to the operation mode, The exponent of the first floating-point number, the exponent of the second floating-point number and the operation mode obtain the exponent after the multiplication operation.
  • the symbol processing unit 822 may be implemented as an exclusive OR circuit in one embodiment (the symbol processing unit 822 may be implemented in the form of an exclusive OR circuit), and the symbol processing unit 822 is used to perform the calculation of the first floating point number and the second floating point number.
  • the sign bit data performs an exclusive OR operation to obtain the sign bit data of the product of the first floating point number and the second floating point number.
  • the multiplier of the present disclosure supports operations in multiple operation modes, thereby overcoming the defect of the multiplier that only supports a single floating-point operation in the prior art. Furthermore, since the multiplier of the present disclosure can be multiplexed, it also supports high-bit wide floating-point data, which reduces the operation cost and overhead. In one or more embodiments, the multiplier of the present disclosure may also be arranged or included in an integrated circuit chip or a computing device to implement multiplication operations on floating-point numbers in multiple operation modes.
  • FIG. 9 is a flowchart illustrating a method 900 for performing a floating-point number multiplication operation using a multiplier according to an embodiment of the present disclosure. It is understandable that the multiplier described here is the multiplier described in detail above in conjunction with Figure 2 to Figure 8. Therefore, the previous description of the multiplier and its internal composition, functions and operations are also applicable to the description here. .
  • the method 900 may include using the exponent processing unit of the multiplier at step S902 to obtain the post-multiplication operation according to the operation mode, the exponent of the first floating-point number, and the exponent of the second floating-point number.
  • the index can be one of a variety of operation modes, and can be used to indicate the data format of a floating-point number. In one or more embodiments, the operation mode can also be used to determine the data format of the floating point number of the output result.
  • the method 900 may use the mantissa processing unit of the multiplier to obtain the mantissa after the multiplication operation according to the operation mode, the first floating-point number, and the second floating-point number.
  • the present disclosure uses the Booth coding algorithm and the Wallace tree compressor in some preferred embodiments, so as to improve the efficiency of the mantissa processing.
  • the method 900 may also use the symbol processing unit of the multiplier in step S906 to obtain the multiplication according to the sign of the first floating-point number and the sign of the second floating-point number. The symbol after the operation.
  • FIG. 10 is another schematic block diagram showing a computing device 1000 according to an embodiment of the present disclosure.
  • the computing device 1000 in addition to the new first type conversion unit 1002 added, can have the same composition, structure, and functional attributes (such as addition) as the computing device 100 described in conjunction with FIG. Module 108 and update module 112), therefore, the foregoing description of the computing device 100 is also applicable to the computing device 1000.
  • the added first type conversion unit it can be applied in scenarios where the first adder in the addition module does not support multiple data types (or formats) and data type conversion is required.
  • it may be configured to convert the data type (or data format) of the product result, so that the adder performs the addition operation.
  • the product result may be the product result obtained by the floating-point multiplier of the aforementioned multiplication unit.
  • the data type of the product result may be, for example, one of the aforementioned FP16, BF16, FP32, UBF16, or UFP16.
  • the data type conversion can be performed by means of the first type conversion unit, so that the result is suitable for the addition operation of the adder.
  • the first type conversion unit can be configured to exemplarily perform the following steps on FP16 type data to convert it to type FP32 Data: S1: Sign bit shifted to the left by 16 bits; S2: Exponent plus 112 (the difference between the base of the exponent 127 and 15), shifted to the left by 13 bits (right-aligned); and S3: Mantissa shifted to the left by 13 bits (left-aligned) .
  • FIG. 11 is a schematic block diagram showing an adder group 1100 according to an embodiment of the present disclosure. It can be seen from the content shown schematically in the figure that it is a three-level tree structure adder group, where the first stage includes four first adders 1102 of the present disclosure, which exemplarily receives eight FP32 type adders. The input of floating-point numbers, such as in0, in1,..., in7. The second stage includes two first adders 1104, which exemplarily receive the input of four FP16 floating point numbers. The third stage includes only one first adder 1106, which can receive the input of two FP16 floating point numbers and output the sum result of the aforementioned eight FP32 floating point numbers.
  • the present disclosure proposes to provide a set between the first adders of the first stage and the second stage One or more second type conversion units 1108 between levels.
  • the second type conversion unit may have the same or similar function as the first type conversion unit 1002 described in conjunction with FIG. 10, that is, convert the input floating-point data into data consistent with subsequent addition operations. Types of.
  • the second type conversion unit can support one or more data type conversions according to different application requirements. For example, in the example shown in FIG. 11, it can support one-way data type conversion from FP32 type data to FP16 type data.
  • the second type conversion unit may be designed to support bidirectional data type conversion between FP32 type data and FP16 type data. In other words, it can not only support data type conversion from FP32 type data to FP16 type data, but also support data type conversion from FP16 type data to FP32 type data.
  • the first type conversion unit 1002 of FIG. 10 or the second type conversion unit 1108 of FIG. 11 may also be configured to support two-way conversion between various floating-point data, for example, it may support the aforementioned combination operation
  • the two-way conversion between various floating-point data described in the model helps to maintain the forward or backward compatibility of the data in the data processing process of the present disclosure, and further expands the application scenarios and scope of application of the solution of the present disclosure.
  • the above-mentioned type conversion unit is only an optional solution of this disclosure.
  • the first or second adder itself supports the addition operation of multiple data formats, or can be multiplexed to process multiple data format operations , Does not need such a type conversion unit.
  • the data format supported by the second adder is the data format of the output data of the first adder, there is no need to provide such a type conversion unit between the two.
  • FIG. 12 is a schematic block diagram showing an adder group 1200 according to an embodiment of the present disclosure. As can be seen from the content shown in the figure, it schematically shows a five-level tree structure adder group, which specifically includes 16 first adders in the first stage, 8 first adders in the second stage, and Four first adders in the third stage, two first adders in the fourth stage, and one first adder in the fifth stage. It can be seen from the multi-level tree structure that the adder group shown in FIG. 12 can be regarded as an extension of the tree structure shown in FIG. 11. Or conversely, the adder group shown in FIG. 11 can be regarded as a part or component unit of the adder group shown in FIG. 12, as the part framed by the dashed line 1202 in FIG. 12.
  • the 16 adders of the first group can receive the product result from the multiplication unit.
  • the product result may be a floating point number converted by the first type conversion unit 1002 shown in FIG. 10.
  • the aforementioned product result when the aforementioned product result is the same as the data type supported by the first-stage adder of the adder group 1200, it can be directly input into the adder group 1200 without the first type conversion unit, as shown in FIG. 12 32 FP32 floating point numbers shown in (such as in0 ⁇ in31).
  • 16 summation results can be obtained as the input of the 8 first adders in the second stage.
  • the sum result of the output of the two first adders of the fourth stage is finally input to a first adder of the fifth stage, and the output of the fifth stage adder can be input as the aforementioned intermediate result.
  • the intermediate result can undergo one of the following operations:
  • the intermediate result is the intermediate result obtained by calling the multiplication unit in the first round, it can be input to the adder of the update module mentioned above, and then cached in the register of the update module to wait for the multiplication unit to be called in the second round.
  • the obtained intermediate result is subjected to an addition operation; or when the intermediate result is an intermediate result obtained by calling the multiplication unit in an intermediate round (for example, when more than two rounds of operations are performed), it can be input to the adder of the update module , And then added to the summation result obtained by the previous round of addition operation input from the register of the update module to the adder of the update module, and store it in the register as the summation result of the intermediate round of addition operation; Or when the intermediate result is the intermediate result obtained by calling the multiplication unit in the last round, it can be input to the adder of the update module, and then be compared with the previous round of addition operation input from the register of the update module to the adder.
  • the summation results obtained are added and stored in the register as the final
  • FIG. 12 arranges multiple adders in the form of a tree hierarchy to complete the addition operation of multiple numbers
  • the solution of the present disclosure is not limited to this.
  • Those skilled in the art can also arrange multiple adders in other suitable structures or manners according to the teachings of the present disclosure, for example, by connecting multiple full adders, half adders or other types of adders in series or parallel to achieve pairing. Addition of multiple input floating-point numbers.
  • the addition tree structure shown in FIG. 12 does not show the second type conversion unit shown in FIG. 11.
  • those skilled in the art can think of arranging one or more inter-stage second type conversion units in the multi-stage adder shown in FIG. 12 to realize the conversion of data types between different levels.
  • the scope of application of the computing device of the present disclosure is further expanded.
  • FIG. 13 and 14 are respectively a flowchart and a schematic block diagram showing a neural network operation 1300 according to an embodiment of the present disclosure.
  • Figures 13 and 14 aim to use the convolution operations in the neural network (including the convolution kernel and neuron data as one of the weight data of the present disclosure). ) As an example to illustrate. It can be understood that the convolution operation can occur at multiple layers in the neural network, such as the convolutional layer and the fully connected layer of the neural network.
  • the convolution kernel In the process of calculating the convolution operation (for example, image convolution), there will be multiplexing of the convolution kernel and neuron data. Specifically, in the multiplexing situation of the convolution kernel, the same convolution kernel performs inner product with different neuron data in the process of sliding on the neuron data block. In the multiplexing of neuron data, different convolution kernels perform inner products with the same piece of neuron data. Therefore, in order to avoid repeated handling and reading of data during the calculation of convolution, and to save power consumption, the computing device of the present disclosure can reuse neuron and convolution kernel data in multiple rounds of calculation.
  • the input terminal of the computing device of the present disclosure may include at least two input ports that support multiple data bit widths, and the register in the update module may include multiple sub-ports. Register to store the intermediate results obtained in each round of operation. Based on such an arrangement, the computing device may be configured to divide and multiplex the neuron data and weight data according to the bit width of the input port, respectively, to perform neural network operations.
  • each convolution kernel and the corresponding neuron can be divided into Four 512-bit wide vectors, and thus the computing device will perform four rounds of operations to obtain a complete output result.
  • the number may be determined based on the number of times of multiplexing of the neuron data and the number of times of multiplexing the convolution kernel. For example, the number can be obtained by calculating the product of the number of times the neuron is multiplexed and the number of times the convolution kernel is multiplexed.
  • the maximum number of multiplexing times can be determined according to the number of registers (or sub-registers) in the update module. For example, if the number of subregisters is n and the current number of neuron multiplexing is m (m ⁇ n), the maximum number of multiplexing of the convolution kernel is floor(n/m), where the floor function represents the execution to n/m Round down operation. For example, when the number of sub-registers in the update module is 8, and the current number of neuron multiplexing is 2, the maximum number of multiplexing of the convolution kernel is 4 (ie, floor(8/2)) .
  • bit width of the input port and the length of the input data can determine that the multiplication unit and accumulation module of the computing device of the present disclosure need to perform four consecutive operations, in which neuron data is multiplexed 2 times, and convolution kernel data is multiplexed 4 times, and in the first After the 4 rounds of operation update module is updated, the final convolution result is output.
  • the method 1300 buffers neuron data and convolution kernel data.
  • neuron data and convolution kernel data For example, two 512-bit neuron data and two 512-bit convolution kernel data can be read and buffered in a buffer ("buffer").
  • the two 512-bit neuron data can be the neuron data "1-512 bits” and "2-512 bits” shown in the first block on the top left of FIG. 14, and the two The 512-bit convolution kernel data may be the “first convolution kernel” and the “second convolution kernel” shown in the first block on the upper right side of FIG. 14.
  • the method 1300 may perform a multiply-accumulate operation on the first 512bit neuron and the first 512bit convolution kernel data, and then store the obtained first partial sum as the first intermediate result in the sub Register 0. For example, receive 512bit neuron data and convolution kernel data through the two input interfaces of the computing device, and perform the multiplication operation of the two in the floating-point multiplier of the multiplication unit, and then input the result into the adder Perform addition operations to obtain intermediate results. Finally, the first intermediate result is stored in the first sub-register of the update module, that is, sub-register 0.
  • the method 1300 may perform a multiply-accumulate operation on the first 512bit neuron and the second 512bit convolution kernel data, and then store the obtained second partial sum as the second intermediate result in In sub-register 1, as shown in Figure 14. Since in this example, the convolution kernel is multiplexed twice, and each corresponding neuron participates in the calculation twice, the calculation for the first 512bit neuron data is completed.
  • the method 1300 may read the third 512-bit neuron data to cover the first 512-bit neuron data.
  • the method 1300 can perform the multiplication and accumulation operation of the second 512bit neuron data and the first 512bit convolution kernel data, and then store the obtained third partial sum as the third intermediate result To sub-register 2.
  • the method 1300 may perform a multiplication and accumulation operation on the second 512bit neuron data and the second 512bit convolution kernel data, and then use the obtained fourth partial sum as the fourth intermediate result Store in sub-register 3.
  • the method 1300 reads the fourth 512bit neuron to cover the second 512bit Neuron data.
  • the method 1300 can perform the convolution operation of the third 512bit neuron data and the first 512bit convolution kernel data (that is, the multiplication and accumulation operation), and then the fifth one will be obtained.
  • the partial sum is stored in subregister 4 as the fifth intermediate result.
  • the method 1300 may perform the convolution operation of the third 512bit neuron data and the second 512bit convolution kernel data, and then store the obtained sixth part sum as the sixth intermediate result in the sub-register 5 in.
  • the method 1300 may perform the convolution operation of the fourth 512-bit neuron data and the first 512-bit convolution kernel data, and store the obtained seventh intermediate result in the sub-register 6.
  • the method 1300 may perform the convolution operation of the fourth 512bit neuron data and the second 512bit convolution kernel data, and then store the obtained eighth part sum as the eighth intermediate result to Sub-register 7.
  • the method 1300 completes the first round of multiplexing operation of neuron data and convolution kernel data.
  • the size of the neuron and the convolution kernel are both 2048bit, that is to say, each convolution kernel and the corresponding neuron data are 4 512bit vectors, so the complete output update module needs to be updated 4 times, that is, the computing device performs a total of 4 rounds of calculations.
  • the second block of neuron data in the left side of Figure 14 that is, the 5-512-bit, 6-512-bit, 7-512-bit, and 8-512-bit four shown Neuron data
  • the “512-bit third convolution kernel” and “512-bit fourth convolution kernel” on the right perform operations similar to steps S1202-S1220, and the intermediate results obtained are updated through the update module.
  • sub-register 0 to sub-register 7 what is stored in sub-register 0 to sub-register 7 is the summation result, that is, the summation result after the addition operation is performed on the stored intermediate result in the first round and the intermediate result obtained in the second round.
  • the summation result that is, the summation result after the addition operation is performed on the stored intermediate result in the first round and the intermediate result obtained in the second round.
  • what is stored in subregister 0 is the sum result of the first intermediate result in the first round of operation and the second intermediate result in the second round of operation.
  • the computing device of the present disclosure will continue to perform the third and fourth round operations.
  • the computing device completes the calculation of the third piece of neuron data in the left side of FIG. 14 (that is, the 9-512 bit, 10-512 bit, 11-512 bit, and 12-512 bit shown).
  • the 8 intermediate results obtained in the third round are updated in sub-register 0 to sub-register 7 through the update module, respectively, to be added to the summed results obtained after the second round to obtain the post-operation after the third round.
  • the computing device completes the calculation of the fourth piece of neuron data in the left side of FIG. 14 (ie 13-512 bits, 14-512 bits, 15- 512-bit and 16-512-bit four neuron data) and the "512-bit 7th convolution kernel" and "512-bit 8th convolution kernel" in the right side of the convolution operation and update operation.
  • the 8 intermediate results obtained in the fourth round are respectively updated in sub-register 0 to sub-register 7 through the update module, so as to be respectively added to the summed results obtained after the third round to obtain the fourth-round operation.
  • the summation result of, and the summation result at this time is the final complete 8 calculation results of this example, which can be output through sub-register 0 to sub-register 7 respectively.
  • the computing device of the present disclosure completes the neural network operation by multiplexing the convolution kernel and neuron data through examples. It should be understood that the above-mentioned examples are only exemplary, which in no way limit the solution of the present disclosure in any sense. Those skilled in the art can modify the multiplexing scheme according to the teachings of the present disclosure, for example, by setting a different number of sub-registers and selecting input ports that support different bit widths.
  • FIG. 15 is a flowchart illustrating a method 1500 for performing a neural network operation using a computing device according to an embodiment of the present disclosure.
  • the computing device described here is the computing device described above in conjunction with FIGS. 1 to 14 and includes the floating-point multiplier described in detail above. Therefore, the foregoing is related to the computing device, the floating-point multiplier, and the above-mentioned floating-point multiplier. The description of internal composition, function and operation also applies to the description here.
  • the method 1500 may include receiving at least one weight data and at least one neuron data of a neural network operation to be performed at step S1502.
  • the at least one weight data and the at least one neuron data may have a floating-point number data format.
  • the at least one weight data and the at least one neuron data may have the data format indicated by the aforementioned operation mode.
  • the operation mode may use a primary or secondary index to indicate the weight data and The floating-point number data format of neuron data.
  • the method 1500 may use a multiplication unit including at least one floating-point multiplier to perform a multiplication operation in a neural network operation on at least one weight and at least one neuron data to obtain a corresponding product result.
  • the floating-point multiplier here is the floating-point multiplier described above in conjunction with Figure 2 to Figure 9. It supports multiple operation modes and multiplexing to perform multiplication operations on floating-point input data in different data formats. , So as to obtain the product result of weight data and neuron data.
  • the method 1500 uses the addition module to perform an addition operation on the product result to obtain an intermediate result.
  • the addition module can be realized by multiple adders such as full adders, half adders, ripple carry adders, and advance bit adders, etc., and can be connected in various suitable forms, such as arrays.
  • the adder and the multi-level tree structure shown in Figure 11 and Figure 12 are implemented.
  • the method 1500 uses the update module to perform multiple summation operations on the multiple intermediate results generated to output the final result of the neural network operation.
  • the update module may include a second adder and a register, where the second adder may be configured to repeatedly perform the following operations until the request for all the multiple intermediate results is completed.
  • Sum operation receive the intermediate result from the adder and the previous summation result from the previous summation operation from the register; add the intermediate result and the previous summation result to obtain the sum operation result The sum result; and use the sum result of this sum operation to update the previous sum result stored in the register.
  • the computing device of the present disclosure can call the multiplication unit multiple times to achieve support for neural network operations with a large amount of data.
  • FIG. 16 is a structural diagram showing a combined processing device 1600 according to an embodiment of the present disclosure.
  • the combined processing device 1600 includes the computing device described in conjunction with FIGS. 1-15, such as the computing device 1602 shown in the figure.
  • the combined processing device also includes a universal interconnection interface 1604 and other processing devices 1606.
  • the computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user.
  • the other processing device may include one or more types of general-purpose and/or special-purpose processors such as a central processing unit (“CPU"), a graphics processing unit (“GPU”), and an artificial intelligence processor.
  • CPU central processing unit
  • GPU graphics processing unit
  • artificial intelligence processor an artificial intelligence processor.
  • the number of processors is not limited but determined according to actual needs.
  • the other processing device can be used as an interface between the computing device of the present disclosure (which can be embodied as an artificial intelligence computing device) and external data and control.
  • the basic control of the start and stop of the learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
  • the universal interconnection interface can be used to transmit data and control commands between the computing device and other processing devices.
  • the computing device can obtain required input data from other processing devices via the universal interconnection interface, and write the input data to the on-chip storage device of the computing device.
  • the computing device can obtain control instructions from other processing devices via the universal interconnection interface, and write them into the on-chip control buffer of the computing device.
  • the universal interconnection interface can also read the data in the storage module of the computing device and transmit it to other processing devices.
  • the combined processing device may further include a storage device 1608, which may be connected to the computing device and the other processing device respectively.
  • the storage device may be used to store the data of the computing device and the other processing device, and it is especially suitable for the data required to be calculated in the internal storage of the computing device or other processing device. Saved data.
  • the combined processing device of this disclosure can be used as an SOC system on chip for mobile phones, robots, drones, video capture, video surveillance equipment and other equipment, thereby effectively reducing the core area of the control part, increasing the processing speed and reducing The overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment.
  • Some components here can be, for example, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface.
  • the present disclosure also discloses a chip (or integrated circuit chip), which includes the aforementioned computing device or combined processing device.
  • a chip packaging structure which includes the above-mentioned chip.
  • the present disclosure also discloses a board card, which includes the above-mentioned chip packaging structure.
  • a board card which includes the above-mentioned chip packaging structure.
  • the board may also include other supporting components.
  • the supporting components may include, but are not limited to: a storage device 1704, an interface device 1706, and a control device. Device 1708.
  • the storage device is connected to the chip in the chip packaging structure through a bus for storing data.
  • the storage device may include multiple sets of storage units 1710. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM ("Double Data Rate SDRAM", double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage unit. Each group of the storage unit may include a plurality of DDR4 particles (chips). In an embodiment, the chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification.
  • each group of the storage unit may include a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the chip in the chip packaging structure.
  • the interface device is used to implement data transmission between the chip and an external device 1712 (for example, a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces, and the present disclosure does not limit the specific manifestations of the above other interfaces, as long as the interface unit can realize the switching function.
  • the calculation result of the chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the chip to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, "MCU").
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
  • the present disclosure also discloses an electronic device or device, which includes the above-mentioned board.
  • electronic equipment or devices can include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, and cameras , Cameras, projectors, watches, earphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • a computing device for performing neural network operations including: an input terminal configured to receive at least one weight data and at least one neuron data to be performed neural network operations; a multiplication unit, which includes at least one A floating-point multiplier configured to perform a multiplication operation in the neural network operation on the at least one weight data and the at least one neuron data to obtain a corresponding product result; an addition module , Which is configured to perform an addition operation on the product result to obtain an intermediate result; and an update module, which is configured to perform multiple summation operations for the plurality of generated intermediate results to output the neural network The final result of the operation.
  • Clause A2 the computing device according to clause A1, wherein the at least one weight data and the at least one neuron data are data of the same or different data types.
  • the computing device further includes: a first type conversion unit configured to convert the data type of the product result, so that the addition module performs the addition operation.
  • Clause A4 the computing device according to any one of clauses A1-A3, wherein the addition module includes a multi-level adder group arranged in a multi-level tree structure, and each level of adder group includes one or more first An adder.
  • the addition module includes a multi-level adder group arranged in a multi-level tree structure, and each level of adder group includes one or more first An adder.
  • Clause A5 the computing device according to any one of clauses A1-A4, further comprising one or more second type conversion units arranged in the multi-stage adder group, which are configured to convert the one-stage adder The data output by the group is converted into another type of data for the addition operation of the adder group at the next stage.
  • Clause A6 The computing device according to any one of clauses A1-A5, wherein after the multiplication unit outputs the product result, it receives the next pair of the at least one weight data and at least one neuron data to After the multiplication operation is performed, and the addition module outputs the intermediate result, it receives the next product result from the multiplication unit to perform the addition operation.
  • the computing device includes a second adder and a register
  • the second adder is configured to repeatedly perform the following operations until all The summing operation of the multiple intermediate results: receiving the intermediate result from the addition module and the previous summing result of the previous summation operation from the register; combining the intermediate result and the previous The second summation result is added to obtain the summation result of this summation operation; and the summation result of this summation operation is used to update the previous summation result stored in the register.
  • Clause A8 the computing device according to any one of clauses A1-A7, wherein the input terminal includes at least two input ports that support multiple data bit widths, and the register includes a plurality of sub-registers, and the calculation The device is configured to divide and multiplex the neuron data and weight data according to the bit width of the input port to perform neural network operations.
  • the computing device according to any one of clauses A1-A8, wherein the multiplier, the addition module, and the update module are configured to perform multiple rounds of operation according to the division and multiplexing, wherein: in each round of operation , Storing the obtained intermediate result in the corresponding sub-register and updating the sub-register by the update module; and in the last round of operation, outputting the final result of the neural network operation from the plurality of sub-registers.
  • Clause A10 the computing device according to any one of clauses A1-A9, wherein the number of result items of the final result is based on the number of times of multiplexing of neuron data and the number of times of multiplexing of weight data.
  • Clause A11 The computing device according to any one of clauses A1-A10, wherein the maximum number of times of multiplexing is based on the number of the plurality of sub-registers.
  • Clause A12 the computing device according to any one of clauses A1-A11, wherein the computing device includes n of the sub-registers, the number of times of multiplexing of the neuron is m, and the maximum number of multiplexing of the weight data The number of times is floor(n/m), where m is equal to or less than n, and the floor function indicates that n/m is rounded down.
  • the floating-point multiplier is configured to perform a multiplication operation on the at least one neuron data and the at least one weight data according to an operation mode, wherein The at least one neuron data and the at least one weight data include at least respective exponents and mantissas, and the floating-point multiplier includes: an exponent processing unit, configured to perform according to the operation mode and the at least one neuron data And the exponent of the at least one weight data to obtain the multiplied exponent; and a mantissa processing unit for obtaining the exponent according to the operation mode, the at least one neuron data, and the at least one weight data The mantissa after the multiplication operation, wherein the operation mode is used to indicate the data format of the at least one neuron data and the data format of the at least one weight data.
  • Clause A14 the computing device according to any one of clause A13, wherein the operation mode is also used to indicate the data format after the multiplication operation.
  • Clause A15 the computing device according to any one of clauses A12-A14, wherein the data format includes at least one of half-precision floating-point numbers, single-precision floating-point numbers, brain floating-point numbers, double-precision floating-point numbers, and custom floating-point numbers One kind.
  • Clause A16 the computing device according to any one of clauses A12-A15, wherein the at least one neuron data and the at least one weight data further include respective symbols, and the floating-point multiplier further includes: symbols
  • the processing unit is configured to obtain the symbol after the multiplication operation according to the symbol of the at least one neuron data and the symbol of the at least one weight data.
  • Clause A17 the computing device according to any one of clauses A12-A16, wherein the symbol processing unit includes an exclusive-or logic circuit, and the exclusive-or logic circuit is used to determine the symbol and the value of the at least one neuron data. Perform an exclusive OR operation on the sign of the at least one weight data to obtain the sign after the multiplication operation.
  • the computing device according to any one of clauses A12-A17, further comprising: a normalization processing unit, configured to: when the at least one neuron data or the at least one weight data is a non-normalized non-zero float When counting points, the at least one neuron data or the at least one weight data is normalized according to the operation mode to obtain the corresponding exponent and mantissa.
  • a normalization processing unit configured to: when the at least one neuron data or the at least one weight data is a non-normalized non-zero float When counting points, the at least one neuron data or the at least one weight data is normalized according to the operation mode to obtain the corresponding exponent and mantissa.
  • the computing device includes a partial product operation unit and a partial product summation unit, wherein the partial product operation unit is configured to The mantissa of the neuron data and the mantissa of at least one weight data obtain an intermediate result of the mantissa, and the partial product summation unit is used to add the intermediate result of the mantissa to obtain the sum result, and add the sum result As the mantissa after the multiplication operation.
  • Clause A20 the computing device according to any one of clauses A12-A19, wherein the partial product operation unit includes a Booth coding circuit, and the Booth coding circuit is used to calculate the high and low bits of the mantissa of at least one weight data. Add 0 and perform Booth coding to obtain the intermediate result of the mantissa.
  • Clause A21 the computing device according to any one of clauses A12-A20, wherein the partial product summation circuit includes an adder, and the adder is used to add the intermediate result of the mantissa to obtain the addition And results.
  • the computing device according to any one of clauses A12-A21, wherein the partial product summation circuit includes a Wallace tree and an adder, wherein the Wallace tree is used to perform a calculation on the intermediate result Adding to obtain a second mantissa intermediate result, and the adder is used to add the second mantissa intermediate result to obtain the addition result.
  • the partial product summation circuit includes a Wallace tree and an adder, wherein the Wallace tree is used to perform a calculation on the intermediate result Adding to obtain a second mantissa intermediate result, and the adder is used to add the second mantissa intermediate result to obtain the addition result.
  • Clause A23 the computing device according to any one of clauses A12-A22, wherein the adder includes at least one of a full adder, a serial adder, and a forward bit adder.
  • each of the Wallace trees has M inputs and N outputs, and the number of Wallace trees is not less than N*K, wherein N is a preset positive integer less than M, and K is a positive integer not less than the maximum bit width of the intermediate result of the mantissa.
  • Clause A26 the computing device according to any one of clauses A12-A25, wherein the partial product summation circuit is used to select N groups of the Wallace trees to sum the intermediate results according to the operation mode, There are X Wallace trees in each group, and X is the number of digits of the intermediate result of the mantissa.
  • the Wallace trees in each group have a sequential carry relationship, and the Wallace trees in each group The tree does not have a carry relationship.
  • the computing device according to any one of clauses A12-A26, wherein the mantissa processing unit further includes a control circuit for indicating the at least one neuron data or at least one weight data in the operation mode
  • the mantissa processing unit is called multiple times according to the operation mode.
  • the computing device according to any one of clauses A12-A27, wherein the partial product summation circuit further includes a shifter, when the control circuit invokes the mantissa processing unit multiple times according to the operation mode In each call, the shifter is used to shift the existing addition result, and add it to the sum result obtained in the current call to obtain a new addition result, and add The new sum result obtained in the last call is used as the mantissa after the multiplication operation.
  • the shifter when the control circuit invokes the mantissa processing unit multiple times according to the operation mode In each call, the shifter is used to shift the existing addition result, and add it to the sum result obtained in the current call to obtain a new addition result, and add The new sum result obtained in the last call is used as the mantissa after the multiplication operation.
  • the floating-point multiplier further includes a regularization unit for: performing floating-point regularization processing on the mantissa and exponent after the multiplication operation To obtain a regularized exponent result and a regularized mantissa result, and use the regularized exponent result and the regularized mantissa result as the exponent after the multiplication operation and the mantissa after the multiplication operation.
  • the computing device according to any one of clauses A12-A29, wherein the floating-point multiplier further includes a rounding unit configured to perform a rounding operation on the regularized mantissa result according to a rounding mode to Obtain the rounded mantissa, and use the rounded mantissa as the mantissa after the multiplication operation.
  • the floating-point multiplier further includes a rounding unit configured to perform a rounding operation on the regularized mantissa result according to a rounding mode to Obtain the rounded mantissa, and use the rounded mantissa as the mantissa after the multiplication operation.
  • Clause A31 the computing device according to any one of clauses A12-A30, further comprising: a mode selection unit configured to select and indicate the at least one neuron from multiple operation modes supported by the floating-point multiplier The operation mode of the data format of the metadata and at least one weight data.
  • a method for performing neural network operations including: using an input terminal to receive at least one weight data and at least one neuron data of the neural network operation to be performed; using a multiplication unit pair including at least one floating-point multiplier Performing a multiplication operation in the neural network operation on the at least one weight data and the at least one neuron data to obtain a corresponding product result; performing an addition operation on the product result by an addition module to obtain an intermediate result; And using an update module to perform multiple summation operations on the multiple intermediate results generated to output the final result of the neural network operation.
  • Clause A33 an integrated circuit chip including the computing device according to any one of clauses A1-A31.
  • Clause A34 an integrated circuit device, including the computing device according to any one of clauses A1-A31.
  • the disclosed device can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, optical, acoustic, magnetic or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be realized in the form of hardware or software program module.
  • the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the computer software product is stored in a memory and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device) Etc.) Perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory ("ROM”, Read-Only Memory), random access memory ("RAM”, Random Access Memory), mobile hard disk, magnetic disk or optical disk, etc., which can store programs The medium of the code.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Nonlinear Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Hardware Design (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

本发明涉及一种用于执行神经网络运算的计算装置、方法、集成电路芯片和集成电路设备,其中计算装置可以包括在组合处理装置中,该组合处理装置还可以包括通用互联接口和其他处理装置。所述计算装置与其他处理装置进行交互,共同完成用户指定的计算操作。组合处理装置还可以包括存储装置,该存储装置分别与计算装置和其他处理装置连接,用于存储计算装置和其他处理装置的数据。本发明的方案可以广泛应用于各类浮点数据运算中。

Description

用于神经网络运算的计算装置、方法、集成电路和设备
相关申请的交叉引用
本申请要求于2019年10月25日申请的,申请号为201911023669.1,名称为“用于神经网络运算的计算装置、方法、集成电路和设备”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本披露一般地涉及数据处理领域。更具体地,本披露涉及用于神经网络运算的计算装置、方法、集成电路芯片和设备。
背景技术
当前的神经网络中涉及到权值数据(例如卷积数据)与神经元数据的运算操作,其中包括大量的乘加操作。该乘加操作的效率往往取决于所使用的乘法器的执行速度。尽管当前的乘法器在执行效率方面获得了显著提高,但在处理浮点类型数据方面,其还存在提升的空间。另外,神经网络运算中还会涉及到前述权值数据和神经元数据的处理操作,而当前对于这些的数据处理并没有很好的运算机制,从而造成神经网络运算的低效。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本披露的方案提供了一种用于执行神经网络运算的计算装置、方法、集成电路芯片和集成电路设备,从而有效执行神经网络运算,并实现对权值数据和神经元数据的高效复用。
在一个方面中,本披露公开了一种用于执行神经网络运算的计算装置,包括:输入端,其配置用于接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据;乘法单元,其包括至少一个浮点乘法器,所述浮点乘法器配置用于对所述至少一个权值数据和所述至少一个神经元数据执行所述神经网络运算中的乘法操作,以获得对应的乘积结果;加法模块,其配置用于对所述乘积结果执行加法操作,以获得中间结果;以及更新模块,其配置用于执行针对产生的所述多个中间结果的多次求和操作,以输出所述神经网络运算的最终结果。
在另一个方面中,本披露公开了一种用于执行神经网络运算的方法,包括:接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据;利用包括至少一个浮点乘法器的乘法单元对所述至少一个权值数据和所述至少一个神经元数据执行所述神经网络运算中的乘法操作,以获得对应的乘积结果;利用加法模块对所述乘积结果执行加法操作,以获得中间结果;以及利用更新模块针对产生的所述多个中间结果执行多次求和操作,以输出所述神经网络运算的最终结果。
在又一方面中,本披露公开了一种集成电路芯片和集成电路设备,该集成电路芯片包括前述用于执行神经网络运算的计算装置,而集成电路设备包括该集成电路芯片。
通过利用本披露的包括乘法单元的计算装置、方法、集成电路芯片和集成电路设备,可以高效地执行神经网络运算,特别是神经网络中的卷积运算。另外,在执行神经网络运算中,本披露还支持权值数据和神经元数据的复用,从而避免过多的数据迁移和存储,提高了运算效率并降低了运算的成本。
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1是示出根据本披露实施例的计算装置的示意框图;
图2是示出根据本披露实施例的浮点数据格式的示意图;
图3是示出根据本披露实施例的乘法器的示意性结构框图;
图4是示出根据本披露实施例的乘法器的更多细节的结构框图;
图5是示出根据本披露实施例的尾数处理单元的示意性框图;
图6是示出根据本披露实施例的部分积操作的示意图;
图7是示出根据本披露实施例的华莱士树压缩器的操作流程和示意框图;
图8是示出根据本披露实施例的乘法器的整体示意框图;
图9是示出根据本披露实施例的使用乘法器执行浮点数乘法运算的方法的流程图;;
图10是示出根据本披露实施例的计算装置的另一示意框图;
图11是示出根据本披露实施例的加法器组的示意框图;
图12是示出根据本披露实施例的加法器组的又一示意框图;
图13是示出根据本披露实施例的执行神经网络运算的流程图;
图14是示出根据本披露实施例的神经网络运算的示意图;
图15是示出根据本披露实施例的利用计算装置执行神经网络运算的流程图;
图16是示出根据本披露实施例的一种组合处理装置的结构图;以及
图17是示出根据本披露实施例的一种板卡的结构示意图。
具体实施方式
现在将参考附图描述实施例。应当理解,为了说明的简单和清楚,在认为合适的情况下,可以在附图中重复附图标记以指示对应或类似的元件。另外,本申请阐述了许多具体细节以便提供对本文所述实施例的透彻理解。然而,本领域普通技术人员将理解,可以在没有这些具体细节的情况下实践本文描述的实施例。在其他情况下,没有详细描述公知的方法、过程和组件,以免模糊本文描述的实施例。而且,该描述不应被视为限制本文描述的实施例的范围。
本披露的技术方案利用包括一个或多个浮点乘法器的乘法单元来执行包括权值数据和神经元数据之间的乘法操作,并且对获得的乘积结果进行加法操作和更新操作,从而获得最终结果。本披露的方案不仅通过乘法单元提高了乘法操作的效率,还通过更新操作对最终结果前的多个中间结果进行存储,以实现对权值数据和神经元数据的高效复用。
下面将结合附图对本披露所公开的多个实施例进行详细地描述。
图1是示出根据本披露实施例的计算装置100的示意框图。如前所述,该计算装置可以用于执行神经网络运算,特别是对权值数据和神经元数据进行处理,以获得所期望的运算结果。在一个实施例中,当所述神经网络是用于图像的卷积神经网络时,该权值数据可以是卷积核数据,而神经元数据可以是例如图像的像素数据或前层运算操作后的输出数据。
如图1所示,该计算装置包括输入端102,其配置用于接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据。在一个实施例中,当本披露的计算装置用于图像数据处理时,该输入端可以接收来自于图像捕获装置所捕获到的图像数据,该图像捕获装置例如可以是各类图像传感器、照相机、摄像机、移动智能终端、平板计算机等图像采集设备,而采集到的像素数据或经过初步处理的像素数据可以作为本披露的神经元数据。
在一个实施例中,上述的权值数据和神经元数据可以具有相同或不同类型的数据格式,例如具有相同或不同的浮点数格式。进一步,在一个或多个实施例中,输入端可以包括用于数据格式转换的一个或多个第一类型转换单元,用于将接收到的权值数据或神经元数据转换成乘法单元104所支持的数据格式。例如,当乘法单元支持包括半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种的数据格式时,输入端中的格式转换单元可以将接收到的神经元数据和权值数据转换成前述数据格式之一,以适应乘法单元执行乘法操作的要求。关于本披露所支持的各种数据格式或类型以及对数据格式的转换,将在下文讨论本披露的浮点乘法器时进行详细地描述。
如图所示出的,本披露的乘法单元可以包括至少一个浮点乘法器106,该浮点乘法器可以配置用于对前述至少一个权值数据和至少一个神经元数据执行所述神经网络运算中的乘法操作,以获得对应的乘积结果。在一个或多个实施例中,本披露的浮点乘法器可以支持多种运算模式中的一种运算模式的乘法操作,而该运算模式可以用于指示参与乘法操作的神经元数据和权值数据的数据格式。例如,当神经元数据和权值数据都是半精度浮点数时,浮点乘法器可以以第一运算模式来执行操作,而当神经元数据是半精度浮点数并且权值数据是单精度浮点数时,则浮点乘法器可以以第二运算模式来执行乘法操作。关于本披露的浮点乘法器的细节,稍后将结合附图进行详细的描述。
当通过本披露的乘法单元获得乘积结果后,该乘积结果可以传送至加法模块108,该加法模块可以配置用于对所述乘积结果执行加法操作,以获得中间结果。在一个或多个实施例中,该加法模块可以是多个加法器形成的加法器组,该加法器组可以形成树状的结构。例如,加法器包括以多级树状结构方式排列的多级加法器组,每级加法器组包括一个或多个第一加法器110,该第一加法器例如可以是浮点加法器。另外,由于本披露的浮点乘法器是支持多模式运算的乘法器,因此本披露的加法模块中的加法器也可以是支持多种加法运算模式的加法器。例如,当浮点乘法器的输出是半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的一种数据格式时,本披露的前述加法模块中的第一加法器也可以是支持上述任意一种数据格式的浮点数的浮点加法器。换句话说,本披露的方案对于第一加法器的类型不做任何的限制,并且任意能够支持加法操作的装置、器件或设备都可以用于充当这里的加法器,以实现加法操作并获得中间结果。
在获得中间结果后,本披露的计算装置还可以包括更新模块112,其配置用于执行针对产生的多个中间结果的多次求和操作,以输出所述神经网络运算的最终结果。在一些实施例中,当针对一次神经网络运算需要多次调用乘法单元时,则每次调用乘法单元并且通过加法模块所获得的结果就可以视为相对于所述最终结果的中间结果。
为了实现这样的多个中间结果的多次求和操作和对所得到的求和结果的保存操作,在一个或多个实施例中,该更新模块可以包括第二加法器114和寄存器116。考虑到前述加法模块中的第一加法器可以是支持多种模式的浮点加法器,与之相对应,更新模块中的第二加法器也可以具有与第一加法器相同或相类似的性质,即也同样支持多种模式的浮点数加法操作。而当第一加法器或第二加法器并不支持多种浮点数据格式的加法运算时,本披露还公开了第一或第二类型转换单元,用于执行数据类型或格式间的转换,从而同样使得可以利用第一或第二加法器执行多种运算模式的浮点数相加,即可以利用第一或第二加法器以多种运算模式来执行浮点数相加。关于该类型转换单元,稍后将结合附图11进行详细描述。
在示例性的操作中,第二加法器可以配置用于重复地执行以下操作,直至完成对全部所述多个中间结果的求和操作:接收来自于所述加法器(例如加法器108)的中间结果和来自于寄存器(即寄存器116)的、前次求和操作的前次求和结果;将所述中间结果和所述前次求和结果进行相加,以获得本次求和操作的求和结果;以及利用本次求和操作的求和结果来更新寄存器中存储的前次求和结果。当输入端没有新的数据输入时或者乘法单元完成了全部的相乘操作后,将输出寄存器中保存的结果作为神经网络运算的最终结果。
在一些实施例中,所述输入端可以包括具有支持多个数据位宽的至少两个输入端口,并且所述寄存器包括多个子寄存器,所述计算装置配置用于根据所述输入端口位宽对所述神经元数据和权值数据分别进行划分和复用,以执行神经网络运算。在一些应用场景中,所述至少两个输入端口可以是两个支持k*n比特位宽的端口,其中k是最小位宽的数据类型的整数倍,例如k=16、32、64、……等,而n是输入数据的个数,例如n=1、2、3、……等。例如,当k为32,n为16时,则输入数据可以是512比特的位宽。在该情况下,一个端口的输入数据可以是一个包括16个FP32(单精度浮点数)的数据项,也可以是一个包括32个FP16(半精度浮点数)的数据项,还可以是一个32个BF16(脑浮点数)的数据项。以上述输入端口为512比特位宽、权值数据大 小为2048比特的BF16的数据为例,可以将2048比特的权值数据划分为4个512比特长度的数据,从而调用4次所述乘法单元和更新模块,并在第四次更新模块更新完毕后输出最终的运算结果。
基于上面的描述,本领域技术人员可以理解本披露的上述乘法单元、加法模块及更新模块均可以独立且并行地操作。例如,乘法单元输出乘积结果后,便接收下一对神经元数据和权值数据,以进行乘法操作,无需等待后级(例如加法模块和更新模块)均运行完毕再接收处理。同样地,加法模块输出中间结果后,便接收下一个来自乘积单元的乘积结果以进行加法操作。可以看出,本披露方案的并行操作方式提升了运算的效率。此处的“后级”不仅仅指后一个级别,还可以指多级流水运算操作中后面的若干级操作。
以上结合图1对本披露的计算装置的整体操作进行了描述,通过利用该计算装置可以实现高效的神经网络运算。特别地,通过利用支持多种运算模式的浮点乘法器的操作,所述计算装置可以实现在神经网络中对多种数据格式的浮点数相乘操作。下面将结合图2-图9对本披露的浮点乘法器进行详细地描述。
图2是示出根据本披露实施例的浮点数据格式200的示意图。如图2中所示,可以应用本披露技术方案的神经元数据和权值数据可以是浮点数,并且可以包括三个部分,例如符号(或符号位)202、指数(或指数位)204和尾数(或尾数位)206,其中对于无符号的浮点数则可以不存在符号或符号位。在一些实施例中,适用于本披露乘法器的浮点数可以包括半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。具体来说,在一些实施例中,可以应用本披露技术方案的浮点数格式可以是符合IEEE754标准的浮点格式,例如双精度浮点数(float64,简写为“FP64”)、单精度浮点数(float32,简写“FP32”)或半精度浮点数(float16,简写“FP16”)。在另外一些实施例中,浮点数格式也可以是现有的16位脑浮点数(bfloat16,简写“BF16”),也可以是自定义的浮点数格式,例如8位脑浮点数(bfloat8,简写“BF8”)、无符号半精度浮点数(unsigned float16,简写“UFP16”)、无符号16位脑浮点数(unsigned bfloat16,简写“UBF16”)。为了便于理解,下面的表1示出上述的部分数据格式,其中的符号位宽、指数位宽和尾数位宽仅用于示例性的说明目的。
表1
数据类型 符号位宽 指数位宽 尾数位宽
FP16 1 5 10
BF16 1 8 7
FP32 1 8 23
BF8 1 5 3
UFP16 0 5(或6) 11(或10)
UBF16 0 8 8
对于上面所提到的各种浮点数格式,本披露的乘法器在操作中至少可以支持具有任意上述格式的两个浮点数(例如,其中一个浮点数是神经元数据,而另一浮点数是权值数据)之间的相乘操作,其中两个浮点数可以具有相同或不同的浮点数据格式。例如,两个浮点数之间的相乘操作可以是FP16*FP16、BF16*BF16、FP32*FP32、FP32*BF16、FP16*BF16、FP32*FP16、BF8*BF16、UBF16*UFP16或UBF16*FP16等两个浮点数之间的相乘操作。
图3是示出根据本披露实施例的乘法器300的示意性结构框图。如前所述,本披露的乘法器支持各种数据格式的浮点数的相乘操作,其中乘数或被乘数中的一个可以是本披露的神经元数据,而对应的另一个可以是本披露的权值数据。前述的数据格式可以通过本披露的运算模式来指示,以使得乘法器工作在多种运算模式之一。
如图3中所示,本披露的乘法器总体上可以包括指数处理单元302和尾数处理单元304,其 中指数处理单元用于处理浮点数的指数位,而尾数处理单元用于处理浮点数的尾数位。可选地或附加地,在一些实施例中,当乘法器处理的浮点数具有符号位时,乘法器还可以包括符号处理单元306,该符号处理单元可以用于处理包括符号位的浮点数。
在操作中,所述乘法器可以根据运算模式之一对接收、输入或缓存的第一浮点数和第二浮点数执行浮点运算,该第一浮点数和第二浮点数具有如前所讨论的浮点数据格式之一。例如,当乘法器处于第一运算模式中,其可以支持两个浮点数FP16*FP16的乘法运算,而当乘法器处于第二运算模式中,其可以支持两个浮点数BF16*BF16的乘法运算。类似地,当乘法器处于第三运算模式中,其可以支持两个浮点数FP32*FP32的乘法运算,而当乘法器处于第四运算模式中,其可以支持两个浮点数FP32*BF16的乘法运算。这里,示例的运算模式和浮点数对应关系如下表2所示。
表2
运算模式编号 运算浮点数类型
1 FP16*FP16
2 BF16*BF16
3 FP32*FP32
4 FP32*BF16
在一个实施例中,上述的表2可以存储于乘法器的一个存储器中,并且乘法器根据从外部设备接收到的指令来选择表中的运算模式之一,而该外部设备例如可以是图17中示出的外部设备1712。在另一个实施例中,该运算模式的输入也可以经由如图4中所示的模式选择单元408来自动地实现。例如,当两个FP16型的浮点数输入到本披露的乘法器时,模式选择单元可以根据该两个浮点数的数据格式而选择乘法器工作于第一运算模式中。又例如,当一个FP32型浮点数和一个BF16型浮点数输入到本披露的乘法器时,模式选择单元可以根据该两个浮点数的数据格式而选择乘法器工作于第四运算模式中。
可以看出,本披露的不同运算模式与对应的浮点型数据相关联。也就是说,本披露的运算模式可以用于指示第一浮点数的数据格式和第二浮点数的数据格式。在另一个实施例中,本披露的运算模式不仅可以指示第一浮点数的数据格式和第二浮点数的数据格式,还可以用于指示乘法运算后的数据格式。结合表2扩展的运算模式在下表3中示出。
表3
Figure PCTCN2020122949-appb-000001
与表2中所示的运算模式编号不同,表3中的运算模式扩展一位以用于指示浮点乘法运算后的数据格式。例如,当乘法器工作于运算模式21中,其对输入的BF16*BF16两个浮点数执行浮点运算,并且将浮点乘法运算后以FP16的数据格式输出。
上面以编号形式的运算模式来指示浮点数据格式仅仅是示例性的而非限制性的,根据本披露的教导,也可以想到根据运算模式建立索引以确定乘数和被乘数的格式。例如,运算模式包括 两个索引,第一个索引用于指示第一浮点数的类型,第二个索引用于指示第二浮点数的类型,例如运算模式13中的第一索引“1”指示第一浮点数(或称被乘数)为第一浮点格式,即FP16,而第二索引“3”指示第二浮点数(或称乘数)为第二浮点格式,即FP32。进一步,也可以对运算模式增加第三索引,该第三索引指示输出结果的数据格式,例如对于运算模式131中的第三索引“1”,其可以指示输出结果的数据格式是第一浮点格式,即FP16。当运算模式数目增加时,可以根据需要增加相应的索引或索引的层级,以便于对运算模式和数据格式之间关系的确立。
另外,尽管这里示例性地以数字编号来指代运算模式,在其他的例子中,也可以根据应用需要以其他的符号或编码来对运算模式进行指代,例如通过字母、符号或数字及其结合等等,并且通过这样的字母、数字、符号或其组合的表达来指代运算模式并标识出第一浮点数、第二浮点数和输出结果的数据格式。另外,当这些表达以指令形式形成时,该指令可以包括三个域或字段,第一域用于指示第一浮点数的数据格式,第二域用于指示第二浮点数的数据格式,而第三域用于指示输出结果的数据格式。当然,这些域也可以被合并于一个域,或增加新的域以用于指示更多的与浮点数据格式相关的内容。可以看出,本披露的运算模式不仅可以与输入的浮点数数据格式相关联,也可以用于规格化输出结果,以获得期望数据格式的乘积结果。
图4是示出根据本披露实施例的乘法器400的更多细节结构框图。从图4所示内容可以看出,其不仅包括图3中所示出的指数处理单元302、尾数处理单元304和可选的符号处理单元306,还示出这些单元可以包括的内部组件以及与这些单元操作相关的单元,下面结合图4来具体描述这些单元的示例性操作。
为了执行浮点数的乘法运算,例如本披露的神经元数据和权值数据之间的乘法运算,指数处理单元可以用于根据前述的运算模式、第一浮点数的指数和第二浮点数的指数获得乘法运算后的指数。在一个实施例中,该指数处理单元可以通过加减法电路来实现。例如,此处的指数处理单元可以用于将第一浮点数的指数、第二浮点数的指数和各自对应的输入浮点数据格式的偏移值相加,并且接着减去输出浮点数据格式的偏移值,以获得第一浮点数和第二浮点数的乘法运算后的指数。
进一步,乘法器的尾数处理单元可以用于根据前述的运算模式、第一浮点数和所述第二浮点数来获得乘法运算后的尾数。在一个实施例中,尾数处理单元可以包括部分积运算单元412和部分积求和单元414,其中所述部分积运算单元用于根据第一浮点数的尾数和第二浮点数的尾数获得尾数中间结果。在一些实施例中,该尾数中间结果可以是第一浮点数和第二浮点数在相乘操作过程中所获得的多个部分积(如图6和图7中所示意性示出的)。所述部分积求和单元用于将所述尾数中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
为了获得尾数中间结果,在一个实施例中,本披露利用布斯(“Booth”)编码电路对第二浮点数(如充当浮点运算中的乘数)的尾数的高低位补0(其中对高位补0是将尾数作为无符号数转为有符号数),以便获得所述尾数中间结果。需要理解的是,根据编码方法的不同,也可以对第一浮点数(如充当浮点运算中的被乘数)的尾数进行编码(如高低位补0),或者对二者都进行编码,以获得多个部分积。关于部分积的更多描述,稍后将结合附图来说明。
在另一个实施例中,所述部分积求和单元可以包括加法器,其用于对所述尾数中间结果进行加和,以获得所述加和结果。在又一个实施例中,部分积求和单元包括华莱士树和加法器,其中所述华莱士树用于对所述尾数中间结果进行加和,以获得第二尾数中间结果,所述加法器用于对所述第二尾数中间结果进行加和,以获得所述加和结果。在这些实施例中,加法器可以包括全加器、串行加法器和超前进位加法器中的至少一种。
在一个实施例中,所述尾数处理单元还可以包括控制电路416,用于在运算模块指示所述第一浮点数或第二浮点数中的至少一个的尾数位宽大于尾数处理单元一次可处理的数据位宽时,根据所述运算模式多次调用所述尾数处理单元。该控制电路在一个实施例中可以实现为用于产生控制信号,例如可以是一个计数器或者控制的标志位等。为了实现这里的多次调用,所述的部分积求和单元还可以包括移位器,当所述控制电路根据所述运算模式多次调用所述尾数处理单元时, 移位器在每次调用中用于对已有加和结果进行移位,并与当次调用获得的求和结果进行相加,以获得新的加和结果,并且将在最后一次调用中获得的新的加和结果作为所述乘法运算后的尾数。
在一个实施例中,本披露的乘法器还包括规则化单元418和舍入单元420。该规则化单元可以用于对乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和乘法运算后的尾数。例如,根据运算模块所指示的数据格式,规则化单元可以调整指数和尾数的位宽,以使其符合前述指示的数据格式的要求。另外,规则化单元还可以对指数或尾数做其他方面的调整。例如,在一些应用场景中,当尾数的值不为0时,尾数位的最高有效位应为1;否则,可以修改指数位并同时对尾数位进行移位,使其变为规格化数的形式。在另一个实施例中,该规则化单元还可以根据乘法运算后的尾数对所述乘法运算后的指数进行调整。例如,当乘法运算后的尾数的最高位为1时,可以将乘法运算后所获得的指数加1。与之相应,舍入单元可以用于根据舍入模式对所述规则化尾数结果执行舍入操作,并将执行了舍入操作后的尾数作为所述乘法运算后的尾数。根据不同的应用场景,该舍入单元可以执行例如包括向下舍入、向上舍入、向最近的有效数舍入等的舍入操作。在一些应用场景中,舍入单元也可以对尾数右移过程中移出的1进行舍入。
除了指数处理单元和尾数处理单元,本披露的乘法器还可选地包括符号处理单元,当输入的浮点数是带有符号位的浮点数时,该符号处理单元可以用于根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。例如,在一个实施例中,该符号处理单元可以包括异或逻辑电路422,所述异或逻辑电路用于根据所述第一浮点数的符号和所述第二浮点数的符号进行异或运算,获得所述乘法运算后的符号。在另一个实施例中,该符号处理单元也可以通过真值表或逻辑判断来实现。
另外,为了使输入或接收到的第一和第二浮点数符合规定的格式,在一个实施例中,本披露的乘法器还可以包括规格化处理单元424,用于当所述第一浮点数或第二浮点数为非规格化的非零浮点数时,根据所述运算模式,对所述第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数。例如,当选择的运算模式是表2中所示出的第2种运算模式,而输入的第一和第二浮点数是FP16型数据,则可以利用规格化处理单元将FP16型数据规格化为BF16型数据,以便乘法器以第2种运算模式进行操作。在一个或多个实施例中,规格化处理单元还可以用于对存在隐式的1的规格化浮点数和不存在隐式的1的非规格化浮点数的尾数进行预处理(例如尾数的扩充),以便于后续的尾数处理单元的操作。基于上文的描述,可以理解的是这里的规格化处理单元424和前述的规则化单元418在一些实施例中也可以执行相同或相类似的操作,不同的是规格化处理单元424针对于输入的浮点数据进行规格化处理而规则化单元418针对于将要输出的尾数和指数进行规格化处理。
以上结合图4对本披露的乘法器及其多个实施例进行了描述。基于上面的描述,本领域技术人员可以理解本披露的方案通过乘法器的执行来获得乘法运算后的结果(包括指数、尾数和可选的符号)。根据应用场景的不同,例如在不需要前述的规则化处理和舍入处理时,通过尾数处理单元和指数处理单元所获得的结果即可以视为浮点乘法器的运算结果。进一步,对于需要前述的规则化处理和舍入处理时,则经过该规则化处理和舍入处理后所获得的指数和尾数可以视为浮点乘法器的运算结果,或浮点乘法器的运算结果的一部分(当考虑最终的符号时)。进一步,本披露的方案通过多种运算模式来使得乘法器支持不同类型或数据格式的浮点数的运算,从而可以实现乘法器的复用,由此节省了芯片设计的开销并节约了计算成本。另外,通过多次调用机制,本披露的乘法器也支持高位宽的浮点数的计算。鉴于在浮点数乘法操作中,尾数(或称尾数位或尾数部分)的相乘操作对于整个浮点运算的性能至关重要,下面将结合图5来描述本披露的尾数操作。
图5是示出根据本披露实施例的尾数处理单元操作500的示意性框图。如图5中所示,本披露的尾数处理操作可以主要涉及两个单元,即前述结合如图4所讨论的部分积运算单元和部分积求和单元。从操作时序上来看,该尾数处理操作大体可以分为第一阶段和第二阶段,在第一阶段中该尾数处理操作将获得尾数中间结果,而在第二阶段中该尾数处理操作将获得从加法器508 输出的尾数结果。
在示例性的具体操作中,由乘法器接收到的第一浮点数和第二浮点数可以被划分成多个部分,即前述的符号(可选的)、指数和尾数。可选地,在经过规格化处理后,两个浮点数的尾数部分将作为输入进入到尾数处理单元(如图3或图4中的尾数处理单元),并且具体地进入到部分积运算单元。如图5中所示,本披露利用布斯编码电路502对第二浮点数(即浮点运算中的乘数)的尾数的高低位执行补0操作,并进行布斯编码处理,从而在部分积产生电路504中获得所述尾数中间结果。当然,这里的第一浮点数和第二浮点数仅仅用于说明性而非限制性的目的,因此在一些应用场景中,第一浮点数可以是乘数而第二浮点数可以是被乘数。相应地,在一些编码处理中,也可以对充当被乘数的浮点数执行编码操作。
为了更好地理解本披露的技术方案,下面对布斯编码进行简要地介绍。一般地,当两个二进制数进行相乘操作时,通过乘法操作会产生大量的称之为部分积的尾数中间结果,然后在对这些部分积进行累加操作进而得到两个二进制数相乘的最终结果。其中部分积数量越多,阵列乘法器的面积和功耗就会越大,执行速度就会越慢,其实现电路也就越困难。而布斯编码的目的就是为了有效地减少部分积的求和项的数量,从而减小电路面积。其算法在于首先对输入的乘数进行相应规则的编码,在一个实施例中,编码规则例如可以是下表4所示的规则:
表4
Figure PCTCN2020122949-appb-000002
其中表4中的y 2i+1,y 2i和y 2i-1可以表示每一组待编码子数据(即乘数)对应的数值,X可以表示第一浮点数(即被乘数)中的尾数。对每一组对应的待编码数据进行布斯编码处理后,得到对应的编码信号PPi(i=0,1,2,...,n)。如表4中所示意性示出的,布斯编码后得到的编码信号可以包括五类,分别为-2X、2X、-X、X和0。示例性地,基于上述的编码规则,若接收到的被乘数为8位数据“X 7X 6X 5X 4X 3X 2X 1X 0”,则可以获得下述的部分积:
1)当乘数位中包括上表中的连续三位数据“001”时,部分积为X,可以表示为“X 7X 6X 5X 4X 3X 2X 1X 0”,第9位是符号位,即PPi={X[7],X};2)当乘数位中包括上表中的连续三位数据“011”时,部分积为2X,可以表示为X左移一位,得到“X 7X 6X 5X 4X 3X 2X 1X 00”,即PPi={X,0};3)当乘数位中包括上表中的连续三位数据“101”时,部分积为-X,可以表示为
Figure PCTCN2020122949-appb-000003
表示对“X 7X 6X 5X 4X 3X 2X 1X 0””按位取反再加1,即PPi=~{X[7],X}+1;4)当乘数位中包括上表中的连续三位数据“100”时,部分积为-2X,可以表示为
Figure PCTCN2020122949-appb-000004
表示对“X 7X 6X 5X 4X 3X 2X 1X 0”左移一位后取反再加1,即PPi=~{X,0}+1;5)当乘数位中包括上表中的连续三位数据“111”或“000”时,部分积为0,即PPi={9′ b0}。
应当理解的是上面结合表4对获得部分积的过程的描述仅仅是示例性的而非限制性的,本领域技术人员在本披露的教导下,可以对表4中的规则进行改变,以获得不同于表4所示出的部分积。例如,在乘数位中存在连续多位(例如3位或3位以上)的特定数时,得到的部分积可以是被乘数的补码,或者例如在对部分积进行加和之后再执行上述3)和4)项中的“加1”操作。
根据上述介绍性描述可以理解,通过对第二浮点数的尾数利用布斯编码电路进行编码,并且利用第一浮点数的尾数,可以从部分积产生电路产生多个部分积作为尾数中间结果,并且将尾数中间结果输送入到部分积求和单元中的华莱士树(“Wallace Tree”)压缩器506。应当理解的是,此处利用布斯编码获得部分积仅是本披露得到部分积的一种优选方式,而本领域技术人员也可以通过其他的方式来获得该部分积。例如,还可以通过移位操作来获得,即根据乘数的位值为1还是0来选择移位加被乘数还是加0而获得相应的部分积。类似地,利用华莱士树压缩器以实现部分积的加法操作也仅仅是示例性的而非限制性的,本领域技术人员也可以想到利用其他类型的加法器来实现这样的部分积相加操作。该加法器例如可以是一个或多个全加器、半加器或二者的各种组合形式。
关于华莱士树压缩器(或简称为华莱士树),其主要用于对上述的尾数中间结果(即多个部分积)进行求和,以减少部分积的累加次数(即,压缩)。通常,华莱士树压缩器可以采用进位保存CAS(carry-save)架构和Wallace树算法,其利用华莱士树阵列的计算速度比传统进位传递的加法快得多。
具体地,华莱士树压缩器能并行计算各行部分积之和,例如可以将N个部分积的累加次数从N-1次减少到Log 2N次,从而提高了乘法器的速度,对资源的有效利用具有重要意义。根据不同的应用需要,可以将华莱士树压缩器设计成多种类型,例如7-2华莱士树、4-2华莱士树以及3-2华莱士树等。在一个或多个实施例中,本披露使用7-2华莱士树作为实现本披露的各种浮点运算的示例,稍后将结合图5和图6对其进行详细的描述。
在一些实施例中,本披露所公开的华莱士树压缩操作可以布置为具有M个输入,N个输出,其数目可以不小于K,其中N为预设的小于M的正整数,K为不小于尾数中间结果的最大位宽的正整数。例如,M可以是7,N可以是2,即下文将详细描述的7-2华莱士树。当尾数中间结果的最大位宽是48时,K可以取正整数48,也就是说华莱士树的数目可以是48个。
在一些实施例中,根据运算模式,可以选用一组或多组所述华莱士树对所述尾数中间结果进行加和,其中每组有X个华莱士树,X为所述尾数中间结果的位数。进一步,各组内的华莱士树之间可以存在依次进位的关系,而各组间并不存在进位的关系。在示例性的连接中,华莱士树压缩器可以通过进位进行连接,例如来自于低位华莱士树压缩器的进位输出(如图7中Cin)被送入至高位华莱士树,而高位华莱士树压缩器的进位输出(Cout)又可以成为更高位华莱士树压缩器接收来自低位华莱士树压缩器的进位输入。另外,当从多个华莱士树压缩器中选择一个或多个华莱士时,可以进行任意的选择,例如既可以按0、1、2和3编号的顺序来选择,也可以按0、2、4和6编号的顺序来连接,只要选择的华莱士树压缩器是按上述的进位关系来选择即可。
下面结合一个说明性的示例来介绍上文的华莱士树及其操作。假设第一浮点数(例如本披露所述的神经元数据或权值数据中的一个)和第二浮点数(例如本披露所述的神经元数据或权值数据中的另一个)是16位数据,乘法器支持32位的输入位宽(由此支持两组16位数的并行相乘操作),华莱士树是7个(即上述M的一个示例值)输入和2个(即上述N的一个示例值)输出的7-2华莱士树压缩器。在该示例场景下,可以采用48个(即上述K的一个示例值)华莱士树来并行完成两组数据的乘法运算。
在上述的48个华莱士树中,第0~23个华莱士树(即第一组华莱士树中的24个华莱士树)可以完成第一组乘法的部分积加和运算,并且该组内的各华莱士树可以依次通过进位连接。进一步,第24~47个华莱士树(即第二组华莱士树中的24个华莱士树)可以完成第二组乘法的部分积加和运算,其中该组内的各华莱士树依次通过进位连接。另外,第一组中的第23个华莱士树和第二组中的第24个华莱士树之间不存在进位关系,即不同组的华莱士树之间不存在进位关系。
返回到图5,在通过华莱士树压缩器对部分积进行加和压缩后,将经过压缩后的部分积通过加法器进行求和,以获得尾数乘法操作的结果。关于加法器,在本披露的一个或多个实施例中,其可以包括全加器、串行加法器和超前进位加法器中的一种,用于对华莱士树压缩器进行加和所得到的最后两行部分积进行求和操作,以获得尾数乘法操作的结果。
可以理解,通过图5所示出的尾数乘法操作,特别是示例性地使用布斯编码和华莱士树,可以有效地获得尾数乘法操作的结果。具体地,布斯编码处理能有效减少部分积求和项的数目,从而减小电路面积,而华莱士压缩树能并行计算各行部分积之和,从而提高了乘法器的速度。
下面将结合图6和图7对部分积和7-2华莱士树的示例操作过程作详细的描述。可以理解的是这里的描述仅仅是示例性的而非限制性的,目的仅在于对本披露方案的更好理解。
图6示出在经过前述结合图3-图5所描述的尾数处理单元中的部分积产生电路后所获得的部分积600,如图中的两个虚线之间四行白色圆点,其中每行白色圆点标识出一个部分积。为了便于后续的华莱士树压缩器的执行,可以预先对位数进行扩展。例如,图6中的黑点为复制的每个9位部分积的最高位数值,可以看出部分积被扩展对齐至16(8+8)bit(即,被乘数尾数的位宽8bit+乘数尾数的位宽8bit)。在另一个实施例中,例如对于25*13二进制乘法的部分积,其部分积被扩展至38(25+13)bit(即,被乘数尾数的位宽25bit+乘数尾数的位宽13bit)。
图7是示出根据本披露实施例的华莱士树压缩器的操作流程和示意框图700。
如图7中所示,在对两个浮点数的尾数执行相乘操作后,例如如前所述,通过将乘数进行布斯编码并且通过被乘数可以获得图7中所示出的7个部分积。由于布斯编码算法的使用,减小了产生的部分积的数目。为了便于理解,图中在部分积部分用虚线框标识出一个包括7个元素的华莱士树,并且进一步以箭头示出其从7个元素压缩至2个元素的过程。在一个实施例中,该压缩过程(或称加和过程)可以借助于全加器来实现,即输入三个元素输出两个元素(即一个和“sum”以及针对高位的进位“carry”)。7-2华莱士树压缩器的示意框图在图7的右侧示出,可以理解该华莱士树压缩器包括7个来自一列部分积的输入(如图7左侧虚线框中标识的七个元素)。在操作中,第0列华莱士树的进位输入为0,每列华莱士树的进位输出Cout作为下一列华莱士树的进位输入Cin。
从图7左侧部分中可以看到,经过四次压缩后可以将包括7个元素的华莱士树压缩为包括2个元素。如前所提到,本披露利用7-2华莱士树压缩器将7行的部分积最终压缩成具有两行的部分积(即本披露的第二尾数中间结果),并且利用加法器(例如超前进位加法器)来获得尾数结果。
为了进一步阐述本披露方案的原理,下面将示例性地描述本披露的乘法器如何完成FP16*FP16、FP16*FP16、FP32*FP32和FP32*BF16四种运算模式下在第一阶段的操作,即直到华莱士树压缩器完成尾数中间结果的求和以获得第二尾数中间结果:
(1)FP16*FP16
在乘法器的该运算模式下,浮点数的尾数位为10bit,考虑IEEE754标准下非规格化非零数,可以扩展1bit位,从而尾数位为11bit。另外,由于尾数位为无符号数,采用布斯编码算法时可以在高位扩展1bit的0(即在高位补一个0),因此总的尾数位数为12bit。当对作为第二浮点数即乘数进行布斯编码,并且参照第一浮点数时,则通过部分积产生电路可以在高低部分分别获得7个部分积,其中第七个部分积为0,每个部分积的位宽为24bit,此时可以通过48个7-2华莱士树进行压缩处理,并且第23个到第24个华莱士树的进位为0。
(2)BF16*BF16
在乘法器的该运算模式下,浮点数的尾数位为7bit,考虑IEEE754标准下非规格化非零数可以扩展为有符号数,则尾数可以扩展为9bit。当对作为第二浮点数即乘数进行布斯编码,并且参照第一浮点数时,则通过部分积产生电路可以在高低部分分别获得7个有效部分积,其中第6、7个部分积为0,每个部分积位宽为18bit,通过使用第0~17个和第24~41个两组的7-2华莱士树进行压缩处理,其中第23到第24个华莱士树的进位为0。
(3)FP32*FP32
在乘法器的该运算模式下,浮点数的尾数位可以为23bit,考虑IEEE754标准下非规格化非零数,则尾数可以扩展为24bit。为节省乘法单元的面积,本披露的乘法器在该运算模式下可以被调用两次以完成一次运算。为此,每次尾数位进行的乘法为25bit*13bit,即将第一浮点数ina扩展1bit 0成为25bit的有符号数,将第二浮点数inb的24bit尾数位分为高低两部分各12bit,并且 分别扩展1bit 0得到两个13bit的乘数,表示为inb_high13和inb_low13高低两部分。具体操作中,第一次调用本披露的乘法器计算ina*inb_low13,第二次调用乘法器计算ina*inb_high13。在每一次的计算中,通过布斯编码生成7个有效部分积,每个部分积的位宽为38bit,通过第0~37个的7-2华莱士树进行压缩。
(4)FP32*BF16
该乘法器的该运算模式下,第一浮点数ina的尾数位为23bit,第二浮点数的inb的尾数位为7bit,考虑IEEE754标准下非规格化非零数可以扩展为有符号数,则尾数可以分别扩展为25bit和9bit,进行25bit×9bit的乘法,获得7个有效部分积,其中第6、7个部分积为0,每个部分积的位宽为34bit,通过第0~33个华莱士树进行压缩。
以上通过具体示例描述了本披露的乘法器如何在四种运算模式下完成第一阶段的操作,其中优选的使用了布斯编码算法和7-2华莱士树。基于上述的描述,本领域技术人员可以理解本披露使用7个部分积,使得可以在不同的运算模式中复用7-2华莱士树。
在一些运算模式中,前述的尾数处理单元还可以包括控制电路,其可以用于当运算模式指示的所述第一浮点数的尾数位宽和/或所述第一浮点数的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,根据所述运算模式多次调用所述尾数处理单元。进一步,对于多次调用的情形,所述部分积求和电路还可以包括移位器,其用于当根据所述运算模式多次调用所述尾数处理单元时,在已有所述加和结果的情况下,对所述已有的加和结果进行移位,并与当次调用获得的所述求和结果进行相加,得到新的加和结果,将所述新的加和结果作为所述乘法运算后的尾数。
例如,如前所述,可以在FP32*FP32运算模式中两次调用尾数处理单元。具体地,在第一次调用尾数处理单元中,尾数位(即ina*inb_low13)在第二阶段通过超前进位加法器相加获得第二低位尾数中间结果,在第二次调用尾数处理单元中,尾数位(即,ina*inb_high13)在第二阶段通过超前进位加法器相加获得第二高位尾数中间结果。此后,在一个实施例中,可以通过移位器的移位操作来累加第二低位尾数中间结果和第二高位尾数中间结果,以获得该乘法运算后的尾数,该移位操作可以下式来表达:
r fp32xfp32=sum h[37:0]<<12+sum l[37:0]
即将第二高位尾数中间结果sum h[37:0]向左移12位并且与第二低位尾数中间结果sum l[37:0]累加。
上文结合图5-图7详细描述了本披露的乘法器在执行浮点运算时,对第一浮点数和第二浮点数的尾数相乘所执行的操作。当然,图5为了注重描述本披露乘法器的尾数处理单元的操作,并没有绘出其他的单元,例如指数处理单元和符号处理单元,并对其进行描述。下面将结合图8对本披露的乘法器进行整体上的描述,对于前文针对尾数处理单元所做的描述,同样也适用于图8所绘的情形。
图8是示出根据本披露实施例的乘法器800的整体示意框图。需要理解的是图中绘出的各类单元的位置、存在和连接关系仅仅是示例性的而非限制性的,例如其中的一些单元可以集成,而另一些单元也可以分离或依应用场景的不同而被省略或替换。
本披露的乘法器在每种运算模式的操作中按操作流程可以示例性地分为第一阶段和第二阶段,如图中的虚线所绘出的。概括来说,在第一阶段中:输出符号位的计算结果,输出指数位的尾数中间计算结果,输出尾数位的尾数中间计算结果(例如包括前述的输入尾数位定点乘法的布斯算法的编码过程和华莱士树压缩过程)。在第二阶段中:对指数和尾数进行规则化和舍入操作,以输出指数的计算结果和输出尾数的计算结果。
如图8中所示,本披露的乘法器可以包括模式选择单元802和规格化处理单元804,其中模式选择单元可以根据输入模式信号(in_mode)来选择运算模式。在一个实施例中,该输入模式信号可以与表2中的运算模式编号相对应。例如,当输入模式信号指示表2中的运算模式编号“1”时,则可以令乘法器工作于FP16*FP16的运算模式中,而当输入模式信号指示表2中的运算模式编号“3”时,则可以令乘法器工作于FP32*FP32的运算模式中。为了图示的目的,图8仅示出 FP16*FP16、BF16*BF16、FP32*FP32和FP32*BP16四种示例性运算模式。然而,正如前所述,本披露的乘法器同样也支持其他多种不同的运算模式。
规格化处理单元可以配置成用于当第一浮点数或第二浮点数为非规格化的非零浮点数时,根据运算模式,对第一浮点数或第二浮点数进行规格化处理,以获得对应的指数和尾数,例如按照IEEE754标准、对运算模式所指示的数据格式的浮点数进行规则化处理。
进一步,乘法器包括尾数处理单元,以执行第一浮点数尾数和第二浮点数尾数的相乘操作。为此,在一个或多个实施例中,该尾数处理单元可以包括位数扩展电路806、布斯编码器808、部分积产生电路810、华莱士树压缩器812以及加法器814,其中位数扩展电路可以用于考虑IEEE754标准下非规格化非零数而对尾数进行扩展,以适合于布斯编码器的操作。由于关于布斯编码器、部分积产生电路、华莱士树压缩器和加法器,已经结合图5-图7进行了详细了描述,因此相同的描述在此同样适用并因此不再赘述。
在一些实施例中,本披露的乘法器还包括规则化单元816和舍入单元818,该规则化单元和舍入单元具有与图4中所示出的单元相同的功能。具体地,对于规则化单元,其可以根据如图8中所示的输出模式信号“out_mode”所指示的数据格式来对所述加和结果和来自于指数处理单元的指数数据进行浮点数规则化处理以获得规则化指数结果和规则化尾数结果。例如,根据输出模式信号所指示的数据格式,规则化单元可以调整指数和尾数的位宽,以使其符合前述指示的数据格式的要求。再例如,当尾数的最高位为0,且该尾数不为0,则规则化单元可以重复将尾数左移1位,并且指数减1,直到最高位数值为1。对于舍入单元,在一个实施例中,其可以用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将舍入后的尾数作为所述乘法运算后的尾数。
在一个或多个实施例中,前述的输出模式信号可以是运算模式的一部分,用于指示乘法运算后的数据格式。例如,如前表3中所描述的,当运算模式编号为“12”时,则其中的数字“1”可以相当于前述的“in_mode”信号,用于指示执行FP16*FP16的乘法操作,而其中的数字“2”可以相当于“out_mode”信号,用于指示输出结果的数据类型是BF16。因此可以理解的是,在一些应用场景中,输出模式信号可以与前述的输入模式信号合并,以提供给模式选择单元。基于此合并后的模式信号,模式选择单元可以在乘法器操作的初始阶段明确输入数据和输出结果的数据格式,而无需向规则化单独的提供输出模式信号,由此也可以进一步简化操作。
在一个或多个实施例中,对于前述的舍入操作,可以示例性包括如下5种舍入模式。
(1)舍入到最接近的值:在此模式下,当两个值同样接近的情况下,偶数优先。此时会将结果舍入为最接近且可以表示的值,但是当存在两个数同样接近的时候,则取其中的偶数作为舍入结果(在二进制中是以0结尾的数);
(2)四舍五入:示例性操作参见下面的例子;
(3)朝+∞方向舍入:在此规则下,会将结果朝正无限大的方向舍入;
(4)朝-∞方向舍入:在此规则下,会将结果朝负无限大的方向舍入;以及
(5)朝0方向舍入:在此规则下,会将结果朝0的方向舍入。
对于“四舍五入”模式下的尾数舍入的例子:例如两个24位的尾数相乘得到一个48位(47~0)的尾数,经过规格化处理,输出时只取第46至第24位。当尾数的第23位为0时,则舍去第(23-0)位;当尾数的第23位为1时,则向第24位进1并舍去第(23-0)位。
返回到图8,本披露的乘法器还包括指数处理单元820和符号处理单元822,其中指数处理单元可以用于根据运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数。例如,指数处理电路可以将第一浮点数的指数位数据、第二浮点数的指数位数据和各自对应的输入浮点数据类型的偏移值相加,并且减去输出浮点数据类型的偏移值,以获得所述第一浮点数和第二浮点数的乘积的指数位数据。在一个或多个实施例中,指数处理单元可以实现为或包括加减法电路(指数处理单元820可以以加减法电路来实现),该指数处理单元820可用于根据所述运算模式、所述第一浮点数的指数、所述第二浮点数的指数和所述运算模式获得所述乘法运算后的 指数。
符号处理单元822在一个实施例中可以实现为异或电路(符号处理单元822可以以异或电路的形式来实现),符号处理单元822用于对所述第一浮点数和第二浮点数的符号位数据执行异或操作,以获得所述第一浮点数和第二浮点数的乘积的符号位数据。
上文结合图8对本披露的乘法器整体进行了详细的描述。通过该描述,本领域技术人员可以理解本披露的乘法器支持多种运算模式下的操作,从而克服了现有技术中仅支持单一浮点型运算的乘法器的缺陷。进一步,由于本披露的乘法器可以复用,因此也支持高位宽的浮点型数据,降低了运算成本和开销。在一个或多个实施例中,本披露的乘法器还可以布置成或包括于集成电路芯片或计算装置中,以实现在多种运算模式下对浮点数执行乘法运算。
图9是示出根据本披露实施例的使用乘法器执行浮点数乘法运算的方法900的流程图。可以理解的是此处所述的乘法器即前面结合图2-图8详细描述的乘法器,因此在前关于该乘法器及其内部组成、功能和操作的描述也同样适用于此处的描述。
如图9中所示,所述方法900可以包括在步骤S902处利用所述乘法器的指数处理单元来根据运算模式、第一浮点数的指数和第二浮点数的指数获得所述乘法运算后的指数。正如前所述,该运算模式可以是多种运算模式中的一种,并且可以用于指示浮点数的数据格式。在一个或多个实施例中,该运算模式还可以用于确定输出结果的浮点数的数据格式。
接着,在步骤S904处,该方法900可以利用乘法器的尾数处理单元来根据所述运算模式、第一浮点数和第二浮点数获得所述乘法运算后的尾数。关于尾数的示例性操作,本披露在一些优选的实施例中使用了布斯编码算法和华莱士树压缩器,从而提高尾数处理的效率。另外,当第一浮点数和第二浮点数是有符号数时,方法900还可以在步骤S906中利用乘法器的符号处理单元来根据第一浮点数的符号和第二浮点数的符号获得乘法运算后的符号。
尽管上述方法以步骤形式示出利用本披露的乘法器来执行浮点数乘法运算,但这些步骤顺序并不意味着本方法的步骤必须依所述顺序来执行,而是可以以其他顺序或并行的方式来处理。另外,此处为了描述的简明而没有阐述方法900的其他步骤,但本领域技术人员根据本披露的内容可以理解该方法也可以通过使用乘法器来执行前述结合图2-图8描述的各种操作。
在本披露的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
图10是示出根据本披露实施例的计算装置1000的另一示意框图。从图中所示的内容可以看出,除了增加的新的第一类型转换单元1002,计算装置1000可以具有前述结合图1所述的计算装置100相同的组成、结构及其功能属性(例如加法模块108和更新模块112),因此前述关于计算装置100的描述同样也适用于计算装置1000。
关于增加的第一类型转换单元,其可以应用在这样的场景中,即当加法模块中的第一加法器并不支持多种数据类型(或格式)而需要进行数据类型转换。为此,在一个或多个实施例中,其可以配置用于对乘积结果进行数据类型(或者说数据格式)的转换,以便所述加法器执行所述加法操作。这里,所述乘积结果可以是通过前述乘法单元的浮点乘法器所获得的乘积结果。在一个或多个实施例中,该乘积结果的数据类型可以例如是前述的FP16、BF16、FP32、UBF16或UFP16中的一种。在该情况下,当后续的加法器所支持的数据类型不同于乘积结果的数据类型时,可以借助于第一类型转换单元来执行数据类型的转化,以使得结果适用于加法器的加法操作。例如,当乘积结果是FP16型的浮点数,而加法器支持FP32型的浮点数,则第一类型转换单元可以配置成对FP16型数据示例性地执行以下步骤操作,以将其转换成FP32型数据:S1:符号位左移16位;S2:指数加112(指数的基数127与15之间的差距),左移13位(右对齐);以及S3:尾数左移13位(左对齐)。
在上述例子的基础上,也可以通过执行与其相反的操作或逆操作来将FP32型数据转换成 FP16型数据,以便当乘积结果为FP32型数据时,可以将其转换成FP16型数据,从而符合支持FP16型数据加法操作的加法器。应当理解的是,这里的数据类型转换的操作仅仅是示例性的而非限制性的,本领域技术人员可以根据本披露的教导来选择任意合适的方式、机制或操作将乘法结果的数据类型转换成与后续的加法器相适应的数据类型。
图11是示出根据本披露实施例的加法器组1100的示意框图。从该图示意性所示内容可以看出,其是一个三级树状结构的加法器组,其中第一级包括4个本披露的第一加法器1102,其示例性地接收8个FP32型浮点数的输入,如in0、in1、…、in7。第二级包括2个第一加法器1104,其示例性地接收4个FP16型浮点数的输入。第三级仅包括1个第一加法器1106,其可以接收2个FP16型浮点数的输入并输出前述的8个FP32型浮点数的求和结果。
在本实施例中,假定第二级的2个第一加法器1104并不支持FP32型浮点数的加法操作,因此本披露提出在第一级和第二级的第一加法器之间设置有级间的一个或多个第二类型转换单元1108。在一个实施例中,该第二类型转换单元可以具有与结合图10所述的第一类型转换单元1002相同或相似的功能,即将输入的浮点型数据转换成与后续加法操作相一致的数据类型。具体地,该第二类型转换单元可以根据不同的应用需求而支持一种或多种的数据类型转换。例如,在图11所示出的例子中,其可以支持FP32型数据到FP16型数据的单向数据类型转换。而在其他的示例中,第二类型转换单元可以设计成支持FP32型数据和FP16型数据之间的双向数据类型转换。换句话说,其既可以支持FP32型数据到FP16型数据的数据类型转换,也可以支持FP16型数据到FP32型数据的数据类型转换。附加地或可选地,图10的第一类型转换单元1002或图11的第二类型转换单元1108也可以配置成支持多种浮点型数据之间的双向转换,例如其可以支持前述结合运算模式所描述的各种浮点型数据之间的双向转换,从而有助于本披露在数据处理过程中保持数据的前向或后向兼容性,进一步扩展本披露方案的应用场景和适用范围。
需要强调的是上述的类型转换单元仅仅是本披露的一个可选方案,当第一或第二加法器本身支持多种数据格式的加法运算,或可被复用为处理多种数据格式运算时,并不需要这样的类型转换单元。另外,当第二加法器支持的数据格式即是第一加法器输出数据的数据格式时,也不需要在二者之间设置这样的类型转换单元。
图12是示出根据本披露实施例的加法器组1200的示意框图。从图中所示内容可以看出,其示意性示出五级树状结构的加法器组,具体包括第一级的16个第一加法器、第二级的8个第一加法器、第三级的4个第一加法器、第四级的2个第一加法器和第5级的1个第一加法器。从该多级树状结构可以看出,图12所示的加法器组可以视为是对图11所示树状结构的扩展。或反言之,图11所示加法器组可以视为图12所示加法器组的一部分或组成单元,如图12中虚线1202所框出的部分。
在操作中,第一组的16个加法器可以接收来自于乘法单元的乘积结果。根据应用场景的不同,该乘积结果可以是通过图10所示出的第一类型转换单元1002转换后的浮点数。可选地,当前述的乘积结果与加法器组1200的第一级加法器所支持的数据类型相同时,则可以不经第一类型转换单元而直接输入到加法器组1200中,例如图12中所示出的32个FP32型浮点数(如in0~in31)。当通过第一级16个第一加法器的加法操作后,可以获得16个求和结果作为第二级8个第一加法器的输入。以此类推,最终作为第四级2个第一加法器输出的求和结果被输入到第五级的1个第一加法器,而该第五级加法器的输出可以作为前述的中间结果输入到位于前述更新模块中的加法器中。视应用场景的不同,该中间结果可以经历如下的操作之一:
当该中间结果是第一轮调用乘法单元所获得的中间结果时,其可以输入到前述的更新模块的加法器中,并且随后缓存于更新模块的寄存器中,以等待与第二轮调用乘法单元所获得的中间结果进行加法操作;或者当该中间结果是中间一轮(例如当执行多于两轮的操作时)调用乘法单元所获得的中间结果时,其可以输入到更新模块的加法器中,并且随后与由更新模块的寄存器输入到更新模块的加法器中的前一轮加法操作所获得的求和结果进行相加,以作为此中间一轮加法操作的求和结果存储到寄存器中;或者当该中间结果是最后一轮调用乘法单元所获得的中间结果 时,其可以输入到更新模块的加法器中,并且随后与由更新模块的寄存器输入到加法器中的前一轮加法操作所获得的求和结果进行相加,以作为此次神经网络运算的最终结果存储到寄存器中。
尽管图12是以树状层级的形式来布置多个加法器来完成多个数的加法操作,但本披露的方案并不限于此。本领域技术人员根据本披露的教导也可以以其他适宜的结构或方式来布置多个加法器,例如通过串行或并行连接多个全加器、半加器或其他类型的加法器来实现对多个输入的浮点数的加法操作。另外,为了简明的目的,图12所示出的加法树结构并没有示出如图11中所示出的第二类型转换单元。然而,根据应用的需要,本领域技术人员可以想到在图12所示的多级加法器中布置一个或多个级间的第二类型转换单元,以实现不同层级之间的数据类型的转换,从而进一步扩大本披露的计算装置的适用范围。
图13和图14是分别示出根据本披露实施例的神经网络运算1300的流程图和示意框图。为了更好的理解本披露的计算装置如何执行神经网络运算,图13和图14旨在以神经网络中的卷积运算(包括作为本披露的权值数据之一的卷积核和神经元数据)为例来阐述。可以理解的是该卷积运算可以发生在神经网络中的多个层处,例如神经网络的卷积层和全连接层。
在计算卷积运算(例如图像卷积)的过程中,会存在卷积核和神经元数据的复用情形。具体地,在卷积核的复用情形中,同一个卷积核在神经元数据块上滑动的过程中与不同的神经元数据执行内积。而在神经元数据的复用情形中,不同的卷积核与同一块神经元数据执行内积。因此,为了避免在计算卷积的过程中数据被反复搬运和读取,以节省功耗,本披露的计算装置可以在多轮的运算过程中复用神经元和卷积核数据。
根据上述的复用策略,在一个或多个实施例中,本披露的计算装置的输入端可以包括具有支持多个数据位宽的至少两个输入端口,并且更新模块中的寄存器可以包括多个子寄存器,以用于存储在每轮操作中所获得的中间结果。基于这样的布置,所述计算装置可以配置用于根据输入端口位宽对所述神经元数据和权值数据分别进行划分和复用,以执行神经网络运算。例如,假定本披露的计算装置的两个输入端口支持512bit位宽数据的输入,而神经元数据和卷积核是2048bit的位宽数据,则每个卷积核和对应的神经元可以划分为4个512bit位宽的向量,并且由此计算装置将进行四轮运算以获得完整的输出结果。
对于最终的输出结果,在一个或多个实施例中,其数目可以基于所述神经元数据复用次数和卷积核复用次数来确定。例如,该数目可以通过计算神经元复用次数和卷积核复用次数的乘积来获得。这里,复用次数的最大值可以根据更新模块中的寄存器(或者子寄存器)的数目来确定。例如,若子寄存器数目为n,当前神经元复用次数为m(m≤n),则卷积核复用次数的最大值为floor(n/m),其中floor函数表示对n/m执行向下取整操作。例如,当更新模块中的子寄存器的数目是8个时,而当前神经元复用次数是2次时,则卷积核复用次数的最大值为4(即,floor(8/2))。
基于上面的讨论,下面将结合图13和图14,以输入端口为512bit位宽长度、卷积核和神经元数据为2048bit的BF16的数据为例来描述本披露的计算装置的操作,其中鉴于输入端口位宽和输入数据长度,可以确定本披露的计算装置的乘法单元和累加模块需要连续执行四轮操作,其中神经元数据复用2次,卷积核数据复用4次,并在第4轮操作更新模块更新完毕后,输出最终的卷积结果。
首先,在步骤S1302处,方法1300缓存神经元数据和卷积核数据,例如可以读取2个512bit的神经元数据和2个512bit的卷积核数据并且将其缓存于缓冲器(“buffer”)或寄存器组中,该2个512bit的神经元数据可以是图14左侧最上方第一块中所示出的神经元数据“1-512位”和“2-512位”,而2个512bit的卷积核数据可以是图14右侧上方第一块中所示出的“第1卷积核”和“第2卷积核”。
接着,在步骤S1304处,方法1300可以对第1个512bit神经元和第1个512bit卷积核数据执行乘累加操作,并且随后将得到的第1个部分和作为第1个中间结果存储到子寄存器0中。例如,通过计算装置的2个输入接口来接收512bit的神经元数据和卷积核数据,并且在乘法单元的浮点乘法器中执行二者的乘法操作,然后将得到的结果输入至加法器中执行加法操作以获得中间结果。最后,将该第1个中间结果存储到更新模块的第1个子寄存器中,即子寄存器0。
类似地,在步骤S1306处,方法1300可以对第1个512bit神经元和第2个512bit卷积核数据执行乘累加操作,并且随后将得到的第2个部分和作为第2个中间结果存储到子寄存器1中,如图14中所示。由于在本例子中,卷积核就复用2次,而对应的每个神经元就参与计算两次,因此针对第1个512bit神经元数据的运算完成。
接着,在步骤S1308处,方法1300可以读取第3个512bit的神经元数据以覆盖第1个512bit神经元数据。同时,在步骤S1310处,方法1300可以执行第2个512bit的神经元数据和第1个512bit卷积核数据的乘累加操作,并且随后将得到的第3个部分和作为第3个中间结果存储到子寄存器2中。接着,在步骤S1310处,方法1300可以对第2个512bit的神经元数据和第2个512bit的卷积核数据执行乘累加操作,并且随后将得到的第4个部分和作为第4个中间结果存储到子寄存器3中。类似地,由于神经元数据只复用两次,因此此时第2个512bit的神经元数据复用完毕,并且在步骤1312处,方法1300读取第4个512bit神经元以覆盖第2个512bit的神经元数据。
类似于上述的操作,在步骤S1314处,方法1300可以执行第3个512bit神经元数据和第1个512bit卷积核数据的卷积操作(即乘累加操作),并且随后将得到的第5个部分和作为第5个中间结果存储到子寄存器4。在步骤S1316处,方法1300可以执行第3个512bit神经元数据与第2个512bit卷积核数据的卷积操作,并且随后将得到的第6个部分和作为第6个中间结果存储到子寄存器5中。在步骤1318处,方法1300可以执行第4个512bit神经元数据与第1个512bit卷积核数据的卷积操作,并且将得到的第7个中间结果存储到子寄存器6中。最后,在步骤1320处,方法1300可以执行第4个512bit神经元数据与第2个512bit卷积核数据的卷积操作,并且随后将得到的第8个部分和作为第8个中间结果存储到子寄存器7中。
通过上述步骤S1302-S1320的示例性操作,方法1300完成了第一轮神经元数据和卷积核数据的复用操作。如前所述,由于神经元和卷积核的大小均为2048bit,也就是说,每个卷积核和对应的一个神经元数据是4个512bit的向量,因此获得完整的输出更新模块要更新4次,即计算装置执行总共4轮的运算。基于此,在第2轮操作中,将对图14左侧中的第2块神经元数据(即所示出的5-512位、6-512位、7-512位和8-512位四个神经元数据)和右侧中的“512位第3卷积核”和“512位第4卷积核”执行与步骤S1202-S1220类似的操作,并且将获得的中间结果分别通过更新模块更新于子寄存器0至子寄存器7中。此时,子寄存器0至子寄存器7中存储的是求和结果,即第一轮已存中间结果和第二轮获得的中间结果执行加法操作后的求和结果。例如,子寄存器0中存储的是第一轮操作中的第一个中间结果和第二轮操作中的第二个中间结果的求和结果。
与上述的第1轮和第2轮操作类似,本披露的计算装置将继续进行第3轮和第4轮的操作。具体地,在第3轮操作中,计算装置完成对图14左侧中的第3块神经元数据(即所示出的9-512位、10-512位、11-512位和12-512位四个神经元数据)和右侧中的“512位第5卷积核”和“512位第6卷积核”的卷积操作和更新操作。具体地,将第三轮得到的8个中间结果分别通过更新模块更新于子寄存器0至子寄存器7中,以分别与第二轮后得到的求和结果相加,以获得第三轮操作后的、分别存储于子寄存器0至子寄存器7中的求和结果。
进一步,在最后一轮(即,第四轮)操作中,计算装置完成对图14左侧中的第4块神经元数据(即所示出的13-512位、14-512位、15-512位和16-512位四个神经元数据)和右侧中的“512位第7卷积核”和“512位第8卷积核”的卷积操作和更新操作。具体地,将第4轮得到的8个中间结果通过更新模块分别更新于子寄存器0至子寄存器7中,以分别与第3轮后得到的求和结果相加,以获得第4轮操作后的求和结果,而此时的求和结果即为本例的最终完整的8个计算结果,其可以分别通过子寄存器0至子寄存器7输出。
以上通过示例描述了本披露的计算装置如何通过复用卷积核和神经元数据完成神经网络运算。需要理解的是,上述的例子仅仅是示例性的,其绝不在任何意义上对本披露的方案进行限制。本领域技术人员根据本披露的教导可以对复用方案进行修改,例如通过设置不同数目的子寄存器数目、选择支持不同位宽的输入端口来进行调整。
图15是示出根据本披露实施例的使用计算装置执行神经网络运算的方法1500的流程图。 可以理解的是此处所述的计算装置即前面结合图1-图14所描述的计算装置,其包括前面详细描述的浮点乘法器,因此在前关于该计算装置、浮点乘法器及其内部组成、功能和操作的描述也同样适用于此处的描述。
如图15中所示,所述方法1500可以包括在步骤S1502处接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据。正如前所述,该至少一个权值数据和至少一个神经元数据可以具有浮点数的数据格式。在一个或多个实施例中,该至少一个权值数据和至少一个神经元数据可以具有前述的运算模式所指示的数据格式,例如运算模式可以使用一级或二级索引来指示权值数据和神经元数据的浮点数数据格式。
接着,在步骤S1504处,该方法1500可以利用包括至少一个浮点乘法器的乘法单元对至少一个权值和至少一个神经元数据执行神经网络运算中的乘法操作,以获得对应的乘积结果。如前所述,这里的浮点乘法器即前面结合图2-图9所述的浮点乘法器,其支持多种运算模式和复用,以便对于不同数据格式的浮点输入数据进行乘法操作,从而获得权值数据和神经元数据的乘积结果。
在获得乘积结果后,在步骤S1506处,该方法1500利用加法模块对乘积结果执行加法操作,以获得中间结果。如前所述,该加法模块可以通过多个全加器、半加器、波纹进位加法器、超前进位加法器等加法器来实现,并且可以以各种合适的形式来连接,例如以阵列加法器和如图11和图12中所示出的多级树状结构来实现。
在步骤S1508处,方法1500利用更新模块针对产生的多个中间结果执行多次求和操作,以输出所述神经网络运算的最终结果。如前所述,在一个或多个实施例中,更新模块可以包括第二加法器和寄存器,其中第二加法器可以配置用于重复地执行以下操作,直至完成对全部多个中间结果的求和操作:接收来自于加法器的中间结果和来自于寄存器的、前次求和操作的前次求和结果;将中间结果和前次求和结果进行相加,以获得本次求和操作的求和结果;以及利用本次求和操作的求和结果来更新寄存器中存储的前次求和结果。通过该更新模块的操作,本披露的计算装置可以多次调用乘法单元,以实现对大数据量的神经网络运算的支持。
尽管上述方法以步骤形式示出利用本披露的计算装置来执行包括浮点数乘法操作和加法操作在内的神经网络运算,但这些步骤顺序并不意味着本方法的步骤必须依所述顺序来执行,而是可以以其他顺序或并行的方式来处理。另外,此处为了描述的简明而没有阐述方法1500的其他步骤,但本领域技术人员根据本披露的内容可以理解该方法也可以通过使用乘法器来执行前述和下述结合附图所描述的各种操作。
图16是示出根据本披露实施例的一种组合处理装置1600的结构图。如图所示,该组合处理装置1600包括结合附图1-15所述的计算装置,例如图中所示出的计算装置1602。另外,该组合处理装置还包括通用互联接口1604和其他处理装置1606。根据本披露的计算装置与其他处理装置进行交互,共同完成用户指定的操作。
根据本披露的方案,该其他处理装置可以包括中央处理器(“CPU”)、图形处理器(“GPU”)、人工智能处理器等通用和/或专用处理器中的一种或多种类型的处理器,其数目不做限制而是依实际需要来确定。在一个或多个实施例中,该其他处理装置可以作为本披露的计算装置(其可以具体化为人工智能运算装置)与外部数据和控制的接口,执行包括但不限于数据搬运,完成对本机器学习运算装置的开启、停止等的基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。
根据本披露的方案,该通用互联接口可以用于在计算装置与其他处理装置间传输数据和控制指令。例如,该计算装置可以经由所述通用互联接口从其他处理装置中获取所需的输入数据,写入该计算装置片上的存储装置。进一步,该计算装置可以经由所述通用互联接口从其他处理装置中获取控制指令,写入计算装置片上的控制缓存。替代地或可选地,通用互联接口也可以读取计算装置的存储模块中的数据并传输给其他处理装置。
可选地,该组合处理装置还可以包括存储装置1608,其可以分别与所述计算装置和所述 其他处理装置连接。在一个或多个实施例中,存储装置可以用于保存所述计算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本计算装置或其他处理装置的内部存储中无法全部保存的数据。
根据应用场景的不同,本披露的组合处理装置可以作为手机、机器人、无人机、视频采集、视频监控设备等设备的SOC片上系统,从而有效地降低控制部分的核心面积,提高处理速度并降低整体的功耗。在此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。此处的某些部件可以例如是摄像头,显示器,鼠标,键盘,网卡或wifi接口。
在一些实施例里,本披露还公开了一种芯片(或称集成电路芯片),其包括了上述计算装置或组合处理装置。在另一些实施例里,本披露还公开了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,本披露还公开了一种板卡,其包括了上述芯片封装结构。参阅图17,其提供了前述的示例性板卡,上述板卡除了包括上述芯片1702以外,还可以包括其他的配套部件,该配套部件可以包括但不限于:存储器件1704、接口装置1706和控制器件1708。
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元1710。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(“Double Data Rate SDRAM”,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储器件可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。
在一个实施例中,每一组所述存储单元可以包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备1712(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。例如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本披露并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接,以便对所述芯片的状态进行监控。具体地,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,“MCU”)。所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,并且可以带动多个负载。由此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。
在一些实施例里,本披露还公开了一种电子设备或装置,其包括了上述板卡。根据不同的应用场景,电子设备或装置可以包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
依据以下条款可更好地理解前述内容:
条款A1,一种用于执行神经网络运算的计算装置,包括:输入端,其配置用于接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据;乘法单元,其包括至少一个浮点乘法器,所述浮点乘法器配置用于对所述至少一个权值数据和所述至少一个神经元数据执行所述神 经网络运算中的乘法操作,以获得对应的乘积结果;加法模块,其配置用于对所述乘积结果执行加法操作,以获得中间结果;以及更新模块,其配置用于执行针对产生的所述多个中间结果的多次求和操作,以输出所述神经网络运算的最终结果。
条款A2,根据条款A1所述的计算装置,其中所述至少一个权值数据和所述至少一个神经元数据是相同或不同数据类型的数据。
条款A3,根据条款A1或A2所述的计算装置,进一步包括:第一类型转换单元,其配置用于对所述乘积结果进行数据类型的转换,以便所述加法模块执行所述加法操作。
条款A4,根据条款A1-A3的任意一项所述的计算装置,其中所述加法模块包括以多级树状结构方式排列的多级加法器组,每级加法器组包括一个或多个第一加法器。
条款A5,根据条款A1-A4的任意一项所述的计算装置,进一步包括布置在所述多级加法器组中的一个或多个第二类型转换单元,其配置用于将一级加法器组输出的数据转换成另一类型的数据,以用于后一级加法器组的加法操作。
条款A6、根据条款A1-A5的任意一项所述的计算装置,其中所述乘法单元输出所述乘积结果后,便接收下一对所述至少一个权值数据和至少一个神经元数据,以进行乘法操作,并且所述加法模块输出所述中间结果后,便接收下一个来自所述乘法单元的乘积结果,以进行加法操作。
条款A7,根据条款A1-A6的任意一项所述的计算装置,其中所述更新模块包括第二加法器和寄存器,所述第二加法器配置用于重复地执行以下操作,直至完成对全部所述多个中间结果的求和操作:接收来自于所述加法模块的中间结果和来自于所述寄存器的、前次求和操作的前次求和结果;将所述中间结果和所述前次求和结果进行相加,以获得本次求和操作的求和结果;以及利用本次求和操作的求和结果来更新所述寄存器中存储的前次求和结果。
条款A8,根据条款A1-A7的任意一项所述的计算装置,其中所述输入端包括具有支持多个数据位宽的至少两个输入端口,并且所述寄存器包括多个子寄存器,所述计算装置配置用于:根据所述输入端口位宽对所述神经元数据和权值数据分别进行划分和复用,以执行神经网络运算。
条款A9,根据条款A1-A8的任意一项所述的计算装置,其中所述乘法器、加法模块和更新模块配置成根据所述划分和复用执行多轮操作,其中:在每轮操作中,将获得的中间结果存储于对应的子寄存器中并且由更新模块来执行所述子寄存器的更新;以及在最后一轮操作中,从所述多个子寄存器输出所述神经网络运算的最终结果。
条款A10,根据条款A1-A9的任意一项所述的计算装置,其中所述最终结果的结果项数目基于所述神经元数据复用次数和权值数据复用次数。
条款A11,根据条款A1-A10的任意一项所述的计算装置,其中所述复用次数的最大值基于所述多个子寄存器的数目。
条款A12,根据条款A1-A11的任意一项所述的计算装置,其中所述计算装置包括n个所述子寄存器,所述神经元复用次数为m,所述权值数据复用的最大次数为floor(n/m),其中m等于或小于n,并且floor函数表示对n/m执行向下取整操作。
条款A13,根据条款A1-A12的任意一项所述的计算装置,其中所述浮点乘法器用于根据运算模式对所述至少一个神经元数据和所述至少一个权值数据执行乘法运算,其中所述至少一个神经元数据和所述至少一个权值数据至少包括各自的指数和尾数,所述浮点乘法器包括:指数处理单元,用于根据所述运算模式、所述至少一个神经元数据的指数和所述至少一个权值数据的指数获得所述乘法运算后的指数;以及尾数处理单元,用于根据所述运算模式、所述至少一个神经元数据和所述至少一个权值数据获得所述乘法运算后的尾数,其中,所述运算模式用于指示所述至少一个神经元数据的数据格式和所述至少一个权值数据的数据格式。
条款A14,根据条款A13的任意一项所述的计算装置,其中所述运算模式还用于指示所述乘法运算后的数据格式。
条款A15,根据条款A12-A14的任意一项所述的计算装置,其中所述数据格式包括半精 度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。
条款A16,根据条款A12-A15的任意一项所述的计算装置,其中所述至少一个神经元数据和所述至少一个权值数据还包括各自的符号,所述浮点乘法器进一步包括:符号处理单元,用于根据所述至少一个神经元数据的符号和至少一个权值数据的符号获得所述乘法运算后的符号。
条款A17,根据条款A12-A16的任意一项所述的计算装置,其中所述符号处理单元包括异或逻辑电路,所述异或逻辑电路用于根据所述至少一个神经元数据的符号和所述至少一个权值数据的符号进行异或运算,获得所述乘法运算后的符号。
条款A18,根据条款A12-A17的任意一项所述的计算装置,进一步包括:规格化处理单元,用于当所述至少一个神经元数据或至少一个权值数据为非规格化的非零浮点数时,根据所述运算模式,对所述至少一个神经元数据或至少一个权值数据进行规格化处理,以获得对应的指数和尾数。
条款A19,根据条款A12-A18的任意一项所述的计算装置,其中所述尾数处理单元包括部分积运算单元和部分积求和单元,其中所述部分积运算单元用于根据所述至少一个神经元数据的尾数和至少一个权值数据的尾数获得尾数中间结果,所述部分积求和单元用于将所述尾数中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
条款A20,根据条款A12-A19的任意一项所述的计算装置,其中所述部分积运算单元包括布斯编码电路,所述布斯编码电路用于对至少一个权值数据的尾数的高低位补0,并进行布斯编码处理,以获得所述尾数中间结果。
条款A21,根据条款A12-A20的任意一项所述的计算装置,其中所述部分积求和电路包括加法器,所述加法器用于对所述尾数中间结果进行加和,以获得所述加和结果。
条款A22,根据条款A12-A21的任意一项所述的计算装置,其中所述部分积求和电路包括华莱士树和加法器,其中所述华莱士树用于对所述中间结果进行加和,以获得第二尾数中间结果,所述加法器用于对所述第二尾数中间结果进行加和,以获得所述加和结果。
条款A23,根据条款A12-A22的任意一项所述的计算装置,其中所述加法器包括全加器、串行加法器和超前进位加法器中的至少一种。
条款A24,根据条款A12-A23的任意一项所述的计算装置,其中当所述中间结果的个数不足M个时,补充零值作为尾数中间结果,使得所述尾数中间结果的数量等于M,其中M为预设的正整数。
条款A25,根据条款A12-A24的任意一项所述的计算装置,其中每个所述华莱士树具有M个输入和N个输出,所述华莱士树的数目不小于N*K,其中N为预设的小于M的正整数,K为不小于所述尾数中间结果的最大位宽的正整数。
条款A26,根据条款A12-A25的任意一项所述的计算装置,其中所述部分积求和电路用于根据运算模式来选用N组所述华莱士树对所述中间结果进行加和,其中每组有X个华莱士树,X为所述尾数中间结果的位数,其中各组内的所述华莱士树之间存在依次进位的关系,而各组之间的华莱士树不存在进位的关系。
条款A27,根据条款A12-A26的任意一项所述的计算装置,其中所述尾数处理单元还包括控制电路,用于在所述运算模式指示所述至少一个神经元数据或至少一个权值数据中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,根据所述运算模式多次调用所述尾数处理单元。
条款A28,根据条款A12-A27的任意一项所述的计算装置,其中所述部分积求和电路还包括移位器,当所述控制电路根据所述运算模式多次调用所述尾数处理单元时,所述移位器在每次调用中用于对已有加和结果进行移位,并与当次调用获得的所述求和结果进行相加,以获得新的加和结果,并且将在最后一次调用中获得的新的加和结果作为所述乘法运算后的尾数。
条款A29,根据条款A12-A28的任意一项所述的计算装置,其中所述浮点乘法器还包括 规则化单元,用于:对所述乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和所述乘法运算后的尾数。
条款A30,根据条款A12-A29的任意一项所述的计算装置,其中所述浮点乘法器还包括舍入单元,其用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将所述舍入后的尾数作为所述乘法运算后的尾数。
条款A31,根据条款A12-A30的任意一项所述的计算装置,其进一步包括:模式选择单元,其用于从所述浮点乘法器支持的多种运算模式中选择指示所述至少一个神经元数据和至少一个权值数据的数据格式的运算模式。
条款A32,一种用于执行神经网络运算的方法,包括:利用输入端接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据;利用包括至少一个浮点乘法器的乘法单元对所述至少一个权值数据和所述至少一个神经元数据执行所述神经网络运算中的乘法操作,以获得对应的乘积结果;利用加法模块对所述乘积结果执行加法操作,以获得中间结果;以及利用更新模块针对产生的所述多个中间结果执行多次求和操作,以输出所述神经网络运算的最终结果。
条款A33,一种集成电路芯片,包括根据条款A1-A31的任意一项所述的计算装置。
条款A34,一种集成电路设备,包括根据条款A1-A31的任意一项所述的计算装置。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本披露所提供的几个实施例中,应该理解到,所披露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、光学、声学、磁性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本披露各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,当本披露的技术方案可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(“ROM”,Read-Only Memory)、随机存取存储器(“RAM”,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。

Claims (34)

  1. 一种用于执行神经网络运算的计算装置,包括:
    输入端,其配置用于接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据;
    乘法单元,其包括至少一个浮点乘法器,所述浮点乘法器配置用于对所述至少一个权值数据和所述至少一个神经元数据执行所述神经网络运算中的乘法操作,以获得对应的乘积结果;
    加法模块,其配置用于对所述乘积结果执行加法操作,以获得中间结果;以及
    更新模块,其配置用于执行针对产生的所述多个中间结果的多次求和操作,以输出所述神经网络运算的最终结果。
  2. 根据权利要求1所述的计算装置,其中所述至少一个权值数据和所述至少一个神经元数据是相同或不同数据类型的数据。
  3. 根据权利要求1所述的计算装置,进一步包括:
    第一类型转换单元,其配置用于对所述乘积结果进行数据类型的转换,以便所述加法模块执行所述加法操作。
  4. 根据权利要求3所述的计算装置,其中所述加法模块包括以多级树状结构方式排列的多级加法器组,每级加法器组包括一个或多个第一加法器。
  5. 根据权利要求4所述的计算装置,进一步包括布置在所述多级加法器组中的一个或多个第二类型转换单元,其配置用于将一级加法器组输出的数据转换成另一类型的数据,以用于后一级加法器组的加法操作。
  6. 根据权利要求1所述的计算装置,其中所述乘法单元输出所述乘积结果后,便接收下一对所述至少一个权值数据和至少一个神经元数据,以进行乘法操作,并且所述加法模块输出所述中间结果后,便接收下一个来自所述乘法单元的乘积结果,以进行加法操作。
  7. 根据权利要求1所述的计算装置,其中所述更新模块包括第二加法器和寄存器,所述第二加法器配置用于重复地执行以下操作,直至完成对全部所述多个中间结果的求和操作:
    接收来自于所述加法模块的中间结果和来自于所述寄存器的、前次求和操作的前次求和结果;
    将所述中间结果和所述前次求和结果进行相加,以获得本次求和操作的求和结果;以及
    利用本次求和操作的求和结果来更新所述寄存器中存储的前次求和结果。
  8. 根据权利要求7所述的计算装置,其中所述输入端包括具有支持多个数据位宽的至少两个输入端口,并且所述寄存器包括多个子寄存器,所述计算装置配置用于:
    根据所述输入端口位宽对所述神经元数据和权值数据分别进行划分和复用,以执行神经网络运算。
  9. 根据权利要求8所述的计算装置,其中所述乘法器、加法模块和更新模块配置成根据所述划分和复用执行多轮操作,其中:
    在每轮操作中,将获得的中间结果存储于对应的子寄存器中并且由更新模块来执行所述子寄存器的更新;以及
    在最后一轮操作中,从所述多个子寄存器输出所述神经网络运算的最终结果。
  10. 根据权利要求9所述的计算装置,其中所述最终结果的结果项数目基于所述神经元数据复用次数和权值数据复用次数。
  11. 根据权利要求9所述的计算装置,其中所述复用次数的最大值基于所述多个子寄存器的数目。
  12. 根据权利要求8所述的计算装置,其中所述计算装置包括n个所述子寄存器,所述神经元复用次数为m,所述权值数据复用的最大次数为floor(n/m),其中m等于或小于n,并且floor函数表示对n/m执行向下取整操作。
  13. 根据权利要求1-12的任意一项所述的计算装置,其中所述浮点乘法器用于根据运算模式对所述至少一个神经元数据和所述至少一个权值数据执行乘法运算,其中所述至少一个神经元数据和所述至少一个权值数据至少包括各自的指数和尾数,所述浮点乘法器包括:
    指数处理单元,用于根据所述运算模式、所述至少一个神经元数据的指数和所述至少一个权 值数据的指数获得所述乘法运算后的指数;以及
    尾数处理单元,用于根据所述运算模式、所述至少一个神经元数据和所述至少一个权值数据获得所述乘法运算后的尾数,
    其中,所述运算模式用于指示所述至少一个神经元数据的数据格式和所述至少一个权值数据的数据格式。
  14. 根据权利要求13所述的计算装置,其中所述运算模式还用于指示所述乘法运算后的数据格式。
  15. 根据权利要求13所述的计算装置,其中所述数据格式包括半精度浮点数、单精度浮点数、脑浮点数、双精度浮点数、自定义浮点数中的至少一种。
  16. 根据权利要求13所述的计算装置,其中所述至少一个神经元数据和所述至少一个权值数据还包括各自的符号,所述浮点乘法器进一步包括:
    符号处理单元,用于根据所述至少一个神经元数据的符号和至少一个权值数据的符号获得所述乘法运算后的符号。
  17. 根据权利要求13所述的计算装置,其中所述符号处理单元包括异或逻辑电路,所述异或逻辑电路用于根据所述至少一个神经元数据的符号和所述至少一个权值数据的符号进行异或运算,获得所述乘法运算后的符号。
  18. 根据权利要求13所述的计算装置,进一步包括:
    规格化处理单元,用于当所述至少一个神经元数据或至少一个权值数据为非规格化的非零浮点数时,根据所述运算模式,对所述至少一个神经元数据或至少一个权值数据进行规格化处理,以获得对应的指数和尾数。
  19. 根据权利要求13所述的计算装置,其中所述尾数处理单元包括部分积运算单元和部分积求和单元,其中所述部分积运算单元用于根据所述至少一个神经元数据的尾数和至少一个权值数据的尾数获得尾数中间结果,所述部分积求和单元用于将所述尾数中间结果进行加和运算以获得加和结果,并将所述加和结果作为所述乘法运算后的尾数。
  20. 根据权利要求19所述的计算装置,其中所述部分积运算单元包括布斯编码电路,所述布斯编码电路用于对至少一个权值数据的尾数的高低位补0,并进行布斯编码处理,以获得所述尾数中间结果。
  21. 根据权利要求19所述的计算装置,其中所述部分积求和电路包括加法器,所述加法器用于对所述尾数中间结果进行加和,以获得所述加和结果。
  22. 根据权利要求19所述的计算装置,其中所述部分积求和电路包括华莱士树和加法器,其中所述华莱士树用于对所述尾数中间结果进行加和,以获得第二尾数中间结果,所述部分积求和电路中的加法器用于对所述第二尾数中间结果进行加和,以获得所述加和结果。
  23. 根据权利要求22所述的计算装置,其中所述部分积求和电路中的所述加法器包括全加器、串行加法器和超前进位加法器中的至少一种。
  24. 根据权利要求23所述的计算装置,其中当所述尾数中间结果的个数不足M个时,补充零值作为尾数中间结果,使得所述尾数中间结果的数量等于M,其中M为预设的正整数。
  25. 根据权利要求24所述的计算装置,其中每个所述华莱士树具有M个输入和N个输出,所述华莱士树的数目不小于N*K,其中N为预设的小于M的正整数,K为不小于所述尾数中间结果的最大位宽的正整数。
  26. 根据权利要求25所述的计算装置,其中所述部分积求和电路用于根据运算模式来选用N组所述华莱士树对所述尾数中间结果进行加和,其中每组有X个华莱士树,X为所述尾数中间结果的位数,其中各组内的所述华莱士树之间存在依次进位的关系,而各组之间的华莱士树不存在进位的关系。
  27. 根据权利要求26所述的计算装置,其中所述尾数处理单元还包括控制电路,用于在所述运算模式指示所述至少一个神经元数据或至少一个权值数据中的至少一个的尾数位宽大于所述尾数处理单元一次可处理的数据位宽时,根据所述运算模式多次调用所述尾数处理单元。
  28. 根据权利要求27所述的计算装置,其中所述部分积求和电路还包括移位器,当所述控制电路根据所述运算模式多次调用所述尾数处理单元时,所述移位器在每次调用中用于对已有加和结果进行移位,并与当次调用获得的所述求和结果进行相加,以获得新的加和结果,并且将在最后一次调用中获得的新的加和结果作为所述乘法运算后的尾数。
  29. 根据权利要求28所述的计算装置,其中所述浮点乘法器还包括规则化单元,用于:
    对所述乘法运算后的尾数和指数进行浮点数规则化处理,以获得规则化指数结果和规则化尾数结果,并且将所述规则化指数结果和所述规则化尾数结果作为所述乘法运算后的指数和所述乘法运算后的尾数。
  30. 根据权利要求29所述的计算装置,其中所述浮点乘法器还包括:
    舍入单元,用于根据舍入模式对所述规则化尾数结果执行舍入操作以获得舍入后的尾数,并将所述舍入后的尾数作为所述乘法运算后的尾数。
  31. 根据权利要求13所述的计算装置,其中所述浮点乘法器还包括:
    模式选择单元,其用于从所述浮点乘法器支持的多种运算模式中选择指示所述至少一个神经元数据和至少一个权值数据的数据格式的运算模式。
  32. 一种用于执行神经网络运算的方法,包括:
    利用输入端接收待执行神经网络运算的至少一个权值数据和至少一个神经元数据;
    利用包括至少一个浮点乘法器的乘法单元对所述至少一个权值数据和所述至少一个神经元数据执行所述神经网络运算中的乘法操作,以获得对应的乘积结果;
    利用加法模块对所述乘积结果执行加法操作,以获得中间结果;以及
    利用更新模块针对产生的所述多个中间结果执行多次求和操作,以输出所述神经网络运算的最终结果。
  33. 一种集成电路芯片,包括权利要求1-31的任意一项所述的计算装置。
  34. 一种集成电路设备,包括根据权利要求1-31的任意一项所述的计算装置。
PCT/CN2020/122949 2019-10-25 2020-10-22 用于神经网络运算的计算装置、方法、集成电路和设备 WO2021078210A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/620,547 US20220350569A1 (en) 2019-10-25 2020-10-22 Computing apparatus and method for neural network operation, integrated circuit, and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911023669.1 2019-10-25
CN201911023669.1A CN112712172B (zh) 2019-10-25 2019-10-25 用于神经网络运算的计算装置、方法、集成电路和设备

Publications (1)

Publication Number Publication Date
WO2021078210A1 true WO2021078210A1 (zh) 2021-04-29

Family

ID=75540716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122949 WO2021078210A1 (zh) 2019-10-25 2020-10-22 用于神经网络运算的计算装置、方法、集成电路和设备

Country Status (3)

Country Link
US (1) US20220350569A1 (zh)
CN (1) CN112712172B (zh)
WO (1) WO2021078210A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034163A (zh) * 2022-07-15 2022-09-09 厦门大学 一种支持两种数据格式切换的浮点数乘加计算装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113791756B (zh) * 2021-09-18 2022-12-23 中科寒武纪科技股份有限公司 转数方法、存储介质、装置及板卡
CN114118387A (zh) * 2022-01-25 2022-03-01 深圳鲲云信息科技有限公司 数据处理方法、数据处理装置及计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948787A (zh) * 2019-02-26 2019-06-28 山东师范大学 用于神经网络卷积层的运算装置、芯片及方法
US20190205744A1 (en) * 2017-12-29 2019-07-04 Micron Technology, Inc. Distributed Architecture for Enhancing Artificial Neural Network
CN110084361A (zh) * 2017-10-30 2019-08-02 上海寒武纪信息科技有限公司 一种运算装置和方法
CN110210615A (zh) * 2019-07-08 2019-09-06 深圳芯英科技有限公司 一种用于执行神经网络计算的脉动阵列系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570559A (zh) * 2015-10-09 2017-04-19 阿里巴巴集团控股有限公司 一种基于神经网络的数据处理方法和装置
CN106991477B (zh) * 2016-01-20 2020-08-14 中科寒武纪科技股份有限公司 一种人工神经网络压缩编码装置和方法
CN106650922B (zh) * 2016-09-29 2019-05-03 清华大学 硬件神经网络转换方法、计算装置、软硬件协作系统
CN107844826B (zh) * 2017-10-30 2020-07-31 中国科学院计算技术研究所 神经网络处理单元及包含该处理单元的处理系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084361A (zh) * 2017-10-30 2019-08-02 上海寒武纪信息科技有限公司 一种运算装置和方法
US20190205744A1 (en) * 2017-12-29 2019-07-04 Micron Technology, Inc. Distributed Architecture for Enhancing Artificial Neural Network
CN109948787A (zh) * 2019-02-26 2019-06-28 山东师范大学 用于神经网络卷积层的运算装置、芯片及方法
CN110210615A (zh) * 2019-07-08 2019-09-06 深圳芯英科技有限公司 一种用于执行神经网络计算的脉动阵列系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG , YUQUAN: "Design of High-Performance Floating-point DSP Coprocessor", CHINA MASTER’S THESES FULL-TEXT DATABASE, 1 May 2018 (2018-05-01), pages 1 - 87, XP055804809 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034163A (zh) * 2022-07-15 2022-09-09 厦门大学 一种支持两种数据格式切换的浮点数乘加计算装置

Also Published As

Publication number Publication date
CN112712172B (zh) 2023-12-26
US20220350569A1 (en) 2022-11-03
CN112712172A (zh) 2021-04-27

Similar Documents

Publication Publication Date Title
WO2021078212A1 (zh) 用于向量内积的计算装置、方法和集成电路芯片
WO2021078210A1 (zh) 用于神经网络运算的计算装置、方法、集成电路和设备
TWI763079B (zh) 用於浮點運算的乘法器、方法、積體電路晶片和計算裝置
CN110515589B (zh) 乘法器、数据处理方法、芯片及电子设备
CN111381871B (zh) 运算方法、装置及相关产品
CN111008003B (zh) 数据处理器、方法、芯片及电子设备
CN110515590B (zh) 乘法器、数据处理方法、芯片及电子设备
WO2021185262A1 (zh) 计算装置、方法、板卡和计算机可读存储介质
TWI774093B (zh) 用於轉換資料類型的轉換器、晶片、電子設備及其方法
CN111381808A (zh) 乘法器、数据处理方法、芯片及电子设备
CN111258633B (zh) 乘法器、数据处理方法、芯片及电子设备
CN111258541B (zh) 乘法器、数据处理方法、芯片及电子设备
CN111258544B (zh) 乘法器、数据处理方法、芯片及电子设备
WO2021073512A1 (zh) 用于浮点运算的乘法器、方法、集成电路芯片和计算装置
CN209895329U (zh) 乘法器
WO2021073511A1 (zh) 用于浮点运算的乘法器、方法、集成电路芯片和计算装置
CN210109863U (zh) 乘法器、装置、神经网络芯片及电子设备
CN110647307B (zh) 数据处理器、方法、芯片及电子设备
CN110515586B (zh) 乘法器、数据处理方法、芯片及电子设备
CN111258545B (zh) 乘法器、数据处理方法、芯片及电子设备
WO2023231363A1 (zh) 乘累加操作数的方法及其设备
CN112711440A (zh) 用于转换数据类型的转换器、芯片、电子设备及其方法
CN113031916A (zh) 乘法器、数据处理方法、装置及芯片
CN113033799B (zh) 数据处理器、方法、装置及芯片
WO2021185261A1 (zh) 计算装置、方法、板卡和计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20880063

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20880063

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20880063

Country of ref document: EP

Kind code of ref document: A1