EP4275113A1 - Précision numérique dans un ensemble circuit multiplicateur numérique - Google Patents

Précision numérique dans un ensemble circuit multiplicateur numérique

Info

Publication number
EP4275113A1
EP4275113A1 EP21918010.6A EP21918010A EP4275113A1 EP 4275113 A1 EP4275113 A1 EP 4275113A1 EP 21918010 A EP21918010 A EP 21918010A EP 4275113 A1 EP4275113 A1 EP 4275113A1
Authority
EP
European Patent Office
Prior art keywords
format
operands
operand
result
multiplier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21918010.6A
Other languages
German (de)
English (en)
Inventor
Jeffrey Werner
Jonathan Ross
Revathi Natarajan
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Groq Inc
Original Assignee
Groq Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Groq Inc filed Critical Groq Inc
Publication of EP4275113A1 publication Critical patent/EP4275113A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing
    • G06F7/4876Multiplying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to digital circuits, and in particular, to systems and methods for numerical precision in digital multiplier circuitry.
  • a digital multiply-accumulator is an electronic circuit capable of receiving multiple digital input values, determining a product of the input values, and summing the results.
  • Performing digital multiply-accumulate operations can raise a number of challenges. For example, data values being multiplied may be represented digitally in a number of different data types. However, including different multipliers to handle all the different data types a system may need to process would consume circuit area and increase complexity.
  • Embodiments of the present disclosure pertain to digital multimodal multiplier systems and methods.
  • the present disclosure includes a circuit comprising a plurality of multimodal multiplier circuits, the multimodal multiplier circuits comprising one or more storage register circuits for storing digital bits corresponding to one or more first operands and one or more second operands.
  • the one or more storage register circuits In a first mode, store one first operand and one second operand having a first data type.
  • the one or more storage register circuits store a first plurality of operands and a second plurality of operands having a second data type.
  • a plurality of multiplier circuits are configured to receive the one or more first operands and the one or more second operands.
  • the one first operand and the one second operand are multiplied in one or more of the plurality of multiplier circuits.
  • a first operand of the first plurality of operands is multiplied with a first operand of the second plurality of operands and a second operand of the first plurality of operands is multiplied with a second operand of the second plurality of operands in the plurality of multiplier circuits.
  • the first operands are weights and the second operands are activation values.
  • the one first operand and the one second operand having the first data type comprise floating point values
  • the first and second plurality of operands having the second data type comprise integer values
  • At least one of the plurality of multiplier circuits are used to multiply operands in both the first mode and the second mode.
  • a number of multiplier circuits used to multiply operands in the first mode is the same as a number of multiplier circuits used to multiply operands in the second mode.
  • the one first operand and the one second operand having the first data type comprise a greater number of bits than the first and second plurality of operands having the second data type.
  • multiplier circuitry is used to multiply operand and another operand of a first format.
  • One or more storage register circuits of the multiplier circuitry store digital bits corresponding to the operand of the first format and another operand of the first format.
  • a decomposing circuit of the multiplier circuitry decomposes the operand into a first plurality of operands, and decomposes the other operand into a second plurality of operands.
  • the multiplier circuitry further includes a plurality of multiplier circuits. Each multiplier circuit multiplies a respective first operand of the first plurality of operands with a respective second operand of the second plurality of operands to generate a corresponding partial result of a plurality of partial results.
  • An accumulator coupled to the plurality of multiplier circuits accumulates the plurality of partial results using a second format to generate a complete result of the second format that is stored in the accumulator circuit.
  • a conversion circuit converts the complete result of the second format into an output result of an output format.
  • a method for an integer multiplication comprises: storing digital bits corresponding to an operand of a first format and another operand of the first format, decomposing the operand and the other operand into a plurality of operands, multiplying a respective first operand of a first subset of the operands with a respective second operand of a second subset of the operands to generate a corresponding partial result of a plurality of partial results, and generating a complete result of a second format by accumulating the plurality of partial results in an accumulator circuit using the second format.
  • methods for performing element-wise operations between input operands of a first array and a second array are presented herein. At least one of the methods comprises: sorting exponents of the input operands by range into a plurality of exponent ranges, sorting a first subset of the input operands from the first array into a first plurality of groups each having a respective exponent range, sorting a second subset of the input operands from the second array into a second plurality of groups each having the respective exponent range, normalizing operands from each group of the first and second plurality of groups to be within a corresponding exponent range to generate first normalized operands in the first groups and second normalized operands in the second groups, executing element-wise operations on a corresponding subset of the first normalized operands and a corresponding subset of the second normalized operands to generate a respective intermediate result of a plurality of intermediate results for each group of the first plurality of groups and each group of the second plurality
  • the techniques described herein are incorporated in a hardware description language program, the hardware description language program comprising sets of instructions, which when executed produce a digital circuit.
  • the hardware description language program may be stored on a non-transitory computer- readable storage medium, such as a computer memory (e.g., a data storage system).
  • a computer memory e.g., a data storage system
  • the operations of methods described herein are executed by a processor in accordance with sets of instructions stored at a non-transitory computer-readable storage medium, such as a computer memory (e.g., a data storage system).
  • FIG. 1 illustrates a computer system based on a tensor streaming processor (TSP) device according to one or more embodiments.
  • TSP tensor streaming processor
  • FIG. 2 A illustrates a multimodal multiplier circuit according to one embodiment.
  • FIG. 2B illustrates a multimodal multiplier circuit according to another embodiment.
  • FIG. 2C illustrates a multimodal multiplier circuit according to yet another embodiment.
  • FIG. 3 illustrates an example multimodal multiplier circuit according to one embodiment.
  • FIG. 4 illustrates another example multimodal multiplier circuit according to one embodiment.
  • FIG. 5 illustrates a multimodal multiply-accumulator circuit according to another embodiment.
  • FIG. 6 illustrates a method for the multimodal multiplication according to an embodiment.
  • FIG. 7 illustrates multiplier circuity with TruePointTM (TP) format based accumulation of partial multiplication results according to an embodiment.
  • FIG. 8 is a graph illustrating improved precision of the TP based computations for a machine-learning workload.
  • FIG. 9 illustrates a method for integer multiplication with the TP based accumulation according to an embodiment.
  • FIG. 10 illustrates a method for conversion of floating point numbers during element-wise matrix operations according to an embodiment.
  • FIG. 11 illustrates a computing machine for use in commerce according to an embodiment.
  • the present disclosure describes a computing system that provides numerical precision equivalent to or better than FP32 numerical representation using integer formatted operands, e.g., 8-bit (INT8) or 4-bit (INT4) integer format operands.
  • the computing system presented herein converts operands from a floating point format to an integer format and implements a Toom-Cook decomposition algorithm to perform a plurality of integer multiplications to generate a plurality of partial multiplication results.
  • the partial multiplication results are then shifted so that they are aligned with an appropriate power (i.e., 1, 10, 100).
  • the partial multiplication results are accumulated in one or more accumulation registers using the TruePointTM (TP) numerical precision (i.e., fixed point format representation).
  • TP TruePointTM
  • a final multiplication result is obtained by rounding (i.e., truncating) the accumulated result to a desired numerical precision (e.g., FP32 numerical representation).
  • a tensor streaming processor may be utilized as a core processor module of the computer system presented herein.
  • the TSP is particularly suited for computations in AI and ML applications.
  • the TSP is a device that is commercially available from Groq, Inc. of Mountain View, California.
  • the Groq TSP NodeTM Accelerator Card is available as axl6 PCI-Express (PCIe) 2-slot expansion card that hosts a single Groq ChiplTM device.
  • PCIe PCI-Express
  • the TSP core 100 (aka, AI processor and/or ML processor) includes memory and arithmetic modules optimized for multiplying and adding input data with weight sets (e.g., trained or being trained) for AI and/or ML applications (e.g., training or inference).
  • the TSP core 100 includes a vector processor (VXM) 110 for performing operations on vectors (i.e., one-dimensional arrays of values).
  • VXM vector processor
  • Other elements of the TSP core 100 are arranged symmetrically on either side of the VXM 110 to optimize processing speed.
  • the VXM 110 is directly adjacent to memory modules (MEMs) 111, 112.
  • Switch matrix units (SXMs) 113 and 114 are further arranged on both sides of the VXM 110 to control routing of data.
  • the TSP core 100 further includes numerical interpretation modules (NIMs) 115 and 116 for numeric conversion operations, and matrix multiplication units (MXMs) 117 and 118 for matrix multiplications.
  • An instruction control unit (ICU) 120 controls the flow of data and execution of operations across all functional blocks 110-118.
  • the TSP core 100 may further include communications circuits such as chip-to-chip (C2C) circuits 123, 124, and an external communication circuit (e.g., PCIe) 121.
  • the TSP core 100 may further include a chip control unit (CCU) 122 to control, e.g., boot operations, clock resets, some other low-level setup operations, or some combination thereof.
  • CCU chip control unit
  • FIG. 2A illustrates a multimodal multiplier circuit according to one embodiment.
  • the multimodal multiplier circuit of FIG. 2A may be a building block of the VXM 110, the MXM 117 and/or the MXM 118 of FIG. 1.
  • Features and advantages of the present disclosure include multimodal multiplier circuits that may receive and process different data types with different numbers of bits in different modes and share circuitry, which may advantageously reduce circuit area and may improve the speed and efficiency of processing data, for example.
  • a multimodal multiplier circuit 220 may include one or more input storage register circuits 221 for storing digital bits representing input operands to be multiplied.
  • the storage register circuits 221 may store different numbers of operands to be multiplied together in different modes, and the operands may have different data types and different numbers of bits.
  • Storage register circuits are circuits that store digital bits, such as a plurality of flip flops or other digital storage circuits known to those skilled in the art.
  • a single storage register circuit may be partitioned into multiple storage register circuits, for example, to store different digital values (e.g., operands).
  • the one or more storage register circuits 221 in a first mode, store one first operand and one second operand having a first data type, and in a second mode the one or more storage register circuits store a first plurality of operands and a second plurality of operands having a second data type.
  • a plurality of multiplier circuits 222 may be configured to receive the one or more first operands and the one or more second operands, for example.
  • multipliers may be shared across modes. For example, in a first mode, two operands having the first data type are multiplied in one or more of the plurality of multiplier circuits 222. In a second mode, a first plurality of operands and a second plurality of operands are multiplied in the plurality of multiplier circuits 222. The first and second plurality of operands multiplied in the second mode may have fewer bits than the first and second operands multiplied in the first mode, for example. However, one or more of the multiplier circuits may be used for both modes.
  • At least one of the plurality of multiplier circuits is used to multiply operands in both the first mode and the second mode.
  • a number of multiplier circuits used to multiply operands in the first mode is the same as the number of multiplier circuits used to multiply operands in the second mode.
  • multimodal multiplier circuits 220 may be combined to form multimodal multiply-accumulator circuits.
  • an output of multimodal circuit 220 may comprise output product values having different data types or even different numbers of output products in different modes, for example.
  • Output products of a plurality of other multimodal multipliers 223 may be summed with output products of multimodal multiplier 220 in adder 224 to produce a multimodal multiply-accumulator.
  • an input register 225 may receive an input value (e.g., an output of another multiply-accumulator) and adder 224 may sum locally generated products with sums generated by other multimodal multiply accumulators, for example.
  • An output register may store a summed result and may couple the result to additional multiply- accumulator circuits, for example.
  • Arrays of such multimodal multiply-accumulate circuits may be configured to process large volumes of operands having different data types, for example.
  • Embodiments of the disclosure may be particularly advantageous in machine learning (aka artificial intelligence) digital processing circuit applications, where the one or more first operands are weights and the one or more second operands are activation values, for example.
  • FIG. 2B illustrates a multimodal multiplier circuit according to another embodiment.
  • the multimodal multiplier circuit of FIG. 2B may be a building block of the VXM 110, the MXM 117 and/or the MXM 118 of FIG. 1.
  • storage register circuit 200 may store digital bits corresponding to one or more first operands.
  • a second storage register circuit 201 may store digital bits corresponding to one or more second operands.
  • registers 200 and 201 may be one partitioned register or multiple distinct registers, for example.
  • the first and second storage register circuits 200 and 201 each may store one first operand and one second operand having a first data type (e.g., Op A and OpB, respectively), and in a second mode the first storage register circuit 200 stores a first plurality of operands (e.g., Opl and Op2) and the second storage register circuit 201 stores a second plurality of operands (e.g., Op3 and Op4) having a second datatype.
  • operands having the first data type may comprise a greater number of bits than operands having the second data type, for example.
  • operands having the first data type comprise floating point values, for example, and operands having the second data type comprise integer values, for example.
  • first and second multiplier circuits 210 and 211 are coupled to the first and second storage register circuits 200 and 201.
  • one first operand (e.g., OpA) in the first storage register circuit 200 and one second operand (e.g., OpB) in the second storage register circuit 201 are coupled to the first multiplier circuit 210.
  • a first operand of the first plurality of operands (e.g., Opl of Opl and Op2) in the first storage register circuit 200 and a first operand of the second plurality of operands (e.g., Op3 of Op3 and Op4) in the second storage register circuit 201 are coupled to the first multiplier circuit 210 and a second operand of the first plurality of operands (e.g., Op2 of Opl and Op2) in the first storage register circuit 200 and a second operand of the second plurality of operands (e.g., Op4 of Op3 and Op4) in the second storage register circuit 201 are coupled to the second multiplier circuit 211.
  • select circuits (e.g., multiplexers) 202 and 203 may be used to selectively couple operands from input storage registers to particular multipliers based on a mode control signal. For example, in a first mode, select circuit
  • 202 may couple OpA from register 200 to one input of multiplier 210, and select circuit
  • registers 200 and 201 may each receive and store two operands on each multiplication processing cycle. Accordingly, in the second mode, select circuit 202 couples Opl to one input of multiplier 210 and couples Op2 to one input of multiplier 211. Similarly, in the second mode, select circuit 203 couples Op3 to another input of multiplier 210 and couples Op4 to another input of multiplier 211. Accordingly, in some modes, data may be multiplied in parallel and multipliers may be shared across multiple modes, for example.
  • operands having the first data type may have a greater number of bits than operands having the second data type (e.g., integers).
  • multiplier circuit 210 may be configured to multiply inputs having a greater number of bits than multiplier circuit 211, for example.
  • operands having the second datatype entering multiplier 210 may be sign extended to match the extended bit capabilities of multiplier circuit 210.
  • the multimodal multiplier circuits may further comprise a sign extension circuit 212 coupled to outputs of the first and second storage register circuits 200 and 201 to receive, in the second mode, one of the first plurality of operands (e.g., Opl) from the first storage register circuit 200 and one of the second plurality of operands (e.g., Op3) from the second storage register circuit 201, for example.
  • Sign extension circuit 212 may increase the number of bits of each binary number (e.g., Opl and Op3) while preserving the number's sign (positive/negative) and value, for example.
  • Another select circuit 204 receives the mode control signal to couple inputs of multiplier 210 to either outputs of the sign extension circuit 212 to receive operands of the second datatype, or alternatively, to outputs of select circuits 202 and 203 to receive operands of the first data type.
  • operands coupled to input registers 200 and 201 may be floating point numbers.
  • a multimodal multiplier circuit may further comprise an adder circuit 213.
  • exponent bits of one operand e.g., a floating point operand
  • exponent bits in a second operand e.g., another floating point operand
  • adder circuit 213 designated as dashed lines for when floating point is used.
  • Floating point values may have the form “significand x base exponent ,” where the exponent of two FP operands may be added in adder 213 and significands (aka the mantissa) of the FP operands are multiplied in multiplier 210, for example.
  • Floating point numbers may be represented in the system using more bits than integers, for example, and thus multiplier 210 may have more bits than multiplier 211, which may only multiply operands having the second data type, for example.
  • outputs of multipliers 210 and 211 and adder 213 may be further processed and added to other multiplier outputs.
  • processors aka artificial intelligence processors, e.g., neural networks
  • Such processors may require volumes of multiply-accumulate functions, and it may be desirable in many applications to flexibly process input data represent in a variety of different datatypes, such as signed integer, unsigned integer, or floating point (e.g., FP16 IEEE 754).
  • the first operands are weights and the second operands are activation values and the circuits and methods described herein are implemented in a machine learning processor.
  • one mode may configure a machine learning processor to multiply floating point (FP) numbers.
  • a first FP operand corresponding to a weight may be stored in register 200 and a second FP operand corresponding to an activation (e.g., a pixel value of an input image) may be stored in register 201.
  • the significand of the first and second FP operands are coupled to a wide bit format multiplier 210, for example, and the exponent bits of the FP operands are coupled to adder 213 to produce an output product (e.g., OpA*OpB x exp out - exp ).
  • the machine learning processor may multiply integer numbers.
  • two 8-bit integers for example, may be stored in each of registers 200 and 201.
  • two integer weights may be stored in register 200 and two integer activations may be stored in register 201.
  • One activation and one weight may be coupled to a sign extend circuit so the integers match the wider format of multiplier 210, for example, and another activation and weight are coupled to multiplier 211 to be advantageously multiplied in parallel.
  • Outputs of multipliers 210 and 211 e.g., Opl*Op3 and Op2*Op4
  • Activations and weights may alternatively multiplied together using the techniques illustrated FIG. 2B, for example.
  • FIG. 2C illustrates a multimodal multiplier circuit according to yet another embodiment.
  • the multimodal multiplier circuit of FIG. 2C may be a building block of the VXM 110, the MXM 117 and/or the MXM 118 of FIG. 1.
  • one or more operands, A may be received in a first storage register circuit 230 and one or more second operands, B, may be received in a second storage register circuit 231.
  • a plurality of multipliers 232-235 are coupled to particular segments of registers 230 and 231 to receive the one or more operands.
  • operands may be positioned in different locations in registers 230 and 231 based on the mode so that multipliers 232-235 may be efficiently shared.
  • a and B both correspond to four (4) operands A0-A3 and B0-B3 (e.g., a total of eight 8-bit integers).
  • operands A0-A3 are stored in register segments 230A-D, respectively
  • operands B0-B3 are stored in register segments 231A-D, respectively.
  • Multiplier 232 has one input coupled to segment 230A of register 230 and a second input coupled to segment 231 A of register 231 to receive operands A0 and B0.
  • multiplier 233 has one input coupled to segment 230B and a second input coupled to segment 23 IB to receive operands A1 and Bl
  • multiplier 234 has one input coupled to segment 230C and a second input coupled to segment 231C to receive operands A2 and B2
  • multiplier 235 has one input coupled to segment 230D and a second input coupled to segment 23 ID to receive operands A3 and B3. Accordingly, in one mode, multipliers 232-235 may multiply two sets of four 8-bit integer operands.
  • C0-C3 may be concatenated and added to output products of other multimodal multiplier circuits as described below.
  • the circuit may receive operands A and B having a different data type with a greater number of bits.
  • operands A and B may be a 16-bit floating point numbers. Accordingly, these operands may be stored as components in different register segments of registers 230-231.
  • operand A may be stored as two components in two register segments in register 230
  • another operand B may be stored as two components in two register segments in register 231.
  • operand A comprises a first component (e.g., lower order bits) received on A0 and stored in register segment 230A and a second component (e.g., higher order bits) received on A2 and stored in register segment 230C.
  • Operand B comprises a first component (e.g., lower order bits) received on B0 and stored in register segment 231 A and a second component (e.g., higher order bits) received on B1 and stored in register segment 23 IB, for example.
  • Embodiments of the present disclosure may selectively couple different input bits into different register segments in different modes.
  • the first component of A on input A0 may be coupled to and stored in register segment 230B, and the second component of A on input A2 may be coupled to and stored in register segment 230D.
  • the first component of B on input B0 may be coupled to and stored in register segment 230C
  • the second component of B on input B1 may be coupled to and stored in register segment 23 ID.
  • select circuits e.g., multiplexers
  • multiplier 232 receives the first component (on A0) of operand A and the first component (on B0) of operand B
  • multiplier 233 receives the first component (on A0) of operand A and the second component (on Bl) of operand B
  • multiplier 234 receives the second component (on A2) of operand A and the first component (on B0) of operand B
  • multiplier 235 receives the second component (on A2) of operand A and the second component (on Bl) of operand B.
  • multipliers 232-235 perform the following multiplications A0B0, A0B1, A2B0, and A2B1, where A0 are the lower order (less significant) bits of A, A2 are the higher order (more significant) bits of A, B0 are the lower order (less significant) bits of B, and Bl are the higher order (more significant) bits of B.
  • Output product values C0-C3 of components of the inputs may be stored in register 237, for example.
  • outputs of multipliers 232-235 may be coupled to shift circuits 240-243.
  • Outputs of shift circuits 240-243 are coupled to an adder circuit to produce an output product of the inputs A*B.
  • CO may be coupled to shift circuit 240, which may have a nominal shift value of 0
  • C2 may be coupled to shift circuit 242, which may have a nominal shift value of N
  • C3 may be coupled to shift circuit 243, which may have a nominal shift value of 2N.
  • Each shift circuit may perform a left shift, for example.
  • products of lower order bits A0B0 are not shifted
  • products of higher and lower order bits A2B0 and B1A0 are shifted by N
  • products of higher order bits A2B1 are shifted by 2N.
  • no shifter 240 may be included since CO may not be shifted.
  • exponent bits of floating point operands, expA and expB may be input to adder circuit 260 and added together and the result used to increase the shift performed by each shift circuit.
  • the outputs of the shift circuits are summed in an adder circuit 244, which may comprise a plurality of N-bit adders, for example.
  • the shifted and added output product values may provide a second output (Out2) in one of the modes, which may be a fixed point representation, for example. Accordingly, in some embodiments, multiplication of the inputs may result in output products being converted to a third data type, which may be added to output products of other multimodal multiplier circuits as described below.
  • FIG. 3 illustrates a multimodal multiplier circuit according to another embodiment.
  • the multimodal multiplier circuit of FIG. 3 may be a building block of the VXM 110, the MXM 117 and/or the MXM 118 of FIG. 1.
  • Some embodiments of the present disclosure may receive and process operands in one mode with high precision, including bit lengths long enough such that, when in another mode, multiple lower bit length operands may be processed in a plurality of parallel multipliers.
  • Multiplier 310 may process one operand from each register 300-301 in a first mode, and multipliers 310 and 311 may combine two operands from each register 300-301 in a second mode.
  • 3 may further comprise a third storage register circuit 302 for storing digital bits corresponding to a two additional operands (Op5, Op6) and a fourth storage register circuit 303 for storing digital bits corresponding to two more operands (Op7, Op8), where Op5-Op8 have the second data type with fewer bits than the first data type (e.g., INT8 v. FP16).
  • register 302 stores weight values and register 303 stores activation values.
  • the circuit in FIG. 3 may further include multipliers 312 and 313.
  • Select circuits 322 and 323 couple operands in registers 302 and 303 to multiplier circuits 312 and 313.
  • multiplier circuit 312 may be coupled to storage register circuits 302 and 303 to receive an operand (e.g., Op5) from storage register circuit 302 and another operand (e.g., Op7) from storage register circuit 303.
  • multiplier circuit 313 may be coupled to storage register circuits 302 and 303 to receive an operand (e.g., Op6) from storage register circuit 302 and another operand (e.g., Op8) from storage register circuit 303.
  • Ops5-6 are weights and Ops7-8 are activation values.
  • the output of each multiplier is an activation multiplied by a weight.
  • four multiplications may be performed in parallel.
  • the outputs of each multiplier 310-313 may be coupled to an adder 330, which may sum (or accumulate) products, for example.
  • the final output may be stored in an output register.
  • the outputs products from multipliers 310-313 are added to corresponding values in an input register 350, for example.
  • some embodiments may accumulate products of activations and weights (x*wt) along a column of multipliers (not shown), for example.
  • input register 350 may store four (4) values of the integers (Al, A2, A3, A4), which are added to the four corresponding output products from multipliers 310-313 (Rl, R2, R3, R4).
  • the result is four (4) corresponding output values in output register 340 (Al+Rl, A2+R2, A3+R3, A4+R4), which may be coupled to an input register of another group of multipliers, for example.
  • multiplier 310 may, in the first mode, produce floating point values, which are then converted to a third data type, such as fixed point, having an extended bit length to achieve wide dynamic range and accuracy.
  • the same adder 330 and output register 340 may be used to store one extended length data type or multiple integer data types, for example, which may have advantages including reduced circuit area, for example.
  • FIG. 4 illustrates a multimodal multiplier circuit according to yet another embodiment.
  • the multimodal multiplier circuit of FIG. 4 may be a building block of the VXM 110, the MXM 117 and/ or the MXM 118 of FIG. 1.
  • the output of multiplier 210 is coupled to a select circuit 401.
  • the output product of multiplier 210 and summed exponents from adder 213 may be coupled to a denormalizer circuit 403.
  • the denormalizer circuit 403 may receive a floating point product from multiplier circuit 210 and summed exponent bits from adder circuit 213 and produce a fixed point value.
  • a fixed point value may be used to advantageously optimize dynamic range and precision, for example.
  • the fixed point value comprises a number of bits equal to at least N times the number of bits produced by products of operands having the second data type.
  • the fixed point representation of the number, in the first mode may have an extended bit length (e.g., 90-100 bits).
  • a first output product of multiplier 210 has a first bit length greater than the other multipliers (e.g., multiplier 211 or multipliers 311-313, as mentioned above).
  • one or more of the output products of the multipliers may be sign extended (e.g., at 450), in the second mode, so that the bit length of the output products are the same.
  • the final bit length of the output products of the plurality of multipliers, in the second mode may be substantially the same as the bit length of the fixed point number from denormalizer circuit 403 in the first mode, for example.
  • equalizing the number of bits between first and second modes may include concatenating the multiplier outputs, for example, using concatenation circuit 402.
  • select circuit 401 couples the output of multiplier 210 to one input of concatenation circuit 402
  • other inputs of concatenation circuit 402 may be coupled to outputs of other multiplier circuits, such as multiplier circuit 211 as shown in FIG. 4, for example.
  • additional padding bits may be added between the concatenated values in the second mode to isolate the individual values during the addition described below, for example.
  • FIG. 3 other example embodiments may be extended to include more parallel multiplication paths for additional operands having a second data type and received during a second mode. For example, four (4) multiplications of Int8 values may be multiplied together, concatenated, added, and stored in output register 406, for example.
  • Adder 405 may also be configured to receive digital values from an input register 407, for example, which may be a value produced using one or more other multimodal multiplier units.
  • input register 407 includes an extended length fixed point number
  • input register 407 may include the same number of values as received by concatenation circuit 402 (e.g., 4 8-bit integers).
  • adder 405 may receive and sum two or more fixed point numbers, in a first mode, or multiple arrays of values in a second format (e.g., two or more 4 integer arrays) in a second mode.
  • the results are stored in output register 406.
  • output register 406 may store either one fixed point number or two integers, for example.
  • FIG. 5 illustrates a multimodal multiply-accumulator circuit according to another embodiment.
  • the multimodal multiply-accumulator circuit of FIG. 5 may be a building block of the VXM 110, the MXM 117 and/or the MXM 118 of FIG. 1.
  • a plurality of multimodal multipliers are configured in parallel, and outputs of the multipliers are coupled to inputs of an adder circuit to form a multiply-accumulator.
  • groups of multiply-accumulator circuits may be configured in series.
  • multimodal multiplier circuits 510A-N may receive input operands in a first or second data type and a mode control signal (“mode”) to configure the multiplier circuits to process different types of inputs.
  • mode mode control signal
  • Each multimodal multiplier 510A-N may receive a pair of operands having the first data type (e.g., FP16) in a first mode.
  • each multimodal multiplier 510A-N may receive a plurality of pairs of operands having the second data type (e.g., INT8) in a second mode.
  • the pairs of operands may be activation values and weights of a neural network, for example, where the circuit in FIG. 5 may be included in a machine learning digital data processing circuit.
  • each multimodal multiplier 510A-N may be coupled to adder 520, which may (in some embodiments) correspond to adder 330 in FIG. 3 or adder 405 in FIG. 4, for example.
  • adder 520 sums values having a third data type (e.g., fixed point), where each multimodal multiplier 510A-N converts a product of the input operands from the first data type (e.g., float) to the third data type (e.g., extended length fixed point) as mentioned above.
  • adder 520 sums values having the second data type (e.g., integer).
  • product values from a particular multiplier in each multimodal multiplier 510A-N are added to product values from corresponding multipliers.
  • the product from multiplier 310 in one multimodal multiplier 510A is added to the products from multiplier 310 in the other multimodal multipliers 510B-N
  • the product from multiplier 311 in one multimodal multiplier 510A is added to the products from multiplier 311 in the other multimodal multipliers 510B-N, and so on.
  • results from columns of multipliers in an array of multiplier circuits may be combined independently (e.g., as arrays of values).
  • Outputs of adder 520 are stored in output register circuit 530, which stores a single output value in the third data type, for example, in the first mode and multiple output values having the second data type in the second mode, for example.
  • each multiply-accumulator circuit 500-502 may comprise an input register circuit having an input coupled to an output register circuit of another multimodal multiply-accumulator circuit.
  • multiply accumulator circuit 500 includes an input register 540, which may be configured to receive one or more sums from multiply-accumulator 501 based on the mode the system is operating in, for example.
  • input register 540 receives and stores a single input value, which may have the third data type (e.g., an extended fixed point value), and when multiply-accumulator circuits 500 and 501 are in a second mode, input register 540 receives and stores a plurality of input values having the second data type (e.g., four (4) integer values).
  • An output of register 540 is coupled to the adder circuit 520. Accordingly, in the first mode, a plurality of values, one from each multimodal multiplier 510A-N, may be added together and further added to the single input value in register 540.
  • each multimodal multiplier 510A-N and the multiple values from input register 540 are added, where values corresponding to particular columns are added to other values corresponding to particular columns. For example, if there are four values in input register 540 and four multipliers used in each multimodal multiplier 510A-N in the second mode, then a first of the four values from register 540 may be added with values from N multipliers 310 (See FIG. 3) in each of 510A-N, a second of the four values from register 540 may be added with values from multipliers 311 (in FIG. 3) in each of 510A-N, and so on, which may result in four summed output values in output register 530. An output of the output register circuit 530 is coupled to multimodal multiply-accumulator circuit 502 and a similar process may be repeated, for example.
  • FIG. 6 illustrates a method for the multimodal multiplication according to an embodiment.
  • digital bits corresponding to one or more first operands are stored in a first storage register circuit.
  • digital bits corresponding to one or more second operands are stored in a second storage register circuit.
  • the first and second storage register circuits may store one first operand and one second operand having a first data type.
  • the first and second storage register circuits may store a first plurality of operands and a second plurality of operands having a second data type.
  • the one first operand in the first storage register circuit and the one second operand in the second storage register circuit are multiplied in a first multiplier circuit coupled to the first and second storage register circuits.
  • a first multiplier circuit coupled to the first and second storage register circuits.
  • one of the plurality of first operands in the first storage register circuit and one of the plurality of second operands in the second storage register circuit are multiplied using the first multiplier circuit.
  • another one of the plurality of first operands in the first storage register circuit and another one of the plurality of second operands in the second storage register circuit are multiplied using the second multiplier circuit.
  • a digital system such as a computer system based on the TSP core 100, utilizes either a floating point format or an integer format to store representations of input operands in a compressed format while arithmetic calculations (e.g., multiplications and additions) can be performed in an integer format.
  • the results of arithmetic operations are accumulated in one or more accumulate registers using the TP format, (i.e., fixed point numerical representation), and a final multiplication result is obtained by truncating the accumulation result to a desired precision (e.g., FP32).
  • the TP format is a fixed point numerical representation of an accumulation of FP16 products that avoids the need for higher precision calculations in the matrix multiplication loop.
  • the TP format represents a fixed point numerical representation for the accumulation result having an accuracy comparable to a higher precision FP numerical representation (e.g., FP64 numerical precision).
  • FP64 numerical precision e.g., FP64 numerical precision.
  • a sum of products is converted from the TP format (i.e., the fixed point loss-less integer representation) to, e.g., FP32 numerical representation with only 23 bits of significand.
  • FIG. 7 illustrates multiplier circuity with the TP format based accumulation of partial multiplication results according to an embodiment.
  • the multiplier circuity of FIG. 7 can be a building block of the VXM 110, the MXM 117 and/or the MXM 118.
  • the multiplier circuity of FIG. 7 is a component of an array of multipliers within, e.g., the MXM 117 and/orthe MXM 118.
  • One or more storage register circuits 700, 701 store digital bits corresponding to an operand of a first format and another operand of the first format.
  • the first format may be an INT4 format, an INT8 format, an INT16 format, a FP16 format (e.g., in accordance with the IEEE 754 standard) and a FP32 format (e.g., in accordance with the IEEE 754 standard), or some other numerical representation format.
  • Conversion circuits 702, 703 may convert the operand and the other operand from a floating point format into an integer format prior to decomposition of the operand and the other operand.
  • the Mode signal is a bit signal having a first value (e.g., “0”) when the first formal is an integer format (e.g., INT4, INT8, INT16) and having a second value (e.g., “1”) when the first formal is a floating point format (e.g., FP16, FP32).
  • a decomposition circuit 704 decomposes the operand into a first plurality of operands (e.g., smaller integer numbers).
  • the decomposition circuit 705 further decomposes the other operand into a second plurality of operands (e.g., smaller integer numbers).
  • the decomposition circuit 704 may decompose the operand and the other operand by applying, e.g., a Toom-Cook decomposition algorithm. Details about the Toom-Cook decomposition algorithm are provided further below.
  • the first plurality of multipliers 706A, ... , 706N and the second plurality of multipliers 708A, ... , 708N are integer multipliers.
  • each operand of the first plurality of operands is routed from the decomposition circuit 704 to each multiplier of a first plurality of multipliers 706 A, ... , 706N as well as to each multiplier of a second plurality of multipliers 708A, ... , 708M.
  • each operand of the second plurality of operands is routed from the decomposition circuit 704 to each multiplier of the first plurality of multipliers 706A, ...
  • Each pair of operands from the first and second pluralities of operands are mutually multiplied in a corresponding multiplier of the first and second pluralities of multipliers 706 A, ...,
  • 708M are stored in corresponding registers 709A, ... , 709N, 710A, ... , 710M.
  • the first format is a floating point format
  • a significand portion from each operand of the first plurality of operands is routed from the decomposition circuit 704 to each multiplier of the first plurality of multipliers 706A, ... , 706N as well as to each multiplier of the second plurality of multipliers 708A, ... , 708M.
  • a significand portion from each operand of the second plurality of operands is routed from the decomposition circuit 704 to each multiplier of the first plurality of multipliers 706 A, ... , 706N as well as to each multiplier of the second plurality of multipliers 708A, ... , 708M.
  • Each pair of significand portions from the first and second pluralities of operands are mutually multiplied in a corresponding multiplier of the first and second pluralities of multipliers 706 A, ..., 706N, 708 A, ... , 708M to generate a corresponding partial result stored in a corresponding register 709A, ... , 709N, 710A, ... , 710M.
  • an exponent portion from each operand of the first plurality of operands is routed from the decomposition circuit 704 to each adder of a first plurality of adders 705 A, ... , 705N as well as to each adder of a second plurality of adders 707A,
  • the adders 705 A, ..., 705N, 707A, ... , 707M can be turned off based on the Mode signal, all zero bits are routed to the inputs of the adders 705 A, ..., 705N, 707A, ... , 707M, or the adders 705 A, ..., 705N, 707 A, ... , 707M are bypassed in some other manner and their outputs are not utilized.
  • each partial result stored in the corresponding register 709A, ... , 709N, 710A, ... , 710M is shifted at a corresponding shift circuit 713 A, ... , 713N, 714A, ... , 714N by a number of bits equal to a value of a respective exponent Expn, ... , Expm, Expm, ... , ExpNM output from a corresponding adder 705A, ..., 705N, 707 A, ... , 707M.
  • Each shifted partial result is passed onto a corresponding conversion circuit 715 A, ... , 715N, 716A, ... , 716M.
  • Conversion circuits 715 A, ... , 716N, 716A, ... , 716M convert the plurality of partial results to the TP format, i.e., to the fixed point numerical representation.
  • a position of a decimal point in the TP numerical representation of each shifted partial result is based on a value of the respective exponent Expn, ... , Expm, Expm, ... , ExpNM.
  • the shift circuits 713A, ... , 713N, 714A, ... , 714N and the conversion circuits 715A, ... , 715N, 716A, ... , 716M are bypassed using, e.g., corresponding demultiplexers 711 A, ... , 71 IN, 712A, ... , 712M controlled by an appropriate value of the Mode signal.
  • the accumulator circuit 719 accumulates the plurality of partial results (or the plurality of shifted partial results) using the second format (i.e., the TP numerical representation) to generate a complete result of the second format that is also stored in a register of the accumulator circuit 719.
  • the accumulator circuit 719 accumulates the plurality of partial results from a smallest partial result among the plurality of partial results to a largest partial result among the plurality of partial results.
  • FIG. 7 illustrates a single accumulator circuit 719
  • the multiplier circuity in FIG. 7 may comprise a plurality of accumulator circuits, e.g., connected into a single accumulation stage or multiple accumulation stages.
  • the accumulator circuit 719 comprises at least 80 bits. In another embodiment, the accumulator circuit 719 comprises 96 bits. In yet another embodiment, the accumulator circuit 719 comprises 128 bits. However, the accumulator circuit 719 larger than 128 bits can be also utilized.
  • a register of the accumulator circuit 719 is at least 116 bits wide because 22 compressed carry bits and three status bits are used for carry information to enable calculations using a faster clock frequency.
  • Accumulated multiplier results are converted from the 116-bit register of the accumulator circuit 719 with 91-bit integer precision to FP32 using a truncation/conversion circuit 720 coupled to an output of the accumulator circuit 719.
  • the truncation/conversion circuit 720 may be part of the NIM 115 or the NIM 116, and the conversion may occur when the accumulated multiplier results are streamed from the MXM 117 or the MXM 118 to the VXM 110.
  • a width of each partial output sum at the register of the accumulator circuit 719 is 25 bits.
  • a total of four partial sums are concatenated to 100 bits at the register of the accumulator circuit 719 to achieve INT32 precision.
  • the remaining bits in the register of the accumulator circuit 719 are not used.
  • the value produced and stored at the register of the accumulator circuit 719 is in a fully loss-less INT32 format, i.e., the TP format with INT32 numerical representation.
  • an accumulator in the NIM 115 (or in the NIM 116) performing a full sum operation would resolve compressed carry bits in a 112-bit word to 90-bits, and then accumulate multiple 256x256 matrix multiplication output values, with a maximum capacity to accumulate up to 2 38 90-bit TP numbers into a single INTI 28. If matrix multiplications are interleaved, then the partial (interim) results are added separately.
  • the VXM 110 may comprise an arbitrary precision arithmetic instruction that includes a carry that is persistent to the next clock cycle. Using an initial ADD MOD, a series of ADD MOD CI instructions, and an optional final ADD MOD CI INT320,0 to get the final carry bit, any size INT can be accumulated at the accumulator circuit 719.
  • fused dot product operation instead of a fused multiply accumulate operation.
  • the result of fused dot product operation is obtained and stored within the register of the accumulator circuit 719 to maintain a pre-defined precision, e.g., the precision of at least 80 bits.
  • a pre-defined precision e.g., the precision of at least 80 bits.
  • up to 320 partial results of the fused dot product operation can be accumulated in the accumulator circuit 719 without any truncation.
  • An accumulated result in the second format (e.g., TP format) stored in the register of the accumulator circuit 719 represents a complete multiplication result.
  • the truncation/conversion circuit 720 coupled to the register of the accumulator circuit 719 converts the complete multiplication result of the second format (e.g., the TP number) into an output result of an output format that is stored in an output register 721.
  • the truncation/conversion circuit 720 may convert the complete multiplication result from the second format into the output format by first selectively truncating a portion of the complete multiplication result stored in the register of the accumulator circuit 719.
  • the truncation/conversion circuit 720 converts the complete multiplication result (i.e., the truncated accumulation result) into the output format, e.g., FP32 format, FP64 format, FP128 format, or some other floating point format.
  • the conversion by the truncation/conversion circuit 720 may be based on a desired output precision provided to the truncation/conversion circuit 720 via an “Out_Format” signal, as shown in FIG. 7.
  • the rounding (i.e., truncation) to the FP32 format in accordance with the IEEE 754 standard uses 8 bits to represent an exponent and 23 bits to represent a significand.
  • the accumulation at the accumulator circuit 719 with truncation of a final accumulated result to the FP32 format precision at the truncation/conversion circuit 720 provides the calculation rate of approximately 4.98 teraflops. Note that “one teraflops” represents a computing speed of one million floating point operations per second while providing numerical results with precision equivalent to a FP32 unit.
  • the rounding (i.e., truncation) to the FP16 format in accordance with the IEEE 754 standard uses 5 bits for the exponent and 10 bits for the significand. It can be shown that the accumulation at the accumulator circuit 719 with truncation of a final accumulated result to the FP16 format precision at the truncation/conversion circuit 720 provides the calculation rate of approximately 403 teraflops. Additionally, the rounding (i.e., truncation) to FP16 representation with 8 exponent bits and 7 bits for the significand can be utilized, which can be denoted as bfloatl6 or BF16. It can be shown that the accumulation at the accumulator circuit 719 with truncation of a final accumulated result to the BF16 format precision at the truncation/conversion circuit 720 provides the calculation rate of approximately 44.78 teraflops.
  • the decomposition circuit 704 performs the decomposition of large integers by applying the Toom-Cook decomposition algorithm in order to obtain smaller integers suitable for faster integer multiplications.
  • the decomposition circuit 704 can apply the Toom- Cook3 decomposition algorithm.
  • the decomposition circuit 704 that applies the Toom- Cook algorithm can be a building block of the VXM 110, MXM 117 and/or the MXM 118 separate from digital multiplier circuitry.
  • the integers 7 and 22 are multiplied.
  • two integer multiply operations would occur, and each time the partial result would be 14, but the correct numbers to be added are 140 and 14 yielding a proper final multiplication result of 154.
  • the problem would occur when the least significant digit is truncated to obtain an approximate final result, which is typical in the case of rounding floating point numbers. Then, shifting the digits to account for the ones, tens, and hundreds columns (e.g., performed at the shift circuits 713A,... , 713N, 714A, ...
  • operands that are input into the multiplier circuitry of FIG. 7 are either represented in a signed or unsigned integer format (e.g., INT4 or INT8) or in a floating-point format (e.g., FP16 or FP32 format).
  • the multiplier circuitry of FIG. 7 can be configured to identify the format of input operands, e.g., INT 8 format or FP16 format. Note that INT8 format of operands would require INT8 multiplications with INT32 accumulation, while FP16 format of input operands would require FP16 multiplication with FP32 accumulation.
  • the multiplier circuitry of FIG. 7 supports INT8 multiplication, INTI 6 multiplication (with INT64 accumulation), INT32 multiplication (with INTI 28 accumulation), as well as the multiplication between INT8 operand and INT4 operand (e.g., when weight precision is not required).
  • INT8 multiplications (with INT32 accumulation) have sufficient precision and accuracy for inference applications. It should be noted that precision and accuracy are two different requirements.
  • the precision requirement is related to a number of bits for representation of a multiplication result, e.g., a 16-bit multiplication result.
  • the accuracy requirement is related to whether the multiplication result is mathematically correct, e.g., whether the 16-bit result is mathematically correct or not.
  • models in AI and/or ML applications are generally trained using floating point representation of numbers because the trained models require the fidelity to calculate converging differences between weights of a previous learning iteration and weights of a current learning iteration. Otherwise, the trained models would not converge as the differences would be greater than predetermined threshold values, i.e., the differences would be too large to converge.
  • the multiplier circuitry of FIG. 7 can be part of a common circuitry of the MXM 117 and/or the MXM 118 shared between the floating point type arithmetic and integer type arithmetic.
  • Input operands of the multiplier circuitry of FIG. 7 are either in integer format (e.g., INT8) or in floating-point format (e.g., FP16).
  • the multiplier circuitry of FIG. 7 can handle inputs in either floating point format or integer format.
  • each multiplier 706A, ..., 706N, 708A, ... , 708M can input an operand that is either a signed integer, an unsigned integer or a floating point number.
  • , 708M may be configured (e.g., using an appropriate internal circuitry) to identify the input data type, perform required conversion if any (e.g., from the floating point format to integer format), and perform the integer multiply operations to generate partial products. Then, the partial products can be accumulated in the accumulation circuit using the TP format to obtain a final multiplication result as a sum of the partial products.
  • the multiplier circuitry of FIG. 7 can perform operations on two sets of integer input operands, and the final output products would be two 24-bit quantities, i.e., sums of integer products.
  • 24-bits is sufficient to hold the sum of products between 255 and 255 (i.e., the largest operands for INT 8 format).
  • the final products can be locally summed (e.g. as part of the VXM110, the MXM 117 and/or the MXM 118) by columns across the entire array for each column of the array.
  • the operands are converted to the TP format, e.g., at the conversion circuits 702, 703 or the conversion circuits 715 A, ... , 715N, 716A, ... , 716M.
  • the product of floating point multiply and accumulate operations are thus maintained in the TP format at the accumulator circuit 719.
  • the multiplier circuitry of FIG. 7 can maintain results of the multiplications and summations in the TP format, which advantageously maintains absolute accuracy for operands spanning a range of numbers from very small numbers to very large numbers.
  • the TP format maintains the complete number in its fixed point format and outputs the final result as a fixed point TP number and an exponent (before conversion to a desired floating point format).
  • the multiplier circuitry of FIG. 7 accepts input operands for, e.g., matrix multiplication in FP16 format, but generate a final multiplication result that is output in, e.g., FP32 format, which is far more precise than FP16 format (e.g., because of 23 bits for the mantissa and 8 bits for the exponent). Accordingly, by utilizing integer multipliers and performing accumulation of partial products in the TP format, the multiplier circuitry of FIG. 7 effectively performs FP32 operations with a loss of precision less than a threshold value. Alternatively, the multiplier circuitry of FIG. 7 generates FP64 (or FP128) results from FP16 operands by truncating multiplication results to the appropriate number of bits.
  • FIG. 8 is a graph 800 of dot product precisions as a function of sample size for dot product multiplications performed using different formats.
  • a plot 802 shows a dot product precision as a function of sample size for the dot product operation performed using FP32 based multiplications.
  • a plot 804 shows a dot product precision as a function of sample size for the dot product operation performed using FP32 sorted multiplications.
  • a plot 806 shows a dot product precision as a function of sample size for the dot product operation performed by the multiplier circuitry of FIG. 7 with input operands in FP16 format, the accumulation of partial products performed in the TP format (e.g., the accumulator circuit 719), and the final multiplication result being output in FP32 format. It can be observed from the plot 806 that that the precision of dot product operations that utilize the TP format is superior to that of the traditional FP32 multiplications and FP32 sorted multiplications. Also, the precision of dot product operations based the TP format is virtually unchanged as a number of accumulation operations increases (i.e., as a sample size increases).
  • the TP based calculations provide improved latency and throughput, while providing the most accurate floating point results.
  • CPU or GPU based systems would have to accumulate to, e.g., FP128 precision format.
  • the presented TP based multiply-and- accumulate (MAC) operations running on the TSP core 100 utilize FP16 operands and generate FP32 results, with accuracy that is significantly better than that of a GPU or CPU.
  • the TP format can be also a key enabler for low power calculations when calculations involve utilizing floating point formats. It is known that energy required to compute products of operands in FP16 format is less than energy required to compute products of operands represented in wider formats, e.g., FP32 or FP64 formats. For example, it can be shown that FP32 based calculations consume approximately four times the energy compared to FP16 based calculations. To take energy advantage of mixed-precision applied at the multiplier circuitry of FIG.
  • the input operands are in FP16 formats whereas the dot product is accumulated and then output in FP32 format.
  • 320-element SIMD instructions of the TSP core 100 allow the instruction fetch and decode energy to be amortized across 320 operations.
  • Each MEM slice of MEMs 111, 112 may access approximately 8,000320-element vectors, keeping SRAM access cost low compared to traditional cache hierarchies.
  • FIG. 9 illustrates a method for integer multiplication with the TP based accumulation according to an embodiment.
  • digital bits corresponding to an operand of a first format and another operand of the first format are stored in one or more storage register circuits.
  • the operand is decomposed into a first plurality of operands, and the other operand is decomposed into a second plurality of operands.
  • a respective first operand of the first plurality of operands is multiplied with a respective second operand of the second plurality of operands using each multiplier circuit of a plurality of multiplier circuits to generate a corresponding partial result of a plurality of partial results.
  • the plurality of partial results are accumulated in an accumulator circuit using a second format to generate a complete result of the second format that is stored in the accumulator circuit.
  • the complete result of the second format is converted into an output result of an output format.
  • Embodiments of the present disclosure further relate to various methods for conversion of FP numerical representation (e.g., FP32 or BF16) of input operands (e.g., activations and weights) for performing element-wise operations, e.g., element-wise multiplications between an activation matrix and a weight matrix - MATMUL.
  • input operands e.g., activations and weights
  • element-wise operations e.g., element-wise multiplications between an activation matrix and a weight matrix - MATMUL.
  • all exponents of the input operands are sorted by range.
  • all input numbers e.g., matrix elements
  • each exponent can be within one of the following ranges: 2 n -2 to 1, 2 n x2-4 to 2M, 2 n x3-6 to 2 hc 2-3, 2 n x4-8 to 2 hc 3-5, 2 n x5-10 to 2 n x4-7, 2 n x6-12 to 2 hc 5-9, 2 n x7-14 to 2 hc 6-11, 2 n x8-16 to 2 hc 7-13, 2 n x9-34 to 2 n x8- 15, where n is a number of bits for representing the exponent.
  • numbers (e.g., matrix elements) from each group are normalized to be within a defined exponent range of the MATMUL while keeping track which range each group was in before the normalization.
  • an element-wise operation e.g., multiplication
  • an intermediate result is adjusted to align with the original range.
  • Fifth, accumulation with previous group result(s) is performed.
  • Sixth, if any groups remain, the third, fourth and fifth steps are repeated.
  • Seventh, once all the groups are completed, final accumulation and conversion to the final format are performed.
  • the first sub-method of the first method utilizes the TP format on intermediate results, and no error is introduced until the final conversion.
  • the first sub-method of the first method requires (roundup (exponent range of inputs / exponent range of matrix)) 2 passes in the matrix c matrix size/ MATMUL matrix size plus pre-processing and post processing cycles to complete.
  • all matrix weights are first pre-processed to belong into the same range.
  • An exponent of a respective matrix weight can be within one of the following range: 2 n -2 to 1, 2 n x2-4 to 2 n -l, 2 n x3-6 to 2 n x2-3, 2 n x4-8 to 2 n x3-5, 2 n x5-10 to 2 n x4-7, 2 n x6-12 to 2 n x5-9, 2 n x7-14 to 2 n x6-ll, 2 n x8-16 to 2 hc 7-13, 2 n x9-34 to 2 n x8-15, where n is a number of bits for the exponent.
  • the largest intermediate exponent N is pre-processed, and all values with exponent less than (e-log2(m)-s) are zeroed out, where m is a number of operations to perform, e is a size exponent in the final format, s is a size significand for conversion, and e > N.
  • activations are re-sorted and the zeroed out values are removed.
  • all matrix activations are pre-processed to belong into the same range.
  • each group of activations is normalized to be in the exponent range of the MATMUL, while keeping track which range each group was in before the normalization.
  • an element-wise operation e.g., multiplication
  • an intermediate result is adjusted to align with the original range.
  • accumulation with previous groups result(s) is performed.
  • final accumulation and conversion to the final format is performed.
  • the second sub-method of the first method throws away values up front that would not make a difference in the final conversion.
  • the second sub-method of the first method utilizes the TP format on intermediate results.
  • the second sub-method of the first method have the potential to introduce error on the least significant bit (LSB) region and requires more pre-processing than the first sub method of the first method.
  • LSB least significant bit
  • a second method of a limited range only most significant bits (MSBs) of exponents of the input operands (e.g., activations and weights) are utilized.
  • MSBs most significant bits
  • pre-processing of the input operands is first performed and only m MSBs of an exponent of each input number are used.
  • an element-wise operation (e.g., multiplication) is performed on each activations group and weights group obtained at the second step.
  • an intermediate result is adjusted to align with an original significand.
  • the first sub method of the second method introduces a precision error in two ways. First, the precision error is introduced by limiting the exponents. Second, the precision error is non-zero if the number of sub-significands times the significand bits in the matrix is less than the number of input significand bits.
  • the first sub-method of the second method requires a number of passes to complete the matrix that is significantly less than for the first and second sub-methods of the first method.
  • the first sub-method of the second method requires just four to nine passes to complete the matrix depending on the sub-significands (i.e., depending whether the truncation or roundup is performed at the second step).
  • pre processing of all activations is first performed including normalization to a highest exponent bit that is “1” (that particular bit and the m-1 MSBs are used after that).
  • all weights are pre-processed and normalized to a highest exponent bit that is “1” and use that bit plus the m-1 MSBs after that.
  • an element-wise operation e.g., multiplication
  • an intermediate result is adjusted to align with an original significand.
  • Seventh accumulation with previous groups result(s) is performed.
  • Ninth once all the groups are completed, final accumulation is performed, followed by adjustment to the original range and conversion to the final format.
  • the second sub-method of the second method introduces a potential precision error in two ways.
  • the potential precision error can occur due to limiting the exponents.
  • the potential precision error is introduced.
  • the number of passes required to complete the matrix is significantly less for the second sub-method of the second method than for the first and second sub-methods of the first method.
  • the second sub-method of the second method requires just four to nine passes to complete the matrix depending on the sub-significands (i.e., depending whether the truncation or roundup is performed at the second and fourth steps).
  • the second sub-method of the second method has the potential to be more accurate than the first sub-method of the second method.
  • the first step is to force a format of input exponents to only use the range of the matrix unit.
  • an element-wise operation e.g., multiplication
  • an intermediate result is adjusted to align with an original significand.
  • Fifth, accumulation with previous groups result(s) is performed.
  • the third, fourth and fifth steps are repeated. Seventh, once all groups are completed, final accumulation is performed, followed by adjustment to the original range and conversion to the final format.
  • the third sub-method of the second method forces the input range to match the limited range of the matrix for the exponent. If the roundup is used at the second step, no error is introduced until the final conversion.
  • the third sub method of the second method matches a throughput of the first sub-method of the second method. However, the third sub-method of the second method does not introduce any precision or range error during the processing of matrix elements. [0089]
  • exponents are broken into N m- bit units.
  • pre-processing of all input numbers is first performed by breaking the exponent portion in equal bits (or near equal bits) under the size of the matrix unit exponent size.
  • an element-wise operation e.g., multiplication
  • an intermediate result is adjusted to align with the original range.
  • the first sub-method of the third method utilizes the TP format for intermediate results, and no error is introduced until the final conversion, if the roundup is used at the second step.
  • the first sub-method of the third method requires N equal exponents times N equal exponents times n significands times n significands passes for each matrix to complete.
  • pre-processing of all input numbers is first performed by breaking each exponent portion in equal bits (or near equal bits) under the size of the matrix unit exponent size.
  • all significands are truncated to match the size of the matrix unit significand.
  • an element-wise operation e.g., multiplication
  • an intermediate result is adjusted to align with the original range.
  • accumulation with previous group result(s) is performed.
  • Sixth, if any groups remain, the third, fourth and fifth steps are repeated.
  • the second sub-method of the third method keeps the complete range of the original input numbers but limits the precision to an internal matrix.
  • the number of passes for each matrix with FP32 format is just four.
  • the number of passes for each matrix with BF16 format is also four, but the second sub-method of the third method provides a better precision for FP32 format than for BF16 format until the final conversion.
  • the accumulation as part of the matrix multiplication can be performed in the extended variable precision TP format.
  • An amount of accumulated precision required for a given matrix multiple accumulation (MATMUL) can be dynamically changed.
  • N x N FP16 MATMUL that is a size of an internal matrix
  • no extension is required in the final accumulation and conversion to obtain a final output format.
  • an intermediate accumulation i.e., accumulation of partial products
  • an intermediate accumulation i.e., accumulation of partial products
  • 128 bits to keep from overflowing the final result during the accumulation.
  • no error is introduced for precision or accuracy until the final conversion to the final format.
  • the minimum final format is FP32 in order to maintain a complete range for the final result without overflow during the accumulation.
  • an intermediate accumulation i.e., accumulation of partial products
  • a total of 512 bits is extended by a total of 512 bits to keep from overflowing the final result during the accumulation.
  • the total of 512 bits required for extension of the intermediate accumulation is due to, e.g., 32 bits required for the size of the FP32 MATMUL, plus 564 bits for FP32 TP, minus 90 bits for FP16, plus roundup log236 bits (i.e., 6 bits) as one FP32 matrix operation requires 36 FP16 operations for full range and precision TP assuming 256 x 256 base matrix size. Again, no error is introduced for precision or accuracy until the final conversion to the final format. In such case, the minimum final output format is FP64 in order to maintain a complete range for the final result without overflow during the accumulation.
  • FIG. 10 illustrates a method for floating point conversion during element-wise matrix operations according to an embodiment.
  • An accumulation as part of the element-wise matrix operations can be performed in the extended variable precision TP format.
  • input numbers e.g., elements of activations matrix and weights matrix
  • a next activations matrix is loaded, which becomes a current activations matrix.
  • a next weights matrix is loaded, which becomes a current weights matrix. Note that, in some cases (e.g., when the next activations matrix and the next weights matrix are loaded for the first time) the steps 1002 and 1003 can be performed simultaneously or near simultaneously.
  • step 1002 of loading the next activations matrix is restarted on every next weights matrix.
  • an element-wise operation e.g., element-wise multiplication
  • the method returns to the step 1002 for loading a next activations matrix that becomes the current activations matrix.
  • the method returns to the step 1003 for loading a next weights matrix that becomes the current weights matrix (which also initiates restarting load of a next activations matrix at the step 1002).
  • accumulation is performed in, e.g., the aforementioned extended variable precision TP format. If the current weights matrix is not the last weights matrix, loading of a next weights matrix is performed (at step 1003) and accumulation is applied on intermediate operation results obtained at the step 1004 where the newly loaded weights matrix is used for the element-wise operation. After all weights matrices are loaded and the accumulation at the step 1005 is finished, final (multicycle) summation and conversion to the final format is performed at 1006.
  • FIG. 11 is a block diagram illustrating components of an example computing machine that is capable of reading instructions from a computer-readable medium and execute them in a processor (or controller) according to an embodiment.
  • a computer described herein may include a single computing machine shown in FIG. 11, a virtual machine, a distributed computing system that includes multiple nodes of computing machines shown in FIG. 11, or any other suitable arrangement of computing devices.
  • the computer described herein may be used by any of the elements described in the previous figures to execute the described functions.
  • FIG. 11 depicts a diagrammatic representation of a computing machine in the example form of a computer system 1100 within which instructions 1124 (e.g., software, program code, or machine code), which may be stored in a computer-readable medium, causing the machine to perform any one or more of the processes discussed herein.
  • the computing machine operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine may operate in the capacity of a server machine or a client machine in a server-client network environment, or as a peer machine in a peer- to-peer (or distributed) network environment.
  • a computing machine may be a tensor streaming processor designed and manufactured by Groq, Inc. of Mountain View, California, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, an internet of things (IoT) device, a switch or bridge, or any machine capable of executing instructions 1124 that specify actions to be taken by that machine.
  • PC personal computer
  • PDA personal digital assistant
  • STB set-top box
  • IoT internet of things
  • switch or bridge any machine capable of executing instructions 1124 that specify actions to be taken by that machine.
  • machine shall also be taken to include any collection of machines that individually or jointly execute instructions 1124 to perform any one or more of the methodologies discussed herein
  • the example computer system 1100 includes one or more processors (generally, a processor 1102) (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), one or more application specific integrated circuits (ASICs), one or more radio-frequency integrated circuits (RFICs), or any combination of these), a main memory 1104, and a static memory 1106, which are configured to communicate with each other via a bus 1108.
  • the computer system 1100 may further include graphics display unit 1110 (e.g., a plasma display panel (PDP), a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT)).
  • graphics display unit 1110 e.g., a plasma display panel (PDP), a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT)
  • the computer system 1100 may also include alphanumeric input device 1112 (e.g., a keyboard), a cursor control device 1114 (e.g., a mouse, a trackball, a joystick, a motion sensor, or other pointing instrument), a storage unit 1116, a signal generation device 1118 (e.g., a speaker), and a network interface device 1120, which also are configured to communicate via the bus 1108.
  • alphanumeric input device 1112 e.g., a keyboard
  • a cursor control device 1114 e.g., a mouse, a trackball, a joystick, a motion sensor, or other pointing instrument
  • storage unit 1116 e.g., a disk drive, or other pointing instrument
  • signal generation device 1118 e.g., a speaker
  • network interface device 1120 which also are configured to communicate via the bus 1108.
  • the storage unit 1116 includes a computer-readable medium 1122 on which the instructions 1124 are stored embodying any one or more of the methodologies or functions described herein.
  • the instructions 1124 may also reside, completely or at least partially, within the main memory 1104 or within the processor 1102 (e.g., within a processor’s cache memory). Thus, during execution thereof by the computer system 1100, the main memory 1104 and the processor 1102 may also constitute computer-readable media.
  • the instructions 1124 may be transmitted or received over a network 1126 via the network interface device 1120.
  • the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions (e.g., the instructions 1124).
  • the computer-readable medium 1122 may include any medium that is capable of storing instructions (e.g., the instructions 1124) for execution by the machine and that cause the machine to perform any one or more of the methodologies disclosed herein.
  • the computer-readable medium 1122 may include, but not be limited to, data repositories in the form of solid-state memories, optical media, and magnetic media.
  • the computer-readable medium 1122 does not include a transitory medium such as a signal or a carrier wave.
  • a hardware description language is a specialized computer language used to describe the structure and behavior of electronic circuits, including digital logic circuits.
  • a hardware description language results in an accurate and formal description of an electronic circuit that allows for the automated analysis and simulation of an electronic circuit.
  • An HDL description may be synthesized into a netlist (e.g., a specification of physical electronic components and how they are connected together), which can then be placed and routed to produce the set of masks used to create an integrated circuit including the elements and functions described herein.
  • a netlist e.g., a specification of physical electronic components and how they are connected together
  • a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
  • Embodiments of the disclosure may also relate to an apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus.
  • any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
  • Some embodiments of the present disclosure may further relate to a system comprising a processor (e.g., a tensor streaming processor or an artificial intelligence processor), at least one computer processor (e.g., a host server), and anon-transitory computer-readable storage medium.
  • the storage medium can store computer executable instructions, which when executed by the compiler operating on the at least one computer processor, cause the at least one computer processor to be operable for performing the operations and techniques described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Nonlinear Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)

Abstract

Selon un mode de réalisation, l'invention concerne un ensemble circuit multiplicateur qui multiplie des opérandes d'un premier format. Un ou plusieurs circuits de registre de stockage stockent des bits numériques correspondant à un opérande et à un autre opérande du premier format. Un circuit de décomposition décompose l'opérande en une première pluralité d'opérandes, et l'autre opérande en une seconde pluralité d'opérandes. Chaque circuit multiplicateur multiplie un premier opérande respectif de la première pluralité d'opérandes avec un second opérande respectif de la seconde pluralité d'opérandes pour générer un résultat partiel correspondant d'une pluralité de résultats partiels. Un circuit accumulateur accumule la pluralité de résultats partiels en utilisant un second format pour générer un résultat complet du second format qui est stocké dans le circuit accumulateur. Un circuit de conversion tronque le résultat complet du second format et convertit le résultat tronqué en un résultat de sortie d'un format de sortie.
EP21918010.6A 2021-01-07 2021-06-28 Précision numérique dans un ensemble circuit multiplicateur numérique Pending EP4275113A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163134941P 2021-01-07 2021-01-07
PCT/US2021/039440 WO2022150058A1 (fr) 2021-01-07 2021-06-28 Précision numérique dans un ensemble circuit multiplicateur numérique

Publications (1)

Publication Number Publication Date
EP4275113A1 true EP4275113A1 (fr) 2023-11-15

Family

ID=82358291

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21918010.6A Pending EP4275113A1 (fr) 2021-01-07 2021-06-28 Précision numérique dans un ensemble circuit multiplicateur numérique

Country Status (3)

Country Link
EP (1) EP4275113A1 (fr)
KR (1) KR20230121151A (fr)
WO (1) WO2022150058A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI831588B (zh) * 2023-01-30 2024-02-01 創鑫智慧股份有限公司 神經網路演算裝置以及在神經網路演算中的數值轉換方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6205462B1 (en) * 1999-10-06 2001-03-20 Cradle Technologies Digital multiply-accumulate circuit that can operate on both integer and floating point numbers simultaneously
US8577948B2 (en) * 2010-09-20 2013-11-05 Intel Corporation Split path multiply accumulate unit
CN106126189B (zh) * 2014-07-02 2019-02-15 上海兆芯集成电路有限公司 微处理器中的方法
EP3586226B1 (fr) * 2017-02-23 2023-07-05 ARM Limited Multiplication-accumulation dans un appareil de traitement de données
US10776078B1 (en) * 2018-09-23 2020-09-15 Groq, Inc. Multimodal multiplier systems and methods

Also Published As

Publication number Publication date
KR20230121151A (ko) 2023-08-17
WO2022150058A1 (fr) 2022-07-14

Similar Documents

Publication Publication Date Title
US11494186B2 (en) FPGA specialist processing block for machine learning
US11042360B1 (en) Multiplier circuitry for multiplying operands of multiple data types
US20040015533A1 (en) Multiplier array processing system with enhanced utilization at lower precision
US5280439A (en) Apparatus for determining booth recoder input control signals
US9372665B2 (en) Method and apparatus for multiplying binary operands
US11809798B2 (en) Implementing large multipliers in tensor arrays
US10853037B1 (en) Digital circuit with compressed carry
US20080243976A1 (en) Multiply and multiply and accumulate unit
US20210326111A1 (en) FPGA Processing Block for Machine Learning or Digital Signal Processing Operations
Chen et al. A matrix-multiply unit for posits in reconfigurable logic leveraging (open) CAPI
WO2022150058A1 (fr) Précision numérique dans un ensemble circuit multiplicateur numérique
US20220283777A1 (en) Signed multiword multiplier
US20220075598A1 (en) Systems and Methods for Numerical Precision in Digital Multiplier Circuitry
CN115878074A (zh) 用于专门的过程块中的稀疏性操作的系统和方法
US10831445B1 (en) Multimodal digital multiplication circuits and methods
US20240176619A1 (en) FPGA Specialist Processing Block for Machine Learning
Gustafsson et al. Bit-level pipelinable general and fixed coefficient digit-serial/parallel multipliers based on shift-accumulation
Edukondalu et al. HIGH SPEED 32 BIT VEDIC MULTPLIER USING VERILOG

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20230630

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)