CN116610285A

CN116610285A - Method and system for calculating dot product

Info

Publication number: CN116610285A
Application number: CN202310144578.3A
Authority: CN
Inventors: T·费列雷
Original assignee: Imagination Technologies Ltd
Current assignee: Imagination Technologies Ltd
Priority date: 2022-02-17
Filing date: 2023-02-10
Publication date: 2023-08-18
Also published as: GB202202126D0; GB2615773A; GB2615773B

Abstract

The application relates to a method and a system for calculating dot products. A method of performing a dot product of an array of '2k' floating point numbers is disclosed, the array comprising two sets of k floating point numbers a _i And b _i . The method includes receiving two sets of 'k' floating point numbers, and adding each floating point number a _i Multiplying by floating point number b _i To generate k product numbers (z _i ) Each product number (z _i ) Having a mantissa bit length of the 'r+log (k-1) +1' bits. The method further includes based on the k productsNumber (z) _i ) To create a set of 'k' numbers (y _i ) This number (y _i ) Having a bit length of 'n' bits. In addition, the method includes identifying each pair of floating point numbers a _i And b _i K indices of (eab) _i ) The maximum index sum (e) _max ) Based on the maximum index sum (e _max ) To align the number (y _i ) And simultaneously adding the set of 'k' numbers to obtain the dot product.

Description

Method and system for calculating dot product

Technical Field

The application relates to a method and a system for calculating dot products.

Background

Most computing systems use a digital format, typically with various computations performed in binary or radix 2. These digital formats include fixed point formats or floating point formats. Fixed point number formats may provide additional precision, but are only used to represent a limited range of values. Thus, most modern computing systems use floating point number formats to provide a tradeoff between range and accuracy.

The floating point number includes a mantissa (m) having a bit length of 'b' bits, an exponent (e) having a bit length of 'a' bits, and an optional sign bit(s) to represent a binary number. In some widely used formats, the exponent is biased (i.e., offset) by a value (c) to represent a number less than 1, and is used to encode an outlier at its end point. For the non-extremum of e, floating point number x is referred to as normalized, and number x is denoted (-1) ^s 2 ^e-c (1+2 ^-b m). Thus, floating point numbers can be used to represent very small or very large numbers in binary or some other radix, using precisely scientific notation. The use of floating point numbers in arithmetic computations provides varying degrees of precision depending on the bit length or type of floating point format used.

The computation of convolution/dot products involving larger arrays of real values is typically found in solutions to various numerical problems. Large array or two arrays (a ₀ ,a ₁ ,a ₂ …a _k-1 ) And (b) ₀ ,b ₁ ,b ₂ …b _k-1 ) Is defined as:

it is therefore advantageous to have hardware dedicated to performing dot-product in high performance computing systems, graphics processing systems, neural network accelerators, and the like. Conventionally, there are different methods to achieve this, with different advantages and disadvantages.

One known method of performing dot-product of two floating-point arrays/groups of floating-point numbers in a computing system is by using separate floating-point multiplicationsAnd floating point addition. A dot product unit 100 using this principle is shown in fig. 1. The dot product unit 100 includes a set of floating point multiplication units 102a, 102b, 102c, and 102d, and a set of floating point adder units 104a, 104b, and 104c. The dot product unit 100 is implemented as a tree of floating point multiplication units and addition units. Dot-product unit 100 receives a first set of floating point numbers (a ₁ 、a ₂ 、a ₃ And a ₄ ) And a second set of floating point numbers (b ₁ 、b ₂ 、b ₃ And b ₄ ) As input. Consider a first set of floating point numbers a _i Each number of (a) includes mantissa ma _i Sum index ea _i . Similarly, a second set of floating point numbers b _i Each number of (a) includes mantissa mb _i And index eb _i . Each floating point number a in the first set of floating point numbers _i Is provided as a first input to a respective one of the floating-point multiplication units 102a, 102b, 102c, and 102 d. Each floating point number b in the second set of floating point numbers _i Is provided as a second input to a respective one of the floating-point multiplication units 102a, 102b, 102c, and 102 d. Each floating point multiplication unit 102a, 102b, 102c, and 102d executes a floating point number a _i And b _i To obtain the product c _i . Once the product c from each multiplication unit 102a, 102b, 102c, and 102d is obtained _i The results (floating point numbers) are accumulated by a series of adders in any order of dynamic or static selection to obtain an output. Thus, the adders may be arranged in any order. In one example, as depicted in FIG. 1, product c from two consecutive floating-point multiplication units 102a and 102b ₁ And c ₂ Is provided to the first floating point adder unit 104a to add the product (c ₁ And c ₂ ) And (5) adding. Similarly, product c from the next two consecutive floating-point multiplication units 102c and 102d ₃ And c ₄ Is provided to the second floating point adder unit 104b to add the product (c ₃ And c ₄ ) And (5) adding. The accumulated value is also provided as an input to a third floating-point adder 104c for accumulation to obtain an output y. The output from the multiplier or adder in each step is rounded, which results in rounding errors in the output generated in each step. Rounding errors are floating pointsAnd (5) calculating the characteristics.

Another known method of doing the dot product of two floating point arrays/two sets of floating point numbers in a computing system is through the use of fused multiply and add operations. A dot product unit 200 using this principle is shown in fig. 2. The dot product unit 200 includes a set of Fused Multiply and Add (FMA) units 202a, 202b, 202c, and 202d. The FMA unit performs floating point multiplication and addition in a single step with single rounding. Thus, the FMA improves the speed and accuracy of the dot product calculation involving multiply-accumulate. In fig. 2, the result of one fused multiply-add unit is provided as input to another fused multiply-add unit such that the product of two numbers is added to the product of the next two numbers. Dot product unit 200 receives two sets of floating point numbers a _i And b _i As input. Each floating point number a in the first set of floating point numbers _i Is provided as a first input to a respective one of the FMA units 202a, 202b, 202c and 202 d. Each floating point number b in the second set of floating point numbers _i Is provided as a second input to a respective one of the FMA units 202a, 202b, 202c and 202 d. Each FMA202a, 202b, and 202c calculates the number a _i And b _i And adds the product to the result of the previous FMA with a single round (note that in the case of FMA202d, there is no 'previous' FMA because FMA202d is at the top of the tree and thus the addition simply adds the result of the multiplication to zero). For example, as depicted in fig. 2, FMA202d receives the number a ₄ And b ₄ As the multiplicand input. In addition, a 0 input is provided as a third input. FMA202d will a ₄ And b ₄ Multiply and add the result to 0 to obtain output d ₄ . Further, the FMA202 c receives the number a ₃ And b ₃ As multiplicand input, and receives d ₄ As a third input. FMA202 c will a ₃ And b ₃ Multiply and combine the result with d ₄ Adding to obtain an output d ₃ . Similarly, FMA202 b is accomplished by combining a ₂ And b ₂ Multiply and combine the result with d ₃ Add to obtain output d ₃ . Further, the FMA202a is configured by combining a ₁ And b ₁ Multiply and combine the result with d ₂ Add toObtaining the final output y=d ₁ 。

Thus, a floating point pair a from two sets of floating point numbers _i And b _i Multiplied together and added to the previously calculated output to generate a new or accumulated output. In other words, the sum is performed as a sequence of multiplications and additions of numbers. The final output (y) generated after multiplying and adding all floating point numbers in the array is provided as output.

Any of the above methods may be iteratively cycled through the same unit or concurrently using parallel or sequential combinations of units. Neither the first dot product unit 100 nor 200 is used to guarantee the accuracy of the output, as different input orders may produce different results due to intermediate rounding operations. Furthermore, due to the number of logic gates in the critical path, a high delay is introduced.

Accordingly, there are drawbacks in existing methods and architectures for handling floating point numbers.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

A method of performing a dot product of an array of '2k' floating point numbers is disclosed, the array comprising two sets of k floating point numbers a _i And b _i . The method includes receiving two sets of 'k' floating point numbers, and adding each floating point number a _i Multiplying by floating point number b _i To generate k product numbers (z _i ) Each product number (z _i ) Having a mantissa bit length of the 'r+log (k-1) +1' bits. The method further includes based on the k product numbers (z _i ) To create a set of 'k' numbers (y _i ) This number (y _i ) Having a bit length of 'n' bits. In addition, the method includes identifying each pair of floating point numbers a _i And b _i K indices of (eab) _i ) The maximum index sum (e) _max ) Based on the maximum index sum (e _max ) To align the number (y _i ) And simultaneously adding the set of 'k' numbers to obtain the dot product.

According to a first aspect, there is provided a method of performing a dot product of an array of '2k' floating point numbers, k.gtoreq.3, using a hardware implementation, the array comprising a first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 And a second set of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Wherein the method comprises: receiving two sets of 'k' floating point numbers; each floating point number a _i Multiplying by floating point number b _i To generate k product numbers (z _i ) Each product number (z _i ) Mantissa bit length with 'r+log (k-1) +1' bits; based on the k product numbers (z _i ) To create a set of 'k' numbers (y _i ) Said number (y _i ) With a function of adding at least an additional most significant bit to the product number (z _i ) The bit length of 'n' bits obtained by the bit length 'r', wherein the 'n' bits comprise a plurality of magnitude bits, wherein 'n' isBits, wherein x is an integer and x is greater than or equal to 1; identifying k index sums (eab) _i ) The maximum index sum (e) _max ) Each exponential sum is the floating point number a _i And the floating point number b _i Index sum of (2); based on the maximum index sum (e) _max ) To align the number (y _i ) Is a bit of the magnitude of (2); and simultaneously adding the set of 'k' numbers.

Optionally, a first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 Comprises mantissas (ma _i ) Sum index (ea) _i ) And the second group of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Comprises mantissas (mb _i ) Sum index (eb) _i ) Wherein each mantissa (ma _i ) Has a bit length of 'p' bits, and each mantissa (mb _i ) Having a bit length of 'q' bits.

Optionally, each floating point number a _i Multiply by the corresponding floating point number b _i IncludedMantissa (ma) _i ) And mantissa (mb) _i ) Multiplying to obtain an intermediate mantissa product (mab _i )。

Alternatively, by setting the value of the 'r' bit to 'r=p+1-log (k-1)' bits, the method of performing dot product simulates the precision obtained using separate multiplication and addition units for performing dot product having an output mantissa bit length of P bits.

Alternatively, k product numbers (z) are generated having a mantissa bit length of 'r+log (k-1) +1' bits _i ) Comprising the following steps: if p+q+2>r+log (k-1) +1 bits, then multiplying the intermediate mantissa (mab) _i ) Bits of (1) are rounded to the r+log (k-1) +1 bits; or if p+q+2<r+log (k-1) +1 bits, then additional least significant bits are padded into the intermediate mantissa product (mab _i ) To produce r + log (k-1) +1 bits.

Optionally, a maximum index sum (e) is identified _max ) Including identifying k index sums (eab) _i ) Maximum value among them, where k index sums (eab) _i ) By indexing (ea) _i ) Sum index (eb) _i ) And summing to obtain the final product.

Optionally, an additional most significant bit is added to the product number (z _i ) Comprises adding at leastA number of said most significant bits.

Optionally, adding at least an additional most significant bit to the bit length of the multiplicative product (zi) further comprises adding one or more least significant bits to the multiplicative product (z _i ) Is a bit length of (a) in a frame.

Optionally, the method further comprises: by processing the 'k' number (y _i ) To calculate an output value; re-normalizing the output value; and rounding the output value to represent the output value as a floating point number.

Alternatively, the number (y _i ) The magnitude bits are aligned based on the maximum exponent (e _max ) The method comprises the following steps: for each floating point number (i): calculating the maximum index sum (e) _max ) And each index sum (eab) _i ) Between (a) and (b)Difference (e) _d ) The method comprises the steps of carrying out a first treatment on the surface of the And based on the calculated difference (e _d ) Corresponding number (y _i ) Is shifted to the LSB side.

Optionally, in addition to shifting the magnitude bits of the numbers, the method further comprises performing rounding or truncation on bits of the numbers that are shifted beyond the bit length of the numbers.

Optionally, the method further comprises, if the set of 'k' floating point numbers includes signed floating point numbers, based on sign bits (s _i ) To determine the complement of two of the number of magnitude bits.

According to a second aspect, there is provided a hardware implementation of a dot-product for executing an array of '2k' floating-point numbers, k.gtoreq.3, the array comprising a first set of k floating-point numbers a ₀ 、a ₁ ...、a _k-1 And a second set of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Wherein the hardware implementation comprises: the device comprises a multiplication unit, a format conversion unit, a maximum exponent detection unit, an alignment unit and a processing unit. The multiplication unit includes a plurality of multipliers configured to: receiving two groups of 'k' floating point numbers, and adding each floating point number a _i Multiplying by floating point number b _i To generate k product numbers (z _i ) Each product number (z _i ) Having a mantissa bit length of the 'r+log (k-1) +1' bits. The format conversion unit is configured to convert the k product numbers (z _i ) To create a set of 'k' numbers (y _i ) Said number (y _i ) With a function of adding at least an additional most significant bit to the product number (z _i ) The bit length of 'n' bits obtained from the bit length of (c), wherein the 'n' bits comprise a plurality of magnitude bits, wherein 'n' isBits, where x is an integer and x is ≡2. The maximum exponent detecting unit is configured to identify k exponents and (eab) _i ) The maximum index sum (e) _max ) Each exponential sum is the floating point number a _i And the floating point number b _i Index sum of (d). The alignment unit is configured to based on the maximum index sum (e _max ) To align the magnitude bits of the number. The processing unit is configured to simultaneously add the set of 'k' numbers to generate an output value.

Optionally, the hardware implementation further comprises a renormalization unit configured to: re-normalizing the output value; and rounding the output value to represent the output value as a floating point number.

Optionally, the multiplication unit comprises a plurality of multiplier units configured to multiply each mantissa (ma _i ) At the same time multiply by the corresponding mantissa (mb _i ) To obtain an intermediate mantissa product (mab _i )。

Alternatively, by setting the value of the 'r' bit to 'r=p+1-log (k-1)' bits, a hardware implementation for performing a dot product operation simulates the precision obtained using a separate multiplication and addition unit for performing a dot product having an output mantissa bit length of P bits.

Optionally, the multiplication unit is configured to generate k product numbers (z) having a mantissa bit length of 'r+log (k-1) +1' bits by _i ): if p+q+2>r+log (k-1) +1 bits, then multiplying the intermediate mantissa (mab) _i ) Bits of (1) are rounded to the r+log (k-1) +1 bits; or if p+q+2<r+log (k-1) +1 bits, then additional least significant bits are padded into the intermediate mantissa product (mab _i ) To produce r + log (k-1) +1 bits.

Optionally, the maximum exponent detecting unit is configured to identify k exponents and (eab) _i ) The maximum index sum (e) _max ) Wherein k index sums (eab) _i ) By indexing (ea) _i ) Sum index (eb) _i ) And summing to obtain the final product.

Optionally, the alignment unit is configured to determine the alignment of the alignment units based on a maximum index (e _max ) To align magnitude bits of the number, wherein the alignment unit comprises: a plurality of subtracting units, wherein each subtracting unit is configured to calculate a maximum exponential sum (e) _max ) Sum of indexes (eab) _i ) Difference between (e) _d ) The method comprises the steps of carrying out a first treatment on the surface of the And a plurality of shifter units, each shifter unit configured to shift the difference (e _d ) The magnitude bits of the corresponding number are shifted to the LSB side.

Optionally, the alignment unit is configured to further truncate bits shifted outside the bit length of the number of the numbers.

Optionally, the alignment unit further comprises a plurality of complement units configured to, if the set of 'k' floating point numbers comprises signed floating point numbers, base the sign bit (s _i ) To determine the two's complement of each number of magnitude bits.

According to a third aspect, there is provided a method of performing a dot product of an array of '2k' floating point numbers, k.gtoreq.3, using a hardware implementation, the array comprising a first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 And a second set of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Wherein the method comprises: receiving two sets of 'k' floating point numbers; each floating point number a _i Multiplying by floating point number b _i Each multiplication generates a first intermediate multiplication product number (z _i ') and a second intermediate product number (z) _i ") to generate k first intermediate product numbers (z) including bit lengths each having 'r+log (k-1) +2' bits _i ') and k second intermediate product numbers (z) _i ") by a product number of 2 k; creating a set of k first numbers (y based on the 2k product numbers _i ') and k second numbers (y) _i ") a set of '2k' numbers, the '2k' numbers each have a value obtained by adding an additional most significant bit to the product number (z _i And z _i ") bit length of 'n' bits obtained, wherein the 'n' bits comprise a plurality of magnitude bits, wherein 'n' is Bits, wherein x is an integer and x is greater than or equal to 1; identifying k index sums (eab) _i ) The maximum index sum (e) _max ) Each exponential sum is the floating point number a _i And the floating point number b _i Index sum of (2); based on the maximum index sum (e) _max ) To align the number (y _i ' and y _i ") the magnitude bits; and simultaneously adding the set of '2k' numbers.

According to a fourth aspect, there is provided a hardware implementation of a dot-product for executing an array of '2k' floating-point numbers, k.gtoreq.3, the array comprising a first set of k floating-point numbers a ₀ 、a ₁ ...、a _k-1 And a second set of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Wherein the hardware implementation comprises: the device comprises a multiplication unit, a format conversion unit, a maximum exponent detection unit, an alignment unit and a processing unit. The multiplication unit includes a plurality of multipliers configured to: receiving two sets of 'k' floating point numbers; each floating point number a _i Multiplying by floating point number b _i Each multiplication generates a first intermediate multiplication product number (z _i ') and a second intermediate product number (z) _i ") to generate k first intermediate product numbers (z) including bit lengths each having 'r+log (k-1) +2' bits _i ') and k second intermediate product numbers (z) _i ") by a product number of 2 k. The format conversion unit is configured to: based on the 2k multiplication products, k first numbers (y _i ') and k second numbers (y) _i ") by adding an additional most significant bit to the product number (z _i And z _i ") wherein the 'n' bits comprise a plurality of magnitude bits, wherein 'n' is Bits, where x is an integer and x is 1. The maximum exponent detecting unit is configured to identify k exponents and (eab) _i ) The maximum index sum (e) _max ) Each exponential sum is the floating point number a _i And the floating point number b _i Index sum of (d). The alignment unit is configured to based on the maximum index sum (e _max ) To align the number (y _i ' and y _i ") are provided. The processing unit is configured to simultaneously add the set of '2k' numbers to generate an output value.

The hardware implementation for performing dot-product according to the first aspect described above may be embodied in hardware on an integrated circuit. A method of manufacturing a hardware implementation for performing dot-product in an integrated circuit manufacturing system may be provided. An integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, configures the system to manufacture a hardware implementation for performing dot-product. A non-transitory computer-readable storage medium may be provided having stored thereon a computer-readable description of a hardware implementation for performing dot-product, which when processed in an integrated circuit manufacturing system causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the hardware implementation for performing dot-product.

An integrated circuit manufacturing system may be provided, comprising: a non-transitory computer readable storage medium having stored thereon a computer readable description of a hardware implementation for performing dot product according to the first aspect described above; a layout processing system configured to process the computer-readable description to generate a circuit layout description embodying the integrated circuit for performing the hardware implementation of the dot product; and an integrated circuit generation system configured to fabricate a hardware implementation for performing dot-product from the circuit layout description.

Computer program code may be provided for performing any of the methods described herein. A non-transitory computer readable storage medium may be provided having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform any of the methods described herein.

As will be apparent to those skilled in the art, the above features may be suitably combined and combined with any of the aspects of the examples described herein.

Drawings

Examples will now be described in detail with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram illustrating a conventional floating point dot product calculator with separate multiply and add units;

FIG. 2 is a schematic block diagram illustrating another conventional floating point dot product calculator with a fused multiply and add unit;

FIG. 3 is a block diagram illustrating an example of a hardware implementation for performing dot-product operations;

FIG. 4a is a block diagram illustrating mantissas of floating point numbers represented in an incoming format;

FIG. 4b is a block diagram illustrating the multiplication product number represented in a first intermediate format;

FIG. 4c is a block diagram illustrating a second intermediate format representation with a number of symbols;

FIG. 5 is a block diagram showing the different units in a hardware implementation for performing the dot product explained in FIG. 3;

6 a-6 d illustrate examples of floating point numbers converted from a first format representation to a second format representation;

FIG. 7 is a flow chart illustrating a method of performing dot-product of two sets of k floating-point numbers;

FIG. 8 is a diagram showing a comparison of a particular implementation of architecture 300 with other standard architectures for processing a set of floating point numbers;

FIG. 9 illustrates a computer system in which a dot product calculator is implemented;

FIG. 10 illustrates an integrated circuit manufacturing system for generating an integrated circuit embodying a dot product calculator; and is also provided with

Fig. 11 illustrates an architecture 300 that implements carry save multiplication.

The figures illustrate various examples. Skilled artisans will appreciate that element boundaries (e.g., blocks, groups of blocks, or other shapes) illustrated in the figures represent one example of boundaries. In some examples, it may be the case that one element may be designed as a plurality of elements, or that a plurality of elements may be designed as one element. Where appropriate, common reference numerals have been used throughout the various figures to indicate like features.

Detailed Description

The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.

Embodiments will now be described by way of example only.

As explained above, conventional hardware for performing dot-product of two sets of numbers includes hardware implementing separate multiplication and addition units or hardware implementing a fused multiplication and addition unit. Multiplicand a from two sets of floating point numbers _i And b _i May be represented in an incoming format F that includes mantissas and exponents. When expressed in incoming format F, mantissa ma _i And mb (m) _i Each comprising a bit length of p bits. The output of the multiplication unit may have a floating point number format F'. Format F' need not be the same as format F and may have a mantissa width, e.g., mantissa product (c _i ) Can be a multiplicand a _i And b _i Is twice the mantissa bit length of (c). In some cases, the multiplicand a _i And b _i May be represented in different formats such that a _i And b _i The mantissas of (a) may have different bit lengths. In such cases, the multiplication output may have a floating point format F' where the mantissa product is a _i Mantissa bit length and b of (2) _i The sum of the mantissa bit lengths of (a) is used.

Regardless of whether the input sets have the same mantissa length, when the product c _i The mantissa has at least a sum of multiplicand a _i And b _i Where the bit length of the mantissa is as long as the sum of the bit lengths of the mantissas, separate multiplication and addition are used as described in FIG. 1 The dot product unit 100 of (a) has the same precision as the dot product unit 200 using a fused multiply and add unit, assuming no overflow or underflow, since no rounding is required. However, if the second format F' is not wide enough to maintain an accurate multiplication output, it is more accurate to perform dot-product using the dot-product unit 200 implementing a fused multiplication and addition unit.

Different ordering of the input pairs as multiplicands may produce different results, whether using fused multiplications and additions or performing multiplications and additions separately. This is because of the effects of certain phenomena, such as catastrophic cancellation that occurs when accumulating values in floating point numbers.

Some arrangements of floating point adders that provide faster computation include arrangements for performing parallel summation. These arrangements may also be used to reduce latency of the network. For example, FIG. 1 shows a specific implementation of a network of floating point adders employing the outputs of multiplication units 102a, 102b, 102c, and 102 d. The network of floating point adders shown in fig. 1 is a balanced tree (or balanced tree adder) that is used to perform parallel summation and is intended to reduce latency. With this configuration, it is possible to useStages to effect addition of 'n' floating point numbers (i.e., multiplication output c in FIG. 1 _i )。

Furthermore, in a generalized example, the tree adder need not be a balanced structure. Tree adders may use a single adder to add floating point numbers at each different stage. For example, any two floating point numbers are added in a first stage to generate a first sum value. Further, the first sum is added to another floating point number in a second stage using a second adder to generate a second sum, and so on. The latency of the arrangement in this example is increased compared to a balanced tree adder.

Consider that mantissa products L, -L, M and N are provided as inputs to an adder unit, or as outputs of a multiplication unit when using the dot-product unit 100 described in fig. 1 using separate multiplication and addition units, or as intermediate multiplication values when using the dot-product unit 200 implementing a fused multiplication and addition unit. In all of these examples, when performing the addition, the accumulated output values are rounded or truncated in each stage in order to fit the output values to their finite representation. Multiple rounds may result in catastrophic cancellation. For different orders of providing each input to the tree adder, different outputs are generated. Catastrophic cancellation may occur when very small numbers are added to very large numbers and may result in loss of significant bits of the result due to rounding. For example, when summing a larger positive number L, a corresponding negative number-L, and two smaller positive numbers M and N, the exact value of the sum is (M+N). The floating point addition arrangement that sums L and-L in the first main adder and M and N in the second main adder should get the final result (M+N). However, if the order of inputs is different and the main adder performs the sum (L+M) and (-L+N), and L is much greater than M and N, the output of the main adder may be rounded to L and-L, resulting in a total output of 0. A similar effect can be observed when a network of fused multiplication and addition units is used.

As discussed above, existing methods of processing floating point numbers generate output values of different precision based on the order of the numbers provided as inputs, such as performing dot-product using separate multiplication and addition units as described in fig. 1, or performing dot-product using fused multiplication and addition units as described in fig. 2. That is, the particular order of inputs is provided to generate the best result of the closest number of real dot products. Other orders of providing inputs may produce results that are less close to the actual dot product of the numbers. Thus, based on the order in which the inputs are provided, a series of results may be obtained around the actual dot product of the floating point number.

The different accuracies of the results obtained are due to reasons such as truncation errors or rounding errors and catastrophic cancellation as previously discussed. In addition, the delay in obtaining dot products for a large number of arrays is severe because multiplication and addition occur in several consecutive steps. Although the method of performing multiplication and addition using separate multiplication and addition enables multiplication to occur in parallel and then some addition to be performed in parallel, the overall addition still needs to be performed in each stage to generate the final output value. Furthermore, renormalization and rounding are performed in each stage, which increases the delay in generating the output value. Thus, there is a need for a method of processing a set of floating point numbers that is more accurate and with less delay.

Hardware implementations and methods are described herein that simultaneously process a set of k floating point numbers. The method includes receiving an input in an incoming format, generating an output of the multiplication unit in a first (intermediate) format, and then converting the first format into a number in a second (intermediate) format for performing an addition (the output of which may or may not be the same as any of the preceding formats). More specifically, the method includes receiving floating point numbers from each group in an incoming format, generating a multiplication product number having a first format by simultaneously performing mantissa multiplication and exponent summation, while simulating the precision of a selected conventional multiplier. Further, the method includes processing the numbers in the second format simultaneously (e.g., by performing a single summation over all the numbers in the group to obtain a sum, as opposed to performing multiple summations across the group) to generate an output value.

FIG. 3 is a block diagram illustrating an example of a specific implementation of an architecture for processing a set of k floating point numbers. Architecture 300 is a dot-product unit that is used to perform multiplications and additions of a large array of 2k floating-point numbers to generate an output value. The large array of 2k floating point numbers includes a first set of k floating point numbers (a ₀ ,a ₁ ,a ₂ …a _k-1 ) And a second set of k floating point numbers (b ₀ ,b ₁ ,b ₂ …b _k-1 ). The architecture is particularly suitable for calculating dot products of large arrays, but can also be used to calculate dot products of two or more numbers as desired. Architecture 300 includes mantissa multiplication unit 301, format conversion unit 302, exponent addition unit 303, maximum exponent detection unit 304, alignment unit 306, processing unit 308, and renormalization unit 310. Each number in the first set of 'k' floating point numbers includes mantissa ma _i Sum index ea _i . Each number in the second set of 'k' floating point numbers includes mantissa mb _i And index eb _i . Each of the first set of 'k' floating point numbers has a mantissa bit length of 'p' bitsAnd each number in the second set of 'k' floating point numbers has a mantissa bit length of 'q' bits. The two sets of 'k' floating point numbers may be signed numbers or unsigned numbers. In the case of signed numbers, these numbers will each also include sign bits (sa _i Or sb _i ) And mantissas and exponents. However, some floating point formats (e.g., unsigned formats) may not include sign bits.

First group of k floating point numbers (a ₀ ,a ₁ ,a ₂ …a _k-1 ) And a second set of k floating point numbers (b ₀ ,b ₁ ,b ₂ …b _k-1 ) May be received in an input unit (not shown in the figures). The input unit may be a memory device or a memory unit that may store the received input. Both sets of 'k' floating point numbers are stored in an incoming format. For example, if p+.q, then the two sets of 'k' floating point numbers can have the same format or different incoming formats.

The mantissa and exponent bit lengths of the numbers in the incoming format are identified based on the type of floating point number format. The incoming format may be a predefined format that the architecture 300 is designed to receive, or may be identified on a task-by-task basis (e.g., by a controller, not shown). Examples of various types of floating point formats include, but are not limited to, IEEE formats, including half-precision floating point (16-bit floating point), single-precision floating point (floating point), and double-precision floating point (double), or other formats such as brain floating point (bfloat 16). In one example, to explain the method, we consider the number a in the first group _i And number b in the second group _i All have the same incoming format as the IEEE single precision floating point format with a mantissa (ma _i Or mb (m) _i ) And an exponent (ea) having a bit length of 8 bits _i Or eb (eb) _i ). In another example, the number a in the first group _i And number b in the second group _i The notification format of (c) may be different. For example, we consider the IEEE single precision floating point format as number a in the first group _i An incoming format of (2) having a mantissa ma with a bit length of 23 bits _i An exponent ea of bit length 8 bits _i And we will brain floating point grid Considered as number b in the first group _i An incoming format of (2) having a mantissa mb with a bit length of 7 bits _i An exponent eb with bit length of 8 bits _i . However, it should be understood that the invention is not limited to (combinations of) these formats, and those skilled in the art will appreciate that architecture 300 may be implemented to perform the methods described herein using numbers in any type of floating point number format. Mantissa ma of each number in a first set of 'k' floating point numbers stored in incoming format _i And mantissas mb for each number in the second set of 'k' floating point numbers _i Is provided to mantissa multiplication unit 301. Mantissa ma with bit length of 'p' bits before providing input _i The fractional part of (c) may be extended by implicit leading bits to obtain a normalized mantissa of p+1 bits. Similarly, mantissa mb having a bit length of 'q' bits _i The fractional part of (c) may be extended by implicit leading bits to obtain a normalized mantissa of q+1 bits.

Mantissa multiplication unit 301 includes a plurality of multiplier units configured to generate 'k' product numbers (z) having a bit length of 'r' bits (where 'r' is an integer) in different first formats ₀ ,z ₁ ,z ₂ …z _k-1 ) As described in more detail below. Each multiplier unit is configured to perform mantissa multiplication of corresponding mantissas from the first and second sets of k floating-point numbers to obtain an intermediate mantissa product:

mab _i ＝ma _i ×mb _i

The bit length of the full-precision result obtained when performing mantissa multiplication may be larger/smaller than r bits. Thus, mantissa multiplication unit 301 fits the outputs of the plurality of multipliers to the bit length of the 'r' bits, thereby generating product number z in the first format _i . Thus, mantissa product mab as output of each multiplier unit _i Rounded to r bits or padded with additional (zero) bits to fit the mantissa product to r bits. The value of the bit length 'r' is set based on the desired accuracy of the dot product unit 300. In particular, 'r' can be (broadly) regarded as the bits required to simulate the accuracy of the multiplication aspect of a conventional dot product unitA number. The number of bits will consist of a plurality of explicit preamble bits and a plurality of decimal bits. Due to input of floating point number a _i And b _i The product number with a bit length of r bits in the first format is normalized before multiplication, and thus comprises two explicit preamble bits (since multiplication of two numbers between 1.0 (inclusive) and 2.0 (exclusive) may generate a number between 1.0 (inclusive) and 4.0 (exclusive). Thus, the bit length needs to be increased by one to consider the explicit preamble '1' bit position during summation.

In the first case, the hardware implementation of the dot-product unit 300 may simulate the accuracy of the P bits obtained with the dot-product unit 100 using separate multiplication and addition units. In this context, the precision of P bits means that the dot product unit 300 achieves a precision that is not less than the worst case precision achieved by the (analog) dot product unit 100 that generates the final output with the P-bit mantissa. In other words, when implementing separate multiplication and addition units to perform dot products, P is the bit length of the mantissa output when performing multiplication and the mantissa input when performing addition. However, for any given bit length P, the actual accuracy of the dot product unit 100 (due to the accumulation aspect as already discussed) will depend on the order in which the inputs are processed. Thus, in this context, the dot product unit 300 is configured to be at least as accurate as the worst case accuracy that the analog dot product unit 100 should achieve. To achieve this, when the hardware implementation of the dot product unit 300 simulates the precision of the P bit obtained with the dot product unit 100, the bit length of the 'r' bit is set to 'r=p+2' bits.

In a second scenario, the hardware implementation of the dot-product unit 300 may simulate the accuracy of the Q bits obtained with the dot-product unit 200 using a fused multiply and add unit. Also in this context, the precision of Q bits means that the dot product unit 300 achieves a precision that is not less than the worst case precision achieved by the (analog) dot product unit 200 that generates the final output with Q bit mantissas. In other words, when implementing a fused multiply and add unit to perform dot product, Q is the bit length of the mantissa output and accumulated mantissa input when performing multiply and add. However, for any given bit length Q, the actual accuracy of the dot product unit 200 (due to the accumulation aspect as already discussed) will depend on the order in which the inputs are processed. Thus, in this context, the dot product unit 300 is configured to be at least as accurate as the worst case accuracy that the analog dot product unit 200 should achieve. To achieve this, when the hardware implementation of the dot-product unit 300 simulates the accuracy of Q bits obtained with the dot-product unit 100, the bit length of the 'r' bit is set to 'r=max (q+2, p+q+3)' bits.

As already mentioned, if the value of 'r' is smaller than the full precision bit length of the mantissa multiplication (i.e. p+q+2), the mantissa product is rounded off exactly to obtain the required bit length. Furthermore, if the value of 'r' is greater than the full-precision bit length of the mantissa multiplication (i.e., p+q+2), the mantissa product is padded with zeros to obtain the desired bit length.

I.e. if p+q+2>r bits, mantissa product (mab _i ) Is rounded precisely to the r-bit to obtain the product number z _i . Rounding of the mantissa product may be accomplished in many ways. In one example, multiple multiplier units may be implemented using truncated multipliers. When performing mantissa multiplication using a truncated multiplier, the truncated multiplier directly calculates the mantissa product mab by truncating additional bits on the 'r' bit _i Thereby directly generating the product number z in the first format _i . In another example, multiple multiplier units may be implemented using a full multiplier. When performing mantissa multiplication using a full multiplier, the multiplier computes an intermediate mantissa product mab having a bit length greater than the 'r' bit _i The intermediate mantissa product is further rounded to the 'r' bit to generate the product number z in a first format _i 。

Furthermore, if p+q+2<r bits, then the mantissa product (mab _i ) Filling with additional least significant bits to generate a product number z with r bits _i . Thus, the product number z _i Can be expressed as a fixed point value mab _i 2 ^-r+2 。

Similarly, to simulate the precision of dot product unit 200, the value of 'r' is at least one position (if Q+2. Ltoreq.p+q+3) or more (if Q+2. Gtoreq+p+q+3) greater than the full precision bit length of the mantissa multiplication (i.e., p+q+2). Thus, the mantissa product is zero-padded to obtain the desired bit length.

At the same time, the exponent ea of each number in the first set of 'k' floating point numbers stored in incoming format _i And an exponent eb for each number in the second set of 'k' floating point numbers _i Is supplied to the exponent addition unit 303. The exponent addition unit includes a plurality of adder units, each configured to generate an exponent sum,

eab _i ＝ea _i +eb _i

format conversion unit 302 receives 'k' product numbers (z from mantissa multiplication unit 301 ₀ ,z ₁ ,z ₂ …z _k-1 ). Format conversion unit 302 converts the 'k' product numbers (z ₀ ,z ₁ ,z ₂ …z _k-1 ) Is converted into a 'k' number (y ₀ ,y ₁ ,y ₂ …y _k-1 ) As described in more detail below.

Format conversion unit 302 converts the 'k' product numbers (z ₀ ,z ₁ ,z ₂ …z _k-1 ) Converted to a number in a second format. This includes multiplying each product number z in the set of 'k' product numbers _i Converted to a number' y _i '. The format conversion unit 302 multiplies the product number z having a bit length of 'r' bits _i Conversion (in a first format) to form a number ' y having a bit length of ' n ' bits _i ' to indicate the second format. By adding one or more additional Most Significant Bits (MSBs) and one or more additional Least Significant Bits (LSBs) to the product number z in a first format having bit length 'r' bits _i To obtain the bit length of the 'n' bits. Thus, the bit length 'n' is always greater than the bit length 'r' of the product number generated, and thus greater than the input floating point number a _i And b _i Is a fraction of the original mantissa of (a).

If the received set of 'k' floating point numbers are unsigned floating point numbers, the formed representation of an unsigned number having a bit length of 'n' bits includes n magnitude bits. If the received set of 'k' floating point numbers are signed floating point numbers, then addedThe additional MSBs to the mantissa of the first format may include bits representing sign bits. Thus, the generated representation of the signed number having a bit length of 'n' bits comprises a sign bit and an (n-1) magnitude bit. For floating point number a _i And b _i Exclusive-or operation of sign bits to generate a corresponding number y _i Is a sign bit of (c).

The product number in the first format is converted to a number in the second format based on the number of floating point numbers (k) in the group. I.e. the number of products (z _i ) The number of additional MSBs and LSBs of (c) is determined based on the number 'k'. Multiply the product number (z) _i ) Extending the bit length of (a) by at least the logarithmic number to the MSB sideBits, and expands at least logarithmic number of number +.>Bits to obtain the number' y _i '. If the input floating point number is a signed floating point number, then an additional bit is added to the MSB to represent a sign bit. Thus (S) >Additional MSB and->Is added to the multiplication product number (z _i ) Bit length 'r' on either side of the bit length of (c). I.e. one extra bit of the added extra MSB (exceptOut of bits) are assigned to sign bit s _i . Sign bit is obtained by inputting corresponding input floating point number a _i And b _i Is obtained by exclusive-or operation of sign bits of (c). Extra bits in the added extra LSB (except +.>Out of bits) are precision bits for obtaining additional precision. Additional MS addedB and LSB prevent overflow or underflow of bits while handling the set of 'k' numbers as explained in detail later. In different examples, the number of additional MSBs and LSBs added to each side may be the same or different. Generally, the number' y in the second format _i The bit length ' n ' of ' can be obtained as:

where x is an integer and preferably x is ≡ 1 and where the value of x depends on the number of extra bits added to represent the preamble bits, sign bits and precision bits, if present. For example, x may be as small as 1 when there are no sign bits in the original received number, or as small as 2 when the original received number does have sign bits. In both cases, x may be larger to provide greater accuracy.

In a second different implementation, the product number z in the first format _i May include r+log (k-1) +1 bits instead of the 'r' bits described in the preceding paragraph, such that rounding when the multiplication stage is performed (if p+q+2)>r bits), as many mantissa product bits as possible are reserved for addition. In this case, in order to simulate the accuracy of the P bits obtained using the dot product unit 100 using the separate multiplication and addition unit through the hardware implementation of the dot product unit 300, the bit length of the 'r' bit is set to 'r=p+1-log (k-1)'.

In this case, the format conversion unit 302 multiplies the product number z having a bit length of 'r+log (k-1) +1' bits _i Conversion (in a first format) to form a number ' y having a bit length of ' n ' bits _i ' to indicate the second format. By adding one or more additional Most Significant Bits (MSB) to the product number z in the first format having bit length 'r' bits _i To obtain the bit length of the 'n' bits. Multiply the product number (z) _i ) The bit length of (2) is extended to MSB side by at least logarithmic numberBits. In addition, in order to obtain the amountIn addition to precision, a number of additional bits may also be added to the LSB as precision bits. Thus, the number' y in the second format _i The bit length ' n ' of ' can be obtained as:

FIG. 4a shows mantissas (mantissa ma) in an incoming format in an example where two sets of incoming numbers share a common format _i Or mb (m) _i ) And FIG. 4b shows the product number (z) in the first (intermediate) format _i ) Is a representation of (c). Fig. 4c shows the signed number (y) in the second (intermediate) format _i ) Is a representation of (c). In FIG. 4a, the incoming format is shown as a brain floating point number with a 7-bit mantissa bit length (p).

FIG. 4b shows the product number (z) expressed in a first format having a bit length of 'r' bits _i ). As discussed above, the bit length of the 'r' bit is set to 'r=p+2' bit or 'r=max (q+2, p+q+3)' bit, based on the required precision. If p+q+2>r, then by inserting a mantissa having a bit length of p bits (ma _i ) And mantissa (mb) having a bit length of q bits _i ) The output obtained by multiplication is truncated/rounded to fit to r bits. However, if p+q+2<r bit, then by inserting mantissas (ma _i ) And (mb) _i ) The output obtained by multiplication is padded with additional least significant bits to generate a product number z with r bits _i . Let the value of r bits be 16 bits. The representation of the product number in the first format includes two explicit preamble bits (LB) as part of the r bits.

Fig. 4c shows the signed number (y _i ) Is a representation of (c). The representation in FIG. 4c shows a display withSigned number of bits 'y' of bit length _i '. This is done by combining log ₂ (k) +1 additional MSBs and logs ₂ (k-1) +1 additional LSBs are added to the bit length 'r' of the mantissa. Thus, in the example, when the added bit length r is set to 17 bits (exampleAs extended from the initial bit length of 7 as shown in fig. 4 a) a set of 8 floating point numbers (i.e., k=8), then the signed numbers in the second format will have a bit length of n=25 bits.

As is clear from the example in fig. 4c, the number' y, expressed in the second format, is as defined herein _i ' comprising r bits including two bits allocated to represent a preamble bit (LB), allocated to represent a sign bit (s _i ) And as an additional MSB a further bitBit, and->Additional LSBs. Thus, the number 'y' of the second format _i ' is a signed number and includes sign bits and magnitude bits (i.e., bits indicating the absolute magnitude of the represented value). The sign bit is assigned as either a '0' or '1' bit based on whether the number is positive or negative.

In the example shown in FIG. 4c, the number' y _i ' include is added to the multiplicative product (z _i ) A kind of electronic deviceAdditional LSBs. Number' y _i ' may include a number added to the mantissa (m _i ) Is->And additional LSBs, wherein u is any integer, and u is equal to or greater than 1. Preferably, the number' y _i ' include is added to the multiplicative product (z _i ) Is->Additional LSBs. Is added to the multiplication product number (z _i ) The extra LSB of (2) increases the accuracy of the result obtained and reduces the underflow of the bits multiplied by the product number while aligning the number' y _i ' this will be explained in detail below.

Thus, in the above paragraphIn the described example, where the incoming format is a signed floating point number and the first format is a number with two explicit preamble bits, at leastAdditional MSBs and->The LSBs are added to the product number z _i On either side of the bit length ' r ' to form a number ' y _i ' so that the number of extra bits x is ≡2.

Similarly, in another example case, assuming the incoming format is an unsigned floating point number and the first format is a number with two explicit preamble bits, thenAdditional MSBs and at least->The number of extra LSBs is added to the product number (z _i ) On either side of the bit length ' r ' to form a number ' y _i ' so that the number of extra bits x is ≡1.

Thus, in a generalized example, x.gtoreq.1 and x.gtoreq.r 'positions, forming a number of' y _i The maximum bit length of the' maximum bit length,

is added to the multiplication product number (z _i ) The additional MSBs and LSBs of (b) are initially allocated as '0' bits in the second format. The sign bit is assigned as either a '0' or '1' bit based on whether the number is positive or negative.

Furthermore, each pair of floating point numbers ea in the first incoming format _i And eb (b) _i Index of (2) and' eab _i ’(eab ₀ ,eab ₁ ,eab ₂ ,eab ₃ …eab _k-1 ) The output provided as maximum exponent detect unit 304And (5) entering. As shown in fig. 3, the input' eab of the maximum detection unit is supplied from the exponent addition unit 303 _i '. In some other arrangements, the exponent sum may be passed to the maximum exponent detect unit 304 by the format conversion unit 302.

The maximum exponent detecting unit 304 generates a sum (eab) of k exponents ₀ ,eab ₁ ,eab ₂ ,eab ₃ …eab _k-1 ) The maximum index sum (e) _max ). The maximum exponent detecting unit 304 detects the maximum exponent sum using various methods or functions. An example of a method of identifying the maximum exponent sum is to use a binary tree structure. Identifying the maximum index sum (e) is described in detail below with reference to FIG. 5 _max ) Is a method of (2). However, which option is more preferred may depend on the available resources (e.g., parallel processing may be generally faster but more computationally intensive).

The exponent sum value eab is supplied from the exponent addition unit 303 in addition to the maximum exponent detection unit 304 _i As input to the alignment unit 306. Alignment unit 306 receives each pair of floating point numbers a _i And b _i Index of (2) and' eab _i ' as a first input. The alignment unit 306 also receives the maximum exponent sum (e) from the maximum exponent detection unit 304 _max ) As a second input and receives the number 'y' from the format conversion unit 302 _i ' as a third input. In one particular implementation, the alignment unit 306 may include a format conversion unit implemented as part of the alignment unit rather than as a separate unit. In this case, the alignment unit 306 receives the multiplication product number' z _i ' as input, and multiplying the product number ' z ' before shifting _i 'convert to' y _i '. The alignment unit 306 aligns each number' y _i ' magnitude bits, thus dividing the number ' y ' based on the maximum exponent _i ' conversion to a different number (or integer v) _i ). The alignment number 'y' is explained in detail with reference to FIG. 5 _i The' method, but in summary, the number in the second format is adjusted to be based on the maximum exponent, and for convenience the adjusted number (v ₀ ,v ₁ ,v ₂ …v _k-1 ) Considered an integer for subsequent processing in processing unit 308.

Thereafter, the k integers (v ₀ ,v ₁ ,v ₂ …v _k-1 ) Is provided to the processing unit 308. The processing unit 308 is an adder unit. The processing unit 308 processes k integers (i.e., k aligned digits) simultaneously. That is, the processing unit performs processing on all integers in the group at the same time, instead of sequentially processing elements in the group, for example. The processing unit 308 performs addition of k integers to generate an output value o. It is noted that addition of positive and negative numbers is equivalent to performing subtraction, and thus the term processing is used herein to encompass the effect of both addition and subtraction, alone or in combination.

The output value o from the processing unit 308 and the maximum exponent sum from the maximum exponent detection unit 304 are also provided to a renormalization unit 310. The renormalization unit 310 converts the output value from the processing unit into a floating point number having a mantissa mi and an exponent 'ei'. The format of the output values may be selected according to the required accuracy (e.g. according to whether the purpose is to simulate the accuracy of an arrangement such as that of fig. 1 or fig. 2). The output unit 312 stores the converted output value (i.e., outputs floating point numbers).

Fig. 5 is a block diagram illustrating different elements in a specific implementation of architecture 300 in fig. 3. Consider a scenario in which the set of k floating point numbers in the first format includes three numbers, i.e., k=3. An input unit (not shown) may receive as inputs a first set of three floating-point numbers ai (a 0, a1, a 2) and a second set of three floating-point numbers bi (b 0, b1, b 2).

Mantissa multiplication unit 301 receives a first set of three floating point numbers (a ₀ ,a ₁ ,a ₂ ) Mantissa (ma) ₀ ,ma ₁ ,ma ₂ ) Second group of three floating point numbers b _i (b ₀ ,b ₁ ,b ₂ ) Mantissa (mb) ₀ ,mb ₁ ,mb ₂ ) As input. The mantissa multiplication unit 301 includes a plurality of multiplier units 501a, 501b, and 501c. Each multiplier unit is configured to generate a product number z having a bit length of 'r' bits _i . Each multiplier unit is configured to perform mantissa multiplication of corresponding mantissas from the first and second sets of k floating-point numbers to obtain mantissa multiplicationThe product is:

mab _i ＝ma _i ×mb _i

multiplier units 501a respectively divide floating point number a ₀ And b ₀ Mantissa ma of ₀ And mb (m) ₀ Multiplying to generate a multiplication product z ₀ . Similarly, multiplier units 501b and 501c generate a multiplication product number z, respectively ₁ And z ₂ 。

As previously discussed, the value of the bit length 'r' is set based on the desired accuracy of the dot product unit 300. In the first case, where the hardware implementation of the dot product unit 300 mimics the accuracy of P bits obtained when dot product is performed using separate multiplication and addition, the bit length of the 'r' bit is set to 'r=p+2' bits. In the second case, where the hardware implementation of dot-product unit 300 mimics the accuracy of the Q bits obtained when dot-product is performed using fused multiplication and addition, the bit length of the 'r' bit is set to 'r=max (q+2, 2p+3)' bits when the two sets of floating-point numbers have the same incoming format (p=q), or to 'r=max (p+2, p+q+3)' bits when the two sets of floating-point numbers have different incoming formats (i.e., p+noteq).

In one example, the plurality of multiplier units 501a, 501b, and 501c may be implemented using truncated multipliers. When the mantissa multiplication is performed using a truncated multiplier, the truncated multiplier directly calculates the output r bits by truncating the additional bits on the 'r' bits, thereby directly generating the product number zi in the first format. In another example, the plurality of multiplier units 501a, 501b, and 501c may be implemented using full multipliers. When a full multiplier is used to perform mantissa multiplication, the multiplier computes an intermediate mantissa product mabi having a bit length greater than the 'r' bit, which is further rounded to the 'r' bit, thereby generating the product zi in the first format.

If p+q+2>r bits, the mantissa product (mabi) is rounded precisely to r bits. Furthermore, if p+q+2<r bits, the mantissa product (mabi) is padded with additional least significant bits to generate a product number zi with r bits.

Meanwhile, the exponent adding unit 303 receives the first three floating point numbers from the input unita ₀ ,a ₁ ,a ₂ ) Index (ea) ₀ ,ea ₁ ,ea ₂ ) A second set of three floating point numbers (b ₀ ,b ₁ ,b ₂ ) Index (eb) ₀ ,eb ₁ ,eb ₂ ) As input. The exponent adding unit 303 includes a plurality of adder units 503a, 503b, and 503c. Each adder unit is configured to calculate an exponent ea of the floating point number in each group _i And index eb _i To generate a sum corresponding to each product z _i Index sum eab of (2) _i

eab _i ＝ea _i +eb _i

Adder unit 503a is configured to store floating point number a ₀ Index ea of (2) ₀ And floating point number b ₀ Index eb of (2) ₀ Added to generate an exponential sum eab ₀ . Similarly, adder units 503b and 503c generate an exponent and eab, respectively ₁ And eab ₂ 。

In the first group of three floating point numbers a _i (a ₀ ,a ₁ ,a ₂ ) And a second set of three floating point numbers b _i (b ₀ ,b ₁ ,b ₂ ) In the example of signed floating point numbers, sign bits corresponding to floating point numbers ai and bi are exclusive-ored to obtain a product number z _i (z ₀ ,z ₁ ,z ₂ ) Corresponding sign bit s _i (s ₀ ,s ₁ ,s ₂ ) Is a value of (2).

The output of the mantissa multiplication unit 301 is also supplied to the format conversion unit 302. In other words, the plurality of multiplier units 501a, 501b and 501c in the mantissa multiplication unit 301 provide the product number z in the first format with a bit length of r bits to the format conversion unit 302 ₀ 、z ₁ And z ₂ . Furthermore, if present, the sign bit s _i (s ₀ ,s ₁ ,s ₂ ) Is also provided to format conversion unit 302. The format conversion unit 302 converts the set of three numbers in the first format into three numbers y in the second format _i As described with reference to fig. 3. In this example, consider that two sets of 3 floating point numbers in the incoming format are signed numbers and that the set of k product packets in the first formatTwo explicit preamble bits are included, thereby preserving the integer portion of the mantissa. Thus, the set of three floating point numbers is converted into three signed numbers having a bit length of n bits, the bit length including one sign bit s _i And (n-1) magnitude bits f comprising two explicit preamble bits _i . In other example cases, both sets of three floating point numbers in the incoming format may be unsigned numbers.

Further, the index sum' eab from the index addition unit 303 _i ’(eab ₀ ,eab ₁ ,eab ₂ ,and eab ₃ ) Is provided to the maximum exponent detecting unit 304.

The maximum exponent detect unit 304 in fig. 5 may include two maximum function logic to identify a maximum exponent sum. This is by way of example only, and other embodiments may have different structures for finding the maximum exponent, or similar structures, but with different amounts of logic to account for different amounts of input.

In an example using two maximum function logic, the first maximum function logic may receive the exponent and eab ₀ And eab ₁ . First maximum function logic identification eab ₀ And eab ₁ The maximum index value among them. In addition, the output of the first maximum function logic and the exponent eab ₂ Is provided to the second maximum function logic. The second maximum function logic identifies a maximum exponent value and a t exponent eab among outputs of the first maximum function logic ₂ To detect e _max I.e. input index and eab ₀ 、eab ₁ And eab ₂ The largest sum of these indices.

As mentioned above, the maximum value detection unit 304 may be implemented in various other ways. For example, the maximum detection unit 304 may be implemented using a binary search tree.

Returning to the depicted example, the maximum index sum identified by the maximum detection unit 304 is provided as input to the alignment unit 306. Furthermore, index and eab ₀ 、eab ₁ And eab ₂ Is provided as an input to the alignment unit 306. In addition, three signed numbers' y in the second format _i ' is provided as an input to the alignment unit 306. Alignment ofUnit 306 aligns each signed number' y based on the maximum exponential sum and a corresponding exponential sum of exponents that generate a number of multiplications in a first format corresponding to the signed number _i ' magnitude bit f _i . In other words, the magnitude bits of the signed numbers where the corresponding product number has not shifted the exponent sum as the largest exponent, to account for differences in the exponent sum of the number pairs in the two sets compared to the largest exponent sum (effectively adding zeros before the first (or at least first non-zero) magnitude bit, and removing trailing bits as needed to properly realign the magnitude bits). Thus, the alignment unit 306 will each have a signed number (' y) _i ') to another integer (v) output by the alignment unit 306 _i ). Integer v _i Is considered a fixed point number format. Similarly, in the case of an unsigned number, the alignment unit shifts the magnitude bits of the unsigned number based on the maximum exponent sum and a corresponding exponent sum of a number that generates a multiplication product number in a first format corresponding to the unsigned number.

The examples shown in figures 6a to 6d are used to show the signed number' y _i ' to integer v _i Is a transition of (2). It will be appreciated that in describing this series of diagrams, the starting point is the signed number' y in the second format number as output by format conversion unit 302 _i ' and the end point is the converted signed number v mentioned above _i (integer v) _i ). However, for ease of reference, in the following description, the intermediate stage may also be referred to as signed number.

Consider the example of two sets of three floating-point numbers, each having an implicit leading bit in an incoming format and a sign bit separate from the mantissa. Each number having a mantissa m of bit length 7 bits in a first format _i (such as bfoat 16). Assume that the bit length generated by multiplying the mantissas of two floating point numbers in each group is multiplied by the product number z _i Is set to r=17 bits. Thus, in this example, each number in the set of three numbers, when converted to the second format, includes a signed number y having a bit length of' n _i (including signed bits s _i ) Wherein

FIG. 6a shows a signed number' y with a bit length of n bits in a second format _i '. In an example, consider the number shown in FIG. 6a to represent the signed number y provided as input to the alignment unit 306 ₀ . The alignment unit 306 comprises a plurality of subtraction modules 505. The alignment unit 306 further comprises a plurality of shifter units 506 and a plurality of complement units 507.

The alignment unit 306 receives the index and eab from the index addition unit 303 _i As a first input, the maximum exponent sum e is received from the maximum exponent detection unit 304 _max As a second input, and receives the symbol number 'y' from the format conversion unit 302 _i ' as a third input. As explained in fig. 3, in one example, format conversion unit 302 may be implemented as part of alignment unit 306. The index addition unit 303 supplies the index sum' eab of three numbers to each subtraction module _I '. Each of the subtracting modules receives the exponent and eab from the exponent adding unit 303 _i As a first input, the maximum exponent sum e is received from the maximum exponent detection unit 304 _max As a second input. Each subtracting module calculates the maximum index sum 'e' of two numbers _max 'AND index and' eab _i Difference between' e _di '. In fig. 5, the first subtracting module receives two first numbers of exponents eab from the exponents adding unit 303 ₀ And receives the maximum exponent and e from the maximum exponent detecting unit 304 _max . The first subtraction module calculation is referred to as e in fig. 5 _d0 Is a first difference of (a). Similarly, the remaining subtraction modules calculate e _d1 And e _d2 As shown in fig. 5. As shown, multiple subtraction modules compute the difference e for each number in parallel _di Other arrangements are possible-e.g. a single subtraction module performing each subtraction in series. Returning to the depicted example, the calculated difference e from each of the plurality of subtraction modules 505 _d0 、e _d1 And e _d2 And is also provided to a corresponding shifter cell among the plurality of shifter cells 506.

Each shifter cell of the plurality of shifter cells 506 receives the calculated difference e corresponding to the product number _di As a first input, and receives a corresponding signed number' y _i ' magnitude bit f _i As a second input. Furthermore, each shifter cell of the plurality of shifter cells 506 is configured to shift the signed number' y based on the corresponding calculated exponent difference _i ' magnitude bit f _i Shifting is performed. Magnitude bit f _i The least significant bit side (i.e., the right side of the depicted format) is shifted (except for the sign bit) by an amount equal to the calculated exponent difference. FIG. 6a shows the signed number of first inputs' y ₀ ' magnitude bit f _i Including 'r bits' corresponding to bits of the product number in the first format, including two explicit preamble bits. With number of symbols' y ₀ The remaining extra bits of ' are filled with '0' bits. The original 'r bits' and the explicit preamble bits are shifted by the shifter unit. In addition, in FIG. 6a, there is a symbol number y ₀ The sign bit is assigned a '1' bit, which indicates the sign number y ₀ And is negative. The shifter unit does not shift the sign bit.

In this example, a first shifter among the plurality of shifters 506 receives a first input (signed number y ₀ The magnitude of f ₀ ). Further, the first shifter receives the calculated difference e from the first subtraction module _d0 (in the example, for the mantissa (f) shown in FIG. 6a ₀ ) Taking into account the calculated difference e _d0 (difference e) _max And eab ₀ ) Equal to 4) as a second input. Therefore, the first shifter unit will have a symbol number y ₀ The magnitude of f ₀ Shifted 4 positions to the right. Fig. 6b shows the shifted numbers. Calculated difference e _d And never negative. Thus, based on the difference e calculated for each number _d Magnitude bit f _i Always shift to the least significant bit side (i.e., to the right in the example), it is possible to shift the zero bit.

Similarly, other shifter cells of the plurality of shifter cells 506 are based on the corresponding calculated differences e _d1 、e _d2 And e _d3 For the magnitude bits (f ₁ 、f ₂ And f ₃ ) Shifting is performed. Thus, all of the plurality of shifter cells 506 perform the magnitude bit f in parallel _i While in most existing architectures for handling floating point numbers, the shifter shifts or aligns the mantissas sequentially as needed, which adds a significant amount of delay. Since in the disclosed architecture shifting or alignment of all numbers occurs in parallel, processing delay can be significantly reduced as the number of floating point numbers to be processed increases. In another particular implementation, multiple shifter units 506 may serially execute magnitude bit f despite the fact that delay is increased due to limitations of available resources _i For example, parallel processing may be generally faster but computationally intensive).

It can be seen in fig. 6b that the shift of the magnitude bits results in a shift of 4 bits from the bit width of the signed number (and thus a representation of the stored number). The shifter unit truncates signed bits that are shifted outside the bit length of n bits. The exponent difference of the calculated corresponding number is greater than the mantissa m of the number in the first format when converted into the second format _i The bits of the mantissa corresponding to the original number of first format are shifted out by the bit length of 'n' bits when the number of additional LSBs is added. When the calculated difference is greater than the number of additional LSBs, a bit underflow of the original (first format) mantissa may result. When the calculated difference is smaller than the number of additional LSBs, only an underflow of the 'zero' bit added by the format conversion unit 302 is caused. As mentioned above, fig. 6b shows 4 bits shifted out of the bit length of n bits when the bits are shifted by 4 bits. Fig. 6c shows the signed number after performing truncation. As is apparent from the figure, although the bits are shifted by 4 bits, in this case, only 1 bit of the actual number (i.e., 1 bit of the product number in the first format) is lost due to the addition of the LSB during conversion to the second format. Thus, the extra LSBs serve to reduce the loss of accuracy, which occurs, for example, if all numbers are shifted to the same maximum exponent using the first format.

From each shifter cellThe output of 506 is also provided to a corresponding complement unit of a plurality of complement units 507. The complement unit receives the aligned magnitude bits from the shifter unit as a first input and selects the signed number' y _i The sign bit of' is taken as the second input. However, in other arrangements, the function of the complement unit may be performed prior to the function of the shift unit or as part of the adder unit (processing unit 308). In any case, the complement unit performs magnitude bits f on those digits having sign bits indicating a negative number _i Two's complement of (2). In this case, the shifted positive symbol numbers in the set are provided to the processing unit 308 (adder 508 in fig. 5) without complement. In addition, a complement of two to the negative number in the set is provided to the processing unit. Processing unit 308 receives outputs from plurality of complement units 507 and simultaneously processes aligned signed number v _i To generate an output. The output obtained from each complement unit is the alignment number v _i . FIG. 6d shows the number p obtained by complementing the mantissa shown in FIG. 6c ₀ 。

Thus, the alignment unit 306 performs the number 'y' by performing _i ' magnitude bit f _i Performing the shifting and truncating steps to align the number' y _i ' magnitude bits to generate a set of numbers (or integers) v _i . The alignment unit also converts any number having sign bits indicating that the number is negative to a two's complement representation. In unsigned number' y _i In the case of 'the alignment unit performs the alignment of the number' y _i ' magnitude bit f _i The steps of shifting and truncating are performed. The only difference is that the complement step need not be performed without a number of symbols. The alignment unit 306 is able to process each number in parallel for the steps of shifting, truncating and complementing the mantissa bits. Number v obtained after conversion _i Is an integer. Number v _i The calculation is as follows

In various implementations, the alignment unit 306 may be configured to compare the number' y _i ' magnitude bit f _i The step of rounding up is performed after shifting is performed instead of truncating. In this case, the integer v _i By combining the number' y _i ' magnitude bit f _i Rounding up to obtain. Thus, the number v _i The calculation is as follows

It will be apparent to those skilled in the art that the number' y _i The step of 'rounding' may be performed by any method that achieves round-up or round-down. As shown in fig. 5, the converted number, i.e., integer v ₀ 、v ₁ 、v ₂ And v ₃ And also provided to adder 508 as processing unit 308. By counting the signed number y in the example ₀ The magnitude of f ₀ Signed integer v generated by performing shifting, truncation and complement ₀ Shown in fig. 6 d. In an example, adder 508 is a carry save adder capable of performing addition on 3 'n' bit integers. Integer v _i Is between the values ofBetween them. The carry save adder performs addition of 3 signed integers vi to generate a sum value o (output).

Is added (i.e. 3 integers v _i ) Is smaller than the magnitude ofAnd thus the sum value will be less thanAnd does not overflow the 'n' bit. I.e. the maximum possible value integer y _i Will have a 1 (taking into account sign bits) exactly one position from the MSB end, so that v _i With end spaced from MSBAt least->Bit 1 (again considering sign bit) and the sum of the k numbers of the value (i.e. considering the extreme case where all numbers have the largest exponential sum) cannot overflow the extra +_ provided at the MSB end after sign bit>Bits. Adder 508 processes the set of 'k' floating-point numbers to generate the same output value regardless of the order in which the set of 'k' floating-point numbers are provided as inputs.

The sum value o is also supplied to the renormalization unit 310. It will be noted that in this example, the value o and the value v _i As such, there will be a signed integer in two's complement format. The normalization unit 310 includes a shifter 510a and a subtractor 510b. Shifter 510a shifts the bits of the sum value o to generate a normalized value (in a general format). Shifter 510a represents the sum value o in a normalized format by counting the number of leading '0' bits or '1' bits (denoted as'd') that occur consecutively in the MSB (i.e., include sign bits). The number of leading '0' bits is counted when the obtained sum value is a non-negative number, and the number of leading '1' bits is counted when the obtained sum value is a negative number. The shifter shifts the bits of the number to generate a normalized number (n _k ). Number (n) _k ) Is further rounded to represent a normalized number (n _k ) Normalized number (n) _k ) Expressed as (assuming that the addition is made at the time of conversion to the second formatMSB):

subtractor 510b receives as a first input the maximum exponent sum and receives d (the number of preambles '0' or '1') and is addedThe number of additional LSBs to the mantissa of the first format is taken as input. Furthermore, the subtractor calculates an exponent of the normalized number based on the input, and represents the exponent with additional bits in the floating point number a _i Bit length of exponent of (b) or floating point number b _i An exponent on a bit length equal to the maximum value of the bit lengths of the exponents. That is, the index of the normalized number can be calculated by' max (ea _i Bit length of eb _i Is expressed by the bit length of) +1' bits. The exponent of the final output is calculated as follows (again, assume that the addition of the second format is performed when converting to the second format

This is an example and is not limited to the fact that one skilled in the art can calculate the index e using various other known methods in other examples _k . Thus, the final output or sum value obtained is normalized with the mantissa (n _k ) Sum index (e) _k ) To represent.

The adder architecture 300 may be used to perform additions to any number of floating point numbers. The example shown in FIG. 5 is a specific example of a dot-product unit 300 for executing dot-products of a set of 3 floating-point numbers. Further, an additional number of elements may be added to each unit in adder 500 in a similar manner, expanding each unit to perform dot-product on any number of floating-point numbers (e.g., 20 floating-point numbers or 50 floating-point numbers) simultaneously.

FIG. 7 is a flow chart illustrating a method of processing two sets of 'k' floating point numbers. The method includes performing dot product using a hardware implementation of architecture 300 for performing dot product multiplication. The method includes performing multiplication and addition operations on a large array of 2k floating point numbers to generate an output value. The large array of 2k floating point numbers includes a first set of k floating point numbers (a ₀ ,a ₁ ,a ₂ …a _k-1 ) And a second set of k floating point numbers (b ₀ ,b ₁ ,b ₂ …b _k-1 )。

In step 701, the method includes receiving two sets of 'k' floating point numbers each in an incoming format. Each number in the first set of 'k' floating point numbers includes mantissa ma _i Sum index ea _i . Each number in the second set of 'k' floating point numbers includes mantissa mb _i And index eb _i . Number a _i Mantissa ma of _i Having a bit length of 'p' bits and a number b _i Mantissa mb of (a) _i Having a bit length of 'q' bits. The two sets of 'k' floating point numbers may be signed numbers or unsigned numbers. Identifying the bit length and exponent of mantissa in an incoming format based on the type of floating point number format (e _i ) Is a bit length of (a) in a frame. Further, floating point numbers may be signed or unsigned numbers with implicit or explicit preamble bits. For example, a single precision (32 bit) floating point number in an incoming format may typically be a signed number with an implicit preamble bit that includes a mantissa of bit length 23 bits (excluding the preamble bit), an exponent of bit length 8 bits, and an additional sign bit (s _i ). In other examples, a single precision (32-bit) floating point number in the first format may be a signed number with an explicit preamble bit, then the mantissa having a bit length of 23 bits including the explicit preamble bit.

When a single precision (32-bit) floating point number in an incoming format is an unsigned number with an implicit preamble bit, there will not be any additional sign bits and the mantissa may be represented by a bit length of 24 bits (not including the preamble bit). Further, when a single precision (32-bit) floating point number in an incoming format is an unsigned number with explicit preamble bits, the 24-bit length of the mantissa includes the explicit preamble bits. For example, if p+.q, then the two sets of 'k' floating point numbers can have the same incoming format or different incoming formats.

At step 702, when the set of 'k' floating point numbers in the first format is received, the method includes generating 'k' product numbers (z ₀ ,z ₁ ,z ₂ …z _k-1 ). Mantissa ma with bit length of 'p' bits before providing input _i The fractional part of (2) may be extended by implicit preamble bits to obtain normalization of p+1 bitsMantissa. Similarly, mantissa mb having a bit length of 'q' bits _i The fractional part of (c) may be extended by implicit leading bits to obtain a normalized mantissa of q+1 bits. By executing corresponding mantissas ma from the first and second sets of k floating-point numbers _i And mb (m) _i And fits the output of each mantissa multiplication to the bit length of the 'r' bit to generate k multiplication products.

As previously explained, the value of the bit length 'r' is set based on the desired accuracy of the dot product unit 300. To simulate the accuracy obtained with dot product unit 100 using separate multiplications and additions, the bit length of the 'r' bit is set to 'r=p+2' bits. To further simulate the accuracy obtained with dot product unit 200 using fused multiplication and addition, the bit length of the 'r' bit is set to the 'r=max (q+2, p+q+3)' bit.

I.e. if p+q+2>r bits, mantissa product (mab _i ) Is rounded precisely to the r-bit to obtain the product number z _i . Furthermore, if p+q+2<r bits, then the mantissa product (mab _i ) Filling with additional least significant bits to generate a product number z with r bits _i 。

At step 703, the method includes generating an exponent ea of the number in the first set of 'k' floating point numbers _i And an exponent eb of the corresponding number in the second set of 'k' floating point numbers _i Sum eab of _i . This step may be performed before or after step 702, or may even be performed in parallel with step 702.

Further, at step 704, the method includes multiplying the 'k' product numbers (z ₀ ,z ₁ ,z ₂ …z _k-1 ) Is converted into a 'k' number (y ₀ ,y ₁ ,y ₂ …y _k-1 ). By adding both the extra MSB and the extra LSB to the product number z in the first format _i The bit length 'r' of the number (y _i ). The bit length of the r bits is extended based on the number 'k' (the number of floating point numbers in the group). In the example of the set of 'k' floating point numbers in incoming format as signed numbers, adding additional MBS and LSB includes preferably addingA number of most significant bits and +.>A number of least significant bits. The number of additional MSBs and additional LSBs added to the bit length b of the mantissa may be the same or different. In this example, the added additional MSB includes one bit representing a sign bit. Thus, the signed number includes the sign bit s _i Is represented by the bit length of the 'n' bits. The bit length 'n' is expressed as

Wherein x is an integer and x is not less than 2.

In addition, the method at step 706 includes identifying an exponential sum (eab) of the set of 'k' floating point numbers _i ) The maximum index sum (e) _max ). Maximum index sum (e) _max ) The maximum exponent detection unit 304 performs recognition. The maximum exponent detection unit 304 implements an algorithm, such as for identifying a set of values (exponent and eab _i ) A maximum function of the maximum values among them. Step 706 may be performed before or after step 704, or may even be performed in parallel with step 706.

The method further includes, at step 708, counting the number' y _i The' magnitude bits are aligned based on the maximum exponent sum (e _max ). Number' y _i ' is an integer expressed as a fixed point number having a bit length of n bits. The method of aligning the magnitude bits of the numbers is discussed with respect to fig. 7. The magnitude bits based on the maximum exponent sum to align the numbers are performed by the alignment unit 306. The alignment unit 306 thus generates an alignment number, which is an integer v _i 。

The method also includes, at step 710, simultaneously processing the set of 'k' alignment numbers v _i To generate an output value o. Processing integer v _i Including performing k number of additions. Note that addition of positive and negative numbers is equivalent to performing subtraction, and thusThe term treatment is used herein to encompass the effect of both addition and subtraction, alone or in combination. The processing of the k numbers is performed simultaneously. That is, the processing unit processes all integers in the group at the same time, instead of, for example, sequentially processing elements in the group or processing elements in the group in pairs. The processing unit 308 performs addition of k integers to generate an output value.

Further, at step 712, the method includes re-normalizing and rounding the output value o to represent the output value as having a normalized mantissa n in any format _k And index e _k Floating point number of (a). The method includes re-normalizing the output value to represent the output value o as a normalized number of criteria. Furthermore, the method normalizes the number n _k Rounding is performed to represent the number with mantissas having a particular bit length. For example, the normalized number is rounded to a certain length, depending on the required accuracy (e.g. depending on whether the purpose is to simulate an arrangement such as fig. 1 or fig. 2). Normalization is performed by initially counting the number of '0' bits or '1' bits that repeatedly occur on the MSB side. When the output value 'o' is a positive number, the repeated occurrence of '0' bits is counted. When the output value 'o' is negative, the repeated occurrence of '1' bits is counted. Further, normalization is performed by shifting bits of the output value o to the LSB side around the decimal point to represent the signed number as a normalized number of standards. Furthermore, the method calculates an exponent value based on the maximum exponent and the number of counts of repeated bits. Thus, the output 'o' is normalized to be represented as a floating point number in the first format.

Furthermore, architecture 300 may also be implemented as a dot-product unit as shown in FIG. 11 for optimally multiplying two sets of floating-point numbers. The dot product unit 1100 includes: a multiplication unit 1101 including a plurality of multiplication units 1101 _a 、1101 _b 、…1101 _k-1 The method comprises the steps of carrying out a first treatment on the surface of the An alignment unit comprising a plurality of shifter units 1106 _a 、1106 _b 、…1106 _k-1 The method comprises the steps of carrying out a first treatment on the surface of the An accumulator unit 1108; and a normalizing unit 1110.

Dot-product unit 1100 receives a signal including a first set of k floating point numbers (a ₀ ,a ₁ ,a ₂ …a _k-1 ) And a second set of k floating point numbers (b ₀ ,b ₁ ,b ₂ …b _k-1 ) A large array of floating point numbers. Multiple multiplier units 1101 _a 、1101 _b 、..1101 _k-1 Execution mantissa ma _i And mb (m) _i As explained with respect to fig. 3 and 5. However, each multiplier unit 1101 _i Generating two intermediate mantissa products, a first intermediate mantissa product m _i ' and second intermediate mantissa product m _i ', such that m _i ' and m _i ' sum generates full-precision mantissa product mab _i . This feature takes advantage of the fact that hardware multipliers typically operate on a shift and add basis, such that the final computation step is typically an addition of two numbers. In this example, the last addition may be omitted, since the multiplication is then anyway an addition, so two inputs instead of one input may be sent to the subsequent addition (i.e. the multiplication result is in a carry save form). This increases the number of values to be summed in the next stage, but reduces the size of the multiplication unit required to implement the dot product unit, which may be desirable to reduce the latency or area of implementation. As will be apparent from the consideration of fig. 3 and 5, this will cause 2k product numbers to be output from the mantissa multiplication unit, the 2k product numbers including the intermediate mantissa product m by rounding or padding _i ' and m _i "generated k product numbers z _i ' and k product numbers z _i ". To ensure that the full-resolution multiplication output (i.e. full mantissa product mab) is used in conjunction with _i ) With the same precision, carry save output (i.e. intermediate mantissa product m _i And m _i ') in product number z _i ' and z _i "bit length r out which is extended by one precision bit compared to the bit length detailed above for using the full-resolution multiplication output.

Further, from each multiplier unit 1101 _i The product number (z) _i ' and z _i ") is provided to a shifter unit 1106 in the alignment unit _i . Each shifter cell 1106 may include two shifters for multiplying the product number z _i ' and z _i "shift. In another example, the shifter unit may include onlyA shifter unit, and can sequentially provide the multiplication product number z _i ' and z _i "to be shifted by the shifting unit".

Comprising a plurality of shifter cells 1106 _a 、1106 _b 、…1106 _k-1 The alignment unit of (a) converts each product number z _i ' and z _i ", to generate a second format _{2k numbers yi of (2)} . For simplicity, 2k numbers y _i Can be expressed as based on the product number z _i ' and z _i "generated k number y _i ' and k number y _i ". Each shifter element aligns the number y having the second format based on the index sum and the maximum index sum explained in detail in fig. 3 and 5 _i ' and y _i ", to generate an integer v _i ' and v _i ". In addition, the number v in the second format _i ' and v _i "is provided to accumulator 1108, where i 'is the processing unit as shown in FIG. 3, or i' is the adder as shown in FIG. 5.

The processing unit further aligns the number v in a second format _i ' and v _i "add to obtain output o". The normalization unit 1110 further normalizes the output to generate a normalized floating point number as a final output based on the exponent sum and the maximum exponent sum, as explained in detail in fig. 3 and 5.

This architecture eliminates the final step of generating a full-resolution multiplication output by the multiplication unit. Instead, the intermediate mantissa products, which are a carry-save representation, are converted and added together. Architecture 1110 reduces carry propagate adder delay and area. However, architecture 1110 requires twice the shift operation required by architecture 300 or 500, and thus twice the shifter required by the architecture (to avoid additional latency).

In another embodiment, in the dot product unit shown in FIG. 11, a plurality of multiplier units 1101 are compared to the r+1 bits explained in the above paragraphs with reference to FIG. 11 _a 、1101 _b 、..1101 _k-1 Each multiplier unit 1101 _i Execution mantissa ma _i And mb (m) _i To generate a plurality of products each having r+log (k-1) +2 bitsTwo product numbers of bit length (z _i ' and z _i "). As discussed above in connection with the full-analysis multiplier example, this maintains additional precision in the accumulation phase from multiplication to dot-product units by extending the multiplication output LSB as far as the minimum additional LSB extension (which would otherwise be performed at the accumulation phase). Further, from each multiplier unit 1101 _i The product number (z) _i ' and z _i ") is provided to a shifter unit 1106 in the alignment unit _i . The alignment unit also performs the steps as explained for fig. 3, 5 and 11 to generate an integer v _i ' and v _i ". In addition, the number v in the second format is processed _i ' and v _i "to obtain the output of the dot product.

FIG. 8 is a diagram illustrating a comparison of a particular implementation of architecture 300 with other standard architectures for processing a set of floating point numbers. This is particularly relevant in terms of accumulation of dot product units. However, for clarity, it is noted that FIG. 8 is discussed more generally than the context of a dot product unit.

In fig. 8, a graphical representation of the results of a first experiment comparing a specific implementation of architecture 300 explained in fig. 3 with other standard architectures is shown. The first experiment involved comparing area versus latency tradeoffs for different architectures. In a first experiment, as shown in fig. 8, three architectures (arch 1, arch 2, and arch 3) were used for comparison. Arch 1 is the balanced tree architecture of the floating point adder embodiment (rounded in pairs to the nearest position and associated with the even embodiment). The results obtained for arch 1 are represented by the cross (+) symbol on the first curve in fig. 8. Arch 2 is another architecture of the balance tree for floating point adder implementations (with paired precision rounding implementations). The results obtained for arch 2 are symbolized by a circle (o) on the second curve shown in fig. 8. Arch 3 is a specific implementation (with precise rounding specific implementation) of architecture 300 disclosed in this document. The results obtained for arch 3 are represented by square symbols on the third curve in fig. 8. In the first experiment, three architectures were implemented in software using VHDL.

In a first experiment, single precision in a first format was usedA set of floating point numbers (32-bit) floating point numbers is used as input. Each floating point number includes a mantissa mi with a bit length ' r ' of 24 bits (r=24 bits), an exponent ' e with a bit length't ' of 8 bits (t=8 bits) _i ' and sign bit. The first experiment involved integrating three architectures for various timing targets in order to observe area versus delay tradeoffs. From the graph in fig. 8, it is observed that arch 3 (architecture 300) has minimal delay and minimal area. In particular, the fastest circuit synthesized according to architecture 300 uses less than 50% of the area of the fastest circuit synthesized from other architectures under consideration, with less than 50% of the delay.

In addition, the complexity of hardware implementations of different architectures is also compared. The complexity of a hardware implementation such as critical paths is represented using large O symbols. For architecture 300, maximum exponent detect unit 304 is implemented with O (log (k) log (t)) gates on the critical path (where k is the number of values summed by the adder). Furthermore, the alignment unit 306 is implemented with O (log (r)) gates on the critical path. The processing unit (308) (i.e., adder 508) is implemented with O (log (k) +log (r)) gates on the critical path. Normalization unit 310 is implemented with O (log (t)) gates on the critical path. Thus, an overall hardware implementation may be implemented with O (log (k) log (t) +log (r)) logic gates on its critical path. To increase the array size k and mantissa width r, the critical path is asymptotically shorter than the architecture of the balanced tree of the floating-point adder.

For the same input of the set of k floating point numbers in the first format, a direct implementation of a multi-input adder consisting of a binary tree of floating point adders (with a fixed rounding mode, e.g., rounding to zero) is explained below. By construction, the implementation yields pairs of precisely rounded sums. The critical path in the balanced tree of floating point adders passes through O (log (k)) adders, each having O (log (rt)) gates on its critical path. In summary, the architecture of the balanced tree of floating point adders thus has O (log (k) log (rt)) logic gates on its critical path.

Furthermore, a particular implementation of architecture 300 generates an output with a worst-case precision that is no worse than the pair-wise addition performed when using either architecture 100 of binary tree adders and multipliers or architecture 200 of fused multiply and add units with precise rounding. A mathematical proof of the accuracy of architecture 300 is provided later. This indicates that the accuracy of the floating point summation result is not lower than the worst case paired floating point addition with an exact rounding scheme. This means that for any given array to be summed, performing a pair-wise addition by iteratively replacing two of the entries in the array with the sum of their rounding to the nearest larger or smaller representable value can always produce a result that is less accurate or equal than the result produced by the addition as part of the disclosed architecture 300. In known architectures, one imprecise choice for ordering the inputs and performing the rounding step is to add the number of magnitude increment to the largest number and round all the way in the same direction. Since the accuracy of the intermediate multiplication result is also not less than in architecture 100 or 200, the accuracy of the output obtained by architecture 300 is not less than the accuracy of the result obtained by making these selections.

By removing the intermediate normalization step and replacing the intermediate carry propagation step with a single carry save addition, the latency and area performance of architecture 300 is significantly improved compared to a floating point adder tree, as shown in fig. 8. The empirical precision of architecture 300 as described above, as measured on a zero-centered gaussian distributed input, shows significant advantages over a floating point adder tree.

Finally, architecture 300 is exchangeable for additions, such that any order of input pairs (a _i ,b _i ) Producing the same output. This results in better reproducibility of the results because the order in which floating point numbers in both groups are bound to the inputs of architecture 300 does not affect the results.

The following provides a mathematical demonstration of the accuracy of architecture 300. In the following section, it is demonstrated that our algorithm is not less accurate than the worst case iterative pair addition with accurate rounding.

First, some basic characteristics of an exact rounding scheme are defined and demonstrated. Is provided withIs two digital formats, and r.epsilon.R.u.is {. + -. Infinity } is a number. When q is the minimum upper bound or maximum lower bound of r in F, we say that q is the exact rounding of r in format F, written as q _F r。

When the minimum upper limit and the maximum lower limit of r in F2 belong to F1, F1 is more accurate than F2 in the neighborhood of r. The following proposition is therefore directly presented.

Proposition 1: if F1 is more accurate than F2 in the neighborhood of r, then for all values q1, q2, q1≡q _F1 r and q2≡ _F2 q1, we obtain q2.apprxeq _F2 r。

Let H now be a floating point format with't' digits, 'R' mantissa digits (including R decimal places (and R-R explicit leading digits)), and exponent bias 'c' used at the input and output of our calculation. We assume that the mantissa is normalized and its precision is reduced to a maximum of maintaining R +1 non-zero consecutive bits. In other words, the numbers in format H have mantissas such that at least one of the R-R leading bits is '1' and when i MSBs are '0' for some 0.ltoreq.i.ltoreq.r-R-1, then R-R-i LSBs are '0'. Furthermore, a set of 'k' floating point numbers x in format H ₀ ,…,x _k-1 As input is given. The algorithm is implemented by converting the maximum exponent e in the array _max The upper aligned fixed point format G.

Number in format GThe signed integer v of bits is given in the range +.>And represents a real numberEach input floating point number x _i Is converted into fixed point valueWhere the choice of rounding is left to the specific implementation. />

The fixed point values are then added together and their sum is converted back to the original format, producing a result Where the choice of rounding is again left to the specific implementation.

For purposes of this analysis, the numbers in the input array are divided into two categories: absolute value is less thanThe sum of the numbers of (2) is at least +.>Is a large number of (a). The input array is divided into a small number array x' ₀ ,…,x′ _k′-1 And a large digital array x' _k′ ,…,x′ _k-2 ,x′ _k-1 So that x' _k-1 Is of the index e _max . The count k 'of the small number verifies that 0.ltoreq.k'. Ltoreq.k-1. For all i=0, …, k-1, n _i ' and e _i ' respectively represents x _i ' mantissa and exponent, and v _i ' represents x _i ' conversion to fixed point format G, so that +.>Note that while small numbers may produce rounding errors, large numbers are accurately represented. This is because the mantissas are normalized, thus ensuring that the majority digits are smaller thanAny weight bit of (2) is '0'.

Sequence w _i By looking at i=0, …, k-2 and w _k-1 Y, set up Thereby forming. Recall that in a recall that,thus->In this decomposition two quotients concerning the magnitude of the intermediate sum are demonstrated.

First, it was demonstrated that no underflow occurred in format G when adding a small number to the number with the largest exponent.

Lemma 1: if x _l Normalized, then

For all i=1, …, k' -1.

And (3) proving: by generalization, it was demonstrated that, for all i=1, …, k' -1, Let i be an integer between 1 and k' -1. First, by the triangle inequality, +.> Subsequently, since i.ltoreq.k' -1, we have obtained by assuming thatFurthermore, the->Can be represented by G, so after rounding we get +.>

If i=1, then due to x _l Can be represented by G, thusAnd then due to x _l Is normalized, thus->

If i>1, by generalizing the assumptions,

in either case, obtain Ending the generalization. As a direct result, we obtain So that when k '. Ltoreq.k-1, k' -1, & lt for all i=1, …>

Next, it was demonstrated that no overflow occurs in format G when all other numbers are added to the number with the largest exponent.

And (4) lemma 2: if x _l Normalized, then for all i=1, …, k-1,

and (3) proving: note that due to e _max Is the largest input index and thus, for all i=0, …, k-1,and then due to the value->Can also be represented by G, thus +.>It follows that by direct generalization of i, k-1, < > for all i=1, …> In turn->

The worst-case accuracy of the summation of architecture 300 is stated and demonstrated in the following theorem.

Theorem 1: for any array x ₀ ,…,x _k-1 There is x ₀ ,…,x _k-1 Is to precisely round the sum z in pairs such that a set of multiplication results x ₀ ,…,x _k-1 Application to architecture 300 produces an output y such that

And (3) proving: consider sequence l ₀ ,…,l _k-1 Such that for all i=1, …, k-1, l ₀ ＝x′ _k-1 And l _i Is l _i-1 +x _i ′ _-1 The maximum lower bound in H and considers sequence u ₀ ,…,u _k-1 So that for all i=1, …, k-1, u ₀ ＝x _l And u _i Is u _i-1 +x _i ′ _-1 The minimum upper bound in H. These sequencesThe columns define pairs of precisely rounded sums l obtained by systematically rounding the intermediate sums in the same direction _k-1 And u _k-1 . It is obvious that the process is not limited to, let us say for all i=0, …, k-1, l _i ≤w _i ≤u _i So that l _k-1 ≤y≤u _k-1 . This immediately gives rise to a signal for z=l _k-1 Or z=u _k-1 At least one of the above-mentioned materials,

we now demonstrate by induction that for all i=0, …, k-1, l _i ≤w _i ≤u _i 。

i=0: by definition we get l ₀ ＝u ₀ ＝x _l And when w ₀ ≈ _G x′ _k-1 When w is ₀ ＝x _l And x' _k-1 May be denoted G.

ii=1, …, k' -1: by generalizing the assumptions we get l _i-1 ≤w _i-1 ≤u _i-1 And then l _i-1 +x _i ′ _-1 ≤w _i-1 +x _i ′ _-1 ≤u _i-1 +x _i ′ _-1 . Furthermore, by quotients 1 and 2 we getSo that G is at w _i-1 +x _i ′ _-1 Is more accurate than F because for this exponent range G accommodates at least the same number of mantissa bits as H. Since G is in fixed-point format and the sum does not overflow or underflow, we can also get, for x _i ' round and then add the result to w _i-1 Equivalent to x _i ′ _-1 Added to w _i-1 And then the resultRounding is performed. We get, l _i Is l _i-1 +x _i ′ _-1 Maximum lower bound in H, and less than or equal to l _i-1 +x _i ′ _-1 Maximum lower bound in G. Similarly, u _i Is u _i-1 +x _i ′ _-1 A minimum upper bound in H, and greater than or equal to u _i-1 +x _i ′ _-1 The minimum upper bound in G. Thus, by definition of precise rounding we get l _i ≤w _i ≤u _i 。

ii = k ', k' +1, …, k-2: by generalizing the assumptions we get l _i-1 ≤w _i-1 ≤u _i-1 And then l _i-1 +x _i ′ _-1 ≤w _i-1 +x _i ′ _-1 ≤u _i-1 +x _i ′ _-1 . Due to x _i ' is a large number so it can be represented by G and following the lemma 2 we get w _i ＝w _i-1 +x _i ′ _-1 . Due to l _i ≤l _i-1 +x _i ′ _-1 And u _i ≥u _i-1 +x _i ′ _-1 Thus we obtain l _i ≤w _i ≤u _i 。

i=k-1: it follows the inductive assumption l _k-2 +x′ _k-2 ≤w _k-2 +x′ _k-2 ≤u _k-2 +x′ _k-2 . Using a similar demonstration as before we getOr at w _k-2 +x′ _k-2 And->G is more accurate than H in the neighborhood of (a). Subsequently, we get w from proposition 1 _k-1 ≈ _H w _k-2 +x′ _k-2 And then by definition of precise rounding, get l _k-1 ≤w _k-1 ≤u _k-1 。

Consider a floating point format H' that is more accurate than H, obtained by extending the bit length of H with additional mantissa bits on the LSB side or allowing more than r+1 consecutive non-zero bits. The worst-case accuracy of the summation of the architecture 300 configured with format H' taking into account any rounding direction is at least as high as the worst-case accuracy of the summation of the architecture 300 configured with format H taking into account any intermediate rounding direction. This is because any value rounded to format H can be assumed to be at least as wide as the same value rounded to format H'. Thus, when format H' is used, the range of possible value outputs of the summation is encompassed within the range of values that can be output when format H is used, in all possible rounding directions.

In the conventional architecture 100 using separate multiplications and additions, each multiplication result is rounded/padded to include P decimal places. Thus, after rounding/padding, the output mantissa of the multiplication in the conventional architecture 100 will have a normalized form and have up to p+1 consecutive non-zero bits. When configured to simulate architecture 100, multiplication unit 301, which is part of architecture 300, produces at least r=p+2 mantissa bits, including r=p decimal places and possibly more than r+1=p+1 consecutive non-zero bits, resulting in format H' that may be more accurate than H. Thus, each input of the alignment unit 306 is included between possible rounding values that may be generated by multipliers in the conventional architecture 100. It follows that the overall accuracy of the dot product implementation 300 is subsumed between the minimum and maximum possible values, taking into account any accumulation order and rounding direction in the conventional architecture 100. In other words, it ensures accuracy of pair-wise accurate rounding.

In the conventional architecture 200 using fused multiplication and addition, each intermediate multiplication result includes complete p+q decimal places. Thus, the internal mantissas of the multiplication in the conventional architecture 200 will have a normalized form and have up to p+q+2 consecutive non-zero bits. When configured into analog architecture 200, multiplication unit 301 acts as part of architecture 300 and generates r=max (q+2, p+q+3) mantissa bits, including r=max (Q, p+q) decimal bits and up to r+1=max (q+1, p+q+2) consecutive nonzero bits, resulting in format H. Thus, each input of the alignment unit 306 holds the same value as the intermediate multiplication result in the conventional architecture 200. It follows that the overall accuracy of the dot product implementation 300 is subsumed between the minimum and maximum possible values, taking into account any accumulation order and rounding direction in the conventional architecture 200. In other words, it ensures the accuracy of the exact rounding of the triplets (triplets-wise).

FIG. 9 illustrates a computer system in which the graphics processing system described herein may be implemented. The computer system includes a CPU 902, a GPU 904, a memory 906, and other devices 914, such as a display 916, speakers 918, and a camera 908. Processing block 910 (corresponding to processing block 110) is implemented on GPU 904. In other examples, processing block 910 may be implemented on CPU 902. The components of the computer system may communicate with each other via a communication bus 920. The repository 912 (corresponding to the repository 112) is implemented as part of the memory 906.

While FIG. 9 illustrates an embodiment of a graphics processing system, it should be appreciated that a similar block diagram may be drawn for an artificial intelligence accelerator system-e.g., by replacing the GPU 904 with a Neural Network Accelerator (NNA), or adding the NNA as an additional element. In such cases, the architecture 300 of the adder may be implemented in the NNA.

The adder described herein may be embodied in hardware on an integrated circuit. Generally, any of the functions, methods, techniques, or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry) or any combination thereof. The terms "module," "functionality," "component," "element," "unit," "block," and "logic" may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs specified tasks when executed on a processor. The algorithms and methods described herein may be executed by one or more processors executing code that causes the processors to perform the algorithms/methods. Examples of a computer-readable storage medium include Random Access Memory (RAM), read-only memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that can store instructions or other data using magnetic, optical, and other techniques and that can be accessed by a machine.

The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for execution by a processor, including code expressed in machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, byte code, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in programming language code such as C, java or OpenCL. The executable code may be, for example, any kind of software, firmware, script, module, or library that, when properly executed, handled, interpreted, compiled, run in a virtual machine or other software environment, causes the processor of the computer system supporting the executable code to perform the tasks specified by the code.

The processor, computer, or computer system may be any kind of device, machine, or special purpose circuit, or a set or portion thereof, that has processing capabilities such that it can execute instructions. The processor may be any kind of general purpose or special purpose processor, such as CPU, GPU, NNA, a system on a chip, a state machine, a media processor, an Application Specific Integrated Circuit (ASIC), a programmable logic array, a Field Programmable Gate Array (FPGA), or the like. The computer or computer system may include one or more processors.

The present invention is also intended to cover software defining a configuration of hardware as described herein, such as Hardware Description Language (HDL) software, for designing integrated circuits or for configuring programmable chips to perform desired functions. That is, a computer readable storage medium may be provided having encoded thereon computer readable program code in the form of an integrated circuit definition data set (which may also be referred to as a hardware design) that, when processed (i.e., run) in an integrated circuit manufacturing system, configures the system to manufacture a computing device comprising any of the apparatus described herein. The integrated circuit definition data set may be, for example, an integrated circuit description.

Accordingly, a method of fabricating an adder architecture as described herein at an integrated circuit fabrication system may be provided. Furthermore, an integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, causes a method of manufacturing an adder to be performed.

The integrated circuit definition data set may be in the form of computer code, for example, as a netlist, code for configuring a programmable chip, as a hardware description language defining a hardware suitable for fabrication at any level in an integrated circuit, including as Register Transfer Level (RTL) code, as a high-level circuit representation (such as Verilog or VHDL), and as a low-level circuit representation (such as OASIS (RTM) and GDSII). A higher-level representation, such as RTL, logically defining hardware suitable for fabrication in an integrated circuit may be processed at a computer system configured to generate a fabrication definition of the integrated circuit in the context of a software environment that includes definitions of circuit elements and rules for combining these elements to generate a fabrication definition of the integrated circuit so defined by the representation. As is typically the case when software is executed at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate a manufacturing definition for an integrated circuit to execute code that defines the integrated circuit to generate the manufacturing definition for the integrated circuit.

An example of processing an integrated circuit definition data set (e.g., a hardware design) on an integrated circuit manufacturing system to configure the system to manufacture an adder will now be described with respect to fig. 10.

Fig. 10 illustrates an example of an Integrated Circuit (IC) fabrication system 1002 configured to fabricate an adder as described in any of the examples herein. In particular, the IC fabrication system 1002 includes a layout processing system 1004 and an integrated circuit generation system 1006. The IC fabrication system 1002 is configured to receive an IC definition dataset/hardware design (e.g., defining an adder as described in any of the examples herein), process the IC definition dataset, and generate an IC (e.g., embodying an adder as described in any of the examples herein) from the IC definition dataset. The processing of the IC definition data set configures the IC fabrication system 1002 to fabricate an integrated circuit embodying an adder as described in any of the examples herein.

Layout processing system 1004 is configured to receive and process the IC definition dataset/hardware design to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art and may involve, for example, synthesizing RTL codes to determine a gate level representation of a circuit to be generated, for example in terms of logic components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout may be determined from the gate level representation of the circuit. This may be done automatically or with the participation of a user in order to optimize the circuit layout. When the layout processing system 1004 has determined a circuit layout, it may output the circuit layout definition to the IC generation system 1006. The circuit layout definition may be, for example, a circuit layout description.

The IC generation system 1006 generates ICs according to a circuit layout definition, as is known in the art. For example, the IC generation system 1006 may implement a semiconductor device fabrication process to generate ICs, which may involve a multi-step sequence of photolithography and chemical processing steps during which electronic circuits are gradually formed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask that may be used in a lithographic process to produce an IC from the circuit definition. Alternatively, the circuit layout definitions provided to the IC generation system 1006 may be in the form of computer readable code that the IC generation system 1006 may use to form a suitable mask for generating the IC.

The different processes performed by IC fabrication system 1002 may all be implemented at one location, e.g., by a party. Alternatively, IC fabrication system 1002 may be a distributed system such that some processes may be performed at different locations and by different parties. For example, some of the following phases may be performed at different locations and/or by different parties: (i) Synthesizing an RTL code representing the IC definition dataset to form a gate level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate level representation; (iii) forming a mask according to the circuit layout; and (iv) using the mask to fabricate the integrated circuit.

In other examples, processing the integrated circuit definition data set at the integrated circuit manufacturing system may configure the system to manufacture the adder without processing the integrated circuit definition data set to determine the circuit layout. For example, an integrated circuit definition dataset may define a configuration of a reconfigurable processor, such as an FPGA, and processing of the dataset may configure the IC manufacturing system to generate (e.g., by loading configuration data into the FPGA) the reconfigurable processor having the defined configuration.

In some embodiments, the integrated circuit manufacturing system may be caused to generate an apparatus as described herein when the integrated circuit manufacturing definition dataset/hardware design is processed in the integrated circuit manufacturing system. For example, by configuring an integrated circuit manufacturing system in the manner described above with reference to fig. 10 via an integrated circuit manufacturing definition dataset, an apparatus as described herein may be manufactured.

In some examples, the integrated circuit definition dataset may include software running on or in combination with hardware defined at the dataset. In the example shown in fig. 10, the IC generation system may be further configured by the integrated circuit definition dataset/hardware design to load firmware onto the integrated circuit in accordance with program code defined at the integrated circuit definition dataset at the time of manufacturing the integrated circuit or to otherwise provide the program code to the integrated circuit for use with the integrated circuit.

Embodiments of the concepts set forth in the present application in apparatuses, devices, modules, and/or systems (and in methods implemented herein) may result in performance improvements over known embodiments. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During the manufacture of such devices, apparatuses, modules and systems (e.g., in integrated circuits), a tradeoff may be made between performance improvements and physical implementation, thereby improving the manufacturing method. For example, a tradeoff may be made between performance improvement and layout area, matching the performance of known embodiments, but using less silicon. This may be accomplished, for example, by reusing the functional blocks in a serial fashion or sharing the functional blocks among elements of a device, apparatus, module, and/or system. Rather, the concepts described herein that lead to improvements in the physical implementation of devices, apparatus, modules and systems, such as reduced silicon area, can be weighed against performance improvements. This may be accomplished, for example, by fabricating multiple instances of the module within a predefined area budget.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the application.

Cross Reference to Related Applications

The present application claims priority from uk patent applications 2202126.5 and 2202128.1 filed on month 2 and 17 of 2022, which are incorporated herein by reference in their entirety.

Claims

1. A method of performing a dot product of an array of '2k' floating point numbers, k.gtoreq.3, using a hardware implementation, the array comprising a first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 And a second set of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Wherein the method comprises:

receiving two sets of 'k' floating point numbers;

each floating point number a _i Multiplying by floating point number b _i To generate k product numbers (z _i ) Each product number (z _i ) Mantissa bit length with 'r+log (k-1) +1' bits;

based on the k product numbers (z _i ) To createA set of 'k' numbers (y _i ) Said number (y _i ) With a function of adding at least an additional most significant bit to the product number (z _i ) The bit length of 'n' bits obtained from the bit length of (c), wherein the 'n' bits comprise a plurality of magnitude bits, wherein 'n' isBits, wherein x is an integer and x is greater than or equal to 1;

identifying k index sums (eab) _i ) The maximum index sum (e) _max ) Each exponential sum is the floating point number a _i And the floating point number b _i Index sum of (2);

based on the maximum index sum (e) _max ) To align the number (y _i ) Is a bit of the magnitude of (2); and

the set of 'k' numbers is added together at the same time.

2. The method of claim 1, wherein the first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 Comprises mantissas (ma _i ) Sum index (ea) _i ) And the second group of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Comprises mantissas (mb _i ) Sum index (eb) _i ) Wherein each mantissa (ma _i ) Has a bit length of 'p' bits, and each mantissa (mb _i ) Having a bit length of 'q' bits.

3. The method of claim 2 wherein each floating point number a is calculated _i Multiply by the corresponding floating point number b _i Including mantissa (ma _i ) And mantissa (mb) _i ) Multiplying to obtain an intermediate mantissa product (mab _i )。

4. A method as claimed in claims 1 to 3, wherein the method of performing dot product simulates the accuracy obtained using separate multiplication and addition units by setting the value of the 'r' bit to 'r = P +1-log (k-1)' bits for performing dot product having an output mantissa bit length of P bits.

5. A method as claimed in claim 3, wherein k product numbers (z) of the mantissa bit length with 'r+log (k-1) +1' bits are generated _i ) Comprising the following steps:

if p+q+2>r+log (k-1) +1 bits, then multiplying the intermediate mantissa (mab) _i ) Rounding to the r+log (k-1) +1 bits; or alternatively

If p+q+2<r+log (k-1) +1 bits, then additional least significant bits are padded into the intermediate mantissa product (mab _i ) To produce r + log (k-1) +1 bits.

6. A method as claimed in any preceding claim, wherein a maximum index sum (e) is identified _max ) Including identifying k index sums (eab) _i ) Maximum value among them, where k index sums (eab) _i ) By indexing (ea) _i ) Sum index (eb) _i ) And summing to obtain the final product.

7. A method as claimed in any preceding claim, wherein an additional most significant bit is added to the product number (z _i ) Comprises adding at leastA number of said most significant bits.

8. A method as claimed in any preceding claim, wherein adding at least an additional most significant bit to the bit length of the product number (zi) further comprises adding one or more least significant bits to the product number (z _i ) Is used for the bit length of (a).

9. The method of any preceding claim, wherein the method further comprises:

by processing the 'k' number (y _i ) To calculate an output value;

re-normalizing the output value; and

the output value is rounded to represent the output value as a floating point number.

10. A method as claimed in any preceding claim, wherein the number (y _i ) The magnitude bits are aligned based on the maximum exponent (e _max ) The method comprises the following steps: for each floating point number (i):

calculating the maximum index sum (e) _max ) And each index sum (eab) _i ) Difference between (e) _d ) The method comprises the steps of carrying out a first treatment on the surface of the And

based on the calculated difference (e _d ) The corresponding number (y _i ) Is shifted to the LSB side.

11. A hardware implementation for performing a dot product of an array of '2k' floating point numbers, k.gtoreq.3, the array including a first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 And a second set of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Wherein the hardware implementation comprises:

a multiplication unit comprising a plurality of multipliers configured to:

receiving two sets of 'k' floating point numbers;

a format conversion unit configured to:

based on the k product numbers (z _i ) To create a set of 'k' numbers (y _i ) Said number (y _i ) With a function of adding at least an additional most significant bit to the product number (z _i ) The bit length of 'n' bits obtained from the bit length of (c), wherein the 'n' bits comprise a plurality of magnitude bits, wherein 'n' isBits, wherein x is an integer and x is ≡2;

A maximum exponent detecting unit configured to identify k exponents and (eab) _i ) The maximum index sum (e) _max ) Each exponential sum is the floating point number a _i And the floating point number b _i Index sum of (2);

an alignment unit configured to determine a maximum index sum (e) _max ) To align the magnitude bits of the number; and

a processing unit configured to simultaneously add the set of 'k' numbers to generate an output value.

12. The hardware implementation of claim 13, further comprising a renormalization unit configured to:

re-normalizing the output value; and

13. The hardware implementation of claims 13 to 14 wherein the first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 Comprises mantissas (ma _i ) Sum index (ea) _i ) And the second group of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Comprises mantissas (mb _i ) Sum index (eb) _i ) Wherein each mantissa (ma _i ) Has a bit length of 'p' bits, and each mantissa (mb _i ) Having a bit length of 'q' bits.

14. A hardware implementation as claimed in claim 13 to 15, wherein the multiplication unit comprises a plurality of multiplier units configured to multiply each mantissa (ma _i ) At the same time multiply by the corresponding mantissa (mb _i ) To obtain an intermediate mantissa product (mab _i )。

15. The hardware implementation of claims 13 to 16, wherein said hardware implementation for performing a dot product operation simulates the precision obtained using a separate multiplication and addition unit for performing a dot product having an output mantissa bit length of P bits by setting said value of 'r' bit to 'r = P +1-log (k-1)'.

16. The hardware implementation of claims 13 to 17, wherein the multiplication unit is configured to generate k product numbers (z) with the mantissa bit length of 'r+log (k-1) +1' bits by _i )：

17. The hardware implementation of claims 13 to 18 wherein the maximum exponent detection unit is configured to identify k exponents and (eab) _i ) The maximum index sum (e) _max ) Wherein k index sums (eab) _i ) By indexing (ea) _i ) Sum index (eb) _i ) And summing to obtain the final product.

18. A method of performing a dot product of an array of '2k' floating point numbers, k.gtoreq.3, using a hardware implementation, the array comprising a first set of k floating point numbers a ₀ 、a ₁ ...、a _k-1 And a second set of k floating point numbers b ₀ 、b ₁ ...、b _k-1 Wherein the method comprises:

receiving two sets of 'k' floating point numbers;

each floating point number a _i Multiplying by floating point number b _i Each multiplication generates a first intermediate multiplication product number (z _i ') and a second intermediate product number (z) _i ") to generate k first intermediate product numbers (z) including bit lengths each having 'r+log (k-1) +2' bits _i ') and k second intermediate product numbers (z) _i ") by a product number of 2 k;

based on the 2k product numbers, creatingK first numbers (y _i ') and k second numbers (y) _i ") by adding an additional most significant bit to the product number (z _i And z _i ") wherein the 'n' bits comprise a plurality of magnitude bits, wherein 'n' isBits, wherein x is an integer and x is greater than or equal to 1;

based on the maximum index sum (e) _max ) To align the number (y _i ' and y _i ") the magnitude bits; and

the set of '2k' numbers is added together at the same time.

19. An integrated circuit definition data set that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a hardware implementation as claimed in any one of claims 13 to 22.

20. A non-transitory computer readable storage medium having stored thereon a computer readable description of a hardware implementation of any of claims 13 to 22, which when processed in an integrated circuit manufacturing system causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the hardware implementation.