CN117813585A - Systolic array with efficient input reduced and extended array performance - Google Patents
Systolic array with efficient input reduced and extended array performance Download PDFInfo
- Publication number
- CN117813585A CN117813585A CN202280052183.4A CN202280052183A CN117813585A CN 117813585 A CN117813585 A CN 117813585A CN 202280052183 A CN202280052183 A CN 202280052183A CN 117813585 A CN117813585 A CN 117813585A
- Authority
- CN
- China
- Prior art keywords
- input
- bit
- reduced
- bits
- reducer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000003638 chemical reducing agent Substances 0.000 claims abstract description 327
- 238000012545 processing Methods 0.000 claims abstract description 133
- 230000009467 reduction Effects 0.000 claims abstract description 133
- 238000000034 method Methods 0.000 claims abstract description 50
- 238000007667 floating Methods 0.000 claims description 123
- 239000004606 Fillers/Extenders Substances 0.000 claims description 35
- 238000003491 array Methods 0.000 abstract description 38
- 238000013528 artificial neural network Methods 0.000 description 43
- 238000004364 calculation method Methods 0.000 description 37
- 101100136062 Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) PE10 gene Proteins 0.000 description 23
- 238000005516 engineering process Methods 0.000 description 23
- 238000003860 storage Methods 0.000 description 23
- 230000008569 process Effects 0.000 description 12
- 230000004913 activation Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 230000010349 pulsation Effects 0.000 description 9
- 238000012986 modification Methods 0.000 description 6
- 230000004048 modification Effects 0.000 description 6
- 238000009825 accumulation Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000002776 aggregation Effects 0.000 description 4
- 238000004220 aggregation Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 230000036316 preload Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 210000004556 brain Anatomy 0.000 description 3
- 230000001934 delay Effects 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 101100243454 Caenorhabditis elegans pes-10 gene Proteins 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 241000761456 Nops Species 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 238000013529 biological neural network Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 229910052751 metal Inorganic materials 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007493 shaping process Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8046—Systolic arrays
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/483—Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
- G06F7/487—Multiplying; Dividing
- G06F7/4876—Multiplying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/499—Denomination or exception handling, e.g. rounding or overflow
- G06F7/49942—Significance control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/50—Adding; Subtracting
- G06F7/501—Half or full adders, i.e. basic adder cells for one denomination
Landscapes
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Nonlinear Science (AREA)
- Complex Calculations (AREA)
Abstract
Systems and methods for performing reduced-precision multiply-accumulate operations in systolic arrays are provided. Each row of the systolic array may receive a downscaled input from a respective downscaler. The reduction input may include a reduction input data element and/or a reduction weight. The systolic array may lack support for inputs having a first bit length, and the reducer may reduce the bit length of a given input from the first bit length to a shorter second bit length and provide the reduced input to the array. To reduce the bit length, the reducer may reduce the number of trailing bits of the input. Additionally, the systolic array may receive a reduced and rounded input. The systolic array may propagate the reduced input through processing elements in the systolic array. Each processing element may include a multiplier and/or adder to perform arithmetic operations based on the reduced input.
Description
Background
An artificial neural network is a computing system having an architecture based on a biological neural network. The neural network may be implemented by circuits such as systolic arrays and data paths. The systolic array can accelerate the performance of the training and reasoning phases of the artificial neural network. During the training phase, input data may be provided to train the model. During the inference phase, new inputs may be processed according to the model to obtain a prediction result. User applications typically use models in the inference phase, so the inference phase may typically be time sensitive and delays during the inference phase can negatively impact the user experience.
As more applications use artificial neural networks, they also use a wide range of numbers, which may include numbers with increased bit lengths (e.g., 32-bit floating point numbers), which may require greater computing power or modifications to the neural network. While computational support for numbers with increased bit lengths may provide increased accuracy for mathematical operations, providing support for increased bit lengths for these numbers may increase the complexity, size, and cost of processing elements in systolic arrays. These increases may also affect system processing speed and system power consumption. When systolic arrays are required to support a wide range of numbers, power consumption and the size of the systolic array may become very important.
Drawings
Various features will now be described with reference to the following figures. Throughout the drawings, reference numerals may be repeated to indicate corresponding relationships between the referenced elements. The drawings are provided to illustrate examples described herein and are not intended to limit the scope of the disclosure.
Fig. 1A shows an exemplary 4 x 4 systolic array and an exemplary column of reducers.
Fig. 1B shows an exemplary 1 x 8 column of systolic arrays.
Fig. 2A illustrates a processing element for neural network computation in which inputs enter through a separate reducer, in accordance with certain examples of the disclosed technology.
Fig. 2B illustrates a processing element for neural network computation in which inputs enter through the same reducer, in accordance with some examples of the disclosed technology.
Fig. 3 illustrates an apparatus including a zero detector circuit for reducing input data elements and reducing weights into systolic arrays for neural network computation, in accordance with certain examples of the disclosed technology.
Fig. 4A illustrates a reducer showing selection of inputs to be reduced and rounded, according to some examples of the disclosed technology.
Fig. 4B illustrates a reducer that illustrates selection of rounding inputs to be reduced, in accordance with some examples of the disclosed technology.
Fig. 4C illustrates a reducer that illustrates generating multiple reduced inputs from a selected input, in accordance with some examples of the disclosed technology.
Fig. 5 illustrates multiply-accumulate data paths for neural network computation in accordance with certain examples of the disclosed technology.
Fig. 6 illustrates an apparatus for neural network computation in accordance with some examples of the disclosed technology.
Fig. 7 illustrates a method performed by a reducer and processing element for neural network computation, in accordance with some examples of the disclosed technology.
Fig. 8 illustrates a method performed by a reducer and processing element for neural network computation, in accordance with some examples of the disclosed technology.
Fig. 9A-9H illustrate an exemplary systolic array that processes data over a series of systolic intervals.
Fig. 10 illustrates an example of a computing device in accordance with certain aspects of the present disclosure.
Detailed Description
In general, the present disclosure relates to a systolic array that supports converting an input having a bit length higher than the native support of elements of the array into one or more reduced inputs. In addition, the input may be converted to a reduced input for a single pass reduced accuracy calculation of an input having a bit length higher than the native support of the elements of the array. For example, elements of the array may support a single pass computation on inputs of the systolic array having a particular bit length, and the systolic array may receive inputs from a reducer that reduces the bit length of the inputs to match the bit length that is natively supported by the elements during the single pass computation. The input may also be converted into multiple reduced inputs for multi-pass full-precision computation of inputs having a bit length higher than the native support of the elements of the array. As described herein, providing a reduced input to a systolic array using such a reducer may enable the input to be given to the systolic array with arbitrary bit lengths, and the input may be programmatically adjusted to a particular bit length (e.g., the highest bit length supported during a single pass computation) such that the user does not need to know the particular bit length of the input of the processing elements of the systolic array. While conventional systolic arrays may support different bit lengths, native support for single pass computation of higher bit lengths may increase the size and power consumption of the systolic array. In addition, this may affect the processing of shorter bit lengths. Thus, conventional systolic arrays must balance the ability to perform single pass calculations on longer bit lengths with the efficiency of processing shorter bit lengths. This may result in systolic arrays not supporting longer bit lengths due to efficiency loss in handling shorter bit lengths. Disclosed herein is a systolic array that supports arbitrarily long bit lengths with reduced accuracy and minimal efficiency loss compared to processing shorter bit lengths. Systolic arrays may support inputs with arbitrary bit lengths through a reducer that may discard excess bits from the significance of inputs with arbitrary bit lengths and round the remaining bits. Discarding the extra bits may enable the reducer to reduce the input bit length to the maximum bit length supported by the systolic array single pass computation, but at the cost of reduced accuracy for any bit length. In addition, the use of such a reducer may enable systolic arrays that receive inputs having arbitrary bit lengths to provide the same performance as achieved by systolic arrays that receive inputs having fixed bit lengths. Allowing a user to provide an input with an arbitrary (or non-fixed) bit length may allow for lower cost or lower power elements to be used in a systolic array that receives an input with a larger bit length while maintaining the overall performance of the systolic array as the reducer reduces the bit length of the input. Additionally, by reducing the bit length of the input (e.g., 32-bit floating point numbers), the reducer may provide a reduced precision version of the input (e.g., 22-bit floating point reduced precision numbers). Thus, the reducer may generate a reduced input from the input by reducing the bit length of the input.
The reducer may generate a plurality of reduced inputs from the inputs. Systolic arrays may utilize multiple reduced inputs in a multi-pass multiply-accumulate operation in order to maintain input accuracy. For example, each combination of reduced inputs (e.g., in the case where the reducer generates two reduced inputs for an input data element and weight, input data element 1 and weight 1, input data element 2 and weight 1, input data element 1 and weight 2, and input data element 2 and weight 2) may be passed through a multi-pass multiply-accumulate operation. By generating multiple reduced inputs from the inputs with reduced bit lengths, the reducer can reduce the bit lengths of the inputs to the maximum bit length supported by a systolic array single pass calculation, but at the cost of reduced performance for any bit length. In addition, the use of such a reducer may enable a systolic array that receives multiple reduced inputs (where the bit length is reduced from the original bit length) to provide the same frequency, power, and/or size advantages as achieved by a systolic array that receives inputs with a fixed (e.g., standard) bit length, at the cost of lower performance than a systolic array that operates on inputs with the original bit length. Allowing a user to provide an input having an arbitrary bit length may allow for the use of lower cost or lower power elements (e.g., power elements configured to operate on standard bit lengths) in a systolic array that receives an input having an arbitrary bit length while providing increased accuracy compared to a systolic array that receives an input having a standard bit length.
As described herein, a systolic array includes an array of Processing Elements (PEs) that are typically arranged in two dimensions (e.g., columns and rows). The PEs of the array may be interconnected to enable data to pass through the PEs, which may perform one or more mathematical operations on the data. For example, each PE may perform a "multiply accumulate" operation whereby input levels are fed into PEs in each row of the array, where each PE multiplies its respective input by a stored weight value, and passes the product result to PEs in subsequent rows.
One illustrative use of systolic arrays is to conduct the reasoning phase of a machine learning application. Machine learning typically requires at least two phases: a "learning phase" in which a model is trained from training data; and an "inference phase" in which a training model is applied to the production data to predict the outcome. Inference phase applications are typically delay sensitive because they are run in a production environment. Furthermore, inference phase applications (and in particular neural network applications) typically require intensive algebraic computations, such as matrix multiplications. Systolic arrays can be used to accelerate the inference phase workload in machine learning applications.
As described above, the PEs of the systolic array may be divided into rows and columns. Each PE in the input layer may receive an element of the input data set and scale the element using weights (e.g., filters) to indicate the extent to which the element affects the output. Each PE in the intermediate layer may receive at least one of an element and a weight (or filter) from another PE in the systolic array. Each PE in the middle layer may combine elements received from a corresponding PE of the systolic array to calculate a set of intermediate outputs. For example, each PE in the middle layer may calculate a sum of the element weight products and then generate a sum to apply the activation function to the sum (e.g., by a system separate from the PEs of the systolic array).
Generally, an input data set (e.g., an input signature) may be fed one input data element at a time into its respective row of the systolic array and transferred from one PE to another PE in a given row, e.g., starting from the leftmost PE. Each row receives a particular input data element and weight that is fed into a first PE in the row and then passed to an adjacent PE to the right of the first PE in the same row. In addition, the input portions and may be fed one at a time into their respective columns of the systolic array, and pass from one PE to another PE in a given column starting with the uppermost PE. Generally, the input portion and an adjacent PE that may be fed in a column from a first PE to be directly below the first PE in the same column. In addition, each column corresponds to a particular input portion sum of each PE passing through a given column. Doing so may allow each PE of a given column to perform mathematical operations on the input partial sums to produce output partial sums. As the input data element passes through the PE, the input data element may be multiplied by the weight value and summed with the input portion. The first PE in a column is provided with an input partial sum and an output partial sum is generated based on mathematical operations performed by the PE. The output portion sums are then taken as input portion sums for adjacent PEs provided in the same column. The neighboring PEs may then perform further mathematical operations before generating and passing the output partial sums to further neighboring PEs. In some embodiments, the input data may be fed into the systolic array in a cascaded fashion, where the PEs in the first column and first row (which may be designated as the location of [0,0], indicating row 0 and column 0) receive the input data elements and input portion sums in the first clock cycle. Thereafter, data may typically flow to subsequent rows and columns at a given rate (e.g., one PE per cycle). For example, the output portion and PE at [0,0] may be fed to the PE at [1,0] along with the input data element of row 1, such that the PE at [1,0] performs mathematical operations on the input data element and portion and the PE during the second clock cycle. Similarly, an input data element of PE [0,0] may be passed to a PE of a subsequent column (e.g., at position [0,1 ]) which may also be fed to the input portion sum such that the PE at [0,1] performs a mathematical operation on the input portion sum and the input data element during a second clock cycle. Assuming a convention that rows proceed downward and columns proceed to the right, data can generally flow downward and to the right during an array operation. To assist in these calculations, weights may be provided to PEs within the array prior to the first clock cycle, or the PEs may receive weights during the first clock cycle or during the calculations.
As machine learning applications and neural network applications proliferate, the need for increased processing power (e.g., the ability to process larger numbers and/or more accurate numbers) while achieving higher accuracy and maintaining performance has also increased. For example, the need to support numbers (e.g., decimal digits of the number and/or significant digits of the number) with increased accuracy has increased. Providing support for numbers having a larger bit length (e.g., 32-bit floating point numbers) results in a significant increase in integrated circuit die cost, power consumption, and circuit complexity as compared to supporting only numbers having a fixed (e.g., particular) bit length (e.g., 16-bit floating point numbers), because conventional PEs may not be able to receive numbers having bit lengths exceeding a particular length. In systolic arrays of hundreds or thousands of PEs, additional support for numbers with larger bit lengths may lead to an exponential increase in integrated circuit die cost, power consumption, and circuit complexity. In some configurations, the PE supports performing mathematical operations on numbers having increased bit lengths (e.g., 32 bits) with dedicated circuitry configured for larger bit lengths. For example, a 32-bit floating point systolic array may be used exclusively to perform mathematical operations on 32-bit floating point (FP 32) numbers. Such modifications may be particularly undesirable, may provide reduced performance, and may be costly and/or time consuming to implement. In other configurations, the PE does not support mathematical operations on numbers having bit lengths exceeding a given size. For example, a 16-bit floating point systolic array may not be able to perform mathematical operations on numbers other than the 16-bit floating point (FP 16) number. This lack of capability may be particularly undesirable and may provide reduced accuracy and/or reduced processing power.
The present disclosure provides a systolic array that has significant advantages over existing implementations. The present disclosure enables systolic arrays to support arbitrary bit lengths and maintain performance of shorter bit lengths relative to arrays that support single pass computation of longer bit lengths natively without significantly increasing the power consumption of the array. Further, the present disclosure may enable the use of numbers with arbitrary bit lengths (e.g., 32-bit floating point numbers) as inputs to systolic arrays (e.g., as inputs to reducers of the arrays). In addition, the reducer of the systolic array may programmatically adjust the input to a particular bit length (e.g., the highest bit length supported during a single pass computation) so that the user does not need to know the particular bit length of the input received by the processing elements of the systolic array. These advantages are provided by the embodiments discussed herein, and in particular by creating a systolic array by utilizing one or more reducers that reduce one or more inputs to be provided to the systolic array. In addition, one or more reducers may generate multiple reduced inputs for a particular input in order to maintain the accuracy of the original input.
Systolic arrays may support specific bit lengths or data types. For example, systolic arrays may support standard bit lengths and/or data types (e.g., FP16 numbers). The consumer or user may be notified that the systolic array supports a particular bit length or data type. In addition, the reducer may receive inputs having arbitrary bit lengths that do not correspond to supported bit lengths and/or data types (e.g., FP32 numbers). The reducer may convert an input having a non-supported bit length into a reduced format (e.g., reduced bit length) and provide an input having a reduced format (e.g., 22-bit floating point number) to the systolic array. The reduced format may be a non-standard format, a non-standard bit length, and/or a non-standard data type. The consumer may not be notified that the systolic array supports inputs with a reduced format. In addition, inputs having reduced formats may have higher accuracy or precision than inputs having standard bit lengths and/or data types, and have higher performance than inputs having arbitrary bit lengths and/or data types, as they may require specialized software and/or hardware to use these numbers. Additionally, the internal structure of the systolic array may be a superset of components for each supported data type. For example, the internal structure of the systolic array may support standard significant bit lengths from A to B and standard exponential bit lengths from X to Y. Thus, the maximum internal support bit length of the array may be 1+b+y, where B and Y may be any number. In addition, 1+b+y may not correspond to a standard format (e.g., 1+b+y may correspond to a 22-bit format), but the reducer may be able to scale down to this format for input to the array. Thus, while the set of data types and/or bit lengths supported by the systolic array may be exposed to the client, the reduced format (e.g., intermediate bit length between arbitrary bit length and standard bit length) may not be exposed to the client and may correspond to the maximum format (e.g., bit length) supported by the systolic array. This may enable improved accuracy relative to inputs having standard bit lengths and improved performance relative to inputs having arbitrary bit lengths.
As disclosed herein, each reducer (e.g., bit reducer, zeroer, etc.) assigned to a particular row of the systolic array may reduce one or more inputs provided to the reducer (e.g., change one or more bits to zero) and output one or more reduced inputs based at least in part on the one or more inputs. The input provided to the reducer may be a number represented by a significant number and an exponent. For example, the input provided may be in a floating point format. One or more of the reduction inputs may be represented in a modified format having a reduction significance and an expansion exponent. The reduced input may have sign bits, exponent bits, and significand bits. The most significant bit of the significant digits may be implied or hidden. Each reducer may include one or more of the following: a rounder, an exponent extender, a trailing bit reducer, and a multiplexer. The reducer may adjust the input provided to the reducer by maintaining an exponent of the original input and reducing the significand of the original input. The reducer may utilize a rounder to round the reduced input generated by the reducer based on the non-reduced numbers. In some implementations, the input may be rounded in advance to a given precision (e.g., the number of bits supported by a single pass calculation) and the reducer may discard the resulting trailing zeros to generate a reduced input. The rounder may round the input using various rounding techniques (e.g., any standard rounding technique). In addition, the reducer may utilize an exponent extender to extend the number of bits of the exponent portion of the number and a trailing bit reducer to reduce the number of bits of the significand portion of the number. Each reducer may contain any combination of these components. Each reducer may utilize components included in the reducer to generate and provide a reduced input to the systolic array or to a processing element of the systolic array. By generating a reduced input, the reducer is enabled to reduce or adjust any bit length (e.g., any long bit length) to a bit length supported by the processing elements of the array during a single pass computation, with a loss of precision of the original input having any bit length.
The reducer may result in reduced accuracy (e.g., data corresponding to the discarded bits) by discarding the bits and providing a single pass computation through the systolic array. For example, the final output may be a reduced output equal to the reduced weight multiplied by the reduced input data element. This accuracy can be regained by implementing additional passes through the array. For example, the reducer may convert weights into high reduced weights and low reduced weights and convert input data elements into high reduced input data elements and low reduced input data elements. Additionally, the final output may include greater accuracy and may be equal to a low reduction weight times a low reduction input data element plus a low reduction weight times a high reduction input data element plus a high reduction weight times a low reduction input data element plus a high reduction weight times a high reduction input data element. While multi-pass computation may require a reduced speed (e.g., based on multiple passes through an array to obtain a single overall output), multi-pass computation may provide a significant increase in accuracy over single-pass computation that reduces accuracy. Thus, systolic arrays may be able to support higher bit lengths by receiving inputs from the reducer with hardware that supports a maximum bit length that is lower than the higher bit length natively. Each reducer assigned to a particular row of the systolic array may receive a particular input data element and/or weight and generate a plurality of reduced inputs from the received inputs for multiple passes (e.g., into) the systolic array of the original inputs. For example, the reducer may receive an input data element and generate a plurality of reduced input data elements based on the input data element in order to maintain a higher precision of the original input data element than reducing the input to a standard bit length. Multiple reduced inputs may be summed to generate an input. It should be appreciated that each input may be converted into any number of reduced inputs. The reducer may generate a reduced input as a first reduced input (e.g., a high reduced input) and a second reduced input (e.g., a low reduced input). The first reduced input may be based on a higher magnitude significant digit of the input and the second reduced input may be based on a lower magnitude significant digit. For example, the first reduced input may be based on the leftmost bit of the significand (e.g., the bit having the highest magnitude) and the second reduced input may be based on the rightmost bit of the significand (e.g., the bit having the lowest magnitude). In addition, the significance of the input may be divided between the first reduced input and the second reduced input. For example, for a 23-bit significand, a first reduction input may be based on the first 11 bits of the significand read from left to right (e.g., bits 22 to 12) and a second reduction input may be based on the next 12 bits of the significand read from left to right (e.g., bits 11 to 0).
The reducer may generate the first reduced input by zeroing out a plurality of low order bits of the original input. In addition, the reducer may generate a second reduced input by zeroing out the plurality of high order bits of the original input. In some implementations, the reducer can determine that the input is a normal (e.g., non-denormal or sub-normal) number by removing the implicit preamble bits and renormalizing the reduced significant number (e.g., the significant number after zeroing out the number of preamble bits). The reducer may renormalize the input by shifting the significand by a plurality of bits based on a plurality of leading zeros. For example, the leading one of the reduced significant numbers may be shifted to an implied bit position. The reducer may also adjust the exponent based on the number of bits the reducer shifts. Since adjusting an index may result in an index that is outside the range of the current index, the reducer may expand the index (e.g., from 8 bits to 9 bits) so that the adjusted index may expand the index representation. For example, a range of 8-bit exponents may achieve exponent values between-126 and +127, and by expanding the exponent to a 9-bit exponent, the reducer may achieve exponent values between-254 and +255. Since renormalizing a 32-bit input may require an exponent as low as-149 (-126-23) to allow shifting across all 23 bits of the significand (e.g., where the exponent is "00000000" and the significand is "00000000000000000000001"), the reducer may expand the 8-bit exponent of the input to generate a second reduced input. The reducer may extend the exponents of the first and second reduced inputs. In some implementations, the reducer may extend only the exponent of the second reduced input.
Each of the first and second reduced inputs may be represented in a reduced (e.g., compressed) format (e.g., 21-bit length). One or more reducers may generate reduced inputs for the input data elements and weights. The one or more reducers may also provide each combination of reduced inputs to the systolic array for a multiply-accumulate operation. The systolic array may implement a multi-pass multiply-accumulate operation for a combination of reduced inputs to generate a total output. For example, a multiply-accumulate operation may be performed on the first reduced weight and the first reduced input data element, the first reduced weight and the second reduced input data element, the second reduced weight and the first reduced input element, and the second reduced weight and the second reduced input data element. For example, the final output may be equal to the first reduced weight times the first reduced input data element plus the first reduced weight times the second reduced input data element plus the second reduced weight times the first reduced input data element. The adder may sum the outputs of each multiply-accumulate operation (e.g., each partial multiply-accumulate operation) to generate a total output. By generating multiple reduced inputs (e.g., inputs having reduced bit lengths) from an input (e.g., an input having an arbitrary bit length), the systolic array may be able to perform multiply-accumulate operations on the input (multiple reduced input versions of the input) without the need to support the arbitrary bit length of the input. Systolic arrays may have certain frequency constraints, size constraints, etc. in order to maintain performance goals. Given these limitations, conventional systolic arrays may not be able to support arbitrary bit lengths. By generating multiple reduced inputs for a particular input, the systolic array can satisfy these constraints while generating an output based on inputs having arbitrary bit lengths. It should be appreciated that any number of reduced inputs may be generated from the original input. For example, a 64-bit floating point number may be converted to 5 21-bit reduced floating point numbers. Each of the reduced inputs may correspond to a portion of the significant portion of the original input. For example, a first reduced input may correspond to a first portion of a significant portion of an original input, a second reduced input may correspond to a second portion of the significant portion, a third reduced input may correspond to a third portion of the significant portion, and so on. A particular portion of the significant portion of the original input of a particular reduced input may be identified by zeroing out other portions of the significant portion.
In some implementations, the reducer may include or receive signals from a multiplexer that selects between two or more inputs based on a control signal, such as an opcode or a data type indicator. For example, the multiplexer may identify a particular input (e.g., weight or input data element) for downscaling.
In some implementations, the systolic array may have a separate reducer that receives one of the input data elements or weights and provides a corresponding reduced version of that input to the systolic array. Each processing element in an initial column of processing elements of the systolic array may receive a plurality of reduced inputs from one or more reducers. For example, a first processing element of the initial column may receive a downscaled input data element from a first downscaler and a downscaled weight from a second downscaler, and a second processing element of the initial column may receive a downscaled input data element from a third downscaler and a downscaled weight from a fourth downscaler.
Each reducer may reduce the bit length of 16 bits, 32 bits, or any number of bits. For example, the reducer may reduce the bit length of a 32-bit floating point number to a 22-bit floating point number. In one embodiment, a 32-bit floating point number has a 1-bit sign, an 8-bit exponent, and a 23-bit significand. From such 32-bit floating point numbers, the reducer may generate a reduced 20-bit floating point number having a 1-bit sign, an 8-bit exponent, and an 11-bit significand. In some implementations, the reducer may increase the bit length of the input exponent in order to adjust the format of the reduced input to a format supported by the processing element. For example, the reducer may increase the exponent from 8 bits to 10 bits. In some implementations, to reduce the bit length of a particular number, the reducer may reduce the number of trailing bits of the significand of the number (e.g., the reducer may zero the lower bits of the significand of the number). For example, the number may be a binary string "10101010101111111111111" and the reducer may zero out the twelve trailing bits of the number to generate a reduced binary string "10101010101000000000000" and/or "10101010101".
Each reducer may also round the resulting reduced input to a systolic array. The reducer may round the reduced input to a particular precision or number of bits supported by the processing elements of the systolic array. For example, the reducer may round the number to generate a rounded number. By rounding the input to the systolic array, the systolic array may obtain higher accuracy systolic array calculations. In some implementations, the reducer may round the reduced input. In other implementations, the reducer may receive rounding inputs (e.g., inputs rounded by separate systems) and reduce the rounding inputs. Rounding may include one or more of randomly rounding, rounding to the nearest even, rounding to zero, rounding down, or rounding up. Additionally, a user, system, etc. may specify a rounding method (e.g., via selection from a user interface) for rounding an input.
The systolic array may have PEs each including a 22-bit multiplier and a 34-bit adder. The 22-bit multiplier may operate on a 22-bit reduced floating point number reduced from a 32-bit floating point number by a reducer to generate a multiplication product having one sign bit, ten exponent bits, and 23 significand bits. The multiplication product may include 24 significant digits, where the most significant digit is implicit or hidden. The 34-bit adder may operate on a 34-bit number (e.g., a 34-bit product of multiplication). In addition, the adder may operate on a 35-bit number, where one bit is either implicit or hidden. In some embodiments, the systolic array may include an n-bit multiplier and an m-bit adder, where n may be any number and the n-bit multiplier and the m-bit adder may operate on an x-bit reduced floating point number. The variables n, m, x and y may be any number, where n is greater than x.
In the following description, various examples will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the examples. However, it will be apparent to one skilled in the art that the examples may be practiced without the specific details. Moreover, well-known features may be omitted or simplified in order not to obscure the described examples.
Fig. 1A shows an exemplary 4 x 4 systolic array 100A. Systolic array 100A illustratively includes four columns of PEs and four rows of PEs, with four PEs in each row and four PEs in each column. It should be appreciated that systolic array 100A is simplified for descriptive purposes, and that systolic array 100A according to the present disclosure may include any number of PEs in each row and each column. In addition, the number of PEs in each row may be different from the number of PEs in each column. It should further be appreciated that such systolic arrays 100A may be logically organized into any number of rows and any number of columns. In addition, the number of rows may be different from the number of columns. Systolic array 100A may be part of a neural network processor in a computer system. For example, the computer system may provide multi-tenant computing services for data processing applications, such as image recognition services, text-based data processing (e.g., processing of search queries), audio or video data processing, and so forth.
Each PE may include a respective row input bus 102, a respective column input bus 104, a respective column output bus 106, and a respective row output bus 108. The PEs may receive input from the left PE (or from external circuitry) of the same row via the row input bus 102. The PE may also receive input from a PE (or from external circuitry) on the same column via column input bus 104. The PE may perform arithmetic computations based on the inputs and transmit the results of the arithmetic computations to the PE under the same column (or to external circuitry) via column output bus 106. The PE may also forward inputs received via the row input bus 102 to the right PE of the same row via the row output bus 108.
Systolic array 100A may perform arithmetic calculations, including multiplication and addition operations, for the processing elements of the neural network. For example, each PE may include an arithmetic unit such as a multiplier and an adder. In some embodiments, the multiplier and adder may be a fused multiplier adder. In the example of fig. 1A, each row of PEs may process one set of input data, and each column of PEs may generate one set of output data based on the set of input data received by each PE in a given column.
The column 112 of PEs (leftmost column) may receive four sets of input data, where each set of input data is processed by a row of PEs. The column 116 of reducers may provide four sets of reduced input data to the column 112 of PEs, where each set of input data is provided by one reducer, which may improve the overall performance of the array as compared to conventional arrays. It should be appreciated that the column 116 of the reducer may provide any number of sets of reduced inputs to the column 112 of the PE. For example, the number of reducers and/or the number of sets of reduction inputs may be based on the number of PEs in a given column. In the example of FIG. 1A, column 112 of PEs includes four PEs (PE 112a, PE 112b, PE 112c, PE 112 d), and column 116 of reducers includes four corresponding reducers (reducer 116a, reducer 116b, reducer 116c, reducer 116 d). It should be appreciated that column 116 of reducers may include any number of reducers. Each reducer in the column 116 of reducers may provide a set of reduced input data for a particular PE in the column 112 of PEs, where each set of reduced input data includes two or more reduced inputs. For example, reducer 116a may provide a reduced input data element and a reduced weight to PE 112 a. Each reducer in column 116 of reducers may convert an input into a reduced input. For example, reducer 116a may convert 32-bit input data elements into reduced 22-bit input data elements.
Each of the column 116 of reducers may further select a reduction input to provide to each PE in the column 112 of PEs. For example, each reducer in column 116 of reducers may include a multiplexer to select a reduction weight or reduce an input data element to provide to a PE. In some embodiments, each reducer 116 a-116 d may be implemented as a plurality of reducers (e.g., a first reducer and a second reducer). In addition, the first reducer and the second reducer may provide one or more inputs to the column 112 of the PE. For example, a first reducer of reducer 116a may provide reduced input data elements to PE 112a, and a second reducer of reducer 116a may provide reduced weights to PE 112 a. In some implementations, the PE may receive a reduced input (e.g., reduced input data elements) and a non-reduced input (e.g., non-reduced weights) for arithmetic operations.
Each PE in column 112 may obtain a reduced input data element and a reduced weight from a corresponding input data set received via row input bus 102. Each PE in column 112 may multiply the reduced input data element with a reduced weight to generate a scaled input. The scaled inputs generated by the PEs within any column (including column 112) may be accumulated by the adder of each PE. For example, PE 112a (of column 112) may generate a first scaling input (from the first input data set), where the first scaling input may be based on an output of the adder. For example, the adder may generate a first output partial sum, and PE 112a may generate a first scaling input based at least in part on the first output partial sum. PE 112a may transmit the first scaled input as part of and to PE 112b via column output bus 106. PE 112b can also generate a second scaling input (from the second input data set) and add the second scaling input to the partial sum. The updated portion of the sum accumulated with the first and second scaling inputs is then transmitted to PE 112c via column output bus 106. Partial sums are updated and propagated across columns 112, and PE 112d may generate sums of scaled inputs from the four input data sets.
The sum generated by PE 112d may correspond to the output dataset and may be fed back to the leftmost PE after passing through the activation function. In addition, each PE in column 112 can also propagate the input dataset to other PE columns (e.g., column 114), which can scale the input dataset using a different set of weights than column 112. Each column of PEs may perform arithmetic operations (multiplication and addition) to generate output data elements for other processing elements in parallel. In the example of fig. 1A, systolic array 100A may generate output data elements for four PEs corresponding to four columns of systolic array 100A.
Systolic array 100A may perform convolution calculations in multiple waves. In one embodiment, the waves represent a stream of input data elements that are processed while reusing the same weights in systolic array 100A. For example, the respective weights may have been preloaded in each PE in systolic array 100A sequentially or in parallel prior to starting the wave computation. The PE-generated partial sums may correspond to a single wave. Since the PEs of systolic array 100A perform the arithmetic operations of convolution calculations, the dynamic power consumed by all multipliers in the PEs can be significant. This problem may be further exacerbated for systolic arrays that include a large number of PEs (e.g., thousands). The arithmetic operations performed by the PE will be further explained with reference to fig. 2A and 2B.
As described above, the input may be reduced to generate a reduced input that is provided to the systolic array. In addition, for multiple single reduced-precision computations, which may be combined into a higher-precision computation, the input may be reduced to multiple reduced inputs. The systolic array may include an aggregator to combine the partial outputs into a higher precision output (e.g., a higher precision output relative to a single pass calculation). Fig. 1B shows an exemplary configuration of eight PE columns 120 within systolic array 100B. Array 100B may be similar to array 100A of fig. 1A, but illustratively includes 8 rows and a column. Specifically, as shown in fig. 1B, the input may be converted into a plurality of reduced inputs, and each PE may perform a multiply-accumulate operation on each combination of the reduced inputs and provide a partial output partial sum to the corresponding neighboring PE. The number of partial output partial sums generated and the number of multiply-accumulate operations may be similarly varied by varying the number of reduction inputs. Thus, each higher bit length input may be converted by a reducer for systolic arrays into any number of reduced inputs having lower bit lengths in order to satisfy the bit lengths natively supported by systolic arrays.
To facilitate computing a total output sum for the columns, column 120 in FIG. 1B includes an aggregator 130. The aggregator 130 may be located within or outside of the array 100B. For each pass of a given input (e.g., each combination of reduced inputs associated with a particular input) through the array, the aggregator 130 may store and sum portions of the output. The aggregator 130 may add the partial sums generated for each combination of the reduction inputs. The aggregator 130 may calculate a running sum for the outputs (e.g., by iteratively summing partial output sums of a given set of reduced inputs) as a total output sum. For example, the aggregator 130 may include a portion and a buffer 132.
In some implementations, the systolic array may identify a particular order to pass the reduced inputs and reduced weights through the array. For example, the reduced input and reduced weight may be passed through the array first in order to maintain accuracy of the number with a lower magnitude. Thus, the downscaled input with the lower magnitude may be accumulated first in order to maintain accuracy. For example, the product of the low reduction input data element and the low reduction weight may be added to the product of the high reduction input data element and the low reduction weight (or the low reduction input data element and the high reduction weight) to generate the first partial output. Additionally, the first partial output may be added to the product of the low reduced input data element and the high reduced weight (or the product of the high reduced input data element and the low reduced weight) to generate the second partial output. In addition, the second partial output may be added to the product of the low reduction input data element and the high reduction weight or the other of the product of the high reduction input data element and the low reduction weight to generate a third partial output. The third partial output may be added to the product of the high reduction input data element and the high reduction weight to generate a total output. By adding the reduced input having a lower magnitude first, the accuracy of the reduced input may be maintained so as to minimize the loss of accuracy of the low reduced input when added to the high reduced input.
Although fig. 1B shows the aggregator 130 providing a pair-wise summation, the aggregator 130 may alternatively implement other aggregation techniques. In some implementations, the column 120 of PEs may not include an aggregator 130 and may provide an output dataset consisting of partial sums for each combination of reduced inputs. In one implementation, column 120 may not include an aggregator 130 and column 120 may provide multiple partial output data sets. In some implementations, the multiple output data sets may each correspond to a partial sum generated for each combination of the reduced inputs of column 120. In another implementation, the aggregator 130 may provide more or fewer output data sets. The aggregator 130 may provide one or more output data sets that each correspond to one or more partial sums. In some cases, the output of the aggregator 130 may be capable of being configured according to the desired use of the array, and thus may accept instructions as to what output should be provided. In some cases, the aggregator 130 may provide combinations of the above-described outputs (e.g., by providing a final sum of four partial sums corresponding to each combination of reduced inputs and non-reduced inputs). In some embodiments, a portion of the partial sum aggregation may occur within a systolic array. For example, the systolic array may add the first partial sum to the second partial sum (using one or more components) to generate a third partial sum, and may add the fourth partial sum to the fifth partial sum to generate a sixth partial sum. Additionally, the systolic array may provide a third partial sum and a sixth partial sum for accumulation to the aggregator 130.
Fig. 2A illustrates PE 00 in a systolic array for neural network computation, in accordance with certain embodiments of the disclosed technology. PE 00 can be part of a systolic array similar to systolic array 100A in FIG. 1A. Fig. 4A and 4B show additional details of the reducers 225, 227 of fig. 2A. Some embodiments may be described with reference to neural networks, however, it should be understood that certain embodiments may be used for other applications such as pattern recognition, image processing, audio processing, video processing, and the like, without departing from the scope of the present technology.
Systolic array 200 includes reducers 225, 227 and a plurality of processing elements including PE 00 and PE 01. PE 00 may comprise one or more of the following: a data element load generator 202, an input data element register 204, a weight register 206, a multiplier 208, an adder 210, a skip computation generator 212, a skip computation register 214, a selector circuit 216, an input section and register 218, a cached weight register 220, and an operation decoder 256. According to some implementations, PE 00 may receive one or more of the following: reduced input data element 222, reduced weight 224, zero data element indicator 226, zero weight indicator 228, opcode 230, weight loading 232, and input portions and 234 to perform convolution calculations.
PE 00 can be coupled to a first reducer 225 and a second reducer 227. The first reducer 225 may receive a first input (such as the input data element 221) and the second reducer 227 may receive a second input (such as the weight 223). The first reducer 225 may convert the first input to a first reduced input and the second reducer 227 may convert the second input to a second reduced input. The first reducer 225 may provide the reduced input data element 222 (e.g., a reduced version of the input data element 221) to the PE 00. In addition, the second reducer 227 may provide the reduced weight 224 (e.g., a reduced version of the weight 223) to the PE 00. In some implementations, one or more of the first reducer 225 or the second reducer 227 can round the input and/or the reduced input. The rounding may be based on a rounding method identified by the system, the user, etc. (e.g., user input may specify a particular rounding method). In other implementations, one or more of the first reducer 225 or the second reducer 227 may reduce the pre-rounded input (e.g., the pre-rounded input may be rounded by a local or remote system of systolic arrays). Additionally, the first reducer 225 and the second reducer 227 may convert one or more floating point inputs to a reduced representation. The floating point input may include a bit length of 32 bits, 64 bits, or any number of bits.
In some implementations, one or more of the first reducer 225 or the second reducer 227 can detect when one or both of the input data element 221 and the weight 223 exceeds a particular bit length. For example, the first reducer 225 may determine whether the input data element 221 exceeds 22 bits, and the second reducer 227 may determine whether the weight 223 exceeds 22 bits. In addition, a user, system, etc. may provide a particular bit length for comparison with the bit length of the input data element 221 and the weight 223. Upon determining that a particular input (e.g., input data element 221) exceeds the identified bit length, one or more of first reducer 225 or second reducer 227 may generate a reduced input (e.g., reduced input data element 222).
To reduce the bit length of the input data element 221 and/or the weight 223, the first reducer 225 and/or the second reducer 227 may reduce the bit length of the significant portion having a particular length. The first reducer 225 and/or the second reducer 227 may reduce the bit length of the significand portion to match the maximum bit length of the significand supported by the components of the systolic array (e.g., the multipliers of each processing element). For example, the first reducer 225 and/or the second reducer 227 may reduce the bit length of the significant portion of the input from 23 bits to 11 bits. In some embodiments, the first reducer 225 and/or the second reducer may expand the exponent portion of the input to a particular format required by the multiplier. For example, the first reducer 225 and/or the second reducer 227 may extend the bit length of the input exponent portion from 8 bits to 10 bits.
In the event that the significant portion of one or both of the input data element 221 and the weight 223 has been reduced, the first reducer 225 and the second reducer 227 may still extend the number of bits used to represent the respective exponent portion. Thus, subsequent arithmetic circuitry such as multiplier 208 may perform computations on the digits in a single format (e.g., 22-bit floating point format).
PE 00 may receive a reduced input data element 222 via a first input port. The reduced input data element 222 may be an input data set, or any array of input data elements. PE 00 can receive one reduced input data element at a time from an input data set within a uniform time period. For example, the uniform time period may correspond to a clock cycle. The input dataset may be similar to an input feature map comprising input feature map elements. For example, the input data set may correspond to an input image, an audio clip, a video clip, a text portion, or any other data that may be provided for data processing to identify a particular mode or object. In some cases, the input data set may be an intermediate output data set that has undergone an activation function, such as ReLu or Sigmoid, as discussed with reference to fig. 1A. Each reduced input data element 222 may be a floating point data type or any suitable data type. Each reduced input data element 222 may comprise 22 bits, 21 bits, 20 bits, or any suitable number of bits. The reduced input data element 222 may be stored in the input data element register 204 for a period of time.
PE 00 may receive the reduction weight 224 via a second input port. In some implementations, the reduced weights 224 may belong to a set of weight values corresponding to a convolution filter. The reduction weight 224 may be preloaded in the PE 00 prior to receiving the reduction input data element 222. In some embodiments, PE 00 can receive one reduced weight value at a time from a set of reduced weight values over a uniform period of time to preload each PE in a given row with a corresponding reduced weight value. The PE may pass the reduced weight value to the next PE in the corresponding row until each PE in the given row has been preloaded. Each PE may cache a corresponding reduction weight value for computation with the reduced input data element. Each of the reduction weights 224 may be a floating point data type or any suitable data type. Each reduction weight 224 may include 22 bits, 21 bits, 20 bits, or any suitable number of bits. The reduced weight 224 may be stored in the cached weight register 220 for a period of time.
PE 00 may receive the input portion of the current operation and 236 via a third input port. In some embodiments, the input portion and 236 may be 16-bit, 18-bit, 32-bit, 33-bit, 34-bit numbers, or have any number of bits.
PE 00 may receive the zero data element indicator 226 of the current operation via the fourth port. The zero data element indicator 226 may comprise a single bit or multiple bits. The zero data element indicator 226 may indicate (e.g., indicate) whether the reduced input data element 222 is zero. The zero data element indicator 226 may indicate whether the input data element 221 is zero. For example, a value of "1" for the zero data element indicator 226 may indicate that the reduced input data element 222 associated with the zero data element indicator 226 is zero, and a value of "0" for the zero data element indicator 226 may indicate that the reduced input data element 222 associated with the zero data element indicator 226 is not zero. In addition, "0" may correspond to a logic zero or logic low, and "1" may correspond to a logic one or logic high. For example, a logical zero may be represented by a first range of voltage levels (e.g., 0 to 2 volts) and a logical one may be represented by a second range of voltage levels (e.g., 3 to 5 volts). It should be appreciated that other implementations of the values representing "0" and "1" are possible without departing from the scope of the disclosed technology. The zero data element indicator 226 may be generated by circuitry external to the PE 00 and communicated sequentially to all PEs in the same row for a uniform period of time.
PE 00 may receive the zero weight indicator 228 via a fifth port. The zero weight indicator 228 may comprise a single bit or multiple bits. The zero weight indicator 228 may indicate whether the reduced weight 224 associated with the zero weight indicator 228 is zero. The zero weight indicator 228 may also indicate whether the weight 223 associated with the zero weight indicator 228 is zero. For example, a value of "1" for the zero weight indicator 228 may indicate that the reduced weight 224 is zero, and a value of "0" for the zero weight indicator 228 may indicate that the reduced weight 224 is not zero. The zero weight indicator 228 may be generated by circuitry external to PE 00 and communicated sequentially to all PEs in the same row along with the reduced weight 224.
The weight load 232 may load the reduced weight 224 into the cached weight register 220 to provide the cached weight 246. Before the reduced input data element 222 is fed into the array, the weight load 232 may be asserted to cache the reduced weight 224 of PE 00 in the cached weight register 220. As weights are shifted into the array to preload each PE with a corresponding weight value, the weight load 232 may be asserted for each PE for a particular period of time to preload each PE with the appropriate weight value.
The operation decoder 256 may decode the opcode 230 to determine an operation performed by the PE 00 for a different instruction represented by a different opcode value. In some implementations, the first opcode value may correspond to an instruction to shift a reduction weight from one PE to another PE in the systolic array. The second opcode value may correspond to an instruction that is initiated by the PE to perform an arithmetic calculation. For example, once the reduced weights have been preloaded into the systolic array, the reduced input data elements may be read from memory and arithmetic calculations may be performed as the reduced input data elements pass through the array. The third opcode value may correspond to an instruction that executes a NOP. NOPS may be used to separate two systolic array instructions or when there are no reduced input data elements to read from memory. For example, NOP may be used to separate instructions that shift reduced weights from instructions that begin arithmetic computations. For example, for a 4 x 4 array, up to 15 cycles may be required to shift the reduced weights into all PEs in the array before starting the arithmetic computation, and thus 15 NOP cycles may be required. The op-decoder 256 may decode the op-code 230 to generate the NOP 258 and a start computation signal 260. The operational decoder 256 may provide a start computation signal 260 to a weight register 206 coupled to the multiplier 208 and adder 210. The operational decoder 256 may also provide a start computation signal 260 to the multiplier 208. The opcode 230 may include any suitable number of bits, e.g., two, four, etc. In some implementations, the operation decoder 256 may also decode the operation code to determine the data type to provide a data type control signal.
In some implementations, the reduced input data element 222, reduced weight 224, opcode 230, zero data element indicator 226, and zero weight indicator 228 may belong to the row input bus 102, as discussed with reference to fig. 1A. In other implementations, a splitter (not shown) may be used in PE 00 to split row input bus 102 into different internal buses to carry reduced input data elements 222, reduced weights 224, opcodes 230, zero data element indicators 226, and zero weight indicators 228 within PE 00. For example, the reduced input data element 222 and the reduced weight 224 may belong to a first row input bus and the opcode 230, the zero data element indicator 226, and the zero weight indicator 228 may belong to a second row input bus.
The data element load generator 202 may generate a data load signal 242 that may be used to allow the input data element register 204 to skip storage of the reduced input data element 222 under certain conditions. In some implementations, the reduced input data element 222 may be loaded into the input data element register 204 when the data load signal 242 is asserted based on the zero data element indicator 226 and the NOP 258. The data load signal 242 may be asserted when the zero data element indicator 226 corresponding to the reduced input data element 222 is "0" and the opcode 230 does not indicate a NOP (e.g., NOP 258 is "0"). When the zero data element indicator 226 corresponding to the reduced input data element 222 or NOP 258 is a "1," the data load signal 242 may not be asserted. The data element load generator 202 may be implemented using OR, NOR, NAND or any suitable circuitry.
The input data element register 204 may store the reduced input data element 222 or skip the storage of the reduced input data element 222 based on the currently operated data load signal 242 to provide a stored input data element 244. In some implementations, if the load input is a "1", the input data element register 204 may store the Din input, and if the load input is a "0", the previous value may be saved. For example, if the data load signal 242 is "1", the input data element register 204 may store a new value of the reduced input data element 222, and if the data load signal 242 is "0", the input data element register 204 may skip storing the new value of the reduced input data element 222. Thus, in some cases, the input data element register 204 may store only non-zero values of the reduced input data element 222. According to some embodiments, skipping the storage of a new value by the input data element register 204 may result in not switching the stored input data element 244 and maintaining the previous value of the stored input data element 244.
The weight register 206 may store cached weights 246 to provide stored weight values 248 based on a start calculation signal 260. In some implementations, the weight register 206 may store the Din input if the load input is a "1" and may hold the previous value if the load input is a "0". For example, if the start calculation signal 260 is asserted (e.g., the start calculation signal 260 is "1"), the cached weights 246 may be loaded into the weight register 206, otherwise the weight register 206 may hold the previous value. Thus, the reduced weight 224 previously loaded into the cached weight register 220 using the weight load 232 may be shifted into the weight register 206 at the beginning of the arithmetic computation. In some embodiments, the stored weight values 248, once loaded at the beginning of the arithmetic computation, remain unchanged as the input data elements are fed through the systolic array one element at a time to PE 00 for computation corresponding to one or more waves.
PE 00 can provide a stored input data element 244 to PE 01 based on the currently operating data load signal 242. PE 01 may receive stored input data element 244 as reduced input data element 222 via a first port. In some implementations, if the load input is a "1", the input data element register 204 may store the Din input, and if the load input is a "0", the previous value may be saved. PE 00 can provide the stored weight value 248 to PE 01 based on the start calculation signal 260. PE 01 may receive the stored weight value 248 as the reduced weight 224 via the second port. In some implementations, the weight register 206 may store the Din input if the load input is a "1" and may hold the previous value if the load input is a "0".
Multiplier 208 may perform a multiplication operation between stored input data elements 244 and stored weight values 248. Multiplier 208 may generate product 250 based on the multiplication operation. Multiplier 208 may receive an input having a fixed bit length. For example, multiplier 208 may receive a 22-bit floating point input. Thus, the reducer may enable the systolic array to receive an input having an arbitrary bit length and provide the multiplier 208 with a reduced input having a bit length supported by the multiplier 208. In some embodiments, the product 250 may be an integer product, a floating point product, or any other product. In addition, multiplier 208 may generate product 250 having 8 bits, 16 bits, 18 bits, 32 bits, 34 bits, or any other number of bits. Multiplier 208 may be implemented using a multiplier circuit. Multiplier 208 may perform floating point multiplication, integer multiplication, or multiplication involving any other data type. Multiplier 208 may be implemented using a 16-bit multiplier data path, an 18-bit multiplier data path, a 22-bit multiplier data path, or a multiplier data path having any number of bits. Multiplier 208 may support at least n-bit operations, where n is greater than or equal to the number of bits in the input (e.g., input data element).
Multiplier 208 may include multiple data paths, e.g., as further discussed with respect to fig. 5. With respect to fig. 2A, multiplier 208 may include separate data paths for calculating sign bits, significands and exponents. It should be appreciated that the significand data-path and the exponent data-path may comprise data having any number of bits.
Multiplier 208 may provide product 250 to adder 210. Adder 210 may perform an addition operation on product 250 and stored input portions 236 to provide addition result 238. Adder 210 may be implemented using adder circuitry. Adder 210 may perform floating point addition, integer addition, or non-integer addition. Adder 210 may perform addition on inputs having 8 bits, 16 bits, 18 bits, 32 bits, 34 bits, or any number of bits. Adder 210 may be implemented using a 16-bit adder data path, an 18-bit adder data path, a 32-bit adder data path, a 34-bit adder data path, or an adder data path having any number of bits. In one implementation, adder 210 is implemented with a given bit size (e.g., with a given bit size adder data path), which may represent the maximum bit size of the expected input of the array. In some implementations, each processing element may include an adder with a larger bit size and a multiplier with a smaller bit size, as an adder with an increased bit size may be more cost effective than a multiplier with the same increased bit size. Thus, the present disclosure enables systolic arrays to support larger bit sizes with reduced accuracy using multipliers of lower bit sizes. In another embodiment, adder 210 may be implemented with a bit size that is smaller than the maximum bit size of the expected input of the array. Adder 210 may support at least m-bit operations, where m is equal to or greater than the value of the multiplier data path. The adder data path may be a superset of the multiplier data paths.
Multiplier 208 and adder 210 may provide a fused multiply-accumulate operation. Multiplier 208 and adder 210 may be integrated together to perform a single-step multiply-add operation. In some implementations, rounding may not be performed on the output of multiplier 208 before the output is provided to adder 210. In addition, multiplier 208 may provide an accurate product 250 to adder 210. In other implementations, PE 00 can perform rounding on the output of multiplier 208.
Selector circuit 216 may receive addition result 238, input portions and 236, and stored skip computation indicator 254. The selector circuit 216 may select the addition result 238 or the input portions and 236 to be provided as output portions and 240 via a sixth port. In some implementations, the selector circuit 216 may include at least one multiplexer that may select the addition result 238 or the input portions and 236 to be generated. The selector circuit 216 may select the addition result 238 or the input portion sum 236 to be provided as the output portion sum 240 via the sixth port based on the stored skip computation indicator 254. According to some embodiments, when the value of the reduced input data element 222 or the reduced weight 224 for the current operation is zero, or the NOP 258 is asserted, the addition result 238 may hold the value of the previous operation because of the product 250. In this case, stored skip computation indicator 254 may allow bypassing of addition result 238 and selecting input portions and 236 to provide as output portions and 240. For example, when the stored skip computation indicator 254 provides a skip computation signal of "1", the input section and 236 may be selected as the output section and 240 of the ripple period, and when the stored skip computation indicator 254 provides a skip computation signal of "0", or the addition result 238 may be selected as the output section and 240 of the ripple period.
Fig. 2B shows the diagram shown in fig. 2A, in which the shared reducer 225 replaces the first reducer 225 and the second reducer 227. The sharing reducer 225 may receive the input data element 221 and the weights 223. The sharing reducer 225 may also receive an operation code 230. The sharing reducer 225 may perform a selection operation on the input data element 221 and the weight 223 based at least in part on the opcode 230. In some implementations, the shared reducer 225 will generate reduced input based at least in part on the opcode 230. For example, when the opcode 230 is a particular value, the sharing reducer 225 may reduce the weight 223 and provide the reduced weight 224 to the PE 00. Additionally, the shared reducer 225 may reduce the input data element 221 and provide the reduced input data element 222 to the PE 00 when the opcode 230 provides some other set point. Thus, the shared reducer 225 may reduce the bit length of the significant portion of both the input data elements 221 and the weights 223 to match the maximum bit length of the significant supported by the components of the systolic array (e.g., the multipliers of each processing element). In some implementations, the sharing reducer 225 can receive multiple input data elements and/or multiple weights and generate multiple reduced input data elements and/or multiple reduced weights. For example, the sharing reducer 225 may generate any number (e.g., four) of reduced input data elements and/or any number (e.g., four) of reduced weights.
The sharing reducer 225 may select between the input data element 221 and the weight 223 using a multiplexer. In some implementations, the reduced input data element 222 and the reduced weight 224 may be delivered to the PE 00 on separate buses. In other implementations, the reduced input data element 222 and the reduced weight 224 may be delivered on the same bus. Additionally, the shared reducer 225 may reduce both the input data element 221 and the weight 223 within the same clock cycle, and provide the reduced input data element 222 and the reduced weight 224 to the PE 00. In some implementations, the sharing reducer 225 can reduce the weight 223 and provide the reduced weight 224 to the PE 00 during a clock cycle. The shared reducer 225 may then reduce the input data element 221 and provide the reduced input data element 222 to the PE 00 during the second clock cycle.
Fig. 3 illustrates an apparatus 300 including a zero detector circuit for reducing input data elements and reducing weights into systolic arrays for neural network computation, in accordance with certain implementations of the disclosed technology.
The device 300 may include a two-dimensional systolic array 302 comprising PEs arranged in rows and columns. Systolic array 302 may be similar to systolic array 100A in fig. 1A. A first row of systolic array 302 may include PE 00, PE 01, PE 02, … …, PE 0y, a second row of systolic array 302 may include PE 10, PE 11, PE 12, … …, PE 1y, a third row of systolic array 302 may include PE 20, PE 21, PE 22, … …, PE 2y, and an xth row of systolic array 302 may include PE X0, PE X1, PE X2, … …, PE xy. x and y may comprise positive integers such as 32, 64, 128 or any suitable number. Each PE of systolic array 302 may be similar to PE 01 and include means for performing arithmetic calculations on the reduced input using a power efficient method, as discussed with reference to fig. 2A, 2B.
In some implementations, the first (e.g., leftmost) PE in each row of systolic array 302 may be coupled to a respective zero input data detector circuit to detect zero values on input data elements and to a respective zero weight detector circuit to detect zero values on weight values entering systolic array 302. For example, PE 00 in a first row may be coupled to a first zero-input data detector 306a and a first zero-weight detector 308a, PE 10 in a second row may be coupled to a second zero-input data detector 306b and a second zero-weight detector 308b, PE 20 in a third row may be coupled to a third zero-input data detector 306c and a third zero-weight detector 308c, and PE X0 in an X-th row may be coupled to an X-th zero-input data detector 306X and an X-th zero-weight detector 308X. The first zero input data detector 306a, the second zero input data detector 306b, the third zero input data detectors 306c, … …, and the xth zero input data detector 306X may detect zero values on corresponding reduced input data elements in input data set 0, input data set 1, input data sets 2, … …, and input data set X, respectively. Similarly, first zero weight detector 308a, second zero weight detector 308b, third zero weight detectors 308c, … …, and xth zero weight detector 308X may detect zero values on corresponding reduced weight values in filter 0, filter 1, filters 2, … …, and filter X, respectively.
Each zero-input data detector and each zero-weight detector in each row of systolic array 302 may be coupled to a respective reducer to receive a reduced input. Each zero-input data detector may receive a reduced input data element and each zero-weight detector may receive a reduced weight. For example, a first zero input data detector 306a may be coupled to a first reducer 307a and a first zero weight detector 308a may be coupled to a second reducer 309a, a second zero input data detector 306b may be coupled to a third reducer 307b and a second zero weight detector 308b may be coupled to a fourth reducer 309b, a third zero input data detector 306c may be coupled to a fifth reducer 307c and a third zero weight detector 308c may be coupled to a sixth reducer 309c, and an xth zero input data detector 306X may be coupled to an xth reducer 307X and an xth zero weight detector 308X may be coupled to a yth reducer 309X.
The reducers 307a-307x and 309a-309x may be implemented as separate entities external to the systolic array 302. For example, reducers 307a-307x and 309a-309x may be part of a separate circuit from the systolic array. In some embodiments, the circuit and systolic array 302 may be part of a computation engine that may perform arithmetic computations of convolution operations. In other embodiments, reducers 307a-307x and 309a-309x may be implemented as part of systolic array 302.
In some embodiments, the first and second reducers 307a and 309a may be first shared reducers, and the third and fourth reducers 307b and 309b may be second shared reducers, and the fifth and sixth reducers 307c and 309c may be third shared reducers, and the xth and yh reducers 307X and 309X may be xth shared reducers. Each sharing reducer may provide a reduced input data element and a reduced weight. In some implementations, each shared reducer may include one output bus and may select the reduced input to be generated. In other implementations, each shared reducer may include multiple output buses and may output reduced input data elements and reduced weights.
The zero input data detectors 306a-306x and/or zero weight detectors 308a-308x may be arranged before the respective reducers 307a-307x, 309a-309x such that zero input may be detected and if zero input is detected, the respective reducers 307a-307x, 309a-309x may not operate to save power. In some embodiments, both the zero input data detectors 306a-306x and the respective reducers 307a-307x may receive the input data sets and operate in parallel rather than sequentially. In addition, both the zero weight detectors 308a-308x and the respective downscalers 309a-309x may receive the filters and operate in parallel rather than sequentially.
Each of input data set 0, input data set 1, input data sets 2, … …, and input data set x may belong to an image, text, video clip, audio clip, or to another type of data set that may need to be processed by a neural network processor for convolution computation.
In some cases, input data set 0, input data set 1, input data set 2, … …, and input data set x may be associated with output data set 0, output data set 1, output data set 2, … …, output data set y generated by the middle layer of convolution operations. For example, output data set 0, output data set 1, output data sets 2, … …, output data set y may be fed back to systolic array 302 by activating a function and as input data set 0, input data set 1, input data sets 2, … …, and input data set x. Filter 0, filter 1, filters 2, … …, and filter x may include different sets of weight values to convolve with input data set 0, input data set 1, input data set 2, … …, and input data set x. The weight values in filter 0, filter 1, filters 2, … …, and filter x may be predetermined using supervised learning, unsupervised learning, or any suitable method of determining a convolution filter.
Each zero input data detector of a respective row may detect whether a reduced input data element from the input data set entering the respective row is a "0" and generate a corresponding zero input data indicator for the reduced input data element. In addition, each zero input data detector of a respective row may also detect whether an input data element from the input data set entering the respective reducer is a "0" and generate a corresponding zero input data indicator for that input data element. The corresponding zero data element indicator may be passed into the first PE of the respective row along with the reduced input data element. For example, PE 00 can be the first PE in the first row in systolic array 302. PE 00 can receive a reduced input data element from input data set 0 prior to other PEs in the first row (e.g., PE 01, PE 02, … …, PE 0 y). In some embodiments, the reduced input data elements may be sequentially fed to PE 00 from input data set 0 one at a time within a uniform time period. The first zero input data detector 306a may generate a zero data element indicator 226 for each input data element from the input data set 0 for each of the uniform time periods (e.g., clock cycles). The zero data element indicator 226 may be fed sequentially to the PE 00 with each reduced input data element over a uniform period of time. PE 00 may or may not store the reduced input data element 222 based on the value of the corresponding data load signal 242. In some implementations, the first zero input data detector 306a may include a comparator to compare an incoming reduced input data element to zero to assert (e.g., set to "1") or de-assert (e.g., set to "0") the zero data element indicator 226 based on the value of the incoming reduced input data element. For example, the comparator may be implemented using OR, XOR, NAND or any suitable circuit.
Each zero-weight detector of a respective row may detect whether a reduced weight from a set of reduced weights into the respective row is zero and generate a corresponding zero-weight indicator for the reduced weight. In addition, each zero weight detector may also detect whether a weight from the set of filters entering the respective reducer is zero, and generate a corresponding zero weight indicator for that weight. For example, the first zero weight detector 308a may detect whether a reduced weight (e.g., reduced weight 224) from filter 0 includes a zero value and generate a zero weight indicator 228 for the reduced weight. In some implementations, the first zero weight detector 308a may include a comparator to compare the reduced weight to zero to assert (e.g., set to "1") or de-assert (e.g., set to "0") the zero weight indicator 228. For example, the comparator may be implemented using OR, XOR, NAND or any suitable circuit. In one embodiment, the respective reduced weights of PE 00 may be sequentially fed from filter 0 one at a time to PE 00 in a unified period for preloading the respective reduced weights of PE 00 to PE 0y prior to starting the arithmetic computation. The first zero weight detector 308a may generate a corresponding zero weight indicator for each of those reduced weights, which may be sequentially fed to PE 00 along with the corresponding reduced weights over a uniform period of time. PE 00 may sequentially pass the respective reduced weight and the corresponding zero weight indicator to the next adjacent PE until all PEs in the first row are preloaded with the respective reduced weight and the corresponding zero weight indicator. The respective reduced weights and the corresponding zero weight indicators may be cached in each PE prior to feeding the respective reduced input data elements to each row in systolic array 302.
The second, third, and xth zero input data detectors 306b, 306c, … …, and 306X may be similar to the first zero input data detector 306a, and may generate a corresponding zero data element indicator, similar to the zero data element indicator 226, to be sequentially provided to the PEs 10, 20, … …, and PE X0 for power optimization within a uniform time period. The respective zero data element indicators generated for each row may be received by the respective first PE in each row via the respective row input bus 102 and propagated sequentially by the first PE to all PEs in a given row within a uniform time period. The second, third, … …, and xth zero weight detectors 308b, 308c, … … may be similar to the first zero weight detector 308a, and may generate a respective zero weight indicator, similar to the zero weight indicator 228, to be sequentially provided to PEs 10, 20, … …, and PE X0 to be preloaded to each PE in a respective row with a respective weight value prior to initiating an arithmetic computation.
In some implementations, the zero input data detectors 306a-306x and the zero weight detectors 308a-308x may be implemented as separate entities external to the systolic array 302. For example, the zero input data detectors 306a-306x and the zero weight detectors 308a-308x may be part of the circuit 304. In other embodiments, the circuit 304 and systolic array 302 may be part of a computing engine that may perform arithmetic computations of convolution operations. Some implementations of the disclosed technology may provide reduced gate count and dynamic power consumption by detecting zeros on input data elements and weights of respective first PEs into each row of the systolic array, and passing a zero indicator to all PEs in the array, as compared to using respective zero detectors within each PE in the systolic array 302.
It is noted that for ease of illustration, fig. 3 only shows the respective zero data element indicators and zero weight indicators of the first PEs entering each row of systolic array 302, however it should be understood that each PE in the respective row of systolic array 302 may also receive the respective reduced input data elements and the respective reduced weights along with some control signals (e.g., opcode 230, weight loading 232, data type, etc.), which may propagate from left to right of systolic array 302 for each row.
Fig. 4A illustrates an example reduction system 400A (e.g., a 32-bit floating point ("FP 32") reduction system) according to an example implementation. The reduction system 400A includes a multiplexer 402, a rounding identifier, and a reducer 405. The reducer 405 may reduce an input having an arbitrary bit length to a maximum bit length supported by elements of the systolic array during a single pass computation. For example, reducer 405 may reduce the input to a 22-bit input, where 22 bits are the maximum bit length supported by the multipliers of the systolic array. The reducer 405 may include an exponent extender 406, a rounder 408, and a trailing bit reducer 410. In some implementations, the reducer 405 can include an exponent extender 406. In other embodiments, reducer 405 may not include exponent extender 406. For example, the reducer 405 may not extend the exponent of the input to generate a reduced input. In some implementations, the multiplexer 402 may be separate from the reducer 405. In other implementations, the reducer 405 may include a multiplexer 402. As previously discussed, the reducer 405 processes the original number 401A to produce a reduced number 403A.
The reduction system 400A may receive one or more digits to be reduced. The one or more digits may include one or more of the input data elements 221 and/or weights 223. For example, the reduction system 400A may receive FP32 weights and FP32 input data elements. In some implementations, the reduction system 400A may receive the input data elements 221 or weights 223 without a multiplexer.
Multiplexer 402 may receive one or more digits received by reduction system 400A. The multiplexer 402 may also receive the opcode 230 or other indicator of whether a weight or input data element should be selected. The multiplexer 402 may decode the operation code 230 to select the number to be operated on by the reduction system 400A. The multiplexer 402 may output different numbers for the reduction operation based on the value of the opcode 230. In some implementations, the first opcode value may correspond to an instruction to output the weight 223 as the multiplexer output 420 and the second opcode value may correspond to an instruction to output the input data element 221 as the multiplexer output 420. For example, once the input data element 221 and the weight 223 have been provided to the reduction system 400A, the multiplexer 402 may output the input data element 221 based at least in part on the opcode 230, and later output the weight 223.
In the example of fig. 4A, the original number 401A is an FP32 number having a sign bit portion, an exponent bit portion, and a significand bit portion. It should be appreciated that the original number 401A may be any arbitrary bit length number having any exponent bit length and/or significant bit length. The FP32 format of the original 401 includes a 1-bit symbol, an 8-bit exponent, and a 23-bit significand. In some embodiments, the original number 401A may include more, fewer, or different bits. In addition, the original number 401A may include more, fewer, or different bits for the sign bit portion, the exponent bit portion, and/or the significand bit portion.
Exponent extender 406 may receive 8-bit exponent 428 from original number 401A. Exponent extender 406 may increase the number of bits representing exponent 428 from 8 bits to 10 bits. In some implementations, exponent extender 406 may add 1, 2, 3, or any number of bits to exponent 428. The number of bits added may be sufficient to represent the number in the format expected by the PE (e.g., a PE may expect a 10-bit exponent). In other implementations, the exponent extender may not add any bits to exponent 428. For example, exponent extender 406 (or another component) may determine that a sufficient (e.g., sufficient) number of bits are included in exponent 428 and may not extend exponent 428.
Index expander 406 may expand index 428 and maintain the value of index 428. Exponent extender 406 may extend the exponent using range conversion by copying the most significant bit, appending a second inverted copy of the most significant bit, and appending other bits of exponent 428 to the end of extension exponent 434. For example, if the value of exponent 428 is "10101010", exponent extender 406 may copy the most significant bit "1", invert the most significant bit once by "0", and append the last seven bits "0101010" such that extension exponent 434 is "100101010". In some implementations, if the exponent starts with a leading zero, the extension extender 406 may perform a different operation. In addition, exponent extender 406 may extend the exponent using range conversion by copying the most significant bit, appending a second copy of the most significant bit, and appending other bits of exponent 428 to the end of extension exponent 434. For example, if exponent 428 is "00000000," exponent extender 406 may extend exponent 428 such that extension exponent 434 is "000000000. In some implementations, exponent extender 406 may add additional bits of data to any location of the exponent field, depending on the endian format and signed or unsigned representation of the exponent. Thus, index expander 406 may expand index 428 to generate expanded index 434.
The exponent extender 406 may provide an extended version 434 of the exponent as a 10-bit extended exponent field of the reduced number 403A.
The reducer 405 may also receive the rounding identifier 404. The rounding identifier 404 may identify the type of rounding to be performed by the reducer 405. For example, the round identifier 404 may identify a round method, such as randomly round, round to the nearest even, round to zero, round down, round up, or any other round method. Random rounding may include randomly rounding to the next larger or smaller number. For example, random rounding may include 50% probability rounding down and 50% probability rounding up. In addition, in random rounding, the probability of rounding up or down may be based on the relative position of the digits to be rounded. For example, a number x between y and z may have a first probability of rounding up to z equal to (x-y)/(z-y) and a second probability of rounding down to y equal to (z-x)/(z-y), where y and z may be any number and x may be any number between y and z. Rounding to the nearest even number may include rounding to the nearest even number having a particular number of bits, rounding to zero may include rounding the particular number of bits to zero, rounding up may include rounding up the particular number of bits, and rounding down may include rounding down the particular number of bits. The rounding identifier 404 may be provided by a user (e.g., via a user interface), another system, etc. In addition, the round identifier 404 may be a custom round identifier or a default round identifier.
The reducer 405 may include a rounder 408 to round the significand 430. The rounder 408 may perform rounding based on the rounding method identified by the rounding identifier 404. For example, the rounding method may be random rounding, rounding to the nearest even, rounding to zero, rounding down, rounding up, or any other rounding method. The rounder 408 may perform rounding based on any bit of the significand. In addition, the rounding 408 may determine the number of bits to be reduced (e.g., the number of bits to be zeroed out) by the trailing bit reducer 410 and may initiate rounding at the bits immediately preceding the bits to be reduced. In addition, the rounding unit 408 may round the bits to be reduced by the trailing bit reducer 410. For example, if the significand 430 includes bits "1110111" and the trailing bit reducer 410 determines that the trailing bit reducer 410 is to reduce three trailing bits (e.g., the first three bits read from left to right), the rounding 408 may perform rounding based on a "0" in position 4. In addition, if the rounder 408 determines to perform rounding to zero, the rounder 408 may generate a rounding effective number 432"1110000", if the rounder 408 determines to perform rounding up, the rounder 408 may generate a rounding effective number 432"1111000", and the like. In some implementations, the rounder 408 may logically follow the trailing bit reducer 410, and the rounder 408 may round the reduction significance.
The reducer 405 may also include a trailing bit reducer 410 to reduce the bit representation of the rounding significant 432. The trailing bit reducer 410 may receive as input a rounding effective 432. The trailing bits reducer 410 may identify the number of bits to be reduced from the rounding effective 432. The number of bits to be reduced may be based on the difference between the bit length of the rounding significance 432 and the maximum single pass computation bit length supported by the elements of the systolic array. Additionally, the number of bits may be based on user input or system input (e.g., input identifying the maximum number of bits supported). The number of bits may be the trailing bits (e.g., the number of rightmost bits or least significant bits) of the rounded significance 432. For example, if the trailing bit reducer 410 determines that 3 bits should be reduced from the rounding significant number 432, the trailing bit reducer 410 may identify 3 bits from right to left in the rounding significant number 432. Additionally, bits may correspond to position 0, position 1, and position 2 within the original number 401A. The trailing bit reducer 410 may identify bits and zero bits (e.g., reduce, eliminate, push logic zeros). In the example of fig. 4A, the trailing bit reducer 410 identifies that 12 bits should be reduced from the rounding significant 432, and zeroes out the trailing 12 bits of the rounding significant 432. By reducing the bit representation of the rounding significant number 432, the trailing bit reducer 410 can generate a reduced significant number 436 that includes only the non-reduced (non-return-to-zero) bits of the significant number 430.
The trailing bits reducer 410 may provide a reduced significant number 436 as an 11-bit rounding significant number for the reduced number 403A.
The reduction 403A may be a second bit length, where the second bit length is any number of bits less than the first bit length. In some embodiments, the second bit length may be a maximum bit length supported by elements of the systolic array. It should be appreciated that the reduction 403A may be any arbitrary bit length number having any exponent bit length and/or significant bit length. In the example of FIG. 4A, the reduced number 403A may be a 22-bit floating point number having a sign bit portion, an exponent bit portion, and a significand bit portion, while the original number 401A may be a 32-bit floating point number. Reduced number 403A may include a 1-bit symbol (e.g., symbol 426), a 10-bit exponent (e.g., extension exponent 434), and an 11-bit significand (e.g., reduced significand 436). The reduction system 400A may provide a reduction number 403A as a reduction output 421. The reduction output 421 may be a reduction input data element 222, a reduction weight 224, or any other reduction number.
Fig. 4B illustrates an example reduction system 400B (e.g., a 32-bit floating point ("FP 32") reduction system) according to an example implementation. The reduction system 400B may include a reducer 405 that may reduce an input having an arbitrary bit length to a maximum bit length supported by elements of the systolic array during a single pass computation. For example, reducer 405 may reduce the input to a 22-bit input, where 22 bits are the maximum bit length supported by the multipliers of the systolic array. The reduction system 400B includes similar components to the reduction system 400A except that in fig. 4B, the original number 401B is rounded by the system before being provided to the reduction system 400B.
In the example of fig. 4B, the original number 401B may be an FP32 number having a sign bit portion, an exponent bit portion, and a significand bit portion. It should be appreciated that the original number 401B may be any arbitrary bit length number having any exponent bit length and/or significant bit length. The FP32 format of the original 401B includes a 1-bit symbol, an 8-bit exponent, and a 23-bit rounding significant. In some implementations, the original number 401B may include any number of bits or be associated with any other bit format. The 23-bit rounding significant may be rounded by a system external or internal to reduction system 400B.
The reducer 405 may also include a trailing bit reducer 410 to reduce the rounding significance 450. The trailing bit reducer 410 may receive the rounding significance 432 as an input and reduce the number of bits (e.g., from 23 bits to 11 bits) representing the rounding significance 450. The trailing bit reducer 410 may generate a reduced significant number 452 that includes only non-reduced (non-return-to-zero) bits of the rounding significant number 450. In addition, the trailing bit reducer 410 can provide the reduced significant number 452 as an 11-bit rounding significant number for the reduced number 403B.
In some implementations, the reduction system 400B may not receive the rounding identifier 404. For example, the rounding identifier 404 may be provided to a systematic rounding that generates the rounding significance 450 in order to identify a rounding method. The reduction system 400B may provide a reduction number 403B as a reduction output 441. The reduction output 441 may be the reduction input data element 222, the reduction weight 224, or any other reduction number.
Fig. 4C illustrates an example reduction system 400C (e.g., a 32-bit floating point ("FP 32") reduction system) according to an example implementation. The reduction system 400C may include a reducer 405 that may reduce an input having an arbitrary bit length to a plurality of reduced inputs having a maximum bit length supported by elements of the systolic array during a single pass computation. For example, reducer 405 may reduce the input to a 21-bit input, where 21 bits are the maximum bit length supported by the multipliers of the systolic array. The reduction system 400C includes similar components to the reduction systems 400A and 400B except that in fig. 4C, the reducer 405 converts the original number 401C into a plurality of reduced inputs.
In the example of fig. 4C, the original number 401C may be an FP32 number having a sign bit portion, an exponent bit portion, and a significand bit portion. It should be appreciated that the original number 401C may be any arbitrary bit length number having any exponent bit length and/or significant bit length. The FP32 format of the original number 401C includes a 1-bit symbol, an 8-bit exponent, and a 23-bit rounding significant number. In some implementations, the original number 401C may include any number of bits or be associated with any other bit format.
The raw number 401C as input 454 may be provided to a format detector 456 for regular and/or irregular detection. For example, format detector 456 may be a denormal detector and/or a formal detector. Format detector 456 may detect whether input 454 is normal or denormal based at least in part on at least one of a value of a 1-bit symbol, a value of an 8-bit exponent, or a value of a 23-bit significand. For example, format detector 456 may detect denormal numbers when an 8-bit exponent contains zero in each bit and the significand is non-zero. Format detector 456 may provide enable signal 458 to normalizer 455 based at least in part on the detection of the normal number. For example, if format detector 456 detects that input 454 is normal, format detector 456 may provide a first value to normalizer 455. If format detector 456 detects that input 454 is denormal, format detector 456 may provide a second value to normalizer 455. In some implementations, the first number can be 1 and the second number can be 0. The detection of a normal number may correspond to a logic high and the detection of a denormal number may correspond to a logic zero. In some implementations, format detector 456 may detect normal numbers by zeroing out the significand 450 (e.g., replacing the significand 450 with zero) and subtracting the original number 401C having the reduced significand 451 from the original number 401C having the zeroed-out significand to generate the normal identifier. In addition, if the original number 401C is normal, the normal identifier may contain an implied preamble bit, and if the original number 401C is denormal, the normal identifier may be equal to zero.
The reducer 405 may provide 1-bit symbols as 1-bit symbols for the reduced numbers 403C and 403D.
The reducer 405 may also include a trailing bit reducer 410 and a leading bit reducer 453 to reduce the payload 450. The trailing bit reducer 410 and leading bit reducer 453 can receive the significand 432 as input and reduce the number of bits representing the significand 450 (e.g., from 23 bits to 11 bits). The trailing bits reducer 410 may generate a reduced effective number 452 that includes only non-reduced (non-return-to-zero) bits of the effective number 450 by removing trailing (or low) bits of the effective number 450. The leading bit reducer 453 can generate the reduced significant number 451 including only the non-reduced (non-return-to-zero) bits of the significant number 450 by removing the high bits of the significant number 450. In addition, the trailing bit reducer 410 may provide a reduced effective number 452 as an 11-bit reduced effective number of the reduced number 403C, and the leading bit reducer 453 may provide a reduced effective number 451 as an input to the normalizer 455.
As described above, reducer 405 may also include exponent expanders 406A and 406B to expand exponent 428. The exponent extender 406A may generate the extension exponent 434 and may provide the extension exponent 434 as an exponent of the reduction 403C, and the extension extender 406B may provide the extension exponent 433 as an input to the exponent adjuster 435.
The reducer 405 may include a normalizer 455 (e.g., a shifter). The normalizer 455 may be implemented based at least in part on an enable signal 458 received from the format detector 456. Normalizer 455 may receive reduced significant number 451 from leading bit reducer 453. The reducer 455 may shift the reduced effective number 451 based at least in part on the number of leading zeros of the reduced effective number 451 (as detected by the normalizer 455). Normalizer 455 may further shift reduced significant number 451 such that the first non-zero number is shifted out of reduced significant number 451 and represented by an implied bit. Normalizer 455 may shift reduced significant number 451 by adding a bit comprising a logic low or zero to the right or end of reduced significant number 451. Normalizer 455 may generate shift significance 452, where shift significance 452 may be the same number of bits as reduced significance 451. For example, if the reduced effective number 451 is 00001100000, the normalizer 455 may count four zeros and further adjust the shift count to five, and the normalizer 455 may shift the reduced effective number 451 five times in total and generate the shifted effective number 452, i.e., 10000000000. Normalizer 455 may then provide shifted significant number 452 as the significant number portion of reduction number 403D. Where format detector 456 does not identify that original number 401C is a normal number (e.g., original number 401C is a denormal number), normalizer 455 may provide reduced significant number 451 as a significant portion of reduced number 403D. In some implementations, if format detector 456 determines that original number 401C is normal, reducer 405 may calculate a zero-back number by zeroing the significant number of original number 401C. In addition, the reducer 405 may generate the significand of the reduction 403D by subtracting the reduced significand from the zero-back number. In other implementations, the reduction 403D may be determined by subtracting the reduction 403C from the original 401C.
The exponent extender 406B may provide an extended version 433 of the exponent to the exponent adjuster 435 (e.g., subtractor) based at least in part on the enable signal 458 (when the format detector 456 detects the normal format of the first input) and the signal 437 from the normalizer 455 identifying the renormalized effective number 452. The exponent adjuster 435 may receive the extension exponent 433 from the exponent extender 406B and a plurality of leading zeros from the normalizer 455. The number of leading zeros may identify the number of leading zeros removed by normalizer 455 to re-normalize the reduced effective number 451. The exponent adjuster 435 may subtract a value from the extension exponent 433 based at least in part on the leading zero output by the normalizer 455. Thus, the exponent adjuster 435 may compensate for the shift of the exponent value for the significand. For example, if the leading zero output is equal to 5 and the extension exponent is equal to 000011111 or 31, the exponent adjuster 435 may subtract 5 from 000011111 or 31 such that the adjusted exponent 439 is equal to 000011010 or 26. The exponent adjuster 435 may provide the adjusted exponent 439 as a 9-bit extended exponent field of the reduction 403D. Otherwise, the extended version 433 of the exponent may be stored as a 9-bit extended exponent field of the reduced number 403D. In some implementations, the exponent extender 406B may extend the exponent 433 before the normalizer 455 normalizes the reduction significance 451. In other implementations, exponent extender 406B may extend exponent 433 after or in parallel with normalizer 455 normalizing reduction significant 451.
The reduction system 400C may provide a reduction 403C and a reduction 403D as reduction inputs 457 and 459 to the original 401C. The reduction inputs 457 and 459 may be reduction input data elements 222, reduction weights 224, or any other reduction number.
Fig. 5 illustrates an exemplary multiply-accumulate data path 500. The exemplary data path 500 may be implemented as the multiplier 208 and adder 210 discussed with respect to fig. 2A and 2B. As shown in fig. 5, multiplier 208 may receive a reduced input data element 222 and a reduced weight 224 and provide the product to adder 210. Adder 210 may receive product and input portions 234 and provide an addition result 238. By converting the input to a reduced representation before presenting the input to multiplier 208, multiplier 208 may omit support for numbers having a larger bit length (e.g., 32 bits), whereas multiplier 208 may support numbers having a reduced bit length (e.g., 22 bits). Thus, a systolic array may maintain the performance provided by receiving an input having a shorter bit length by receiving an input having an arbitrary bit length and adjusting the input to a particular bit length (e.g., the maximum bit length supported by the processing elements of the systolic array).
The reduced input data element 222 may be a 22-bit number. In some implementations, the reduced input data element 222 may have any bit length and/or may be any number of bits. Additionally, the reduced input data element 222 may be a floating point number. In some embodiments, the reduced input data element 222 may be a brain floating point number. In addition, the reduced input data element 222 may be a number of any data type. The reduced input data element 222 may be comprised of a sign bit field, an exponent field, and a significand field. Multiplier 208 may support different types of reduced input data elements. For example, the reduced input data element 222 may include a 1-bit symbol, a 10-bit exponent, and an 11-bit significand. Additionally, the reduced input data element 222 may include a 1-bit symbol, an 8-bit exponent, and an 11-bit significand. Multiplier 208 may support both types of reduced input data elements. In some implementations, the reduced input data element 222 may include an x-bit symbol, a y-bit exponent, and a z-bit significand, where x, y, and z may be any number. The reduced input data element 222 may be provided to the multiplier 208 via a first sign data path 511, a first exponent data path 521, and a first significand data path 531.
The reduction weight 224 may be a 22-bit number. In some embodiments, the reduction weight 224 may have any bit length and/or may be any number of bits. Additionally, the reduction weight 224 may be a floating point number. In some embodiments, the reduced weight 224 may be a brain floating point number. In addition, the reduction weight 224 may be any data type. The reduced weight 224 may be comprised of a sign bit path, an exponent bit path, and a significand bit path. For example, the reduction weight 224 may include a 1-bit symbol, a 10-bit exponent, and an 11-bit significand. Additionally, the reduction weight 224 may include a 1-bit symbol, an 8-bit exponent, and a 10-bit significand. In some implementations, the reduced input data element 222 may include an x-bit symbol, a y-bit exponent, and a z-bit significand, where x, y, and z may be any number. The reduced weight 224 may be provided to the multiplier 208 via a second symbol data path 512, a second exponent data path 522, and a second significand data path 532.
Multiplier 208 may include a sign data path, an exponent data path, and a significand data path. Multiplier 208 may receive first symbol data path 511, first exponent data path 521, and first significand data path 531 from reduced input data element 222. Multiplier 208 may receive second symbol data path 512, second exponent data path 522, and second significand data path 532 from reduced weights 224. In some implementations, multiplier 208 may also receive a data type control signal. Multiplier 208 may perform a multiplication operation on the received input.
The symbol data path of multiplier 208 may receive a first symbol data path 511 and a second symbol data path 512. The symbol data path may output a portion of the symbol data path 513 based at least in part on the first symbol data path 511 and the second symbol data path 512. In some embodiments, the symbol data path may be implemented as a exclusive or (XOR) function. The symbol data path may provide a portion of the symbol data path 513 to the adder 210.
The exponent data path of multiplier 208 may receive a first exponent data path 521 and a second exponent data path 522. The exponent data path of multiplier 208 may include adder 526. In some implementations, the exponent data path of multiplier 208 may include a mapper to adjust the output of multiplier 208 to a desired format for one or more components of the systolic array (e.g., an adder separate from adder 526). For example, an adder of a systolic array may expect an input with an 11-bit exponent (e.g., operate on it). Additionally, the mapper may receive the first exponent data path 521 and the second exponent data path 522 and perform a mapping operation to add one or more bits to the exponent of each of the reduced input data element 222 and the reduced weight 224
Adder 526 may receive mapped or unmapped versions of first exponent data path 521 and second exponent data path 522. Adder 526 may perform addition on the two values received from first exponent data path 521 and second exponent data path 522. Adder 526 may also receive shift/carry information (not shown) from the significand data-path. Adder 526 may provide a partial exponent data path 523 based at least in part on the addition performed on the two values. The partial exponent data path 523 may be 10 bits or other range sufficient to accommodate the exponent sum without overflowing.
The significand data path of multiplier 208 may receive first significand data path 531 and second significand data path 532. The effective number data path of multiplier 208 may include a binary multiplier 534 and a formatter 536. Binary multiplier 534 may multiply the value of first significand data-path 531 by the value of second significand data-path 532. Binary multiplier 534 may generate a multiplication product based on the multiplication operation. In some embodiments, the product may be an integer product, a floating point product, or any other product. In addition, binary multiplier 534 may generate a product having 8 bits, 16 bits, 32 bits, or any other number of bits. The product may have a bit length of the maximum bit length supported by the elements of the systolic array during a single pass calculation. Thus, the systolic array may receive an input of arbitrary input, and the reducer may reduce to a bit length corresponding to a maximum bit length supported by elements of the systolic array (e.g., multipliers of processing elements). Binary multiplier 534 may also perform floating point multiplication, integer multiplication, or multiplication involving any other data type. Binary multiplier 534 may be implemented using a 16-bit multiplier data path, an 18-bit multiplier data path, or a multiplier data path having any number of bits. Binary multiplier 534 may provide a multiplier product to formatter 536. In some implementations, binary multiplier 534 may be implemented using a multiplier circuit.
The formatter 536 may adjust the format of the product of the multiplication generated by the binary multiplier 534. The significand data-path of multiplier 208 may include a formatter 536 to adjust the output of multiplier 208 to a desired format for one or more components of the systolic array (e.g., an adder separate from adder 526). For example, an adder of a systolic array may be expected to have an input of 23-bit significands (e.g., operate on it). The formatter 536 may increase or decrease the number of bits used to represent the multiplication product, for example, by increasing the bit size to 23 bits. The formatter 536 may provide a portion of the significand data path 533 to the adder 210.
Adder 210 may include a sign data path, an exponent data path, and a significand data path. Adder 210 may be implemented with a given bit size (e.g., with an adder datapath of a given size). In some implementations, each processing element may include an adder with a larger bit size and a multiplier with a smaller bit size, as an adder with an increased bit size may be more cost effective than a multiplier with the same increased bit size. Thus, the present disclosure enables systolic arrays to support larger bit sizes with reduced accuracy using multipliers of lower bit sizes. Adder 210 may receive a portion of symbol data path 513, a portion of exponent data path 523, and a portion of significand data path 533 from multiplier 208. Adder 210 may also receive input portions and 234. Adder 210 may perform an addition operation on a multiplier product comprised of partial sign data path 513, partial exponent data path 523, partial significand data path 533, and input portions and 234. In some implementations, adder 210 may perform an addition operation on both floating point numbers and brain floating point numbers. Additionally, adder 210 may be a 34-bit floating point adder, a 32-bit floating point adder, or any other bit length adder.
Adder 210 may generate addition result 238 based on the addition operation. The addition result 238 may be composed of a sign data path 515, an exponent data path 525, and a significand data path 535. In some embodiments, the addition result 238 may be an integer sum, a floating point sum, or any other sum. In addition, adder 210 may generate a sum having 8 bits, 16 bits, 32 bits, 34 bits, or any other number of bits. In some implementations, adder 210 may be implemented using binary adder circuitry.
Fig. 6 illustrates an apparatus 600 for neural network computation, in accordance with some embodiments of the disclosed technology. The apparatus 600 may be part of a computer system (e.g., a host server). For example, the host server may provide multi-tenant computing services for data processing applications such as image recognition services, text-based data processing (e.g., processing of search queries), audio data processing, video data processing, and the like. In some embodiments, the host device may operate a software application and communicate with the apparatus 600 to make predictions based on calculations using a predictive model of the neural network processor. For example, the host device may use the predictive model to make predictions by identifying information included in the input dataset of images, text, audio, video, etc.
The device 600 may include a neural network processor 602 coupled to a memory 614, a host interface 616, and a Direct Memory Access (DMA) controller 618 via an interconnect 620. The neural network processor 602 may include a compute engine 604, a compute controller 606, a state buffer 608, an output buffer 610, and an activation engine 612. The neural network processor 602 may provide computing resources to support computation of the predictive model. The neural network processor 602 may be implemented as a system on a chip (SoC), a Field Programmable Gate Array (FPGA), or any suitable circuitry.
The memory 614 may store instructions received from the host device, input data sets (e.g., pixel data of an image), and weights (e.g., weights corresponding to certain visual and/or non-visual features). The memory 614 may also store the output of the neural network processor 602 (e.g., one or more image recognition decisions made on the input image in the form of an output dataset). Memory 614 may include any suitable memory, such as Dynamic Random Access Memory (DRAM), synchronous DRAM (SDRAM), double data rate DRAM (DDR DRAM), storage Class Memory (SCM), flash memory, and the like.
The host interface 616 may enable communication between the host device and the neural network processor 602. For example, the host interface 616 may transmit memory descriptors between the host device and the neural network processor 602, including memory addresses of stored data (e.g., input data sets, weights, calculation results, etc.). Host interface 616 may include, for example, a peripheral component interconnect express (PCIe) interface, or any suitable interface for communicating with a host device. The host device may include a host processor and a host memory.
DMA controller 618 may perform DMA operations to transfer data between neural network processor 602 and a host device. For example, as described above, the host device may store instructions, input data sets, and weights in the memory 614. The host device may provide the stored instructions, data, and weighted memory addresses to the neural network processor 602 (e.g., in the form of memory descriptors). The neural network processor 602 may then obtain the stored instructions, data, and weights based on the memory address provided by the host device. The neural network processor 602 may also store the results of the computation (e.g., one or more image recognition decisions) in the memory 614 and provide the memory address of the stored results to the host device.
The state buffer 608 may provide a cache of data for computation at the compute engine 604. The data cached at the state buffer 608 may include, for example, input data sets and weights retrieved from the memory 614, as well as intermediate outputs of the computations at the compute engine 604. The cache may reduce the impact of memory access bottlenecks (e.g., caused by latency at the memory 614, DMA controller 618, interconnect 620, etc.) on the performance of the compute engine 604. The state buffer 608 may be an on-chip memory device and may include Static Random Access Memory (SRAM) or any suitable memory.
The computation controller 606 may provide control of the various components of the neural network processor 602 to perform neural network computations. In some implementations, the compute controller 606 may read instructions stored in the memory 614 and schedule execution of the instructions by the compute engine 604. In a first implementation, the compute controller 606 may perform scheduling of loading weights into the compute engine 604 before reading input data elements from the state buffer 608. For example, as described with reference to fig. 2A, 2B, 4A, and 4B, the compute controller 606 may provide the opcodes 230 and the weight loads 232 to the compute engine 604 based on instructions received from a host device. The compute controller 606 may provide the appropriate values of the opcodes 230 to the compute engine 604, which may be decoded by each PE in the compute engine 604 to perform the corresponding operations. For example, compute engine 604 may use weight load 232 and opcode 230 to preload weights in all PEs in compute engine 604. Once the weights have been preloaded, the computation controller 606 may execute a schedule of loading input data elements from the state buffer 608 sequentially into the computation engine 604 within a uniform time period to begin arithmetic computation.
In a second embodiment, the compute controller 606 may perform a schedule in which weights and input data elements are sequentially loaded from the state buffer 608 into the compute engine 604 within a uniform time period. The compute controller 606 may schedule the loading of the input data elements and weights in the respective first PEs of each row in the systolic array 302 using the respective row data buses. For example, the corresponding input data element and weight value may be loaded into the first PE of the corresponding row at each cycle.
In another implementation, the compute controller 606 may schedule loading of weights in the systolic array 302 for each row in parallel using the respective column data bus for each PE in a given row. For example, the weights for each row may be loaded in parallel in each cycle. In some implementations, the computing controller 606 may determine the data type of the input data set based on instructions received from the host device. The instruction may be in the form of an opcode. The data type may indicate the size and type of the input data element, e.g., 4-bit, 8-bit, 16-bit, signed, unsigned, or floating point.
The calculation engine 604 may perform calculations of the neural network. For example, the compute engine 604 may reduce the input provided to the systolic array to generate a reduced input. In addition, the compute engine 604 may determine the maximum supported bit length of the systolic array and generate a reduced input with the maximum supported bit length. In some embodiments, the compute engine 604 may include a set of PEs that perform one or more arithmetic operations involved in neural network computation. Each PE may perform a multiply-accumulate operation using the input data set and associated weights. For example, the compute engine 604 may include the systolic array 302 and the circuit 304 including the zero input data detectors 306a-306x and the zero weight detectors 308a-308x. In some embodiments, the zero input data detectors 306a-306x and the zero weight detectors 308a-308x may be external to the compute engine 604. The compute engine 604 may execute instructions scheduled by the compute controller 606 to sequentially load weights and input data sets from the state buffer 608 into the compute engine 604.
In a first implementation, the weights may be preloaded prior to reading the input data set from the state buffer 608, as described with reference to fig. 4. The respective zero weight indicator corresponding to each weight may be cached locally in each PE and the cached values may be used to perform arithmetic calculations of the respective input data element as the input data element is fed into the calculation engine 604 along with the corresponding zero data element indicator. In a second implementation, the weights and input data sets may be read simultaneously from the state buffer 608, as described with reference to fig. 5. The corresponding zero data element indicator and zero weight indicator may be provided by the respective zero detector circuit and propagated sequentially from one PE to another PE for the respective row. The weights and input data sets may be obtained from the state buffer 608 using one or more interfaces. In some embodiments, the calculation engine 604 may perform arithmetic calculations to reduce the dynamic power consumption of the systolic array 302 using the corresponding zero data element indicators and zero weight indicator signals as described with reference to fig. 2-5 and provide the calculation results to be stored in the output buffer 610.
The output buffer 610 may include a set of registers to store the output data set generated by the compute engine 604. In some embodiments, the output buffer 610 may also implement additional processing, such as, for example, pooling operations, to reduce the size of the stored output. In addition, the calculation engine 604 may be operable to perform calculations for a particular neural network layer, and the output buffer 610 may process the output of that neural network layer and store the processed output data set at the state buffer 608 (with or without processing by the activation engine 612). The calculation engine 604 may use the processed output data set as an intermediate output. In some implementations, the output buffer 610 can include adders to accumulate partial sums generated for different sets of filters and input data sets to generate a convolved output array. The final output values of the convolved output arrays stored in the state buffer 608 may be retrieved by the computation controller 606 for storage at the state buffer 608.
The activation engine 612 may apply one or more activation functions (e.g., reLu functions) on the output of the output buffer 610. For example, the activation engine 612 may include one or more look-up tables (e.g., in the form of a multiplexer circuit) that may map an input to one of the candidate outputs representing the result of applying the activation function to the input. In some examples, the activation engine 612 may also include a bypass path to allow the output from the output buffer 610 to be directly stored at the state buffer 608 when the activation function is not applied.
Fig. 7 illustrates a method 700 performed by a compute engine 604 utilizing a systolic array (e.g., a set of processing elements) in accordance with some examples of the disclosed technology. An array may be similar to, for example, array 100A and include multiple PEs similar to, for example, PE 112 a. The systolic array may include a plurality of PEs configured in a plurality of rows and/or a plurality of columns. For example, a systolic array may include 65,536 PEs, which are further divided into 256 rows and 256 columns. The compute engine 604 may be a systolic circuit that includes a systolic array and one or more reducers (e.g., converters) to receive inputs having any bit length and to convert the any bit length inputs to inputs having a reduced bit length corresponding to a maximum supported bit length of elements of the systolic array. For example, one or more reducers may convert multiple input data elements (e.g., 32-bit input data elements) to multiple reduced input data elements (e.g., 22-bit input data elements) and/or multiple weights (e.g., 32-bit weights) to multiple reduced weights (e.g., 22-bit weights).
In block 702, a first reducer receives a first input (e.g., a first number) having a first bit length (e.g., 32 bits). The first input bit length may be any bit length. The first input may be represented in a floating point format. In addition, the first reducer may identify the number of trailing bits of the first input and reduce the number of trailing bits of the first input. The first input may represent an input data element. The first reducer may convert 32-bit floating point numbers to 22-bit floating point numbers. In some embodiments, the first reducer may convert an m-bit floating point number to an n-bit floating point number, where n and m may be any number, where n is less than m.
In block 704, the first reducer generates a first reduced input having a second bit length (e.g., 22 bits). The second bit length may be a maximum bit length supported by elements of the systolic array. For example, the first reduced input may be a 22-bit floating point number. In addition, the second bit length may be less than the first bit length (e.g., the second bit length may be any bit length less than the first bit length). The first reducer may generate the first reduced input based on reducing a number of trailing bits of the first input. To generate the first reduced input (or any other reduced input), the first reducer may include a trailing bit reducer to reduce the number of trailing bits representing the significant portion of the first input and to generate a reduced significant portion of the first input (e.g., a 32-bit first input). For example, the trailing bits reducer may zero the number of trailing bits. Additionally, the first reducer may include a rounder to round the reduced significant portion of the first input based at least in part on a remainder of bits representing the significant portion of the first input that are not included within the reduced significant portion (e.g., a remainder of non-trailing bits of the first input). For example, rounding the first input may include rounding a portion of bits of the first input. The rounder may further round the first input to a particular number (e.g., a particular floating point number). In some implementations, the rounder may round the significand portion and the trailing bit reducer may generate a reduced significand portion from the rounded significand portion (e.g., the first input may be a first rounding input to the trailing bit reducer). In other embodiments, the first reducer may not include a rounding unit and the significand portion may be pre-rounded (e.g., rounded or not rounded by another system). The rounder may round the input based on one or more of randomly rounding, rounding to the nearest even, rounding to zero, rounding down, rounding up, or any other rounding method. The random rounding may include rounding the input up to the first number or down to the second number based on probabilities tuned based on the relative distance between the input and the first number and the relative distance between the input and the second number, respectively. In some implementations, the input may be rounded based on user input (e.g., selection of a rounding method). The first reducer may further include an exponent extender for increasing a number of bits representing the exponent portion of the first input. In some embodiments, the first reduced input may be stored in a 24-bit format.
In some implementations, the first reducer can generate the second input. In other implementations, the calculation engine 604 may include a second reducer for receiving weights in a floating point format having a first bit length. The second reducer may identify the number of trailing bits of the weights and reduce the number of trailing bits of the weights. In addition, the second reducer may generate weights in a floating point format having a second bit length based on the number of trailing bits of the reduced weights. For example, the second input may be a second 22-bit floating point number.
In block 706, each processing element in at least one row of the systolic array multiplies the first reduced input by a second input (e.g., a second number) to generate a multiplication product. In some embodiments, the second input may be a second reduced input. For example, the second input may be a reduced weight. The first reducer may receive a first input and a weight and generate a first reduced input and a second input. In addition, the first reducer may select either the first reduced input or the second input to be provided to the respective processing element. Each processing element may include a multiplier for multiplying the first reduced input by the second input. For example, each processing element may include a 22-bit multiplier. In addition, each processing element may include a multiplier for multiplying at least two inputs by a second bit length (e.g., n-bit number). In addition, the multiplier may multiply two 22-bit floating point numbers. The multiplier may include a 1-bit sign data path, an 11-bit significand data path, and a 10-bit exponent data path.
In block 708, the respective processing elements add the input partial sums with the multiplier product to generate an adder partial sum (e.g., an addition result). Each processing element may also include an adder for adding the input portion sums with the multiplier products. For example, each processing element may include a 34-bit adder. In addition, each processing element may include an adder for adding at least two digits (e.g., a p-digit number, where p is greater than n, the multiplier receiving the n-digit number) having a third bit length. In addition, the adder may add two floating point numbers. The adder may include a 1-bit sign data path, a 23-bit significand data path, and a 10-bit exponent data path.
Fig. 8 illustrates a method 800 performed by the compute engine 604 with systolic arrays in accordance with some examples of the disclosed technology. An array may be similar to, for example, array 100A and include multiple PEs similar to, for example, PE 112 a. The systolic array may include a plurality of PEs configured in a plurality of rows and/or a plurality of columns. For example, a systolic array may include 65,536 PEs, which are further divided into 256 rows and 256 columns. The compute engine 604 may be a systolic circuit that includes a systolic array and one or more reducers (e.g., converters) to receive inputs having any bit length and to convert the any bit length inputs to a plurality of reduced inputs having reduced bit lengths corresponding to maximum supported bit lengths of elements of the systolic array. For example, one or more reducers may convert each of a plurality of input data elements (e.g., 32-bit input data elements) to a plurality of reduced input data elements (e.g., 21-bit input data elements) and/or convert each of a plurality of weights (e.g., 32-bit weights) to a plurality of reduced weights (e.g., 21-bit weights).
In block 802, a systolic array (e.g., a reducer of the systolic array) receives a first input (e.g., an input data element, a weight, etc.) in a floating point format having a first bit length. For example, the first input may be a 32-bit floating point number. The systolic array may also receive a second input (e.g., input data elements, weights, etc.) for a multiply-accumulate operation. The reducer may convert an m-bit floating point number to one or more n-bit floating point numbers, where n may be any number less than m. For example, the reducer may convert a 32-bit floating point number to two 21-bit floating point numbers.
In block 804, the systolic array generates a first reduced input (e.g., a high reduced input) having a second bit length. The first reduced input may correspond to a set of most significant bits of the significant portion of the first input (e.g., leading bits of the significant portion of the first input).
In block 806, the systolic array generates a second reduced input (e.g., a low reduced input) having a third bit length. The second reduced input may correspond to a set of least significant bits of the significant portion of the first input (e.g., trailing bits of the significant portion of the first input). The first reduced input and the second reduced input may sum up to a first input. In addition, the second bit length and the third bit length may be less than the first bit length from the first input. For example, the first reduced input and the second reduced input may each be 21-bit floating point numbers. In addition, the reducer may convert the input data elements and weights into respective first and second reductions.
Each of the first reduced input and the second reduced input may be represented in a floating point format. In some implementations, the reducer can generate a first reduced input and subtract the first reduced input from the first input to generate a second reduced input. For example, if the first input includes a first significand of "11111111011010101010101", the first reduced input includes a first significand of "11111111011", and the second reduced input may be determined to be "010101010101" by subtracting the first reduced input from the first input. The first reduced input and the second reduced input may be a systolic array and/or a maximum supported bit length of a particular processing element. In some implementations, the reducer can include a first sub-reducer for generating a first reduced input. The first sub-reducer may include a tail bit reducer for reducing the number of tail bits of the significant-fraction of the first input to produce a high reduced significant-fraction. The first sub-reducer may further include a first exponent extender to increase the number of bits representing the exponent portion of the first input to produce a first increased exponent portion. Based on the first increment index portion and the high reduction significance portion, the first sub-reducer may generate a first reduction input (e.g., a high reduction input). In addition, the reducer may include a second sub-reducer for generating a second reduced input. The second sub-reducer may include a preamble reducer for reducing the number of preamble bits of the significant-fraction of the first input to produce a low reduced significant-fraction. The second sub-reducer may further include a second exponent extender for increasing the number of bits representing the exponent portion of the first input to produce a second increased exponent portion. Based on the second increment index portion and the low reduction significance portion, the second sub-reducer may generate a second reduction input (e.g., a low reduction input). In some embodiments, the second sub-reducer may further comprise: a format detector for detecting whether the first input is denormal or normal; a normalizer for removing implied bits of the first input and renormalizing the low reduction significant fraction to produce a normalized significant fraction based on determining that the first input is normal; and an exponent adjuster for adjusting the second increment exponent portion based on the renormalized significand portion to produce an adjusted exponent portion. Additionally, the second reduction input may include an adjusted exponent portion and a normalized significand portion.
In block 808, the systolic array performs a plurality of multiply-accumulate operations on the first reduced input, the second reduced input, and the second input. The first input may be an input data element or weight and the second input may be the other of the input data element or weight. In some implementations, the second input may not be reduced. In other implementations, the systolic array may scale the second input to generate a third scaled-down input and a fourth scaled-down input for multiple multiply-accumulate operations. To perform multiple multiply-accumulate operations, the systolic array may compute multiple partial sums. In addition, for each combination of high/low reduction inputs, the systolic array may calculate a partial sum. For example, the systolic array may include a processing element for multiply-accumulate operations on the reduced input. The processing elements may each include a multiplier for multiplying two 21-bit floating-point numbers and an adder for adding the two floating-point numbers. In addition, the multiplier may include a 1-bit sign data path, an 11-bit significand data path, and a 9-bit exponent data path, and the adder may include a 1-bit sign data path, a 23-bit significand data path, and a 10-bit exponent data path. In addition, the reducer may generate a reduced input and select the reduced input to be provided for processing by the processing element. The plurality of operations may be a plurality of sequential multiply-accumulate operations (e.g., a plurality of multiply operations and a plurality of accumulate operations for the first input). The processing element may comprise a multiplier for multiplying at least two n-bits and an adder for adding two p-bits, where p may be any number greater than n. For example, the multiplier may be a 21-bit multiplier for multiplying two 21-bit numbers, and the adder may be a 34-bit adder. In addition, to perform the operation, the processing element may multiply the second reduced input with a second reduced weight to generate a first product, multiply the first reduced input with the second reduced weight to generate a second product, multiply the second reduced input with the first reduced weight to generate a third product, multiply the second reduced input with the first reduced weight to generate a fourth product, add the first product with the input partial sums to generate a first sum, add the first sum with the second product to generate a second sum, add the second sum with the third product to generate a third sum, and add the third sum with the fourth product to generate a total product or output.
The systolic array may generate a full-precision total output from multiple portions of the first input and the second input (e.g., input data elements and weights) based on the reduced input. In some implementations, to generate the overall output, the systolic array may provide each sub-product to an adder (e.g., an accumulator). The adder may perform block-based accumulation on the outputs of the systolic array (e.g., each of the sub-products).
To better illustrate the operation of systolic arrays with multiple combinations of reduced inputs, fig. 9A-9H show exemplary four PE columns 900 of a neural network computation for processing multiply-accumulate operations within systolic intervals 0-9 of systolic arrays according to some examples of the disclosed technology. PE column 900 may be part of a systolic array similar to systolic array 100A in fig. 1A, which may extend any number of rows and any number of columns. In some implementations, the systolic array may include a full multiply-accumulate operation for each combination of reduced inputs (e.g., low input/weight and high input/weight), and the outputs of each operation may be summed.
PE column 900 comprises four PEs labeled PE00, PE10, PE20, and PE30 according to their Row and Column (RC) numbers. In the example of fig. 9A-9J, column 900 is implementing a two pass multiply-accumulate operation. For example, an input data element may be converted into two reduced input data elements for a multiply-accumulate operation. Weights may be preloaded into the array and weights may be used in each multiply-accumulate operation to reduce the input to generate an output. In some embodiments, the weights may also be converted to two (or any number of) reduced weights. A first reduced weight (e.g., a low reduced weight) from the weights may be preloaded for multiply-accumulate operations with reduced input data elements, and a second reduced weight (e.g., a high reduced weight) from the weights may then be loaded for multiply-accumulate operations with the same reduced input data elements. The outputs of each combination of the reduced inputs and reduced weights may be summed to generate a total output. It should be appreciated that column 900 may implement n-pass multiply-accumulate operations, where n may be any number. For example, the weights may be converted to any number of reduced weights, and each weight may be iteratively loaded into a systolic array for multiply-accumulate operations with a set of reduced input data elements.
Each PE illustratively includes a multiplier with a single ripple interval delay (e.g., the input provided at interval n is provided as the output at interval n+1) and an adder with two interval delays (e.g., the input provided at interval n is provided as the output at interval n+2). Adders with other delays may be implemented. As shown in fig. 9A to 9H, each PE of the PE column 900 includes a Data register Data RegRC for receiving an input Data element, a Weight storage register Weight RegRC, a multiplier denoted by "X", and an adder or accumulator denoted by "+", respectively.
The values shown as input portions and provided at pulsation intervals 0 to 9 are shown along the top, where PE00 receives value A1. (while the value A1 is shown for illustrative purposes, in some cases, all partial sums fed to the top row of the array may be set to the same value, which may be zero). Values provided as input data elements at ripple intervals 0 to 9 are shown along the left column, with PE00 in row 0 receiving values C1 and C2 at the indicated times, PE10 in row 1 receiving values D1 and D2 at the indicated times, PE20 in row 2 receiving values E1 and E2 at the indicated times, and PE30 in row 3 receiving values F1 and F2 at the indicated times. C1, D1, E1, and F1 may each be a first reduced input data element (e.g., a low reduced input data element) and C2, D2, E2, and F2 may each be a second reduced input data element (e.g., a high reduced input data element). G1, H1, I1, and J1 may be weights. In some implementations, the weights may each be converted to a first reduced weight (e.g., a low reduced weight) and a second reduced weight (e.g., a high reduced weight). When no value is shown, zero or NOP may be assumed. Where indicated, the system is initialized with a zero value for clarity and to facilitate understanding. However, other examples may occur in different states and/or with other internal values.
Fig. 9A to 9H show the progress of data when performing a multiply-accumulate operation. The multiply-accumulate operation across the interval shown includes (as discussed in more detail below): multiplying the weight G1 by the input data element C1 and accumulating the input partial sum A1; multiplying the weight G1 by the input data element C2; multiplying the weight H1 by the input data element D1 and accumulating the input part sum X1 from PE 00; multiplying the weight H1 by the input data element D2 and accumulating the input part from PE00 and X2; multiplying the weight I1 by the input data element E1 and accumulating the input part from PE10 and Y1; multiplying the weight I1 by the input data element E2 and accumulating the input part from PE10 and Y2; multiplying the weight J1 by the input data element F1 and accumulating the input portion from PE20 and Z1; and multiplies the weight J1 by the input data element F2 and accumulates the input portion from PE20 and Z2. The techniques disclosed herein may be extended to additional sequences of input data elements and input partial sums.
Fig. 9A shows the state of PE column 900 at pulsation interval 0. Weights G1, H1, I1, and J1 are each preloaded into a respective weight register. For example, weights G1, H1, I1, and J1 may be preloaded in a weight-loading operation. In PE00, an input Data element C1 is received for writing and storage in Data Reg00 for use during the next systolic interval. All other inputs and other states are initialized to zero.
Fig. 9B shows the state of PE column 900 at pulsation interval 1. In PE00, input Data element C2 is received for writing and storage in Data Reg00 for use during the next systolic interval. In some implementations, the Weight G1 may be preloaded into Weight Reg00 for multiple ripple intervals and may not be preloaded again. For example, the weight G1 may be preloaded for multiple multiply-accumulate operations with multiple reduced input data elements. The weight G1 may then be replaced with a new weight G2 for multiply-accumulate operations with the reduced input. For example, G1 and G2 may be reduced weights generated from the weights. Thus, the weight G1 can only be preloaded into the array once. It should be appreciated that the combination of inputs or weights may be ordered such that any of the reduced inputs or weights may store multiple ripple intervals in the respective data registers and may not be re-read into the PE. For example, the reduced inputs or combinations of weights may be ordered or assigned such that the weight G1 is not re-read into the PE. The stored input Data element C1 is read from Data Reg00 and provided as an input to both the multiplier of PE00 and the Data registers of the PE in the subsequent columns. The multiplier in PE00 multiplies C1 by G1 to generate a multiplication result c1×g1, which is provided to the adder of PE 00. The input section and A1 are also received at the adder of PE 00. Each adder is pipelined with a delay of 2 intervals and thus processes the respective input part and the respective multiplication result during a time period corresponding to the delay (e.g. the following 2 intervals).
In PE10, an input Data element D1 is received for writing and storage in Data Reg10 for use during the next systolic interval.
Fig. 9C shows the state of PE column 900 at pulsation interval 2. In PE00, input Data element C2 is read from Data Reg00 and provided as input to both the multiplier of PE00 and the Data registers of the PEs in the subsequent columns. The multiplier in PE00 multiplies C2 by G1 to generate a multiplication result c2×g1, which is provided to the adder of PE00 for use in the adder operation. Note that during the ripple interval 2, the adder of PE00 continues the addition operation between the multiplication result c1×g1 obtained during the interval 1 and the input part and A1.
In PE10, an input Data element D2 is received for writing and storage in Data Reg10 for use during the next systolic interval. The stored input Data element D1 is read from Data Reg10 and provided as an input to both the multiplier of PE10 and the Data registers of the PE in the subsequent columns. The multiplier in PE10 multiplies D1 by H1 to generate a multiplication result d1×h1, which is provided to the adder of PE 10.
In PE20, input Data element E1 is received for writing and storage in Data Reg20 for use during the next systolic interval.
Fig. 9D shows the state of PE column 900 at pulsation interval 3. In PE00, the adder completes the addition of A1 and c1×g1 and generates an addition result a1+c1×g1. The addition result a1+c1×g1 is transmitted as an input part and to the PE10. The additional results of PEs within a given column may be generally referred to herein as "partial sums". Note that during ripple interval 3, the adder of PE00 continues to add between the multiplication results c2×g1 obtained during interval 2.
In PE10, the stored input Data element D2 is read from Data Reg10 and provided as an input to both the multiplier of PE10 and the Data registers of the PEs in the subsequent columns. The multiplier in PE10 multiplies D2 by H1 to generate a multiplication result d2×h1, which is provided to the adder of PE10. The input portion and c1×g1+a1 are received from PE00 and are also provided to the adder of PE10 for use in the adder operation. Note that during the ripple interval 3, the adder of PE10 continues to add between the multiplication result d1×h1 and the input part from PE00 and (a1+c1×g1).
In PE20, input Data element E2 is received for writing and storage in Data Reg20 for use during the next systolic interval. The stored input Data element E1 is read from Data Reg20 and provided as an input to both the multiplier of PE20 and the Data registers of the PE in the subsequent columns. The multiplier in PE20 multiplies E1 by I1 to generate a multiplication result E1×I1, which is provided to the adder of PE20 for use in the adder operation.
In PE30, input Data element F1 is received for writing and storage in Data Reg30 for use during the next systolic interval.
Fig. 9E shows the state of PE column 900 at pulsation interval 4. The adder completes the addition of 0 and c2×g1 and generates an addition result c2×g1. In some embodiments, input portions and A1 may be added to each combination of reduced inputs. For example, in the case where each input is converted to two reduced inputs resulting in four combinations of reduced inputs for each weight and input data element (e.g., a four-pass multiply-accumulate operation for a pair of inputs), the input partial sums may be added to each combination of reduced inputs. In other embodiments, a portion of the sum of inputs may be added to each combination of the reduced inputs. For example, the input partial sums may be divided across each combination of the reduction inputs. The addition result c2×g1 is transmitted as an input part and to the PE10.
In PE10, the input portion and C2×G1 are received from PE00, and are also provided to an adder of PE10 for use in an adder operation. Note that during ripple interval 4, the adder of PE10 continues to add between the multiplication result d2×h1 and the input part from PE00 and (c2×g1).
In addition, in PE10, the adder completes the addition of d1×h1+c1×g1+a1, and generates an addition result X1. The addition result X1 is transmitted as an input part and to the PE20.
In PE20, the stored input Data element E2 is read from Data Reg20 and provided as an input to both the multiplier of PE20 and the Data registers of the PEs in the subsequent columns. The multiplier in PE20 multiplies E2 by I1 to generate a multiplication result e2×i1, which is provided to the adder of PE20 for use in the adder operation. The input portion and X1 are received from PE10 and are also provided to the adder of PE20 for use in the adder operation. Note that during ripple interval 4, the adder of PE20 continues to add between the multiplication result e1×i1 and the input part sum (X1) from PE 10.
In PE30, input Data element F2 is received for writing and storage in Data Reg30 for use during the next systolic interval. The stored input Data element F1 is read from Data Reg30 and provided as an input to both the multiplier of PE30 and the Data registers of the PE in the subsequent columns. The multiplier in PE30 multiplies F1 by J1 to generate a multiplication result F1×J1, which is provided to the adder of PE30 for use in the adder operation.
Fig. 9F shows the state of PE column 900 at pulsation interval 5. In PE10, the adder completes the addition of d2×h1+c2×g1 and generates an addition result X2. The addition result X2 is transmitted as an input part and to the PE20.
In PE20, the input portion and X2 are received from PE10 and are also provided to an adder of PE20 for use in an adder operation. Note that during ripple interval 5, the adder of PE20 continues to add between the multiplication result e2×i1 and the input part sum (X2) from PE 10.
In addition, in PE20, the adder completes the addition of e1×i1+x1 and generates an addition result Y1. The addition result Y1 is transmitted as an input part and to the PE30.
In PE30, the stored input Data element F2 is read from Data Reg30 and provided as an input to both the multiplier of PE30 and the Data registers of the PEs in the subsequent columns. The multiplier in PE30 multiplies F2 by J1 to generate a multiplication result F2×J1, which is provided to the adder of PE30 for use in the adder operation. Note that during ripple interval 5, the adder of PE30 continues the addition operation between the multiplication result f1×j1 obtained during interval 4 and the input part sum (Y1) from PE20.
Fig. 9G shows the state of PE column 900 at pulsation interval 6. In PE20, the adder completes the addition of e2×i1+x2 and generates an addition result Y2. The addition result Y2 is transmitted as an input part and to the PE30.
In the PE30, the adder of the PE30 continues the addition operation between the multiplication result f2×j1 obtained during the interval 5 and the input part sum (Y2) from the PE 20.
In addition, in PE30, the adder completes the addition of f1×j1+y1 and generates an addition result Z1. The addition result Z1 may be transmitted to another PE and/or aggregator for aggregation with additional combinations of reduced inputs for a particular set of inputs.
Fig. 9H shows the state of PE column 900 at pulsation interval 7. In PE30, the adder completes the addition of f2×j1+y2 and generates an addition result Z2. The addition result Z2 may be transmitted to another PE and/or aggregator for aggregation with additional combinations of reduced inputs for a particular set of inputs.
The exemplary states of the data flows shown in fig. 9A-9H may be performed for one or more start input data elements and any number of start input portions.
Fig. 10 illustrates an example of a computing device 1000. The functionality and/or several components of computing device 1000 may be used with other embodiments disclosed elsewhere in this disclosure without limitation. The computing device 1000 may perform computations to facilitate processing of tasks. As an illustrative example, computing device 1000 may be part of a server in a multi-tenant computing service system. Various hardware and software resources of computing device 1000 (e.g., hardware and software resources associated with data processing) may be allocated to clients upon request.
In one example, computing device 1000 may include processing logic 1002, a bus interface module 1004, a memory 1006, and a network interface module 1008. These modules may be hardware modules, software modules, or a combination of hardware and software. In some cases, modules may be used interchangeably with components or engines without departing from the scope of the present disclosure. Computing device 1000 may include additional modules that are not illustrated herein for ease of description. In some implementations, the computing device 1000 may include fewer modules. For example, one or more of the modules may be combined into one module. One or more of the modules may communicate with each other over a communication channel 1010. The communication channels 1010 may include one or more buses, grids, matrices, fabrics, combinations of these communication channels, or some other suitable communication channel.
The processing logic 1002 may include an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a system on a chip (SoC), and a Network Processing Unit (NPU), a processor configured to execute instructions, or any other circuitry for performing logical arithmetic and floating point operations. Examples of processors that may be included in processing logic 1002 may include a processor comprised of And the like. In some embodiments, a processor may include multiple processing cores, and each processing core may execute instructions independently of the other processing cores. In addition, each processor or processing core may implementMultiple processing threads of instructions are executed on the same processor or processing core while maintaining logical separation between the multiple processing threads. Such processing threads executing on a processor or processing core may be exposed to software as separate logical processors or processing cores. In some embodiments, multiple processors, processing cores, or processing threads executing on the same core may share certain resources, such as, for example, a bus, a level 1 (L1) cache, and/or a level 2 (L2) cache. Instructions executed by the processing logic 1002 may be stored on a computer-readable storage medium, for example, in the form of a computer program. The computer readable storage medium may be non-transitory. In some cases, the computer readable medium may be part of the memory 1006. The processing logic 1002 may also include hardware circuitry for performing artificial neural network calculations, including, for example, the neural network processor 602, etc.
The client may be granted access to the processing logic 1002 to provide the personal assistant service requested by the client. For example, the computing device 1000 may host a virtual machine on which an image recognition software application may execute. The image recognition software application, when executed, may access the processing logic 1002 to predict, for example, objects included in an image. As another example, access to the processing logic 1002 may also be granted as part of a bare metal instance, where an image recognition software application executing on a client device (e.g., remote computer, smart phone, etc.) may directly access the processing logic 1002 to perform recognition of an image.
The memory 1006 may include volatile memory or nonvolatile memory, or both volatile and nonvolatile types of memory. For example, memory 1006 may include Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, and/or some other suitable storage medium. In some cases, some or all of memory 1006 may be internal to computing device 1000, while in other cases, some or all of memory may be external to computing device 1000. The memory 1006 may store an operating system including executable instructions that, when executed by the processing logic 1002, provide an execution environment for executing instructions that provide functionality for performing convolution calculations for the computing device 1000. The memory 1006 may also store software applications, for example, for performing artificial neural network calculations. The memory may also store and maintain several data structures and tables to facilitate the functionality of the computing device 1000.
The bus interface module 1004 may enable communication with external entities such as host devices and/or other components in the computing system through external communication media. Bus interface module 1004 may include a physical interface for connecting to cables, sockets, ports, or other connections to external communication media. The bus interface module 1004 may also include hardware and/or software to manage incoming and outgoing transactions. The bus interface module 1004 may implement a local bus protocol such as a Peripheral Component Interconnect (PCI) -based protocol, non-volatile memory express (NVMe), advanced Host Controller Interface (AHCI), small Computer System Interface (SCSI), serial Attached SCSI (SAS), serial AT attachment (SATA), parallel ATA (PATA), some other standard bus protocol, or proprietary bus protocol. Bus interface module 1004 may include a physical layer for any of these bus protocols, including connectors, power management, error handling, and the like. In some implementations, the computing device 1000 may include multiple bus interface modules for communicating with multiple external entities. These multiple bus interface modules may implement the same local bus protocol, different local bus protocols, or a combination of the same and different bus protocols.
The network interface module 1008 may include hardware and/or software for communicating with a network. For example, this network interface module 1008 may include a physical connector or physical port for wired connection to a network, and/or an antenna for wireless communication to a network. The network interface module 1008 may also include hardware and/or software that implements a network protocol stack. The network interface module 1008 may communicate with a network using network protocols such as, for example, TCP/IP, wireless bandwidth, roCE, institute of Electrical and Electronics Engineers (IEEE) 802.11 wireless protocol, user Datagram Protocol (UDP), asynchronous Transfer Mode (ATM), token ring, frame relay, advanced data Link control (HDLC), fiber Distributed Data Interface (FDDI), and/or Point-to-Point protocol (PPP), among others. In some implementations, the computing device 1000 may include multiple network interface modules, each configured to communicate with a different network. For example, computing device 1000 may include a network interface module for communicating with a wired ethernet, wireless 1002.11 network, cellular network, infiniband technology network, or the like. In some embodiments, the computing device 1000 may receive parameter sets from a server, such as the weight values described above for convolution calculations, through the network interface module 1008.
The various components and modules of computing device 1000 described above may be implemented as discrete components, system on a chip (SoC), ASIC, NPU, FPGA, or any combination thereof. In some embodiments, the SoC or other component may be communicatively coupled to another computing system to provide various services, such as traffic monitoring, traffic shaping, computing, and the like. In some embodiments of the present technology, the SoC or other component may include multiple subsystems as disclosed herein.
The modules described herein may be software modules, hardware modules, or suitable combinations thereof. If the module is a software module, the module may be embodied on a non-transitory computer readable medium and processed by a processor in any of the computer systems described herein. It should be noted that the described processes and architectures may be performed in real-time or asynchronous mode prior to any user interaction. The modules may be configured in the manner suggested in fig. 10, and/or the functionality described herein may be provided by one or more modules existing as separate modules, and/or the functionality of the modules described herein may be distributed across multiple modules.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the claims.
Other variations are also within the spirit of the present disclosure. Thus, while the disclosed technology is susceptible to various modifications and alternative constructions, specific embodiments thereof have been shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the disclosure to the specific form or forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the disclosure as defined in the appended claims.
The use of the terms "a" and "an" and "the" and similar referents in the context of describing the disclosed embodiments (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Unless otherwise indicated, the terms "comprising," "having," "including," and "containing" are to be construed as open-ended terms (i.e., meaning "including, but not limited to"). The term "connected" should be construed as being partially or fully included within the following interpretation: attached to or bonded together even if an intervention is present. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., "such as") provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the disclosure.
Unless specifically stated otherwise, disjunctive language such as the phrase "at least one of X, Y or Z" is intended in the context to be understood to generally mean that an item, etc., may be X, Y or Z or any combination thereof (e.g., X, Y and/or Z). Thus, such disjunctive language is generally not intended and should not imply that certain embodiments require the presence of at least one of X, at least one of Y, or at least one of Z, respectively.
Various embodiments of the disclosure are described herein, including the best mode known to the inventors for carrying out the disclosure. Variations of those embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the disclosure to be practiced otherwise than as specifically described herein. Accordingly, this disclosure includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Various exemplary embodiments of the present disclosure may be described by the following clauses:
Clause 1: a systolic array processor organized in rows and columns, each row comprising:
a reducer configured to convert 32-bit input data elements into reduced 22-bit input data elements, the reducer comprising:
a tail bit reducer configured to reduce a number of bits representing a significand portion of a 32-bit input data element in the 32-bit input data element to produce a reduced significand portion of the 32-bit input data element;
a rounder configured to round the reduced significant portion of the 32-bit input data element to produce a rounded significant portion; and
an exponent extender configured to increase a number of bits representing an exponent portion of the 32 bit input data element to produce an increased exponent portion,
wherein the reducer generates a reduced 22-bit input data element based on the rounding significant portion and the increase exponent portion; and
a plurality of processing elements configured to receive the reduced 22-bit input data elements from the reducer and receive weights for performing multiply-accumulate operations.
Clause 2: the systolic array processor of clause 1, wherein the reducer is further configured to convert 32-bit weights into the weights.
Clause 3: the systolic array processor of clause 1 or clause 2, wherein the reducer further comprises a first reducer, each row further comprising:
a second reducer configured to convert 32-bit weights into the weights.
Clause 4: the systolic array processor of any one of clauses 1-3, wherein the rounder is configured to round the reduced significant portion of the 32-bit input data element based on one or more of:
randomly rounding;
rounding to the nearest even;
rounding to zero;
rounding downwards; or alternatively
Rounded up.
Clause 5: a pulsing circuit, comprising:
a set of processing elements arranged in a plurality of rows; and
a first converter configured to:
receiving a first input represented by a floating point having a first bit length;
identifying a number of trailing bits of the first input;
reducing the number of trailing bits of the first input; and
Generating a first reduced input represented in floating point having a second bit length based on the number of trailing bits of the reduced first input, wherein the second bit length is less than the first bit length, wherein the second bit length corresponds to a bit length supported by the group of processing elements;
wherein each processing element in at least one row of the set of processing elements is configured to receive the first reduction input from the first converter and to receive a second input for performing a multiply-accumulate operation.
Clause 6: the pulsing circuit of clause 5, wherein each processing element in the plurality of rows of the set of processing elements comprises:
a multiplier configured to multiply two 22-bit floating point numbers, wherein the multiplier consists of a 1-bit sign data path, an 11-bit significand data path, and a 10-bit exponent data path; and
an adder configured to add two floating point numbers, wherein the adder is comprised of a 1-bit sign data path, a 23-bit significand data path, and a 10-bit exponent data path.
Clause 7: the pulsing circuit of clause 5 or clause 6, wherein the first input comprises an input data element and the second input comprises a reduced weight, wherein the first converter is further configured to:
Receiving the first input and weight;
generating the first reduced input and the second input; and is also provided with
The first reduced input or the second input to be provided is selected.
Clause 8: the ripple circuit of any one of clauses 5 to 7, wherein the first converter comprises:
a tail bit reducer configured to reduce a number of bits representing a significant portion of the first input to produce a reduced significant portion of the first input;
a rounder configured to round the reduced significant portion of the first input based on a remaining portion of the bits of the significant portion that are not included within the reduced significant portion that represent the first input; and
an exponent extender configured to increase a number of bits representing an exponent portion of the first input.
Clause 9: the ripple circuit of any one of clauses 5 to 8, wherein the first input comprises a first rounding input, wherein the first converter comprises:
a tail bit reducer configured to reduce a number of bits representing a significant portion of the first input to produce a reduced significant portion of the first input; and
An exponent extender configured to increase a number of bits representing an exponent portion of the first input.
Clause 10: the pulsing circuit of any one of clauses 5-9, wherein the first reduction input comprises a first reduction rounding input, wherein the first reduction rounding input rounds based on one or more of:
randomly rounding;
rounding to the nearest even;
rounding to zero;
rounding downwards; or alternatively
Rounded up.
Clause 11: the pulsing circuit of any one of clauses 5 to 10, wherein the first reduction input comprises a first reduction rounding input, wherein the first reduction rounding input rounds based on a user input.
Clause 12: the ripple circuit of any one of clauses 5 to 11, wherein:
the first converter is configured to convert 32-bit floating point numbers to 22-bit floating point numbers,
wherein each of the processing elements comprises:
a 22 bit multiplier; and
a 34 bit adder.
Clause 13: the ripple circuit of any one of clauses 5 to 12, wherein:
the first converter is further configured to convert an m-bit floating point number to an n-bit floating point number, where n and m may be any positive integer, where n is less than m,
Wherein each of the processing elements comprises:
a multiplier configured to multiply at least two n-bits; and
an adder configured to add two p-bits, where p is greater than n.
Clause 14: the pulsing circuit of any one of clauses 5-13, wherein to reduce the number of trailing bits of the first input, the first converter is configured to:
the number of trailing bits is set to zero.
Clause 15: the pulsing circuit of any one of clauses 5 to 14, further comprising:
a second converter configured to:
receiving weights represented in floating points having the first bit length;
identifying a number of trailing bits of the weight;
reducing the number of trailing bits of the weight; and is also provided with
The second input represented in floating point having the second bit length is generated based on the number of trailing bits that reduce the weight.
Clause 16: the pulsing circuit of any one of clauses 5 to 15, wherein the first reduced input is stored in a 24-bit format.
Clause 17: a method, comprising:
receiving a first input represented by a floating point having a first bit length;
Reducing the number of trailing bits of the first input;
generating a first reduced input in a floating point representation having a second bit length based on reducing the number of trailing bits of the first input, wherein the second bit length is less than the first bit length, wherein the second bit length corresponds to a supported bit length; and
the first reduced input and the second input are received for performing a multiply-accumulate operation.
Clause 18: the method of clause 17, wherein:
the first input comprises a 32-bit floating point number;
the first reduced input includes a first 22-bit floating point number; and is also provided with
The second input includes a second 22-bit floating point number.
Clause 19: the method of clause 17 or clause 18, wherein generating the first reduction input comprises:
rounding the first input based on a remaining portion of non-trailing bits of the first input to generate the first reduced input, wherein the first input comprises a number of bits, wherein rounding the first input comprises rounding a portion of the number of bits.
Clause 20: the method of any of clauses 17-19, wherein one or more of the first reduction input or the second input comprises a rounding reduction input, wherein the rounding reduction input rounds based on one or more of:
Randomly rounding;
rounding to the nearest even;
rounding to zero;
rounding downwards; or alternatively
Rounded up.
Various exemplary embodiments of the present disclosure may be described by the following clauses:
clause 1: a systolic array processor organized in rows and columns, each row comprising:
a reducer configured to convert a 32-bit input data element into two 21-bit input data elements, the reducer comprising:
a first sub-reducer configured to convert a 32-bit input data element of the 32-bit input data elements into a first 21-bit input data element, the first 21-bit input data element corresponding to a set of most significant bits of a significant portion of the 32-bit input data element, the first sub-reducer comprising:
a trailing bit reducer configured to reduce a number of trailing bits representing the significant portion of the 32-bit input data element to produce a first reduced significant portion of the 32-bit input data element, the first reduced significant portion corresponding to the set of most significant bits; and
a first exponent extender configured to increase a number of bits representing an exponent portion of the 32 bit input data element to produce a first increased exponent portion,
Wherein the first sub-reducer generates the first 21-bit input data element based on the first reduced-significant portion and the first increment-exponent portion; and
a second sub-reducer configured to convert the 32-bit input data element into a second 21-bit input data element, the second 21-bit input data element corresponding to a set of least significant bits of the significant portion of the 32-bit input data element, the second sub-reducer comprising:
a leading bit reducer configured to reduce a number of leading bits representing the significant portion of the 32-bit input data element to produce a second reduced significant portion of the 32-bit input data element, the second reduced significant portion corresponding to the set of least significant bits; and
a second exponent extender configured to increase a number of bits representing the exponent portion of the 32-bit input data element to produce a second increased exponent portion,
wherein the second sub-reducer generates a second 21-bit input data element based on the second reduced significant portion and the second increased exponent portion; and
A plurality of processing elements, a processing element of the plurality of processing elements configured to iteratively perform a plurality of pairwise multiply-accumulate operations on the first 21-bit input data element, the second 21-bit input data element, and weights to provide a total output, wherein a 21-bit length corresponds to a maximum supported bit length of the processing element.
Clause 2: the systolic array processor of clause 1, wherein the first 21-bit input data element and the second 21-bit input data element sum to the 32-bit input data element.
Clause 3: the systolic array processor of clause 1 or clause 2, wherein the second sub-reducer is further configured to determine that the 32-bit input data element comprises a normal number, the second sub-reducer further comprising:
a normalizer for removing implied bits of the 32-bit input data element and renormalizing the second reduced significant portion to produce a normalized significant portion based on determining that the 32-bit input data element includes a normal number; and
an exponent adjuster for adjusting the second increased exponent portion based on renormalizing the second reduced significant portion to produce an adjusted exponent portion,
Wherein the second 21-bit input data element is further based on the normalized significand portion and the adjusted exponent portion.
Clause 4: the systolic array processor of any one of clauses 1-3, the weights comprising a first reduced weight and a second reduced weight, wherein the processing element is further configured to:
multiplying the second 21-bit input data element with the second reduced weight to generate a first product;
multiplying the first 21-bit input data element with the second reduction weight to generate a second product;
multiplying the second 21-bit input data element with the first reduced weight to generate a third product; and is also provided with
Multiplying the first 21-bit input data element with the first reduced weight to generate a fourth product,
wherein the systolic array processor further comprises a portion and a buffer configured to:
the first product, the second product, the third product, the fourth product, and the input portion are summed to generate the total output.
Clause 5: a pulsing circuit, comprising:
a set of processing elements arranged in a plurality of rows; and
A first converter configured to:
receiving a first input represented by a floating point having a first bit length;
generating a first reduced input represented in floating point having a second bit length, the first reduced input corresponding to a set of most significant bits of a significant portion of the first input; and is also provided with
Generating a second reduced input represented in floating point having a third bit length, the second reduced input corresponding to a set of least significant bits of the significant portion of the first input, wherein the first reduced input and the second reduced input sum to the first input, wherein the second bit length and the third bit length are less than the first bit length, wherein the second bit length and the third bit length correspond to bit lengths supported by the group of processing elements,
wherein each processing element in at least one row of the set of processing elements is configured to receive the first reduced input and the second reduced input and to perform a plurality of multiply-accumulate operations on the first reduced input, the second reduced input, and the second input.
Clause 6: the pulsing circuit of clause 5, wherein each processing element in the plurality of rows of the set of processing elements comprises:
A multiplier configured to multiply two 21-bit floating point numbers, wherein the multiplier consists of a 1-bit sign data path, an 11-bit significand data path, and a 9-bit exponent data path; and
an adder configured to add two floating point numbers, wherein the adder is comprised of a 1-bit sign data path, a 23-bit significand data path, and a 10-bit exponent data path.
Clause 7: the pulsing circuit of clause 5 or clause 6, wherein the first input corresponds to an input data element and the second input corresponds to a weight, wherein the first converter is further configured to:
receiving the second input represented in floating point having a fourth bit length;
generating a third reduced input represented in floating point having a fifth bit length, the third reduced input corresponding to a set of most significant bits of a significant portion of the second input;
generating a fourth reduced input represented in floating point having a sixth bit length, the fourth reduced input corresponding to a set of least significant bits of the significant portion of the second input, wherein the third reduced input and the fourth reduced input sum to the second input, wherein the fifth bit length and the sixth bit length are less than the fourth bit length, wherein the fifth bit length and the sixth bit length correspond to the bit lengths supported by the group of processing elements; and is also provided with
Selecting the first reduced input, the second reduced input, the third reduced input or the fourth reduced input to be provided.
Clause 8: the ripple circuit of any one of clauses 5 to 7, wherein the first converter comprises:
a first sub-reducer, the first sub-reducer comprising:
a tail bit reducer configured to reduce a number of the set of least significant bits of the significant portion of the first input to produce a first reduced significant portion of the first input; and
a first exponent extender configured to increase a number of bits representing an exponent portion of the first input to produce a first increased exponent portion,
wherein the first sub-reducer generates the first reduced input based on the first reduced significant portion and the first increased exponent portion; and
a second sub-reducer, the second sub-reducer comprising:
a leading bit reducer configured to reduce the number of the set of most significant bits of the significant portion of the first input to produce a second reduced significant portion of the first input; and
A second exponent extender configured to increase a number of bits representing the exponent portion of the first input to produce a second increased exponent portion,
wherein the second sub-reducer generates the second reduced input based on the second reduced significant portion and the second increased exponent portion.
Clause 9: the ripple circuit of clause 8, wherein the second sub-reducer is configured to determine that the first input comprises a normal number, the second sub-reducer further comprising:
a normalizer for removing implied bits of the first input and renormalizing the second reduced significant-number portion to produce a normalized significant-number portion based on determining that the first input includes normal numbers; and
an exponent adjuster for adjusting the second increased exponent portion based on renormalizing the second reduced significant portion to produce an adjusted exponent portion,
wherein the second reduction input is further based on the normalized significand portion and the adjusted exponent portion.
Clause 10: the pulsing circuit of any one of clauses 5 to 9, wherein the second input corresponds to a first reduced weight and a second reduced weight, wherein to perform the plurality of multiply-accumulate operations, the respective processing elements are configured to:
Multiplying the second reduced input with the second reduced weight to generate a first product;
adding the first product to an input partial sum to generate a first sum;
multiplying the first reduced input with the second reduced weight to generate a second product;
multiplying the second reduced input with the first reduced weight to generate a third product; and is also provided with
Multiplying the first reduced input with the first reduced weight to generate a fourth product,
wherein the ripple circuit further comprises a portion and a buffer, the portion and buffer configured to:
adding the first sum and the second product to generate a second sum;
adding the second sum to the third product to generate a third sum; and is also provided with
The third sum and the fourth product are added to generate a total output.
Clause 11: the ripple circuit of any one of clauses 5 to 10, wherein the plurality of multiply-accumulate operations comprises an ordered plurality of multiply-accumulate operations.
Clause 12: the ripple circuit of any one of clauses 5 to 11, wherein:
the first converter is configured to convert a 32-bit floating point number to a plurality of 22-bit floating point numbers,
wherein each of the processing elements comprises:
A 22 bit multiplier; and
a 34 bit adder.
Clause 13: the ripple circuit of any one of clauses 5 to 12, wherein:
the first converter is further configured to convert an m-bit floating point number to one or more n-bit floating point numbers, where n and m can be any number, where n is less than m,
wherein each of the processing elements comprises:
a multiplier configured to multiply at least two n-bits; and
an adder configured to add two p-bits, where p is greater than n.
Clause 14: the ripple circuit of any one of clauses 5 to 13, further comprising:
a portion and a buffer configured to perform block-based accumulation based on a plurality of outputs of the set of processing elements.
Clause 15: the pulsing circuit of any one of clauses 5 to 14, further comprising:
a second converter configured to:
receiving the second input represented in floating point having a fourth bit length, the second input corresponding to a weight;
generating a third reduced input represented in floating point having a fifth bit length, the third reduced input corresponding to a set of most significant bits of a significant portion of the second input; and is also provided with
Generating a fourth reduced input represented in floating point having a sixth bit length, the fourth reduced input corresponding to a set of least significant bits of the significant portion of the second input, wherein the third reduced input and the fourth reduced input sum to the second input, wherein the fifth bit length and the sixth bit length are less than the fourth bit length, wherein the fifth bit length and the sixth bit length correspond to the bit lengths supported by the group of processing elements,
wherein the respective processing elements in the at least one row of the group of processing elements are further configured to receive the third reduced input and the fourth reduced input and to perform the plurality of multiply-accumulate operations on the first reduced input, the second reduced input, the third reduced input, and the fourth reduced input.
Clause 16: the pulsing circuit of any one of clauses 5 to 15, wherein the group of processing elements performs a first accumulation of a plurality of outputs of the group of processing elements to produce a reduced plurality of outputs, the pulsing circuit further comprising:
a portion and a buffer configured to perform block-based accumulation based on the reduced plurality of outputs to generate an output.
Clause 17: a method, comprising:
receiving a first input represented in floating point;
generating a first reduced input represented in floating point, the first reduced input corresponding to a set of most significant bits of a significant portion of the first input;
generating a second reduced input represented in floating point, the second reduced input corresponding to a set of least significant bits of the significant portion of the first input, wherein the first reduced input and the second reduced input sum to the first input, wherein the first reduced input and the second reduced input correspond to supported bit lengths; and
one or more operations are performed based on the first reduced input, the second reduced input, and the second input to generate an output.
Clause 18: the method of clause 17, wherein:
the first input comprises a 32-bit floating point number;
the first reduced input includes a first 22-bit floating point number; and is also provided with
The second reduced input includes a second 22-bit floating point number.
Clause 19: the method of clause 17 or clause 18, further comprising:
receiving the second input in floating point representation;
generating a third reduced input represented in floating point, the third reduced input corresponding to a set of most significant bits of a significant portion of the second input; and
Generating a fourth reduced input represented in floating point, the fourth reduced input corresponding to a set of least significant bits of the significant portion of the second input, wherein the third reduced input and the fourth reduced input sum to the second input,
wherein the one or more operations are further based on the third reduced input and the fourth reduced input.
Clause 20: the method of any of clauses 17 to 19, wherein each of the first input and the second input comprises an input data element or weight.
The processes described herein or shown in the figures of the present disclosure may begin in response to an event, such as on a predetermined or dynamically determined schedule, on demand when initiated by a user or system administrator, or in response to some other event. When such a process is initiated, a set of executable program instructions stored on one or more non-transitory computer-readable media (e.g., hard disk drive, flash memory, removable media, etc.) may be loaded into a memory (e.g., RAM) of a server or another computing device. The executable instructions may then be executed by a hardware-based computer processor of the computing device. In some implementations, such processes, or portions thereof, may be implemented serially or in parallel on multiple computing devices and/or multiple processors.
Depending on the implementation, certain actions, events, or functions of any of the processes or algorithms described herein may be performed in a different sequence, may be added, combined, or omitted altogether (e.g., not all of the described operations or events are necessary to practice the algorithm). Further, in some embodiments, operations or events may be performed concurrently, e.g., through multi-threaded processing, interrupt processing, or multiple processors or processor cores, or on other parallel architectures, rather than sequentially.
The various illustrative logical blocks, models, routines, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware (e.g., ASIC or FPGA devices), computer software running on computer hardware, or combinations of both. The processor device may be a microprocessor, but in the alternative, the processor device may be a controller, a microcontroller or state machine, combinations thereof, or the like. The processor device may include electronic circuitry for processing computer-executable instructions. In another embodiment, the processor device includes an FPGA or other programmable device that performs logic operations without processing computer-executable instructions. A processor device may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Although described primarily herein with respect to digital technology, the processor apparatus may also primarily include analog components. For example, some or all of the rendering techniques described herein may be implemented in analog circuitry or mixed analog and digital circuitry. The computing environment may include any type of computer system including, but not limited to, a microprocessor-based computer system, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computing engine within an appliance, to name a few.
Elements of the methods, processes, routines, or algorithms described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor device, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of non-transitory computer-readable storage medium. An exemplary storage medium may be coupled to the processor device such the processor device can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor device. The processor device and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor device and the storage medium may reside as discrete components in a user terminal.
Unless specifically stated otherwise, or otherwise understood in the context of use, conditional language (such as "capable," "may," "possible," "may," "e.g." etc.) as used herein is generally intended to express that certain embodiments include but other embodiments do not include certain features, elements or steps. Thus, such conditional language is not generally intended to imply that features, elements or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements or steps are included in or are to be performed in any particular embodiment. The terms "comprising," "including," "having," and the like are synonymous and are used as inclusive and open-ended, and do not exclude additional elements, features, acts, operations, etc. Furthermore, the term "or" has an inclusive (rather than an exclusive) meaning that when used, for example, to connect a list of elements, the term "or" means one, some, or all of the elements in the list.
Unless specifically stated otherwise, a disjunctive language such as the phrase "at least one of X, Y or Z" should in this context be understood to generally be used to mean that an item, etc., may be X, Y or Z or any combination thereof (e.g., X, Y or Z). Thus, such disjunctive language is generally not intended and should not imply that certain embodiments require the presence of at least one of X, at least one of Y, and at least one of Z, respectively.
While the above detailed description has shown, described, and pointed out novel features as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or algorithm illustrated may be made without departing from the scope of the disclosure. As may be recognized, certain embodiments described herein may be embodied within a form that does not provide all of the features and benefits set forth herein, as some features may be used or practiced separately from others. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Claims (15)
1. A pulsing circuit, comprising:
a set of processing elements arranged in a plurality of rows; and
A first converter configured to:
receiving a first input represented by a floating point having a first bit length;
identifying a number of trailing bits of the first input;
reducing the number of trailing bits of the first input; and is also provided with
Generating a first reduced input represented in floating point having a second bit length based on the number of trailing bits of the reduced first input, wherein the second bit length is less than the first bit length, wherein the second bit length corresponds to a bit length supported by the group of processing elements;
wherein each processing element in at least one row of the set of processing elements is configured to receive the first reduction input from the first converter and to receive a second input for performing a multiply-accumulate operation.
2. The pulsing circuit of claim 1 wherein each processing element in said plurality of rows of said set of processing elements comprises:
a multiplier configured to multiply two 22-bit floating point numbers, wherein the multiplier consists of a 1-bit sign data path, an 11-bit significand data path, and a 10-bit exponent data path; and
an adder configured to add two floating point numbers, wherein the adder is comprised of a 1-bit sign data path, a 23-bit significand data path, and a 10-bit exponent data path.
3. The pulsing circuit of any of claims 1 or 2, wherein the first input comprises an input data element and the second input comprises a reduced weight, wherein the first converter is further configured to:
receiving the first input and weight;
generating the first reduced input and the second input; and is also provided with
The first reduced input or the second input to be provided is selected.
4. A ripple circuit according to any one of claims 1 to 3, wherein the first converter comprises:
a tail bit reducer configured to reduce a number of bits representing a significant portion of the first input to produce a reduced significant portion of the first input;
a rounder configured to round the reduced significant portion of the first input based on a remaining portion of the bits of the significant portion that are not included within the reduced significant portion that represent the first input; and
an exponent extender configured to increase a number of bits representing an exponent portion of the first input.
5. The pulsing circuit of any of claims 1 to 4, wherein the first input comprises a first rounding input, wherein the first converter comprises:
A tail bit reducer configured to reduce a number of bits representing a significant portion of the first input to produce a reduced significant portion of the first input; and
an exponent extender configured to increase a number of bits representing an exponent portion of the first input.
6. The pulsing circuit of any of claims 1-5 wherein the first reduction input comprises a first reduction rounding input, wherein the first reduction rounding input rounds based on one or more of:
randomly rounding;
rounding to the nearest even;
rounding to zero;
rounding downwards; or alternatively
Rounded up.
7. The ripple circuit of any one of claims 1 to 6, wherein:
the first converter is configured to convert 32-bit floating point numbers to 22-bit floating point numbers,
wherein each of the processing elements comprises:
a 22 bit multiplier; and
a 34 bit adder.
8. The ripple circuit of any one of claims 1 to 7, wherein:
the first converter is further configured to convert an m-bit floating point number to an n-bit floating point number, where n and m may be any positive integer, where n is less than m,
Wherein each of the processing elements comprises:
a multiplier configured to multiply at least two n-bits; and
an adder configured to add two p-bits, where p is greater than n.
9. The pulsing circuit of any of claims 1 to 8, wherein to reduce the number of trailing bits of the first input, the first converter is configured to:
the number of trailing bits is set to zero.
10. The ripple circuit of any one of claims 1 to 9, further comprising:
a second converter configured to:
receiving weights represented in floating points having the first bit length;
identifying a number of trailing bits of the weight;
reducing the number of trailing bits of the weight; and is also provided with
The second input represented in floating point having the second bit length is generated based on the number of trailing bits that reduce the weight.
11. The ripple circuit of any one of claims 1 to 10, wherein the first reduced input is stored in a 24-bit format.
12. A method, comprising:
receiving a first input represented by a floating point having a first bit length;
Reducing the number of trailing bits of the first input;
generating a first reduced input in a floating point representation having a second bit length based on reducing the number of trailing bits of the first input, wherein the second bit length is less than the first bit length, wherein the second bit length corresponds to a supported bit length; and
the first reduced input and the second input are received for performing a multiply-accumulate operation.
13. The method of claim 12, wherein:
the first input comprises a 32-bit floating point number;
the first reduced input includes a first 22-bit floating point number; and is also provided with
The second input includes a second 22-bit floating point number.
14. The method of any of claim 12 or claim 13, wherein generating the first reduction input comprises:
rounding the first input based on a remaining portion of non-trailing bits of the first input to generate the first reduced input, wherein the first input comprises a number of bits, wherein rounding the first input comprises rounding a portion of the number of bits.
15. The method of any one of claims 12 to 14, wherein one or more of the first or second reduction inputs comprises a rounding reduction input, wherein the rounding reduction input rounds based on one or more of:
Randomly rounding;
rounding to the nearest even;
rounding to zero;
rounding downwards; or alternatively
Rounded up.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/363,900 US20230004523A1 (en) | 2021-06-30 | 2021-06-30 | Systolic array with input reduction to multiple reduced inputs |
US17/363,900 | 2021-06-30 | ||
US17/363,894 | 2021-06-30 | ||
PCT/US2022/035353 WO2023278475A1 (en) | 2021-06-30 | 2022-06-28 | Systolic array with efficient input reduction and extended array performance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117813585A true CN117813585A (en) | 2024-04-02 |
Family
ID=84785517
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202280052183.4A Pending CN117813585A (en) | 2021-06-30 | 2022-06-28 | Systolic array with efficient input reduced and extended array performance |
Country Status (2)
Country | Link |
---|---|
US (1) | US20230004523A1 (en) |
CN (1) | CN117813585A (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11842169B1 (en) | 2019-09-25 | 2023-12-12 | Amazon Technologies, Inc. | Systolic multiply delayed accumulate processor architecture |
US11467806B2 (en) | 2019-11-27 | 2022-10-11 | Amazon Technologies, Inc. | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range |
US11816446B2 (en) | 2019-11-27 | 2023-11-14 | Amazon Technologies, Inc. | Systolic array component combining multiple integer and floating-point data types |
US11308027B1 (en) | 2020-06-29 | 2022-04-19 | Amazon Technologies, Inc. | Multiple accumulate busses in a systolic array |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160004506A1 (en) * | 2014-07-02 | 2016-01-07 | Via Alliance Semiconductor Co, Ltd. | Standard format intermediate result |
US20180121168A1 (en) * | 2016-10-27 | 2018-05-03 | Altera Corporation | Denormalization in multi-precision floating-point arithmetic circuitry |
US20190377549A1 (en) * | 2018-06-06 | 2019-12-12 | Nvidia Corporation | Stochastic rounding of numerical values |
US20210064985A1 (en) * | 2019-09-03 | 2021-03-04 | International Business Machines Corporation | Machine learning hardware having reduced precision parameter components for efficient parameter update |
WO2021108660A1 (en) * | 2019-11-27 | 2021-06-03 | Amazon Technologies, Inc. | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5764556A (en) * | 1995-07-18 | 1998-06-09 | Advanced Micro Devices, Inc. | Method and apparatus for performing floating point addition |
US9552189B1 (en) * | 2014-09-25 | 2017-01-24 | Altera Corporation | Embedded floating-point operator circuitry |
US10049322B2 (en) * | 2015-05-21 | 2018-08-14 | Google Llc | Prefetching weights for use in a neural network processor |
US10019231B2 (en) * | 2016-08-22 | 2018-07-10 | Arm Limited | Apparatus and method for fixed point to floating point conversion and negative power of two detector |
KR20200107295A (en) * | 2019-03-07 | 2020-09-16 | 에스케이하이닉스 주식회사 | Systolic array and processing system |
US11210063B2 (en) * | 2019-03-27 | 2021-12-28 | Intel Corporation | Machine learning training architecture for programmable devices |
US11494163B2 (en) * | 2019-09-06 | 2022-11-08 | Intel Corporation | Conversion hardware mechanism |
US11188303B2 (en) * | 2019-10-02 | 2021-11-30 | Facebook, Inc. | Floating point multiply hardware using decomposed component numbers |
CN116594589B (en) * | 2019-12-31 | 2024-03-26 | 华为技术有限公司 | Method, device and arithmetic logic unit for floating point number multiplication calculation |
CN115934030B (en) * | 2020-01-20 | 2024-01-16 | 华为技术有限公司 | Arithmetic logic unit, method and equipment for floating point number multiplication |
-
2021
- 2021-06-30 US US17/363,900 patent/US20230004523A1/en active Pending
-
2022
- 2022-06-28 CN CN202280052183.4A patent/CN117813585A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160004506A1 (en) * | 2014-07-02 | 2016-01-07 | Via Alliance Semiconductor Co, Ltd. | Standard format intermediate result |
US20180121168A1 (en) * | 2016-10-27 | 2018-05-03 | Altera Corporation | Denormalization in multi-precision floating-point arithmetic circuitry |
US20190377549A1 (en) * | 2018-06-06 | 2019-12-12 | Nvidia Corporation | Stochastic rounding of numerical values |
US20210064985A1 (en) * | 2019-09-03 | 2021-03-04 | International Business Machines Corporation | Machine learning hardware having reduced precision parameter components for efficient parameter update |
WO2021108660A1 (en) * | 2019-11-27 | 2021-06-03 | Amazon Technologies, Inc. | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range |
Non-Patent Citations (1)
Title |
---|
H.T. KUNG 等: "Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization", 《ACM SESSION: MACHINE LEARNING II》, 17 April 2019 (2019-04-17) * |
Also Published As
Publication number | Publication date |
---|---|
US20230004523A1 (en) | 2023-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP4066100B1 (en) | Systolic array component combining multiple integer and floating-point data types | |
US12067375B2 (en) | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range | |
US10698657B2 (en) | Hardware accelerator for compressed RNN on FPGA | |
KR102557589B1 (en) | Accelerated mathematical engine | |
US10817260B1 (en) | Reducing dynamic power consumption in arrays | |
CN107239829B (en) | Method for optimizing artificial neural network | |
CN117813585A (en) | Systolic array with efficient input reduced and extended array performance | |
US11880682B2 (en) | Systolic array with efficient input reduction and extended array performance | |
US11762803B2 (en) | Multiple accumulate busses in a systolic array | |
CN111008003B (en) | Data processor, method, chip and electronic equipment | |
EP4363963A1 (en) | Systolic array with efficient input reduction and extended array performance | |
TWI776213B (en) | Hardware circuit and method for multiplying sets of inputs, and non-transitory machine-readable storage device | |
CN110659014B (en) | Multiplier and neural network computing platform | |
US20220012304A1 (en) | Fast matrix multiplication | |
CN209895329U (en) | Multiplier and method for generating a digital signal | |
US11842169B1 (en) | Systolic multiply delayed accumulate processor architecture | |
CN210109863U (en) | Multiplier, device, neural network chip and electronic equipment | |
CN110647307B (en) | Data processor, method, chip and electronic equipment | |
WO2020108486A1 (en) | Data processing apparatus and method, chip, and electronic device | |
CN111260069B (en) | Data processing device, method, chip and electronic equipment | |
Zhang | Systolic Architectures for Efficient Deep Neural Network Implementations with Assured Performance | |
CN113031909A (en) | Data processor, method, device and chip | |
CN118765393A (en) | Method for calculating narrow bit width linear algebraic operation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |