US11520854B2 - Support for different matrix multiplications by selecting adder tree intermediate results - Google Patents

Support for different matrix multiplications by selecting adder tree intermediate results Download PDF

Info

Publication number
US11520854B2
US11520854B2 US16/667,700 US201916667700A US11520854B2 US 11520854 B2 US11520854 B2 US 11520854B2 US 201916667700 A US201916667700 A US 201916667700A US 11520854 B2 US11520854 B2 US 11520854B2
Authority
US
United States
Prior art keywords
adders
elements
matrix
group
hierarchical tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/667,700
Other versions
US20210125044A1 (en
Inventor
Yuchen Hao
Krishnakumar Narayanan Nair
Ehsan Khish Ardestani Zadeh
Rakesh Komuravelli
Abdulkadir Utku Diril
Thomas Mark Ulrich
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meta Platforms Inc
Original Assignee
Meta Platforms Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meta Platforms Inc filed Critical Meta Platforms Inc
Priority to US16/667,700 priority Critical patent/US11520854B2/en
Assigned to FACEBOOK, INC. reassignment FACEBOOK, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ZADEH, EHSAN KHISH ARDESTANI, DIRIL, ABDULKADIR UTKU, HAO, Yuchen, KOMURAVELLI, RAKESH, NAIR, KRISHNAKUMAR NARAYANAN, ULRICH, Thomas Mark
Priority to CN202011133658.1A priority patent/CN112749368A/en
Priority to EP20203634.9A priority patent/EP3816790A1/en
Publication of US20210125044A1 publication Critical patent/US20210125044A1/en
Assigned to META PLATFORMS, INC. reassignment META PLATFORMS, INC. CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: FACEBOOK, INC.
Application granted granted Critical
Publication of US11520854B2 publication Critical patent/US11520854B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/53Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel
    • G06F7/5318Multiplying only in parallel-parallel fashion, i.e. both operands being entered in parallel with column wise addition of partial products, e.g. using Wallace tree, Dadda counters

Definitions

  • Matrix multiplication is a central operation in many numerical algorithms used to solve various scientific and engineering computations. For example, matrix multiplication is an important component in artificial intelligence computations, such as inference. Since these computations are often demanding and data intensive, hardware solutions are often beneficial for improving performance. Computations can often be performed more quickly using hardware-based solutions that optimize the performance of various operations, e.g., matrix operations to support convolution. It is a technical challenge to create a hardware platform compatible with performing different matrix operations while also significantly improving performance and efficiency. Therefore, there exists a need for hardware and data path solutions that improve on the ability to efficiently perform operations without introducing significant complexity and restrictions.
  • FIG. 1 is a block diagram illustrating an embodiment of a system for performing matrix multiplication using dot product operations.
  • FIG. 2 a is a diagram illustrating example data parameters associated with convolution.
  • FIG. 2 b is a diagram illustrating an example reduction of a data dimension associated with convolution performed via matrix multiplication.
  • FIG. 2 c is a diagram illustrating an example hardware implementation of a dot product operation associated with convolution performed via matrix multiplication.
  • FIG. 3 a is a diagram illustrating example data parameters associated with groupwise convolution.
  • FIG. 3 b is a diagram illustrating an example of an inefficient reduction of a data dimension associated with groupwise convolution due to small input and output channel sizes.
  • FIG. 3 c is a diagram illustrating an example of a more efficient reduction of a data dimension associated with groupwise convolution with small input and output channel sizes.
  • FIG. 4 is a diagram illustrating an example hardware implementation of a dot product operation that includes support for small channel sizes.
  • FIG. 5 a is a flow chart illustrating an embodiment of a process for performing matrix multiplication that includes zero-padding in software.
  • FIG. 5 b is a diagram illustrating an example of a combined matrix in accordance with the process of FIG. 5 a.
  • the invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor.
  • these implementations, or any other form that the invention may take, may be referred to as techniques.
  • the order of the steps of disclosed processes may be altered within the scope of the invention.
  • a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task.
  • the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
  • a system for performing matrix multiplication in hardware that includes support for various size inputs.
  • the disclosed system includes a matrix multiplication hardware unit comprising a plurality of multipliers configured to element-wise multiply a first group of elements with a second group of elements and a hierarchical tree of adders configured to add together results of the plurality of multipliers and selectively provide a final result of the hierarchical tree of adders or any of a plurality of intermediate results of the hierarchical tree of adders for use in determining an output result matrix.
  • the disclosed system further includes a control unit configured to instruct the matrix multiplication hardware unit to perform a plurality of different matrix multiplications in parallel by using a combined matrix that includes elements of a plurality of different operand matrices and utilize one or more selected ones of the intermediate results of the hierarchical tree of adders for use in determining the output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices.
  • a control unit configured to instruct the matrix multiplication hardware unit to perform a plurality of different matrix multiplications in parallel by using a combined matrix that includes elements of a plurality of different operand matrices and utilize one or more selected ones of the intermediate results of the hierarchical tree of adders for use in determining the output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices.
  • Practical and technological benefit of the disclosed system include improved computational and power efficiency and reduced data fragmentation when performing matrix multiplication.
  • Matrix multiplication is an important component in many types of computations, e.g., artificial intelligence computations such as inference.
  • matrix multiplication is used to perform various types of convolutions.
  • Matrix multiplication can be performed by using hardware that performs dot product operations (e.g., a dot product engine).
  • a dot product engine can be used to compute different variants of convolutions.
  • the dot product engine requires large enough inputs (e.g., large enough input channels and output channels) to be efficient. Insufficiently large inputs can lead to low utilization and output fragmentation, which requires more hardware resources to remove gaps in the output.
  • Groupwise convolution which is a common building block in modern deep learning neural networks is oftentimes associated with limited input channels and output channels per convolution group.
  • the number of input channels and/or output channels is smaller than a specified size for which the dot product engine is configured, the overall efficiency of the dot product engine can drop and the output of the convolution can become fragmented in memory. Reorganizing the fragmented output uses more hardware resources.
  • the output is an output of a layer in a neural network and the output is fed to a next layer.
  • the problem of fragmentation can be particularly significant for neural networks in which many layers have limited input channels and/or output channels (e.g., computer vision neural networks).
  • the dot product engine is logically partitioned by pulling out intermediate results from an adder tree so that input activations and weights from multiple groups can be fed to the dot product engine at the same time and computed independently.
  • Benefits of this approach include improved dot product engine efficiency and more tightly packed (less fragmented) output tensors. Stated alternatively, a benefit is improved efficiency of groupwise convolution with small input and output channel sizes.
  • FIG. 1 is a block diagram illustrating an embodiment of a system for performing matrix multiplication using dot product operations.
  • Matrix multiplication system 100 includes matrix multiplication unit 102 , memory 124 , and control unit 126 .
  • matrix multiplication unit 102 includes input storage A 106 , input storage B 108 , and output storage 110 .
  • input storage A 106 , input storage B 108 , and output storage 110 are implemented as groups of hardware registers, such as flip-flop circuits.
  • input storage A 106 stores a first group of elements of to be element-wise multiplied with a second group of elements and input storage B 108 stores the second group of elements.
  • the first group of elements are values of an input matrix (e.g., input activations) and the second group of elements are values of weights of one or more convolution filters (or vice versa with respect to what is stored in input storage A 106 and input storage B 108 ).
  • output storage 110 stores a result produced by processing elements 104 .
  • input storage A 106 , input storage B 108 , and output storage 110 may be implemented as a single group of the same type of registers.
  • no additional storage for the output of processing elements 104 may be needed if the output is written to (and thus overwrites) the registers used to store the groups of elements to be element-wise multiplied.
  • matrix multiplication unit 102 include processing elements 104 and processing elements 104 includes a first element 112 through an nth element 118 .
  • each element of processing elements 104 performs a dot product operation. Having multiple elements (e.g., in parallel) allows for multiple dot product operations to be performed concurrently.
  • input storage A 106 , input storage B 108 , and output storage 110 store the various elements to be multiplied and the output elements for all of the parallel dot product operations.
  • each element of processing elements 104 includes an element-wise multiplier and an adder tree.
  • first element 112 includes element-wise multiplier 114 and adder tree 116 and nth element 118 includes element-wise multiplier 120 and adder tree 122 .
  • each element-wise multiplier includes a plurality of individual multipliers.
  • each element-wise multiplier includes 32 individual multipliers (e.g., see FIG. 2 c and FIG. 4 ).
  • the specific type of multiplier depends on the format of the values to be multiplied. Examples of formats include unsigned integers, signed integers, various floating-point number formats, and so forth for various bit sizes (e.g., 8-bit, 16-bit, etc.).
  • Various multiplier implementations known in the art may be used (e.g., serial multipliers, pipelined multipliers, combinatorial multipliers, etc.).
  • each adder tree is configured to add together results of the plurality of multipliers. For example, if 32 values are element-wise multiplied, the adder tree can produce an output that is the sum of 32 values.
  • the adder tree is hierarchical and includes a number of levels equal to log 2 N, where N is the number of elements to be summed.
  • the output of the adder tree is stored in output storage 110 .
  • each adder in the adder tree is implemented using basic digital logic gates.
  • each adder tree is configured to selectively provide a final result of the adder tree or any of one or more intermediate results of the adder tree to support groupwise convolution (e.g., see FIG. 4 ).
  • Matrix multiplication system 100 may be configured for various types of convolution (e.g., normal, groupwise, etc.). In various embodiments, matrix multiplication system 100 is configured to adapt to the type of convolution being performed (e.g., based on data parameters associated with the type of convolution) in order to improve efficiency.
  • memory 124 is coupled to matrix multiplication unit 102 .
  • data stored in input storage A 106 and/or input storage B 108 is loaded from memory 124 .
  • an output produced by processing elements 104 is written back to memory 124 .
  • the output result may first be stored in output storage 110 .
  • Examples of memory 124 include non-volatile memory (e.g., flash memory, various types of read-only memory, etc.), volatile memory (e.g., various types of random-access memory), or any other type of memory.
  • control unit 126 is coupled to matrix multiplication unit 102 and memory 124 .
  • control unit 126 is implemented using digital electronic circuits (e.g., assemblies of digital logic gates printed on an integrated circuit).
  • Control unit 126 directs the transfer of data from memory 124 to and from matrix multiplication unit 102 (e.g., to input storage A 106 and input storage B 108 and from output storage 110 ).
  • control unit 126 instructs matrix multiplication unit 102 to perform a plurality of different matrix multiplications in parallel by using a combined matrix that includes elements of a plurality of different operand matrices (e.g., see FIG.
  • control unit 126 directs matrix multiplication unit 102 to provide either a final result of an adder tree or any of one or more intermediate results of the adder tree to support groupwise convolution.
  • the final result of the adder tree refers to the output of the single adder in the last level of the adder tree.
  • controlling which result to provide includes supplying control signals to a plurality of multiplexers and/or demultiplexers (e.g., see FIG. 4 ).
  • FIG. 1 portions of the communication path between the components are shown. Other communication paths may exist, and the example of FIG. 1 has been simplified to illustrate the example clearly. For example, control signals are not shown explicitly in FIG. 1 . Furthermore, not all connections between storage elements and memory are shown. Although single instances of components have been shown to simplify the diagram, additional instances of any of the components shown in FIG. 1 may exist. The number of components and the connections shown in FIG. 1 are merely illustrative. For example, additional instances of internal storage may exist. Components not shown in FIG. 1 may also exist.
  • FIG. 2 a is a diagram illustrating example data parameters associated with convolution.
  • convolution 200 includes convolving input 202 with filters 204 to produce output 206 .
  • Input 202 may be a three-dimensional image with height (H), width (W), and channel (C) dimensions. C is also referred to herein as the number of input channels.
  • the channel dimension can be comprised of various data types. For example, an input may have three channels for the colors red, green, and blue (or more channels for more colors, and so forth; the number of channels can be much larger than three). Channels can also correspond to other data types.
  • input 202 is convolved with each of K number of filters.
  • Each filter has dimensions of F H ⁇ F W ⁇ C where F H is the filter height, F W is the filter width, and C is the same number of channels as in input 202 .
  • Convolution 100 reduces the C dimension so that the C dimension is not present in output 206 .
  • output 206 has the same height and width as input 202 .
  • the depth of output 206 is K (the number of filters, which is also referred to herein as the number of output channels) because each filter in filter 204 reduces input 202 along the C dimension and creates a separate H ⁇ W image for each filter.
  • FIG. 2 b is a diagram illustrating an example reduction of a data dimension associated with convolution performed via matrix multiplication.
  • the reduction is performed by matrix multiplication system 100 when it is configured to handle normal convolution (as opposed to groupwise convolution).
  • Filter values e.g., associated with filters 204 of FIG. 2 a
  • Filter values may be loaded into input storage A 106 of FIG. 1 .
  • These filter values are also referred to herein as weights.
  • Input values e.g., associated with input 202 of FIG. 2 a
  • These input values are also referred to herein as activations. Either input storage A 106 or input storage B 108 of FIG. 1 may be used for either weights or activations.
  • a 32 ⁇ 32 square of filter values is loaded into storage 232 and a 1 ⁇ 32 row of activations in storage 234 is broadcasted into math engine 236 to perform reduction 225 (e.g., when math engine 236 is configured for 32-element multiplication).
  • math engine 236 is part of processing elements 104 of FIG. 1 .
  • One dimension of storage 232 is dedicated to storing channel (C) data.
  • the other dimension of storage 232 stores linearized F H , F W , and K data (labeled as K in the example shown).
  • One dimension of storage 234 is also dedicated to storing C data.
  • the other dimension of storage 234 stores linearized H and W data (labeled as W in the example shown). Stated alternatively, the innermost (contiguous) dimension for storage 232 and storage 234 is C, the outer dimension for storage 232 is F H , F W , and K linearized, and the outer dimension for storage 234 is H and W linearized.
  • math engine 236 If math engine 236 is configured for 32-element multiplication, it performs 1024 multiplications and 32 row-wise adder tree computations to produce 32-element column vector 238 .
  • convolution By storing linearized inputs, convolution can be mapped into a set of matrix multiplications. The three-dimensional convolution shown in FIG. 2 a can be mapped to two-dimensional matrix multiplication.
  • An adder tree e.g., see FIG. 2 c ) can be used to reduce each product vector into a single element. After broadcasting the row through all of storage 232 , column vector 238 is the result.
  • 32 ⁇ 32 matrix output 240 is the result.
  • column vector 238 is transferred to matrix output 240 and stored in transposed form.
  • matrix output 240 is stored in output storage 110 of FIG. 1 .
  • the C dimension of storage 232 and the C dimension of storage 234 have been collapsed, resulting in an output with the non-C dimensions K and W of storage 232 and storage 234 , respectively.
  • math engine 236 can be fully utilized.
  • FIG. 2 c is a diagram illustrating an example hardware implementation of a dot product operation associated with convolution performed via matrix multiplication.
  • hardware implementation 250 performs the C reduction of math engine 236 in FIG. 2 b .
  • hardware implementation 250 performs element-wise multiplication of two vectors and sums the element-wise multiplied results.
  • element-wise multiplication of two 32-element vectors is performed with multipliers 252 (comprising 32 multipliers).
  • multipliers 252 comprising 32 multipliers.
  • multipliers 252 comprising 32 multipliers.
  • multipliers 252 comprising 32 multipliers.
  • Various multiplier implementations known in the art may be used (e.g., serial multipliers, pipelined multipliers, combinatorial multipliers, etc.).
  • Reduction tree 254 sums the outputs of multipliers 252 .
  • reduction tree 254 includes a hierarchical tree of adders.
  • the number of levels in the adder tree is equal to log 2 N, where N is the number of elements to be summed.
  • N is the number of elements to be summed.
  • 32-element multiplication is shown. This is illustrative and not restrictive. Multiplication of other numbers of elements (e.g., 64, 128, etc.) is also possible.
  • FIG. 3 a is a diagram illustrating example data parameters associated with groupwise convolution.
  • input 302 is analogous to input 202 of FIG. 2 a except that data is organized as groups.
  • Input 302 includes H, W, and C data as is the case for input 202 of FIG. 2 a .
  • filters 304 is analogous to filters 204 of FIG. 2 a .
  • Filters 204 includes F H , F W , C, and K data as is the case for filters 204 of FIG. 2 a .
  • input 302 and filters 304 are organized as groups.
  • grouping is associated with C and K (not H, W, F H , or F W ).
  • input 302 is partitioned along C in groups G 0 , G 1 , G 2 , and G 3 .
  • filters 304 is partitioned along C and K in two dimensions with linearized F H and F W comprising the third dimension.
  • C could be reduced completely.
  • reduction of C does not occur because there is no cross-group computation.
  • squares of data to saturate a math engine can also be produced (e.g., 32 ⁇ 32 squares for a math engine configured for 32-element multiplication) as in FIG. 2 b .
  • FIG. 3 b is a diagram illustrating an example of an inefficient reduction of a data dimension associated with groupwise convolution due to small input and output channel sizes.
  • For reduction 325 suppose the number of input channels Cg is 16 and the number of output channels Kg is also 16. Because there is no cross-group computation, a 16 ⁇ 16 square of filters is stored in storage 332 and a 1 ⁇ 16 row of activations is broadcasted from storage 334 (compare with the 32 ⁇ 32 square and 1 ⁇ 32 row in FIG. 2 b ). Consequently, math engine 336 (configured for 32 ⁇ 32 data in the example illustrated, as in FIG. 2 b ) cannot be fully utilized (only 25% utilized in the example shown). Math engine 336 (as is the case with math engine 236 of FIG.
  • Kg insufficiently large Kg leads to unused space towards the right end of matrix output 340 , which when written to memory, produces gaps (internal fragmentation) in the innermost dimension of the output tensor.
  • Another potential problem is that when utilization is 25%, at least one operand needs to be padded with zeros. For cache lines that are designed for 32 elements, only the first 16 elements would be valid. If groupwise convolution is followed by another type of convolution (e.g., normal convolution), additional processing is needed to remove zero-padding in the input operand.
  • FIG. 3 c is a diagram illustrating an example of a more efficient reduction of a data dimension associated with groupwise convolution with small input and output channel sizes.
  • Low utilization and output fragmentation occur in the example of FIG. 3 b because the per-group number of input and output channels are not sufficiently large to saturate the math engine.
  • Utilization can be increased and fragmentation reduced by creating two logically independent parts and mapping two independent groups to the two parts.
  • storage 352 and 354 are input storage A 106 and input storage 108 of FIG. 1 , respectively, or vice versa.
  • Math engine utilization improves twofold with this type of packing in which each row in math engine 356 produces two results.
  • column vector 358 is filled with useful results, which are then stored in matrix output 360 transposed.
  • math engine 356 is part of processing elements 104 of FIG. 1 .
  • matrix output 360 is stored in output storage 110 of FIG. 1 .
  • the packing scheme shown requires modifications to hardware implementation 250 of FIG. 2 c.
  • Packing more groups into the math engine to improve efficiency for use with hardware implementation 250 of FIG. 2 c is problematic because each row of activations (e.g., in storage 354 ) is broadcasted to all 32 rows of the weights (e.g., storage 352 ) and the reduction tree in the math engine (e.g., reduction tree 254 of FIG. 2 c ) reduces each row to a single value. Stated alternatively, the adder tree of hardware implementation 250 of FIG. 2 c would not work because it is configured to reduce across 32 elements. As described below (e.g., see FIG. 4 ), the hardware implementation can be modified to support grouped reductions to allow packing of additional groups to improve utilization and reduce fragmentation.
  • FIG. 4 is a diagram illustrating an example hardware implementation of a dot product operation that includes support for small channel sizes.
  • hardware implementation 400 performs the reductions (associated with dot product operations) in math engine 356 of FIG. 3 c .
  • hardware implementation 400 replaces hardware implementation 250 of FIG. 2 c to support groupwise convolution (e.g., as shown in FIG. 3 a and FIG. 3 c ) as well as normal convolution (e.g., as shown in FIGS. 2 a - 2 b ).
  • Multipliers 402 corresponds to multipliers 252 in FIG. 2 c .
  • reduction tree 404 a hierarchical adder tree, is a modified version of reduction tree 254 of FIG. 2 c .
  • the adder tree is broken up with multiplexers and demultiplexers so that results can be pulled from the next to last level of the adder tree in addition to the last level of the adder tree.
  • demultiplexer 406 receives the output from the left adder of the next to last level of adders and demultiplexer 408 receives the output from the right adder of the next to last level of adders.
  • the output of the left adder corresponds to a reduction value of G 0 in FIG. 3 c and the output of the right adder corresponds to a reduction value of G 1 in FIG. 3 c .
  • hardware implementation 400 can produce two independent reduction results if data routing of outputs of the next to last level in the adder tree is performed.
  • the left half of the inputs to multipliers 402 are associated with one group (e.g., G 0 in FIG.
  • demultiplexers 406 and 408 route the next to last level outputs to either the last level in the adder tree or to multiplexer 410 .
  • the output of the last level adder corresponds to a single reduction value, which can be used if the math engine is saturated (e.g., for the normal convolution in FIG. 2 b ).
  • multiplexer 410 selects between the output of the last level adder or the output of the left next to last level adder to be placed in storage 412 . If a single reduction is desired (e.g., for normal convolution), the output of the last level adder (final result of the hierarchical tree of adders) should be selected. If two independent reductions for groupwise convolution are desired, the output of the left next to last level adder should be transferred to storage 412 and the output of the right next to last level adder should be transferred to storage 414 (because two independent reductions require two storage locations).
  • storage 412 and 414 are temporary registers. Storage 412 and 414 may also be static random-access memory. The contents of storage 412 and 414 may be transferred to a column vector output (e.g., column vector 238 of FIG. 2 b or column vector 358 of FIG. 3 c ) and/or matrix output storage (e.g., matrix output 240 of FIG. 2 or matrix output 360 of FIG. 3 c ).
  • the matrix output storage is a register array. Control signals are needed to indicate to the multiplexers and demultiplexers which results to pull (e.g., to indicate the size of the reduction dimension).
  • a signal for normal convolution can be sent to demultiplexers 406 and 408 to route next to last level adder tree outputs to the last level adder and sent to multiplexer 410 to select the result from the last level adder.
  • groupwise convolution support e.g., two groups of 16 elements
  • the signal can be inverted to direct demultiplexer 406 to route the left next to last level adder tree output to multiplexer 410 , direct demultiplexer 408 to route the right next to last level adder tree output to storage 414 , and direct multiplexer 410 to select the left next to last level adder tree output of demultiplexer 406 to send to storage 412 .
  • these control signals are not drawn explicitly in FIG. 4 .
  • one or more control signals are provided by control unit 126 of FIG. 1 .
  • control circuitry can be added at other levels in the adder tree.
  • control circuitry e.g., multiplexers and demultiplexers
  • control circuitry is inserted at the next to last level where there are two adders to produce two reduction values corresponding to two independent groups. If four reduction values corresponding to four independent groups are desired, control circuitry can be inserted one level above the next to last level in the adder tree where there are four adders.
  • control circuity can be added at any level that exists in the adder tree to pull intermediate results corresponding to reduction of groups of size 2, 4, . . . , and up to 2 N-1 , where N is the number of levels in the adder tree.
  • the hardware implementation illustrated improves utilization from 25% to 50%, which translates to increased energy efficiency.
  • FIG. 5 a is a flow chart illustrating an embodiment of a process for performing matrix multiplication that includes zero-padding in software.
  • this process includes pre-storing zeros and weights in memory (e.g., random-access memory, read-only memory, flash memory, etc.) so that the zeros and weights multiplied with inputs produce correct results.
  • the memory is memory 124 of FIG. 1 .
  • This technique is similar to the technique illustrated in FIG. 3 c and FIG. 4 in that throughput of the math engine can be increased by packing multiple groups together. Pre-storing zeros and weights in memory can result in more memory being used (e.g., see FIG. 5 b ) compared with the implementation associated with FIG. 3 c and FIG. 4 .
  • a first matrix to be multiplied with a first operand and a second matrix to be multiplied with a second operand concurrently are identified.
  • the two matrix multiplications are performed concurrently using a same dedicated matrix multiplication hardware unit configured to perform a multiplication of a matrix that is larger in size than the first matrix and the second matrix.
  • a multiplication hardware unit configured to perform multiplication of 32 ⁇ 32 matrices (e.g., math engine 356 of FIG. 3 c ) is used.
  • the first matrix and the second matrix can be 16 ⁇ 16 matrices (as is the case for the examples in FIG. 3 c and FIG. 4 ).
  • the process of FIG. 5 a produces the same outputs as in FIG. 3 c and FIG. 4 .
  • the first matrix and the second matrix are combined into a combined matrix.
  • the combined matrix includes zero-padded elements and a first group of elements of the combined matrix corresponding to the first matrix that do not share a column or a row of the combined matrix with a second group of elements of the combined matrix corresponding to the second matrix.
  • FIG. 5 b is a diagram illustrating an example of a combined matrix.
  • combined matrix 515 includes portions 520 , 522 , 524 , and 526 .
  • portion 520 stores the first matrix
  • portion 522 stores the second matrix.
  • Portions 524 and 526 store zeros (are zero-padded).
  • combined matrix 515 may be a 32 ⁇ 32 matrix in which portions 520 and 522 are 16 ⁇ 16 data squares. Stated alternatively, portion 520 may correspond to a first 16 ⁇ 16 G 0 group and portion 522 may correspond to a second 16 ⁇ 16 G 1 group.
  • the storing of zero-padded portions in memory represents additional memory usage compared with the example of FIG. 3 c in which no zero-padded portions are stored in memory (e.g., memory 124 of FIG. 1 ). In the FIG. 3 c example, only non-zero data needs to be stored in memory and then transferred to storage (e.g., input storage A 106 and input storage B 108 of FIG. 1 ) associated with the math engine.
  • the first operand and the second operand are combined into a combined operand.
  • the operands are the counterparts with which the first and second matrices are multiplied.
  • the first matrix and second matrix are multiplied with the first operand and second operand, respectively, in a way analogous to the multiplication of the contents of storage 352 and storage 354 in FIG. 3 c .
  • the combined operand can be operands of different groups stored adjacent to each other in memory in a similar manner that G 0 and G 1 data are stored adjacent to each other in storage 354 of FIG. 3 c . Because the combined matrix is zero-padded, the combined operand does not need to be zero-padded.
  • the G 0 portion of the combined operand can be stored so that it aligns with and multiplies with either the G 0 portion of the combined matrix or a zero-padded portion so as not to generate erroneous results.
  • the G 1 portion of the combined operand can be stored so that it aligns with and multiplies with either the G 1 portion of the combined matrix or a zero-padded portion.
  • a multiplication of the combined matrix with the combined operand is performed to determine a combined result matrix.
  • the dedicated matrix multiplication hardware unit is used to determine the combined result matrix. Because the combined matrix includes half zeros, the combined result matrix also includes at least half zeros.
  • the combined result matrix has the same layout as the combined matrix. For example, if the combined matrix has the layout shown in FIG. 5 b , the combined result matrix will also have a layout where non-zero data is located in the upper left and lower right quadrants and zeros are located in the other two quadrants.
  • a result of multiplying the first matrix with the first operand is obtained from a first portion of the combined result matrix.
  • the result of multiplying the first matrix with the first operand is located in an upper left portion of the combined result matrix (e.g., in the same relative position as portion 520 of FIG. 5 b ).
  • a result of multiplying the second matrix with the second operand is obtained from a second portion of the combined result matrix.
  • the result of multiplying the second matrix with the second operand is located in a lower right portion of the combined result matrix (e.g., in the same relative position as portion 522 of FIG. 5 b ).
  • FIG. 5 b is a diagram illustrating an example of a combined matrix in accordance with the process of FIG. 5 a . See the discussion above with respect to FIG. 5 a .
  • the layout of the combined result matrix is the same as that illustrated in FIG. 5 b (data in portions corresponding to portions 520 and 522 and zeros in portions corresponding to portions 524 and 526 ) because the zeros in the combined matrix when multiplied with operands produces zeros in the combined result matrix.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)

Abstract

A first group of elements is element-wise multiplied with a second group of elements using a plurality of multipliers belonging to a matrix multiplication hardware unit. Results of the plurality of multipliers are added together using a hierarchical tree of adders belonging to the matrix multiplication hardware unit and a final result of the hierarchical tree of adders or any of a plurality of intermediate results of the hierarchical tree of adders is selectively provided for use in determining an output result matrix. A control unit is used to instruct the matrix multiplication hardware unit to perform a plurality of different matrix multiplications in parallel by using a combined matrix that includes elements of a plurality of different operand matrices and utilize one or more selected ones of the intermediate results of the hierarchical tree of adders for use in determining the output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices.

Description

BACKGROUND OF THE INVENTION
Matrix multiplication is a central operation in many numerical algorithms used to solve various scientific and engineering computations. For example, matrix multiplication is an important component in artificial intelligence computations, such as inference. Since these computations are often demanding and data intensive, hardware solutions are often beneficial for improving performance. Computations can often be performed more quickly using hardware-based solutions that optimize the performance of various operations, e.g., matrix operations to support convolution. It is a technical challenge to create a hardware platform compatible with performing different matrix operations while also significantly improving performance and efficiency. Therefore, there exists a need for hardware and data path solutions that improve on the ability to efficiently perform operations without introducing significant complexity and restrictions.
BRIEF DESCRIPTION OF THE DRAWINGS
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
FIG. 1 is a block diagram illustrating an embodiment of a system for performing matrix multiplication using dot product operations.
FIG. 2 a is a diagram illustrating example data parameters associated with convolution.
FIG. 2 b is a diagram illustrating an example reduction of a data dimension associated with convolution performed via matrix multiplication.
FIG. 2 c is a diagram illustrating an example hardware implementation of a dot product operation associated with convolution performed via matrix multiplication.
FIG. 3 a is a diagram illustrating example data parameters associated with groupwise convolution.
FIG. 3 b is a diagram illustrating an example of an inefficient reduction of a data dimension associated with groupwise convolution due to small input and output channel sizes.
FIG. 3 c is a diagram illustrating an example of a more efficient reduction of a data dimension associated with groupwise convolution with small input and output channel sizes.
FIG. 4 is a diagram illustrating an example hardware implementation of a dot product operation that includes support for small channel sizes.
FIG. 5 a is a flow chart illustrating an embodiment of a process for performing matrix multiplication that includes zero-padding in software.
FIG. 5 b is a diagram illustrating an example of a combined matrix in accordance with the process of FIG. 5 a.
DETAILED DESCRIPTION
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
A system for performing matrix multiplication in hardware that includes support for various size inputs is disclosed. The disclosed system includes a matrix multiplication hardware unit comprising a plurality of multipliers configured to element-wise multiply a first group of elements with a second group of elements and a hierarchical tree of adders configured to add together results of the plurality of multipliers and selectively provide a final result of the hierarchical tree of adders or any of a plurality of intermediate results of the hierarchical tree of adders for use in determining an output result matrix. The disclosed system further includes a control unit configured to instruct the matrix multiplication hardware unit to perform a plurality of different matrix multiplications in parallel by using a combined matrix that includes elements of a plurality of different operand matrices and utilize one or more selected ones of the intermediate results of the hierarchical tree of adders for use in determining the output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices. Practical and technological benefit of the disclosed system include improved computational and power efficiency and reduced data fragmentation when performing matrix multiplication.
Matrix multiplication is an important component in many types of computations, e.g., artificial intelligence computations such as inference. In various embodiments, matrix multiplication is used to perform various types of convolutions. Matrix multiplication can be performed by using hardware that performs dot product operations (e.g., a dot product engine). Stated alternatively, a dot product engine can be used to compute different variants of convolutions. In various embodiments, the dot product engine requires large enough inputs (e.g., large enough input channels and output channels) to be efficient. Insufficiently large inputs can lead to low utilization and output fragmentation, which requires more hardware resources to remove gaps in the output.
Groupwise convolution, which is a common building block in modern deep learning neural networks is oftentimes associated with limited input channels and output channels per convolution group. When the number of input channels and/or output channels is smaller than a specified size for which the dot product engine is configured, the overall efficiency of the dot product engine can drop and the output of the convolution can become fragmented in memory. Reorganizing the fragmented output uses more hardware resources. In various embodiments, the output is an output of a layer in a neural network and the output is fed to a next layer. The problem of fragmentation can be particularly significant for neural networks in which many layers have limited input channels and/or output channels (e.g., computer vision neural networks).
As described in further detail herein, in various embodiments, the dot product engine is logically partitioned by pulling out intermediate results from an adder tree so that input activations and weights from multiple groups can be fed to the dot product engine at the same time and computed independently. Benefits of this approach include improved dot product engine efficiency and more tightly packed (less fragmented) output tensors. Stated alternatively, a benefit is improved efficiency of groupwise convolution with small input and output channel sizes.
FIG. 1 is a block diagram illustrating an embodiment of a system for performing matrix multiplication using dot product operations. Matrix multiplication system 100 includes matrix multiplication unit 102, memory 124, and control unit 126. In the example shown, matrix multiplication unit 102 includes input storage A 106, input storage B 108, and output storage 110. In some embodiments, input storage A 106, input storage B 108, and output storage 110 are implemented as groups of hardware registers, such as flip-flop circuits. In some embodiments, input storage A 106 stores a first group of elements of to be element-wise multiplied with a second group of elements and input storage B 108 stores the second group of elements. In some embodiments, the first group of elements are values of an input matrix (e.g., input activations) and the second group of elements are values of weights of one or more convolution filters (or vice versa with respect to what is stored in input storage A 106 and input storage B 108).
In various embodiments, output storage 110 stores a result produced by processing elements 104. The example shown is illustrative and not restrictive. For example, input storage A 106, input storage B 108, and output storage 110 may be implemented as a single group of the same type of registers. Furthermore, no additional storage for the output of processing elements 104 may be needed if the output is written to (and thus overwrites) the registers used to store the groups of elements to be element-wise multiplied.
In the example shown, matrix multiplication unit 102 include processing elements 104 and processing elements 104 includes a first element 112 through an nth element 118. In various embodiments, each element of processing elements 104 performs a dot product operation. Having multiple elements (e.g., in parallel) allows for multiple dot product operations to be performed concurrently. In some embodiments, input storage A 106, input storage B 108, and output storage 110 store the various elements to be multiplied and the output elements for all of the parallel dot product operations. In the example illustrated, each element of processing elements 104 includes an element-wise multiplier and an adder tree. For example, first element 112 includes element-wise multiplier 114 and adder tree 116 and nth element 118 includes element-wise multiplier 120 and adder tree 122.
In various embodiments, each element-wise multiplier includes a plurality of individual multipliers. For example, if matrix multiplication unit 102 is configured to perform 32-element dot products, each element-wise multiplier includes 32 individual multipliers (e.g., see FIG. 2 c and FIG. 4 ). The specific type of multiplier depends on the format of the values to be multiplied. Examples of formats include unsigned integers, signed integers, various floating-point number formats, and so forth for various bit sizes (e.g., 8-bit, 16-bit, etc.). Various multiplier implementations known in the art may be used (e.g., serial multipliers, pipelined multipliers, combinatorial multipliers, etc.).
In various embodiments, each adder tree is configured to add together results of the plurality of multipliers. For example, if 32 values are element-wise multiplied, the adder tree can produce an output that is the sum of 32 values. In various embodiments, the adder tree is hierarchical and includes a number of levels equal to log2N, where N is the number of elements to be summed. In some embodiments, the output of the adder tree is stored in output storage 110. In various embodiments, each adder in the adder tree is implemented using basic digital logic gates. As described in further detail herein, in various embodiments, each adder tree is configured to selectively provide a final result of the adder tree or any of one or more intermediate results of the adder tree to support groupwise convolution (e.g., see FIG. 4 ). Matrix multiplication system 100 may be configured for various types of convolution (e.g., normal, groupwise, etc.). In various embodiments, matrix multiplication system 100 is configured to adapt to the type of convolution being performed (e.g., based on data parameters associated with the type of convolution) in order to improve efficiency.
In the example illustrated, memory 124 is coupled to matrix multiplication unit 102. In some embodiments, data stored in input storage A 106 and/or input storage B 108 is loaded from memory 124. In various embodiments, an output produced by processing elements 104 is written back to memory 124. The output result may first be stored in output storage 110. Examples of memory 124 include non-volatile memory (e.g., flash memory, various types of read-only memory, etc.), volatile memory (e.g., various types of random-access memory), or any other type of memory.
In the example illustrated, control unit 126 is coupled to matrix multiplication unit 102 and memory 124. In various embodiments, control unit 126 is implemented using digital electronic circuits (e.g., assemblies of digital logic gates printed on an integrated circuit). Control unit 126 directs the transfer of data from memory 124 to and from matrix multiplication unit 102 (e.g., to input storage A 106 and input storage B 108 and from output storage 110). In various embodiments, control unit 126 instructs matrix multiplication unit 102 to perform a plurality of different matrix multiplications in parallel by using a combined matrix that includes elements of a plurality of different operand matrices (e.g., see FIG. 3 c ) and utilize one or more selected ones of the intermediate results of a hierarchical tree of adders for use in determining an output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices (e.g., see FIG. 4 ). Stated alternatively, as described in further detail herein, in various embodiments, control unit 126 directs matrix multiplication unit 102 to provide either a final result of an adder tree or any of one or more intermediate results of the adder tree to support groupwise convolution. As used herein, the final result of the adder tree refers to the output of the single adder in the last level of the adder tree. In some embodiments, controlling which result to provide includes supplying control signals to a plurality of multiplexers and/or demultiplexers (e.g., see FIG. 4 ).
In the example illustrated in FIG. 1 , portions of the communication path between the components are shown. Other communication paths may exist, and the example of FIG. 1 has been simplified to illustrate the example clearly. For example, control signals are not shown explicitly in FIG. 1 . Furthermore, not all connections between storage elements and memory are shown. Although single instances of components have been shown to simplify the diagram, additional instances of any of the components shown in FIG. 1 may exist. The number of components and the connections shown in FIG. 1 are merely illustrative. For example, additional instances of internal storage may exist. Components not shown in FIG. 1 may also exist.
FIG. 2 a is a diagram illustrating example data parameters associated with convolution. In the example illustrated, convolution 200 includes convolving input 202 with filters 204 to produce output 206. Input 202 may be a three-dimensional image with height (H), width (W), and channel (C) dimensions. C is also referred to herein as the number of input channels. The channel dimension can be comprised of various data types. For example, an input may have three channels for the colors red, green, and blue (or more channels for more colors, and so forth; the number of channels can be much larger than three). Channels can also correspond to other data types. In the example illustrated, input 202 is convolved with each of K number of filters. Each filter has dimensions of FH×FW×C where FH is the filter height, FW is the filter width, and C is the same number of channels as in input 202. Convolution 100 reduces the C dimension so that the C dimension is not present in output 206. In the example shown, output 206 has the same height and width as input 202. The depth of output 206 is K (the number of filters, which is also referred to herein as the number of output channels) because each filter in filter 204 reduces input 202 along the C dimension and creates a separate H×W image for each filter.
FIG. 2 b is a diagram illustrating an example reduction of a data dimension associated with convolution performed via matrix multiplication. In some embodiments, the reduction is performed by matrix multiplication system 100 when it is configured to handle normal convolution (as opposed to groupwise convolution). Filter values (e.g., associated with filters 204 of FIG. 2 a ) may be loaded into input storage A 106 of FIG. 1 . These filter values are also referred to herein as weights. Input values (e.g., associated with input 202 of FIG. 2 a ) may be loaded into input storage B 108 of FIG. 1 . These input values are also referred to herein as activations. Either input storage A 106 or input storage B 108 of FIG. 1 may be used for either weights or activations.
In some embodiments, a 32×32 square of filter values is loaded into storage 232 and a 1×32 row of activations in storage 234 is broadcasted into math engine 236 to perform reduction 225 (e.g., when math engine 236 is configured for 32-element multiplication). In some embodiments, math engine 236 is part of processing elements 104 of FIG. 1 . One dimension of storage 232 is dedicated to storing channel (C) data. The other dimension of storage 232 stores linearized FH, FW, and K data (labeled as K in the example shown). One dimension of storage 234 is also dedicated to storing C data. The other dimension of storage 234 stores linearized H and W data (labeled as W in the example shown). Stated alternatively, the innermost (contiguous) dimension for storage 232 and storage 234 is C, the outer dimension for storage 232 is FH, FW, and K linearized, and the outer dimension for storage 234 is H and W linearized.
If math engine 236 is configured for 32-element multiplication, it performs 1024 multiplications and 32 row-wise adder tree computations to produce 32-element column vector 238. By storing linearized inputs, convolution can be mapped into a set of matrix multiplications. The three-dimensional convolution shown in FIG. 2 a can be mapped to two-dimensional matrix multiplication. In the example illustrated, during each cycle, a row from storage 234 is broadcasted to every row in storage 232 and vector by vector multiplication is performed. An adder tree (e.g., see FIG. 2 c ) can be used to reduce each product vector into a single element. After broadcasting the row through all of storage 232, column vector 238 is the result. After 32 broadcasts, 32×32 matrix output 240 is the result. In some embodiments, column vector 238 is transferred to matrix output 240 and stored in transposed form. In some embodiments, matrix output 240 is stored in output storage 110 of FIG. 1 . As shown, the C dimension of storage 232 and the C dimension of storage 234 have been collapsed, resulting in an output with the non-C dimensions K and W of storage 232 and storage 234, respectively. In this example, when C and K are larger than 32, math engine 236 can be fully utilized.
FIG. 2 c is a diagram illustrating an example hardware implementation of a dot product operation associated with convolution performed via matrix multiplication. In some embodiments, hardware implementation 250 performs the C reduction of math engine 236 in FIG. 2 b . Stated alternatively, hardware implementation 250 performs element-wise multiplication of two vectors and sums the element-wise multiplied results. In the example illustrated, element-wise multiplication of two 32-element vectors is performed with multipliers 252 (comprising 32 multipliers). Various multiplier implementations known in the art may be used (e.g., serial multipliers, pipelined multipliers, combinatorial multipliers, etc.). Reduction tree 254 sums the outputs of multipliers 252. In this example, reduction tree 254 includes a hierarchical tree of adders. In various embodiments, the number of levels in the adder tree is equal to log2N, where N is the number of elements to be summed. In various examples illustrated herein (e.g., FIG. 2 b , FIG. 2 c , FIG. 3 b , FIG. 3 c , and FIG. 4 ), 32-element multiplication is shown. This is illustrative and not restrictive. Multiplication of other numbers of elements (e.g., 64, 128, etc.) is also possible.
FIG. 3 a is a diagram illustrating example data parameters associated with groupwise convolution. In groupwise convolution 300, input 302 is analogous to input 202 of FIG. 2 a except that data is organized as groups. Input 302 includes H, W, and C data as is the case for input 202 of FIG. 2 a . In the example shown, filters 304 is analogous to filters 204 of FIG. 2 a . Filters 204 includes FH, FW, C, and K data as is the case for filters 204 of FIG. 2 a . However, input 302 and filters 304 are organized as groups. In various embodiments, for groupwise convolution, grouping is associated with C and K (not H, W, FH, or FW). In the example shown, input 302 is partitioned along C in groups G0, G1, G2, and G3. In the example shown, filters 304 is partitioned along C and K in two dimensions with linearized FH and FW comprising the third dimension. In the example of FIG. 2 a , C could be reduced completely. In the groupwise convolution example shown, reduction of C does not occur because there is no cross-group computation. With very large C and K values, squares of data to saturate a math engine can also be produced (e.g., 32×32 squares for a math engine configured for 32-element multiplication) as in FIG. 2 b . However, as described below, when C (number of input channels) and K (number of output channels) are not sufficiently large, underutilization and fragmentation can occur. There is a greater tendency for insufficiently large C and K in groupwise convolution because of the grouping and the lack of cross-group computation.
FIG. 3 b is a diagram illustrating an example of an inefficient reduction of a data dimension associated with groupwise convolution due to small input and output channel sizes. For reduction 325, suppose the number of input channels Cg is 16 and the number of output channels Kg is also 16. Because there is no cross-group computation, a 16×16 square of filters is stored in storage 332 and a 1×16 row of activations is broadcasted from storage 334 (compare with the 32×32 square and 1×32 row in FIG. 2 b ). Consequently, math engine 336 (configured for 32×32 data in the example illustrated, as in FIG. 2 b ) cannot be fully utilized (only 25% utilized in the example shown). Math engine 336 (as is the case with math engine 236 of FIG. 2 b ) collapses the input channel (Cg) dimension. In this groupwise convolution example, due to the underutilization of math engine 336, only a 16-element column vector 338 is produced (compare with the 32-element column vector in FIG. 2 b ). In the example shown, column vector 338 is transferred to matrix output 340 and stored in transposed form.
As illustrated here, insufficiently large Kg leads to unused space towards the right end of matrix output 340, which when written to memory, produces gaps (internal fragmentation) in the innermost dimension of the output tensor. Another potential problem is that when utilization is 25%, at least one operand needs to be padded with zeros. For cache lines that are designed for 32 elements, only the first 16 elements would be valid. If groupwise convolution is followed by another type of convolution (e.g., normal convolution), additional processing is needed to remove zero-padding in the input operand.
FIG. 3 c is a diagram illustrating an example of a more efficient reduction of a data dimension associated with groupwise convolution with small input and output channel sizes. Low utilization and output fragmentation occur in the example of FIG. 3 b because the per-group number of input and output channels are not sufficiently large to saturate the math engine. Utilization can be increased and fragmentation reduced by creating two logically independent parts and mapping two independent groups to the two parts. In the example shown, for reduction 350, with Cg=Kg=16, two groups, G0 and G1, are packed along the Cg dimension in storage 352 and storage 354 as opposed to a single group in storage 332 and storage 334 of FIG. 3 b . In some embodiments, storage 352 and 354 are input storage A 106 and input storage 108 of FIG. 1 , respectively, or vice versa. Math engine utilization improves twofold with this type of packing in which each row in math engine 356 produces two results. Furthermore, column vector 358 is filled with useful results, which are then stored in matrix output 360 transposed. In some embodiments, math engine 356 is part of processing elements 104 of FIG. 1 . In some embodiments, matrix output 360 is stored in output storage 110 of FIG. 1 . As described below, the packing scheme shown requires modifications to hardware implementation 250 of FIG. 2 c.
Packing more groups into the math engine to improve efficiency for use with hardware implementation 250 of FIG. 2 c is problematic because each row of activations (e.g., in storage 354) is broadcasted to all 32 rows of the weights (e.g., storage 352) and the reduction tree in the math engine (e.g., reduction tree 254 of FIG. 2 c ) reduces each row to a single value. Stated alternatively, the adder tree of hardware implementation 250 of FIG. 2 c would not work because it is configured to reduce across 32 elements. As described below (e.g., see FIG. 4 ), the hardware implementation can be modified to support grouped reductions to allow packing of additional groups to improve utilization and reduce fragmentation.
FIG. 4 is a diagram illustrating an example hardware implementation of a dot product operation that includes support for small channel sizes. In some embodiments, hardware implementation 400 performs the reductions (associated with dot product operations) in math engine 356 of FIG. 3 c . In some embodiments, hardware implementation 400 replaces hardware implementation 250 of FIG. 2 c to support groupwise convolution (e.g., as shown in FIG. 3 a and FIG. 3 c ) as well as normal convolution (e.g., as shown in FIGS. 2 a-2 b ). Multipliers 402 corresponds to multipliers 252 in FIG. 2 c . In the example shown, reduction tree 404, a hierarchical adder tree, is a modified version of reduction tree 254 of FIG. 2 c . In the example shown, the adder tree is broken up with multiplexers and demultiplexers so that results can be pulled from the next to last level of the adder tree in addition to the last level of the adder tree.
In the example illustrated, demultiplexer 406 receives the output from the left adder of the next to last level of adders and demultiplexer 408 receives the output from the right adder of the next to last level of adders. The output of the left adder corresponds to a reduction value of G0 in FIG. 3 c and the output of the right adder corresponds to a reduction value of G1 in FIG. 3 c . Thus, hardware implementation 400 can produce two independent reduction results if data routing of outputs of the next to last level in the adder tree is performed. To support groupwise convolution, the left half of the inputs to multipliers 402 are associated with one group (e.g., G0 in FIG. 3 c ) for reduction and the right half of the inputs to multipliers 402 are associated with another group (e.g., G1 in FIG. 3 c ) so that the output of the left next to last level adder corresponds to the first group and the output of the right next to last level adder corresponds to the other group.
In the example shown, demultiplexers 406 and 408 route the next to last level outputs to either the last level in the adder tree or to multiplexer 410. When the next to last level outputs are routed to the last level adder, the output of the last level adder corresponds to a single reduction value, which can be used if the math engine is saturated (e.g., for the normal convolution in FIG. 2 b ). In the example illustrated, multiplexer 410 selects between the output of the last level adder or the output of the left next to last level adder to be placed in storage 412. If a single reduction is desired (e.g., for normal convolution), the output of the last level adder (final result of the hierarchical tree of adders) should be selected. If two independent reductions for groupwise convolution are desired, the output of the left next to last level adder should be transferred to storage 412 and the output of the right next to last level adder should be transferred to storage 414 (because two independent reductions require two storage locations).
In some embodiments, storage 412 and 414 are temporary registers. Storage 412 and 414 may also be static random-access memory. The contents of storage 412 and 414 may be transferred to a column vector output (e.g., column vector 238 of FIG. 2 b or column vector 358 of FIG. 3 c ) and/or matrix output storage (e.g., matrix output 240 of FIG. 2 or matrix output 360 of FIG. 3 c ). In some embodiments, the matrix output storage is a register array. Control signals are needed to indicate to the multiplexers and demultiplexers which results to pull (e.g., to indicate the size of the reduction dimension). For example, a signal for normal convolution (e.g., a single group of 32 elements) can be sent to demultiplexers 406 and 408 to route next to last level adder tree outputs to the last level adder and sent to multiplexer 410 to select the result from the last level adder. For groupwise convolution support (e.g., two groups of 16 elements), the signal can be inverted to direct demultiplexer 406 to route the left next to last level adder tree output to multiplexer 410, direct demultiplexer 408 to route the right next to last level adder tree output to storage 414, and direct multiplexer 410 to select the left next to last level adder tree output of demultiplexer 406 to send to storage 412. For purposes of clarity of illustration, these control signals are not drawn explicitly in FIG. 4 . In some embodiments, one or more control signals are provided by control unit 126 of FIG. 1 .
The example shown is illustrative and not restrictive. Various other data routing configurations can be used (e.g., using different combinations of multiplexers and demultiplexers or with other logic circuits). In addition, control circuitry can be added at other levels in the adder tree. In the example illustrated, control circuitry (e.g., multiplexers and demultiplexers) is inserted at the next to last level where there are two adders to produce two reduction values corresponding to two independent groups. If four reduction values corresponding to four independent groups are desired, control circuitry can be inserted one level above the next to last level in the adder tree where there are four adders. Stated alternatively, control circuity can be added at any level that exists in the adder tree to pull intermediate results corresponding to reduction of groups of size 2, 4, . . . , and up to 2N-1, where N is the number of levels in the adder tree. The hardware implementation illustrated improves utilization from 25% to 50%, which translates to increased energy efficiency.
FIG. 5 a is a flow chart illustrating an embodiment of a process for performing matrix multiplication that includes zero-padding in software. In various embodiments, this process includes pre-storing zeros and weights in memory (e.g., random-access memory, read-only memory, flash memory, etc.) so that the zeros and weights multiplied with inputs produce correct results. In some embodiments, the memory is memory 124 of FIG. 1 . This technique is similar to the technique illustrated in FIG. 3 c and FIG. 4 in that throughput of the math engine can be increased by packing multiple groups together. Pre-storing zeros and weights in memory can result in more memory being used (e.g., see FIG. 5 b ) compared with the implementation associated with FIG. 3 c and FIG. 4 .
At 502, a first matrix to be multiplied with a first operand and a second matrix to be multiplied with a second operand concurrently are identified. In various embodiments, the two matrix multiplications are performed concurrently using a same dedicated matrix multiplication hardware unit configured to perform a multiplication of a matrix that is larger in size than the first matrix and the second matrix. In some embodiments, a multiplication hardware unit configured to perform multiplication of 32×32 matrices (e.g., math engine 356 of FIG. 3 c ) is used. The first matrix and the second matrix can be 16×16 matrices (as is the case for the examples in FIG. 3 c and FIG. 4 ). Thus, in some embodiments, the process of FIG. 5 a produces the same outputs as in FIG. 3 c and FIG. 4 .
At 504, the first matrix and the second matrix are combined into a combined matrix. In various embodiments, the combined matrix includes zero-padded elements and a first group of elements of the combined matrix corresponding to the first matrix that do not share a column or a row of the combined matrix with a second group of elements of the combined matrix corresponding to the second matrix. FIG. 5 b is a diagram illustrating an example of a combined matrix. In the example shown in FIG. 5 b , combined matrix 515 includes portions 520, 522, 524, and 526. In the example shown, portion 520 stores the first matrix and portion 522 stores the second matrix. Portions 524 and 526 store zeros (are zero-padded). For example, combined matrix 515 may be a 32×32 matrix in which portions 520 and 522 are 16×16 data squares. Stated alternatively, portion 520 may correspond to a first 16×16 G0 group and portion 522 may correspond to a second 16×16 G1 group. The storing of zero-padded portions in memory represents additional memory usage compared with the example of FIG. 3 c in which no zero-padded portions are stored in memory (e.g., memory 124 of FIG. 1 ). In the FIG. 3 c example, only non-zero data needs to be stored in memory and then transferred to storage (e.g., input storage A 106 and input storage B 108 of FIG. 1 ) associated with the math engine.
At 506, the first operand and the second operand are combined into a combined operand. In various embodiments, the operands are the counterparts with which the first and second matrices are multiplied. The first matrix and second matrix are multiplied with the first operand and second operand, respectively, in a way analogous to the multiplication of the contents of storage 352 and storage 354 in FIG. 3 c . For example, the combined operand can be operands of different groups stored adjacent to each other in memory in a similar manner that G0 and G1 data are stored adjacent to each other in storage 354 of FIG. 3 c . Because the combined matrix is zero-padded, the combined operand does not need to be zero-padded. The G0 portion of the combined operand can be stored so that it aligns with and multiplies with either the G0 portion of the combined matrix or a zero-padded portion so as not to generate erroneous results. Similarly, the G1 portion of the combined operand can be stored so that it aligns with and multiplies with either the G1 portion of the combined matrix or a zero-padded portion.
At 508, a multiplication of the combined matrix with the combined operand is performed to determine a combined result matrix. In various embodiments, the dedicated matrix multiplication hardware unit is used to determine the combined result matrix. Because the combined matrix includes half zeros, the combined result matrix also includes at least half zeros. In various embodiments, the combined result matrix has the same layout as the combined matrix. For example, if the combined matrix has the layout shown in FIG. 5 b , the combined result matrix will also have a layout where non-zero data is located in the upper left and lower right quadrants and zeros are located in the other two quadrants.
At 510, a result of multiplying the first matrix with the first operand is obtained from a first portion of the combined result matrix. In some embodiments, the result of multiplying the first matrix with the first operand is located in an upper left portion of the combined result matrix (e.g., in the same relative position as portion 520 of FIG. 5 b ).
At 512, a result of multiplying the second matrix with the second operand is obtained from a second portion of the combined result matrix. In some embodiments, the result of multiplying the second matrix with the second operand is located in a lower right portion of the combined result matrix (e.g., in the same relative position as portion 522 of FIG. 5 b ).
FIG. 5 b is a diagram illustrating an example of a combined matrix in accordance with the process of FIG. 5 a . See the discussion above with respect to FIG. 5 a . As described above, in various embodiments, the layout of the combined result matrix is the same as that illustrated in FIG. 5 b (data in portions corresponding to portions 520 and 522 and zeros in portions corresponding to portions 524 and 526) because the zeros in the combined matrix when multiplied with operands produces zeros in the combined result matrix.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.

Claims (20)

What is claimed is:
1. A system, comprising:
a matrix multiplication hardware unit configured to perform matrix multiplications of a first size, comprising:
a plurality of multipliers configured to element-wise multiply a first group of elements with a second group of elements; and
a hierarchical tree of adders configured to add together results of the plurality of multipliers and selectively provide a final result of the hierarchical tree of adders or any of a plurality of intermediate results of the hierarchical tree of adders for use in determining an output result matrix, wherein the hierarchical tree of adders includes a demultiplexer configured to receive an output of an adder in a next-to-last level of adders in the hierarchical tree of adders, wherein the demultiplexer is configured to provide an output that is an input to an adder in a last level of adders in the hierarchical tree of adders; and
a control unit configured to instruct the matrix multiplication hardware unit to perform a plurality of different matrix multiplications of a second size smaller than the first size in parallel by using a combined matrix that includes elements of a plurality of different operand matrices and utilize one or more selected ones of the intermediate results of the hierarchical tree of adders for use in determining the output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices.
2. The system of claim 1, wherein the first group of elements is associated with convolution filter values and the second group of elements is associated with input values to be convolved with the convolution filter values, or vice versa.
3. The system of claim 1, wherein the first group of elements and the second group of elements include color channel data.
4. The system of claim 1, wherein the first group of elements and the second group of elements are loaded from a memory and stored in registers.
5. The system of claim 1, wherein the first group of elements and the second group of elements include data associated with a neural network computation.
6. The system of claim 1, wherein the plurality of multipliers includes thirty-two multipliers.
7. The system of claim 1, wherein the hierarchical tree of adders includes a number of hierarchical levels of adders equal to logarithm base two of the number of multipliers in the plurality of multipliers.
8. The system of claim 1, wherein the hierarchical tree of adders includes five hierarchical levels.
9. The system of claim 1, wherein the output result matrix is stored in a register array and transferred to a memory.
10. The system of claim 1, wherein the plurality of intermediate results is two intermediate results.
11. The system of claim 1, wherein the plurality of intermediate results are outputs of adders in the next-to-last level of adders in the hierarchical tree of adders.
12. The system of claim 1, wherein the matrix multiplication hardware unit further comprises one or more multiplexers.
13. The system of claim 1, wherein an output of the adder in the last level of adders in the hierarchical tree of adders is an input to a multiplexer having another input that is another output of the demultiplexer.
14. The system of claim 1, wherein the control unit is configured to transfer the elements of the plurality of different operand matrices from a memory to registers.
15. The system of claim 1, wherein the control unit is configured to transfer the elements of the plurality of different operand matrices from registers to the matrix multiplication hardware unit.
16. The system of claim 1, wherein the control unit is configured to send one or more control signals to one or more multiplexers and demultiplexers in the matrix multiplication hardware unit.
17. A method, comprising:
element-wise multiplying a first group of elements with a second group of elements using a plurality of multipliers belonging to a matrix multiplication hardware unit configured to perform matrix multiplications of a first size;
adding together results of the plurality of multipliers using a hierarchical tree of adders belonging to the matrix multiplication hardware unit and selectively providing a final result of the hierarchical tree of adders or any of a plurality of intermediate results of the hierarchical tree of adders for use in determining an output result matrix, wherein the hierarchical tree of adders includes a demultiplexer configured to receive an output of an adder in a next-to-last level of adders in the hierarchical tree of adders, wherein the demultiplexer is configured to provide an output that is an input to an adder in a last level of adders in the hierarchical tree of adders; and
using a control unit to instruct the matrix multiplication hardware unit to perform a plurality of different matrix multiplications of a second size smaller than the first size in parallel by using a combined matrix that includes elements of a plurality of different operand matrices and utilize one or more selected ones of the intermediate results of the hierarchical tree of adders for use in determining the output result matrix that includes different groups of elements representing different multiplication results corresponding to different ones of the different operand matrices.
18. The method of claim 17, wherein the first group of elements is associated with convolution filter values and the second group of elements is associated with input values to be convolved with the convolution filter values, or vice versa.
19. The method of claim 17, wherein the first group of elements and the second group of elements include color channel data.
20. The method of claim 17, wherein the first group of elements and the second group of elements are loaded from a memory and stored in registers.
US16/667,700 2019-10-29 2019-10-29 Support for different matrix multiplications by selecting adder tree intermediate results Active 2041-04-17 US11520854B2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/667,700 US11520854B2 (en) 2019-10-29 2019-10-29 Support for different matrix multiplications by selecting adder tree intermediate results
CN202011133658.1A CN112749368A (en) 2019-10-29 2020-10-21 Supporting different matrix multiplications by selecting adder tree intermediate results
EP20203634.9A EP3816790A1 (en) 2019-10-29 2020-10-23 Support for different matrix multiplications by selecting adder tree intermediate results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US16/667,700 US11520854B2 (en) 2019-10-29 2019-10-29 Support for different matrix multiplications by selecting adder tree intermediate results

Publications (2)

Publication Number Publication Date
US20210125044A1 US20210125044A1 (en) 2021-04-29
US11520854B2 true US11520854B2 (en) 2022-12-06

Family

ID=73014331

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/667,700 Active 2041-04-17 US11520854B2 (en) 2019-10-29 2019-10-29 Support for different matrix multiplications by selecting adder tree intermediate results

Country Status (3)

Country Link
US (1) US11520854B2 (en)
EP (1) EP3816790A1 (en)
CN (1) CN112749368A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220101518A (en) * 2021-01-11 2022-07-19 에스케이하이닉스 주식회사 Multiplication and accumulation circuit and processing-in-memory device having the same

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130262548A1 (en) * 2012-03-27 2013-10-03 Fujitsu Semiconductor Limited Matrix calculation unit
US20180321938A1 (en) * 2017-05-08 2018-11-08 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US20190227982A1 (en) 2006-04-12 2019-07-25 Intel Corporation Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
US20190303749A1 (en) 2018-03-30 2019-10-03 International Business Machines Corporation Massively parallel neural inference computing elements

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190227982A1 (en) 2006-04-12 2019-07-25 Intel Corporation Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
US20130262548A1 (en) * 2012-03-27 2013-10-03 Fujitsu Semiconductor Limited Matrix calculation unit
US20180321938A1 (en) * 2017-05-08 2018-11-08 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US20190303749A1 (en) 2018-03-30 2019-10-03 International Business Machines Corporation Massively parallel neural inference computing elements

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Extended European Search Report for European Application No. 20203634.9, dated Mar. 19, 2021, 08 Pages.
Office Action dated Jun. 14, 2022 for European Application No. 20203634.9, filed Oct. 23, 2020, 7 pages.

Also Published As

Publication number Publication date
US20210125044A1 (en) 2021-04-29
EP3816790A1 (en) 2021-05-05
CN112749368A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CA3070972C (en) Accelerated mathematical engine
US11409838B2 (en) High throughput matrix processor with support for concurrently processing multiple matrices
US10120649B2 (en) Processor and method for outer product accumulate operations
US10747844B2 (en) Systems and methods for converting a matrix input to a vectorized input for a matrix processor
US11520853B2 (en) Mapping convolution to a partition channel convolution engine
US11537865B2 (en) Mapping convolution to a channel convolution engine
CN113282273A (en) Hardware for multi-format floating point operations
US8667043B2 (en) Method and apparatus for multiplying binary operands
US11580192B2 (en) Grouped convolution using point-to-point connected channel convolution engines
US11520854B2 (en) Support for different matrix multiplications by selecting adder tree intermediate results
US11256979B2 (en) Common factor mass multiplication circuitry
CN113626759A (en) Summing high bit widths using a low bit width dot product engine
EP3940605A1 (en) Mapping convolution to connected processing elements using distributed pipelined separable convolution operations
US11443013B2 (en) Pipelined pointwise convolution using per-channel convolution operations
EP3783478B1 (en) Mapping convolution to a matrix processor unit
US11379557B2 (en) Device and method for flexibly summing matrix values
US10061559B2 (en) Apparatus and method for controlling operation
Singh et al. An Overlap-and-Add Based Time Domain Acceleration of CNNs on FPGA-CPU Systems

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

AS Assignment

Owner name: FACEBOOK, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HAO, YUCHEN;NAIR, KRISHNAKUMAR NARAYANAN;ZADEH, EHSAN KHISH ARDESTANI;AND OTHERS;SIGNING DATES FROM 20191216 TO 20191223;REEL/FRAME:051525/0632

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: META PLATFORMS, INC., CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:FACEBOOK, INC.;REEL/FRAME:058214/0351

Effective date: 20211028

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE