WO2024056984A1 - Multiple-outer-product instruction - Google Patents
Multiple-outer-product instruction Download PDFInfo
- Publication number
- WO2024056984A1 WO2024056984A1 PCT/GB2023/051858 GB2023051858W WO2024056984A1 WO 2024056984 A1 WO2024056984 A1 WO 2024056984A1 GB 2023051858 W GB2023051858 W GB 2023051858W WO 2024056984 A1 WO2024056984 A1 WO 2024056984A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- source vector
- source
- vector operand
- operand
- outer product
- Prior art date
Links
- 239000013598 vector Substances 0.000 claims abstract description 472
- 238000012545 processing Methods 0.000 claims abstract description 119
- 230000004044 response Effects 0.000 claims abstract description 12
- 239000011159 matrix material Substances 0.000 claims description 131
- 238000000034 method Methods 0.000 claims description 34
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 12
- 238000001994 activation Methods 0.000 claims description 12
- 230000001419 dependent effect Effects 0.000 claims description 4
- 230000008901 benefit Effects 0.000 description 10
- 238000003491 array Methods 0.000 description 9
- 238000013459 approach Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000004088 simulation Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- XRMDCWJNPDVAFI-UHFFFAOYSA-N 2,2,6,6-tetramethyl-1-oxopiperidin-1-ium-4-ol Chemical compound CC1(C)CC(O)CC(C)(C)[N+]1=O XRMDCWJNPDVAFI-UHFFFAOYSA-N 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
- G06F9/30014—Arithmetic instructions with variable precision
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
- G06F9/30038—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30101—Special purpose registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30105—Register structure
- G06F9/30109—Register structure having multiple operands in a single register
Definitions
- the present technique relates to the field of data processing.
- a data processing apparatus may comprise processing circuitry which is capable of performing outer product operations, in which an outer product of two vectors is calculated.
- the processing circuitry may further be arranged to perform multiple outer product operations on different pairs of vectors in order to multiply together two matrices.
- ANN Artificial Neural Network
- multiplication operations can be relatively slow, and each outer product operation (let alone each matrix multiplication) typically involves a large number of multiplication operations (e.g. a multiplication for each pair of data elements in the input vectors).
- multiplication operations e.g. a multiplication for each pair of data elements in the input vectors.
- an apparatus comprising: processing circuitry to perform vector operations; and instruction decoder circuitry to decode instructions from a set of instructions to control the processing circuitry to perform the vector operations specified by the instructions, wherein: the set of instructions comprises a multiple-outer-product instruction specifying a plurality of first source vector operands, at least one second source vector operand and correlation information associated with the at least one second source vector operand, wherein each vector operand comprises a plurality of data elements and, for a given second source vector operand, the correlation information is arranged to indicate, for each data element of the given second source vector operand, a corresponding first source vector operand; the instruction decoder circuitry is responsive to the multiple-outer-product instruction to control the processing circuitry to perform a plurality of computations to implement a plurality of outer product operations, wherein the plurality of outer product operations comprise, for a given first source vector operand, performing an associated outer product operation to calculate an outer product of the given first source vector operand
- a method comprising: performing vector operations using processing circuitry; decoding instructions from a set of instructions to control the processing circuitry to perform the vector operations specified by the instructions, wherein: the set of instructions comprises a multiple-outer-product instruction specifying a plurality of first source vector operands, at least one second source vector operand and correlation information associated with the at least one second source vector operand, wherein each vector operand comprises a plurality of data elements and, for a given second source vector operand, the correlation information is arranged to indicate, for each data element of the given second source vector operand, a corresponding first source vector operand; the method comprises, in response to the multiple-outer-product instruction performing a plurality of computations to implement a plurality of outer product operations, wherein the plurality of outer product operations comprise, for a given first source vector operand, performing an associated outer product operation to calculate an outer product of the given first source vector operand with a subset of data elements of the at least one second source vector oper
- a computer program comprising instructions which, when executed on a computer, control the computer to provide: processing program logic to perform vector operations; and instruction decoder program logic to decode target instructions from a set of target instructions to control the processing program logic to perform the vector operations specified by the target instructions, wherein: the set of target instructions comprises a multiple-outer-product instruction specifying a plurality of first source vector operands, at least one second source vector operand and correlation information associated with the at least one second source vector operand, wherein each vector operand comprises a plurality of data elements and, for a given second source vector operand, the correlation information is arranged to indicate, for each data element of the given second source vector operand, a corresponding first source vector operand; the instruction decoder program logic is responsive to the multiple-outer-product instruction to control the processing program logic to perform a plurality of computations to implement a plurality of outer product operations, wherein the plurality of outer product operations comprise, for a given first source vector oper
- the computer program described above is, in some examples, stored on a computer- readable storage medium.
- the computer-readable storage medium can be transitory or non- transitory.
- Figure 1 is a block diagram of a data processing apparatus
- Figure 2 shows an example of architectural registers that may be provided within the apparatus, including vector registers for storing vector operands and array registers for storing 2D arrays of data elements, including an example of a physical implementation of the array registers;
- Figures 3A and 3B schematically illustrate how accesses may be performed to a square 2D array within the array storage
- Figure 4A illustrates an outer product operation
- Figure 4B illustrates a matrix multiplication operation
- FIGS 5, 6A and 6B illustrate N:M structured sparsity in matrices
- Figure 7 shows an example of multiplying a matrix of activations by a matrix of weights using a multiply-accumulate (MAC) array
- Figures 8A and 8B illustrate how matrices with 2:4 and 4:8 structured sparsity can be compressed
- Figure 9 illustrates a matrix multiplication in which one of the input matrices to be multiplied is a compressed matrix
- Figure 10 is a block diagram of a data processing apparatus, illustrating how the processing circuitry is used to perform outer product operations
- Figures 11 A and 11 B illustrate how generated outer product results can be used to update an associated storage element within a 2D array of the array storage
- Figure 12 schematically illustrates fields that may be provided within a multiple outer product instruction
- Figures 13 to 15 illustrate examples of multiple outer product instructions and examples of how correlation information may be represented
- Figures 16 and 17 illustrate examples of circuitry for performing a sum of outer products operation
- Figure 18 is a flow diagram illustrating the steps performed upon decoding a multiple outer product instruction.
- Figure 19 illustrates a simulator implementation that may be used.
- an apparatus comprising processing circuitry and instruction decoder circuitry.
- the instruction decoder circuitry (also referred to herein as decoder circuitry or an instruction decoder) is arranged to decode instructions from a set of instructions and control the processing circuitry to perform the vector operations specified by the instructions.
- the instruction decoder circuitry may be responsive to the instructions in the set of instructions to generate control signals, and the control signals may control the processing circuitry to perform vector operations.
- Vector operations are operations performed on vector operands - e.g. operands comprising multiple data elements.
- Vector operations can include any operation that involves at least one vector operand, such as a load or store operation to load/store a vector from/to a storage location (e.g. memory or a cache) or an arithmetic operation (e.g. addition, multiplication) performed on vector operands.
- the processing circuitry is capable of performing vector operations that include at least outer product operations.
- vector operands may (but need not necessarily) be stored in vector registers, where a single vector register may store an entire vector operand, or where a single vector operand may be spread between multiple vector registers (and hence a single vector register may store elements from multiple vector operands).
- the set of instructions which the instruction decoder circuitry is configured to decode for execution by the processing circuitry include at least a “multiple-outer-product” (MOP) instruction.
- MOP multiple-outer-product
- the MOP instruction is defined in the instruction set architecture (ISA) and identifies (e.g. in respective fields of the instruction, either directly or indirectly, e.g. by identifying a register storing the corresponding data values) at least:
- first and second are merely labels, and the first and second source vector operands need not necessarily be the first and second (groups of) operands specified by the instruction respectively. On the contrary, the first and second (groups of) source vector operands could be identified by the MOP instruction in either order.
- Each source vector operand (e.g. the second source vector operand(s) and each of the plurality of first source vector operands) comprises a plurality of data elements and, for a given second source vector operand, the correlation information is arranged to indicate (e.g. identify, directly or indirectly) a corresponding first source vector operand for each data element in the given second source vector operand.
- the correlation information may (optionally) also identify corresponding first source vector operands for data elements in any other second source vector operands identified by the instruction (e.g. if more than one second source vector operand is identified by the MOP instruction).
- the correlation information identifies any data elements in the second source vector operand that are associated with individual first source vector operands.
- the instruction decoder circuitry is responsive to the MOP instruction to control the processing circuitry to perform a plurality of computations to implement a plurality of outer product operations - e.g. in response to a single MOP instruction being decoded by the instruction decoder circuitry, the processing circuitry is configured to perform computations which generate results equivalent to performing two or more outer product operations.
- the plurality of outer product operations to be performed by the processing circuitry in response to the decoded MOP instruction comprise an outer product operation for each of at least a subset of the first source vector operands identified by the MOP instruction. More particularly, the processing circuitry performs an associated outer product operation for a given first operand to calculate an outer product of that first source vector operand with a subset (e.g.
- the subset of data elements of the at least one second source vector operand to be used for each outer product operation are identified based on the correlation information - in particular, for each data element of the at least one second source vector operand, a corresponding first source vector operand is selected based on the correlation information, to be used when performing the associated outer product operation. (Note that an outer product operation need not, necessarily, be performed for every one of the first source vector operands, or for every subset of the at least one second source vector operand.)
- the MOP instruction By defining an instruction (the MOP instruction) that enables multiple outer product operations to be performed in response to a single instance of the instruction, throughput of instructions executed by the processing circuitry can be significantly improved. Accordingly, the MOP instruction of the present technique allows the performance of the processing circuitry to be improved. Indeed, in one example implementation, these multiple outer product operations can be performed in parallel, even further improving throughput.
- providing correlation information to associate subsets of data elements in the at least one second source vector operand with individual first source vector operands allows the at least one second source vector operand to be defined such that multiple vectors are effectively compressed into one source vector operand.
- using the correlation information in this way provides freedom over how the multiple vectors are compressed into a single second source vector operand - for example, the correlation information allows the multiple vectors to be compressed so that data elements from any one of the multiple vectors can occupy any data element positions in the at least one second source vector operand - including consecutive and/or non-consecutive data elements.
- Providing an instruction which supports execution of multiple outer product operations based on a source vector operand which may comprise data elements from multiple vectors that have been compressed into a smaller number of source vector operands can be advantageous in a number of scenarios. In one particular, non-limiting example, this may allow vectors which comprise one or more “zero” elements to be represented in a more compact form.
- an outer product operation relies on multiplication - e.g. multiplying each data element in one vector by each data element in another vector - and since multiplication of any number by zero is zero, it is possible to effectively compress the input vectors for an outer product operation by removing some or all zero elements.
- the approach of the present technique - in which correlation information is used to identify which first source vector operands correspond with which elements in one or more second source vector operands - can allow input vector elements equalling zero to be removed from one or more source vector operands without changing the result of the outer product operation.
- the present technique can allow such vectors to be represented in such a way that they can take up less space in any data storage structures storing the compressed operand (e.g. by removing data elements equal to zero from the vectors, so that the total number of data elements to be recorded - and hence the space in the storage structures needed to store those structures - is reduced).
- Reducing the amount of space taken up in storage provides numerous advantages, including reduced energy consumption if the data is stored in volatile storage and reduced latency and power consumption in loading, manipulating and storing the vectors.
- the total number of multiplications to be performed is reduced leading to reduced overall latency associated with the operation and to increased data throughput.
- ISA instruction set architecture
- the at least one second source vector operand it is not essential for the at least one second source vector operand to represent a compressed form of multiple vectors, or for those multiple vectors to comprise zeroes in one or more data elements.
- the results generated by executing the MOP instruction are equivalent to calculating multiple outer product operations, the order in which individual data elements of each vector operand are used/consumed by the processing circuitry when executing the MOP instruction is a matter of implementation. For example, the outer product operations do not necessarily need to be performed one at a time (e.g. by considering each first source vector operand in turn).
- each data element in the at least one second source vector operand could be considered in turn (e.g. for a given data element in the at least one second source vector operand, selecting an appropriate first source vector operand).
- all of the data elements in the at least one second source vector can be considered in parallel, with a corresponding first source vector operand being selected in parallel for each element of the second source operand.
- the apparatus comprises array storage circuitry comprising storage elements to store data elements, the array storage circuitry being arranged to store at least one two dimensional (2D) array of data elements accessible to the processing circuitry when performing the vector operations.
- the multiple-outer-product instruction may specify a given two dimensional array of data elements within the array storage forming a destination operand
- the processing circuitry may be configured to perform the associated outer product operation for a given first source vector operand by multiplying each data element of the given first source vector operand by each data element in the subset of data elements of the at least one second source vector operand in order to generate a plurality of outer product result elements, and using each outer product result element to update a value held in an associated storage element within the given two dimensional array of storage elements.
- Array storage circuitry - which could, for example, comprise a set of array registers - can provide a useful mechanism for performing certain types of operations, for example outer product operations.
- the matrix of data elements produced as a result of performing an outer product operation can be stored within associated data elements of a 2D array represented in the array storage circuitry.
- the inventors of the present technique realised that in some example use cases, when performing an outer product operation using two source vectors, it may be the case that some of the elements in one or both of the source vectors are zero, as noted above. This can result in inefficient use of the storage elements of 2D array, since a significant number of those storage elements may not then be used, or will only be used to store a zero value, and also can result in inefficient use of the resources of the hardware components forming the processing circuitry (which may be capable of performing computations to produce results for each of the storage elements).
- a single instruction (namely the MOP instruction discussed above) can be defined that, through the use of correlation information associating subsets of data elements in the at least one second source vector operand with corresponding first source vector operands, enables multiple outer product operations to be performed, with the results of each outer product operation being stored within associated storage elements of the two-dimensional (2D) array. This can significantly improve throughput (as noted above), whilst also making more efficient utilisation of the available storage elements within the array storage.
- each outer product result there need not necessarily be a 1 :1 correlation between each outer product result and an associated storage element in the 2D array.
- multiple outer product results might be used to determine the value held in a given storage element, with extra operations such as accumulation (addition) operations being performed to combine the multiple outer product results.
- the processing circuitry comprises multiplication circuitry to generate each outer product result when performing the plurality of outer product operations, and multiplexer circuitry associated with the multiplication circuitry, the multiplexer circuitry to select, under control of the correlation information, a selected data element of the plurality of first source vector operands and a selected data element of the at least one second source vector operand to be multiplied in order to generate an associated outer product result element.
- the multiplication circuitry could include a multiplier circuit - e.g. this could be a simple multiplier for multiplying together two values or a multiply-accumulate (MAC) circuit, which multiplies together two values and adds the result to an accumulate value - associated with each storage element in the 2D array of storage elements.
- MAC multiply-accumulate
- the same MAC circuit may both multiply together the selected data elements to generate the associated outer product result and add the associated outer product result to the value currently stored in the associated storage element of the array storage circuitry (which could be zero).
- each multiplier circuit or MAC circuit may not necessarily be a 1 :1 correlation between each multiplier circuit or MAC circuit and each storage element in the 2D array - for example, there could be one multiplier/MAC per storage element (in which case it may be possible for the outer product result to be stored in each storage element to be calculated in parallel), or there may be less than one multiplier/MAC per storage element (in which case, each multiplier/MAC is used to compute outer product results for multiple storage elements in series).
- the correlation information comprises at least one set of indices, and the at least one set of indices comprises an index associating each data element of the given second source vector operand with the corresponding first source vector operand.
- an index may be provided for each data element of the given second source vector operand.
- each index may be associated with a plurality of data elements in the given second source vector operand.
- the correlation information is provided by at least one correlation source operand specified by the multiple-outer-product instruction, and a given correlation source operand comprises, for each data element of the given second source vector operand, a corresponding element comprising the index associating that data element of the given second source vector operand with the corresponding first source vector operand.
- an index for each data element of the at least one second source vector provides a simple mechanism for determining which index corresponds to which data element of the second source vector operand, which can in turn simplify processing and, therefore, provide improvements in performance.
- the set of indices may be stored as a multi-bit scalar value (e.g. in a scalar register or a predicate register), where each bit is associated with a different vector element in one of one or more vector registers.
- the set of indices may be stored as a vector (e.g. in a vector register), with each data element in the vector holding one or more indices for corresponding data elements in the at least one second source vector operand. It will be appreciated that these are just some examples of how the set of indices may be stored, and other examples are also possible.
- the correlation information comprises a set of indices for each second source vector operand.
- the MOP instruction may also identify a separate set of indices for each second source vector operand.
- the number of sets of indices may, therefore, be at least as large as the number of second source vector operands.
- the at least one set of indices comprises a set of indices providing the correlation information for a plurality of second source vector operands.
- each set of indices identified by the MOP instruction may provide correlation information for more than one second source operand.
- the number of sets of indices may be less than the number of second source vectors.
- the correlation information is provided by at least one correlation source operand specified by the multiple-outer-product instruction, and each element of a given correlation source operand comprises a plurality of indices, the plurality of indices comprising an index for a corresponding data element of each of the plurality of second source vector operands.
- each element of the given correlation source operand provides multiple indices, including at least one index for each second source vector.
- the top bits in a given element could provide an index for a data element of a first source vector, while the bottom bits provide an index for a data element of a second source vector.
- This approach takes advantage of the fact that the indices might each be formed of fewer bits than there is space for in each element of a correlation source operand.
- multiple indices can be stored in each element of a correlation source operand, so that less storage space is needed to store the correlation information.
- each outer product operation performed by the processing circuitry is based on a different subset of data elements of the at least one second source vector operand.
- each subset of data elements of the at least one second source vector operand may comprise different data elements to each other subset (e.g. each outer product operation may be based on different data elements of the second source vector operand).
- the subsets of data elements of the at least one second source vector operand used for any two of the outer product operations differ by at least one data element.
- data element generally refers to a particular data element position in a vector operand, rather than the specific numerical value held in that position.
- the numerical values of the data elements in different subsets might not necessarily differ, since it is possible that multiple data elements in a given source vector operand hold data with the same numerical value.
- the subsets may differ by more than one data element - as a particular example, each data element in the at least one second source vector operand may be part of at most one subset of data elements.
- multiple vectors can be represented in each second source vector operand, with each subset representing a different vector, and the multiple outer product operations can be performed based on the multiple vectors.
- the multiple-outer-product instruction comprises a sum-of-outer- products instruction, such that multiple outer product results have the same associated storage element within the given two dimensional array of storage elements, and the processing circuitry is configured to combine those multiple outer product results in order to update the value held in the associated storage element.
- the MOP instruction is a sum-of-outer-products instruction (which could also be referred to as a “sum- of-multiple-outer-products instruction, SMOP), execution of which involves accumulating multiple outer product results into a single storage element of the 2D array.
- SMOP sum- of-multiple-outer-products instruction
- the apparatus comprises a set of vector registers accessible to the processing circuitry, wherein each vector register is arranged to store a vector comprising a plurality of data elements, and the plurality of first source vector operands and the at least one second source vector operand comprise vectors contained within vector registers of the set of vector registers.
- a vector is a one-dimensional (1 D) array comprising multiple (more than one) data elements.
- one might represent a vector of data elements as a single column or a single row of data elements; in a data processing system such as the apparatus in this example, the data elements of a vector are stored within a single vector register.
- Vector operands contrast with scalar operands, since each scalar operand comprises a single data item (e.g. each data element in a vector may be a scalar operand).
- SIMD single-instruction-multiple-data
- the apparatus comprises a set of predicate registers accessible to the processing circuitry, wherein each predicate register is arranged to store predicate information comprising a plurality of elements, each element providing a predicate value, and the correlation information is stored within at least one predicate register of the set of predicate registers.
- a data processing system - such as the apparatus of this example - may be provided with a set of predicate registers to store predicates (e.g. predicate information).
- predicates e.g. predicate information
- Each predicate may, for example, be a mask of true/false (e.g. 1/0) values to be used in vector processing - for example, a predicate may indicate which data elements in a vector should and should not be operated on.
- This example makes use of the predicate registers for another purpose: storing the correlation information.
- This approach is advantageous because it makes use of circuitry (the predicate registers) which may already have been used to perform the outer product operations if each subset of data elements was provided in a separate vector register (e.g. to predicate out the zero values).
- this approach provides the correlation information without taking up additional storage space (e.g. taking up an additional architectural register, or additional space in memory, a cache or some other storage structure).
- each second source vector operand represent data values from a plurality of rows and a plurality of columns of a source matrix.
- each subset of the data elements in the at least one second source vector operand may represent a different row/column of a source matrix, and hence the present technique provides a more compact representation of the matrix.
- each element of the given correlation source operand is arranged to enable reconstruction of the source matrix from the at least one second source vector operand.
- each element of the given correlation source operand may indicate which row/column of the original matrix one or more corresponding elements of the second source vector operand(s) came from.
- each element could be a bitmap of 1s and Os, indicating which rows/columns of the original matrix held non-zero values (e.g. if the original matrix had two rows and the first bitmap is (1 , 0), this might indicate that a first data element in a given second source vector operand represents a first data element of a first row of the matrix).
- each data element in each second source vector operand is associated with a corresponding first dimension in the source matrix, wherein the corresponding first dimension comprises a corresponding row or a corresponding column in the source matrix, and each data element in each second source vector operand provides a data value selected from among the data values in the corresponding first dimension in the source matrix.
- the first data element in a given second source vector operand may be the first data element in a given row/column of the source matrix, and the correlation information may indicate which row/column the data element came from.
- a matrix made up of multiple rows/columns e.g. multiple vectors
- the source matrix comprises a matrix with N:M structured sparsity wherein each defined group of M data values in the source matrix comprises at most N non-zero data values.
- each group of M data elements in the source matrix is a defined/specific group - not just any group of M elements in the source matrix can be used.
- the source matrix comprises a matrix of weights or a matrix of activations for use in execution of an artificial neural network (ANN).
- ANN artificial neural network
- the nodes of an ANN are typically represented as matrices of weights, and these matrices can be large.
- data input to the nodes of the ANN typically takes the form of an activation matrix. Since these matrices can be large, and the number of nodes in an ANN is typically very large, a significant amount of data is typically needed to represent an ANN. Therefore, it can be useful to prune a neural network by clearing some of the data elements (e.g. setting them to zero) in the weight matrices. In particular, this may be done in a structured manner, in accordance with a defined N:M sparsity.
- MOP multiple-outer-product
- the same techniques may be implemented in a computer program (e.g. an architecture simulator or model) which may be provided for controlling a host data processing apparatus to provide an instruction execution environment for execution of instructions from target code.
- the computer program may include instruction decoding program logic for decoding instructions of the target code so as to control a host data processing apparatus to perform data processing including performing vector operations.
- the instruction decoding program logic emulates the functionality of the instruction decoder (instruction decoder circuitry) of a hardware apparatus as discussed above.
- the program may also include processing program logic to, when executed on the host data processing apparatus, perform the data processing (and hence emulate the functionality of the processing circuitry described above).
- the program may include register maintenance program logic which maintains a data structure (within the memory or architectural registers of the host apparatus) which represents (emulates) the architectural registers of the instruction set architecture being simulated by the program - for example, these could include any or all of vector registers, scalar registers, array registers and predicate registers.
- the instruction decoding program logic includes support for the MOP instruction which has the same functionality as described above for the hardware example.
- a simulator computer program may present, to target code executing on the simulator computer program, a similar instruction execution environment to that which would be provided by an actual hardware apparatus capable of directly executing the target instruction set, even though there may not be any actual hardware providing these features in the host computer which is executing the simulator program.
- the simulator computer program provides all of the advantages discussed above in relation to the apparatus.
- the simulation can be useful for executing code written for one instruction set architecture on a host platform which does not actually support that architecture.
- the simulator can be useful during development of software for a new version of an instruction set architecture while software development is being performed in parallel with development of hardware devices supporting the new architecture. This can allow software to be developed and tested on the simulator so that software development can start before the hardware devices supporting the new architecture are available.
- the simulator program may be stored on a storage medium, and that storage medium may be transitory or non-transitory.
- Figure 1 schematically illustrates a data processing system 10 comprising a processor 20 coupled to a memory 30 storing data values 32 and program instructions 34.
- the processor 20 includes an instruction fetch unit 40 for fetching program instructions 34 from the memory 30 and supplying the fetched program instructions to instruction decoder circuitry 50.
- the decoder circuitry 50 decodes the fetched program instructions and generates control signals to control processing circuity 60 to perform processing operations upon data values held within storage elements of register storage 65 as specified by the decoded vector instructions.
- the register storage 65 may be formed of multiple different blocks.
- a scalar register file 70 may be provided that comprises a plurality of scalar registers that can be specified by instructions
- a vector register file 80 may be provided that comprises a plurality of vector registers that can be specified by instructions.
- the processor 20 can access an array storage 90.
- the array storage 90 is provided as part of the processor 20, but this is not a requirement.
- the array storage can be implemented as any one or more of the following: architecturally-addressable registers; non-architecturally-addressable registers; a scratchpad memory; and a cache.
- the processing circuitry 60 may in one example implementation comprise both vector processing circuitry and scalar processing circuitry.
- scalar processing may involve applying a single vector processing instruction to data elements of a data vector having a plurality of data elements at respective positions in the data vector.
- the processing circuitry may also perform vector processing to perform operations on a plurality of vectors within a two dimensional array of data elements (which may also be referred to as a sub-array) stored within the array storage 90.
- Scalar processing operates on, effectively, single data elements rather than on data vectors.
- Vector processing can be useful in instances where processing operations are carried out on many different instances of the data to be processed.
- a single instruction can be applied to multiple data elements (of a data vector) at the same time. This can improve the efficiency and throughput of data processing compared to scalar processing.
- the processor 20 may be arranged to process two dimensional arrays of data elements stored in the array storage 90.
- the two-dimensional arrays may, in at least some examples, be accessed as one-dimensional vectors of data elements in multiple directions.
- the array storage 90 may be arranged to store one or more two dimensional arrays of data elements, and each two dimensional array of data elements may form a square array portion of a larger or even higher-dimensioned array of data elements in memory.
- the register storage 65 also includes a predicate register file 75. This stores predicate information (e.g. masks) for use in data processing operations (e.g. to mask out certain data elements of a vector so that they are excluded from a particular processing operation).
- predicate information e.g. masks
- data processing operations e.g. to mask out certain data elements of a vector so that they are excluded from a particular processing operation.
- Figure 2 shows an example of the architectural registers 65 of the processor 20 that may be provided in one example implementation.
- the architectural registers (as defined in the instruction set architecture (ISA)) may include a set of scalar registers (not shown) and a set of predicate registers 100 for storing predicate information.
- the predicate registers may also store correlation information for executing multiple-outer-product (MOP) instructions, as will be discussed below. For example, there may be a certain number of predicate registers 100 provided, for example 16 registers P0-P15 in this example.
- the predicate registers may have a fixed size, although depending on the datatype of the elements stored in the predicate registers, some bits in each element may not necessarily be used.
- the architectural registers available for selection by program instructions in the ISA supported by the decoder 50 may include a certain number of vector registers 105 (labelled Z0- Z31 in this example). Of course, it is not essential to provide the number of predicate/vector registers shown in Figure 2, and other examples may provide a different number of registers specifiable by program instructions.
- Each vector register may store a vector operand comprising a variable number of data elements, where each data element may represent an independent data value.
- SIMD vector processing
- the processing circuitry may perform vector processing on vector operands stored in the registers to generate results.
- the vector processing may include lane-by-lane operations where a corresponding operation is performed on each lane of elements in one or more operand vectors to generate corresponding results for elements of a result vector.
- each vector register may have a certain vector length VL where the vector length refers to the number of bits in a given vector register.
- the vector length VL used in vector processing mode may be fixed for a given hardware implementation or could be variable.
- the ISA supported by the processor 20 may support variable vector lengths so that different processor implementations may choose to implement different sized vector registers but the ISA may be vector length agnostic so that the instructions are designed so that code can function correctly regardless of the particular vector length implemented on a given CPU executing that program.
- the vector registers Z0-Z31 may also serve as operand registers for storing the vector operands which provide the inputs to processing and accumulate operations performed by the processing circuitry 60 on two dimensional arrays of data elements stored within the array storage 90.
- the architectural registers also include a certain number NA of array registers 110 forming the earlier-mentioned array storage 90, ZAO-ZA(NA-I).
- Each array register can be seen as a set of register storage for storing a single 2D array of data elements, e.g. the result of a processing and accumulate operation. However, processing and accumulate operations may not be the only operations which can use the array registers.
- the array registers could also be used to store square arrays while performing transposition of the row/column direction of an array structure in memory.
- a program instruction references one of the array registers 110, it is referenced as a single entity using an array identifier ZAi, but some types of instructions (e.g. data transfer instructions) may also select a sub-portion of the array by defining an index value which selects a part of the array (e.g. one horizontal/vertical group of elements).
- the physical implementation of the register storage corresponding to the array registers may comprise a certain number NR of array vector registers, ZARO-ZAR(NR-I), as also shown in Figure 2.
- the array vector registers ZAR forming the array register storage 110 may be a distinct set of registers from the vector registers Z0-Z31 used for SIMD processing and vector inputs to array processing.
- Each of the array vector registers ZAR may have the vector length VL, so each array vector register ZAR may store a 1 D vector of length VL, which may be partitioned logically into a variable number of data elements.
- VL is 512 bits then this could be a set of 64 8-bit elements, 32 16-bit elements, 16 32-bit elements, 8 64-bit elements or 4 128-bit elements, for example. It will be appreciated that not all of these options would need to be supported in a given implementation. By supporting variable element size this provides flexibility to handle calculations involving data structures of different precision.
- a group of array vector registers ZARO-ZAR(NR-I) can be logically considered as a single entity assigned a given one of the array register identifiers ZAO-ZA(NA-I), so that the 2D array is formed with the elements extending within a single vector register corresponding to one dimension of the array and the elements in the other dimension of the array striped across multiple vector registers.
- the processing circuitry 60 is arranged, under control of instructions decoded by decoder circuitry 50, to access the scalar registers 70, the vector register 80 and/or the array storage 90. Further details of this latter arrangement will now be described with reference Figure 3A, which merely provides one illustrative example of how the array storage may be accessed, in particular considering access to a square 2D array within the array storage.
- a square 2D array within the array storage 90 is arranged as an array 205 of n x n storage elements/locations 200, where n is an integer greater than 1.
- n is 16 which implies that the granularity of access to the storage locations 200 is 1/16th of the total storage of the 2D array, in either horizontal or vertical array directions.
- the array of n x n locations are accessible as n linear (one-dimensional) vectors in a first direction (for example, a horizontal direction as drawn) and n linear vectors in a second array direction (for example, a vertical direction as drawn).
- n linear vectors in a first direction
- n linear vectors in a second array direction
- the n x n storage locations are accessible, from the point of view of the processing circuitry 60, as 2n linear vectors, each of n data elements.
- the array of storage locations 200 is accessible by access circuitry 210, 220, column selection circuitry 230 and row selection circuitry 240, under the control of control circuitry 250 in communication with at least the processing circuitry 60 and optionally with the decoder circuitry 50.
- the n linear vectors in the first direction (a horizontal or “H” direction as drawn), in the case of an example square 2D array designated as “ZA1” (noting that as discussed below, there could be more than one such 2D array provided within the array storage 90, for example ZAO, ZA1 , ZA2 and so on) are each of 16 data elements 0...F (in hexadecimal notation) and may be referenced in this example as ZA1 H0...ZA1 H15.
- the same underlying data, stored in the 256 entries (16 x 16 entries) of the array storage 90 ZA1 of Figure 3B may instead be referenced in the second direction (a vertical or “V” direction as drawn) as ZA1V0...ZA1V15.
- a data element 260 is referenced as item F of ZA1 H0 but item 0 of ZA1V15.
- H and V does not imply any spatial or physical layout requirement relating to the storage of the data elements making up the array storage 90, nor does it have any relevance to whether the 2D arrays within the array storage store row or column data in any example application.
- Figure 4A illustrates an outer product operation.
- the outer product operation takes, as inputs, two vectors A and B, which may be stored in the vector register file as discussed above.
- the result of the outer product operation is a matrix (e.g. a 2D array) A®B.
- each data element in the output matrix is determined by multiplying together corresponding data elements in each input vector - for example, the top-left element in the result matrix is calculated by multiplying together element a 0 of vector A and element b 0 of vector B.
- each data element in vector A is multiplied by each data element in vector B; hence, the result of calculating an outer product operation of a vector of n elements and a vector of m elements is an n x m matrix.
- Figure 4B illustrates a matrix multiplication operation.
- Figure 4B shows an operation involving multiplying together two matrices C and D (which could, for example, be stored in vector registers (e.g. one row or one column being held in each register) or in array storage circuitry) to generate a matrix CD.
- the result of multiplying together two n x n matrices is an n x n matrix (more generally, an n x k matrix multiplied by an kx m matrix will result in an n x m matrix).
- processing circuitry may first calculate the outer product of the left-most column, i, of matrix A and the top row, w, of matrix B to generate 16 outer product results and populate a 4x4 array in the array storage with the outer product results.
- the processing circuitry may then calculate an outer product of the next column, j, of matrix A with the next row, x, of D, to generate another 16 outer product results which are added to the outer product results already stored in the array. This process can then be repeated for the last two pairs of vectors (k®y and I ®z), to generate the final result CD.
- a matrix multiplication can be carried out by performing multiple outer product operations and accumulating the results - for example, this could be by performing one or more multiple-outer-product instructions.
- the order in which the pairs of vectors are multiplied together is not limited to the order described above - the outer products can be calculated in any order. In addition, it is not necessary to perform the outer product operations one after the other.
- matrix multiplication involves performing a significant number of operations - for example, multiple outer product operations may be performed, each of which involves multiple multiplication operations.
- matrix multiplication can be a particularly time- and energy-consuming process. This is especially true in situations such as execution of an artificial neural network (ANN), which may comprise performing a significant number of matrix multiplications.
- ANN artificial neural network
- Figure 5 shows a matrix with N:M structured sparsity.
- a shaded element represents a non-zero element, while a blank (unshaded) element represents a zero element.
- structured sparsity can be introduced into a matrix (e.g. by clearing I setting to zero some of its elements) in order to reduce the number of data elements in the matrix that will be involved in at least some matrix operations performed on that matrix.
- the inventors realised (as will be discussed in more detail below) that introducing structured sparsity into a matrix can allow the matrix to be compressed into a smaller number of vector operands, which reduces the amount of space taken up by the matrix.
- Figures 6A and 6B show further examples of sparse matrices for information - as in Figure 5, a shaded element represents a non-zero element, while a blank (unshaded) element represents a zero element.
- the first four elements in a column or the last four elements in a column has at most two non-zero elements. This can be seen by comparing, for example, the left-most column of each matrix - in the matrix of Figure 6B, the first four data elements in the left-most column are all non-zero; this is not permitted in a matrix with 2:4 structured sparsity, where (as shown in the left-most column of the matrix in Figure 6A) at most 2 of these four elements can be non-zero.
- FIG. 7 illustrates how an array of multiply- accumulate (MAC) units can be used to calculate the result of multiplying an 8 x 4 matrix of activations by a 4 x 8 matrix (with 2:4 sparsity) of weights (e.g. the weights at a given node of an ANN).
- Predicate information may be used to mask out the zero values, for example.
- Figure 8A shows how a matrix that originally takes up four vector registers (Z8, Z9, Z10, Z11) can be compressed into two vector registers (Z4, Z5), each of which stores data elements from one or more rows of the source matrix. This frees up two vector registers in the vector register file (and/or an equivalent amount of space in memory I another data store).
- a set of indices Id is held for each of the compressed source vector operands, which can be used to reconstruct the source matrix.
- the indices may indicate which row of the source matrix each element in Z4 or Z5 comes from.
- FIG. 8B Another example of how a sparse matrix can be compressed into a smaller number of vector operands is shown in Figure 8B, where an 8 x 8 sparse matrix (with 4:8 structured sparsity) is compressed into four vector operands.
- MOP multiple-outer- product
- each column of the matrix of activations e.g. the vectors ZO, Z1 , Z2, Z3
- data elements in Z4 and Z5 that have the same shading.
- the indices can be used to identify which vector in the first group of vector operands (Z0:Z3) each data element in the second group of vector operands (Z4:Z5) should be multiplied by.
- the multiplication circuitry 255 shown in Figure 9 comprises a number of multiplexers 270, each of which provides an input into a corresponding multiplier circuit (not shown).
- an 8 x 8 array of multiplexers 270 is provided, including a multiplexer (and corresponding multiplier) for each multiplication to be performed.
- Each multiplexer takes, as inputs, a data element from a corresponding position in each of the activation vectors ZO to Z3, and selects one of the data elements for multiplication with a data element in a corresponding position in Z4 or Z5 based on the correlation information. It will be appreciated, however, that in other examples there may be fewer multiplexers and multipliers, with each multiplexer/multiplier pair being used to perform several of the multiplication operations to be performed.
- the apparatus of the present technique is configured to support execution of a multiple-outer-product (MOP) instruction that identifies, as inputs, a plurality of first vector operands (e.g. vector registers Z0:Z3 in this case), at least one second source vector operand (e.g. Z4 and/or Z5) and correlation information (e.g. one or both of the sets of indices).
- MOP multiple-outer-product
- the weights matrix is a compressed sparse matrix
- the activations matrix it is equally possible for the activations matrix to be compressed sparse matrix instead of or in addition to the weights matrix. More generally (since multiplying activations and weights matrices is just one example use case for the present technique), it does not matter whether the first (group of) vector operand(s) or the second (group of) vector operand(s) specified by the MOP instruction represent a compressed sparse matrix - it can be either or both.
- Figure 10 is a block diagram of an apparatus in accordance with one example implementation, illustrating how the processing circuitry is used to perform outer product operations.
- the vector register file 80 provides a plurality of vector registers that can be used to store vectors of data elements.
- the MOP instruction can be arranged to identify a plurality of first source vector operands 300 and at least one second source vector operand 320.
- the at least one second source vector operand 320 (and optionally also the plurality of first source vector operands 300) comprises multiple subsets of data elements, with each subset being for a different outer product operation.
- first and second used herein to refer to the two source (groups of) vector operands are used purely as labels to distinguish between them, and do not imply any particular ordering with regards to how those operands are specified by the instruction.
- either of the source operand fields of the instruction may be used to specify the first source vector operands referred to above, and the other of the source operand fields will then be used to specify the second source vector operand referred to above.
- At least one of the two source vector operands will comprise multiple subsets of data elements for use in different outer product operations, and the other source vector operand may not, it may also be the case in some example implementations that both source vector operands comprise multiple subsets of data elements.
- more than one second source vector operand 320 may be specified, in addition to specifying a plurality of first source vector operands.
- the number of first source vector operands and the number of second source vector operands specified by the MOP are not limited to either one vector operand or two vector operands, but indeed more than two first/second vector operands may be specified (for example four vector operands or eight vector operands, etc.).
- the MOP instruction also specifies correlation information, which identifies which data elements in the at least one second source vector operand 320 are to be multiplied by which vector operands in the plurality of first vector operands.
- the correlation information is stored in predicate registers 325 in the predicate register file 75, and hence the MOP instruction also identifies one or more predicate registers 325.
- the processing circuitry 60 is controlled in dependence on control signals received from the decoder circuitry (decoder) 50, and when the decoder circuitry 50 decodes the earlier- mentioned MOP instruction, it will send control signals to the processing circuitry to control the processing circuitry to perform a plurality of computations to implement a plurality of outer product operations, wherein the plurality of outer product operations comprise, for a given first source vector operand, performing an associated outer product operation to calculate an outer product of the given first source vector operand with a subset of data elements of the at least one second source vector operand.
- those control signals will control selection circuitry 340 provided by the processing circuitry 60 to select the appropriate data elements to be processed by each outer product operation.
- Each outer product operation comprises multiplying each data element of an associated subset of data elements in the at least one second source vector operand by each data element in an associated first source vector operand, and then using each outer product result element to update a value held in an associated storage element within a given two-dimensional array 380 of storage elements within the array storage 90.
- the selection circuitry 340 can be organised in a variety of ways, but in one example implementation comprises multiplexer circuitry provided for each of a plurality of multipliers used to generate an outer product result from two input data elements, that multiplexer circuitry being used to select the appropriate two input data elements for each multiplier.
- the selected input data elements are then forwarded to multiplication circuitry 350, which as noted above may in one example implementation comprise a multiplier circuit for each outer product result to be produced.
- Each outer product result element is produced by multiplying the two input data elements provided to the corresponding multiplier within the multiplication circuitry 350.
- the outer product result element may be provided directly to array update circuitry 370 used to update the storage elements within the 2D array 380, each outer product result element having an associated storage element within the 2D array 380 and being used to update the value held in that associated storage element.
- each outer product result element generated is combined with the existing value stored in the associated storage element of the 2D array 380 (for example by adding the outer product result to the existing value or subtracting the outer product result from the existing value), using the optional accumulate circuitry 360.
- the multiplication circuitry 350 and optional accumulate circuitry 360 are shown as separate blocks in Figure 10, in one example implementation they may be provided as a combined block formed of multiply accumulate circuits.
- the array update circuitry 370 is used to control access to the relevant storage elements within the 2D array 380, so as to ensure that each value received by the array update circuitry is used to update the associated storage element within the 2D array 380.
- Outer product operations are usefully employed within data processing systems for a variety of different reasons, and hence the ability to perform multiple outer product operations in response to a single instruction can provide significant performance/throughput improvements, as well as making more efficient use of the available storage resources provided by the two- dimensional arrays within the array storage 90.
- outer product operations can be used to implement matrix multiplication operations.
- Matrix multiplication may for example involve multiplying a first m x k matrix of data elements by a k x n matrix of data elements to produce an m x n matrix result of data elements.
- This operation can be decomposed into a plurality of outer product operations (more particularly k outer product operations, where k may be referred to as the depth), where each outer product operation involves performing a sequence of multiply accumulate operations to multiply each data element of an m vector of data elements from the first matrix by each data element of an n vector of data elements from the second matrix to produce an m x nmatrix of result data elements stored within the 2D array.
- the results of the plurality of outer product operations can be accumulated within the same 2D array in order to produce the m x n matrix that would have been generated by performing the earlier-mentioned matrix multiplication.
- Matrix multiplication has a number of potential applications. In addition to application in executing an ANN as noted above, matrix multiplication may be employed in, for example, image processing.
- Figure 11A illustrates how an outer product result element may be associated with a particular storage element in the 2D array.
- a data element 570 from the first source vector operand is multiplied by a data element 572 from the second source vector operand using the multiply function 574, producing an outer product result element which is then subjected to an accumulate operation by the accumulate function 576 to add that outer product result to the current value stored in the associated storage element 578 (or to subtract that outer product result from the current value stored in the associated storage element 578) in order to produce an updated value that is then stored in the associated storage element.
- Figure 11 B illustrates a sum of outer products operation where two outer product result elements are associated with the same storage element within the 2D array.
- a data element 580 from the first source vector operand is multiplied by a data element 582 from the second source vector operand using the multiply function 584 in order to produce a first outer product result element.
- a data element 586 from the first source vector operand and a data element 588 from the second source vector operand are multiplied by the multiply function 590 in order to produce a second outer product result element.
- the two outer product results are then added together using the add function 592, and an accumulate function 594 is performed in order to produce an updated data value for storing in the associated storage element 596.
- Figure 12 is a diagram schematically illustrating fields that may be provided within the MOP instruction, in accordance with one example implementation.
- An opcode field 605 is used to identify the type of instruction, in this case identifying that the instruction is a MOP instruction.
- One or more control information fields 610 can be provided, for example to identify one or more predicates as referred to earlier.
- a field 615 identifies the correlation information to be used in the operation - for example, this could be an identifier of a register (e.g. a predicate register) storing the correlation information. Note that this is separate from any predicate information held in the control information field 610, which is to be used as a predicate/mask in the operation.
- the field 620 is then used to identify the plurality of first source vector operands (for example by specifying vector registers within the vector register file 80, wherein one or more vector operands are implicitly associated with the identified register(s)).
- a field 625 can be used to identify the one or more second source vector operands, again for example by specifying one or more vector registers within the vector register file 80. Note that this is just one example, and in fact either one of the fields 620, 625 can be used to specify the earlier-mentioned first source vector operands, with the other field then specifying the second source vector operand.
- the field 630 may be used to identify the destination 2D array within the array storage 90 that is to be used to store the matrices generated as a result of performing the multiple outer product operations specified by the multiple outer product instruction.
- Figures 13 to 15 illustrate some examples of how the correlation information may be represented in one or more predicate registers.
- Figure 13 shows a first example in which each predicate register P0, P1 holds a set of indices for a corresponding vector register Z4, Z5, each set of indices forming the correlation information for its corresponding vector register.
- Each index in this example identifies the register of the source matrix (and hence indirectly identifies a row of the source matrix) that the corresponding data element came from, with the first (bottom) row being row “0” and the fourth (top) row being row “3”.
- the left-most element in predicate register P0 (which holds the correlation information for vector register Z4 in this example) is “2”, indicating that the corresponding element (the left-most element) in vector register Z4 comes from the 3 rd row of the source matrix (e.g. from vector register Z10 in this example).
- FIG. 13 shows how the MOP instructions (identified as “FMOPA” instructions in this case) may be represented in this example.
- MOP instructions identified as “FMOPA” instructions in this case
- a separate MOP instruction is executed for each of the vector registers Z4, Z5, with each MOP instruction identifying:
- Figure 14 shows another example of how the correlation may be represented; in this example, the same indices are used, but the indices for both Z4 and Z5 are stored in the same predicate register P0.
- This example takes advantage of the fact that the number of bits required to represent each index is typically significantly smaller than the number of bits available in each element of the predicate register - for example, the indices in this example are at most 2-bits long (the indices 0, 1 , 2 and 3 are represented in binary as 00, 01 , 10, and 11 respectively), whereas the number of bits in each element of a predicate register may be 4 or more.
- the inventors realised that the indices for the two vector registers Z4, Z5 can be packed into a single predicate register, reducing the number of predicate registers that are taken up by the correlation information.
- the top-half (e.g. the top two bits) of each data element can be used to store the index for a corresponding data element of one of the vector registers (e.g. the top-half of the left-most element stores the value “3”, which is the index for the leftmost element in Z5)
- the bottom-half (e.g. the bottom two bits) of each data element can be used to store the index for a corresponding data element of the other of the vector registers (e.g. the bottom-half of the left-most element stores the value “2”, which is the index for the leftmost element in Z4).
- this example requires two slightly different MOP instructions (i.e. an “FMOPA1” instruction and an “FM0PA2” instructions), with both identifying the same predicate register PO, but one causing the processing circuitry to read the top two bits of each element, and the other causing the processing circuitry to read the bottom two bits of each element.
- MOP instructions i.e. an “FMOPA1” instruction and an “FM0PA2” instructions
- each of the two MOP instructions could, for example, have a different opcode.
- the encoding of the two instructions could be different in some other way.
- each MOP instruction in Figure 14 specifies:
- Figure 15 shows another example of how the correlation information may be represented, in which the correlation information for two vector registers Z4, Z5 is compressed into a single predicate register PO.
- each data element in the predicate register PO holds a bitmap indicating which data elements in a corresponding row of the source matrix held non-zero values.
- the left-most data element in the predicate register reads “1100”, indicating that there are non-zero elements in the top two rows of the left-most column of the source matrix, and zero in the bottom two rows of the same column. From this, it can be determined that the leftmost data elements in the vector registers Z4, Z5 are from the top two (2 nd and 3 rd ) rows of the source matrix.
- the processing circuitry determines a position of either the first or the second “1” in the corresponding bitmap to identify which of the first source vectors should be used in the associated outer product operation.
- Each instruction specifies:
- the MOP instruction in this example is a sum-of-outer-products (SMOPA) instruction, for performing elementary 2-way dot product operations.
- SMOPA sum-of-outer-products
- the base instruction is a sum-of-outer- products instruction with an accumulation on 32-bits.
- the elementary operations are 4-way dot products operating on 4-way interleaved data.
- 2 source registers are used for the activations (left-hand operand) and one source register for the weights (right-hand operand).
- 4 elements are selected among 8 elements in this variant of the instruction (e.g. 2 times 2 elements among 4), two predicate registers are used.
- Four 8-bit 4-to-1 multiplexers are provided in front of each multiplier to achieve a throughput of one operation per cycle.
- the MOP instruction in this example is a sum-of-outer-products instruction (SMOPA), which specifies:
- Figure 18 is a flow diagram illustrating steps performed on decoding a multiple outer product instruction in accordance with one example implementation.
- step 650 it is determined whether a MOP instruction has been encountered. If not, then standard decoding of the relevant instruction is performed at step 655, with the processing circuitry then being controlled to perform the required operation defined by that instruction.
- step 660 that instruction is decoded in order to identify the source vector operands (e.g. the plurality of first source vector operands and the at least one second source vector operand), the destination 2D array, the correlation information, and the form of outer product that is to be performed (for example whether an accumulating outer product is being performed or a non-accumulating variant is being performed, and also for example whether normal outer product operations are to be performed or sum of outer products operations are to be performed).
- the source vector operands e.g. the plurality of first source vector operands and the at least one second source vector operand
- the destination 2D array e.g. the plurality of first source vector operands and the at least one second source vector operand
- the form of outer product that is to be performed for example whether an accumulating outer product is being performed or a non-accumulating variant is being performed, and also for example whether normal outer product operations are to be performed or sum of outer products operations are to be performed.
- the processing circuitry is controlled to perform the required outer product operations and to perform the required updates to the 2D array storage elements.
- selection circuitry is controlled to select the data elements for each multiplication operation dependent on the correlation information.
- FIG 19 illustrates a simulator implementation that may be used. Whilst the earlier described embodiments implement the present invention in terms of apparatus and methods for operating specific processing hardware supporting the techniques concerned, it is also possible to provide an instruction execution environment in accordance with the embodiments described herein which is implemented through the use of a computer program. Such computer programs are often referred to as simulators, insofar as they provide a software based implementation of a hardware architecture. Varieties of simulator computer programs include emulators, virtual machines, models, and binary translators, including dynamic binary translators. Typically, a simulator implementation may run on a host processor 715, optionally running a host operating system 710, supporting the simulator program 705.
- the hardware there may be multiple layers of simulation between the hardware and the provided instruction execution environment, and/or multiple distinct instruction execution environments provided on the same host processor.
- powerful processors have been required to provide simulator implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstances, such as when there is a desire to run code native to another processor for compatibility or re-use reasons.
- the simulator implementation may provide an instruction execution environment with additional functionality which is not supported by the host processor hardware, or provide an instruction execution environment typically associated with a different hardware architecture.
- An overview of simulation is given in “Some Efficient Architecture Simulation Techniques”, Robert Bedichek, Winter 1990 IISENIX Conference, Pages 53 - 63.
- the simulator program 705 may comprise processing program logic 720 to emulate the behaviour of the processing circuitry described above and instruction decode program logic to emulate the behaviour of the instruction decoder circuitry described above.
- memory hardware such as a register or cache
- array storage emulating program logic 722 is provided to emulate the array storage described above.
- the simulator program 705 may be stored on a computer-readable storage medium (which may be a non-transitory medium), and provides a program interface (instruction execution environment) to the target code 700 (which may include applications, operating systems and a hypervisor) which is the same as the interface of the hardware architecture being modelled by the simulator program 705.
- the program instructions of the target code 700 including the MOP described above, may be executed from within the instruction execution environment using the simulator program 705, so that a host computer 715 which does not actually have the hardware features of the apparatus discussed above can emulate these features.
- the words “configured to...” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation.
- a “configuration” means an arrangement or manner of interconnection of hardware or software.
- the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Complex Calculations (AREA)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2213475.3A GB2622581A (en) | 2022-09-14 | 2022-09-14 | Multiple-outer-product instruction |
GB2213475.3 | 2022-09-14 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024056984A1 true WO2024056984A1 (en) | 2024-03-21 |
Family
ID=83945083
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2023/051858 WO2024056984A1 (en) | 2022-09-14 | 2023-07-14 | Multiple-outer-product instruction |
Country Status (3)
Country | Link |
---|---|
GB (1) | GB2622581A (zh) |
TW (1) | TW202411860A (zh) |
WO (1) | WO2024056984A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180189234A1 (en) * | 2016-12-31 | 2018-07-05 | Intel Corporation | Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data |
US20200159810A1 (en) * | 2018-11-15 | 2020-05-21 | Hewlett Packard Enterprise Development Lp | Partitioning sparse matrices based on sparse matrix representations for crossbar-based architectures |
US20200401440A1 (en) * | 2016-12-31 | 2020-12-24 | Intel Corporation | Systems, methods, and apparatuses for heterogeneous computing |
US20220058026A1 (en) * | 2020-08-19 | 2022-02-24 | Facebook Technologies, Llc | Efficient multiply-accumulation based on sparse matrix |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10929503B2 (en) * | 2018-12-21 | 2021-02-23 | Intel Corporation | Apparatus and method for a masked multiply instruction to support neural network pruning operations |
GB2594971B (en) * | 2020-05-13 | 2022-10-05 | Advanced Risc Mach Ltd | Variable position shift for matrix processing |
CN116888591A (zh) * | 2021-03-31 | 2023-10-13 | 华为技术有限公司 | 一种矩阵乘法器、矩阵计算方法及相关设备 |
-
2022
- 2022-09-14 GB GB2213475.3A patent/GB2622581A/en active Pending
-
2023
- 2023-07-14 WO PCT/GB2023/051858 patent/WO2024056984A1/en unknown
- 2023-08-18 TW TW112131068A patent/TW202411860A/zh unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180189234A1 (en) * | 2016-12-31 | 2018-07-05 | Intel Corporation | Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data |
US20200401440A1 (en) * | 2016-12-31 | 2020-12-24 | Intel Corporation | Systems, methods, and apparatuses for heterogeneous computing |
US20200159810A1 (en) * | 2018-11-15 | 2020-05-21 | Hewlett Packard Enterprise Development Lp | Partitioning sparse matrices based on sparse matrix representations for crossbar-based architectures |
US20220058026A1 (en) * | 2020-08-19 | 2022-02-24 | Facebook Technologies, Llc | Efficient multiply-accumulation based on sparse matrix |
Also Published As
Publication number | Publication date |
---|---|
GB2622581A (en) | 2024-03-27 |
GB202213475D0 (en) | 2022-10-26 |
TW202411860A (zh) | 2024-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240070226A1 (en) | Accelerator for sparse-dense matrix multiplication | |
TWI842911B (zh) | 用於存取矩陣運算元的多變數跨步讀取操作 | |
EP3238072B1 (en) | Hardware apparatuses and methods to prefetch a multidimensional block of elements from a multimensional array | |
CN110337635A (zh) | 用于点积操作的系统、方法和装置 | |
CN109992304A (zh) | 用于加载片寄存器对的系统和方法 | |
US20230229730A1 (en) | Variable position shift for matrix processing | |
WO2021250392A1 (en) | Mixed-element-size instruction | |
CN109992243A (zh) | 用于矩阵操作的系统、方法和装置 | |
CN114327362A (zh) | 大规模矩阵重构和矩阵-标量操作 | |
CN110058886A (zh) | 用于计算两个区块操作数中的半字节的数量积的系统和方法 | |
CN115576606A (zh) | 实现矩阵转置乘的方法、协处理器、服务器及存储介质 | |
US20230214236A1 (en) | Masking row or column positions for matrix processing | |
US11210090B2 (en) | Register-based complex number processing | |
WO2024056984A1 (en) | Multiple-outer-product instruction | |
CN109992303A (zh) | 用于将片寄存器对存储到存储器的系统和方法 | |
CN112149050A (zh) | 用于增强的矩阵乘法器架构的装置、方法和系统 | |
CN112230993A (zh) | 数据处理方法及装置、电子设备 | |
GB2617828A (en) | Technique for handling data elements stored in an array storage | |
WO2023199015A1 (en) | Technique for handling data elements stored in an array storage | |
US20240320005A1 (en) | Matrix multiplication in a dynamically spatially and dynamically temporally dividable architecture | |
US20240320292A1 (en) | Matrix multiplication in a dynamically spatially and dynamically temporally dividable architecture | |
WO2023242531A1 (en) | Technique for performing outer product operations | |
GB2628590A (en) | Technique for efficient multiplication of vectors of complex numbers | |
JP2024524588A (ja) | ベクトル合成命令のための処理装置、方法及びコンピュータプログラム | |
CN117932201A (zh) | 一种risc-v矩阵运算的芯片和方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23748299 Country of ref document: EP Kind code of ref document: A1 |