US20200341772A1 - Efficient Architectures For Deep Learning Algorithms - Google Patents

Efficient Architectures For Deep Learning Algorithms Download PDF

Info

Publication number
US20200341772A1
US20200341772A1 US16/397,401 US201916397401A US2020341772A1 US 20200341772 A1 US20200341772 A1 US 20200341772A1 US 201916397401 A US201916397401 A US 201916397401A US 2020341772 A1 US2020341772 A1 US 2020341772A1
Authority
US
United States
Prior art keywords
operand
simd
values
engines
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/397,401
Other languages
English (en)
Inventor
Shashi Kiran CHILAPPAGARI
Winston Lee
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Degirum Corp
Original Assignee
Degirum Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Degirum Corp filed Critical Degirum Corp
Priority to US16/397,401 priority Critical patent/US20200341772A1/en
Assigned to DeGirum Corporation reassignment DeGirum Corporation ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHILAPPAGARI, SHASHI KIRAN, LEE, WINSTON
Priority to PCT/US2020/026337 priority patent/WO2020222971A1/fr
Priority to KR1020217033459A priority patent/KR20220002295A/ko
Priority to CN202080032192.8A priority patent/CN113748417A/zh
Priority to EP20799017.7A priority patent/EP3963462A4/fr
Priority to JP2021563271A priority patent/JP7361133B2/ja
Priority to CA3137873A priority patent/CA3137873A1/fr
Priority to TW109111546A priority patent/TWI833003B/zh
Publication of US20200341772A1 publication Critical patent/US20200341772A1/en
Priority to US17/470,675 priority patent/US20210406030A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30109Register structure having multiple operands in a single register
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30141Implementation provisions of register files, e.g. ports
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • G06F9/3891Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the dot product of p and q, denoted by P ⁇ q is defined as:
  • A be an (m ⁇ n) matrix (i.e., A is a matrix with m rows (horizontal) and n columns (vertical)).
  • a i,j denote an element in the i th row and j th column of matrix A.
  • B be an (n ⁇ k) matrix.
  • A [ a 1 , 1 a 1 , 2 a 1 , 3 ... a 1 , n a 2 , 1 a 2 , 2 a 2 , 3 ... a 2 , n ⁇ ⁇ ⁇ ⁇ a m , 1 a m , 2 a m , 3 ... a m , n ]
  • ⁇ ⁇ B [ b 1 , 1 b 1 , 2 b 1 , 3 ... b 1 , k b 2 , 1 b 2 , 2 b 2 , 3 ... b 2 , k ⁇ ⁇ ⁇ ⁇ b n , 1 b n , 2 b n , 3 ... b n , k ] ( 2 )
  • the matrices A and B can be multiplied only if their dimensions are compatible (i.e., if the number of columns in A is equal to the number of rows in B).
  • the product C of matrices A and B is defined below.
  • the multiplication of two matrices of dimensions (m ⁇ n) and (n ⁇ k) consists of computing (m ⁇ k) dot products of n length vectors.
  • the dot product computation is generally implemented as a series of multiply-accumulate operations.
  • the multiply-accumulate operation computes the product of two numbers and adds that product to an accumulator. This can be represented as:
  • FIG. 1 is a block diagram of a simple MAC unit 100 that includes input operand registers 101 and 102 , which store operands a and b, respectively, multiply circuit 103 , addition circuit 104 and accumulator register 105 , which stores the accumulator value c.
  • MAC multiplier-accumulator
  • FIG. 2 is a block diagram of a system 200 that includes multiple parallel MACs 201 - 204 for performing multiple dot products in parallel.
  • MACs 201 , 202 , 203 , and 204 include operand registers 211 - 212 , 213 - 214 , 215 - 216 and 217 - 218 , respectively, multiplier circuits 221 , 222 , 223 and 224 , respectively, addition circuits 231 , 232 , 233 and 234 , respectively, and accumulators 241 , 242 , 243 and 244 , respectively.
  • System 200 includes four parallel MACs in which MAC 201 computes the dot product of a 1,: with b :,1 , MAC 202 computes the dot product of a 2,: with b :,1 , MAC 203 computes the dot product of a 3,: with b :,1 and MAC 204 computes the dot product of a 4,: with b :,1 .
  • MAC 201 computes the dot product of a 1,: with b :,1
  • MAC 202 computes the dot product of a 2
  • MAC 203 computes the dot product of a 3
  • MAC 204 computes the dot product of a 4
  • MAC 204 computes the dot product of a 4
  • Supplying the MACs 201 - 204 with the input data is a challenge that needs to be solved. It would therefore be desirable to have efficient ways to supply computation units such as MACs 201 - 204 with the required data.
  • all the MAC units 201 - 204 of FIG. 2 are performing the same operations, but with different inputs. Hence, instead of providing separate instructions to each of the MACs 201 - 204 , it is possible to group all the MACs 201 - 204 together to form a single instruction multiple data (SIMD) engine that operates in response to a common instruction.
  • SIMD single instruction multiple data
  • FIG. 3 is a block diagram of a SIMD engine 300 that groups operand registers 211 , 213 , 215 and 217 of MACs 201 - 204 to form a first operand register 301 , and groups operand registers 212 , 214 , 216 and 218 of MACs 201 - 204 to form a second operand register 302 .
  • the multiplier circuits 221 - 224 of MACs 201 - 204 are combined to form multiplier 321
  • the addition circuit 231 - 234 of MACs 201 - 204 are combined to form addition circuit 331 .
  • the accumulators 241 - 244 of MACs 201 - 204 are combined to form an accumulator 341 .
  • SIMD engine 300 the various elements of parallel MACs 201 - 204 are combined to form SIMD engine 300 . It is important to note that the scalar inputs for the different MACs 201 - 204 are be combined to form vector inputs in the SIMD engine 300 . In addition, the output of the accumulator 341 of the SIMD engine 300 will also be a vector.
  • SIMD engines are efficient in processing vectors and are capable of executing a variety of instructions, they require significant control logic and local memory. It can be seen that in order to get the maximum number of operations per unit silicon area, the number of SIMD engines needs to be maximized.
  • One way to achieve this is to have a design in which multiple SIMD engines can share control logic and memory resources. However, this imposes restrictions on the type of operations that can be performed by the SIMD engines and may require additional logic to drive the SIMD engines. It would therefore be desirable to have improved computer architectures that include SIMD engines.
  • the present invention provides an improved computer architecture that includes a plurality of single instruction, multiple data (SIMD) engines that operate in parallel.
  • An Operand A register file stores a first set of one or more operand values (Operand A values), wherein each of the Operand A values includes a plurality of operand words.
  • An Operand B register file stores a second plurality of one or more operand values (Operand B values), wherein each of the Operand B values includes a plurality of operand words.
  • each of the Operand A and Operand B values includes four 32-bit operand words.
  • each of the Operand A and Operand B values includes eight 16-bit operand words.
  • An input distribution block that includes an Operand A distribution circuit and an Operand B distribution circuit.
  • the Operand A distribution circuit is controlled to route the received Operand A value to each of the SIMD engines in parallel. For example, if the received Operand A value includes four operand words [w, x, y, z], and there are four parallel SIMD engines, then each of the SIMD engines would receive the four operand words [w, x, y, z].
  • the Operand B distribution circuit is coupled to receive one or more Operand B values from the Operand B register file.
  • the Operand B distribution circuit selectively routes one or more of the operand words from one or more of the received Operand B values to create a plurality of input Operand B values, wherein each of the input Operand B values is routed to a corresponding one of the plurality of SIMD engines.
  • the Operand B distribution circuit is controlled to route a received Operand B value to each of the SIMD engines in parallel. For example, if the received Operand B value includes four operand words [a, b, c, d], and there are four parallel SIMD engines, then each of the SIMD engines would receive the four operand words [a, b, c, d].
  • the Operand B distribution circuit includes a plurality of buffers to store a plurality of Operand B values.
  • Operand B select logic is used to select which of the SIMD engines receive which of the buffered Operand B values. For example, if the buffered Operand B values include [a, b, c, d], [e, f, g, h], [i, j, k, l] and [m, n, o, p], and there are four parallel SIMD engines, then then one of the four SIMD engines could receive input Operand B value [a, b, c, d], one of the four SIMD engines could receive input Operand B value [e, f, g, h], one of the four SIMD engines could receive input Operand B value [i, j, k, l], and one of the four SIMD engines could receive input Operand B value and [m, n, o, p].
  • two of the four SIMD engines could receive input Operand B value [a, b, c, d], one of the four SIMD engines could receive input Operand B value [e, f, g, h], and one of the four SIMD engines could receive input Operand B value [i, j, k, l].
  • the Operand B register file can include a single register file (such that the plurality of Operand B values are loaded into the Operand B buffers in a serial manner), or a plurality of register files (such that the plurality of Operand B values are loaded into the Operand B buffers in parallel). If Operand B register file is implemented using a plurality of register files, then the Operand B buffers can be implemented using a double buffer configuration, wherein Operand B values are transferred from the Operand B register file to the Operand B distribution circuit at the same time that Operand B value are transferred from the Operand B distribution circuit to the SIMD engines.
  • the Operand B distribution circuit receives a plurality of Operand B values in parallel from the Operand B register file. These received Operand B values are provided to a shift logic circuit within the Operand B distribution circuit. Control logic specifies an amount of shift (in operand words) that the shift logic circuit introduces to the received Operand B values. The shifted Operand B values are buffered within the Operand B distribution circuit, and are then routed to the SIMD engines in parallel.
  • the improved computer architecture also includes a plurality of output register sets, each coupled to a corresponding one of the plurality of SIMD engines.
  • Data e.g., dot product values
  • each of the plurality of output register sets is independently addressed, providing flexibility to the operations performed.
  • the computer architecture of the present invention enables efficient sparse matrix multiplication.
  • FIG. 1 is a block diagram of a conventional multiplier-accumulator (MAC) unit.
  • FIG. 2 is a block diagram of a conventional system that includes multiple parallel MAC units for calculating multiple dot products in parallel.
  • FIG. 3 is a block diagram of a conventional single instruction multiple data (SIMD) engine that is created by grouping various elements of the multiple parallel MAC units of FIG. 2 .
  • SIMD single instruction multiple data
  • FIG. 5 is a block diagram illustrating an architecture (Architecture 1A) for routing a first operand value (Operand A) having four 32-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 1A an architecture for routing a first operand value (Operand A) having four 32-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 6 is a block diagram illustrating an architecture (Architecture 1A) for routing a first operand value (Operand A) having eight 16-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 1A an architecture for routing a first operand value (Operand A) having eight 16-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 9 and FIG. 10 are block diagrams illustrating an architecture (Architecture 2A), for routing a first operand value (Operand A) having eight 16-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 2A for routing a first operand value (Operand A) having eight 16-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 11 is a block diagram illustrating an architecture (Architecture 3A) for routing a first operand value (Operand A) having four 32-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 3A for routing a first operand value (Operand A) having four 32-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 12 and FIG. 13 are block diagrams illustrating an architecture (Architecture 3A) for routing a first operand value (Operand A) having eight 16-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 3A for routing a first operand value (Operand A) having eight 16-bit operand words from an Operand A register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 14 is a block diagram illustrating an architecture (Architecture 1B) for routing a second operand value (Operand B) having four 32-bit operand words from an Operand B register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 1B for routing a second operand value (Operand B) having four 32-bit operand words from an Operand B register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 15 is a block diagram illustrating an architecture (Architecture 1B) for routing a second operand value (Operand B) having eight 16-bit operand words from an Operand B register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 1B an architecture for routing a second operand value (Operand B) having eight 16-bit operand words from an Operand B register file to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 16 is a block diagram illustrating an architecture (Architecture 2B) for routing a second operand value (Operand B) having four 32-bit operand words from an Operand B register file and a plurality of Operand B buffers to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 2B for routing a second operand value (Operand B) having four 32-bit operand words from an Operand B register file and a plurality of Operand B buffers to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 17 is a block diagram illustrating an architecture (Architecture 3B) for routing a second operand value (Operand B) having four 32-bit operand words from a plurality of parallel Operand B register files and a plurality of Operand B buffers to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 3B for routing a second operand value (Operand B) having four 32-bit operand words from a plurality of parallel Operand B register files and a plurality of Operand B buffers to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 18 is a block diagram illustrating an architecture (Architecture 3B) for routing a second operand value (Operand B) having four 32-bit operand words from a plurality of parallel Operand B register files and a plurality of Operand B double buffers to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • Architecture 3B for routing a second operand value (Operand B) having four 32-bit operand words from a plurality of parallel Operand B register files and a plurality of Operand B double buffers to a plurality of SIMD engines in accordance with one embodiment of the present invention.
  • FIG. 20 is a block diagram of a computer system that includes a SIMD block having four parallel SIMD engines and an output circuit having four parallel output register sets in accordance with one embodiment of the present invention.
  • FIG. 21 is a block diagram of the computer system of FIG. 20 , which illustrates the addressing of the four parallel output register sets in accordance with one embodiment of the present invention.
  • FIG. 24 is a diagram illustrating two matrices I and J to be multiplied by the computer architecture of FIG. 23 in accordance with one embodiment of the present invention.
  • FIG. 25 is a block diagram illustrating the manner in which the contents of Matrix I and Matrix J of FIG. 24 are logically stored within system memory in accordance with one embodiment of the present invention.
  • FIG. 26 is a block diagram illustrating the manner in which results of the multiplication of Matrix I and Matrix J of FIG. 24 are stored within the output register sets of the computer architecture of FIG. 23 in accordance with one embodiment of the present invention.
  • FIG. 27 and FIG. 28 are block diagrams of a computer architecture during various stages of a sparse matrix multiplication in accordance with one embodiment of the present invention.
  • FIG. 29 , FIG. 30 and FIG. 31 are block diagrams of a computer architecture during various stages of a sparse matrix multiplication in accordance with an alternate embodiment of the present invention.
  • SIMD engine architectures describe various efficient SIMD engine architectures. Specifically, ways to operate multiple SIMD engines in parallel are proposed, and manners for supplying the SIMD engines with inputs are described. While the following description uses examples that implement 128-bit wide input operands and 4 SIMD engines, it is noted that the described examples can be extended to smaller or larger input operand widths and/or fewer or more SIMD engines.
  • FIG. 4 is a block diagram of a computer system 400 that includes various hardware resources needed for operating a SIMD block 401 in accordance with one embodiment. These resources include an operand buffer 410 (which includes Operand A register file 411 and Operand B register file 412 ), input distribution block 415 (which includes operand A distribution circuit 416 and operand B distribution circuit 417 ), SIMD block 401 , output circuit 420 , control logic 430 (which includes state machine and scheduler 431 , control registers 432 and operand packaging circuit 433 ), and system memory 440 .
  • the important parameters for the SIMD operation are the operands, the type of operation and the addresses for the output circuit. These parameters are described in more detail in the subsequent sections.
  • state machine and scheduler 431 provides addresses to input distribution block 415 , wherein these addresses control the manner in which the Operand A distribution circuit 416 routes the Operand A values received from Operand A register file 411 to SIMD block 401 , and also control the manner in which the Operand B distribution circuit 417 routes the Operand B values received from Operand B register file 412 are routed to SIMD block 401 .
  • Operand B distribution circuit 417 may include buffers to store multiple Operand B values, as well as shift logic that controls an amount of shift to be applied to the Operand B values received from Operand B register file 412 .
  • State machine and scheduler 431 also provides addresses used to access memory banks included within the output circuit 420 . These addresses include read addresses, which enable accumulation values stored in the memory banks to be routed to the SIMD block 401 for multiply-accumulate operations, as well as write addresses, which enable updated accumulation values provided by SIMD block 401 to be written back to the memory banks within output circuit 420 .
  • Control registers 432 store values that control the manner in which the state machine and scheduler 431 generates the various addresses for different modes of operation (which are described in more detail below). The operation of the various elements of computer system 400 is described in more detail below for various modes (i.e., architectures).
  • FIG. 6 is a block diagram illustrating another embodiment of the first Operand A architecture (Architecture 1A), wherein eight 16-bit input words [s, t, u, v, w, x, y, z] stored in Operand A register file 411 are routed to Operand A distribution circuit 416 . These input words [s, t, u, v, w, x, y, z] are buffered within Operand A distribution circuit 416 , and are then routed in parallel to each of the four SIMD engines (SIMD 0 , SIMD 1 , SIMD 2 , SIMD 3 ) included in the SIMD block 401 .
  • SIMD 0 SIMD 1 , SIMD 2 , SIMD 3
  • FIG. 7 is a block diagram illustrating one embodiment of the second Operand A architecture (Architecture 2A), wherein four 32-bit input words [w, x, y, z] stored in Operand A register file 411 are routed to Operand A distribution circuit 416 .
  • Operand A distribution circuit 416 includes a buffer that stores the received input words [w, x, y, z].
  • State machine and scheduler 431 provides an index value to Operand A distribution circuit 416 , wherein this index value specifies the input word [w].
  • Operand A distribution circuit 416 performs a switching/demultiplexing operation, wherein the input word [w] is routed in parallel to each of the four SIMD engines (SIMD 0 , SIMD 2 , SIMD 2 , SIMD 3 ) included in the SIMD block 401 . That is, the 32-bit input word [w] is effectively repeated four times to provide a 128-bit input Operand A, which consists of the 32-bit word [w] repeated four times.
  • This 128-bit input Operand A [w, w, w, w] is provided to each of the SIMD engines (SIMD 0 , SIMD 2 , SIMD 2 , SIMD 3 ) in parallel.
  • the 32-bit input word [y] is effectively repeated four times to provide a 128-bit input Operand A [y, y, y, y], which is provided to each of the SIMD engines (SIMD 0 , SIMD 1 , SIMD 2 , SIMD 3 ) in parallel.
  • the 16-bit input word [u] is effectively repeated eight times to provide a 128-bit input Operand A [u, u, u, u, u, u, u, u, u], which is provided to each of the SIMD engines (SIMD 0 , SIMD 1 , SIMD 2 , SIMD 3 ) in parallel.
  • Operand A distribution circuit 416 performs switching/demultiplexing operations, wherein: the 32-bit input word [w] is repeated four times to create a 128-bit Operand A value of [w, w, w, w], which is routed to SIMD 0 ; the 32-bit input word [x] is repeated four times to create a 128-bit Operand A value of [x, x, x, x], which is routed to SIMD 1 ; the 32-bit input word [y] is repeated four times to create a 128-bit Operand A value of [y, y, y, y], which is routed to SIMD 2 ; and the 32-bit input word [z] is repeated four times to create a 128-bit Operand A value of [z, z, z, z], which is routed to SIMD 3 .
  • FIG. 13 is a block diagram illustrating the continuation of the distribution started by FIG. 12 , wherein the 16-bit input word [w] is repeated eight times to create a 128-bit Operand A value of [w, w, w, w, w, w, w, w], which is routed to SIMD 0 ; the 16-bit input word [x] is repeated eight times to create a 128-bit Operand A value of [x, x, x, x, x, x, x], which is routed to SIMD 2 ; the 16-bit input word [y] is repeated eight times to create a 128-bit Operand A value of [y, y, y, y, y, y, y, y], which is routed to SIMD 2 ; and the 16-bit input word [z] is repeated eight times to create a 128-bit Operand A value of [z, z, z, z, z, z, z, z,
  • Architectures 1A, 2A and 3A implement 32-bit input and 16-bit input modes. However, these embodiments are provided for illustration purpose only. The ideas are general and can be extended in a straightforward manner to other input modes (e.g., 8-bit input mode). Moreover, although the Architectures 1A, 2A and 3A have been described in connection with embodiments that include 4 SIMD engines and a 128-bit register file word, other numbers of SIMD engines and register file word widths can be used in other embodiments in a straightforward manner.
  • the index of the single value to be broadcast needs to be provided to the Operand A distribution circuit 416 .
  • the index can have different interpretations depending on whether the data is 8-bit, 16-bit or 32-bit wide (which could be specified by a control register setting).
  • the index of the value to be broadcast to SIMD 0 needs to be provided to the Operand A distribution circuit 416 .
  • the indices for the values to be broadcast to the other SIMD engines can be inferred by the hardware by incrementing. Or all the four indices can be provided to the Operand A distribution circuit 416 .
  • the data stored in the buffers of the Operand A distribution circuit 416 can be reused over multiple cycles so that the register file words do not need to be read every cycle from the Operand A register file 411 .
  • Separate control logic can supply a flag specifying which cycles need to load the data from the Operand A register file 411 .
  • the Operand A distribution circuit 416 can contain multiple buffers to hold the register file word data with control logic specifying the buffer indices to use for writing and reading.
  • the Operand A distribution circuit 416 contains two buffers: one for writing and one for reading, which are used in a ping-pong manner.
  • the state machine and scheduler 431 automatically manages the read and write indices. This scheme is generally known as double buffering. In such cases, no additional control logic is needed to specify buffer indices for read and write.
  • Operand B distribution circuit 417 can be configured in four different architectures (Architecture 1B, Architecture 2B, Architecture 3B and Architecture 4B) to provide the input Operand B to SIMD block 401 .
  • each of the four SIMD engines (SIMD 0 , SIMD 1 , SIMD 2 , SIMD 3 ) included in the SIMD block 401 receives a full register file word (which includes four 32-bit word values a, b, c and d) as the input Operand B.
  • a full register file word which includes four 32-bit word values a, b, c and d.
  • Architecture 1B for providing Operand B to the SIMD block 401 is similar to Architecture 1A for providing Operand A to the SIMD block 401 .
  • FIG. 14 is a block diagram illustrating one embodiment of the first Operand B architecture (Architecture 1B), wherein four 32-bit input words [a, b, c, d] stored in Operand B register file 412 are routed to Operand B distribution circuit 417 .
  • Operand B distribution circuit 417 includes a buffer that stores the received input words [a, b, c, d].
  • Operand B distribution circuit 417 also includes circuitry for performing a switching/demultiplexing function, wherein the buffered input words [a, b, c, d] are routed in parallel to each of the four SIMD engines (SIMD 0 , SIMD 2 , SIMD 2 , SIMD 3 ) included in the SIMD block 401 .
  • SIMD 0 , SIMD 1 , SIMD 2 , SIMD 3 receives the full register file word [a, b, c, d] as input Operand B.
  • FIG. 15 is a block diagram illustrating another embodiment of the first Operand B architecture (Architecture 1B), wherein eight 16-bit input words [a, b, c, d, e, f, g, h] stored in Operand B register file 412 are routed to Operand B distribution circuit 417 . These input words [a, b, c, d, e, f, g, h] are buffered within Operand B distribution circuit 417 , and are then routed in parallel to each of the four SIMD engines (SIMD 0 , SIMD 1 , SIMD 2 , SIMD 3 ) included in the SIMD block 401 .
  • SIMD 0 SIMD 1 , SIMD 2 , SIMD 3
  • each of the SIMD engines receives the full register file word [a, b, c, d, e, f, g, h] as input Operand B.
  • multiple entries can be read simultaneously from Operand B register file 412 .
  • the most general way to implement this is to use a multi-read-port memory to implement this register file 412 .
  • a memory with four read ports can be used to simultaneously read four entries from the Operand B register file 412 .
  • such a memory configuration has a high hardware complexity (occupies a relatively large area and consumes a relatively high power).
  • preferred embodiments of the present invention include low complexity methods and structures for supplying the different SIMD engines with (possibly) different input Operand B values. While these preferred embodiments may not provide as much generality as the broad (multiple read port) method, they are efficient for the purposes of the algorithms to be implemented.
  • a second architecture for providing input Operand B to the SIMD engines is provided, wherein a small number of entries from the Operand B register file 412 are buffered in the Operand B distribution circuit 417 and then distributed to the SIMD engines of SIMD block 401 .
  • this can be thought of as an approach that gives some flexibility for each SIMD by allowing them to address any entry from a small number of entries. This keeps hardware complexity small.
  • the main characteristics of the second architecture (Architecture 2B) for providing the input Operand B to the SIMD block 401 can be defined as follows.
  • the Operand B distribution circuit 417 includes a plurality of Operand B buffers to hold values read from the Operand B register file 412 . Each of these Operand B buffers can hold one full register file word. Each SIMD can receive the register file word stored in any one of the Operand B buffers.
  • a buffer select mechanism is used to specify which of the Operand B buffers is coupled to each of the SIMD engines.
  • the Operand B buffers are filled one at a time from the Operand B register file 412 .
  • each Operand B buffer may use a double buffering scheme so that read and write operations to an Operand B buffer do not occur in the same cycle.
  • FIG. 16 is a block diagram of the second architecture (Architecture 2B) for providing input Operand B to the SIMD engines in accordance with one embodiment.
  • Operand B distribution logic 417 includes four Operand B buffers B0-B3, each of which is capable of storing a full register word from Operand B register file 412 .
  • Operand B buffers B0-B3 store register file words received from operand B register file 412 .
  • Operand B buffers B0, B1, B2 and B3 store values [a, b, c, d], [e, f, g, h], [i, j, k, l] and [m, n, o, p], respectively (wherein each of the values a-p is a 32-bit word).
  • the four entries bs0, bs1, bs2 and bs3 specify operand B buffers B0, B0, B1 and B2, respectively, indicating that the contents of operand B buffer B0 (i.e., [a, b, c, d]) are provided to SIMD 0 and SIMD 1 , the contents of operand B buffer B1 (i.e., [e, f, g, h]) are provided to SIMD 2 , and the contents of operand B buffer B2 (i.e., [i, j, k, l]) are provided to SIMD 3 .
  • the buffer selection may change by changing the buffer select entries bs0, bs1, bs2 and bs3. It is noted that if the number of operand buffers is reduced to 1, then Architecture 2B would be equivalent to Architecture 1B.
  • Another approach to effectively allow for multiple reads from the operand B register file 412 is to implement the operand B register file 412 using a plurality of register files, each of which allows a single read operation to be performed at a time.
  • having one large memory with 4 read ports can be more expensive than four smaller memories with one read port each.
  • the larger memory with 4 read ports offers more flexibility in terms of the data that can be read.
  • four smaller memories with single read port are used, four entries can be read at a given time, but each of the entries has to belong to a different memory. This is not the case with a 4 read-port memory that allows any 4 entries to be read simultaneously.
  • the main characteristics of the third architecture (Architecture 3B) for providing the input Operand B to the SIMD block 401 can be defined as follows. There is more than one Register File for Operand B. In one specific case, the number of Register Files for Operand B is equal to the number of SIMD engines included in SIMD block 401 . Thus, if there are four SIMD engines, then there will be four corresponding Operand B register files. However, other cases are possible and it is easy to extend the architecture to those cases.
  • the multiple Operand B Register Files can be read simultaneously.
  • each SIMD receives its input Operand B directly from one of the Operand B register files. If the number of SIMD engines is equal to the number of operand B register files, then each of the SIMD engines can receive an input Operand B from a corresponding one of the Operand B register files.
  • the Operand B distribution circuit 417 can contain operand buffers (similar to Architecture 2B) to hold the data read from the Operand B register files. This can allow multiple cycles to use same data. Also, the Operand B register files need not be read every cycle due to reuse of the buffered data.
  • a load flag can specify the cycles in which data needs to be read from the Operand B register files to the Operand B distribution circuit 417 .
  • a separate block can also specify the address of the buffer to load for every SIMD, as described above in connection with Architecture 2B.
  • FIG. 17 is a block diagram of the third architecture (Architecture 3B) for providing the input Operand B to the SIMD block 401 in accordance with one embodiment.
  • four Operand B register files 412 0 , 412 1 , 412 2 and 413 3 provide four corresponding Operand B register words (e.g., [a0, b0, c0, d0] [e0, f0, g0, h0] [i0, j0, k0, l0] and [m0, n0, o0, p0]) to Operand B distribution circuit 417 .
  • Operand B register words e.g., [a0, b0, c0, d0] [e0, f0, g0, h0] [i0, j0, k0, l0] and [m0, n0, o0, p0]
  • the Operand B distribution circuit 417 routes the register file words provided by operand B register files 412 0 , 412 1 , 412 2 and 413 3 to buffer memories BM 0 , BM 1 , BM 2 and BM 3 , respectively, within Operand B distribution circuit 417 .
  • Operand B buffer select logic 1701 (which may be included in the state machine and scheduler 431 of control logic 430 ) is used to determine the manner in which the contents of Operand B buffers BM 0 , BM 1 , BM 2 and BM 3 are provided to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 .
  • Operand B buffer select logic 1701 includes four buffer select entries bms0, bms1, bms2 and bms3, which store values that specify which of the Operand B buffers BM 0 -BM 3 provide their contents to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively.
  • the four entries bms0, bms1, bms2 and bms3 specify operand B buffers BM 0 , BM 1 , BM 2 and BM 3 , respectively, indicating that the contents of operand B buffer BM 0 (i.e., [a0, b0, c0, d0]) are provided to SIMD 0 , the contents of operand B buffer BM 1 (i.e., [e0, f0, g0, h0]) are provided to SIMD 1 , the contents of operand B buffer BM 2 (i.e., [i0, j0, k0, l0]) are provided to SIMD 2 , and the contents of operand B buffer BM 3 (i.e., [m0, n0, o0, p0]) are provided to SIMD 3 .
  • the buffer selection may change by changing the buffer memory select entries bms0, bms1,
  • FIG. 18 is a block diagram of the third architecture (Architecture 3B) for providing input Operand B to the SIMD block 401 in accordance with an alternate embodiment.
  • Operand B distribution circuit 417 includes double Operand B buffers B01-B02, B11-B12, B21-B22 and B31-B32, which store data provided by the Operand B register files 412 0 , 412 1 , 412 2 and 413 3 , respectively.
  • Operand B register file words [a0, b0, c0, d0] and [a1, b1, c1, d1] from operand B register file 412 0 are stored in Operand B buffers B02 and B01, respectively.
  • Operand B register file words [e0, f0, g0, h0] and [e1, f1, g1, h1] from operand B register file 412 1 are stored in Operand B buffers B12 and B11, respectively.
  • Operand B register file words [i0, j0, k0, l0] and [i1, j1, k1, 11] from operand B register file 412 2 are stored in Operand B buffers B22 and B21, respectively.
  • Operand B register file words [m0, n0, o0, p0] and [m1, n1, o1, p1] from operand B register file 412 3 are stored in Operand B buffers B32 and B31, respectively.
  • Operand B buffer select logic 1801 (which may be included in the state machine and scheduler 431 of control logic 430 ) is used to determine the manner in which the contents of Operand B buffers B01-B02, B11-B12, B21-B22 and B31-B32 are provided to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 .
  • Operand B buffer select logic 1801 includes four buffer select entries bs01, bs11, bs21 and bs31, which store values that specify which of the Operand B buffers B01-B02, B11-B12, B21-B22 and B31-B32 provide their contents to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively.
  • the four entries bs01, bs11, bs21 and bs31 specify operand B buffers B02, B12, B22 and B32, respectively, indicating that the contents of Operand B buffer B02 (i.e., [a0, b0, c0, d0]) are provided to SIMD 0 , the contents of Operand B buffer B12 (i.e., [e0, f0, g0, h0]) are provided to SIMD 1 , the contents of Operand B buffer B22 (i.e., [i0, j0, k0, l0]) are provided to SIMD 2 , and the contents of Operand B buffer B32 (i.e., [m0, n0, o0, p0) are provided to SIMD 3 .
  • Operand B buffer B02 i.e., [a0, b0, c0, d0]
  • the contents of Operand B buffer B12 i.e., [
  • Operand B distribution circuit 417 includes switching/demultiplexing circuitry that performs the above-described routing in response to the buffer select entries bs01, bs11, bs21 and bs31.
  • Operand B buffer select logic 1801 can select any of the operand buffers B01-B02, B11-B12, B21-B22 and B31-B32 to provide input Operand B to any of the SIMD engines.
  • buffer select entry bs01 may store a value (B31) that causes the contents of Operand B buffer B31 (i.e., [m1, n1, o1, p1] to be routed to SIMD 0 .
  • the buffer selection may change by changing the buffer select entries bs01, bs11, bs21 and bs31.
  • different numbers of Operand B buffers can be included in Operand B distribution circuit 417 .
  • each Operand B register file allows reading two entries at a time and choosing one register file word worth of data by applying some shifting operations.
  • Control logic 340 specifies the addresses of two rows to be read from each Operand B register file, as well as the amount of shift to be applied to the entries read from these two rows. This functionality is typically realized in hardware by implementing each Operand B register file memory as two banks of memory. This allows reading two entries at the same time.
  • the two register file words are then fed into a shifting logic module that receives an amount of shift as an input parameter and outputs one register file word worth of data.
  • the addresses for the two banks and the amount of shift are supplied by state machine and scheduler 431 .
  • FIG. 19 is a block diagram of the fourth architecture (Architecture 4B) for providing input Operand B to the SIMD block 401 in accordance with one embodiment.
  • Operand B register files 1912 0 , 1912 1 1912 2 , and 1912 3 include memory banks 1912 00 - 1912 01 , 1912 10 - 1912 11 , 1912 20 - 1912 21 and 1912 30 - 1912 31 , respectively.
  • Each of the memory bank pairs 1912 00 - 1912 01 , 1912 10 - 1912 11 , 1912 20 - 1912 21 and 1912 30 - 1912 31 store different register file words.
  • memory bank 1912 00 stores register file words [a0, b0, c0, d0], [a2, b2, c2, d2] and [a4, b4, c4, d4]
  • memory bank 1912 01 stores register file words [a1, b1, c1, d1], [a3, b3, c3, d3] and [a5, b5, c5, d5].
  • Memory bank 1912 10 stores register file words [e0, f0, g0, h0], [e2, f2, g2, h2] and [e4, f4, g4, h4] and memory bank 1912 11 stores register file words [e1, f1, g1, h1], [e3, f3, g3, h3] and [e5, f5, g5, h5].
  • Memory bank 1912 20 stores register file words [i0, j0, k0, l0], [i2, j2, k2, l2] and [i4, j4, k4, l4] and memory bank 1912 21 stores register file words [i1, j1, k1, l1], [i3, j3, k3, l3] and [i5, j5, k5, l5].
  • Memory bank 1912 30 stores register file words [m0, n0, o0, p0], [m2, n2, o2, p2] and [m4, n4, o4, p4] and memory bank 1912 31 stores register file words [m1, n1, o1, p1], [m3, n3, o3, p3] and [m5, n5, o5, p5].
  • Register file words read from the memory bank pairs 1912 00 - 1912 01 , 1912 10 - 1912 11 , 1912 20 - 1912 21 and 1912 30 - 1912 31 are provided to shift logic circuit 1901 in Operand B distribution circuit 417 .
  • Outputs of shift logic circuit 1901 are provided to Operand B buffers B0, B1, B2 and B3 in Operand B distribution circuit 417 .
  • Control logic 430 controls the register file words read from memory banks 1912 00 - 1912 01 , 1912 10 - 1912 11 , 1912 20 - 1912 21 and 1912 30 - 1912 31 .
  • control logic 430 causes register file words to be simultaneously read from the memory banks 1912 00 - 1912 01 , 1912 10 - 1912 11 , 1912 20 - 1912 21 and 1912 30 - 1912 31 .
  • the addresses provided to each of the memory bank pairs may selected such that two different consecutive register file words are read from each of the memory banks, thereby providing the register file words necessary to perform a shifting operation.
  • register file words [a0, b0, c0, d0] and [a1, b1, c1, d1] may be simultaneously read from memory banks 1912 00 and 1912 01 , respectively; register file words [e0, f0, g0, h0] and [e1, f1, g1, h1] may be simultaneously read from memory banks 1912 10 and 1912 11 , respectively; register file words [i0, j0, k0, l0] and [i1, j1, k1, 11] may be simultaneously read from memory banks 1912 20 and 1912 21 , respectively; and register file words [m0, n0, o0, p0] and [m1, n1, o1, p1] may be simultaneously read from memory banks 1912 30 and 1912 31 , respectively.
  • the shift logic circuit 1901 receives the eight register file words provided by Operand B register files 1912 0 - 1912 3 .
  • Control logic 340 also controls the amount of shift introduced by shift logic circuit 1901 .
  • Table 1 defines the values provided by shift logic circuit 1901 to operand buffers B0-B3 in the present example, for various shift values. Note that each shift value introduces an additional 32-bit shift to the received pairs of register file words.
  • FIG. 19 illustrates the results for a shift value of 1.
  • operand B buffers B0, B1, B2 and B3 are routed to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively, as input operand B.
  • shifting may be efficiently performed within the register file words stored by Operand B register files 1912 0 - 1912 3 .
  • multiple architectures can be implemented together by sharing hardware resources.
  • the hardware can be programmed to operate different architectures as modes that can be chosen by a register.
  • Architectures 1B, 2B, 3B and 4B implement 32-bit input and 16-bit input modes, it is understood that these architectures can easily be modified to implement input modes of other widths (e.g., 8-bit input mode).
  • the Architectures 1B, 2B, 3B and 4B have been described in connection with embodiments that include 4 SIMD engines and a 128-bit register file word, other numbers of SIMD engines and register file word widths can be used in other embodiments in a straightforward manner.
  • control registers 432 can store values that configure Operand B distribution circuit 415 to implement Architecture 1B, 2B, 3B or 4B for Operand B in the manners described above.
  • Output circuit 420 ( FIG. 4 ) is used for storing (and specifying addresses for) the outputs of the SIMD engines (SIMD 0 , SIMD 2 , SIMD 2 and SIMD 3 ). Each SIMD can write the output of an operation performed within the SIMD to certain number of output registers within output circuit 420 .
  • FIG. 20 is a block diagram that shows each of the SIMD engines (SIMD 0 , SIMD 2 , SIMD 2 and SIMD 3 ) coupled to corresponding memory banks 2000 0 - 2000 3 , wherein each memory bank includes k rows, with each row forming an output register.
  • the control logic 340 specifies a row address within each of the memory banks 2000 0 - 2000 3 , such that previously stored accumulation values are read from the addressed output registers of the memory banks 2000 0 - 2000 3 , and are provided to the corresponding SIMD engines, SIMD 0 -SIMD 3 .
  • the SIMD engines (SIMD 0 -SIMD 3 ) perform multiply-accumulate operations to generate updated accumulation values, which are then written back to the addressed output registers within the corresponding memory banks 2000 0 - 2000 3 .
  • the row addresses of the output registers associated with each SIMD can be thought of as input signals to the SIMD engines.
  • the row address is the index of the row within the SIMD (referred to as relative index within the SIMD).
  • FIG. 21 is a block diagram that shows register select logic 2101 used to determine the manner in which the contents of output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 are provided to SIMD engines SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively.
  • Register select logic 2101 is implemented within state machine and scheduler 431 of control logic 430 .
  • register select logic 2101 includes four register select entries R0, R1, R2 and R3, which store row address values that specify which of the output registers within output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 provide their contents to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively (or store values received from SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively).
  • the four register select entries R0, R1, R2 and R3 specify the output registers in Row 1, Row(K ⁇ 1), Row 0 and Row 2 of output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively (indicating that the contents of these output registers are provided to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively). Note that the selected output registers are highlighted in FIG. 21 . In subsequent cycles, the register selection may change by changing the register select entries R0, R1, R2 and R3.
  • FIG. 22 is a block diagram of a hardware system 2200 that unifies the various architectures and features proposed above for the different operands and outputs.
  • the system 2200 includes (1) an operand block 2210 that includes one or more register files for storing each of the operands (Operand A and Operand B), (2) an input distribution block (IDB) 2220 that includes one or more buffers for each of the operands and a logic block for each of the operands, (3) a SIMD block 2230 that includes one or more SIMD engines, and (4) an output block 2240 that includes one or more output register files for each of the SIMD engines.
  • an operand block 2210 that includes one or more register files for storing each of the operands (Operand A and Operand B)
  • IDB input distribution block
  • SIMD block 2230 that includes one or more SIMD engines
  • an output block 2240 that includes one or more output register files for each of the SIMD engines.
  • operand block 2210 includes operand A register file(s) 2211 and operand B register file(s) 2212 , which may be used to implement the various embodiments of Operand A register file 411 and Operand B register file 412 described above.
  • Input distribution block (IDB) 2220 includes Operand A IDB buffers 2221 and Operand A IDB logic 2223 , which may be used to implement the various embodiments of Operand A distribution circuit 416 described above.
  • Input distribution block 2220 also includes Operand B IDB buffers 2222 and Operand B IDB shift logic 2224 , which may be used to implement the various embodiments of Operand B distribution circuit 417 described above.
  • SIMD block 2240 which may be used to implement the various embodiments of SIMD block 401 described above, includes SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 .
  • Output block 2240 which may be used to implement the various embodiments of output circuit 420 described above, includes output register files 2241 , 2242 , 2243 and 2244 , which are coupled to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively.
  • three control signals OP_A_RF_SRC_ADDR_SEL, OP_A_RF_DEST_ADDR_SEL and OP_A_RF_LOAD_FLAG, are used to control the operation of Operand A register files 2211 .
  • the OP_A_RF_LOAD_FLAG and OP_B_RF_LOAD_FLAG signals specify if data needs to be transferred from the Operand A register files 2211 and the Operand B register files 2212 , respectively, to the Operand A IDB buffers 2221 and Operand B IDB buffers 2222 , respectively. If the OP_A_RF_LOAD_FLAG signal has a value of 1, then the two associated control signals (OP_A_RF_SRC_ADDR_SEL and OP_A_RF_DEST_ADDR_SEL) specify the source and destination addresses for the Operand A data.
  • the two associated control signals (OP_B_RF_SRC_ADDR_SEL and OP_B_RF_DEST_ADDR_SEL) specify the source and destination addresses for the Operand B data. Note that not all embodiments will require destination addresses for the Operand A and Operand B data (i.e., if there is only one possible destination for the Operand A or Operand B data). If the OP_A_RF_LOAD_FLAG signal or the OP_B_RF_LOAD_FLAG signal has a value of 0, no data is read or transferred from the corresponding Operand A register files 2211 or the Operand B register files 2212 .
  • the OP_A_RF_LOAD_FLAG signal and the OP_B_RF_LOAD_FLAG signal can be generated by state machine and scheduler 431 of control logic 430 .
  • the OP_A_RF_SRC_ADDR_SEL signal specifies the row address in the Operand A register file(s) 2211 to be read. In the modes of operation described above, the OP_A_RF_SRC_ADDR_SEL signal will include just one address, which specifies Operand A register file to be read.
  • the OP_B_RF_SRC_ADDR_SEL signal specifies the row address(es) in the Operand B register file(s) 2212 to be read. Depending on the mode of the operation, the OP_B_RF_SRC_ADDR_SEL signal can be just one address (Architecture 1B or 2B) or four addresses (Architecture 3B) or 8 addresses (Architecture 4B).
  • the hardware has appropriate modes to handle the different cases, wherein these modes are specified by control registers 432 of control logic 430 .
  • the above-described source addresses can be generated in hardware by state machine and scheduler 431 of control logic 430 .
  • the OP_A_RF_DEST_ADDR_SEL signal specifies the destination address in the Operand A IDB buffers 2221 , into which the data read from the Operand A register files 2211 is transferred.
  • the OP_B_RF_DEST_ADDR_SEL signal specifies the destination addresses in the Operand B IDB buffers 2222 (or shift logic 2224 ) to which the data read from the Operand B register files 2212 is transferred.
  • these addresses can be a single address or multiple addresses.
  • the addresses can be generated in hardware by state machine and scheduler 431 of control logic 430 .
  • SIMD engines share the operand register files and control logic. This results in higher compute density i.e., more computation capacity per unit silicon area. Sharing of operand register files and control logic also saves power and SRAM bandwidth. The savings in SRAM bandwidth come from the fact that only two operand register files need to be written into to support multiple SIMD engines.
  • the input distribution block 2220 includes buffers 2221 - 2222 to hold the Operand data and logic blocks 2223 - 2224 to manipulate the data.
  • Operand A values from register files 2211 are stored in buffers 2221 before being provided to Operand A IDB logic 2223 (which performs the switching/demultiplexing functions described above).
  • Operand B values from register files 2212 are routed through shift logic 2224 before being stored in buffers 2222 .
  • Operand A data is first loaded and is then manipulated to obtain the inputs to SIMD engine 2230
  • Operand B data is first manipulated (shifted) and then buffered.
  • the input distribution block 2220 acts as a small cache from which the SIMD engines are fed the operands. Input distribution block 2220 allows multiple SIMD engines to run in parallel using a single control circuit. As described above, the data from Operand A register file 2211 can be manipulated so that it provides distinct data to multiple SIMD engines.
  • the OP_A_IDB_ADDR_SEL signal specifies the address of the Operand A IDB buffer to be used for each SIMD.
  • buffers 2221 can hold multiple register file words and each of the SIMD engines can possibly choose a different register file word from the buffers 2221 .
  • this single buffer is actually implemented using a double buffer so that read and write do not happen to the same buffer in one cycle.
  • the hardware manages this double buffer in a transparent way. Hence, this signal is internally managed by hardware in most cases.
  • the OP_A_IDB_DATA_SEL signal which controls the Operand A IDB logic 2223 , specifies the data that needs to be transferred to each SIMD. For example, in Architecture 2A, a single value is effectively replicated and broadcast to all SIMD engines. This signal specifies the index of the value that needs to be replicated. Similarly, in Architecture 3A, four consecutive values are taken from a register file word and each one of them is effectively replicated and sent to one SIMD. In this case, the OP_A_IDB_DATA_SEL signal specifies the index of the value that needs to be replicated for SIMD 0 . For the other SIMD engines (SIMD 1 -SIMD 3 ), the index values are incremental. For Architecture 1A, since the full register word stored in buffer 2221 is sent to all of the SIMD engines, the OP_A_IDB_DATA_SEL signal is not needed.
  • the OP_B_IDB_SHIFT_SEL signal which controls the Operand B IDB shift logic 2224 , is used to control the manner in which register file words received from Operand B register files 2212 are shifted (i.e., when two register file words from the same register file for Operand B are read.
  • the OP_B_IDB_SHIFT_SEL signal (and the Operand B IDB shift logic 2224 ) is only required when the system 2200 is implementing Architecture 4B.
  • the OP_B_IDB_SHIFT_SEL signal specifies how the two register file words need to be manipulated to produce one register file word (in the manner described above).
  • Convolution operations typically involve data shifts. Locating the Operand B shift logic 2224 between the Operand B register files 2212 and the SIMD engines advantageously reduces hardware overhead by allowing data to be read from the Operand B register files 2212 multiple times, with different shifts applied to the data each time. If the shift logic 2224 is not implemented in this manner, the shifted data would need to be written to Operand B register file 2212 , and therefore could not be reused as many times as in the proposed architecture.
  • the OP_B_IDB_ADDR_SEL value specifies the addresses of the Operand B IDB buffers that will provide their contents as inputs for each of the SIMD engines. This signal was illustrated for Architecture 2B in FIG. 16 . This is one of the most important signals in the architecture, and provides a lot of flexibility on the type of computations that can be done.
  • the OP_B_IDB_ADDR_SEL value typically comes from state machine and scheduler 431 for the mode corresponding to Architecture 2B but can also be managed by hardware in cases where data access patterns are predictable.
  • Operand B buffers 2222 in the input distribution block 2220 allows different SIMD engines to potentially get different Operand B data at a given cycle.
  • Using four Operand B buffers 2222 i.e., the same as the number of SIMD engines) allows four simultaneous reads, so that each SIMD receives different data. This is much less expensive (from a hardware perspective), than implementing the Operand B register file 2212 with a four port memory (which would also allow four simultaneous read operations to supply SIMD 0 -SIMD 3 ).
  • Providing Operand B buffers 2222 to buffer a small number of register words from the Operand B register files 2212 effectively provides a small cache that can be accessed by any of SIMD 0 -SIMD 3 . This presents a good compromise between hardware complexity and the required flexibility for some classes of algorithms.
  • the SIMD block 2230 includes one or more SIMD engines which perform the actual computations on the data provided by the input distribution block 2220 . Because SIMD engines can support different type of operations, the operation to be performed should be provided as an input. Thus, the SIMD_OPERATION_SEL value is used to specify the operations to be performed by the SIMD engines. Theoretically, different SIMD engines can perform different operations, but in general, the same operation select value SIMD_OPERATION_SEL is used to drive all the SIMD engines.
  • the result of the computations performed by the SIMD engines need to be written to output register files 2241 - 2244 within output block 2240 . Also, for operations like accumulation, previously accumulated values need to be read from the output register files 2241 - 2244 (and provided to the SIMD engines). Generally, the accumulated values are written back into the same location as the previously accumulated values. However, for the sake of generality, two control values OUTPUT_RF_ADDR_SEL_0 and OUTPUT_RF_ADDR_SEL_1 are provided to output block 2240 , thereby allowing the read and write addresses of each of the output register files 2241 - 2244 to be specified separately.
  • control value OUTPUT_RF_ADDR_SEL_0 specifies the write addresses to each of the output register files 2241 - 2244
  • control value OUTPUT_RF_ADDR_SEL_1 specifies the read addresses to each of the output register files 2241 - 2244 .
  • each of the output register files 2241 - 2244 advantageously provides flexibility with regard to the type of operations that can be performed by the described system architecture. Some examples of this flexibility are described in more detail below.
  • FIG. 23 is a block diagram of a computer architecture 2300 which can be used to perform matrix multiplication in accordance with one embodiment of the present invention.
  • FIG. 24 is a diagram illustrating two matrices I and J to be multiplied by the computer architecture 2300 of FIG. 23 .
  • Matrix I has 64 rows and 16 columns, and matrix J has 16 rows and 4 columns. Each row of matrix I may represent a weight vector, while each column of matrix J may represent an activation vector in a machine learning system.
  • Matrix I includes 1024 (32-bit) values w 0,0 to w 63,15 , as illustrated.
  • Matrix J includes 64 (32-bit) values a 0 -a 15 , b 0 -b 15 , c 0 -c 15 and d 0 -d 15 , as illustrated.
  • FIG. 25 is a block diagram illustrating the manner in which the contents of Matrix I and Matrix J are logically stored within system memory 440 .
  • Matrix I is stored in an Operand A memory block 441 that includes 256 rows, each row including four weight values.
  • the first row of Operand A memory block 441 (Row 0) includes weight values [w 0,0 , w 1,0 , w 2,0 , w 3,0 ,].
  • the remaining columns (Col. 2-Col. 15) of Matrix I are stored in consecutive sets of sixteen consecutive rows within Operand A memory block 441 as illustrated.
  • Matrix J is stored in an Operand B memory block 442 that includes 16 rows, each row including four activation values.
  • the first row of Operand B memory block 442 includes activation values [d 0 , c 0 , b 0 , a 0 ] included in the first row of Matrix J.
  • the remaining rows (Row 1-Row 15) of Matrix J are stored in consecutive rows (Row 1-Row 15) of Operand B memory block 442 .
  • State machine and scheduler 431 causes operand packaging logic 433 to retrieve the entries w 0,0 , w 1,0 , w 2,0 and w 3,0 from the first row of Operand A memory block 441 , and to retrieve the entries a 0 , b 0 , c 0 and d 0 from the first row of Operand B memory block 442 .
  • State machine and scheduler 431 writes the retrieved entries w 0,0 , w 1,0 , w 2,0 and w 3,0 to Operand A register file 411 , and writes the retrieved entries a 0 , b 0 , c 0 and d 0 to Operand B register file 412 . This result is illustrated in FIG. 23 .
  • Operand A distribution circuit 416 within input distribution block 415 is controlled to route the entries w 0,0 , w 1,0 , w 2,0 and w 3,0 from Operand A register file 411 as ‘Operand A’ to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively (in the manner specified by FIG. 11 above).
  • Operand B distribution circuit 417 within input distribution block 415 is controlled to route the entries a 0 , b 0 , c 0 and d 0 from Operand B register file 412 to each of SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 (in the manner specified by FIG. 14 above).
  • SIMD 0 -SIMD 3 multiplies the corresponding entries of Operand A and Operand B (e.g., SIMD 0 performs (a 0 ⁇ w 0,0 ), (b 0 ⁇ w 0,0 ), (c 0 ⁇ w 0,0 ) and (d 0 ⁇ w 0,0 )) to generate corresponding products.
  • SIMD 0 performs (a 0 ⁇ w 0,0 ), (b 0 ⁇ w 0,0 ), (c 0 ⁇ w 0,0 ) and (d 0 ⁇ w 0,0 )) to generate corresponding products.
  • FIG. 26 illustrates the mapping of the contents of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 in accordance with the present example.
  • Each entry of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 is initially set to a zero value.
  • Each entry of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 corresponds with a dot product of the matrix multiplication.
  • Each dot product is specified by a row of matrix I and a column of matrix J.
  • the entry of output register set 2000 0 labeled (w 0,i a i ) stores the dot product of Row 0 of matrix I (w 0,0 , w 0,1 , w 0,2 . . .
  • State machine and scheduler 431 controls addressing of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 in parallel. During the initial calculation (described above and illustrated in FIG. 23 ), state machine and scheduler 431 addresses Row 0 of each of the output register sets 2000 0 - 2000 3 . As a result, the zero values stored in Row 0 of the output register sets 2000 0 - 2000 3 are provided to SIMD 0 -SIMD 3 , respectively.
  • each of SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 performs accumulation operations, wherein the zero values retrieved from the output register sets 2000 0 - 2000 3 are added to the products calculated by SIMD 0 -SIMD 3 . The accumulated values are then written back to Row 0 of the corresponding output register sets 2000 0 - 2000 3 .
  • the zero values from the entries (w 0,i d i ), (w 0,i c i ), (w 0,i b i ) and (w 0,i a i ) of Row 0 of output register set 2000 0 are provided to SIMD 0 .
  • SIMD 0 then adds the calculated products (w 0,0 ⁇ d 0 ), (w 0,0 ⁇ c 0 ), (w 0,0 ⁇ b 0 ) and (w 0,0 ⁇ a 0 ) to these retrieved zero values to create accumulated values.
  • SIMD 0 then writes these accumulated values back to the entries (w 0,i d i ), (w 0,i c i ), (w 0,i b i ) and (w 0,i a i ) of Row 0 of output register set 2000 0 . Similar operations are performed by SIMD 1 -SIMD 3 .
  • State machine and scheduler 431 increments address used to access Operand A memory block 441 , causing the next row of values (i.e., w 4,0 , w 5,0 , w 6,0 , and w 7,0 ) to be retrieved and stored in Operand A register file 411 .
  • Operand A distribution circuit 416 routes these received values in the same manner described above in connection with FIG. 23 . That is, the values w 4,0 , w 5,0 , w 6,0 , and w 7,0 are provided to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively. Operand B remains unchanged at this time.
  • SIMD 0 -SIMD 3 multiplies the corresponding entries of Operand A and Operand B (e.g., SIMD 0 performs (a 0 ⁇ w 4,0 ), (b 0 ⁇ w 4,0 ), (c 0 ⁇ w 4,0 ) and (d 0 ⁇ w 4,0 )) thereby providing corresponding products.
  • state machine and scheduler 431 increments the row address of each of the output register sets 2000 0 - 2000 3 , thereby addressing Row 1 within each of these output register sets.
  • the zero values stored in Row 1 of the output register sets 2000 0 - 2000 3 are provided to SIMD 0 -SIMD 3 .
  • each of SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 performs accumulation operations, wherein the zero values retrieved from the output register sets 2000 0 - 2000 3 are added to the products calculated by the SIMD engines. The accumulated values are then written back to the output register sets 2000 0 - 2000 3 .
  • the zero values from the entries (w 4,i d i ), (w 4,i c i ), (w 4,i b i ) and (w 4,i a i ) of Row 1 of output register set 2000 0 are provided to SIMD 0 .
  • SIMD 0 then adds the calculated products (w 4,0 ⁇ d 0 ), (w 4,0 ⁇ c 0 ), (w 4,0 ⁇ b 0 ) and (w 4,0 ⁇ a 0 ) to these retrieved zero values to create accumulated values.
  • SIMD 0 then writes these accumulated values back to the entries (w 4,i d i ), (w 4,i c i ), (w 4,i b i ) and (w 4,i a i ) of Row 1 of output register set 2000 0 . Similar operations are performed by SIMD 1 -SIMD 3 .
  • Operand A distribution circuit 416 sequentially routes all (64) of the weight values w 0,0 to w 63,0 from the first column (Col 0) of Matrix I to SIMD 0 -SIMD 3 as Operand A values in the manner described above.
  • state machine and scheduler 431 After the weight values from the first column (Col 0) of Matrix I have been used to perform multiply-accumulate operations (e.g., after products associated with values w 0,0 to w 63,0 have been calculated), state machine and scheduler 431 resets the addresses of output register sets 2000 0 - 2000 3 to Row 0. In addition, state machine and scheduler 431 increments the address used to access Operand A memory block 441 , such that the values (w 0,1 , w 1,1 , w 2,1 , w 3,1 ) are retrieved and stored in Operand A register file 411 .
  • Operand A distribution circuit 416 routes these values (w 0,1 , w 1,1 , w 2,1 , w 3,1 ) to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively (in the same manner described above in connection with FIG. 23 ).
  • State machine and scheduler 431 also increments the address used to access Operand B memory block 442 by one, such that values (a1, b 1 , c 1 , d 1 ) from Row 1 of Operand B memory block 442 are retrieved and stored in Operand B register file 412 .
  • Operand B distribution circuit 417 routes these values (a1, b 1 , c 1 , d 1 ) to each of SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 (in the same manner that values (a 0 , b 0 , c 0 , d 0 ) were previously routed to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 in FIG. 23 ).
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 perform multiply-accumulate operations on the received values, and the results are stored in Row 0 of the output registers 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively, in the manner described above.
  • State machine and scheduler 431 increments address used to access Operand A memory block 441 , causing the next row of values (i.e., w 4,0 , w 5,0 , w 6,0 , and w 7,0 ) to be retrieved and stored in Operand A register file 411 .
  • Operand A distribution circuit 416 routes these received values in the same manner described above in connection with FIG. 23 . That is, the values w 4,0 , w 5,0 , w 6,0 , and w 7,0 are provided to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively.
  • Operand B (a 1 , b 1 , c 1 , d 1 ) remains unchanged at this time.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 perform multiply-accumulate operations on these received values, and the results are stored in Row 1 of the output registers 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively, in the manner described above.
  • Operand A distribution circuit 416 sequentially routes all (64) of the weight values w 0,1 to w 63,1 from the second column (Col 1) of Matrix I to SIMD 0 -SIMD 3 as Operand A values (while Operand B (a 1 , b 1 , c 1 , d 1 ) remains unchanged).
  • the output register sets 2000 0 - 2000 3 store the dot product of each row of matrix I with each column of matrix J.
  • the entry (w 0,i a i ) of output register set 2000 0 stores the dot product of Row 0 of matrix I and Col 0 of matrix J
  • the entry (w 29,i d i ) of output register set 2000 1 stores the dot product of Row 29 of matrix I and Col 3 of matrix J.
  • the present invention provides an efficient structure for multiplying matrix I and matrix J.
  • SIMD 0 -SIMD 3 provide a high degree of processing parallelism (e.g., sixteen parallel multiply-accumulate operations at a time), which advantageously reduces the time required to perform the matrix multiplication.
  • the control circuitry required to implement the matrix multiplication is advantageously simple. Address inputs to Operand A memory block 441 and output register sets 2000 0 - 2000 3 are simply incremented after each multiply-accumulate operation, and the address input to Operand B memory block 442 is simply incremented after every 16 multiply-accumulate operations.
  • Input distribution block 415 advantageously maintains the same configuration during the entire matrix multiplication.
  • a matrix that contains a large number of zero value entries is referred to as a ‘sparse’ matrix.
  • a matrix that includes 7 ⁇ 8 zero value entries or more may be referred to as a sparse matrix.
  • Multiplication involving a sparse matrix may involve a large number of unnecessary operations.
  • 7 ⁇ 8 of the entries of Matrix I include zero values.
  • only 512 (16 ⁇ 16 ⁇ 16 ⁇ (1 ⁇ 8)) multiply-accumulate operations are required to multiply matrix I and matrix J.
  • all 4096 operations (16 ⁇ 16 ⁇ 16) described above would be performed by the method described above in connection with FIGS. 23-26 .
  • a method for using the structure of FIGS. 23-26 for performing multiplication with a sparse matrix is provided.
  • Matrix I is a sparse matrix, wherein only one eighth of the entries of Matrix I have non-zero values.
  • processing is sequentially performed for each column of Matrix I (e.g., column 0 of Matrix I is initially processed, followed by column 1 of Matrix I, etc.).
  • column 0 of Matrix I is initially processed, followed by column 1 of Matrix I, etc.
  • the processing of the first column of Matrix I will be described, with the understanding that the remaining columns of Matrix I are processed in the same manner.
  • operand packaging logic 433 identifies the row addresses of the non-zero values within Matrix I.
  • operand packing logic 433 determines that the non-zero values w 3,0 , w 5,0 , w 8,0 , w 10,0 , w 11,0 , w 24,0 , w 58,0 , and w 61,0 are located in rows 3, 5, 8, 10, 11, 24, 58 and 61, respectively, of Matrix I.
  • operand packing logic 433 determines which of the output register sets 2000 0 - 2000 3 are used to store the dot products associated with the identified non-zero values.
  • this determination is made by dividing the row address of the non-zero value within Operand Matrix I by ‘4’, and then using the remainder (R) of this division operation to identify the output register set (wherein the remainder (R) identifies output register set 2000 R ).
  • Operand packing logic 433 also determines the row within the output register set where the dot product is stored. In general, this determination is made by dividing the row address of the non-zero value within Matrix I by ‘4’, and ignoring the remainder (R).
  • output register set 2000 0 includes the dot products [(w 8,i d i ) (w 8,i c i ) (w 8,i b i ) (w 8,i a i )] in Row 2 of output register set 2000 0 , and the dot products [(w 24,i d i ), (w 24,i c i ), (w 24,i b i ) (w 24,i a i )] in Row 6 of output register set 2000 0 .
  • operand packing logic 433 sorts (packs) the non-zero values w 3,0 , w 5,0 , w 8,0 , w 10,0 , w 11,0 , w 24,0 , w 58,0 , and w 61,0 of Column 0 of matrix I into Operand A memory block 441 as follows. See new statement later in document.
  • the first non-zero values to have dot products stored in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 are stored in the first row (Row 0) of Operand A memory block 441 .
  • non-zero values w 8,0 , w 5,0 , w 10,0 and w 3,0 which have dot products in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively, are stored in Row 0 of Operand A memory block 441 .
  • the next non-zero values to have dot products stored in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 are stored in the second row (Row 1) of Operand A memory block 441 .
  • non-zero values w 24,0 , w 61,0 , w 58,0 , and w 11,0 which have dot products in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively, are stored in Row 1 of Operand A memory block 441 .
  • the above-described sorting/packing of the non-zero values of Column 0 of matrix I into the Operand A memory block 441 is illustrated in FIG. 27 .
  • Operand A register file 411 stores the non-zero weight values w 8,0 , w 5,0 , w 10,0 and w 3,0 of Matrix I
  • Operand B register file 412 stores the activation values d 0 , c 0 , b 0 and a 0 of Matrix J.
  • State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w 8,0 , w 5,0 , w 10,0 and w 3,0 , from Operand A register file 411 to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively, as Operand A.
  • state machine and scheduler 431 causes Operand B distribution circuit 417 to route the values d 0 , c 0 , b 0 and a 0 to each of the SIMD engines as Operand B.
  • FIG. 27 is a block diagram illustrating the above-described configuration.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 multiply the received Operands A and B in the manner described above.
  • State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 that are associated with the non-zero values w 8,0 , w 5,0 , w 10,0 and w 3,0 . That is, state machine and scheduler 431 addresses Row 2, Row 1, Row 2 and Row 0 within output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 then write the accumulated values to the addressed rows (Row 2, Row 1, Row 2 and Row 0, respectively) of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • state machine and scheduler 431 then retrieves the non-zero values w 24,0 , w 61,0 , w 58,0 and w 11,0 , from the second row of Operand A memory block 441 , and stores these non-zero values in Operand A register file 411 .
  • Operand A register file 411 stores the non-zero weight values w 24,0 , w 61,0 , w 58,0 and w 11,0 of Matrix I
  • Operand B register file 412 stores the activation values d 0 , c 0 , b 0 and a 0 of Matrix J
  • State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w 24,0 , w 61,0 , w 58,0 and w 11,0 , from Operand A register file 411 to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively, as Operand A.
  • state machine and scheduler 431 continues to cause Operand B distribution circuit 417 to route the values d 0 , c 0 , b 0 and a 0 to each of the SIMD engines as Operand B.
  • These values d 0 , c 0 , b 0 and a 0 are routed from Row 0 of the Operand B memory block 442 (i.e., Row 0 of Matrix J) because each of the Operand A values w 24,0 , w 61,0 , w 58,0 and w 11,0 are from Column 0 of Matrix I.
  • FIG. 28 is a block diagram illustrating the above-described configuration.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 multiply the received Operands A and B in the manner described above.
  • State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 that are associated with the non-zero values w 24,0 , w 61,0 , w 58,0 and w 11,0 . That is, state machine and scheduler 431 addresses Row 6, Row 15, Row 14 and Row 2 within output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 then write the accumulated values to the addressed rows (Row 6, Row 15, Row 14 and Row 2, respectively) of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • the SIMD engines are kept busy (i.e., perform multiply-accumulate operations for non-zero matrix values), while minimizing the number of multiply-accumulate operations required to perform the multiplication of ‘sparse’ Matrix I and Matrix J.
  • the computer architecture performs multiplication of a sparse matrix in a highly efficient (and fast) manner.
  • the first sixteen non-zero entries in the first three columns of Matrix I are entries w 2,0 w 12,0 , w 32,0 , w 38,0 , w 45,0 , w 56,0 (in Col. 0 of Matrix I), w 7,1 , w 14,1 w 21,1 , w 25,1 , w 37,1 , w 43,1 (in Col. 1 of Matrix I), w 8,2 , w 10,2 , w 23,2 and w 51,2 (in Col. 2 of Matrix I).
  • Operand packing logic 433 identifies the row addresses of the non-zero values within Matrix I (e.g., non-zero entry w 2,0 is located in row 2 of Matrix I). Using this row address information, operand packing logic 433 determines which of the output register sets 2000 0 - 2000 3 are used to store the dot products associated with the identified non-zero values in the manner described above. Operand packing logic 433 also determines the row within the output register set where the dot product is stored, in the manner described above.
  • operand packing logic 433 determines that the dot products associated with non-zero entries w 12,0 , w 32,0 , w 56,0 and w 8,2 are mapped to rows 3, 8, 14 and 2, respectively, of output register set 2000 0 ; the dot products associated with non-zero entries w 45,0 , w 21,0 , w 25,0 and w 37,2 are mapped to rows 11, 5, 6 and 9, respectively, of output register set 2000 1 ; the dot products associated with non-zero entries w 2,0 , w 38,0 , w 14,1 and w 10,2 are mapped to rows 0, 9, 3 and 2, respectively, of output register set 2000 2 ; and the dot products associated with non-zero entries w 7,1 , w 43,1 , w 23,2 and w 51,2 are mapped to rows 7, 10, 5 and 12, respectively, of output register set 2000 3 .
  • No non-zero entries of column 1 of Matrix I are mapped to output register set 2000 0 , three non-zero entries (w 21,1 , w 25,1 , w 37,1 ) of column 1 of Matrix I are mapped to output register set 2000 1 , one non-zero entry (w 14,1 ) of column 1 of Matrix I is mapped to output register set 2000 2 , and two non-zero entries (w 7,1 and w 43,1 ) of column 0 of Matrix I is mapped to output register set 2000 3 .
  • operand packing logic 433 sorts (packs) the non-zero values of columns 0, 1 and 2 of Matrix I into Operand A memory block 441 as follows.
  • the first non-zero values to have dot products stored in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 are stored in the first row (Row 0) of Operand A memory block 441 .
  • non-zero values w 12,0 , w 45,0 , w 2,0 and w 7,1 which have dot products in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively, are stored in Row 0 of Operand A memory block 441 .
  • non-zero values to have dot products stored in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 are stored in the second row (Row 1) of Operand A memory block 441 .
  • non-zero values w 32,0 , w 21,1 , w 38,0 , and w 23,2 which have dot products in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively, are stored in Row 1 of Operand A memory block 441 .
  • non-zero values to have dot products stored in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 are stored in the third row (Row 2) of Operand A memory block 441 .
  • non-zero values w 8,2 , w 37,1 , w 10,2 , and w 43,1 which have dot products in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively, are stored in Row 2 of Operand A memory block 441 .
  • Operand A register file 411 stores the non-zero weight values w 12,0 , w 45,0 , w 2,0 and w 7,1 of Matrix I. Note that in an alternate embodiment, these non-zero weight values w 12,0 , w 45,0 , w 2,0 and w 7,1 are stored in a buffer within Operand A distribution circuit 416 .
  • state machine and scheduler 431 retrieves the activation values from Row 0 and Row 1 of Operand B memory block 432 because these two activation values are required to calculate the required dot products associated with the retrieved weight values included in Operand A (which were taken from the first two columns of Matrix I).
  • Operand B register file 412 can be loaded in series or parallel from Operand B memory block 442
  • the buffers B0-B3 of Operand B distribution circuit 417 can be loaded in series ( FIG. 16 ) or parallel ( FIGS. 17-18 ) from Operand B register file 412 .
  • State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w 12,0 , w 45,0 , w 2,0 , and w 7,1 from Operand A register file 411 to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively, as Operand A.
  • state machine and scheduler 431 causes Operand B distribution circuit 417 to route the values d 0 , c 0 , b 0 and a 0 to each of SIMD 0 , SIMD 1 and SIMD 2 as Operand B, and also causes Operand B distribution circuit 417 to route the values d 1 , c 1 , b 1 and a 1 to SIMD 3 .
  • the Operand B selection register 1601 stores the Operand B select signals that enable the routing of these Operand B values.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 multiply the received Operands A and B in the manner described above.
  • State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 that are associated with the non-zero values w 12,0 , w 45,0 , w 2,0 and w 7,1 . That is, state machine and scheduler 431 addresses Row 4, Row 11, Row 0 and Row 1 within output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 then write the accumulated values to the addressed rows (Row 4, Row 11, Row 0 and Row 1, respectively) of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • the register select logic 2101 stores register select entries that enable the routing of values to/from output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 .
  • the multiply-accumulate operations implemented in FIG. 29 advantageously use non-zero weight values from both the first column of Matrix I (i.e., weight values w 12,0 , w 45,0 and w 2,0 ), and the second column of Matrix I (i.e., weight value w 7,1 ), thereby increasing efficiency (because none of the SIMD engines are idle, even though there are not enough non-zero entries in the first column of Matrix I to supply all four of the SIMD engines).
  • state machine and scheduler 431 then causes Row 1 of Operand A memory block 441 to be retrieved and loaded into Operand A register file 411 , and then transferred into an Operand A buffer within Operand A distribution circuit 416 .
  • Operand A register file 411 and Operand A distribution circuit 416 store the non-zero weight values w 32,0 , w 21,1 , w 38,0 and w 23,2 of Matrix I.
  • State machine and scheduler 431 also causes Row 2 of Operand B memory block 442 to be retrieved and loaded into Operand B register file 412 , and then transferred into Operand B buffer B2 within Operand B distribution circuit 417 .
  • Operand B register file 412 and Operand B buffer B0 store the activation values d 0 , c 0 , b 0 and a 0 of Matrix J
  • Operand B register file 412 and Operand B buffer B1 store the activation values d 1 , c 1 , b 1 and a 1
  • Operand B register file 412 and Operand B buffer B2 store the activation values d 2 , c 2 , b 2 and a 2 .
  • state machine and scheduler 431 retrieves the activation values from Rows 0, 1 and 2 of Operand B memory block 432 because these three activation values are required to calculate the required dot products associated with the retrieved weight values included in Operand A (which were taken from the first three columns of Matrix I).
  • State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w 32,0 , w 21,1 , w 38,0 and w 23,2 from Operand A register file 411 to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively, as Operand A.
  • state machine and scheduler 431 causes Operand B distribution circuit 417 to route the values d 0 , c 0 , b 0 and a 0 to each of SIMD 0 and SIMD 2 as Operand B, causes Operand B distribution circuit 417 to route the values d 1 , c 1 , b 1 and a1 to SIMD 1 , and causes Operand B distribution circuit 417 to route the values d 2 , c 2 , b 2 and a 2 to SIMD 3 .
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 multiply the received Operands A and B in the manner described above.
  • State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 that are associated with the non-zero values w 32,0 , w 21,1 , w 38,0 and w 23,2 . That is, state machine and scheduler 431 addresses Row 8, Row 5, Row 9 and Row 5 within output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 then write the accumulated values to the addressed rows (Row 8, Row 5, Row 9 and Row 5, respectively) of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • the multiply-accumulate operations implemented in FIG. 30 advantageously use non-zero weight values from the first column of Matrix I (i.e., weight values w 32,0 and w 38,0 ), the second column of Matrix I (i.e., weight value w 21,1 ), and the third column of Matrix I (i.e., weight value w 23,2 ), thereby increasing operational efficiency (because none of the SIMD engines are idle).
  • state machine and scheduler 431 then causes Row 2 of Operand A memory block 441 to be retrieved and loaded into Operand A register file 411 , and then transferred into an Operand A buffer within Operand A distribution circuit 416 .
  • Operand A register file 411 and Operand A distribution circuit 416 store the non-zero weight values w 8,2 , w 37,1 , w 10,2 and w 43,1 of Matrix I.
  • the activation values already stored in Operand B buffers B1-B2 of Operand B distribution are used in multiply-accumulate operations associated with the non-zero weight values w 8,2 , w 37,1 , w 10,2 and w 43,1 of Matrix I.
  • State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w 8,2 , w 37,1 , w 10,2 and w 43,1 from Operand A register file 411 to SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 , respectively, as Operand A.
  • state machine and scheduler 431 causes Operand B distribution circuit 417 to route the values d 2 , c 2 , b 2 and a 2 to each of SIMD 0 and SIMD 2 as Operand B, and causes Operand B distribution circuit 417 to route the values d 1 , c 1 , b 1 and a 1 to SIMD 1 and SIMD 3 .
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 multiply the received Operands A and B in the manner described above.
  • State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 that are associated with the non-zero values w 8,2 , w 37,1 , w 10,2 and w 43,1 . That is, state machine and scheduler 431 addresses Row 2, Row 9, Row 2 and Row 10 within output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • SIMD 0 , SIMD 1 , SIMD 2 and SIMD 3 then write the accumulated values to the addressed rows (Row 2, Row 9, Row 2 and Row 10, respectively) of the output register sets 2000 0 , 2000 1 , 2000 2 and 2000 3 , respectively.
  • the multiply-accumulate operations implemented in FIG. 31 advantageously use non-zero weight values from the second column of Matrix I (i.e., weight values w 37,1 and w 43,1 ) and the third column of Matrix I (i.e., weight values w 8,2 and w 10,2 ), thereby increasing operational efficiency (because none of the SIMD engines are idle).
  • operand packing logic 433 is shown as being a part of control logic 430 in the embodiments described above, it is understood that in an alternate embodiment, the functionality of operand packing logic 433 can be implemented external to system 400 .
  • software can be used to identify the non-zero values of Matrix I (because the weight values for a network, as represented by the entries of Matrix I, are known), determine the output registers (and output register row addresses) associated with these non-zero values, identify the addresses of the values of the Matrix J required to perform the multiply-accumulate operations with the non-zero values of Matrix I, and determine the manner in which the non-zero values of Matrix I should be packed within the Operand A register file 411 .
  • the packed Operand A values can then be loaded directly into Operand A register file 411 (and/or system memory 440 ).
  • the addresses required to load and access Operand B register file 412 and the addresses required to access the output registers 2000 0 - 2000 3 can be loaded into state machine and scheduler 431 .
  • State machine and scheduler 431 then simply retrieves the non-zero values from memory and supplies the required address signals during runtime, without any extra hardware complexity. In this manner, this alternate embodiment advantageously reduces the hardware requirements of system 400 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Advance Control (AREA)
  • Image Processing (AREA)
  • Executing Machine-Instructions (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
US16/397,401 2019-04-29 2019-04-29 Efficient Architectures For Deep Learning Algorithms Abandoned US20200341772A1 (en)

Priority Applications (9)

Application Number Priority Date Filing Date Title
US16/397,401 US20200341772A1 (en) 2019-04-29 2019-04-29 Efficient Architectures For Deep Learning Algorithms
CA3137873A CA3137873A1 (fr) 2019-04-29 2020-04-02 Architectures efficaces pour algorithmes d'apprentissage profond
EP20799017.7A EP3963462A4 (fr) 2019-04-29 2020-04-02 Architectures efficaces pour algorithmes d'apprentissage profond
KR1020217033459A KR20220002295A (ko) 2019-04-29 2020-04-02 딥 러닝 알고리즘을 위한 효율적인 아키텍처
CN202080032192.8A CN113748417A (zh) 2019-04-29 2020-04-02 深度学习算法的高效架构
PCT/US2020/026337 WO2020222971A1 (fr) 2019-04-29 2020-04-02 Architectures efficaces pour algorithmes d'apprentissage profond
JP2021563271A JP7361133B2 (ja) 2019-04-29 2020-04-02 深層学習アルゴリズムのための効率的なアーキテクチャ
TW109111546A TWI833003B (zh) 2019-04-29 2020-04-06 執行矩陣乘法的方法及電腦系統
US17/470,675 US20210406030A1 (en) 2019-04-29 2021-09-09 Computer system using a plurality of single instruction multiple data (simd) engines for efficient matrix operations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US16/397,401 US20200341772A1 (en) 2019-04-29 2019-04-29 Efficient Architectures For Deep Learning Algorithms

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/470,675 Division US20210406030A1 (en) 2019-04-29 2021-09-09 Computer system using a plurality of single instruction multiple data (simd) engines for efficient matrix operations

Publications (1)

Publication Number Publication Date
US20200341772A1 true US20200341772A1 (en) 2020-10-29

Family

ID=72922170

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/397,401 Abandoned US20200341772A1 (en) 2019-04-29 2019-04-29 Efficient Architectures For Deep Learning Algorithms
US17/470,675 Pending US20210406030A1 (en) 2019-04-29 2021-09-09 Computer system using a plurality of single instruction multiple data (simd) engines for efficient matrix operations

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/470,675 Pending US20210406030A1 (en) 2019-04-29 2021-09-09 Computer system using a plurality of single instruction multiple data (simd) engines for efficient matrix operations

Country Status (8)

Country Link
US (2) US20200341772A1 (fr)
EP (1) EP3963462A4 (fr)
JP (1) JP7361133B2 (fr)
KR (1) KR20220002295A (fr)
CN (1) CN113748417A (fr)
CA (1) CA3137873A1 (fr)
TW (1) TWI833003B (fr)
WO (1) WO2020222971A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11334647B2 (en) * 2019-06-29 2022-05-17 Intel Corporation Apparatuses, methods, and systems for enhanced matrix multiplier architecture
US11462003B2 (en) * 2020-03-25 2022-10-04 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US20230030287A1 (en) * 2021-07-31 2023-02-02 International Business Machines Corporation Exploiting fine-grained structured weight sparsity in systolic arrays
WO2023077770A1 (fr) * 2021-11-03 2023-05-11 海光信息技术股份有限公司 Procédé, appareil et dispositif de traitement de données, et support de stockage
US11755683B2 (en) 2019-12-23 2023-09-12 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors (FAST) in machine learning
US11797830B2 (en) 2020-03-25 2023-10-24 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US12007937B1 (en) 2023-11-29 2024-06-11 Recogni Inc. Multi-mode architecture for unifying matrix multiplication, 1×1 convolution and 3×3 convolution

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230378511A1 (en) 2021-01-12 2023-11-23 Lg Energy Solution, Ltd. Electrode Manufacturing Device and Electrode Manufacturing Method Using the Same

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197625B1 (en) * 1997-10-09 2007-03-27 Mips Technologies, Inc. Alignment and ordering of vector elements for single instruction multiple data processing
US7219213B2 (en) * 2004-12-17 2007-05-15 Intel Corporation Flag bits evaluation for multiple vector SIMD channels execution
GB2464292A (en) * 2008-10-08 2010-04-14 Advanced Risc Mach Ltd SIMD processor circuit for performing iterative SIMD multiply-accumulate operations
WO2015089314A1 (fr) * 2013-12-11 2015-06-18 Mill Computing, Inc. Processeur informatique employant des données d'opérandes avec des métadonnées associées
US10996959B2 (en) * 2015-01-08 2021-05-04 Technion Research And Development Foundation Ltd. Hybrid processor
US10997496B2 (en) * 2016-08-11 2021-05-04 Nvidia Corporation Sparse convolutional neural network accelerator
US10599429B2 (en) * 2018-06-08 2020-03-24 Intel Corporation Variable format, variable sparsity matrix multiplication instruction

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11334647B2 (en) * 2019-06-29 2022-05-17 Intel Corporation Apparatuses, methods, and systems for enhanced matrix multiplier architecture
US11755683B2 (en) 2019-12-23 2023-09-12 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors (FAST) in machine learning
US11462003B2 (en) * 2020-03-25 2022-10-04 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US11797830B2 (en) 2020-03-25 2023-10-24 Western Digital Technologies, Inc. Flexible accelerator for sparse tensors in convolutional neural networks
US20230030287A1 (en) * 2021-07-31 2023-02-02 International Business Machines Corporation Exploiting fine-grained structured weight sparsity in systolic arrays
US11941111B2 (en) * 2021-07-31 2024-03-26 International Business Machines Corporation Exploiting fine-grained structured weight sparsity in systolic arrays
WO2023077770A1 (fr) * 2021-11-03 2023-05-11 海光信息技术股份有限公司 Procédé, appareil et dispositif de traitement de données, et support de stockage
US12007937B1 (en) 2023-11-29 2024-06-11 Recogni Inc. Multi-mode architecture for unifying matrix multiplication, 1×1 convolution and 3×3 convolution
US12008069B1 (en) * 2023-11-29 2024-06-11 Recogni Inc. Multi-mode architecture for unifying matrix multiplication, 1×1 convolution and 3×3 convolution

Also Published As

Publication number Publication date
JP2022534854A (ja) 2022-08-04
JP7361133B2 (ja) 2023-10-13
KR20220002295A (ko) 2022-01-06
EP3963462A4 (fr) 2023-01-18
EP3963462A1 (fr) 2022-03-09
US20210406030A1 (en) 2021-12-30
TWI833003B (zh) 2024-02-21
TW202109287A (zh) 2021-03-01
CN113748417A (zh) 2021-12-03
CA3137873A1 (fr) 2020-11-05
WO2020222971A1 (fr) 2020-11-05

Similar Documents

Publication Publication Date Title
US20210406030A1 (en) Computer system using a plurality of single instruction multiple data (simd) engines for efficient matrix operations
US8069337B2 (en) Methods and apparatus for dynamic instruction controlled reconfigurable register file
US4881168A (en) Vector processor with vector data compression/expansion capability
US6381690B1 (en) Processor for performing subword permutations and combinations
CN100447738C (zh) 含有多级寄存器文件的数字数据处理设备
WO2018160738A2 (fr) Système et procédé de multiplicateur matriciel reconfigurable
US6343356B1 (en) Methods and apparatus for dynamic instruction controlled reconfiguration register file with extended precision
EP1586991A2 (fr) Processeur avec un pluralité de bancs de régistres
JP4484756B2 (ja) リコンフィギュラブル回路および処理装置
CN112506567B (zh) 数据读取方法和数据读取电路
US20130212353A1 (en) System for implementing vector look-up table operations in a SIMD processor
US20210004238A1 (en) Calculating device
CN101093474A (zh) 利用矢量处理器实现矩阵转置的方法和处理系统
US11907681B2 (en) Semiconductor device and method of controlling the semiconductor device
US6598061B1 (en) System and method for performing modular multiplication
US8060549B2 (en) Method and apparatus for accumulating floating point values
JPH06162227A (ja) ベクトル並列計算機
EP0239634A1 (fr) Processeur a zones multiples
CN112433760B (zh) 数据排序方法和数据排序电路
WO2022047403A1 (fr) Architectures et configurations d'unité de traitement de mémoire
US6965985B2 (en) Sign generation bypass path to aligner for reducing signed data load latency
WO2020084721A1 (fr) Dispositif de traitement de calcul et procédé de commande de dispositif de traitement de calcul
CN105531932A (zh) 可重配置指令单元阵列的串行配置
JPH08212168A (ja) アレイプロセッサ
KR20230117200A (ko) 레지스터의 니어-메모리 결정

Legal Events

Date Code Title Description
AS Assignment

Owner name: DEGIRUM CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHILAPPAGARI, SHASHI KIRAN;LEE, WINSTON;REEL/FRAME:049022/0597

Effective date: 20190429

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION